OBJECTIVE MEASUREMENT: Theory Into Practice Volume 2
edited by
Mark Wilson Graduate School of Education University of...

Author:
Mark Wilson

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

OBJECTIVE MEASUREMENT: Theory Into Practice Volume 2

edited by

Mark Wilson Graduate School of Education University of California, Berkeley

ABLEX PUBLISHING CORPORATION NORWOOD, NEW JERSEY

Copyright O 1994 Ablex Publishing Corporation All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without permission of the publisher. Printed in the United States of America

Library of Congress Cataloging-in-Publication Data (Revised for vol. 2) Objective measurement. "Papers presented at successive International Objective Measurement Work shop (IOMW)"—Pref. Includes bibliographical references and indexes. ISBN 0-89391-727-3 (v. 1) — ISBN 0-89391-814-8 (v. 1 : pbk.) 1. Psychometrics—Congresses. 2. Psychometrics—Data processing— Congresses. 3. Educational tests and measurements—Congresses. I. Wilson, Mark. II. International Objective Measurement Workshop. BF39.024 1991 150'.1'5195 91-16210 CIP

Ablex Publishing Corporation 355 Chestnut Street Norwood, New Jersey 07648

Table of Contents Preface

v

Acknowledgments Part I. 1

2

3

4

vii

Historical and Philosophical Perspectives

Fundamental Measurement and the Fundamentals of Rasch Measurement Wim van der Linden

3

The Relevance of the Classical Theory of Measurement to Modern Psychology Joel Michell

25

The Rasch Debate: Validity and Revolution in Educational Measurement William P. Fisher, Jr.

36

Historical Views of the Concept of Invariance in Measurement Theory g

73

Part II.

Practice

5

Computer Adaptive Testing: A National Pilot Study Mary E. Lunz and Betty A. Bergstrom

103

6

Reliability of Alternate Computer-adaptive Tests Mary E. Lunz, Betty A. Bergstrom, and Benjamin D. Wright

115

7

The Equivalence of Rasch Item Calibrations and Ability Estimates Across Modes of Administration Betty A. Bergstrom and Mary E. Lunz

122 m

iv

8

9

CONTENTS

Constructing Measurement with a Many-facet Rasch Model John Michael Linacre

129

Development of a Functional Assessment that Adjusts Ability Measures for Task Simplicity and Rater Leniency Anne G. Fisher

145

10

Measuring Chemical Properties with the Rasch Model t

11

Impact of Additional Person Performance Data on Person, Judge, and Item Calibrations John Stahl and Mary Lunz

176

189

Part III. Theory 12

13

Local Independence: Objectively Measurable or Objectionably Abominable? Robert J. Jannarone

209

Objective Measurement with Multidimensional Polytomous Latent Trait Models h

235

14

When Does Misfit Make a Difference? Raymond Adams and Benjamin D. Wright

15

Comparing Attitude Across Different Cultures: Two Quantitative Approaches to Construct Validity Mark Wilson

16

Consequences of Removing Subjects in Item Calibration Patrick S.C. Lee and Hoi K. Suen

17

Item Information as a Function of Threshold Values in the Rating Scale Model Barbara G. Dodd and Ralph J. DeAyala

18

244

271 295

299

Assessing Unidimensionality for Rasch Measurement Richard M. Smith and Chang Y. Miao

316

Author Index

329

Subject Index

337

Preface This volume is the second in a series that collects together papers presented at successive International Objective Measurement Workshops (IOMW). These workshops bring together researchers from all over the world to discuss, debate, and gossip about recent developments in the area of measurement in the social sciences generally, and, more specifically, developments within the community of researchers who see a special place for the measurement approach based on the ideas of Georg Rasch. This "special place" is evidenced by the frequent mention throughout the volume of Rasch himself, of the family of models named in his honor, and of the concept of specific objectivity, a term t h a t he coined and that is perhaps his most significant contribution to the theory and practice of measurement. Within this framework, new philosophical perspectives are discussed in chapters by Wim van der Linden and William Fisher. In the area of practice, two major clusters of new work are reported on in the volume: Mary Lunz, Betty Bergstrom, and Benjamin Wright describe a national pilot study of computer adaptive testing in professional licensure; and Michael Linacre introduces three chapters by Anne Fisher, Thomas Rehfeldt, and John Stahl and Mary Lunz that describe applications of a type of Rasch model called a facet model. Theoretical advancements in the area are reported by Henk Kelderman, Raymond Adams and Ben Wright, Barbara Dodd and Ralph DeAyala, and Richard Smith and Chang Miao. The workshops do not exclusively focus on such work, however. Alternative perspectives are a frequent and important part of the presentations and discussions t h a t take place at the workshops. In this volume, Joel Michel and George Engelhard, Jr., advance philosophical and historical perspectives that take a broader view, and the papers by Robert Jannarone, Mark Wilson, and Patrick Lee and Hoi Suen explicitly attempt to make connections outside the Rasch framework. v

Vi

PREFACE

The chapters are largely drawn from those presented at the sixth IOMW, held at the University of Chicago in April 1990 and organized by Mary Lunz of the American Society of Clinical Pathologists. This is not the only source for chapters, however. One of the chapters (my own) was presented in only partially complete form at the fifth IOMW, and one other (by Wim van der Linden) is based on a debate at the American Educational Research Association annual meeting held immediately after the workshop. I hope that their inclusion will encourage contributions from authors who have either completed work t h a t was not quite ready for publication immediately after past workshops (a virtual requirement for inclusion, given the tight time constraints associated with publication), or who have recently finished an appropriate paper, but, for whatever reason, did not present it at a workshop.

Acknowledgments I would like to acknowledge the work of the Rasch Measurement Special Interest Group of the American Educational Research Association for putting together the Sixth International Objective Measurement Workshop, which was the source of most of these chapters. In particular, I would like to recognize the sterling work of John Michael Linacre and Mary Lunz in this regard. The subject index for this book was compiled by my wife, Janet Susan Williams, with the help of the chapter authors: Thank you Janet, for persisting with our sometimes strange topics and concerns, and for enhancing the quality of the book in so fine a way.

vii

This page intentionally left blank

part I

1

Historical and Philosophical Perspectives

This page intentionally left blank

chapter

-L

Fundamental Measurement and the Fundamentals of Rasch Measurement Wim J. van der Linden University of Twente

To many of us, the natural sciences are the example upon which the social and behavioral sciences should be modeled. To some of us who are not fully aware of the daily research practice in the natural sciences, this conviction seems to take the form of a simple, inductivistic recipe in which the first concern is to measure the variables of interest on a quantitative scale. Once this basic step is taken, the ultimate goal is to discover universal laws in the measurements and to present them in mathematical form. Others, however, more aware of the important role that imagination plays in research, view measurements as the "hard" facts against which theoretical speculations have to be tested. To both parties, it would probably be a shock to read Campbell's (1928) book on scientific measurement, noting that according to this authoritative text the distinction between theory and measurement as two distinct realms is wrong and misleading. J u s t as with normal substantive research, measurement proceeds by establishing natural laws and empirically verifying their truth. Campbell wrote his book because he was not pleased with the usual definition of measurement as "the process of assigning numbers to objects to represent their properties" (p. 1). According to Campbell such statements abound in textbooks on physics, but they are by no means 3

4

VAN DER LINDEN

true and show that even physicists at the front line of research may lack a thorough understanding of what measurement is about and how quantitative variables are established. The book had an immediate impact on scientists as well as philosophers of science, and has been the indisputable standard reference in discussions about measurement ever since. It took four decades before someone else (Ellis, 1966) dared to write a new monograph about measurement in the sciences—a monograph based on the same foundations, though, as those laid by Campbell. One of Campbell's main points is the reminder that variables should not be conceived of as a generalization of our visual experience of physical length—that is, as an "empirical line"—but as a set of physical objects with certain relations defined on it. For the variable to be quantitative, these relations should order the objects and define an operation of "addition" on them. The relations form an hypothesis t h a t has to be verified, just as we had to verify, for example, the relations between objects implied by Boyle's law before we were able to consider it a genuine natural law. Once verified, we usually single out a particular object as the unit against which the others are compared to measure them. The choice of a unit is a practical issue; we mostly select some object that is convenient to us—for example, our feet when we pace out a distance. Measurement that can be defined and verified in t ment is theory based and that the theory involved has to go through a process of prediction and confirmation is demonstrated by those physical properties for which it has not been possible to verify the hypothesis of a quantitative variable. A well-known example in physics is Mohs' definition of hardness. It is possible to order the hardness of physical objects by the operation of scratching and observing which object in the set scratches which other object, but for this operation it has not been possible to verify the relations implied by the addition operation, and we are still not able to measure hardness fundamentally. Fortunately, though, in such cases quantitative measurement may be possible by a process called derived measurement: Using proven numerical laws between the variable concerned and other variables t h a t can be measured fundamentally, we may be able to calculate quantitative measurements for the former even if it cannot be measured itself in a direct or fundamental fashion. An obvious example is the measurement of temperature by the length of a column of mercury in a classic thermometer. In derived measurement, again, the keyword is relations. For relatively new fields such as education and psychology, it has been tempting to try to emulate the success of the natural sciences by

FUNDAMENTAL MEASUREMENT

5

looking for the possibility of fundamental measurement. In particular, for a long period the quest was for psychological equivalents of the addition operation. (The precise properties of this operation, called the concatenation operation, will be explored later in this chapter.) This quest did not meet with success, though, and at a certain stage many doubted if quantitative measurement, and hence the establishment of psychology as a mature science, would be possible at all. An excellent historiography of this episode is given in Michell (1990). A major step forward was taken by Luce and Tukey (1964), when they showed that variables can be tested for quantitativeness in the absence of any empirical concatenation operation. The example used by Luce and Tukey was the case of additive conjoint measurement. The principle underlying the example, namely that the nature of the variable follows from the measurement model for which testable consequences have been shown to hold against empirical data, is not unique to additive conjoint measurement and also applies, for example, to such modern developments in educational and psychological testing as item response models. The present chapter focuses on these models. In the following we will first explore Campbell's notions of fundamental and derived measurement a little further. The emphasis is not on a careful, formal treatment, but on a rather loose discussion of the insights that led Campbell to his basic notions. The next part of the chapter raises an analogous problem for the behavioral sciences: How to found educational and psychological testing as a discipline of quantitative measurement in the absence of fundamental measurement operations. The chapter ends with a discussion of the fundamentals of Rasch measurement and seeks to define its unique position in educational and psychological measurement. FUNDAMENTAL MEASUREMENT Campbell's analysis of measurement can be summarized by the statement t h a t establishing quantitative variables is a theoretical issue involving natural laws and that these laws have to be verified before the variable can be considered to be truly quantitative. It is now time to further explore the nature of these laws and to see how they can be tested. Ideally, for a variable to be quantitative three different types of laws have to hold. If these laws can be confirmed, the variable is directly or fundamentally measurable. Other variables may be measurable by the principle of derived measurement to be discussed later, or, according to Campbell, they are not quantitatively measurable at all.

6

VAN DER LINDEN

As already observed, it is tempting to think of a physical variable as an empirical line. Our most immediate experience of the physical reality is one of objects showing different lengths in one, two, or three dimensions. Hence, it is not without reason that length is our intuitive model of any physical variable—a fact that is reinforced by our daily meetings with graphs and diagrams that map all kinds of physical variables as geometric lines. However, a more fruitful idea of a physical variable is one of a set of objects with a relational structure. The variable temperature, for instance, is given by the way such physical objects as the sun, my oven, John's ice cream, and the cup of coffee I had this morning relate to each other. If I enlarge this set to include all past, present, and future objects, then relations of "equality," "difference," "more than," and "less t h a n " between these objects define the variable temperature. Of course, the variable weight is defined by a different collection of relations between the same objects, but the basic point is that the variables temperature and weight do not have any physical meaning over and above these two collections of relations. Campbell's first two laws of measurement specify two different types of relations. Let capitals A,B,C c . . . denote the objects in the set. The first law of measurement specifies an order relation for the set. Let the order relation between objects A and B be denoted by A >E B. Although this notation reminds us of the symbol that is used to denote the "larger t h a n " relation between numbers in mathematics, no reference whatsoever to mathematical entities is intended. For this reason the subs order relation, the following properties have to hold for all possible pairs of objects:

As an example, the reader may think of the relation "longer than," which defines the variable length. The first proposition states t h a t if A is longer than B and B is longer than C, then A is longer than C. The other two propositions can be interpreted similarly. We are now able to formulate the first law: First Law of Measurement (Order Relation). All pairs of objects obey the properties of the order relation defined in (1) through (3). e tioned as examples of variables obeying this law of measurement.

FUNDAMENTAL MEASUREMENT

7

Objects can be ordered with respect to length by direct comparison. Similarly, objects can be ordered by weight using direct comparison on a balance. Another example is time; we are able to order periods of time by direct comparison (provided they begin simultaneously). A counterexample is Mohs' hardness. Mohs' scratching operation, already discussed above, orders objects only partially with respect to hardness, due to the fact that objects exist with scratching relations that do not obey the axioms. For a well-known psychological variable such as intelligence, procedures for ordering h u m a n beings by direct comparison usually seriously violate the transitivity property of the order relation defined in (1). Measurement procedures based on direct comparison are therefore unable to yield quantitative measurements of intelligence. Weight is a nice example to illustrate that the first law of measurement—as well as the two laws to be introduced below— describe no isolated aspects of nature. To be able to verify the first law of measurement, other laws are involved too; for example, laws relating the behavior of balances to physical variables as gravity, air turbulence, and buoyancy, or mechanical laws governing the operation of the balance. Without knowledge of such laws one would never be possible to confirm the order relation in (1) through (3) for sets of physical objects. In addition to an order relation, a set of objects has to meet an empirical relation of additivity to form a quantitative variable. The term concatenation operation has been introduced to emphasize that an empirical operation is meant, and not the arithmetical operation on numbers. Examples of concatenation operations are: putting more t h a n one object on the scale of a balance to compare their combined property with other objects, putting electrical resistances in a series to compare the resistance of this new object with other objects, or placing two objects end to end in a line to compare their length with those of other objects. Concatenation operations are defined by a set of relations between objects. Let A +E B denote an new object that is produced by a concatenation operation. Again, this notation is somewhat misleading in that it reminds us of the addition operation in arithmetic, but the subscript E is added to emphasize that an empirical and not a mathematical operation is intended. Later, if we assign numbers to measure quantitative variables, the rules of measurement will map this concatenation operation on the mathematical operation of addition. Now the following set of relations defines the concatenation operator:

8

VAN DER LINDEN

The meaning of these relations is obvious. Relations (4) and (5) show t h a t the order in which the objects are combined does not influence the results. Relations (6) and (7) relate the results of concatenation operations to the properties of order relations in (1) through (3). It should be noted that the set of conditions in formulated in (4) through (5) is somewhat outdated and idiosyncratic. Modern versions can be found in algebraic texts axiomatically defining the formally equivalent operation of addition. Second Law of Measurementn (Additivity).y All objects obey the properties of the additivity relation defined in (4) through (7). Examples. Weight, length, and period of time were given earlier as examples of variables for which the order relation in (1) through (3) can be verified empirically. The same holds for these variables with respect to (4) through (7). Intelligence as measured by an IQ test is an example of a psychological variable for which we do not have a concatenation operation. Evidently, if two subjects work together on the test, the IQ for their concerted effort is not equal to the sum of their individual IQs. The properties in (4) through (7) provide the criterion by which we could empirically test a candidate for the concatenation operation for intelligence, if somebody proposed a new one. Again, the axioms in (4) through (7) may seem trivial just because we abstract from physical reality. However, it is emphasized again that measurement axioms can only be tested if embedded in a larger theory relating the physical variable of interest to relevant other variables. For example, we would never be able to verify (4) through (7) for the concatenation of objects on a balance if we were not able to use physical theory to control or correct for interferences between the results for the left-hand and right-hand sides of (4) through (7) due to, for instance, gravitational variation or mechanical friction. Though the first two laws of measurement may sound somewhat abstract to readers not familiar with measurement theory, Campbell's third and last law comes closer to the actual practice of fundamental measurement. Its starting point is the observation that from the set of objects in the first two laws, we may pick a series of objects and consid-

FUNDAMENTAL MEASUREMENT

9

er them a standard series against which the other objects are to be measured. The basic procedure is to match the other objects with one in the standard series and use the numeral associated with the latter as the measure of the former. The first two laws can be used to produce a standard series. An obvious procedure is to denote one object as the standard or unit object. The order relation could be used to find another object that has the relation =E to the standard. Then the concatenation operation defined by the second law can be used to combine the two objects into a new object. If the numeral 1 is assigned to the standard (other choices are possible, but probably less convenient), the new object receives the numeral 2. This process can be repeated until the standard series is large enough to measure all objects in the set. Noninteger measures are introduced if the concatenation operation is used in a reciprocal way; that is, if we take objects with a < # relation to one of the objects already in the standard series and determine the number of times the concatenation operation has to be applied to produce a new object that has a =E relation to the given object. If the standard series is complete in the sense that for each object in the set there is one in the standard series to which it has a =E relation, the series forms a feasible measuring device. In more technical language, it can be stated that an (arbitrary) unit object and a concatenation operation together span or generate a standard series. Analogously, a numeral for the unit object along with the addition operator generate a set of quantitative measures for the objects in the universe. The surprising thing to be noted is that the actual numeral used for the unit object is not important at all; different numerals will generate different sets of values for the standard objects, but each set will map the same empirical relational structure between the objects. Campbell's third law identifies an important property of standard series: Third Law of Measurement (Arbitrariness of Unit). Any object can be chosen as a unit of object to form a standard series. Examples. A well-known prototype of a standard series is the oldfashioned series of weights used on a balance. In fact, the series is only a partial standard series. If an object is met that cannot be matched with one of the weights in the series, a concatenation operation is used t h a t combines weights on one scale into a new object t h a t has a =E relation to the object on the other scale. The =E relation is defined by the balance of the scales. The unit object upon which a series of weights is based is not unique; any

10

VAN DER LINDEN

other object could have been chosen. It is convenience that determines our choice of standard series. Actually, convenience may take us one step further and have us replace the standard series by a single measuring device. The yardstick, with each of its notches replacing a separate object in a standard series, is a pertinent example. The history of measurement in physics can be looked upon as a long process in which old measuring devices are replaced by new devices. As each replacement usually is based on the application of new substantive laws, the latest device may hardly seem to bear any relation with its early ancestors, as is the case, for instance, with modern atomic clocks and the original sandglass. Campbell's analysis reminds us, however, of the fact t h a t for measurement to qualify as fundamental at its basis there must be an empirical concatenation operation that can be used to derive a standard series of objects from an arbitrary unit object. Below we will return to intelligence as an example of a variable for which no standard series has been possible. We could select a certain subject as our unit object, but it is impossible to build a series of standard objects from it, as we still have no concatenation operation. Hence, we are unable to assign numerals to intelligence that obey the laws of fundamental quantitative measurement. DERIVED MEASUREMENT Though fundamental measurement provides measurement in the natural sciences with a sound footing, it is not the only type of quantitative measurement possible. Another type defined by Campbell is derived measurement. Its name is appropriately chosen, since derived measurement always assumes the existence of fundamental measurement. The best way to appreciate the distinction between fundamental and derived measurement is by noting the different numbers of variables in physical laws. Each of the three laws of measurement given above were associated with a single variable. This is typical of fundamental measurement; such laws explain the quantitative structure of a given variable, dealing only with properties of the relational structure on the set of objects that defines it. As argued earlier, this does not imply t h a t substantive knowledge about other variables does not play a role in the confirmation of the laws of fundamental measurement, but the laws themselves are always formulated for single variables. Natural sciences, on the other hand, abound with laws of two or more

FUNDAMENTAL MEASUREMENT

11

variables. These laws govern the ways different physical variables relate to each other. They can also be used to measure variables. As an example, think of the mechanical experiment in which a known force is applied to physical objects and their acceleration is measured. As a result, it can be observed that for each object force and acceleration are proportional to each other, but that different sets of objects may display different constants of proportionality. In straightforward notation this means: Fla = c. Now suppose it is observed that the values of this constant c perfectly order the objects according to mass. These values can then be identified as measures of mass, and the law can be notated in its well-known form as: F = ma. Thus even if no concatenation operation is available for mass, and mass can never be measured directly, it is nevertheless possible to represent the mass of objectives on a quantitative scale, provided the other variables in the law can be measured fundamentally. The properties of the scale follow from the mathematical structure of the model and are determined following a procedure that is known in physics as dimensional analysis. The question how to find an order of mass independently of c so t h a t c can be identified as a measure of mass is not clearly dealt with in Campbell's book. A lucid treatment of this problem is given in Rasch (1960, chap. 7), where mass is identified as the acceleration of a standard object caused by a unit of force.

MEASUREMENT IN THE BEHAVIORAL A N D SOCIAL SCIENCES As already put forward, the behavioral and social sciences have lacked the possibility of fundamental measurement. Even for such sophisticated forms of measurement as intelligence measurement, the history of psychology has not produced any viable concatenation operation t h a t could be used to "add" two amounts of intelligence to obtain a new amount equal to their "sum." As a consequence, it has been impossible to select a series of intelligent objects that forms a standard series and can be used as a measuring device. Of course, practically, it is possible to select a small series of people of increasing intelligence, provided their intelligence is spaced at large distances; in some cases it might even be possible to set up reliable trials in which the intelligence of the people in the series is compared with that of other people. The critical point, however, is the following: As long as it is impossible to obtain the intelligence of the other people in the series by repeated concatenation of the intelligence of a person chosen as the unit, such a series can never be a standard series.

12

VAN DER LINDEN

How about IQ tests? iVre they not the measuring instruments that yield quantitative intelligence scores? They certainly do not provide fundamental measurement. An intelligence test is not a device t h a t replaces a standard series of intelligent objects as the yardstick replaces a set of sticks of variable lengths. Standard series are always parts of the universe of objects that define the variable; they possess the magnitude that the variable represents. It is by this virtue that direct comparison with other objects and hence fundamental measurement is possible. A yardstick itself has length, just as each weight in a standard series has a certain weight. However, IQ tests have no intelligence and it is impossible to directly compare the intelligence of people with the "intelligence of the test." The truth about IQ tests is that, notwithstanding our daily parlance, they are not measurement instruments at all in the same sense as physics has its thermometers, balances, and stopwatches! In fact, they are just standardized experiments used to collect such qualitative data as responses to problems formulated in test items. Measurement in the behavioral and social sciences never takes place while data are collected—it always happens after they are collected. Now if the behavioral and social sciences have no fundamental measurement, and according to Campbell derived measurement is the only other sound form of quantitative measurement, is derived measurement possible in these sciences? Again the answer is no. By definition derived measurement is always based on fundamental measurement. And if no laws with relations between fundamentally measurable variables are at hand, we can never find the constants in such laws t h a t identify measures for new quantitative variables. Implicit Measurement It is exactly here that Campbell's analysis goes wrong and comes to a premature stop. Modern measurement theory shows that we can go one step further and verify laws t h a t explain observable data using only unmeasured variables. If these laws—or models, as modern measurement theory prefers to call them—are quantitative and empirically verified, then the unmeasured or latent variables have quantitative scales on which, as a byproduct, the positions of the objects are known. As the model contains only latent variables, measurement of them is not derived from other fundamentally measured variables—all variables are measured jointly, in relation to one another. To distinguish this type of measurement from fundamental and derived measurement, it is called implicit measurementnt here. The first step in implicit measurement is the definition of the data for which the model has to be designed. These data are categorical or

FUNDAMENTAL MEASUREMENT

13

ordinal. The fact t h a t the data are qualitative and not quantitative is essential; otherwise there would be no reason at all to "upgrade the data" and derive quantitative measures from them. Once the data are defined, the next activity is to design a model t h a t explains the data as a function of the variables on which they depend. Now the basic point is that it is possible to explain qualitative data by a model with quantitative variables. Loosely speaking, here quantitative is taken to mean that the variables are allowed to have real values and that the model relates the variables or parameters to each other through a mathematical structure that contains at least a +. This operation of addition is present in the model to govern the way the variables are assumed to interact, not to map an empirical concatenation operation. In a model or law for a single variable the 4- can only be used to add values of the same variable, but in a model with more than one variable the + can be used to add values of different variables. For the model to be empirically testable, the former case requires a concatenation operation; the latter case does not. The final step is to fit the model to actual data and test its goodness of fit. Generally, fitting a model means that values for the variables or parameters are found such that observable consequences from the model match the properties of the data as closely as possible. Several statistical methods are available to do the job, each based on a different criterion of optimal fit. The important point however, is that if the model shows good fit, we have a tested quantitative scale for the variables in the model, just as a good fit of the First and Second Laws of Measurement gives us a tested quantitative scale for a single variable. The values for the variables that give the optimal match are the quantitative measures of the objects that explain the data in the experiment. We have to be somewhat more specific about the quantitative structure of the variables in the measurement model. As the structure is not defined and tested following the axioms in the First and Second Law of Measurement, how do we know its formal properties? The criterion is the invariance or uniqueness of the model under transformation of scale of its variables. Though more formal definitions of invariance are possible, the following suffices for the present purpose: A model is invariant under a scale transformation if it has exactly the same observable consequences before and after transformation. The transformations under which a model is invariant are called admissible transformations. Admissible transformations fully define the structure of the scale. For example, if it is not possible to transform the unit or the zero of the model without changing its fit to data, then the unit or zero are empirical properties of the model and identify the structure of the variable.

14

VAN DER LINDEN

Stevens' Theory of Scale Types The theory of scale types has become popular through the work of Stevens (1951). His basic distinction was between nominal, ordinal, interval, and ratio scales, each defined by a different class of admissible transformations. Historically, Stevens' theory of scale types was a rebuttal to Campbell's condition of a concatenation operation as a prerequisite for fundamental measurement. Because in the 1920 through the 1940s psychology was unable to produce concatenation operations, psychologists felt that they either had to relax Campbell's condition or to believe that in psychology measurement was not possible at all. Stevens did the former. He maintained Campbell's notion of representationalism, but relaxed the idea that the relational structure of the variable had to represent a concatenation operation, introducing ordinal and even nominal measurement as other true forms of representational measurement. Though Stevens' theory of scale types has become part of the standard outfit of all behavioral and social scientists, he has left them in uncertainty as to what level of scale their actual measurements are on. The theory provides no test whatsoever of level of scale. Stevens' view of measurement still had procedural overtones rather t h a n being fully model based. Therefore he missed the point t h a t in the behavioral and social sciences tests of scale properties can never be derived from measurement procedures themselves; only models can do the job. Had Stevens focused on relaxing Campbell's theory of derived measurement rather than fundamental measurement, his interest in scale invariance might have led him to the notion of implicit measurement as outlined above. It took some 15 years before others formalized the idea.

Additive Conjoint Measurement Luce and Tukey (1964) showed the behavioral and social sciences that quantitative measurement is possible, provided more than one variable is measured and they are modeled jointly. They demonstrated the principle using their new model of additive conjoint measurement, which will be introduced here briefly. The model of additive conjoint measurement formulates the relation between the following three variables: a dependent variable P and two independent variables A and B. The variables A and B are unmeasured or latent, but it is possible to classify all objects simultaneously with respect to them. The dependent variable P is not measured either, but all objects are ordered completely with respect to

FUNDAMENTAL MEASUREMENT

15

their values of P. The best way to represent the data is by a bivariate table with each row representing a different value of A and each column a different value of B, the values being arbitrarily chosen. For each cell there is a value of P attached to the objects classified into it and across cells the values satisfy a complete order relation. In additive conjoint measurement functions are fitted to the data in the table such that the following additive model holds:

Luce and Tukey proved the powerful result that if the data in the table meet certain conditions, then: (1) (2)

monotone functions fx{.), f2(.) ) and f3(.)) satisfying this additive model exist; fi(P), /2(A) and f3(B) are quantitative variables.

For the sake of brevity, a discussion of the conditions will be skipped here. It suffices to say that a test of whether the data in the table meet the conditions is straightforward. Readers interested in the conditions may refer to the original paper by Luce and Tukey or to a lucid introduction to additive conjoint measurement in Michell (1990, chap. 4). It is important to separate the methodology in Luce and Tukey's paper from the actual model they propose. The methodology reflects the steps of implicit measurement outlined above. First, the data are identified for which the measurement model is needed (here, data ordering objects on P and classifying them with respect to A and B). Then a model is formulated that explains the data as a function of relevant independent variables (here, P is modeled as a function of A and B). The model is quantitative in that it uses a + to represent the relation between the variables (here, /i(P) = f2(A) + f3(B)). Then measures on the variables are derived by applying the model to the data and finding values for the variables (here, values for /\(P), f2(A) and f3(B) ) such t h a tTf2{A) + f3(B)(is equal to fx(P)p for all objects). It should also be noted t h a t the model is not a mathematical tautology, but a hypothetical empirical law that may be rejected by the data. This is manifest from the fact t h a t for the model to hold true the data in the table have to meet the three conditions in Luce and Tukey's theorem. It is this underlying methodology and not the specific model in Luce and Tukey's paper t h a t should be considered their most important contribution to measurement theory. Some authors seem to have difficulty distinguishing between the two and tend to assume t h a t unless other models can be demonstrated to be equivalent to the model

16

VAN DER LINDEN

of additive conjoint measurement, they do not provide quantitative measurement (e.g., Michell, 1990; see van der Linden, 1994). In particular, models that are stochastic or have a more complicated mathematical structure are ruled out by this assumption. This is not correct. Nonadditive models of measurement have been studied along the same lines as in Luce and Tukey's paper and proofs of the fact t h a t they provide quantitative variables are available (Krantz & Tversky, 1971). The distinctive advantage of additive models such as the one above, however, is their simplicity, due to the absence of interaction between the independent variables in their effect on the dependent variable. In nonadditive models comparisons between the effects of different levels of the same variable always depend on the level of other variables. This does not prohibit comparison, but makes their formulation more complicated. Although the term conjoint measurement is a perfect description of the underlying principles in Luce and Tukey (1964), to some authors conjoint measurement is equivalent to additive conjoint measurement. To avoid this misunderstanding, the term implicit measurement is preferred here. As for stochastic models of measurement, ironically, others had already been practicing model-based measurement long before Luce and Tukey wrote their seminal article. Independently, Lord (1952) and Rasch (1960) worked on models that are now known as item response models. In item response models, characteristics of the examinees and the test items are implicitly modeled as quantitative, unmeasured (or latent) variables. Along the same line, even Thurstone's (1927) work on models for paired comparisons shows an intuitive appreciation of the methodology of implicit measurement. Of these authors, Rasch was the only one to show an interest in the foundations of measurement and he introduced a basic principle of measurement to derive his model. In the final section of this chapter, the central theme of this book is reflected in an analysis of the fundamentals of the Rasch model and their relation to Campbell's and Luce and Tukey's treatments of measurement theory.

FUNDAMENTALS OF RASCH MEASUREMENT Rasch (1960) formulated his well-known model for achievement tests in which he assumed that only two parameters are needed to explain the probability of success on an item—an ability parameter @ for the examinee and a difficulty parameter b for the item. For item i the model stipulates the following probability of success as a function of O:

FUNDAMENTAL MEASUREMENT

17

It should be noted t h a t applying the well-known logit transformation, the model can also be given in a different form as:

Rasch's interest in educational and psychological measurement was primarily in its foundation. However, judging from his publications, he did not show much interest in Campbell's Laws of Fundamental Measurement and in fact never even made any reference to Campbell's work or to any other major paper on measurement theory. Instead he introduced a principle t h a t he called specific objectivity—the principle will be introduced here briefly. Though Rasch considered specific objectivity to be a single principle, actually it has two different versions—one at the level of the parameters in the model and the other at the level of their statistical estimators. We will deal with the two versions separately. Specific Objectivity as a Mathematical Principle Suppose that the abilities of two examinees, a and b are to be compared using their performances in item i. These performances are repres ison between the examinees is defined by Rasch as a comparator funct

The principle of specific objectivity requires that comparisons made between values of the ability parameter be independent of the values of the difficulty parameter of the items involved, and vice versa. Formally, this implies t h a t the comparator function in (4) be independent of the item parameter bt. Rasch (1977) was able to derive that a necessary and sufficient condition for this requirement to hold is additivity of the response function /(.). To demonstrate the condition, it is observed t h a t from his proof it follows that there exist transformations g

18

VAN DER LINDEN

Obviously, if g^.) is taken to be the logit transformation and g2{.) the reversal of the scale of the item difficulty parameter, the representation of the Rasch model in (2) is obtained. Thus, we may conclude t h a t the Rasch model meets this version of the principle of specific objectivity. To fully appreciate Rasch's derivation of (3) as a consequence of the principle of specific objectivity, several things should be noted. First, (4) is not a derivation of a model from certain conditions on the data; in face, no definition of any data whatsoever is involved. The result is just a mathematical theorem on functions. The only quantities used a © and bt and another mathematical c(.) defined on pairs of functions f(.). The reader should not be misled by the notation of the variables 0 and b and derive some empirical meaning from it. As observed by Fischer (1987), the theorem belongs to the domain of functional equations and was already addressed by various mathematicians before Rasch formulated it as his first version of the principle of specific objectivity. Second, an intuitive way to appreciate the result is to think of the well-known two-way ANOVA table, with the rows and columns representing the values of the parameters © and b and the values of the response function /"(©,&) in the cells. The present version of the principle of specific objectivity requires that comparisons between columns be made independent of the value for the rows, and vice versa. In ANOVA terminology, it amounts to the requirement that the table be fully additive and show no interaction effects. Though additivity is a very welcome property making life truly elegant, life with interaction is possible. Rasch sometimes seemed to imply that in the presence of interaction effects no scientific statements are possible at all; see, for instance, the title of his 1977 paper. As all analysts of tables know, comparisons in tables with interaction are possible; the only price to be paid is t h a t they are to be made conditional on other variables. This makes them more complicated but not less true. Third, the resemblance between the model of additive conjoint measurement in (1) and the representation of the Rasch model in (3) is remarkable and has been noted several times (Brogden, 1977; Perline, Wright, & Wainer, 1978). Strictly speaking, however, the resemblance is only formal. In the model of additive conjoint measurement, P is a d as the left-hand side of the Rasch model is the logit of an unknown mathematical probability. Moreover, in (1) the objects are classified according to empirical values of A and B, but in (3) 0 and b are unknown quantities again. All we are able to say is that if the Rasch

FUNDAMENTAL MEASUREMENT

19

model held and the logits were known, then the logits would meet the technical conditions formulated in Luce and Tukey's (1964) theorem. Now, as will be shown below, the Rasch model has simple sufficient statistics for 0 as well as b. These statistics, which are just the numbers of correct responses per examinee and item respectively, may be used to classify examinees and test items according to their estimated values of 0 and b. Proceeding in this way, as Perline, Wright, and Wainer (1978) did, the fit of the model of additive conjoint measurement and the Rasch model to the same set of data may be compared. But the results are never decisive, since the model of additive conjoint measurement, being a deterministic model, will only fit a very small subset of all possible data sets generated according to the Rasch model. The fact that the Rasch model is not a deterministic but a stochastic measurement model brings us to the version of the principle of specific objectivity in the following section. Fourth, the Rasch model is not the unique model that satisfies (5). If gx(.) is taken to be the probit transformation, then the well-known normal-ogive model from Item Response Theory is obtained with discrimination and guessing parameters constrained to be equal to the values 1 and 0, respectively (Lord, 1952). According to the first version of the principle of specific objectivity, this constrained normal-ogive model is thus specific objective." Specific Objectivity as a Statistical Principle The previous version of the principle of specific objectivity formulated a requirement for the model as a mathematical expression. Were the variables in the model known a priori for all persons and items, the principle would have had immediate practical meaning. Now it has not. For this reason, Rasch extended his principle to include a version formulated at the level of response data. The version can be formulated as follows: Suppose one examinee with ability 0 responds to a test consisting of only two items with difficulty parameters b1 and b2. Let us derive the probability t h a t the examinee has one item correct, say item 1, given the fact t h a t his total score on the test is r = 1. This means that either item 1 or item 2 is correct. The probabilities of the two outcomes are:

where 7, is the denominator of (2) for item i.

20

VAN DER LINDEN

Now, noting cancellation of the factor dependent on 0 , it follows for the probability of item 1 correct given r = 1 that:

The surprising result is that although the probability of the response vector (1,0) depends both on 0 and the two item parameters, the conditional probability given r = 1 depends only on the item parameters. In statistical terminology, and formulated at the level of any number of items, these few steps show us that the Rasch model has a simple sufficient statistic for the ability parameter—the number of correct responses by the examinee. Likewise, it can be show t h a t the number of correct responses on an item is a sufficient statistic for the difficulty parameter. Expressions as in (8) can be used for conditional maximum likelihood estimation of the ability and difficulty parameters. These conditional estimators have the same favorable asymptotic properties as maximum likelihood estimators in the regular case of models for identical independently distributed random variables (Andersen, 1980). The above shows that the existence of the number of correct responses as a sufficient statistic is a necessary condition for the Rasch model. One may wonder if the reverse also holds and the presence of these statistics is a sufficient condition for the Rasch model. A proof of this property is given in Rasch (1968). Later, Andersen (1977) proved the more general claim that the existence of any (minimal) sufficient statistic for one parameter independent of the other parameter is a sufficient condition for the Rasch model. Thus the Rasch model has not only (nontrivial) sufficient statistics for its parameters, it is also the only model with this property. The practical value of the presence of simple sufficient statistics can hardly be undervalued. They allow the use of conditional inference that yields maximum likelihood estimators with known asymptotic properties. This is not the case for other item response models, which are not even known to produce consistent estimators unless they are brought back to the regular case of models for identical independently distributed random variables, for instance, by introducing a common population from which the examinees are drawn. Because of this property, the Rasch model has a well-developed body of statistical theory for estimating its parameters and testing its goodness of fit. In particular the fact that excellent goodness-of-fit statistics are available for the Rasch model is of critical importance. As was pointed out in the earlier treatment of Luce and Tukey's methodology of implicit measurement, it is the fit of the model that guarantees the quan-

FUNDAMENTAL MEASUREMENT

21

titativeness of the variables in the model. The Rasch model is based on statistical theory t h a t works and produces results with known properties. The same holds for its many extensions to models dealing with different item formats, multidimensional abilities, and constraints on the item parameters. In his writings, Rasch was not always clear about the meaning of his theorems and sometimes he was even a bit obscure. He seemed to prefer working outside of the mainstream of the statistical literature. For instance, he hardly ever referred to the theories of exponential families and sufficient statistics, which had their most important developments when Rasch worked on his model and were published in such standard references as Lehmann (1959). Nonetheless, his model belongs to an exponential family and thus has sufficient statistics. Instead he used such terms as "separability of parameters" or "specific objective comparisons" and always seemed to imply that his results meant something more t h a n just statistical theorems and were attempts to found measurement—or even the validity of science. The danger of confusion is dominantly present in Rasch (1968), where he pretends to proof t h a t the Rasch model is a necessary consequence of separability of parameters but actually proves this for the presence of simple sufficient statistics. This is clear from the fact that in his proof he reduces the sample space to the two possible outcomes modeled in (6) through (7) and from there on demonstrates the necessity of the Rasch model. In so doing, the assumption of separable parameters is made identical to the one of the number of responses correct, r, being a sufficient statistic and can be abandoned as a superfluous concept. The same line of reasoning is typical of proofs on specific objectivity in Fischer (1987) and Roskam and Jansen (1984). It is the generality of Rasch's claims and his mixing up of the concepts of specific objectivity and sufficient statistics that could lead to ascribing unrealistic properties to the Rasch model. For example, the belief is widespread t h a t due to the presence of sufficient statistics, conditional maximum likelihood estimation in the Rasch model allows estimation of the same ability parameters from different samples of test items. This statement is statistically too simple to be true. First of all, any parameter can be estimated from any sample; the only relevant question is how good the estimators are. Now tests usually contain no infinitely large samples of items and we know that conditional maximum likelihood estimators have small-sample bias. Thus the expected ability estimates from different samples of test items (in the sense of hypothetical replicated administrations of the same two sets of items with the same examinees) are not identical and depend on the difficulty parameters of the items. Likewise, it is known that samples

22

VAN DER LINDEN

of test items, however long, with different difficulty parameters may give rise to extremely different variances of the estimators. Thus conditional maximum likelihood estimators based on different samples of test items are not identically distributed estimators, let alone are they identical! What, then, is the correct claim? It is the statement that under the condition that the Rasch model holds, if the lengths of two different tests go to infinity, the conditional maximum likelihood estimators of the ability of the same person have the same expected value but are likely to have different variances. In other words, the correct inference is t h a t the presence of sufficient statistics paves the way for the use of c tent estimators of the parameters in the Rasch model. "Specific objectivity" has no meaning beyond this! At the same time, consistency is a minimal prerequisite for parameter estimation, and from Andersen's (1977) result we know that the Rasch model has this property, but t h a t all other models with incidental parameters do miss it. It is in this sense t h a t the fundamentals of Rasch measurement are fundamental.

e The purpose of this chapter was to highlight a few moments in the history of thoughts about the foundation of measurement. In the first part of the chapter Campbell's notions of fundamental and derived measurement were reviewed and it was shown how nicely they fit the practice of measurement in the natural sciences. At the same time Campbell's emphasis on fundamental measurement as a necessary condition for derived measurement set a wrong model for the behavioral and social sciences. It created an obsession with fundamental measurement with subsequent attempts to relax fundamental measurement rather than derived measurement. Luce and Tukey, however, did the latter, using their model of additive conjoint measurement to show t h a t measurement in the absence of fundamentally measured variables is possible, provided the variables are modeled jointly and directly as quantitative variables. It was emphasized that although it is tempting to see the absence of nonadditivity in Luce and Tukey's model as mandatory, nonadditive models are more complicated but still they are possible. The basic methodology is the joint modeling of latent variables to account for qualitative or ordinal data, which yields quantitative measures for the variables with scale properties defined by the invariance of the model. Others had already been practicing this form of implicit measurement, notably in the field of item response theory

FUNDAMENTAL MEASUREMENT

23

w h e r e s t o c h a s t i c m o d e l s w e r e i n t r o d u c e d to e x p l a i n p r o b a b i l i t i e s of success on t e s t i t e m s by q u a n t i t a t i v e p a r a m e t e r s a s s o c i a t e d w i t h t h e a b i l i t i e s of t h e e x a m i n e e s a n d t h e f e a t u r e s of t h e i t e m s . T h e R a s c h m o d e l b e l o n g s to t h i s d o m a i n of i t e m r e s p o n s e m o d e l s . R a s c h d e r i v e d h i s m o d e l from h i s p r i n c i p l e of specific objectivity. It w a s s h o w n t h a t t h i s p r i n c i p l e a c t u a l l y h a s t w o v e r s i o n s — t h e r e q u i r e m e n t of a d d i t i v i t y of m o d e l s t r u c t u r e a n d of s i m p l e sufficient s t a t i s t i c s . T h e f e a t u r e of a d d i t i v i t y is n o t u n i q u e , it is s h a r e d w i t h o t h e r m o d e l s . However, t h e R a s c h m o d e l is t h e only m o d e l w i t h sufficient s t a t i s t i c s a n d h e n c e t h e u n i q u e m o d e l w i t h i n c i d e n t a l p a r a m e t e r s for w h i c h c o n s i s t e n t e s t i m a tors are available. REFERENCES Andersen, E.B. (1980). Discrete statistical models with social science applications. Amsterdam: North-Holland. Andersen, E.B. (1977). Sufficient statistics and latent trait models. Psychometrika, 42, 6 9 - 8 1 . Brogden, H.E. (1977). The Rasch model, the law of comparative judgment, and additive conjoint measurement. Psychometrika, 42, 631-635. Campbell, N.R. (1928). An account of the principles of measurement and calculation. London: Longmans, Green & Co. Ellis, B. (1966). Basic concepts of measurement. Cambridge: Cambridge University Press. Fischer, G.H. (1987). Applying the principles of specific objectivity and of generalizability to the measurement of change. Psychometrika, 52, 5 6 5 587. Krantz, D.H., & Tversky, A. (1971). Conjoint-measurement analysis of composition rules in psychology. Psychological Review, 78, 151-169. Lehmann, E.L. (1959). Testing statistical hypothesis. New York: Wiley. Lord, F.M. (1952). A theory of test scores. Psychometric Monograph No. 7. Psychometric Society. Luce, R.D., & Tukey, J.W. (1964). Simultaneous conjoint measurement: A new Cal C t 1, 1-27. Michell, J. (1990). An introduction to the logic of psychologicala lmeasurement. Hillsdale, NJ: Lawrence Erlbaum. Perline, R., Wright, B.D., & Wainer, H. (1978). The Rasch model as additive conjoint measurement. Applied Psychological Measurement, 3, 237-255. Rasch, G. (1960). Probabilistici cmodels for some intelligence andd attainment tests. Copenhagen: Paedagogiske Institut. Rasch, G. (1968, September). A mathematical theory of objectivity and its consequences for model construction. Paper presented at the European Meeting on Statistics, Econometrics and Management Science, Amsterdam, The Netherlands.

24

VAN DER LINDEN

Rasch, G. (1977). On specific objectivity: An attempt at formalizing the request for generality and validity of scientific statements. In M. Blegvad (Ed.), The Danish Yearbook of Philosophy.yCopenhagen: Munksgaard. Roskam, E.E., & Jansen, P.G.W. (1984). A new derivation of the Rasch model. i chology. Amsterdam: Elsevier. Stevens, S.S. (1951). Mathematics, measurement and psychophysics. In S.S. Stevens (Ed.), Handbook of experimental psychology (pp. 1-49). New York: Wiley. t 34, 278-286. van der Linden, W.J. (1994). Review of J. Michell, An introduction to the logic of psychological measurement. t Psychometrika.

chapter

2 ^

The Relevance of the Classical Theory of Measurement to Modern Psychology Joel Michell University of Sydney

p of measurement. It has been eclipsed by the representational theory, especially that version promoted by S.S. Stevens (1951, 1959) and those who later advanced his ideas much more rigorously (e.g., Krantz, Luce, Suppes, & Tversky, 1971; Luce, Krantz, Suppes, & Tversky, 1990). This theory, however, suffers certain philosophical weaknesses and, I argue, is inferior to the classical theory. The classical theory is not only sufficient to provide a basis for those enterprises called psychological measurement, it also has interesting consequences for that enterprise. I am nervous about calling any theory classical, for it is a term debased by advertising copy. In this case, however, that qualm must be ignored. Literally, classical means of the highest class, and by association it has come to mean the cultures of ancient Greece and Rome. It is in this latter sense that I mean it. The theory of measurement described here is that implicit in the writings of Aristotle and Euclid. They presumed a theory that not only nourished the development of quantitative science in antiquity, but did so until the end of the 19th 25

26

MICHELL

century. Even after Aristotle fell from grace among the scientists of the 17th century, Euclid's Elements remained part of every scientist's training until the 20th century. This theory of measurement is still deeply ingrained in our culture. It remains not only the layperson's view of measurement, but the view of those scientists unaffected by philosophy or the social sciences. Of course, it was never static, and it changed over the centuries. What I offer is only an interpretation based on what I see as the best elements of that theory. The central concept of this theory is the concept of a quantity. A quantity is a class of properties (such as length) or a class of relations (such as temporal durations), the elements of which stand in additive relations to one another rich enough to sustain numerical ratios. Length and time are two important paradigms of quantity, for the additive relations they involve seem, in some cases at least, to be directly visible. In some cases, for example, we are able to see t h a t a particular length is composed entirely of other discrete lengths. Furthermore, this relation of additive composition between lengths we hold to be rich enough to sustain ratios. We do not hesitate to describe one length as being twice or thrice another, for example. In general, we believe t h a t for any two lengths, x and y, there exists a real number, r, such that,

The kind of structure t h a t a set of properties or relations must have in order to sustain ratios is something like the following (Holder, 1901; Michell, 1990; Stein, 1990). Let Q be a set of properties or relations and + a relation of composition upon Q, then + on Q sustains ratios if 1. 2. 3.

4.

for any a and b in Q, a + b = b + a (commutativity), for any a, b, and c in Q, a + (b + c) = (a + b) + c (associativity), for any a and b in Q one and only one of the following is true, 3.1 a = b, 3.2 there exists c in Q such that a = b + c, 3.3 there exists c in Q such that b = a + c, (3 determines an order upon Q as follows: for any a and b in Q, a > b if and only if either 3.1 or 3.2, and this order is transitive, antisymmetric, and strongly connected, i.e., a simple order), for any a and b in Q, na > b (where na is defined recursively as l a = a and (n + l)a = na + a, for any natural number n).

Furthermore, if Q is order dense, continuous, and unbounded above (Michell, 1990) (as we believe length and time intervals to be), then

CLASSICAL THEORY OF MEASUREMENT

27

these numerical ratios are isomorphic to the positive real numbers. Of course, neither Aristotle nor Euclid possessed the modern concept of the real number system, but as both Bostock (1979) and Stein (1990) argue, the concept of a ratio developed by Euclid in Book V of the Elements (Heath, 1908) is equivalent to that of a positive real number as defined later by Dedekind (1909). According to the classical theory, measurement is the discovery or estimation of such ratios. In very general terms what I mean by the ratio of a to b is the magnitude of a relative to b. For any a and b in Q (e.g., for any pair of lengths, say) the magnitude of a to b cannot necessarily be expressed as the ratio of one whole number to another, for there are, as we know, incommensurable pairs of magnitudes (for example, the lengths of the side and diagonal of a square). However, in such cases there will be a unique and well-defined set of numerical ratios less t h a n alb. Such a set is what Dedekind meant by a cut, and this concept he used to define the real number system. While the theory of ratios of nonnumerical quantities was highly developed by Euclid and his Book V of the Elem Holder (1901), the father of modern measurement theory, who first proved the relationship between Euclid's ratios and the modern concept of real number by explicitly defining what was meant by quantity. The classical theory contains two more theses. One is that these ratios literally are the real numbers. The second is t h a t the relation of additivity involved in any quantity is conceptually distinct from any relations of concatenation observable in the behavior of objects. The first thesis, t h a t the real numbers are ratios of quantities, is not Aristotle's or Euclid's, though both held that numbers (for them, natural numbers) were empirical properties (see Lear, 1982; Stein, 1990) and o we attend to them while ignoring other properties of things). However, this thesis was definitely a part of the classical theory by the 17th century, where we find it in Newton, who defined number as "the abstracted ratio of any quantity to another quantity of the same kind" (cf. Whiteside, 1967). From the classical view, the numbers are not abstract in the modern philosophical sense (i.e., nonempirical and outside of space and time), they are empirical relations of a special kind, the kind holding between different magnitudes of the same quantity. s things. Rather, in measurement we discover numerical relations between things, and these numerical relations are just as empirical as any other relations we may observe. The second of these two additional theses constituting the classical theory is t h a t the relation of additivity characterizing a quantity, and in virtue of which ratios obtain, is not to be identified with any rela-

28

MICHELL

tion of concatenation between the objects possessing magnitudes of the quantity. For example, in the case of length we may distinguish a relation between lengths on the one hand and a relation between objects possessing length (say, rods) defined in terms of an operation of concatenation. This operation of concatenation may or may not directly reflect the additivity of lengths, depending upon what other properties the rods possess, the conditions under which the operation is performed, and the precise nature of the operation. That is, there is no n h connection and because any effect is never the product of a single cause (even in the laboratory), additivity will only be directly reflected in behavior under special conditions. t a different kinds of quantities, but rather between the different ways quantities relate to the behavior of objects. In the case of extensive quantities, we are able to arrange conditions so that quantitative additivity is more or less directly reflected in the behavior of some objects for some restricted range of values. In the case of intensive quantities, quantitative additivity is only indirectly evident. This is essentially the distinction as made by the medieval scholar Nicole Oresme (see Clagett, 1968). If there is a villain in the history of measurement theory then it is N.R. Campbell. Campbell (1920) denied both of these theses and so popularized the representational alternative that it became accepted dogma. However, he did not introduce representationalism. That honor belongs to Russell (1903). But it was Campbell's monograph that came to have a decisive influence. The last presentation of the classical theory was that given by A.N. Whitehead in Volume 3 of Principia Mathematica (Whitehead & Russell, 1913). Campbell's book was published in 1920, and from t h a t time there are no expositions of the classical theory until my attempt (Michell, 1990). Campbell made it seem t h a t measurement was numerical representation rather than the discovery of the numerical value of ratios. al t i surement as the numerical representation of empirical operations of addition. In the absence of such operations measurement was held to be impossible. This concept ignores the above distinction between additivity within the quantity and physical operations that reflect this underlying additivity. He did admit derived measurement, but it was made logically dependent upon fundamental measurement and the sense in which it involved numerical representation was never made

CLASSICAL THEORY OF MEASUREMENT

29

explicit. Thus, derived measurement sits uneasily with his insistence t h a t measurement is numerical representation. S.S. Stevens (1951, 1959) followed Campbell in denying these two features of the classical theory. He differed from Campbell in being a more thoroughgoing representationalist. Whereas Campbell wanted to restrict the concept of measurement to the numerical representation of operations of addition, Stevens simply wanted to define it as numerical representation per se. Measurement, for him, was the numerical representation of any empirical relation. This thoroughgoing representationalism entailed his famous theory of scale types and his notorious doctrine of permissible statistics. Both are artifacts of the representational theory of measurement and find no parallel within the classical theory. Representationalism, despite its enormous popularity in both psychology and the philosophy of science, is really a sidetrack in the development of our understanding of measurement. It is a sidetrack because it is based upon an impossible theory of number. Within all versions of the representational theory, numbers are taken as given. However, it is clear from the logic of the representational theory that they are not given in empirical situations. The only empirical context complex enough to yield them is measurement itself, but according to this theory numbers are imported into measurement from outside the empirical domain. Representationalists make a hard and fast distinction between the empirical system, which is characterized as qualitat Hence, numbers are held to be nonempirical entities of an abstract kind (in the special, modern sense of abstract, which means not located in space and time). Beyond that, representationalism involves no commitment as to what they might be. This view of numbers makes them exotic things indeed, so it is something of a surprise to find that the representationalists' rationale for introducing them into science via measurement is their simplicity and the convenience of reasoning with them. As Bertrand Russell (1896/1983) put it, "Number is of all conceptions, the easiest to operate with, and science seeks everywhere for an opportunity to apply it" (p. 301). Hence, in measurement, empirical operations are represented numerically in order t h a t "the powerful weapon of mathematical analysis" can "be applied to the subject matter of science" (Campbell, 1920, pp. 267-268). All representationalists have employed the same rationale. This rationale raises some difficult questions. If the concepts of number are nonempirical, how can they be "the easiest to operate with"? Surely empirical concepts themselves would have to be easier,

30

MICHELL

for they are of familiar, perceptible qualities and relations, while numerical ones are abstract and unfamiliar. Related to this is a further question. Why are numerical concepts universally useful in empirical contexts if they are not also empirical concepts? Finally, if cognition is an empirical relation between our brains and the empirical environment, from whence would our numerical concepts have derived were they not empirical? The fact that numerical concepts are so easy to operate with, so universally useful, and so readily cognized is easily explained by the hypothesis t h a t they are empirical concepts, but is seemingly inexplicable if they are not. The hypothesis t h a t numerical concepts are empirical ones has long been out of favor philosophically, and this is what has given the representational theory its philosophical audience. Stevens, in his turn, was influenced not only by Campbell and other representationalists, but also by the philosophical climate that held mathematics generally to be a system of tautologies,—that is, by the movement called logical e empirical view is again on the philosophical agenda (see for example, Bigelow, 1988; Forrest & Armstrong, 1987; Irvine, 1990). In light of the above considerations, if plausible empirical candidates for the numbers, such as ratios of quantities, can be located, it seems obtuse not to recognize them as such. If the classical theory could be rehabilitated into the mainstream of psychological science, what would be its implications for modern psychology? Some of the more important are as follows: 1. 2. 3. 4. 5.

There are no distinctions of scale type; There is no problem of permissible statistics (or, as it is known in its modern guise, of meaning fulness); The hypothesis that a variable is quantitative is a substantive hypothesis and must be put to the test like any other in science; J u s t because an instrument yields quantitative or numerical data, it does not follow t h a t anything is being measured or that quantitative variables are involved; and Testing the hypothesis t h a t a variable is quantitative means finding evidence for additivity, and this does not necessarily mean extensive measurement (as Campbell thought).

Firstly, within the classical theory there are no distinctions of scale type. A measurement scale for some quantity is obtained when a unit is selected relative to which numerical ratios may be observed or estimated. Hence, all measurement scales are, to use Stevens' (1946) terminology, ratio scales. There are no nominal, ordinal, or interval scales

CLASSICAL THEORY OF MEASUREMENT

31

of measurement. This is not to say that one cannot code classes or orders numerically. It is just to say that numerical coding and measurement are quite different enterprises. Secondly, there is no problem of permissible statistics. The numbers discovered or estimated in measurement are real numbers. Any mathematically valid argument forms applicable to real numbers may be applied to measurements, and the conclusions arrived at follow validly from those measurements. Of course, some conclusions have more generality than others; for example, conclusions that are independent of the unit employed. But this is just to indicate that formal validity is not the sole consideration in making inferences from measurements. Stevens' problem of permissible statistics has, over the last 30 v Narens, 1985; Luce et al., 1990). This, like the problem of permissible statistics, is an artifact of the representational theory. According to t h a t theory, since the facts numerically represented in measurement are essentially qualitative (that is, nonquantitative), it must follow t h a t quantitative propositions based upon measurement are not literal descriptions of reality. Indeed, they may even lack any empirical or qualitative meaning. The problem of meaningfulness has two parts: first, the specification of necessary and sufficient conditions for quantitative propositions to contain empirical meaning; and second, the determination of the empirical content of the meaningful propositions. Both parts have proved difficult and neither is as yet satisfactorily solved within the framework of the representational theory. However, for the classical theory there is no problem of meaningfulness, for the numerical ratios discovered in measurement are held to exist empirically and quantitative measurement propositions are literal assertions about them. It is this consequence of the classical theory, with its great simplicity, that is its major strength relative to the representational theory. Thirdly, the hypothesis t h a t a variable is quantitative is a substantive hypothesis and must be put to the test, like any other hypothesis in science. There is a real distinction between quantitative and nonquantitative variables. It is a distinction that resides in the internal structure of the variable itself and not in our procedures. Hence, if psychology is to be a quantitative science it must be shown experimentally t h a t psychological variables are quantitative. Two errors prevented psychologists from seeing this clearly. One was the Pythagorean dogma t h a t all natural variables are quantitative. This dogma dominated much of 19th century science and strongly influenced the founders of modern psychology. Many of them presumed t h a t if psychology were to be a science it had to be quantitative, and so they never

32

MICHELL

attempted to test the hypothesis that such variables as mental ability or intensity of sensations were quantitative. The second error t h a t clouded the issue was the operational view that measurement is really only a matter of devising number-generating procedures. Of course, numerical procedures are needed for measurement, but only if the variable involved really is measurable. Fourthly, taking up that last point, just because an instrument yields quantitative data, it does not follow that anything is being measured or t h a t quantitative variables are involved. Guided by a mixture of Pythagoreanism and operationalism, psychologists have devised a wide range of procedures t h a t generate numerical data, including mental tests, rating scales, attitude and personality questionnaires, and magnitude estimations. For many it seemed that no more was involved in psychological measurement than devising such procedures. Even if psychologists did not know exactly what they measured, they could be confident t h a t because the procedures resulted in numerical assignments they must be measuring something. However, to assert that, on the classical view, means assuming that the underlying psychological variables causally implicated in producing numerical scores of one kind or another are quantitative and a substantive hypothesis like t h a t could well be false. Hence, to assume it is true is unwarranted. Evidence is needed. This leads to the fifth implication, which is that testing for quantity means finding evidence for additivity, but this does not necessarily mean extensive measurement. All that is required in order to test for additivity is the discovery of situations sensitive to its presence or absence in the variables being studied. It is fruitless to attempt to test for additivity in situations that are indifferent to its existence. In t h a t way the hypothesis could never be falsified. Simply because many of the quantitative procedures devised by psychologists are not sensitive to underlying additivity, they do not enable a genuine test of this property. However, extensive measurement is not necessary to do this, as Campbell mistakenly insisted. Perhaps the most important legacy of the representational theory is the theory of conjoint measurement (see Krantz et al., 1971), for it demonstrates that additive structure can be tested for via ordinal relations. The future of psychological measurement lies in finding new ways to apply this theory to situations involving variables that psychologists have traditionally presumed to be quantitative. To elaborate upon this point, it is already known that many quantitative theories in psychology admit application of conjoint measurement theory. Some of the simpler applications are described in Michell (1990), and many others are described elsewhere (e.g., Perline, Wright,

CLASSICAL THEORY OF MEASUREMENT

33

& Wainer, 1979; and Levelt, Riemersma, & Bunt, 1972). The kind of situation to which conjoint measurement theory in its simplest form is applicable is one involving the relation between three not necessarily distinct variables. Suppose that levels of variables A and X combine noninteractively to produce levels of variable P, but that none of these variables can be measured as yet. If levels of A and X can be independently identified and the consequent levels of P can be ordered, then t h a t is sufficient to (a) test the hypothesis that A, X, and P are quantitative, and (b) if they are, to begin measuring them. What is required is t h a t the order upon P satisfy a hierarchy of cancellation conditions (see Krantz et al., 1971; Michell, 1990). We may think of the relationship between A, X, and P as expressed in a matrix in which the rows are levels of A (call them a, b, c . . . ), the columns levels of X (call them x, y, z . . . ), and the cells levels of P (call the result of combining level a of A with level x oiX, level {a, x) of P, and so on). The cancellation conditions are then constraints upon the ordinal relations between levels of P. For example, single cancellation (often called independence)Ce is that the order upon the columns in any row must be replicated in all rows and that, likewise, the order upon t cancellation, triple cancellation, and so on, are more complex ordinal constraints. The important point about such conditions is that they are testable and, so, present the possibility of testing the hypothesis that A, X, and P are all quantitative. To be more precise, single cancellation and double cancellation may be expressed as follows. Single Cancellation (1) (2)

For any levels, a and b, of A and, x, of X, if (a,x) > (b,x) then for all other levels, y, of X, (a,y) > (b,y); and for any levels x and y, of X and a of A, if (a,x) > (a,y) then for all other levels, b, of A, (b,x) > (b,y).

Double Cancellation For any levels, a, b, and c, of A and x, y, and z,

oiX,

34

MICHELL

The other cancellation conditions are of this form, but more complex. In essence they all state that if certain specified ordinal relations exist between levels of P, then others must obtain as well. As mentioned, A, X, and P need not be distinct variables, and I have been interested in exploring the application of conjoint measurement theory to Coombs' (1964) theory of unidimensional unfolding (Michell, 1990). For certain sets of preference orders, Coombs' theory entails an ordering upon interstimulus midpoints. Such an ordering must satisfy the hierarchy of cancellation conditions if the dimension involved is quantitative because the midpoint between any two stimuli is a noninteractive function (midpoint (x,y) = V2 (x + y)). Hence, just by inspecting preference orders on sets of unidimensional stimuli (for example, attitude statements) the hypothesis that the dimension involved is quantitative may be tested. Taking the classical theory of measurement seriously is a necessity for the enterprise called psychological measurement, if it is to become part of mainstream quantitative science. At present psychological measurement only sustains itself by defining measurement in its own special way. In the physical sciences its meaning is tied to the classical theory (cf, e.g., Beckwith & Buck, 1961). Taking the classical theory seriously means, above anything else, finding ways to test the hypothesis t h a t psychological variables are quantitative, and our best hope of doing t h a t is through applying the theory of conjoint measurement. REFERENCES Beckwith, T.G., & Buck, N.L. (1961). Mechanical measurements. Reading, MA.: Addison-Wesley. Bigelow, J. (1988). The reality of numbers. Oxford: Oxford University Press. Bostock, D. (1979). Logic and arithmetic: Vol. 2, Rational and irrational numbers. Oxford: Oxford University Press. Campbell, N.R. (1920). Physics, the elements. Cambridge, UK: Cambridge University Press. Clagett, M. (1968). Nicole Oresme and the medieval geometry of qualities and motion. Madison, WI: Wisconsin University Press. Coombs, OH. (1964). A theory of data. New York, Wiley and Sons. Dedekind, R. (1909). Essays on the theory of numbers. Chicago: Open Court. Forrest, P., & Armstrong, D.M. (1987). The nature of number. Philosophical Papers, 16, 165-186. Heath, T.L. (1908). The thirteen books of Euclid's elements (Vol. 2). Cambridge, UK: Cambridge University Press. Holder, O. (1901). Die axiome der quantitat und die lehre vom mass. Berichte uber die Verhandlungen der Koniglich Sachsischen Gesellschaf der Wissenschaften zu Leipzig, Mathematische—Physische Klasse, 54, 1-64.

CLASSICAL THEORY OF MEASUREMENT

35

Irvine, A.D. (1990). Physicalism in mathematics. Boston: Kluwer Academic. Krantz, D.H., Luce, R.D., Suppes, P., & Tversky, A. (1971). Foundations of measurement (Vol. 1). New York: Academic Press.

l 91, 161-192. Levelt, W.J.M., Riemersma, J.B., & Bunt, A.A. (1972). Binaural additivity in loudness. British Journal of Mathematical and Statistical Psychology, 25, 51-68. Luce, R.D., Krantz, D.H., Suppes, P., & Tversky, A. (1990). Foundations of measurement (Vol. 3). New York: Academic Press. Michell, J. (1990). An introduction to the logic of psychologicalal m measurement. Hillsdale, NJ: Erlbaum. Narens, L. (1985). Abstract measurement theory. Cambridge, MA: MIT Press. Newman, E.B. (1974). On the origin of scales of measurement. In H.R. Moskowitz, B. Scharf, & J.C. Stevens, (Eds.), Sensation andd measurement (pp. 137-145). Dordrecht-Holland: Keidel. Perline, R., Wright, B.D., & Wainer, H. (1979). The Rasch model as additive conjoint measurement. Applied Psychological Measurement, 9, 249-264. Russell, B. (1983). The a priori in geometry. In K. Blackwell, A. Brink, N. Griffin, R.A. Rempel, & J.G. Slater (Eds.), The collected papers of Bertrand Russell (Vol. 1, pp. 289-304). London: George Allen & Unwin. (Original work published 1896.) Russell, B. (1903). Principles of mathematics. Cambridge, UK: Cambridge University Press. Stein, H. (1990). Eudoxos and Dedekind: On the ancient Greek theory of ratios and its relation to modern mathematics. Synthese, 84, 163-211. Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 667-680. Stevens, S.S. (1951). Mathematics, measurement and psychophysics. In S.S. Stevens (Ed.), Handbook of experimental psychology (pp. 1-49). New York: Wiley. Stevens, S.S. (1959). Measurement, psychophysics and utility. In C.W. Churchm 63). New York: Wiley. Suppes, P. (1959). Measurement, empirical meaningfulness and three-valued logic. In C.W. Churchman & P. Ratoosh (Eds.), Measurement:nt Definitions and theories (pp. 129-143). New York: Wiley.

w bridge, UK: Cambridge University Press. Whiteside, D.T. (1967). The mathematical works of Isaac Newton (Vol. 2). New York: Johnson Reprint Corp.

chapter

33 O

The Rasch Debate: Validity and Revolution in Educational Measurement* William P. Fisher, Jr.

Postmodern Quantities, Inc. New Orleans, LA

T H E DEBATE Cherryholmes (1988, p. 449) uses a passage from Rorty (1985) to contrast traditional and alternative approaches to construct validity. Rorty describes two ways in which people make sense of their lives. In one way, the context in which life is understood is that of historical or fictional heroes and heroines; in the other, life is understood in relation to a nonhuman, supposedly unchangeable reality, such as nature. The first way fosters solidarity in community life, the second objectivity, in the positivist sense of facts supposed to completely transcend culture and history. Rorty and Cherryholmes stress that the problem with the one-sided sense of objectivity is that it fails to recognize and

* The author would like to thank the Spencer Foundation for supporting this research, and to thank Carol Myford, Jackson Stenner, Mark Wilson, and Benjamin Wright for their readings of the text and their helpful comments, but must take responsibility for the ideas expressed in the chapter himself.

36

THE RASCH DEBATE

37

a c k n o w l e d g e i t s own c u l t u r a l a n d h i s t o r i c a l e m b e d d e d n e s s . I w o u l d l i k e to a d d t h a t t h e p r o b l e m w i t h t h e u s e of n a r r a t i v e s t o r i e s in t h e c r e a t i o n of m e a n i n g a n d v a l i d i t y of c o n s t r u c t s is t h a t it fails to recognize a n d a c k n o w l e d g e i t s own possibilities for a new, m o r e c o n v e r s a t i o n a l a n d playful, y e t n o n e t h e l e s s r i g o r o u s , s e n s e of objectivity. T h e R a s c h d e b a t e is a v a r i a t i o n on t h e t h e m e s t a t e d by Rorty a n d C h e r r y h o l m e s . J a e g e r (1987, p. 8) h a s j u x t a p o s e d two q u o t e s t h a t r e s t a t e t h e t h e m e in t h e t e r m s of t h e d e b a t e : There appears to be a fundamental difference in measurement philosophy between those on the two sides of the Rasch debate. . . . The difference is well characterized in the writings of Benjamin Wright (1968) and E.F. Lindquist (1953). First Wright: Science conquers experience by finding the most succinct explanations to which experience can be forced to yield. Progress marches on the invention of simple ways to handle complicated situations. When a person tries to answer a test item the situation is potentially complicated. Many forces influence the outcome—too many to be named in a workable theory of the person's response. To arrive at a workable position, we must invent a simple conception of what we are willing to suppose happens, do our best to write items and test persons so that their interaction is governed by this conception and then impose its statistical consequences upon the data to see if the invention can be made useful. (1968, p. 97) [emphasis added; and the quote is actually from Wright, 1977b, p. 97]. In contrast, Lindquist wrote: A good educational achievement test must itself define the objective measured. This means t h a t the method of scaling an educational achievement test should not be permitted to determine the content of the test or to alter the definition of objectives implied in the test. From the point of view of the tester, the definition of the objective is s

tion. The objective is handed down to him by those agents of society who are responsible for decisions concerning educational objectives, and what the test constructor must do is to attempt to incorporate that definition as clearly and exactly as possible in the examination that he builds. (1953, p. 35) [emphases added].

A l t h o u g h J a e g e r also c h a r a c t e r i z e s t h e d e b a t e a s one " b e t w e e n advoc a t e s a n d o p p o n e n t s of t h e u s e of IRT [Item R e s p o n s e T h e o r y ] in t e s t d e v e l o p m e n t a n d s c a l i n g , " t h e d e b a t e on t h e u s e f u l n e s s a n d m e a n i n g f u l n e s s of R a s c h m e a s u r e m e n t is c o n d u c t e d w i t h i n w h a t J a e g e r w o u l d call t h e IRT c o m m u n i t y j u s t a s m u c h a s b e t w e e n it a n d t h o s e o u t s i d e of it. T h e d e b a t e is t h e r e f o r e t a k i n g place on a n u m b e r of levels, a s well a s i n a n i n t e r n a t i o n a l forum.

38

FISHER

Those advancing various reasons for not using Rasch's approach to educational and psychological measurement, or for narrowly restricting its application, include Bollinger and Hornke (1978), Divgi (1986, 1989), Goldstein (1979, 1980, 1983), Grau and Mueser (1986), Lord (1980, p. 58; 1983), Whitely (1977), Whitely and Dawis (1974), and Wood (1978). Those rebutting the claims of the critics include Andrich (1988, 1989), Fischer (1987, p. 585), Fisher (1991), Gustafsson (1980), Henning (1989), Lewine (1986), and Wright (1968, pp. 9 9 - 1 0 1 ; 1977a; b, pp. 102-104; 1984; 1985, pp. 107-109; Wright & Linacre, 1989). Some Rasch advocates suggest t h a t Rasch measurement presents the possibility for a revolution in educational and social measurement (Andrich, 1987; Duncan, 1984a,b,c; Fisher, 1988, 1991; Loevinger, 1965; Singleton, 1991). The same sort of claims (Cliff, 1973; Michell, 1990) have been advanced on behalf of conjoint measurement theory (Luce & Tukey, 1964; Krantz, et al., 1971; Ramsay, 1975), to which Rasch's work is closely related (Brogden, 1977; Perline, Wright, & Wainer, 1979). Lindquist is plainly and emphatically appealing to a one-sided objectivism in which construct validation is assumed to take place outside of the context in which the construct is manifest. Wright, in contrast, is just as plainly and emphatically struggling with the problem of dealing with the way constructs are simultaneously invented and discovered. Where Lindquist speaks of the sacrosanct, untouchable nature of test items, Wright says that test items amount to nothing more t h a n guesses as to how a construct articulates itself. Wright's suggestion t h a t we observe how well the guesses work to provoke a manifestation of the construct via the interaction of question and answer, and then see how far the guesses can be made to work in practice, is a fair approximation of what Ricoeur (1981, pp. 212-213) calls the method of converging indices and its probabilistic approach to the validation of guesses. Lindquist wants to disavow the fact that the test items originated in a discursive context, preferring to conceive of them as given in an objective reality. Wright, however, is focusing explicitly on the circular manner in which guesses about reality are entertained, criticized, tested, and applied in an ongoing constructive way. The extent to which Lindquist is articulating a commonly held position in educational measurement is indicated by the popularity of multiparameter IRT models. The unwillingness of educators to enter into the circular and conversational logic of construct validity continues, despite the fact that the mathematical form of the IRT models contradicts necessary and sufficient requirements for objectivity (Wright, 1984; Andrich, 1988, p. 67), and makes the models difficult and expensive to use (Wright, 1984; Stocking, 1989; Hambleton &

THE RASCH DEBATE

39

Cook, 1977, p. 76; Hambleton & Rogers, 1989, p. 158). One reason for the popularity of two- and three-parameter measurement models in education is t h a t they allow the test constructor to accept the validity of test items with no questions asked. Multiparameter models suppress questions of fit because most items fit these models, and when they do not, the reasons why are so technical that confidence in the test is not affected. The Rasch, or "one-parameter," approach, in contrast, requires the test constructor to pay close attention to the functioning of the items, checking for the extent to which they can be said to hang together along a single continuum of more and less difficulty. The critical evaluation of the performance of the items on the test undercuts the onesidedness of the test writers and researchers' authority by acknowledging the voices of the test takers. Instead of objectifying test takers by subjecting them to an unquestionable authority (Cherryholmes 1988, p. 430), the Rasch approach to test construction promotes a conversation in which questions are tested by the respondents just as much as the respondents are tested by the questions. Rigorous test administration practices demand that the intrusion of any factors other t h a n the abilities of the persons measured and the difficulties of the problems posed be minimized. Wright and Stone (1979, pp. 10-11) ask why test administration should not follow through on this demand, explicitly enacting in practice what is otherwise merely assumed to be required for legitimate comparisons. Duncan (1984b, p. 217; also see 1984c, p. 400) observes that what we need are not so much a repertoire of more flexible models for describing extant tests and scales . . . but scales built to have the measurement properties we must demand if we take "measurement" seriously. As I see it, a measurement model worthy of the name must make explicit some conceptualization—at least a rudimentary one—of what goes on when an examinee solves test problems or a respondent answers opinion questions; and it must incorporate a rigorous argument about what it means to measure an ability or attitude with a collection of discrete and somewhat heterogenous items.

The great majority of educational measurement models do in fact belong to a repertoire of models flexible enough to describe extant tests. Rasch models, in contrast, specify the properties we must demand if we take measurement seriously, focusing on meaningful comparisons, those in which item difficulty does not depend on person ability, and vice versa. More flexible models, by definition, allow unexamined presuppositions, prejudices, and preconceptions concerning who the persons mea-

40

FISHER

sured are, and whether the test items actually belong to the same variable, to interfere with the measurement process. Should not the preconceptions that necessarily structure questions and observations themselves be examined, modified, and accounted for, just as much as the students' test behavior and environment is controlled? These questions raise issues best addressed by widening the scope of the debate to include explicit considerations of what the most important form of test validity is.

MATTERS OF CONTEXT Content and Construct Lindquist is working from within the traditional positivist framework, described by Burtt (1954) as one which defines objectivity as a matter of letting data speak for themselves, with no recourse to presuppositions or hypotheses allowed. This sense of data arose in historical periods when nature was conceived to be a static constant, with the continents, seas, stars, planets, and biological life precisely the same now as they were on the day God finished the Creation. This sense of data as existing eternally and independent of any human context has fallen under the weight of many different factors, ranging from notions concerning the life cycle of the universe, plate tectonics, and evolution, to the observation t h a t what counts as legitimate data and rational thinking changes from one historical period to another (Kuhn, 1961, 1970; Toulmin, 1982; Holton, 1988; Hesse, 1970, 1972). However, many of us, like Lindquist, continue to think and act, out of habit, perhaps, as if data are given, not emerging from within a frame of reference. Messick (1975, p. 959; Cherryholmes, 1988, p. 426) offers a more specific reason for Lindquist's views on educational measurement: Construct validity is not usually sought for educational tests, because they are typically already considered to be valid on other grounds, namely, on the grounds of content validity. Hambleton and Novick (1973) claim, for example, that "above all else, a criterion-referenced test must have content validity" (p. 168). Assuming t h a t tests are valid on grounds of content validity is to be imbued with the overweening confidence that things are as they are because t h a t is the way someone says they are, not because that is the way they actually play themselves out in practice. Examination of the

THE RASCH DEBATE

41

empirical consistency of data may lead to the conclusions that particular test items, and perhaps specific content areas included on a test, represent constructs different enough in their conceptual structure to invalidate the inferences concerning abilities typically made on the basis of test scores. The search for construct validity may then contradict the conclusions already drawn concerning the content validity of test items, as Phillips (1986, p. 107) indicates: the deletion of misfitting items raises the issue of sacrificing validity for model fit. Typically, achievement test batteries are carefully developed according to detailed content specifications. If items are dropped from a subtest, that subtest no longer matches the test specifications and has lost content validity. Notice the force of Phillips's assertion: validity is inherently a matter of content validity. As Lindquist makes explicit, no question need be raised concerning construct validity, concerning whether or not what is measured is actually what is assumed to be measured. A typical reaction to the suggestion that some items should be deleted from a test assumes that content validity is the only validity relevant to an educational test, as when it is said that It is by no means clear that the Rasch model does describe real data very well. Willmott & Fowles (1974) admit that when testing the model some items do not fit the model. These are omitted from the set of items. As they say, "The criterion is that items should fit the model, and not that the model should fit the items." (!) (Goldstein & Blinkhorn, 1977, p. 310; original emphasis and exclamation; also see Goldstein, 1979, pp. 215216) Because the position informed by measurement theory asserts that data should be fit to a model that clearly specifies criteria for recognizing data good enough to measure with, the Rasch model may not always describe real data very well. This state of affairs says more about the quality of the data than the usefulness of the model. Goldstein (1979, p. 216), however, is adamant about "moving away from the doctrine of a singly underlying trait, [in order to] allow educational criteria properly to determine test content." But as Gustafsson (1980) points out, items t h a t do not belong to one construct may well belong to another; the problem may be as simple as separately analyzing the groups of items. No one in this debate has seriously recommended t h a t misfitting items simply be discarded. It is only reasonable to think t h a t items from the same content domain might represent

42

FISHER

different constructs, and produce data with independent empirical consistencies. The point is to admit that measurement always and everywhere follows from a metaphysics of what counts as an observation (Burtt, 1954; Heelan, 1972, 1983, 1985; Heidegger, 1967; Hudson, 1972; Ihde, 1979, 1991; Kuhn, 1961), and to step into the flow of the hermeneutic circle deliberately and in accord with our intentions. Imagination, Ideality, and Empirical Consistency Focusing on content to the exclusion of the construct reenacts a fundamental error that has been repeated over and over again in the history of science. The error made one of its earliest and most famous appearances in the Pythagorean ontological confusion of representations and images for the things themselves. In the same way that an exclusive focus on content validity precludes attention to constructs, Pythagoreans take number and numerical relationships for existence itself and are unable to think of the noetic order of existence by itself, [and so they never] see the real implications of the [Platonic] doctrine of ideas. (Gadamer, 1980, p. 35; also see p. 32). The Pythagoreans were caught up in unsolvable problems such as the squaring of the circle, trying to solve them by means of the physical transcription of the images themselves. Besides forbidding "all recourse and all allusion to manipulations, [and] to physical transformations of figures," Plato redefined the elements of geometry, "denominating such concepts as line, surface, equality, and the similarity of figures" (Ricoeur, 1965, p. 202; also see Gadamer, 1980, p. 150). Conceiving a point as '"an indivisible line,' and a line as 'length without breadth'" (Cajori 1985, p. 26), Plato construed geometric entities as fictions in order to make the difference between names and concepts as plain as possible. Galileo placed modern science on the same footing when he based his theory of gravity on the behaviors of objects in a frictionless vacuum, behaviors he would never observe. Rasch's (1960, pp. 37-38) comment t h a t "a model is not meant to be true" is intended to have the same effect as Galileo's realization t h a t he was imagining how gravity might be modeled. Theories and models never fit experience exactly, but instead serve as heuristic aids in organizing and managing experience meaningfully. For instance, the crisis of Pythagorean mathematics was overcome by Plato's redefinition of geometrical elements, because irrational numbers live out the same conceptual existence in ideality that ratio-

THE RASCH DEBATE

43

nal ones do. The irrationality of the square root of two no longer threatened the heart of mathematical reason after Plato because the existence of this number and the line segment it represents no longer depended upon representation as a line segment of precisely drawable length or as a number t h a t could be exactly specified. The crisis of educational, psychological, and social measurement provoking the Rasch debate hinges on the same problem, namely, that the rationality of testing depends on whether the qualities measured are modeled by content (name) or construct (concept). The point in using figures of any kind, whether they are metaphorical, numerical or geometrical, is to facilitate clarity in thinking through clear representation of the thing itself. Clear views of things are brought about when one can see through the content of the particular figure drawn and see the thing itself free of influence from the particular representation instrumental to the observation. Plato's restricting the use of instruments in geometry to the compass and straightedge was aimed at allowing things to communicate themselves, not by confusing the conceptual ideality of things with their names, as Pythagoreans and positivists do, but by using the instruments as media for the expression of the things themselves. Plato placed philosophy in close association with mathematics because geometrical analyses are not valid just because they are performed on geometrical figures such as circles and triangles. It is essential to establish the validity of the construct, to distinguish between the content of the items and the validity of taking them as representative of a conceptual dimension. "Since predictive, concurrent, and content validities are all essentially ad hoc, construct validity is the whole of validity from a scientific point of view" (Loevinger, 1957, p. 636, in m referenced" (Messick, 1975, p. 957, emphasis in original). Loevinger's (1965, p. 151) appreciation for Rasch measurement cannot be separated from her position on construct validity, since "any concept of validity of measurement must include reference to empirical consistency" (Messick, 1975, p. 960). Whitely, on the other hand, holds to the explicitly positivist end of Cronbach and Meehl's (1955) sense of construct validity as "appealing to criteria outside of the measuring process . . . in accordance with a nomothetic network" (Whitely, 1977, p. 232), which is exactly the way Goldstein (1983), Hambleton and Novick (1973), Lindquist (1953), and Phillips (1986) see the matter. Wright's (Wright & Masters, 1982, p. 91) concept of construct validity is much closer to Cherryholmes's, Loevinger's, and Messick's discursive formulation t h a n it is to Whitely's positivist construal:

44

FISHER

The responses of each person can be examined for their consistency with the idea of a single dimension along which items have a unique order. Unless the responses of a person are in general agreement with the ordering of items implied by the majority of persons, the validity of the person's measure is suspect. The same dialectical relation between whole and part holds for items Responses to each item must be examined for their consistency with the idea of a single dimension along which persons have a unique order. Unless the responses to an item are in general agreement with the ordering of persons implied by the majority of items, the validity of the item is suspect. Wright stresses the need to constantly refer and defer to the text of what has been said and done in the administration of the test. In a manner reminiscent of recent work in the philosophy of science t h a t stresses the mediating role of instruments in experiment (Ackermann, 1985; Heelan, 1983; 1985; Ihde 1979, 1991), Wright is construing data as a text that resonates in the lives of those who read and write it. And in contrast to the detached, uninvolved, and cool sense of theorizing deployed by those who take content validity as primary, Wright's stress on the use of experiment belies his sense of theory as a matter of participating in and being committed to the object of discourse, which is again in close accord with recent observations made in the philosophy and history of science (Hacking, 1983, 1988; Heelan, 1988, 1989; Hesse, 1970, 1972; Holton, 1988; Kuhn 1961, 1970; Latour & Woolgar, 1979; Ormiston & Sassower, 1989). The history of science supports the discursive formulation of construct validity and disputes positivism's exclusive concern with content because of the crucial importance of the ontological difference between mathematical and perceptible being. This difference is what "Eudemos singles out [as] Plato's contribution in his history of mathematics, namely, to have distinguished between name and concept (Simp Plato resolved the Pythagorean overcomplications with mathematical clarity and simplicity, Copernicus, Kepler, and Galileo founded modern science when they resolved the Aristotelian astronomical complications by basing their studies on mathematical idealizations and observations. Cronbach and Meehl (1955) focused attention on the difference between content and construct, and brought social measurement a step nearer to recreating the ancient meaning of mathematical clarity. Rasch's restrictions on measuring instruments, in turn, have the

THE RASCH DEBATE

45

potential of recreating in social science what Plato's and Galileo's restrictions on, and uses of, measuring instruments did for geometry and natural science. Instead of allowing the perceptible being of content to dictate validity, Rasch measurement fosters an awareness of the ontological depth t h a t mathematical description offers. Those who take content validity to be the sole form of validity required for measurement wish to be able to nail down hard facts, not go with the flow of the life cycle of facts (Fleck, 1979) through their birth, life, and death, as is required for the validation of constructs. JAEGER'S REVOLUTION REVISITED Jaeger (1987) juxtaposes the quotes from Lindquist and Wright in the context of alternately proclaiming and questioning the revolutionary status of developments in educational measurement over the last 20 years. J u s t as Wright (1984, 1988b, for example) often does, Jaeger (1987, pp. 9-12) uses quotes from Thorndike and Thurstone as evidence of the age and importance of some of the most fundamental ideas in educational measurement. But Jaeger does not explore the possibility t h a t the revolution in educational measurement begun by Thorndike, Thurstone, and others is still happening; and he does not sufficiently elaborate upon what the point of the revolution might be. The contextual matters crucial to understanding the Rasch debate have provided some clues as to what that point might be. Kuhn (1970) suggests more to look for when he indicates that observational anomalies, methodological problems in accounting for them, and resulting degrees of extreme complication prepare the ground for scientific revolutions. Thus, the Pythagorean and Aristotelian overcomplications and rationalizations t h a t Plato and Galileo cut through with their insistence on rigorous observation and mathematical idealization in the use of the compass, straightedge, and telescope may have their parallels in the fixation on content validity plaguing educational measurement. The history of science in general, and Kuhn's theory of scientific revolutions in particular, leads to at least three hypotheses concerning the extent to which the Rasch debate is a revolution in the making (Andrich, 1987). These hypotheses, and some evidence bearing them out, will be briefly enumerated and sketched. Crisis The first hypothesis of scientific revolution asserts that there should be a widespread general sense of crisis in the field, as well as in others

46

FISHER

constrained by the same paradigmatic orientation. In this case, education, measurement, and the very proposition that quantification could be useful and meaningful should be under fire. That education is in a state of crisis is by now an understatement; crisis in the world at large has escalated to the point that crisis has become the normal, everyday state of affairs. Education has served as a model for dealing with political, economic, and social problems for centuries, and now it is failing as we see that much of what passed for education was actually indoctrination into various ideologies. Because testing is purported to separate those who know something from those who do not, it has come under harsh criticism for failing to perform this purpose fairly and unambiguously (Crouse & Trusheim, 1988; Gould, 1981; Owen, 1985; Strenio, 1981; Sutherland, 1984). The large and significant literature on the shortcomings of quantitative methods in social science that has erupted (Bakan, 1966; Carver, 1978; Coats, 1970; Falk, 1986; Krenz & Sax, 1986; Michell, 1986; to name just a few), and the horrors of educational measurement alluded to by Lumsden (1976), are part and parcel of the crisis of rationality. Shifting Paradigms Second, alternative paradigms should crystalize from the crisis situation; alternative methods and theoretical approaches coalesce into a new paradigm when their language becomes incommensurable with t h a t of the traditional paradigm. Dissatisfaction with the very idea t h a t h u m a n abilities and attitudes can be quantified has reached such a pitch t h a t qualitative approaches are widely considered to be at the forefront of methodological innovation in the social sciences at large. The force of this movement comes from the realization that meaning is more important to social inquiry than facts are. Andrich (1988), Michell (1990), and Wright (1977b) agree with Kuhn (1961) when they emphasize how important qualitative research is in the development of quantitative measures. What I shall call the quantitative paradigm refers to the uncritical acceptance of numbers as valid representatives of qualitative structures. In the same way t h a t Pythagoreans worshipped number, mistaking numerical relations for existence itself, blind submission to the "quantitative imperative" (Michell, 1990) takes place in educational measurement whenever the content of the questions asked is the sole arbiter of validity. This is the same thing as ignoring the first fundamental problem of measurement, the justification of the measured and measuring (Suppes & Zinnes, 1963, p. 4).

THE RASCH DEBATE

47

The possibilities for different languages appear because, as Cherryholmes (1988) points out, the focus on construct validity in qualitative research offers a stark contrast with the lack of concern for it in the quantitative paradigm, despite Loevinger's (1957) and Messick's (1975) stress on it as the "whole of validity." The quantitative paradigm contends that, "above all else, a criterion-referenced test must have content validity" (Hambleton & Novick, 1973, p. 168). Whereas the qualitative paradigm takes an experimental perspective, allowing the imagination to play upon itself in the service of dialogical objectivity (Heelan, 1988; Ihde, 1991; Ormiston & Sassower, 1989), the quantitative paradigm insists only that its dictates be followed to the letter. For instance, Divgi (1986, p. 283) says: "Issues like 'objectivity' and consistent estimation are shown to be unimportant in selection of a latent trait model." Whitely (1977, p. 233) concurs, saying t h a t "data on the internal structure of a test may not be substituted for other kinds of validity data." These statements replace construct validity with content validity and are completely opposed to Messick's (1975, p. 960) assertion t h a t validity bears directly on empirical consistency. More echoes of Lindquist's appeal to the authorities on high, the sacrosanct nature of test items, and the prohibition against monkeying around with item content resound when Messick (1975, p. 959) quotes Osburn (1968, p. 101), who says that what the test is measuring is operationally defined by the universe of content as embodied in the item generating rules. No recourse to response-inferred concepts such as construct validity, predictive validity, underlying factor structure or latent variables is necessary to answer this vital question. Cherryholmes (1988, pp. 452-453) observes that this sort of ultraoperationalism had been rejected even by the logical positivists more t h a n 30 years before Osburn wrote, because they saw that conceptual significance is never generated by strictly following rules. Cronbach and Meehl (1955) accordingly rejected operationalist definitions of constructs in their study of construct validity. Willmott and Fowles (1974) give concise expression to the different premises of the qualitative and quantitative paradigms, respectively, when they say t h a t "The criterion is that items should fit the model, and not t h a t the model should fit the items." Michell (1990, p. 8) phrases the qualitative theme in similar terms, saying that "The only way to decide whether or not the variables studied in any particular science are quantitative is to put that hypothesis to the test. This essential step is missing in the development of modern psychology."

48

FISHER

J u s t as Plato and Galileo stressed the conceptual ideality of measurement constructs in opposition to the Pythagorean and Aristotelian confusion of number and existence, Rasch's qualitative approach to measurement conceives of ability and difficulty idealistically, as if neither depended upon the particulars of the other. J u s t as Plato's geometrical fictions and Galileo's physical fictions served as heuristic models for the mathematical sciences of their ages, so will Rasch's socio-psycho-educational fictions serve as heuristic models for the coming age. Therefore, as Fischer (1987, p. 585) puts it, rather than rejecting Rasch's models as being too narrow, as Goldman and Raju (1986, p. 19), Goldstein (1983, p. 373; Goldstein & Blinkhorn, 1977, pp. 310-11), Hambleton and Rogers (1989, p. 148), and Whitely (1977, pp. 229, 2 3 2 233) explicitly do, one should instead change the data by altering the experimental design or the mode of observation. After all, it is "difficult to say in what sense measurement is achieved if that property [of parameter separability characteristic of data fitting a Rasch model] is violated" (Duncan, 1984a, p. 224; also see 1984c, pp. 398-399). These alternative perspectives are paradigmatically distinct insofar as each has radically different presuppositions about what counts as a legitimate question, and how one goes about determining whether a question is legitimate. The two paradigms also trace separate historical traditions, which contributes to the way their proponents tend to speak at cross purposes. The quantitative paradigm in education owes a great deal to logical positivism (Cherryholmes, 1988) and the operationalism of Bridgman (1927) and its applications to measurement by Stevens (1946) (Michell, 1990, pp. 15-20). The qualitative paradigm, on the other hand, largely follows from the phenomenology of Husserl (1970, originally published in German in 1936), the existential hermeneutics of Heidegger (1962, 1967; originally published in 1927 and 1935, respectively), Freudian psychology, Marxism, and ethnography. Contrary to the impression one might receive from most current works identifiable as qualitatively oriented, philosophical writers such as Husserl, Heidegger, Gadamer, Ricoeur, and Levi-Strauss explicitly related their interests to the understanding of mathematics, technology, and objectivity. Heelan and Ihde are among the very few contemporary writers who have realized and acted upon the relation of phenomenology to science, though Michell (1990, p. 8) recognizes Brentano, the teacher of Husserl and Freud, as an early leader in the qualitative paradigm, and Wheeler and Zurek (1983) mention the relevance of Husserl to the measurement problems of contemporary physics. In an article on construct validity, Whitely (Embretson (Whitely),

THE RASCH DEBATE

49

1983) has moved somewhat closer to a qualitatively informed theory of constructs than was evidenced in her earlier publications. But even when she qualifies her emphasis on item content and the nomothetic network in favor of empirical consistency and construct representation, Whitely continues to construe Rasch item and person parameters as representations of theoretical constructs (Embretson (Whitely), 1983, p. 186). Where Cherryholmes (1988) places construct validation in the realm of poststructuralist discourse analysis, Whitely (Embretson (Whitely), 1983, p. 179) traces a change from functionalism to structuralism, which means that her focus has shifted only one step away from the operational definition of the construct and is now concerned with combining the operationalism with an overly mechanical sense of the meaning of the item calibrations and person measures. In this context, Whitely points out that unidimensional measurement models do not provide a suitable basis for comparing alternative construct theories because tests of unidimensionality are "useful only for those theories that postulate a single construct," and even for these, the isolation of a "single dimension could be due to the completely confounded influence of several constructs" (Embretson (Whitely), 1983, p. 186). But why should it be reasonable to expect a general measurement model to serve as a means of representing constructs in the first place? Why should tests of unidimensionality be so crucial to the comparison of alternative construct theories? Whitely's (Embretson (Whitely), 1983, p. 195) reference to Bechtoldt's (1959) sense of construct operationalization as "a major focus of the proposed approach to construct validation research" provides an important clue to how she would answer these questions, as Messick (1981, p. 578) indicates: Bechtoldt's (1959) argument identifies not just the meaning of the test score but the meaning of the construct with the scoring operations, thereby confusing the test with the construct and the measurement model with the substantive theory. In confusing the test with the construct and the measurement model with substantive theory, Bechtoldt and Whitely reiterate what Gadamer (1980, p. 35) calls the Pythagorean confusion of number and numerical relationships with existence itself Others more appropriately stress t h a t "nothing in the fit between response model and observation contributes to an understanding of what the regularity means. In this sense, the response model is atheoretical" (Stenner, Smith, and Burdick, 1983, p. 308). The only reason why Whitely might expect the response model to be

50

FISHER

theoretical is t h a t her structuralist sense of construct representation demands it. Even when partial credit (Masters, 1982) or facets (Linacre, 1991) models are used to structure the theory informing a test's content, and tests of unidimensionality show themselves to be useful in relation to theories that postulate more than one construct, the theory of measurement implemented by the models cannot offer anything in the way of a substantive theory of the construct. Once responses have been determined to point along one direction of more and less useful for purposes of comparison, then questions of construct validity—Are persons expected to be more able scoring higher? Are items expected to be more difficult missed more often?—can be raised (Wright & Masters, 1982, p. 93). Empirical vs. Theoretical Support As a third sign of revolution, the traditional paradigm should have the advantage of more data supporting its position, and the disadvantage of fewer theoretical resources at its disposal to explain anomalous data, in relation to the alternative paradigm. In the present instance, adherents of the quantitative paradigm should assert t h a t (a) their theories and models fit commonly found data better than the theories and models of the qualitative paradigm, and (b) their own theories and models are nonetheless extremely complicated, difficult to use, time consuming, inefficient, problematic, and expensive, whereas those of the qualitative paradigm are simple, easy to use, efficient, readily available, and inexpensive. The first half of this hypothesis is supported by Whitely's (1977, p. 229) comment that "the several studies which apply a reasonably stringent test of fit are notable for the frequency with which the [Rasch] model is found to be inappropriate." She even goes so far as to say, in the face of the crisis noted above, that "classical testing procedures have served test development admirably for several decades" (Whitely, 1977, p. 234). Goldman and Raju (1986, p. 19) say t h a t since the findings of their "study suggest that the two-parameter model fits the attitude survey [of interest] better than the Rasch model, future applications might emphasize the two-parameter model." Hambleton and Rogers (1989, p. 148) are direct, saying that "the one-parameter model has rarely provided a satisfactory fit to the test data; the threeparameter nearly always has." In contrast to the value the quantitative paradigm places on control of item content, the qualitative paradigm values the theoretical and practical advantages of fundamental measurement principles. Kuhn

THE RASCH DEBATE

51

(1961) says t h a t the role of imagination and qualitative considerations in measurement is far greater than is usually supposed; commitment to these considerations means that some time usually has to pass before early advocates of new theories have managed to put together data supporting their hunches. Data fitting Rasch's implementations of measurement theory are sufficiently commonplace for published listing of widely-used Rasch-based item banks (Choppin, 1968, 1976, 1978; Wright & Bell, 1984) to be several years old. The two- and three-parameter models' capacity to better describe extant data has a flip side to it; the structure of that data cannot be easily explained and cannot be related to principles of measurement in any useful way. As might be expected from item response models whose estimation algorithms contradict their own assumptions of unidimensionality, the most commonly used computer program for implementing the two- and three-parameter IRT models, LOGIST (Wingersky, Barton, & Lord, 1982), has been shown by Stocking (1989, p. 42) to be rife with "large (and sometimes unacceptable) biases" in the estimation of the parameters. Stocking took up the study of LOGISTbased applications of IRT in order "to explore and understand some apparently anomalous results . . . that have been obtained from time to time over the past several years" not only in real data, but also in data simulated to fit the three-parameter model. After remarking, in a manner reminiscent of many of her colleagues (documented in Wright, 1984), on the expense and difficulty of using LOGIST, Stocking (1989, pp. 44-45) concludes t h a t LOGIST . . . needs improvement. Most applications cannot afford to run the program to complete convergence. It may be possible to improve results of the four-step structure by obtaining better starting values for the parameters. Alternatively, controlling the behavior of estimates of discrimination and guessing parameters through the imposition of prior distributions on them may be cost effective and provide reasonable results. The four-step procedure (Stocking, 1989, p. 21) referred to is one in which abilities and difficulties are estimated first, holding the discrimination and guessing parameters constant; then, the abilities are fixed and the three item parameters are estimated. Steps three and four repeat the first two steps. This structure was imposed on the estimation procedure in an effort aimed at overcoming the tendency of parameter estimates to diverge without limit (Stocking, 1989, pp. 2 5 26). Lord noted quite some time ago that "the [three-parameter] method

52

FISHER

usually does not converge properly" (Lord, 1968, p. 1015) and t h a t "experience has shown that if . . . restraints are not imposed, the estimated value of [discrimination] is likely to increase without limit" (Lord, 1975, p. 14). These problems are precisely what caused Wright to reject the multiparameter approaches in the mid-1960s, when he and Bruce Choppin wrote such programs against Rasch's advice (Wright, 1988a, p. 3). LOGIST's four-step procedure is intended to arrest the divergence of the parameters to infinity; this procedure uses the Rasch model, in effect, every other iteration through the data (on the first and third steps of the four-step procedure) in order to provide "reasonable estimates for item parameters and abilities in a feasible amount of time" (Stocking, 1989, p. 21). Stocking (1989, p. 45) makes the same recommendations concerning another program, BILOG (Mislevy & Bock, 1983): BILOG, being a more recent computer program available for general use, has not been subjected to the same wide variety of applications as LOGIST. As such, it does not contain the necessary restrictions to prevent the numerical procedures from diverging from reasonable, although perhaps less than optimal starting values. It seems clear that such additional restrictions are necessary. "Better starting values for the parameters," and "imposing prior distributions on them" are "necessary restrictions" that the two most widely used IRT computer programs must incorporate just to provide "reasonable estimates . . . in a feasible amount of time." Wright (1988a, p. 3) realized the same thing about his own two-parameter program in 1964, saying that it would not "converge unless I introduced some inevitably arbitrary constraint. The choice of the constraint would always alter the results. . . . Since I couldn't make the two-parameter program work, I discarded it." Hambleton and Rogers (1989, p. 158) comment on the unavailability, unfriendliness, cryptic and unwanted output, and bugs of IRT computer programs, in addition to the excessive time and prohibitive sample sizes required for their application. In contrast, Hambleton and Cook (1977, p. 88) write that "the problem of ability and item parameter estimation with the Rasch model is quite different. In fact, the estimation problem is essentially resolved." Hambleton and Cook's (1977, p. 76) comment that the only "fast and convenient-to-use computer programs for estimating the parameters [are those available] for the Rasch model" continues to be relevant. Wright (1984) documents more words of praise from those who have identified themselves with the quantitative paradigm's stress on con-

THE RASCH DEBATE

53

tent validity for the efficiency and effectiveness of Rasch's approach to measurement. Because the two- and three-parameter models often do not work at all with small sample sizes, Lord (1983) has said t h a t small sample sizes justify the use of the Rasch model. Rasch measurement would then be the best route to take for the great majority of tests, since most are administered in classrooms with less than fifty students. Validity by Default or Design? It appears that the most important aspect of validity in American educational measurement is the capacity to tell what Rorty (1985) calls stories of objectivity, in the sense that objectivity is the one-sided impo-stories of objectivity, in the sense that objectivity is the one-sided impsition of authority. Most educational measurement experts are willing to allow issues of construct validity to be decided by default, and "if researcher-theorists default on construct validity, then they consciously or unconsciously adopt inherited discourses and meanings previously assigned to constructs and measurements" (Cherryholmes, 1988, p. 428; also see Gould, 1981). As Burtt (1954, p. 225) phrased it, What kind of metaphysics are you likely to cherish when you sturdily suppose yourself to be free of the abomination? Of course . . . in this case your metaphysics will be held uncritically because it is unconscious; moreover, it will be passed on to others far more readily than your other notions inasmuch as it will be propagated by insinuation rather than by direct argument. The positivist denial of metaphysics is also assumed any time someone purports to be able to count on test items to provide valid and reliable measures when no value is placed on checking whether it is reasonable to add up counts of right answers and assign scores. However, just because experts have decided that items on a test all belong to the same content domain does not mean that they belong to the same construct. Viewed in this larger context, what Jaeger (1987) called the Rasch debate begins to look more like the validity debate. An exclusive focus on content validity in educational measurement serves ideological, bureaucratic, and administrative needs far more t h a n scientific or h u m a n ones. Some writers suggest that educational measurement addresses the social, economic and political agenda of elite decision makers more t h a n it does the interests of equal opportunity and justice (Crouse & Trusheim, 1988; Owen, 1985; Sutherland, 1984; Strenio,

54

FISHER

1981); it will continue to do so until more attention is paid the discourse processes and metaphysics of testing. Cherryholmes (1988, p. 421) suggests that some attention to these issues began, and "social research methodology entered adolescence, if not maturity, in July 1955 . . . with the publication of Cronbach and Meehl's 'Construct Validity in Psychological Tests.'" The problem is t h a t "the adolescence has been arrested" (Cherryholmes, 1988, p. 450). If so, the potential for its further development grew with the publication of Rasch's (I960) research on measurement, as has been suggested by Duncan (1984b, pp. 216-218; c, pp. 398-400). That potential will hardly begin to be realized until educators overcome their fixation on content validity, however.

IMPLICATIONS FOR PRACTICE The Things Themselves and Keeping the Scientific Theme Secure Sensitivity to the role of culture in the framing of questions has led to a new emphasis on a qualitative, ethnographic style of research in education. Though this development has been productive in promoting a more dialectical critique of the question-and-answer process, few suggestions for improvements in quantitative thinking have been forthcoming; quantitative methods have been either relegated to the positivist trash heap of history by qualitative purists, or accepted as unavoidably positivist, at least in part, by most of those who still continue to use and think about them. Even those who recognize the philosophical problems attending quantitative methods and incorporate a critical dialectic into their application, such as Cook and Campbell (1979, pp. 91-94), still take only roundabout routes to show t h a t their data focus on a common question and point in the direction from which the responses arrive. A more direct approach is to specify in advance what will count as an observation, on the basis of informal observations, imaginative hunches, or previous research; focus questions on the continuum along which the variable will likely be manifest; and examine the questions for conformity to measurement principles after they have been exposed to treatment by a relevant group of persons (Rasch, 1960; Wright, 1968, 1977b). Where education's traditional concern with content validity moves straight from the unarticulated theoretical construct to observation to assertions concerning what is observed (Cherryholmes, 1988, p. 448) in a monological and one-sided fashion, Rasch

THE RASCH DEBATE

55

and Wright insist on the importance of completing several spirals through the hermeneutic circle, returning to check and possibly alter observations and theoretical constructs before making assertions about what has been observed or what can be expected in the way of future observations. Cherryholmes (1988, p. 448; also see Fisher, 1990) says t h a t "quantitative and qualitative approaches are combined when the meaning of these bidirectional arrows [moving from construct to observation to phenomenon and back again] is clarified and negotiated." What Cherryholmes (1988, p. 448) refers to as the "'covariation' or shared meaning but not identity" connoted by these arrows has also been called a "mutually critical correlation" (Tracy, 1975) and a "method of converging indices" (Ricoeur, 1981, pp. 212-213) tracing a dialectical spiral t h a t delineates the "arrow of meaning" followed in pursuit of a line of questioning (Ricoeur, 1981, p. 193). The same mutual relation of construct to phenomenon t h a t is mediated by the structure of language embodied in questions holds when data meet the requirements of measurement as these are modeled by Rasch. Focusing the research question by attending to the ways in which it is posed by the test or survey questions extends and refines the question and answer process by which meaning is created in conversation, or by which meaning emerges from the reading of a text. Rasch measurement advances the qualitative critique of quantification and facilitates the investigation of construct validity in distinctively phenomenological and hermeneutic ways. Cherryholmes (1988, p. 432) says t h a t in Phenomenological and interpretative research . . . authority derives from subjects and blurs distinctions between subjects and objects. . . . Phenomenologically based research produces "truths" different from quantitative, statistically sophisticated research because the locus of power that makes "truth" possible shifts from researchers as subjects to respondents as subjects. Designing research with the intention of obtaining fit to a Rasch model is a way of heeding Husserl's call to return to the things themselves. Cherryholmes (1988, p. 430) describes the phenomenological epoche in a strict Husserlian sense as a bracketing of the researcher's prior beliefs and attitudes t h a t results in a proscription against imposing their own categories of observation on the objects of study who have become subjects. This transcendental idealism of Husserl has been critically d e v a l u ated in the work of his students Heidegger and Gadamer such that

56

FISHER

phenon enology is retained as the method of philosophy, but the epoche becomes a bracketing of the particulars through which things make themselves known. The epoche is still performed in order to gain access to the pure thought of the things themselves, but the researcher goes with the flow of, and organizes in an orderly fashion, the past beliefs, opinions, and frames of reference that Husserl (and Cherryholmes) proposed to be simply dropped. Research questions themselves constitute frames of reference and embody attitudes, so it is more realistic to attempt a fusion of the horizons of the research questions with the horizons of the questions the research subjects find pertinent (Gadamer, 1989) than it is to try to purify the questions of background assumptions and presume that the subjects have thereby been free to disclose their understanding of the world. Heidegger (1962, p. 195) said that attention to this hermeneutic circularity is our "first, last, and constant task" in "making the scientific theme secure." Because Rasch (1960, p. 110) estimated person and item parameters "one by means of the other . . . without getting into any logical circle," he was able to fix attention on the Heideggerian task. In opposition to what could be expected from Lindquist (1953), Rasch and Wright would agree with Heidegger that "science [is] genuine only if it succeeds in taking the measure from things, instead of imposing measure upon them" (Zimmerman, 1990, p. 228). Husserl and Heidegger's influences on the writers discussed by Cherryholmes, such as Derrida, Foucault, Habermas, Ihde, Rorty, and Schutz, bring Rasch into direct contact with the issues of construct validity raised in the discursive context. More specifically, to be sufficiently composed and prepared to pose real questions is to perform the phenomenological epoche such t h a t the thing itself is brought into view. The researcher has some evidence t h a t the thing itself is in view when the observations delineating its structure do not inordinately vary depending upon the particular questions asked or the particular persons responding. For the bracketing, and separation, of the particulars to occur, they must converge upon a common line of thought; this belonging together is characteristic of Husserl's method of profile variation, Ricoeur's method of converging indices, and is referred to by Brenneman, Yarian, and Olson (1982) as the paradox of unity and separation. Things think themselves and method is an activity of the things themselves (Gadamer, 1989) when person parameters are estimated free of concern for the particular questions asked, item parameters are estimated free of concern for the particular persons responding, and fit to the model is checked free of concern for either parameter (Rasch, 1960, pp. 122, 178; 1961, p. 325). Whether this separability theorem,

THE RASCH DEBATE

57

and the specific objectivity attained when the theorem is satisfied, are practical for any particular field of research is a matter for empirical study. It must be asserted, however, that to attain specific objectivity is to make the scientific theme secure. Rasch's incorporation of basic phenomenological and hermeneutic themes into his mathematics has been ignored, leading some to relegate his work to the positivist trash heap. For instance, Cronbach, 1982, p. 70) considered Rasch (1961) to hold that "one-parameter scaling can discover coherent variables independent of culture and population." On the contrary, Wright himself could have written what Cronbach says on the next page, that the sooner all social scientists are aware that data never speak for themselves, that without a carefully framed statement of boundary conditions generalizations are misleading or trivially vague, and that forecasts depend on substantive conjectures, the sooner will social science be consistently a source of enlightenment. With regard to Cronbach's statement that "data never speak for themselves," Wright and Masters (1982, p. 9) say that To be able to do arithmetic we need to be able to count, and to count we need units. But there are no natural units. There are only the arbitrary units we construct and decide to use for our counting. Cronbach expresses concern for "a carefully framed statement of boundary conditions," without which "generalizations will be misleading or trivially vague"; Wright and Masters (1982, p. 5) say For scientific ideas to be useful, they must apply over some range of time and place; that is, over some frame of reference. The way we think things are must seem to stay the same within some useful context. What is a Rasch model if it is not "a carefully framed statement of boundary conditions"? To require that test results be dominated only by abilities and difficulties is to make a substantive conjecture, as is evident in the quote from Wright (1977b, p. 97) used by Jaeger (1987) to characterize the debate. Cronbach's thoughtless dismissal of Rasch raises the point that the qualitative criticism of quantitative methods must be complemented by criticism of qualitative approaches that emphasize only the movement from the phenomenon to observation to construct, which makes them just as incomplete as the quantitative approaches that follow only the movement from construct to observation to phenomenon. Neither approach alone successfully addresses

58

FISHER

the problem of method in social research, and to simply juxtapose them does not accomplish anything of substance, either. A more fully complementary relation between the two paradigms is required, one in which each incorporates what is most important about the other into its own movement, acknowledging in practice that "the social roots of social measurement are in the social process itself" and t h a t "quantification is implicit . . . in the social process itself before any social scientist intrudes" (Duncan, 1984b, pp. 221, 36). The goals of the qualitative paradigm are not to abandon or bury quantification, but to explicate what Coombs (1967, pp. 4-5) called the "interpretive step . . . required to convert the recorded observations into data." When this interpretive step and its implications are included in research the phenomenologically rich sense of method as the playful activity of the thing itself takes hold (Gadamer, 1989). To apply Rasch's models is to incorporate the interpretive step into scaling procedures, making interpretation of the construct unavoidable in calibrating instruments and making measurements, which is part of the reason Rasch has provoked debate. How does the interpretive step fit into the process of instrument calibration and person measurement? It is actually not just a single step, but is repeated several times. Even the invention of the questions to be asked involves an interpretation of the relevant content domain; decisions as to item appropriateness may be guided by criteria of content validity at this point, but they should also be guided by a theory of the variable: What will count as an observation of more or less of the ability or attitude of interest? The activity of the phenomenon measured moves first in the direction shared by the questions on a test toward the responses they provoke; the responses in turn raise new questions which either extend or otherwise alter the direction initially followed. The back-and-forth motion continues in a manner t h a t connects with what is most fundamental to method (from the ancient Greek meta-hodos), the way in which clear thinking follows after and i meaning or train of thought it cuts within a particular cultural and historical frame of reference. This is not to say that Rasch measurement models embody the essence of method, or t h a t they even are methods, because they are not. The methods by which meaning is created vary substantially both among and within areas of interest. The point is only that obsession with content validity cuts off the flow of method prematurely; a shift in focus toward construct validity would contribute to the phenomenological and methodological soundness of educational research.

THE RASCH DEBATE

59

Interpreting Empirical Consistency The recent surge of interest in fit analysis, differential item functioning, and the Mantel-Haenszel (MH) procedure is a move in the direction of a strong emphasis on construct validity in educational research, but presumes an approach to measurement often lacking in the methods creating the data to which it is applied. In the application of the Mantel-Haenszel (MH) procedure, If one is not prepared to accept the validity of the Rasch model for the item under examination, the implicit assumptions of the MH procedure will not be satisfied either. If one is prepared to accept the Rasch assumptions, however, the Rasch model yields simpler and better statistics. (Linacre & Wright, 1987, p. 16; 1989, p. 3; also see Zwick, 1990) Thus, the application of the MH procedure to data that fit the threeparameter IRT model but not the Rasch model adds yet another level of self-contradiction and complication to educational measurement. The residual differences between modeled and observed responses calculated by both the Rasch and the MH procedures implement the rigorous sense of unidimensionality contradicted by the two- and threeparameter estimation algorithms. This situation raises some hard questions. What is the point of obtaining complex and obscure statistics from the MH procedure when a model t h a t almost always fits data is being used to provide ability and difficulty estimates? Why not use the same requirements used to calculate fit to estimate scale positions, and arrive at simpler statistics in less time and with less trouble? The sort of structure required of data for fit to a Rasch measurement model, and presumed in the application of the MH procedure, is displayed in Table 3-1. In fact, it is only reasonable to count up marks of correct and incorrect (or marks of correct, partly correct, and incorrect—see Wright & Masters, 1982, and Masters, 1982, for more on partial credit scoring), and use the counts as a basis for making inferences about person ability or item difficulty, when data can be organized into a pattern roughly similar to the one shown in Table 3-1. The items are ordered from more to less difficulty according to the number of persons responding correctly to each; the persons are ordered from more to less ability according to the number of items to which each has correctly responded. The resulting pattern required for measurement is one in which a person may occasionally score a correct

60

FISHER

Table 3-1 Sample Data that Display the Reciprocal Order Needed for Convergence and Fit to an Additive Conjoint Measurement Model Items Easy or Agreeable to Hard orlDisagreeable Persons

1

2

3

4

5

6

7

8

9

10

Person Scores

Luc John Louise Martha Jimi Diane Nathan Jon Laura Alissa

0 1 1 1 1 1 1 1 1 1

1 0 1 1 1 1 1 1 1 1

0 1 0 1 1 1 1 1 1 1

0 0 1 0 1 1 1 1 1 1

0 0 0 1 0 1 1 1 1 1

0 0 0 0 0 1 1 1 1 1

0 0 0 0 1 0 0 1 1 1

0 0 0 0 0 0 1 1 0 1

0 0 0 0 0 0 0 0 1 1

0 0 0 0 0 0 0 0 1 0

1 2 3 4 5 6 7 8 9 9

Item Score

9

9

8

7

6

5

4

3

2

1

answer after missing an item or two, but there is a general harmony to the continuum of more and less shared by the persons and items. In contrast, Table 3-2 displays data that contradict the basic requirement of unidimensionality, and so threaten the construct validity of the calibrations and measures. Imagine that the data in Table 3-2 are embedded in a large matrix of data organized like that shown in Table 3-1, in which a general order of more and less of something remains relatively and probabilistically constant across items and persons. Every person in Table 3-2 has the same count of correct answers, but is it possible to assume that the counts mean the same thing? Is not t h a t assumption made, however, every time a teacher or a tester computes the percentage of the total number of items to which a student Table 3-2

Sample Data on the Variation of Meaning in a Score Items Easy or Agreeablei to Hard or Disagreeable

Persons

1

2

3

4

5

6

7

8

9

10

Person Scores

Joe Mary Lucy Bob Anne Larry Igor

0 1 1 1 1 1 0

0 1 1 0 1 1 1

0 1 1 1 1 1 1

0 1 1 0 1 1 1

0 1 0 1 0 0 1

1 0 1 0 0 0 0

1 0 0 1 0 0 1

1 0 0 0 0 0 0

1 0 0 1 0 0 0

1 0 0 0 1 1 0

5 5 5 5 5 5 5

t

responded correctly? In contrast to Divgi (1986, p. 283), Messick's (1975, p. 960) answer to this question is an unequivocal yes: Inferences in educational and psychological measurement are made from scores, and scores are a function of subject responses. Any concept of validity of measurement must include reference to empirical consistency. Content coverage is an important consideration in test construction and interpretation, to be sure, but in itself does not provide validity. After all, is not it possible that some students will respond to ostensibly easy questions incorrectly, and ostensibly hard ones correctly, independent of the fact t h a t all of the items have been judged to belong to the same content domain? Is it not important to detect when this sort of thing happens on a large scale, as has been the case with Anne, Igor, Larry, and especially Joe, in Table 3-2? And what about Bob, who was correct on every other item when they are ordered by difficulty? Is he making some kind of joke? The probability of Igor missing the easiest item must be very small, so was this the result of simple carelessness or is something more important going on? Anne and Larry both got the very hardest item correct after missing five in a row. Is this simply a sign of some special knowledge they each have, did they collaborate on the answer, did one copy from the other, or were these independently made lucky guesses? Answers to these questions can be gained by asking the students new questions of the same difficulty as those on which their responses are surprising. If the items in Table 3-2 are in entry, as well as measure, order, it might be beneficial to ask if Mary ran out of time as she labored with each question before she moved on to the next. Did Joe skip all of the easy questions out of boredom? Did Bob make random marks on the answer sheet, or answer true/false or multiple choice questions all in the same category? If so, why? Will Larry and Anne answer another item of question 10's difficulty correctly, or were their responses produced by collaboration, cheating, guessing, or special knowledge? Would Igor have missed the first question if he had not been in a hurry to get started, or if he had not had difficulty figuring out the test's purpose? The other side of validating a construct involves another, reciprocally structured, set of questions simultaneously raised about the test items. Is there a very easy item that groups of high-ability persons consistently miss? Is there a very hard item that groups of low-ability persons answer correctly? For instance, word problems in a mathematics test may become inordinately difficult for students who are unable to read the language in which the problems are written. If word prob-

62

FISHER

lems are irrevocably deemed a valid part of the mathematics content domain, and the test analyst has no business monkeying around with the sacrosanct items handed down by the authorities, as Lindquist (1953) maintains, then discrimination and prejudice are built into the test and any decisions t h a t follow from them. If, on the other hand, we are flexible enough to not regard content decisions as fixed, then the differential meaning of the items can be accounted for in the interpretation that transforms observations into measures. These examples are intended to show that there are many kinds of disturbance that interfere with the effort to measure, each is as likely to occur as guessing is, and each will present just as much potential for disruption. Are we then to model additional parameters for plodding, sleeping, and fumbling, as they are called by Wright and Stone (1979, pp. 170-190), in such a way that they will move us even further from Rasch's access to sufficient statistics? Hardly; two basic reasons for the movement toward qualitative methods in educational research are t h a t usual applications of quantitative method traditionally strive to anticipate, close off, trap, or nail down anomalies, and to focus on operations and content instead of meaning and constructs. It is more sensible, though, to go with the flow of the multifaceted, conversational, and metaphorical logic by which things actually play themselves out, t h a n it is to force a one-sided logic and rationality on what people do. Well put questions inevitably open up more questions than they answer, and to cut off questioning is to kill the potential for learning. Disruptions in the measurement process are inevitable but it is far more productive to locate and interpret them after they occur than to try to include them as elements in a model of an already very complicated situation. Patterns of anomalous response commonly found in educational test data are discussed in Wright and Stone (1979, pp. 170-190). Quantitative methods for flagging unexpected patterns of response associated with persons and items are standard equipment in programmatic applications of the Rasch models, such as BIGSTEPS (Wright & Linacre, 1991) and FACETS (Linacre, 1991). The statistics indicative of empirical inconsistency have been shown useful in investigating construct validity (Maier & Philipp, 1986; Wright & Masters, 1982, pp. 90-117). More complex multiple regression procedures using the conceptual structure of item characteristics to predict Rasch item difficulties have been presented by Stenner and Smith (1982) and Stenner, Smith, and Burdick (1983) in the context of exploring construct validity. The interpretive study of ordered data matrices shows that scores are meaningful only within the context of a frame of reference, and t h a t Rasch's requirement of shared order across persons and items is in

t

fact assumed whenever raw scores are used as a basis for comparison, Goldstein's (1979, p. 219) claims to the contrary notwithstanding. Andersen (1977, p. 72) says that If there exists a minimal sufficient statistic for the individual parameter 0 which is independent of the item parameters, then the raw score is the minimal sufficient statistic and the model is the Rasch model. In Wright's (1977b, p. 114; also see 1985, pp. 106-107) terms, Unweighted scores are appropriate for person measurement if and only if what happens when a person responds to an item can be usefully approximated by the Rasch model. . . . Ironically, for anyone who claims skepticism about "the assumptions" of the Rasch model, those who use unweighted scores are, however unwittingly, counting on the Rasch model to see them through. Whether this is useful in practice is a question not for more theorizing, but for empirical study. There are, perhaps, those who read these passages simply as expressions of the writers' demands that things be done their way, as if they believe they have access to a divine inspiration ordering sanctification of particular procedures and the conscription of a following of disciples, with no questions raised from anyone as to why things should be done this way. On the contrary, "the reader who believes that all t h a t is at stake in the axiomatic treatment of measurement is a possible canonizing of one scaling procedure at the expense of others is missing the point" (Ramsay, 1975, p. 262; also see Andrich, 1988, p. 20). The point is to sanctify neither items nor procedures, but to undertake data analysis as a kind of detective work. The Roman Catholic Church . . . has long held that sanctification was only for the dead—indeed only for those already dead for an appropriate period. . . . sanctification of data is equally only for dead data—data that are only of historical importance, like Newton's apple. . . . Data analysis has its major uses. They are detective work and guidance counseling. Let us all try to act accordingly. (Tukey, 1969, p. 90) The empirical studies of the detective work and guidance counseling provided by Rasch measurement that were called for by Wright (1977b) have been completed on many different kinds of test, survey, and rating scale data. These studies have answered the question concerning the Rasch model's practical usefulness in the affirmative many times over, as is evidenced by just a cursory examination of the

64

FISHER

papers presented to the Midwest Objective Measurement Seminars, the International Objective Measurement Workshops (Wilson, 1991), and the Rasch Measurement SIG sessions of the AERA, besides the publications appearing in journals as diverse as the Archives of Physic gy. The medical fields have found Rasch's approach to measurement especially useful, with a great deal of Rasch applications being found in accreditation and certification, as well as in psychiatry, nursing, and blind and physical rehabilitation. Perhaps the only obstacles to revolution in educational measurement are assumptions concerning the irreconcilable differences of solidarity and objectivity. SOLIDARITY VS. OBJECTIVITY OR OBJECTIVE SOLIDARITY? In contrast to Rorty and Cherryholmes, I would like to suggest that stories of solidarity and objectivity and not mutually exclusive. Cherryholmes (1988, p. 450) says that If Rorty is correct that reflective human beings make sense of their lives by telling stories about either solidarity or objectivity and our stories about objectivity are flawed, they nevertheless describe a community. The community is elitist, control centralized; criticism is limited to experts; the social context and historical setting of the community is not discussed; constructs (the way the community is conceptually organized) are not chosen on ethico-political or aesthetic grounds but in terms of "scientific" criteria; and the discourse is thought of as nonmaterial and descriptive-explanatory.

To this it must be added that if the solidarity of societies emphasizing objectivity is likely to take a one-sided, dictatorial, and authoritarian form, then the objectivity of societies that emphasize solidarity is likely to be multifaceted, conversational, and playful (Heelan, 1983, 1985; Ihde, 1979; Ackermann, 1985). There is a large literature describing science in the language of community life (Fahnestock, 1986; Fleck, 1979; Hesse, 1970, 1972; Holton, 1988; Kuhn, 1961, 1970; Latour & Woolgar, 1979; Ormiston & Sassower, 1989; Toulmin, 1982); the problem these works address is how to find and nurture whatever resources for solidarity there may be remaining in scientific society. This does not require us to abandon objectivity; on the contrary, we aim to avoid yet another simplistic reduction of rich variation to another mere dichotomy. In opposition to Lindquist's approach to measurement, Wright spe-

t

cifically addresses ethical, political, and aesthetic criteria by which to judge and choose constructs. Because we intend to use our measures to inform decisions that affect people's lives, we are ethically bound to be sure t h a t the numbers actually represent more and less of the construct in question. Some might say that the only ethics addressed by Lindquist concern a blind devotion to following orders. Because we are legally and morally bound not to discriminate among persons by religion, sex, race, sexual orientation, or age, we require that our measures not vary across these groups in an inordinate fashion. Lindquist's definition of the test content as sacrosanct prevents attention from being focused on these issues in an effective way. Rasch's measurement models offer an aesthetically pleasing symmetry of question and answer in which each plays itself out in terms of the other, effectively extending and furthering the process by which meaning is reproduced in social life, conversationally. Lindquist, on the other hand, would have us only accept t h a t which is handed down without question because we have no business monkeying around with sacrosanct definitions. The desire to understand human experience by means of stories told of a nonhuman, ahistorical reality still predominates in much of social science. In education this desire is evident in the popularity of measurement models that do not recognize or accept the fact of their own imposition of political, moral and aesthetic criteria upon students, test items, and data. By recognizing that the projection of such criteria is unavoidable, and by formulating models of how consciously chosen criteria can be simply, easily and practically implemented, explicated, and criticized, it becomes possible to explore whether we really know what we are talking about when we make assertions on the basis of test results. And far from saying that construct validity is simply a matter of fitting data to Rasch's models, this chapter has attempted to provoke thoughtful attention to the problem of construct validity. Measuring what is supposed to be measured involves far more t h a n anything that can be specified in a set of mechanically and thoughtlessly followed rules. Revolution in educational measurement will be attained only when we let go of our needs for rules and the capacity to dominate and control in favor of a thinking secure enough to go with the flow of letting individuals be what they are. REFERENCES Ackermann, J.R. (1985). Data, instruments, and theory: A dialectical approach to understanding science. Princeton, NJ: Princeton University Press.

66

FISHER

Andersen, E.B. (1977). Sufficient statistics and latent trait models. Psychometrika, 42"(1), 6 9 - 8 1 . Andrich, D. (1987, April). Educational and other social science measurement: A Kuhnian revolution in progress. Presented to the American Educational Research Association, New Orleans.

a

series on Quantitative Applications in the Social Sciences, series no. 07-068. Beverly Hills, CA: Sage Publications. Andrich, D. (1989). Statistical reasoning in psychometric models and educational measurement. Journal of Educational Measurement, 26(1), 81-90. Bakan, D. (1966). The test of significance in psychological research. Psychological Bulletin, 66(6), 423-437. n Psychologist, Bechtoldt, H.P. (1959). Construct validity: A critique. American Psychologist, Bechtoldt, H.P. (1959). Construct validity: A critique. American 14, 619-629. Bollinger, G., & Hornke, L.F. (1978). Uber die Beziehung von Itemtrennscharfe und Rasch-Skalierbarkeit. Archiv fiir Psychologic, 130, 89-96. Brenneman, W.L., & Yarian, S.O., with A.M. Olson. (1982). The seeing eye: Hermeneutical phenomenology in the study of religion. University Park, PA: Pennsylvania State University Press. Bridgman, P.W. (1927). The logic of modern physics. New York: Macmillan. Brogden, H.E. (1977). The Rasch model, the law of comparative judgment and additive conjoint measurement. Psychometrika, 42, 631-634. Burtt, E.A. (1954). The metaphysical foundations of modern science. New York: Doubleday Anchor. Cajori, F. (1985). A history of mathematics. New York: Chelsea. Carver, R. (1978). The case against statistical significance testing. Harvard Education Review, 48(3), 378-399. Cherryholmes, C. (1988). Construct validity and the discourses of research. American Journal of Education, 96(3), 421—457. Choppin, B. (1968). An item bank using sample-free calibration. Nature, 219, 870-872. Choppin, B. (1976). Recent developments in item banking. In D.N. DeGruitjer & L.J. Vanderkamp (Eds.), Advances in psychological anddeducational measurement. t London: John Wiley & Sons. Choppin, B. (1978). Item Banking and the Monitoring of Achievement. Slough, England: National Foundation for Educational Research. Cliff, N. (1973). Scaling. Annual Review of Psychology, 24, 473-506. Coats, W. (1970). A case against the normal use of inferential statistical models in educational research. Educational Researcher, 3, 6 - 7 . Cook, T.D., & Campbell, D.T (1979). Quasi-experimentation:oDesign & analysis issues for field settings. Boston: Houghton Mifflin. Coombs, C. (1967). A theory of data. New York: Wiley. Cronbach, L.J. (1982). Prudent aspirations for social inquiry. In W.H. Kruskal (Ed.), The social sciences: Their nature and uses. Chicago: University of Chicago Press. Cronbach, L., & Meehl, P. (1955). Construct validity in psychological tests. Psychological Bulletin, 52(A), 281-302.

THE RASCH DEBATE

67

Crouse, J., & Trusheim, D. (1988). The case against the SAT Chicago: University of Chicago Press. Divgi, D.R. (1986). Does the Rasch model really work for multiple choice items? Not if you look closely. Journal of EducationalalMeasurement, 23(4), 283-296. Divgi, D.R. (1989). Reply to Andrich and Henning. Journal of Educational Measurement, 26,(3), 295-299. Duncan, O.D. (1984a). Measurement and structure: Strategies for the design and analysis of subjective survey data. In C.F. Turner & E. Martin (Eds.), Surveying subjective phenomena (Vol. 1). New York: Russell Sage Foundation. Duncan, O.D. (1984b). Notes on social measurement: Historical and critical. New York: Russell Sage Foundation. Duncan, O.D. (1984c). Rasch measurement: Further examples and discussion. In C.F. Turner & E. Martin (Eds.), Surveying subjective phenomena (Vol. 2). New York: Russell Sage Foundation. Embretson (Whitely), S. (1983). Construct validity: Construct representation versus nomothetic span. Psychological Bulletin, 93(1), 179-197. Fahnestock, J. (1986). Accommodating science: The rhetorical life of scientific facts. Written Communication, n3(3), 275-296. Falk, R. (1986). Misconceptions of statistical significance. Journal of Structural Learning, 9, 83-96. Fischer, G.H. (1987). Applying the principles of specific objectivity and of generalizability to the measurement of change. Psychometrika,a52, 4, 565-587. Fisher, W.P. (1988). Recent developments in the philosophy of science pertaining to problems of objectivity in measurement. Raschh Measurement Transactions, 2(2), 1-3. Fisher, W.P. (1990, April). Conversing, testing, questioning. Presented to the American Educational Research Association Annual Meeting, Boston I ERIC Document TM016413]. Fisher, W. (1991). Objectivity in measurement: A philosophical history of Rasch's separability theorem. In M. Wilson (Ed.), Objective Measurement: Theory into practice. Norwood, NJ: Ablex Publishing Corp. Fleck, L. (1979). The birth and genesis of a scientific fact. Chicago: University of Chicago Press. Gadamer, H.-G. (1980). Dialogue and dialectic: Eight hermeneuticalal studies on Plato (PC. Smith, Trans, and Intro.). New Haven: Yale University Press. Gadamer, H.-G. (1989). Truth and method (2nd ed.) (J. Weinsheimer & D.G. Marshall, Rev. Trans.). New York: Crossroad. Goldman, S.H., & Raju, N.S. (1986). Recovery of one- and two-parameter logistic item parameters: An empirical study. Educational and Psychological Measurement, 46, 1 1 - 2 1 . Goldstein, H. (1979). Consequences of using the Rasch model for educational assessment. British Educational Research Journal, 5(2), 211-220. Goldstein, H. (1980). Dimensionality, bias, independence and measurement

68

FISHER

scale problems in latent trait test score models. British Journal of Mathematical and Statistical Psychology, 33, 234-246. Goldstein, H. (1983). Measuring changes in educational attainment over time: Problems and possibilities. Journal of Educational Measurement, t 20(4), 369-377. Goldstein, H , & Blinkhorn, S. (1977). Monitoring educational standards—An inappropriate model. Bulletin of the British Psychological Society, 30, 309-311. Grau, B.W., & Mueser, K.T. (1986). Measurement of negative symptoms. Schizophrenia Bulletin, 12(1), 7 - 8 . Gould, S.J. (1981). The mismeasure of man. New York: W. W. Norton. Gustafsson, J.-E. (1980). Testing and obtaining fit of data to the Rasch model. British Journal of Mathematicalal and Statistical Psychology, 33, 2 0 5 233. Hacking, I. (1983). Representing and intervening: Introductory topics in the philosophy of natural science. Cambridge, UK: Cambridge University Press. Hacking, I. (1988). On the stability of the laboratory sciences. The Journal of Philosophy, 85(10), 507-514. Hambleton, R.K., & Cook, L.L. (1977). Latent trait models and their use in the al a 14(2), 75-96. Hambleton, R.K., & Novick, M.R. (1973). Toward an integration of theory and method for criterion-referenced tests. Journal of Educationalal Measurement, 10, 159-170. Hambleton, R.K., & Rogers, H.J. (1989). Solving criterion-referenced measurement problems with item response models. International Journal of Educational Research, 13(2), 145-160. Heelan, P. (1972). Towards a hermeneutic of natural science. Journal of the British Society for Phenomenology, 3, 252—260. Heelan, P. (1983). Natural science as a hermeneutic of instrumentation. Philosophy of Science, 50, 181-204. Heelan, P. (1985, March). Interpretation in physics: Observation and measurement. Greater Philadelphia Philosophy Consortium. Heelan, P. (1988). Experiment and theory: Constitution and reality. The Journal of Philosophy, 85(10), 515-524. Heelan, P. (1989). After experiment: Realism and research. American Philo-Heelan, P. (1989). After experiment: Realism and research. American Philosophical Quarterly, 26(4), 297-308. Heidegger, M. (1962). Being and time (J. Macquarrie and E. Robinson, Trans.). New York: Harper & Row. Heidegger, M. (1967). What is a thing? (W.B. Barton, Jr., & V. Deutsch, Trans.). (Analytic afterword by E. Gendlin). South Bend, IN: Regnery. Henning, G. (1989). Does the Rasch model really work for multiple-choice items? Take another look: A response to Divgi. Journal of Educational Measurement, 26(1), 91-97. Hesse, M. (1970). Models and analogies in science. Notre Dame, IN: University of Notre Dame Press.

THE RASCH DEBATE

69

Hesse, M. (1972). In defence of objectivity. Proceedings of the BritishhAcademy, 58, 275-292. Holton, G. (1988). Thematic origins of scientific thought (rev. ed.). Cambridge,Holton, G. (1988). Thematic origins of scientific thought (rev. ed.). Cambridge, MA: Harvard University Press. Hudson, L. (1972). The cult of the fact. New York: Harper & Row. Husserl, E. (1970). The crisis of European science. Evanston, IL: Northwestern University Press. Ihde, D. (1979). Technics and praxis. Boston: D. Reidel. Ihde, D. (1991). Instrumental realism. Bloomington, IN: Indiana University Ihde, D. (1991). Instrumental realism. Bloomington, IN: Indiana University Press. Jaeger, R.M. (1987). Two decades of revolution in educational measurement!? Educational Measurement: Issues and Practice 6(2), 6-14. Krantz, D.H., Luce, R.D., Suppes, P., & Tversky, A. (1971). Foundations of measurement. Vol. 1: Additive and polynomial representations. New York: Academic Press. Krenz, C , & Sax, G. (1986). What quantitative research is and why it doesn't work. American Behavioral Scientist, 30(1), 58-69. Kuhn, T.S. (1961). The function of measurement in modern physical science. Isis, 52(168), 161-193. Kuhn, T.S. (1970). The structure of scientific revolutions (2nd ed.). Chicago:Kuhn, T.S. (1970). The structure of scientific revolutions (2nd ed.). Chicago: University of Chicago Press. Latour, B., & Woolgar, S. (1979). Laboratory life: The social construction of scientific facts. Beverly Hills: Sage. Lewine, R.R.J. (1986). Reply to Grau and Mueser. Schizophrenia Bulletin,Lewine, R.R.J. (1986). Reply to Grau and Mueser. Schizophrenia Bulletin, 12(1), 9 - 1 1 . Linacre, J.M. (1991). FACETS: A computer program for many-faceted Rasch Linacre, J.M. (1991). FACETS: A computer program for many-faceted Rasch analysis. Chicago: MESA Press. Linacre, J.M., & Wright, B.D. (1987). Item bias: Mantel-Haenszel and theLinacre, J.M., & Wright, B.D. (1987). Item bias: Mantel-Haenszel and the Linacre, J.M., & Wright, B.D. (1987). Item bias: Mantel-Haenszel and theLinacre, J.M., & Wright, B.D. (1987). Item bias: Mantel-Haenszel and the Rasch model (Memorandum No. 39, MESA Psychometric Laboratory, Department of Education). Chicago: University of Chicago. Linacre, J.M., & Wright, B.D. (1989). The equivalence of Rasch PROX and Mantel-Haenszel. Rasch Measurement, 3(2), 1-3. Lindquist, E.F. (1953). Selecting appropriate score scales for tests (Discussion). Proceedings of the 1952 Invitational Conference on Testing Problems.Proceedings of the 1952 Invitational Conference on Testing Problem. Princeton, NJ: Educational Testing Service. Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological Reports, 3, 635-694. Loevinger, J. (1965). Person and population as psychometric concepts. Psychological Review, 72(2), 143-155. Lord, F.M. (1968). An analysis of the Verbal Scholastic Aptitude Test using Birnbaum's three-parameter logistic model. Educational and Psychological Measurement, 28, 989-1020. Lord, F.M. (1975). Evaluation with artificial data of a procedure for estimating ability and item characteristic curve parameters (Research Bulletin ability and item characteristic curve parameters (Research Bulletn 75-33). Princeton, NJ: Educational Testing Service. Lord, F.M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum.

70

FISHER

Lord, F.M. (1983). Small N justifies Rasch model. In D.J. Weiss (Ed.), New horizons in testing: Latent trait test theory and computerized adaptive horizons in testing: Latent trait test theory and computerized adaptie testing. New York: Academic. Luce, R. D., & Tukey, J. W. (1964). Simultaneous conjoint measurement: A new kind of fundamental measurement. Journal of Mathematical Psychology, kind of fundamental measurement. Journal of Mathematical Psychology, 7(1), 1-27. Lumsden, J. (1976). Test theory. Annual Review of Psychology, 27, 251-280. Maier, W, & Philipp, M. (1986). Construct validity of the DSM-III and RDC classification of melancholia (endogenous depression). Journal of Psychiatric Research, 20, 4, 289-299. Masters, G. (1982). A Rasch model for partial credit scoring. Psychometrika 47, 149-174. Messick, S. (1975). The standard problem: Meaning and values in measurement and evaluation. American Psychologist, 30, 955-966. Messick, S. (1981). Constructs and their vicissitudes in educational and psychological measurement. Psychological Bulletin, 89, 575-588. Michell, J. (1986). Measurement scales and statistics: A clash of paradigms. Psychological Bulletin, 100, 398-407. Michell, J. (1990). An introduction to the logic of psychological measurement. Michell, J. (1990). An introduction to the logic of psychological measurement. Hillsdale, NJ: Erlbaum. Mislevy, R.J., & Bock, R.D. (1983). BILOG: Item analysis and test scoring with binary logistic models. Mooresville, IN: Scientific Software. Ormiston, G, & Sassower, R. (1989). Narrative experiments: The discursive Ormiston, G, & Sassower, R. (1989). Narrative experiments: The discursive authority of science and technology. Minneapolis, MN: University of Minnesota Press. Osburn, H.G. (1968). Item sampling for achievement testing. Educational and Osburn, H.G. (1968). Item sampling for achievement testing. Educational and Psychological Measurement, 28, 95-104. Owen, D.S. (1985). None of the above: Behind the myth of scholastic aptitude. Owen, D.S. (1985). None of the above: Behind the myth of scholastic aptitude. Boston: Houghton Mifflin. Perline, R., Wright, B.D., & Wainer, H. (1979). The Rasch model as additive conjoint measurement. Applied Psychological Measurement, 3(2), 2 3 7 255. Phillips, S.E. (1986). The effects of the deletion of misfitting persons on vertical equating via the Rasch model. Journal of Educational Measurement, cal equating via the Rasch model. Journal of Educational Measurement, cal equating via the Rasch model. Journal of Educational Measurement, cal equating via the Rasch model. Journal of Educational Measurement, 23(2), 107-118. Ramsay, J.O. (1975). Review of Foundations of Measurement, Vol. I, by D.H. Ramsay, J.O. (1975). Review of Foundations of Measurement, Vol. I, by D.H. Krantz et al. Psychometrika, 40, 257-262. Krantz et al. Psychometrika, 40, 257-262. Krantz et al. Psychometrika, 40, 257-262. Krantz et al. Psychometrika, 40, 257-262. Rasch, G. (1960). Probabilistic models for some intelligence and attainment Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danmarks Paedogogiske Institut. (Reprint, 1980, with Foreword and Afterword by Benjamin D. Wright, Chicago: University of Chicago Press.) Rasch, G. (1961). On general laws and the meaning of measurement in psychology. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, 4 (pp. 321-333). Berkeley: University of California Press. Ricoeur, P. (1965). History and truth (C. A. Kelbley, Trans.). Evanston: Northwestern University Press.

THE RASCH DEBATE 7 1

Ricoeur, P. (1981). Hermeneutics and the human sciences: Essays on language, action and interpretation (J.B. Thompson, Ed., Trans, and intro.). Cambridge, UK: Cambridge University Press. Rorty, R. (1985). Solidarity or objectivity. In J. Rajchman & C. West (Eds.), Postanalytic philosophy. New York: Columbia University Press. analytic philosophy. New York: Columbia University Press. analytic philosophy. New York: Columbia University Press. analytic philosophy. New York: Columbia University Press. Singleton, M. (1991). Rasch measurement as a Kuhnian revolution. Rasch Measurement, 4(4), 119. Stenner, A.J., & Smith, M., III. (1982). Testing construct theories. Perceptual and Motor Skills, 55, 415-426. Stenner, A.J., Smith, M., Ill, and Burdick, D.S. (1983). Toward a theory of construct definition. Journal of Educational Measurement, 20(4), 3 0 5 316. Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680. Stocking, M.L. (1989). Empirical estimation errors in item response theory as a function of test properties. Princeton, NJ: Educational Testing Service Research Report. Strenio, A.J. (1981). The testing trap. New York: Rawson, Wade. Suppes, P., & Zinnes, J.L. (1963). Basic measurement theory. In R.D, Luce, R.R. Bush, & E. Galanter (Eds.), Handbook of mathematical psychology. New York: John Wiley & Sons. Sutherland, G., in collaboration with S. Sharp. (1984). Ability, merit, and measurement: Mental testing and English education, 1880-1940. Oxford: surement: Mental testing and English education, 1880-1940. Oxford: Clarendon Press. Toulmin, S. (1982). The construal of reality: Criticism in modern and postmodern science. Critical Inquiry, 9, 9 3 - 1 1 1 . Tracy, D. (1975). Blessed rage for order: The new pluralism in theology. Minneapolis: The Winston-Seabury Press. Tukey, J.W. (1969). Analyzing data: Sanctification or detective work? American Psychologist, 24, 8 3 - 9 1 . Wheeler, J.A., & Zurek, W. (Eds.). (1983). Quantum theory and measurement. Wheeler, J.A., & Zurek, W. (Eds.). (1983). Quantum theory and measurement. Princeton, NJ: Princeton University Press. Whitely, S.E. (1977). Models, meanings and misunderstandings: Some issues in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), 227-235. Whitely, S.E., & Dawis, R.V. (1974). The nature of objectivity with the Rasch model. Journal of Educational Measurement, 11(2), 163-178. Willmott, A., & Fowles, D. (1974). The objective interpretation of test performance: The Rasch model applied. Atlantic Highlands, NJ: NFER Publishing. in applying Rasch's theory. Journal of Educational Measurement, 14(3), Wilson, M. (Ed.). (1991). Objective measurement: Theory into practice. Norwood, NJ: Ablex Publishing Corp. Wingersky, M.S., Barton, M.A., & Lord, F.M. (1982). LOGIST Users Guide. Princeton, NJ: Educational Testing Service. Wood, R. (1978). Fitting the Rasch model: A heady tale. British Journal of Mathematical and Statistical Psychology, 31, 27-32. Wright, B.D. (1968). Sample-free test calibration and person measurement.

72

FISHER

in applying Rasch's theory. Journal of Educational Measurement, 14(3), Proceedings of the 1967 Invitational Conference on Testing Problems (pp. 85-101). Princeton: Educational Testing Service. Wright, B.D. (1977a). Misunderstanding the Rasch model. Journal of Educational Measurement, 14(3), 219-225. Wright, B.D. (1977b). Solving measurement problems with the Rasch model. Journal of Educational Measurement, 14(2), 97-116. Wright, B.D. (1984). Despair and hope for educational measurement. Contemporary Education Review, 3(1), 281-288. w w min applying Rasch's theory. Journal of Educational Measurement, 14(3), ent and personality assessment. (E. Roskam, Ed.). North Holland: Elsevier Science Publishers. Win applying Rasch's theory. Journal of Educational Measurement, 14(3), right, B.D. (1988a). Georg Rasch and measurement. Rasch Measurement, 2(3), 1-7. Wright, B.D. (1988b). The model necessary for a Thurstone scale and Campbell concatenation for mental testing. Rasch Measurement, 2(1), 2 - 4 . Wright, B.D., & Bell, S.R. (1984). Item banks: What, why, how. Journal of Educational Measurement, 21(A), 331-345. Wright, B.D., & Linacre, J.M. (1989). Observations are always ordinal; Measurements, however, must be interval. Archives of Physical Medicine and in applying Rasch's theory. Journal of Educational Measurement, 14(3), Rehabilitation, 70(12), 857-867. in applying Rasch's theory. Journal of Educational Measurement, 14(3), Wright, B.D., & Linacre, J.M. (1991). BIGSTEPS: A Rasch-Model Computer Program. Chicago: MESA Press. w Press. Wright, B.D., & Stone, M. (1979). Best test design. Chicago: MESA Press. Zimmerman, M.E. (1990). Heidegger's confrontation with modernity: Technology, politics, art. Bloomington: Indiana University Press. Zwick, R. (1990). When do item response function and Mantel-Haenszel definitions of differential item functioning coincide? Journal of Educational Statistics, 15, 183-197.

chapter

4 4

Historical Views of the Concept of Invariance in Measurement Theory* George Engelhard, Jr. Emory University

The history of science is the history of measurement. (Cattell, 1893, p. 316) The scientist is usually looking for invariance whether he knows it or not. (Stevens, 1951, p. 20) Invariance has been identified as a fundamental characteristic of measurement in the behavioral sciences (Andrich, 1988a; Bock & Jones, 1968; Jones, 1960; Stevens, 1951). In essence, the goal of invariant measurement has been succinctly stated by Stevens: "the scientist * This research was supported in part by the University Research Committee of Emory University. Support for this research was also provided through a Spencer Fellowship from the National Academy of Education. Earlier versions of this chapter were presented at the Fifth International Objective Measurement Workshop at the University of California, Berkeley (March 1989), and at the Sixth International Objective Measurement Workshop at the University of Chicago (April 1991). Judith A. Monsaas and Larry Ludlow provided helpful comments on earlier drafts of this paper. Sections of this chapter have been published in Engelhard, G. (1992, Summer), Historical views of invariance: Evidence from the measurement theories of Thorndike, Thurstone and Rasch, Educational and Psychological Measurement. Permission to reprint has been obtained from the publisher. The figures reproduced in this chapter are based on the original graphics produced by Thorndike, Thurstone, and Rasch. The original graphics varied somewhat in quality, and for historical accuracy are reproduced in this chapter as originally drawn.

73

74

ENGELHARD, JR.

seeks measures that will stay put while his back is turned" (1951, p. 21). The concept of invariance has implications for both item calibration and the measurement of individuals. Many of the measurement problems that confront researchers in psychology and education today, such as those related to invariance, are not new. By taking a historical perspective on these measurement problems, it may be possible to increase the understanding of the measurement problems themselves, assess the adequacy of solutions proposed by major measurement theorists, and identify promising areas for future research. Progress, and in some cases lack of progress, towards the solution of basic measurement problems can also be meaningfully documented. During the 20th century, there have been two major research traditions t h a t have guided measurement theorists attempting to quantify various human characteristics, such as abilities, aptitudes, and attitudes. One tradition has its roots in the psychometric work of Charles Spearman (1904); this research tradition, which is focused on the test score, is primarily concerned with measurement error and the decomposition of an observed test score into several components including a "true" score and various error components. This research tradition within mental test theory can be labelled test theory. A second research tradition that has developed in a parallel fashion has its roots in the 19th-century work in psychophysics and has continued into present practice through the various forms of latent trait theory or, more specifically, item response theory (IRT). This second research tradition will be referred to as scaling theory. The focus of research within this second tradition is on the calibration of both individuals and items onto a latent variable scale. Within these two research traditions, test theory and scaling theory, there are several dominant perspectives that have evolved over time. For example, Spearman's research on test theory has been extended through generalizability theory (Brennan, 1983; Cronbach, Gleser, Nanda, & Rajaratnam, 1972; Shavelson, Webb, & Rowley, 1989), as well as the LISREL models developed by Karl Joreskog (Joreskog & Sorbom, 1986). The purpose of this chapter is to examine advances within the second measurement tradition of scaling theory t h a t are due to the contributions of Thorndike, Thurstone, and Rasch. Measurement perspectives within test theory will not be addressed in detail in this chapter. A great deal of educational and psychological research has been conducted within the framework of test theory. For example, empirical research workers routinely include "coefficient alphas" or "KR-20s" for the instruments used in their studies. Along with this concern for

HISTORICAL VIEWS OF INVARIANCE

75

"reliability" coefficients, research workers have also been concerned about the validity of their instruments, although documenting what a test score really represents is rarely resolved in most studies and may ultimately be the most important measurement question of all. Instead of focusing on measurement problems related to reliability and validity, which are the central concepts of test theory (Loevinger, 1957), this study focuses on measurement problems related to the concept of invariance, which appear clearly within scaling theory; this emphasis is not to say t h a t the concepts of reliability or especially validity are unimportant, but rather that different research traditions focus on different aspects of the measurement problems encountered in the behavioral sciences. In fact, invariance has important relationships to and implications for issues related to reliability and validity, and it is essential for gaining a clear understanding of certain persistent problems encountered in test theory. As pointed out by Jones and Appelbaum (1989), developments in item response theory have led to constructive changes in psychological testing and the "primary advantage of IRT over classical test theory resides in properties of invariance" (p. 24). The purpose of this chapter is to provide a historical perspective on the concept of invariance. Several enduring measurement problems related to item calibration and to the measurement of individuals can be meaningfully viewed by using the concept of invariance. The measurement theories of Thorndike, Thurstone, and Rasch are used because they address measurement problems related to the concept of invariance, and proposed solutions to these problems. These measurement theorists also share a common research tradition based on scaling theory. Although there are quantitative aspects to the approaches used to address invariance, it is beyond the scope of this chapter to provide detailed derivations of the equations used by each theorist to achieve sample-invariant item calibration and item-invariant measurement of individuals. These detailed derivations are presented by Engelhard (1984) for measurement issues related to sample-invariant item calibration. A parallel analysis can also be developed for issues related to the item-invariant measurement of individuals, and these derivations are presented in detail by Engelhard (1991). In the next section of this chapter, the concept of invariance is defined and arguments are presented for its importance as a key idea in measurement. A description of the measurement theories of Thorndike, Thurstone, and Rasch is presented next; the role of invariance in each of these theories is also examined. Next, a comparison and discussion of these three theories of measurement are set forth in terms of their contributions to the solution of problems related to the concept of

76

ENGELHARD, JR.

invariance. The final section includes a summary of the major points of this chapter, as well as suggestions for additional research in this area. THE CONCEPT OF INVARIANCE Within the behavioral sciences, S.S. Stevens (1951) has presented one of the strongest cases for the general importance of the concept of invariance. In his chapter on "Mathematics, Measurement and Psychopp pp Stevens described the role of this concept in mathematics and physics, and he argued that "many psychological problems are already conceived as the deliberate search for invariances" (p. 20). In fact, Stevens defined the whole field of science in terms of a quest for invariance and the concomitant generalizability of results. In his words, The scientist is usually looking for invariance whether he knows it or not. Whenever he discovers a functional relationship his next question follows naturally: under what conditions does it hold? . . . The quest for invariant relations is essentially the aspiration toward generality, and in psychology, as in physics, the principles that have wide applications are those we prize. (Stevens, 1951, p. 20) Applying this view of invariance more specifically to measurement issues, Stevens used the concept of invariance to define his familiar scales of measurement—nominal, ordinal, interval, and ratio scales (Stevens, 1946). In his words, Each of the four classes of scales is best characterized by its range of invariance—by the kinds of transformations that leave the "structure" of the scale undistorted. And the nature of invariance sets limits to the kinds of statistical manipulations that can be legitimately applied to the scaled data. (Stevens, 1951, p. 23) Influenced by the insightful work of Mosier (1940, 1941), Stevens pointed out the symmetry between the fields of psychophysics and psychometrics as related to the concept of invariance: Psychophysics sees the response as an indicator of an attribute of the individual—an attribute that varies with the stimulus and is relatively invariant from person to person. Psychometrics regards the response as indicative of an attribute that varies from person to person but is rela-

HISTORICAL VIEWS OF INVARIANCE

77

tively invariant for different stimuli. Both psychophysics and psychometrics make it their business to display the conditions and limits of these invariances. (Stevens, 1951, p. 31) The first sentence in this quotation illustrates the idea of sampleinvariant item calibration, whereas the second sentence points to the in applying Rasch's theory. Journal of Educational Measurement, 14(3), idea of item-invariant measurement of individuals. This duality between psychophysics and psychometrics, which was clearly described by Mosier (1940, 1941) and pointed out even earlier by Guilford (1936), represents one of the five major ideas underlying test theory identified by Lumsden (1976). Measurement problems related to invariance can be meaningfully viewed in terms of these two broad classes—sampleinvariant item calibration and item-invariant measurement of individuals. Within each of these two classes, invariance over methods and conditions can be examined. Methods refer to the statistical procedures and models, including the method used to collect the data, used within the measurement theory. For example, paired comparison and successive interval scaling would represent different methods of data collection, and would also require different statistical models. Conditions can refer to either subgroupings of items and/or examinees. For example, test equating is concerned with the development of procedures t h a t yield comparable estimates of an individual's ability that are invariant over the subgroups of items (tests) that are used to obtain these ability estimates. As another example, the research on item bias oin applying Rasch's theory. Journal of Educational Measurement, 14(3), or differential item functioning, as it has come to be labelled, reflects concern with whether or not the meaning of an individual's responses on a particular test item varies as a function of irrelevant factors related to membership in various social categories, such as gender, race, and social class. Sample-Invariant Item Calibration The basic measurement problem underlying sample-invariant item calibration is how to minimize the influence of arbitrary samples of individuals on the estimation of item scale values. For example, Engelhard (1984) described how Thorndike provided a single adjustment (location) for differences in group characteristics, whereas Thurstone provided for two adjustments (location and scale). Rasch's approach to sample-invariant calibration can be viewed as providing three adjustments (location, scale, and an individual level response model). An-

78

ENGELHARD, JR.

drich (1978) has also provided an important comparison between the Thurstone and Rasch approaches to item scaling by using paired comparison responses that can also lead to sample-invariant item calibrations. The overall goal of sample-invariant calibration of items is to estimate the location of items on a latent variable of interest t h a t will remain unchanged across subgroups of individuals and also across various subgroups of items. For example, if the goal of sample-invariant calibration is achieved, then the item scale values will not be a function of subgroup characteristics, such as ability level, gender, race, or social class. Further, the calibration of the items should also be invariant over subsets of items, so that if a calibrated set of items is being developed, the scale values of the items are not affected by the inclusion or exclusion of other items in the test.

Item-Invariant Measurement of Individuals In the case of item-invariant measurement, the basic measurement problem involves minimizing the influence of the particular items t h a t happen to be used to estimate an individual's ability. This problem is also related to the scaling and equating of test scores, as well as to the scoring of each individual's performance. Solutions to this problem usually include adjustments for item characteristics (item difficulty) and test characteristics (location, dispersion, and shape of item distributions on the latent variable scale). The overall objective is to obtain comparable estimates of individual ability regardless of which items are included in the test. This objective is essentially the problem of equating person measurements obtained on tests composed of different items (Engelhard & Osberg, 1983). Invariance over scoring method also requires attention. In addition to considering invariance over methods, it is important to examine invariance over conditions within this context; an individual's score should not depend on the scores of other individuals being tested at the same time. In summary, invariance can be viewed as an important general concept in the physical and behavioral sciences, as well as a key aspect of successful measurement in the behavioral sciences. As pointed out by Bock and Jones (1968), "in a well-developed science, measurement can be made to yield invariant results over a variety of measurement methods and over a range of experimental conditions for any one method" (p. 9).

HISTORICAL VIEWS OF INVARIANCE

79

THREE MEASUREMENT THEORIES AND INVARIANT MEASUREMENT The purposes of this section are to describe and to illustrate how the concept of invariance emerged within the measurement theories of Thorndike, Thurstone, and Rasch. As the clearest statement of the conditions necessary to accomplish invariance is presented in the measurement theory of Rasch, this section begins with his research and then traces the adumbrations of these ideas within the work of Thurstone and Thorndike. It also should be pointed out t h a t all three of these theorists wrote extensively on various measurement problems, and for Thorndike especially it was sometimes difficult to point to one consistent set of principles that defined his definitive theory of measurement. In order to address this issue, certain texts are explicitly cited. It should be understood that these texts are being used to define a particular individual's measurement theory. This endeavor was not much of a problem for Rasch because he was very consistent in his views related to invariance; Thurstone was fairly consistent, whereas Thorndike was the least consistent of the three. Rasch Based on psychometric research conducted during the 1950s, Rasch (1960/1980, 1961, 1966a,b) presented a set of ideas and methods that were described by Loevinger (1965) as a "truly new approach to psychometric problems" (p. 151) t h a t can lead to "nonarbitrary measures" (p. 151). One of the major characteristics of this "new approach" was Rasch's explicit concern with the development of "individual-centered techniques" as opposed to the group-based measurement models used by measurement theorists such as Thorndike and Thurstone. In Rasch's words, "individual-centered statistical techniques require models in which each individual is characterized separately and from which, given adequate data, the individual parameters can be estimated" (1960/1980, p. xx). Problems related to invariance played an important role in motivating the measurement theory of Rasch. As pointed out by Andrich (1988a), Rasch presented "two principles of invariance for making comparisons t h a t in an important sense precede, though inevitably lead to, measurement" (p. 18). Rasch's concept of "specific objectivity," which he formulated in terms of his principles of comparison, form his version of the goals of invariant measurement (Rasch, 1977). In Rasch's words,

8 0 ENGELHARD, JR.

The comparison between two stimuli should be independent of which particular individuals were instrumental for the comparison; and it should also be independent of which stimuli within the considered class were or might also have been compared. Symmetrically, a comparison between two individuals should be independent of which particular stimuli within the class considered were instrumental for the comparison; and it should also be independent of which other individuals were also compared, on the same or on some other occasion. (Rasch, 1961, pp. 331332) It is clear in this quotation that Rasch recognized the importance of both sample-invariant item calibration and item-invariant measurement of individuals. In fact, he made them the cornerstones of his quest for specific objectivity. In order to address problems related to invariance, Rasch laid the foundation for the development of a "family of measurement models," which are characterized by separability of item and person parameters (Masters & Wright, 1984). Rasch's approach to sample-invariant item calibration involved the comparison of item difficulties obtained in separate groups. In his words, In relation to attainment tests all the school grades for which the tests are in practice applicable may be considered as forming a total collection of persons, that may be divided into subpopulations, such as single grades, sex groups and age groups within a grade, social strata, etc. Between the test results in such more or less extensive groups the same fundamental relationship must hold, and if so we shall use the term that the relationship is "relatively independent of population," the qualification "relatively" pointing to the degree of breakdown that has been applied to the data. (Rasch, 1960/1980, p. 9) In his book, he used ability groups formed on the basis of raw scores. In essence, Rasch was "looking for trouble in a more or less definite direction, namely, for the possibility that the relative difficulties of the tests may vary with [raw score] that is, with the reading inability of tthe children" (Rasch, 1961, p. 323). This test of fit (or what Rasch referred to as control of the model) was presented graphically. In order to illustrate this idea, the results for two subtests, N and F, from the Danish Military Group Intelligence Test (BPP), which were used by Rasch (1960/1980), are presented in Figure 4-1. The test data were obtained from 1,904 recruits who were tested in September 1953. The results for Subtest N are presented in Panel A (Rasch, 1960/1980, pp. 89), which illustrates successful sample-invariant item calibration. The abscissa is based on the average of the separate within group

HISTORICAL VIEWS OF INVARIANCE

81

Figure 15 Figure 7

Subtest F of BPP.

Subtest N of BPP.

a

Successful sample-invariant item calibration

Figure 4 - 1 ibration

B.

Unsuccessful sample-invariant item calibration

Rasch's graphic approach for examining sample-invariant item cal-

Note. The abscissa (l.i) in each panel is the average of the item difficulties calculated separately within the raw score groups (r). The ordinate (lri) represents the item difficulties calculated within each score group with a constant added by Rasch to avoid overlapping items and to highlight the linearity or non-linearity of these plots. From Probabilistic models for some intelligence and attainment tests (pp. 89 and 98) by G. Rasch, 1980/1960, Chicago: The University of Chicago Press. Copyright 1980 by The University of Chicago. Reprinted by permission.

82

ENGELHARD, JR.

Figure 6

Figure 14 Subtest F of B P P .

Subtest N of B P P .

A.

Successful item-invariant measurement

B.

Unsuccessful item-invariant measurement

Figure 4-2 Rasch's graphic approach for examining item-invariant measurement of individuals Note. The abscissa (lr.) in each panel is the average of the ability estimates calculated separately within item groups. The ordinate (lri) represents the ability estimates calculated within each item group with a constant added by Rasch to avoid overlapping ability estimates and to highlight the linearity or non-linearity of these plots. From Probabilisin applying Rasch's theory. Journal of Educational Measurement, 14(3), tic models for some intelligence and attainment tests (pp. 87 and 97) by G. Rasch, 1980/1960, Chicago: The University of Chicago Press. Copyright 1980 by The University of Chicago. Reprinted by permission.

HISTORICAL VIEWS OF INVARIANCE 8 3

calibrations. The parallel lines indicate that the difficulty of the items is relatively invariant across raw-score groups. Unsuccessful sampleinvariant item calibrations are presented in Panel B for Subtest F (Rasch, 1960/1980, p. 98) and are reflected in the nonparallel lines. Because of the formal symmetry in the model proposed by Rasch between items and individuals, he could use a similar graphic approach to examine whether or not item-invariant measurement of individuals had been achieved. The results for Subtests N and F are presented in Figure 4-2. Panel A (Rasch, 1960/1980, p. 87) illustrates siin applying Rasch's theory. Journal of Educational Measurement, 14(3), n applying Rasch's theory. Journal of Educational Measurement, 14(3), successful item-invariant measurement with ability estimates relain ap lying Rasch's theory. Journal of Educational Measurement, 14(3), in ap lying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, in ap lying Rasch's theory. Journal of Educational Measurement, 14(3), in ap lying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/180, pin applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, p. 97) provides evidence of unsuccessful item-invariant measurement as evidenced by the inequality of the slopes based on the regression of ability estimates obtained separately within each item group on the total. Even though there are more sophisticated methods for examining invariance using statistical tests of item and person fit (Wright, 1988; Wright & Stone, 1979), the graphical methods can be a useful guide to whether or not invariance has been achieved. As will be seen in the next section, Thurstone used a similar graphical method to examine whether or not his method of absolute scaling was appropriate for a particular set of test data. By focusing on the individual as the level of analysis, Rasch was able to examine test data and to identify when invariance was exhibited. When the data fit the Rasch model, such as with Subtest N, then the types of invariance which eluded research workers in the test theory tradition can be obtained. To quote Loevinger, Rasch is concerned with a different and more rigorous kind of generalization than Cronbach, Rajaratnam and Gleser. When his model fits, the results are independent of the sample of persons and of the particular items with some broad limits. Within these limits, generality is, one might say, complete. (Loevinger, 1965, p. 151) Detailed descriptions of Rasch measurement are presented in Wright and Stone (1979), Wright and Masters (1982), and Wright (1988). Thurstone Thurstone also recognized the importance of invariant measurement. In fact, as pointed out by Bock and Jones (1968), "in the system of psychological measurement based on the Thurstonian models, we achieve some of the invariance in measurement which is characteristic

84

ENGELHARD, JR.

of the other sciences" (p. 9). In developing his method of absolute scaling for calibrating test items, Thurstone (1925, 1927, 1928a,b) was specifically motivated by the lack of sample-invariance he had observed in Thorndike's scaling method. In his words, the probable error, or PE lused in Thorndike's methodl, is not valid as a unit of measurement for educational scales. Its defect consists in that it does not possess the one requirement of a unit of measurement, namely constancy. It fluctuates from one age to another. (Thurstone, 1927, p. 505; emphasis added) The probable error is a measure of dispersion used by Thorndike t h a t is similar to the interquartile range; for normal distributions, .6745 times the standard deviation is approximately equal to the PE. The concept of constancy proposed by Thurstone is his version of an invariance condition, and it is an explicit consequence of measurement situations t h a t yield objective measurements. Thorndike's PE values fluctuate because the item scale values are not sample-invariant, a condition t h a t violates Thurstone's insight that the "scale value of an item should be the same no matter which age group is used in the standardization" (Thurstone, 1928a, p. 119). As did Rasch, Thurstone used the idea of a continuum to represent the latent variable of interest and assumed that items can be placed at points on this linear scale which would have a fixed position regardless of the group being tested. According to Thurstone, "if any particular test item or particular raw score is to be allocated on the absolute scale, its scale value should be ideally the same whether determined by group one or group two" (1925, p. 438). Thurstone presented this idea graphically, and his illustration is reproduced in Figure 4-3. In Figure 4-3, Thurstone (1927, p. 509) showed the location of seven items (a to g) and presented the idea that the calibration of these items t h a t determines their location on the latent variable scale should be invariant over groups A and B, which are different in terms of location and variability on the latent variable scale. In order to adjust for differences in the location and variability of two or more distributions, Thurstone assumed a normal distribution of ability for each group and essentially adjusted statistically for differences in locations (means) and scales (standard deviations). In order for these adjustments proposed by Thurstone to lead successfully to sample-invariant item calibration, Thurstone proposed a graphical test of fit. Thurstone's illustration, which is presented in Figure 4-4, shows the plot of the item scale values (sigma values) calibrated separately in grades 7 and 8. According to Thurstone,

HISTORICAL VIEWS OF INVARIANCE

Figure 4-3

85

Thurstone's view of sample-invariant item calibration

Note. The abscissa represents a latent variable scale. According to Thurstone (1927), the location of the seven items (a to g) on the latent variable scale should be invariant over ability groups A and B. From "The Unit of Measurement in Educational Scales" by L.L. Thurstone, 1927, The Journal of Educational Psychology, 18, p. 509. Copyright American Psychological Association. Reprinted by permission.

If the plot in Figure 4-4 should be distinctly non-linear, the present scaling method is not applicable. Non-linearity here shows that the two distributions cannot both be normal on the same scale. If the plot is linear, it proves that both distributions may be assumed to be normal on the same scale or base line. (Thurstone, 1927, p. 513) This test of fit can also be presented in the style of the graphical displays used by Rasch; this graphic representation is shown in Figure 4-5 (Engelhard, 1984, p. 33) for Thurstone's data. The effects of using Thurstone's method of absolute scaling, which provides adjustments for differences in the locations and variations of the ability distributions, as compared to Thorndike's scaling method, which simply adjusts for location differences, are shown in Figure 4-6. In Panel A of Figure 4-6 (Thurstone, 1927, p. 506), the results of using Thorndike's method to calibrate a language scale developed by Trabue (1916) are presented; the average language ability increases as a function of grade level, whereas the variances remain constant. The results obtained by using Thurstone's method are presented in Panel B of Figure 4-6 (Thurstone, 1927, p. 515); in this figure, average ability

86

ENGELHARD, JR.

Figure 4-4 Thurstone's graphic approach for examining sample-invariant item calibrations Note. Item scale values (sigma values) were calculated separately by grade (7 and 8). From "The Unit of Measurement in Educational Scales" by L.L. Thurstone, 1927, The Journal of Educational Psychology, 18, p. 513. Copyright American Psychological Association. Reprinted by permission.

increases with grade level, but the variances of the scores also increase. These results seem theoretically plausible. Thurstone's method of absolute scaling is described and illustrated in detail in Engelhard (1984). An "experimental" adjustment for sample effects that occurs with Thurstone's model for paired comparisons is described in Andrich (1978). Thurstone's method of absolute scaling can also be used to scale test scores (Gulliksen, 1950), but a more interesting discussion of issues related to item-invariant measurement is presented by Thurstone (1926) in an article on the scoring of individual performance. In this article, Thurstone presented a set of conditions as follows: 1.

It should not be required to have the same number of test elements at each step of the scale.

HISTORICAL VIEWS OF INVARIANCE

87

COMBINED CALIBRATION SAMPLE i n appl y i n g Ras c h ' s t h eor y . J o ur n al of Educ a t i o nal Meas u r e ment , 14( 3 ) , i n appl y i n g Ras c h ' s t h eor y . J o ur n al of Educ a t i o nal Meas u r e ment , 14( 3 ) , t i v e l y i n v a r i a nt ov e r i t e m gr o ups , wher e as Panel B ( R as c h , 1960/ 1 980, g Figure 4-5 Rasch's graphic test of fit for Thurstone's data Note. Based on same data presented in Figure 4. From "Thorndike, Thurstone and Rasch: A comparison of their methods of scaling psychological and educational tests" by G. Engelhard, 1984, Applied Psychological Measurement, 8, p. 33. Copyright 1984 by Applied Psychological Measurement, Inc. Reproduced by permission.

2. 3. 4.

It should be possible to omit several test questions at different levels of the scale without affecting the individual score. It should be possible to include in the same scale two forms of test. It should not be required to submit every subject to the whole range of the scale. The starting point and terminal point, being selected by the examiner, should not directly affect the individual score.

A.

Based on Thorndike's scale

Figure 4-6

B.

Based on Thurstone's method of absolute scaling

Distribution of language ability based on Thorndike's method (Panel A) and Thurstone's method of absolute scaling (Panel

B). Note. Abscissa is a latent variable scale for measuring language ability and ordinate indicates successive grade groups (grade 2 to 12). From ' T h e Unit of Measurement in Educational Scales" by L.L. Thurstone, 1927, The Journal of Educational Psychology, 18, pp. 506 and 515. Copyright

HISTORICAL VIEWS OF INVARIANCE

5. 6. 7.

89

It should be possible to use the scale so that a rational score may be determined for each individual subject and so that the performance of groups of subjects may be compared. The arithmetical labor in determining individual scores should be a minimum. The procedure should be as far as possible consistent with psychophysical methods so t h a t it will be free from the logical errors involved in the Binet scales and its variants.

Conditions one to five clearly show Thurstone's concern with iteminvariant measurement. In his 1926 paper, he went on to propose a scoring method which meets these conditions. Thurstone's approach is presented in detail by Engelhard (1991). In essence, Thurstone proposed what would be recognized today as person characteristic curves t h a t graphically present the probabilities of an individual succeeding on a set of calibrated test items. Many of Thurstone's articles on scaling are included in The Measurement of Values (1959), although his work on absolute scaling is not included in t h a t volume. The technical details and elaborations of Thurstonian models are presented in Bock and Jones (1968). Andrich (1988c) provided a useful overview of Thurstone's contributions to measurement theory. Although it is not directly relevant for this chapter, it is interesting to note that both Thurstone (1947) and Rasch (1953) also used the concept of invariance as an important aspect of their approaches to factor analysis. Thorndike In 1904, Thorndike published the first edition of his highly influential book entitled An Introduction to the Theory of Mental and Social Measin applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, surements. Thorndike's major aim in writing this book was to "introduce students to the theory of mental measurements and to provide them with such knowledge and practice as may assist them to follow critically quantitative evidence and argument and to make their own researches exact and logical" (1904, p. v). Thorndike's book was the standard reference on statistics and quantitative methods in the mental and social sciences for the first two decades of this century (Clifford, 1984; Engelhard, 1988; Travers, 1983). Much of this influence can be attributed to Thorndike's clear and expository writing style. He explicitly acknowledged t h a t contemporary work in measurement theory had not been presented in a manner suitable for students without fairly advanced mathematical skills. He set out to present a less mathematical introduction to measurement theory based on the belief that

90

ENGELHARD, JR.

"there is, happily, nothing in the general principles of modern statistical theory but refined common sense, and little in the techniques resulting from them that general intelligence can not readily master" (p. 2). Thorndike, who wrote extensively on educational and psychological measurement, covered topics that ranged from the general statement of his theory (Thorndike, 1904) to the measurement of a variety of educational outcomes (Thorndike, 1910, 1914, 1918, 1921), as well as intelligence (Thorndike, Bregman, Cobb, & Woodyard, 1926). What were the basic measurement problems identified by Thorndike? Thorndike clearly stated that the "special difficulties" of measurement in the behavioral sciences are 1. 2. 3.

Absence or imperfection of units in which to measure. Lack of constancy in the facts measured. Extreme complexity of the measurements to be made.

In order to illustrate the problems related to the absence of an accepted unit of measurement, Thorndike (1904) pointed out that the spelling tests developed by Joseph Mayer Rice did not have equal units. Rice assumed that all his spelling words were of equal difficulty, whereas Thorndike argued that the correct spelling of an easy versus a hard word did not reflect equal amounts of spelling ability. Because the units of measurement are unequal, Thorndike asserted that Rice's results were inaccurate. Without general agreement on units, the meaning of test scores becomes more subjective. Within the framework of this chapter, Thorndike was illustrating that obtained scores may not be invariant over subsets of items which vary in difficulty. Inconstancy is the second major measurement problem identified by Thorndike (1904). Many of the measurement problems encountered in the behavioral sciences are related to random variation inherent in h u m a n characteristics. These variations are due not only to the unreliability of tests, but also to within-subject fluctuations. For example, if a person's motivation is measured repeatedly, these values tend to vary. Thorndike's concept of constancy is also related to the idea of invariance as developed in this chapter. The final measurement problem or "special difficulty" identified by Thorndike pertains to the extreme complexity of the variables and constructs that social and behavioral scientists wish to measure. This problem primarily, although not totally, reflects a concern with dimensionality. Most of the variables worth measuring in the behavioral sciences do not readily translate into unidimensional tests that permit the reporting of a single score to represent the individual's location on

HISTORICAL VIEWS OF INVARIANCE

91

the latent variable or construct of interest. As pointed out by Jones and Appelbaum (1989), if unidimensionality is obtained for all items and over all groups of examinees, then item parameters ^ i l l be invariant across groups, and ability parameters will be invariant across items. Methods for conducting item factor analyses designed to explore this issue have been summarized by Mislevy (1986), and an approach to this problem has been illustrated by Muraki and Engelhard (1985). Thorndike's method for obtaining sample-invariant item calibration is very similar to Thurstone's method of absolute scaling. As described by Thurstone, Thorndike's scaling method consists in first determining the scale value of each item for each grade separately with the mean of each grade as an origin. The difficulty of a test item for Grade V children, for example, is determined by the proportion of right answers to the test item in that grade. When a test item has been scaled in several grades, the scale values so obtained will, of course, be different because of the fact that they are expressed as deviations from different grade means as origins. Thorndike then reduces all these measurements to a common origin in the construction of an educational scale by adding to each scale value the scale value of the mean of the grade. (Thurstone, 1927, p. 508) The major difference between Thorndike's method of item scaling and Thurstone's method of absolute scaling is that Thorndike assumed that the variances of the groups are equal. Thurstone criticized this assumption: it is clear that in order to reduce the overlapping sentences or test items to a common base line or scale it is necessary to make not one but two adjustments. One of these adjustments concerns the means of the several grade groups and this adjustment is made by the Thorndike scaling methods. The second adjustment which is not made by Thorndike concerns the variation in dispersion of the several groups when they are referred to a common scale. (Thurstone, 1927, p. 509) The results of using the two different methods were presented earlier in Figure 4-6. In his later work, Thorndike did include an adjustment for the range of scores (Thomson, 1940). Thorndike's views of item-invariant measurement of individuals are presented in several places (Thorndike, 1914; Thorndike et al., 1926). Engelhard (1991) presents a detailed description of Thorndike's approach as applied to the measurement of reading ability (Thorndike, 1914). Essentially, Thorndike recommended using a set of procedures t h a t are very similar to the methods of scoring individual performance

92

ENGELHARD, JR.

used by Thurstone and Rasch. Thorndike also suggested examining person fit and proposed adjusting reading ability estimates when an individual responded in an inconsistent manner to the test items. COMPARISON AND DISCUSSION OF THREE MEASUREMENT THEORIES The comparisons of the major similarities and differences among the measurement theories of Thorndike, Thurstone and Rasch are summarized in Tables 4-1 and 4-2. Table 4-1 presents a summary comparison of their views related to sample-invariant item calibrations, while Table 4-2 presents issues related to the item-invariant measurement of individuals. These issues are discussed in detail in two earlier articles (Engelhard, 1984, 1991). In general terms, it is clear that Thorndike, Thurstone, and Rasch were all working within a common scaling tradition. They based many of their proposed methods for calibrating test items and measuring individuals on statistical advances made within the field of psychophysics. One of the differences between psychophysics and psychometrics is that the independent variable is usually an observable variable in psychophysics, whereas in psychometrics the

Table 4 - 1 Comparison of Thorndike, Thurstone, and Rasch on Major Issues Related to Sample-Invariant Item Calibration Issue

Thorndike

Thurstone

Rasch

Recognized importance of item invariance Utilized the latent trait concept Transformation of percent correct Level of analysis Assumed distribution of ability

Yes

Yes

Yes

Yes

Yes

Yes

PE values

Normal Deviates

Logits

Group Normal

Group Normal

Model to Data 1

Model to Data 2

Individual None Required Data to Model 3

dig = M* + x i g

d i g = ^ g + Xjg(Tg

d, = M + XjY

Separate Process

Simultaneous Process

Tests of fit Number of adjustments Item difficulties (Scale values) Person measurement

Separate Process

Note. From "Thorndike, Thurstone and Rasch: A comparison of their methods of scaling psychological and educational tests" by G. Engelhard, 1984, Applied Psychological Measurement, 8(1), p. 29. Copyright 1984 by Applied Psychological Measurement Inc. Reproduced by permission.

HISTORICAL VIEWS OF INVARIANCE

93

Table 4-2 Comparison of Thorndike, Thurstone, and Rasch on Major Issues Related to Item-Invariant Measurement of Individuals Issue

THORN

THURS

RASCH

Recognized importance of item-invariant measurement Utilized concept of latent variable scale Avoided using raw scores Used person response curves Had formal probabilistic model Used standard errors for ability estimates Scoring criterion Flagged inconsistent response patterns

Yes

Yes

Yes

Yes Yes Yes No No 80% Yes (ad hoc) Separate Process

Yes Yes Yes No No 50% No

Yes Yes Yes Yes Yes 50% Yes (theory) Simultaneous Process

Item calibration

Separate Process

Note. From "Thorndike, Thurstone and Rasch: A comparison of their approaches to iteminvariant measurement" by G. Engelhard, 1991, Journal of Research and Development in Education, 24(2), p. 55. Copyright 1991 by College of Education, The University of Georgia. Reprinted by permission.

construct is usually unobservable. As this construct is not directly observable, these three psychometricians used the idea of a latent continuum to represent this unobservable variable. Although they all held similar positions on many measurement issues as highlighted in Tables 4-1 and 4-2, there are also several import a n t differences between the conceptualizations of Thorndike and Thurstone as compared to the views of Rasch. One of the major differences is the recognition by Rasch that measurement models can and should be developed based on the responses of individuals to single test items. This focus on the individual, rather than on groups, allowed Rasch to avoid making unnecessary assumptions regarding the distribution of abilities t h a t were needed by both Thorndike and Thurstone. As pointed out earlier, Thorndike's method of scaling test items and Thurstone's method of absolute scaling were both based on the assumption that abilities were normally distributed. By using the individual and not the group as the level of analysis, Rasch invented measurement models t h a t are capable of providing estimates of the location of both items and individuals on a latent variable continuum simultaneously. This approach also allowed Rasch to develop probabilistic models rather t h a n deterministic ones for modelling the probability of each individual succeeding on a particular test item as a function of his or her ability and the item difficulties. This probabilistic relationship is clearly shown in the familiar S-shaped item characteristic

94

ENGELHARD, JR.

curves. Further, by simultaneously including item calibration and individual measurement within one model, he was able to derive "conditional" estimates of these parameters which provides a framework for determining whether or not invariance has been achieved.

SUMMARY Progress is as difficult to define within the field of measurement as in any other field of study (Donovan, Laudan, & Laudan, 1988; Laudan, 1977). The analysis presented in this chapter suggests that Rasch's work provides a theoretical and statistical framework for the practical realization of invariant measurement that was sought by both Thorndike and Thurstone. The simultaneous inclusion of both ability and item difficulty within a probabilistic model defined at the individual level of analysis provided a general framework in which item and person parameters can be estimated separately. Rasch was able to use recent advances in statistics, such as the concept of sufficiency developed by Fisher (1925), to propose an approach to measurement t h a t provides practical solutions to many testing problems related to invariance. This chapter is part of a larger program of research related to the history and philosophy of measurement theory. The overall purposes of this research are to identify basic measurement problems and to describe how these measurement problems are addressed by major measurement theorists. As pointed out earlier, many of the measurement problems that are faced today are not new. Through the use of historical and comparative perspectives, it is possible to gain a better understanding of both the measurement problems themselves and of the progress that has been made toward the solution of these problems. Some of the perennial measurement problems in the behavioral sciences can be viewed as part of the quest for invariant measurement as described in this paper. Another related concept that was not examined in this presentation is unidimensionality. A historical and comparative analysis of this concept and of its development within scaling theory along the lines used in this chapter would be an important contribution to the knowledge of progress in measurement theory. This chapter has focused on the concept of invariance as it has appeared within the context of measurement theory. Invariance can also be viewed more broadly as the quest for generality in science. If science is viewed in its simplest form as a series of questions and answers, then invariance addresses the problem of whether or not answers are comparable over methods and groups. The concept of in-

HISTORICAL VIEWS OF INVARIANCE 95 HISTORICAL VIEWS OF INVARIANCE 95

variance within educational and psychological research can also be expanded to include first, second, and higher order invariances. For example, invariances of the first order might deal with mean differences between groups on a variable such as mathematics anxiety. A second order concern might be whether or not the correlations between mmathematics achievement and anxiety are invariant over gender, so-mathematics achievement and anxiety are invariant over gender, so-mathematics achievement and anxiety are invariant over gender, social class, and race groups. Higher order invariances might relate to the generalizability of a system of interrelationships among more than two variables. There are several areas for future research related to the manner in which the concept of invariance appears within other measurement theories that are not within the scaling tradition but derive from the test theory tradition. Some illustrative questions are: How does the work on test theory relate to the quest for invariance within scaling theory? Can the work of Spearman be viewed as a search for an invariant ranking of individuals regardless of time of administration and instruments used? Can the work of Cronbach and others on generalizability theory be viewed as an attempt to identify and examine sources of error variance in test scores which are related to the concept of "invariance" in educational and psychological tests as presented in this chapter? What about invariance within the framework of two- and three-parameter item response models? What about Guttman's research on psychometrics? What are the explicit connections of classical measurement concepts, such as reliability and validity, to the concept of invariance as presented in this chapter? How does invariance relate to unidimensionality? In summary, the problem of invariance is of fundamental importance for the development of meaningful measures in education and psychology. Item-invariant estimates of individual abilities and sample-invariant estimates of item difficulties are essential in order to realize the advantages of objective measurement. The conditions for objective measurement correspond to the concept of invariance as developed in this paper. The conditions for objective measurement are as follows: First, the calibration of measuring instruments must be independent of those objects that happen to be used for the calibration. Second, the measurement of objects must be independent of the instrument that happens to be used for the measuring. (Wright, 1968, p. 87) This chapter provides a historical and substantive review of the problems related to invariant measurement. It also illustrates the progress t h a t has been made toward solving measurement problems related to

96

ENGELHARD, JR.

i n v a r i a n c e . F u r t h e r , t h i s c h a p t e r c o n t r i b u t e s to a n a p p r e c i a t i o n of R a s c h ' s a c c o m p l i s h m e n t s a n d of t h e e l e g a n c e of h i s a p p r o a c h to p r o b l e m s r e l a t e d to i n v a r i a n t m e a s u r e m e n t . As p o i n t e d out by A n d r i c h (1988b), Rasch's a c h i e v e m e n t s did n o t occur in a " h i s t o r i c a l v a c u u m " (p. 13). T h i s c h a p t e r i l l u s t r a t e s t h e c o n t i n u i t y a n d p r o g r e s s t h a t is e v i d e n t w i t h i n t h e m e a s u r e m e n t t h e o r i e s of T h o r n d i k e , T h u r s t o n e , and Rasch.

REFERENCES Andrich, D. (1978). Relationships between the Thurstone and Rasch approaches to item scaling. Applied Psychological Measurement, 2, 4 4 9 460. Andrich, D. (1988a). Rasch models for measurement. Newbury Park, CA: Sage. Andrich, D. (1988b, April). A scientific revolution in social measurement. Paper presented at the annual meeting of the American Educational Research Association in New Orleans. in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, Andrich, D. (1988c). Thurstone scales. In J.P. Keeves (Ed.), Educational rein ap lying Rasch's theory. Journal of Educational Measurement, 14(3), in ap lying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, sin ap lying Rasch's theory. Journal of Educational Measurement, 14(3), in ap lying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, earch, methodology, and measurement: An international handbo k. Oxford: Pergamon Press. Bock, R.D., & Jones, L.V. (1968). The measurement and prediction of judgement and choice. San Francisco: Holden-Day. in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, Brennan, R.L. (1983). Elements of generalizability theory. Iowa City, IA: American College Testing Program. Cattell, J.K. (1893). Mental measurement. Philosophical Review, 2, 316-332. Clifford, G.J. (1984). Edward L. Thorndike: The sane positivist. Middleton, CT: Wesleyan University Press. (Original work published 1968.) Cronbach, L.J., Gleser, G.C., Nanda, H., & Rajaratnam, N. (1972). The dependin ap lying Rasch's theory. Journal of Educational Measurement, 14(3), in ap lying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, ain ap lying Rasch's theory. Journal of Educational Measurement, 14(3), in ap lying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, bil ty of behavioral measurements: Theory of generalizabil ty of scores and profiles. New York: Wiley. in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, Donovan, A., Laudan, L., & Laudan, R. (1988). (Eds.). Scrutinizing science: Empirical studies of scientific change. Boston: Kluwer Academic Publishers. Engelhard, G. (1984). Thorndike, Thurstone and Rasch: A comparison of their in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, methods of scaling psychological tests. Applied Psychological Measurement, 8, 21-38. Engelhard, G. (1988, April). Thorndike's and Wood's principles of educational measurement: A view from the 1980's. Paper presented at the annual meeting of the American Educational Research Association in New Orleans (ERIC Document Reproduction Service No. ED 295 961). Engelhard, G. (1991). Thorndike, Thurstone and Rasch: A comparison of their approaches to item-invariant measurement. Journal of Research and Development in Education, 24(2), 45-60. Engelhard, G., & Osberg, D.W. (1983). Constructing a test network with a rin applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, Rasch measurement model. Applied Psychological Measurement, 7, 283294.

HISTORICAL VIEWS OF INVARIANCE

97

Fisher, R.A. (1925). Statistical methods for research workers. Edinburgh: Oliver and Boyd. Guilford, J.P. (1936). Psychometric methods. New York: Mc-Graw Hill Book Company, Inc. Gulliksen, H. (1950). Theory of mental tests. New York: J. Wiley and Sons. Jones, L.V. (1960). Some invariant findings under the method of successive iintervals. In H. Gulliksen & S. Messick (Eds.), Psychological scaling: Theory and applications (pp. 7-20). New York: John Wiley and Sons. Jones, L.V., & Appelbaum, M.I. (1989). Psychometric methods. Annual review of psychology, 40, 2 3 - 4 3 . Joreskog, K.G., & Sorbom, D. (1986). LISREL VI: Analysis of linear structural relationships by maximum likelihood, instrumental variables, and least ssquares methods. Mooresville, IN: Scientific Software. Laudan, L. (1977). Progress and its problems: Toward a theory of scientific change. Berkeley, CA: University of California Press. Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological Reports, 3, 635-694. Loevinger, J. (1965). Person and population as psychometric concepts. Psychological Review, 72, 143-155. Lumsden, J. (1976). Test theory. Annual review of psychology, 27, 251-280. Masters, G.N., & Wright, B.D. (1984). The essential process in a family of measurement models. Psychometrika, 49, 529-544. Mislevy, R.J. (1986). Recent developments in the factor analysis of categorical variables. Journal of Educational Statistics, 11, 3 - 3 1 . Mosier, C.I. (1940). Psychophysics and mental test theory: Fundamental postulates and elementary theorems. Psychological Review, 47, 355-366. Mosier, C.I. (1941). Psychophysics and mental test theory II: The constant process. Psychological Review, 48, 235-249. Muraki, E., & Engelhard, G. (1985). Full-information item factor analysis: Applications of EAP scores. Applied Psychological Measurement, 9, 4 1 7 430. Rasch, G. (1953). On simultaneous factor analysis in several populations. Uppssala Symposium on Psychological Factor Analysis (pp. 65-71). Nordisk Psykologi's Monograph Series, 3. Rasch, G. (1961). On general laws and the meaning of measurement in psycchology. In J. Neyman (Ed.), Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, (pp. 321-333). Berkeley, CA: University of California Press. Rasch, G. (1966a). An individualistic approach to item analysis. In P.F. Lazarsfeld and N. Henry (Eds.), Readings in mathematical social science (pp. 89-107). Chicago: Science Research Associates. Rasch, G. (1966b). An item analysis which takes individual differences into account. British Journal of Mathematical and Statistical Psychology, 19, 49-57. Rasch, G. (1977). On specific objectivity: An attempt at formalizing the request for generality and validity of scientific statements. Danish Yearbook of Philosophy, 14, 58-94. rRasch, G. (1980). Probabilistic models for some intelligence and attainment

98

ENGELHARD, JR.

tests. Chicago: The University of Chicago Press. (Original work published 1960.) Shavelson, R.J., Webb, N.M., & Rowley, G.L. (1989). Generalizability theory. American Psychologist, 44, 922-932. Spearman, C. (1904). "General intelligence," objectively determined and measured. American Journal of Psychology, 15, 201-293. Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680. Stevens, S.S. (1951). Mathematics, measurement, and psychophysics. In S.S. Stevens (Ed.), Handbook of experimental psychology (pp. 1-49). New York: Wiley. Thomson, G H . (1940). The nature and measurement of the intellect. Teachers College Record, 41, 726-750. Thorndike, E.L. (1904). An introduction to the theory of mental and social min applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, measurements. New York: Teachers College, Columbia University. Thorndike, E.L. (1910). Handwriting. Teachers College Record, 11, 83-175. Thorndike, E.L. (1914). The measurement of ability in reading. Teachers College Record, 15, 207-277. Thorndike, E.L. (1918). The nature, purposes, and general methods of measurements of educational products. In G M . Whipple (Ed.), The seventeenth yearbook of the national society for the study of education. Part II, in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, The measurement of educational products. Bloomington, IL: Public School Publishing Company. Thorndike, E.L. (1921). Measurement in education. Teachers College Record, 22, 371-379. Thorndike, E.L., Bergman, E.O., Cobb, M.V. & Woodyard, E. (1926). The mmeasurement of intelligence. New York: Bureau of Publications, Teachers College, Columbia University. Thurstone, L.L. (1925). A method of scaling psychological and educational tests. Journal of Educational Psychology, 15, 433—451. Thurstone, L.L. (1926). The scoring of individual performance. Journal of Educational Psychology, 17, 446-457. Thurstone, L.L. (1927). The unit of measurement in educational scales. Journal of Educational Psychology, 18, 505-524. Thurstone, L.L. (1928a). Comment by Professor L.L. Thurstone. Journal of Educational Psychology, 19, 117-124. Thurstone, L.L. (1928b). Scale construction with weighted observations. Journal of Educational Psychology, 19, 441-453. Thurstone, L.L. (1947). Multiple-factor analysis: A development and expansion of the vectors of mind. Chicago: The University of Chicago Press. Thurstone, L.L. (1959). The measurement of values. Chicago: The University of Chicago Press. Trabue, M.R. (1916). Completion-test language scales. Contributions to Education (No. 77). New York: Columbia University, Teachers College. Travers, R.M.W. (1983). How research has changed American schools: A history from 1840 to the present. Kalamazoo, MI: Mythos Press. Wright, B.D. (1968). Sample-free test calibration and person measurement.

HISTORICAL VIEWS OF INVARIANCE

99

in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, Proceedings of the 1967 invitational conference on testing problems. Princeton, NJ: Educational Testing Service. Wright, B.D. (1988). Rasch measurement models. In J.P. Keeves (Ed.), Educain applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, tional research, methodology, and measurement: An international handbook. Oxford: Pergamon Press. in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, Wright, B.D., & Masters, G. (1982). Rating scale analysis: Rasch measurement. Chicago: MESA Press. in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, Wright, B.D., & Stone, M.H. (1979). Best test design: Rasch measurement. Chicago: MESA Press.

This page intentionally left blank

part II

11

Practice

This page intentionally left blank

chapter

5 O

Computer-Adaptive Testing: A National Pilot Study Mary E. Lunz

American Society of Clinical Pathologists

Betty A. Bergstrom

Computer Adaptive Technologies The purpose of educational measurement is to inform educational decision making by providing estimates of an individual's knowledge and skill. For certification and licensure, this means making minimum competency pass/fail decisions. In recent years, computers have become more versatile and more accepted for the development and delivery of examinations. One of the most interesting and potentially advantageous methods for certification boards and examinees is ccomputer-adaptive testing (CAT). The adaptive algorithms for item selection usually depend on item response theory (IRT) (Rasch 1960/1980; Lord & Novick, 1968; Wright & Stone, 1979). Items in the bank are calibrated to a benchmark scale on which a pass/fail point is established. The adaptive algorithm selects items that provide the most information about the examinee given the current ability measure estimated from responses to all of the previous items. Many studies (Weiss, 1983, 1985; Weiss & Kingsbury, 1984; McKinley & Reckase, 1980; Olsen, Maynes, Slawson, & Ho, 1986) have explored computer-adaptive tests and have found that because maximum information is gained from each item administered, lower measurement error and higher reliability can be achieved using fewer items. While this is advantageous from a psychometric perspective, it presents the examinee with a testing experience that is quite different from traditional multiple choice tests. 103

104

LUNZ & BERGSTROM

Why a National Study Computer-adaptive testing is attractive because of the convenience to examinees with regard to scheduling and reporting, potentially shorter tests, and increased availability of opportunities to challenge the test. Advantages to the certification board include improved security and data collection, better opportunity to control cheating, and cost savings with regard to committee expenses, printing, and shipping. Computer adaptive tests, however, are different from traditional paper and pencil certification examinations. Written certification examinations usually include 200 to 500 items while computer adaptive examinations are usually shorter, including fewer t h a n 100 items. Paper and pencil tests are administered simultaneously. Current practice suggests that certification examinations begin with an easier item, while computer adaptive tests usually begin by presenting an item of medium difficulty. Most examinees get 70 percent or more of the items correct on a certification examination, while a computer adaptive test is usually targeted at 50 percent probability of correct response. On a traditional test, examinees can review and change answers, but on a computer adaptive test this option may not be available. The concern is how examinees and educators react to this innovation in test administration. Are examinees willing to believe in the IRT methodology? Even more mundane, can examinees follow the directions for entering responses into the computer, read items from the computer monitor, and look at a separate illustration book? Will examinees panic at the thought of a computer-administered test? Will examinees perform poorly when they have a harder than usual test, or when they are not given the opportunity to review their answers? These concerns could not be addressed adequately using simulated data, which effectively removes the human element from the evaluation process. It therefore seemed mandatory to verify the knowTn and postulated psychometric, psychological, and social attributes of computer adaptive testing. Thus a national pilot study was undertaken. METHODS AND RESULTS Item Precalibration A paper and pencil examination was given to a sample of students from 57 medical technology programs. From the analysis of these data

COMPUTER-ADAPTIVE TESTING: A NATIONAL PILOT STUDY

105

an item bank was constructed that met the test specifications for the traditional paper and pencil certification examination. The items were calibrated using the Rasch model (Rasch 1960/1980; Wright & Stone, 1979). Inappropriate items and poorly fitting items were deleted before the calibrated item bank of 726 items was established. The stability of the item precalibrations is discussed in detail in the chapter entitled "The Equivalence of Rasch Item Calibrations and Ability Estimates Across Modes of Administration." Data Collection Two hundred thirty-eight medical technology programs from across the country participated in the second phase. Program directors agreed to administer, under secure conditions, a computer adaptive test and a written test composed of 109 items from the computer adaptive test pool, to their students who were eligible to take the certification examination. Comparable pass/fail decisions on paper-and-pencil and adaptive tests were made (Lunz & Bergstrom, 1991). The calibrated item bank of 726 items was used to construct computer-adaptive tests tailored to the current ability of each student. An individual computer disk was available for each student. The computer-adaptive test could be administered in a computer center to the group or individually in a private office as long as security was maintained. Useable data were gathered from approximately 1,077 students; 83 percent were white and 81 percent were female, which is typical population mix for this certification examination. Appropriateness of the R a s c h Model for CAT The appropriateness of the Rasch model over other IRT models for computer-adaptive testing has been confirmed by several studies. Wainer (1983) states t h a t when items are targeted to the ability of the examinee, items t h a t are very difficult for an examinee are not presented. Thus the incidence of guessing is minimal and the estimation of a lower asymptote within the confines of CAT is generally impractical. Wainer (1983) also notes t h a t "inclusion of slopes in the estimation model will result in a very optimistic estimate of the accuracy of the ability estimate." Sample sizes in this study were relatively small, but the Rasch model item calibrations have been found to be robust with small samples (Lord, 1983). Also, there is evidence that person measures estimated with the Rasch and the two- and three-parameter models correlate

106

LUNZ & BERGSTROM

highly (.99) when tests are administered under a computer adaptive algorithm (Olsen et al., 1986). The Rasch model (Rasch, 1960/1980) was used to calibrate items and estimate person measures. The PROX method was used for item selection (Wright & Stone, 1979) in the adaptive algorithm. The Rasch model calibrates item difficulties to a log-linear scale [log(exp(B-D)/lexp(B-D)]. Item difficulties are expressed in log-odds units (logits). Fit of the Data to the R a s c h Model The fit of the data to the Rasch model was verified by examining the infit statistic for the calibrated items (Wright & Masters, 1982). For each person/item encounter, the observed response was compared to the modeled expected response. Misfitting items were removed from the item bank. When data fit the Rasch model, the infit statistic (the mean of the standardized squared residual, weighted by its variance) has a value near 0 and a standard deviation near 1.0. For the 726-item pool, the mean item infit was .04 with a standard deviation of 1.01. CAT Algorithm The computer adaptive testing model used in this study has the following characteristics. It is designed as a mastery model (Weiss & Kingsbury, 1984) to determine whether a person's estimated ability level is above or below a preestablished criterion. Kingsbury and Houser (1990) have shown that an adaptive testing procedure t h a t provides maximum information about the examinee's ability will provide a more clear indication t h a t the examinee is above or below the pass/fail point t h a n a test that peaks the information at the pass/fail point. The CAT ADMINISTRATOR program (Gershon, 1989) constructed computer-adaptive tests following the test specifications of the traditional paper-and-pencil certification examination (see Table 5-1). This means t h a t the item with the most appropriate level of difficulty, within a given subtest, was presented to the examinee. In the first 50 items, blocks of 10 items were administered from subsets 1-4 and blocks of 5 items were administered from subsets 5 and 6. After 50 items, blocks of 4 items (subsets 1-4) and blocks of 2 items (subsets 5 and 6) were administered. Subset order was selected randomly by the computer algorithm. Maurelli and Weiss (1983) found subtest order to have no effect on the psychometric properties of an achievement test battery. Items were chosen at random from unused items within .10 logits of

COMPUTER-ADAPTIVE TESTING: A NATIONAL PILOT STUDY Table 5 - 1

107

Item Bank Description

Subtest

Test Plan Distribution*

Number of Items in Bank

Easiest Item

Mean

Hardest Item

SD

Microbiology Blood Banking Chemistry Hematology Body Fluids Immunology

20% 20% 20% 20% 10% 10%

147 165 142 135 72 65

-2.89 -2.21 -3.61 -2.80 -2.24 -2.78

-.06 -.07 -.07 -.05 -.09 .25

2.38 2.94 2.97 2.97 3.84 2.04

.96 1.00 1.06 .97 .97 .96

100%

726

-3.61

-.02

3.84

1.00

Bank Scale

*The test plan distribution for computer-adaptive tests was the same as the test plan for the traditional fixed-length written certification examination.

the targeted item difficulty within the specified content area. While the examinee considered the item presented, the computer selected two items, one t h a t would yield maximum information should the current item be answered incorrectly and another that would yield maximum information should the current item be answered correctly. This procedure ensured that there was no lag time before the next item was presented. The minimum test length was 50 items and the maximum test length was 240 items. All examinees had four hours to complete the computer test. The test stopped when the examinee achieved a measure 1.3 x SEM (90% confidence, one tailed test), above or below the pass point of .15 logits on the bank scale. Figure 5-1 shows an examinee's test map. Note t h a t by item 50, the error band is well above the pass point, making this examinee a clear pass with greater than 90 percent confidence in the accuracy of the decision. If an examinee challenged 240 items and a pass/fail decision could not be made, the test stopped and a decision was made with less than 90 percent confidence, based on his or her measure at that point. Experimental Conditions and Results The computer-adaptive tests also incorporated varying combinations of experimental test conditions. These test conditions were designed to assess the known and assumed attributes of computer-adaptive testing, based on the assumption that some modifications to the "theoretically perfect computer-adaptive test" might be required to make it

Figure 5 - 1

COMPUTER-ADAPTIVE TEST EXAMINEE MAP

COMPUTER-ADAPTIVE TESTING: A NATIONAL PILOT STUDY

109

practical and acceptable to examinees. The goal was to determine which conditions, if any, make a difference in examinee performance. Students were randomly assigned to a combination of test conditions. This caused the number of examinees included in each analysis to vary. Each study, however, included a reasonable number of examinees, comparable to typical computer adaptive test studies. The test conditions were transparent to the examinee, with the exception of the "review" condition, which required special instructions. Analysis of covariance, with the written test as a covariate, was performed for each of the experimental conditions. Unidimensionality The first condition related to unidimensionality. The certification board outlines the domain of practice that must be demonstrated by the examinee. The domain breaks down into logical subsets for purposes of education and evaluation. A student must be able to demonstrate proficiency across the domain. Thus the activities in the six subtests are related conceptually, as well as in practice, so t h a t they must be tested using a single certification measurement instrument. It is the belief of this certification board and of those who practice in this field that the subtest areas are part of single dimension. Students must demonstrate competence across subtests, even though some variance in their performance among subtests is expected. The performance of examinees is positively correlated across subtests. The correlations are highly significant and range between .20 and .60. The subtests had statistically comparable mean item difficulties (df = 5 F = 1 . 3 6 P = .24), standard deviations and ranges so t h a t adaptive tests with comparable content coverage could be constructed for examinees with differing ability levels (see Table 5-1). For 645 students pass/fail decisions were based on the total test measure, while for the other 432 students pass/fail decisions were made for each subtest. Table 5-2 shows the results of the comparison of examinee performance when decisions were made by subtest or total test. There was no significant difference in mean performance (df = 1, F = 1.43, P = .23). Table 5-3 shows the percentage of examinees passing each subtest when decisions were made by subtest and total measure. The overall pass rate is about 4 percent higher when the decision is based on total test performance. The remaining conditions are reported only for examinees for whom decisions were made on the total test measure (N = 645).

110

LUNZ & BERGSTROM Table 5-2 Comparison of Examinee Measures When Total Test or Subtest Performance Is the Criteria for Pass/Fail Decisions

N examinees x ability SD

Decision Total Test

Decision by Subtest

645 .230 (.224)* .57

432 .191 (.196)* .46

df = 1

F = 1.43

P = .232

Reported in logits *Adjusted means based on covariate analysis

Targeted Level of Test Difficulty Psychometricians postulate t h a t a 50 percent probability of a correct response provides the best measurement of ability. Most written tests are, in fact, targeted to a 70 percent or even higher probability of correct response. The concerns are (a) how do students, accustomed to getting high scores, react to harder tests; and (b) can the item bank provide an efficient test at a specifically targeted level of difficulty across student ability levels. Students were randomly assigned to test conditions for 50 percent, 60 percent, and 70 percent probability of a correct response. Table 5-4 shows the results of controlling the probability of a correct

Table 5-3 Comparison of Percentage of Examinees Passing Each Subtest When Total Measure or Subtest Measure Is the Criterion Subtest

Decision by Total % Examinees Passing

Decision by Subtest % Examinees Passing

Microbiology Blood Banking Chemistry Hematology Body Fluids Immunology

49 61 54 53 52 48

49 59 54 53 49 46

Total

56

52

COMPUTER-ADAPTIVE TESTING: A NATIONAL PILOT STUDY

111

Table 5-4 Comparison of Examinee Measures Based on Targeting Condition Probability of a Correct Response

N examinees x ability SD

50%

60%

70%

201 .284 (.238)* .525

232 .168 (.224)* .558

212 .246 (.236)* .622

df = 2

F = .08

P - .926

Reported in logits *Adjusted means based on covariate analysis

response. There was no significant difference in examinee performance due to controlled probability of a correct response (df = 2, F = .08, P = .926). These results suggest that computer adaptive tests can be targeted at 50%, 60%, or 70% probability of a correct response without affecting examinee performance. Targeting to 60% or 70% may provide a psychological advantage for the examinee. It may also be useful for certification boards who have existing item banks created for easier paper and pencil tests. For further details on altering test difficulty see Bergstrom, Lunz, and Gershon (1992). Minimum Test Length A third condition was designed to address test length. Content experts often feel t h a t long tests are necessary to cover the field. However, the principles of sampling suggest t h a t well-targeted items will yield comparable results. Most examinees (79%) were allowed to stop after 50 items if a pass/fail decision with 90 percent confidence could be made. Some examinees (21%) were placed in a "long" test condition that required a minimum of 100 items even if a decision with 90 percent confidence could have been made with fewer items. Tests varied in length depending upon the performance of the examinee and the test length condition. Table 5-5 shows the results of examinee performance by minimum test length. Although the group means are not significantly different (df = 1, F = .82, P = .366) those examinees in the shorter minimum test condition performed slightly better.

112

LUNZ & BERGSTROM Table 5-5 Comparison of Examinee Measures Based on Minimum Test Length

N examinees x ability SD

Min L = 50

Min L = 100

428 .262 (.230)* .580

217 .167 (.199)* .549

df = 1

F = .82

P - .366

Reported in logits *Adjusted means based on covariate analysis

Opportunity to Review Examinees often argue that they have the "right" to review their tests, and, indeed, have been trained to do so. Psychometricians argue t h a t allowing examinees to change responses in a computer adaptive test decreases the information value of each item and therefore increases the error of measurement. A fourth condition involved the ability of examinees to review their test and alter responses. Examinees, randomly placed in the review condition, were required to answer items when they were presented but were allowed to review and change responses after they completed the test. The other examinees (nonreview condition) were not allowed to review items and alter responses. Table 5-6 shows the comparison of examinee measures for the review and nonreview conditions. There was no significant difference in mean examinee performance (df = 1, F = .80, P = .37), although examinees who were allowed to review had slightly higher mean meaTable 5-6 Comparison of Ability Measures Based on Review and Nonreview Test Conditions

N examinees x ability SD

Review

Nonreview

109 .253 (.258)* .546

536 .225 (.220)* .576

df = 1

F = .80

Reported in logits *Adjusted means based on covariate analysis

P = .37

COMPUTER-ADAPTIVE TESTING: A NATIONAL PILOT STUDY

sures. No r e s u l t of w r o n g to w r o n g to c u s s e d in

113

e x a m i n e e c h a n g e d s t a t u s from p a s s to fail or fail to p a s s a s a c h a n g i n g r e s p o n s e s . S o m e r e s p o n s e s w e r e c h a n g e d from r i g h t , w h i l e o t h e r s w e r e c h a n g e d from r i g h t to w r i n g or w r o n g . T h e p s y c h o m e t r i c i s s u e s i n v o l v i n g r e v i e w a r e disL u n z , B e r g s t r o m , a n d W r i g h t (1991).

Reliability of A l t e r n a t e Test F o r m s A fifth condition involved r e l i a b i l i t y of a l t e r n a t e t e s t forms. O n e ass u m p t i o n of c o m p u t e r a d a p t i v e t e s t i n g is t h a t c o m p a r a b l e decisions will b e m a d e e v e n t h o u g h e x a m i n e e s a r e t e s t e d w i t h different i t e m s , b e c a u s e all t e s t s a r e e q u a t e d to t h e s a m e scale. S o m e e x a m i n e e s w e r e placed in a condition t h a t forced t h e m to t a k e t w o t e s t s , o n e i m m e d i a t e l y following t h e o t h e r , w i t h o u t a b r e a k . I n fact, t h e e x a m i n e e s did n o t k n o w t h e y w e r e t a k i n g two u n i q u e t e s t s . A d e t a i l e d r e p o r t of t h e r e s u l t s follows in t h e C h a p t e r 6, " R e l i a b i l i t y of A l t e r n a t e C o m p u t e r A d a p t i v e Tests."

REFERENCES Bergstrom, B.A., Lunz, M.E., & Gershon, R.C. (1992). Altering the level of difficulty in computer adaptive tests. Applied Measurement in Education, 5, 4,137-149. in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, Gershon, R.C. (1989). CAT ADMINISTRATOR [(Computer Program)]. Chicago: Micro Connections. Kingsbury, G.G., & Houser, R.L. (1990, March). Assessing the utility of item response models: Computerized adaptive testing. Paper presented to the Annual Meeting of the National Council on Measurement in Education, Boston. lLord, F.M. (1983). Small N justifies Rasch model. In D.J. Weiss (Ed.), New horizons in testing. New York: Academic Press. Lord, F.M., & Novick, M.R. (1968). Statistical theories of mental test score. Reading, MA: Addison-Wesley. Lunz, M.E., & Bergstrom, B.A. (1991). Comparability of decision for computer adaptive and written examinations. Journal of Allied Health, 20, 1, 15— 23. Lunz, M.E., Bergstrom, B.A., & Wright, B.D. (1992). The effect of review on sstudent ability and test efficiency for computer adaptive tests. Applied Psychological Measurement, 16, 1, 33-40. McKinley, R.L., & Reckase, M.D. (1980). Computer applications to ability testing. Association for Educational Data Systems Journal, 13, 193-203. Maurelli, V.A., & Weiss, D.J. (1983). Factors influencing the psychometric char-

114

LUNZ & BERGSTROM

acteristics of an adaptive testing strategy for test batteries (Research Report 81-4). Minneapolis: University of Minnesota, Department of Psychology, Psychometric Methods Program, Computerized Adaptive Testing Laboratory. o in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, equating of paper-administered, computer-administered and computerized adaptive tests of achievement. Paper presented at the American Educational Research Association Meeting, San Francisco. in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, Rasch, G. (1980). Probabilistic models for some intel igence and attainmnt tests. Chicago: University of Chicago Press. (Original work published 1960.) Wainer, H. (1983). Are we correcting for guessing in the wrong direction? In D.J. Weiss (Ed.), New horizons in testing. New York: Academic Press. Weiss, D.J. (1983). New horizons in testing: Latent trait test theory and computerized adaptive testing. New York: Academic Press. in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, Weiss, D.J. (1985). Final report: Computerized adaptive measurement of achievement and ability (Project NR150-433, N00014-79-CO172). Minneapolis: University of Minnesota. Weiss, D.J., & Kingsbury, G.G. (1984). Application of computerized adaptive in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, testing to educational problems. Journal of Educational Measurement, 21(4), 361-375. wWright, B.D., & Masters, G.N. (1982). Rating scale analysis. Chicago: MESA Press. Wright, B.D., & Stone, M.H. (1979). Best test design. Chicago: MESA Press.

chapter

6 O

Reliability of Alternate Computer-Adaptive Tests Mary E. Lunz

American Society of Clinical Pathologists

Betty A. Bergstrom

Computer Adaptive Technologies

Benjamin D. Wright

University of Chicago

When items are IRT calibrated, ability estimation can be independent of the particular items used for measuring (Rasch 1960/1980; Wright, 1968, 1977). Thus, when all items are calibrated on the same scale, statistically equivalent person measures should result from alternate computer-adaptive tests, regardless of which particular items are administered on each test. This is an essential requirement for successful computer-adaptive testing. If the adaptive item selection algorithm is working properly, and the person has not altered significantly in ability, the mean difficulty of the items presented to that examinee should be statistically equivalent. When the items for two computer-adaptive tests are selected from the item bank, use the same test specifications, and are tailored to the same examinee ability, the two tests should be weakly parallel (Boekkooi-Trimminga, 1990). For high-stakes testing, such as certification, where decisions are often permanent, the alternate forms reliability of computer-adaptive tests must be demonstrated prior to implementation of computeradaptive strategies since all examinees will take different and uniquely tailored tests. 115

116

LUNZ, BERGSTROM & WRIGHT

The traditional index of test performance, reliability, can be applied to alternate computer adaptive tests. The Standards for Educational and Psychological Testing (1985) state that the goal of reliability is to estimate the consistency of scores on alternate tests constructed to defined test specifications. Allen and Yen (1979) define alternate tests as any two test forms that have been constructed to be parallel in content and that also have similar observed score means and variances for equivalent samples. They also state that a correlation between observed scores on alternate forms will produce a good estimate of test reliability when the alternate forms are parallel. While this assumes fixed-length written tests, the basic principle seems applicable to computer-adaptive tests. Reliability between alternate computer adaptive tests was addressed by Martin, McBride, and Weiss (1983). Scores on two alternate fixed-length forms of adaptive tests correlated at .90, after 30 items were administered. Kingsbury and Weiss (1980) found t h a t alternate forms of a computer-adaptive test resulted in more reliable scores t h a n alternate forms of a traditional pencil-and-paper test (correlations .92 and .88, respectively). Any subset of items selected adaptively from a calibrated-item bank constitutes a test form and should produce statistically equivalent ability measures for an examinee of a given ability (Wright, 1977). Alternate computer-adaptive tests contain different items, but when administered sequentially to the same examinee, they should produce statistically equivalent ability estimations. They should function in parallel because both sets of items are tailored on the same examinee ability using the same test plan. The purpose of this study is to determine the reliability of alternate test forms administered adaptively. Reliability will be assessed by comparing estimates of examinee ability and pass/fail decisions on alternate computer-adaptive tests.

METHOD The computer-adaptive testing model used was designed to determine a person's estimated ability level with respect to a preestablished criterion. An alternate test was presented automatically for examinees who were randomly placed in the total test and alternate forms test conditions. One hundred forty-two examinees were placed in this combination of conditions. These 142 examinees took sequential computer-adaptive tests; how-

RELIABILITY OF ALTERNATE COMPUTER-ADAPTIVE TESTS

117

ever, they were not aware that they were taking two separate tests, because the second test began as soon as the first test was completed. The alternate tests were constructed by the CAT ADMINISTRATOR program (Gershon, 1989), using the same test plan, starting point, and stopping rule. Each test was tailored on the ability of the examinee. Items presented to an examinee on the first test were marked by the computer so they would not be administered to the same examinee on the alternate test. This slightly limited the items available for the second test. Examinees were required to answer each item before another item was presented. The opportunity to review or change answers at a later time was not available. Since the tests were sequential, there was no opportunity for examinees to study between tests. The only possible change in ability could come from the practice gained or the fatigue caused by taking the first test. These data were analyzed with correlations, and paired t-tests of examinee measures on the alternate tests. It was expected that the null hypothesis of no significant difference between examinee measures on the alternate tests would be confirmed. In addition, pass/fail decisions on the alternate tests were compared.

RESULTS Pass/Fail Consistency Table 6-1 presents the pass/fail results for the alternate tests. Sixtyfour examinees passed both computer-adaptive tests, while 56 examinees failed both computer-adaptive tests. This is an 85 percent consistency rate. Fifteen examinee measures were within 1.3 standard errors of measurement for one or both tests. This means that the decision to pass or fail was made with less than 90 percent confidence in its accuracy. When the 15 examinees for whom decisions with 90 percent confidence could not be made were excluded, 94 percent of the examinees earned the same decision on the alternate tests. Comparison of Examinee Ability Measures The observed correlation of the 142 pairs of examinee measures for the alternate tests was .79. When this correlation is corrected for measure-

118

LUNZ, BERGSTROM & WRIGHT Table 6 - 1 Pass/Fail Consistency Alternate Computer-Adaptive Tests All Examinees Test 1 Pass Fail

Test 2

Total 71

Pass

64

7

Fail

15

56

71

Total

79

63

142

Unclear decisions were made for 15 examinees: 3 = F/P, 12 = P/F Examinees with Clear* Pass/Fail Decisions Test 1 Test 2 Pass Fail Total Pass

64

4

68

3

56

59

67

60

127

Fail Total

*Clear decision = 90% confidence 1.3 x SE above or below MPS

ment error it becomes .96. Table 6-2 gives summary statistics for examinee ability measures on test 1 and test 2. The mean difference in the 142 pairs of ability measures is - . 0 3 logits. Results of a paired t-test indicate no significant differences between examinee measures on the alternate tests (t = .87, df = 141, p = .39). Figure 6-1 shows the plot of examinee measures on the alternate tests.

Table 6-2 Statistics

Examinee Ability Summary

Statistic Test 1 Mean Mean Test 2 Mean Mean

Mean*

SD*

Ability Measure Error of Measure

.19 .23

.59 .05

Ability Measure Error of Measure

.16 .23

.57 .06

•Reported in logits

Figure 6-1

Plot of Examinee Ability Measures on Alternate Computer Adaptive Tests

120

LUNZ, BERGSTROM & WRIGHT

DISCUSSION This study was designed to verify the reliability of examinee ability measures and pass/fail decisions when alternate tests were administered sequentially using a computer adaptive algorithm that tailored items to examinee ability. The computer algorithm distributed the items according to the test plan on both alternate tests. The 142 pairs of alternate tests were evaluated based on content and comparability of item difficulties. The standard deviation of the ability measure difference (.38) is appropriate, given the mean measurement errors for test 1 (.23) and test 2 (.23). The disattenuated correlation is .96. These results confirm t h a t the particular subset of items selected can vary and still produce statistically equivalent ability measures on alternate tests. Certification boards frequently compile different written test forms for each test administration and assume that the decision to pass or fail has a comparable meaning as long as the tests are equated and the same test plan is implemented. Test specifications confirm the content validity of each test form (see Lunz & Stahl, 1989). The adaptive algorithm implemented the test specifications in addition to presenting items tailored to each examinee so that the maximum information about the examinee was gained from each item in each content area. The alternate tests varied in length and order of subtest presentation. This, however, did not alter the final decision for 94 percent of the examinees, who earned clear (90 percent confidence) pass/fail decisions on both tests. The first tests averaged 72 items (SD = 23), while the second tests averaged 94 items (SD = 53). The number of items included on the second test was slightly higher, on average, because the items which provided the most information about the examinee were presented on the first test. Since less information was gained from each item, more items were required to reach the same level of confidence in the decision. More examinees passed test 1 and failed test 2. These examinees, however, had earned an unclear decision (less t h a n 90 percent confidence) on test 1. Several examinees in the alternate forms condition took as many as 400 items because their ability measure was close to the pass point on both tests. This certainly challenged the depth of the item bank within each content area. A larger item bank would have provided better targeted alternate tests for these borderline examinees. Shorter tests, made possible by tailoring to the ability of the examinee, are an asset for both the certification board and the examinee as long as there is evidence that decisions are reliable. The results of this study provide evidence of the reliability of alternate computer adap-

RELIABILITY OF ALTERNATE COMPUTER-ADAPTIVE TESTS 121

t i v e t e s t s by d o c u m e n t i n g t h e consistency of pass/fail decisions a n d t h e c o m p a r a b i l i t y of t h e e x a m i n e e a b i l i t y m e a s u r e s .

REFERENCES Allen, M.J., & Yen, W.M. (1979). Introduction to measurement theory. Belmont, CA: Wads worth. Boekkooi-Timminga, E. (1990). The construction of parallel tests from IRT based item banks. Journal of Educational Statistics, 15(2), 129-145. Gershon, R.C. (1989). CAT administrator [Computer Program)!. Chicago: Micro Connections. Kingsbury, G.G., & Weiss, D.J. (1980). An alternate-forms reliability and concurrent validity comparison of Bayesian and adaptive and conventional ability tests (Research Report 80-5). Minneapolis: University of Minnesota, Department of Psychology, Psychometric Methods Program, Computerized Adaptive Testing Laboratory. Lunz, M.E., & Stahl, J.A. (1989). Content validity revisited: Transforming job analysis data into test specifications. Evaluation and the Health Professional, 12, 192-206. Martin, J.T., McBride, J.R., & Weiss, D.J. (1983). Reliability and validity of adaptive and conventional tests in a military recruit population (Research Report 83-1). Minneapolis: University of Minnesota. in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, Rasch, G. (1980). Probabilistic models for some intel igence and attainment tests. Chicago: University of Chicago Press. (Original work published 1960). Standards for educational and psychological Testing. (1985). Washington, DC: American Psychological Association. Wright, B.D. (1977). Solving measurement problems with the Rasch Model. Journal of Educational Measurement, 14, 97-116. Wright, B.D. (1968). Sample free calibration and person measurement. Proceedings of the 1967 Invitational Conference on Testing Problems. Princeton, NJ: Educational Testing Service.

chapter

7• 7 7

The Equivalence of Rasch Item Calibrations and Ability Estimates Across Modes of Administration Betty A. Bergstrom

Computer Adaptive Technologies

Mary E. Lunz

American Society of Clinical Pathologists Board of Registry

In order for an item to be used efficiently in a computer-adaptive algorithm, it must be precalibrated using a latent trait model, such as the Rasch model, which orders items from easy to difficult. This can be accomplished with data from a previous pencil-and-paper administration, or data from a previous computer-adaptive administration. Many organizations have item pools calibrated from previous pencil-andpaper administrations. However, the use of these calibrations for a computer-adaptive test needs careful consideration. Since the mode of administration is different, there is a possibility that items are somehow "different" when presented on a computer instead of on a piece of paper. If items are different, pencil-and-paper calibrations may not be appropriate for a computer-adaptive test. In a computer-adaptive test each examinee takes a tailored test. Therefore, items are presented to examinees in different contexts and at different points during the test administration. Thus context effects and location effects will be 122

THE EQUIVALENCE OF RASCH ITEM CALIBRATIONS AND ABILITY ESTIMATES

123

unique for each examinee. In a paper-and-pencil test, item location and context do not fluctuate. If the pencil-and-paper location and/or context affect the item calibration, the calibration may not be appropriate for a computer-adaptive test. The possibility t h a t item calibrations might change due to the mode of administration, namely, conventional paper-and-pencil vs. computer adaptive, has been discussed by several researchers (Kingbury & Houser, 1989; Wise, Barnes, Harvey, & Plake, 1989). Green, Bock, Humphreys, Linn, and Reckase (1984) suggest several possible problems that might arise when items for a computer-adaptive test are calibrated using data from a paper-and-pencil test. An overall shift might occur, such t h a t all items become easier or harder, or an "itemby-mode interaction" might occur, where some, but not all, item parameters change. They postulate that items with diagrams or many lines of text may be most vulnerable to an item-by-mode interaction. Context effects have been addressed by Kingston and Dorans (1984). They note that the appropriateness of IRT equating based on precalibration requires that changes in position of items in a test between the preoperational calibration and operational administrations of the test have no effect on item parameter estimates. They found some types of complex items, especially those that require extensive instructions, to be particularly sensitive to location effects and thus possibly unsuitable for computer-adaptive administration. Yen (1980) also found item characteristics to be affected by the sequence in which items were administered. One of the consequences of targeting items to the ability level of the examinee is that examinees of different ability levels may be presented with items in different difficulty order. Folk (1990) points out t h a t a high-ability examinee will generally answer the initial items on a computer adaptive test correctly and then will receive more difficult items. This results in his or her test being structured from easy to hard. A low-ability examinee will answer fewer initial items correctly, which results in his or her test being structured from hard to easy. However, Folk found t h a t the administration of items in different orders did not substantially affect the performance of low- or highability examinees. Other potential problems in precalibrating items with a pencil-andpaper test for computer-adaptive administration have been addressed by Wainer and Keily (1987). One of these is the differential effect of cross information encountered in computer-adaptive testing. If a paper-and-pencil item provides a cue for another item, all examinees receive the same cue. With a computer-adaptive test, examinees are administered different items, and items are ordered differently. If an

124

BERGSTROM & LUNZ

item calibration is influenced by a cueing effect in a pencil-and-paper administration, it may be invalid for the computer-adaptive administration. They also point out that one of the virtues of computeradaptive testing—short test length—may become problematic if item calibrations are unstable. Since the shorter test lacks the redundancy of a conventional test, it will be more vulnerable to idiosyncrasies of item performance. If items have not been precalibrated, an initial pencil-and-paper administration may be most practical. In this case, the size and composition of the sample needed for precalibration of items must be considered. It has been suggested that the sample include a minimum of 1,000 respondents and be comparable to the target population (Rudner, 1989, Green et al., 1984). However, it may be difficult to amass a comparable sample population this large in areas such as professional certification. The purpose of this chapter is to explore two related issues to determine whether item calibrations from conventional pencil-and-paper tests are appropriate for use in this particular application of computeradaptive testing. The first issue is the equivalence of item calibrations from paper-and-pencil and computer-adaptive administrations. The second issue is the equivalence of examinee ability measures when item calibrations from paper-and-pencil tests versus item calibrations from computer-adaptive tests are used for the tailoring algorithm.

METHOD Precalibration Three hundred and twenty-one medical technology students from 57 educational (training) programs across the country provided data for the precalibration of items. To participate, students had to be eligible to take the first semiannual administration of the related certification examination. Each student took one of four different forms of a 200-item conventional pencil-and-paper test. Each form included a subset of common items for equating so t h a t all forms could be placed on the same scale. Form 1 was taken by 73 students, Form 2 by 86 students, Form 3 by 71 students, and Form 4 by 91 students. Each of the four forms was calibrated by the Rasch model program MSCALE (Wright, Congdon, & Schultz, 1987). The forms were equated using common item equating (Wright & Stone, 1979). The items were evaluated for fit to the model

THE EQUIVALENCE OF RASCH ITEM CALIBRATIONS AND ABILITY ESTIMATES

125

and misfitting items were deleted. This established pencil-and-paper (PAP) item calibrations for a bank of 726 items. CAT Administration Useable data from the computer-adaptive test administration was obtained from 1,077 students from 238 medical technology programs across the country. To participate, students had to be eligible to take the second semiannual administration of the related certification examination. A detailed description of the computer adaptive testing model used in this study is given in Chapter 5. Recalibration from CAT Administration To determine the equivalence of item calibrations, and to determine whether shifts in item calibration affect examinee measures, the response data from the computer-adaptive test administration were recalibrated. Each computer adaptive test yielded an examinee response string. While the entire item pool consisted of 726 items, each examinee response string contained responses from between 50 items (minimum test length) to 240 items (maximum test length). Each item had a unique identifying number. Response strings from all examinees were appended, resulting in a file containing a 1,077 (examinee) by 726 (item) matrix, with missing data for all items not presented to particular examinee. The l,077-by-726 response matrix was analyzed with BIGSCALE (Wright, Linacre, & Schultz, 1990) a Rasch program that processes large data sets t h a t have missing data. This procedure produced a new set of item calibrations and a new set of examinee measures based upon responses from the CAT administration. The mean number of examinees per item calibration on the CAT was 146.45, with a standard deviation of 77.79. The minimum number of examinees to calibrate an item in the CAT administration was 13; the maximum number of examinees to calibrate an item was 348. Items with calibrations between - 1 and 1 logits were administered more frequently t h a n items with lower or higher precalibrations. Thus the number of examinees used to calibrate each item from the CAT administration data varied considerably. The paper-and-pencil calibration of the 726 items, and the computer-adaptive test calibration of the 726 items, were compared. Then the 1,077 examinee measures obtained from each calibration were compared.

126

BERGSTROM & LUNZ

RESULTS Comparison of Item Calibrations The mean for the PAP calibration was - 0 . 0 2 , with a standard deviation of 1.00. The mean for the CAT calibration was 0.00 (BIGSCALE mean centers the items) with a standard deviation of 1.22. Two types of shift occurred. The first is an overall shift, indicated by a difference in the standard deviation of the PAP calibration compared to the standard deviation of the CAT calibration. The spread of the CAT calibration (S.D. = 1.22) is wider than the spread of the PAP calibration (S.D. - 1.00). The second type of shift occurred with specific items. After the distribution of the CAT calibration is adjusted for differences in the mean and standard deviation, some item calibrations still shift and the order of item difficulty is altered. The correlation for PAP item calibrations and the CAT item calibrations was .90, .95, disattenuated. A few items calibrate as more difficult on the CAT calibration t h a n they did originally on the PAP calibration, and a few items calibrate as less difficult on the CAT calibration. The shifts from the PAP calibration (small sample) to the CAT calibration (varying sample per item) may be due to the mode of administration or to item bias (a difference in the intent or preparation between the PAP sample population and the CAT sample population). For example, of the seven items with the largest shifts in the direction of easier on the CAT calibration, five were from the same content area, indicating possible differential preparation between the two sample populations. Comparison of Ability Measure Estimates For examinees who took the computer-adaptive test, ability measures, based on estimates obtained from the PAP calibration, were compared with estimates made from the CAT calibration. The mean ability measure calculated with the PAP calibration was .24, with a standard deviation of .53. The mean ability estimate calculated with the CAT calibration was .25, with a standard deviation of .50. The mean logit difference between ability estimates was —.01, and the standard deviations of the differences is .07. The correlation of the examinee measures obtained from the PAP item calibrations, and the examinee measures obtained from the CAT item calibrations, was .99. Thus there is no difference between the

THE EQUIVALENCE OF RASCH ITEM CALIBRATIONS AND ABILITY ESTIMATES

127

examinee measures obtained due to the mode of data collection for item calibrations. DISCUSSION In this study, even though the item calibrations were obtained from a pencil-and-paper administration with relatively few participants, most of the Rasch item calibrations remained stable when calibrated from the computer-adaptive administration. The results demonstrate that, for these data, the item calibrations from a pencil-and-paper administration can be used for computer-adaptive tests. The item calibrations were equivalent, given varying numbers of examinees, different contexts, and varying modes of administration. The PAP calibrations used a sample of examinees of varying ability levels, so each item was calibrated from a range of examinee abilities. Items on the computer-adaptive administration were targeted to the examinee's ability, so the CAT calibrations were based on a smaller range of examinee ability levels. Two types of shifts occurred in the item calibrations. The first type, an overall shift in mean and standard deviation, can be corrected by using an equating transformation. The second type of shift, a shift in the calibration of certain items, is potentially much more problematic, because examinees take different items. This means that when some items shift, examinees are differentially affected depending upon how many of the shifted items are presented to them. The examinee measure correlation of .99 indicates that even though a small percentage of the item calibrations shift, the examinee measures are not affected. No examinee measure differed beyond the variance expected due to error of measurement. However, if shift in item calibration is a concern, the items can be identified and revised or discarded from subsequent CAT administrations. Of course, the item pool must be continually monitored for drift, validity, and quality of item content whether tests are administered in a paper-and-pencil or computer-adaptive mode. The examinee measures however, can be considered valid even if it is necessary to reevaluate some items. REFERENCES Folk, V.G. (1990, April). Adaptive testing and item difficulty order effects. Paper presented at the annual meeting of The American Educational Research Association, Boston. Green, B.F., Bock, R.D., Humphreys, L.G., Linn, R.L., & Reckase, M.D. (1984).

128

BERGSTROM & LUNZ

Technical guidelines for assessing computerized adaptive tests. Journal of Educational Measurement, 21(4), 347-360. Kingsbury, G.G., & Houser, R. (1989, March). Assessing the impact of using in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, item parameter estimates obtained from paper-and-pencil testing for computerized adaptive testing. Paper presented to the annual meeting of the National Council of Measurement in Education, San Francisco. Kingston, N.M., & Dorans, J.J. (1984). Item location effects and their implications for IRT equating and adaptive testing. Applied Psychological Meassurement, 8(2), 147-154. in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, Rudner, L.M. (1989). Notes from Eric/TM. Journal of Educational Measurement Issues and Practice, 8(4), 25-26. Wainer, H., & Kiely, G. (1987). Item clusters and computerized adaptive testing: A case for testlets. Journal of Educational Measurement, 24(3), 1 8 5 201. Wise, S.L., Barnes, L.B., Harvey, A.L., & Plake ; B.S. (1989). Effects of computer anxiety and computer experience on the computer-based in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, achievement test performance of college students. Applied Measurement in Education, 2, 235-241. in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, Wright, B.D., Congdon, R., & Shultz, M. (1987). MSCALE [Computer Program]. Chicago: MESA Press. Wright, B.D., Linacre, J.M., & Schultz, M. (1990). BIGSCALE |Computer Program]. Chicago: MESA Press. Wright, B.D., & Stone, M.H. (1979). Best test design. Chicago: MESA Press. Yen, W.M. (1980). The extent, causes and importance of context effects on item in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, parameters for two latent trait models. Journal of Educational Measurement, 17(4), 297-311.

chapter

8 O

Constructing Measurement with a Many-Facet Rasch Model John Michael Linacre

MESA Psychometric Laboratory Department of Education University of Chicago

SUBJECTIVE V E R S U S OBJECTIVE TESTS The rush to objective multiple-choice question (MCQ) tests in the 1920s was driven by dissatisfaction with subjective judge-rated tests. Objective tests were intended to control intrusions of undesirable variance into subjective test scores. But, in the 1980s, the testing community began to realize that what is needed is not objective testing but rather objective measurement. The reevaluation of subjective tests in the light of objective measurement opens a new field of possibilities. Ruch (1929) summarized the drawbacks to subjective tests: 1. 2. 3. 4.

Subjectivity of scoring lowers reliability. Sampling must be limited to a small number of broad questions. Time required to write lengthy answers is excessive. These examinations encourage bluffing.

His first drawback is our primary concern here. The importance of his last three drawbacks depend on the intention, construction, and 129

130

LINACRE

application of the subjective test. Indeed, one of the documented drawbacks to MCQ testing is the success of test-taking strategies, which are equivalent to bluffing, in increasing students' performance without increasing their achievement (Haladyna, Nolen, & Haas, 1991). Of course, subjective tests have remained in use. The example considered here is a selection examination for admission to a graduate program. Nineteen members of the admissions committee, the judges, rated 100 examinees on 14 items of competency using a five-point rating scale. Each examinee was rated by three, four, or five judges. Judges assigned ratings only when there was sufficient information to make a judgment. Consequently, not all judges awarded 14 ratings to each examinee that they rated. One judge rated 97 of the examinees. Another judge rated only one.

CONVENTIONAL ATTEMPTS TO MODEL JUDGING Studies of scoring subjectivity have found that "there is as much variation among judges as to the value of each paper as there is variation among papers in the estimation of each judge" (Ruggles, 1911). But any difference among judges is a threat to fairness because raw score depends on which judge rates an examinee. Since differences in judge severity can account for as much variance in ratings as differences in examinee ability (Cason & Cason, 1984), an obvious and widely attempted correction for judge behavior is to deduct the mean value of all ratings given by a judge from his or her individual ratings in hope of obtaining a judge-free rating. This fails because: 1.

2.

3.

All judges are required to rate all examinees on all items, a design t h a t is impractical in any large-scale testing situation. Substituting partial sampling designs (Braun, 1988) lessens the judging load, but introduces daunting administrative requirements. The stochastic aspect of the judging process remains unrecognized and unmanaged. Adjustments by averaging and subtracting do not control the effects of judge variation. The nonlinearity of the initial rating scale is overlooked. Ratings originate on an ordinal, not an interval, scale. (a) the highest and lowest categories represent infinite ranges of performance above and below the intermediate categories. (b) the ranges of performance represented by intermediate categories depend on how their labels are interpreted by judges. The intervals are never equal.

CONSTRUCTING MEASUREMENT WITH A MANY-FACET RASCH MODEL

4. 5.

131

Judge idiosyncracies are undiagnosed and uncontrolled. This means that the validity of the examination is unknown. Measures for examinees, which are statistically independent of the local details of the examination and hence generalizable beyond the examination, cannot be produced.

Attempts have been made to overcome these problems through nonlinear transformation of the responses combined with conventional approaches to modelling error (De Gruiter, 1984; Cason & Cason, 1984), but they have not been reported to succeed. THE MANY-FACET RASCH MODEL These obstacles can be overcome with a many-facet Rasch model. The specifications underlying the two-facet Rasch model can be extended to tests of many facets (Linacre, 1989). These specifications are: 1.

2. 3.

the impact of each element of each facet on the test situation is dominated by a single parameter with a value independent of all other parameters within the frame of reference. (Single parameterization is necessary if examinees are to be arranged in one order of merit, or items indexed by difficulty on an item bank), these parameters combine additively—they share one linear scale, the estimate of any parameter is dependent on the accumulation of all ratings in which it participates but is independent of the particular values of any of those ratings.

These specifications are the necessary and sufficient requirements for constructing a linear measurement system from any observed data. The degree to which this construction is useful and valid is measured by statistics quantifying the fit of the data to the measurement model (Wright & Masters, 1982). A many-facet Rasch model for the admission examination is:

where Bn is the ability of examinee n, where n = 1,100 Dt is the difficulty of item i, where i = 1,14 Cj is the severity of judge j , where j = 1 , 1 9

132

LINACRE

Figure 8-1

Conventional and measurement perspectives on rating scales

Fk is the difficulty of the step up from category k-1 to category k, and k = 2,5. Each examinee is represented by one parameter, Bn, which corresponds to the ability measure of the examinee on a linear continuum. Larger measures indicate greater ability. The difficulty of a successful performance on an item is parameterized by one parameter, Dh which is a measure on the same continuum as that of examinee ability. Thus the probability of a successful performance increases as either the examinee ability increases or the item difficulty decreases. Other elements also intervene. The assignment of ratings is mediated through a judge. Each judge is identified by one parameter, Cj, in the same linear measurement system. A more severe judge, with a larger measure, is less likely to award a high rating than a lenient judge with a smaller measure. Finally the step structure of the ratings scale must also be parameterized. As Figure 8-1 illustrates, the fact that the categories are labelled 1 to 5 and printed uniformly spaced across the page seems to indicate that the levels of performance represented by the categories must be equally spaced and so can be analyzed as linear measures as they stand. Nevertheless, in reality, the rating categories themselves represent qualitatively distinct, but ordered, performance levels partitioning an infinite continuum of performance. The equal integer spacing of the category labels and their equally spaced printing invite the judge to devote equal attention to each of the alternatives. But the range of the performance level corresponding to each of the ordered categories can only be discovered empirically from how the judges behave. Moreover, since the number of rating categories is finite, the ranges corresponding to the extreme categories are always infinite, because there is conceptually no limit to how good or how bad a performance can be. It is the functioning of the categories of the rating scale t h a t defines the measures, not the arbitrary assignment of equal inte-

CONSTRUCTING MEASUREMENT WITH A MANY-FACET RASCH MODEL

133

ger category labels. The labelling of the categories is a convenience for the management of the examination. What is needed for analysis is not the category label but the count of qualitatively higher levels of performance represented by the category. Thus the lowest category, usually labelled 1, corresponds to a step count of 0, while the category labelled 5 corresponds to a step count of 4. Equation (1) specifies the stochastic relationship between the ordered categories of the rating scale and the latent performance continuum. This relationship is an ogive that satisfies both the theoretical requirements for measurement and the functional form of the rating scale defined by the judges through their use of it. The unequal widths of the performance ranges corresponding to the intermediate categories are parameterized by the Fk terms. The infinite performance ranges at the extremes of the scale are mapped into the corresponding finite top and bottom categories. A maximum likelihood estimate for each parameter is obtained when the expected marginal sum of the counts of the ratings in which the parameter participates is equal to the observed sum of counts. Missing ratings can be ignored in this estimation, as is done in the computer program FACETS (Linacre, 1988). In Figure 8-2, the examinees, judges, and items of the admission examination have been measured on one common linear frame of reference. The expected scores (in rating points) are shown for examinees facing items of 0 logit difficulty and judges of 0 logit severity. Other expected scores are obtained by indexing the score scale at (examinee ability-judge severity-item difficulty) logits. An example of the ogival score-to-measure conversion is shown in Figure 8-3, where the average rating given an examinee on the admissions test has been mapped against examinee measure. The solid ogive traces the raw score to measure conversion that would have occurred if all judges had rated all examinees on all items. Each point X represents the conversion for an examinee. Its placement depends on which judges rate the examinee's performance. Examinee A has a higher average rating, but a lower measure than Examinee B, because A happened to be rated by more lenient judges than B. Most Xs are displaced below the solid ogive, because the most lenient judge rated only a few examinees. FIT TO THE MODEL Equation (1) specifies the stochastic structure of the data. The probability of a rating in any category is modelled explicitly. The modelled, (expected) values of the error variance associated with each rating are

134

LINACRE

Figure 8-2

Results of a many-facet Rasch analysis

CONSTRUCTING MEASUREMENT WITH A MANY-FACET RASCH MODEL

135

Figure 8-3 Average category labels for examinee performance plotted against estimated logit measures

explicit. This enables a detailed examination of the data for fit to the model. Not only too much, but also too little, observed error variance threatens the validity of the measurement process, and motivates investigation, diagnosis, and remediation of specific measurement problems. The relationships between the modelled error variances and the observed error variances (sums of squared residuals) are used as partial and global tests of fit of data to model (Wright & Panchapakesan, 1969; Windmeijer, 1990). In conventional analysis, by contrast, any difference between an observed and an expected rating is blamed on a judge's unexplained and undesired error variance. The optimal error value is zero, but this can never be obtained in a nontrivial situation. Any amount greater t h a n zero threatens validity. Thus, "the widespread use of such items in standardized tests depends on whether some degree of scoring error, however small, can be accepted" (Bennett, Ward, Rock, & Lahart, 1990). This error variance is often compared to the observed variance of a judge's ratings, leading to an uncontrolled comparison between the within-judge randomness of a judge's ratings and the between-

136

LINACRE

examinee spread of the abilities of the examinees who happen to have been rated. An example of Rasch fit statistics for four of the admission examination judges is shown in Table 8-1. Their severity measures (in logodds units, logits) are about equal, but their measures have different standard errors. These indicate the precision or reliability of their measures. The size of these errors is chiefly determined by the number of ratings the judge made. The more ratings a judge makes, the more information there is with which to estimate a severity measure and so the smaller its standard error. Two fit statistics are reported, the mean-square and standardized forms of the Outfit statistic. Outfit is an acronym for "outlier-sensitive fit statistic," because its size is strongly influenced by single unexpectedly large residuals. Outfit is based on the ratio of observed error variance to modelled error variance. The ratio is computed on a rating-by-rating basis, and then averaged across all ratings in which the judge participated. The result is the mean of the ratios of squared observed residuals to modelled residuals. The mean-square outfit statistic is on a ratio scale with expectation 1 and range 0 to infinity. Its statistical significance is indicated by a standardized value with a modelled unit normal distribution. Since the success of the standardization is sample dependent, this value cannot be interpreted strictly in terms of the unit normal distribution, but must be evaluated in the light of the local situation. In Table 8-1, Judges A and B have mean-square outfit statistics close to their expected values of 1, and standardized values close to their expectation of 0. Judge C, however, shows considerable misfit. His mean-square outfit of 1.4 indicates 40 percent more variance in his ratings t h a n is modelled. The significance value of 3 indicates that this is rarely expected. Symptomatic of Judge C's behavior is the distribution of his ratings. He awarded considerably more high and more low ratings than Judges A and B. This wider spread of ratings is unexpected in the light of the rating patterns of the other judges. Judge D, on the other hand, exhibits a muted rating pattern. His mean-square statistic of .7 indicates 30 percent less variance in his ratings t h a n is modelled. The high significance of this is flagged by the standardized value of —6. Judge D's ratings show a preference for central categories. He reduces the rating scale to a dichotomy and so reduces the variance of his ratings. The fact that Judge D's ratings are more predictable than those of the other raters would be regarded as beneficial in a conventional analysis. In a Rasch analysis, however, Judge D's predictability implies that Judge D is not supplying as much independent information as the other judges on which to base the examinees' measures. Were Judge D perfectly predictable, always rat-

Table 8-1

Judge Measures and Fit Statistics Outfit

% Frequency of Rating

Judge

Examinees Rated

Total Ratings

Mean Rating

Severity Measure

Model Error

Mean-Square

Standardized

1

2

3

4

5

A B C (Noisy) D (Muted)

12 48 17 73

168 672 231 1018

2.8 2.7 2.7 2.8

0.62 0.68 0.81 0.63

0.13 0.07 0.11 0.05

1.0 1.1 1.4 0.7

0 1 3 -6

0 0 0 0

0 1 6 0

35 42 35 31

53 42 41 61

11 15 18 7

138

LINACRE

ing in the same category, he would supply no information concerning differences among examinees. A frequently used alternative to Outfit is Infit, an informationweighted fit statistic sensitive to unexpected patterns of small residuals. This is calculated from the ratio of the sum of all squared residuals to the sum of all modelled error variances for ratings in which the judge participated. For the judges shown in Table 8-1, the Outfit and Infit statistics are numerically identical. This is because the misfit for this data set is homogeneous across examinee ability levels. By contrast, lucky guessing and carelessness on MCQ items cause large outlying residuals that are detected by unexpected Outfit values, while alternative curricula lead to unexpected patterns of small residuals which are detected by Infit.

THE JUDGING PLAN The only requirement on the judging plan is that there be enough linkage between all elements of all facets that all parameters can be estimated within one frame of reference without indeterminacy. An example of lack of linkage and consequent indeterminacy is a plan in which judge panel B grades only boys and judge panel G grades only girls, because then a relatively good performance by one gender can be attributed either to higher ability or to more lenient judges. The ideal and usually necessary judging plan for conventional analysis is t h a t in which every judge rates every examinee on every item. This is illustrated in Figure 8-4, which follows the specifications of Braun (1988). Under Rasch analysis, this design meets the linkage requirement and provides precise measures of all parameters in the shared frame of reference, but such completeness is not required. All t h a t is required is a network of examinee, judge, and item overlap. A simple linking network can be obtained by having groups of judges rate some examinees on all items. This type of plan is shown in Figure 8-5. The parameters are linked into one frame of reference through ratings t h a t share pairs of parameters: common persons, common essays, or common judges. Accidental omissions or unintended extra ratings amend the judging plan but do not threaten measurement construction. Measures are less precise than with complete data because fewer ratings are made. Since the standard errors of the measures are approximately in proportion to the inverse of the square root of the number of observations, the standard errors of measures estimated from this second incomplete data set will be about 2.5 times

0

Figure 8-4

Complete judging plan

larger t h a n for the first complete data set. On the other hand, the judging effort will be reduced by 83 percent. Judging is time consuming and expensive. It may be desirable to minimize the judging work by arranging for each item of each performance to be judged only once. Even under these circumstances, the statistical requirement for overlap can usually be met rather easily. For instance, if each examinee writes several essays and all essays are shuffled together randomly, overlap can be obtained by having each judge grade whichever essay happens to come next on the pile. Each judge grades as many essays as time and speed allow. But each essay is graded only once. Nevertheless, by the end of the judging session, many examinees will have been rated by more than one judge, but on

140

LINACRE

Figure 8-5

"Rotating test-book" judging plan

different essays, and many essay topics will have been rated by more t h a n one judge, but for different examinees. An example of this type of minimal judging plan, but under slightly stricter rules, is shown in Figure 8-6. Each of the 32 examinees' three essays is rated by only one judge. Each of the 12 judges rates eight essays, including two or three of each essay type. The e x a m i n e e judge-essay overlap enables all parameters to be estimated unambiguously in one frame of reference. Assignment of essays to judges was by a simulated "random pile" of essays with the constraints that each essay be rated only once, each judge rate an examinee once at most, and each judge avoid rating any one type of essay too frequently. The cost of this minimal data collection is lower measurement precision, with standard errors 3.5 times larger than for the full plan. The

CONSTRUCTING MEASUREMENT WITH A MANY-FACET RASCH MODEL

Figure 8-6

141

Minimal-effort judging plan

judging effort, however, is reduced about 92 percent. The loss of information under such a plan might appear excessive, but where the number of different items of performance to be rated is high, this type of plan has proved feasible (Lunz, Wright, & Linacre, 1990). GENERALIZABILITY OF RESULTS The category labels of a rating scale are not only arbitrary and nonlinear, but also local to the design of the particular examination. The implications of this may be masked when all examinees are rated on the same items by the same judges in one testing session, but they are immediately apparent when examinees face different testing situa-

142

LINACRE

tions. Quantitative comparison requires a frame of reference in which it no longer matters which examinee is rated by which judge on which item in what session. The many-facet Rasch model enables such a framework to be constructed (Stahl, 1991). CONTROL OF JUDGE IDIOSYNCRACY Judge training is required to develop a shared understanding of a rating scale and a uniform perspective on the challenge applied by the test items. It is claimed that "subjectivity of marking may be reduced about one-half by the adoption of and adherence to a set of scoring rules when essay examinations are to be graded" (Ruch, 1929). Conventionally, training has been further aimed at obtaining unanimity across judges about the rating to be awarded to particular performances on particular items. This idealistic attempt to produce identical, and hence exchangeable, judges has met with little success. "Judges employ unique perceptions which are not easily altered by training" (Lunz et al., 1990, p. 332). No entirely successful large-scale judge training program has ever been reported. There are many situations in which judge training is given little or no attention (for example, a supervisor rating subordinates) or has been discovered to have been ineffective. It is always essential to monitor the quality of the ratings being awarded and to direct each judge's attention to those areas in which there is doubt. An advantage of the Rasch many-facet measurement model is t h a t within-judge self-consistency, rather t h a n between-judge unanimity, is now the aim. On this basis, unexpectedly harsh or lenient ratings, not in accord with a judge's usual rating style, can be identified, and also each judge's biases relating to any particular items, groups of examinees, or the like, can be quickly revealed. This has two benefits. First, unacceptably idiosyncratic ratings can be treated as missing without disturbing the validity of the remainder of the analysis. Second, precise feedback to each judge about specific questionable ratings or rating patterns can foster improvements in the judging process. In the admission data, 14 of the 6,227 ratings were sufficiently unexpected as to invite closer inspection, and, where necessary, corrective action. In three cases, the observed ratings were more than two rating points different from those expected based on the overall ability of the examinee, severity of the judge, and difficulty of the item—surely a large enough discrepancy to provoke skepticism about the validities of those ratings.

CONSTRUCTING MEASUREMENT WITH A MANY-FACET RASCH MODEL

143

FURTHER MEASUREMENT MODELS The many-facet measurement model can be expressed in many forms to meet the requirements of specific testing situations, including portfolio assessment, artistic and athletic competitions, and skill certification. Some of these forms are: an item-scale model, in which each item is constructed with its own rating scale,

where Bn, Dt, and C ; are as above, and Fik is the difficulty of the step from category k-1 to category k of the scale unique to item i, and k = l,Mt a judge-scale model, in which each judge uses his or her own interpretation of the rating scale,

where Bn, Dt, and Cj are as above, and Fjk is the difficulty of the step from category k-1 to category k for j u d g e d and k = l,Mj a four-faceted model, in which each of the items is modelled to apply to each of a number of tasks, where

Bn, DL, Cj and Fk are as above, and Am is the difficulty of task m.

144

LINACRE

CONCLUSION T h e c o n s t r u c t i o n of a m e a s u r e m e n t s y s t e m for subjective t e s t s is p r a c t i c a l a n d useful. Test c o n s t r u c t o r s no l o n g e r n e e d l i m i t t h e m s e l v e s to w h a t c a n be o b t a i n e d from a n M C Q t e s t , b u t i n s t e a d c a n devote t h e i r c r e a t i v e p o w e r s to d e s i g n i n g t e s t s t h a t involve deeper, m o r e r e l e v a n t , a n d h e n c e m o r e a u t h e n t i c e v i d e n c e of c o m p e t e n c e , w i t h o u t losing t h e b e n e f i t s of objective m e a s u r e m e n t .

REFERENCES Bennett, R.E., Ward, W.C., Rock, D.A., & Lahart, C. (1990). Toward a framew ton, NJ: Education Testing Service. Braun, H.I. (1988). Understanding scoring reliability. Journal of Educational Statistics, 13(1), 1-18. Cason, G.J., & Cason, C.L. (1984). A deterministic theory of clinical performance rating. Evaluation and the Health Professions, 7, 221-247'. De Gruiter, D.N.M. (1984). Two simple models for rater effects. Applied Psychological Measurement, 8, 213-218. Haladyna, T.M., Nolen, S.B., & Haas, N.S. (1991). Raising standardized achievement test scores and the origins of test score pollution. Educational Researcher, 20(5), 2 - 7 . Linacre, J.M. (1988). FACETS computer program. Chicago: MESA Press. ll Lunz, M.E., Wright, B.D., & Linacre, J.M. (1990). Measuring the impact of judge severity on examination scores. Applied Measurement in Education, 3(4), 331-345. Ruch, G.M. (1929). The objective or new-type examination. Chicago: Scott, Ruch, G.M. (1929). The objective or new-type examination. Chicago: Scott, Foresman. Ruggles, A.M. (1911). Grades and grading. New York: Teacher's College. sStahl, J. (1991, April). Equating examinations that require judges. Paper presented at AERA Annual Meeting, Chicago. Windmeijer, F.A.G (1990). The asymptotic distribution of the sum of weighted ssquared residuals in binary choice models. Statistica Neerlandica, 44(2), 69-78. Wright, B.D., & Masters, G.N. (1982). Rating scale analysis. Chicago: MESA Press. Wright, B.D., & Panchapakesan, N. (1969). A procedure for sample-free item analysis. Educational and Psychological Measurement, 29(1), 23-48.

chapter

9 %7

Development of a Functional Assessment That Adjusts Ability Measures for Task Simplicity and Rater Leniency* Anne G. Fisher

Professor, Department of Occupational Therapy, College of Applied Human Sciences Colorado State University

INTRODUCTION Therapists draw important conclusions about the abilities and limitations of people by observing them in the context of their performances * Appreciation is extended to J. Michael Linacre and Benjamin Wright for their reviews and refinement of this manuscript. Kimberly Bryze and Anita Bundy also provided valuable editorial input. This project was supported, in part, by funding from the American Occupational Therapy Association and Foundation through the Gerontology Research Symposium, the Physical Disabilities Symposium, and the Center of Research and Measurement at the University of Illinois at Chicago, College of Associated 145

146

FISHER

oof activities of daily living (ADD (for example, dressing, bathing, or eating) and instrumental activities of daily living (IADD (for example, meal preparation, shopping, or laundry). Therapists use the information gathered to (a) make judgements regarding the overall functional ability of the person, (b) identify specific deficits t h a t may be impairing functional performance, (c) plan appropriate intervention programs designed to enhance the person's level of independence, and (d) monitor change in performance levels over time. While therapists routinely evaluate ADL/IADL ability by direct observation, the majority use homegrown evaluation tools of unknown validity and reliability. That is, there is general recognition that therapists practicing in a variety of settings, such as rehabilitation, long-term care, and home health, have developed their own ADL/IADL assessments with little attempt to establish the validity and reliability of the instruments. Further, no existing standardized instrument has been recognized as having the characteristics of a gold standard (Eakin, 1989; Keith, 1984; Law & Letts, 1989; Jongbloed, 1986). There are several factors that may have contributed to the limited usage of standardized ADL/IADL evaluations by therapists in clinical settings. Among the most apparent is that existing standardized evaluations fail to meet the needs of the clinician involved in the direct intervention with people who have physical or psychosocial disabilities. For example, most standardized ADL/IADL scales were developed for managerial and policy purposes related to screening, determination of the need for services, resource allocation, and outcome analysis (see Fuhrer, 1987; Granger & Gresham, 1984; Kane & Kane, 1981, for reviews). As a result, standardized ADL/IADL evaluations tend to be rather global in nature; they commonly are used to assess whether or not the person can perform a number of ADL/IADL tasks independently, and, if not, what level of assistance is required. From the perspective of the therapist responsible for providing intervention, such standardized global assessments provide an indicattion of what a person can or cannot do, but no information about why the person might be experiencing functional limitations. Yet an import a n t prerequisite for planning cost-effective intervention programs is t h a t the therapist be able to identify specific factors that limit performance ability so that those factors can be targeted in the intervention. Health Professions, Department of Occupational Therapy. Thanks are extended to the members of the AMPS gerontology and physical disabilities teams that served as the raters for this study. Finally, appreciation is extended to Ay Woan Pan for her assistance with data analysis. Portions of this chapter were presented at the annual meeting of the American Educational Research Association, Chicago, April 1991.

DEVELOPMENT OF A FUNCTIONAL ASSESSMENT

147

Therefore, the therapist who chooses to use a standardized instrument designed to evaluate global ADL/IADL ability, yet desires to identify specific deficits or impairments that are interfering with the functional performance of the individual, must supplement his or her global ADL/IADL evaluation with discrete evaluations of the distinct constituents underlying ADL/IADL performance (including strength, range of motion, perception, and mental status). The basic assumption made is t h a t if the underlying cause of the ADL/IADL limitations can be identified and treated, the effects will generalize to improved functional performance across a wide range of ADL/IADL tasks. While this approach has logical appeal, research has not demonstrated a strong enough relationship between underlying constituents and ADL/IADL performance, when they are evaluated separately, to be able to make valid predictions about the abilities of a person in daily life task performance based on his or her discrete test scores (Bernspang, Asplund, Eriksson, & Fugl-Meyer, 1987; Jongbloed, Brighton, & Stacey, 1988; Pincus, Callahan, Brooks, Fuchs, Olsen, & Kaye, 1989; Reed, Jagust, & Seab, 1989; Skurla, Rogers, & Sunderland, 1988; Teri, Borson, Kiyak, & Yamagishi, 1989). The commonly chosen alternative is for the therapist to observe directly the person performing selected ADL/IADL tasks that the individual has identified as relevant to his or her needs and goals, and tthen, simultaneously, make subjective judgements regarding (a) the then, simultaneously, make subjective judgements regarding (a) the person's overall ability to perform ADL/IADL tasks, and (b) the distinct underlying performance constituents that appear to be impairing the person's performance. There are certain advantages to this approach. While most standardized ADL/IADL scales are of a self- or proxy-report or interview format, there is increasing recognition that direct observation of ADL/IADL performance may be preferred in many instances (Consensus Development Panel, 1988; Guralnik, Branch, Cummings, & Curb, 1989). Moreover, therapists are recognized for their expertise in performance evaluation (evaluation based on direct observation of performance) (Guralnik et al., 1989), as well as for their ability to effect comprehensive task analyses that result in the identification of appropriate adaptive or compensatory methods t h a t can be utilized by the person to achieve desired functional goals (Faletti, 1984). Another advantage of directly observing a person perform selected ADL/IADL tasks is t h a t the therapist is able to individualize the evaluation by observing the person perform only those tasks that the individual perceives as relevant and meaningful, given his or her living situation and interests. This is based on the assertion that the quality of task performance is influenced by the volitional characteristics of

148

FISHER

the individual. Volition is assumed to determine what tasks the person chooses to perform, and function is hypothesized to be maximized when an individual performs a task of his or her choice (Kielhofner & Burke, 1985). However, observing the person perform self-selected tasks while making subjective judgements regarding the individual's ability to perform ADL/IADL tasks defies objective measurement. Indeed, even when a systematic and reproducible method of scoring the performance is used, the specific tasks chosen by the person vary in difficulty. If no mechanism is used to adjust person measures for the simplicity of the tasks performed, the person who performs easier tasks will have an unfair advantage over the person who performs harder tasks. Moreover, unless the person performs exactly the same set of tasks each time he or she is evaluated, this system does not allow the therapist to monitor change as the individual progresses over the course of intervention. The influence of rater judgement is another frequently cited area of concern, especially for IADL assessments (George & Fillenbaum, 1985; Lawton, 1987; Rubenstein, Schairer, Wieland, & Kane, 1984). The major reason for lowered interrater reliabilities is that the complexity of IADL requires that greater degrees of rater judgement be used in scoring; what is judged to constitute adequate performance is highly variable and reflects the personal biases of the raters (Lawton, 1987). As Lunz and Stahl (1990) pointed out, clinical observation and rating of a person's performance always requires the input of a judge. Since all judge-awarded ratings reflect some subjectivity, judge bias is a major drawback to objective measurement of examinee ability. Attempts to improve uniformity among judges have included constructing structured items . . . , standardizing grading criteria and administration procedures, and providing extensive judge training. But these efforts have served only to direct the attention of judges, not to control the I leniency! of their assessments, (p. 426) Therefore, any objective measurement system that is developed to meet the requirements of clinical practice must have several import a n t features. First, it must provide the therapist with the capability to assess the impact of discrete skill deficits on global ADL/IADL ability directly. Second, it must be developed so as to give consideration to the motivation, interests, and needs of the person tested by offering the opportunity for motivated task choice. Third, person ability measures must be adjusted for the simplicity of the tasks performed and for the leniency of the rater who observed the performance. And finally, the

DEVELOPMENT OF A FUNCTIONAL ASSESSMENT

149

measurement system must have demonstrated validity and reliability. The Assessment of Motor and Process Skills (AMPS) (Fisher, 1991), an innovative assessment of IADL, was designed to meet these requirements of clinical practice. The purpose of this chapter is to describe the application of the many-faceted Rasch model (Andrich, 1988; Linacre, 1989, this volume) to construct and validate the motor scale of the AMPS. ASSESSMENT OF MOTOR AND PROCESS SKILLS The Assessment of Motor and Process Skills (AMPS) was developed in response to the need for scales (a) that are defined by skill item easiness and IADL task simplicity, (b) that adjust the person ability measures for the leniency of the rater performing the observation, (c) that permit the simultaneous evaluation of IADL task performance and the underlying motor and process (organizational/adaptive) performance skill capacities necessary for skilled task performance, and (d) that provide the person observed the opportunity to select tasks to perform t h a t reflect his or her values and interests. In the context of the person's actually performing one or more IADL tasks of his or her choice, the person is rated on 15 motor skill items and 20 process skill items. The motor skills are conceptualized as representing a taxonomy of universal motor operations t h a t underlie task performance, and the process skills each are conceptualized as representing a taxonomy of universal process operations that underlie task performance. Motor skills pertain to those capacities that the person uses to produce or impart motion to self or objects. They are those performance skills that relate to the posture, mobility, coordination, and strength capacities of the person t h a t provide the basis for movement of the body and objects. The term process may be defined as a series of actions enroute to task completion. Process skills are related to the attentional, conceptual, organizational, and adaptive capacities t h a t the person uses to sensibly organize the actions he or she performs in order to complete the specified task. These motor and process skills are operationally defined as observable actions that reflect the underlying performance capacities (Fisher, 1991). Definitions of the 15 motor skills analyzed for this study are listed in Figure 9-1. When the AMPS is used to evaluate a person, he or she is offered several IADL task choices from approximately 30 listed in the test manual. Whenever possible, the person is asked to choose at least two to perform. During the performance, the rater scores the 15 observable motor skills on a 4-point rating scale. A score of 4 (Competent) is

150

FISHER

STRENGTH • Moves—pushes, shoves, pulls, or drags objects along a supporting surface; includes opening doors and drawers. Pertains to the moving of objects that are not lifted (e.g., pushing or pulling on a cart, door, or drawer; dragging a heavy bag across the floor; or sliding a heavy pan along the counter top). Includes the ability to self-propel a wheelchair. • Lifts—raises or hoists objects off of supporting surface; includes moving an object that is lifted from one place to another, but without ambulation or moving from one place to another. Pertains to having enough strength to lift objects. • Reaches—stretches or extends the arm, and, when appropriate, the trunk to grasp or place objects that are out of reach. Pertains to the ability to effectively reach to the extent necessary in order to obtain objects. Where appropriate, this includes trunk movement. • Endures—persists and completes the task without evidence of fatigue, pausing to rest, or stopping to "catch ones breath." POSTURE AND MOBILITY • Transports—carries objects while ambulating or moving from one place to another (e.g., in a wheelchair). Pertains to the physical capacity to gather. • Stabilizes—steadies body, and maintains trunk control and balance while sitting, standing, or walking, while reaching, or while moving, lifting, pushing, or pulling objects; pertains to postural control during trunk or limb movements. • Aligns—maintains the body weight evenly distributed over the base of support; implies an absence of asymmetries, flexed or stooped posture, or excessive leaning; pertains to body alignment that may be affected by structural or strength limitations. • Walks—ambulates on level surfaces; implies steadiness or an absence of shuffling, lurching, ataxia, etc.; includes the ability to turn around to change direction while walking. FINE MOTOR ABILITIES AND SUBTLE POSTURAL ADJUSTMENTS • Bends—actively flexes, rotates, or twists the body in a manner and direction appropriate to the task; pertains to trunk mobility. • Coordinates—uses different parts of the body together or uses other body parts as an assist or stabilizer during bilateral motor tasks. Pertains to the physical capacity to hold, support, or stabilize objects during bilateral task performance. • Manipulates—uses dexterous grasp and release, as well as coordinated in-hand manipulation patterns; pertains to skillful use of isolated finger movements when handling objects. • Flows—uses smooth, fluid, continuous, uninterrupted arm and hand movements. Pertains to the quality or refinement of motor execution; includes the absence of dysmetria, ataxia, tremor, rigidity, or stiffness of movement. Implies the ability to isolate movements. • Positions—positions body or wheelchair in relation to objects in a manner that promotes the use of efficient arm movements; pertains to the use of postural background movements appropriate to the task. Implies the absence of awkwardness of arm or body positions. Includes the ability to position the body or wheelchair appropriate to the task or movement pattern of the arm. • Calibrates—regulates or grades the force, speed, and extent of movements in the performance of a step or action; pertains to the amount of effort exerted or an expenditure of energy that is appropriate to the requirements of the action or step (e.g., not too much or too little). • Grips—pinches or grasps in order to grasp handles, to open fastenings and containers, or to remove coverings; relates to effectiveness of strength of pinch and grip. Figure 9-1

Definitions of the AMPS motor skills.

DEVELOPMENT OF A FUNCTIONAL ASSESSMENT 151

assigned when the rater judges that there is no evidence of a motor skill deficit interfering with the person's performance. A score of 3 (Questionable) is assigned when the rater questions the presence of a motor skill deficit that is interfering with IADL task performance. A score of 2 (Ineffective) is assigned when the rater judges that a motor skill deficit is impacting on the person's effective use of time and energy such that ongoing task performance is affected. Finally, a score of 1 (Deficit) is assigned when the motor skill deficit is severe enough to result in task breakdown, risk of danger, or an unacceptable slowing of the task progression. Scoring examples for all skill items are listed in the test manual (Fisher, 1991). Scoring examples for each score category for the motor skill item Transports are shown in Figure 9-2. TRANSPORTS—carries objects while ambulating or moving from one place to another (e.g., in a wheelchair). Pertains to physical capacity to gather. (Note. Score the ability to move objects such as doors, drawers, or carts that typically are not lifted under the motor verb Mov»s. The presence of instability when carrying objects is also scored under the motor verb Stabilizes.) 4 = readily and consistently carries objects from one place to another while walking or moving from place to place —carries sheets from linen closet to bedroom without difficulty —carries pan from stove to the other end of the counter —carries, when appropriate, two or three items at a time —while seated in a wheelchair, readily carries bread and condiments (placed in the lap) from refrigerator to counter —while walking with a walker, carries shoes and polish in a basket on the walker without difficulty 3 = questionable transporting skill, but no apparent disruption of action item or task performance, or impact on other skill items —possible hesitation or slowness while transporting objects —examiner questions the presence of instability while transporting 2 = ineffective transporting skill impacts on action item or task performance, or results in inefficient use of time or energy —some gait instability when carrying sheets —slides objects that typically are transported (e.g., moving a pan from the stove to the other end of the counter top) —difficulty carrying more than one or two items —difficulty transporting objects in the wheelchair slows task progression 1 = severity of transporting skill deficit clearly impedes action item or task performance such that the results are unacceptable, or damage or danger is imminent —attempts but unable to transport —imminent risk of fall or dropping an object when attempting to walk while carrying the object —unacceptable delay in task progression because of difficulty transporting —examiner intervention required because severity of transporting skill deficit results in task breakdown, or imminent risk of damage or danger

Figure 9-2 Example performances by score category for the motor skill item Transports.

152

FISHER

MANY-FACETED RASCH ANALYSIS OF THE A M P S MOTOR SCALE Because the 15 motor skill items represent universal operations that underlie all IADL task performances, it is possible, for the first time, to relate motor skill capabilities directly to the simplicity of the IADL tasks. This is accomplished by using the many-faceted Rasch analysis computer program, FACETS (Linacre, 1988), to calibrate the motor skill items and the IADL tasks on a common log-linear scale (IADL motor scale). Person IADL motor skill measures are adjusted for the simplicity of the tasks actually performed. Therefore, it is possible to (a) determine where, on a conceptual continuum of ability, people of varying abilities are located; and (b) compare and predict performance capacity of those people across multiple tasks of greater or lesser simplicity than those they actually were observed performing. An added advantage of using many-faceted Rasch analysis is that raters can be calibrated according to their relative leniency. Moreover, the many-faceted Rasch model is used to calibrate each element (that is, each skill item, each task, each rater, each person) of each facet (item facet, task facet, rater facet, person facet) "on the same common log-linear scale so that a quantitative frame of reference for the [assessment] is constructed and quantitative comparisons among and within facets and facet elements can be made" (Lunz, Wright, & Linacre, 1990, p. 332). Therefore, it is possible to create a measurement system that is able to adjust person scores for the additive effects of skill items easiness, task simplicity, and rater leniency. (See Linacre, 1989, this volume; Lunz & Stahl, 1990; Lunz et al., 1990, for more detailed discussions of the many-faceted Rasch model.) As applied to the AMPS, the many-faceted Rasch model specifies the following expectations: (a) a person has a higher probability of obtaining a higher score on an easy skill item than on a hard skill item, (b) easy skill items are easier for all individuals t h a n are hard skill items, (c) judges award higher scores for easy skill items than hard skill items, (d) individuals obtain higher scores on less challenging tasks than more challenging tasks, and (e) people with higher ability obtain higher scores than do less able individuals. Moreover, since a 4-point rating scale is used to score the AMPS, all persons are expected to obtain progressively higher rating scale scores on progressively easier skill items and tasks (Andrich, 1988; Lunz & Stahl, 1990; Silverstein, Kilgore, & Fisher, 1989; Wright & Masters, 1982). When the data conform to these expectations, they fit the measurement model. The values of the parameters modeled to underlie the observed re-

DEVELOPMENT OF A FUNCTIONAL ASSESSMENT

153

sponses (raw skill item scores) are estimated according to these specifications until the expected (estimated) responses predicted by the model are as close as possible to the observed responses (Lunz & Stahl, 1990). With the AMPS, the skill item easiness calibration is the estimated location of that skill item on the continuum of increasing IADL motor ability. The task simplicity calibration is the estimated location of that task on the same continuum of increasing IADL motor ability. The rater leniency calibration is the estimated location of that rater on the common scale. Finally, the person measure is the estimated location of t h a t person on the continuum of increasing ability that has been defined by the easiness of skill items and the simplicity of the tasks, after being adjusted for the raters who scored the task performances. These calibrations and measures are expressed in equal-interval units of measurement based on the logarithm of the odds (log-odds probability units or logits) of obtaining a given skill item score when a person of a given ability is observed by a given rater performing a given task (Andrich, 1988; Lunz & Stahl, 1990; Lunz et al., 1990; Wright & Masters, 1982). The detailed fit statistics that are computed by the FACETS computer program then are examined to verify that a valid measurement system t h a t conforms to the requirements for linear measurement is being constructed. The mean-square residuals, differences between observed and expected scores, provide a measure of the degree to which the skill items and tasks fit the expectations of the Rasch model (Linacre, this volume). The skill item and task mean-square fit statistics verify the internal validity of the AMPS motor scale. As the AMPS continues to be developed, those skill items and tasks that fit the model will be retained. Those that fail to fit the model will be revised or eliminated. Since rater leniency also is calibrated, the FACETS computer program calculates rater fit statistics. Examination of rater fit statistics enables determination of the extent to which individual raters assign skill item scores consistently. A rater misfits when his or her assigned scores are internally inconsistent (that is, when the rater unexpectedly assigns high scores on hard skill items or to less able persons or low scores on easy skill items or to more able persons). Finally, person response validity is verified by examining person fit statistics t h a t measure the extent to which a person's pattern of responses to the individual skill items corresponds to t h a t predicted by the model (Linacre, this volume). A person will misfit when he or she obtains unexpectedly high scores on hard skill items or unexpectedly low scores on easy skill items. This misfit can provide useful diagnostic information t h a t can be used to guide therapeutic interventions.

154

FISHER

The intention is to construct a valid and reliable measurement system t h a t can be used to evaluate individuals who have a wide range of ability levels. With individuals at the more able end of the ability continuum, the therapist must contribute to critical decisions regarding a person's ability to live independently in the community. Therefore, this study was focused on the examination of the validity and reliability of the AMPS motor scale when applied to community-living individuals. More specifically, a major focus of this study was the examination of rater consistency and severity. In addition, several aspects of validity were examined. The examination of the internal validity of the AMPS motor scale included evaluation of the fit of the items and the tasks to the many-faceted Rasch model (Linacre, 1989, this volume). Construct validity of the AMPS motor scale was evaluated by examining the hierarchical ordering of the motor skill item calibrations. Adequate strength of proximal shoulder and truck musculature is necessary for postural control and fine motor skill (Case-Smith, Fisher, & Bauer, 1989). Further, fine motor skills and subtle postural background movements are commonly the only skills impaired in persons with mild motor deficits (cf. Fisher, Murray, & Bundy, 1991). Therefore, it was expected t h a t (a) the motor skill items that assess components of strength would be among the easiest items, (b) the motor skill items t h a t assess posture and mobility would be of intermediate difficulty, and (c) the motor skill items that assess fine motor skills and subtle postural control would be the most difficult (see Figure 9-1). Concurrent validity of the AMPS motor scale was examined by evaluating the ability of AMPS IADL motor measures to differentiate between individuals who are able to live independently in the community and those persons who require assistance to remain in the community. Finally, the examination of the validity of the scales involved evaluation of person response validity. METHODS Subjects The 56 subjects for this study included (a) 39 community-living well individuals without previously identified limitations of the ability to perform daily living tasks; (b) three community-living frail individuals without identified major medical conditions, but with identified functional limitations; and (c) eight community-living and six institutionalized individuals with major orthopedic, neurological, sensory

DEVELOPMENT OF A FUNCTIONAL ASSESSMENT

Table 9-1

155

Subject Demographic Data Age (years)

Group Community-living well Community-living frail Community-living disabled Institutionalized disabled

Total

Mean

Range

< 65 (n)

> 65 (n)

39 3 8 6

48 77 72 64

20-84 68-84 62-81 28-80

22

17 3 7 4

1 2

(for example, hearing loss), or cognitive disabilities. Most of the subjects with disabilities experienced some restriction in the ability to perform daily life tasks. Three of the disabled subjects were able to live independently in the community; nine required minimal assistance or supervision to live in the community; and two needed maximal assistance or would be unable to live in the community. The well subjects ranged in age from 20 to 84 years; the frail subjects were all older adults; and the subjects with disabilities ranged in age from 28 to 81 years (see Table 9-1). All but four of the subjects were female. Three of the four male subjects were disabled. Instrumentation The AMPS was administered to each subject in accordance with the standardized administration procedures described in the test manual (Fisher, 1991). To ensure linkage between subjects, tasks, and raters, the task choices made available to the subjects were limited to the following eight tasks: repotting a small houseplant; vacuuming a living room (including moving light furniture); changing the sheets on a bed; preparing eggs, toast, and brewed coffee; preparing a grilled cheese sandwich; making a tossed green salad; preparing a tuna salad sandwich; and making a fruit salad. Forty-two of the subjects performed two tasks; the remaining 14 subjects performed one task. Procedure Upon obtaining informed consent for participation in this study, a trained rater administered the AMPS to each subject. Approximately five task choices were offered to each subject, and each subject selected one or two tasks to perform. All task performances were videotaped for later scoring by one or more of 15 trained raters. All of the raters were experienced occupational therapists trained in

156

FISHER

the administration and scoring of the AMPS. Rater training was accomplished by means of a 3-day training workshop. Upon completion of the training, each rater independently scored one of four calibration videotapes containing approximately 10 videotaped task performances. Four of the raters co-scored several additional videotaped task performances. To ensure linkage among raters, each rater scored a minimum of five videotaped task performances (observations) t h a t also were scored by at least four additional raters. Data Analysis A total of 221 rated observations were subjected to many-faceted Rasch analysis. To facilitate the ability to conceptualize the assumed additive relationship between the five facets of the constructed AMPS motor scale, the log-odds probability of a given score was modeled as

• • • • • • •

Pnitrk = probability of person n being assigned score k by rater r on skill item / when performing task t Pnitrk -1 = probability of person n being assigned score k - 1 by rater r on skill item i when performing task t Bn = Ability measure of person n Et = Easiness calibration of skill item i St = Simplicity calibration of task t Lr = Leniency calibration of rater r Fk = Difficulty of rating scale step k relative to step k - 1

Both mean-square infit and mean-square outfit statistics were used to evaluate (a) the suitability of the skill items and tasks for constructing an IADL motor scale, (b) the consistency of the rater's scoring over skill items and observations, and (c) the usefulness of the scale, defined by the easiness of the skill items and the simplicity of the tasks, as a measure of the IADL motor ability of persons. The infit statistic is an information weighted mean-square residual between observed and expected, which focuses on the accumulation of central, inlying, deviations from expectation. The outfit statistic is the usual unweighted mean-square residual, which is particularly sensitive to outlying deviations from expectation. (Lunz et al., 1990, p. 336) The expected mean-square value is 1.0. Mean-squares less than 1.0 suggest the presence of unexpected redundancy, dependency, or con-

DEVELOPMENT OF A FUNCTIONAL ASSESSMENT

157

striction in the data. Redundancy or dependency occurs when items are highly correlated. Constriction occurs when scores are not sufficiently spread out across the range rating scale. Mean-squares greater t h a n 1.0 signal the presence of unexpected variability, inconsistency, or ex tremism (Wright & Stone, 1979). Mean-squares greater t h a n 1.3 or less t h a n 0.7 were considered suggestive of unacceptable fit and they were targeted for further examination.

RESULTS Validity of the A M P S Motor Scale Table 9-2 shows the skill item easiness calibrations, the standard errors of these estimates, and the mean-square fit statistics for each skill item. Lifts is the easiest skill item (.99) and Calibrates is the most difficult (-.81). The construct validity of the AMPS motor scale is confirmed by the ordering of the easiness calibrations of the skill items. Lifts, Endures, Moves, and Reaches were expected to be the easiest skill items. Coordinates, Flows, Bends, Positions, Manipulates, Grips, and Calibrates were expected to be the most difficult skill items. The calibrated difficulty order of the results are consistent with these hypothesized expectations. Table 9-2

h

e Mean SD

Skill Item Easiness Facet

Skill Item

Score

Count

Easiness Calibration (logits)

Calibrates Grips Manipulates Positions Bends Walks Flows Aligns Stabilizes Transports Coordinates Reaches Moves Endures Lifts

523 528 524 527 531 563 564 572 575 573 590 599 606 611 613

221 221 219 220 221 221 221 221 221 219 221 221 221 221 221

-0.81 -0.73 0.71 -0.71 0.68 -0.14 -0.13 0.02 0.08 0.12 0.39 0.61 0.79 0.92 0.99

.12 .12 .12 .12 .12 .13 .13 .14 .14 .14 .15 .16 .16 .17 .17

1.2 1.3 1.0 1.2 0.9 0.7 0.9 0.9 0.8 1.0 1.4 0.8 1.0 1.0 1.1

1.6 1.1 0.9 1.2 0.8 0.5 0.6 0.6 0.5 0.7 1.0 0.6 1.0 0.6 1.0

567 32

221 1

0.00 0.62

.14 .02

1.0 0.2

0.9 0.3

SE (logits)

Infit MnSq

Outfit MnSq

Table 9-3

Summary of Misfitting Ratings by Rater Rater Number

Skill Item Stabilizes Aligns Positions Walks Reaches Bends Coordinates Manipulates Flows Moves Transports Lifts Calibrates Grips Endures

1 10

1 6

1 1

22 1 1

2 4 1 2 4 5 2

1 1 1 1

4 2 2 2 1 3 1

2

11

12

13

14

1

1 1

1 1 1

1 2 1 1

1 4 1

7 3

2 1 3

1

1 1

1

1 1 1

1 1

1 1

1

1 1

1 1

2 2 1 1

1

1 1 1

1

1

1 2 1 3

2 2

1 1

1

1

2 2 2 1

1

15

1 1 2 15 10 1

221 221 220 221 221 221 221 219 221 221 219 221 221 221 221 3310

Total Ratings

89

105

419

418

90

75

75

75

75

75

75

74

75

405

1185

Misfitting Ratings

3

3

27

22

3

7

6

4

4

0

6

9

10

22

42

Percentage Misfit

3.4

2.9

3.3

9.3

8.0

5.3

5.3

0

8.0

6.4

5.3

12.2

13.3

5.4

Total Ratings

3.5

Misfitting Ratings

6 6 20 6 5 6 18 10 5 12 8 16 26 16 8

Percentage Misfitting

2.7 2.7 9.1 2.7 2.3 2.7 8.1 4.6 2.3 5.4 3.7 7.2 11.8

7.2 3.6

168 5.1

DEVELOPMENT OF A FUNCTIONAL ASSESSMENT

Table 9-4 Rater 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Total

159

Number of Score Category Ratings by Rater Deficit 0 0 2 4 0 0 0 0 1 0 0 0 5 4 1

( 3.0 logits). Between 3.0 and 2.0 logits is a transition zone where the ability of the frail subjects and the most able subjects with disabilities equals that of the least able well subjects. These well subjects all were over the age of 65; they may be at risk for functional decline. Finally, the least able subjects (ability measures < 2.0 logits) were consistently those individuals who had identified functional limitations. Rating Scale The rating scale score categories and the frequency with which each score was assigned are shown in Table 9-11. In contrast to the scoring performance of the most lenient raters, who assigned Competent ratings 90 percent or more of the time (see Table 9-4), 69 percent of the total assigned ratings were Competent. The logit measures associated with each expected score are shown in Table 9-12. The expected score transitions are the expected calibrations for scores halfway between those actually included in the 4-point rating scale. It is these expected score transitions, expressed in logits, t h a t are delineated on the rating scale facet in Figure 9-3. For example, a 3.5 expected score at 2.48 logits demarcates the transition between an expected Competent score of 4 and an expected Questionable score of 3. This transition between Table 9-11

Rating Scale Score Category Statistics

Score Category

Count

Percentage

Step

Step (logits)

SE (logits)

1 2 3 4

17 383 614 2296

1 12 19 69

0 1 2 3

-3.02 1.10 1.93

0.25 0.07 0.05

168

FISHER Table 9-12 Logit Measure _x

-3.04 0.98 0.59 1.53 2.48 + y-

Expected Score at Logit Measure Expected Score 1.0 1.5 2.0 2.5 3.0 3.5 4.0

Definition Deficit Ineffective Questionable Competent

an expected competent IADL ability and an expected questionable IADL ability corresponds to the same region of the rating scale, between 3.0 and 2.0 logits, where the ability measures of the least able well subjects, the frail subjects, and the most able subjects with disabilities are located (see Figure 9-3). Using the Constructed Scale to Predict Performance Modeling the log-odds probability of a given score based on task simplicity, skill item easiness, and rater leniency has the effect of creating a set of geometrically additive slide rulers that facilitate the ability to conceptualize the additive relationship between the five facets of the constructed AMPS IADL motor scale. These rulers are depicted in Figure 9-3. Vertically sliding each of the central three rulers to target the person, task, and skill items of interest enables the therapist to determine the predicted scores for that person when scored by a given rater for his or her performance on a given task. Figure 9-4 demonstrates this process. Suppose that we are interested in evaluating the ability of the AMPS motor scale in identifying persons who may be beginning to experience functional decline or who may be at risk for loss of the ability to live in the community without assistance. If we position the task simplicity ruler so that the mean task simplicity (0.0, indicated by a pointer " < < " ) is centered on the most able of the identified frail community-living subjects (F), we can scan across Figure 9-3 to the rating scale facet to discover that this subject would be expected to be competent when repotting a plant, but questionable when making a fruit salad. Now, if we are interested in knowing what level of ability this person would be expected to have on the individual skill items, we can position the mean skill item easiness calibration (also shown by a pointer " < < " ) on the task of interest. In

Figure 9-4

The most able frail subject, performing a task of average challenge, scored on a difficult skill item.

(Note. Subject codes: W = community-living well subjects, F = frail subjects, D = subjects with identified orthopedic, neurological, or cognitive disabilities)

Figure 9-5 The most able frail subject, performing a task of average challenge, scored on an easy skill item. (Note. Subject codes: W = community-living well subjects, F - frail subjects, D - subjects with identified orthopedic,

DEVELOPMENT OF A FUNCTIONAL ASSESSMENT

171

this case, we chose the four tasks of average simplicity. Again scanning across Figure 9-3 to the rating scale facet, we can see t h a t we would expect this person to be competent on Lifts, Endures, Moves, Coordinates, and Reaches, but questionable on the harder skill items Calibrates, Grips, Manipulates, Positions, and Bends. Finally, if we position the rater facet ruler so that the mean rater leniency calibration (pointer " < < " ) is centered on the most difficult skill items, we can determine how scores assigned by raters of varying leniency can be expected to differ. This frail subject (F\ when observed performing a task of average simplicity (for example, Salad) would be expected to score Competent on hard skill items (Bends) when rated by the most lenient two raters, but Ineffective when rated by the most severe rater. Comparison of Figure 9-4 and Figure 9-5 shows the range of expected scores between the easiest and the hardest skill items. In contrast to the expected performance shown in Figure 9-4, this frail subject (F), when observed performing at task of average challenge (Salad), scored on an easy skill item (Endures), would be expected to score Competent when scored by all but the most severe rater, who would be expected to rate her Questionable. We are able to make these predictions for all calibrated facet elements even though the person in question only performed a few of the tasks. The opportunity to predict performances is valuable because it enables the therapist to assess in what areas the person will need intervention in order to be able to function independently in everyday tasks.

DISCUSSION The results of this study support the validity of the motor scale of the AMPS. This study also demonstrates the advantages of the use of the many-faceted Rasch model and the FACETS computer program to construct and validate measures. First, I have shown that it is possible to construct a single variable, a common measurement scale, that considers simultaneously the easiness of the skill items, the simplicity of the tasks, and the leniency of the rater in the calculation of person IADL motor skill measures. Second, I have shown how the detailed facet element fit statistics can be used to monitor and verify the validity of the scale, the consistency and leniency of the raters, and the responses of the individuals t h a t are evaluated. Third, when the fit statistics signal unexpected behavior, I have shown how the source of the disturbance can be identified. Therapists can use this information to make informed decisions about the validity of the measures and the functional limitations of the person evaluated. This information also can

172

FISHER

be used to make informed decisions about modification of the skill items and tasks or the provision of rater feedback. For example, the skill items Coordinates and Calibrates failed to demonstrate adequate fit to the Rasch model. In this instance, it was possible to determine that the source of the inconsistency was related to a few subjects and a few raters. This information was used to provide these raters with feedback regarding their inconsistent scoring, and to clarify for them the scoring criteria of these two skill items. These raters, who are now undergoing recalibration, can be monitored over time to evaluate the effects of the feedback on their scoring behavior. As the development of the AMPS motor scale proceeds, those skill items with low mean-square values should be monitored carefully. Future investigation should focus on verification of the presence of dependency among these skill items, and perhaps, the shortening of the assessment by omitting redundant items. As more of the 31 tasks currently included in the AMPS manual are calibrated into the measurement system, they will need to be monitored both for their fit to the measurement model and for their level of simplicity. The present results suggest the need to add more challenging tasks targeted at individuals whose ability measures are located near the transition zone between competent and questionable performance (see Figures 9-3, 9-4, and 9-5). The calibration of less challenging tasks t h a t can be used to better evaluate individuals with disabilities also is needed. SUMMARY The FACETS Rasch analysis computer program is the first practical method that corrects person ability measures for differences among raters and, simultaneously, for variation in the simplicity of the tasks performed by the individual. The resulting person measures are not affected by the leniency of the particular rater who observed the performance, or by the simplicity of the particular tasks the person performed (Lunz & Stahl, 1990; Lunz et al., 1990). The feasibility of constructing a valid objective measurement system t h a t meets the requirements of clinical practice has been demonstrated in this pilot study. The calibration of the skill items and the tasks on the same scale enables therapists to relate the discrete skill items directly to IADL tasks according to their relative positions on the common scale. This calibration takes advantage of all available observations, and the standardization process does not require a sophisticated or complete rating plan t h a t requires that all persons be scored

DEVELOPMENT OF A FUNCTIONAL ASSESSMENT

173

on all items or on more t h a n a few tasks. Moreover, when people are evaluated using the AMPS motor scale, their motor skill abilities can be related to all of the tasks calibrated in the measurement system, whether or not the person performed those tasks. Finally, when individuals do have unexpected patterns of scores that results in misfit to the Rasch-modeled expectations, their pattern of scores can be analyzed in order to interpret how the relationship between their motor skill deficits and their IADL task performance abilities differ from expectations. This study also supports the feasibility of developing a functional assessment that gives consideration to the motivation, interests, and needs of the individual, and t h a t accounts for the leniency of the rater. Through the calibration of a bank of tasks that provide available task choice options, individuals can select from among those tasks those t h a t are familiar to him or her and reflect his or her values and interests. Finally, when the AMPS motor scale is used in clinical practice to evaluate people for whom there is concern about limitations in functional performance, therapists will be able to determine how those individuals would be expected to perform on tasks that are more or less challenging than those actually observed. Thus, therapists will be able to provide more detailed and accurate information to assist with important decisions about whether or not elderly or disabled individuals can live independently in the community. If assistance is required, they will have information about the level and type of assistance needed.

REFERENCES Andrich, D. (1988). Rasch models of measurement (Sage University Paper series on Quantitative Applications in the Social Sciences, 07-068). Beverly Hills, CA: Sage. Bernspang, B., Asplund, K., Eriksson, S., & Fugl-Meyer, A.R. (1987). Motor and perceptual impairments in acute stroke patients: Effects on self-care ability. Stroke, 18, 1081-1987. Case-Smith, J., Fisher, A.G., & Bauer, D. (1989). An analysis of the relationship between proximal and distal motor control. American Journal of Occupational Therapy, 43, 657-662. Consensus Development Panel (1988). National Institutes of Health Consensus Development Conference statement: Geriatric assessment methods for clinical decision-making. Journal of the American Geriatrics Society, 36, 342-347. Eakin, P. (1989). Assessments of activities of daily living: A critical review. British Journal of Occupational Therapy, 52, 11-15.

174

FISHER

Faletti, M.V. (1984). Human factors research and functional environments for the aged. In I. Altman, M.P. Lawton, & J.F. Wohlwill (Eds.), Elderly people and the environment (pp. 191-237, Human Behavior and Environment, Vol. 7). New York: Plenum Press. Fisher, A.G. (1991). Assessment of motor and process skills (research ed. 5-R.2). Unpublished test manual available from the Department of Occupational Therapy, University of Illinois at Chicago. Fisher, A.G., Murray, E.A., & Bundy, A.C. (1991). Sensory integration: Theory and practice. Philadelphia: F.A. Davis. Fuhrer, M.J. (1987). Overview of outcome analysis in rehabilitation. In M.J. Fuhrer (Ed.), Rehabilitation outcomes: Analysis and measurement (pp. 1 15). Baltimore: Paul H. Brookes. George, L.K., & Fillenbaum, G.G. (1985). OARS methodology: A decade of e Society, 33, 607-613. Granger, C.V., & Gresham, G.E. (Eds.). (1984). Functional assessment in rehabilitation medicine. Baltimore: Williams & Wilkins. Guralnik, J.M., Branch, L.G., Cummings, S.R., & Curb, J.D. (1989). Physical performance measures in aging research. Journal of Gerontology, 44, M141-146. Jongbloed, L. (1986). Prediction of function after stroke: A critical review. Stroke, 17, 765-775. Jongbloed, L., Brighton, C , & Stacey, S. (1988). Factors associated with indeppendent meal preparation, self-care and mobility in CVA clients. Canadian Journal of Occupational Therapy, 55, 259-263. kKane, R.A., & Kane, R.L. (1981). Assessing the elderly (pp. 1-23). Lexington, MA: Lexington Books. Keith, R.A. (1984). Functional assessment measures in medical rehabilitation: cvCurrent status. Archives of Physical Medicine and Rehabilitation, 65, 74-78. Kielhofner, G., & Burke, J.P. (1985). Components and determinants of h u m a n occupation. In G. Kielhofner (Ed.), A model of human occupation: Theory aand application (pp. 12-41). Baltimore: Williams & Wilkins. Law, M., & Letts, L. (1989). A critical review of scales of activities of daily living. American Journal of Occupational Therapy, 43, 522-528. Lawton, M.P. (1987). Behavioral and social components of functional capacity. In National Institutes of Health (Author), Consensus Development Conference on Geriatric Assessment Methods for Clinical Decision making ference on Geriatric Assessment Methods for Clinical Decision making (pp. 23-29). (Available from the National Institutes of Health, Washington, DC) Linacre, J.M. (1988). FACETS computer program for many-faceted Rasch measurement. Chicago: MESA. Linacre, J.M. (1989). Many-faceted Rasch measurement. Chicago: MESA. Lunz, M.E., & Stahl, J.A. (1990). Judge consistency and severity across grading periods. Evaluation and the Health Professions, 13, 425-444. Lunz, M.E., Wright, B.D., & Linacre, J.M. (1990). Measuring the impact of

DEVELOPMENT OF A FUNCTIONAL ASSESSMENT

175

judge severity on examination scores. Applied Measurement in Education, 3, 331-345. Pincus, T., Callahan, L.F., Brooks, R.H., Fuchs, H.A., Olsen, N.J., & Kaye, J.J. (1989). Self-report questionnaire scores in rheumatoid arthritis compared with traditional physical, radiographic, and laboratory measures. Annals of Internal Medicine, 110, 259-266. Reed, B.R., Jagust, W.J., & Seab, J.P. (1989). Mental status as a predictor of daily function in progressive dementia. Gerontologist, 29, 804-807. Rubenstein, L.Z., Schairer, C , Wieland, G.D., & Kane, R. (1984). Systematic biases in functional status assessment of elderly adults: Effects of different data sources. Journal of Gerontology, 39, 686-691. Silverstein, B., Kilgore, K., & Fisher, W. (1989). Implementing patient tracking systems and using functional assessment scales (Center for Rehabilitation Outcome Analysis monograph series on issues and methods in rehabilitation outcome analysis, Vol. 1). Wheaton, IL: Marianjoy Rehabilitation Center Skurla, E., Rogers, J.C., & Sunderland, T. (1988). Direct assessment of activities of daily living in Alzheimer's disease: A controlled study. Journal of the American Geriatrics Society, 36, 97-103. Teri, L., Borson, S., Kiyak, H.A., & Yamagishi, M. (1989). Behavioral disturbance, cognitive dysfunction, and functional skill: Prevalence and relationship in Alzheimer's disease. Journal of the American Geriatrics Society, 37, 109-116. Wright, B.D., & Masters, G.N. (1982). Rating scale analysis. Chicago, MESA Press. Wright, B.D., & Stone, M.H. (1979). Best test design. Chicago: MESA Press.

chapter

10

Measuring Chemical Properties With the Rasch Model T.K. Rehfeldt

The Sherwin-Williams Co.

In the paint and coatings industry we do many careful quantitative experiments to develop better coatings. The use of sophisticated experimental designs is increasingly important. We carefully analyze the data that we obtain from these experiments. We carefully measure the processing and composition variables during the experiment, and we make every effort to control variables. However, the responses we measure for these designed experiments frequently take the form of subjective ratings on arbitrary scales. Solvent resistance is one such test. Here we are interested in what happens when solvent is spilled on the surface. For automotive coatings solvents of interest are gasoline, methanol, and engine coolant. Obviously, we don't want a little gasoline to leave a visible mark on the car. Other tests are described below. Stain resistance is similar to solvent resistance but with more persistent substances, such as, grease, oil, and tar. No one wants an automobile covered with oil stains that cannot be removed. Corrosion protection, salt spray, and weathering 176

MEASURING CHEMICAL PROPERTIES

177

refer to the effects of water, sunlight, road salt, and environmental conditions on the finish. Blocking and blistering are related to the hardness and durability of the finish and how well the finish coat adheres to the primer or the metal surface. Orange peel is the presence of texture in the paint t h a t makes the finish look like an orange. Excessive orange peel or other texture is objectionable in a highquality paint. A whole series of appearance tests, such as texture, color, color match of adjacent parts of the car, and gloss are also based upon ratings. All of these tests are very important to paint retailers and customers, because a poor paint job can kill a sale. The very nature of the rating process ensures t h a t these scales are subjective. At best, the rating process produces ordinal rankings; but proper evaluation requires quantitative interval measures. In this chapter I describe how we have used the Rasch model to obtain the interval measures that are required but are only implied by the rankings.

EARLIER ANALYTIC TECHNIQUES There have been attempts to overcome the problem of subjectivity in rating scales of paint performance. Usually a reference material, whose properties are known, is included with the experimental materials. However, this does not ensure equal interval, repeatable, objective scales. Rank order statistics are used to evaluate rating scale data (Lehmann, 1975; Sprent, 1989; Siegel, 1956; Hill & Prane, 1984). Here the scale itself is ignored, and the various paints under test are ranked from the best to worst by one or several judges. Rank order calculations are used to equate the rankings of the various judges. These rankings usually work well; judges will rank a group of paints in the same order, subject to experimental error, and it is easy to tell which is the best and which is the worst. Flowever, the ranking techniques do not provide objective scales of measurement. The rankings show which is better, but there is no way to tell how much better one coating is than the others. This is particularly troublesome in the middle range of the rankings, where discrimination is more important. In general, it is not difficult for judges to agree upon very good performance. Judges will also usually agree on very poor performance. However, this is not where most experiments will take place. One goal of industrial experimentation is to provide good performance at lower costs; thus, we are constantly looking for small improvements, or incremental changes in performance. This is

178

REHFELDT

where discrimination is most important. We want to know how low we can push one part of a formulation and still get an acceptable performance in the middle ratings. Thus, for rank order methods the greatest uncertainty occurs at the place where precision is most important. Another consideration is that one must always deal with a group of coatings and references; the usual rank order statistics do not provide an objective scale t h a t can be used in subsequent testing, where the make-up of the group will often change. Differences among coatings, which are part of a statistically designed experiment, are often analyzed by multiple analysis of variance (ANOVA) on the raw scores. In these cases each facet or factor is examined as a treatment level. However, this technique only detects which factors are associated with differences in the performance. It does not rank the various paints, nor does it construct useful rating scales. Often every factor appears to be significant in an analysis of variance. Furthermore, the initial scores, provided by the experts, do not, by themselves, provide the interval scale necessary to do the analysis of variance properly. What we would like is a technique that will allow for the differences between judges, that will measure the relative performance of the coatings, and that will produce an equal interval scale for use in subsequent testing. A n Example The application of the Rasch model (Rasch, 1960; Wright & Panchapakesan, 1969; Wright & Linacre, 1987; Wright & Stone, 1979; Wright & Masters, 1981) to these paint problems will be illustrated by examination of an experiment in which the response of interest was stain resistance. This experiment was chosen for this work because it is typical, in design and extent, to experiments conducted in our development laboratories, and, thus, is a useful test case (Rehfeldt, 1990). METHOD An experiment was conducted that investigated seven different polymer formulas. Each formula was evaluated with two hardeners. The response variable was stain resistance; for this experiment the stain was applied to the test paints at three concentrations. The staining agent was placed on the test panels and allowed to remain overnight. The stain was then washed off with 10 double rubs of a cloth saturated

MEASURING CHEMICAL PROPERTIES

179

with a suitable solvent. The appearance of the stained and cleaned area was evaluated by the judges. A completely balanced design was used—we examined all combinations of polymer with hardener with stain concentration. This produced 42 test results. Each of the 42 tests was rated by five judges on a scale of 0 to 8, where 0 is total failure and 8 is superior performance. The raw ratings for the experiment are shown in Table 10-1. The paint samples are designated by polymer, 6 1 - 6 7 ; hardener, A and B; and stain concentration, low, medium, and high.

RESULTS From the data in Table 10-1, an objective equal interval scale was constructed by application of the basic Rasch model t h a t produced a rating scale. The logit measures were estimated for each of the 42 tests. The fit of this model was good. There were no misfitting judges or panels. The test panel separation was about 1.6, and the separation reliability was about 0.8 (both on standardized residuals). Mean square errors were less t h a n 1 on the logit scale. Since the data set was rather smaller than the more traditional uses of this method the convergence was somewhat slower (100 to 150 iterations are typical). The position of each test coating was estimated on the scale. This scale, and the positions of each paint along the scale, are shown in Figure 10-1 and given in Table 10-2. Figure 10-1 shows the distribution of test paints on the scale. This plot tells us several things about our test paints. Polymer 67, when used with hardener A, is the best performer, since it is highest on the scale for both high and medium stain concentration. Polymer 61, with hardener A, gives equal performance, but only at the low stain concentration. As a result of the linear, equal interval scale, we can tell that, for example, the improvement in performance between the median, 1.5 logits, and the polymer 61/hardener B/low stain combination is the same as the improvement between polymer 64/hardener B/low stain and polymer 66/hardener A/low stain. In other words the odds that polymer 61/hardener B/low stain will pass the stain resistance test is about 4.5 times the median performance. Likewise the odds that polymer 66/hardener A/low stain will pass the test is 4.5 times the odds t h a t polymer 61/hardener B/low stain will pass the test. Also, the improvement between polymer 61/hardener B/low stain and polymer 63/hardener A/low stain is twice the improvement between polymer 67/hardener B/low stain and polymer 61/hardener B/low stain. We

180

REHFELDT

Table 1 0 - 1

R a w R a t i n g s f r o m Stain R e s i s t a n c e Experiment Judges

Conditions Obs

Poly.

Hard.

Cone.

BHA

KIL

LMF

DIA

PCC

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

61 62 63 64 65 66 67 61 62 63 64 65 66 67 61 62 63 64 65 66 67 61 62 63 64 65 66 67 61 62 63 64 65 66 67 61 62 63 64 65 66 67

A A A A A A A B B B B B B B A A A A A A A B B B B B B B A A A A A A A B B B B B B B

Low Low Low Low Low Low Low Low Low Low Low Low Low Low Med Med Med Med Med Med Med Med Med Med Med Med Med Med High High High High High High High High High High High High High High

8 7 6 6 7 7 7 6 2 1 6 6 4 5 8 5 5 6 7 7 8 0 0 0 5 3 0 4 6 5 5 6 6 7 8 0 0 0 5 3 0 1

8 8 8 7 8 7 7 7 3 2 7 7 5 6 8 6 6 6 8 8 8 0 0 0 6 4 1 5 7 6 6 6 7 8 8 0 0 0 6 4 0 2

8 7 7 6 7 6 7 5 2 1 6 6 4 5 7 5 5 6 8 7 8 0 0 0 5 4 1 4 6 5 5 5 6 7 8 0 0 0 5 3 0 2

8 8 8 7 8 8 7 6 3 2 7 7 5 6 8 5 5 6 8 8 8 0 0 0 5 3 0 3 7 5 5 6 7 7 8 0 0 0 5 3 0 1

8 7 7 6 7 7 7 6 3 2 6 5 4 5 7 5 5 5 7 7 8 0 0 0 5 4 0 4 6 5 4 5 6 7 8 0 0 0 5 4 0 1

NB: 8 = Superior Performance and 0 = Complete Failure

MEASURING CHEMICAL PROPERTIES

181

Figure 10-1 Objective Scale of Stain Resistance Constructed from Raw Ratings

182

REHFELDT

Table 10-2 Summary of Stain Resistance Measurements by Hardener, Stain Concentration, and Polymer on Original Logit Scale Hardener B

Hardener A Stain Concentration

Polymer Polymer Polymer Polymer Polymer Polymer Polymer

64 65 67 61 66 62 63

High

Med

Low

High

Med

Low

2.05 3.86 8.41 3.86 6.00 1.16 0.74

2.50 7.32 8.41 7.32 6.63 1.16 1.16

3.86 6.63 5.41 8.41 5.41 6.63 6.00

1.16 -1.51 -3.81 -6.67 -6.67 -6.67 -6.67

1.16 -1.28 -0.80 -6.67 -5.66 -6.67 -6.67

3.86 3.40 1.60 2.95 -0.27 -2.38 -3.54

N.B. 8.41 Is the maximum measure and - 6.67 is the minimum measure.

have no hope of making this kind of inference from the original 0 - 8 scale. The Rasch Facets (Linacre, 1989) model was next applied to the data shown in Table 10-1. The fit of the model in this case was very similar to the fit, described above, for the simple rating scale case. There were no misfitting panels, and the standard errors were less than 1 for all cases. If we use the extension of the Rasch model to the multifaceted case, then we can partition the effects of the separate facets, still on the equal interval scale. A model of this type was calculated, and the results, shown in Tables 10-2, 10-3, and 10-4, were obtained. In Table 10-3 we see the overall effect of the hardener on the performance of these test paints. We see immediately that hardener A is better t h a n hardener B. This is the average effect of hardener, separated from the other variables. This means that, for any combination of polymer and stain concentration, the addition of hardener A will give better performance than hardener B. Thus, for whatever polymer Table 10-3 Hardener 1 2

A B

Table Mean: Table S.D.:

Effect of Hardener on Stain Resistance Score

Count

Measure Logit

Model Error

705 279

105 105

1.61 -1.61

0.12 0.07

492 213

105

*Centered during estimation

—

0.00* 1.61

0.10 0.02

MEASURING CHEMICAL PROPERTIES

183

Table 10-4 Effect of Stain Concentration on Stain Resistance Score

Count

Measure Logit

Model Error

LOW MED HIGH

415 300 269

70 70 70

1.09 -0.38 -0.71

0.12 0.11 0.10

Table Mean: Table S.D.:

328 62

70 0

Concen.

0.00* 0.79

0.11 0.01

*Centered during estimation

is chosen, you will be better off with hardener A. This is equivalent to an ANOVA, but the values used are interval measures and the emphasis is on the amount or magnitude of the effect. Often an ANOVA is made, in this context, which examines only the statistical significance of the effect, and the magnitude of the effect is left uninterpreted because the scale is not meaningful (see, for example, Broder, Kordomenos, & Thomson, 1988). In Table 10-4 we see the effect of stain concentration. Each polymer, when all conditions are considered, has better stain resistance when the stain concentration is low than when medium or high. This is to be expected, but we now have a quantitative estimate of the differences. The odds, for any combination of polymer and hardener, t h a t a paint will pass the stain resistance test is about six times better at the lower stain concentration than at the high concentration. In addition, we have a basis, as we shall see below, for detecting unusual performance, and, hence, unexpected results. DIFFERENCES AMONG J U D G E S Table 10-5 illustrates one of the primary advantages of the Rasch model analysis over naive interpretation of the raw ratings. It is evident from this table t h a t these five judges do not rate in the same way—different judges give different ratings to the same paint panel. Here are two groups of judges: KIL and DIA, who are similar in the leniency of their ratings at 0.38 and 0.12 logits; and LMF, BHA, and PCC, who are also similar among themselves, at —0.15, —0.17, and - 0 . 1 7 logits, respectively, but who are significantly more severe than the previous group of two. The latter group is about 0.35 logits more stringent, or harsher in their ratings, t h a n the former. If this difference is not considered in the analysis of the data, then the ratings

184

REHFELDT Table 10-5 Differences in Judges' Rating Behavior of Stain Resistance Score

Count

Measure Logit

Model Error

KIL DIA

216 203

42 42

0.38 0.12

0.14 0.14

LMF BHA PCC

189 188 188

42 42 42

0.15 0.17 -0.17

0.14 0.14 0.14

Table Mean: Table S.D.:

196 11

42 0

Judge

0.00* 0.22

0.14 0.00

*Centered during estimation

obtained depend, at least in part, on who does the rating and not on the performance of the paint. It may be argued that the rankings of the coatings may be the same even though each judge gives different individual ratings. While this may be true, and could be used in experiments of this type, it implicitly places two restrictions on the data analysis. First, the experiment must contain enough samples to provide significance to the rankings, which means 10 or more samples. Second, the rankings are only adequate for the experiment at hand and cannot be used to evaluate subsequent measurements of the property—here, stain resistance. One must, thus, conduct a complete experiment with at least 10 trials for each evaluation. Further Analysis In Table 10-6 we see the effect of the polymer on the performance. The measure order of the polymers is from best to worst, again separated from the other variables. We get our positions, on an objective scale; and the scale can be used for one or a few subsequent measurements without running the entire experiment over again. Further, the measures tell us, not only which polymer is better, but how much better as well. We can use the measures determined by the model in several ways. Table 10-2 shows the summary of the experiment plotted in Figure 10-1. The scale measures are in the body of the table. The polymers are in order of decreasing performance. The columns of the table show the effect of stain concentration and hardener. The values here are the logits on the original scale.

MEASURING CHEMICAL PROPERTIES Table 10-6

185

Effect of Polymer on Stain Resistance Measure Logit

Model Error

Polymer

Score

64 Best 65 67

173 173 169

30 30 30

1.38 1.38 1.29

0.16 0.16 0.15

61 66

140 132

30 30

0.61 0.37

0.16 0.18

62 63 Worst

102 95

30 30

-0.60 -0.83

0.18 0.18

Table Mean: Table S.D.:

140 30

30.0 0.0

0.51 0.86

0.17 0.01

Count

NB: There are three groups.

We see here t h a t Polymer 64 is rated best overall by virtue of its total performance. While lower, for example, than Polymer 67, with high concentration of stain and hardener A, Polymer 64 is more consistent over the various stain concentrations and with both hardeners. In fact, Polymer 64 is the only polymer that did not receive negative measures with Hardener B. We can also see an interesting anomaly. Polymer 67, with hardener A, performs better with high stain concentrations than it does with lower stain concentrations. This is not expected, and may be important for formulation of this type of coating. This anomaly was found by examination of the residuals from a multifaceted analysis. We see this in Table 10-7. Here, the expected ratings, near 8, are shown with the residuals. Polymer 67 was rated lower, at 7, than was expected by all the judges. This would indicate an area for further investigation. Finally, we can use the FACETS analysis to combine effects of the variables if we desire. For example, in Table 10-8, we have combined Table 10-7 Residuals Analysis of Stain Resistance Measurement Polymer/ Hardener

Cone

Judge

Obs.

Expect.

Residual

67A 67A 67A 67A 67A

LOW LOW LOW LOW LOW

BHA KIL LMF DIA PCC

7 7 7 7 7

7.9 8.0 7.9 8.0 7.9

-0.9 -1.0 -0.9 -1.0 -0.9

186

REHFELDT Table 10-8 Ranking of Stain Resistance with Polymer and Hardener Combined Polymer/Hard 67A 61A 66A 65A 62A 64A 63A 64B 65B 61B 67B 62B 63B Table Mean: Table S.D.:

Score 115 110 108 107 89 89 87 84 66 30 24 13 8 72 38

Count

Logit

Error

15 15 15 15 15 15 15 15 15 15 15 15 15

5.97 4.77 4.39 4.20 1.64 1.64 1.40 1.08 0.20 -2.02 -2.52 -3.51 -4.00

0.55 0.45 0.43 0.42 0.35 0.35 0.34 0.32 0.23 0.27 0.30 0.30 0.33

0.99 3.16

0.36 0.08

15 0.0

the effects of the polymer type and the hardener by using the polymer/hardener combination as a single facet rather than as two facets. In this analysis the data for polymer/hardener combinations were entered as separate facets and polymer 61/hardener A is single factor, so the dimensions of the data matrix are changed from 7 x 2 x 3 x 5 , polymers, hardeners, concentrations, judges, respectively, to 14 x 3 x 5, polymer/hardener, concentrations, judges. Here, we obtain positions of polymer and hardener combinations with respect to stain resistance. The scale, shown in Table 10-8, then, is the scale for the polymer/hardener combinations calculated with respect to stain resistance by the five judges. In this case we do not separate the effects of the polymer and hardener, so the polymer 67/hardener A entity is the best overall. Further Application of the R a s c h Model We have begun to use the Rasch model for rating scales in several ways. One method that shows promise is to do the standard analysis and obtain the property map, such as the one shown in Figure 10-1. Once we have such a map, we can select a suitable number of paint panels scattered along the scale. We try to select a suitable number, 5, 8, or 10, depending on the test in question, and arrange them to approximate the equal intervals. Then, for subsequent applications of the test we ask the judge to select the best match of the test piece panel

MEASURING CHEMICAL PROPERTIES

187

with one of the set measured standards. In this manner the individual judge does not have to know anything about the analysis method, but the results are in line with the equal interval scale we want. We simply translate the match into the proper logit measure. In another application we have started to examine color perception. This is important for automotive paints in particular. Even when we have gone to great lengths in spectroscopic analysis to assure a color match or color purity, we find that certain viewers can perceive a difference from the standard color. Until recently we were at a loss to control this feature of paints. We are now beginning to use the Rasch model to measure the color and match perception. In this manner we will obtain a measure of color perception which we can compare with the spectroscopic analysis. We believe that this additional testing will produce many fewer rejects on the basis of color than we currently experience. SUMMARY A method of overcoming the difficulties of rating scale rankings of paints has been demonstrated. The utility of the method includes construction of an objective measurement scale, detection and adjustment for differences in judges, measures of performance, means to detect outliers, and consistent measures from one experiment to the next. The model is suitable for rating scale, pass/fail, and minimum performance testing in paints and coatings. Such tests as stain resistance, solvent resistance, tape time, cross hatch adhesion, hardness, and other such tests with inherently large scatter are suitable candidates for Rasch facets analysis. When this model is used, rating scale rankings can be used to estimate experimental measures for regression and other designed experiments in a like manner to other quantitative measurements. REFERENCES Broder, M., Kordomenos, P.I., & Thomson, D.M. (1988). A statistically designed experiment for the study of a silver automotive basecoat. Journal of Coatings Technology, 60(766), 27. Hill, E., & Prane, J.W. (1984). Applied techniques in statistics for selected industries: Coatings, paints, and pigments. New York: John Wiley and Sons. Lehmann, E.L. (1975). Non-parametric statistical methods based on ranks. San Francisco: Holden-Day.

188

REHFELDT

Linacre, J.M. (1989). Many-faceted Rasch measurement. Unpublished doctoral dissertation, University of Chicago. Rasch, G. (1960). Probabilistic models for intelligence and attainment tests. Rasch, G. (1960). Probabilistic models for intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research. Rehfeldt, T.K. (1990). Measurement and analysis of coatings properties. Journal of Coatings Technology, 60(790), 53-58. Rasch, G. (1960). Probabilistic models for intelligence and attainment tests. Siegel, S. (1956). Non-parametric statistics. New York: McGraw-Hill. Sprent, P. (1989). Applied non-parametric statistical methods. New York: Sprent, P. (1989). Applied non-parametric statistical methods. New York: Chapman-Hall. Wright, B., & Linacre, M. (1987). Rasch model derived from objectivity. Rasch Measurement SIG Newsletter, 1(1), 2 - 3 . Wright, B., & Masters, J. (1981). Rating scale analysis. Chicago: MESA Press. Wright, B., & Panchapakesan, N. (1969). A procedure for sample-free item analysis. Educational and Psychological Measurement, 29, 2 3 - 4 8 . Wright, B., & Stone, M. (1979). Best test design. Chicago: MESA Press.

chapter

11 J.JL

Impact of Additional Person Performance Data on Person, Judge, and Item Calibrations John A. Stahl

National Association of Boards of Pharmacy

Mary E. Lunz

American Society of Clinical Pathologists Achievement testing often relies on multiple choice items. Multiple choice items are economical when testing large populations, they have well-documented psychometric properties, and they are reliable because many items can be included in a test. The limitation of multiple choice items is proving that they do indeed measure competence to perform specified tasks. In most cases, they measure knowledge of how to perform the task. In any performance-related field, a direct observation and judgement of a candidate's ability to perform the desired tasks is preferable to the less direct measure of knowledge provided by multiple choice items. The development of Rasch models to handle many-faceted measurement, in particular the FACETS program (Linacre, 1988), has opened the opportunity for developing economically feasible ways of making more direct assessments of candidate performances. Oral examinations, practical examinations, and essay examinations, all of which involve the use of judges, can now be used in assessing candidates without sacrificing the properties of objective measurement (Lunz, Wright, & Linacre, 1990; Lunz & Stahl, 1990). More direct assessment of a candidate's capability to perform a par189

190

STAHL & LUNZ

ticular task is desirable; however, we should not be too hasty in abandoning the information that can be obtained through more traditional testing instruments. In many cases, the area being tested involves both the capability to perform tasks and a base of essential knowledge. Knowledge can be tested efficiently with a question-and-answer format. Frequently, a multiple choice test is the most efficient method for gathering this information. The ideal situation would be to use all of the available information concerning a candidate's capabilities when making the assessment. The traditional method is to use several testing instruments, make an independent assessment using each instrument, and then require the candidate to pass all parts. An alternative method is to combine all the available information into one single analysis. The flexibility of the FACETS program allows this alternate method to be explored. This study is an exploration of single analysis assessment using several different test instruments. Data from a multiple choice written examination and from a judge-mediated practical examination are combined. Both tests were administered to the candidates, although a small subgroup took only one of the two tests. The combined data set was analyzed using the FACETS program, and the results of the analysis were compared to the results obtained from analyzing the multiple choice and practical examinations separately. METHOD The data are from the certification process in histology, 1 a clinical laboratory specialty. The first examination consisted of 173 multiple choice items administered to 417 candidates. The questions covered processing, cutting, and staining tissue and general laboratory operations. The second examination was a practical that required the candidates to prepare 15 histology slides according to prescribed criteria. These slides were prepared by 321 candidates and mailed to a central location for grading. The slides were graded by a group of trained judges during a two-day grading session. The slides were graded on seven tasks: preparing the tissue block, labeling the slide, coverslipping the slide, obtaining the proper tissue sample size, processing the tissue, cutting the tissue, and staining the tissue. The candidates for both examinations consisted of individuals who had met the criteria to sit for the examination either by completing an approved program of 1

Histology is the science concerned with the structure of cells, tissues, and organs in relation to their function. Histotechnology is concerned with the preparation of slides for use in the microscopic study of tissues.

IMPACT OF ADDITIONAL PERSON PERFORMANCE DATA

191

instruction in histology or through a combination of on-the-job training and experience. ANALYSES The multiple choice examination was analyzed using the BIGSCALE program (Wright, Linacre, & Schultz, 1990) for Rasch analysis. Measures for each examinee and difficulties for each item on the test were obtained. The practical examination was analyzed initially using the FACETS (Linacre, 1988) program for many-faceted Rasch analysis. For this examination, the probability of candidate n with ability Bn achieving score x (rather t h a n score x - 1) on slide i with difficulty Dt from judge j with severity Cj was modeled as:

where =

Probability of candidate n being given score x by judge j on slide i Pnijx-i = Probability of candidate n being given score x — 1 by judge j on slide i Bn = ability of candidate n Dt = difficulty of slide i Cj = severity of judge j Fx = difficulty of achieving rating step x relative to step x — 1 Pmjx

The above equation is the general expression for the three-faceted Rasch rating scale model (Linacre, 1989, p. 62). The three components in the examination are the candidates, the items, and the judges. The probabilities of success are modeled as an additive combination of these three components. Taking the logarithm of the probability odds expresses these parameters in log-odds units (logits). Measures for each candidate, difficulties for each slide, and severities for each judge were obtained. The data from the histology multiple choice and practical examinations were then combined into one data set and reanalyzed using the FACETS program. This analysis added a facet to the model to account for the dichotomously scored multiple choice items (Bn —Dt -Cj —Mt -Fx) where ML is the difficulty of the multiple choice items. This

192

STAHL & LUNZ

combined analysis resulted in a single measure for each candidate, a severity for each judge, and a difficulty for each slide and each multiple choice item. The results of these three analyses were then compared. Calibrations and measures from the analyses of the examinations were plotted against the corresponding results obtained from the combined analysis.

RESULTS The Rasch fit statistics are a measure of the fit of the data to the model. The Infit (information weighted mean squared residual) is sensitive to an accumulation of central or inlying deviations. The Outfit (unweighted mean squared residual) is sensitive to occasional outlying deviations. Significant departures from expected indicate disruptions in the testing process. The fit statistics for the multiple choice items and for the slide items are presented in Tables 11-1 and 11-2, for both the individual and the combined analyses. The multiple choice items show very little misfit. Two of the slide items show evidence of some misfit. Slide 3 has an Outfit of 2.2 indicating that there were some outlying scores. This was a relatively easy slide and the outlying scores were probably due to unexpectedly low ratings on this item given to a few examinees. Slide 9 had low Infits and Outfits indicating t h a t there was a greater than expected consistency in the ratings of this item, probably all 2s and 3s. The degree of misfit for these two items was not sufficient to preclude using them in the analysis. Having determined that the data fit the model, we can now examine whether the simultaneous analysis of the two sets of results has introduced measurement disturbances. This is accomplished by comparing the pertinent measures derived from the separate analyses with those derived from the combined analysis. In Figure 11-1, the item difficulties obtained from the initial BIGSCALE multiple choice examination analysis are plotted against the item difficulties obtained from the combined FACETS analysis. It can be seen t h a t the multiple choice item calibrations were not affected by the addition of the practical examination data. In Figure 11-2, the calibrations of the slides obtained from the initial FACETS analysis are plotted against the slide calibrations obtained from the combined FACETS analysis. The slide calibrations were not substantially affected by the addition of the multiple choice item data. In Figure 11-3, the judge severities obtained from the initial FACETS

IMPACT OF ADDITIONAL PERSON PERFORMANCE DATA 193

Table 11-1 Multiple Choice ItemsTable 11-1 Fit Statistics

Combined Analysis

Individual Analysis Item

Infit

1

.9 1.0 .9 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.1 1.0 1.0 1.0

2

3 4

5 6

7

8 9 10 11

12 13 14

15

16 17

18 19 20 21 22 23 24 25 26

27 28 29 30 31 32

33 34 35 36 37 38 39 40

41 42

1.1 1.0

1.1 1.0 1.0 .9 1.0 1.0 1.0 1.0 1.0 1.0

.9 1.0 1.1 .9 1.0 1.0 1.0 1.1 .9 1.0 1.0 1.1 1.0 1.0

Multiple Choice Items

Outfit

.9 .9 .7 1.1 1.0

1.2 1.0 1.0 1.0 1.1 1.0 1.0

1.1

1.0 1.0 .9 1.1 1.1 1.2 1.1 1.0 .8 1.0 .9 1.0 .9 1.0 1.0 .8 1.1 1.1 .9 1.0 .9

1.0 1.1 .9 1.1 1.2 1.3 1.1 .9

Infit

.9 .9 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.1 1.0

1.0 1.0 .9 1.0 .9 1.0 1.0 1.0 1.0 .9 .9 1.0

.9 1.0 1.0 1.0 1.0 .9 1.0 1.0 1.0 1.0 .9

Outfit

.9 .9 .7 1.0 1.0 1.1

1.0 1.0 1.0 1.0 1.0 1.0 1.0

1.0 1.0 1.0 1.0

1.1 1.1

1.0 .9 .8 1.0 .9 1.0 .9 1.0 1.0 .8 .8 1.0 .9 1.0 .9 .9 1.1 .9 1.0 1.1 1.1 1.0 .9

(continued)

194 STAHL & LUNZ194 STAHL & LUNZ

Table 11-1 (Continued) Individual Analysis

Combined Analysis

Item

Infit

Outfit

Infit

Outfit

43 44 45 46 47 48 49 50

1.0 1.1 1.0

1.0 1.1

1.0 1.0

1.0 1.0 1.0

1.0 1.0 1.0 1.0

1.1

51 52 53 54 55 56 57 58 59 60 61

62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85

1.0 1.0 1.0 .9 1.0 1.1 .9

.9 1.0 1.0 1.0 1.1 .9 1.0 .9 1.0 1.0 1.0 .9 .9 .9 1.0 1.0 1.0 1.0 1.1 1.0 1.0 .9 1.1 1.1 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

1.0 .9 .9 1.1 .9 .9 1.0 1.0 1.2 1.1 .8 1.0 .8 1.0 .9 1.1 .9 .8 .9 1.0 1.0 1.1 .9 1.1 1.0 1.0 .9 1.1 1.1 1.0 .9 1.1

1.0 1.0 1.1 .9 1.0 1.1

.9 .9 1.0 .9 .9 1.0 .9 1.0 1.0 .9 1.0 .9 1.0 1.0 1.0

.9 .9 .9 1.0 1.0 1.0 .9 1.1 1.0 1.0 .9

1.0 1.1 1.0 .9 1.0 1.0 1.0 1.0 1.0 1.0 1.0

1.0

1.0 1.0 1.0

1.0 .9 .9 1.0 .9 .9 1.0 .9 1.1 1.0 .8 1.0 .9 1.0 .9 1.0 .9 .8 .9 1.0 1.0 1.0 .9 1.1 1.0 1.0 .9 1.0 1.1 1.0 .9 1.1 1.0 .9 1.0 .9 1.0 1.0

(continued)

IMPACT OF ADDITIONAL PERSON PERFORMANCE DATA

Table 11-1

(Continued) Individual Analysis

Combined Analysis

Item

Inflt

Outfit

Infit

Outfit

86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105

1.0 1.1 1.1 1.1 1.0 1.1 .9 1.1 1.0 1.0 .9 1.1 .9 .9 1.0 1.0 .9 1.0 .9 1.0 1.1 1.1 1.0 1.0 1.1 1.0

1.0 1.2 1.1 1.2 1.0 1.1 .9 1.2 1.0 1.0

1.0

.9 .9 1.0 1.0 .9 1.0 .9 1.0 1.1 1.2 1.1 1.0 1.1 1.0 1,1 .9 .9 .9 1.1 1.1

1.0 1.1 1.0 1.0 1.0 1.0 .9 1.1 1.0 1.0 .9 1.1 .9 .9 1.0 1.0 .9 1.0 .9 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 .9 .9 1.0 1.0 1.1

.9 1,1

.9 1.0

1.0 .9

.9 .9 1.1

106 107

108 109 110 111 112 113 114 115 116 117 118 119 120 121

122 123 124 125 126 127 128

1.0

.9 .9 1.0 1.1 1.1 .9 1.1 .9 .9 1.1 1.0 1.0 1.1 1.1 1.0 1.0

.8 1.1

1.2

.9 1.0 1.1 1.1 .9 1.0

1.1 1.0

1.1 1.0

1.0 .9 1.1 .9 1.0 .8 1.1 .9 .9 .9 1.0 .9 1.0 .9 1.0 1.1 1.1 1.0 .9 1.0 1.0 1.0 .9 .9 .9

1.0 1.1 .9 1.0 .9 .9 1.1

.9 1.0 1.0 1.0

.8 1.0 1.0

1.0 .9

1.0 .9

1.0

(continued)

195

Table 11-1 (Continued) Combined Analysis

Individual Analysis Item

Infit

Outfit

Infit

Outfit

129 130 131 132 133 134 135 136

1.0 1.0

1.0

1.0 1.0 .9 1.1 .9 1.0 1.0 .9 1.0 .9 1.0 1.0 .9 1.0 .9 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 .9 1.0 1.0 1.1 1.0 1.0 .9

1.0 1.0 .9 1.1 .9 1.0 1.0 .9 .9

137 138

139 140 141 142 143 144 145

146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162

.9 1.0 1.0 1.0 1.0 .9 1.0 1.0 .9 1.0 .9 1.0 1.0 1.0 1.0

165 166 167 168 169 170 171 172 173

1.0 1.0 1.0 1.1 .9 1.0 1.0 1.1 1.1 1.0 .9 1.0 .9 .9 1.1 .9 .9 .9 1.0 1.0 .9 1.0 1.1 1.0 d.O 1.0

Mean S.D.

1.0 .1

163 164

196

.9 1.1

1.1 .9

1.2 .9 1.0 1.1

.9 .9 .9 .9 1.0 .9 1.0 .9 1.0 1.0 .9 1.0 1.1 1.0 1.0 1.1 .9 1.0 1.0 1.3 1.1 1.1 .9 .9 .9 .9 1.2 .9 1.0 1.0 .9 1.0 1.1 1.1 .9 1.2

.9 1.1 .9 .9 .9 1.0 .9 .9 1.0 1.0 1.0 .9 1.0

.9 .8 1.0 .8 1.0 .9 1.0 1.0 .9 1.0 1.0 1.0 .9 1.0 .9 1.0 1.0 1.1 1.0 1.0 .9 .9 .9 .9 1.1 .9 .9 .9 1.0 .9 .9 1.0 1.0 1.0 .9 1.1

1.0 .1

1.0 .1

1.0 .1

.9 .9

.9 .9

IMPACT OF ADDITIONAL PERSON PERFORMANCE DATA Table 11-2

Slide Item Fit Statistics Individual Analysis

Combined Analysis

Item

Infit

Outfit

Infit

Outfit

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

.9 .8 .8 .9 .9 .9 .9 1.0 .7 1.0 1.2 1.2 1.1 1.3 1.2

.8 .9 2.2 .6 .6 .9 1.4 1.5 .4 1.0 .8 1.0 .8 1.3 1.1

1.0 .9 .9 1.0 1.0 1.0 1.0 1.1 .8 1.1 1.2 1.2 1.2 1.5 1.3

1.0 .8 2.2 .7 .6 .9 1.5 1.6 .5 1.2 .8 1.0 .9 1.4 1.0

Mean S.D.

1.0 .2

1.0 .4

1.0 .2

1.0 .4

WRITTEN DATA

COMBINED DATA Figure 1 1 - 1

Written Item Calibrations Written Exam Vs. Combined Data

197

198

STAHL & LUNZ

PRACTICAL EXAM ONLY

COMBINED DATA Figure 1 1 - 2

Slide Calibrations Practical Vs. Combined Data

PRACTICAL EXAM ONLY

COMBINED DATA Figure 1 1 - 3

Judge Calibrations Practical vs. Combined Data

IMPACT OF ADDITIONAL PERSON PERFORMANCE DATA

199

analysis are plotted against the judge severities obtained from the combined FACETS analysis. There is more variability between the judge severities derived from the two analyses, although the correlation is still high at .84 (p = .000 for a two tailed test of significance). The reason for this variability will become more apparent as we look at the examinee measures. There are three measures for each person: (a) the multiple choice examination measure, (b) the practical examination measure, and (c) the combined FACETS analysis measure. The multiple choice examination measures are plotted against the combined FACETS analysis measures in Figure 11-4. There is a linear relationship, but the combined FACETS analysis measures are about .5 logits higher than the multiple choice examination measures. The correlation between the measures is .97 (p = .000 for a two tailed test of significance). In Figure 11-5, the person measures from the practical examination are plotted against the person measures from the combined FACETS analysis. The relationship is less strongly linear (correlation = .59, p = .000 for a two-tailed test of significance), and the combined FACETS measures tend to be lower t h a n the practical examination measures. These results suggest the following. First, the results of the multiple choice examination are having a much greater influence on the combined analysis t h a n the results from the practical examination. This is logical, since the multiple choice examination consisted of 173 items, scored dichotomously, whereas the practical examination had only 77 judged responses per candidate, 15 responses scored on a 0 - 3 rating scale and the remainder scored on a 0 - 1 scale. Thus the multiple choice examination provided about 2.5 times the number of responses as the practical examination. Second, the practical examination was the easier of the two tests. Historically it has been harder to pass the multiple choice examination (about 50 percent pass) than the practical examination (about 80 percent pass). The variation in judge severities between the practical analysis and the combined FACETS analysis can be attributed to the strong impact of the multiple choice test on the candidate measures. A candidate who is less able on the multiple choice examination forces down his or her combined analysis measure even if he or she is more able on the practical examination. The judges who graded that particular candidate appear more severe on the combined data analysis. The converse is true for candidates who were more able on the written examination than the practical. The increased ability of these individuals on the combined analysis has the effect of making the judges who graded these candidates appear less severe. A close examination of Figure 11-3 shows t h a t the judge severities tend to split away from the identity line with some looking harder on the practical than on the combined

Figure 11-4

Person Measures

Figure 11-5

Person Measures

202

STAHL & LUNZ

FACETS analysis and some looking easier on the practical than the combined FACETS analysis. This divergence from the identity line represents the expected impact of the examinee ability measures, adjusted for the written examination measures, on the judge severity calibrations. An alternative is to equalize the contribution of the two examinations to the final certification decision. The traditional method is to analyze the examinations separately and require the examinee to pass both before being certified. Another approach is to weight the contribution of each assessment in a combined analysis in such a way t h a t the contribution is equal. Since the multiple choice examination had more impact on the candidate measures, the combined FACETS analysis was repeated with the results weighted so t h a t the contribution of each examination would be approximately equal. The errors of measure for the candidate measures derived from each test were compared. The error of measure from the practical was about three times larger than the error from the multiple choice examination. The contribution of the multiple choice examination was therefore weighted by a value of .3 to equalize the impact of the multiple choice examination in the combined measure and the analysis repeated. The relevant measures and calibrations derived under the separate and weighted/combined conditions were compared. The comparisons of the multiple choice item difficulty calibrations and the slide item diffiPRACTICAL EXAM ONLY

WEIGHTED COMBINED DATA

Figure 11-6

Judge Calibrations Practical vs. Weighted Combined Data

IMPACT OF ADDITIONAL PERSON PERFORMANCE DATA

203

culty calibrations showed no significant change between the separate and combined analyses. The plots of these comparisons were identical to the plots seen in Figures 11-1 and 11-2. In Figure 11-6, the judge severities from the practical analysis are plotted against the weighted combined judge severities. The impact of the multiple choice items on the judge severities is still apparent; however, the degree of impact is less in the weighted analysis. The correlation is higher at .96 (p = .000 for a two tailed test of significance). In Figures 11-7 and 11-8, the candidate measures from the written and practical analyses are plotted against the weighted combined candidate measures. In Figure 11-7, the linear relationship is less well defined than it was in Figure 11-4, as the impact of the multiple choice examination is reduced. The correlation coefficient is now .87 (p = .000 for a two tailed test of significance). In Figure 11-8, the practical examination candidate measures and the weighted combined candidate measures have a more clearly linear relationship with a correlation of .80 (p = .000 for a two tailed test of significance). The combined analysis candidate measures are between the higher practical examination measures and the lower multiple choice measures. The contribution of each is relatively equivalent. This weighting could probably be finetuned even further, until the correlations between the results of the separate analyses and the weighted analysis become identical.

DISCUSSION Making assessments of a person's performance often can have serious implications. The more information that can be obtained and utilized for t h a t assessment, the more reliable that assessment will be. Many instruments are available to obtain this information. These instruments include multiple choice tests, oral examinations, practical examinations, essay tests, and so on. This study was designed as an initial attempt to use the flexibility of the FACETS program to combine the information from two unique examinations, a practical examination and a multiple choice examination, and to explore the results of combining this information. The results indicate t h a t combined analysis can occur without significant disturbances in the measurement process. The calibrations of the item difficulties, both multiple choice items and practical slides, was virtually unaffected. Fit to the model of these items was acceptable and directly comparable to the fit from the separate analyses. The mean squared information weighted residual for the slides on both the

i

Figure 11-7

Person Measures

Figure 11-8 Person Measures

206

STAHL & LUNZ

practical and the combined analysis had a mean of 1.0 and a standard deviation of .2. For the multiple choice items, the mean squared infit for both the multiple choice examination and the combined FACETS analysis had a mean of 1.0 and a standard deviation of .1. The largest impact appeared in the judge severity calibrations. Even here, the correlations between the calibrations derived from the different analyses were high and no change in the fit of the data to the model was observed (Mean squared Infit for both analyses had a mean of 1.0 and a standard deviation of .1). The impact on candidate measures indicated that care must be taken in assigning weight to the contribution of each individual test used in the combined analysis. Each examination is designed to test elements of a candidate's performance. The role that each of these elements contributes to the actual competence of a candidate to perform the tasks may influence the weight assigned to that task. The importance placed on the parts of the examinations by the examination board when making the final certification decision must also be considered. In this case, greater importance has been placed on the multiple choice examination, because the multiple choice examination tests the candidate's basic knowledge of histology in a broader context t h a n the specific task-oriented practical examination. If this is the case, then the combined analysis may not be appropriate. This study, however, demonstrated that it is possible to combine information from different types of examinations and to analyze the extent of their contribution to the final evaluation. Further research on combined FACETS analysis is necessary; however, such an approach may be more commensurate with the assessment of overall competence now t h a t the technology and theoretical models are available.

REFERENCES Linacre, J.M. (1988). FACETS, a computer program for the analysis of multifaceted data. Chicago: Mesa Press. Linacre, J.M. (1989). Many-faceted Rasch measurement. Chicago: Mesa Press. Lunz, M.E., & Stahl, J.A. (1990). A comparison of intra- and interjudge decision consistency using analytical and holistic scoring criteria. Journal of Allied Health, 19, 173-179. Lunz, M.E., Wright, B.D., & Linacre, J.M. (1990). Measuring the impact of judge severity on examination scores. Applied Measurement in Education, 3, 331-345. Wright, B.D., Linacre, J.M., & Schultz, M. (1990). BIGSCALE, Rasch-Model Rating Scale Analysis Computer Program. Chicago: Mesa Press.

part I I I

Theory

This page intentionally left blank

chapter

12 JL^

Local Dependence: Objectively Measurable or Objectionably Abominable?* Robert J. Jannarone

University of South Carolina

INTRODUCTION This chapter concerns extending Rasch model-based objective measurement (Fisher, 1991; Rasch, 1980; Wright, 1980) to include a variety of locally dependent conjunctive measurement (LDCM) models (Jannarone, 1991). The Rasch model justifies the nearly universal practice of scoring tests by counting the number of binary items that are passed. LDCM models provide supplemental scoring schemes that use nonadditive combinations of item scores as well. In the process, LDCM violates the fundamental axiom of latent trait theory, which is a traditional basis for nearly all test models, including the Rasch model. LDCM models have been introduced and developed elsewhere (Jannarone, 1986, 1987, 1991; Jannarone, Yu, & Laughlin, 1990; Kelderman & Jannarone, 1989; Van der Linden & Jannarone, 1989). The issue to be raised here is whether LDCM offers useful cognitive measurement potential t h a t the Rasch model does not, while preserving the essence of objective measurement. * This chapter is dedicated to the memory of Rose and Peter.

209

210

JANNARONE

Other attempts have been made to extend objective measurement, some of which are sharply attacked in this chapter. These attacks may upset some readers, especially if they ignore the fact that science rewards strong results with sharp rebukes (Kuhn, 1970; Popper, 1968). To set matters straight, the author admires and appreciates the contributors who are criticized below, without exception. Indeed, the results of this chapter would have been unthinkable without their extraordinary efforts. The chapter is organized as follows. First, the question of whether to extend objective measurement is addressed informally and answered affirmatively. Next, the question of how to extend objective measurement is addressed more carefully, with psychometric measurement history as a basis and specific guidelines as a result. Finally, locally dependent, conjunctive measurement is assessed against these guidelines, and conclusions are made regarding its utility.

SHOULD OBJECTIVE MEASUREMENT BE EXTENDED? "If it's not broke, don't fix it!" is a common sentiment among people who just want to be productive. It is often a sound sentiment, because improvements are not easy to make. However, improvements surely cannot be made without being attempted, lending support to the opposite sentiment. "If it's not broke, fix it anyway!" When both short-term productivity and long-term optimality are concerns, then, it is natural to strike a balance between the two opposing sentiments. The balance may lean more toward the fix-it than the don't-fix-it end among scientists, for several reasons. Researchers have been primarily trained and directed toward identifying, applying, and validating new ideas, which necessarily requires questioning and rejecting current ideas. Also, the history of science has repeatedly shown t h a t pursuing new ideas, even without concrete goals in mind, can eventually result in important practical benefits. However, in pursuing either concrete, short-term goals or more abstract, long-term goals, researchers must use some established procedures and rely on some basic assumptions; otherwise they would have no basis at all for making progress. From the most practical to the most basic scientific work, then, the fix-it, don't-fix-it question is an important one. The case in point for this chapter is model-based objective measurement (MOM), in the form of the Rasch model (Rasch, 1980; Wright, 1980). MOM is an interesting case for fix-it versus don't-fix-it study because it is currently being both heavily researched and widely used.

LOCAL DEPENDENCE

211

The contents of this edited series indicate much MOM-related activity, ranging from very basic research to very applied assessment. While some studying measurement foundations are naturally interested in extending MOM, others are perhaps either happy with MOM as it stands or too busy using MOM to seriously consider changing it, or both. The fix-MOM, don't-fix-MOM question, then, is a broad one that may have different answers for different researchers. On the fix-MOM side, prudent extensions to the Rasch model could improve testing practice. For example, it is widely known t h a t educational aptitude does not depend on achievement alone, but also on learning ability, strategy selection ability, motivation, and the efficient use of time. Yet MOM is limited for measuring effects of these skills on performance, because it is based on a certain restrictive axiom. Also, as computing power continues its remarkable growth, computerized testing and tutoring are certain to become widely used. Yet the same axiom limits MOM prospects for computer-based, dynamic ability assessment. In addition, reports indicate that MOM cannot properly measure characteristics of some items in current use, such as differential item discriminations (Lord, 1980) and dependencies due to shared content (Jannarone, 1991). Thus, extensions to MOM may be needed to broaden its formal domain as well as its utility. The MOM axiom t h a t limits its domain is the local independence assumption (Lazarsfeld, 1958), which is widely regarded as the fundamental axiom of latent trait theory (Lord & Novick, 1968, Sec. 24.5; Jannarone, 1991a). Local independence requires t h a t measurement must be noninvasive (Jannarone, 1991) in that a person's future test behavior must be the same after responding to an item as it would have been before. By requiring that measurement be noninvasive, local independence prevents measuring a person's progress during a test as a function of his or her progress on previous items. As a result, as long as local independence is imposed, some potentially interesting abilities will be neglected. Conversely, those who are interested in measuring such abilities will continue to neglect MOM in its current form. Locally dependent instances can be found ranging from exerciseactivity assessment settings, where injuries are recorded, to learningactivity settings, where task responses are recorded. Suppose, for example, that a binary "item score" is recorded weekly, indicating whether or not runners in a study have been injured (Macera, Pate, Powell, Jackson, Kendrick, & Craven, 1989). Running-injury incidence can obviously depend on recent running-injury history. Also, some people may be more likely to press on after an injury t h a n others. As a consequence, running-injury measures may not only be locally depen-

212

JANNARONE

dent, but interesting individual differences in local dependencies may exist as well. As closely related example, suppose that item pairs have been constructed to reflect learning transfer, in the form of successfully learning information on one item and then successfully applying the learned information to a following item (Jannarone, 1987, 1991). As in the exercise activity case, (a) one item score is likely to depend (locally) on a preceding item score; and (b) individual differences in local dependencies (that is, learning transfer abilities) may be worth measuring. In these and other instances, information may exist in test score patterns that cannot be measured by number-correct test scores alone. For example, counting the number of adjacent item pairs that are both passed can provide information about learning ability, if adjacent items are linked by content in certain ways (Jannarone, 1991). Yet the local independence axiom turns out to prohibit the use of such nonlinear, conjunctive scoring schemes (in a sense that will be shown below—when items are binary, item scores are equivalent to logical events, cross-products of which are called conjuncts, whence the term con Arguments on the fix-MOM side, then, include prospects for allowing local dependencies among items and using nonadditive scoring schemes, both of which are prohibited by MOM in its current form. On the don't-fix-MOM side, arguments can be made for continuing to use MOM as it stands. The strongest among these is the natural and important wish to retain simplicity and elegance, if possible. No method of combining item scores is simpler than adding them up, as prescribed by the Rasch model. Moreover, local independence is a necessary and sufficient condition for additivity (within a broad class of item response models, q.v.), which means that extending MOM along noninvasive lines would necessarily decrease its simplicity and elegance. Also, it is usually the case that simple additive measurement works remarkably well relative to nonadditive alternatives, even when observations are generated according to nonadditive models (Jannarone, 1987). Number-correct scoring should be especially satisfactory for tests in current use, having items that were chosen with additivity in mind. In many practical settings, then, MOM in its current form can be expected to perform quite well. Local independence is the issue of focus here for the fix-MOM, don't-fix-MOM question, although it is not the only one. Other issues have also emerged over the years, for which different Rasch model variants have been proposed. These include extensions along multidimensional, multiparameter, multicategory, and nonparametric lines. Although these extensions are not directly related to the local

LOCAL DEPENDENCE

213

independence issue, they will be reviewed in the next section, in an attempt to identify the essence of objective measurement. Arguments exist, therefore, for and against developing and applying extensions to MOM. For the those who study psychometric foundations, the choice in favor of exploring ways to fix MOM is straightforward. Their only dilemma is how to develop general and potentially useful extensions to MOM that preserve good measurement properties. However, the choice tends to be more difficult for researchers who are more concerned with practical testing. Their choice requires assessing MOM alternatives for their particular needs (and within their busy schedules), rather t h a n pursuing extended objective measurement for its own sake. They thus need to balance real rather than potential extended MOM utility against necessary increases in extended MOM complexity, which is not easy. It is hoped that the following description will aid researchers with both applied and basic interests, in choosing between MOM as it stands and extended locally dependent, conjunctive alternatives.

HOW SHOULD OBJECTIVE MEASUREMENT BE EXTENDED? Since objective measurement has been closely tied to the Rasch model (Wright, 1980), the question of how to extend MOM will be addressed by first examining Rasch model attributes. Historical psychometric developments will then be reviewed, and resulting measurements extension guidelines will be proposed. t Rasch model is remarkably simple and elegant, especially in terms of additivity. Indeed, it will be shown later that the Rasch model is the only additive member of a very general test model family. Moreover, the Rasch model offers a sound basis for measuring differential item characteristics and including them in the measurement process. Also, Rasch measurement results in sound and straightforward statistical inference procedures (Andersen, 1980). The Rasch model has other features that have become identified with "specifically objective measurement" (Fischer, 1981, 1991; Wright, 1980). In particular, Rasch measurement produces ability estimates t h a t do not depend on item difficulty, along with item difficulty estimates that do not depend on abilities. Conceptually, this translates into a measuring process t h a t "transcends the measurement instrument [by excluding person-by-item] interaction terms" (Wright, 1980). The most familiar form of the Rasch model is

214

JANNARONE

where i indexes individuals, m indexes item measurements, the xlm are binary item scores, the f3m are item parameters, and the 0l are person parameters. (For readers who are unfamiliar with the proportionality (*) sign in (1), it indicates t h a t the second expression is the first expression times a factor t h a t does not depend on observed scores.) The M factors in (1), which are called item response functions, give the probabilities of passing component items as functions of 0. Rasch model local independence is evident from (1), because when person parameter (fy) values are fixed, joint item score probabilities are products of the component item response functions. The subtractive nature of Rasch model person and item parameters, which leads to item and person parameters being comparable on the same scale, is also evident from (1). Finally, since it produces individual differences measures t h a t are simple number-correct scores, the Rasch model also provides for measurement validation by correlating item and total test scores with external measures. (Other useful Rasch model properties will be described after the exponential family of statistical models is reviewed below.) t to regression and correlation methods (Galton, 1888; Pearson, 1896), which rely on component measures that have substantial (individual differences) variation, along with mutual (additive) covariation. Regression and correlation features were given prominence in both the classical test model (Spearman, 1904) and the closely related singlefactor model (Spearman, 1927), and they remain prominent in modern test theory. A closely related development was the advent of analysis of variance (ANOVA—Box, 1978; Fisher, 1921) models. ANOVA and regression models are members of the general linear model family (Searle, 1971), all of which are based on additive associations between dependent and (perhaps nonadditive functions of) independent variables. t analysis developments (MFA—Thurstone, 1932) and interactive

LOCAL DEPENDENCE

215

ANOVA developments (Box, 1978; Fisher & Mackenzie, 1923) have provided important lessons for extending objective measurement. Both multiple factors and ANOVA interactions increase explanatory power by supplementing simpler models with extra parameters. Extra "factor loadings" are used in the MFA case to supplement classical test "true scores" with "factor scores" (Lord & Novick, 1968). Likewise, extra "interaction effects" are used in the ANOVA case to supplement ANOVA "main effects" (Scheffe, 1959). Although similar in motivation, MFA models and extended ANOVA models have different forms and uses. The statistical form of the extended ANOVA model is nonadditive, in that cross-products of main effect (group indicator) variables are used as extra predictor variables. These extra observable variables are used in extended ANOVA estimation and inference to account for required extra parameters. By contrast, only additive functions of observable variables are used in the MFA model (expecting the covariances that are used for estimation and inference in general linear models and the one-factor model). Because extended ANOVA parameters are accompanied by corresponding nonadditive statistics, statistical estimation and inference procedures remain straightforward in the extended ANOVA case, just as in the additive ANOVA case. By contrast, MFA procedures involve exotic constraints (such as simple structure—see Thurstone, 1947) and estimation procedures (Joreskog & Sorbom, 1984) for dealing with difficult inference problems, some of which have yet to be resolved. As a result, while extended ANOVA models continue to be widely and successfully used, interest in MFA models seems to be decreasing (as indicated by fewer articles appearing in Psychometrika). i and regression methods can be used for binary data, binary measures violate basic normality assumptions for these models. The next major test theory development was the introduction of special binary item response theory (IRT) models, of the normal ogive (Ferguson, 1942; Lawley, 1943) and one-parameter logistic (Rasch 1960/1980) types. These new models were introduced as completely new alternatives to—rather than extended versions of—classical and MFA models, in order to precisely reflect associations among binary item scores. Like the Rasch model, they were constructed with provisions for individual differences along with subtractive item and person parameters, in an attempt to place item effects and person effects on the same scale. l Local independence and latent trait theory. The next development, which was more broad and foundational than model specific, was the identification of local independence (Lazarsfeld, 1958) as a

216

JANNARONE

test theory axiom. Specifically, permissible latent trait models (Lord & Novick, 1968, chap. 25), were restricted to settings where all local dependencies could be explained by latent traits (as opposed to more broadly defined latent variable settings, which may not necessarily satisfy the local independence axiom—see Jannarone, 1991a; Suppes & Zanotti, 1981). In the process, fundamental local dependence in general and LDCM models in particular were excluded from orthodox latent trait theory, by definition. In a related development, it was shown that for any latent variable model (including orthodox latent trait models as well as LDCM models), locally independent counterparts can always be constructed (this result was first published by Suppes & Zanotti in 1981—see also Holland & Rosenbaum, 1986; Jannarone, 1991a; Stout, 1987,1990). At first glance, the result suggests t h a t locally dependent modeling is of minor importance because locally independent counterparts can always be constructed. However, the Suppes and Zanotti alternative has been recognized as "vacuous," because it simply identifies each new item score with a separate parameter value, as each item score becomes observed. As a result, a basic statistical inference requirement— identifying the same latent variables with each of several observations—becomes lost in the process. e also been proposed, including multidimensional extensions of the Rasch model (Fisher, 1973; Whitely, 1980) to explain multiple person characteristics; multiparameter logistic models (Andrich, 1978; Birnbaum, 1958; Glas & Verhelst, 1989) to explain multiple item characteristics; and multidimensional, multiparameter models (Bock, 1972; Glas, 1991; Kelderman, 1984; Mckinley & Reckase, 1983; Samejima, 1969; Wilson, 1989). Tests based on nonparametrics (Mokken & Lewis, 1982; Holland, 1981; Holland & Rosenbaum, 1986; Rosenbaum, 1984, 1987; Stout, 1987, 1990) have been developed as well, for making inferences without having to make specific assumptions about item response function form. Most of these IRT extensions will be compared in more detail later in this section, once a basic for comparing them has been established. Exponential family theory, In exponential family form Exponential family theory, In exponential family form e (Lehmann, 1983, 1986), joint likelihood functions are expressed as exponents, which contain weighted sums of parameters. The parameter weights, which are called sufficient statistics, can be used for parameter estimation and inference. The exponential family format for the Rasch model (1) is,

LOCAL DEPENDENCE

217

Exponential family formats identify distinct sources of information with distinct exponent terms. Statistical independence is always indicated by separable product factors, such as the M item response functions in (1). Since products are equivalent to exponential sums, distinct exponent sum terms in exponential family models are also independent, in a sense (but not in general, because the proportionality constant may not be factorable). For example, when the Rasch model is expressed in form (2), each of the M + I sufficient statistics is seen as a kind of independent information source for its corresponding parameter. More precisely, it follows from exponential family theory that estimation for each exponential family parameter depends only on its corresponding sufficient statistic, given the remaining sufficient statistics. This property, when applied to (2), results in person parameter and item parameter separability for the Rasch model, which was listed as a specific objectivity property earlier. Exponential family analysis can be used to identify test model strengths as well as weaknesses. If a given model can be expressed in exponential family form several highly useful statistical properties follow. These include guarantees that: (a) unique, optimal (maximum likelihood and conditional maximum likelihood) estimates exist; (b) such estimates can be found by straightforward estimation procedures (since exponential family likelihoods are convex—see Andersen, 1980); and (c) relatively simple, optimal inference procedures can be identified (due to exponential family monotone likelihood ratio, asymptotic normality, and other properties—see Lehmann, 1986). All of these properties are strengths of the Rasch model, because of its exponential form given in (2). Similar strengths apply to unextended ANOVA, multiple regression, and classical test models, because they can also be expressed in exponential family form. Some extended statistical models can be also viewed as sound, once they are represented as extended exponential family models. For example, the ANOVA model with interactions represents an extended exponential family model with extra terms. Each term involves an additional interaction effect parameter, along with a corresponding new sufficient statistic, when expressed in exponential family form. Likewise, the Spearman one-factor model involves component item weights, over and above the (true score) parameters associated with the classical test model. When expressed in exponential family form

218

JANNARONE

(based on standard normality assumptions—see Joreskog & Sorbom, 1984), each such item weight becomes identified with an item variance parameter, which can be estimated by its corresponding item variance statistic. Thus, both of these models can be viewed as statistically sound extensions of their simpler ANOVA and classical test model counterparts. Other extended IRT models that belong in the exponential family (Andrich, 1978, 1985; Embretson, 1984; Fischer, 1973; Fischer & Formann, 1982; Kelderman, 1984; Masters, 1982; Whitely, 1980; Wilson, 1989) can be viewed as statistically sound as well. Attempts to place some other test models in exponential form can expose deficits, however. For example, the multidimensional factor analysis model becomes exposed as having more parameters (population means, factor loadings, and uniqueness) than sufficient statistics (as restricted by normality assumptions—sample item means, sumsof-squares, and sums-of-cross-products). The MFA model thus comes up short, in that no exponential family form can be constructed with a sufficient statistic for each parameter. As a result, certain side conditions (Thurstone, 1947; Joreskog & Sorbom, 1984) must be imposed to make MFA parameter estimation even possible, with no resulting guarantees of optimality. Similar problems exist for other multiparameter IRT extensions (Birnbaum, 1968), along with their multivariate extensions. Exponential forms for both Birnbaum model extensions and constrained MFA model extensions involve some terms that have products of two parameters associated with a single statistic, which cannot be separately estimated. Thus, distinct estimates for the two parameters cannot be identified (in the formal sense—see Fischer, 1981; Jannarone, 1991a), and estimation problems result (Fischer, 1981; Mislevy & Stocking, 1987). (Although the above examples suggest that statistical soundness is equivalent to exponential family membership, this may not be the case. For example, certain IRT models described by Kristoff, 1968, and Jannarone, 1991b, do not belong in the exponential family. Yet estimates with sound properties, such as uniqueness resulting from convexity, can be obtained for these models.) The two-parameter Birnbaum model deserves special attention because it has been strongly endorsed (Lord, 1980), and it continues to be widely studied (Mislevy & Verhelst, 1990; N.D. Verhelst, personal communication, December 1989). Abbreviated as the 2P model elsewhere, it will be called the toupee model here, because of its cosmetic nature (see below). Both the Spearman model and the toupee model were proposed to link latent traits with weighted sums of item scores, rather t h a n simple, unweighted sums. It seems strange at first glance, then,

LOCAL DEPENDENCE

219

t h a t the Spearman model is statistically sound by exponential family standards, while the toupee model is not. The reason lies in the fact t h a t individual item weighting information in both cases must come from second-order (and conceivably higher order) item statistics. In the Spearman model case for continuous data, this poses no problems from an information viewpoint, because sums of squared item statistics are distinct from sums of raw item scores. In the binary case, however, a squared binary item score is identical to a raw binary item score (whether its value is 0 or 1). As a result, no new item information can be obtained from examining binary item variances (and higher order moments), over and above that available in binary item sums. The fact that binary item statistics are limited in this way has a more basic message for item response theory than simply an argument against the toupee model. The message is this: If item parameters that are distinct from Rasch difficulty parameters are to be estimable, then local independence must be violated in the process (since statistics other t h a n additive item statistics must be obtained). One approach is to use nonbinary item information such as categorical item response data (Glas & Verhelst, 1989) and item response latency measures (Jannarone, 1991b). A second approach is to use item cross-product statistics such as item-subtest regression estimates (Engelen & Jannarone, 1989), to identify discrimination-like parameters. In either case, however, the use of such statistics presents a dilemma for researchers who believe in both local independence and the toupee model. Some guidelines for extending objective measurement. A vari ety of conflicting lessons and guidelines can be gathered from the preceding survey. For example, some with interests in developing advanced statistical procedures would identify different guidelines than others with more practical interests. For the author, who is mainly interested in developing simple tests based on established statistical principles to reflect general cognitive processes, the following guide-principles to reflect general cognitive processes, the following guidlines seem vital. A.

Extend explanatory power, by A(l) including new item parameters to reflect task differences as necessary, A(2) including new person parameters to assess corresponding individual differences, and A(3) avoiding substantive constraints (such as local independence); B. Ensure statistical soundness, by B(l) including new statistics to identify new parameters as necessary, B(2) preserving the use of fast, optimal, and conditionally invariant estimation procedures, B(3) using composites of distinct observations for each parameter,

220

C.

JANNARONE

to increase measurement precision, and B(4) using theoretically sound inference procedures; and Retain interpretive and statistical simplicity, by C(l) minimizing parametric and statistical complexity, and C(2) retaining specific objectivity.

Guidelines A(l) and A(2) may seem too parameter-based to those who favor nonparametric approaches. The argument in favor of nonparametrics, as indicated earlier, is that few assumptions need be satisfied in order for nonparametric procedures to be valid. The argument in favor of parametric modeling is that parameter estimates can be used to explain associations among items precisely, rank-order persons in terms of discrete skills, and test distinct cognitive processing hypotheses. At a more basic level, the nonparametric movement in statistics over the last 20 years (Lehmann, 1975) seems analogous to the old behaviorism movement in psychology (Boring, 1957). Both have their place for identifying simple associations, but both are simplistic to a fault for explaining underlying structures. Guideline A(3) is included to avoid defining away the existence of potentially useful test models. Examples later in the chapter will show t h a t orthodox latent trait theory violates A(3). Previously cited tests for dimensionality violate A(3) as well, making them strangely restrictive given their nonparametric intent. In order to exclude the trivial Suppes and Zanotti model from consideration, these tests impose presumably minimal restrictions on families of permissible test models, but they prohibit locally dependent alternatives in the process. Some such tests impose local independence explicitly (for example, locally independent, monotone IRF requirements—Holland & Rosenbaum, 1986), violating A(3) in the process. Others impose local independence with more subtlety. For example, according to a preliminary "essential local dependence" definition (Stout, 1990), certain LDCM models such as the Rasch-Markov model below are strangely both "essentially locally independent" and clearly locally dependent. The consequence is a dimensionality test that has equally strange implications. If such tests are to be truly nonparametric, they should require only the weakest necessary substantive assumptions, as suggested by A(3) (see also Jannarone, 1991a). Guideline A(2) provides an individual emphasis that may be helpful, if not essential, for externally validating test theory models. Previously successful extended models without individual differences (Andrich, 1978, 1985; Embretson, 1984; Fischer & Formann, 1982; Masters, 1982; Wilson, 1989) indicate that extended individual differ-

LOCAL DEPENDENCE

221

ence measures are not essential. In all these instances, however, if individual differences measures had been included, then external validation prospects could have been enhanced. Also, Rasch model specific objectivity features provide a very powerful basis for independently assessing and using item difference measures as well as person difference measures. Guideline A(2) is meant to indicate, then, that extended individual differences provisions can only be helpful, provided that other guidelines can be followed. The guidelines under B and C are nearly equivalent to an exponential family membership requirement. If a statistical model falls in the exponential family, all of the requirements under B are satisfied, as was indicated earlier. Also, the exponential family format requires a one-to-one correspondence between statistics and parameters, ensuring the kind of statistical clarity that is required by B. The equivalence is not perfect because some models outside the family may be both statistically sound and interpretable as was indicated earlier. However, exponential family membership is sufficient to satisfy B and C. Thus, all of the previously cited extended exponential family models satisfy B and C as well. Some prominent extended test models do not satisfy B, however, most notably the MFA and toupee models. Also, guideline B(3) was specifically included to clarify why the extended approach presented by Suppes and Zanotti is unacceptable. Local independence has of course not been included as a guideline, given the viability of local dependence. In light of that viability, arguments that have been expressed in favor of local independence (Lord & Novick, 1968; McDonald, 1981) are not compelling (Jannarone, 1991). Although the above three guidelines are individually beneficial, they should be recognized as being mutually at odds. The extended generality that falls under A cannot be achieved without added statistical and conceptual complexity, which compromises b o t h B and C. For example, the interactive and multidimensional extensions of the ANOVA and factor analysis models necessarily involve more elaborate data explanations and analyses, as was indicated earlier. Thus, these and all other extended explanatory models are useful only if such elaborate explanations are necessary. More pointedly, extensions to the Rasch model should be considered only if extended insights about complex cognitive processes are required. Otherwise, there is no need to compromise the marvelous simplicity and elegance of the Rasch model. Because of this tradeoff between generality and complexity, more general alternatives to the Rasch model should be considered on a case-bycase basis, with this tradeoff between the three guidelines in mind.

222

JANNARONE

LOCALLY DEPENDENT, CONJUNCTIVE MEASUREMENT AS THE CASE IN POINT t the conjunctive measurement family is a special case of the following general model.

The notation in (3) resembles the Rasch model notation in (1), in t h a t M is the number of items, the xim are binary item scores, p is a vector of item parameters, and the elements of 0 are person parameters. However, equation (3) contains many more person and item parameters t h a n (1)—indeed, it contains far too many parameters to be useful (and identifiable) as it stands. Instead, some parameter values must be set to 0 or equated with other parameters, to obtain useful special cases. One such special case of (3) is the Rasch model (1), which is obtained by equating all first-order person parameters,

and excluding all higher order terms,

Multivariate Rasch models, from which distinct Rasch measurements are obtained for different tests within a battery, can also be viewed as special cases of (3). For example, the appropriate special case of (3) for a battery made up often 100-item tests would have: (a) M set at 1,000; (b) the first 100 6im values equated and denoted by fyl\ (c) the next 100 0im values equated with ^ 2 ) , and so on; and (d) the constraints in (5) imposed to remove higher order effects. The resulting model, after including (a) through (d) in (3) is,

LOCAL DEPENDENCE

223

The final form of (6) indicates that the 10 component tests can be treated as independent, distinct Rasch models. The general conjunctive model given in (3) represents an extension of Rasch model measurement, in that the Rasch model is a special case, along with a variety of other locally independent models such as (6). After statistical properties of (3) are examined next, some locally dependent special cases of (3) will be examined as well. Conjunctive measurement model exponential family membership. The likelihood associated with (3) can be arranged in the following exponential family form:

224

JANNARONE

Since all conjunctive measurement models are special cases of (7), they can be placed in exponential family form as well. For example, the exponential family and sufficient statistics for the Rasch model have already been formulated in (2). Also, the exponential family form of the likelihood for the above multivariate Rasch model (6) is,

As would be expected from fitting separate Rasch models to tests within the battery, (8) shows that item difficulty sufficient statistics are counts of persons who pass items, and ability sufficient statistics are persons' component test number-correct scores. Thus, person and item sufficient statistics for test battery Rasch models are additive, as in the usual Rasch model case. The subtractive form of person and item parameters in (3), along with its exponential family membership, guarantee that some specific objectivity features will be retained by conjunctive Rasch model extensions. In particular, (7) guarantees that if person sufficient statistics are fixed then item parameter estimation will be independent of person parameter estimation, and vice versa. This holds for all conjunctive measurement models, not only the (univariate and multivariate) Rasch model special cases that have been introduced so far. Local independence within the conjunctive measurementl family. It has already been shown in (1) that the Rasch model is locally independent. Since (6) is a product of Rasch models, it follows that the multivariate Rasch model is locally independent as well. Among the many possible conjunctive measurement models, it happens t h a t (univariate and multivariate) Rasch models are the only conjunctive measurement models that are locally independent. More precisely, it can be shown (Jannarone, 1991a) that: Proposition I. Special cases of (3) are locally independent if and only if second-order and higher-order terms are absent, that is, all of the constraints (5) are satisfied. It also follows from (7) that locally dependent conjunctive measurement models must involve nonadditive statistics. More precisely, it follows that:

LOCAL DEPENDENCE

225

Proposition II. Special cases of (3) are locally dependent if and only if they require the use of second-order and/or higher-order sufficient statistics, that is, one or more of the constraints in (5) are not satisfied. The practical consequence of Propositions I and II is that Rasch models are quite exclusive—but substantively limited—members of a large measurement model family. The measurement family is large, in t h a t it includes many special cases that are restricted only by objective measurement and statistical soundness concerns. Rasch models are exclusively locally independent and additive, hence easy to interpret and statistically elegant. However, they are also limited for reflecting interesting, locally dependent versions of (3), some of which will be given next. Some locally dependent conjunctive measurement models.s (The following examples will be treated briefly here—see Jannarone, 1988, 1991, 1991b, for details.) As mentioned earlier, locally dependent models can be useful in settings where items are sequentially linked, as in studying exercise injuries and learning transfer abilities. One relatively simple LDCM model for sequentially linked items has the Rasch-Markov form,

h

along with item sufficient statistics of the form,

The M symbols in (10) and (11) indicate that parameter estimates are monotonically increasing functions of their corresponding sufficient

226

JANNARONE

statistics—it follows from (9) that all item local dependencies can be explained by those among adjacent items only, resulting in the socalled Markov property—see Jannarone, 1987.) The nonadditive person statistics associated with the Rasch-Markov model reflect individual differences in adjacent item local dependencies. For exercise injuries, high values of such statistics (controlling for numbers-of-injuries) reflect a person's inclination to press on in the face of physical adversity. For learning transfer settings, high crossproduct statistic values (controlling for number-correct scores) reflect a person's inclination to successfully transfer information t h a t was learned on one item to the next item. Nonadditive person statistics can also be used to assess the utility of LDCM models in prediction. In particular, the explanatory power of (9)—over and above that of the Rasch model can be assessed with partial correlation tests. Such tests can be performed based on sample correlations among number-correct person sufficient statistics, crossproduct sufficient statistics, and external criterion measures. (More precise tests based on correlations among efficient Rasch-Markov person parameter estimates can be obtained as well, but calculating efficient estimates is not easy—see Jannarone, 1987). Similar tests can be constructed for validating the other LDCM models to be introduced below. Some simplified versions of (9) can be formulated by equating item parameters. For example, all first-order item parameters can be equated and all second-order item parameters can be equated, resulting in the following stationary Rasch-Markov (or binomial-Markov) model,

having the same person statistics as (1), but item sufficient statistics of the form,

Choosing between the Rasch-Markov model (9) and the Raschbinomial model (12) is like choosing between the Rasch model and the binomial model (Keats & Lord, 1962). Insofar as first-order and

LOCAL DEPENDENCE

227

second-order item sufficient statistics are equal, they should be combined for greater statistical efficiency and simplicity. Insofar as true corresponding item parameters are unequal, however, explanatory power will be lost. Constrained versions of the LDCM models below can be similarly constructed, and similar considerations apply. Other simplified versions of (9) can be obtained by excluding person parameters. For example, second-order person parameters can be removed from (9), resulting in a model that accounts for local dependencies without accounting for individual differences in such dependencies. Such LDCM models, some of which have been adopted in interesting ways (Andrich, 1978, 1985; Embretson, 1984), are perfectly sound, provided that such individual differences do not exist. If they could exist, however, it would seem better to account for them formally and utilize resulting individual differences measures for external prediction. Similar options and concerns apply to the remaining LDCM models to be described. Other LDCM models can be used in schemes where "testlets" made up of items sharing common content are involved (Wainer & Thissen, 1989). One familiar scheme involves testlets that assess comprehension ability, with each testlet being based on reading the same paragraph of text. For the case involving T testlets, with the tth testlet being made up of Mt items (t = 1, . . . , t), an appropriate LDCM model would have the form,

where M = Mx + • • • + MT. For simple cases involving two items in each testlet, the person sufficient statistics would be,

and the item sufficient statistics would be

228

JANNARONE

The nonadditive person statistics in this case may be viewed as measures of task completion style. Among persons who get scores of 50 on a test made up of 50 two-item testlets, for example, the cross-product score can range from 0, indicating that a person got exactly one item correct in each testlet, to 25, indicating that a person either passed both items or no items in each testlet. The score of 0 would show an inclination to get one item correct and then move on, whereas the score of 25 would show an inclination to either work a problem through entirely or give it up. (For higher Mt values than 2, 6\2) sufficient statistics are total numbers of item pairs that are both passed within testlets, with high value reflecting "compulsive" behavior and low values reflecting "hyperactive" behavior.) Insofar as these behaviors can be measured reliably, they may be useful in predicting certain external criteria, over and above number-correct scores. Alternatively, they could reflect sources of error variation if only number-correct scores are used, which could be either adjusted statistically or reduced by coaching examinees, or both. Thus, LDCM models can be used to identify different cognitive resource allocation strategies, by capitalizing on nonadditive person statistics. This use of LDCM models has also been proposed in settings where a battery of diagnostic subtests is assigned, followed by a training period, followed by a battery of parallel achievement subtests (Jannarone, 1991). As in the previous two examples, the extra utility of nonadditive test scores in this setting is easy to demonstrate. In particular, it can be shown that different learning strategies can be uncovered, by assessing their (nonadditive, within-person) subtest correlation coefficients—such differential strategies can be uncovered even among people having identical pretest, posttest, and change scores (Jannarone, 1991). Furthermore, if such individual strategy differences exist, then pretest-posttest scores (in the latter case as well as testlet scores in the former case) must be locally dependent, in keeping with Propositions I and II. Finally, LDCM models can be used in settings where quickness and correctness are measured concurrently (Jannarone, 1991b), to assess individual differences in speed-accuracy tradeoffs. For example, in speeded test settings where stringent time limits are imposed, some persons may "rise to the occasion" and do relatively well, while others may not do so well. Assessing individual differences along these lines may be useful in selecting personnel for whom quick, accurate responses are essential, such as air traffic controllers and police officers. Corresponding LDCM models are based on person sufficient statistics t h a t are cross-products among item response speed measures and item

LOCAL DEPENDENCE

229

correctness measures. As in the previous three cases, since these measures are nonadditive they are also necessarily locally dependent. CONCLUSIONS This chapter began with an informal discussion of whether and how to extend objective measurement beyond the Rasch model. A more careful analysis was presented next, based on historical measurement developments. The analysis indicated that objective measurability should be assessed case-by-case, based on: A, extending explanatory power; B, ensuring statistical soundness; and C, retaining interpretive and statistical simplicity. Conjunctive measurement was also reviewed with these guidelines in mind. Overall conclusions follow. Since conjunctive measurement models are exponential family members with subtractive person and item statistics, they satisfy all criteria previously listed under A and B. However, LDCM models necessarily violate some simplicity and specific objectivity requirements previously listed under C. Item and person measurement separability are preserved, in t h a t person parameter estimates are conditionally independent of item parameters (when item sufficient statistics are fixed), and vice versa. However, it is less clear that LDCM "transcends the measuring instrument [by excluding person-by-item] interaction terms." Indeed, LDCM clearly involves such interactions at the item level. On the other hand, LDCM is free of such interactions at the testlet level. In paragraph comprehension settings, for example, items are locally independent between distinct paragraph testlets, so that number-correct and cross-product scores do not interact between testlets. LDCM may be viewed, then, as "transcending the measuring instrument" up to a point, once the instrument is viewed as performing measurements at the testlet level rather than the item level. Similar conclusions can be made for other LDCM models if they are applied to repeatable testlets, rather than nonrepeatable tests. The most basic message from this chapter is that if reasonable statistical soundness conditions are imposed, then alternatives to the (univariate or multivariate) Rasch model will necessarily violate the local independence axiom. This message is noteworthy, because local dependence conflicts with traditional latent trait theory foundations. It is tempting to suggest that this violation marks only a lull between newly discovered locally dependent counterexamples and new locally dependent text theory axioms that will necessarily follow. However,

230

JANNARONE

the history of science (Kuhn, 1970; Miller & Fredericks, 1991; Popper, 1968) has shown that foundational requirements carry a great deal of inertia. Because of this inertia, the local independence axiom cannot be taken lightly and much effort will be needed to overcome it. Continued attempts to develop the toupee model along latent trait theory lines illustrate this inertia quite well. The cosmetic nature of such attempts should be clear, in light of the previous discussion. Moreover, straightforward LDCM alternatives to the toupee model can easily be developed, by including item-subtest regression parameters and corresponding sufficient statistics as necessary. However, the toupee model will not be shed for a better alternative, until researchers face the bald fact that it presents a local independence paradox. The status of local independence is conceptually (and substantively —see Jannarone, 1991) similar to the status of Newton's physics laws at the end of the last century. J u s t as evidence gathered then t h a t Newton's laws were not always followed, evidence is gathering now t h a t local independence does not always follow. Likewise, just as classical physics gave way to modern physics then, local independence is being recognized as a special core of a more general test theory now. This century has shown that classical physics continues to provide a simple and elegant model, which is adequate for most purposes, but, other quite useful developments have resulted from extending physics to include relativity. Similar developments in psychometrics seem likely for the next century, including (a) the continued successful use of additive Rasch measurement in most cases, along with (b) occasional uses of alternative approaches for measuring more complicated, nonadditive cognitive processes. The development of such alternatives seems especially likely, in light of ongoing advances in real-time human, computer interaction (Wainer & Thissen, 1989). Returning finally to whether LDCM is objective measurement or an abomination, it was suggested early in the chapter that different researchers with different needs are likely to have different answers. That conclusion holds after a more careful analysis as well. Those who only need to measure achievement and other simple attributes will lean toward the Rasch model and away from less elegant "abominations," and rightly so. Others, who feel the need to extend cognitive measurement along dynamic lines, will be more willing to view LDCM as objective and potentially useful. Hopefully the preceding description has helped researchers with these and other leanings to consider locally dependent, conjunctive measurement for their own needs. (Some computer programs are available and others are being developed by the author for LDCM analysis. Researchers with LDCM interests should feel free to contact him for assistance.)

LOCAL DEPENDENCE

231

REFERENCES Andersen, E.B. (1980). Discrete statistical models with social science applications. Amsterdam: North-Holland. Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561-573. Andrich, D. (1985). A latent trait model for items with response dependencies: implications for test construction and analysis. In S.E. Embretson (Ed.), Test design: Development in psychology and psychometrics (pp. 245-275). Orlando, FL: Academic Press. Birnbaum, A. (1958). Statistical theory of tests of a mental ability. Annals of Mathematical Statistics, 29, 1285 (abstract). Birnbaum, A. (1968). Some latent trait models and their use in inferring an Examinee's ability. In F.M. Lord & M.R.Novick (Eds.),statistical theories of mental test scores. Reading, MA: Addison-Wesley. Bock, R.D. (1972). Estimating item parameters and latent ability when responses are scored in two or more numerical catagories.Psychometrika, 37, 2 9 - 5 1 . Boring, E.G. (1957). A history of experimental psychology (2nd ed.). New York: Appleton-Century-Crofts. Box, J.F. (1978). R.A. Fisher: The life of a scientist. New York: Wiley. Embretson [Whitely], S. (1984). A general latent trait model for response processes. Psychometrika 49, 175-186. Engelen, R.J.H., & Jannarone, R.J. (1989). A connection between item/subtest regression and the Rasch model (Research Report No. 89-1). Enschede, The Netherlands: Department of Education, Twente University. Ferguson, G.A. (1942). Item selection by the constant process. Psychometrika, 7, 19-29. Fischer, G. (1973). The linear logistic test model as an instrument in educational research. Ada Psychologica, 37, 359-374. Fischer, G.H. (1981). On the existence and uniqueness of maximum likelihood estimates in the Rasch model. Psychometrika, 46, 59-77. Fischer, G.H., & Formann, A.K. (1982). Some applications of logistic latent trait models with linear constraints on the parameters. Applied Psychological Measurement, 6, 397-416. Fisher, R.A. (1921). Studies in crop variation, I: An examination of the yield of dressed grain from Broadbalk. Journal of Agricultural Science, 11, 1 0 7 135. Fisher, R.A., & Mackenzie, W.A. (1923). Studies in crop variation, II: the manurial response of different potato varieties. Journal of Agricultural Science, 13, 311-320. Fisher, W.P. (1991). Objectivity in measurement: A philosophical history of r Theory into practice. Norwood, NJ: Ablex Publishing Corp. Galton, F. (1888). Co-relations and their measurement, chiefly from anthropomorphic data. Proceedings of the Royal Society, 45, 135-145.

232

JANNARONE

Glas, C.A.W. (1991). A Rasch model with a multivariate distribution of ability. In M. Wilson (Ed.), Objective measurement: Theory into practice. Norwood, NJ: Ablex Publishing Corp. Glas, C.A.W., & Verhelst, N.D. (1989). Using the Rasch model for dichotomous data for analyzing polytomous responses. Unpublished manuscript,d CITO, Arnhem, The Netherlands. Holland, P.W. (1981). When are item response models consistent with observed data? Psychometrika, 46, 79-92. Holland, P.W, & Rosenbaum, P.R. (1986). Conditional association and unidimensionality in monotone latent variable models. Annals of Statistics, 14, 1523-1543. Jannarone, R.J. (1986). Conjunctive item response theory kernels. Psychometrika, 51, 357-373. Jannarone, R.J. (1987). Locally independent models for reflecting learning abilities (Center for Machine Intelligence Report No. 87-67). University of South Carolina, Columbia. Jannarone, R.J. (1991). Conjunctive measurement theory: Cognitive research prospects. In M. Wilson (Ed.), Objective measurement: Theory into practice. Norwood, NJ: Ablex Publishing Corp. j Contrasts and connections with traditional test theory. Unpublished manuscript. University of South Carolina, Columbia. Jannarone, R.J. (1991b). Measuring quickness and correctness concurrently: A conjunctive IRT approach. Unpublished manuscript. University of South Carolina, Columbia. Jannarone, R.J., Yu, K.F., & Laughlin, J.E. (1990). Easy Bayes estimates for Rasch-type models. Psychometrika, 55, 449-460. Joreskog, K., & Sorbom, D. (1984). LISREL VI users guide. Chicago: International Educational Resources. Keats, J.A., & Lord, F.M. (1962). A theoretical distribution for mental test scores. Psychometrika, 27, 59-72. Kelderman, H. (1984). Loglinear Rasch model tests. Psychometrika, 49, 2 2 3 245. Kelderman,H.,&Jannarone,R.J.(1989,March).Conditional maximum likeli-k hood estimation in conjunctive item response models. Paper presented at the annual American Educational Research Association meetings, San Francisco. Kristoff, W. (1968). On the parallelization of trace lines for a certain test model (Research Report No. RR-68-56). Princeton, NJ: Educational Testing Service. Kuhn, T.S. (1970). The structure of scientific revolutions (2nd ed.). Chicago: University of Chicago Press. Lawley, D.N. (1943). On problems connected with item selection and test construction. Proceedings of the Royal Society of Edinburgh, 61, 273-287. Lazarsfeld, P.F. (1958). Latent structure analysis. In S. Koch (Ed.), Psychology: A study of a science, Vol. III. New York: McGraw-Hill. Lehmann, E.L. (1975). Nonparametrics: Statistical methods based on ranks. San Francisco: Holden-Day.

LOCAL DEPENDENCE

233

Lehmann, E.L. (1983). Theory of point estimation. New York: Wiley. Lehmann, E.L. (1986). Testing statistical hypotheses (2nd ed.). New York: Wiley. Lord, F.M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum. Lord, F.M., & Novick, M.R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. Macera, C.A., Pate, R.R., Powell, K.E., Jackson, K.L., Kendrick, J.S., & Craven, T.E. (1989). Predicting lower extremity injuries among habitual runners. Archives of Internal Medicine, 149, 2565-2568. Masters, G.N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149-174. McDonald,R.P.(1981)The dimensionality of tests and items.British Journalm of Mathematical and Statistical Psychology, 34, 100-117. McKinley, R.L., & Reckase, M.D. (1983). An extension of the two-parameter lodistic model to the multidimensional latent trait space (Research Report No. R83-2). Iowa City, IA: American College Testing Program. Miller, S.I., & Fredericks, M. (1991). Postpositivistic assumptions and educational research: another view. Educational Researcher, 20, 2 - 8 . Mislevy, R.J., & Stocking, M.L. (1987). A consumers guide to LOGIST and BILOG. (Research Report No. RR-87-43). Princeton, NJ: Educational Testing Service. Mislevy, R.J., & Verhelst, N.D. (1990). Modeling item responses when different subjects employ different solution strategies. Psychometrika, 55, 195216. Mokken, R.J., & Lewis, C. (1982). A nonparametric approach to the analysis of dichotomous item responses. Applied Psychological Measurement, 6, 417-430. Pearson, K. (1896). Mathematical contributions to the theory of evolution, III. r Royal Society, A, 187, 113-178. Popper, K.P. (1968). The logic of scientific discovery. New York: Harper & Row. r tests. Chicago: University of Chicago Press. (Original work published 1960). Rosenbaum, P.R. (1984). Testing the conditional independence and monotonicity assumptions of item response theory. Psychometrika, 49, 425-436. Rosenbaum, P.R. (1987). Probability inequalities for latent scales. British Journal of Mathematical and Statistical Psychology, 40, 157-168. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometric Monograph No. 17. Scheffe, H. (1959). The analysis of variance. New York: Wiley. Searle, S.R. (1971). Linear models. New York: Wiley. Spearman, C. (1904). The nature of intelligence and the principles of cognition. London: Macmillan. Spearman, C. (1927). The abilities of man. New York: Macmillan. Stout, W. (1987). A nonparametric approach for assessing latent trait dimensionality. Psychometrika, 52, 589-617.

234

JANNARONE

Stout, W. (1990). A nonparametric multidimensional IRT approach with applications to ability estimation. Psychometrika, 55, 293-326. Suppes, P., & Zanotti, M. (1981). When are probabilistic explanations possible? Synthese, 48, 191-199. Thurstone, L.L. (1932). The theory of multiple factors. Ann Arbor, MI: Edwards Brothers. Thurstone, L.L. (1947). Multiple factor analysis: A development and expansion of the vectors of mind. Chicago: University of Chicago Press. Van der Linden, W., & Jannarone, R.J. (1989). Locally dependent choice models. Unpublished manuscript, Twente University. Wainer, H., & Thissen, D. (1989). Item clusters in computerized adaptive testing: A case for testlets. Journal of Educational Measurement, 24, 1 8 5 202. Whitely, S.E. (1980). Multi-component latent trait models for ability tests. Psychometrika, 45, 479-494. Wilson, M. (1989). Saltus: a psychometric model of discontinuity in cognitive development. Psychological Bulletin, 105, 276-289. Wright, B. (1980). Forward and Afterward to Rasch, G. Probabilistic models for some intelligence and attainment tests. Chicago: University of Chicago Press.

chapter

13 -LO

Objective Measurement with Multidimensional Polytomous Latent Trait Models* Henk Kelderman

University of Twente now at Vrije University

Objective measurement in the social sciences is rarely possible without probabilistic models. In many cases these measurements are based on aggregates of elementary measurements, such as answers to test questions and errors in spelling, which are themselves subject to random error. Classical test theory models (Lord & Novick, 1968) yield estimates of the reliability of the total test score under certain assumptions such as one dimensionality of the test. Item response theory (Birnbaum, 1968; Lord, 1980, Rasch, 1960/ 1980) explicitly relates the item responses to subject parameters and item parameters via a probabilistic model. For a set of test data, the assumptions of the model may be tested and the parameters estimated. A desirable property of the class of IRT models proposed by Rasch is t h a t the subject parameters may be estimated independent of the item parameters, and vice versa. As an example of a Rasch type model * Requests for reprints should be sent to H. Kelderman, Department of Work and Organizational Psychology, Faculty of Psychology and Pedagogics, Vrije University, De Boelelaan 1081c, 1081HV Amsterdam, The Netherlands. The author thanks Mary Lunz of the American Society of Clinical Pathologists for test data and suggestions.

235

236

KELDERMAN

consider the unidimensional Rasch models for polytomous items described by Andrich (1978), Masters (1982, 1987), Wright and Masters (1982), Masters and Wright (1984) and others. Suppose that N subjects respond to k test items. On item j , subject i may give any of r ; responses xij• = x (= 0, . . . , r-). The probability of this is denoted by Tiijx. Let 6, be a parameter describing the ability of person i and let 8 /x be a difficulty parameter of response x on item j . Written in terms of log-odds, the partial credit model related the response probabilities to the person and item parameters through

j = 1, . . . , k; x = 1, . . . , rr If for a certain population of subjects P and a universe of items Q, this Rasch type model holds, the abilities of the subjects in P can be compared regardless of the choice of items from Q, and the difficulty of the items can be compared regardless of the particular sample of subjects taken from P. Rasch (1960/1980, 1977) calls this specifically objective measurement. It can be shown the likelihood of the data under model (1) factors into two distinct parts, a part with only subject parameters and a part with only item parameters (e.g. Masters, 1982). To each of these sets of parameters corresponds a set of minimal sufficient statistics. For the subject parameters they are the simple sums of the item scores xn + . . . + xlk and for the item response parameters they are the numbers of subjects that have given that particular response. The sets P and Q define the limits of model validity. For subjects and items outside the sets, objective measurement may not be possible. For example, if the trait to be measured is knowledge of medieval history, kindergarten children may not be in P nor arithmetic items in Q. Sometimes a set of items Q is supposed to measure a certain trait for subjects in P, but it does not fit the Rasch model. In that case, one may attempt to partition the universe Q into s Rasch homogeneous subuniverses Qq (q = 1, . . . s), each fitting the Rasch model. Obviously, if this can be done, objective measurement is still possible because each subuniverse allows objective measurement. The only difference is that each subject is now characterized by a vector of person parameters 6, = (9 zi , . . . , dls) rather than a single scalar subject parameter. However, in some testing situations, particularly if the items have several answer categories, multidimensionality may be more intricate and surface within a single item. For example, consider the following test item " V l 2 - 3 = ?". Two numerical operations may be assumed: Subtraction (12 - 3 = 9) and taking the square root (V9 =

OBJECTIVE MEASUREMENT

237

3). If both operations are performed, the answer gets the full credit x = 2. If only the first operation is performed, the answer gets the partial credit x = 1. Finally, if the answer is incorrect it is scored x = 0. The partial credit model (1) then explains the odds of getting a credit of 1 rather than 0 by the person's ability 8,- and a response parameter 8 ; 1 and explains the odds of getting a credit 2 rather than 1 by the same ability parameter and a response parameter 8 /2 . It might, however, be hypothesized that "subtracting" and "taking the square root" are different abilities calling for a model, where the consecutive odds depend different latent traits. Therefore, we now consider multidimensional Rasch models. Multidimensional R a s c h Models Rasch (1961), aware of multidimensionality within item responses, invented a model that allows for a different dimension in each category:

j = 1, . . . , k; x = 0, . . . , r ; , with constraint 8 / 0 = 0 and 8 n + . . . + bkl = 0. This model describes the log probability of a score x on i t e m j . It is easy to reformulate the model into a model for the odds of getting score x rather than x - 1 as in the partial credit model, because

This multidimensional version of the partial credit model was described by Kelderman (1991a,b). As in the unidimensional partial credit model (1), the model contains a threshold parameter 8j^, but the model now has a separate subject parameter dfx for each response category. An extension of Rasch's multidimensional model that gives the analyst more flexibility in specifying the relation between item responses and latent traits is the Multidimensional Polytomous Latent Trait (MPLT) model (Kelderman & Rijkes, in press). Let Bqjx be a positive integer valued weight of response x of item j with respect to the qth subject parameter. If B^ ^ 0, it means t h a t the item response denoted by the pair (j, x) depends on latent trait q and if Bqjx = 0, it does not. In addition, if the weight is larger than one, it means t h a t the response involves more t h a n one application of the latent trait. The MPLT model is then written as:

238

KELDERMAN

j = 1, . . . , k; x = 0, . . . , r ; . As in the previous models, additional constraints must be imposed on the parameters to obtain a unique set of parameter estimates. There are two types of indeterminacies in the model, between pj and bJX and between 8^ and 8^. Adding a constant Cj to each Sjx and subtracting it from fjL; does not change the model. These indeterminacies may be removed by setting convenient linear restrictions on the parameters that facilitate the interpretation of the parameters. For example, setting 8 -0 = 0 removes the first indeterminacy and makes sense if x = 0 is the incorrect response. cq may be chosen such t h a t the mean person parameter or item parameters is equal to zero to fix the scale of Qiq. Different parameterization may be employed in different situations to improve the interpretability of the parameters. One example of this is given later in this paper. If the parameters of model (2) are unique, conditional estimation of parameters is possible. Several applications of MPLT models have been described in the literature (Duncan & Stenbeck, 1987; Kelderman, 1991; Kelderman & Rijkes, in press, Wilson, 1989, 1990; Wilson & Adams, 1993; Wilson & Masters, 1993), and a computer program that computes, conditional maximum likelihood estimates and goodness-of-fit statistics has been developed (Kelderman, 1992; Kelderman & Steen, 1993). A powerful example of MPLT modeling in practice is the following analysis of medical-laboratory-test items. A n Example The American Society of Clinical Pathologists (ASCP) produces tests for the certification of medical personnel. The Society has a long standing commitment to objective measurement. Their tests are carefully constructed and analyzed to make sure that the comparison of person parameters is independent of item content as much as possible. Mary Lunz of ASCP made available for reanalysis the following set of data. The data we analyze here are the responses of 333 examinees to nine four-choice items measuring the ability to perform medical laboratory tests. The items are calibrated under a Rasch model so that the sum of the correct answers contains all information about the subject's ability 8, available in the data. There are, however, reasons to believe that this single ability parameter might not be sufficient to explain the subjects' behavior on the tests. In particular, it was hypothesized that several different cognitive processes are involved in making the items, and t h a t even the

OBJECTIVE MEASUREMENT

239

incorrect responses might be chosen on the basis of partial execution of these processes. The correct response would then be chosen if all processes were successfully executed and orchestrated. Table 13-1 gives the judgements of ASCP content experts about three cognitive processes t h a t are possibly involved in choosing the items alternatives. For example, in Item 3 the correct answer b involves the application of knowledge as well as two computations, whereas in the incorrect answer c one calculation is missing. Now assume that there are individual differences in the subjects' ability to perform each of these cognitive operations and the three corresponding ability parameters are given by 8 a , di2 and 8;3. Furthermore, assume that there is a parameter 8;4 that is exclusively involved with giving the correct answer. In that case, the specification in Table 13-1 gives the B weights of the MPLT model. To investigate whether this hypothesis is correct we specify two models: (a) a model with Qi4 only, and (b) a model with all four ability parameters 8 a , 8 U , Qi3, Qi4. The item difficulty parameters of both models were estimated with the LOGIMO program (Kelderman & Steen, 1993). The log-likelihood (and number of independent parameters) of model (a) is - 9 3 9 1 (36) and of model (b) is - 9 0 5 2 (446). The log-likelihood of a model is the logarithm of the probability of the observed data under that model. Obviously the likelihood of the data under model (b) is larger t h a n under model (a), but this comparison is not fair since model (b) has 446 parameters estimated from the data and model (a) only 36.

Table 13-1 Specification of Cognitive Processes Involved in Responses of the ACSP Medical Laboratory Test

I Applies

III Correlates

Knowledge

Item abed

II Calculates Data

abcdabcd

12 2 1 2 3 1 2 1 1 4 2 1 1 5 1 1 1 1 6 1 1 1 7 1 2 2 8 1 1 1 1 1 9 1 N = 3370, Nonresponse = 39

1 1 1 1

1

1 1

1

IV Correct a b c d

1 1 1 1 1 1 1 1 1

240

KELDERMAN

A statistic t h a t makes a tradeoff between log-likelihood against parameters to be estimated is Akaike's Information Criterion AIC = constant + 2 number of parameters - 2 loglikelihood. Akaike (1977) found that it can be expected that adding two extra parameters to the model is generally equivalent to an increase of one point on the log-likelihood scale. If we now compare model (a) and (b) with AIC (constant = -18000), we have values of 854 and 996, respectively. That is, adding the 410 parameters to the 36 parameters of model (a) to get model (b) does not increase the likelihood beyond t h a t expected from chance. In fact the likelihood of (b) is smaller than expected. The conclusion, therefore, must be that the cognitive processes do not explain the structure of the data beyond the simple correct model. In Table 13-2, Pearson Goodness of fit statistics and item response parameters are given for each of the items under model (a). Comparing the Pearson Goodness of fit statistics X2 with their degrees of freedom, we see that X2 generally is not much larger, indicating a satisfactory fit. So we may conclude that the practice of using the number correct score is not invalidated by possible multidimensionality in the correct responses and does not ignore possible information about cognitive processes present in incorrect responses. Table 13-3 shows the parameter estimates of model (a). To obtain an interpretable and unique set of parameters, two types of identifying restrictions are imposed. Firstly, to make the parameters of the incorrect responses comparable, we reparameterize in such a way that their sum equals zero in each item. This is achieved by subtracting in (2) their mean from the item's response parameters S ;x and adding it to fJL;. Secondly, as with the dichotomous Rasch model, to fix the origin of the latent trait scale the sum of the item parameters corresponding to the correct responses is set equal to zero. The starred parameters in Table 13-3 describe the item difficulty. A high value of this parameter means t h a t the item is relatively difficult. This difficulty is measured in the

Table 13-2

Goodness-of-Fit Statistics for Model (a)

Item 1

2

3

4

5

6

7

8

9

sum

Chi-Square 33 38 40 32 21 16 21 22 28 251 DF 21 22 22 22 23 20 23 20 21 194

Table 13-3

Parameter Estimates of Model (a) Item

1

2

3

4

5

6

7

8

9

-0.54 0.95 0.92* 1.50

-0.88

-0.37 -0.66* 1.00 -0.63

0.01* 0.07 -0.48 0.41

-0.14 -0.93 -0.47* 1.07

1.13 -0.07* -0.58 -0.54

0.30 0.60 0.21* -0.30

-0.46*

0.97

Response

a b c d

* Correct Response

0.02 0.50* 0.85

1.05 0.16 0.90

1.29 0.05* 0.31

242

KELDERMAN

same latent trait scale as the subject parameter 0 a (see Model (2) for s = 1). It is seen t h a t Items 1 and 2 are relatively difficult and Items 3 , 5 , and 8 are relatively easy. The nonstarred parameters are impopularity parameters of the distractors. A high value of this parameter indicates t h a t the particular distractor is not popular compared to the other distractors of the item. For example, Table 13-3 shows t h a t Distractor b of Item 9 is much more popular than a or d.

DISCUSSION In this chapter the possibility of multidimensional objective measurement is discussed and a general multidimensional Rasch model for polytomously scored items is introduced. The analysis of the ASCP data shows that multidimensionality can be modelled quite flexibly on the item response level. It is shown that multidimensionality is not present in these data and that a unidimensional model suffices to describe the data. It should be noted that this unidimensional model is not the same as the dichotomous Rasch model, but a unidimensional model for polytomous items. It is a subject for further investigation to determine whether this new model is more desirable in this case t h a n the classic Rasch model. An advantage is that all given responses are modelled and described and that goodness-fit studies that focus on the various responses may yield information on possible sources of misfit. It is hard to compare both models empirically, for example, using AIC, because the sample space is different.

REFERENCES Akaike, H. (1977). On entropy maximization principle. In P.R. Krisschnaiah (Ed.), Applications of statistics (pp. 27-41). Amsterdam: North Holland. Andrich, D. (1978). A rating scale formulation for ordered response categories. Psychometrika, 43, 561-573. Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee's ability. In F.M. Lord & M.R. Novick (Eds.), Statistical theories of mental test scores. Reading MA: Addison-Wesley. Duncan, O.D., & Stenbeck, M. (1987). Are likert scales unidimensional? Social Science Research, 16, 245-259. Kelderman, H. (1991, April). Estimation and testing a multidimensional Rasch model for partial credit scoring. Paper presented at the Annual Meeting of the American Educational Research Association, Chicago, Illinois.

OBJECTIVE MEASUREMENT

243

Kelderman, H. (1992). Computing maximum likelihood estimates of loglinear IRT models from marginal sums. Psychometrika, 57, 437-450. Kelderman, H., & Rijkes, C.P.M. (in press). Loglinear multidimensional IRT models for polytomously scored items, Psychometrika, 59. Kelderman, H., & Steen, R. (1988). LOGIMO: Loglinear Item Response Modeling [computer manual]. Groningen, The Netherlands: i.e.c. ProGAMMA. Lord, F.M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum. Lord, F.M., & Novick, M.R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. Masters, G.N., & Wright, B.D. (1984). The essential process in a family of measurement models. Psychometrika, 49, 529-544. Masters, G.N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149-174. Masters, G.N. (1987). Measurement models for ordered response categories. In R. Langeheine & J. Rost (Eds.), Latent trait and latent class models. New York: Plenum. Rasch, G. (1961). On the meaning of measurement in psychology. In J. Neyman (Ed.), Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability (Vol. 5). Berkeley, CA: University of California Press. Rasch, G. (1977). On specific objectivity: An attempt at formalizing the request for generality and validity of scientific statements. Danish Yearbook of Philosophy, 17, 58-94. r tests. Chicago: The University of Chicago Press. (Original work published 1960) Wilson, M. (1989, April). The partial order model. Paper presented at the Fifth International Objective Measurement Workshop, Berkeley, CA. Wilson, M. (1990). An extension of the partial credit model to incorporate diagnostic information. Unpublished paper, Graduate School of Education, University of California, Berkeley, CA. Wilson, M., & Adams, R.A. (1993). Marginal maximum likelihood estimation for the ordered partition model. Journal of Educational Statistics, 18, 69-90. Wilson, M., & Masters, G.N. (1993). The partial credit model and null categories. Psychometrika, 58, 87-99. Wright, B.D., & Masters, G.N. (1982). Rating scale analysis. Chicago: MESA Wright, B.D., & Masters, G.N. (1982). Rating scale analysis. Chicago: MESA Press.

chapter 14 I T

When Does Misfit Make a Difference? Raymond J. Adams

Australian Council for Educational Research

Benjamin D. Wright University of Chicago

The Rasch model (Rasch, 1960/1980; Wright & Stone, 1979) (indeed all fixed effects item response models) requires that item parameters remain fixed and independent of the persons they are measuring. Similarly the model requires that the person parameters be independent of the particular items used to measure them. When applying the model, it is usual to use tests of fit that examine the extent to which these requirements are met by a set of data. If these tests fail to reject the model (at some arbitrary level of statistical significance) then it is accepted that the model and data are compatible and that the above properties are met. On the other hand, if the fit tests lead to a rejection of the model, then it is concluded that the data and model are incompatible. In reality, model and data are never fully compatible. The manifest ability of person n (that is, the ability level that is actually applied by person n) is always slightly different when faced with item i t h a n it is when faced with item j . Similarly, the manifest difficulty of item j is always slightly different for persons n and m. No data can ever be perfectly compatible with any measurement model. The consequence 244

WHEN DOES MISFIT MAKE A DIFFERENCE?

245

is t h a t as samples become large enough, tests of fit invariably indicate t h a t the data do not exactly follow the requirements specified by the model. When the number of observations is sufficiently large, tests of fit will always indicate incompatibility between any data and any model (Gustafsson, 1980; Martin-Loff, 1974). A more constructive approach to examining the fit of a model is to address the question of whether the model constructs a useful representation of the data structure (van den Wollenberg, 1988). Under these circumstances, the impractical all-or-nothing treatment of the usefulness of parameter estimates is replaced by a consideration of how well the model represents the important elements of the data. When we consider the case of person measurement, the most import a n t questions are: How useful are the person parameter estimates that result from the application of a model to the data at hand? What kinds of mistakes are likely to be made if the parameter estimates are used as though the model and data are compatible? In this chapter, misfit to the Rasch model is examined in terms of its effect upon person parameter estimates. The analysis addresses the case where a set of calibrated items is used to provide estimates of person abilities. Since the Rasch model can be derived from a set of coherent requirements for measurement (Thurstone, 1928; Wright, 1989) the term measurement disturbance is used to describe data misfit. FRAMEWORK FOR STUDYING DISTURBANCE To undertake the analysis, a general framework for disturbance is presented. This framework enables the imposition a broad class of measurement disturbances through the specification of a single interaction term. The Rasch model for dichotomous data specifies that the odds of success of person n on item i must depend upon a function of n and i t h a t can be factored into two components, one depending only on n and the other depending only on i. This requirement can be expressed as:

where K is an arbitrary constant that can be set at one. One large class of deviations from this measurement model occurs when the odds of success depends on a function of n and i t h a t cannot be factored. That is:

246

ADAMS & WRIGHT

where E(n,i) cannot be factored into separate functions of n and i. Taking logs and writing b n = log[A(n)], d{ = -log[C(i)] and e ni = log[E(n,i)] this becomes:

or equivalently:

The requirement that the outcome be predictable solely from the fixed main effects is violated by the interaction, e ni . This interaction can be produced by many kinds of measurement disturbances. When estimated item difficulties are used to estimate person abilities (and no other disturbance exists) then e ni can represent the uncertainty in the item parameters estimates. Here e ni is the same for all persons but varies over items (i.e., e ni = e{) and the observed outcomes actually result from (b n - dj) + e^ In this case the actual difficulty of each item is not exactly what it is believed to be. The disturbance e{ may be equal for all items (simply e), it may be different from all items, or it may be useful to consider e^ as sampled randomly from a normal distribution with mean zero and variance of, the calibration error for item i. By specifying the combinations of n and i that lead to a nonzero interaction, e ni , and by specifying the relationship between the magnitude of e ni and group membership, it is possible to generate a family of disturbances. This family includes the most widely mentioned disturbances: item bias, multidimensionality, variations in discrimination, and guessing. A general model is derived by partitioning the test into Q mutually exclusive, exhaustive, nonempty item subsets, fllf H 2 , . . ., HQ and by partitioning the sample of individuals measured into P mutually exclusive, exhaustive, and nonempty person subsets, @1? ©2> • • • , ®p. If P = 1 then the persons are not partitioned; if P = N, then they are partitioned one per group; and if P = p < N, they are partitioned into some small number of groups. Similarly the items may be partitioned in 1, q, or L groups. Then when person n takes item i, the outcome is not best predicted by (b n - dj) but by (b n - dj) + e ni where e ni = e st , for n E @s, i E fl t , and e s t is either some function of s and t or a sampled element from some distribution whose parameters depend on s and/or t.

WHEN DOES MISFIT MAKE A DIFFERENCE?

247

Calibration noise is the disturbance that arises when items with specified difficulties are used in a measurement instrument. The item parameter estimates used for estimating person measures always involve some uncertainty. That uncertainty can be represented by an error variance in the parameter estimates. The parameter estimates are generally assumed to be unbiased and normally distributed. That is, di ~ N(8j, erf), where of is a function of the sample size (and the targeting of §{ on the abilities of the people in the calibrating sample). The disturbance model expressed the calibration noise by letting q = L and p = 1 so that e ni = e H . There is only one group of people, so the disturbance only varies over items. The size of e H depends upon the uncertainty in the item calibration, and it can be viewed as an element sampled from a normal distribution with zero mean and variance of, the error variance in the calibration of item i. Random misfit is similar to calibration noise. In this case both item and person partitions contain one item each, so that e ni is different for every item person combination. That is, Q = L, P = N, and e ni = e n i where e ni , the disturbance, is random—perhaps viewed as sampled from a normal distribution with a unique variance cr^. p o r simplicity, it is easiest to begin with the variance of the disturbance as constant, so t h a t 0-2. = a2 for an n an(j i Because the disturbance is regarded as random, the particular value of e ni is not related to b n or dj. Item bias is a third type of disturbance. There are innumerable definitions of item bias in the literature. Here an item is considered biased when its difficulty parameter is different for one group than for another. Mellenbergh (1982) describes two biases of this type; uniform bias—a constant shift in difficulty, and nonuniform bias—a shift in difficulty related to the ability of the individual. In our formulation only uniform bias is included under the heading item bias. Nonuniform bias is treated as multidimensionality. When the persons are partitioned into a small number of groups (P = p) and the items are partitioned in a small number of groups (Q = q), then the items in group are a considered biased for the people in the group a, if e ni = e a a ^ 0 for n G @a, i E fta, and e ni = 0, otherwise. i tion (P = N) and items partitioned into a small number of groups (Q = q). Then e ni = e n t for i E fl t , so that the ability of person n, when attempting items in subgroups t, is given by (b n + e nt ) when the dimensions underlying the test are thought to be related; then the error components need to be specified so that corr(b n + e n t , b n ) is some function of t. If there are people for whom e n t is zero, then this error specification gives nonuniform item bias. Suppose, for example, a mathematics test consists of a small subset of items that have a language requirement t h a t is greater than that of the majority of items—

248

ADAMS & WRIGHT

t h a t is, they are confounded by a second dimension. The ability necessary to respond to these items is some combination of mathematics and language ability, and this combination is (b n + e nt ). When variations in item discrimination are introduced, item bias is specified so that both items and persons are partitioned one per subgroup (P = N, Q = L); then e ni = ei5 the ability of person n on item i, becomes (b n + e ni ), and the abilities (b n + e ni ) and b n are correlated differently for different items. If corr(e ni , b n ) is positive, (b n + e ni ) will increase for persons with higher abilities and decrease for persons with lower abilities. This leads to increased discrimination between higher and lower abilities in item i, relative to the rest of the items. Similarly, a negative correlation leads to a decrease in item discrimination relative to the rest of the items. If the items are grouped into larger subsets with homogeneous discriminations, then the model for variations in discrimination is exactly the multidimensionality model.v This framework shows that for the Rasch model, variation in item disc Guessing can be included in this disturbance model. As with variations in discrimination and random misfit, both persons and items are fully partitioned (P = N, Q = L) so that e ni = e ni . For random misfit, e n i was produced randomly. For variations in discriminations, it was correlated with b n . Now for guessing, e nj is functionally related to (b n - d{) so t h a t the probability that person n will guess correctly on item i, g ni brings together the propensity for person n to guess and the chance t h a t item i can be guessed correctly. This is a more general definition of guessing than that typically used with models such as the threeparameter logistic (Birnbaum, 1968), which suggests that guessing varies with items but not individuals. To specify e ni so t h a t g ni is the minimum probability that person n will succeed on item i use:

so that:

The disturbances described have been recognized because their identification follows directly from the traditional statistical procedures of psychometrics. Our identification of variations in item discrimination is due to the role that item-test correlations have played in traditional test theory, and multidimensionality is identified because

WHEN DOES MISFIT MAKE A DIFFERENCE?

249

factor analysis is so often applied in test analysis. Item bias is specified because of the concern with the fairness of tests for all individuals. None of these disturbances, however, is more important or more likely t h a n any other. Other, as yet unnamed, disturbances must exist, and in fact, are quite likely to be of equal as prevalence. Consider, for example, analogues to item discrimination and item multidimensionality, such as person discrimination and person multidimensionality. These disturbances might be equally prevalent, and might present equally likely threats to valid measurement. They have been neglected because techniques that would expose them have not been routinely applied. Table 14-1 summarizes the six disturbances that have been singled out for discussion and shows how they can be modelled. These six are introduced because they are most commonly recognized and named. While we have introduced and specified these disturbances separately, of course they exist simultaneously, to a greater or lesser extent in all real test data. To specify a model with all disturbances is possible, but its study would be discouragingly ambiguous.

INVESTIGATING DISTURBANCE USING PROX When the assumption is made that the item parameters are normally distributed, and the mean and the variance of the distribution is known, the PROX estimation equations (Cohen, 1979; Wright & Stone, 1979) provide closed-form estimators for Rasch model ability parameters. These simple equations will be used to deduce the likely effects of disturbance on ability estimates. The analytic findings will then be confirmed by simulations t h a t use maximum likelihood estimation of person parameters. The disturbance e ni can be introduced either as a modification to the ability of person n or as a modification to the difficulty of item i. In the

Table 14-1 Summary of Some Measurement Disturbances Disturbance Type calibration noise random misfit item bias multidimensionality variable discrimination guessing

Characteristics for (bn - d,) + en, *n

= e

*n

= e

*n *n Cn €n

= e - e =e

= e

e, sampled from N(0, erf) e ni sampled from N(0, a-*,) est some constant corr(bn, ent) varies with t corr(bn, eni) varies with i e ni = max[bn - d„ log(g ni /(l - gni)] - (bn - d.)

250

ADAMS & WRIGHT

case of item parameters assumed known and person parameters being estimated, it is the modification of item difficulties that leads to the clearer understanding of the effects of the disturbance. Throughout the examination of measurement disturbance, dt will be used to denote the available estimate of difficulty for item i, perhaps stored in an item bank. It is this previously calibrated item difficulty that is used as the basis of subsequent person measurement. These difficulties are not, however, the actual item difficulties for each person n. The actual difficulty of the item for individual n is d{ + e ni , and will be denoted 8f. Depending upon the disturbance that is modelled, 8fmay or may not vary across individuals. The actual ability of person n is £ n . The estimator of (Bn t h a t uses the estimated (or bank) item difficulties, d{ is denoted b n and the estimator of p n that uses the actual item difficulties 8f, is denoted (3n. We call b n the disturbed estimator and |Bn the undisturbed estimator. Our aim is to investigate the bias in b n as an estimator (3n and we tackle this by examining the difference between b n and (3n. It is import a n t to recognize, however, that to call b n the disturbed estimate and P n the undisturbed estimate does not imply that (3n is better t h a n b n . The person parameter estimate can only be a useful measure when there is a stable frame of reference with respect to which it can be interpreted. The undisturbed ability estimate (3n does not satisfy this requirement, because its frame of reference against which it is unique to that individual—it depends upon an individually defined set of item difficulties. The disturbed ability estimate b n may not be useful either, if the difference between b n and (3n makes the existing frame of reference inappropriate for this person. The issue is not one of choosing between b n and (3n but of translating the disturbances into their possible effects on the validity and accuracy of parameter estimates. When a test is made up of L items with actual difficulties 8f that are normally distributed, the PROX formula gives the (Bn for individual n with proportion correct score fn = r n /L as:

where 8. is the mean of the actual item difficulties 8ffor person n and ag- is the variance of these actual item difficulties for person n. The item parameters enter this equation through their means 8. and their spread 8|, implying t h a t the effect of the disturbance on the measure can be determined from the effect that the disturbance has upon the mean and dispersion of the item difficulties. A change in the mean item difficulty adds a constant bias to the measure, which is equal to the change in the mean. An increase in the

WHEN DOES MISFIT MAKE A DIFFERENCE?

251

dispersion of the items produces estimates that are further from the centre of the test. A decrease produces estimates that are nearer the centre of the test. If d. and o^ are the mean and variance of the bank item difficulties di? then an approximation for the difference between the disturbed ability estimate using the bank difficulties dd and the undisturbed ability estimates, p n , using the actual difficulties, 8f, for person n is:

If (3n is considered an unbiased estimate of the actual ability ($n, this expression also gives the bias in b n as an estimator of P n . Letting vn denote the variance ratio v2,/^ and (ULn = (d. - 8), enables vn and |xn to be used as indices of the magnitude of disturbance. Substituting a2 = u n of into the above equation, the expression becomes:

Expression (7) shows t h a t a constant bias, independent of the score fn, is introduced through the difference |xn. The second term is zero at fn = 0.5 and/or vn = 1 (of = of) For u # 1, the absolute value of its contribution to the bias increases as the difference between fn and 0.5 increases. For u n < 1, (of > of) the bias is away from the centre of the test and for vn > 1, (of < of) the bias is towards the centre of the test. The PROX standard errors of the ability estimates b n and (3n are:

The items only enter these expressions through their dispersion. The larger the item dispersion, the larger the standard error of the parameter estimate. The mean squared error (MSE) of a bank estimate b n about the actual ability p n , based on the estimated difficulties, is:

When the estimated difficulties are used to estimate an individual's ability, it is var(b n ) t h a t is reported as the error variance for the ability estimate, but it is the MSE expressed in (9) t h a t gives the actual varia-

252

ADAMS & WRIGHT

tion in b about p. The difference between (9) and the modelled variation var(b n ) is due to the bias, b - P n . The ratio of MSE(b n ) to the var(b n ):

is the factor by which the sampling variation of b n about p exceeds the error variances that would be reported on the basis of estimated difficulties alone. Expression (10) shows that modelled standard errors that are reported on the basis of the bank difficulties d{ will underestimate the mean squared error in the bank estimates b n . The increased uncertainty is due to the bias in the bank estimates. The bias causes a variation of b n about p n that is not symmetric. If the bank estimated item difficulties, di? have greater variation than the actual item difficulties, 8" then (b n — Pn) will be skewed away from the center of the test. But i the bank item difficulties have less variation than the actual item difficulties, then (b n - Pn) will be skewed toward the center of the test. 1

EFFECTS OF THE DISTURBANCE ON ^ n A N D u n Both PROX and UFORM indicate that under the assumptions of this study, the bias in an estimate for person n, based upon disturbed items, depends only on the mean of the disturbances \xn, and the change in the dispersion of the items as expressed through the ratio vn. Because these two indices capture the effect of all of the disturbances in the class t h a t we are considering, they need not be examined separately. Describing each of the disturbances in terms of its effect on fxn and vn will be sufficient to specify the effects of that disturbance. Since vn captures the direction of the bias, it is important to consider the circumstances under which vn is likely to be greater than one and vn is likely to be less than one. Begin by recalling that the bank item difficulties are denoted by di? i = 1, L and the actual item difficulties for individual n are 8f = d{ e ni ; then the variance of the actual item difficulties for person n is: (ig = a* + &i - 2ade,

(11)

1 A similar analysis using UFORM estimation equations (which assume uniform rather than normal distributions for the item and person parameters) indicate the same bias patterns.

WHEN DOES MISFIT MAKE A DIFFERENCE?

253

where of is the variance of the bank difficulties, of is the variance of the disturbances and a d e is the covariance between the bank difficulties and the disturbance. Therefore,

When item difficulties and disturbances are uncorrelated, then the variance of the actual item difficulties for person n exceeds the variance of the bank item difficulties and vn = 1 + (of/of), is always greater than one. This leads to person parameter estimates that are biased toward the centre of the test. When disturbances and difficulties are negatively correlated then again o n will be greater than one and there will be a bias toward the centre of the test. If the disturbances and item difficulties are positively correlated, however, and their covariance is more than half the variance of the disturbances, then the estimated abilities will be biased away from the centre of the test. That is, to get vn < 1 requires:

which requires

Unless item partitioning is done in terms of item difficulty, calibration noise, random misfit, item bias, multidimensionality, and variation in discrimination are disturbances that are uncorrelated with the item difficulties. This is shown by considering each of the named disturbances in t u r n and describing their effect on |jLn and o n . Calibration Noise In the case of item calibration error, it is assumed that a test is formed by selecting from a previously calibrated bank. The existing bank item estimates are used as the basis for the estimation of individuals' abilities on the assumption t h a t they can be used as though they were the item difficulties. But this is not the case; the actual difficulty of item i for person n is 8f — d{ — e{ where ei is a random disturbance sampled from a normal distribution with mean zero and variance of, the error variance of the item bank estimate. This kind of disturbance does not effect the mean item difficulty,

254

ADAMS & WRIGHT

because the expected values of 8. and d. are equal. It does, however, change the spread of the item difficulties. The disturbances and the estimated item difficulties are uncorrelated, so the variance of the actual item difficulties (assuming independence among items) is:

so t h a t

and since vn must be greater than one, the disturbed estimates will be biased toward the center of the test. When calibrated items with estimated error variances are used, an estimate of vn is available and either PROX or UFORM can be used to approximate the bias and the mean squared error. As will be shown later, calibration noise leads to a negligible bias, and it is likely t h a t other disturbances will contribute more to bias and mean squared error t h a n does calibration noise. R a n d o m Misfit In random misfit the disturbance is unbiased and independent of both person ability and item difficulty. This gives:

This disturbance will cause vn to be greater than one and the disturbed estimates will be biased toward the center of the test. Item Bias If person parameters are estimated on the basis of known item parameters, then estimates for people who are not in the bias group will not

WHEN DOES MISFIT MAKE A DIFFERENCE? 255

be affected. Taking the simple case of one set of biased items, and one set of people for which the items are biased, the item bias model gives 8f = dt for i £ H t or n g O s and 8f = d{ + e s t for i G Ctt and n G 9 S . Letting M be the number of items in flt gives:

Item bias causes a constant bias |jin, the magnitude of which depends on the size of the constant effect e st , and the proportion M/L, of items t h a t are biased. Because u ni > 1 the disturbed person parameter estimates will also be biased towards |xn by an amount related to the size of the effect e^/of, and the proportion M/L, of items that are biased. To illustrate the way item bias works, Table 14-2 shows PROX estimates of bias at various levels of ability p, when the magnitude of the disturbance e st , is - . 2 5 , - . 5 , - 1 . and - 2 . and ten, twenty and forty percent of the items are considered biased on a 100 item test with item difficulties ranging from - 3 to 3 logits. The table shows t h a t the disturbed ability estimates are always less t h a n the undisturbed estimates. This negative bias increases with the magnitude of the disturbance and the number of disturbed items. Because the disturbance causes an increase in the test variance, there is a bias toward the center of the test that is added to the constant bias. This means that, relatively speaking, the bias for more able students is greater t h a n the bias for less able students. The practical consequences of the biases shown in Table 14-2 can be assessed by comparing their magnitude with the minimum measurement error which a 100 item test could provide, namely, 2/VlOO = 0.20. For modest bias (less that 0.5 for less than 20 percent of the items) the bias is less than 0.10, which is half of one standard error. However, for more severe item bias the estimation bias can exceed two or three standard errors. Item Multidimensionality Multidimensionality is similar to item bias, differing in only two minor ways. First it applies to all persons, not just a subset, and second the disturbance is not a fixed effect for each subset of items—it is correlated with ability. In the case of a two dimensional set of items with M items on a second dimension, this gives:

Table 14-2

BIAS in PROX Ability Estimates Caused by Item BIAS

M/L

Ability (p) 1.5 1.0 0.5 0.0

-0.5 -1.0 -1.5

.10

.026 -.025 -.025 -.025 - .025 -.025 -.024

.20

.40

.10

- .051 .051 .050 -.050 -.050 -.049 -.049

.103 -.102 -.101 -.100 .099 -.098 .097

.053 .052 -.051 -.050 -.049 -.048 -.047

est = -0.25

.20

.40

.10

-.106 .104 -.102 .100 -.098 - .096 -.094

.211 - .207 -.204 -.200 -.196 -.193 .189

- .111 .107 -.104 -.100 .096 - .093 -.089

ert = -0.5

.20

.40

.10

.222 .215 .207 -.200 -.193 .185 -.178

.444 -.429 - .414 -.400

- .244 .229 -.214 .200 -.186 - .171 -.156

est = -1.0

-.386 -.371 -.356

.20

.40

.487 -.457 -.428 -.400 -.372 - .343 -.313

.968 - .910 - .854 -.800 -.746 -.690 -.632

est = 2.0

WHEN DOES MISFIT MAKE A DIFFERENCE?

257

These equations are the same as those for item bias. The difference between the two disturbances is that now the bias occurs for all persons, not just a bias subgroup, and because e n t varies \xn and vn, vary across people. Because, the underlying dimensions of most tests are positively correlated, |mn and vn will tend to be larger for people with extreme abilities. That is, the biasing will be most pronounced for the people with the highest and lowest ability estimates. This implies that multidimensionality (and nonuniform item bias) can be advantageous to the least able students. If |jLn is zero or positive, then less able students will get disturbed ability estimates biased upwards. A negative U | Ln may lead to either a bias up or down depending upon the relative magnitude of |jLn and u n , and the score of the individual. For small negative |jLn it is possible for an individual of low ability to have a disturbed estimate t h a t is positively biased. It will always be the case that, if a test is biased against a set of individuals, the measures of the less able individuals in t h a t set will always be biased upwards relative to the ability of more able individuals in that group. Variations in Item Discrimination For variations in discrimination, the disturbance varies across all items and all persons. Assuming that the test contains a set of items with a symmetric distribution of discriminations, that are independent of item difficulty, then |xn, the mean disturbance for any person will be zero, and vn will be given by:

Under these conditions variations in discrimination will behave exactly like random misfit. If the distribution of discriminations is not symmetric, then there will also be a bias due to jjin, which will no longer be zero. Over- and Underdetermined Response Patterns None of the disturbances we have considered so far directly address one misfit that is routinely identified in Rasch measurement—

258

ADAMS & WRIGHT

variation in person discrimination. The examination of individual response patterns often indicates that the hard items proved harder for the individual than the dj indicate, while the easy items proved easier; or t h a t the hard items proved easier for the individual t h a n the d{ indicate, and the easy items proved harder. In the first case the probability t h a t the individual will succeed on easy items is greater t h a n expected, and the probability t h a t they will fail on hard items is greater t h a n expected. In traditional test analyses such a result would be regarded as desirable and might be labelled as high person discrimination. But there is also a sense in which this response pattern is overdetermined by the estimated item difficulties. From the perspective of objective measurement, an overdetermined response pattern is not a desirable outcome. The requirement of invariant item difficulties has been violated. As with the other disturbances the actual difficulties of the items are unique to that individual. In the second case, the probability that the individual will succeed on easy items is less than expected, and the probability that they will succeed on hard items is greater than expected. Here the pattern of responses would be underdetermined by the estimated difficulties on the items. In traditional item analyses such a response pattern would correspond to a poorly discriminating person. Again the under determined response pattern indicates that the measurement requirement of invariance has been violated—the actual item difficulties are unique to the individual. Under- and overdetermined response patterns are caused by disturbances t h a t effect both |mn and vn. For over determined response patterns 8f> dj if dt > (3n but 8f< dj if dj < H n . As a result |xn may take any value, depending upon the number of items above and below the individual's ability. A uniform distribution of items centered at zero makes |xn > 0 if P n > 0 and |xn < 0 when (3n < 0. Thus overdetermined response patterns are likely to cause a bias away from the center of the test. The overdetermined response pattern also indicates that the variance of actual item difficulties is greater than the variance in the calibrated item difficulties. That is vn > 1, which causes a bias toward the center of the test. The net result of these two competing biases will depend upon the distribution of the items and magnitude of the disturbance. For underdetermined response patterns the above argument is reversed, |xn causes a bias toward the center of the test and vn is most likely to cause a bias away from the center of the test. Again the net result will depend upon the distribution of the items and the magnitude of the disturbance.

WHEN DOES MISFIT MAKE A DIFFERENCE?

259

SIMULATIONS The above discussion is based on the expected pattern of bias indicated by PROX (and UFORM). In what follows these expectations are compared to the results of a set of simulations that use maximum likelihood to estimate abilities on the basis of estimated item difficulties. Three classes of disturbances identified by the type of response patterns they produce are considered. The first class are the noisy response pattern disturbances. They occur when random disturbances t h a t are uncorrelated with the item difficulties are introduced while generating the response patterns for simulated individuals. Response patterns of this type emulate calibration noise, random misfit, item bias, multidimensionality, and variation in item discrimination. Noisy response patterns are also under determined response patterns. The introduction of the random disturbance means that the bank difficulties do not determine the response pattern as well as expected. The second class of disturbances produce systematically underdetermined response patterns. When p n is the generating ability of person n, a disturbance is introduced that makes the items for which dj < P n more difficult but items for which dj > p n less difficult. The third class of disturbances produce overdetermined response patterns. If Pn is the generating ability of person n, a disturbance is introduced that makes the items for which dj < p n less difficult but items for which dj > p n more difficult. One normally distributed sample of 500 persons was generated for all simulations. This sample was constructed by applying an inverse normal transformation to a set of numbers uniformly spaced between 0 and 1 and then scaling them so that the abilities ranged from - 3 . 3 to 3.3 logits. These abilities were fixed throughout all simulations and are referred to as the generating abilities, p. The mean ability was zero and the standard deviation was 1.3. Tests of 40, 60, and 100 items were constructed with difficulties uniformly spaced between - 3 . 0 and 3.0 logits. These difficulties where used as the bank difficulties, d{, and were fixed throughout the simulations. Tests shorter t h a n 40 items were not considered because they introduce floor and ceiling effects sufficient to confound the study of bias due to item disturbance alone. In the process of the simulation each bank item difficulty, di? had a disturbance added to it to construct an actual difficulty, 8f, for each individual. The combination of P n and 8fwas used to simulate item responses and produce test scores. Each test score was then transformed into two logit abilities, p n based on the actual 8s and b n based

260

ADAMS & WRIGHT

on the bank ds. This process was replicated 100 times for each sample, producing 100 pairs of p and b for each of the 500 generating abilities. For the noisy response patterns, five different disturbance standard deviations and three different disturbance means were used. Three standard deviations were the same for all items. Two had standard deviations, related to item difficulty. For the fixed standard deviations a random deviate was sampled from a normal distribution with mean 0, 0.25 or 0.5 and a standard deviation of 0.5, 0.75, or 1.0, and added to each bank difficulty, d r A unique disturbance was added to each item but t h a t disturbance remained constant across the persons and replications. 2 The three standard deviations and three means combine to give nine different disturbances. In an attempt to emulate the effect of calibration noise more closely, two standard deviations that varied with item difficulty were also considered. Here the disturbance for item i was created by randomly selecting a deviate from a normal distribution with zero mean and variance given by:

This is an estimate of the asymptotic error variance for item parameter estimates made with a calibrating sample of size N, under the assumption that all members of the calibrating sample had p = 0 and t h a t there was no covariance between item parameter estimates. This will overestimate the error variance for the hardest and easiest items and underestimate the error variance in the middle of the test. These two noise disturbances were generated with zero means and the are denoted as N10 and N100. To produce response patterns that were under determined either 0.25 or 0.5 logits was subtracted from the difficulty of an item when dj was greater than P n and either 0.25 or 0.5 logits was added to the difficulty when dx was less than P n . These are denoted U25 and U50. To produce overdetermined response patterns, either 0.25 or 0.5 logits was added to the difficulty of an item when dj was greater than p n and either 0.25 or 0.5 logits was subtracted when dj was less than P n . These are denoted 0 2 5 and O50. 2 A disturbance can be generated for each item and held constant across persons and replications, or disturbances can be generated for each item-person combination and held constant across replications, or a unique disturbance can be generated for every item-person-replication combination. It was found that all three choices produced the same results. The first choice cuts computing time dramatically, and it was adopted.

WHEN DOES MISFIT MAKE A DIFFERENCE?

261

For the underdetermined response pattern an additional condition was applied t h a t prevented the actual difficulty 8f from becoming greater t h a n p n if dj was less than p n , or S-'from becoming less than P n if dj was greater t h a n P n —in these cases 8" was set equal to P n . This leads to 15 different disturbances. The five noisy response pattern disturbances with zero means were used with tests of 40, 60, and 100 items and the remaining eleven disturbances were applied with the 100 item tests only.

RESULTS In analyzing the results of the simulations it was the bias caused by the disturbance that proved to be of most interest. For every simulation two bias indices were saved for each of the 500 sample elements;

and

In both of these indices the denominator R is the number of successful replications for generating ability p, and the summation was taken over each of the successful replications. 3 The first index BIAS-p, provides a frame of reference for the bias in b, since it is the bias t h a t would be expected if there were no disturbance. BIAS-p is the bias in the estimates p n of p n . Each p n is estimated using the actual item difficulties for person n, 8f—it does not involve the disturbance—so it is expected to have a mean close to zero for all ability levels and test lengths. The second index, BIAS-b, gives the bias in the disturbed estimates b when used as estimates of the actual ability p. Most of the analysis is concerned with the magnitude of BIAS-b and the way it is related to test length, disturbance and p. The first step in the analysis was to compare the 500 disturbed and undisturbed ability estimates with each other and with the true abili-

A successful replication being one for which a finite ability was estimable for (3.

262

ADAMS & WRIGHT

ties. When such a comparison was undertaken, remarkable agreement was found between the two parameter estimates and the actual parameter values. Figure 14-1 contains a comparison of each of the disturbed and undisturbed estimates with the true abilities for test of 100 items and disturbance, a = 1, JJL = 0. This figure shows a worst case scenario in the comparison of disturbed and undisturbed estimates when the disturbance has a zero mean. The test is long, so both sets of estimates have negligible standard errors, and any discrepancy between estimates and between the disturbed estimates and the generating values is almost entirely due to the disturbance. The relationship between the 500 disturbed estimates and the generating parameter values for tests of length 100 and disturbance, a = 1, JJL = 0, is examined more closely in Figure 14-2. The solid line Generating Ability, p

Figure 1 4 - 1 Comparison of the 5 0 0 generating abilities b, undisturbed estimates p, and disturbed estimates b, for 1 0 0 item tests with disturbance; a = 1 , I^t = 0.

WHEN DOES MISFIT MAKE A DIFFERENCE?

263

Generating Ability, p Figure 14-2 Plot of 5 0 0 generating abilities p, against disturbed estimates b for 1 0 0 item tests with disturbance; a = 1 , |x = 0.

corresponds to where the points would lie if the disturbed estimates and true values were equal. The bias in b shown in Figure 14-1 is towards the center of the test. This is consistent with the predictions based on PROX and UFORM, which showed that the larger variance in the actual item difficulties (vn > 1) results in parameter estimates that are biased toward the center of the test. Figures 14-1 and 14-2 highlight that, even with a substantial amount of disturbance that leads to noisy response patterns, the bias in the person parameter estimation is

264

ADAMS & WRIGHT

quite small. This is consistent with a result reported by Wright and Douglas (1977), who found that for test designs encountered in practice, a random disturbance with standard deviations as large as one lead to negligible distortions in ability estimates. Table 14-3 shows the mean, standard deviation, and range for the bias indices BIAS-p and BIAS-b for all of the disturbances. Results for BIAS-p are reported only once for all of the test lengths, because they are independent of the disturbance. The difference between the BIASb results and the BIAS-P results are due to the disturbances. When there is no disturbance BIAS-b is equal to BIAS-p. The results shown in Table 14-3 follow those that were predicted on the basis of PROX and UFORM. In each case the mean of the bias is close to the mean of the disturbance, and, for the noisy response patterns, the range and standard deviation of the bias increases with the standard deviation of the bias. The range and standard deviation of bias decrease as test length

Table 14-3 Mean, Standard Deviation, and Range of BIAS in Undisturbed and Disturbed Parameter Estimates Test Length 100

60

40 V>

mean

sd

range

mean

0.00

-.001

.045

.332

.049

sd

sd

range

mean

.000

.036

.250

.000

.029

.225

.303

-.005

.040

.262

.071

.496

.025

.057

.437

.108

.687

-.010

.089

.537

-.001 .250 .492 -.007 .241 .479 -.016 .231 .533

.036 .036 .039 .060 .059 .066 .094 .097 .085

.275 .164 .228 .398 .353 .396 .549 .635 .529

.046 .070

.280 .529

.001 -.011

.034 .064

.227 .486

.001 -.003

.029 .048

.182 .402

o 025 050

.000 .003

.070 .130

.652 1.229

u U25 U50

.002 .000

.081 .155

.568 1.011

(X

0.00

noisy response patterns -.002 0.50 0.00 0 . 5 0 0.25 0.50 0.50 .022 0.75 0.00 0.75 0.25 0.75 0.50 1.00 0 . 0 0 .001 1.00 0 . 2 5 1.00 0 . 5 0 N100 N10

-.003 -.027

range

WHEN DOES MISFIT MAKE A DIFFERENCE?

265

increases; this was not predicted by PROX or UFORM. Both the PROX and UFORM bias formulae are independent of test length. Support for this is given by the decrease in the range and standard deviation of the bias, with no disturbance added. For the noisy response patterns, the largest bias reported in Table 3 is approximately 0.34 logits (half of the range of 0.687) for 40 item tests with disturbance a = 1. This maximum bias is no more than the standard error of person parameter estimates typical of 40-item test. For the 100 item tests with disturbance a = 1 the largest bias is approximately 0.27 logits, and this, too, is no more than the standard errors typical of 100 item tests. In fact, because the maximum biases occur at the extremes of the test, the modelled standard error of a parameter estimate always exceed the corresponding bias by a considerable amount. It is also clear from Table 14-3 that BIAS-b for disturbance a = .5 is not much larger than BIAS-p. In fact for or < 0.5 the bias is not discernible. Similarly, for tests calibrated on samples as small as 100, item parameter uncertainty does not cause any discernible bias in the person parameter estimates—the standard deviation of BIAS-b and BIAS-p are almost identical, and the range of the BIAS-b is slightly less t h a n the range of BIAS-p. Even items calibrated on as few as 10 people appear to give person parameter estimates that are not excessively biased. The standard deviations and ranges for the over- and underdetermined response patterns, however, do show a substantial variation in the parameter estimates. Figures 14-3, 14-4, and 14-5 show how the bias, BIAS-b, in the disturbed estimates varies with the generating values of p. Each plot contains 500 points, one for each ability, showing the mean bias from the 100 replications in the simulation. The plots also include a smooth curve, which is the expected bias based on PROX calculations. The PROX estimates were produced by using the generating ability, p, and the bank difficulties, d, to produce expected relative scores for each individual. The variances of the bank difficulties and the disturbance generating parameters were then used to estimate jutn and vn, and the bias was calculated. The PROX and UFORM estimates of bias due to disturbance are very similar. The PROX results are presented because under the PROX assumptions, the bias is determined by the effect of the disturbance on the mean and standard variance of the item difficulties—an effect t h a t can be easily derived. Under the UFORM assumptions the bias is determined by the effect of the disturbance on the range—an effect t h a t cannot be easily derived. For each plot in Figure 14-3 there is strong agreement between the

266

ADAMS & WRIGHT

Generating Ability, p

Generating Ability, p

Generating Ability, p

Generating Ability, p

Figure 14-3 Bias in disturbed ability estimates, BIAS-b, plotted against the generating ability, p, for a variety of noisy response patterns with mean disturbance zero.

PROX estimate of the expected bias and the observed bias. As predicted, the noisy response pattern disturbances shown in Figure 3 cause ability estimates to be biased toward the centre of the test. The amount of the bias toward the centre of the test is larger for the larger disturbances. Change in test length alters the sampling variation but not the magnitude of the bias.

WHEN DOES MISFIT MAKE A DIFFERENCE?

g

Generating Ability, p

Generating Ability, p

Generating Ability, p

267

Figure 14-4 Bias in disturbed ability estimates, BIAS-b, plotted against the generating ability, p, for the calibration noise disturbances and noisy response patterns with nonzero mean disturbances.

Figure 14-4 shows the bias for the N10 and N100 disturbance and the noisy response patterns disturbance that has a nonzero mean. For NIO and N100 the bias is toward the centre of the test, as predicted. But the PROX estimates are not as accurate as they are for the constant variance disturbance. For NIO it appears that in the middle of the test there is less bias than predicted by PROX. This may occur because the disturbance is smallest in the middle of the text, and the

268

ADAMS & WRIGHT

i

Generating Ability, p

Generating Ability, p

Generating Ability, p

Generating Ability, p

Figure 14-5 Bias in disturbed ability estimates, BIAS-b, plotted against the generating ability, 0, for over- and underdetermined response patterns.

items in the middle of the test carry most information for the estimation of the abilities in the middle of the test. The N100 plot shows negligible bias and the two nonzero mean plots show the effect of the constant bias and the bias that varies with ability. Figure 14-5 shows the bias caused by the under and over determined response patterns. The overdetermined response patterns show a bias away from the center of the test, and the underdetermined response patterns show a bias toward the center of the test. There is a substantial range in the middle of the test, however, in which none of these disturbances leads to bias larger than 0.2 logits.

WHEN DOES MISFIT MAKE A DIFFERENCE?

269

SUMMARY AND CONCLUSION The framework for describing measurement disturbance that was developed in this study shows that a substantial range of misfit to the Rasch model can be expressed as interactions between individual group membership and item group membership. This makes it possible to use the PROX estimation equations to determine the effects of all varieties of measurement disturbance on person parameter estimates. PROX estimates of abilities depend only upon the mean difficulty of the items and their variance. If the effect of the disturbance on the mean item difficulty and variance is available, then PROX estimation equations can be used to estimate ability estimates based on both the bank and actual difficulties, and the simulations confirm that PROX estimates do accurately predict the nature and magnitude of the effects of disturbance on person parameter estimates. Further, it was shown t h a t the disturbance manifests itself as a bias in the parameter estimates. That is, disturbance leads to systematic errors in the estimation of individual person parameters. When the disturbance changes the mean of the item difficulties then there is a constant bias, equal to the change in the mean. When the disturbance alters the variance of the item difficulties then a bias either in or away from the centre of the test results. When the response pattern is noisy or under determined then the likely bias is towards the centre of the test. When the response pattern is over determined then the likely bias is away from the centre of the test. In practice, of course, the effect that the disturbance has upon the mean, jm, and variance, v, of the item difficulties is unknown. A further line of research, which examines the relationship between fit statistics and JJL and v, may be profitable. If fit statistics could be found that are systematically related to (JL and u, then estimates of the bias caused by the disturbance would become available. At this point we are only able to use fit statistics to indicate the likely direction of the bias. Previous research (Smith, 1982) has indicated that the t-fit statistics used by Wright and Stone (1979) are most sensitive to variations in discrimination. A positive t-statistic for a person generally corresponds to an underdetermined response pattern, while a negative t-statistic corresponds to an overdetermined response pattern. Pending further investigation, this suggests t h a t positive t-statistics correspond to person parameter estimates biased towards the center of the test, and negative t-statistics correspond to person parameter estimates biased away from the center of the test. While it may be possible to use indices of fit to obtain estimates of this bias, it is not recommended that the bias estimates be used as a correction to estimated parameters. The disturbed ability estimate is

270

ADAMS & WRIGHT

b a s e d on a s t a n d a r d s e t of i t e m difficulties k n o w n , by v i r t u e of a misfit i n d i c a t o r , n o t to b e a p p r o p r i a t e for t h e i n d i v i d u a l . T h e u n b i a s e d estim a t e is b a s e d on a n o n s t a n d a r d , s l i g h t l y different set of i t e m diffic u l t i e s u n i q u e to t h a t i n d i v i d u a l . N e i t h e r b n n o r (Bn qualifies a s a b e s t measure.

REFERENCES Birnbaum, A. (1968). Some latent trait models and their use in inferring and examinee's ability. In F.M. Lord & M.R. Novick, Statistical theories of mental test scores (pp. 397-479). Reading, MA: Addison-Wesley. Cohen, L. (1979). Approximate expression for parameter estimates in the Rasch model. British Journal of Mathematical and Statistical Psychology, 32, 113-120. Gustafsson, J-E. (1980). Testing and obtaining fit of data to the Rasch model. British Journal of Mathematical and Statistical Psychology, 33, 205-233. Martin-Loff, P. (1974). The notion of redundancy and its use as a quantitative measure of discrepancy between a statistical hypothesis and a set of o Mellenbergh, G.J. (1982). Contingency table methods for assessing item bias. Journal of Educational Statistics, 7, 105-118. Smith, R.M. (1982). Detecting measurement disturbances with the Rasch model. Unpublished doctoral dissertation, University of Chicago. r tests (expanded ed.). Chicago: The University of Chicago Press. (Original work published 1960) Thurstone, L.L. (1928). Attitudes can be measured. American Journal of Sociology, 33, 529-554. van den Wollenberg, A.L. (1988). Testing a latent trait model. In R. Langeheine & J. Rost (Eds.), Latent trait models and latent class models. New York: Plenum Press. van den Wollenberg, A.L., Wierda, F.W., & Jansen, P.G.W. (1988). Consistency of Rasch model parameter estimation: A simulation study. Applied Psychology Measurement, 12, 307-313. Wright, B.D. (1989). Deducing the Rasch model from Thurstone's requirement t h a t item comparisons be sample free. Rasch Measurement Special Interest Group Newsletter, 3(1), pp. 9-10. Wright, B.D., & Douglas, G.A. (1977). Best procedures for sample-free item analysis. Applied Psychological Measurement, 1, 281-295. w Chicago: MESA Press.

chapter

15 JLO

Comparing Attitude Across Different Cultures: Two Quantitative Approaches to Construct Validity Mark Wilson

University of California, Berkeley

Use of an instrument across national and cultural groups raises issues concerning the validity of any comparison between the groups due to the possibility t h a t respondents in the groups have understood the questions they are being asked in different ways according to their group membership. These differences could arise in translation or could also arise due to cognitive and affective differences between cultural groups. For attitude scales and other types of instruments in the affective domain, the most usual process used to ensure that a scale's meaning has not drifted too far in the process of translation is to back-translate. That is, each translated item is translated back into the original language, and a panel of experts is consulted to ensure t h a t the original and the back-translation are sufficiently close. International comparisons of ability and attitude are an important part of the arsenal of techniques available to comparative education. For a comprehensive discussion of this issue with respect to ability tests, see Irvine and Berry (1988). In this chapter the focus is on the affective domain. An example is provided by the studies of the Interna271

272

WILSON

tional Project for the Evaluation of Educational Achievement (IEA) comparing various national educational systems that make regular use of attitude assessment instruments whose qualities within different cultures and languages must be considered constant to a certain degree in order to make such comparisons valid (e.g., Husen, 1967; Linden, 1977; Walker, 1976). In these studies the comparability of results across languages is examined exclusively by using backtranslation to establish content validity (Messick, 1989). In this chapter, examples are given of techniques that could be used in addition to back-translation t h a t would allow one to examine the construct validity (Messick, 1989) of the instrument across cultures. Note that the point of this chapter is not to criticize the process of back-translation, but rather to raise the question of whether back-translation alone is sufficient, and to describe some additional techniques that may be useful. When one wishes to compare a particular attitude across contexts such as across different nationalities or languages, it is necessary first to establish that the instrument being used to assess the attitude means the same in the different contexts; otherwise the interpretation of differences becomes intractable. The question boils down to: What must remain the same in order to detect meaningful differences? This problem has been known to psychometricians as the issue of item parameter invariance: What are needed are item parameters that remain approximately invariant from group to group. Since this need arises because of variations among groups of examinees in the abilities or traits measured by the items, any solution must necessarily involve a consideration of the relation between these abilities or traits and examinee performance on the items. The problem of dealing with the relationship between the examinee's mental traits and his performance is not a simple one, but we cannot avoid it. It lies at the heart of mental test theory, which is, after all, fundamentally concerned with inferring the examinee's mental traits from his responses to test items. (Lord & Novick, 1968, 354) What is different about the present study is that I am applying this same logic, which has traditionally been applied to ability and achievement tests, to instruments in the affective domain. One problem with the application of construct validity concepts in the affective domain is that instruments are frequently developed seemingly without an explicit reference to any underlying structure t h a t might be used as the basis for the examination of construct validity. This should not be seen so much as a problem with the use of construct validity as a criterion, but rather as a problem with the construction of such instruments. Messick (1989) has argued strongly

COMPARING ATTITUDE ACROSS DIFFERENT CULTURES

273

t h a t construct validity is the foremost criterion for establishing validity. Any instrument developed without some sort of construct validation should be considered as having dubious quality. In fact, most instruments in the affective domain are scored by simply adding up the weights for the (usually Likert-type) responses for each item on a given subscale. This implicitly assumes that the underlying construct for the subscale is a unidimensional latent trait. Moreover, Andersen (1973) has shown t h a t where the weights are integers (which is true in the great majority of cases), the resulting scores can be sufficient statistics only where the underlying model is a Rasch model (where I am here referring to the class of models defined by Rasch that have specific objectivity—Rasch, 1960/1980, not just the simple logistic model). Thus, one can argue t h a t even in cases where the instrument developers have ignored all reference to construct validity, the use of weighted scores betrays an unstated reliance on a unidimensional structure, and the use of integer weights betrays an unstated reliance on fit to a Rasch model. Consider first an instrument that is intended to measure just one unidimensional attitude. What is needed to ensure that measurements within a certain context can be compared to measurements within a new context is that (a) the instrument is also unidimensional in the new context (consistent dimensionality), and (b) that it is sufficiently consistent in its parametric structure (consistent construct validity). For an instrument composed of several subscales, the situation can be somewhat more complicated. Such multiscale instruments are being increasingly used in social sciences research, for example, in the learning environment literature (Epstein & McPartland, 1976; Fraser & Fisher, 1983; Moos, 1978; Walberg, 1979). If the theoretical basis of the instrument specifies no particular a priori multidimensional relationship between the subscales, then assessment of consistency involves only the replication of the above steps with each of the subscales. But if some particular relationship among the latent traits represented by the subscales is postulated as an inherent part of the construct, then, after confirming measurement stability for each subscale, the stability of the multidimensional relationship among the subscales must also be confirmed. In this study, I will consider two different approaches to the study of measurement consistency—a structural equation modelling (SEM) approach and an item response theory (IRT) approach. Below I describe the two approaches, and this is followed by an example that illustrates the methods. For ease of understanding by an English-speaking audience, the example makes a comparison across two different Englishspeaking cultures, rather than across two different language groups.

274

WILSON

THE TWO APPROACHES Structural Equation Modelling Approach In what follows, I describe statistics that result when one applies the unweighted least squares estimation procedure to polychoric correlation matrices rather than the more common maximum likelihood estimation applied to product moment correlation matrices. This is done because the assumption of normality of observed variables is unlikely to be fulfilled (even approximately) by Likert-style items such as those most commonly used in the affective domain (Joreskog & Sorbom, 1986). Using polychoric correlation coefficients assumes t h a t the distribution of the observed categories on the Likert scale results from the discretization of an unobservable (latent) normally distributed variable into the categories by cutting the latent variable at successive thresholds. This has the advantage that the assumptions on which the analysis are based are more like what one might expect to be the case, but it also has the disadvantage that no standard errors are available, nor are chi-square fit tests available. Unidimensionality. The unidimensionality of each scale within au multiscale instrument may be assessed using a congeneric test model approach (Joreskog, 1971). Each subscale is first fitted to a one-factor LISREL model (Joreskog & Sorbom, 1986) with one loading (the first) fixed to unity to provide a scale. Fit to a unidimensional model can be assessed by a number of measures, among them the squared multiple correlation (SMC) between each item and the underlying factor, the coefficient of determination (D), and the root mean square residual (RMR). The SMC for item i on a subscale is

where 6^ is the modelled error variance and s{i is the observed variance for item i (Joreskog & Sorbom, 1986, p. 1.37). The coefficient of determination, D, is

where | | is the matrix determinant function, O is the covariance matrix of the modelled errors, and S is the covariance matrix of the observed variables. It varies between 0 and 1 and is a generalized measure of reliability for the whole model (Joreskog & Sorbom, 1986, p. 1.37).

COMPARING ATTITUDE ACROSS DIFFERENT CULTURES

275

The RMR is

where k is the number of items, s y are the elements of S, and a y are the elements of 2 , the fitted variance-covariance matrix (Joreskog & Sorbom, 1986, p. 1.4). It is an indicator of a typical element among the variance and covariance residuals, and must be interpreted with respect to the size of the elements of S. The maximum of the residuals (MR) is also useful for getting a feel for the worst-case variation around the RMR. Fit can also be judged by using Joreskog's goodness of fit index (GFI) as an overall measure of fit (Joreskog & Sorbom, 1986, pp. 1.40, IV.17): The goodness of fit index is

where tr is the matrix trace function. GFI is a measure of the relative amount of variance and covariance accounted for by the model (i.e., the closer to 1 the more variance accounted for by the model), and it is independent of sample size and relatively robust against departures from normality. It can be used to compare the fit of models for different data, but its distributional properties are unknown, so there is no standard with which to compare it. i SEM approach is assessed by testing the fit of a one factor solution with factor loadings constrained to be the same across both samples (Munck, 1979). The same indices of fit are used here as were used for checking unidimensionality. Item R e s p o n s e Theory Approach In this discussion, I will use a particular form of IRT model drawn from the Rasch family of measurement models (Wright & Masters, 1982), and designed specifically for ordered polytomous data. The advantages of using Rasch models when the data have the appropriate characteristics have been noted elsewhere (Masters & Wright, 1984), and I will not pursue the issue here. The partial credit model (Masters, 1982) takes as its basic observation the number of steps that a person has made

276

WILSON

beyond the lowest performance level, or, in a rating situation, the number of steps that the object has been judged to be above the lowest level. Note that the number of ordered levels in each item need not be constant across all items, although it is constant in many cases in attitude measurement because of the predominance of Likert-type response alternatives. Consequently, the basic parameter is the step difficulty within each item. For an item with m + 1 ordered levels from 0 to m, the probability of person i with ability fi{ being observed in category n in item j (yy = n) is:

for n = 1, 2, . . . , m, where 6jk is the difficulty parameter for the step k in item j ; and

The local independence assumption used in the partial credit model is that, conditional on step difficulties, the interaction between a person and an item is independent between items. The analyses were conducted using the Quest computer program (Adams & Khoo, 1991). Model-data fit.In order to use the partial credit model to compare subscales across different groups one must first check for adequate model data fit. Only if the model fits in both contexts can meaningful comparisons be made. Note that this criterion is more demanding t h a n the criterion of unidimensionality used in the SEM approach, as items may misfit due to other problems besides multidimensionality. Model fit is assessed here using two indices. The "Person Fit t" gives an indication of the statistical significance of misfit for persons. With no misfit, it is distributed approximately as a normal distribution with mean 0 and standard deviation 1 (Wright & Masters, 1982). A "mean square" statistic is used to assess the degree of item misfit (Wright & Masters, 1982). It has an expected value of 1, and a rule of thumb t h a t I will use here is that the effect is strong when the statistic is outside the range (.75, 1.3).

COMPARING ATTITUDE ACROSS DIFFERENT CULTURES

277

i item statistics can be compared to check for equivalence of item location using the item step difficulty estimates. These comparisons can be routinized by using the standardized difference between the parameters:

where the primed estimates refer to those from one sample, and the unprimed estimates refer to the other sample, and the us are the appropriate standard error in each case (Wright & Masters, 1982, p. 115). Note that this requirement is not the same as requiring equal item marginals, even though the item marginals are sufficient statistics for the item parameters. Rather, the requirement is that the item steps have the same relative difficulty for the two groups. This comparison is far more detailed than that for the SEM approach. A comparison at a similar level of detail would be to compare the overall results for the persons from the two analyses. One way to do this is to use the difficulty estimates from one of the groups to estimate person abilities in the other, and then examine the overall fit of the new person estimates. This gives some indication of the overall impact of the altered difficulty estimates on person estimates.

AN EXAMPLE In this study, data were collected using a multiscale quality of life instrument across Australian and American student samples. Instead of translating a scale from one language to another, a translation was made from one dialect of English to another. This short-cut is taken to allow study of this phenomenon in a monolingual setting, and to make the alterations completely comprehensible to an English-speaking audience. The results are used to illustrate the procedures described above.

THE SAMPLES Two data sets are used as the basis for comparison:

278

1. 2.

WILSON

(AUS sample): a sample of 1,368 Year-9 Victorian high school students collected as part of a study of school staffing policies (Ainley, Reed, & Miller, 1986); (USA sample): a sample of 138 Year-9 high school students from Louisiana based on a stratification of the State's school system, identified as potential drop-outs, assessed before a summer-school intervention program called Louisiana State Youth Opportunities Unlimited (LSYOU; Shapiro, 1987).

Note t h a t both samples are stratified samples of the schools in each state, with random choice of appropriate students within schools. THE INSTRUMENT The QSL Construct The Quality of School Life instrument (QSL; Williams & Batten, 1981) was designed as an application of Burt's conception of quality of life assessment (Burt, Fischer, & Christman, 1979) and Spady and Mitchell's model of schooling (Mitchell & Spady, 1977). Spady and Mitchell have developed a model of schooling based on sociological theory. Drawing on the work of Talcott Parsons, they have postulated a fourpart system that links societal expectations to school structures and hence to student experiences. In the four domains of societal expectations schools are expected to: 1. 2. 3. 4.

facilitate and certify the achievement of technical competence; in effect, to certify t h a t individuals are capable of doing tasks valued in the society at large; encourage and enhance personal development in the form of physical, emotional, and intellectual skills and abilities; generate and support social integration among individuals across cultural groups and within institutions; and n u r t u r e and guide each student's sense of social responsibility for the consequences of his or her own personal actions, and for the character and quality of the groups to which the student belongs. (Mitchell & Spady, 1977, p. 9)

Williams and Batten (1981) used exploratory factor analysis to explore the multidimensional nature of the QSL instrument, and then the hypothesized structure was tested using confirmatory procedures. It consists of six subscales, two general ones and four more specific

COMPARING ATTITUDE ACROSS DIFFERENT CULTURES

279

ones matching the Spady-Mitchell domains. The two general scales are: (a) general affect (GA), which taps the nonspecific feelings of happiness and well-being associated with school; and (b) negative affect (NA), which taps the reverse of GA, depression, loneliness, and restlessness. The four domains are: 1. 2. 3. 4.

Status (ST), which assesses a student's feelings of worth in the social context; Identity (ID), which assesses a student's feelings of growth as an individual; Opportunity (OP), which assesses a student's feelings of increasing adequacy to meet society's standards; and Teachers (TE), which assesses a student's feelings towards his or her teachers.

The original scheme was for a fifth domain, Adventure (AD), in place of the TE domain, to assess personal academic development. In the initial studies it was found that the items developed for this domain did not adequately identify it as a distinct factor, but that all items t h a t involved teachers loaded on a distinct factor. There are 27 items in the scale, with four or five for each subscale. The items are all statements with the stem "School is a place where . . . " followed by a specific predicate such as " . . . I feel happy." The response format is Likert-style with four categories: Strongly Disagree (scored 0), Disagree (1), Agree (2), and Strongly Agree (3). All are scored positively except for the NA subscale, which is scored negatively. Williams and Batten (1981) give complete details of the instrument. Content Validity Use of this instrument in different geographic, cultural, and developmental contexts raises issues of the ability of the respondents to understand the original intent of the instrument's authors because of differences in idiom and word-meanings. Consequently, when use of the instrument was considered in an American context, each item was examined for appropriateness. A panel of local experts was consulted to recommend alterations in the wording of the items for the USA sample—the teachers who were involved in the LSYOU summer training program. A complete record of the changes for the whole instrument is given in Figure 1 in Wilson (1988). In this chapter I will concentrate on three of the subscales, and the changes for those are given in Table 15-1. The Negative Affect scale was found to require no

280

WILSON

Table 1 5 - 1

Comparison of the Two Item Sets Text for AUS Sample

Item

Text for USA Sample SAME SAME SAME SAME

NA1 NA2 NA3 NA4

1 1 1 1

TE1 TE2 TE3 TE4

teachers teachers teachers teachers

GA1 GA2 GA3 GA4

1 really like to go each day 1 get enjoyment from being there 1 feel proud to be a student 1 like learning

feel depressed feel lonely get upset feel restless help me to do my best listen to what 1 say are fair and just treat me fairly in class

SAME teachers take notice of me in class SAME SAME lly like to be each day 1 real 1 feelI happy SAME 1 aminterested in the work we do

adjustments: It is an example of what one might consider an otherwise unattainable ideal in instrument translation. The Teachers scale was chosen to represent a scale that needed only minor adjustment. The General Affect scale was the one most affected by the adjustments. Although it is hard to put a limit to just how much a scale might be altered in translation, this was chosen as a representative of a heavily adjusted scale. Reliability The reliability of the QSL subscales has been examined in a number of circumstances. In the original study, Williams and Batten (1981) found t h a t the reliabilities varied from .76 (for the NA scale) to .91 (for the ST scale), with a mean of .83. Wilson (1988) reported similar ranges and means for a high school and a university sample from Louisiana using the altered instrument. These are quite respectable reliabilities for instruments in the affective domain. RESULTS SEM Approach Unidimensionality and item parameter invariance. Consideru first the results of the LISREL analyses for the Negative Affect scale given in the top panel of Table 15-2. These are the results for a one-

COMPARING ATTITUDE ACROSS DIFFERENT CULTURES

Table 15-2

281

LISREL Unidimensionality Results SMCa

Loadings3 3

4

1

2

3

4

cDa

GFIa

MSR

MR

Negative Affect USA 49 43 AUS 60 51

50 57

56 53

55 51

45 38

54 40

31 38

82 74

98 99

.026 .011

.049 .021

Teachers USA 65 AUS 64

47 69

86 66

73 41

70 59

58 58

91 67

79 47

99 85

99 99

.008 .007

.019 .013

General Affect USA 64 AUS 73

70 70

49 53

49 58

83 63

89 66

55 46

60 51

98 85

94 99

.029 .013

.088 .031

Design

1

2

a

The numbers under Loadings, SMC, cD, and GFI are to be divided by 100

factor solution in each of the samples. The factor loadings in the unconstrained design for the two samples are evidently not identical, the largest difference is .49 to .60, the smallest, .56 to .53. These result in squared multiple correlations (SMC) for each of the four items as given in the next four columns, and a total coefficient of determination (CD) in the next column. The coefficients indicate that three of the four items and the set as a whole is better fit by the one-factor model in the USA sample than the AUS sample. The next column gives the goodness-of-fit index (GFI) which seems to indicate a reasonably good fit for the one factor design. In the last two columns are included the mean squared residual (MSR) of the fitted covariance matrix, and the maximum residual (MR). The entries in the covariance matrices for both USA and AUS vary from about .2 to about .8, and this is typical for all the covariance matrices analyzed here. Hence, the residuals confirm the picture presented by the GFI, that the one factor solution in each sample is a reasonable one. Now compare the results for the one factor solution with that for the one-factor solution with loadings constrained to be the same in both samples, given in the top panel of Table 15-3. By assumption the loadings are identical. Compared to the results in Table 15-2, the constrained loadings give somewhat different SMCs for the USA sample and identical ones for the AUS sample. This ought to be expected as the common loadings are much closer to the original AUS loadings t h a n to the USA loadings, which is due to the larger sample size for the AUS sample. Although the SMCs for the USA sample have changed, they are not systematically larger or smaller. The overall picture con-

282

WILSON

Table 15-3

LISREL Parameter Invariance Results SMCa

Loadings3 2

3

4

1

2

3

4

cDa

GFIa

MSR

MR

Negative Affect USA 59 AUS

51

56

53

59 51

47 38

52 40

26 38

82 74

97 99

.067 .013

.158 .102

Teachers USA AUS

63

66

69

45

"

"

71 57

51 56

88 69

59 48

97 85

72 97

.191 .027

.315 .039

General Affect USA 72 AUS

71

53

57

84 63

86 66

55 45

61 51

98 84

92 99

.070 .015

.111 .038

Sample

a

1

"

" "

"

"

The numbers under Loadings, SMC, cD, and GFI are to be divided by 100

tained in the cD and GFI columns show no interpretable change at all between the two designs. The RMR column shows that the overall change in the residuals has been largely confined to the USA sample. The MR column reveals that while the residuals remain small on the whole, the maxima have inflated by a factor of three for USA and five for AUS. Overall, the picture for Negative Affect looks pretty good: The differences in fit brought about by constraining the solution to have the same loadings are not particularly important according to the summary statistics. The maximum residuals give a somewhat more detailed, and perhaps somewhat more disturbing comparison. The above analyses were then repeated for the Teachers and General Affect scales. The description of the results detailed in Tables 15-2 and 15-3 are abbreviated as the format is the same as above. Only the most interesting differences are commented upon. For the Teachers scale, a somewhat better (compared to Negative Affect) fit to the one factor design is not maintained for the constrained loadings design— GFI for USA drops from .99 to .72, the RMR inflates by a factor of over twenty, and the MR is clearly unacceptable. For the General Affect scale the situation for the Negative Affect scale is repeated, with almost identical general measures of fit for the two designs, and a somewhat greater degree of change revealed by the residuals. IRT Approach Model-data fit. The mean and standard deviation of the Person Fit t statistics are recorded in Table 15-4. These show that across both

COMPARING ATTITUDE ACROSS DIFFERENT CULTURES Table 15-4

283

Partial Credit Person Fit Statistics AUS

USA

AUS Anchored

Scale

Mean

SD

Mean

SD

Mean

SD

Negative Affect Teachers General Affect

-.19 -.24 .22

1.11 1.14 1.42

-.17 -.22 -.15

1.19 1.14 1.27

-.21 .16 .07

1.38 1.16 1.31

subscale and sample, the variability in the statistics are slightly greater t h a n would be expected, and that the values are somewhat more negative than we might expect. These negative values are sometimes associated with a situation where the items within a subscale have some degree of local dependence. The mean squares for the items are given in Table 15-5. The items in the Teachers scale immediately stand out as fitting poorly in the USA sample—items TE2 and TE3 both fall outside the guidelines. The remainder do not show such poor fit. i alyses within the two samples are given in Table 15-6. For each scale, given as separate panels of the table, the results are organized by the partial credit step parameters. For each item within a scale, there are three sets of columns, one for each step parameter. Within those three columns, the first gives the USA estimate of that step parameter (in logits) and the second column gives the AUS estimate. The third column gives the standardized difference (z). Larger absolute values of

Table 15-5 Partial Credit Item Fit Statistics Mean Square s Negative Affect USA 1.00 AUS .95

1.09 1.07

.98 1.01

.95 .98

Teachers USA AUS

.94 1.15

1.52 .96

.61 .87

.77 .94

General Affect USA .92 AUS 1.12

.81 1.01

1.08 .89

1.08 .91

284

WILSON

Table 15-6

Partial Credit Item Parameter Estimates

z

USA

AUS

z

0.53 1.05 0.34 0.12

-1.15 1.03 0.51 -1.59

2.17 2.32 1.68 0.94

1.08 0.81 1.01 1.01

1.62 1.85 1.20 -0.17

-1.48 -1.23 0.18 -0.76

0.84 -0.31 -0.39 -1.84

1.63 -2.40 0.58 2.82

2.60 3.89 3.82 4.02

2.29 3.16 3.63 3.26

0.85 1.68 0.44 1.70

0.03 -0.11 -1.04 -0.34

0.37 0.31 0.91 -1.40

0.97 0.57 0.33 2.91

3.99 4.17 2.75 3.56

3.35 2.78 2.60 1.73

1.54 3.33 0.43 4.90

AUS

z

USA

Negative Affect 1 -2.28 2 -1.37 3 -2.08 4 -2.12

-1.69 -0.61 -1.40 -2.22

-1.77 -2.70 -2.17 0.31

0.18 0.69 0.19 -0.32

Teachers 1 -3.00 2 -2.85 3 -2.06 4 -2.77

-2.36 -1.34 -1.88 -3.39

-0.70 -2.59 0.41 1.07

General Affect 1 -2.82 2 -2.43 3 -3.53 4 -4.23

-1.08 -1.92 -2.51 -2.70

-3.01 -0.97 1.25 -1.41

Item

USA

Third Step

Second Step

First Step

AUS

the standardized difference indicate greater discrepancy between the two samples, and, while the theoretical distribution of these statistics is only approximately known, values greater than 1.96 or less than - 1 . 9 6 are generally accepted to indicate a problem (Wright & Masters, 1982, p. 115). It should be noted that relatively larger differences in logits between two estimates at the extremes of the scales may result in smaller standardized differences than in the middle because of the U-shaped standard error distribution for partial credit. Even though the TE scale showed a poor fit in the previous analyses, for illustrative purposes, it will be included in the analyses at this next stage. Looking at the results for the Negative Affect scale in the first panel of Table 15-5, one finds two standardized differences less t h a n -1.96—for step one for both items NA2 and NA3. The count for the Teachers scale is three—two less than - 1 . 9 6 in item TE2 and one greater than 1.96 in item TE4. For General Affect there are four—one each in items GA1 and GA3, and two in item GA4. Rather than examine each of the discrepant items in detail, three representative items will be examined and illustrated below. First, consider an item that shows little or no difference between the samples: item TE3, "Teachers are fair and just." The estimated category characteristic curves for the AUS sample are illustrated in Figure 15-1, and those for the USA sample are illustrated in Figure 15-2. The

COMPARING ATTITUDE ACROSS DIFFERENT CULTURES

Attitude

to

Teachers

285

io0ita

Figure 15-1

Probability of responses for item TE3 in the AUS sample.

Figure 15-2

A t t i t u d e to Teachers io0ita Probability of responses for item TE3 in the USA sample.

286

WILSON

figures give the probability of responding with each of the Likert-style responses indicated in the body of the figure, at increasing locations along the latent trait. For example, in Figure 15-2, a student located at - 4 . 0 0 logits would be predicted to respond with "Strongly Disagree" (SD) with probability approximately .90, and "Disagree" (D) approximately .10, but the others with vanishing probability. At the upper end of the scale, a sample member located at 4.00 logits would be predicted to respond "Strongly Agree" with probability approximately .60, and "Agree" approximately .40, but the rest hardly at all. The sample members are, of course, located at positions estimated for each score. These did not alter noticeably between the two samples (a consistent pattern for all three scales), so the locations on the latent trait are indicated only by logit values in order to clarify the figures. Clearly, there would be no interpretable differences between the sample with regard to item TE3. Second, consider an item with just one discrepancy between the samples: item TE4, "Teachers treat me fairly in class." The estimates for the AUS sample are illustrated in Figure 15-3, and those for the USA sample are illustrated in Figure 15-4. Although the standardized difference indicates a significant discrepancy only for the second step

Figure 15-3

Probability of responses for item TE4 in the AUS sample.

COMPARING ATTITUDE ACROSS DIFFERENT CULTURES

287

I a Figure 15-4

Probability of responses for item TE4 in the USA sample.

parameter, the figures show that this results in noticeable differences for all the transitions. For instance, at - 4 . 0 0 logits, the "Strongly Disagree" to "Disagree" (SD to D) ratio is approximately .65/.32 = 2.03 for the AUS sample, but is approximately .83/.16 = 5.19 for the USA sample. Similarly, at 4.00 logits, the SA to A ratio is approximately .7/.3 = 2.33 in the AUS sample, but is approximately .6/.4 = 1.5 in the USA sample. Looking overall, for a person at the same latent trait value in both samples, the discrepancy indicates that it is relatively easier for an AUS sample member at a particular location to give a positive response to the item than a USA sample member at the same location. The shapes of the curves are relatively unchanged, indicating that a simple translation, of, say, .80 (which is the average discrepancy in logits), would bring the two sets of estimates into alignment. We might consider this a "consistent" difference. Third, consider the item that is most discrepant between the samples: item GA4, "I like learning," for the AUS sample (Figure 15-5) and "I am interested in the work we do" in the USA sample (Figure 15-6). Here, although it is somewhat easier for the Australian sample to give a positive response, a simple shift in location does not suffice to make the curves even approximately equal. The Australian sample has

288 Wilson

1

I

1

Figure 15-5

Probability of responses for item GA4 in the AUS sample.

Figure 15-6

Probability of responses for item GA4 in the USA sample.

COMPARING ATTITUDE ACROSS DIFFERENT CULTURES

289

shown much greater proclivity to give more extreme responses closer to the middle of the probability location. For example, while a member of the Australian sample who is located at the point where D and A are equally likely (the intersection of the second curve and the 0.50 probability line) would have to change in attitude by 3.00 logits to move to the point at which A and SA are equally likely, a similarly located member of the USA sample would have to change by 4.00 logits. We might consider this an "inconsistent" difference. The USA item estimates were also used to anchor a second analysis of the AUS sample. The resulting overall fit statistics from this are shown in the column headed "AUS Anchored" in Table 15-4. Neither means nor standard deviations differ in any large extent for any of the subscales. This shows t h a t the differences in the item estimates for the two samples, although making statistically significant and interpretable differences for the items, do not seem to be having any great impact on the person estimates. We should not be too surprised at this, as the sufficient statistics for the students are the same under both sets of item estimates.

DISCUSSION OF RESULTS FROM EXAMPLE The two approaches have resulted in rather different orders of detail for the three chosen subscales. The SEM approach gave positive assurances for all three subscales concerning unidimensionality, and a similar assurance concerning parameter invariance for both the General Affect and the Negative Affect subscales, but indicated a problem for the Teachers subscale. Thus, we have an example where the most altered subscale in terms of content was not the most problematical in construct validity terms. The results for the partial credit model indicated that the Teachers subscale had a fit problem for one of the samples (USA), but that the others fit at a reasonable level. Comparison at the item step level between the two samples revealed considerable differences, which were illustrated for three cases that were, respectively, small, consistent, and inconsistent. These comparisons revealed statistically significant differences between the item parameters for a little over half of the items, including at least two in each subscale. Of the items that were the identically-worded in the two samples, 5 out of 8 were found to have significant differences; of the four items that were altered, all were found to have significant differences. Comparison at the overall level of person fit statistics, however, did not reveal any great impact from these differences in person estimates.

290

WILSON

CONCLUSION The overall finding is one t h a t contains some good news and some bad news for those who use attitude instruments to conduct research across cultural contexts. Looking at it on the negative side, none of the subscales showed invariance on all criteria. In the SEM analysis, for construct validity as evaluated by fit to a constrained one-factor model, two subscales performed reasonably well. The IRT analysis revealed t h a t all three of the subscales gave significantly different estimates of item location across the samples, indicating that the respondents saw the latent traits in different ways. Looking on the positive side, these results, may be considered substantive results rather than merely negative findings, telling us about the different ways that people construct variables and respond to items in different contexts. In summary, this study has shown that through careful assessment of psychometric properties using techniques such as Structural Equation Modelling and Item Response Theory, attitude scales can be examined to see whether they are sufficiently consistent in their characteristics to allow meaningful comparisons to be made across cultural contexts. The results of such examinations will be dependent upon the level of detail that the researcher pursues. Clearly, the IRT approach resulted in a greater degree of detail in the examination, and hence found more discrepancies than the SEM approach. Many researchers in the area of cross-cultural comparisons will find such a level of examination alarming, leading potentially to the rejection of much of the existing research base. Others may consider it merely the inevitable result of trying to compare the incomparable. It is the position of this researcher that the present situation regarding the use of affective instruments across cultural contexts is not sufficiently well-researched to say which of these alternatives is correct, indeed it may be t h a t neither is correct. What is needed is a program of study that seeks out the conditions under which affective instruments display parameter invariance across particular cultural and linguistic contexts. This might be called strong construct validity for the comparison. Where such conditions are not attainable, or where particular indicators are considered important enough to be kept free from modification, one might instead seek evidence of weak construct validity, such as that used in the SEM approach here, or perhaps by using a technique similar to that described above for assessing fit of one sample to the item parameters of the other. This will require both technical work on what are the most appropriate techniques to investigate these types of construct validity, and substantive and philosophical work on the meaningfulness of terms such as strong or weak construct validity.

COMPARING ATTITUDE ACROSS DIFFERENT CULTURES

291

REFERENCES Adams, R.A., & Khoo, S.T. (1991). Quest (computer program]. Hawthorn, Australia: Australian Council for Educational Research. Ainley, J., Reed, R., & Miller, H. (1986). School organisation and the quality of schooling (ACER Research Monograph No. 29). Hawthorn, Australia: ACER. Andersen, E.B. (1973). Conditional inference for multiple choice questionnaires. British Journal of Mathematical and Statistical Psychology, 26, 31-44. Burt, R.S., Fischer M . G , & Christman, K.R (1979). Structures of well-being: sufficient conditions for identification as restricted covariance models. Sociological Methods and Research, 8, 111-120. Epstein, J.L., & McPartland, J.M. (1976). The concept and measurement of the quality of school life. American Educational Research Journal, 13(1), 15-30. Fraser, B.J., & Fisher, D.L. (1983). Development and validation of short forms of some instruments measuring student perceptions of actual and preferred classroom learning environment. Science Education, 67, 115-131. Husen, T. (1967). International study of achievement in mathematics: a comparison of twelve countries (Vols. 1 and 2). New York: Wiley. Irvine, S.H., & Berry, J.W. (1988). Human abilities in cultural context. Cambridge, UK: Cambridge University Press. Joreskog, K . G (1971). Statistical analysis of a set of congeneric tests. Psychometrika, 36, 109-133. Joreskog, K . G , & Sorbom, D. (1986). LISREL VI: Analysis of linear structural relationships by maximum likelihood and least square methods. Mooresville, IN: Scientific Software. Linden, L. (1977). Home environment and student support (Department of Statistics Research Report No. 77-10). Uppsala: University of Uppsala. Lord, F.M., & Novick, M.R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. Masters, G.N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149-174. Masters, G.N., & Wright, B.D. (1984). The essential process in a family of measurement models. Psychometrika, 49, 529-544. Messick, S. (1989). Validity. In R.L. Linn (Ed.), Educational measurement (3rd ed.). New York: ACE-Macmillan. Mitchell, D.E., & Spady, W.G (1977). Authority and the functional structuring of social actions in schools. Unpublished AERA symposium paper (quoted in Williams & Batten, 1981). Moos, R.M. (1978). A typology of junior high and senior high classrooms. American Educational Research Journal, 15{1), 53-66. Munck, I. (1979). Model building in comparative education. Stockholm: Almqvist & Wiksell. Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests (expanded ed.). Chicago: The University of Chicago Press. (Original work published 1960)

292

WILSON

Shapiro, J.Z. (1987 April). Project LSYOU: A summative evaluation. Paper presented at the annual meeting of the American Educational Research Association, Washington, DC. Walberg, H.J. (1979). Educational environments and effects. Berkeley, CA: McCutchan. Walker, D.A. (1976). The IEA Six Subject Survey: An empirical study of education in twenty-one countries. Stockholm: Almqvist & Wiksell. Williams, T.H. & Batten, M.H. (1981). The quality of school life. ACER Research Monograph, No. 12. Hawthorn, Australia: ACER. Wilson, M. (1988). Internal construct validity and reliability of a quality of school life instrument across nationality and school level. Educational and Psychological Measurement, 48, 995-1009. Wright, B.D., & Masters, G.N. (1982). Rating scale analysis. Chicago: MESA Press.

chapter

16

Consequences of Removing Subjects in Item Calibration Patrick S.C. Lee LaSalle University

Hoi K. Suen

Pennsylvania State University

The metric of the ability or 0 scale in item response theory (IRT) is indeterminant. With this indeterminancy, item and ability parameters are theoretically unidentifiable unless an origin is assigned to 9 (Lord, 1980). A common practice today is to scale along a z-score metric with a mean of 0 and a standard deviation of 1 (Hambleton & Swaminathan, 1985). Existing methods in IRT parameter estimation generally assume that, given the z-score metric, 0 is within the interval — ^ < G < sc. When Newton-Raphson (e.g., Lord, 1980; Hambleton & Swaminathan, 1985) or other unconstrained numerical procedures are applied to estimate ability, 9 can theoretically take on a value of positive or negative infinity. Specifically, the maximum likelihood estimator for a subject with a perfect response vector is infinity, while t h a t for a subject with an all-zero response vector is negative infinity. These estimates are problematic in a joint maximum likelihood estimation of item parameters in that item estimators are affected or unattainable. If item parameters are attainable but affected in an unspecified manner, the invariance of parameters is no longer guaranteed. Hence, it can potentially affect subsequent applications such as equating. 293

294

LEE & SUEN

There are at least five alternatives to resolve this problem. One solution is to impose external constraints in the estimation procedure to minimize parameter drift to unacceptable values (cf. Hambleton, 1989). These constraints are generally based on experience or logical deduction. For example, in the 3-parameter context, the slope parameter may be constrained to be positive (i.e., a > 0), the guessing parameter may be constrained to be less than some reasonable amount (for example, c < .35), or the ability parameter constrained away from the extremes ( - 3 < 9 < 3). Another solution is to impose a nonuniform prior distribution of 9 values; then the posterior 9 values estimated through a Bayesian Modal Estimation procedure (Swaminathan & Gifford, 1986) are taken as the best estimates. The third solution is to remove the need for estimating 9 altogether through the Marginal Maximum Likelihood procedure (Mislevy & Bock, 1990), although there is still a need to estimate the distribution of 9 . A fourth option is to create two "dummy" items. 1 One of these items will have a perfect classical p-value while the other will have a zero p-value. Subjects with perfect and zero raw scores would thus be eliminated. This alternative would be appropriate only for a conditional estimation of abilities. For a joint estimation of subject and item parameters, it essentially replaces the problem of perfect- and zero-scored subjects with perfectand zero-scored items. A final alternative is to remove all subjects with perfect or zero raw scores prior to item calibration (e.g., Wright & Stone, 1979). The consequences of the final alternative of removing subjects prior to item calibration on the quality of the estimators are unknown (Hambleton & Swaminathan, 1985, pp. 92-93). The purpose of this chapter is to examine the effects of such a tactic on the 9 metric and item parameters. INVARIANT ITEM PARAMETERS An important and desired characteristic of IRT is the invariance of item parameters (Lord, 1980), which also enables the calibration process to be sample-free (Wright & Stone, 1979). When the z-score metric is imposed on the 9 scale for each of two groups responding to the same set of items, estimators of item parameters will most likely be different from one group to another. However, the property of invariance is maintained if the two 9 scales are linear transformations of one another (Hambleton & Swaminathan, 1985; Lord, 1980; Lord & Novick, The authors wish to thank Robert Jannarone for pointing out this option.

CONSEQUENCES OF REMOVING SUBJECTS IN ITEM CALIBRATION

295

1968; Wright, 1968). If the effects of removing subjects are such t h a t the 9 scales from different calibration samples become unknown and nonlinear transformations of one another, the practice of removing subjects would be problematic in that item parameters are no longer invariant. Let's assume that the 9 metric X for group A with a number of perfect and all-zero response vectors is a linear transformation of the 9 metric Y for group B, which also has a number of perfect and all-zero response vectors. If subjects are removed from these groups because of perfect and zero raw scores, the metric of the 9 scales would change, resulting in two new metrics X* and Y*. The property of invariance is guaranteed only if X* is a linear transformation of X and Y* is a linear transformation of Y, which would then imply that X* remains a linear transformation of Y*. TRANSFORMATION OF METRICS Samuelson (1968) demonstrated that, given a finite sample of N subjects, no score can be beyond ±(N - 1)° 5 standard deviations from the mean. For a 9 scale with a z-score metric, this property implies that the boundaries of 9 scores calibrated from a finite sample are ±(N 1)° 5 . Let N be the size of a calibration sample in which p subjects have perfect response vectors and m subjects have all-zero response vectors and let X be the 9 scale for this sample. With 9 on a z-score metric, we can assume t h a t the distribution of 9 is symmetric. Let ±c be the actual maximum and minimum 9 values for a given finite sample of subjects, then - ( N - 1)° 5 < - c < 9 R < + c < + ( N - 1)° 5 , where 9 R is the ability score for all subjects R whose raw score is neither perfect not zero. That is, each subject R would be retained after subjects with perfect and zero raw scores have been dropped. Let X* be the 9 metric after subjects with perfect and zero raw scores have been dropped and 9* R be the ability score for subject R on the X* metric. We demonstrate below that 9* R is a linear transformation of 9 R by obtaining a mapping of the boundaries. That is, we need to show how 9 R of the interval [-(N - l ) 0 5 , + (N - l ) 0 5 ] is transformed. In the estimation of 9* R , such a transformation is equivalent to transforming the interval [-(N - 1)° 5 , + (N - 1)° 5 ]: Maximize-minimize 9* R :1 < R < n

296

LEE & SUEN

where p is the number of perfect raw scores, m the number of zero raw scores, and n the number of nonperfect, nonzero raw scores. This constrained optimization problem requires that the sum of the 9 scores is zero (Eq. (1)) with the variance of one (Eq. (2)) in order for 9 to remain within the z-score metric. The solution of this optimization problem would lead to the range of 0* R . Using the Lagrangian technique (cf. Mangasarian, 1969), we obtain the interval of 0* R to be

where c is the actual number of standard deviations away from the mean which will contain all possible scores in the sample of subjects. Note t h a t two distinct scores 9 R and 9 S of the original X metric would become 0 * R and 0 * s of the new X* metric such t h a t their ordinal positions are preserved. Thus, the transformation is a one-to-one ordered mapping. In other words,

For a given calibration sample of subjects, N, m, p, n and c are constants. Thus, Equation 4 demonstrates that 9 R and 0 * R are linear transformations of one another.

DISCUSSION Recall t h a t the X metric is of all scores while the X* metric is of nonperfect, nonzero scores only. Equation (4) shows that the same ordering is preserved in the X* metric and the boundaries of 0 * R have changed according to Equation (3). The absolute difference of two distinct scores 9 R and 9 S on the X metric is transformed into 6* R and 0 * s on the X* metric, as given by Equation (4). This transformation is approximately equivalent to the magnitude of n° 5 , provided p and m

CONSEQUENCES OF REMOVING SUBJECTS IN ITEM CALIBRATION

297

are negligible. The result is that X* becomes a linear transformation of X. Thus, the invariance of item parameters is preserved. Therefore, from the perspectives of the invariance of item parameters, the practice of removing subjects with perfect and all-zero raw scores prior to item calibration is acceptable in that subjects' relative positions are maintained, item parameters remain invariant, and equating of 9 across samples is possible. This finding is also consistent with the general notion that item calibration is sample-free. It should be cautioned, however, that, while Equation (4) provides a theoretical justification for the removal of subjects in item calibration, it is by itself a necessary but insufficient condition to support the practice in applied settings. It demonstrates t h a t parameters are not affected. To support practice in applied settings, it is also necessary to demonstrate that estimators, in addition to parameters, are also not affected. Further analyses are needed to explore the effects of removing subjects on estimators. An additional consideration is that, whereas this chapter provides a justification in removing subjects in item calibration, the problem of how to derive a finite 9 for these subjects in ability estimation remains. Wilson and Wright (1985) provided one solution for this problem.

REFERENCES Hambleton, R.K. (1989). Principles and selected applications of item response theory. In R.L. Linn (Ed.), Educational measurement (3rd ed., pp. 147200). New York: Macmillan. Hambleton, R.K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Hingham, MA: Kluwer. Lord, F.M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum. Lord, F.M., & Novick, M.R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. Mangasarian, O.L. (1969). Nonlinear programming. New York: McGraw-Hill. Mislevy, R.J., & Bock, R.D. (1990). BILOG 3: Item analysis and test scoring with binary logistic models (2nd ed.). Mooresville, IN: Scientific Software. Samuelson, P.A. (1968). How deviant can you be? Journal of the American Statistical Association, 63, 1522-1525. Swaminathan, H., & Gifford, J.A. (1986). Bayesian estimation in the threeparameter logistic model. Psychometrika, 51, 589-601. Wilson, M., & Wright, B.D. (1985, April). Finite measures from perfect scores. Paper presented at the Annual Meeting of the American Educational Research Association, Montreal.

298

LEE & SUEN

Wright, B.D. (1968). Sample-free test calibration and person measurement. Proceedings of the 1967 invitational conference on testing problems. Princeton, NJ: Educational Testing Service. Wright, B.D., & Stone, M.H. (1979). Best test design. Chicago: MESA.

chapter

17 JL f

Item Information as a Function of Threshold Values in the Rating Scale Model Barbara G. Dodd

The University of Texas at Austin

Ralph J. De Ayala

The University of Maryland—College Park

Birnbaum's (1968) conceptualization of information functions for tests and for individual items has been used in many applications of item response theory (IRT) models. The primary benefit of information functions is that they allow one to construct measurement instruments t h a t will maximize the precision of measurement or information where it is needed most. Another benefit is that information functions for two measurement instruments can be compared in terms of relative efficiency to aid in the selection of the best instrument for a given measurement situation. Information functions have also been used effectively to determine item selection for computerized adaptive testing (CAT). Most of the applications of information functions have been restricted to the IRT models for dichotomously scored items, where item responses are scored either correct or incorrect. Very little research has investigated the properties of information functions for IRT models developed specifically for item responses that are scored into more t h a n two categories. 299

300

DODD & DE AYALA

Three of the models that are appropriate when item responses are scored using integers to represent ordered response categories corresponding to varying degrees of the trait measured by the item are the rating scale model (Andrich, 1978a,b), the partial credit model (Masters, 1982), and the graded response model (Samejima, 1969). The rating scale model was developed specifically for the case of attitude measurement when the Likert-type response format is used. The partial credit model is an extension to the multiple category case of the one-parameter Rasch model for dichotomously scored items, while the graded response model is an extension to the multiple category case of the two-parameter logistic model. Both the partial credit model and the graded response models are appropriate to use with items for which partial credit can be earned for partially correct solutions to problems. While the rating scale model has been shown to be a special case of the partial credit model (Wright & Masters, 1982), the partial credit model is not a special case of the graded response model (Thissen & Steinberg, 1986). Samejima (1969) extended Birnbaum's formulation of information functions to the multiple category case. By comparing the information yielded by items scored with optimal dichotomization with the information yielded by scoring the items according to the graded response model, Samejima (1969, 1976) found that the graded response approach yielded considerably greater precision of measurement. Dodd and Koch (1987) applied Samejima's formulation of information functions for the multiple category case to the partial credit model. Unlike the simple Rasch model for dichotomously scored items, it was found t h a t item information functions for the partial credit model could differ substantially from one another as a function of the step estimates for each item. Dodd and Koch also demonstrated the usefulness of information functions to test revision. Information functions for the multiple category case have also been shown to be effective for item selection during CAT based on either the partial credit model (Koch & Dodd, 1989) or the graded response model (Dodd, Koch, & De Ayala, 1989). Dodd (1987) applied Samejima's (1969) formulation of information functions for polychotomously scored items to be the rating scale model. It was found that the distribution of item information for a set of items with the same response threshold values was a function of the scale value for the item. Each item information function peaked near the scale value for the item. It was also discovered that rating scales with threshold values that spanned a small range along the attitude continuum yielded more peaked information functions than rating scales with threshold values that spanned a large range. Thus, the

ITEM INFORMATION AS A FUNCTION OF THRESHOLD VALUES

301

distribution of item information was a function of both the scale value for the item and the set of response threshold values for the rating scale. This chapter presents the results of a further investigation of the relationship between the distribution of information for an item and the item parameter estimates of the rating scale model. The effectiveness of using item information functions for item selection during CAT was also investigated.

THE RATING SCALE MODEL Andrich (1978a,b) extended the Rasch model for dichotomously scored items to the polychotomous case of rating scale items in which responses to an item are scored using ordered categories to represent varying degrees of the attitude level. In the rating scale model, a scale value is estimated for each item to reflect the location of the item on the attitude continuum. In addition, a single set of response thresholds is estimated for the entire set of items included in the rating scale, because the response threshold values are assumed to be constant across items on a given rating scale. The probability of responding in a given category is defined as

Equation 1 is the general form for obtaining the operating characteristic curves for an item based on the rating scale model. The 6 term is the attitude level, the bt term is the scale value or location parameter for item i, and the t} terms are the response threshold parameters for the set of items. For notational convenience, S[0 - {b( + tj)], forj = 0 to 0 is defined as being equal to 0. Item information (after Samejima, 1969) for the rating scale model, conditional on theta, is defined as

302

DODD & DE AYALA

Figure 1 7 - 1 Item information functions for two items that have a scale value of zero and threshold values that are symmetric around zero but differ in the range of the threshold values.

where P' is the first derivative of Equation 1. An example of item information functions for two hypothetical items with a scale value of zero and symmetric threshold values that differ in range are presented in Figure 17-1. Both items provided maximum information at the scale value. Item 2 had a slightly flatter information function than item 1 because of the larger range of the threshold values for the scale from which item 2 was selected compared to the scale for item 1. The information for a given rating scale is simply defined as the sum of the item information functions. Thus, the information that a given item contributes to the scale information function is independent of the information provided by the other items in the rating scale. Item and scale information functions could prove useful in some applications of the rating scale model. For example, the scale information functions for two rating scales can be compared in terms of relative efficiency, which can aid in the selection of the best rating scale for a given measurement situation. Item information function might also be used effectively to determine item selection for computerized adaptive attitude measurement.

ITEM INFORMATION AS A FUNCTION OF THRESHOLD VALUES

303

INFORMATION STUDY Datasets The relationship between the distribution of item information and the item parameters of the rating scale model was assessed with rating scales t h a t had either three- or four-threshold values. A total of 30 different scale threshold values were generated to investigate the effect of the number, symmetry, and distance between adjacent threshold values on the distribution of item information. For each of the 30 sets of threshold values, nine item scale values that ranged from - 2 . 0 to 2.0 in .5 increments were used in the item information analyses. To determine if the relationships between the item parameters and the distribution of information across the trait continuum that were found for the generated items would hold for real data, the threshold values that were estimated for the AWS and ADCOM datasets (Dodd, 1990) and the threshold values estimated by Masters and Wright (1981) for the fear of crime items were used in the item information analysis. The three real attitude scales differed from one another in terms of the number and range of threshold values. Analyses The nine scale values used in conjunction with each of the 30 generated scale threshold values used to investigate the effects of number, symmetry, and distance between adjacent threshold values on the distribution of item information were treated as known parameters in the information analyses. Estimates of the threshold values reported in the literature for the three real attitude scales were also used in the item information analyses. Equation 2 was used to calculate information for the 0 values ranging from —4.0 to 4.0 at intervals of .1 for the 270 generated items and the 27 items, based on estimates of the threshold values for the three real attitude scales. Results The item information functions for the 270 generated items confirmed the findings of Dodd (1987) that the item information function for each item peaked near the scale value and that rating scales with threshold values t h a t spanned a small range along the attitude continuum yielded more peaked information functions than rating scales with threshold values that spanned a large range. As expected, it was

304

DODD & DE AYALA

Figure 17-2 Item information functions for four items that have a scale value of zero and threshold values that are symmetric around zero but differ in the range of the threshold values and the distance between adjacent threshold values.

also found t h a t items with four threshold values yielded more total information across the trait continuum than the items with three threshold values. Thus only items with the same number of thresholds yield the same total amount of information across the entire attitude continuum. Inspection of item information functions for scales with three threshold values revealed that the information functions peaked at the scale value of the item when the threshold values were symmetric around zero. For the scales with four threshold values that were symmetric around zero, it was found that the item information functions peaked at the item scale value provided the distance between the two middle threshold values was not equal to or greater than 2.0 logits. The four items selected to illustrate this finding had a scale value of zero but were from scales that differed in the distance between adjacent threshold values as well as the range of the threshold values. Figure 17-2 shows the item information functions for these four items. The information functions for items 3 - 5 all peaked at the scale value of zero.

ITEM INFORMATION AS A FUNCTION OF THRESHOLD VALUES

305

Item 6, however, had a bimodal distribution of information which peaked at trait levels of —1.6 and 1.6. Given the fact that the range of threshold values for the scale from which item 6 was selected is the same as the range of the threshold values for the scale from which item 5 was selected, it appeared that large distances between the two middle threshold values resulted in bimodal information functions. Inspection of the information functions for other scales with four threshold values revealed bimodal information functions when the distance between the two middle threshold values was equal to or greater than 2.0 logits. It should also be noted that the information function for item 4 was flatter than the information function for item 3 because the distance between the two middle threshold values was greater for the item 4 scale t h a n for the item 3 scale. The information functions for items 3 and 4 were also more peaked than the information functions for items 5 and 6 because the range of scale threshold values for items 3 and 4 were less t h a n the range of the scale threshold values for items 5 and 6. When there was an odd number of asymmetric thresholds, the peak of the information function was shifted away from the scale value in the direction of the dominant sign of the threshold values. Figure 17-3

Figure 17-3 Item information functions for three items that have a scale value of zero but differ in the range and degree of asymmetry of the threshold values.

306

DODD & DE AYALA

Figure 17-4 Item information functions for two items that have a scale value of zero and asymmetric threshold values with the same range but differ in the distance between adjacent threshold values.

depicts the information functions for three items that had a scale value of zero but differed in the range and degree of asymmetry of threshold values. As can be seen, the scale threshold values for item 7 had the smallest degree of asymmetry and the smallest shift of the peak of the information function away from the scale value. Items 8 and 9 had scale threshold values that differed from one another only in terms of the direction of the asymmetry. The magnitude of the shift of the peak of the information functions away from the scale value was identical for items 7 and 8 and differed only in terms of the direction of shift away from the scale value. For the scales with an even number of threshold values that were asymmetric, the degree of shift away from the scale value was found to be a function of the distance between adjacent threshold values. Figure 17-4 presents the item information functions for two items with four asymmetric scale threshold values that differ only in the distance between adjacent threshold values. Item 11, which has a distance of 2.5 logits between the two middle threshold values, had a 1.6 logit shift in

ITEM INFORMATION AS A FUNCTION OF THRESHOLD VALUES

307

Figure 17-5 Item information functions for three items that have a scale value of zero but differ in the range and number of the threshold values.

the peak of the information function away from the scale value. Item 10, on the other hand, had a smaller shift in the peak of the information function (.9) because the distance between the two middle thresholds was smaller t h a n the distance between the middle threshold values for item 11. Figure 17-5 depicts the item information functions for each of the three real attitude scales with an item scale value of zero. As can be seen, the magnitude of the shift away from the scale value is a function of the degree of asymmetry of the threshold values. For the odd number of threshold values, the direction of the shift is determined by the dominant sign of the threshold values. It is interesting to note that the shift for the fear of crime item was 2.1 logits. For the even number of threshold values, the direction of shift was a function of the magnitude of two middle threshold values; the shift was in the direction of the threshold value with the largest deviation from zero. These results confirmed the relationship between the distribution of item information and the item parameters of the rating scale model that were identified with the generated item parameters.

308

DODD & DE AYALA

CAT STUDY Method Datasets. Two real datasets consisted of response data for two different attitude scales. The third dataset consisted of simulated response data generated specifically to fit the rating scale model. Responses made by 490 teachers to the Audit of Administrator Communication (ADCOM; Valentine, 1978) were available for use in the present study. ADCOM is a 40-item Likert-type attitude scale designed to measure attitudes of teachers toward the communication skills of their school administrators. All items are scored on a five-point scale on which 0 represents an unfavorable response toward the communication skills of the administrator, and a score of 4 represents a favorable response. Factor analysis of the ADCOM scale (Koch, 1983) indicated t h a t the scale is approximately unidimensional; the first factor accounted for about 85% of the common variance. The Attitude Toward Women Scale (AWS; Spence, Helmreich, & Stapp, 1973) was designed to measure attitudes toward the rights and roles of women in contemporary society. Each of the 25 items has four response alternatives ranging from "AGREE STRONGLY" to "DISAGREE STRONGLY." Responses are scored so that profeminist attitudes receive a score of 3, whereas very traditional attitudes receive a score of 0. Response data were available for 531 women. Previous factor analytic studies (Dodd, 1985) demonstrated that the AWS has one dominant factor that accounts for about 83% of the common variance. The third dataset consisted of simulated responses to 27 items from 500 simulees. These data were generated according to the rating scale model using standard procedures. The items were constructed to have four response alternatives. Consequently, three response threshold values were specified for the set of 27 items, and a scale value was specified for each of the items. The item parameters used to generate the data were those estimates reported by Masters and Wright (1981) based on real responses to fear of crime items. More specifically, the item parameter estimates for 9 items reported by Masters and Wright were treated as known item parameters and were used as input into the data generation program. Given the fact that 9 items is too small an item bank for CAT, the size of the item pool was tripled by duplicating Masters and Wright's item parameter estimates for the 9 items twice, and simulated item responses were thus generated for 27 items. Conventional procedures were used to generate the simulated item responses according to the rating scale model. The reader is referred to

ITEM INFORMATION AS A FUNCTION OF THRESHOLD VALUES

309

Dodd (1990) for a detailed description of the data generation procedure. Response strings to 27 items for 500 simulees were generated for later use in the simulated adaptive measurement procedures. Because these data were generated according to the rating scale model, there was no need to assess the unidimensionality of the data. Calibration. For each of the three datasets, a two-stage procedure outlined by Masters and Wright (1981) was used to obtain the estimates of the item parameter according to the rating scale model. In the first stage the computer program PARTIAL 1 was used to obtain item parameter estimates based on the partial credit model. This program was written according to the calibration procedures and estimation equations specified by Masters (1982) for the partial credit model. The second stage involved obtaining estimates of the threshold values and of the scale value parameters from the step value estimates obtained from the PARTIAL program. For each item, the partial credit model's step estimates were simply averaged to obtain the estimate of the scale value for the item. Estimates of the threshold values were obtained by first transforming each of the partial credit step value estimates for an item into a deviation score from the scale value for that item. The deviation scores for each step were then averaged across the items to obtain the estimate of the threshold value for each step. Note that, generally, these estimates will not be identical to those yielded by a computer program that estimates the item parameters of the rating scale model directly. c for the rating scale model was used to simulate computerized adaptive attitude measurement using a sample of 200 persons from each of the three datasets, respectively. The maximum likelihood estimation method was used to estimate the person's attitude trait level after each item. Prior to maximum likelihood estimation, however, it was necessary to use a specified stepsize along the theta scale as preliminary theta estimates to administer the first two or three items. The variable stepsize recommended by Dodd was used to change the theta estimate by half the distance between the previous theta estimate and either of the two extreme scale value estimates for the item pool. If the response to the most recent item administered was in the lower half of the response categories, the lowest scale value estimate was used, while a response in the upper half of the response categories resulted in using the upper extreme scale value estimate. C o n t a c t the first author for information about the PARTIAL computer program.

310

DODD & DE AYALA

Given the current theta estimate, the two item-selection procedures studied by Dodd were used in the present investigation to determine the most appropriate item remaining in the pool to administer next. The maximum information method involved choosing the item t h a t provided the most information for the current theta estimate, while the scale value method involved selecting the item with the scale value closest to the current theta estimate. Unlike the Dodd study, however, the CAT sessions under both item selection procedures continued to administer items until a prespecified standard error was obtained or a maximum of 20 items had been administered. For the ADCOM the minimum standard error was arbitrarily set at .25. For the AWS dataset a slightly higher standard error level of .30 was used because the average standard error for the full scale calibration was higher t h a n .25. An even higher standard error level of .50 was used for the artificial dataset because the average standard error for the full-scale calibration was .41. Data Analyses Descriptive statistics, correlations, and scattergrams were used to evaluate the two CAT conditions. For each dataset, means and standard deviations were obtained to describe the thetas, standard errors, and number of items administered under the two CAT conditions as well as for the full scale calibration. Scattergrams and correlations were run to determine the degree of linear relationship that existed between the theta estimates obtained under the two CAT conditions and the full scale calibration run for each dataset. For the artificial data, the theta estimates yielded by the CATs and the full scale calibration were also correlated with the known z values used to generate the data. In addition, the root mean squared error (RMSE) statistic was calculated to measure the correspondence between the full scale theta estimates and those yielded by the CAT procedures. Results Item pool calibration. For the AWS data, the step value for the lowest category of one item could not be estimated because only one person responded in that category for that item. The item was thus deleted from the scale, and the remaining items recalibrated. Descriptive statistics for the scale values of the remaining 24 items and the threshold values are presented in Table 17-1. Initial results revealed that an estimate of the lowest step value for

ITEM INFORMATION AS A FUNCTION OF THRESHOLD VALUES

311

Table 1 7 - 1 Descriptive Statistics, Scale Value Estimates, and Threshold Estimates for Three Datasets

Scale Value Mean SD Minimum Maximum Number of Items Threshold 1 2 3 4

AWS

ADCOM

Artificial

-.475 .838 -1.864 .903 24

.855 .709 -1.985 .829 39

-.097 .685 -1.474 1.169 27

-.728 .091 .819

-1.347 -.536 .024 1.859

-4.688 .880 3.807

one item of the ADCOM scale was unobtainable because no person responded in the lowest category. In effect this item did not have the same functional response scale as the other 39 items. Consequently the item was deleted from the scale and the remaining 39 items recalibrated. Table 17-1 shows the descriptive statistics for the scale values and the threshold values. The PARTIAL program yielded step estimates for all 27 items of the fear of crime scale. Descriptive statistics for the scale values and the threshold values are displayed in Table 17-1. d deviations of the theta estimates, standard error of the theta estimates, and the number of items administered under the two adaptive testing conditions and the full scale calibration for each of the three datasets. The mean theta estimates for each of the two CAT conditions and for the full scale calibration within each dataset were very similar. For the AWS and ADCOM, the mean standard error of the theta estimates were identical for the two adaptive conditions which administered virtually the same number of items, on the average. The scale value item selection procedure administered one fewer item, on the average, for the artificial data, but resulted in approximately the same average standard error of the theta estimates as the item information selection technique. i estimates yielded by each of the two CAT conditions were correlated with the theta estimates from full scale calibration. The resulting

312

DODD & DE AYALA

Table 17-2 Descriptive Statistics for Three Datasets Under Two Adaptive Conditions and the Full-Scale Calibration Dataset and Testing

Theta Estimate Mean

Condition AWS (N - 200) Scale Value Information Full Scale ADCOM (N = 200) Scale Value Information Full Scale Artificial (N = 200) Scale Value Information Full Scale

SD

Number of Items

Standard Error Mean

SD

Mean

SD

.37 .38 .35

.89 .88 .88

.32 .32 .26

.09 .09 .10

16.10 16.10 24.00

2.52 2.51

-.03 -.07 -.02

1.18 1.18 1.17

.27 .27 .21

.04 .04 .04

19.13 18.88 39.00

1.12 1.33

.05 -.05 .02

1.13 1.07 1.09

.50 .51 .41

.07 .05 .06

16.12 17.08 27.00

2.80 2.52

coefficients of correlation were very high (.97 to .98) and virtually the same regardless of the item selection procedure used (see Table 17-3). For the artificial data, it was possible to determine the relationship between the known z values used to generate the data and the theta estimates yielded by the two CAT conditions and the full scale calibration. The coefficients of correlation obtained for the two CAT conditions were virtually the same (.88 and .89), but somewhat lower t h a n what has been found in similar research. For example, using a different artificial dataset, Dodd (1990) obtained coefficients of .95 to .96 for various CAT procedures based on the rating scale model. The coefficient of correlation was also somewhat lower than expected for the full scale calibration (r = .92). These slightly lower coefficients of correlation are due to the size of the standard errors of the theta estimates Table 17-3 Pearson Correlation Coefficients and RMSE Statistics for Three Data Sets

9

FS- ^SV

B

FS-

B

INF0

AWS

ADCOM

Artificial

.98 .98

.98 .98

.16 .16

.25 .26

.96 .97 .89 .88 .92 .31 .28

"V Bsv z

'

B

INFO

• V «FS

RMSE ^ Bsv RMSE eFS, INF0 B

ITEM INFORMATION AS A FUNCTION OF THRESHOLD VALUES

313

from the full scale calibration of the artificial data (Mean = .41). The higher coefficients of correlation reported by Dodd (1990) resulted from a full scale calibration that produced an average standard error of .23. The standard errors are a function of the scale information, which in turn is related to the threshold values for the rating scale. The current item pool had an exceptionally wide spread of the threshold values compared to other scales reported in the literature (Andrich, 1978a; Dodd, 1990). The RMSE statistics, which are also presented in Table 17-3, mirrored the results found for the correlation coefficients. For each dataset, the RMSE statistics were virtually the same for the two CAT conditions.

DISCUSSION The results of the item information analyses confirmed the findings of Dodd (1987) and provided further clarification and extension of other findings. Both studies demonstrated that across the entire trait continuum, items with the same number of threshold values provided the same total amount of information. The finding that items from scales with more threshold values yielded more total information across the entire theta scale t h a n items from scales with fewer threshold values is not surprising. This finding is consistent with the belief that more categories provide more information or allow for finer discriminations among persons t h a n items with fewer categories. The systematic comparison of item information functions in this study provided further clarification of the previous finding that the item information function peaked near the scale value of the item. The results revealed that the magnitude of the shift away from the scale value for a given item in a scale was a function of the degree of asymmetry of the threshold values. When there was an odd number of asymmetric threshold values, the peak of the item information function was shifted away from the scale value in the direction of the dominant sign of the threshold values. For the scales with an even number of threshold values, the degree of shift away from the scale value for a given item was also found to be a function of the distance between adjacent threshold values. In addition, it was discovered that if the distance between the middle threshold values was large when the number of threshold values was even, the information function could be bimodal even if the thresholds were symmetric around zero. The fact t h a t the shift in the peak of the item information function was found to be 2.1 logits away from the scale values for one real

314

DODD & DE AYALA

d a t a s e t s u g g e s t e d t h a t u s i n g t h e closest scale v a l u e to select i t e m s for a d m i n i s t r a t i o n d u r i n g a n a d a p t i v e a t t i t u d e m e a s u r e m e n t session (Dodd, 1990) m i g h t n o t be t h e b e s t i t e m selection p r o c e d u r e . T h e r e s u l t s of t h e CAT s i m u l a t i o n s t h a t c o m p a r e d t h e scale v a l u e a n d t h e m a x i m u m i n f o r m a t i o n i t e m selection p r o c e d u r e s for t h r e e d a t a s e t s did n o t , however, l e a d to t h i s conclusion. E v e n t h o u g h t h e t w o i t e m select i o n p r o c e d u r e s a d m i n i s t e r e d different i t e m s , t h e r e s u l t s of t h e t w o CAT w e r e for all p r a c t i c a l p u r p o s e s t h e s a m e . T h i s r e s u l t w a s p a r t i c u l a r l y i m p r e s s i v e for t h e artificial d a t a , given t h e less t h a n o p t i m a l i t e m pool a n d t h e fact t h a t shift in t h e p e a k of t h e i n f o r m a t i o n funct i o n a w a y from t h e scale v a l u e w a s g r e a t e r t h a n 2 logits. W h i l e t h e r e s u l t s r e v e a l t h a t b o t h i t e m selection p r o c e d u r e s w o r k e d e q u a l l y well, t h e scale v a l u e i t e m selection p r o c e d u r e r e q u i r e s less c o m p u t i n g t i m e a n d t h u s would be t h e p r e f e r r e d m e t h o d . T h i s w o u l d be p a r t i c u l a r l y t r u e for l a r g e i t e m pools b e c a u s e i n f o r m a t i o n w o u l d n o t h a v e to be c a l c u l a t e d for e v e r y i t e m t h a t h a d n o t b e e n a d m i n i s t e r e d . D e t e r m i n i n g t h e closest scale v a l u e to t h e l a t e s t t h e t a e s t i m a t e w o u l d b e m u c h m o r e efficient.

REFERENCES Andrich, D. (1978a). Application of a psychometric model to ordered categories which are scored with successive integers. Applied Psychological Measurement, 2, 581-594. Andrich, D. (1978b). A rating formulation for ordered response categories. Psychometrika, 43. 561-573. Birnbaum, A. (1968). Some talent trait models and their use in inferring an examinee's ability. In F.M. Lord & M.R. Novick, Statistical theories of mental test scores. Reading, MA: Addison-Wesley. Dodd, B.G. (1985). Attitude scaling: A comparison of the graded response and partial credit latent trait models (Doctoral dissertation, University of Texas at Austin, 1984). Dissertation Abstracts International, 45, 2074A. Dodd, B.G. (1987, April). Computerized adaptive testing with the rating scale model. Paper presented at the Fourth International Objective Measurement Workshop, Chicago. Dodd, B.G. (1990). The effect of item selection procedure and stepsize on computerized adaptive attitude measurement using the rating scale model. Applied Psychological Measurement, 14, 355-366. Dodd, B.G., & Koch, W.R. (1987). Effects of variations in step values on item and test information in the partial credit model. Applied Psychological Measurement, 11, 339-351. Dodd, B.G., Koch, W.R., & De Ayala, R.J. (1989). Operational characteristics of adaptive testing procedures using the graded response model. Applied Psychological Measurement, 13, 129-143.

ITEM INFORMATION AS A FUNCTION OF THRESHOLD VALUES

315

Koch, W.R. (1983). Likert scaling using the graded response latent trait model. Applied Psychological Measurement, 7, 15-32. Koch, W.R., & Dodd, B.G. (1989). An investigation of procedures for computerized adaptive testing using partial credit scoring. Applied Measurement in Education, 2, 335-357. Masters, G.N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149-174. Masters, G.N., & Wright, B.D. (1981). A model for partial credit scoring (Research Memorandum No. 31). Chicago: University of Chicago, MESA Statistical Laboratory. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, No. 17. Samejima, F. (1976). Graded response model of the latent trait theory and tailored testing. In C.K. Clark (Ed.), Proceedings of the First Conference on Computerized Adaptive Testing. Washington, DC: U.S. Government Printing Office. Spence, J.T., Helmreich, R., & Stapp, J. (1973). A short version of the Attitude toward Women Scale (AWS). Bulletin of the Psychonomic Society, 2, 2 1 9 220. Thissen, D., & Steinberg, L. (1986). A taxonomy of item response models. Psychometrika, 51, 567-577. Valentine, R.J. (1978). Audit of administrator communication. Columbia, MO: Jerry W. Valentine. Wright, B.D., & Masters, G.N. (1982). Rating scale analysis. Chicago: MESA Press.

chapter

18

Assessing Unidimensionality for Rasch Measurement Richard M. Smith

University of South Florida

Chang Y. Miao

American Dental Association

Unidimensionality is one of the requirements for Rasch measurement, as it is for most measurement models. However, the primary sources on Rasch measurement have very little to say about the requirement of unidimensionality and provide no recommendation as to methods for directly testing this assumption. Rasch (1960/1980), although describing several methods for "control of the model" and discussing the applicability of the model to data, does not directly address the issue of unidimensionality. Wright and Stone (1979) do not explicitly discuss unidimensionality of the data as a requirement of Rasch measurement. The notion, however, is implicit in their definition of a variable that results exclusively from items that share a common line of inquiry. Wright and Stone provide extensive documentation of methods to test the fit of the data to the model, suggesting that fit of the data to the model assures that the assumptions of the model were met. Wright and Masters (1982) define unidimensionality as a basic requirement for measurement and further expand the assessment of fit on an item and person level, suggesting fit on this level assures the 316

ASSESSING UNIDIMENSIONALITY FOR RASCH MEASUREMENT 317

existence of a single variable. Andrich (1988) is more explicit in his discussions of unidimensionality, but again relies on tests of fit, either in testing the invariance of parameter estimation across subgroups or analyzing the differences between observed response patterns and probabilities developed from the estimated model parameters. Consistent throughout these works is the notion that the unidimensionality assumption is satisfied if the data fit the model. Practically, this is often interpreted by researchers using Rasch measurement as meaning that the requirement of unidimensionality is met if the fit values that accompany most calibration programs for items and/or persons do not depart significantly from their expected values. Hattie (1985) not only provides a comprehensive review of the various definitions of unidimensionality that appear in the psychometric literature, but also reviews a large number of studies t h a t have attempted to develop and validate the use of a variety of indices for assessing unidimensionality. Given this review, there is reason to be extremely skeptical of the use of any fit indices based on the Rasch model, and there is little encouragement to use any principal component or factor analytic procedure. However, the practicality of the situation is such that many researchers do use the family of Rasch measurement models. The research based on the use of this model rarely contains evidence that the dimensionality issue has been addressed in any method other than looking at the general level of item and/or person fit available in the common calibration programs. It is also the case that many other researchers typically rely on factor analytic or principal component techniques to assess the unidimensionality of tests, either in the development stage or in assessing the applicability of a given test to a specific sample. It would appear helpful to directly compare the results of using these commonly available techniques. In this study the use of the Rasch fit indices will be limited to the unweighted total item fit statistic (OUTFIT) found in such Rasch calibration programs as BICAL, MSCALE, and BIGSCALE. The choice between the principal component and factor analytic procedure is more difficult. Hattie (1985) separates principal component indices from factor analysis indices for several reasons, including the fact that factor analysis requires a hypothesis as to the number of factors. It is exactly for this reason that principal component analysis was chosen for this study. It seems reasonable that researchers using the Rasch model to analyze item level response data believe that, at least operationally, the test is unidimensional. Otherwise, there would be little reason to choose a model that makes unidimensionality a requirement for measurement.

318

SMITH & MIAO

It is always prudent to determine that the sample of persons taking a particular examination responded to the items in a manner which suggests unidimensionality. No matter how many times a test has been demonstrated to be unidimensional for other circumstances, it is always necessary to reconfirm this for the current circumstances. Given this framework, it is unlikely that the researchers wanting to assess unidimensionality would have a preconceived notion of a multidimensional factor structure, if it exists, but rather are simply checking to see if the common threats to unidimensionality, such as speedness, sex bias, race bias, or interactions between content and instruction, have effected the dimensionality of the test. This reasoning suggests that the principal component analysis, which assumes no a priori number of factors, would be the most appropriate method for assessing multidimensionality.

OBJECTIVE The purpose of this study is to compare two methods of testing the assumption of unidimensionality: the Rasch fit statistic approach detailed in the references cited above and principal component factor analysis. The factor analytic technique is not based on the same set of assumptions as Rasch measurement and can be applied prior to Rasch analysis to test the unidimensionality assumption. The Rasch fit techniques must be used in the context of the Rasch models and require the estimation of parameters before they can be applied. To fully test the applicability of these two approaches, the true factor structure of the data most be known a priori, thus requiring simulated data. The use of real test data with an unknown factor structure would not be useful in deciding which of two or more methods of assessing the factor structure is appropriate, since there is no known structure against which to compare the results. Usually, a study using real data results in the methods that happen to agree being declared the winners because they happen to agree. This decision has no relevance in answering the question of which of the methods best describes the true factor structure of the data.

METHODS The data for this study were simulated so as to represent varying degrees of correlation between the two factors represented in the re-

ASSESSING UNIDIMENSIONALITY FOR RASCH MEASUREMENT 319

sponse data and varying numbers of items representing each of the two factors. The correlations between the two factors (X and Y) ranged from 0.10 (.01% common variance) to 0.87 (.75% common variance), with nine different values for the common variance (.01, .04, .09, .16, .25, .36, .50, .64, .75). For each data set the total number of items on the test was set at 50. This test length was chosen to represent an average test length. The number of items in each factor were also varied across five different ratios of items for the two factors (45 & 5, 40 & 10, 35 & 15, 30 & 20, and 25 & 25, with the number of X factor items listed first). This resulted in 45 different combinations of common variance and ratio of X to Y items. For each data set a sample of 1,000 person was used, again to represent an average number of examinees. For each person two sets of independent unit normal ability distributions were generated (X and Z). The unit normal distributions for each data set were created in SYSTAT From these two distributions the correlated data were produced by substituting one of the common variance values listed above in the following equation: Y, - aX, + (1 - a)Zi where Xj is the first independent ability for person i, Zj is the second independent ability, a is the amount of common variance, and Yj is the correlated ability. The two abilities (Xj and Yj) for each person were then used to create simulated responses to the 50-item test. The X ability was used to generate the responses to the items measuring the X factor, and the Y ability was used to generate the responses to the items measuring the Y factor using the Rasch probability equation for dichotomous data: P(x - 1 | X, d) = exp(X - d)/(l + exp(X - d) and P(y = 1 | Y, d) - exp(Y - d)/(l + exp(Y - d).

Here X and Y are the person abilities, d is the item difficulty, and p is the probability of a correct response. Each probability was then compared to a random number between 0.0 and 1.0, chosen specifically for t h a t person item interaction using the random number function available in BASIC. If the value of the random number exceeded the probability, the item was assigned a response of 0; otherwise, the response was set to 1. The item difficulties used in the simulations were uniformly distributed in sets of five items (with item difficulties in logits of — 1, - . 5 , 0, +.5, and +1) so that the number of items in each facto did not have an effect on the mean or distribution of the item diffi-

320

SMITH & MIAO

culties for that data set. In this study, two replications of each data set were created. The resulting sets of simulated response patterns were analyzed by two methods. The first was calibration and item analysis using the MSCALE program (Wright, Rossner, & Congdon, 1985). This provided the Rasch item difficulties and the unweighted total fit statistic (OUTFIT) for each item. The unweighted total fit statistic is based on the standardization of the difference between the person's observed score on an item and the probability of a correct response, based on the performance of the total calibration sample on the item and the person's total score on the test (Wright & Stone, 1979; Smith, 1986). The standardized residual is summed over all persons who took the item and converted to a mean square by dividing by the number of persons:

where N is the number of persons, x n i is the scored response (1, 0) of person n to item i, and P n i is the probability of a correct response for person n to item i. This mean square (MS (UT)) is then converted to an approximate unit normal using the cube root transformation. Values of the unweighted total fit statistic greater than 2 generally indicate a person has unexpected responses in his or her response pattern—easy items answered incorrectly for higher ability persons or hard items answered correctly for lower ability persons. The second analysis was principal component factor analysis using SAS. This provided an estimate of the number of factors contained in each data set and factor loadings for each item. In the case of the Rasch analysis, the magnitude and the variance of the outfit statistics were used to assess unidimensionality. In the case of factor analysis, the eigen values for each factor and the factor loadings for each item were used to assess unidimensionality. Table 18-1 contains the equations used to create the correlated abilities. A total of 10 different conditions (nine different amounts of common variance and no common variance) were developed with two sets of correlated abilities generated for each condition. The expected correlation between the two sets of ability based on the amount of common variance is also listed. Finally, the observed correlation between the two sets of ability is listed. For Tables 18-3 through 18-5, the results represent the average of two replications based on the two sets of correlated person abilities reported in Table 18-1.

ASSESSING UNIDIMENSIONALITY FOR RASCH MEASUREMENT

321

Table 1 8 - 1 Correlation Between Independent Ability and Correlated Ability Observed Correlation

Data Set

Generating Equation

Expected Correlation

Simulation 1

Simulation 2

0 1 2 3 4 5 6 7 8 9

.oox + l.OOY

.00 .10 .20 .30 .40 .50 .60 .71 .80 .87

.07 .08 .11 .17 .26 .39 .55 .72 .89 .96

-.05 -.02 .06 .13 .31 .49 .42 .88 .95

.01X .04X .09X .16X .25X .36X .50X .64X .75X

+ + + f + + + + +

.99Y .96Y .91Y .84Y .75Y .64Y .50Y .36Y .25Y

RESULTS The interpretation of the factor analytic results depends in large part on the choice of the critical value of the eigen values. To determine the best value to be used, a set of single factor data was created. The results, shown in Table 18-2, indicated that there were a considerable number of factors identified with eigen values greater than 1.0. However, the eigen values for the second component never exceeded 1.40 in the four simulations of unidimensional data. Consequently, the value 1.4 was chosen to determine the presence of a second factor in the two factor simulations. The results of the principal component factor analysis are presented in Table 18-3. Overall, the factor analytic technique was able to detect the presence of two factors at all variations in the number of X factor Table 18-2 Results of Principal Component Analysis Unidimensional Data Eigen Values

Data Set

No. of Items

Factor 1

Factor 2

Factor 3

N> 1

0-1 0-2 0-3 0-4

50 50 50 50

8.51 7.75 8.69 8.43

1.26 1.33 1.21 1.26

1.23 1.25 1.19 1.22

13 15 13 13

322

SMITH & MIAO

Table 18-3

Results of Principal Component Analysis Multidimensional Data

Number Data of Items Set Y vs. X Factor 1 1

2

3

4

5

6

7

8

9

5-45 10-40 15-35 20-30 25-25 5-45 10-40 15-35 20-30 25-25 5-45 10-40 15-35 20-30 5-45 10-40 15-35 20-30 5-45 10-40 15-35 20-30 5-45 10-40 15-35 20-30 5-45 10-40 15-35 20-30 5-45 10-40 15-35 20-30 5-45 10-40 15-35 20-30

8.50

7.65 6.73 5.93 5.25 8.36 7.65 6.94

5.80

5.40 8.50 7.83 6.99 6.14

8.59

7.72 7.05 6.19 8.51 7.54 7.16 6.46 8.66 8.00 7.16 6.66 8.28 7.52 7.36 6.70 8.77 8.47 8.12 7.50 8.88 8.43 8.28 7.95

Percent Correctly Loaded on Factor

Eigen Values Factor 2

Factor 3

1.71

1.22

4.09 4.52 1.61 2.43 3.14 3.79 4.31 1.61 2.29

1.20 1.21 1.20 1.21 1.19 1.24 1.22

2.52 3.23

2.89 3.60 1.49 2.01

2.64 3.20 1.48 1.88 2.34 2.76 1.40 1.67 2.02 2.08 1.24 1.48 1.61 1.71 1.21

1.29 1.31 1.37 1.22 1.22 1.28 1.29

1.18 1.26

1.19 1.21 1.21

1.21

1.19

1.24 1.21

1.21

1.21

1.23 1.29

1.24 1.20 1.29 1.24 1.26 1.23 1.28 1.29 1.28 1.17 1.21 1.26 1.30 1.17 1.19 1.20 1.25

N

1 12 12 11 13 12 12 13 12 12 14 12 13

13 12 13 13

13 14

12 14

12 14 12

13 14 15 13 15 14 15

12

14

13 15 13 14 14 14

Factor 1

Factor 2

100 98 97 100 96 100 98 97 97

100 100 100 100 68 100 100 100 100 68 100 100 93 100 100 100 93 75 80 90 80

92 100 98 100 100 100

100 100 100 100 100 100

97 98 100 100 100 100 95 100 93 100 100 100 97 100 100 97 100

75 80 80 53 50 40 70 40 35 40 20 7

15 0 0 7

10

ASSESSING UNIDIMENSIONALITY FOR RASCH MEASUREMENT

323

and Y factor items for the following common variance levels, .01, .04, .09, .16, and .25 (data sets 1 to 5). For the .36 and the .50 common variance levels (data sets 6 and 7), the eigen values were lower than 1.40 for the 45-5 item ratio. For the .64 and .75 common variance levels (data sets 8 and 9) the eigen values were less than 1.40 for all 5 of the item ratio levels. Also summarized in Table 3 is the percentage of items t h a t loaded on the appropriate factor for each of the replications. As the proportion of common variance between ability X and Y increased, the principal component analysis was less able to assign the Y ability items correctly to t h a t factor. The interpretation of the Rasch item fit statistics was accomplished by comparing the value of the individual fit statistics to a critical value of +2.00 (A commonly used value that corresponds approximately to a Type I error rate of .05—Smith, 1991.) The presence of a second factor was determined by evaluating the number of X factor items with a fit value greater than +2.00 and the number of Y factor items with a fit value greater than +2.00. The results of this analysis are summarized in Table 18-4. For all levels of common variance involving 45 items on the X factor and 5 items on the Y factor, the percentage of the X factor items that had a fit value greater than +2.00 was at or below the Type I error rate and 100% of the items on the Y factor had fit values greater than + 2.00. For the 40 X factor and 10 Y factor item comparisons across levels of common variance, 90% or more of the Y factor items had fit values greater than +2.00, while the number of X factor items with fit values greater than +2.00 was less than the Type I error rate. The only exception was the 75% common variance level (data set 9), where only 35% of the Y factor items had fit values greater than +2.00. For the 35 item X factor and 15 item Y factor comparisons across the nine levels of common variance, the percentage of X factor items with fit values greater than +2.00 remained less than the Type I error rate, while the percentage of Y factor items with values greater than +2.00 averaged over 90%, up to the 25 percent common variance (data set 5). Above 50% common variance (data sets 8 and 9), the number of Y factor items with values greater t h a n +2.00 dropped to less than 50%. For the 30 item X factor and 20 item Y factor comparisons the percentage of X factor items with values greater than +2.00 was less than or equal to the Type I error rate for this statistic. The percentage of Y factor items with values greater t h a n +2.00 never exceeded 60% and dropped to 25% for the 75% common variance level (data set 9). For the 25 item X factor and 25 item Y factor comparisons for the first two levels of common variance both the percentage of X factor items and Y factor

324

SMITH & MIAO

Table 18-4

Results of Rasch Fit Analysis Multidimensional Data

Number Data of Items Set Yvs. X 1

2

3

4

5

6

7

8

9

Mean

S.D.

8.40 1.53 5-45 6.08 1.37 10-40 3.69 1.08 15-35 1.52 0.83 20-30 25-25 -0.56 0.79 5-45 7.88 0.86 10-40 5.91 1.00 3.59 1.16 15-35 1.76 0.91 20-30 0.20 1.14 25-25 8.50 1.88 5-45 6.20 1.45 10-40 15-35 4.15 0.94 20-30 1.90 0.85 8.54 1.60 5-45 5.75 1.37 10-40 3.79 1.03 15-35 1.77 0.82 20-30 7.04 1.53 5-45 5.00 1.42 10-40 3.85 1.14 15-35 1.91 0.90 20-30 5.40 1.11 5-45 4.20 1.12 10-40 3.15 1.59 15-35 2.21 0.99 20-30 4.80 0.68 5-45 3.08 1.14 10-40 2.55 0.68 15-35 1.89 0.91 20-30 4.12 1.48 5-45 2.81 1.16 10-40 1.95 0.99 15-35 1.57 1.03 20-30 2.34 0.31 5-45 1.30 0.97 10-40 1.22 0.96 15-35 1.49 0.78 20-30

Total Test Item Fit

X Item Fit

Y Item Fit % - 2

Mean

S.D.

%>2

Mean

S.D.

% >2

100 100 100 40 0 100 100 93 40 8 100 100 100 32 100 100 100 40 100 100 93 35 100 100 73 55

1.10 -1.64 -1.69 -1.08 0.13 -0.93 -1.70 -1.66 -1.30 -0.23 -0.94 -1.69 -1.93 -1.34 -0.96 -1.58 -1.70 -1.13 -0.79 -1.44 -1.87 - 1.45 -0.78 -1.25 1.47 -1.59 -0.60 - 0.82 -1.18 -1.19 0.44 -0.75 -0.90 -1.11 -0.29 -0.40 -0.54 -0.92

1.09 0.69 0.94 0.69 0.80 0.76 0.91 0.82 0.96 0.78 1.00 0.84 1.01 0.84 0.87 0.97 1.01 1.06 0.68 0.82 0.80 1.04 0.92 1.03 0.91 0.77 0.86 0.97 0.74 1.08 0.86 0.89 0.87 0.79 1.07 1.05 0.91 0.95

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0

-0. 15 -0. 09 -0. 07 -0. 04 -0. 09 -0. 05 -0. 17 -0. 10 -0. 08 -0. 01 0. 01 -0. 11 -0. 10 -0 .04 0 .01 -0 .11 -0 .05 0 .02 0,.00 -0 .15 -0 .16 -0 .10

3.10 3.23 2.66 1.49 0.79 2.77 3.21 2.60 1.76 0.99 3.06 3.32 2.98 1.80 3.04 3.13 2.74 1.72 2.54 2.77 2.81 1.93 2.09 2.43 2.42 2.06 1.83 1.86 1.87 1.82 1.65 1.71 1.59 1.59 1.30 1.23 1.22 1.49

10 20 30 16 0 10 20 28 16 4 10 20 30

100 90 87 45 100 90 40 30 100

30 20 25

0 0

0 0

0

0 0 0 3 0 0 0

0 4 2 0 0

-0 .17 -0 .15 -0 .09

-0.07

-0 .01 -0 .03 -0 .07 0 .04 0 .01 -0 .04 -0 .04 -0 .04 -0 .03

-0 .05 0 .02 0 .03

14 10

20 30 18 10 20 28 14 10 20 22 22 10 18 26 20 10 18

12 12 10 8 6 10

items with fit values greater than +2.00 was very near the Type I error rate. Table 18-4 also summarizes the overall fit values for the entire set of 50 items—that is, X and Y factor item fit statistics combined. In no case does the absolute value of the mean fit value for the test exceed

ASSESSING UNIDIMENSIONALITY FOR RASCH MEASUREMENT 325

Table 18-5 Recommended Procedure to Detect Multidimensionality Data sset

No. of items

1

5-45 10-40 15-35 20-30 25-25

2

3

4

5

6

7

8

9

Yvs. X

5-45

10-40 15-35 20-30 25-25 5-45 10-40 15-35 20-30 5-45 10-40 15-35

20-30

5-45 10-40 15-35

20-30

5-45 10-40 15-35 20-30 5-45 10-40 15-35 20-30 5-45 10-40 15-35 20-30 5-45 10-40 15-35 20-30

Prin,Components Factor Analysis

Rasch Item OUTFIT

yes

yes yes yes No No yes yes yes No No yes yes yes No yes yes yes No yes yes yes No yes yes yes No yes yes yes No yes yes No No yes No No No

yes yes

yes yes yes yes yes yes yes yes yes yes yes No yes yes yes No yes yes yes No yes yes yes No No yes yes No No No No No No No No

20. There is considerable variation in the standard deviation of the fit values, but for the 64% and 75% common variance simulations (data set 8 and 9) the standard deviation of the fit values approaches the expected standard deviation of the null distribution (1.00). Table 18-5 combines the results of the principal component analysis

326

SMITH & MIAO

and the Rasch fit analysis. For each of these two techniques, over each of the simulated data sets and combinations of X and Y factor items, a decision was made as to whether that method was appropriate to detect multidimensionality for that combination of X and Y factor items and common variance. The criterion used to make this yes/no decision was an eigen value greater than 1.5 for the second factor in the principal component analysis or more than 60% of the Y factor items identified as misfitting for the Rasch item analysis method. Both of the decision points were reached on an ad hoc basis and no attempt was made to determine if they were equivalent. The results suggest that the principal component and the Rasch item fit approaches are not sensitive to the same combinations of common variance and the number of items represented on the second factor. These results strongly indicate that in cases of a second factor with less t h a n 64% common variance (data sets 1 through 7), the factor analytic procedure will detect the factor as long as 20% or more of the items load on that factor. If less t h a n 20% of the items load on t h a t factor, the techniques is much less sensitive to the presence of the second factor. For data with 64% and higher common variance the factor analytic procedure identifies only a single factor, no matter what proportion of the items load on the second factor. These results are almost the opposite of the Rasch fit values based on the unweighted total item fit statistic. The Rasch fit statistic is sensitive to the second factor until approximately 30% of the items loaded on the second factor for data sets 1 through 7, and until approximately 20% of the items on the second factor in data set 8, and until approximately 10% of the items belong to the second factor in data set 9. If the percentage of items on the second factor was above t h a t level, then the fit statistic was generally unable to detect multidimensionality, no matter what the degree of correlation between the two factors. CONCLUSIONS If one can assume that the original objective of the test construction process was to produce a unidimensional measure, it would be unusual to find t h a t the test had approximately equal numbers of items on two relatively uncorrelated factors. Rather, one would expect to find the majority of the items on one factor and relatively few items on the second factor. It is also reasonable to expect that the second factor would be highly correlated with the primary factor. These are exactly the cases where the factor analytic method is inappropriate. If, in fact,

ASSESSING UNIDIMENSIONALITY FOR RASCH MEASUREMENT

327

there had been equal numbers of items on uncorrelated factors, there would be reason to believe that the test developers had little understanding of the underlying construct that the test was designed to measure. Thus, although the factor analytic method detected the second factor in slightly more cases in these simulations, the Rasch item fit approach performed better in the simulations that most closely resembled the expectations discusses above for departures from an intended unidimensional test. However, a prudent practice would be to use the two methods to complement each other, thus assuring the widest possible coverage of different combinations of common variance and proportion of items loading on the second factor. Further, it should be realized t h a t neither of the procedures worked well when more than 30% of the items loaded on the second factor that had more t h a n 64% common variance with the first factor. In situations like this, the important question is whether the test is functionally unidimensional despite the presence of two factors.

REFERENCES Andrich, D. (1988). Rasch models for measurement. Newbury Park, CA: Sage Publications. Hattie, J. (1985). Methodological review: Assessing unidimensionality for tests and items. Applied Psychological Measurement, 9, 139-164. Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests (expanded ed.). Chicago: The University of Chicago Press. (Original work published 1960) Smith, R.M. (1986). Person fit in the Rasch model. Educational and Psychological Measurement, 46, 359-372. Smith, R.M. (1991). The distributional properties of Rasch item fit statistics. Educational and Psychological Measurement, 51, 541—565. Wright, B.D., Rossner, M., & Congdon, R.T. (1985). MSCALE: A Rasch program for ordered categories. Chicago: MESA Press. Wright, B.D., & Masters, G.N. (1982). Rating scale analysis. Chicago: MESA Press. Wright, B.D., & Stone, M. (1979). Best test design. Chicago: MESA Press.

This page intentionally left blank

Author Index A Ackermann, J.R., 44, 64, 65 Adams, R.A., 278, 293 Ainley, J., 280, 293 Akaike, H., 242, 244 Allen, M.J., 116, 121 Andersen, E.B., 20, 23, 63, 66, 215, 233, 275, 293 Andrich, D., 38, 45, 46, 63, 66, 73, 78, 79, 89, 96, 96, 151, 154, 155, 175, 218, 220, 222, 229, 233, 238, 244, 302, 303, 315, 316, 319, 329 Appelbaum, M.I., 75, 91, 97 Armstrong, D.M., 30, 34 Asplund, K., 149, 175

B Barton, M.A., 51, 71 Bakan, D., 46, 66 Barnes, L.B., 123, 130 Batten, M.H., 280-282, 294 Bauer, D., 156, 175 Bechtoldt, H.P., 49, 66 Beckwith, T.G., 34, 34 Bell, S.R., 51, 72 Bennett, R.E., 137, 146 Bergman, E.O., 90, 91, 98 Bergstrom, B.A., 105, 111, 113

Bernspang, B., 149, 175 Berry, J.W., 273, 293 Bigelow, J., 30, 34 Birnbaum, A., 218, 220, 233, 237, 244, 250, 272, 301, 316 Blinkhorn, S., 41, 48, 68 Bock, R.D., 52, 70, 73, 78, 83, 89, 96, 123, 124, 129, 218, 233, 296, 299 Boekkooi-Timminga, E., 115, 121 Bollinger, G., 38, 66 Boring, E.G., 222, 233 Borson, S., 149, 177 Bostock, D., 27, 34 Box, J.F., 216, 217, 233 Branch, L.G., 149, 176 Braun, H.L, 132, 140, 146 Brennan, R.L., 74, 96 Brenneman, W.L., 56, 66 Bridgeman, P.W., 48, 66 Brighton, C., 149, 176 Broder, M., 185, 189 Brogden, N.E., 18, 23, 38, 66 Brooks, R.H., 149, 177 Buck, N.L., 34, 34 Bundy, A.C., 156, 176 Bunt, A.A., 33, 35 Burdick, D.S., 49, 62, 71 329

330

AUTHOR INDEX

Burke, J.P., 150, 176 Burt, R.S., 280, 293 Burtt, E.A., 40, 42, 53, 66

c Cajori, F., 42, 66 Callahan, L.F., 149, 177 Campbell, D.T, 54, 66 Campbell, N.R., 3, 23, 28, 29, 34 Carver, R., 46, 66 Case-Smith, J., 156, 175 Cason, C.L., 132, 133, 146 Cason, G.J., 132, 133, 146 Cattell, J.K., 73, 96 Cherryholmes, C , 36, 39, 40, 4 7 - 4 9 , 5 3 - 5 5 , 64, 66 Choppin, B., 51, 66 Christman, K.P., 280, 293 Clagett, M., 28, 34 Clifford, G.J., 89, 96 Coats, W., 46, 66 Cobb, M.V., 90, 91, 98 Cohen, L., 251, 272 Congdon, R., 124, 130, 322, 329 Cook, L.L., 38, 52, 68 Cook, T.D., 54, 66 Coombs, C.H., 34, 34, 58, 66 Craven, TE., 213, 235 Cronbach, L.J., 43, 44, 47, 57, 66, 74, 96 Crouse, J., 46, 53, 67 Cummings, S.R., 149, 176 Curb, J.D., 149, 176

D Dawis, R.V., 38, 71 De Ayala, R.J., 302, 316 Dedekind, R., 27, 34 De Gruiter, D.N.M., 133, 146 Divgi, D.R., 38, 47, 61, 67 Dodd, B.G., 302, 305, 310, 311, 314-316, 316,317 Donovan, A., 94, 96

Dorans, J.J., 123, 130 Douglas, G.A., 266, 272 Duncan, O.D., 38, 39, 48, 54, 58, 67, 70, 240, 244

E Eakin, P., 148, 175 Ellis, B., 4, 23 Embretson (Whitely), S., 48, 49, 67, 220, 229, 233 Engelen, R.J.H., 221, 233 Engelhard, G., 75, 77, 78, 85, 86, 89, 9 1 - 9 3 , 96, 97 Epstein, J.L., 275, 293 Eriksson, S., 149, 175

F Fahnestock, J., 64, 67 Faletti, M.V., 149, 176 Falk, R., 46, 67 Ferguson, G.A., 217, 233 Fillenbaum, G.G., 150, 176 Fischer, G.H., 18, 21, 23, 38, 48, 67, 215, 218, 220, 222, 233 Fischer, M.G., 280, 293 Fisher, A.G., 151, 153, 156, 157, 175, 176 Fisher, D.L., 275, 293 Fisher, R.A., 94, 97, 216, 217, 233 Fisher, W.P., 38, 55, 67, 154, 177, 211,233 Fleck, L., 45, 64, 67 Folk, V.G., 123, 129 Formann, A.K., 220, 222, 233 Forrest, P., 30, 34 Fowles, D., 41, 47, 71 Fraser, B J . , 275, 293 Fredericks, M., 232, 235 Fuchs, H.A., 149, 177 Fugl-Meyer, A.R., 149, 175 Fuhrer, M.J., 148, 176

AUTHOR INDEX

G

Gadamer, H.-G., 42, 44, 49, 56, 58,67 Galton, F., 216, 233 George, L.K., 150,176 Gerson, R.C., 106, 111, 113, 117, 121 Gifford, J.A., 292, 299 Glas, C.A.W., 218, 221, 234 Gleser, G.C., 74, 96 Goldman, S.H., 48, 50, 67 Goldstein, H., 38, 4 1 , 43, 48, 63, 67,68 Gould, S.J., 46, 53, 68 Granger, C.V., 148,276 Grau, B.W., 38, 68 Green, B.F., 123, 124, 129 Gresham, G.E., 148,276 Guilford, J.R, 77, 97 Gulliksen, H., 86, 97 Guralnik, J.M., 149,176 Gustafsson, J.-E., 38, 4 1 , 68, 247, 272

H Haas, N.S., 132, 146 Hacking, I., 44, 68 Haladyna, T.M., 132, 146 Hambleton, R.K., 38, 43, 47, 48, 50, 52, 68, 295, 296, 299 Harvey, A.L., 123, 130 Hattie, J., 319, 329 Heath, T.L., 27, 34 Heelan, P., 42, 44, 47, 64, 68 Heidegger, M., 42, 48, 56, 68 Helmreich, R., 3 1 0 , 3 2 7 Henning, G., 38, 68 Hesse, M., 40, 44, 64, 68, 69 Hill, E., 179, 189 Ho, K., 103, 106, 114 Holder, O., 26, 27, 34 Holland, P.W., 218, 222, 234 Holton, G., 40, 44, 64, 69

331

Hornke, L.F., 38, 66 Houser, R., 123, 130 Hudson, L., 42, 69 Humphreys, L.G., 123, 124, 129 Husen, T., 274, 293 Husserl, E., 48, 69

I Ihde, D., 42, 44, 47, 64, 69 Irvine, A.D., 30, 35 Irvine, S.H., 273, 293

J

Jackson, K.L., 213, 235 Jaeger, R.M., 37, 45, 53, 57, 69 Jagust, W.J., 149,277 Jannarone, R.J., 211, 213, 214, 218, 220-223, 226-228, 230, 2 3 2 , 2 3 3 , 236 Jansen, P.G.W., 21, 23 Jones, L.V., 73, 75, 78, 83, 89, 91, 96, 97 Jongbloed, L., 148, 149, 176 Joreskog, K.G., 74, 97, 217, 220, 234, 276,277,293

K

Kane, R.A., 148, 150, 176, 177 Kane, R.L., 148,276 Kaye, J.J., 149, 177 Keats, J.A., 228, 234 Keith, R.A., 148,276 Kelderman, H., 211, 218, 220, 234, 2 3 9 - 2 4 1 , 244, 245 Kendrick, J.S., 213, 235 Khoo, S.T., 278, 293 Kielhofner, G., 150, 176 Kiely, G., 123, 130 Kilgore, K., 154,277 Kingsbury, G.G., 103, 106, 114, 116, 121, 123, 130 Kingston, N.M., 123, 130 Kiyak, H.A., 149, 177

332

AUTHOR INDEX

Koch, W.R., 302, 305, 310, 316, 317 Kordomenos, RL, 185, 189 Krantz, D.H., 16, 23, 25, 3 1 - 3 3 , 35, 38, 69 Krenz, C., 46, 69 Kristoff, W., 220, 234 Kuhn, T.S., 40, 42, 4 4 - 4 6 , 51, 64, 69, 212, 232, 234

Luce, R.D., 5, 14, 16, 19, 23, 25, 3 1 - 3 3 , 35, 38, 69, 70 Lumsden, J., 46, 70, 77, 97 Lunz, M.E., 105, 111, 113, 113, 120, 121, 143, 144, 146, 150, 154, 155, 158, 174, 176, 191, 208

M

Macera, C.A., 213, 235 Mackenzie, W.A., 217, 233 Maier, W., 46, 62, 70 Lahart, C., 137, 146 Mangasarian, O.L., 298, 299 Latour, B., 44, 64, 69 Martin, J.T., 116, 121 Laudan, L., 94, 96 Martin-Loff, P., 247, 272 Laudan, R., 94, 96 Masters, G., 43, 50, 57, 59, 62, Laughlin, J.E., 211,234 70, 72, 80, 83, 97, 99, 106, Law, M., 148, 176 114, 133, 146, 154, 155, 177, Lawley, D.N., 217, 234 220, 222, 235, 238, 245, Lawton, M.R, 150, 176 277-279, 286, 293, 294, 302, Lazarsfeld, P.F., 213, 217, 234 305, 310, 311,317, 318,329 Lear, J., 27, 35 Lehmann, E.L., 21, 23, 179, 189, Maurelli, V.A., 106, 113 Maynes, D.D., 103, 106, 114 218, 219, 222, 234, 235 McBride, J.R., 116, 121 Letts, L., 148, 176 McDonald, R.R, 223, 235 Levelt, W.J.M., 33, 35 McKinley, R.L., 103, 113, 219, Lewine, R.R.J., 38, 69 235 Lewis, C , 218, 235 McPartland, J.M., 275, 293 Linacre, J.M., 38, 50, 59, 62, 69, Meehl, P., 43, 44, 47, 66 72, 125, 130, 133, 135, 143, 144, 146, 151, 154-156, 158, Mellenbergh, G.J., 249, 272 161, 174, 176, 180, 184, 190, Messick, S., 40, 43, 47, 49, 61, 70, 274, 293 191, 193, 208 Michell, J., 5, 15, 16, 23, 26, 28, Linden, L., 274, 293 32-34, 35, 4 6 - 4 8 , 70 Lindquist, E.F., 37, 43, 56, 62, Miller, H., 280, 293 69 Miller, S.I., 232, 235 Linn, R.L., 123, 124, 129 Mislevy, R.J., 52, 70, 91, 97, 220, Loevinger, J., 38, 43, 47, 69, 75, 235, 296, 299 79, 83, 97 Mitchell, D.E., 280, 293, 294 Lord, F.M., 16, 23, 38, 5 1 - 5 3 , Mokken, R.J., 218, 235 69-71, 103, 105, 113, 213, 217, 218, 220, 223, 228, 234, Moos, R.M., 275, 293 235, 237, 245, 274, 293, 295, Mosier, C.I., 76, 77, 97 Mueser, K.T., 38, 68 296, 299

L

AUTHOR INDEX

Munck, I., 277, 293 Muraki, E., 91, 97 Murray, E.A., 156, 176

N

Nanda, H., 74, 96 Narens, L., 31, 35 Newman, E.B., 30, 35 Nolen, S.B., 132, 146 Novick, M.R., 40, 43, 47, 68, 103, 105, 113, 213, 217, 218, 223, 235, 237, 245, 274, 293, 296, 299

o Olsen, J.B., 103, 106,224 Olsen, N.J., 149,277 Olson, A.M., 56, 66 Ormiston, G., 44, 47, 64, 70 Osberg, D.W., 78, 96 Osburn, H.G., 47, 70 Owen, D.S., 46, 53, 70

P

Panchapakesan, N., 137, 146, 180, 190 Pate, R.R., 213, 235 Pearson, K., 216, 235 Perline, R., 18, 19, 23, 32, 33, 35, 38, 70 Philipp, M., 46, 62, 70 Phillips, S.E., 4 1 , 43, 70 Pincus, T., 149, 177 Plake, B.S., 123, 130 Popper, K.P., 212, 232, 235 Powell, K.E., 213, 235 Prane, J.W., 179, 189

R

Rajaratnam, N., 74, 96 Raju, N.S., 48, 50, 67 Ramsay, J.O., 38, 63, 70

333

Rasch, G., 16, 17, 20, 21, 23, 42, 54, 56, 57, 70, 7 9 - 8 3 , 89, 97, 103, 105, 106, 114, 115, 121, 180, 190, 211, 212, 217, 235, 237-239, 245, 246, 272, 275, 293, 318, 329 Reckase, W.D., 103, 113, 123, 124, 129, 219, 235 Reed, B.R., 149, 177 Reed, R., 280, 293 Rehfeldt, T.K., 180, 190 Ricoeur, P., 38, 42, 55, 70, 71 Riemersma, J.B., 33, 35 Rock, D.A., 137, 146 Rogers, H.J., 38, 48, 50, 52, 68 Rogers, J.C., 149,277 Rorty, R., 36, 53, 71 Rosenbaum, P.R., 218, 222, 234 Roskam, E.E., 21, 23 Rossner, M., 322, 329 Rowley, G.L., 74, 98 Rubenstein, L.Z., 150, 177 Ruch, G.M., 131, 144, 146 Rudner, L.M., 124, 130 Ruggles, A.M., 132, 146 Russell, B., 28, 29, 35

s Samejima, F., 218, 235, 302, 303, 317 Samuelson, P.A., 297, 299 Sassower, R., 44, 47, 64, 70 Sax, G., 46, 69 Schairer, C., 150, 177 Scheffe, H., 217, 235 Schultz, M., 124, 125, 130, 193, 208 Seab, J.P., 149,277 Searle, S.R., 216, 235 Shapiro, J.Z., 280, 294 Shavelson, R.J., 74, 98 Siegel, S., 179, 190 Silverstein, B., 154, 177

334

AUTHOR INDEX

Singleton, M., 38, 71 Skurla, E., 149,277 Slawson, D., 103, 114 Smith, M., Ill, 49, 62, 71 Smith, R.M., 271, 272, 322, 325, 329 Sorbom, D., 74, 97, 217, 220, 235, 276, 277, 293 Spady, W.G., 280, 293 Spearman, C , 74, 98, 216, 235 Spence, J.T., 310, 317 Sprent, P., 179, 190 Stacey, S., 149, 176 Stahl, J.A., 120, 121, 144, 146, 150, 155, 174,276, 191,205 Stapp, J., 310, 317 Steen, R., 240, 241, 245 Stein, H., 26, 27, 35 Steinberg, L., 3 0 2 , 3 2 7 Stenbeck, M., 240, 244 Stenner, A.J., 49, 62, 71 Stevens, S.S., 14, 23, 25, 29, 30, 35, 48, 71, 73, 76, 77, 98 Stocking, M.L., 38, 51, 52, 71, 220,235 Stone, M., 39, 62, 72, 83, 99, 103, 105, 106, 114, 124, 130, 159, 177, 180, 190, 246, 251, 271, 272, 296, 300, 318, 322, 329 Stout, W., 218, 222, 235, 236 Strenio, A.J., 46, 53, 54, 71 Sunderland, T., 149, 177 Suppes, P., 25, 3 1 - 3 3 , 35, 38, 46, 69, 71,218,236 Sutherland, G., 46, 53, 71 Swaminathan, H., 295, 296, 299

T Teri, L., 149, 177 Thissen, D., 299, 232, 236, 302, 317

Thomson, D.M., 185, 189 Thomson, G.H., 9 1 , 98 Thorndike, E.L., 8 7 - 9 1 , 98 Thurstone, L.L., 16, 24, 84, 87-91, 98, 216, 220, 236, 247, 272 Tolmin, S., 64, 71 Trabue, M.R., 85, 98 Tracy, D., 55, 71 Travers, R.M.W., 89, 98 Trusheim, D., 46, 53, 67 Tukey, J.W., 5, 14, 16, 19, 23, 38, 63, 70, 71 Tversky, A., 16, 23, 25, 3 1 - 3 3 , 35, 38, 69

V Valentine, R.J., 3 1 0 , 3 2 7 Van der Linden, W.J., 16, 24, 211,236 van den Wollenberg, A.L., 247, 272 Verhelst, N.D., 218, 220, 221, 234, 235

w Wainer, H., 18, 19, 23, 32, 33, 35, 38, 70, 105, 114, 123, 130, 229, 232, 236 Walberg, H.J., 275, 294 Walker, D.A., 274, 294 Ward, W.C., 137, 146 Webb, N.M., 74, 98 Weiss, D.J., 103, 106, 113, 114, 116, 121 Wheeler, J.A., 48, 71 Whitehead, A.N., 28, 35 Whitely, S.E., 38, 43, 4 7 - 5 0 , 71, 218,220,236 Whiteside, D.T, 27, 35 Wieland, G.D., 150,277 Williams, T.H., 280-282, 294

AUTHOR INDEX

Willmott, A., 4 1 , 47, 71 Wilson, M., 64, 71, 218, 220, 222, 236, 240, 245, 281, 282, 294, 299, 299 Windmeijer, F.A.G., 137, 146 Wingersky, M.S., 51, 71 Wise, S.L., 123, 130 Wood, R., 38, 71 Woodyard, E., 90, 91, 98 Woolgar, S., 44, 64, 69 Wright, B.D., 18, 19, 23, 32, 33, 35, 3 7 - 3 9 , 43, 45, 46, 50-52, 54, 57, 59, 62, 63, 69-72, 80, 83, 95, 97-99, 103, 105, 106, 113, 113, 114, 115, 116, 121, 124, 125, 130, 133, 137, 143, 144, 146, 154, 155, 158, 159, 174,276, 177, 180, 190, 191, 193,208, 211, 212, 215,236,

238, 245, 246, 247, 251, 271, 272, 277-279, 186, 294, 296, 297, 299, 299, 302,305,310,311,327, 322, 329

335

266, 293, 300, 318,

Y Yamagishi, M., 149, 177 Yarian, S.O., 56, 66 Yen, W.M., 116, 121, 123, 130 Yu, K.F., 211, 234

z Zanotti, M., 218, 236 Zimmerman, M.E., 59, 72 Zinnes, J.L., 46, 71 Zurek, W , 48, 71 Zwick, R., 59, 72

This page intentionally left blank

Subject Index A Achievement Testing, see Applications Additive conjoint measurement, 14-16 Additivity, 2 6 - 2 8 , 32, 212-213 Affective domain, see Applications Akaike's Information Criterion, 290 Analysis of variance (ANOVA), 214-215, 217, 218 Applications achievement testing, 189 affective domain, 271-289 assessment of motor and process skills (AMPS), 145-173 computerized adaptive testing (CAT), see Computerized adaptive testing functional assessment, occupational therapists' use of, 145-173 judge mediated practical examination, 190 quality of school life (QSL), 278-280

quantitative experiments, see Quantitative experiments Assessment of motor skills, see Applications

B Bayesian modal estimation, see Estimation Bias in person measurement discrimination, 248, 257 guessing, 248 item bias, 247, 254 misfit, 244, 257 multidimensionality, 247, 248, 257 BIGSCALE, see Computer programs BIGSTEPS, see Computer programs Boundaries of ability, 295

c Calibration items, 104-105, 116, 122-128, 153, 293-297 sample free, 7 7 - 7 8 , 295 Cancellation conditions, 3 3 - 3 9 337

338

SUBJECT INDEX

CAT, see Computerized adaptive testing Chemical properties, measurement of, see Quantitative experiments Classical test theory, 235 Comparison between groups, 271 international, 271 Computer programs BILOG, 52 BIGSCALE, 125-126, 191, 317 BIGSTEPS, 62 FACETS, 62, 152-153, 156, 159, 161, 171-172, 189-191 combined analysis, 189-192, 198, 205 weighted analysis, 202-205 LISREL, 281-282 LOGIMO, 239 LOGIST, 51-52 MSCALE, 124 Computerized adaptive testing (CAT), 103-113, 115, 117, 122, 125-128, 308-314 algorithm, 106 attitude, 308-314 review, 112-113 targeting, 110-111, 123 test length, 111-112 Concatenation operation, 7-10 Conjoint Measurement Theory, 38 Conjunctive measurement, 212, 222, 223-228 local dependence, 225 local independence, 224 Consistent estimators, 22 Construct validity, quantitative approaches to, 271-290 Coomb's theory, 34, 58

D

Derived measurement, 4, 1 0 - 1 1 Differential item functioning, 59

E

Equating, see Invariance Estimation Bayesian modal, 294 item scale values, 77 marginal maximum likelihood (MML), 294 maximum likelihood (ML), 106, 217, 293 Newton-Raphson procedure, 293 PROX, 249 Exponential family theory, 216-219

F FACETS, see Computer programs Factor Analysis comparison with Rasch fit statistics, 321-326 principal component, 317-318 threshold for eigen values, 321 use with correlated factors, 319 Fit, 13, 19, 20, 22, 5 9 - 6 2 , 80, 83, 84, 85, 106, 165, 179, 244, 257. See also Fit Statistics diagnosis, 133-138 item, 83 model-data, 276 Pearson goodness of, 240-242 person, 83 test of Thurstone's scaling method, 8 4 - 8 9 Fit Statistics, 153, 157, 159, 161-162, 165, 171, 192 comparison with factor analysis, 321-326

SUBJECT INDEX

goodness of fit, 20 item total (outfit), 317, 320 Functional Assessment, Occupational therapists' use of, see Applications Fundamental measurement, 3-10

G Galileo's theory, 42, 44, 45, 48 Graded response model, 300

H Hermeneutic circle, 42-44, 48, 54-58 Husserlian phenomenology, 48, 55-56

I Implicit measurement, 12-13 Indeterminacy, 293 Information functions, 299-307, 313-314 item, 299-307, 313-314 scale, 302 Interpreting data, 5 4 - 6 5 Invariance item parameters, 294 of parameters, 293-297 Rasch's perspective, 7 9 - 8 3 Thorndike's perspective, 8 9 92 Thurstone's perspective, 8 3 89 Item response theory (IRT), 37, 38, 5 0 - 5 3 , 59, 74, 104, 115, 215-216, 218, 235-237, 273-280 model data fit, 282-283 Item parameter invariance, 272, 275, 277, 280-290

339

J Joint estimation, 298 Judge mediated practical examination, see Applications Judges behavior, 130-131 differences among, 148, 152, 154, 156, 161, 183 training of, 142, 156 use of, 189 Judging plan for analysis of, 138, 141

K Kuhnian revolution, 38, 4 5 - 5 4

L Lagrangian technique, 296 Likert-type responses, 276 Linear transformation, 294 Locally dependent conjunctive measurement models (LDCM), 209-230

M Mantel-Haenszel procedure, 59 Meaning, 39, 44, 46, 53, 55, 57, 58, 60, 6 2 - 6 3 Meaningfulness, 3 0 - 3 1 Measurement classical theory, 2 5 - 2 8 , 3 0 - 3 4 color and match perception, 187 conjoint, 3 2 - 3 4 context, frame of reference, interpretive structure, 36-65 as conversational give and take or question and answer, 37, 38, 3 9 - 4 0 , 47, 55, 62, 64-65

340

SUBJECT INDEX

crucial role of instrument quality in, 3 6 - 6 5 as experiment, 3 6 - 6 5 , 44, 47 fundamental, 28 and imagination, 38, 4 2 - 4 5 , 47, 51, 59 of individuals, item invariant, 78 locus of authority, 39, 43, 53, 55 and mathematical ideality, 42-45 model-based objective (MOM), 210-213 paradigms, 46-50, 5 4 - 5 8 questioning authority of, 39-40, 43, 56 representational theory, 25, 28-32 socio-political implications of, 36, 58, 6 4 - 6 5 validity and empirical consistency of data, 39, 43-44, 47, 5 9 - 6 3 Measurement consistency, see Structural equation modeling (SEM); Item response theory (IRT) Metaphysics, 42, 53-54 Method, 58 Misfit, see Fit; Fit Statistics Multidimensional factor analysis (MFA), 214-215, 218 Multidimensional polytomous latent trait models (MPLT), 235-242

N Newton-Raphson procedure, see Estimation Nonlinear transformation, 295

o Objectivity

multi-faceted, dialogical, communitarian, 37, 38, 39_40, 47, 55, 58, 62, 6 4 - 6 5 one sided, monological, authoritarian, 3 6 - 3 9 , 43, 53, 54, 62, 6 4 - 6 5 Operationalism, 47, 48, 49

P

Parameter convergence/separation, 48 divergence, 5 1 - 5 2 estimation, 293 Partial credit model, 275-276, 300 Platonic idea, 4 2 - 4 5 , 48 Positivism, 36, 40, 43, 44, 47, 53, 54,57 PROX, see Estimation Pythagorean, 4 2 - 4 5 , 46, 49

Q Qualitative and quantitative paradigms, 46—50, 5 4 - 5 8 Quality of school life, see Applications Quantitative experiments chemical properties of paint, 176 in the paint industry, 176 paint performance, 179 Quantity, 26 extensive and intensive, 28

R Rasch debate, 37, 45, 53 Rasch measurement calibration programs, 317, 320 requirements, 316-317

SUBJECT INDEX

Rasch models, 16-17, 2 0 - 2 1 , 105-106, 122 attributes, 213-214 internal measures, rankings, 177-181 many facets, 131-133, 143, 152-171, 182, 189-192 multidimensional, 237 rating scale model, 301-314 Rater consistency, 154, 161; see also Judges, differences among Rating scales, 132-133, 177, 179, 187 Real numbers as empirical relations, 27 Reliability alternate forms, 113-120 Removing subjects, 293-297

s Scaling theory, 74, 76, 84-92 Specific objectivity, 19-22, 80, 237 Statistics, fit, see Fit statistics permissible, 3 0 - 3 1 sufficient, see Sufficient statistics

341

Steven's theory of scales, 14 Structural equation modeling (SEM), 273-275, 280-282 Sufficient Statistics, 20-22, 228 item, 225, 226, 227 person, 225, 227, 228

T Tests, subjective, 129-131 Test model origins, 214-219

u Unfolding, Coombs' theory of, 34 Unidimensionality, 109, 274, 280-282, 316-318, 326-327

V Validity concurrent, 43 content, 4 0 - 4 2 , 4 3 - 4 4 , 47, 49, 58, 62 predictive, 43

Z Z-score metric, 293, 295

edited by

Mark Wilson Graduate School of Education University of California, Berkeley

ABLEX PUBLISHING CORPORATION NORWOOD, NEW JERSEY

Copyright O 1994 Ablex Publishing Corporation All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without permission of the publisher. Printed in the United States of America

Library of Congress Cataloging-in-Publication Data (Revised for vol. 2) Objective measurement. "Papers presented at successive International Objective Measurement Work shop (IOMW)"—Pref. Includes bibliographical references and indexes. ISBN 0-89391-727-3 (v. 1) — ISBN 0-89391-814-8 (v. 1 : pbk.) 1. Psychometrics—Congresses. 2. Psychometrics—Data processing— Congresses. 3. Educational tests and measurements—Congresses. I. Wilson, Mark. II. International Objective Measurement Workshop. BF39.024 1991 150'.1'5195 91-16210 CIP

Ablex Publishing Corporation 355 Chestnut Street Norwood, New Jersey 07648

Table of Contents Preface

v

Acknowledgments Part I. 1

2

3

4

vii

Historical and Philosophical Perspectives

Fundamental Measurement and the Fundamentals of Rasch Measurement Wim van der Linden

3

The Relevance of the Classical Theory of Measurement to Modern Psychology Joel Michell

25

The Rasch Debate: Validity and Revolution in Educational Measurement William P. Fisher, Jr.

36

Historical Views of the Concept of Invariance in Measurement Theory g

73

Part II.

Practice

5

Computer Adaptive Testing: A National Pilot Study Mary E. Lunz and Betty A. Bergstrom

103

6

Reliability of Alternate Computer-adaptive Tests Mary E. Lunz, Betty A. Bergstrom, and Benjamin D. Wright

115

7

The Equivalence of Rasch Item Calibrations and Ability Estimates Across Modes of Administration Betty A. Bergstrom and Mary E. Lunz

122 m

iv

8

9

CONTENTS

Constructing Measurement with a Many-facet Rasch Model John Michael Linacre

129

Development of a Functional Assessment that Adjusts Ability Measures for Task Simplicity and Rater Leniency Anne G. Fisher

145

10

Measuring Chemical Properties with the Rasch Model t

11

Impact of Additional Person Performance Data on Person, Judge, and Item Calibrations John Stahl and Mary Lunz

176

189

Part III. Theory 12

13

Local Independence: Objectively Measurable or Objectionably Abominable? Robert J. Jannarone

209

Objective Measurement with Multidimensional Polytomous Latent Trait Models h

235

14

When Does Misfit Make a Difference? Raymond Adams and Benjamin D. Wright

15

Comparing Attitude Across Different Cultures: Two Quantitative Approaches to Construct Validity Mark Wilson

16

Consequences of Removing Subjects in Item Calibration Patrick S.C. Lee and Hoi K. Suen

17

Item Information as a Function of Threshold Values in the Rating Scale Model Barbara G. Dodd and Ralph J. DeAyala

18

244

271 295

299

Assessing Unidimensionality for Rasch Measurement Richard M. Smith and Chang Y. Miao

316

Author Index

329

Subject Index

337

Preface This volume is the second in a series that collects together papers presented at successive International Objective Measurement Workshops (IOMW). These workshops bring together researchers from all over the world to discuss, debate, and gossip about recent developments in the area of measurement in the social sciences generally, and, more specifically, developments within the community of researchers who see a special place for the measurement approach based on the ideas of Georg Rasch. This "special place" is evidenced by the frequent mention throughout the volume of Rasch himself, of the family of models named in his honor, and of the concept of specific objectivity, a term t h a t he coined and that is perhaps his most significant contribution to the theory and practice of measurement. Within this framework, new philosophical perspectives are discussed in chapters by Wim van der Linden and William Fisher. In the area of practice, two major clusters of new work are reported on in the volume: Mary Lunz, Betty Bergstrom, and Benjamin Wright describe a national pilot study of computer adaptive testing in professional licensure; and Michael Linacre introduces three chapters by Anne Fisher, Thomas Rehfeldt, and John Stahl and Mary Lunz that describe applications of a type of Rasch model called a facet model. Theoretical advancements in the area are reported by Henk Kelderman, Raymond Adams and Ben Wright, Barbara Dodd and Ralph DeAyala, and Richard Smith and Chang Miao. The workshops do not exclusively focus on such work, however. Alternative perspectives are a frequent and important part of the presentations and discussions t h a t take place at the workshops. In this volume, Joel Michel and George Engelhard, Jr., advance philosophical and historical perspectives that take a broader view, and the papers by Robert Jannarone, Mark Wilson, and Patrick Lee and Hoi Suen explicitly attempt to make connections outside the Rasch framework. v

Vi

PREFACE

The chapters are largely drawn from those presented at the sixth IOMW, held at the University of Chicago in April 1990 and organized by Mary Lunz of the American Society of Clinical Pathologists. This is not the only source for chapters, however. One of the chapters (my own) was presented in only partially complete form at the fifth IOMW, and one other (by Wim van der Linden) is based on a debate at the American Educational Research Association annual meeting held immediately after the workshop. I hope that their inclusion will encourage contributions from authors who have either completed work t h a t was not quite ready for publication immediately after past workshops (a virtual requirement for inclusion, given the tight time constraints associated with publication), or who have recently finished an appropriate paper, but, for whatever reason, did not present it at a workshop.

Acknowledgments I would like to acknowledge the work of the Rasch Measurement Special Interest Group of the American Educational Research Association for putting together the Sixth International Objective Measurement Workshop, which was the source of most of these chapters. In particular, I would like to recognize the sterling work of John Michael Linacre and Mary Lunz in this regard. The subject index for this book was compiled by my wife, Janet Susan Williams, with the help of the chapter authors: Thank you Janet, for persisting with our sometimes strange topics and concerns, and for enhancing the quality of the book in so fine a way.

vii

This page intentionally left blank

part I

1

Historical and Philosophical Perspectives

This page intentionally left blank

chapter

-L

Fundamental Measurement and the Fundamentals of Rasch Measurement Wim J. van der Linden University of Twente

To many of us, the natural sciences are the example upon which the social and behavioral sciences should be modeled. To some of us who are not fully aware of the daily research practice in the natural sciences, this conviction seems to take the form of a simple, inductivistic recipe in which the first concern is to measure the variables of interest on a quantitative scale. Once this basic step is taken, the ultimate goal is to discover universal laws in the measurements and to present them in mathematical form. Others, however, more aware of the important role that imagination plays in research, view measurements as the "hard" facts against which theoretical speculations have to be tested. To both parties, it would probably be a shock to read Campbell's (1928) book on scientific measurement, noting that according to this authoritative text the distinction between theory and measurement as two distinct realms is wrong and misleading. J u s t as with normal substantive research, measurement proceeds by establishing natural laws and empirically verifying their truth. Campbell wrote his book because he was not pleased with the usual definition of measurement as "the process of assigning numbers to objects to represent their properties" (p. 1). According to Campbell such statements abound in textbooks on physics, but they are by no means 3

4

VAN DER LINDEN

true and show that even physicists at the front line of research may lack a thorough understanding of what measurement is about and how quantitative variables are established. The book had an immediate impact on scientists as well as philosophers of science, and has been the indisputable standard reference in discussions about measurement ever since. It took four decades before someone else (Ellis, 1966) dared to write a new monograph about measurement in the sciences—a monograph based on the same foundations, though, as those laid by Campbell. One of Campbell's main points is the reminder that variables should not be conceived of as a generalization of our visual experience of physical length—that is, as an "empirical line"—but as a set of physical objects with certain relations defined on it. For the variable to be quantitative, these relations should order the objects and define an operation of "addition" on them. The relations form an hypothesis t h a t has to be verified, just as we had to verify, for example, the relations between objects implied by Boyle's law before we were able to consider it a genuine natural law. Once verified, we usually single out a particular object as the unit against which the others are compared to measure them. The choice of a unit is a practical issue; we mostly select some object that is convenient to us—for example, our feet when we pace out a distance. Measurement that can be defined and verified in t ment is theory based and that the theory involved has to go through a process of prediction and confirmation is demonstrated by those physical properties for which it has not been possible to verify the hypothesis of a quantitative variable. A well-known example in physics is Mohs' definition of hardness. It is possible to order the hardness of physical objects by the operation of scratching and observing which object in the set scratches which other object, but for this operation it has not been possible to verify the relations implied by the addition operation, and we are still not able to measure hardness fundamentally. Fortunately, though, in such cases quantitative measurement may be possible by a process called derived measurement: Using proven numerical laws between the variable concerned and other variables t h a t can be measured fundamentally, we may be able to calculate quantitative measurements for the former even if it cannot be measured itself in a direct or fundamental fashion. An obvious example is the measurement of temperature by the length of a column of mercury in a classic thermometer. In derived measurement, again, the keyword is relations. For relatively new fields such as education and psychology, it has been tempting to try to emulate the success of the natural sciences by

FUNDAMENTAL MEASUREMENT

5

looking for the possibility of fundamental measurement. In particular, for a long period the quest was for psychological equivalents of the addition operation. (The precise properties of this operation, called the concatenation operation, will be explored later in this chapter.) This quest did not meet with success, though, and at a certain stage many doubted if quantitative measurement, and hence the establishment of psychology as a mature science, would be possible at all. An excellent historiography of this episode is given in Michell (1990). A major step forward was taken by Luce and Tukey (1964), when they showed that variables can be tested for quantitativeness in the absence of any empirical concatenation operation. The example used by Luce and Tukey was the case of additive conjoint measurement. The principle underlying the example, namely that the nature of the variable follows from the measurement model for which testable consequences have been shown to hold against empirical data, is not unique to additive conjoint measurement and also applies, for example, to such modern developments in educational and psychological testing as item response models. The present chapter focuses on these models. In the following we will first explore Campbell's notions of fundamental and derived measurement a little further. The emphasis is not on a careful, formal treatment, but on a rather loose discussion of the insights that led Campbell to his basic notions. The next part of the chapter raises an analogous problem for the behavioral sciences: How to found educational and psychological testing as a discipline of quantitative measurement in the absence of fundamental measurement operations. The chapter ends with a discussion of the fundamentals of Rasch measurement and seeks to define its unique position in educational and psychological measurement. FUNDAMENTAL MEASUREMENT Campbell's analysis of measurement can be summarized by the statement t h a t establishing quantitative variables is a theoretical issue involving natural laws and that these laws have to be verified before the variable can be considered to be truly quantitative. It is now time to further explore the nature of these laws and to see how they can be tested. Ideally, for a variable to be quantitative three different types of laws have to hold. If these laws can be confirmed, the variable is directly or fundamentally measurable. Other variables may be measurable by the principle of derived measurement to be discussed later, or, according to Campbell, they are not quantitatively measurable at all.

6

VAN DER LINDEN

As already observed, it is tempting to think of a physical variable as an empirical line. Our most immediate experience of the physical reality is one of objects showing different lengths in one, two, or three dimensions. Hence, it is not without reason that length is our intuitive model of any physical variable—a fact that is reinforced by our daily meetings with graphs and diagrams that map all kinds of physical variables as geometric lines. However, a more fruitful idea of a physical variable is one of a set of objects with a relational structure. The variable temperature, for instance, is given by the way such physical objects as the sun, my oven, John's ice cream, and the cup of coffee I had this morning relate to each other. If I enlarge this set to include all past, present, and future objects, then relations of "equality," "difference," "more than," and "less t h a n " between these objects define the variable temperature. Of course, the variable weight is defined by a different collection of relations between the same objects, but the basic point is that the variables temperature and weight do not have any physical meaning over and above these two collections of relations. Campbell's first two laws of measurement specify two different types of relations. Let capitals A,B,C c . . . denote the objects in the set. The first law of measurement specifies an order relation for the set. Let the order relation between objects A and B be denoted by A >E B. Although this notation reminds us of the symbol that is used to denote the "larger t h a n " relation between numbers in mathematics, no reference whatsoever to mathematical entities is intended. For this reason the subs order relation, the following properties have to hold for all possible pairs of objects:

As an example, the reader may think of the relation "longer than," which defines the variable length. The first proposition states t h a t if A is longer than B and B is longer than C, then A is longer than C. The other two propositions can be interpreted similarly. We are now able to formulate the first law: First Law of Measurement (Order Relation). All pairs of objects obey the properties of the order relation defined in (1) through (3). e tioned as examples of variables obeying this law of measurement.

FUNDAMENTAL MEASUREMENT

7

Objects can be ordered with respect to length by direct comparison. Similarly, objects can be ordered by weight using direct comparison on a balance. Another example is time; we are able to order periods of time by direct comparison (provided they begin simultaneously). A counterexample is Mohs' hardness. Mohs' scratching operation, already discussed above, orders objects only partially with respect to hardness, due to the fact that objects exist with scratching relations that do not obey the axioms. For a well-known psychological variable such as intelligence, procedures for ordering h u m a n beings by direct comparison usually seriously violate the transitivity property of the order relation defined in (1). Measurement procedures based on direct comparison are therefore unable to yield quantitative measurements of intelligence. Weight is a nice example to illustrate that the first law of measurement—as well as the two laws to be introduced below— describe no isolated aspects of nature. To be able to verify the first law of measurement, other laws are involved too; for example, laws relating the behavior of balances to physical variables as gravity, air turbulence, and buoyancy, or mechanical laws governing the operation of the balance. Without knowledge of such laws one would never be possible to confirm the order relation in (1) through (3) for sets of physical objects. In addition to an order relation, a set of objects has to meet an empirical relation of additivity to form a quantitative variable. The term concatenation operation has been introduced to emphasize that an empirical operation is meant, and not the arithmetical operation on numbers. Examples of concatenation operations are: putting more t h a n one object on the scale of a balance to compare their combined property with other objects, putting electrical resistances in a series to compare the resistance of this new object with other objects, or placing two objects end to end in a line to compare their length with those of other objects. Concatenation operations are defined by a set of relations between objects. Let A +E B denote an new object that is produced by a concatenation operation. Again, this notation is somewhat misleading in that it reminds us of the addition operation in arithmetic, but the subscript E is added to emphasize that an empirical and not a mathematical operation is intended. Later, if we assign numbers to measure quantitative variables, the rules of measurement will map this concatenation operation on the mathematical operation of addition. Now the following set of relations defines the concatenation operator:

8

VAN DER LINDEN

The meaning of these relations is obvious. Relations (4) and (5) show t h a t the order in which the objects are combined does not influence the results. Relations (6) and (7) relate the results of concatenation operations to the properties of order relations in (1) through (3). It should be noted that the set of conditions in formulated in (4) through (5) is somewhat outdated and idiosyncratic. Modern versions can be found in algebraic texts axiomatically defining the formally equivalent operation of addition. Second Law of Measurementn (Additivity).y All objects obey the properties of the additivity relation defined in (4) through (7). Examples. Weight, length, and period of time were given earlier as examples of variables for which the order relation in (1) through (3) can be verified empirically. The same holds for these variables with respect to (4) through (7). Intelligence as measured by an IQ test is an example of a psychological variable for which we do not have a concatenation operation. Evidently, if two subjects work together on the test, the IQ for their concerted effort is not equal to the sum of their individual IQs. The properties in (4) through (7) provide the criterion by which we could empirically test a candidate for the concatenation operation for intelligence, if somebody proposed a new one. Again, the axioms in (4) through (7) may seem trivial just because we abstract from physical reality. However, it is emphasized again that measurement axioms can only be tested if embedded in a larger theory relating the physical variable of interest to relevant other variables. For example, we would never be able to verify (4) through (7) for the concatenation of objects on a balance if we were not able to use physical theory to control or correct for interferences between the results for the left-hand and right-hand sides of (4) through (7) due to, for instance, gravitational variation or mechanical friction. Though the first two laws of measurement may sound somewhat abstract to readers not familiar with measurement theory, Campbell's third and last law comes closer to the actual practice of fundamental measurement. Its starting point is the observation that from the set of objects in the first two laws, we may pick a series of objects and consid-

FUNDAMENTAL MEASUREMENT

9

er them a standard series against which the other objects are to be measured. The basic procedure is to match the other objects with one in the standard series and use the numeral associated with the latter as the measure of the former. The first two laws can be used to produce a standard series. An obvious procedure is to denote one object as the standard or unit object. The order relation could be used to find another object that has the relation =E to the standard. Then the concatenation operation defined by the second law can be used to combine the two objects into a new object. If the numeral 1 is assigned to the standard (other choices are possible, but probably less convenient), the new object receives the numeral 2. This process can be repeated until the standard series is large enough to measure all objects in the set. Noninteger measures are introduced if the concatenation operation is used in a reciprocal way; that is, if we take objects with a < # relation to one of the objects already in the standard series and determine the number of times the concatenation operation has to be applied to produce a new object that has a =E relation to the given object. If the standard series is complete in the sense that for each object in the set there is one in the standard series to which it has a =E relation, the series forms a feasible measuring device. In more technical language, it can be stated that an (arbitrary) unit object and a concatenation operation together span or generate a standard series. Analogously, a numeral for the unit object along with the addition operator generate a set of quantitative measures for the objects in the universe. The surprising thing to be noted is that the actual numeral used for the unit object is not important at all; different numerals will generate different sets of values for the standard objects, but each set will map the same empirical relational structure between the objects. Campbell's third law identifies an important property of standard series: Third Law of Measurement (Arbitrariness of Unit). Any object can be chosen as a unit of object to form a standard series. Examples. A well-known prototype of a standard series is the oldfashioned series of weights used on a balance. In fact, the series is only a partial standard series. If an object is met that cannot be matched with one of the weights in the series, a concatenation operation is used t h a t combines weights on one scale into a new object t h a t has a =E relation to the object on the other scale. The =E relation is defined by the balance of the scales. The unit object upon which a series of weights is based is not unique; any

10

VAN DER LINDEN

other object could have been chosen. It is convenience that determines our choice of standard series. Actually, convenience may take us one step further and have us replace the standard series by a single measuring device. The yardstick, with each of its notches replacing a separate object in a standard series, is a pertinent example. The history of measurement in physics can be looked upon as a long process in which old measuring devices are replaced by new devices. As each replacement usually is based on the application of new substantive laws, the latest device may hardly seem to bear any relation with its early ancestors, as is the case, for instance, with modern atomic clocks and the original sandglass. Campbell's analysis reminds us, however, of the fact t h a t for measurement to qualify as fundamental at its basis there must be an empirical concatenation operation that can be used to derive a standard series of objects from an arbitrary unit object. Below we will return to intelligence as an example of a variable for which no standard series has been possible. We could select a certain subject as our unit object, but it is impossible to build a series of standard objects from it, as we still have no concatenation operation. Hence, we are unable to assign numerals to intelligence that obey the laws of fundamental quantitative measurement. DERIVED MEASUREMENT Though fundamental measurement provides measurement in the natural sciences with a sound footing, it is not the only type of quantitative measurement possible. Another type defined by Campbell is derived measurement. Its name is appropriately chosen, since derived measurement always assumes the existence of fundamental measurement. The best way to appreciate the distinction between fundamental and derived measurement is by noting the different numbers of variables in physical laws. Each of the three laws of measurement given above were associated with a single variable. This is typical of fundamental measurement; such laws explain the quantitative structure of a given variable, dealing only with properties of the relational structure on the set of objects that defines it. As argued earlier, this does not imply t h a t substantive knowledge about other variables does not play a role in the confirmation of the laws of fundamental measurement, but the laws themselves are always formulated for single variables. Natural sciences, on the other hand, abound with laws of two or more

FUNDAMENTAL MEASUREMENT

11

variables. These laws govern the ways different physical variables relate to each other. They can also be used to measure variables. As an example, think of the mechanical experiment in which a known force is applied to physical objects and their acceleration is measured. As a result, it can be observed that for each object force and acceleration are proportional to each other, but that different sets of objects may display different constants of proportionality. In straightforward notation this means: Fla = c. Now suppose it is observed that the values of this constant c perfectly order the objects according to mass. These values can then be identified as measures of mass, and the law can be notated in its well-known form as: F = ma. Thus even if no concatenation operation is available for mass, and mass can never be measured directly, it is nevertheless possible to represent the mass of objectives on a quantitative scale, provided the other variables in the law can be measured fundamentally. The properties of the scale follow from the mathematical structure of the model and are determined following a procedure that is known in physics as dimensional analysis. The question how to find an order of mass independently of c so t h a t c can be identified as a measure of mass is not clearly dealt with in Campbell's book. A lucid treatment of this problem is given in Rasch (1960, chap. 7), where mass is identified as the acceleration of a standard object caused by a unit of force.

MEASUREMENT IN THE BEHAVIORAL A N D SOCIAL SCIENCES As already put forward, the behavioral and social sciences have lacked the possibility of fundamental measurement. Even for such sophisticated forms of measurement as intelligence measurement, the history of psychology has not produced any viable concatenation operation t h a t could be used to "add" two amounts of intelligence to obtain a new amount equal to their "sum." As a consequence, it has been impossible to select a series of intelligent objects that forms a standard series and can be used as a measuring device. Of course, practically, it is possible to select a small series of people of increasing intelligence, provided their intelligence is spaced at large distances; in some cases it might even be possible to set up reliable trials in which the intelligence of the people in the series is compared with that of other people. The critical point, however, is the following: As long as it is impossible to obtain the intelligence of the other people in the series by repeated concatenation of the intelligence of a person chosen as the unit, such a series can never be a standard series.

12

VAN DER LINDEN

How about IQ tests? iVre they not the measuring instruments that yield quantitative intelligence scores? They certainly do not provide fundamental measurement. An intelligence test is not a device t h a t replaces a standard series of intelligent objects as the yardstick replaces a set of sticks of variable lengths. Standard series are always parts of the universe of objects that define the variable; they possess the magnitude that the variable represents. It is by this virtue that direct comparison with other objects and hence fundamental measurement is possible. A yardstick itself has length, just as each weight in a standard series has a certain weight. However, IQ tests have no intelligence and it is impossible to directly compare the intelligence of people with the "intelligence of the test." The truth about IQ tests is that, notwithstanding our daily parlance, they are not measurement instruments at all in the same sense as physics has its thermometers, balances, and stopwatches! In fact, they are just standardized experiments used to collect such qualitative data as responses to problems formulated in test items. Measurement in the behavioral and social sciences never takes place while data are collected—it always happens after they are collected. Now if the behavioral and social sciences have no fundamental measurement, and according to Campbell derived measurement is the only other sound form of quantitative measurement, is derived measurement possible in these sciences? Again the answer is no. By definition derived measurement is always based on fundamental measurement. And if no laws with relations between fundamentally measurable variables are at hand, we can never find the constants in such laws t h a t identify measures for new quantitative variables. Implicit Measurement It is exactly here that Campbell's analysis goes wrong and comes to a premature stop. Modern measurement theory shows that we can go one step further and verify laws t h a t explain observable data using only unmeasured variables. If these laws—or models, as modern measurement theory prefers to call them—are quantitative and empirically verified, then the unmeasured or latent variables have quantitative scales on which, as a byproduct, the positions of the objects are known. As the model contains only latent variables, measurement of them is not derived from other fundamentally measured variables—all variables are measured jointly, in relation to one another. To distinguish this type of measurement from fundamental and derived measurement, it is called implicit measurementnt here. The first step in implicit measurement is the definition of the data for which the model has to be designed. These data are categorical or

FUNDAMENTAL MEASUREMENT

13

ordinal. The fact t h a t the data are qualitative and not quantitative is essential; otherwise there would be no reason at all to "upgrade the data" and derive quantitative measures from them. Once the data are defined, the next activity is to design a model t h a t explains the data as a function of the variables on which they depend. Now the basic point is that it is possible to explain qualitative data by a model with quantitative variables. Loosely speaking, here quantitative is taken to mean that the variables are allowed to have real values and that the model relates the variables or parameters to each other through a mathematical structure that contains at least a +. This operation of addition is present in the model to govern the way the variables are assumed to interact, not to map an empirical concatenation operation. In a model or law for a single variable the 4- can only be used to add values of the same variable, but in a model with more than one variable the + can be used to add values of different variables. For the model to be empirically testable, the former case requires a concatenation operation; the latter case does not. The final step is to fit the model to actual data and test its goodness of fit. Generally, fitting a model means that values for the variables or parameters are found such that observable consequences from the model match the properties of the data as closely as possible. Several statistical methods are available to do the job, each based on a different criterion of optimal fit. The important point however, is that if the model shows good fit, we have a tested quantitative scale for the variables in the model, just as a good fit of the First and Second Laws of Measurement gives us a tested quantitative scale for a single variable. The values for the variables that give the optimal match are the quantitative measures of the objects that explain the data in the experiment. We have to be somewhat more specific about the quantitative structure of the variables in the measurement model. As the structure is not defined and tested following the axioms in the First and Second Law of Measurement, how do we know its formal properties? The criterion is the invariance or uniqueness of the model under transformation of scale of its variables. Though more formal definitions of invariance are possible, the following suffices for the present purpose: A model is invariant under a scale transformation if it has exactly the same observable consequences before and after transformation. The transformations under which a model is invariant are called admissible transformations. Admissible transformations fully define the structure of the scale. For example, if it is not possible to transform the unit or the zero of the model without changing its fit to data, then the unit or zero are empirical properties of the model and identify the structure of the variable.

14

VAN DER LINDEN

Stevens' Theory of Scale Types The theory of scale types has become popular through the work of Stevens (1951). His basic distinction was between nominal, ordinal, interval, and ratio scales, each defined by a different class of admissible transformations. Historically, Stevens' theory of scale types was a rebuttal to Campbell's condition of a concatenation operation as a prerequisite for fundamental measurement. Because in the 1920 through the 1940s psychology was unable to produce concatenation operations, psychologists felt that they either had to relax Campbell's condition or to believe that in psychology measurement was not possible at all. Stevens did the former. He maintained Campbell's notion of representationalism, but relaxed the idea that the relational structure of the variable had to represent a concatenation operation, introducing ordinal and even nominal measurement as other true forms of representational measurement. Though Stevens' theory of scale types has become part of the standard outfit of all behavioral and social scientists, he has left them in uncertainty as to what level of scale their actual measurements are on. The theory provides no test whatsoever of level of scale. Stevens' view of measurement still had procedural overtones rather t h a n being fully model based. Therefore he missed the point t h a t in the behavioral and social sciences tests of scale properties can never be derived from measurement procedures themselves; only models can do the job. Had Stevens focused on relaxing Campbell's theory of derived measurement rather than fundamental measurement, his interest in scale invariance might have led him to the notion of implicit measurement as outlined above. It took some 15 years before others formalized the idea.

Additive Conjoint Measurement Luce and Tukey (1964) showed the behavioral and social sciences that quantitative measurement is possible, provided more than one variable is measured and they are modeled jointly. They demonstrated the principle using their new model of additive conjoint measurement, which will be introduced here briefly. The model of additive conjoint measurement formulates the relation between the following three variables: a dependent variable P and two independent variables A and B. The variables A and B are unmeasured or latent, but it is possible to classify all objects simultaneously with respect to them. The dependent variable P is not measured either, but all objects are ordered completely with respect to

FUNDAMENTAL MEASUREMENT

15

their values of P. The best way to represent the data is by a bivariate table with each row representing a different value of A and each column a different value of B, the values being arbitrarily chosen. For each cell there is a value of P attached to the objects classified into it and across cells the values satisfy a complete order relation. In additive conjoint measurement functions are fitted to the data in the table such that the following additive model holds:

Luce and Tukey proved the powerful result that if the data in the table meet certain conditions, then: (1) (2)

monotone functions fx{.), f2(.) ) and f3(.)) satisfying this additive model exist; fi(P), /2(A) and f3(B) are quantitative variables.

For the sake of brevity, a discussion of the conditions will be skipped here. It suffices to say that a test of whether the data in the table meet the conditions is straightforward. Readers interested in the conditions may refer to the original paper by Luce and Tukey or to a lucid introduction to additive conjoint measurement in Michell (1990, chap. 4). It is important to separate the methodology in Luce and Tukey's paper from the actual model they propose. The methodology reflects the steps of implicit measurement outlined above. First, the data are identified for which the measurement model is needed (here, data ordering objects on P and classifying them with respect to A and B). Then a model is formulated that explains the data as a function of relevant independent variables (here, P is modeled as a function of A and B). The model is quantitative in that it uses a + to represent the relation between the variables (here, /i(P) = f2(A) + f3(B)). Then measures on the variables are derived by applying the model to the data and finding values for the variables (here, values for /\(P), f2(A) and f3(B) ) such t h a tTf2{A) + f3(B)(is equal to fx(P)p for all objects). It should also be noted t h a t the model is not a mathematical tautology, but a hypothetical empirical law that may be rejected by the data. This is manifest from the fact t h a t for the model to hold true the data in the table have to meet the three conditions in Luce and Tukey's theorem. It is this underlying methodology and not the specific model in Luce and Tukey's paper t h a t should be considered their most important contribution to measurement theory. Some authors seem to have difficulty distinguishing between the two and tend to assume t h a t unless other models can be demonstrated to be equivalent to the model

16

VAN DER LINDEN

of additive conjoint measurement, they do not provide quantitative measurement (e.g., Michell, 1990; see van der Linden, 1994). In particular, models that are stochastic or have a more complicated mathematical structure are ruled out by this assumption. This is not correct. Nonadditive models of measurement have been studied along the same lines as in Luce and Tukey's paper and proofs of the fact t h a t they provide quantitative variables are available (Krantz & Tversky, 1971). The distinctive advantage of additive models such as the one above, however, is their simplicity, due to the absence of interaction between the independent variables in their effect on the dependent variable. In nonadditive models comparisons between the effects of different levels of the same variable always depend on the level of other variables. This does not prohibit comparison, but makes their formulation more complicated. Although the term conjoint measurement is a perfect description of the underlying principles in Luce and Tukey (1964), to some authors conjoint measurement is equivalent to additive conjoint measurement. To avoid this misunderstanding, the term implicit measurement is preferred here. As for stochastic models of measurement, ironically, others had already been practicing model-based measurement long before Luce and Tukey wrote their seminal article. Independently, Lord (1952) and Rasch (1960) worked on models that are now known as item response models. In item response models, characteristics of the examinees and the test items are implicitly modeled as quantitative, unmeasured (or latent) variables. Along the same line, even Thurstone's (1927) work on models for paired comparisons shows an intuitive appreciation of the methodology of implicit measurement. Of these authors, Rasch was the only one to show an interest in the foundations of measurement and he introduced a basic principle of measurement to derive his model. In the final section of this chapter, the central theme of this book is reflected in an analysis of the fundamentals of the Rasch model and their relation to Campbell's and Luce and Tukey's treatments of measurement theory.

FUNDAMENTALS OF RASCH MEASUREMENT Rasch (1960) formulated his well-known model for achievement tests in which he assumed that only two parameters are needed to explain the probability of success on an item—an ability parameter @ for the examinee and a difficulty parameter b for the item. For item i the model stipulates the following probability of success as a function of O:

FUNDAMENTAL MEASUREMENT

17

It should be noted t h a t applying the well-known logit transformation, the model can also be given in a different form as:

Rasch's interest in educational and psychological measurement was primarily in its foundation. However, judging from his publications, he did not show much interest in Campbell's Laws of Fundamental Measurement and in fact never even made any reference to Campbell's work or to any other major paper on measurement theory. Instead he introduced a principle t h a t he called specific objectivity—the principle will be introduced here briefly. Though Rasch considered specific objectivity to be a single principle, actually it has two different versions—one at the level of the parameters in the model and the other at the level of their statistical estimators. We will deal with the two versions separately. Specific Objectivity as a Mathematical Principle Suppose that the abilities of two examinees, a and b are to be compared using their performances in item i. These performances are repres ison between the examinees is defined by Rasch as a comparator funct

The principle of specific objectivity requires that comparisons made between values of the ability parameter be independent of the values of the difficulty parameter of the items involved, and vice versa. Formally, this implies t h a t the comparator function in (4) be independent of the item parameter bt. Rasch (1977) was able to derive that a necessary and sufficient condition for this requirement to hold is additivity of the response function /(.). To demonstrate the condition, it is observed t h a t from his proof it follows that there exist transformations g

18

VAN DER LINDEN

Obviously, if g^.) is taken to be the logit transformation and g2{.) the reversal of the scale of the item difficulty parameter, the representation of the Rasch model in (2) is obtained. Thus, we may conclude t h a t the Rasch model meets this version of the principle of specific objectivity. To fully appreciate Rasch's derivation of (3) as a consequence of the principle of specific objectivity, several things should be noted. First, (4) is not a derivation of a model from certain conditions on the data; in face, no definition of any data whatsoever is involved. The result is just a mathematical theorem on functions. The only quantities used a © and bt and another mathematical c(.) defined on pairs of functions f(.). The reader should not be misled by the notation of the variables 0 and b and derive some empirical meaning from it. As observed by Fischer (1987), the theorem belongs to the domain of functional equations and was already addressed by various mathematicians before Rasch formulated it as his first version of the principle of specific objectivity. Second, an intuitive way to appreciate the result is to think of the well-known two-way ANOVA table, with the rows and columns representing the values of the parameters © and b and the values of the response function /"(©,&) in the cells. The present version of the principle of specific objectivity requires that comparisons between columns be made independent of the value for the rows, and vice versa. In ANOVA terminology, it amounts to the requirement that the table be fully additive and show no interaction effects. Though additivity is a very welcome property making life truly elegant, life with interaction is possible. Rasch sometimes seemed to imply that in the presence of interaction effects no scientific statements are possible at all; see, for instance, the title of his 1977 paper. As all analysts of tables know, comparisons in tables with interaction are possible; the only price to be paid is t h a t they are to be made conditional on other variables. This makes them more complicated but not less true. Third, the resemblance between the model of additive conjoint measurement in (1) and the representation of the Rasch model in (3) is remarkable and has been noted several times (Brogden, 1977; Perline, Wright, & Wainer, 1978). Strictly speaking, however, the resemblance is only formal. In the model of additive conjoint measurement, P is a d as the left-hand side of the Rasch model is the logit of an unknown mathematical probability. Moreover, in (1) the objects are classified according to empirical values of A and B, but in (3) 0 and b are unknown quantities again. All we are able to say is that if the Rasch

FUNDAMENTAL MEASUREMENT

19

model held and the logits were known, then the logits would meet the technical conditions formulated in Luce and Tukey's (1964) theorem. Now, as will be shown below, the Rasch model has simple sufficient statistics for 0 as well as b. These statistics, which are just the numbers of correct responses per examinee and item respectively, may be used to classify examinees and test items according to their estimated values of 0 and b. Proceeding in this way, as Perline, Wright, and Wainer (1978) did, the fit of the model of additive conjoint measurement and the Rasch model to the same set of data may be compared. But the results are never decisive, since the model of additive conjoint measurement, being a deterministic model, will only fit a very small subset of all possible data sets generated according to the Rasch model. The fact that the Rasch model is not a deterministic but a stochastic measurement model brings us to the version of the principle of specific objectivity in the following section. Fourth, the Rasch model is not the unique model that satisfies (5). If gx(.) is taken to be the probit transformation, then the well-known normal-ogive model from Item Response Theory is obtained with discrimination and guessing parameters constrained to be equal to the values 1 and 0, respectively (Lord, 1952). According to the first version of the principle of specific objectivity, this constrained normal-ogive model is thus specific objective." Specific Objectivity as a Statistical Principle The previous version of the principle of specific objectivity formulated a requirement for the model as a mathematical expression. Were the variables in the model known a priori for all persons and items, the principle would have had immediate practical meaning. Now it has not. For this reason, Rasch extended his principle to include a version formulated at the level of response data. The version can be formulated as follows: Suppose one examinee with ability 0 responds to a test consisting of only two items with difficulty parameters b1 and b2. Let us derive the probability t h a t the examinee has one item correct, say item 1, given the fact t h a t his total score on the test is r = 1. This means that either item 1 or item 2 is correct. The probabilities of the two outcomes are:

where 7, is the denominator of (2) for item i.

20

VAN DER LINDEN

Now, noting cancellation of the factor dependent on 0 , it follows for the probability of item 1 correct given r = 1 that:

The surprising result is that although the probability of the response vector (1,0) depends both on 0 and the two item parameters, the conditional probability given r = 1 depends only on the item parameters. In statistical terminology, and formulated at the level of any number of items, these few steps show us that the Rasch model has a simple sufficient statistic for the ability parameter—the number of correct responses by the examinee. Likewise, it can be show t h a t the number of correct responses on an item is a sufficient statistic for the difficulty parameter. Expressions as in (8) can be used for conditional maximum likelihood estimation of the ability and difficulty parameters. These conditional estimators have the same favorable asymptotic properties as maximum likelihood estimators in the regular case of models for identical independently distributed random variables (Andersen, 1980). The above shows that the existence of the number of correct responses as a sufficient statistic is a necessary condition for the Rasch model. One may wonder if the reverse also holds and the presence of these statistics is a sufficient condition for the Rasch model. A proof of this property is given in Rasch (1968). Later, Andersen (1977) proved the more general claim that the existence of any (minimal) sufficient statistic for one parameter independent of the other parameter is a sufficient condition for the Rasch model. Thus the Rasch model has not only (nontrivial) sufficient statistics for its parameters, it is also the only model with this property. The practical value of the presence of simple sufficient statistics can hardly be undervalued. They allow the use of conditional inference that yields maximum likelihood estimators with known asymptotic properties. This is not the case for other item response models, which are not even known to produce consistent estimators unless they are brought back to the regular case of models for identical independently distributed random variables, for instance, by introducing a common population from which the examinees are drawn. Because of this property, the Rasch model has a well-developed body of statistical theory for estimating its parameters and testing its goodness of fit. In particular the fact that excellent goodness-of-fit statistics are available for the Rasch model is of critical importance. As was pointed out in the earlier treatment of Luce and Tukey's methodology of implicit measurement, it is the fit of the model that guarantees the quan-

FUNDAMENTAL MEASUREMENT

21

titativeness of the variables in the model. The Rasch model is based on statistical theory t h a t works and produces results with known properties. The same holds for its many extensions to models dealing with different item formats, multidimensional abilities, and constraints on the item parameters. In his writings, Rasch was not always clear about the meaning of his theorems and sometimes he was even a bit obscure. He seemed to prefer working outside of the mainstream of the statistical literature. For instance, he hardly ever referred to the theories of exponential families and sufficient statistics, which had their most important developments when Rasch worked on his model and were published in such standard references as Lehmann (1959). Nonetheless, his model belongs to an exponential family and thus has sufficient statistics. Instead he used such terms as "separability of parameters" or "specific objective comparisons" and always seemed to imply that his results meant something more t h a n just statistical theorems and were attempts to found measurement—or even the validity of science. The danger of confusion is dominantly present in Rasch (1968), where he pretends to proof t h a t the Rasch model is a necessary consequence of separability of parameters but actually proves this for the presence of simple sufficient statistics. This is clear from the fact that in his proof he reduces the sample space to the two possible outcomes modeled in (6) through (7) and from there on demonstrates the necessity of the Rasch model. In so doing, the assumption of separable parameters is made identical to the one of the number of responses correct, r, being a sufficient statistic and can be abandoned as a superfluous concept. The same line of reasoning is typical of proofs on specific objectivity in Fischer (1987) and Roskam and Jansen (1984). It is the generality of Rasch's claims and his mixing up of the concepts of specific objectivity and sufficient statistics that could lead to ascribing unrealistic properties to the Rasch model. For example, the belief is widespread t h a t due to the presence of sufficient statistics, conditional maximum likelihood estimation in the Rasch model allows estimation of the same ability parameters from different samples of test items. This statement is statistically too simple to be true. First of all, any parameter can be estimated from any sample; the only relevant question is how good the estimators are. Now tests usually contain no infinitely large samples of items and we know that conditional maximum likelihood estimators have small-sample bias. Thus the expected ability estimates from different samples of test items (in the sense of hypothetical replicated administrations of the same two sets of items with the same examinees) are not identical and depend on the difficulty parameters of the items. Likewise, it is known that samples

22

VAN DER LINDEN

of test items, however long, with different difficulty parameters may give rise to extremely different variances of the estimators. Thus conditional maximum likelihood estimators based on different samples of test items are not identically distributed estimators, let alone are they identical! What, then, is the correct claim? It is the statement that under the condition that the Rasch model holds, if the lengths of two different tests go to infinity, the conditional maximum likelihood estimators of the ability of the same person have the same expected value but are likely to have different variances. In other words, the correct inference is t h a t the presence of sufficient statistics paves the way for the use of c tent estimators of the parameters in the Rasch model. "Specific objectivity" has no meaning beyond this! At the same time, consistency is a minimal prerequisite for parameter estimation, and from Andersen's (1977) result we know that the Rasch model has this property, but t h a t all other models with incidental parameters do miss it. It is in this sense t h a t the fundamentals of Rasch measurement are fundamental.

e The purpose of this chapter was to highlight a few moments in the history of thoughts about the foundation of measurement. In the first part of the chapter Campbell's notions of fundamental and derived measurement were reviewed and it was shown how nicely they fit the practice of measurement in the natural sciences. At the same time Campbell's emphasis on fundamental measurement as a necessary condition for derived measurement set a wrong model for the behavioral and social sciences. It created an obsession with fundamental measurement with subsequent attempts to relax fundamental measurement rather than derived measurement. Luce and Tukey, however, did the latter, using their model of additive conjoint measurement to show t h a t measurement in the absence of fundamentally measured variables is possible, provided the variables are modeled jointly and directly as quantitative variables. It was emphasized that although it is tempting to see the absence of nonadditivity in Luce and Tukey's model as mandatory, nonadditive models are more complicated but still they are possible. The basic methodology is the joint modeling of latent variables to account for qualitative or ordinal data, which yields quantitative measures for the variables with scale properties defined by the invariance of the model. Others had already been practicing this form of implicit measurement, notably in the field of item response theory

FUNDAMENTAL MEASUREMENT

23

w h e r e s t o c h a s t i c m o d e l s w e r e i n t r o d u c e d to e x p l a i n p r o b a b i l i t i e s of success on t e s t i t e m s by q u a n t i t a t i v e p a r a m e t e r s a s s o c i a t e d w i t h t h e a b i l i t i e s of t h e e x a m i n e e s a n d t h e f e a t u r e s of t h e i t e m s . T h e R a s c h m o d e l b e l o n g s to t h i s d o m a i n of i t e m r e s p o n s e m o d e l s . R a s c h d e r i v e d h i s m o d e l from h i s p r i n c i p l e of specific objectivity. It w a s s h o w n t h a t t h i s p r i n c i p l e a c t u a l l y h a s t w o v e r s i o n s — t h e r e q u i r e m e n t of a d d i t i v i t y of m o d e l s t r u c t u r e a n d of s i m p l e sufficient s t a t i s t i c s . T h e f e a t u r e of a d d i t i v i t y is n o t u n i q u e , it is s h a r e d w i t h o t h e r m o d e l s . However, t h e R a s c h m o d e l is t h e only m o d e l w i t h sufficient s t a t i s t i c s a n d h e n c e t h e u n i q u e m o d e l w i t h i n c i d e n t a l p a r a m e t e r s for w h i c h c o n s i s t e n t e s t i m a tors are available. REFERENCES Andersen, E.B. (1980). Discrete statistical models with social science applications. Amsterdam: North-Holland. Andersen, E.B. (1977). Sufficient statistics and latent trait models. Psychometrika, 42, 6 9 - 8 1 . Brogden, H.E. (1977). The Rasch model, the law of comparative judgment, and additive conjoint measurement. Psychometrika, 42, 631-635. Campbell, N.R. (1928). An account of the principles of measurement and calculation. London: Longmans, Green & Co. Ellis, B. (1966). Basic concepts of measurement. Cambridge: Cambridge University Press. Fischer, G.H. (1987). Applying the principles of specific objectivity and of generalizability to the measurement of change. Psychometrika, 52, 5 6 5 587. Krantz, D.H., & Tversky, A. (1971). Conjoint-measurement analysis of composition rules in psychology. Psychological Review, 78, 151-169. Lehmann, E.L. (1959). Testing statistical hypothesis. New York: Wiley. Lord, F.M. (1952). A theory of test scores. Psychometric Monograph No. 7. Psychometric Society. Luce, R.D., & Tukey, J.W. (1964). Simultaneous conjoint measurement: A new Cal C t 1, 1-27. Michell, J. (1990). An introduction to the logic of psychologicala lmeasurement. Hillsdale, NJ: Lawrence Erlbaum. Perline, R., Wright, B.D., & Wainer, H. (1978). The Rasch model as additive conjoint measurement. Applied Psychological Measurement, 3, 237-255. Rasch, G. (1960). Probabilistici cmodels for some intelligence andd attainment tests. Copenhagen: Paedagogiske Institut. Rasch, G. (1968, September). A mathematical theory of objectivity and its consequences for model construction. Paper presented at the European Meeting on Statistics, Econometrics and Management Science, Amsterdam, The Netherlands.

24

VAN DER LINDEN

Rasch, G. (1977). On specific objectivity: An attempt at formalizing the request for generality and validity of scientific statements. In M. Blegvad (Ed.), The Danish Yearbook of Philosophy.yCopenhagen: Munksgaard. Roskam, E.E., & Jansen, P.G.W. (1984). A new derivation of the Rasch model. i chology. Amsterdam: Elsevier. Stevens, S.S. (1951). Mathematics, measurement and psychophysics. In S.S. Stevens (Ed.), Handbook of experimental psychology (pp. 1-49). New York: Wiley. t 34, 278-286. van der Linden, W.J. (1994). Review of J. Michell, An introduction to the logic of psychological measurement. t Psychometrika.

chapter

2 ^

The Relevance of the Classical Theory of Measurement to Modern Psychology Joel Michell University of Sydney

p of measurement. It has been eclipsed by the representational theory, especially that version promoted by S.S. Stevens (1951, 1959) and those who later advanced his ideas much more rigorously (e.g., Krantz, Luce, Suppes, & Tversky, 1971; Luce, Krantz, Suppes, & Tversky, 1990). This theory, however, suffers certain philosophical weaknesses and, I argue, is inferior to the classical theory. The classical theory is not only sufficient to provide a basis for those enterprises called psychological measurement, it also has interesting consequences for that enterprise. I am nervous about calling any theory classical, for it is a term debased by advertising copy. In this case, however, that qualm must be ignored. Literally, classical means of the highest class, and by association it has come to mean the cultures of ancient Greece and Rome. It is in this latter sense that I mean it. The theory of measurement described here is that implicit in the writings of Aristotle and Euclid. They presumed a theory that not only nourished the development of quantitative science in antiquity, but did so until the end of the 19th 25

26

MICHELL

century. Even after Aristotle fell from grace among the scientists of the 17th century, Euclid's Elements remained part of every scientist's training until the 20th century. This theory of measurement is still deeply ingrained in our culture. It remains not only the layperson's view of measurement, but the view of those scientists unaffected by philosophy or the social sciences. Of course, it was never static, and it changed over the centuries. What I offer is only an interpretation based on what I see as the best elements of that theory. The central concept of this theory is the concept of a quantity. A quantity is a class of properties (such as length) or a class of relations (such as temporal durations), the elements of which stand in additive relations to one another rich enough to sustain numerical ratios. Length and time are two important paradigms of quantity, for the additive relations they involve seem, in some cases at least, to be directly visible. In some cases, for example, we are able to see t h a t a particular length is composed entirely of other discrete lengths. Furthermore, this relation of additive composition between lengths we hold to be rich enough to sustain ratios. We do not hesitate to describe one length as being twice or thrice another, for example. In general, we believe t h a t for any two lengths, x and y, there exists a real number, r, such that,

The kind of structure t h a t a set of properties or relations must have in order to sustain ratios is something like the following (Holder, 1901; Michell, 1990; Stein, 1990). Let Q be a set of properties or relations and + a relation of composition upon Q, then + on Q sustains ratios if 1. 2. 3.

4.

for any a and b in Q, a + b = b + a (commutativity), for any a, b, and c in Q, a + (b + c) = (a + b) + c (associativity), for any a and b in Q one and only one of the following is true, 3.1 a = b, 3.2 there exists c in Q such that a = b + c, 3.3 there exists c in Q such that b = a + c, (3 determines an order upon Q as follows: for any a and b in Q, a > b if and only if either 3.1 or 3.2, and this order is transitive, antisymmetric, and strongly connected, i.e., a simple order), for any a and b in Q, na > b (where na is defined recursively as l a = a and (n + l)a = na + a, for any natural number n).

Furthermore, if Q is order dense, continuous, and unbounded above (Michell, 1990) (as we believe length and time intervals to be), then

CLASSICAL THEORY OF MEASUREMENT

27

these numerical ratios are isomorphic to the positive real numbers. Of course, neither Aristotle nor Euclid possessed the modern concept of the real number system, but as both Bostock (1979) and Stein (1990) argue, the concept of a ratio developed by Euclid in Book V of the Elements (Heath, 1908) is equivalent to that of a positive real number as defined later by Dedekind (1909). According to the classical theory, measurement is the discovery or estimation of such ratios. In very general terms what I mean by the ratio of a to b is the magnitude of a relative to b. For any a and b in Q (e.g., for any pair of lengths, say) the magnitude of a to b cannot necessarily be expressed as the ratio of one whole number to another, for there are, as we know, incommensurable pairs of magnitudes (for example, the lengths of the side and diagonal of a square). However, in such cases there will be a unique and well-defined set of numerical ratios less t h a n alb. Such a set is what Dedekind meant by a cut, and this concept he used to define the real number system. While the theory of ratios of nonnumerical quantities was highly developed by Euclid and his Book V of the Elem Holder (1901), the father of modern measurement theory, who first proved the relationship between Euclid's ratios and the modern concept of real number by explicitly defining what was meant by quantity. The classical theory contains two more theses. One is that these ratios literally are the real numbers. The second is t h a t the relation of additivity involved in any quantity is conceptually distinct from any relations of concatenation observable in the behavior of objects. The first thesis, t h a t the real numbers are ratios of quantities, is not Aristotle's or Euclid's, though both held that numbers (for them, natural numbers) were empirical properties (see Lear, 1982; Stein, 1990) and o we attend to them while ignoring other properties of things). However, this thesis was definitely a part of the classical theory by the 17th century, where we find it in Newton, who defined number as "the abstracted ratio of any quantity to another quantity of the same kind" (cf. Whiteside, 1967). From the classical view, the numbers are not abstract in the modern philosophical sense (i.e., nonempirical and outside of space and time), they are empirical relations of a special kind, the kind holding between different magnitudes of the same quantity. s things. Rather, in measurement we discover numerical relations between things, and these numerical relations are just as empirical as any other relations we may observe. The second of these two additional theses constituting the classical theory is t h a t the relation of additivity characterizing a quantity, and in virtue of which ratios obtain, is not to be identified with any rela-

28

MICHELL

tion of concatenation between the objects possessing magnitudes of the quantity. For example, in the case of length we may distinguish a relation between lengths on the one hand and a relation between objects possessing length (say, rods) defined in terms of an operation of concatenation. This operation of concatenation may or may not directly reflect the additivity of lengths, depending upon what other properties the rods possess, the conditions under which the operation is performed, and the precise nature of the operation. That is, there is no n h connection and because any effect is never the product of a single cause (even in the laboratory), additivity will only be directly reflected in behavior under special conditions. t a different kinds of quantities, but rather between the different ways quantities relate to the behavior of objects. In the case of extensive quantities, we are able to arrange conditions so that quantitative additivity is more or less directly reflected in the behavior of some objects for some restricted range of values. In the case of intensive quantities, quantitative additivity is only indirectly evident. This is essentially the distinction as made by the medieval scholar Nicole Oresme (see Clagett, 1968). If there is a villain in the history of measurement theory then it is N.R. Campbell. Campbell (1920) denied both of these theses and so popularized the representational alternative that it became accepted dogma. However, he did not introduce representationalism. That honor belongs to Russell (1903). But it was Campbell's monograph that came to have a decisive influence. The last presentation of the classical theory was that given by A.N. Whitehead in Volume 3 of Principia Mathematica (Whitehead & Russell, 1913). Campbell's book was published in 1920, and from t h a t time there are no expositions of the classical theory until my attempt (Michell, 1990). Campbell made it seem t h a t measurement was numerical representation rather than the discovery of the numerical value of ratios. al t i surement as the numerical representation of empirical operations of addition. In the absence of such operations measurement was held to be impossible. This concept ignores the above distinction between additivity within the quantity and physical operations that reflect this underlying additivity. He did admit derived measurement, but it was made logically dependent upon fundamental measurement and the sense in which it involved numerical representation was never made

CLASSICAL THEORY OF MEASUREMENT

29

explicit. Thus, derived measurement sits uneasily with his insistence t h a t measurement is numerical representation. S.S. Stevens (1951, 1959) followed Campbell in denying these two features of the classical theory. He differed from Campbell in being a more thoroughgoing representationalist. Whereas Campbell wanted to restrict the concept of measurement to the numerical representation of operations of addition, Stevens simply wanted to define it as numerical representation per se. Measurement, for him, was the numerical representation of any empirical relation. This thoroughgoing representationalism entailed his famous theory of scale types and his notorious doctrine of permissible statistics. Both are artifacts of the representational theory of measurement and find no parallel within the classical theory. Representationalism, despite its enormous popularity in both psychology and the philosophy of science, is really a sidetrack in the development of our understanding of measurement. It is a sidetrack because it is based upon an impossible theory of number. Within all versions of the representational theory, numbers are taken as given. However, it is clear from the logic of the representational theory that they are not given in empirical situations. The only empirical context complex enough to yield them is measurement itself, but according to this theory numbers are imported into measurement from outside the empirical domain. Representationalists make a hard and fast distinction between the empirical system, which is characterized as qualitat Hence, numbers are held to be nonempirical entities of an abstract kind (in the special, modern sense of abstract, which means not located in space and time). Beyond that, representationalism involves no commitment as to what they might be. This view of numbers makes them exotic things indeed, so it is something of a surprise to find that the representationalists' rationale for introducing them into science via measurement is their simplicity and the convenience of reasoning with them. As Bertrand Russell (1896/1983) put it, "Number is of all conceptions, the easiest to operate with, and science seeks everywhere for an opportunity to apply it" (p. 301). Hence, in measurement, empirical operations are represented numerically in order t h a t "the powerful weapon of mathematical analysis" can "be applied to the subject matter of science" (Campbell, 1920, pp. 267-268). All representationalists have employed the same rationale. This rationale raises some difficult questions. If the concepts of number are nonempirical, how can they be "the easiest to operate with"? Surely empirical concepts themselves would have to be easier,

30

MICHELL

for they are of familiar, perceptible qualities and relations, while numerical ones are abstract and unfamiliar. Related to this is a further question. Why are numerical concepts universally useful in empirical contexts if they are not also empirical concepts? Finally, if cognition is an empirical relation between our brains and the empirical environment, from whence would our numerical concepts have derived were they not empirical? The fact that numerical concepts are so easy to operate with, so universally useful, and so readily cognized is easily explained by the hypothesis t h a t they are empirical concepts, but is seemingly inexplicable if they are not. The hypothesis t h a t numerical concepts are empirical ones has long been out of favor philosophically, and this is what has given the representational theory its philosophical audience. Stevens, in his turn, was influenced not only by Campbell and other representationalists, but also by the philosophical climate that held mathematics generally to be a system of tautologies,—that is, by the movement called logical e empirical view is again on the philosophical agenda (see for example, Bigelow, 1988; Forrest & Armstrong, 1987; Irvine, 1990). In light of the above considerations, if plausible empirical candidates for the numbers, such as ratios of quantities, can be located, it seems obtuse not to recognize them as such. If the classical theory could be rehabilitated into the mainstream of psychological science, what would be its implications for modern psychology? Some of the more important are as follows: 1. 2. 3. 4. 5.

There are no distinctions of scale type; There is no problem of permissible statistics (or, as it is known in its modern guise, of meaning fulness); The hypothesis that a variable is quantitative is a substantive hypothesis and must be put to the test like any other in science; J u s t because an instrument yields quantitative or numerical data, it does not follow t h a t anything is being measured or that quantitative variables are involved; and Testing the hypothesis t h a t a variable is quantitative means finding evidence for additivity, and this does not necessarily mean extensive measurement (as Campbell thought).

Firstly, within the classical theory there are no distinctions of scale type. A measurement scale for some quantity is obtained when a unit is selected relative to which numerical ratios may be observed or estimated. Hence, all measurement scales are, to use Stevens' (1946) terminology, ratio scales. There are no nominal, ordinal, or interval scales

CLASSICAL THEORY OF MEASUREMENT

31

of measurement. This is not to say that one cannot code classes or orders numerically. It is just to say that numerical coding and measurement are quite different enterprises. Secondly, there is no problem of permissible statistics. The numbers discovered or estimated in measurement are real numbers. Any mathematically valid argument forms applicable to real numbers may be applied to measurements, and the conclusions arrived at follow validly from those measurements. Of course, some conclusions have more generality than others; for example, conclusions that are independent of the unit employed. But this is just to indicate that formal validity is not the sole consideration in making inferences from measurements. Stevens' problem of permissible statistics has, over the last 30 v Narens, 1985; Luce et al., 1990). This, like the problem of permissible statistics, is an artifact of the representational theory. According to t h a t theory, since the facts numerically represented in measurement are essentially qualitative (that is, nonquantitative), it must follow t h a t quantitative propositions based upon measurement are not literal descriptions of reality. Indeed, they may even lack any empirical or qualitative meaning. The problem of meaningfulness has two parts: first, the specification of necessary and sufficient conditions for quantitative propositions to contain empirical meaning; and second, the determination of the empirical content of the meaningful propositions. Both parts have proved difficult and neither is as yet satisfactorily solved within the framework of the representational theory. However, for the classical theory there is no problem of meaningfulness, for the numerical ratios discovered in measurement are held to exist empirically and quantitative measurement propositions are literal assertions about them. It is this consequence of the classical theory, with its great simplicity, that is its major strength relative to the representational theory. Thirdly, the hypothesis t h a t a variable is quantitative is a substantive hypothesis and must be put to the test, like any other hypothesis in science. There is a real distinction between quantitative and nonquantitative variables. It is a distinction that resides in the internal structure of the variable itself and not in our procedures. Hence, if psychology is to be a quantitative science it must be shown experimentally t h a t psychological variables are quantitative. Two errors prevented psychologists from seeing this clearly. One was the Pythagorean dogma t h a t all natural variables are quantitative. This dogma dominated much of 19th century science and strongly influenced the founders of modern psychology. Many of them presumed t h a t if psychology were to be a science it had to be quantitative, and so they never

32

MICHELL

attempted to test the hypothesis that such variables as mental ability or intensity of sensations were quantitative. The second error t h a t clouded the issue was the operational view that measurement is really only a matter of devising number-generating procedures. Of course, numerical procedures are needed for measurement, but only if the variable involved really is measurable. Fourthly, taking up that last point, just because an instrument yields quantitative data, it does not follow that anything is being measured or t h a t quantitative variables are involved. Guided by a mixture of Pythagoreanism and operationalism, psychologists have devised a wide range of procedures t h a t generate numerical data, including mental tests, rating scales, attitude and personality questionnaires, and magnitude estimations. For many it seemed that no more was involved in psychological measurement than devising such procedures. Even if psychologists did not know exactly what they measured, they could be confident t h a t because the procedures resulted in numerical assignments they must be measuring something. However, to assert that, on the classical view, means assuming that the underlying psychological variables causally implicated in producing numerical scores of one kind or another are quantitative and a substantive hypothesis like t h a t could well be false. Hence, to assume it is true is unwarranted. Evidence is needed. This leads to the fifth implication, which is that testing for quantity means finding evidence for additivity, but this does not necessarily mean extensive measurement. All that is required in order to test for additivity is the discovery of situations sensitive to its presence or absence in the variables being studied. It is fruitless to attempt to test for additivity in situations that are indifferent to its existence. In t h a t way the hypothesis could never be falsified. Simply because many of the quantitative procedures devised by psychologists are not sensitive to underlying additivity, they do not enable a genuine test of this property. However, extensive measurement is not necessary to do this, as Campbell mistakenly insisted. Perhaps the most important legacy of the representational theory is the theory of conjoint measurement (see Krantz et al., 1971), for it demonstrates that additive structure can be tested for via ordinal relations. The future of psychological measurement lies in finding new ways to apply this theory to situations involving variables that psychologists have traditionally presumed to be quantitative. To elaborate upon this point, it is already known that many quantitative theories in psychology admit application of conjoint measurement theory. Some of the simpler applications are described in Michell (1990), and many others are described elsewhere (e.g., Perline, Wright,

CLASSICAL THEORY OF MEASUREMENT

33

& Wainer, 1979; and Levelt, Riemersma, & Bunt, 1972). The kind of situation to which conjoint measurement theory in its simplest form is applicable is one involving the relation between three not necessarily distinct variables. Suppose that levels of variables A and X combine noninteractively to produce levels of variable P, but that none of these variables can be measured as yet. If levels of A and X can be independently identified and the consequent levels of P can be ordered, then t h a t is sufficient to (a) test the hypothesis that A, X, and P are quantitative, and (b) if they are, to begin measuring them. What is required is t h a t the order upon P satisfy a hierarchy of cancellation conditions (see Krantz et al., 1971; Michell, 1990). We may think of the relationship between A, X, and P as expressed in a matrix in which the rows are levels of A (call them a, b, c . . . ), the columns levels of X (call them x, y, z . . . ), and the cells levels of P (call the result of combining level a of A with level x oiX, level {a, x) of P, and so on). The cancellation conditions are then constraints upon the ordinal relations between levels of P. For example, single cancellation (often called independence)Ce is that the order upon the columns in any row must be replicated in all rows and that, likewise, the order upon t cancellation, triple cancellation, and so on, are more complex ordinal constraints. The important point about such conditions is that they are testable and, so, present the possibility of testing the hypothesis that A, X, and P are all quantitative. To be more precise, single cancellation and double cancellation may be expressed as follows. Single Cancellation (1) (2)

For any levels, a and b, of A and, x, of X, if (a,x) > (b,x) then for all other levels, y, of X, (a,y) > (b,y); and for any levels x and y, of X and a of A, if (a,x) > (a,y) then for all other levels, b, of A, (b,x) > (b,y).

Double Cancellation For any levels, a, b, and c, of A and x, y, and z,

oiX,

34

MICHELL

The other cancellation conditions are of this form, but more complex. In essence they all state that if certain specified ordinal relations exist between levels of P, then others must obtain as well. As mentioned, A, X, and P need not be distinct variables, and I have been interested in exploring the application of conjoint measurement theory to Coombs' (1964) theory of unidimensional unfolding (Michell, 1990). For certain sets of preference orders, Coombs' theory entails an ordering upon interstimulus midpoints. Such an ordering must satisfy the hierarchy of cancellation conditions if the dimension involved is quantitative because the midpoint between any two stimuli is a noninteractive function (midpoint (x,y) = V2 (x + y)). Hence, just by inspecting preference orders on sets of unidimensional stimuli (for example, attitude statements) the hypothesis that the dimension involved is quantitative may be tested. Taking the classical theory of measurement seriously is a necessity for the enterprise called psychological measurement, if it is to become part of mainstream quantitative science. At present psychological measurement only sustains itself by defining measurement in its own special way. In the physical sciences its meaning is tied to the classical theory (cf, e.g., Beckwith & Buck, 1961). Taking the classical theory seriously means, above anything else, finding ways to test the hypothesis t h a t psychological variables are quantitative, and our best hope of doing t h a t is through applying the theory of conjoint measurement. REFERENCES Beckwith, T.G., & Buck, N.L. (1961). Mechanical measurements. Reading, MA.: Addison-Wesley. Bigelow, J. (1988). The reality of numbers. Oxford: Oxford University Press. Bostock, D. (1979). Logic and arithmetic: Vol. 2, Rational and irrational numbers. Oxford: Oxford University Press. Campbell, N.R. (1920). Physics, the elements. Cambridge, UK: Cambridge University Press. Clagett, M. (1968). Nicole Oresme and the medieval geometry of qualities and motion. Madison, WI: Wisconsin University Press. Coombs, OH. (1964). A theory of data. New York, Wiley and Sons. Dedekind, R. (1909). Essays on the theory of numbers. Chicago: Open Court. Forrest, P., & Armstrong, D.M. (1987). The nature of number. Philosophical Papers, 16, 165-186. Heath, T.L. (1908). The thirteen books of Euclid's elements (Vol. 2). Cambridge, UK: Cambridge University Press. Holder, O. (1901). Die axiome der quantitat und die lehre vom mass. Berichte uber die Verhandlungen der Koniglich Sachsischen Gesellschaf der Wissenschaften zu Leipzig, Mathematische—Physische Klasse, 54, 1-64.

CLASSICAL THEORY OF MEASUREMENT

35

Irvine, A.D. (1990). Physicalism in mathematics. Boston: Kluwer Academic. Krantz, D.H., Luce, R.D., Suppes, P., & Tversky, A. (1971). Foundations of measurement (Vol. 1). New York: Academic Press.

l 91, 161-192. Levelt, W.J.M., Riemersma, J.B., & Bunt, A.A. (1972). Binaural additivity in loudness. British Journal of Mathematical and Statistical Psychology, 25, 51-68. Luce, R.D., Krantz, D.H., Suppes, P., & Tversky, A. (1990). Foundations of measurement (Vol. 3). New York: Academic Press. Michell, J. (1990). An introduction to the logic of psychologicalal m measurement. Hillsdale, NJ: Erlbaum. Narens, L. (1985). Abstract measurement theory. Cambridge, MA: MIT Press. Newman, E.B. (1974). On the origin of scales of measurement. In H.R. Moskowitz, B. Scharf, & J.C. Stevens, (Eds.), Sensation andd measurement (pp. 137-145). Dordrecht-Holland: Keidel. Perline, R., Wright, B.D., & Wainer, H. (1979). The Rasch model as additive conjoint measurement. Applied Psychological Measurement, 9, 249-264. Russell, B. (1983). The a priori in geometry. In K. Blackwell, A. Brink, N. Griffin, R.A. Rempel, & J.G. Slater (Eds.), The collected papers of Bertrand Russell (Vol. 1, pp. 289-304). London: George Allen & Unwin. (Original work published 1896.) Russell, B. (1903). Principles of mathematics. Cambridge, UK: Cambridge University Press. Stein, H. (1990). Eudoxos and Dedekind: On the ancient Greek theory of ratios and its relation to modern mathematics. Synthese, 84, 163-211. Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 667-680. Stevens, S.S. (1951). Mathematics, measurement and psychophysics. In S.S. Stevens (Ed.), Handbook of experimental psychology (pp. 1-49). New York: Wiley. Stevens, S.S. (1959). Measurement, psychophysics and utility. In C.W. Churchm 63). New York: Wiley. Suppes, P. (1959). Measurement, empirical meaningfulness and three-valued logic. In C.W. Churchman & P. Ratoosh (Eds.), Measurement:nt Definitions and theories (pp. 129-143). New York: Wiley.

w bridge, UK: Cambridge University Press. Whiteside, D.T. (1967). The mathematical works of Isaac Newton (Vol. 2). New York: Johnson Reprint Corp.

chapter

33 O

The Rasch Debate: Validity and Revolution in Educational Measurement* William P. Fisher, Jr.

Postmodern Quantities, Inc. New Orleans, LA

T H E DEBATE Cherryholmes (1988, p. 449) uses a passage from Rorty (1985) to contrast traditional and alternative approaches to construct validity. Rorty describes two ways in which people make sense of their lives. In one way, the context in which life is understood is that of historical or fictional heroes and heroines; in the other, life is understood in relation to a nonhuman, supposedly unchangeable reality, such as nature. The first way fosters solidarity in community life, the second objectivity, in the positivist sense of facts supposed to completely transcend culture and history. Rorty and Cherryholmes stress that the problem with the one-sided sense of objectivity is that it fails to recognize and

* The author would like to thank the Spencer Foundation for supporting this research, and to thank Carol Myford, Jackson Stenner, Mark Wilson, and Benjamin Wright for their readings of the text and their helpful comments, but must take responsibility for the ideas expressed in the chapter himself.

36

THE RASCH DEBATE

37

a c k n o w l e d g e i t s own c u l t u r a l a n d h i s t o r i c a l e m b e d d e d n e s s . I w o u l d l i k e to a d d t h a t t h e p r o b l e m w i t h t h e u s e of n a r r a t i v e s t o r i e s in t h e c r e a t i o n of m e a n i n g a n d v a l i d i t y of c o n s t r u c t s is t h a t it fails to recognize a n d a c k n o w l e d g e i t s own possibilities for a new, m o r e c o n v e r s a t i o n a l a n d playful, y e t n o n e t h e l e s s r i g o r o u s , s e n s e of objectivity. T h e R a s c h d e b a t e is a v a r i a t i o n on t h e t h e m e s t a t e d by Rorty a n d C h e r r y h o l m e s . J a e g e r (1987, p. 8) h a s j u x t a p o s e d two q u o t e s t h a t r e s t a t e t h e t h e m e in t h e t e r m s of t h e d e b a t e : There appears to be a fundamental difference in measurement philosophy between those on the two sides of the Rasch debate. . . . The difference is well characterized in the writings of Benjamin Wright (1968) and E.F. Lindquist (1953). First Wright: Science conquers experience by finding the most succinct explanations to which experience can be forced to yield. Progress marches on the invention of simple ways to handle complicated situations. When a person tries to answer a test item the situation is potentially complicated. Many forces influence the outcome—too many to be named in a workable theory of the person's response. To arrive at a workable position, we must invent a simple conception of what we are willing to suppose happens, do our best to write items and test persons so that their interaction is governed by this conception and then impose its statistical consequences upon the data to see if the invention can be made useful. (1968, p. 97) [emphasis added; and the quote is actually from Wright, 1977b, p. 97]. In contrast, Lindquist wrote: A good educational achievement test must itself define the objective measured. This means t h a t the method of scaling an educational achievement test should not be permitted to determine the content of the test or to alter the definition of objectives implied in the test. From the point of view of the tester, the definition of the objective is s

tion. The objective is handed down to him by those agents of society who are responsible for decisions concerning educational objectives, and what the test constructor must do is to attempt to incorporate that definition as clearly and exactly as possible in the examination that he builds. (1953, p. 35) [emphases added].

A l t h o u g h J a e g e r also c h a r a c t e r i z e s t h e d e b a t e a s one " b e t w e e n advoc a t e s a n d o p p o n e n t s of t h e u s e of IRT [Item R e s p o n s e T h e o r y ] in t e s t d e v e l o p m e n t a n d s c a l i n g , " t h e d e b a t e on t h e u s e f u l n e s s a n d m e a n i n g f u l n e s s of R a s c h m e a s u r e m e n t is c o n d u c t e d w i t h i n w h a t J a e g e r w o u l d call t h e IRT c o m m u n i t y j u s t a s m u c h a s b e t w e e n it a n d t h o s e o u t s i d e of it. T h e d e b a t e is t h e r e f o r e t a k i n g place on a n u m b e r of levels, a s well a s i n a n i n t e r n a t i o n a l forum.

38

FISHER

Those advancing various reasons for not using Rasch's approach to educational and psychological measurement, or for narrowly restricting its application, include Bollinger and Hornke (1978), Divgi (1986, 1989), Goldstein (1979, 1980, 1983), Grau and Mueser (1986), Lord (1980, p. 58; 1983), Whitely (1977), Whitely and Dawis (1974), and Wood (1978). Those rebutting the claims of the critics include Andrich (1988, 1989), Fischer (1987, p. 585), Fisher (1991), Gustafsson (1980), Henning (1989), Lewine (1986), and Wright (1968, pp. 9 9 - 1 0 1 ; 1977a; b, pp. 102-104; 1984; 1985, pp. 107-109; Wright & Linacre, 1989). Some Rasch advocates suggest t h a t Rasch measurement presents the possibility for a revolution in educational and social measurement (Andrich, 1987; Duncan, 1984a,b,c; Fisher, 1988, 1991; Loevinger, 1965; Singleton, 1991). The same sort of claims (Cliff, 1973; Michell, 1990) have been advanced on behalf of conjoint measurement theory (Luce & Tukey, 1964; Krantz, et al., 1971; Ramsay, 1975), to which Rasch's work is closely related (Brogden, 1977; Perline, Wright, & Wainer, 1979). Lindquist is plainly and emphatically appealing to a one-sided objectivism in which construct validation is assumed to take place outside of the context in which the construct is manifest. Wright, in contrast, is just as plainly and emphatically struggling with the problem of dealing with the way constructs are simultaneously invented and discovered. Where Lindquist speaks of the sacrosanct, untouchable nature of test items, Wright says that test items amount to nothing more t h a n guesses as to how a construct articulates itself. Wright's suggestion t h a t we observe how well the guesses work to provoke a manifestation of the construct via the interaction of question and answer, and then see how far the guesses can be made to work in practice, is a fair approximation of what Ricoeur (1981, pp. 212-213) calls the method of converging indices and its probabilistic approach to the validation of guesses. Lindquist wants to disavow the fact that the test items originated in a discursive context, preferring to conceive of them as given in an objective reality. Wright, however, is focusing explicitly on the circular manner in which guesses about reality are entertained, criticized, tested, and applied in an ongoing constructive way. The extent to which Lindquist is articulating a commonly held position in educational measurement is indicated by the popularity of multiparameter IRT models. The unwillingness of educators to enter into the circular and conversational logic of construct validity continues, despite the fact that the mathematical form of the IRT models contradicts necessary and sufficient requirements for objectivity (Wright, 1984; Andrich, 1988, p. 67), and makes the models difficult and expensive to use (Wright, 1984; Stocking, 1989; Hambleton &

THE RASCH DEBATE

39

Cook, 1977, p. 76; Hambleton & Rogers, 1989, p. 158). One reason for the popularity of two- and three-parameter measurement models in education is t h a t they allow the test constructor to accept the validity of test items with no questions asked. Multiparameter models suppress questions of fit because most items fit these models, and when they do not, the reasons why are so technical that confidence in the test is not affected. The Rasch, or "one-parameter," approach, in contrast, requires the test constructor to pay close attention to the functioning of the items, checking for the extent to which they can be said to hang together along a single continuum of more and less difficulty. The critical evaluation of the performance of the items on the test undercuts the onesidedness of the test writers and researchers' authority by acknowledging the voices of the test takers. Instead of objectifying test takers by subjecting them to an unquestionable authority (Cherryholmes 1988, p. 430), the Rasch approach to test construction promotes a conversation in which questions are tested by the respondents just as much as the respondents are tested by the questions. Rigorous test administration practices demand that the intrusion of any factors other t h a n the abilities of the persons measured and the difficulties of the problems posed be minimized. Wright and Stone (1979, pp. 10-11) ask why test administration should not follow through on this demand, explicitly enacting in practice what is otherwise merely assumed to be required for legitimate comparisons. Duncan (1984b, p. 217; also see 1984c, p. 400) observes that what we need are not so much a repertoire of more flexible models for describing extant tests and scales . . . but scales built to have the measurement properties we must demand if we take "measurement" seriously. As I see it, a measurement model worthy of the name must make explicit some conceptualization—at least a rudimentary one—of what goes on when an examinee solves test problems or a respondent answers opinion questions; and it must incorporate a rigorous argument about what it means to measure an ability or attitude with a collection of discrete and somewhat heterogenous items.

The great majority of educational measurement models do in fact belong to a repertoire of models flexible enough to describe extant tests. Rasch models, in contrast, specify the properties we must demand if we take measurement seriously, focusing on meaningful comparisons, those in which item difficulty does not depend on person ability, and vice versa. More flexible models, by definition, allow unexamined presuppositions, prejudices, and preconceptions concerning who the persons mea-

40

FISHER

sured are, and whether the test items actually belong to the same variable, to interfere with the measurement process. Should not the preconceptions that necessarily structure questions and observations themselves be examined, modified, and accounted for, just as much as the students' test behavior and environment is controlled? These questions raise issues best addressed by widening the scope of the debate to include explicit considerations of what the most important form of test validity is.

MATTERS OF CONTEXT Content and Construct Lindquist is working from within the traditional positivist framework, described by Burtt (1954) as one which defines objectivity as a matter of letting data speak for themselves, with no recourse to presuppositions or hypotheses allowed. This sense of data arose in historical periods when nature was conceived to be a static constant, with the continents, seas, stars, planets, and biological life precisely the same now as they were on the day God finished the Creation. This sense of data as existing eternally and independent of any human context has fallen under the weight of many different factors, ranging from notions concerning the life cycle of the universe, plate tectonics, and evolution, to the observation t h a t what counts as legitimate data and rational thinking changes from one historical period to another (Kuhn, 1961, 1970; Toulmin, 1982; Holton, 1988; Hesse, 1970, 1972). However, many of us, like Lindquist, continue to think and act, out of habit, perhaps, as if data are given, not emerging from within a frame of reference. Messick (1975, p. 959; Cherryholmes, 1988, p. 426) offers a more specific reason for Lindquist's views on educational measurement: Construct validity is not usually sought for educational tests, because they are typically already considered to be valid on other grounds, namely, on the grounds of content validity. Hambleton and Novick (1973) claim, for example, that "above all else, a criterion-referenced test must have content validity" (p. 168). Assuming t h a t tests are valid on grounds of content validity is to be imbued with the overweening confidence that things are as they are because t h a t is the way someone says they are, not because that is the way they actually play themselves out in practice. Examination of the

THE RASCH DEBATE

41

empirical consistency of data may lead to the conclusions that particular test items, and perhaps specific content areas included on a test, represent constructs different enough in their conceptual structure to invalidate the inferences concerning abilities typically made on the basis of test scores. The search for construct validity may then contradict the conclusions already drawn concerning the content validity of test items, as Phillips (1986, p. 107) indicates: the deletion of misfitting items raises the issue of sacrificing validity for model fit. Typically, achievement test batteries are carefully developed according to detailed content specifications. If items are dropped from a subtest, that subtest no longer matches the test specifications and has lost content validity. Notice the force of Phillips's assertion: validity is inherently a matter of content validity. As Lindquist makes explicit, no question need be raised concerning construct validity, concerning whether or not what is measured is actually what is assumed to be measured. A typical reaction to the suggestion that some items should be deleted from a test assumes that content validity is the only validity relevant to an educational test, as when it is said that It is by no means clear that the Rasch model does describe real data very well. Willmott & Fowles (1974) admit that when testing the model some items do not fit the model. These are omitted from the set of items. As they say, "The criterion is that items should fit the model, and not that the model should fit the items." (!) (Goldstein & Blinkhorn, 1977, p. 310; original emphasis and exclamation; also see Goldstein, 1979, pp. 215216) Because the position informed by measurement theory asserts that data should be fit to a model that clearly specifies criteria for recognizing data good enough to measure with, the Rasch model may not always describe real data very well. This state of affairs says more about the quality of the data than the usefulness of the model. Goldstein (1979, p. 216), however, is adamant about "moving away from the doctrine of a singly underlying trait, [in order to] allow educational criteria properly to determine test content." But as Gustafsson (1980) points out, items t h a t do not belong to one construct may well belong to another; the problem may be as simple as separately analyzing the groups of items. No one in this debate has seriously recommended t h a t misfitting items simply be discarded. It is only reasonable to think t h a t items from the same content domain might represent

42

FISHER

different constructs, and produce data with independent empirical consistencies. The point is to admit that measurement always and everywhere follows from a metaphysics of what counts as an observation (Burtt, 1954; Heelan, 1972, 1983, 1985; Heidegger, 1967; Hudson, 1972; Ihde, 1979, 1991; Kuhn, 1961), and to step into the flow of the hermeneutic circle deliberately and in accord with our intentions. Imagination, Ideality, and Empirical Consistency Focusing on content to the exclusion of the construct reenacts a fundamental error that has been repeated over and over again in the history of science. The error made one of its earliest and most famous appearances in the Pythagorean ontological confusion of representations and images for the things themselves. In the same way that an exclusive focus on content validity precludes attention to constructs, Pythagoreans take number and numerical relationships for existence itself and are unable to think of the noetic order of existence by itself, [and so they never] see the real implications of the [Platonic] doctrine of ideas. (Gadamer, 1980, p. 35; also see p. 32). The Pythagoreans were caught up in unsolvable problems such as the squaring of the circle, trying to solve them by means of the physical transcription of the images themselves. Besides forbidding "all recourse and all allusion to manipulations, [and] to physical transformations of figures," Plato redefined the elements of geometry, "denominating such concepts as line, surface, equality, and the similarity of figures" (Ricoeur, 1965, p. 202; also see Gadamer, 1980, p. 150). Conceiving a point as '"an indivisible line,' and a line as 'length without breadth'" (Cajori 1985, p. 26), Plato construed geometric entities as fictions in order to make the difference between names and concepts as plain as possible. Galileo placed modern science on the same footing when he based his theory of gravity on the behaviors of objects in a frictionless vacuum, behaviors he would never observe. Rasch's (1960, pp. 37-38) comment t h a t "a model is not meant to be true" is intended to have the same effect as Galileo's realization t h a t he was imagining how gravity might be modeled. Theories and models never fit experience exactly, but instead serve as heuristic aids in organizing and managing experience meaningfully. For instance, the crisis of Pythagorean mathematics was overcome by Plato's redefinition of geometrical elements, because irrational numbers live out the same conceptual existence in ideality that ratio-

THE RASCH DEBATE

43

nal ones do. The irrationality of the square root of two no longer threatened the heart of mathematical reason after Plato because the existence of this number and the line segment it represents no longer depended upon representation as a line segment of precisely drawable length or as a number t h a t could be exactly specified. The crisis of educational, psychological, and social measurement provoking the Rasch debate hinges on the same problem, namely, that the rationality of testing depends on whether the qualities measured are modeled by content (name) or construct (concept). The point in using figures of any kind, whether they are metaphorical, numerical or geometrical, is to facilitate clarity in thinking through clear representation of the thing itself. Clear views of things are brought about when one can see through the content of the particular figure drawn and see the thing itself free of influence from the particular representation instrumental to the observation. Plato's restricting the use of instruments in geometry to the compass and straightedge was aimed at allowing things to communicate themselves, not by confusing the conceptual ideality of things with their names, as Pythagoreans and positivists do, but by using the instruments as media for the expression of the things themselves. Plato placed philosophy in close association with mathematics because geometrical analyses are not valid just because they are performed on geometrical figures such as circles and triangles. It is essential to establish the validity of the construct, to distinguish between the content of the items and the validity of taking them as representative of a conceptual dimension. "Since predictive, concurrent, and content validities are all essentially ad hoc, construct validity is the whole of validity from a scientific point of view" (Loevinger, 1957, p. 636, in m referenced" (Messick, 1975, p. 957, emphasis in original). Loevinger's (1965, p. 151) appreciation for Rasch measurement cannot be separated from her position on construct validity, since "any concept of validity of measurement must include reference to empirical consistency" (Messick, 1975, p. 960). Whitely, on the other hand, holds to the explicitly positivist end of Cronbach and Meehl's (1955) sense of construct validity as "appealing to criteria outside of the measuring process . . . in accordance with a nomothetic network" (Whitely, 1977, p. 232), which is exactly the way Goldstein (1983), Hambleton and Novick (1973), Lindquist (1953), and Phillips (1986) see the matter. Wright's (Wright & Masters, 1982, p. 91) concept of construct validity is much closer to Cherryholmes's, Loevinger's, and Messick's discursive formulation t h a n it is to Whitely's positivist construal:

44

FISHER

The responses of each person can be examined for their consistency with the idea of a single dimension along which items have a unique order. Unless the responses of a person are in general agreement with the ordering of items implied by the majority of persons, the validity of the person's measure is suspect. The same dialectical relation between whole and part holds for items Responses to each item must be examined for their consistency with the idea of a single dimension along which persons have a unique order. Unless the responses to an item are in general agreement with the ordering of persons implied by the majority of items, the validity of the item is suspect. Wright stresses the need to constantly refer and defer to the text of what has been said and done in the administration of the test. In a manner reminiscent of recent work in the philosophy of science t h a t stresses the mediating role of instruments in experiment (Ackermann, 1985; Heelan, 1983; 1985; Ihde 1979, 1991), Wright is construing data as a text that resonates in the lives of those who read and write it. And in contrast to the detached, uninvolved, and cool sense of theorizing deployed by those who take content validity as primary, Wright's stress on the use of experiment belies his sense of theory as a matter of participating in and being committed to the object of discourse, which is again in close accord with recent observations made in the philosophy and history of science (Hacking, 1983, 1988; Heelan, 1988, 1989; Hesse, 1970, 1972; Holton, 1988; Kuhn 1961, 1970; Latour & Woolgar, 1979; Ormiston & Sassower, 1989). The history of science supports the discursive formulation of construct validity and disputes positivism's exclusive concern with content because of the crucial importance of the ontological difference between mathematical and perceptible being. This difference is what "Eudemos singles out [as] Plato's contribution in his history of mathematics, namely, to have distinguished between name and concept (Simp Plato resolved the Pythagorean overcomplications with mathematical clarity and simplicity, Copernicus, Kepler, and Galileo founded modern science when they resolved the Aristotelian astronomical complications by basing their studies on mathematical idealizations and observations. Cronbach and Meehl (1955) focused attention on the difference between content and construct, and brought social measurement a step nearer to recreating the ancient meaning of mathematical clarity. Rasch's restrictions on measuring instruments, in turn, have the

THE RASCH DEBATE

45

potential of recreating in social science what Plato's and Galileo's restrictions on, and uses of, measuring instruments did for geometry and natural science. Instead of allowing the perceptible being of content to dictate validity, Rasch measurement fosters an awareness of the ontological depth t h a t mathematical description offers. Those who take content validity to be the sole form of validity required for measurement wish to be able to nail down hard facts, not go with the flow of the life cycle of facts (Fleck, 1979) through their birth, life, and death, as is required for the validation of constructs. JAEGER'S REVOLUTION REVISITED Jaeger (1987) juxtaposes the quotes from Lindquist and Wright in the context of alternately proclaiming and questioning the revolutionary status of developments in educational measurement over the last 20 years. J u s t as Wright (1984, 1988b, for example) often does, Jaeger (1987, pp. 9-12) uses quotes from Thorndike and Thurstone as evidence of the age and importance of some of the most fundamental ideas in educational measurement. But Jaeger does not explore the possibility t h a t the revolution in educational measurement begun by Thorndike, Thurstone, and others is still happening; and he does not sufficiently elaborate upon what the point of the revolution might be. The contextual matters crucial to understanding the Rasch debate have provided some clues as to what that point might be. Kuhn (1970) suggests more to look for when he indicates that observational anomalies, methodological problems in accounting for them, and resulting degrees of extreme complication prepare the ground for scientific revolutions. Thus, the Pythagorean and Aristotelian overcomplications and rationalizations t h a t Plato and Galileo cut through with their insistence on rigorous observation and mathematical idealization in the use of the compass, straightedge, and telescope may have their parallels in the fixation on content validity plaguing educational measurement. The history of science in general, and Kuhn's theory of scientific revolutions in particular, leads to at least three hypotheses concerning the extent to which the Rasch debate is a revolution in the making (Andrich, 1987). These hypotheses, and some evidence bearing them out, will be briefly enumerated and sketched. Crisis The first hypothesis of scientific revolution asserts that there should be a widespread general sense of crisis in the field, as well as in others

46

FISHER

constrained by the same paradigmatic orientation. In this case, education, measurement, and the very proposition that quantification could be useful and meaningful should be under fire. That education is in a state of crisis is by now an understatement; crisis in the world at large has escalated to the point that crisis has become the normal, everyday state of affairs. Education has served as a model for dealing with political, economic, and social problems for centuries, and now it is failing as we see that much of what passed for education was actually indoctrination into various ideologies. Because testing is purported to separate those who know something from those who do not, it has come under harsh criticism for failing to perform this purpose fairly and unambiguously (Crouse & Trusheim, 1988; Gould, 1981; Owen, 1985; Strenio, 1981; Sutherland, 1984). The large and significant literature on the shortcomings of quantitative methods in social science that has erupted (Bakan, 1966; Carver, 1978; Coats, 1970; Falk, 1986; Krenz & Sax, 1986; Michell, 1986; to name just a few), and the horrors of educational measurement alluded to by Lumsden (1976), are part and parcel of the crisis of rationality. Shifting Paradigms Second, alternative paradigms should crystalize from the crisis situation; alternative methods and theoretical approaches coalesce into a new paradigm when their language becomes incommensurable with t h a t of the traditional paradigm. Dissatisfaction with the very idea t h a t h u m a n abilities and attitudes can be quantified has reached such a pitch t h a t qualitative approaches are widely considered to be at the forefront of methodological innovation in the social sciences at large. The force of this movement comes from the realization that meaning is more important to social inquiry than facts are. Andrich (1988), Michell (1990), and Wright (1977b) agree with Kuhn (1961) when they emphasize how important qualitative research is in the development of quantitative measures. What I shall call the quantitative paradigm refers to the uncritical acceptance of numbers as valid representatives of qualitative structures. In the same way t h a t Pythagoreans worshipped number, mistaking numerical relations for existence itself, blind submission to the "quantitative imperative" (Michell, 1990) takes place in educational measurement whenever the content of the questions asked is the sole arbiter of validity. This is the same thing as ignoring the first fundamental problem of measurement, the justification of the measured and measuring (Suppes & Zinnes, 1963, p. 4).

THE RASCH DEBATE

47

The possibilities for different languages appear because, as Cherryholmes (1988) points out, the focus on construct validity in qualitative research offers a stark contrast with the lack of concern for it in the quantitative paradigm, despite Loevinger's (1957) and Messick's (1975) stress on it as the "whole of validity." The quantitative paradigm contends that, "above all else, a criterion-referenced test must have content validity" (Hambleton & Novick, 1973, p. 168). Whereas the qualitative paradigm takes an experimental perspective, allowing the imagination to play upon itself in the service of dialogical objectivity (Heelan, 1988; Ihde, 1991; Ormiston & Sassower, 1989), the quantitative paradigm insists only that its dictates be followed to the letter. For instance, Divgi (1986, p. 283) says: "Issues like 'objectivity' and consistent estimation are shown to be unimportant in selection of a latent trait model." Whitely (1977, p. 233) concurs, saying t h a t "data on the internal structure of a test may not be substituted for other kinds of validity data." These statements replace construct validity with content validity and are completely opposed to Messick's (1975, p. 960) assertion t h a t validity bears directly on empirical consistency. More echoes of Lindquist's appeal to the authorities on high, the sacrosanct nature of test items, and the prohibition against monkeying around with item content resound when Messick (1975, p. 959) quotes Osburn (1968, p. 101), who says that what the test is measuring is operationally defined by the universe of content as embodied in the item generating rules. No recourse to response-inferred concepts such as construct validity, predictive validity, underlying factor structure or latent variables is necessary to answer this vital question. Cherryholmes (1988, pp. 452-453) observes that this sort of ultraoperationalism had been rejected even by the logical positivists more t h a n 30 years before Osburn wrote, because they saw that conceptual significance is never generated by strictly following rules. Cronbach and Meehl (1955) accordingly rejected operationalist definitions of constructs in their study of construct validity. Willmott and Fowles (1974) give concise expression to the different premises of the qualitative and quantitative paradigms, respectively, when they say t h a t "The criterion is that items should fit the model, and not t h a t the model should fit the items." Michell (1990, p. 8) phrases the qualitative theme in similar terms, saying that "The only way to decide whether or not the variables studied in any particular science are quantitative is to put that hypothesis to the test. This essential step is missing in the development of modern psychology."

48

FISHER

J u s t as Plato and Galileo stressed the conceptual ideality of measurement constructs in opposition to the Pythagorean and Aristotelian confusion of number and existence, Rasch's qualitative approach to measurement conceives of ability and difficulty idealistically, as if neither depended upon the particulars of the other. J u s t as Plato's geometrical fictions and Galileo's physical fictions served as heuristic models for the mathematical sciences of their ages, so will Rasch's socio-psycho-educational fictions serve as heuristic models for the coming age. Therefore, as Fischer (1987, p. 585) puts it, rather than rejecting Rasch's models as being too narrow, as Goldman and Raju (1986, p. 19), Goldstein (1983, p. 373; Goldstein & Blinkhorn, 1977, pp. 310-11), Hambleton and Rogers (1989, p. 148), and Whitely (1977, pp. 229, 2 3 2 233) explicitly do, one should instead change the data by altering the experimental design or the mode of observation. After all, it is "difficult to say in what sense measurement is achieved if that property [of parameter separability characteristic of data fitting a Rasch model] is violated" (Duncan, 1984a, p. 224; also see 1984c, pp. 398-399). These alternative perspectives are paradigmatically distinct insofar as each has radically different presuppositions about what counts as a legitimate question, and how one goes about determining whether a question is legitimate. The two paradigms also trace separate historical traditions, which contributes to the way their proponents tend to speak at cross purposes. The quantitative paradigm in education owes a great deal to logical positivism (Cherryholmes, 1988) and the operationalism of Bridgman (1927) and its applications to measurement by Stevens (1946) (Michell, 1990, pp. 15-20). The qualitative paradigm, on the other hand, largely follows from the phenomenology of Husserl (1970, originally published in German in 1936), the existential hermeneutics of Heidegger (1962, 1967; originally published in 1927 and 1935, respectively), Freudian psychology, Marxism, and ethnography. Contrary to the impression one might receive from most current works identifiable as qualitatively oriented, philosophical writers such as Husserl, Heidegger, Gadamer, Ricoeur, and Levi-Strauss explicitly related their interests to the understanding of mathematics, technology, and objectivity. Heelan and Ihde are among the very few contemporary writers who have realized and acted upon the relation of phenomenology to science, though Michell (1990, p. 8) recognizes Brentano, the teacher of Husserl and Freud, as an early leader in the qualitative paradigm, and Wheeler and Zurek (1983) mention the relevance of Husserl to the measurement problems of contemporary physics. In an article on construct validity, Whitely (Embretson (Whitely),

THE RASCH DEBATE

49

1983) has moved somewhat closer to a qualitatively informed theory of constructs than was evidenced in her earlier publications. But even when she qualifies her emphasis on item content and the nomothetic network in favor of empirical consistency and construct representation, Whitely continues to construe Rasch item and person parameters as representations of theoretical constructs (Embretson (Whitely), 1983, p. 186). Where Cherryholmes (1988) places construct validation in the realm of poststructuralist discourse analysis, Whitely (Embretson (Whitely), 1983, p. 179) traces a change from functionalism to structuralism, which means that her focus has shifted only one step away from the operational definition of the construct and is now concerned with combining the operationalism with an overly mechanical sense of the meaning of the item calibrations and person measures. In this context, Whitely points out that unidimensional measurement models do not provide a suitable basis for comparing alternative construct theories because tests of unidimensionality are "useful only for those theories that postulate a single construct," and even for these, the isolation of a "single dimension could be due to the completely confounded influence of several constructs" (Embretson (Whitely), 1983, p. 186). But why should it be reasonable to expect a general measurement model to serve as a means of representing constructs in the first place? Why should tests of unidimensionality be so crucial to the comparison of alternative construct theories? Whitely's (Embretson (Whitely), 1983, p. 195) reference to Bechtoldt's (1959) sense of construct operationalization as "a major focus of the proposed approach to construct validation research" provides an important clue to how she would answer these questions, as Messick (1981, p. 578) indicates: Bechtoldt's (1959) argument identifies not just the meaning of the test score but the meaning of the construct with the scoring operations, thereby confusing the test with the construct and the measurement model with the substantive theory. In confusing the test with the construct and the measurement model with substantive theory, Bechtoldt and Whitely reiterate what Gadamer (1980, p. 35) calls the Pythagorean confusion of number and numerical relationships with existence itself Others more appropriately stress t h a t "nothing in the fit between response model and observation contributes to an understanding of what the regularity means. In this sense, the response model is atheoretical" (Stenner, Smith, and Burdick, 1983, p. 308). The only reason why Whitely might expect the response model to be

50

FISHER

theoretical is t h a t her structuralist sense of construct representation demands it. Even when partial credit (Masters, 1982) or facets (Linacre, 1991) models are used to structure the theory informing a test's content, and tests of unidimensionality show themselves to be useful in relation to theories that postulate more than one construct, the theory of measurement implemented by the models cannot offer anything in the way of a substantive theory of the construct. Once responses have been determined to point along one direction of more and less useful for purposes of comparison, then questions of construct validity—Are persons expected to be more able scoring higher? Are items expected to be more difficult missed more often?—can be raised (Wright & Masters, 1982, p. 93). Empirical vs. Theoretical Support As a third sign of revolution, the traditional paradigm should have the advantage of more data supporting its position, and the disadvantage of fewer theoretical resources at its disposal to explain anomalous data, in relation to the alternative paradigm. In the present instance, adherents of the quantitative paradigm should assert t h a t (a) their theories and models fit commonly found data better than the theories and models of the qualitative paradigm, and (b) their own theories and models are nonetheless extremely complicated, difficult to use, time consuming, inefficient, problematic, and expensive, whereas those of the qualitative paradigm are simple, easy to use, efficient, readily available, and inexpensive. The first half of this hypothesis is supported by Whitely's (1977, p. 229) comment that "the several studies which apply a reasonably stringent test of fit are notable for the frequency with which the [Rasch] model is found to be inappropriate." She even goes so far as to say, in the face of the crisis noted above, that "classical testing procedures have served test development admirably for several decades" (Whitely, 1977, p. 234). Goldman and Raju (1986, p. 19) say t h a t since the findings of their "study suggest that the two-parameter model fits the attitude survey [of interest] better than the Rasch model, future applications might emphasize the two-parameter model." Hambleton and Rogers (1989, p. 148) are direct, saying that "the one-parameter model has rarely provided a satisfactory fit to the test data; the threeparameter nearly always has." In contrast to the value the quantitative paradigm places on control of item content, the qualitative paradigm values the theoretical and practical advantages of fundamental measurement principles. Kuhn

THE RASCH DEBATE

51

(1961) says t h a t the role of imagination and qualitative considerations in measurement is far greater than is usually supposed; commitment to these considerations means that some time usually has to pass before early advocates of new theories have managed to put together data supporting their hunches. Data fitting Rasch's implementations of measurement theory are sufficiently commonplace for published listing of widely-used Rasch-based item banks (Choppin, 1968, 1976, 1978; Wright & Bell, 1984) to be several years old. The two- and three-parameter models' capacity to better describe extant data has a flip side to it; the structure of that data cannot be easily explained and cannot be related to principles of measurement in any useful way. As might be expected from item response models whose estimation algorithms contradict their own assumptions of unidimensionality, the most commonly used computer program for implementing the two- and three-parameter IRT models, LOGIST (Wingersky, Barton, & Lord, 1982), has been shown by Stocking (1989, p. 42) to be rife with "large (and sometimes unacceptable) biases" in the estimation of the parameters. Stocking took up the study of LOGISTbased applications of IRT in order "to explore and understand some apparently anomalous results . . . that have been obtained from time to time over the past several years" not only in real data, but also in data simulated to fit the three-parameter model. After remarking, in a manner reminiscent of many of her colleagues (documented in Wright, 1984), on the expense and difficulty of using LOGIST, Stocking (1989, pp. 44-45) concludes t h a t LOGIST . . . needs improvement. Most applications cannot afford to run the program to complete convergence. It may be possible to improve results of the four-step structure by obtaining better starting values for the parameters. Alternatively, controlling the behavior of estimates of discrimination and guessing parameters through the imposition of prior distributions on them may be cost effective and provide reasonable results. The four-step procedure (Stocking, 1989, p. 21) referred to is one in which abilities and difficulties are estimated first, holding the discrimination and guessing parameters constant; then, the abilities are fixed and the three item parameters are estimated. Steps three and four repeat the first two steps. This structure was imposed on the estimation procedure in an effort aimed at overcoming the tendency of parameter estimates to diverge without limit (Stocking, 1989, pp. 2 5 26). Lord noted quite some time ago that "the [three-parameter] method

52

FISHER

usually does not converge properly" (Lord, 1968, p. 1015) and t h a t "experience has shown that if . . . restraints are not imposed, the estimated value of [discrimination] is likely to increase without limit" (Lord, 1975, p. 14). These problems are precisely what caused Wright to reject the multiparameter approaches in the mid-1960s, when he and Bruce Choppin wrote such programs against Rasch's advice (Wright, 1988a, p. 3). LOGIST's four-step procedure is intended to arrest the divergence of the parameters to infinity; this procedure uses the Rasch model, in effect, every other iteration through the data (on the first and third steps of the four-step procedure) in order to provide "reasonable estimates for item parameters and abilities in a feasible amount of time" (Stocking, 1989, p. 21). Stocking (1989, p. 45) makes the same recommendations concerning another program, BILOG (Mislevy & Bock, 1983): BILOG, being a more recent computer program available for general use, has not been subjected to the same wide variety of applications as LOGIST. As such, it does not contain the necessary restrictions to prevent the numerical procedures from diverging from reasonable, although perhaps less than optimal starting values. It seems clear that such additional restrictions are necessary. "Better starting values for the parameters," and "imposing prior distributions on them" are "necessary restrictions" that the two most widely used IRT computer programs must incorporate just to provide "reasonable estimates . . . in a feasible amount of time." Wright (1988a, p. 3) realized the same thing about his own two-parameter program in 1964, saying that it would not "converge unless I introduced some inevitably arbitrary constraint. The choice of the constraint would always alter the results. . . . Since I couldn't make the two-parameter program work, I discarded it." Hambleton and Rogers (1989, p. 158) comment on the unavailability, unfriendliness, cryptic and unwanted output, and bugs of IRT computer programs, in addition to the excessive time and prohibitive sample sizes required for their application. In contrast, Hambleton and Cook (1977, p. 88) write that "the problem of ability and item parameter estimation with the Rasch model is quite different. In fact, the estimation problem is essentially resolved." Hambleton and Cook's (1977, p. 76) comment that the only "fast and convenient-to-use computer programs for estimating the parameters [are those available] for the Rasch model" continues to be relevant. Wright (1984) documents more words of praise from those who have identified themselves with the quantitative paradigm's stress on con-

THE RASCH DEBATE

53

tent validity for the efficiency and effectiveness of Rasch's approach to measurement. Because the two- and three-parameter models often do not work at all with small sample sizes, Lord (1983) has said t h a t small sample sizes justify the use of the Rasch model. Rasch measurement would then be the best route to take for the great majority of tests, since most are administered in classrooms with less than fifty students. Validity by Default or Design? It appears that the most important aspect of validity in American educational measurement is the capacity to tell what Rorty (1985) calls stories of objectivity, in the sense that objectivity is the one-sided impo-stories of objectivity, in the sense that objectivity is the one-sided impsition of authority. Most educational measurement experts are willing to allow issues of construct validity to be decided by default, and "if researcher-theorists default on construct validity, then they consciously or unconsciously adopt inherited discourses and meanings previously assigned to constructs and measurements" (Cherryholmes, 1988, p. 428; also see Gould, 1981). As Burtt (1954, p. 225) phrased it, What kind of metaphysics are you likely to cherish when you sturdily suppose yourself to be free of the abomination? Of course . . . in this case your metaphysics will be held uncritically because it is unconscious; moreover, it will be passed on to others far more readily than your other notions inasmuch as it will be propagated by insinuation rather than by direct argument. The positivist denial of metaphysics is also assumed any time someone purports to be able to count on test items to provide valid and reliable measures when no value is placed on checking whether it is reasonable to add up counts of right answers and assign scores. However, just because experts have decided that items on a test all belong to the same content domain does not mean that they belong to the same construct. Viewed in this larger context, what Jaeger (1987) called the Rasch debate begins to look more like the validity debate. An exclusive focus on content validity in educational measurement serves ideological, bureaucratic, and administrative needs far more t h a n scientific or h u m a n ones. Some writers suggest that educational measurement addresses the social, economic and political agenda of elite decision makers more t h a n it does the interests of equal opportunity and justice (Crouse & Trusheim, 1988; Owen, 1985; Sutherland, 1984; Strenio,

54

FISHER

1981); it will continue to do so until more attention is paid the discourse processes and metaphysics of testing. Cherryholmes (1988, p. 421) suggests that some attention to these issues began, and "social research methodology entered adolescence, if not maturity, in July 1955 . . . with the publication of Cronbach and Meehl's 'Construct Validity in Psychological Tests.'" The problem is t h a t "the adolescence has been arrested" (Cherryholmes, 1988, p. 450). If so, the potential for its further development grew with the publication of Rasch's (I960) research on measurement, as has been suggested by Duncan (1984b, pp. 216-218; c, pp. 398-400). That potential will hardly begin to be realized until educators overcome their fixation on content validity, however.

IMPLICATIONS FOR PRACTICE The Things Themselves and Keeping the Scientific Theme Secure Sensitivity to the role of culture in the framing of questions has led to a new emphasis on a qualitative, ethnographic style of research in education. Though this development has been productive in promoting a more dialectical critique of the question-and-answer process, few suggestions for improvements in quantitative thinking have been forthcoming; quantitative methods have been either relegated to the positivist trash heap of history by qualitative purists, or accepted as unavoidably positivist, at least in part, by most of those who still continue to use and think about them. Even those who recognize the philosophical problems attending quantitative methods and incorporate a critical dialectic into their application, such as Cook and Campbell (1979, pp. 91-94), still take only roundabout routes to show t h a t their data focus on a common question and point in the direction from which the responses arrive. A more direct approach is to specify in advance what will count as an observation, on the basis of informal observations, imaginative hunches, or previous research; focus questions on the continuum along which the variable will likely be manifest; and examine the questions for conformity to measurement principles after they have been exposed to treatment by a relevant group of persons (Rasch, 1960; Wright, 1968, 1977b). Where education's traditional concern with content validity moves straight from the unarticulated theoretical construct to observation to assertions concerning what is observed (Cherryholmes, 1988, p. 448) in a monological and one-sided fashion, Rasch

THE RASCH DEBATE

55

and Wright insist on the importance of completing several spirals through the hermeneutic circle, returning to check and possibly alter observations and theoretical constructs before making assertions about what has been observed or what can be expected in the way of future observations. Cherryholmes (1988, p. 448; also see Fisher, 1990) says t h a t "quantitative and qualitative approaches are combined when the meaning of these bidirectional arrows [moving from construct to observation to phenomenon and back again] is clarified and negotiated." What Cherryholmes (1988, p. 448) refers to as the "'covariation' or shared meaning but not identity" connoted by these arrows has also been called a "mutually critical correlation" (Tracy, 1975) and a "method of converging indices" (Ricoeur, 1981, pp. 212-213) tracing a dialectical spiral t h a t delineates the "arrow of meaning" followed in pursuit of a line of questioning (Ricoeur, 1981, p. 193). The same mutual relation of construct to phenomenon t h a t is mediated by the structure of language embodied in questions holds when data meet the requirements of measurement as these are modeled by Rasch. Focusing the research question by attending to the ways in which it is posed by the test or survey questions extends and refines the question and answer process by which meaning is created in conversation, or by which meaning emerges from the reading of a text. Rasch measurement advances the qualitative critique of quantification and facilitates the investigation of construct validity in distinctively phenomenological and hermeneutic ways. Cherryholmes (1988, p. 432) says t h a t in Phenomenological and interpretative research . . . authority derives from subjects and blurs distinctions between subjects and objects. . . . Phenomenologically based research produces "truths" different from quantitative, statistically sophisticated research because the locus of power that makes "truth" possible shifts from researchers as subjects to respondents as subjects. Designing research with the intention of obtaining fit to a Rasch model is a way of heeding Husserl's call to return to the things themselves. Cherryholmes (1988, p. 430) describes the phenomenological epoche in a strict Husserlian sense as a bracketing of the researcher's prior beliefs and attitudes t h a t results in a proscription against imposing their own categories of observation on the objects of study who have become subjects. This transcendental idealism of Husserl has been critically d e v a l u ated in the work of his students Heidegger and Gadamer such that

56

FISHER

phenon enology is retained as the method of philosophy, but the epoche becomes a bracketing of the particulars through which things make themselves known. The epoche is still performed in order to gain access to the pure thought of the things themselves, but the researcher goes with the flow of, and organizes in an orderly fashion, the past beliefs, opinions, and frames of reference that Husserl (and Cherryholmes) proposed to be simply dropped. Research questions themselves constitute frames of reference and embody attitudes, so it is more realistic to attempt a fusion of the horizons of the research questions with the horizons of the questions the research subjects find pertinent (Gadamer, 1989) than it is to try to purify the questions of background assumptions and presume that the subjects have thereby been free to disclose their understanding of the world. Heidegger (1962, p. 195) said that attention to this hermeneutic circularity is our "first, last, and constant task" in "making the scientific theme secure." Because Rasch (1960, p. 110) estimated person and item parameters "one by means of the other . . . without getting into any logical circle," he was able to fix attention on the Heideggerian task. In opposition to what could be expected from Lindquist (1953), Rasch and Wright would agree with Heidegger that "science [is] genuine only if it succeeds in taking the measure from things, instead of imposing measure upon them" (Zimmerman, 1990, p. 228). Husserl and Heidegger's influences on the writers discussed by Cherryholmes, such as Derrida, Foucault, Habermas, Ihde, Rorty, and Schutz, bring Rasch into direct contact with the issues of construct validity raised in the discursive context. More specifically, to be sufficiently composed and prepared to pose real questions is to perform the phenomenological epoche such t h a t the thing itself is brought into view. The researcher has some evidence t h a t the thing itself is in view when the observations delineating its structure do not inordinately vary depending upon the particular questions asked or the particular persons responding. For the bracketing, and separation, of the particulars to occur, they must converge upon a common line of thought; this belonging together is characteristic of Husserl's method of profile variation, Ricoeur's method of converging indices, and is referred to by Brenneman, Yarian, and Olson (1982) as the paradox of unity and separation. Things think themselves and method is an activity of the things themselves (Gadamer, 1989) when person parameters are estimated free of concern for the particular questions asked, item parameters are estimated free of concern for the particular persons responding, and fit to the model is checked free of concern for either parameter (Rasch, 1960, pp. 122, 178; 1961, p. 325). Whether this separability theorem,

THE RASCH DEBATE

57

and the specific objectivity attained when the theorem is satisfied, are practical for any particular field of research is a matter for empirical study. It must be asserted, however, that to attain specific objectivity is to make the scientific theme secure. Rasch's incorporation of basic phenomenological and hermeneutic themes into his mathematics has been ignored, leading some to relegate his work to the positivist trash heap. For instance, Cronbach, 1982, p. 70) considered Rasch (1961) to hold that "one-parameter scaling can discover coherent variables independent of culture and population." On the contrary, Wright himself could have written what Cronbach says on the next page, that the sooner all social scientists are aware that data never speak for themselves, that without a carefully framed statement of boundary conditions generalizations are misleading or trivially vague, and that forecasts depend on substantive conjectures, the sooner will social science be consistently a source of enlightenment. With regard to Cronbach's statement that "data never speak for themselves," Wright and Masters (1982, p. 9) say that To be able to do arithmetic we need to be able to count, and to count we need units. But there are no natural units. There are only the arbitrary units we construct and decide to use for our counting. Cronbach expresses concern for "a carefully framed statement of boundary conditions," without which "generalizations will be misleading or trivially vague"; Wright and Masters (1982, p. 5) say For scientific ideas to be useful, they must apply over some range of time and place; that is, over some frame of reference. The way we think things are must seem to stay the same within some useful context. What is a Rasch model if it is not "a carefully framed statement of boundary conditions"? To require that test results be dominated only by abilities and difficulties is to make a substantive conjecture, as is evident in the quote from Wright (1977b, p. 97) used by Jaeger (1987) to characterize the debate. Cronbach's thoughtless dismissal of Rasch raises the point that the qualitative criticism of quantitative methods must be complemented by criticism of qualitative approaches that emphasize only the movement from the phenomenon to observation to construct, which makes them just as incomplete as the quantitative approaches that follow only the movement from construct to observation to phenomenon. Neither approach alone successfully addresses

58

FISHER

the problem of method in social research, and to simply juxtapose them does not accomplish anything of substance, either. A more fully complementary relation between the two paradigms is required, one in which each incorporates what is most important about the other into its own movement, acknowledging in practice that "the social roots of social measurement are in the social process itself" and t h a t "quantification is implicit . . . in the social process itself before any social scientist intrudes" (Duncan, 1984b, pp. 221, 36). The goals of the qualitative paradigm are not to abandon or bury quantification, but to explicate what Coombs (1967, pp. 4-5) called the "interpretive step . . . required to convert the recorded observations into data." When this interpretive step and its implications are included in research the phenomenologically rich sense of method as the playful activity of the thing itself takes hold (Gadamer, 1989). To apply Rasch's models is to incorporate the interpretive step into scaling procedures, making interpretation of the construct unavoidable in calibrating instruments and making measurements, which is part of the reason Rasch has provoked debate. How does the interpretive step fit into the process of instrument calibration and person measurement? It is actually not just a single step, but is repeated several times. Even the invention of the questions to be asked involves an interpretation of the relevant content domain; decisions as to item appropriateness may be guided by criteria of content validity at this point, but they should also be guided by a theory of the variable: What will count as an observation of more or less of the ability or attitude of interest? The activity of the phenomenon measured moves first in the direction shared by the questions on a test toward the responses they provoke; the responses in turn raise new questions which either extend or otherwise alter the direction initially followed. The back-and-forth motion continues in a manner t h a t connects with what is most fundamental to method (from the ancient Greek meta-hodos), the way in which clear thinking follows after and i meaning or train of thought it cuts within a particular cultural and historical frame of reference. This is not to say that Rasch measurement models embody the essence of method, or t h a t they even are methods, because they are not. The methods by which meaning is created vary substantially both among and within areas of interest. The point is only that obsession with content validity cuts off the flow of method prematurely; a shift in focus toward construct validity would contribute to the phenomenological and methodological soundness of educational research.

THE RASCH DEBATE

59

Interpreting Empirical Consistency The recent surge of interest in fit analysis, differential item functioning, and the Mantel-Haenszel (MH) procedure is a move in the direction of a strong emphasis on construct validity in educational research, but presumes an approach to measurement often lacking in the methods creating the data to which it is applied. In the application of the Mantel-Haenszel (MH) procedure, If one is not prepared to accept the validity of the Rasch model for the item under examination, the implicit assumptions of the MH procedure will not be satisfied either. If one is prepared to accept the Rasch assumptions, however, the Rasch model yields simpler and better statistics. (Linacre & Wright, 1987, p. 16; 1989, p. 3; also see Zwick, 1990) Thus, the application of the MH procedure to data that fit the threeparameter IRT model but not the Rasch model adds yet another level of self-contradiction and complication to educational measurement. The residual differences between modeled and observed responses calculated by both the Rasch and the MH procedures implement the rigorous sense of unidimensionality contradicted by the two- and threeparameter estimation algorithms. This situation raises some hard questions. What is the point of obtaining complex and obscure statistics from the MH procedure when a model t h a t almost always fits data is being used to provide ability and difficulty estimates? Why not use the same requirements used to calculate fit to estimate scale positions, and arrive at simpler statistics in less time and with less trouble? The sort of structure required of data for fit to a Rasch measurement model, and presumed in the application of the MH procedure, is displayed in Table 3-1. In fact, it is only reasonable to count up marks of correct and incorrect (or marks of correct, partly correct, and incorrect—see Wright & Masters, 1982, and Masters, 1982, for more on partial credit scoring), and use the counts as a basis for making inferences about person ability or item difficulty, when data can be organized into a pattern roughly similar to the one shown in Table 3-1. The items are ordered from more to less difficulty according to the number of persons responding correctly to each; the persons are ordered from more to less ability according to the number of items to which each has correctly responded. The resulting pattern required for measurement is one in which a person may occasionally score a correct

60

FISHER

Table 3-1 Sample Data that Display the Reciprocal Order Needed for Convergence and Fit to an Additive Conjoint Measurement Model Items Easy or Agreeable to Hard orlDisagreeable Persons

1

2

3

4

5

6

7

8

9

10

Person Scores

Luc John Louise Martha Jimi Diane Nathan Jon Laura Alissa

0 1 1 1 1 1 1 1 1 1

1 0 1 1 1 1 1 1 1 1

0 1 0 1 1 1 1 1 1 1

0 0 1 0 1 1 1 1 1 1

0 0 0 1 0 1 1 1 1 1

0 0 0 0 0 1 1 1 1 1

0 0 0 0 1 0 0 1 1 1

0 0 0 0 0 0 1 1 0 1

0 0 0 0 0 0 0 0 1 1

0 0 0 0 0 0 0 0 1 0

1 2 3 4 5 6 7 8 9 9

Item Score

9

9

8

7

6

5

4

3

2

1

answer after missing an item or two, but there is a general harmony to the continuum of more and less shared by the persons and items. In contrast, Table 3-2 displays data that contradict the basic requirement of unidimensionality, and so threaten the construct validity of the calibrations and measures. Imagine that the data in Table 3-2 are embedded in a large matrix of data organized like that shown in Table 3-1, in which a general order of more and less of something remains relatively and probabilistically constant across items and persons. Every person in Table 3-2 has the same count of correct answers, but is it possible to assume that the counts mean the same thing? Is not t h a t assumption made, however, every time a teacher or a tester computes the percentage of the total number of items to which a student Table 3-2

Sample Data on the Variation of Meaning in a Score Items Easy or Agreeablei to Hard or Disagreeable

Persons

1

2

3

4

5

6

7

8

9

10

Person Scores

Joe Mary Lucy Bob Anne Larry Igor

0 1 1 1 1 1 0

0 1 1 0 1 1 1

0 1 1 1 1 1 1

0 1 1 0 1 1 1

0 1 0 1 0 0 1

1 0 1 0 0 0 0

1 0 0 1 0 0 1

1 0 0 0 0 0 0

1 0 0 1 0 0 0

1 0 0 0 1 1 0

5 5 5 5 5 5 5

t

responded correctly? In contrast to Divgi (1986, p. 283), Messick's (1975, p. 960) answer to this question is an unequivocal yes: Inferences in educational and psychological measurement are made from scores, and scores are a function of subject responses. Any concept of validity of measurement must include reference to empirical consistency. Content coverage is an important consideration in test construction and interpretation, to be sure, but in itself does not provide validity. After all, is not it possible that some students will respond to ostensibly easy questions incorrectly, and ostensibly hard ones correctly, independent of the fact t h a t all of the items have been judged to belong to the same content domain? Is it not important to detect when this sort of thing happens on a large scale, as has been the case with Anne, Igor, Larry, and especially Joe, in Table 3-2? And what about Bob, who was correct on every other item when they are ordered by difficulty? Is he making some kind of joke? The probability of Igor missing the easiest item must be very small, so was this the result of simple carelessness or is something more important going on? Anne and Larry both got the very hardest item correct after missing five in a row. Is this simply a sign of some special knowledge they each have, did they collaborate on the answer, did one copy from the other, or were these independently made lucky guesses? Answers to these questions can be gained by asking the students new questions of the same difficulty as those on which their responses are surprising. If the items in Table 3-2 are in entry, as well as measure, order, it might be beneficial to ask if Mary ran out of time as she labored with each question before she moved on to the next. Did Joe skip all of the easy questions out of boredom? Did Bob make random marks on the answer sheet, or answer true/false or multiple choice questions all in the same category? If so, why? Will Larry and Anne answer another item of question 10's difficulty correctly, or were their responses produced by collaboration, cheating, guessing, or special knowledge? Would Igor have missed the first question if he had not been in a hurry to get started, or if he had not had difficulty figuring out the test's purpose? The other side of validating a construct involves another, reciprocally structured, set of questions simultaneously raised about the test items. Is there a very easy item that groups of high-ability persons consistently miss? Is there a very hard item that groups of low-ability persons answer correctly? For instance, word problems in a mathematics test may become inordinately difficult for students who are unable to read the language in which the problems are written. If word prob-

62

FISHER

lems are irrevocably deemed a valid part of the mathematics content domain, and the test analyst has no business monkeying around with the sacrosanct items handed down by the authorities, as Lindquist (1953) maintains, then discrimination and prejudice are built into the test and any decisions t h a t follow from them. If, on the other hand, we are flexible enough to not regard content decisions as fixed, then the differential meaning of the items can be accounted for in the interpretation that transforms observations into measures. These examples are intended to show that there are many kinds of disturbance that interfere with the effort to measure, each is as likely to occur as guessing is, and each will present just as much potential for disruption. Are we then to model additional parameters for plodding, sleeping, and fumbling, as they are called by Wright and Stone (1979, pp. 170-190), in such a way that they will move us even further from Rasch's access to sufficient statistics? Hardly; two basic reasons for the movement toward qualitative methods in educational research are t h a t usual applications of quantitative method traditionally strive to anticipate, close off, trap, or nail down anomalies, and to focus on operations and content instead of meaning and constructs. It is more sensible, though, to go with the flow of the multifaceted, conversational, and metaphorical logic by which things actually play themselves out, t h a n it is to force a one-sided logic and rationality on what people do. Well put questions inevitably open up more questions than they answer, and to cut off questioning is to kill the potential for learning. Disruptions in the measurement process are inevitable but it is far more productive to locate and interpret them after they occur than to try to include them as elements in a model of an already very complicated situation. Patterns of anomalous response commonly found in educational test data are discussed in Wright and Stone (1979, pp. 170-190). Quantitative methods for flagging unexpected patterns of response associated with persons and items are standard equipment in programmatic applications of the Rasch models, such as BIGSTEPS (Wright & Linacre, 1991) and FACETS (Linacre, 1991). The statistics indicative of empirical inconsistency have been shown useful in investigating construct validity (Maier & Philipp, 1986; Wright & Masters, 1982, pp. 90-117). More complex multiple regression procedures using the conceptual structure of item characteristics to predict Rasch item difficulties have been presented by Stenner and Smith (1982) and Stenner, Smith, and Burdick (1983) in the context of exploring construct validity. The interpretive study of ordered data matrices shows that scores are meaningful only within the context of a frame of reference, and t h a t Rasch's requirement of shared order across persons and items is in

t

fact assumed whenever raw scores are used as a basis for comparison, Goldstein's (1979, p. 219) claims to the contrary notwithstanding. Andersen (1977, p. 72) says that If there exists a minimal sufficient statistic for the individual parameter 0 which is independent of the item parameters, then the raw score is the minimal sufficient statistic and the model is the Rasch model. In Wright's (1977b, p. 114; also see 1985, pp. 106-107) terms, Unweighted scores are appropriate for person measurement if and only if what happens when a person responds to an item can be usefully approximated by the Rasch model. . . . Ironically, for anyone who claims skepticism about "the assumptions" of the Rasch model, those who use unweighted scores are, however unwittingly, counting on the Rasch model to see them through. Whether this is useful in practice is a question not for more theorizing, but for empirical study. There are, perhaps, those who read these passages simply as expressions of the writers' demands that things be done their way, as if they believe they have access to a divine inspiration ordering sanctification of particular procedures and the conscription of a following of disciples, with no questions raised from anyone as to why things should be done this way. On the contrary, "the reader who believes that all t h a t is at stake in the axiomatic treatment of measurement is a possible canonizing of one scaling procedure at the expense of others is missing the point" (Ramsay, 1975, p. 262; also see Andrich, 1988, p. 20). The point is to sanctify neither items nor procedures, but to undertake data analysis as a kind of detective work. The Roman Catholic Church . . . has long held that sanctification was only for the dead—indeed only for those already dead for an appropriate period. . . . sanctification of data is equally only for dead data—data that are only of historical importance, like Newton's apple. . . . Data analysis has its major uses. They are detective work and guidance counseling. Let us all try to act accordingly. (Tukey, 1969, p. 90) The empirical studies of the detective work and guidance counseling provided by Rasch measurement that were called for by Wright (1977b) have been completed on many different kinds of test, survey, and rating scale data. These studies have answered the question concerning the Rasch model's practical usefulness in the affirmative many times over, as is evidenced by just a cursory examination of the

64

FISHER

papers presented to the Midwest Objective Measurement Seminars, the International Objective Measurement Workshops (Wilson, 1991), and the Rasch Measurement SIG sessions of the AERA, besides the publications appearing in journals as diverse as the Archives of Physic gy. The medical fields have found Rasch's approach to measurement especially useful, with a great deal of Rasch applications being found in accreditation and certification, as well as in psychiatry, nursing, and blind and physical rehabilitation. Perhaps the only obstacles to revolution in educational measurement are assumptions concerning the irreconcilable differences of solidarity and objectivity. SOLIDARITY VS. OBJECTIVITY OR OBJECTIVE SOLIDARITY? In contrast to Rorty and Cherryholmes, I would like to suggest that stories of solidarity and objectivity and not mutually exclusive. Cherryholmes (1988, p. 450) says that If Rorty is correct that reflective human beings make sense of their lives by telling stories about either solidarity or objectivity and our stories about objectivity are flawed, they nevertheless describe a community. The community is elitist, control centralized; criticism is limited to experts; the social context and historical setting of the community is not discussed; constructs (the way the community is conceptually organized) are not chosen on ethico-political or aesthetic grounds but in terms of "scientific" criteria; and the discourse is thought of as nonmaterial and descriptive-explanatory.

To this it must be added that if the solidarity of societies emphasizing objectivity is likely to take a one-sided, dictatorial, and authoritarian form, then the objectivity of societies that emphasize solidarity is likely to be multifaceted, conversational, and playful (Heelan, 1983, 1985; Ihde, 1979; Ackermann, 1985). There is a large literature describing science in the language of community life (Fahnestock, 1986; Fleck, 1979; Hesse, 1970, 1972; Holton, 1988; Kuhn, 1961, 1970; Latour & Woolgar, 1979; Ormiston & Sassower, 1989; Toulmin, 1982); the problem these works address is how to find and nurture whatever resources for solidarity there may be remaining in scientific society. This does not require us to abandon objectivity; on the contrary, we aim to avoid yet another simplistic reduction of rich variation to another mere dichotomy. In opposition to Lindquist's approach to measurement, Wright spe-

t

cifically addresses ethical, political, and aesthetic criteria by which to judge and choose constructs. Because we intend to use our measures to inform decisions that affect people's lives, we are ethically bound to be sure t h a t the numbers actually represent more and less of the construct in question. Some might say that the only ethics addressed by Lindquist concern a blind devotion to following orders. Because we are legally and morally bound not to discriminate among persons by religion, sex, race, sexual orientation, or age, we require that our measures not vary across these groups in an inordinate fashion. Lindquist's definition of the test content as sacrosanct prevents attention from being focused on these issues in an effective way. Rasch's measurement models offer an aesthetically pleasing symmetry of question and answer in which each plays itself out in terms of the other, effectively extending and furthering the process by which meaning is reproduced in social life, conversationally. Lindquist, on the other hand, would have us only accept t h a t which is handed down without question because we have no business monkeying around with sacrosanct definitions. The desire to understand human experience by means of stories told of a nonhuman, ahistorical reality still predominates in much of social science. In education this desire is evident in the popularity of measurement models that do not recognize or accept the fact of their own imposition of political, moral and aesthetic criteria upon students, test items, and data. By recognizing that the projection of such criteria is unavoidable, and by formulating models of how consciously chosen criteria can be simply, easily and practically implemented, explicated, and criticized, it becomes possible to explore whether we really know what we are talking about when we make assertions on the basis of test results. And far from saying that construct validity is simply a matter of fitting data to Rasch's models, this chapter has attempted to provoke thoughtful attention to the problem of construct validity. Measuring what is supposed to be measured involves far more t h a n anything that can be specified in a set of mechanically and thoughtlessly followed rules. Revolution in educational measurement will be attained only when we let go of our needs for rules and the capacity to dominate and control in favor of a thinking secure enough to go with the flow of letting individuals be what they are. REFERENCES Ackermann, J.R. (1985). Data, instruments, and theory: A dialectical approach to understanding science. Princeton, NJ: Princeton University Press.

66

FISHER

Andersen, E.B. (1977). Sufficient statistics and latent trait models. Psychometrika, 42"(1), 6 9 - 8 1 . Andrich, D. (1987, April). Educational and other social science measurement: A Kuhnian revolution in progress. Presented to the American Educational Research Association, New Orleans.

a

series on Quantitative Applications in the Social Sciences, series no. 07-068. Beverly Hills, CA: Sage Publications. Andrich, D. (1989). Statistical reasoning in psychometric models and educational measurement. Journal of Educational Measurement, 26(1), 81-90. Bakan, D. (1966). The test of significance in psychological research. Psychological Bulletin, 66(6), 423-437. n Psychologist, Bechtoldt, H.P. (1959). Construct validity: A critique. American Psychologist, Bechtoldt, H.P. (1959). Construct validity: A critique. American 14, 619-629. Bollinger, G., & Hornke, L.F. (1978). Uber die Beziehung von Itemtrennscharfe und Rasch-Skalierbarkeit. Archiv fiir Psychologic, 130, 89-96. Brenneman, W.L., & Yarian, S.O., with A.M. Olson. (1982). The seeing eye: Hermeneutical phenomenology in the study of religion. University Park, PA: Pennsylvania State University Press. Bridgman, P.W. (1927). The logic of modern physics. New York: Macmillan. Brogden, H.E. (1977). The Rasch model, the law of comparative judgment and additive conjoint measurement. Psychometrika, 42, 631-634. Burtt, E.A. (1954). The metaphysical foundations of modern science. New York: Doubleday Anchor. Cajori, F. (1985). A history of mathematics. New York: Chelsea. Carver, R. (1978). The case against statistical significance testing. Harvard Education Review, 48(3), 378-399. Cherryholmes, C. (1988). Construct validity and the discourses of research. American Journal of Education, 96(3), 421—457. Choppin, B. (1968). An item bank using sample-free calibration. Nature, 219, 870-872. Choppin, B. (1976). Recent developments in item banking. In D.N. DeGruitjer & L.J. Vanderkamp (Eds.), Advances in psychological anddeducational measurement. t London: John Wiley & Sons. Choppin, B. (1978). Item Banking and the Monitoring of Achievement. Slough, England: National Foundation for Educational Research. Cliff, N. (1973). Scaling. Annual Review of Psychology, 24, 473-506. Coats, W. (1970). A case against the normal use of inferential statistical models in educational research. Educational Researcher, 3, 6 - 7 . Cook, T.D., & Campbell, D.T (1979). Quasi-experimentation:oDesign & analysis issues for field settings. Boston: Houghton Mifflin. Coombs, C. (1967). A theory of data. New York: Wiley. Cronbach, L.J. (1982). Prudent aspirations for social inquiry. In W.H. Kruskal (Ed.), The social sciences: Their nature and uses. Chicago: University of Chicago Press. Cronbach, L., & Meehl, P. (1955). Construct validity in psychological tests. Psychological Bulletin, 52(A), 281-302.

THE RASCH DEBATE

67

Crouse, J., & Trusheim, D. (1988). The case against the SAT Chicago: University of Chicago Press. Divgi, D.R. (1986). Does the Rasch model really work for multiple choice items? Not if you look closely. Journal of EducationalalMeasurement, 23(4), 283-296. Divgi, D.R. (1989). Reply to Andrich and Henning. Journal of Educational Measurement, 26,(3), 295-299. Duncan, O.D. (1984a). Measurement and structure: Strategies for the design and analysis of subjective survey data. In C.F. Turner & E. Martin (Eds.), Surveying subjective phenomena (Vol. 1). New York: Russell Sage Foundation. Duncan, O.D. (1984b). Notes on social measurement: Historical and critical. New York: Russell Sage Foundation. Duncan, O.D. (1984c). Rasch measurement: Further examples and discussion. In C.F. Turner & E. Martin (Eds.), Surveying subjective phenomena (Vol. 2). New York: Russell Sage Foundation. Embretson (Whitely), S. (1983). Construct validity: Construct representation versus nomothetic span. Psychological Bulletin, 93(1), 179-197. Fahnestock, J. (1986). Accommodating science: The rhetorical life of scientific facts. Written Communication, n3(3), 275-296. Falk, R. (1986). Misconceptions of statistical significance. Journal of Structural Learning, 9, 83-96. Fischer, G.H. (1987). Applying the principles of specific objectivity and of generalizability to the measurement of change. Psychometrika,a52, 4, 565-587. Fisher, W.P. (1988). Recent developments in the philosophy of science pertaining to problems of objectivity in measurement. Raschh Measurement Transactions, 2(2), 1-3. Fisher, W.P. (1990, April). Conversing, testing, questioning. Presented to the American Educational Research Association Annual Meeting, Boston I ERIC Document TM016413]. Fisher, W. (1991). Objectivity in measurement: A philosophical history of Rasch's separability theorem. In M. Wilson (Ed.), Objective Measurement: Theory into practice. Norwood, NJ: Ablex Publishing Corp. Fleck, L. (1979). The birth and genesis of a scientific fact. Chicago: University of Chicago Press. Gadamer, H.-G. (1980). Dialogue and dialectic: Eight hermeneuticalal studies on Plato (PC. Smith, Trans, and Intro.). New Haven: Yale University Press. Gadamer, H.-G. (1989). Truth and method (2nd ed.) (J. Weinsheimer & D.G. Marshall, Rev. Trans.). New York: Crossroad. Goldman, S.H., & Raju, N.S. (1986). Recovery of one- and two-parameter logistic item parameters: An empirical study. Educational and Psychological Measurement, 46, 1 1 - 2 1 . Goldstein, H. (1979). Consequences of using the Rasch model for educational assessment. British Educational Research Journal, 5(2), 211-220. Goldstein, H. (1980). Dimensionality, bias, independence and measurement

68

FISHER

scale problems in latent trait test score models. British Journal of Mathematical and Statistical Psychology, 33, 234-246. Goldstein, H. (1983). Measuring changes in educational attainment over time: Problems and possibilities. Journal of Educational Measurement, t 20(4), 369-377. Goldstein, H , & Blinkhorn, S. (1977). Monitoring educational standards—An inappropriate model. Bulletin of the British Psychological Society, 30, 309-311. Grau, B.W., & Mueser, K.T. (1986). Measurement of negative symptoms. Schizophrenia Bulletin, 12(1), 7 - 8 . Gould, S.J. (1981). The mismeasure of man. New York: W. W. Norton. Gustafsson, J.-E. (1980). Testing and obtaining fit of data to the Rasch model. British Journal of Mathematicalal and Statistical Psychology, 33, 2 0 5 233. Hacking, I. (1983). Representing and intervening: Introductory topics in the philosophy of natural science. Cambridge, UK: Cambridge University Press. Hacking, I. (1988). On the stability of the laboratory sciences. The Journal of Philosophy, 85(10), 507-514. Hambleton, R.K., & Cook, L.L. (1977). Latent trait models and their use in the al a 14(2), 75-96. Hambleton, R.K., & Novick, M.R. (1973). Toward an integration of theory and method for criterion-referenced tests. Journal of Educationalal Measurement, 10, 159-170. Hambleton, R.K., & Rogers, H.J. (1989). Solving criterion-referenced measurement problems with item response models. International Journal of Educational Research, 13(2), 145-160. Heelan, P. (1972). Towards a hermeneutic of natural science. Journal of the British Society for Phenomenology, 3, 252—260. Heelan, P. (1983). Natural science as a hermeneutic of instrumentation. Philosophy of Science, 50, 181-204. Heelan, P. (1985, March). Interpretation in physics: Observation and measurement. Greater Philadelphia Philosophy Consortium. Heelan, P. (1988). Experiment and theory: Constitution and reality. The Journal of Philosophy, 85(10), 515-524. Heelan, P. (1989). After experiment: Realism and research. American Philo-Heelan, P. (1989). After experiment: Realism and research. American Philosophical Quarterly, 26(4), 297-308. Heidegger, M. (1962). Being and time (J. Macquarrie and E. Robinson, Trans.). New York: Harper & Row. Heidegger, M. (1967). What is a thing? (W.B. Barton, Jr., & V. Deutsch, Trans.). (Analytic afterword by E. Gendlin). South Bend, IN: Regnery. Henning, G. (1989). Does the Rasch model really work for multiple-choice items? Take another look: A response to Divgi. Journal of Educational Measurement, 26(1), 91-97. Hesse, M. (1970). Models and analogies in science. Notre Dame, IN: University of Notre Dame Press.

THE RASCH DEBATE

69

Hesse, M. (1972). In defence of objectivity. Proceedings of the BritishhAcademy, 58, 275-292. Holton, G. (1988). Thematic origins of scientific thought (rev. ed.). Cambridge,Holton, G. (1988). Thematic origins of scientific thought (rev. ed.). Cambridge, MA: Harvard University Press. Hudson, L. (1972). The cult of the fact. New York: Harper & Row. Husserl, E. (1970). The crisis of European science. Evanston, IL: Northwestern University Press. Ihde, D. (1979). Technics and praxis. Boston: D. Reidel. Ihde, D. (1991). Instrumental realism. Bloomington, IN: Indiana University Ihde, D. (1991). Instrumental realism. Bloomington, IN: Indiana University Press. Jaeger, R.M. (1987). Two decades of revolution in educational measurement!? Educational Measurement: Issues and Practice 6(2), 6-14. Krantz, D.H., Luce, R.D., Suppes, P., & Tversky, A. (1971). Foundations of measurement. Vol. 1: Additive and polynomial representations. New York: Academic Press. Krenz, C , & Sax, G. (1986). What quantitative research is and why it doesn't work. American Behavioral Scientist, 30(1), 58-69. Kuhn, T.S. (1961). The function of measurement in modern physical science. Isis, 52(168), 161-193. Kuhn, T.S. (1970). The structure of scientific revolutions (2nd ed.). Chicago:Kuhn, T.S. (1970). The structure of scientific revolutions (2nd ed.). Chicago: University of Chicago Press. Latour, B., & Woolgar, S. (1979). Laboratory life: The social construction of scientific facts. Beverly Hills: Sage. Lewine, R.R.J. (1986). Reply to Grau and Mueser. Schizophrenia Bulletin,Lewine, R.R.J. (1986). Reply to Grau and Mueser. Schizophrenia Bulletin, 12(1), 9 - 1 1 . Linacre, J.M. (1991). FACETS: A computer program for many-faceted Rasch Linacre, J.M. (1991). FACETS: A computer program for many-faceted Rasch analysis. Chicago: MESA Press. Linacre, J.M., & Wright, B.D. (1987). Item bias: Mantel-Haenszel and theLinacre, J.M., & Wright, B.D. (1987). Item bias: Mantel-Haenszel and the Linacre, J.M., & Wright, B.D. (1987). Item bias: Mantel-Haenszel and theLinacre, J.M., & Wright, B.D. (1987). Item bias: Mantel-Haenszel and the Rasch model (Memorandum No. 39, MESA Psychometric Laboratory, Department of Education). Chicago: University of Chicago. Linacre, J.M., & Wright, B.D. (1989). The equivalence of Rasch PROX and Mantel-Haenszel. Rasch Measurement, 3(2), 1-3. Lindquist, E.F. (1953). Selecting appropriate score scales for tests (Discussion). Proceedings of the 1952 Invitational Conference on Testing Problems.Proceedings of the 1952 Invitational Conference on Testing Problem. Princeton, NJ: Educational Testing Service. Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological Reports, 3, 635-694. Loevinger, J. (1965). Person and population as psychometric concepts. Psychological Review, 72(2), 143-155. Lord, F.M. (1968). An analysis of the Verbal Scholastic Aptitude Test using Birnbaum's three-parameter logistic model. Educational and Psychological Measurement, 28, 989-1020. Lord, F.M. (1975). Evaluation with artificial data of a procedure for estimating ability and item characteristic curve parameters (Research Bulletin ability and item characteristic curve parameters (Research Bulletn 75-33). Princeton, NJ: Educational Testing Service. Lord, F.M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum.

70

FISHER

Lord, F.M. (1983). Small N justifies Rasch model. In D.J. Weiss (Ed.), New horizons in testing: Latent trait test theory and computerized adaptive horizons in testing: Latent trait test theory and computerized adaptie testing. New York: Academic. Luce, R. D., & Tukey, J. W. (1964). Simultaneous conjoint measurement: A new kind of fundamental measurement. Journal of Mathematical Psychology, kind of fundamental measurement. Journal of Mathematical Psychology, 7(1), 1-27. Lumsden, J. (1976). Test theory. Annual Review of Psychology, 27, 251-280. Maier, W, & Philipp, M. (1986). Construct validity of the DSM-III and RDC classification of melancholia (endogenous depression). Journal of Psychiatric Research, 20, 4, 289-299. Masters, G. (1982). A Rasch model for partial credit scoring. Psychometrika 47, 149-174. Messick, S. (1975). The standard problem: Meaning and values in measurement and evaluation. American Psychologist, 30, 955-966. Messick, S. (1981). Constructs and their vicissitudes in educational and psychological measurement. Psychological Bulletin, 89, 575-588. Michell, J. (1986). Measurement scales and statistics: A clash of paradigms. Psychological Bulletin, 100, 398-407. Michell, J. (1990). An introduction to the logic of psychological measurement. Michell, J. (1990). An introduction to the logic of psychological measurement. Hillsdale, NJ: Erlbaum. Mislevy, R.J., & Bock, R.D. (1983). BILOG: Item analysis and test scoring with binary logistic models. Mooresville, IN: Scientific Software. Ormiston, G, & Sassower, R. (1989). Narrative experiments: The discursive Ormiston, G, & Sassower, R. (1989). Narrative experiments: The discursive authority of science and technology. Minneapolis, MN: University of Minnesota Press. Osburn, H.G. (1968). Item sampling for achievement testing. Educational and Osburn, H.G. (1968). Item sampling for achievement testing. Educational and Psychological Measurement, 28, 95-104. Owen, D.S. (1985). None of the above: Behind the myth of scholastic aptitude. Owen, D.S. (1985). None of the above: Behind the myth of scholastic aptitude. Boston: Houghton Mifflin. Perline, R., Wright, B.D., & Wainer, H. (1979). The Rasch model as additive conjoint measurement. Applied Psychological Measurement, 3(2), 2 3 7 255. Phillips, S.E. (1986). The effects of the deletion of misfitting persons on vertical equating via the Rasch model. Journal of Educational Measurement, cal equating via the Rasch model. Journal of Educational Measurement, cal equating via the Rasch model. Journal of Educational Measurement, cal equating via the Rasch model. Journal of Educational Measurement, 23(2), 107-118. Ramsay, J.O. (1975). Review of Foundations of Measurement, Vol. I, by D.H. Ramsay, J.O. (1975). Review of Foundations of Measurement, Vol. I, by D.H. Krantz et al. Psychometrika, 40, 257-262. Krantz et al. Psychometrika, 40, 257-262. Krantz et al. Psychometrika, 40, 257-262. Krantz et al. Psychometrika, 40, 257-262. Rasch, G. (1960). Probabilistic models for some intelligence and attainment Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danmarks Paedogogiske Institut. (Reprint, 1980, with Foreword and Afterword by Benjamin D. Wright, Chicago: University of Chicago Press.) Rasch, G. (1961). On general laws and the meaning of measurement in psychology. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, 4 (pp. 321-333). Berkeley: University of California Press. Ricoeur, P. (1965). History and truth (C. A. Kelbley, Trans.). Evanston: Northwestern University Press.

THE RASCH DEBATE 7 1

Ricoeur, P. (1981). Hermeneutics and the human sciences: Essays on language, action and interpretation (J.B. Thompson, Ed., Trans, and intro.). Cambridge, UK: Cambridge University Press. Rorty, R. (1985). Solidarity or objectivity. In J. Rajchman & C. West (Eds.), Postanalytic philosophy. New York: Columbia University Press. analytic philosophy. New York: Columbia University Press. analytic philosophy. New York: Columbia University Press. analytic philosophy. New York: Columbia University Press. Singleton, M. (1991). Rasch measurement as a Kuhnian revolution. Rasch Measurement, 4(4), 119. Stenner, A.J., & Smith, M., III. (1982). Testing construct theories. Perceptual and Motor Skills, 55, 415-426. Stenner, A.J., Smith, M., Ill, and Burdick, D.S. (1983). Toward a theory of construct definition. Journal of Educational Measurement, 20(4), 3 0 5 316. Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680. Stocking, M.L. (1989). Empirical estimation errors in item response theory as a function of test properties. Princeton, NJ: Educational Testing Service Research Report. Strenio, A.J. (1981). The testing trap. New York: Rawson, Wade. Suppes, P., & Zinnes, J.L. (1963). Basic measurement theory. In R.D, Luce, R.R. Bush, & E. Galanter (Eds.), Handbook of mathematical psychology. New York: John Wiley & Sons. Sutherland, G., in collaboration with S. Sharp. (1984). Ability, merit, and measurement: Mental testing and English education, 1880-1940. Oxford: surement: Mental testing and English education, 1880-1940. Oxford: Clarendon Press. Toulmin, S. (1982). The construal of reality: Criticism in modern and postmodern science. Critical Inquiry, 9, 9 3 - 1 1 1 . Tracy, D. (1975). Blessed rage for order: The new pluralism in theology. Minneapolis: The Winston-Seabury Press. Tukey, J.W. (1969). Analyzing data: Sanctification or detective work? American Psychologist, 24, 8 3 - 9 1 . Wheeler, J.A., & Zurek, W. (Eds.). (1983). Quantum theory and measurement. Wheeler, J.A., & Zurek, W. (Eds.). (1983). Quantum theory and measurement. Princeton, NJ: Princeton University Press. Whitely, S.E. (1977). Models, meanings and misunderstandings: Some issues in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), 227-235. Whitely, S.E., & Dawis, R.V. (1974). The nature of objectivity with the Rasch model. Journal of Educational Measurement, 11(2), 163-178. Willmott, A., & Fowles, D. (1974). The objective interpretation of test performance: The Rasch model applied. Atlantic Highlands, NJ: NFER Publishing. in applying Rasch's theory. Journal of Educational Measurement, 14(3), Wilson, M. (Ed.). (1991). Objective measurement: Theory into practice. Norwood, NJ: Ablex Publishing Corp. Wingersky, M.S., Barton, M.A., & Lord, F.M. (1982). LOGIST Users Guide. Princeton, NJ: Educational Testing Service. Wood, R. (1978). Fitting the Rasch model: A heady tale. British Journal of Mathematical and Statistical Psychology, 31, 27-32. Wright, B.D. (1968). Sample-free test calibration and person measurement.

72

FISHER

in applying Rasch's theory. Journal of Educational Measurement, 14(3), Proceedings of the 1967 Invitational Conference on Testing Problems (pp. 85-101). Princeton: Educational Testing Service. Wright, B.D. (1977a). Misunderstanding the Rasch model. Journal of Educational Measurement, 14(3), 219-225. Wright, B.D. (1977b). Solving measurement problems with the Rasch model. Journal of Educational Measurement, 14(2), 97-116. Wright, B.D. (1984). Despair and hope for educational measurement. Contemporary Education Review, 3(1), 281-288. w w min applying Rasch's theory. Journal of Educational Measurement, 14(3), ent and personality assessment. (E. Roskam, Ed.). North Holland: Elsevier Science Publishers. Win applying Rasch's theory. Journal of Educational Measurement, 14(3), right, B.D. (1988a). Georg Rasch and measurement. Rasch Measurement, 2(3), 1-7. Wright, B.D. (1988b). The model necessary for a Thurstone scale and Campbell concatenation for mental testing. Rasch Measurement, 2(1), 2 - 4 . Wright, B.D., & Bell, S.R. (1984). Item banks: What, why, how. Journal of Educational Measurement, 21(A), 331-345. Wright, B.D., & Linacre, J.M. (1989). Observations are always ordinal; Measurements, however, must be interval. Archives of Physical Medicine and in applying Rasch's theory. Journal of Educational Measurement, 14(3), Rehabilitation, 70(12), 857-867. in applying Rasch's theory. Journal of Educational Measurement, 14(3), Wright, B.D., & Linacre, J.M. (1991). BIGSTEPS: A Rasch-Model Computer Program. Chicago: MESA Press. w Press. Wright, B.D., & Stone, M. (1979). Best test design. Chicago: MESA Press. Zimmerman, M.E. (1990). Heidegger's confrontation with modernity: Technology, politics, art. Bloomington: Indiana University Press. Zwick, R. (1990). When do item response function and Mantel-Haenszel definitions of differential item functioning coincide? Journal of Educational Statistics, 15, 183-197.

chapter

4 4

Historical Views of the Concept of Invariance in Measurement Theory* George Engelhard, Jr. Emory University

The history of science is the history of measurement. (Cattell, 1893, p. 316) The scientist is usually looking for invariance whether he knows it or not. (Stevens, 1951, p. 20) Invariance has been identified as a fundamental characteristic of measurement in the behavioral sciences (Andrich, 1988a; Bock & Jones, 1968; Jones, 1960; Stevens, 1951). In essence, the goal of invariant measurement has been succinctly stated by Stevens: "the scientist * This research was supported in part by the University Research Committee of Emory University. Support for this research was also provided through a Spencer Fellowship from the National Academy of Education. Earlier versions of this chapter were presented at the Fifth International Objective Measurement Workshop at the University of California, Berkeley (March 1989), and at the Sixth International Objective Measurement Workshop at the University of Chicago (April 1991). Judith A. Monsaas and Larry Ludlow provided helpful comments on earlier drafts of this paper. Sections of this chapter have been published in Engelhard, G. (1992, Summer), Historical views of invariance: Evidence from the measurement theories of Thorndike, Thurstone and Rasch, Educational and Psychological Measurement. Permission to reprint has been obtained from the publisher. The figures reproduced in this chapter are based on the original graphics produced by Thorndike, Thurstone, and Rasch. The original graphics varied somewhat in quality, and for historical accuracy are reproduced in this chapter as originally drawn.

73

74

ENGELHARD, JR.

seeks measures that will stay put while his back is turned" (1951, p. 21). The concept of invariance has implications for both item calibration and the measurement of individuals. Many of the measurement problems that confront researchers in psychology and education today, such as those related to invariance, are not new. By taking a historical perspective on these measurement problems, it may be possible to increase the understanding of the measurement problems themselves, assess the adequacy of solutions proposed by major measurement theorists, and identify promising areas for future research. Progress, and in some cases lack of progress, towards the solution of basic measurement problems can also be meaningfully documented. During the 20th century, there have been two major research traditions t h a t have guided measurement theorists attempting to quantify various human characteristics, such as abilities, aptitudes, and attitudes. One tradition has its roots in the psychometric work of Charles Spearman (1904); this research tradition, which is focused on the test score, is primarily concerned with measurement error and the decomposition of an observed test score into several components including a "true" score and various error components. This research tradition within mental test theory can be labelled test theory. A second research tradition that has developed in a parallel fashion has its roots in the 19th-century work in psychophysics and has continued into present practice through the various forms of latent trait theory or, more specifically, item response theory (IRT). This second research tradition will be referred to as scaling theory. The focus of research within this second tradition is on the calibration of both individuals and items onto a latent variable scale. Within these two research traditions, test theory and scaling theory, there are several dominant perspectives that have evolved over time. For example, Spearman's research on test theory has been extended through generalizability theory (Brennan, 1983; Cronbach, Gleser, Nanda, & Rajaratnam, 1972; Shavelson, Webb, & Rowley, 1989), as well as the LISREL models developed by Karl Joreskog (Joreskog & Sorbom, 1986). The purpose of this chapter is to examine advances within the second measurement tradition of scaling theory t h a t are due to the contributions of Thorndike, Thurstone, and Rasch. Measurement perspectives within test theory will not be addressed in detail in this chapter. A great deal of educational and psychological research has been conducted within the framework of test theory. For example, empirical research workers routinely include "coefficient alphas" or "KR-20s" for the instruments used in their studies. Along with this concern for

HISTORICAL VIEWS OF INVARIANCE

75

"reliability" coefficients, research workers have also been concerned about the validity of their instruments, although documenting what a test score really represents is rarely resolved in most studies and may ultimately be the most important measurement question of all. Instead of focusing on measurement problems related to reliability and validity, which are the central concepts of test theory (Loevinger, 1957), this study focuses on measurement problems related to the concept of invariance, which appear clearly within scaling theory; this emphasis is not to say t h a t the concepts of reliability or especially validity are unimportant, but rather that different research traditions focus on different aspects of the measurement problems encountered in the behavioral sciences. In fact, invariance has important relationships to and implications for issues related to reliability and validity, and it is essential for gaining a clear understanding of certain persistent problems encountered in test theory. As pointed out by Jones and Appelbaum (1989), developments in item response theory have led to constructive changes in psychological testing and the "primary advantage of IRT over classical test theory resides in properties of invariance" (p. 24). The purpose of this chapter is to provide a historical perspective on the concept of invariance. Several enduring measurement problems related to item calibration and to the measurement of individuals can be meaningfully viewed by using the concept of invariance. The measurement theories of Thorndike, Thurstone, and Rasch are used because they address measurement problems related to the concept of invariance, and proposed solutions to these problems. These measurement theorists also share a common research tradition based on scaling theory. Although there are quantitative aspects to the approaches used to address invariance, it is beyond the scope of this chapter to provide detailed derivations of the equations used by each theorist to achieve sample-invariant item calibration and item-invariant measurement of individuals. These detailed derivations are presented by Engelhard (1984) for measurement issues related to sample-invariant item calibration. A parallel analysis can also be developed for issues related to the item-invariant measurement of individuals, and these derivations are presented in detail by Engelhard (1991). In the next section of this chapter, the concept of invariance is defined and arguments are presented for its importance as a key idea in measurement. A description of the measurement theories of Thorndike, Thurstone, and Rasch is presented next; the role of invariance in each of these theories is also examined. Next, a comparison and discussion of these three theories of measurement are set forth in terms of their contributions to the solution of problems related to the concept of

76

ENGELHARD, JR.

invariance. The final section includes a summary of the major points of this chapter, as well as suggestions for additional research in this area. THE CONCEPT OF INVARIANCE Within the behavioral sciences, S.S. Stevens (1951) has presented one of the strongest cases for the general importance of the concept of invariance. In his chapter on "Mathematics, Measurement and Psychopp pp Stevens described the role of this concept in mathematics and physics, and he argued that "many psychological problems are already conceived as the deliberate search for invariances" (p. 20). In fact, Stevens defined the whole field of science in terms of a quest for invariance and the concomitant generalizability of results. In his words, The scientist is usually looking for invariance whether he knows it or not. Whenever he discovers a functional relationship his next question follows naturally: under what conditions does it hold? . . . The quest for invariant relations is essentially the aspiration toward generality, and in psychology, as in physics, the principles that have wide applications are those we prize. (Stevens, 1951, p. 20) Applying this view of invariance more specifically to measurement issues, Stevens used the concept of invariance to define his familiar scales of measurement—nominal, ordinal, interval, and ratio scales (Stevens, 1946). In his words, Each of the four classes of scales is best characterized by its range of invariance—by the kinds of transformations that leave the "structure" of the scale undistorted. And the nature of invariance sets limits to the kinds of statistical manipulations that can be legitimately applied to the scaled data. (Stevens, 1951, p. 23) Influenced by the insightful work of Mosier (1940, 1941), Stevens pointed out the symmetry between the fields of psychophysics and psychometrics as related to the concept of invariance: Psychophysics sees the response as an indicator of an attribute of the individual—an attribute that varies with the stimulus and is relatively invariant from person to person. Psychometrics regards the response as indicative of an attribute that varies from person to person but is rela-

HISTORICAL VIEWS OF INVARIANCE

77

tively invariant for different stimuli. Both psychophysics and psychometrics make it their business to display the conditions and limits of these invariances. (Stevens, 1951, p. 31) The first sentence in this quotation illustrates the idea of sampleinvariant item calibration, whereas the second sentence points to the in applying Rasch's theory. Journal of Educational Measurement, 14(3), idea of item-invariant measurement of individuals. This duality between psychophysics and psychometrics, which was clearly described by Mosier (1940, 1941) and pointed out even earlier by Guilford (1936), represents one of the five major ideas underlying test theory identified by Lumsden (1976). Measurement problems related to invariance can be meaningfully viewed in terms of these two broad classes—sampleinvariant item calibration and item-invariant measurement of individuals. Within each of these two classes, invariance over methods and conditions can be examined. Methods refer to the statistical procedures and models, including the method used to collect the data, used within the measurement theory. For example, paired comparison and successive interval scaling would represent different methods of data collection, and would also require different statistical models. Conditions can refer to either subgroupings of items and/or examinees. For example, test equating is concerned with the development of procedures t h a t yield comparable estimates of an individual's ability that are invariant over the subgroups of items (tests) that are used to obtain these ability estimates. As another example, the research on item bias oin applying Rasch's theory. Journal of Educational Measurement, 14(3), or differential item functioning, as it has come to be labelled, reflects concern with whether or not the meaning of an individual's responses on a particular test item varies as a function of irrelevant factors related to membership in various social categories, such as gender, race, and social class. Sample-Invariant Item Calibration The basic measurement problem underlying sample-invariant item calibration is how to minimize the influence of arbitrary samples of individuals on the estimation of item scale values. For example, Engelhard (1984) described how Thorndike provided a single adjustment (location) for differences in group characteristics, whereas Thurstone provided for two adjustments (location and scale). Rasch's approach to sample-invariant calibration can be viewed as providing three adjustments (location, scale, and an individual level response model). An-

78

ENGELHARD, JR.

drich (1978) has also provided an important comparison between the Thurstone and Rasch approaches to item scaling by using paired comparison responses that can also lead to sample-invariant item calibrations. The overall goal of sample-invariant calibration of items is to estimate the location of items on a latent variable of interest t h a t will remain unchanged across subgroups of individuals and also across various subgroups of items. For example, if the goal of sample-invariant calibration is achieved, then the item scale values will not be a function of subgroup characteristics, such as ability level, gender, race, or social class. Further, the calibration of the items should also be invariant over subsets of items, so that if a calibrated set of items is being developed, the scale values of the items are not affected by the inclusion or exclusion of other items in the test.

Item-Invariant Measurement of Individuals In the case of item-invariant measurement, the basic measurement problem involves minimizing the influence of the particular items t h a t happen to be used to estimate an individual's ability. This problem is also related to the scaling and equating of test scores, as well as to the scoring of each individual's performance. Solutions to this problem usually include adjustments for item characteristics (item difficulty) and test characteristics (location, dispersion, and shape of item distributions on the latent variable scale). The overall objective is to obtain comparable estimates of individual ability regardless of which items are included in the test. This objective is essentially the problem of equating person measurements obtained on tests composed of different items (Engelhard & Osberg, 1983). Invariance over scoring method also requires attention. In addition to considering invariance over methods, it is important to examine invariance over conditions within this context; an individual's score should not depend on the scores of other individuals being tested at the same time. In summary, invariance can be viewed as an important general concept in the physical and behavioral sciences, as well as a key aspect of successful measurement in the behavioral sciences. As pointed out by Bock and Jones (1968), "in a well-developed science, measurement can be made to yield invariant results over a variety of measurement methods and over a range of experimental conditions for any one method" (p. 9).

HISTORICAL VIEWS OF INVARIANCE

79

THREE MEASUREMENT THEORIES AND INVARIANT MEASUREMENT The purposes of this section are to describe and to illustrate how the concept of invariance emerged within the measurement theories of Thorndike, Thurstone, and Rasch. As the clearest statement of the conditions necessary to accomplish invariance is presented in the measurement theory of Rasch, this section begins with his research and then traces the adumbrations of these ideas within the work of Thurstone and Thorndike. It also should be pointed out t h a t all three of these theorists wrote extensively on various measurement problems, and for Thorndike especially it was sometimes difficult to point to one consistent set of principles that defined his definitive theory of measurement. In order to address this issue, certain texts are explicitly cited. It should be understood that these texts are being used to define a particular individual's measurement theory. This endeavor was not much of a problem for Rasch because he was very consistent in his views related to invariance; Thurstone was fairly consistent, whereas Thorndike was the least consistent of the three. Rasch Based on psychometric research conducted during the 1950s, Rasch (1960/1980, 1961, 1966a,b) presented a set of ideas and methods that were described by Loevinger (1965) as a "truly new approach to psychometric problems" (p. 151) t h a t can lead to "nonarbitrary measures" (p. 151). One of the major characteristics of this "new approach" was Rasch's explicit concern with the development of "individual-centered techniques" as opposed to the group-based measurement models used by measurement theorists such as Thorndike and Thurstone. In Rasch's words, "individual-centered statistical techniques require models in which each individual is characterized separately and from which, given adequate data, the individual parameters can be estimated" (1960/1980, p. xx). Problems related to invariance played an important role in motivating the measurement theory of Rasch. As pointed out by Andrich (1988a), Rasch presented "two principles of invariance for making comparisons t h a t in an important sense precede, though inevitably lead to, measurement" (p. 18). Rasch's concept of "specific objectivity," which he formulated in terms of his principles of comparison, form his version of the goals of invariant measurement (Rasch, 1977). In Rasch's words,

8 0 ENGELHARD, JR.

The comparison between two stimuli should be independent of which particular individuals were instrumental for the comparison; and it should also be independent of which stimuli within the considered class were or might also have been compared. Symmetrically, a comparison between two individuals should be independent of which particular stimuli within the class considered were instrumental for the comparison; and it should also be independent of which other individuals were also compared, on the same or on some other occasion. (Rasch, 1961, pp. 331332) It is clear in this quotation that Rasch recognized the importance of both sample-invariant item calibration and item-invariant measurement of individuals. In fact, he made them the cornerstones of his quest for specific objectivity. In order to address problems related to invariance, Rasch laid the foundation for the development of a "family of measurement models," which are characterized by separability of item and person parameters (Masters & Wright, 1984). Rasch's approach to sample-invariant item calibration involved the comparison of item difficulties obtained in separate groups. In his words, In relation to attainment tests all the school grades for which the tests are in practice applicable may be considered as forming a total collection of persons, that may be divided into subpopulations, such as single grades, sex groups and age groups within a grade, social strata, etc. Between the test results in such more or less extensive groups the same fundamental relationship must hold, and if so we shall use the term that the relationship is "relatively independent of population," the qualification "relatively" pointing to the degree of breakdown that has been applied to the data. (Rasch, 1960/1980, p. 9) In his book, he used ability groups formed on the basis of raw scores. In essence, Rasch was "looking for trouble in a more or less definite direction, namely, for the possibility that the relative difficulties of the tests may vary with [raw score] that is, with the reading inability of tthe children" (Rasch, 1961, p. 323). This test of fit (or what Rasch referred to as control of the model) was presented graphically. In order to illustrate this idea, the results for two subtests, N and F, from the Danish Military Group Intelligence Test (BPP), which were used by Rasch (1960/1980), are presented in Figure 4-1. The test data were obtained from 1,904 recruits who were tested in September 1953. The results for Subtest N are presented in Panel A (Rasch, 1960/1980, pp. 89), which illustrates successful sample-invariant item calibration. The abscissa is based on the average of the separate within group

HISTORICAL VIEWS OF INVARIANCE

81

Figure 15 Figure 7

Subtest F of BPP.

Subtest N of BPP.

a

Successful sample-invariant item calibration

Figure 4 - 1 ibration

B.

Unsuccessful sample-invariant item calibration

Rasch's graphic approach for examining sample-invariant item cal-

Note. The abscissa (l.i) in each panel is the average of the item difficulties calculated separately within the raw score groups (r). The ordinate (lri) represents the item difficulties calculated within each score group with a constant added by Rasch to avoid overlapping items and to highlight the linearity or non-linearity of these plots. From Probabilistic models for some intelligence and attainment tests (pp. 89 and 98) by G. Rasch, 1980/1960, Chicago: The University of Chicago Press. Copyright 1980 by The University of Chicago. Reprinted by permission.

82

ENGELHARD, JR.

Figure 6

Figure 14 Subtest F of B P P .

Subtest N of B P P .

A.

Successful item-invariant measurement

B.

Unsuccessful item-invariant measurement

Figure 4-2 Rasch's graphic approach for examining item-invariant measurement of individuals Note. The abscissa (lr.) in each panel is the average of the ability estimates calculated separately within item groups. The ordinate (lri) represents the ability estimates calculated within each item group with a constant added by Rasch to avoid overlapping ability estimates and to highlight the linearity or non-linearity of these plots. From Probabilisin applying Rasch's theory. Journal of Educational Measurement, 14(3), tic models for some intelligence and attainment tests (pp. 87 and 97) by G. Rasch, 1980/1960, Chicago: The University of Chicago Press. Copyright 1980 by The University of Chicago. Reprinted by permission.

HISTORICAL VIEWS OF INVARIANCE 8 3

calibrations. The parallel lines indicate that the difficulty of the items is relatively invariant across raw-score groups. Unsuccessful sampleinvariant item calibrations are presented in Panel B for Subtest F (Rasch, 1960/1980, p. 98) and are reflected in the nonparallel lines. Because of the formal symmetry in the model proposed by Rasch between items and individuals, he could use a similar graphic approach to examine whether or not item-invariant measurement of individuals had been achieved. The results for Subtests N and F are presented in Figure 4-2. Panel A (Rasch, 1960/1980, p. 87) illustrates siin applying Rasch's theory. Journal of Educational Measurement, 14(3), n applying Rasch's theory. Journal of Educational Measurement, 14(3), successful item-invariant measurement with ability estimates relain ap lying Rasch's theory. Journal of Educational Measurement, 14(3), in ap lying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, in ap lying Rasch's theory. Journal of Educational Measurement, 14(3), in ap lying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/180, pin applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, p. 97) provides evidence of unsuccessful item-invariant measurement as evidenced by the inequality of the slopes based on the regression of ability estimates obtained separately within each item group on the total. Even though there are more sophisticated methods for examining invariance using statistical tests of item and person fit (Wright, 1988; Wright & Stone, 1979), the graphical methods can be a useful guide to whether or not invariance has been achieved. As will be seen in the next section, Thurstone used a similar graphical method to examine whether or not his method of absolute scaling was appropriate for a particular set of test data. By focusing on the individual as the level of analysis, Rasch was able to examine test data and to identify when invariance was exhibited. When the data fit the Rasch model, such as with Subtest N, then the types of invariance which eluded research workers in the test theory tradition can be obtained. To quote Loevinger, Rasch is concerned with a different and more rigorous kind of generalization than Cronbach, Rajaratnam and Gleser. When his model fits, the results are independent of the sample of persons and of the particular items with some broad limits. Within these limits, generality is, one might say, complete. (Loevinger, 1965, p. 151) Detailed descriptions of Rasch measurement are presented in Wright and Stone (1979), Wright and Masters (1982), and Wright (1988). Thurstone Thurstone also recognized the importance of invariant measurement. In fact, as pointed out by Bock and Jones (1968), "in the system of psychological measurement based on the Thurstonian models, we achieve some of the invariance in measurement which is characteristic

84

ENGELHARD, JR.

of the other sciences" (p. 9). In developing his method of absolute scaling for calibrating test items, Thurstone (1925, 1927, 1928a,b) was specifically motivated by the lack of sample-invariance he had observed in Thorndike's scaling method. In his words, the probable error, or PE lused in Thorndike's methodl, is not valid as a unit of measurement for educational scales. Its defect consists in that it does not possess the one requirement of a unit of measurement, namely constancy. It fluctuates from one age to another. (Thurstone, 1927, p. 505; emphasis added) The probable error is a measure of dispersion used by Thorndike t h a t is similar to the interquartile range; for normal distributions, .6745 times the standard deviation is approximately equal to the PE. The concept of constancy proposed by Thurstone is his version of an invariance condition, and it is an explicit consequence of measurement situations t h a t yield objective measurements. Thorndike's PE values fluctuate because the item scale values are not sample-invariant, a condition t h a t violates Thurstone's insight that the "scale value of an item should be the same no matter which age group is used in the standardization" (Thurstone, 1928a, p. 119). As did Rasch, Thurstone used the idea of a continuum to represent the latent variable of interest and assumed that items can be placed at points on this linear scale which would have a fixed position regardless of the group being tested. According to Thurstone, "if any particular test item or particular raw score is to be allocated on the absolute scale, its scale value should be ideally the same whether determined by group one or group two" (1925, p. 438). Thurstone presented this idea graphically, and his illustration is reproduced in Figure 4-3. In Figure 4-3, Thurstone (1927, p. 509) showed the location of seven items (a to g) and presented the idea that the calibration of these items t h a t determines their location on the latent variable scale should be invariant over groups A and B, which are different in terms of location and variability on the latent variable scale. In order to adjust for differences in the location and variability of two or more distributions, Thurstone assumed a normal distribution of ability for each group and essentially adjusted statistically for differences in locations (means) and scales (standard deviations). In order for these adjustments proposed by Thurstone to lead successfully to sample-invariant item calibration, Thurstone proposed a graphical test of fit. Thurstone's illustration, which is presented in Figure 4-4, shows the plot of the item scale values (sigma values) calibrated separately in grades 7 and 8. According to Thurstone,

HISTORICAL VIEWS OF INVARIANCE

Figure 4-3

85

Thurstone's view of sample-invariant item calibration

Note. The abscissa represents a latent variable scale. According to Thurstone (1927), the location of the seven items (a to g) on the latent variable scale should be invariant over ability groups A and B. From "The Unit of Measurement in Educational Scales" by L.L. Thurstone, 1927, The Journal of Educational Psychology, 18, p. 509. Copyright American Psychological Association. Reprinted by permission.

If the plot in Figure 4-4 should be distinctly non-linear, the present scaling method is not applicable. Non-linearity here shows that the two distributions cannot both be normal on the same scale. If the plot is linear, it proves that both distributions may be assumed to be normal on the same scale or base line. (Thurstone, 1927, p. 513) This test of fit can also be presented in the style of the graphical displays used by Rasch; this graphic representation is shown in Figure 4-5 (Engelhard, 1984, p. 33) for Thurstone's data. The effects of using Thurstone's method of absolute scaling, which provides adjustments for differences in the locations and variations of the ability distributions, as compared to Thorndike's scaling method, which simply adjusts for location differences, are shown in Figure 4-6. In Panel A of Figure 4-6 (Thurstone, 1927, p. 506), the results of using Thorndike's method to calibrate a language scale developed by Trabue (1916) are presented; the average language ability increases as a function of grade level, whereas the variances remain constant. The results obtained by using Thurstone's method are presented in Panel B of Figure 4-6 (Thurstone, 1927, p. 515); in this figure, average ability

86

ENGELHARD, JR.

Figure 4-4 Thurstone's graphic approach for examining sample-invariant item calibrations Note. Item scale values (sigma values) were calculated separately by grade (7 and 8). From "The Unit of Measurement in Educational Scales" by L.L. Thurstone, 1927, The Journal of Educational Psychology, 18, p. 513. Copyright American Psychological Association. Reprinted by permission.

increases with grade level, but the variances of the scores also increase. These results seem theoretically plausible. Thurstone's method of absolute scaling is described and illustrated in detail in Engelhard (1984). An "experimental" adjustment for sample effects that occurs with Thurstone's model for paired comparisons is described in Andrich (1978). Thurstone's method of absolute scaling can also be used to scale test scores (Gulliksen, 1950), but a more interesting discussion of issues related to item-invariant measurement is presented by Thurstone (1926) in an article on the scoring of individual performance. In this article, Thurstone presented a set of conditions as follows: 1.

It should not be required to have the same number of test elements at each step of the scale.

HISTORICAL VIEWS OF INVARIANCE

87

COMBINED CALIBRATION SAMPLE i n appl y i n g Ras c h ' s t h eor y . J o ur n al of Educ a t i o nal Meas u r e ment , 14( 3 ) , i n appl y i n g Ras c h ' s t h eor y . J o ur n al of Educ a t i o nal Meas u r e ment , 14( 3 ) , t i v e l y i n v a r i a nt ov e r i t e m gr o ups , wher e as Panel B ( R as c h , 1960/ 1 980, g Figure 4-5 Rasch's graphic test of fit for Thurstone's data Note. Based on same data presented in Figure 4. From "Thorndike, Thurstone and Rasch: A comparison of their methods of scaling psychological and educational tests" by G. Engelhard, 1984, Applied Psychological Measurement, 8, p. 33. Copyright 1984 by Applied Psychological Measurement, Inc. Reproduced by permission.

2. 3. 4.

It should be possible to omit several test questions at different levels of the scale without affecting the individual score. It should be possible to include in the same scale two forms of test. It should not be required to submit every subject to the whole range of the scale. The starting point and terminal point, being selected by the examiner, should not directly affect the individual score.

A.

Based on Thorndike's scale

Figure 4-6

B.

Based on Thurstone's method of absolute scaling

Distribution of language ability based on Thorndike's method (Panel A) and Thurstone's method of absolute scaling (Panel

B). Note. Abscissa is a latent variable scale for measuring language ability and ordinate indicates successive grade groups (grade 2 to 12). From ' T h e Unit of Measurement in Educational Scales" by L.L. Thurstone, 1927, The Journal of Educational Psychology, 18, pp. 506 and 515. Copyright

HISTORICAL VIEWS OF INVARIANCE

5. 6. 7.

89

It should be possible to use the scale so that a rational score may be determined for each individual subject and so that the performance of groups of subjects may be compared. The arithmetical labor in determining individual scores should be a minimum. The procedure should be as far as possible consistent with psychophysical methods so t h a t it will be free from the logical errors involved in the Binet scales and its variants.

Conditions one to five clearly show Thurstone's concern with iteminvariant measurement. In his 1926 paper, he went on to propose a scoring method which meets these conditions. Thurstone's approach is presented in detail by Engelhard (1991). In essence, Thurstone proposed what would be recognized today as person characteristic curves t h a t graphically present the probabilities of an individual succeeding on a set of calibrated test items. Many of Thurstone's articles on scaling are included in The Measurement of Values (1959), although his work on absolute scaling is not included in t h a t volume. The technical details and elaborations of Thurstonian models are presented in Bock and Jones (1968). Andrich (1988c) provided a useful overview of Thurstone's contributions to measurement theory. Although it is not directly relevant for this chapter, it is interesting to note that both Thurstone (1947) and Rasch (1953) also used the concept of invariance as an important aspect of their approaches to factor analysis. Thorndike In 1904, Thorndike published the first edition of his highly influential book entitled An Introduction to the Theory of Mental and Social Measin applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, surements. Thorndike's major aim in writing this book was to "introduce students to the theory of mental measurements and to provide them with such knowledge and practice as may assist them to follow critically quantitative evidence and argument and to make their own researches exact and logical" (1904, p. v). Thorndike's book was the standard reference on statistics and quantitative methods in the mental and social sciences for the first two decades of this century (Clifford, 1984; Engelhard, 1988; Travers, 1983). Much of this influence can be attributed to Thorndike's clear and expository writing style. He explicitly acknowledged t h a t contemporary work in measurement theory had not been presented in a manner suitable for students without fairly advanced mathematical skills. He set out to present a less mathematical introduction to measurement theory based on the belief that

90

ENGELHARD, JR.

"there is, happily, nothing in the general principles of modern statistical theory but refined common sense, and little in the techniques resulting from them that general intelligence can not readily master" (p. 2). Thorndike, who wrote extensively on educational and psychological measurement, covered topics that ranged from the general statement of his theory (Thorndike, 1904) to the measurement of a variety of educational outcomes (Thorndike, 1910, 1914, 1918, 1921), as well as intelligence (Thorndike, Bregman, Cobb, & Woodyard, 1926). What were the basic measurement problems identified by Thorndike? Thorndike clearly stated that the "special difficulties" of measurement in the behavioral sciences are 1. 2. 3.

Absence or imperfection of units in which to measure. Lack of constancy in the facts measured. Extreme complexity of the measurements to be made.

In order to illustrate the problems related to the absence of an accepted unit of measurement, Thorndike (1904) pointed out that the spelling tests developed by Joseph Mayer Rice did not have equal units. Rice assumed that all his spelling words were of equal difficulty, whereas Thorndike argued that the correct spelling of an easy versus a hard word did not reflect equal amounts of spelling ability. Because the units of measurement are unequal, Thorndike asserted that Rice's results were inaccurate. Without general agreement on units, the meaning of test scores becomes more subjective. Within the framework of this chapter, Thorndike was illustrating that obtained scores may not be invariant over subsets of items which vary in difficulty. Inconstancy is the second major measurement problem identified by Thorndike (1904). Many of the measurement problems encountered in the behavioral sciences are related to random variation inherent in h u m a n characteristics. These variations are due not only to the unreliability of tests, but also to within-subject fluctuations. For example, if a person's motivation is measured repeatedly, these values tend to vary. Thorndike's concept of constancy is also related to the idea of invariance as developed in this chapter. The final measurement problem or "special difficulty" identified by Thorndike pertains to the extreme complexity of the variables and constructs that social and behavioral scientists wish to measure. This problem primarily, although not totally, reflects a concern with dimensionality. Most of the variables worth measuring in the behavioral sciences do not readily translate into unidimensional tests that permit the reporting of a single score to represent the individual's location on

HISTORICAL VIEWS OF INVARIANCE

91

the latent variable or construct of interest. As pointed out by Jones and Appelbaum (1989), if unidimensionality is obtained for all items and over all groups of examinees, then item parameters ^ i l l be invariant across groups, and ability parameters will be invariant across items. Methods for conducting item factor analyses designed to explore this issue have been summarized by Mislevy (1986), and an approach to this problem has been illustrated by Muraki and Engelhard (1985). Thorndike's method for obtaining sample-invariant item calibration is very similar to Thurstone's method of absolute scaling. As described by Thurstone, Thorndike's scaling method consists in first determining the scale value of each item for each grade separately with the mean of each grade as an origin. The difficulty of a test item for Grade V children, for example, is determined by the proportion of right answers to the test item in that grade. When a test item has been scaled in several grades, the scale values so obtained will, of course, be different because of the fact that they are expressed as deviations from different grade means as origins. Thorndike then reduces all these measurements to a common origin in the construction of an educational scale by adding to each scale value the scale value of the mean of the grade. (Thurstone, 1927, p. 508) The major difference between Thorndike's method of item scaling and Thurstone's method of absolute scaling is that Thorndike assumed that the variances of the groups are equal. Thurstone criticized this assumption: it is clear that in order to reduce the overlapping sentences or test items to a common base line or scale it is necessary to make not one but two adjustments. One of these adjustments concerns the means of the several grade groups and this adjustment is made by the Thorndike scaling methods. The second adjustment which is not made by Thorndike concerns the variation in dispersion of the several groups when they are referred to a common scale. (Thurstone, 1927, p. 509) The results of using the two different methods were presented earlier in Figure 4-6. In his later work, Thorndike did include an adjustment for the range of scores (Thomson, 1940). Thorndike's views of item-invariant measurement of individuals are presented in several places (Thorndike, 1914; Thorndike et al., 1926). Engelhard (1991) presents a detailed description of Thorndike's approach as applied to the measurement of reading ability (Thorndike, 1914). Essentially, Thorndike recommended using a set of procedures t h a t are very similar to the methods of scoring individual performance

92

ENGELHARD, JR.

used by Thurstone and Rasch. Thorndike also suggested examining person fit and proposed adjusting reading ability estimates when an individual responded in an inconsistent manner to the test items. COMPARISON AND DISCUSSION OF THREE MEASUREMENT THEORIES The comparisons of the major similarities and differences among the measurement theories of Thorndike, Thurstone and Rasch are summarized in Tables 4-1 and 4-2. Table 4-1 presents a summary comparison of their views related to sample-invariant item calibrations, while Table 4-2 presents issues related to the item-invariant measurement of individuals. These issues are discussed in detail in two earlier articles (Engelhard, 1984, 1991). In general terms, it is clear that Thorndike, Thurstone, and Rasch were all working within a common scaling tradition. They based many of their proposed methods for calibrating test items and measuring individuals on statistical advances made within the field of psychophysics. One of the differences between psychophysics and psychometrics is that the independent variable is usually an observable variable in psychophysics, whereas in psychometrics the

Table 4 - 1 Comparison of Thorndike, Thurstone, and Rasch on Major Issues Related to Sample-Invariant Item Calibration Issue

Thorndike

Thurstone

Rasch

Recognized importance of item invariance Utilized the latent trait concept Transformation of percent correct Level of analysis Assumed distribution of ability

Yes

Yes

Yes

Yes

Yes

Yes

PE values

Normal Deviates

Logits

Group Normal

Group Normal

Model to Data 1

Model to Data 2

Individual None Required Data to Model 3

dig = M* + x i g

d i g = ^ g + Xjg(Tg

d, = M + XjY

Separate Process

Simultaneous Process

Tests of fit Number of adjustments Item difficulties (Scale values) Person measurement

Separate Process

Note. From "Thorndike, Thurstone and Rasch: A comparison of their methods of scaling psychological and educational tests" by G. Engelhard, 1984, Applied Psychological Measurement, 8(1), p. 29. Copyright 1984 by Applied Psychological Measurement Inc. Reproduced by permission.

HISTORICAL VIEWS OF INVARIANCE

93

Table 4-2 Comparison of Thorndike, Thurstone, and Rasch on Major Issues Related to Item-Invariant Measurement of Individuals Issue

THORN

THURS

RASCH

Recognized importance of item-invariant measurement Utilized concept of latent variable scale Avoided using raw scores Used person response curves Had formal probabilistic model Used standard errors for ability estimates Scoring criterion Flagged inconsistent response patterns

Yes

Yes

Yes

Yes Yes Yes No No 80% Yes (ad hoc) Separate Process

Yes Yes Yes No No 50% No

Yes Yes Yes Yes Yes 50% Yes (theory) Simultaneous Process

Item calibration

Separate Process

Note. From "Thorndike, Thurstone and Rasch: A comparison of their approaches to iteminvariant measurement" by G. Engelhard, 1991, Journal of Research and Development in Education, 24(2), p. 55. Copyright 1991 by College of Education, The University of Georgia. Reprinted by permission.

construct is usually unobservable. As this construct is not directly observable, these three psychometricians used the idea of a latent continuum to represent this unobservable variable. Although they all held similar positions on many measurement issues as highlighted in Tables 4-1 and 4-2, there are also several import a n t differences between the conceptualizations of Thorndike and Thurstone as compared to the views of Rasch. One of the major differences is the recognition by Rasch that measurement models can and should be developed based on the responses of individuals to single test items. This focus on the individual, rather than on groups, allowed Rasch to avoid making unnecessary assumptions regarding the distribution of abilities t h a t were needed by both Thorndike and Thurstone. As pointed out earlier, Thorndike's method of scaling test items and Thurstone's method of absolute scaling were both based on the assumption that abilities were normally distributed. By using the individual and not the group as the level of analysis, Rasch invented measurement models t h a t are capable of providing estimates of the location of both items and individuals on a latent variable continuum simultaneously. This approach also allowed Rasch to develop probabilistic models rather t h a n deterministic ones for modelling the probability of each individual succeeding on a particular test item as a function of his or her ability and the item difficulties. This probabilistic relationship is clearly shown in the familiar S-shaped item characteristic

94

ENGELHARD, JR.

curves. Further, by simultaneously including item calibration and individual measurement within one model, he was able to derive "conditional" estimates of these parameters which provides a framework for determining whether or not invariance has been achieved.

SUMMARY Progress is as difficult to define within the field of measurement as in any other field of study (Donovan, Laudan, & Laudan, 1988; Laudan, 1977). The analysis presented in this chapter suggests that Rasch's work provides a theoretical and statistical framework for the practical realization of invariant measurement that was sought by both Thorndike and Thurstone. The simultaneous inclusion of both ability and item difficulty within a probabilistic model defined at the individual level of analysis provided a general framework in which item and person parameters can be estimated separately. Rasch was able to use recent advances in statistics, such as the concept of sufficiency developed by Fisher (1925), to propose an approach to measurement t h a t provides practical solutions to many testing problems related to invariance. This chapter is part of a larger program of research related to the history and philosophy of measurement theory. The overall purposes of this research are to identify basic measurement problems and to describe how these measurement problems are addressed by major measurement theorists. As pointed out earlier, many of the measurement problems that are faced today are not new. Through the use of historical and comparative perspectives, it is possible to gain a better understanding of both the measurement problems themselves and of the progress that has been made toward the solution of these problems. Some of the perennial measurement problems in the behavioral sciences can be viewed as part of the quest for invariant measurement as described in this paper. Another related concept that was not examined in this presentation is unidimensionality. A historical and comparative analysis of this concept and of its development within scaling theory along the lines used in this chapter would be an important contribution to the knowledge of progress in measurement theory. This chapter has focused on the concept of invariance as it has appeared within the context of measurement theory. Invariance can also be viewed more broadly as the quest for generality in science. If science is viewed in its simplest form as a series of questions and answers, then invariance addresses the problem of whether or not answers are comparable over methods and groups. The concept of in-

HISTORICAL VIEWS OF INVARIANCE 95 HISTORICAL VIEWS OF INVARIANCE 95

variance within educational and psychological research can also be expanded to include first, second, and higher order invariances. For example, invariances of the first order might deal with mean differences between groups on a variable such as mathematics anxiety. A second order concern might be whether or not the correlations between mmathematics achievement and anxiety are invariant over gender, so-mathematics achievement and anxiety are invariant over gender, so-mathematics achievement and anxiety are invariant over gender, social class, and race groups. Higher order invariances might relate to the generalizability of a system of interrelationships among more than two variables. There are several areas for future research related to the manner in which the concept of invariance appears within other measurement theories that are not within the scaling tradition but derive from the test theory tradition. Some illustrative questions are: How does the work on test theory relate to the quest for invariance within scaling theory? Can the work of Spearman be viewed as a search for an invariant ranking of individuals regardless of time of administration and instruments used? Can the work of Cronbach and others on generalizability theory be viewed as an attempt to identify and examine sources of error variance in test scores which are related to the concept of "invariance" in educational and psychological tests as presented in this chapter? What about invariance within the framework of two- and three-parameter item response models? What about Guttman's research on psychometrics? What are the explicit connections of classical measurement concepts, such as reliability and validity, to the concept of invariance as presented in this chapter? How does invariance relate to unidimensionality? In summary, the problem of invariance is of fundamental importance for the development of meaningful measures in education and psychology. Item-invariant estimates of individual abilities and sample-invariant estimates of item difficulties are essential in order to realize the advantages of objective measurement. The conditions for objective measurement correspond to the concept of invariance as developed in this paper. The conditions for objective measurement are as follows: First, the calibration of measuring instruments must be independent of those objects that happen to be used for the calibration. Second, the measurement of objects must be independent of the instrument that happens to be used for the measuring. (Wright, 1968, p. 87) This chapter provides a historical and substantive review of the problems related to invariant measurement. It also illustrates the progress t h a t has been made toward solving measurement problems related to

96

ENGELHARD, JR.

i n v a r i a n c e . F u r t h e r , t h i s c h a p t e r c o n t r i b u t e s to a n a p p r e c i a t i o n of R a s c h ' s a c c o m p l i s h m e n t s a n d of t h e e l e g a n c e of h i s a p p r o a c h to p r o b l e m s r e l a t e d to i n v a r i a n t m e a s u r e m e n t . As p o i n t e d out by A n d r i c h (1988b), Rasch's a c h i e v e m e n t s did n o t occur in a " h i s t o r i c a l v a c u u m " (p. 13). T h i s c h a p t e r i l l u s t r a t e s t h e c o n t i n u i t y a n d p r o g r e s s t h a t is e v i d e n t w i t h i n t h e m e a s u r e m e n t t h e o r i e s of T h o r n d i k e , T h u r s t o n e , and Rasch.

REFERENCES Andrich, D. (1978). Relationships between the Thurstone and Rasch approaches to item scaling. Applied Psychological Measurement, 2, 4 4 9 460. Andrich, D. (1988a). Rasch models for measurement. Newbury Park, CA: Sage. Andrich, D. (1988b, April). A scientific revolution in social measurement. Paper presented at the annual meeting of the American Educational Research Association in New Orleans. in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, Andrich, D. (1988c). Thurstone scales. In J.P. Keeves (Ed.), Educational rein ap lying Rasch's theory. Journal of Educational Measurement, 14(3), in ap lying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, sin ap lying Rasch's theory. Journal of Educational Measurement, 14(3), in ap lying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, earch, methodology, and measurement: An international handbo k. Oxford: Pergamon Press. Bock, R.D., & Jones, L.V. (1968). The measurement and prediction of judgement and choice. San Francisco: Holden-Day. in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, Brennan, R.L. (1983). Elements of generalizability theory. Iowa City, IA: American College Testing Program. Cattell, J.K. (1893). Mental measurement. Philosophical Review, 2, 316-332. Clifford, G.J. (1984). Edward L. Thorndike: The sane positivist. Middleton, CT: Wesleyan University Press. (Original work published 1968.) Cronbach, L.J., Gleser, G.C., Nanda, H., & Rajaratnam, N. (1972). The dependin ap lying Rasch's theory. Journal of Educational Measurement, 14(3), in ap lying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, ain ap lying Rasch's theory. Journal of Educational Measurement, 14(3), in ap lying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, bil ty of behavioral measurements: Theory of generalizabil ty of scores and profiles. New York: Wiley. in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, Donovan, A., Laudan, L., & Laudan, R. (1988). (Eds.). Scrutinizing science: Empirical studies of scientific change. Boston: Kluwer Academic Publishers. Engelhard, G. (1984). Thorndike, Thurstone and Rasch: A comparison of their in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, methods of scaling psychological tests. Applied Psychological Measurement, 8, 21-38. Engelhard, G. (1988, April). Thorndike's and Wood's principles of educational measurement: A view from the 1980's. Paper presented at the annual meeting of the American Educational Research Association in New Orleans (ERIC Document Reproduction Service No. ED 295 961). Engelhard, G. (1991). Thorndike, Thurstone and Rasch: A comparison of their approaches to item-invariant measurement. Journal of Research and Development in Education, 24(2), 45-60. Engelhard, G., & Osberg, D.W. (1983). Constructing a test network with a rin applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, Rasch measurement model. Applied Psychological Measurement, 7, 283294.

HISTORICAL VIEWS OF INVARIANCE

97

Fisher, R.A. (1925). Statistical methods for research workers. Edinburgh: Oliver and Boyd. Guilford, J.P. (1936). Psychometric methods. New York: Mc-Graw Hill Book Company, Inc. Gulliksen, H. (1950). Theory of mental tests. New York: J. Wiley and Sons. Jones, L.V. (1960). Some invariant findings under the method of successive iintervals. In H. Gulliksen & S. Messick (Eds.), Psychological scaling: Theory and applications (pp. 7-20). New York: John Wiley and Sons. Jones, L.V., & Appelbaum, M.I. (1989). Psychometric methods. Annual review of psychology, 40, 2 3 - 4 3 . Joreskog, K.G., & Sorbom, D. (1986). LISREL VI: Analysis of linear structural relationships by maximum likelihood, instrumental variables, and least ssquares methods. Mooresville, IN: Scientific Software. Laudan, L. (1977). Progress and its problems: Toward a theory of scientific change. Berkeley, CA: University of California Press. Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological Reports, 3, 635-694. Loevinger, J. (1965). Person and population as psychometric concepts. Psychological Review, 72, 143-155. Lumsden, J. (1976). Test theory. Annual review of psychology, 27, 251-280. Masters, G.N., & Wright, B.D. (1984). The essential process in a family of measurement models. Psychometrika, 49, 529-544. Mislevy, R.J. (1986). Recent developments in the factor analysis of categorical variables. Journal of Educational Statistics, 11, 3 - 3 1 . Mosier, C.I. (1940). Psychophysics and mental test theory: Fundamental postulates and elementary theorems. Psychological Review, 47, 355-366. Mosier, C.I. (1941). Psychophysics and mental test theory II: The constant process. Psychological Review, 48, 235-249. Muraki, E., & Engelhard, G. (1985). Full-information item factor analysis: Applications of EAP scores. Applied Psychological Measurement, 9, 4 1 7 430. Rasch, G. (1953). On simultaneous factor analysis in several populations. Uppssala Symposium on Psychological Factor Analysis (pp. 65-71). Nordisk Psykologi's Monograph Series, 3. Rasch, G. (1961). On general laws and the meaning of measurement in psycchology. In J. Neyman (Ed.), Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, (pp. 321-333). Berkeley, CA: University of California Press. Rasch, G. (1966a). An individualistic approach to item analysis. In P.F. Lazarsfeld and N. Henry (Eds.), Readings in mathematical social science (pp. 89-107). Chicago: Science Research Associates. Rasch, G. (1966b). An item analysis which takes individual differences into account. British Journal of Mathematical and Statistical Psychology, 19, 49-57. Rasch, G. (1977). On specific objectivity: An attempt at formalizing the request for generality and validity of scientific statements. Danish Yearbook of Philosophy, 14, 58-94. rRasch, G. (1980). Probabilistic models for some intelligence and attainment

98

ENGELHARD, JR.

tests. Chicago: The University of Chicago Press. (Original work published 1960.) Shavelson, R.J., Webb, N.M., & Rowley, G.L. (1989). Generalizability theory. American Psychologist, 44, 922-932. Spearman, C. (1904). "General intelligence," objectively determined and measured. American Journal of Psychology, 15, 201-293. Stevens, S.S. (1946). On the theory of scales of measurement. Science, 103, 677-680. Stevens, S.S. (1951). Mathematics, measurement, and psychophysics. In S.S. Stevens (Ed.), Handbook of experimental psychology (pp. 1-49). New York: Wiley. Thomson, G H . (1940). The nature and measurement of the intellect. Teachers College Record, 41, 726-750. Thorndike, E.L. (1904). An introduction to the theory of mental and social min applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, measurements. New York: Teachers College, Columbia University. Thorndike, E.L. (1910). Handwriting. Teachers College Record, 11, 83-175. Thorndike, E.L. (1914). The measurement of ability in reading. Teachers College Record, 15, 207-277. Thorndike, E.L. (1918). The nature, purposes, and general methods of measurements of educational products. In G M . Whipple (Ed.), The seventeenth yearbook of the national society for the study of education. Part II, in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, The measurement of educational products. Bloomington, IL: Public School Publishing Company. Thorndike, E.L. (1921). Measurement in education. Teachers College Record, 22, 371-379. Thorndike, E.L., Bergman, E.O., Cobb, M.V. & Woodyard, E. (1926). The mmeasurement of intelligence. New York: Bureau of Publications, Teachers College, Columbia University. Thurstone, L.L. (1925). A method of scaling psychological and educational tests. Journal of Educational Psychology, 15, 433—451. Thurstone, L.L. (1926). The scoring of individual performance. Journal of Educational Psychology, 17, 446-457. Thurstone, L.L. (1927). The unit of measurement in educational scales. Journal of Educational Psychology, 18, 505-524. Thurstone, L.L. (1928a). Comment by Professor L.L. Thurstone. Journal of Educational Psychology, 19, 117-124. Thurstone, L.L. (1928b). Scale construction with weighted observations. Journal of Educational Psychology, 19, 441-453. Thurstone, L.L. (1947). Multiple-factor analysis: A development and expansion of the vectors of mind. Chicago: The University of Chicago Press. Thurstone, L.L. (1959). The measurement of values. Chicago: The University of Chicago Press. Trabue, M.R. (1916). Completion-test language scales. Contributions to Education (No. 77). New York: Columbia University, Teachers College. Travers, R.M.W. (1983). How research has changed American schools: A history from 1840 to the present. Kalamazoo, MI: Mythos Press. Wright, B.D. (1968). Sample-free test calibration and person measurement.

HISTORICAL VIEWS OF INVARIANCE

99

in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, Proceedings of the 1967 invitational conference on testing problems. Princeton, NJ: Educational Testing Service. Wright, B.D. (1988). Rasch measurement models. In J.P. Keeves (Ed.), Educain applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, tional research, methodology, and measurement: An international handbook. Oxford: Pergamon Press. in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, Wright, B.D., & Masters, G. (1982). Rating scale analysis: Rasch measurement. Chicago: MESA Press. in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, Wright, B.D., & Stone, M.H. (1979). Best test design: Rasch measurement. Chicago: MESA Press.

This page intentionally left blank

part II

11

Practice

This page intentionally left blank

chapter

5 O

Computer-Adaptive Testing: A National Pilot Study Mary E. Lunz

American Society of Clinical Pathologists

Betty A. Bergstrom

Computer Adaptive Technologies The purpose of educational measurement is to inform educational decision making by providing estimates of an individual's knowledge and skill. For certification and licensure, this means making minimum competency pass/fail decisions. In recent years, computers have become more versatile and more accepted for the development and delivery of examinations. One of the most interesting and potentially advantageous methods for certification boards and examinees is ccomputer-adaptive testing (CAT). The adaptive algorithms for item selection usually depend on item response theory (IRT) (Rasch 1960/1980; Lord & Novick, 1968; Wright & Stone, 1979). Items in the bank are calibrated to a benchmark scale on which a pass/fail point is established. The adaptive algorithm selects items that provide the most information about the examinee given the current ability measure estimated from responses to all of the previous items. Many studies (Weiss, 1983, 1985; Weiss & Kingsbury, 1984; McKinley & Reckase, 1980; Olsen, Maynes, Slawson, & Ho, 1986) have explored computer-adaptive tests and have found that because maximum information is gained from each item administered, lower measurement error and higher reliability can be achieved using fewer items. While this is advantageous from a psychometric perspective, it presents the examinee with a testing experience that is quite different from traditional multiple choice tests. 103

104

LUNZ & BERGSTROM

Why a National Study Computer-adaptive testing is attractive because of the convenience to examinees with regard to scheduling and reporting, potentially shorter tests, and increased availability of opportunities to challenge the test. Advantages to the certification board include improved security and data collection, better opportunity to control cheating, and cost savings with regard to committee expenses, printing, and shipping. Computer adaptive tests, however, are different from traditional paper and pencil certification examinations. Written certification examinations usually include 200 to 500 items while computer adaptive examinations are usually shorter, including fewer t h a n 100 items. Paper and pencil tests are administered simultaneously. Current practice suggests that certification examinations begin with an easier item, while computer adaptive tests usually begin by presenting an item of medium difficulty. Most examinees get 70 percent or more of the items correct on a certification examination, while a computer adaptive test is usually targeted at 50 percent probability of correct response. On a traditional test, examinees can review and change answers, but on a computer adaptive test this option may not be available. The concern is how examinees and educators react to this innovation in test administration. Are examinees willing to believe in the IRT methodology? Even more mundane, can examinees follow the directions for entering responses into the computer, read items from the computer monitor, and look at a separate illustration book? Will examinees panic at the thought of a computer-administered test? Will examinees perform poorly when they have a harder than usual test, or when they are not given the opportunity to review their answers? These concerns could not be addressed adequately using simulated data, which effectively removes the human element from the evaluation process. It therefore seemed mandatory to verify the knowTn and postulated psychometric, psychological, and social attributes of computer adaptive testing. Thus a national pilot study was undertaken. METHODS AND RESULTS Item Precalibration A paper and pencil examination was given to a sample of students from 57 medical technology programs. From the analysis of these data

COMPUTER-ADAPTIVE TESTING: A NATIONAL PILOT STUDY

105

an item bank was constructed that met the test specifications for the traditional paper and pencil certification examination. The items were calibrated using the Rasch model (Rasch 1960/1980; Wright & Stone, 1979). Inappropriate items and poorly fitting items were deleted before the calibrated item bank of 726 items was established. The stability of the item precalibrations is discussed in detail in the chapter entitled "The Equivalence of Rasch Item Calibrations and Ability Estimates Across Modes of Administration." Data Collection Two hundred thirty-eight medical technology programs from across the country participated in the second phase. Program directors agreed to administer, under secure conditions, a computer adaptive test and a written test composed of 109 items from the computer adaptive test pool, to their students who were eligible to take the certification examination. Comparable pass/fail decisions on paper-and-pencil and adaptive tests were made (Lunz & Bergstrom, 1991). The calibrated item bank of 726 items was used to construct computer-adaptive tests tailored to the current ability of each student. An individual computer disk was available for each student. The computer-adaptive test could be administered in a computer center to the group or individually in a private office as long as security was maintained. Useable data were gathered from approximately 1,077 students; 83 percent were white and 81 percent were female, which is typical population mix for this certification examination. Appropriateness of the R a s c h Model for CAT The appropriateness of the Rasch model over other IRT models for computer-adaptive testing has been confirmed by several studies. Wainer (1983) states t h a t when items are targeted to the ability of the examinee, items t h a t are very difficult for an examinee are not presented. Thus the incidence of guessing is minimal and the estimation of a lower asymptote within the confines of CAT is generally impractical. Wainer (1983) also notes t h a t "inclusion of slopes in the estimation model will result in a very optimistic estimate of the accuracy of the ability estimate." Sample sizes in this study were relatively small, but the Rasch model item calibrations have been found to be robust with small samples (Lord, 1983). Also, there is evidence that person measures estimated with the Rasch and the two- and three-parameter models correlate

106

LUNZ & BERGSTROM

highly (.99) when tests are administered under a computer adaptive algorithm (Olsen et al., 1986). The Rasch model (Rasch, 1960/1980) was used to calibrate items and estimate person measures. The PROX method was used for item selection (Wright & Stone, 1979) in the adaptive algorithm. The Rasch model calibrates item difficulties to a log-linear scale [log(exp(B-D)/lexp(B-D)]. Item difficulties are expressed in log-odds units (logits). Fit of the Data to the R a s c h Model The fit of the data to the Rasch model was verified by examining the infit statistic for the calibrated items (Wright & Masters, 1982). For each person/item encounter, the observed response was compared to the modeled expected response. Misfitting items were removed from the item bank. When data fit the Rasch model, the infit statistic (the mean of the standardized squared residual, weighted by its variance) has a value near 0 and a standard deviation near 1.0. For the 726-item pool, the mean item infit was .04 with a standard deviation of 1.01. CAT Algorithm The computer adaptive testing model used in this study has the following characteristics. It is designed as a mastery model (Weiss & Kingsbury, 1984) to determine whether a person's estimated ability level is above or below a preestablished criterion. Kingsbury and Houser (1990) have shown that an adaptive testing procedure t h a t provides maximum information about the examinee's ability will provide a more clear indication t h a t the examinee is above or below the pass/fail point t h a n a test that peaks the information at the pass/fail point. The CAT ADMINISTRATOR program (Gershon, 1989) constructed computer-adaptive tests following the test specifications of the traditional paper-and-pencil certification examination (see Table 5-1). This means t h a t the item with the most appropriate level of difficulty, within a given subtest, was presented to the examinee. In the first 50 items, blocks of 10 items were administered from subsets 1-4 and blocks of 5 items were administered from subsets 5 and 6. After 50 items, blocks of 4 items (subsets 1-4) and blocks of 2 items (subsets 5 and 6) were administered. Subset order was selected randomly by the computer algorithm. Maurelli and Weiss (1983) found subtest order to have no effect on the psychometric properties of an achievement test battery. Items were chosen at random from unused items within .10 logits of

COMPUTER-ADAPTIVE TESTING: A NATIONAL PILOT STUDY Table 5 - 1

107

Item Bank Description

Subtest

Test Plan Distribution*

Number of Items in Bank

Easiest Item

Mean

Hardest Item

SD

Microbiology Blood Banking Chemistry Hematology Body Fluids Immunology

20% 20% 20% 20% 10% 10%

147 165 142 135 72 65

-2.89 -2.21 -3.61 -2.80 -2.24 -2.78

-.06 -.07 -.07 -.05 -.09 .25

2.38 2.94 2.97 2.97 3.84 2.04

.96 1.00 1.06 .97 .97 .96

100%

726

-3.61

-.02

3.84

1.00

Bank Scale

*The test plan distribution for computer-adaptive tests was the same as the test plan for the traditional fixed-length written certification examination.

the targeted item difficulty within the specified content area. While the examinee considered the item presented, the computer selected two items, one t h a t would yield maximum information should the current item be answered incorrectly and another that would yield maximum information should the current item be answered correctly. This procedure ensured that there was no lag time before the next item was presented. The minimum test length was 50 items and the maximum test length was 240 items. All examinees had four hours to complete the computer test. The test stopped when the examinee achieved a measure 1.3 x SEM (90% confidence, one tailed test), above or below the pass point of .15 logits on the bank scale. Figure 5-1 shows an examinee's test map. Note t h a t by item 50, the error band is well above the pass point, making this examinee a clear pass with greater than 90 percent confidence in the accuracy of the decision. If an examinee challenged 240 items and a pass/fail decision could not be made, the test stopped and a decision was made with less than 90 percent confidence, based on his or her measure at that point. Experimental Conditions and Results The computer-adaptive tests also incorporated varying combinations of experimental test conditions. These test conditions were designed to assess the known and assumed attributes of computer-adaptive testing, based on the assumption that some modifications to the "theoretically perfect computer-adaptive test" might be required to make it

Figure 5 - 1

COMPUTER-ADAPTIVE TEST EXAMINEE MAP

COMPUTER-ADAPTIVE TESTING: A NATIONAL PILOT STUDY

109

practical and acceptable to examinees. The goal was to determine which conditions, if any, make a difference in examinee performance. Students were randomly assigned to a combination of test conditions. This caused the number of examinees included in each analysis to vary. Each study, however, included a reasonable number of examinees, comparable to typical computer adaptive test studies. The test conditions were transparent to the examinee, with the exception of the "review" condition, which required special instructions. Analysis of covariance, with the written test as a covariate, was performed for each of the experimental conditions. Unidimensionality The first condition related to unidimensionality. The certification board outlines the domain of practice that must be demonstrated by the examinee. The domain breaks down into logical subsets for purposes of education and evaluation. A student must be able to demonstrate proficiency across the domain. Thus the activities in the six subtests are related conceptually, as well as in practice, so t h a t they must be tested using a single certification measurement instrument. It is the belief of this certification board and of those who practice in this field that the subtest areas are part of single dimension. Students must demonstrate competence across subtests, even though some variance in their performance among subtests is expected. The performance of examinees is positively correlated across subtests. The correlations are highly significant and range between .20 and .60. The subtests had statistically comparable mean item difficulties (df = 5 F = 1 . 3 6 P = .24), standard deviations and ranges so t h a t adaptive tests with comparable content coverage could be constructed for examinees with differing ability levels (see Table 5-1). For 645 students pass/fail decisions were based on the total test measure, while for the other 432 students pass/fail decisions were made for each subtest. Table 5-2 shows the results of the comparison of examinee performance when decisions were made by subtest or total test. There was no significant difference in mean performance (df = 1, F = 1.43, P = .23). Table 5-3 shows the percentage of examinees passing each subtest when decisions were made by subtest and total measure. The overall pass rate is about 4 percent higher when the decision is based on total test performance. The remaining conditions are reported only for examinees for whom decisions were made on the total test measure (N = 645).

110

LUNZ & BERGSTROM Table 5-2 Comparison of Examinee Measures When Total Test or Subtest Performance Is the Criteria for Pass/Fail Decisions

N examinees x ability SD

Decision Total Test

Decision by Subtest

645 .230 (.224)* .57

432 .191 (.196)* .46

df = 1

F = 1.43

P = .232

Reported in logits *Adjusted means based on covariate analysis

Targeted Level of Test Difficulty Psychometricians postulate t h a t a 50 percent probability of a correct response provides the best measurement of ability. Most written tests are, in fact, targeted to a 70 percent or even higher probability of correct response. The concerns are (a) how do students, accustomed to getting high scores, react to harder tests; and (b) can the item bank provide an efficient test at a specifically targeted level of difficulty across student ability levels. Students were randomly assigned to test conditions for 50 percent, 60 percent, and 70 percent probability of a correct response. Table 5-4 shows the results of controlling the probability of a correct

Table 5-3 Comparison of Percentage of Examinees Passing Each Subtest When Total Measure or Subtest Measure Is the Criterion Subtest

Decision by Total % Examinees Passing

Decision by Subtest % Examinees Passing

Microbiology Blood Banking Chemistry Hematology Body Fluids Immunology

49 61 54 53 52 48

49 59 54 53 49 46

Total

56

52

COMPUTER-ADAPTIVE TESTING: A NATIONAL PILOT STUDY

111

Table 5-4 Comparison of Examinee Measures Based on Targeting Condition Probability of a Correct Response

N examinees x ability SD

50%

60%

70%

201 .284 (.238)* .525

232 .168 (.224)* .558

212 .246 (.236)* .622

df = 2

F = .08

P - .926

Reported in logits *Adjusted means based on covariate analysis

response. There was no significant difference in examinee performance due to controlled probability of a correct response (df = 2, F = .08, P = .926). These results suggest that computer adaptive tests can be targeted at 50%, 60%, or 70% probability of a correct response without affecting examinee performance. Targeting to 60% or 70% may provide a psychological advantage for the examinee. It may also be useful for certification boards who have existing item banks created for easier paper and pencil tests. For further details on altering test difficulty see Bergstrom, Lunz, and Gershon (1992). Minimum Test Length A third condition was designed to address test length. Content experts often feel t h a t long tests are necessary to cover the field. However, the principles of sampling suggest t h a t well-targeted items will yield comparable results. Most examinees (79%) were allowed to stop after 50 items if a pass/fail decision with 90 percent confidence could be made. Some examinees (21%) were placed in a "long" test condition that required a minimum of 100 items even if a decision with 90 percent confidence could have been made with fewer items. Tests varied in length depending upon the performance of the examinee and the test length condition. Table 5-5 shows the results of examinee performance by minimum test length. Although the group means are not significantly different (df = 1, F = .82, P = .366) those examinees in the shorter minimum test condition performed slightly better.

112

LUNZ & BERGSTROM Table 5-5 Comparison of Examinee Measures Based on Minimum Test Length

N examinees x ability SD

Min L = 50

Min L = 100

428 .262 (.230)* .580

217 .167 (.199)* .549

df = 1

F = .82

P - .366

Reported in logits *Adjusted means based on covariate analysis

Opportunity to Review Examinees often argue that they have the "right" to review their tests, and, indeed, have been trained to do so. Psychometricians argue t h a t allowing examinees to change responses in a computer adaptive test decreases the information value of each item and therefore increases the error of measurement. A fourth condition involved the ability of examinees to review their test and alter responses. Examinees, randomly placed in the review condition, were required to answer items when they were presented but were allowed to review and change responses after they completed the test. The other examinees (nonreview condition) were not allowed to review items and alter responses. Table 5-6 shows the comparison of examinee measures for the review and nonreview conditions. There was no significant difference in mean examinee performance (df = 1, F = .80, P = .37), although examinees who were allowed to review had slightly higher mean meaTable 5-6 Comparison of Ability Measures Based on Review and Nonreview Test Conditions

N examinees x ability SD

Review

Nonreview

109 .253 (.258)* .546

536 .225 (.220)* .576

df = 1

F = .80

Reported in logits *Adjusted means based on covariate analysis

P = .37

COMPUTER-ADAPTIVE TESTING: A NATIONAL PILOT STUDY

sures. No r e s u l t of w r o n g to w r o n g to c u s s e d in

113

e x a m i n e e c h a n g e d s t a t u s from p a s s to fail or fail to p a s s a s a c h a n g i n g r e s p o n s e s . S o m e r e s p o n s e s w e r e c h a n g e d from r i g h t , w h i l e o t h e r s w e r e c h a n g e d from r i g h t to w r i n g or w r o n g . T h e p s y c h o m e t r i c i s s u e s i n v o l v i n g r e v i e w a r e disL u n z , B e r g s t r o m , a n d W r i g h t (1991).

Reliability of A l t e r n a t e Test F o r m s A fifth condition involved r e l i a b i l i t y of a l t e r n a t e t e s t forms. O n e ass u m p t i o n of c o m p u t e r a d a p t i v e t e s t i n g is t h a t c o m p a r a b l e decisions will b e m a d e e v e n t h o u g h e x a m i n e e s a r e t e s t e d w i t h different i t e m s , b e c a u s e all t e s t s a r e e q u a t e d to t h e s a m e scale. S o m e e x a m i n e e s w e r e placed in a condition t h a t forced t h e m to t a k e t w o t e s t s , o n e i m m e d i a t e l y following t h e o t h e r , w i t h o u t a b r e a k . I n fact, t h e e x a m i n e e s did n o t k n o w t h e y w e r e t a k i n g two u n i q u e t e s t s . A d e t a i l e d r e p o r t of t h e r e s u l t s follows in t h e C h a p t e r 6, " R e l i a b i l i t y of A l t e r n a t e C o m p u t e r A d a p t i v e Tests."

REFERENCES Bergstrom, B.A., Lunz, M.E., & Gershon, R.C. (1992). Altering the level of difficulty in computer adaptive tests. Applied Measurement in Education, 5, 4,137-149. in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, Gershon, R.C. (1989). CAT ADMINISTRATOR [(Computer Program)]. Chicago: Micro Connections. Kingsbury, G.G., & Houser, R.L. (1990, March). Assessing the utility of item response models: Computerized adaptive testing. Paper presented to the Annual Meeting of the National Council on Measurement in Education, Boston. lLord, F.M. (1983). Small N justifies Rasch model. In D.J. Weiss (Ed.), New horizons in testing. New York: Academic Press. Lord, F.M., & Novick, M.R. (1968). Statistical theories of mental test score. Reading, MA: Addison-Wesley. Lunz, M.E., & Bergstrom, B.A. (1991). Comparability of decision for computer adaptive and written examinations. Journal of Allied Health, 20, 1, 15— 23. Lunz, M.E., Bergstrom, B.A., & Wright, B.D. (1992). The effect of review on sstudent ability and test efficiency for computer adaptive tests. Applied Psychological Measurement, 16, 1, 33-40. McKinley, R.L., & Reckase, M.D. (1980). Computer applications to ability testing. Association for Educational Data Systems Journal, 13, 193-203. Maurelli, V.A., & Weiss, D.J. (1983). Factors influencing the psychometric char-

114

LUNZ & BERGSTROM

acteristics of an adaptive testing strategy for test batteries (Research Report 81-4). Minneapolis: University of Minnesota, Department of Psychology, Psychometric Methods Program, Computerized Adaptive Testing Laboratory. o in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, equating of paper-administered, computer-administered and computerized adaptive tests of achievement. Paper presented at the American Educational Research Association Meeting, San Francisco. in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, Rasch, G. (1980). Probabilistic models for some intel igence and attainmnt tests. Chicago: University of Chicago Press. (Original work published 1960.) Wainer, H. (1983). Are we correcting for guessing in the wrong direction? In D.J. Weiss (Ed.), New horizons in testing. New York: Academic Press. Weiss, D.J. (1983). New horizons in testing: Latent trait test theory and computerized adaptive testing. New York: Academic Press. in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, Weiss, D.J. (1985). Final report: Computerized adaptive measurement of achievement and ability (Project NR150-433, N00014-79-CO172). Minneapolis: University of Minnesota. Weiss, D.J., & Kingsbury, G.G. (1984). Application of computerized adaptive in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, testing to educational problems. Journal of Educational Measurement, 21(4), 361-375. wWright, B.D., & Masters, G.N. (1982). Rating scale analysis. Chicago: MESA Press. Wright, B.D., & Stone, M.H. (1979). Best test design. Chicago: MESA Press.

chapter

6 O

Reliability of Alternate Computer-Adaptive Tests Mary E. Lunz

American Society of Clinical Pathologists

Betty A. Bergstrom

Computer Adaptive Technologies

Benjamin D. Wright

University of Chicago

When items are IRT calibrated, ability estimation can be independent of the particular items used for measuring (Rasch 1960/1980; Wright, 1968, 1977). Thus, when all items are calibrated on the same scale, statistically equivalent person measures should result from alternate computer-adaptive tests, regardless of which particular items are administered on each test. This is an essential requirement for successful computer-adaptive testing. If the adaptive item selection algorithm is working properly, and the person has not altered significantly in ability, the mean difficulty of the items presented to that examinee should be statistically equivalent. When the items for two computer-adaptive tests are selected from the item bank, use the same test specifications, and are tailored to the same examinee ability, the two tests should be weakly parallel (Boekkooi-Trimminga, 1990). For high-stakes testing, such as certification, where decisions are often permanent, the alternate forms reliability of computer-adaptive tests must be demonstrated prior to implementation of computeradaptive strategies since all examinees will take different and uniquely tailored tests. 115

116

LUNZ, BERGSTROM & WRIGHT

The traditional index of test performance, reliability, can be applied to alternate computer adaptive tests. The Standards for Educational and Psychological Testing (1985) state that the goal of reliability is to estimate the consistency of scores on alternate tests constructed to defined test specifications. Allen and Yen (1979) define alternate tests as any two test forms that have been constructed to be parallel in content and that also have similar observed score means and variances for equivalent samples. They also state that a correlation between observed scores on alternate forms will produce a good estimate of test reliability when the alternate forms are parallel. While this assumes fixed-length written tests, the basic principle seems applicable to computer-adaptive tests. Reliability between alternate computer adaptive tests was addressed by Martin, McBride, and Weiss (1983). Scores on two alternate fixed-length forms of adaptive tests correlated at .90, after 30 items were administered. Kingsbury and Weiss (1980) found t h a t alternate forms of a computer-adaptive test resulted in more reliable scores t h a n alternate forms of a traditional pencil-and-paper test (correlations .92 and .88, respectively). Any subset of items selected adaptively from a calibrated-item bank constitutes a test form and should produce statistically equivalent ability measures for an examinee of a given ability (Wright, 1977). Alternate computer-adaptive tests contain different items, but when administered sequentially to the same examinee, they should produce statistically equivalent ability estimations. They should function in parallel because both sets of items are tailored on the same examinee ability using the same test plan. The purpose of this study is to determine the reliability of alternate test forms administered adaptively. Reliability will be assessed by comparing estimates of examinee ability and pass/fail decisions on alternate computer-adaptive tests.

METHOD The computer-adaptive testing model used was designed to determine a person's estimated ability level with respect to a preestablished criterion. An alternate test was presented automatically for examinees who were randomly placed in the total test and alternate forms test conditions. One hundred forty-two examinees were placed in this combination of conditions. These 142 examinees took sequential computer-adaptive tests; how-

RELIABILITY OF ALTERNATE COMPUTER-ADAPTIVE TESTS

117

ever, they were not aware that they were taking two separate tests, because the second test began as soon as the first test was completed. The alternate tests were constructed by the CAT ADMINISTRATOR program (Gershon, 1989), using the same test plan, starting point, and stopping rule. Each test was tailored on the ability of the examinee. Items presented to an examinee on the first test were marked by the computer so they would not be administered to the same examinee on the alternate test. This slightly limited the items available for the second test. Examinees were required to answer each item before another item was presented. The opportunity to review or change answers at a later time was not available. Since the tests were sequential, there was no opportunity for examinees to study between tests. The only possible change in ability could come from the practice gained or the fatigue caused by taking the first test. These data were analyzed with correlations, and paired t-tests of examinee measures on the alternate tests. It was expected that the null hypothesis of no significant difference between examinee measures on the alternate tests would be confirmed. In addition, pass/fail decisions on the alternate tests were compared.

RESULTS Pass/Fail Consistency Table 6-1 presents the pass/fail results for the alternate tests. Sixtyfour examinees passed both computer-adaptive tests, while 56 examinees failed both computer-adaptive tests. This is an 85 percent consistency rate. Fifteen examinee measures were within 1.3 standard errors of measurement for one or both tests. This means that the decision to pass or fail was made with less than 90 percent confidence in its accuracy. When the 15 examinees for whom decisions with 90 percent confidence could not be made were excluded, 94 percent of the examinees earned the same decision on the alternate tests. Comparison of Examinee Ability Measures The observed correlation of the 142 pairs of examinee measures for the alternate tests was .79. When this correlation is corrected for measure-

118

LUNZ, BERGSTROM & WRIGHT Table 6 - 1 Pass/Fail Consistency Alternate Computer-Adaptive Tests All Examinees Test 1 Pass Fail

Test 2

Total 71

Pass

64

7

Fail

15

56

71

Total

79

63

142

Unclear decisions were made for 15 examinees: 3 = F/P, 12 = P/F Examinees with Clear* Pass/Fail Decisions Test 1 Test 2 Pass Fail Total Pass

64

4

68

3

56

59

67

60

127

Fail Total

*Clear decision = 90% confidence 1.3 x SE above or below MPS

ment error it becomes .96. Table 6-2 gives summary statistics for examinee ability measures on test 1 and test 2. The mean difference in the 142 pairs of ability measures is - . 0 3 logits. Results of a paired t-test indicate no significant differences between examinee measures on the alternate tests (t = .87, df = 141, p = .39). Figure 6-1 shows the plot of examinee measures on the alternate tests.

Table 6-2 Statistics

Examinee Ability Summary

Statistic Test 1 Mean Mean Test 2 Mean Mean

Mean*

SD*

Ability Measure Error of Measure

.19 .23

.59 .05

Ability Measure Error of Measure

.16 .23

.57 .06

•Reported in logits

Figure 6-1

Plot of Examinee Ability Measures on Alternate Computer Adaptive Tests

120

LUNZ, BERGSTROM & WRIGHT

DISCUSSION This study was designed to verify the reliability of examinee ability measures and pass/fail decisions when alternate tests were administered sequentially using a computer adaptive algorithm that tailored items to examinee ability. The computer algorithm distributed the items according to the test plan on both alternate tests. The 142 pairs of alternate tests were evaluated based on content and comparability of item difficulties. The standard deviation of the ability measure difference (.38) is appropriate, given the mean measurement errors for test 1 (.23) and test 2 (.23). The disattenuated correlation is .96. These results confirm t h a t the particular subset of items selected can vary and still produce statistically equivalent ability measures on alternate tests. Certification boards frequently compile different written test forms for each test administration and assume that the decision to pass or fail has a comparable meaning as long as the tests are equated and the same test plan is implemented. Test specifications confirm the content validity of each test form (see Lunz & Stahl, 1989). The adaptive algorithm implemented the test specifications in addition to presenting items tailored to each examinee so that the maximum information about the examinee was gained from each item in each content area. The alternate tests varied in length and order of subtest presentation. This, however, did not alter the final decision for 94 percent of the examinees, who earned clear (90 percent confidence) pass/fail decisions on both tests. The first tests averaged 72 items (SD = 23), while the second tests averaged 94 items (SD = 53). The number of items included on the second test was slightly higher, on average, because the items which provided the most information about the examinee were presented on the first test. Since less information was gained from each item, more items were required to reach the same level of confidence in the decision. More examinees passed test 1 and failed test 2. These examinees, however, had earned an unclear decision (less t h a n 90 percent confidence) on test 1. Several examinees in the alternate forms condition took as many as 400 items because their ability measure was close to the pass point on both tests. This certainly challenged the depth of the item bank within each content area. A larger item bank would have provided better targeted alternate tests for these borderline examinees. Shorter tests, made possible by tailoring to the ability of the examinee, are an asset for both the certification board and the examinee as long as there is evidence that decisions are reliable. The results of this study provide evidence of the reliability of alternate computer adap-

RELIABILITY OF ALTERNATE COMPUTER-ADAPTIVE TESTS 121

t i v e t e s t s by d o c u m e n t i n g t h e consistency of pass/fail decisions a n d t h e c o m p a r a b i l i t y of t h e e x a m i n e e a b i l i t y m e a s u r e s .

REFERENCES Allen, M.J., & Yen, W.M. (1979). Introduction to measurement theory. Belmont, CA: Wads worth. Boekkooi-Timminga, E. (1990). The construction of parallel tests from IRT based item banks. Journal of Educational Statistics, 15(2), 129-145. Gershon, R.C. (1989). CAT administrator [Computer Program)!. Chicago: Micro Connections. Kingsbury, G.G., & Weiss, D.J. (1980). An alternate-forms reliability and concurrent validity comparison of Bayesian and adaptive and conventional ability tests (Research Report 80-5). Minneapolis: University of Minnesota, Department of Psychology, Psychometric Methods Program, Computerized Adaptive Testing Laboratory. Lunz, M.E., & Stahl, J.A. (1989). Content validity revisited: Transforming job analysis data into test specifications. Evaluation and the Health Professional, 12, 192-206. Martin, J.T., McBride, J.R., & Weiss, D.J. (1983). Reliability and validity of adaptive and conventional tests in a military recruit population (Research Report 83-1). Minneapolis: University of Minnesota. in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, Rasch, G. (1980). Probabilistic models for some intel igence and attainment tests. Chicago: University of Chicago Press. (Original work published 1960). Standards for educational and psychological Testing. (1985). Washington, DC: American Psychological Association. Wright, B.D. (1977). Solving measurement problems with the Rasch Model. Journal of Educational Measurement, 14, 97-116. Wright, B.D. (1968). Sample free calibration and person measurement. Proceedings of the 1967 Invitational Conference on Testing Problems. Princeton, NJ: Educational Testing Service.

chapter

7• 7 7

The Equivalence of Rasch Item Calibrations and Ability Estimates Across Modes of Administration Betty A. Bergstrom

Computer Adaptive Technologies

Mary E. Lunz

American Society of Clinical Pathologists Board of Registry

In order for an item to be used efficiently in a computer-adaptive algorithm, it must be precalibrated using a latent trait model, such as the Rasch model, which orders items from easy to difficult. This can be accomplished with data from a previous pencil-and-paper administration, or data from a previous computer-adaptive administration. Many organizations have item pools calibrated from previous pencil-andpaper administrations. However, the use of these calibrations for a computer-adaptive test needs careful consideration. Since the mode of administration is different, there is a possibility that items are somehow "different" when presented on a computer instead of on a piece of paper. If items are different, pencil-and-paper calibrations may not be appropriate for a computer-adaptive test. In a computer-adaptive test each examinee takes a tailored test. Therefore, items are presented to examinees in different contexts and at different points during the test administration. Thus context effects and location effects will be 122

THE EQUIVALENCE OF RASCH ITEM CALIBRATIONS AND ABILITY ESTIMATES

123

unique for each examinee. In a paper-and-pencil test, item location and context do not fluctuate. If the pencil-and-paper location and/or context affect the item calibration, the calibration may not be appropriate for a computer-adaptive test. The possibility t h a t item calibrations might change due to the mode of administration, namely, conventional paper-and-pencil vs. computer adaptive, has been discussed by several researchers (Kingbury & Houser, 1989; Wise, Barnes, Harvey, & Plake, 1989). Green, Bock, Humphreys, Linn, and Reckase (1984) suggest several possible problems that might arise when items for a computer-adaptive test are calibrated using data from a paper-and-pencil test. An overall shift might occur, such t h a t all items become easier or harder, or an "itemby-mode interaction" might occur, where some, but not all, item parameters change. They postulate that items with diagrams or many lines of text may be most vulnerable to an item-by-mode interaction. Context effects have been addressed by Kingston and Dorans (1984). They note that the appropriateness of IRT equating based on precalibration requires that changes in position of items in a test between the preoperational calibration and operational administrations of the test have no effect on item parameter estimates. They found some types of complex items, especially those that require extensive instructions, to be particularly sensitive to location effects and thus possibly unsuitable for computer-adaptive administration. Yen (1980) also found item characteristics to be affected by the sequence in which items were administered. One of the consequences of targeting items to the ability level of the examinee is that examinees of different ability levels may be presented with items in different difficulty order. Folk (1990) points out t h a t a high-ability examinee will generally answer the initial items on a computer adaptive test correctly and then will receive more difficult items. This results in his or her test being structured from easy to hard. A low-ability examinee will answer fewer initial items correctly, which results in his or her test being structured from hard to easy. However, Folk found t h a t the administration of items in different orders did not substantially affect the performance of low- or highability examinees. Other potential problems in precalibrating items with a pencil-andpaper test for computer-adaptive administration have been addressed by Wainer and Keily (1987). One of these is the differential effect of cross information encountered in computer-adaptive testing. If a paper-and-pencil item provides a cue for another item, all examinees receive the same cue. With a computer-adaptive test, examinees are administered different items, and items are ordered differently. If an

124

BERGSTROM & LUNZ

item calibration is influenced by a cueing effect in a pencil-and-paper administration, it may be invalid for the computer-adaptive administration. They also point out that one of the virtues of computeradaptive testing—short test length—may become problematic if item calibrations are unstable. Since the shorter test lacks the redundancy of a conventional test, it will be more vulnerable to idiosyncrasies of item performance. If items have not been precalibrated, an initial pencil-and-paper administration may be most practical. In this case, the size and composition of the sample needed for precalibration of items must be considered. It has been suggested that the sample include a minimum of 1,000 respondents and be comparable to the target population (Rudner, 1989, Green et al., 1984). However, it may be difficult to amass a comparable sample population this large in areas such as professional certification. The purpose of this chapter is to explore two related issues to determine whether item calibrations from conventional pencil-and-paper tests are appropriate for use in this particular application of computeradaptive testing. The first issue is the equivalence of item calibrations from paper-and-pencil and computer-adaptive administrations. The second issue is the equivalence of examinee ability measures when item calibrations from paper-and-pencil tests versus item calibrations from computer-adaptive tests are used for the tailoring algorithm.

METHOD Precalibration Three hundred and twenty-one medical technology students from 57 educational (training) programs across the country provided data for the precalibration of items. To participate, students had to be eligible to take the first semiannual administration of the related certification examination. Each student took one of four different forms of a 200-item conventional pencil-and-paper test. Each form included a subset of common items for equating so t h a t all forms could be placed on the same scale. Form 1 was taken by 73 students, Form 2 by 86 students, Form 3 by 71 students, and Form 4 by 91 students. Each of the four forms was calibrated by the Rasch model program MSCALE (Wright, Congdon, & Schultz, 1987). The forms were equated using common item equating (Wright & Stone, 1979). The items were evaluated for fit to the model

THE EQUIVALENCE OF RASCH ITEM CALIBRATIONS AND ABILITY ESTIMATES

125

and misfitting items were deleted. This established pencil-and-paper (PAP) item calibrations for a bank of 726 items. CAT Administration Useable data from the computer-adaptive test administration was obtained from 1,077 students from 238 medical technology programs across the country. To participate, students had to be eligible to take the second semiannual administration of the related certification examination. A detailed description of the computer adaptive testing model used in this study is given in Chapter 5. Recalibration from CAT Administration To determine the equivalence of item calibrations, and to determine whether shifts in item calibration affect examinee measures, the response data from the computer-adaptive test administration were recalibrated. Each computer adaptive test yielded an examinee response string. While the entire item pool consisted of 726 items, each examinee response string contained responses from between 50 items (minimum test length) to 240 items (maximum test length). Each item had a unique identifying number. Response strings from all examinees were appended, resulting in a file containing a 1,077 (examinee) by 726 (item) matrix, with missing data for all items not presented to particular examinee. The l,077-by-726 response matrix was analyzed with BIGSCALE (Wright, Linacre, & Schultz, 1990) a Rasch program that processes large data sets t h a t have missing data. This procedure produced a new set of item calibrations and a new set of examinee measures based upon responses from the CAT administration. The mean number of examinees per item calibration on the CAT was 146.45, with a standard deviation of 77.79. The minimum number of examinees to calibrate an item in the CAT administration was 13; the maximum number of examinees to calibrate an item was 348. Items with calibrations between - 1 and 1 logits were administered more frequently t h a n items with lower or higher precalibrations. Thus the number of examinees used to calibrate each item from the CAT administration data varied considerably. The paper-and-pencil calibration of the 726 items, and the computer-adaptive test calibration of the 726 items, were compared. Then the 1,077 examinee measures obtained from each calibration were compared.

126

BERGSTROM & LUNZ

RESULTS Comparison of Item Calibrations The mean for the PAP calibration was - 0 . 0 2 , with a standard deviation of 1.00. The mean for the CAT calibration was 0.00 (BIGSCALE mean centers the items) with a standard deviation of 1.22. Two types of shift occurred. The first is an overall shift, indicated by a difference in the standard deviation of the PAP calibration compared to the standard deviation of the CAT calibration. The spread of the CAT calibration (S.D. = 1.22) is wider than the spread of the PAP calibration (S.D. - 1.00). The second type of shift occurred with specific items. After the distribution of the CAT calibration is adjusted for differences in the mean and standard deviation, some item calibrations still shift and the order of item difficulty is altered. The correlation for PAP item calibrations and the CAT item calibrations was .90, .95, disattenuated. A few items calibrate as more difficult on the CAT calibration t h a n they did originally on the PAP calibration, and a few items calibrate as less difficult on the CAT calibration. The shifts from the PAP calibration (small sample) to the CAT calibration (varying sample per item) may be due to the mode of administration or to item bias (a difference in the intent or preparation between the PAP sample population and the CAT sample population). For example, of the seven items with the largest shifts in the direction of easier on the CAT calibration, five were from the same content area, indicating possible differential preparation between the two sample populations. Comparison of Ability Measure Estimates For examinees who took the computer-adaptive test, ability measures, based on estimates obtained from the PAP calibration, were compared with estimates made from the CAT calibration. The mean ability measure calculated with the PAP calibration was .24, with a standard deviation of .53. The mean ability estimate calculated with the CAT calibration was .25, with a standard deviation of .50. The mean logit difference between ability estimates was —.01, and the standard deviations of the differences is .07. The correlation of the examinee measures obtained from the PAP item calibrations, and the examinee measures obtained from the CAT item calibrations, was .99. Thus there is no difference between the

THE EQUIVALENCE OF RASCH ITEM CALIBRATIONS AND ABILITY ESTIMATES

127

examinee measures obtained due to the mode of data collection for item calibrations. DISCUSSION In this study, even though the item calibrations were obtained from a pencil-and-paper administration with relatively few participants, most of the Rasch item calibrations remained stable when calibrated from the computer-adaptive administration. The results demonstrate that, for these data, the item calibrations from a pencil-and-paper administration can be used for computer-adaptive tests. The item calibrations were equivalent, given varying numbers of examinees, different contexts, and varying modes of administration. The PAP calibrations used a sample of examinees of varying ability levels, so each item was calibrated from a range of examinee abilities. Items on the computer-adaptive administration were targeted to the examinee's ability, so the CAT calibrations were based on a smaller range of examinee ability levels. Two types of shifts occurred in the item calibrations. The first type, an overall shift in mean and standard deviation, can be corrected by using an equating transformation. The second type of shift, a shift in the calibration of certain items, is potentially much more problematic, because examinees take different items. This means that when some items shift, examinees are differentially affected depending upon how many of the shifted items are presented to them. The examinee measure correlation of .99 indicates that even though a small percentage of the item calibrations shift, the examinee measures are not affected. No examinee measure differed beyond the variance expected due to error of measurement. However, if shift in item calibration is a concern, the items can be identified and revised or discarded from subsequent CAT administrations. Of course, the item pool must be continually monitored for drift, validity, and quality of item content whether tests are administered in a paper-and-pencil or computer-adaptive mode. The examinee measures however, can be considered valid even if it is necessary to reevaluate some items. REFERENCES Folk, V.G. (1990, April). Adaptive testing and item difficulty order effects. Paper presented at the annual meeting of The American Educational Research Association, Boston. Green, B.F., Bock, R.D., Humphreys, L.G., Linn, R.L., & Reckase, M.D. (1984).

128

BERGSTROM & LUNZ

Technical guidelines for assessing computerized adaptive tests. Journal of Educational Measurement, 21(4), 347-360. Kingsbury, G.G., & Houser, R. (1989, March). Assessing the impact of using in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, item parameter estimates obtained from paper-and-pencil testing for computerized adaptive testing. Paper presented to the annual meeting of the National Council of Measurement in Education, San Francisco. Kingston, N.M., & Dorans, J.J. (1984). Item location effects and their implications for IRT equating and adaptive testing. Applied Psychological Meassurement, 8(2), 147-154. in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, Rudner, L.M. (1989). Notes from Eric/TM. Journal of Educational Measurement Issues and Practice, 8(4), 25-26. Wainer, H., & Kiely, G. (1987). Item clusters and computerized adaptive testing: A case for testlets. Journal of Educational Measurement, 24(3), 1 8 5 201. Wise, S.L., Barnes, L.B., Harvey, A.L., & Plake ; B.S. (1989). Effects of computer anxiety and computer experience on the computer-based in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, achievement test performance of college students. Applied Measurement in Education, 2, 235-241. in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, Wright, B.D., Congdon, R., & Shultz, M. (1987). MSCALE [Computer Program]. Chicago: MESA Press. Wright, B.D., Linacre, J.M., & Schultz, M. (1990). BIGSCALE |Computer Program]. Chicago: MESA Press. Wright, B.D., & Stone, M.H. (1979). Best test design. Chicago: MESA Press. Yen, W.M. (1980). The extent, causes and importance of context effects on item in applying Rasch's theory. Journal of Educational Measurement, 14(3), in applying Rasch's theory. Journal of Educational Measurement, 14(3), tively invariant over item groups, whereas Panel B (Rasch, 1960/1980, parameters for two latent trait models. Journal of Educational Measurement, 17(4), 297-311.

chapter

8 O

Constructing Measurement with a Many-Facet Rasch Model John Michael Linacre

MESA Psychometric Laboratory Department of Education University of Chicago

SUBJECTIVE V E R S U S OBJECTIVE TESTS The rush to objective multiple-choice question (MCQ) tests in the 1920s was driven by dissatisfaction with subjective judge-rated tests. Objective tests were intended to control intrusions of undesirable variance into subjective test scores. But, in the 1980s, the testing community began to realize that what is needed is not objective testing but rather objective measurement. The reevaluation of subjective tests in the light of objective measurement opens a new field of possibilities. Ruch (1929) summarized the drawbacks to subjective tests: 1. 2. 3. 4.

Subjectivity of scoring lowers reliability. Sampling must be limited to a small number of broad questions. Time required to write lengthy answers is excessive. These examinations encourage bluffing.

His first drawback is our primary concern here. The importance of his last three drawbacks depend on the intention, construction, and 129

130

LINACRE

application of the subjective test. Indeed, one of the documented drawbacks to MCQ testing is the success of test-taking strategies, which are equivalent to bluffing, in increasing students' performance without increasing their achievement (Haladyna, Nolen, & Haas, 1991). Of course, subjective tests have remained in use. The example considered here is a selection examination for admission to a graduate program. Nineteen members of the admissions committee, the judges, rated 100 examinees on 14 items of competency using a five-point rating scale. Each examinee was rated by three, four, or five judges. Judges assigned ratings only when there was sufficient information to make a judgment. Consequently, not all judges awarded 14 ratings to each examinee that they rated. One judge rated 97 of the examinees. Another judge rated only one.

CONVENTIONAL ATTEMPTS TO MODEL JUDGING Studies of scoring subjectivity have found that "there is as much variation among judges as to the value of each paper as there is variation among papers in the estimation of each judge" (Ruggles, 1911). But any difference among judges is a threat to fairness because raw score depends on which judge rates an examinee. Since differences in judge severity can account for as much variance in ratings as differences in examinee ability (Cason & Cason, 1984), an obvious and widely attempted correction for judge behavior is to deduct the mean value of all ratings given by a judge from his or her individual ratings in hope of obtaining a judge-free rating. This fails because: 1.

2.

3.

All judges are required to rate all examinees on all items, a design t h a t is impractical in any large-scale testing situation. Substituting partial sampling designs (Braun, 1988) lessens the judging load, but introduces daunting administrative requirements. The stochastic aspect of the judging process remains unrecognized and unmanaged. Adjustments by averaging and subtracting do not control the effects of judge variation. The nonlinearity of the initial rating scale is overlooked. Ratings originate on an ordinal, not an interval, scale. (a) the highest and lowest categories represent infinite ranges of performance above and below the intermediate categories. (b) the ranges of performance represented by intermediate categories depend on how their labels are interpreted by judges. The intervals are never equal.

CONSTRUCTING MEASUREMENT WITH A MANY-FACET RASCH MODEL

4. 5.

131

Judge idiosyncracies are undiagnosed and uncontrolled. This means that the validity of the examination is unknown. Measures for examinees, which are statistically independent of the local details of the examination and hence generalizable beyond the examination, cannot be produced.

Attempts have been made to overcome these problems through nonlinear transformation of the responses combined with conventional approaches to modelling error (De Gruiter, 1984; Cason & Cason, 1984), but they have not been reported to succeed. THE MANY-FACET RASCH MODEL These obstacles can be overcome with a many-facet Rasch model. The specifications underlying the two-facet Rasch model can be extended to tests of many facets (Linacre, 1989). These specifications are: 1.

2. 3.

the impact of each element of each facet on the test situation is dominated by a single parameter with a value independent of all other parameters within the frame of reference. (Single parameterization is necessary if examinees are to be arranged in one order of merit, or items indexed by difficulty on an item bank), these parameters combine additively—they share one linear scale, the estimate of any parameter is dependent on the accumulation of all ratings in which it participates but is independent of the particular values of any of those ratings.

These specifications are the necessary and sufficient requirements for constructing a linear measurement system from any observed data. The degree to which this construction is useful and valid is measured by statistics quantifying the fit of the data to the measurement model (Wright & Masters, 1982). A many-facet Rasch model for the admission examination is:

where Bn is the ability of examinee n, where n = 1,100 Dt is the difficulty of item i, where i = 1,14 Cj is the severity of judge j , where j = 1 , 1 9

132

LINACRE

Figure 8-1

Conventional and measurement perspectives on rating scales

Fk is the difficulty of the step up from category k-1 to category k, and k = 2,5. Each examinee is represented by one parameter, Bn, which corresponds to the ability measure of the examinee on a linear continuum. Larger measures indicate greater ability. The difficulty of a successful performance on an item is parameterized by one parameter, Dh which is a measure on the same continuum as that of examinee ability. Thus the probability of a successful performance increases as either the examinee ability increases or the item difficulty decreases. Other elements also intervene. The assignment of ratings is mediated through a judge. Each judge is identified by one parameter, Cj, in the same linear measurement system. A more severe judge, with a larger measure, is less likely to award a high rating than a lenient judge with a smaller measure. Finally the step structure of the ratings scale must also be parameterized. As Figure 8-1 illustrates, the fact that the categories are labelled 1 to 5 and printed uniformly spaced across the page seems to indicate that the levels of performance represented by the categories must be equally spaced and so can be analyzed as linear measures as they stand. Nevertheless, in reality, the rating categories themselves represent qualitatively distinct, but ordered, performance levels partitioning an infinite continuum of performance. The equal integer spacing of the category labels and their equally spaced printing invite the judge to devote equal attention to each of the alternatives. But the range of the performance level corresponding to each of the ordered categories can only be discovered empirically from how the judges behave. Moreover, since the number of rating categories is finite, the ranges corresponding to the extreme categories are always infinite, because there is conceptually no limit to how good or how bad a performance can be. It is the functioning of the categories of the rating scale t h a t defines the measures, not the arbitrary assignment of equal inte-

CONSTRUCTING MEASUREMENT WITH A MANY-FACET RASCH MODEL

133

ger category labels. The labelling of the categories is a convenience for the management of the examination. What is needed for analysis is not the category label but the count of qualitatively higher levels of performance represented by the category. Thus the lowest category, usually labelled 1, corresponds to a step count of 0, while the category labelled 5 corresponds to a step count of 4. Equation (1) specifies the stochastic relationship between the ordered categories of the rating scale and the latent performance continuum. This relationship is an ogive that satisfies both the theoretical requirements for measurement and the functional form of the rating scale defined by the judges through their use of it. The unequal widths of the performance ranges corresponding to the intermediate categories are parameterized by the Fk terms. The infinite performance ranges at the extremes of the scale are mapped into the corresponding finite top and bottom categories. A maximum likelihood estimate for each parameter is obtained when the expected marginal sum of the counts of the ratings in which the parameter participates is equal to the observed sum of counts. Missing ratings can be ignored in this estimation, as is done in the computer program FACETS (Linacre, 1988). In Figure 8-2, the examinees, judges, and items of the admission examination have been measured on one common linear frame of reference. The expected scores (in rating points) are shown for examinees facing items of 0 logit difficulty and judges of 0 logit severity. Other expected scores are obtained by indexing the score scale at (examinee ability-judge severity-item difficulty) logits. An example of the ogival score-to-measure conversion is shown in Figure 8-3, where the average rating given an examinee on the admissions test has been mapped against examinee measure. The solid ogive traces the raw score to measure conversion that would have occurred if all judges had rated all examinees on all items. Each point X represents the conversion for an examinee. Its placement depends on which judges rate the examinee's performance. Examinee A has a higher average rating, but a lower measure than Examinee B, because A happened to be rated by more lenient judges than B. Most Xs are displaced below the solid ogive, because the most lenient judge rated only a few examinees. FIT TO THE MODEL Equation (1) specifies the stochastic structure of the data. The probability of a rating in any category is modelled explicitly. The modelled, (expected) values of the error variance associated with each rating are

134

LINACRE

Figure 8-2

Results of a many-facet Rasch analysis

CONSTRUCTING MEASUREMENT WITH A MANY-FACET RASCH MODEL

135

Figure 8-3 Average category labels for examinee performance plotted against estimated logit measures

explicit. This enables a detailed examination of the data for fit to the model. Not only too much, but also too little, observed error variance threatens the validity of the measurement process, and motivates investigation, diagnosis, and remediation of specific measurement problems. The relationships between the modelled error variances and the observed error variances (sums of squared residuals) are used as partial and global tests of fit of data to model (Wright & Panchapakesan, 1969; Windmeijer, 1990). In conventional analysis, by contrast, any difference between an observed and an expected rating is blamed on a judge's unexplained and undesired error variance. The optimal error value is zero, but this can never be obtained in a nontrivial situation. Any amount greater t h a n zero threatens validity. Thus, "the widespread use of such items in standardized tests depends on whether some degree of scoring error, however small, can be accepted" (Bennett, Ward, Rock, & Lahart, 1990). This error variance is often compared to the observed variance of a judge's ratings, leading to an uncontrolled comparison between the within-judge randomness of a judge's ratings and the between-

136

LINACRE

examinee spread of the abilities of the examinees who happen to have been rated. An example of Rasch fit statistics for four of the admission examination judges is shown in Table 8-1. Their severity measures (in logodds units, logits) are about equal, but their measures have different standard errors. These indicate the precision or reliability of their measures. The size of these errors is chiefly determined by the number of ratings the judge made. The more ratings a judge makes, the more information there is with which to estimate a severity measure and so the smaller its standard error. Two fit statistics are reported, the mean-square and standardized forms of the Outfit statistic. Outfit is an acronym for "outlier-sensitive fit statistic," because its size is strongly influenced by single unexpectedly large residuals. Outfit is based on the ratio of observed error variance to modelled error variance. The ratio is computed on a rating-by-rating basis, and then averaged across all ratings in which the judge participated. The result is the mean of the ratios of squared observed residuals to modelled residuals. The mean-square outfit statistic is on a ratio scale with expectation 1 and range 0 to infinity. Its statistical significance is indicated by a standardized value with a modelled unit normal distribution. Since the success of the standardization is sample dependent, this value cannot be interpreted strictly in terms of the unit normal distribution, but must be evaluated in the light of the local situation. In Table 8-1, Judges A and B have mean-square outfit statistics close to their expected values of 1, and standardized values close to their expectation of 0. Judge C, however, shows considerable misfit. His mean-square outfit of 1.4 indicates 40 percent more variance in his ratings t h a n is modelled. The significance value of 3 indicates that this is rarely expected. Symptomatic of Judge C's behavior is the distribution of his ratings. He awarded considerably more high and more low ratings than Judges A and B. This wider spread of ratings is unexpected in the light of the rating patterns of the other judges. Judge D, on the other hand, exhibits a muted rating pattern. His mean-square statistic of .7 indicates 30 percent less variance in his ratings t h a n is modelled. The high significance of this is flagged by the standardized value of —6. Judge D's ratings show a preference for central categories. He reduces the rating scale to a dichotomy and so reduces the variance of his ratings. The fact that Judge D's ratings are more predictable than those of the other raters would be regarded as beneficial in a conventional analysis. In a Rasch analysis, however, Judge D's predictability implies that Judge D is not supplying as much independent information as the other judges on which to base the examinees' measures. Were Judge D perfectly predictable, always rat-

Table 8-1

Judge Measures and Fit Statistics Outfit

% Frequency of Rating

Judge

Examinees Rated

Total Ratings

Mean Rating

Severity Measure

Model Error

Mean-Square

Standardized

1

2

3

4

5

A B C (Noisy) D (Muted)

12 48 17 73

168 672 231 1018

2.8 2.7 2.7 2.8

0.62 0.68 0.81 0.63

0.13 0.07 0.11 0.05

1.0 1.1 1.4 0.7

0 1 3 -6

0 0 0 0

0 1 6 0

35 42 35 31

53 42 41 61

11 15 18 7

138

LINACRE

ing in the same category, he would supply no information concerning differences among examinees. A frequently used alternative to Outfit is Infit, an informationweighted fit statistic sensitive to unexpected patterns of small residuals. This is calculated from the ratio of the sum of all squared residuals to the sum of all modelled error variances for ratings in which the judge participated. For the judges shown in Table 8-1, the Outfit and Infit statistics are numerically identical. This is because the misfit for this data set is homogeneous across examinee ability levels. By contrast, lucky guessing and carelessness on MCQ items cause large outlying residuals that are detected by unexpected Outfit values, while alternative curricula lead to unexpected patterns of small residuals which are detected by Infit.

THE JUDGING PLAN The only requirement on the judging plan is that there be enough linkage between all elements of all facets that all parameters can be estimated within one frame of reference without indeterminacy. An example of lack of linkage and consequent indeterminacy is a plan in which judge panel B grades only boys and judge panel G grades only girls, because then a relatively good performance by one gender can be attributed either to higher ability or to more lenient judges. The ideal and usually necessary judging plan for conventional analysis is t h a t in which every judge rates every examinee on every item. This is illustrated in Figure 8-4, which follows the specifications of Braun (1988). Under Rasch analysis, this design meets the linkage requirement and provides precise measures of all parameters in the shared frame of reference, but such completeness is not required. All t h a t is required is a network of examinee, judge, and item overlap. A simple linking network can be obtained by having groups of judges rate some examinees on all items. This type of plan is shown in Figure 8-5. The parameters are linked into one frame of reference through ratings t h a t share pairs of parameters: common persons, common essays, or common judges. Accidental omissions or unintended extra ratings amend the judging plan but do not threaten measurement construction. Measures are less precise than with complete data because fewer ratings are made. Since the standard errors of the measures are approximately in proportion to the inverse of the square root of the number of observations, the standard errors of measures estimated from this second incomplete data set will be about 2.5 times

0

Figure 8-4

Complete judging plan

larger t h a n for the first complete data set. On the other hand, the judging effort will be reduced by 83 percent. Judging is time consuming and expensive. It may be desirable to minimize the judging work by arranging for each item of each performance to be judged only once. Even under these circumstances, the statistical requirement for overlap can usually be met rather easily. For instance, if each examinee writes several essays and all essays are shuffled together randomly, overlap can be obtained by having each judge grade whichever essay happens to come next on the pile. Each judge grades as many essays as time and speed allow. But each essay is graded only once. Nevertheless, by the end of the judging session, many examinees will have been rated by more than one judge, but on

140

LINACRE

Figure 8-5

"Rotating test-book" judging plan

different essays, and many essay topics will have been rated by more t h a n one judge, but for different examinees. An example of this type of minimal judging plan, but under slightly stricter rules, is shown in Figure 8-6. Each of the 32 examinees' three essays is rated by only one judge. Each of the 12 judges rates eight essays, including two or three of each essay type. The e x a m i n e e judge-essay overlap enables all parameters to be estimated unambiguously in one frame of reference. Assignment of essays to judges was by a simulated "random pile" of essays with the constraints that each essay be rated only once, each judge rate an examinee once at most, and each judge avoid rating any one type of essay too frequently. The cost of this minimal data collection is lower measurement precision, with standard errors 3.5 times larger than for the full plan. The

CONSTRUCTING MEASUREMENT WITH A MANY-FACET RASCH MODEL

Figure 8-6

141

Minimal-effort judging plan

judging effort, however, is reduced about 92 percent. The loss of information under such a plan might appear excessive, but where the number of different items of performance to be rated is high, this type of plan has proved feasible (Lunz, Wright, & Linacre, 1990). GENERALIZABILITY OF RESULTS The category labels of a rating scale are not only arbitrary and nonlinear, but also local to the design of the particular examination. The implications of this may be masked when all examinees are rated on the same items by the same judges in one testing session, but they are immediately apparent when examinees face different testing situa-

142

LINACRE

tions. Quantitative comparison requires a frame of reference in which it no longer matters which examinee is rated by which judge on which item in what session. The many-facet Rasch model enables such a framework to be constructed (Stahl, 1991). CONTROL OF JUDGE IDIOSYNCRACY Judge training is required to develop a shared understanding of a rating scale and a uniform perspective on the challenge applied by the test items. It is claimed that "subjectivity of marking may be reduced about one-half by the adoption of and adherence to a set of scoring rules when essay examinations are to be graded" (Ruch, 1929). Conventionally, training has been further aimed at obtaining unanimity across judges about the rating to be awarded to particular performances on particular items. This idealistic attempt to produce identical, and hence exchangeable, judges has met with little success. "Judges employ unique perceptions which are not easily altered by training" (Lunz et al., 1990, p. 332). No entirely successful large-scale judge training program has ever been reported. There are many situations in which judge training is given little or no attention (for example, a supervisor rating subordinates) or has been discovered to have been ineffective. It is always essential to monitor the quality of the ratings being awarded and to direct each judge's attention to those areas in which there is doubt. An advantage of the Rasch many-facet measurement model is t h a t within-judge self-consistency, rather t h a n between-judge unanimity, is now the aim. On this basis, unexpectedly harsh or lenient ratings, not in accord with a judge's usual rating style, can be identified, and also each judge's biases relating to any particular items, groups of examinees, or the like, can be quickly revealed. This has two benefits. First, unacceptably idiosyncratic ratings can be treated as missing without disturbing the validity of the remainder of the analysis. Second, precise feedback to each judge about specific questionable ratings or rating patterns can foster improvements in the judging process. In the admission data, 14 of the 6,227 ratings were sufficiently unexpected as to invite closer inspection, and, where necessary, corrective action. In three cases, the observed ratings were more than two rating points different from those expected based on the overall ability of the examinee, severity of the judge, and difficulty of the item—surely a large enough discrepancy to provoke skepticism about the validities of those ratings.

CONSTRUCTING MEASUREMENT WITH A MANY-FACET RASCH MODEL

143

FURTHER MEASUREMENT MODELS The many-facet measurement model can be expressed in many forms to meet the requirements of specific testing situations, including portfolio assessment, artistic and athletic competitions, and skill certification. Some of these forms are: an item-scale model, in which each item is constructed with its own rating scale,

where Bn, Dt, and C ; are as above, and Fik is the difficulty of the step from category k-1 to category k of the scale unique to item i, and k = l,Mt a judge-scale model, in which each judge uses his or her own interpretation of the rating scale,

where Bn, Dt, and Cj are as above, and Fjk is the difficulty of the step from category k-1 to category k for j u d g e d and k = l,Mj a four-faceted model, in which each of the items is modelled to apply to each of a number of tasks, where

Bn, DL, Cj and Fk are as above, and Am is the difficulty of task m.

144

LINACRE

CONCLUSION T h e c o n s t r u c t i o n of a m e a s u r e m e n t s y s t e m for subjective t e s t s is p r a c t i c a l a n d useful. Test c o n s t r u c t o r s no l o n g e r n e e d l i m i t t h e m s e l v e s to w h a t c a n be o b t a i n e d from a n M C Q t e s t , b u t i n s t e a d c a n devote t h e i r c r e a t i v e p o w e r s to d e s i g n i n g t e s t s t h a t involve deeper, m o r e r e l e v a n t , a n d h e n c e m o r e a u t h e n t i c e v i d e n c e of c o m p e t e n c e , w i t h o u t losing t h e b e n e f i t s of objective m e a s u r e m e n t .

REFERENCES Bennett, R.E., Ward, W.C., Rock, D.A., & Lahart, C. (1990). Toward a framew ton, NJ: Education Testing Service. Braun, H.I. (1988). Understanding scoring reliability. Journal of Educational Statistics, 13(1), 1-18. Cason, G.J., & Cason, C.L. (1984). A deterministic theory of clinical performance rating. Evaluation and the Health Professions, 7, 221-247'. De Gruiter, D.N.M. (1984). Two simple models for rater effects. Applied Psychological Measurement, 8, 213-218. Haladyna, T.M., Nolen, S.B., & Haas, N.S. (1991). Raising standardized achievement test scores and the origins of test score pollution. Educational Researcher, 20(5), 2 - 7 . Linacre, J.M. (1988). FACETS computer program. Chicago: MESA Press. ll Lunz, M.E., Wright, B.D., & Linacre, J.M. (1990). Measuring the impact of judge severity on examination scores. Applied Measurement in Education, 3(4), 331-345. Ruch, G.M. (1929). The objective or new-type examination. Chicago: Scott, Ruch, G.M. (1929). The objective or new-type examination. Chicago: Scott, Foresman. Ruggles, A.M. (1911). Grades and grading. New York: Teacher's College. sStahl, J. (1991, April). Equating examinations that require judges. Paper presented at AERA Annual Meeting, Chicago. Windmeijer, F.A.G (1990). The asymptotic distribution of the sum of weighted ssquared residuals in binary choice models. Statistica Neerlandica, 44(2), 69-78. Wright, B.D., & Masters, G.N. (1982). Rating scale analysis. Chicago: MESA Press. Wright, B.D., & Panchapakesan, N. (1969). A procedure for sample-free item analysis. Educational and Psychological Measurement, 29(1), 23-48.

chapter

9 %7

Development of a Functional Assessment That Adjusts Ability Measures for Task Simplicity and Rater Leniency* Anne G. Fisher

Professor, Department of Occupational Therapy, College of Applied Human Sciences Colorado State University

INTRODUCTION Therapists draw important conclusions about the abilities and limitations of people by observing them in the context of their performances * Appreciation is extended to J. Michael Linacre and Benjamin Wright for their reviews and refinement of this manuscript. Kimberly Bryze and Anita Bundy also provided valuable editorial input. This project was supported, in part, by funding from the American Occupational Therapy Association and Foundation through the Gerontology Research Symposium, the Physical Disabilities Symposium, and the Center of Research and Measurement at the University of Illinois at Chicago, College of Associated 145

146

FISHER

oof activities of daily living (ADD (for example, dressing, bathing, or eating) and instrumental activities of daily living (IADD (for example, meal preparation, shopping, or laundry). Therapists use the information gathered to (a) make judgements regarding the overall functional ability of the person, (b) identify specific deficits t h a t may be impairing functional performance, (c) plan appropriate intervention programs designed to enhance the person's level of independence, and (d) monitor change in performance levels over time. While therapists routinely evaluate ADL/IADL ability by direct observation, the majority use homegrown evaluation tools of unknown validity and reliability. That is, there is general recognition that therapists practicing in a variety of settings, such as rehabilitation, long-term care, and home health, have developed their own ADL/IADL assessments with little attempt to establish the validity and reliability of the instruments. Further, no existing standardized instrument has been recognized as having the characteristics of a gold standard (Eakin, 1989; Keith, 1984; Law & Letts, 1989; Jongbloed, 1986). There are several factors that may have contributed to the limited usage of standardized ADL/IADL evaluations by therapists in clinical settings. Among the most apparent is that existing standardized evaluations fail to meet the needs of the clinician involved in the direct intervention with people who have physical or psychosocial disabilities. For example, most standardized ADL/IADL scales were developed for managerial and policy purposes related to screening, determination of the need for services, resource allocation, and outcome analysis (see Fuhrer, 1987; Granger & Gresham, 1984; Kane & Kane, 1981, for reviews). As a result, standardized ADL/IADL evaluations tend to be rather global in nature; they commonly are used to assess whether or not the person can perform a number of ADL/IADL tasks independently, and, if not, what level of assistance is required. From the perspective of the therapist responsible for providing intervention, such standardized global assessments provide an indicattion of what a person can or cannot do, but no information about why the person might be experiencing functional limitations. Yet an import a n t prerequisite for planning cost-effective intervention programs is t h a t the therapist be able to identify specific factors that limit performance ability so that those factors can be targeted in the intervention. Health Professions, Department of Occupational Therapy. Thanks are extended to the members of the AMPS gerontology and physical disabilities teams that served as the raters for this study. Finally, appreciation is extended to Ay Woan Pan for her assistance with data analysis. Portions of this chapter were presented at the annual meeting of the American Educational Research Association, Chicago, April 1991.

DEVELOPMENT OF A FUNCTIONAL ASSESSMENT

147

Therefore, the therapist who chooses to use a standardized instrument designed to evaluate global ADL/IADL ability, yet desires to identify specific deficits or impairments that are interfering with the functional performance of the individual, must supplement his or her global ADL/IADL evaluation with discrete evaluations of the distinct constituents underlying ADL/IADL performance (including strength, range of motion, perception, and mental status). The basic assumption made is t h a t if the underlying cause of the ADL/IADL limitations can be identified and treated, the effects will generalize to improved functional performance across a wide range of ADL/IADL tasks. While this approach has logical appeal, research has not demonstrated a strong enough relationship between underlying constituents and ADL/IADL performance, when they are evaluated separately, to be able to make valid predictions about the abilities of a person in daily life task performance based on his or her discrete test scores (Bernspang, Asplund, Eriksson, & Fugl-Meyer, 1987; Jongbloed, Brighton, & Stacey, 1988; Pincus, Callahan, Brooks, Fuchs, Olsen, & Kaye, 1989; Reed, Jagust, & Seab, 1989; Skurla, Rogers, & Sunderland, 1988; Teri, Borson, Kiyak, & Yamagishi, 1989). The commonly chosen alternative is for the therapist to observe directly the person performing selected ADL/IADL tasks that the individual has identified as relevant to his or her needs and goals, and tthen, simultaneously, make subjective judgements regarding (a) the then, simultaneously, make subjective judgements regarding (a) the person's overall ability to perform ADL/IADL tasks, and (b) the distinct underlying performance constituents that appear to be impairing the person's performance. There are certain advantages to this approach. While most standardized ADL/IADL scales are of a self- or proxy-report or interview format, there is increasing recognition that direct observation of ADL/IADL performance may be preferred in many instances (Consensus Development Panel, 1988; Guralnik, Branch, Cummings, & Curb, 1989). Moreover, therapists are recognized for their expertise in performance evaluation (evaluation based on direct observation of performance) (Guralnik et al., 1989), as well as for their ability to effect comprehensive task analyses that result in the identification of appropriate adaptive or compensatory methods t h a t can be utilized by the person to achieve desired functional goals (Faletti, 1984). Another advantage of directly observing a person perform selected ADL/IADL tasks is t h a t the therapist is able to individualize the evaluation by observing the person perform only those tasks that the individual perceives as relevant and meaningful, given his or her living situation and interests. This is based on the assertion that the quality of task performance is influenced by the volitional characteristics of

148

FISHER

the individual. Volition is assumed to determine what tasks the person chooses to perform, and function is hypothesized to be maximized when an individual performs a task of his or her choice (Kielhofner & Burke, 1985). However, observing the person perform self-selected tasks while making subjective judgements regarding the individual's ability to perform ADL/IADL tasks defies objective measurement. Indeed, even when a systematic and reproducible method of scoring the performance is used, the specific tasks chosen by the person vary in difficulty. If no mechanism is used to adjust person measures for the simplicity of the tasks performed, the person who performs easier tasks will have an unfair advantage over the person who performs harder tasks. Moreover, unless the person performs exactly the same set of tasks each time he or she is evaluated, this system does not allow the therapist to monitor change as the individual progresses over the course of intervention. The influence of rater judgement is another frequently cited area of concern, especially for IADL assessments (George & Fillenbaum, 1985; Lawton, 1987; Rubenstein, Schairer, Wieland, & Kane, 1984). The major reason for lowered interrater reliabilities is that the complexity of IADL requires that greater degrees of rater judgement be used in scoring; what is judged to constitute adequate performance is highly variable and reflects the personal biases of the raters (Lawton, 1987). As Lunz and Stahl (1990) pointed out, clinical observation and rating of a person's performance always requires the input of a judge. Since all judge-awarded ratings reflect some subjectivity, judge bias is a major drawback to objective measurement of examinee ability. Attempts to improve uniformity among judges have included constructing structured items . . . , standardizing grading criteria and administration procedures, and providing extensive judge training. But these efforts have served only to direct the attention of judges, not to control the I leniency! of their assessments, (p. 426) Therefore, any objective measurement system that is developed to meet the requirements of clinical practice must have several import a n t features. First, it must provide the therapist with the capability to assess the impact of discrete skill deficits on global ADL/IADL ability directly. Second, it must be developed so as to give consideration to the motivation, interests, and needs of the person tested by offering the opportunity for motivated task choice. Third, person ability measures must be adjusted for the simplicity of the tasks performed and for the leniency of the rater who observed the performance. And finally, the

DEVELOPMENT OF A FUNCTIONAL ASSESSMENT

149

measurement system must have demonstrated validity and reliability. The Assessment of Motor and Process Skills (AMPS) (Fisher, 1991), an innovative assessment of IADL, was designed to meet these requirements of clinical practice. The purpose of this chapter is to describe the application of the many-faceted Rasch model (Andrich, 1988; Linacre, 1989, this volume) to construct and validate the motor scale of the AMPS. ASSESSMENT OF MOTOR AND PROCESS SKILLS The Assessment of Motor and Process Skills (AMPS) was developed in response to the need for scales (a) that are defined by skill item easiness and IADL task simplicity, (b) that adjust the person ability measures for the leniency of the rater performing the observation, (c) that permit the simultaneous evaluation of IADL task performance and the underlying motor and process (organizational/adaptive) performance skill capacities necessary for skilled task performance, and (d) that provide the person observed the opportunity to select tasks to perform t h a t reflect his or her values and interests. In the context of the person's actually performing one or more IADL tasks of his or her choice, the person is rated on 15 motor skill items and 20 process skill items. The motor skills are conceptualized as representing a taxonomy of universal motor operations t h a t underlie task performance, and the process skills each are conceptualized as representing a taxonomy of universal process operations that underlie task performance. Motor skills pertain to those capacities that the person uses to produce or impart motion to self or objects. They are those performance skills that relate to the posture, mobility, coordination, and strength capacities of the person t h a t provide the basis for movement of the body and objects. The term process may be defined as a series of actions enroute to task completion. Process skills are related to the attentional, conceptual, organizational, and adaptive capacities t h a t the person uses to sensibly organize the actions he or she performs in order to complete the specified task. These motor and process skills are operationally defined as observable actions that reflect the underlying performance capacities (Fisher, 1991). Definitions of the 15 motor skills analyzed for this study are listed in Figure 9-1. When the AMPS is used to evaluate a person, he or she is offered several IADL task choices from approximately 30 listed in the test manual. Whenever possible, the person is asked to choose at least two to perform. During the performance, the rater scores the 15 observable motor skills on a 4-point rating scale. A score of 4 (Competent) is

150

FISHER

STRENGTH • Moves—pushes, shoves, pulls, or drags objects along a supporting surface; includes opening doors and drawers. Pertains to the moving of objects that are not lifted (e.g., pushing or pulling on a cart, door, or drawer; dragging a heavy bag across the floor; or sliding a heavy pan along the counter top). Includes the ability to self-propel a wheelchair. • Lifts—raises or hoists objects off of supporting surface; includes moving an object that is lifted from one place to another, but without ambulation or moving from one place to another. Pertains to having enough strength to lift objects. • Reaches—stretches or extends the arm, and, when appropriate, the trunk to grasp or place objects that are out of reach. Pertains to the ability to effectively reach to the extent necessary in order to obtain objects. Where appropriate, this includes trunk movement. • Endures—persists and completes the task without evidence of fatigue, pausing to rest, or stopping to "catch ones breath." POSTURE AND MOBILITY • Transports—carries objects while ambulating or moving from one place to another (e.g., in a wheelchair). Pertains to the physical capacity to gather. • Stabilizes—steadies body, and maintains trunk control and balance while sitting, standing, or walking, while reaching, or while moving, lifting, pushing, or pulling objects; pertains to postural control during trunk or limb movements. • Aligns—maintains the body weight evenly distributed over the base of support; implies an absence of asymmetries, flexed or stooped posture, or excessive leaning; pertains to body alignment that may be affected by structural or strength limitations. • Walks—ambulates on level surfaces; implies steadiness or an absence of shuffling, lurching, ataxia, etc.; includes the ability to turn around to change direction while walking. FINE MOTOR ABILITIES AND SUBTLE POSTURAL ADJUSTMENTS • Bends—actively flexes, rotates, or twists the body in a manner and direction appropriate to the task; pertains to trunk mobility. • Coordinates—uses different parts of the body together or uses other body parts as an assist or stabilizer during bilateral motor tasks. Pertains to the physical capacity to hold, support, or stabilize objects during bilateral task performance. • Manipulates—uses dexterous grasp and release, as well as coordinated in-hand manipulation patterns; pertains to skillful use of isolated finger movements when handling objects. • Flows—uses smooth, fluid, continuous, uninterrupted arm and hand movements. Pertains to the quality or refinement of motor execution; includes the absence of dysmetria, ataxia, tremor, rigidity, or stiffness of movement. Implies the ability to isolate movements. • Positions—positions body or wheelchair in relation to objects in a manner that promotes the use of efficient arm movements; pertains to the use of postural background movements appropriate to the task. Implies the absence of awkwardness of arm or body positions. Includes the ability to position the body or wheelchair appropriate to the task or movement pattern of the arm. • Calibrates—regulates or grades the force, speed, and extent of movements in the performance of a step or action; pertains to the amount of effort exerted or an expenditure of energy that is appropriate to the requirements of the action or step (e.g., not too much or too little). • Grips—pinches or grasps in order to grasp handles, to open fastenings and containers, or to remove coverings; relates to effectiveness of strength of pinch and grip. Figure 9-1

Definitions of the AMPS motor skills.

DEVELOPMENT OF A FUNCTIONAL ASSESSMENT 151

assigned when the rater judges that there is no evidence of a motor skill deficit interfering with the person's performance. A score of 3 (Questionable) is assigned when the rater questions the presence of a motor skill deficit that is interfering with IADL task performance. A score of 2 (Ineffective) is assigned when the rater judges that a motor skill deficit is impacting on the person's effective use of time and energy such that ongoing task performance is affected. Finally, a score of 1 (Deficit) is assigned when the motor skill deficit is severe enough to result in task breakdown, risk of danger, or an unacceptable slowing of the task progression. Scoring examples for all skill items are listed in the test manual (Fisher, 1991). Scoring examples for each score category for the motor skill item Transports are shown in Figure 9-2. TRANSPORTS—carries objects while ambulating or moving from one place to another (e.g., in a wheelchair). Pertains to physical capacity to gather. (Note. Score the ability to move objects such as doors, drawers, or carts that typically are not lifted under the motor verb Mov»s. The presence of instability when carrying objects is also scored under the motor verb Stabilizes.) 4 = readily and consistently carries objects from one place to another while walking or moving from place to place —carries sheets from linen closet to bedroom without difficulty —carries pan from stove to the other end of the counter —carries, when appropriate, two or three items at a time —while seated in a wheelchair, readily carries bread and condiments (placed in the lap) from refrigerator to counter —while walking with a walker, carries shoes and polish in a basket on the walker without difficulty 3 = questionable transporting skill, but no apparent disruption of action item or task performance, or impact on other skill items —possible hesitation or slowness while transporting objects —examiner questions the presence of instability while transporting 2 = ineffective transporting skill impacts on action item or task performance, or results in inefficient use of time or energy —some gait instability when carrying sheets —slides objects that typically are transported (e.g., moving a pan from the stove to the other end of the counter top) —difficulty carrying more than one or two items —difficulty transporting objects in the wheelchair slows task progression 1 = severity of transporting skill deficit clearly impedes action item or task performance such that the results are unacceptable, or damage or danger is imminent —attempts but unable to transport —imminent risk of fall or dropping an object when attempting to walk while carrying the object —unacceptable delay in task progression because of difficulty transporting —examiner intervention required because severity of transporting skill deficit results in task breakdown, or imminent risk of damage or danger

Figure 9-2 Example performances by score category for the motor skill item Transports.

152

FISHER

MANY-FACETED RASCH ANALYSIS OF THE A M P S MOTOR SCALE Because the 15 motor skill items represent universal operations that underlie all IADL task performances, it is possible, for the first time, to relate motor skill capabilities directly to the simplicity of the IADL tasks. This is accomplished by using the many-faceted Rasch analysis computer program, FACETS (Linacre, 1988), to calibrate the motor skill items and the IADL tasks on a common log-linear scale (IADL motor scale). Person IADL motor skill measures are adjusted for the simplicity of the tasks actually performed. Therefore, it is possible to (a) determine where, on a conceptual continuum of ability, people of varying abilities are located; and (b) compare and predict performance capacity of those people across multiple tasks of greater or lesser simplicity than those they actually were observed performing. An added advantage of using many-faceted Rasch analysis is that raters can be calibrated according to their relative leniency. Moreover, the many-faceted Rasch model is used to calibrate each element (that is, each skill item, each task, each rater, each person) of each facet (item facet, task facet, rater facet, person facet) "on the same common log-linear scale so that a quantitative frame of reference for the [assessment] is constructed and quantitative comparisons among and within facets and facet elements can be made" (Lunz, Wright, & Linacre, 1990, p. 332). Therefore, it is possible to create a measurement system that is able to adjust person scores for the additive effects of skill items easiness, task simplicity, and rater leniency. (See Linacre, 1989, this volume; Lunz & Stahl, 1990; Lunz et al., 1990, for more detailed discussions of the many-faceted Rasch model.) As applied to the AMPS, the many-faceted Rasch model specifies the following expectations: (a) a person has a higher probability of obtaining a higher score on an easy skill item than on a hard skill item, (b) easy skill items are easier for all individuals t h a n are hard skill items, (c) judges award higher scores for easy skill items than hard skill items, (d) individuals obtain higher scores on less challenging tasks than more challenging tasks, and (e) people with higher ability obtain higher scores than do less able individuals. Moreover, since a 4-point rating scale is used to score the AMPS, all persons are expected to obtain progressively higher rating scale scores on progressively easier skill items and tasks (Andrich, 1988; Lunz & Stahl, 1990; Silverstein, Kilgore, & Fisher, 1989; Wright & Masters, 1982). When the data conform to these expectations, they fit the measurement model. The values of the parameters modeled to underlie the observed re-

DEVELOPMENT OF A FUNCTIONAL ASSESSMENT

153

sponses (raw skill item scores) are estimated according to these specifications until the expected (estimated) responses predicted by the model are as close as possible to the observed responses (Lunz & Stahl, 1990). With the AMPS, the skill item easiness calibration is the estimated location of that skill item on the continuum of increasing IADL motor ability. The task simplicity calibration is the estimated location of that task on the same continuum of increasing IADL motor ability. The rater leniency calibration is the estimated location of that rater on the common scale. Finally, the person measure is the estimated location of t h a t person on the continuum of increasing ability that has been defined by the easiness of skill items and the simplicity of the tasks, after being adjusted for the raters who scored the task performances. These calibrations and measures are expressed in equal-interval units of measurement based on the logarithm of the odds (log-odds probability units or logits) of obtaining a given skill item score when a person of a given ability is observed by a given rater performing a given task (Andrich, 1988; Lunz & Stahl, 1990; Lunz et al., 1990; Wright & Masters, 1982). The detailed fit statistics that are computed by the FACETS computer program then are examined to verify that a valid measurement system t h a t conforms to the requirements for linear measurement is being constructed. The mean-square residuals, differences between observed and expected scores, provide a measure of the degree to which the skill items and tasks fit the expectations of the Rasch model (Linacre, this volume). The skill item and task mean-square fit statistics verify the internal validity of the AMPS motor scale. As the AMPS continues to be developed, those skill items and tasks that fit the model will be retained. Those that fail to fit the model will be revised or eliminated. Since rater leniency also is calibrated, the FACETS computer program calculates rater fit statistics. Examination of rater fit statistics enables determination of the extent to which individual raters assign skill item scores consistently. A rater misfits when his or her assigned scores are internally inconsistent (that is, when the rater unexpectedly assigns high scores on hard skill items or to less able persons or low scores on easy skill items or to more able persons). Finally, person response validity is verified by examining person fit statistics t h a t measure the extent to which a person's pattern of responses to the individual skill items corresponds to t h a t predicted by the model (Linacre, this volume). A person will misfit when he or she obtains unexpectedly high scores on hard skill items or unexpectedly low scores on easy skill items. This misfit can provide useful diagnostic information t h a t can be used to guide therapeutic interventions.

154

FISHER

The intention is to construct a valid and reliable measurement system t h a t can be used to evaluate individuals who have a wide range of ability levels. With individuals at the more able end of the ability continuum, the therapist must contribute to critical decisions regarding a person's ability to live independently in the community. Therefore, this study was focused on the examination of the validity and reliability of the AMPS motor scale when applied to community-living individuals. More specifically, a major focus of this study was the examination of rater consistency and severity. In addition, several aspects of validity were examined. The examination of the internal validity of the AMPS motor scale included evaluation of the fit of the items and the tasks to the many-faceted Rasch model (Linacre, 1989, this volume). Construct validity of the AMPS motor scale was evaluated by examining the hierarchical ordering of the motor skill item calibrations. Adequate strength of proximal shoulder and truck musculature is necessary for postural control and fine motor skill (Case-Smith, Fisher, & Bauer, 1989). Further, fine motor skills and subtle postural background movements are commonly the only skills impaired in persons with mild motor deficits (cf. Fisher, Murray, & Bundy, 1991). Therefore, it was expected t h a t (a) the motor skill items that assess components of strength would be among the easiest items, (b) the motor skill items t h a t assess posture and mobility would be of intermediate difficulty, and (c) the motor skill items that assess fine motor skills and subtle postural control would be the most difficult (see Figure 9-1). Concurrent validity of the AMPS motor scale was examined by evaluating the ability of AMPS IADL motor measures to differentiate between individuals who are able to live independently in the community and those persons who require assistance to remain in the community. Finally, the examination of the validity of the scales involved evaluation of person response validity. METHODS Subjects The 56 subjects for this study included (a) 39 community-living well individuals without previously identified limitations of the ability to perform daily living tasks; (b) three community-living frail individuals without identified major medical conditions, but with identified functional limitations; and (c) eight community-living and six institutionalized individuals with major orthopedic, neurological, sensory

DEVELOPMENT OF A FUNCTIONAL ASSESSMENT

Table 9-1

155

Subject Demographic Data Age (years)

Group Community-living well Community-living frail Community-living disabled Institutionalized disabled

Total

Mean

Range

< 65 (n)

> 65 (n)

39 3 8 6

48 77 72 64

20-84 68-84 62-81 28-80

22

17 3 7 4

1 2

(for example, hearing loss), or cognitive disabilities. Most of the subjects with disabilities experienced some restriction in the ability to perform daily life tasks. Three of the disabled subjects were able to live independently in the community; nine required minimal assistance or supervision to live in the community; and two needed maximal assistance or would be unable to live in the community. The well subjects ranged in age from 20 to 84 years; the frail subjects were all older adults; and the subjects with disabilities ranged in age from 28 to 81 years (see Table 9-1). All but four of the subjects were female. Three of the four male subjects were disabled. Instrumentation The AMPS was administered to each subject in accordance with the standardized administration procedures described in the test manual (Fisher, 1991). To ensure linkage between subjects, tasks, and raters, the task choices made available to the subjects were limited to the following eight tasks: repotting a small houseplant; vacuuming a living room (including moving light furniture); changing the sheets on a bed; preparing eggs, toast, and brewed coffee; preparing a grilled cheese sandwich; making a tossed green salad; preparing a tuna salad sandwich; and making a fruit salad. Forty-two of the subjects performed two tasks; the remaining 14 subjects performed one task. Procedure Upon obtaining informed consent for participation in this study, a trained rater administered the AMPS to each subject. Approximately five task choices were offered to each subject, and each subject selected one or two tasks to perform. All task performances were videotaped for later scoring by one or more of 15 trained raters. All of the raters were experienced occupational therapists trained in

156

FISHER

the administration and scoring of the AMPS. Rater training was accomplished by means of a 3-day training workshop. Upon completion of the training, each rater independently scored one of four calibration videotapes containing approximately 10 videotaped task performances. Four of the raters co-scored several additional videotaped task performances. To ensure linkage among raters, each rater scored a minimum of five videotaped task performances (observations) t h a t also were scored by at least four additional raters. Data Analysis A total of 221 rated observations were subjected to many-faceted Rasch analysis. To facilitate the ability to conceptualize the assumed additive relationship between the five facets of the constructed AMPS motor scale, the log-odds probability of a given score was modeled as

• • • • • • •

Pnitrk = probability of person n being assigned score k by rater r on skill item / when performing task t Pnitrk -1 = probability of person n being assigned score k - 1 by rater r on skill item i when performing task t Bn = Ability measure of person n Et = Easiness calibration of skill item i St = Simplicity calibration of task t Lr = Leniency calibration of rater r Fk = Difficulty of rating scale step k relative to step k - 1

Both mean-square infit and mean-square outfit statistics were used to evaluate (a) the suitability of the skill items and tasks for constructing an IADL motor scale, (b) the consistency of the rater's scoring over skill items and observations, and (c) the usefulness of the scale, defined by the easiness of the skill items and the simplicity of the tasks, as a measure of the IADL motor ability of persons. The infit statistic is an information weighted mean-square residual between observed and expected, which focuses on the accumulation of central, inlying, deviations from expectation. The outfit statistic is the usual unweighted mean-square residual, which is particularly sensitive to outlying deviations from expectation. (Lunz et al., 1990, p. 336) The expected mean-square value is 1.0. Mean-squares less than 1.0 suggest the presence of unexpected redundancy, dependency, or con-

DEVELOPMENT OF A FUNCTIONAL ASSESSMENT

157

striction in the data. Redundancy or dependency occurs when items are highly correlated. Constriction occurs when scores are not sufficiently spread out across the range rating scale. Mean-squares greater t h a n 1.0 signal the presence of unexpected variability, inconsistency, or ex tremism (Wright & Stone, 1979). Mean-squares greater t h a n 1.3 or less t h a n 0.7 were considered suggestive of unacceptable fit and they were targeted for further examination.

RESULTS Validity of the A M P S Motor Scale Table 9-2 shows the skill item easiness calibrations, the standard errors of these estimates, and the mean-square fit statistics for each skill item. Lifts is the easiest skill item (.99) and Calibrates is the most difficult (-.81). The construct validity of the AMPS motor scale is confirmed by the ordering of the easiness calibrations of the skill items. Lifts, Endures, Moves, and Reaches were expected to be the easiest skill items. Coordinates, Flows, Bends, Positions, Manipulates, Grips, and Calibrates were expected to be the most difficult skill items. The calibrated difficulty order of the results are consistent with these hypothesized expectations. Table 9-2

h

e Mean SD

Skill Item Easiness Facet

Skill Item

Score

Count

Easiness Calibration (logits)

Calibrates Grips Manipulates Positions Bends Walks Flows Aligns Stabilizes Transports Coordinates Reaches Moves Endures Lifts

523 528 524 527 531 563 564 572 575 573 590 599 606 611 613

221 221 219 220 221 221 221 221 221 219 221 221 221 221 221

-0.81 -0.73 0.71 -0.71 0.68 -0.14 -0.13 0.02 0.08 0.12 0.39 0.61 0.79 0.92 0.99

.12 .12 .12 .12 .12 .13 .13 .14 .14 .14 .15 .16 .16 .17 .17

1.2 1.3 1.0 1.2 0.9 0.7 0.9 0.9 0.8 1.0 1.4 0.8 1.0 1.0 1.1

1.6 1.1 0.9 1.2 0.8 0.5 0.6 0.6 0.5 0.7 1.0 0.6 1.0 0.6 1.0

567 32

221 1

0.00 0.62

.14 .02

1.0 0.2

0.9 0.3

SE (logits)

Infit MnSq

Outfit MnSq

Table 9-3

Summary of Misfitting Ratings by Rater Rater Number

Skill Item Stabilizes Aligns Positions Walks Reaches Bends Coordinates Manipulates Flows Moves Transports Lifts Calibrates Grips Endures

1 10

1 6

1 1

22 1 1

2 4 1 2 4 5 2

1 1 1 1

4 2 2 2 1 3 1

2

11

12

13

14

1

1 1

1 1 1

1 2 1 1

1 4 1

7 3

2 1 3

1

1 1

1

1 1 1

1 1

1 1

1

1 1

1 1

2 2 1 1

1

1 1 1

1

1

1 2 1 3

2 2

1 1

1

1

2 2 2 1

1

15

1 1 2 15 10 1

221 221 220 221 221 221 221 219 221 221 219 221 221 221 221 3310

Total Ratings

89

105

419

418

90

75

75

75

75

75

75

74

75

405

1185

Misfitting Ratings

3

3

27

22

3

7

6

4

4

0

6

9

10

22

42

Percentage Misfit

3.4

2.9

3.3

9.3

8.0

5.3

5.3

0

8.0

6.4

5.3

12.2

13.3

5.4

Total Ratings

3.5

Misfitting Ratings

6 6 20 6 5 6 18 10 5 12 8 16 26 16 8

Percentage Misfitting

2.7 2.7 9.1 2.7 2.3 2.7 8.1 4.6 2.3 5.4 3.7 7.2 11.8

7.2 3.6

168 5.1

DEVELOPMENT OF A FUNCTIONAL ASSESSMENT

Table 9-4 Rater 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Total

159

Number of Score Category Ratings by Rater Deficit 0 0 2 4 0 0 0 0 1 0 0 0 5 4 1

( 3.0 logits). Between 3.0 and 2.0 logits is a transition zone where the ability of the frail subjects and the most able subjects with disabilities equals that of the least able well subjects. These well subjects all were over the age of 65; they may be at risk for functional decline. Finally, the least able subjects (ability measures < 2.0 logits) were consistently those individuals who had identified functional limitations. Rating Scale The rating scale score categories and the frequency with which each score was assigned are shown in Table 9-11. In contrast to the scoring performance of the most lenient raters, who assigned Competent ratings 90 percent or more of the time (see Table 9-4), 69 percent of the total assigned ratings were Competent. The logit measures associated with each expected score are shown in Table 9-12. The expected score transitions are the expected calibrations for scores halfway between those actually included in the 4-point rating scale. It is these expected score transitions, expressed in logits, t h a t are delineated on the rating scale facet in Figure 9-3. For example, a 3.5 expected score at 2.48 logits demarcates the transition between an expected Competent score of 4 and an expected Questionable score of 3. This transition between Table 9-11

Rating Scale Score Category Statistics

Score Category

Count

Percentage

Step

Step (logits)

SE (logits)

1 2 3 4

17 383 614 2296

1 12 19 69

0 1 2 3

-3.02 1.10 1.93

0.25 0.07 0.05

168

FISHER Table 9-12 Logit Measure _x

-3.04 0.98 0.59 1.53 2.48 + y-

Expected Score at Logit Measure Expected Score 1.0 1.5 2.0 2.5 3.0 3.5 4.0

Definition Deficit Ineffective Questionable Competent

an expected competent IADL ability and an expected questionable IADL ability corresponds to the same region of the rating scale, between 3.0 and 2.0 logits, where the ability measures of the least able well subjects, the frail subjects, and the most able subjects with disabilities are located (see Figure 9-3). Using the Constructed Scale to Predict Performance Modeling the log-odds probability of a given score based on task simplicity, skill item easiness, and rater leniency has the effect of creating a set of geometrically additive slide rulers that facilitate the ability to conceptualize the additive relationship between the five facets of the constructed AMPS IADL motor scale. These rulers are depicted in Figure 9-3. Vertically sliding each of the central three rulers to target the person, task, and skill items of interest enables the therapist to determine the predicted scores for that person when scored by a given rater for his or her performance on a given task. Figure 9-4 demonstrates this process. Suppose that we are interested in evaluating the ability of the AMPS motor scale in identifying persons who may be beginning to experience functional decline or who may be at risk for loss of the ability to live in the community without assistance. If we position the task simplicity ruler so that the mean task simplicity (0.0, indicated by a pointer " < < " ) is centered on the most able of the identified frail community-living subjects (F), we can scan across Figure 9-3 to the rating scale facet to discover that this subject would be expected to be competent when repotting a plant, but questionable when making a fruit salad. Now, if we are interested in knowing what level of ability this person would be expected to have on the individual skill items, we can position the mean skill item easiness calibration (also shown by a pointer " < < " ) on the task of interest. In

Figure 9-4

The most able frail subject, performing a task of average challenge, scored on a difficult skill item.

(Note. Subject codes: W = community-living well subjects, F = frail subjects, D = subjects with identified orthopedic, neurological, or cognitive disabilities)

Figure 9-5 The most able frail subject, performing a task of average challenge, scored on an easy skill item. (Note. Subject codes: W = community-living well subjects, F - frail subjects, D - subjects with identified orthopedic,

DEVELOPMENT OF A FUNCTIONAL ASSESSMENT

171

this case, we chose the four tasks of average simplicity. Again scanning across Figure 9-3 to the rating scale facet, we can see t h a t we would expect this person to be competent on Lifts, Endures, Moves, Coordinates, and Reaches, but questionable on the harder skill items Calibrates, Grips, Manipulates, Positions, and Bends. Finally, if we position the rater facet ruler so that the mean rater leniency calibration (pointer " < < " ) is centered on the most difficult skill items, we can determine how scores assigned by raters of varying leniency can be expected to differ. This frail subject (F\ when observed performing a task of average simplicity (for example, Salad) would be expected to score Competent on hard skill items (Bends) when rated by the most lenient two raters, but Ineffective when rated by the most severe rater. Comparison of Figure 9-4 and Figure 9-5 shows the range of expected scores between the easiest and the hardest skill items. In contrast to the expected performance shown in Figure 9-4, this frail subject (F), when observed performing at task of average challenge (Salad), scored on an easy skill item (Endures), would be expected to score Competent when scored by all but the most severe rater, who would be expected to rate her Questionable. We are able to make these predictions for all calibrated facet elements even though the person in question only performed a few of the tasks. The opportunity to predict performances is valuable because it enables the therapist to assess in what areas the person will need intervention in order to be able to function independently in everyday tasks.

DISCUSSION The results of this study support the validity of the motor scale of the AMPS. This study also demonstrates the advantages of the use of the many-faceted Rasch model and the FACETS computer program to construct and validate measures. First, I have shown that it is possible to construct a single variable, a common measurement scale, that considers simultaneously the easiness of the skill items, the simplicity of the tasks, and the leniency of the rater in the calculation of person IADL motor skill measures. Second, I have shown how the detailed facet element fit statistics can be used to monitor and verify the validity of the scale, the consistency and leniency of the raters, and the responses of the individuals t h a t are evaluated. Third, when the fit statistics signal unexpected behavior, I have shown how the source of the disturbance can be identified. Therapists can use this information to make informed decisions about the validity of the measures and the functional limitations of the person evaluated. This information also can

172

FISHER

be used to make informed decisions about modification of the skill items and tasks or the provision of rater feedback. For example, the skill items Coordinates and Calibrates failed to demonstrate adequate fit to the Rasch model. In this instance, it was possible to determine that the source of the inconsistency was related to a few subjects and a few raters. This information was used to provide these raters with feedback regarding their inconsistent scoring, and to clarify for them the scoring criteria of these two skill items. These raters, who are now undergoing recalibration, can be monitored over time to evaluate the effects of the feedback on their scoring behavior. As the development of the AMPS motor scale proceeds, those skill items with low mean-square values should be monitored carefully. Future investigation should focus on verification of the presence of dependency among these skill items, and perhaps, the shortening of the assessment by omitting redundant items. As more of the 31 tasks currently included in the AMPS manual are calibrated into the measurement system, they will need to be monitored both for their fit to the measurement model and for their level of simplicity. The present results suggest the need to add more challenging tasks targeted at individuals whose ability measures are located near the transition zone between competent and questionable performance (see Figures 9-3, 9-4, and 9-5). The calibration of less challenging tasks t h a t can be used to better evaluate individuals with disabilities also is needed. SUMMARY The FACETS Rasch analysis computer program is the first practical method that corrects person ability measures for differences among raters and, simultaneously, for variation in the simplicity of the tasks performed by the individual. The resulting person measures are not affected by the leniency of the particular rater who observed the performance, or by the simplicity of the particular tasks the person performed (Lunz & Stahl, 1990; Lunz et al., 1990). The feasibility of constructing a valid objective measurement system t h a t meets the requirements of clinical practice has been demonstrated in this pilot study. The calibration of the skill items and the tasks on the same scale enables therapists to relate the discrete skill items directly to IADL tasks according to their relative positions on the common scale. This calibration takes advantage of all available observations, and the standardization process does not require a sophisticated or complete rating plan t h a t requires that all persons be scored

DEVELOPMENT OF A FUNCTIONAL ASSESSMENT

173

on all items or on more t h a n a few tasks. Moreover, when people are evaluated using the AMPS motor scale, their motor skill abilities can be related to all of the tasks calibrated in the measurement system, whether or not the person performed those tasks. Finally, when individuals do have unexpected patterns of scores that results in misfit to the Rasch-modeled expectations, their pattern of scores can be analyzed in order to interpret how the relationship between their motor skill deficits and their IADL task performance abilities differ from expectations. This study also supports the feasibility of developing a functional assessment that gives consideration to the motivation, interests, and needs of the individual, and t h a t accounts for the leniency of the rater. Through the calibration of a bank of tasks that provide available task choice options, individuals can select from among those tasks those t h a t are familiar to him or her and reflect his or her values and interests. Finally, when the AMPS motor scale is used in clinical practice to evaluate people for whom there is concern about limitations in functional performance, therapists will be able to determine how those individuals would be expected to perform on tasks that are more or less challenging than those actually observed. Thus, therapists will be able to provide more detailed and accurate information to assist with important decisions about whether or not elderly or disabled individuals can live independently in the community. If assistance is required, they will have information about the level and type of assistance needed.

REFERENCES Andrich, D. (1988). Rasch models of measurement (Sage University Paper series on Quantitative Applications in the Social Sciences, 07-068). Beverly Hills, CA: Sage. Bernspang, B., Asplund, K., Eriksson, S., & Fugl-Meyer, A.R. (1987). Motor and perceptual impairments in acute stroke patients: Effects on self-care ability. Stroke, 18, 1081-1987. Case-Smith, J., Fisher, A.G., & Bauer, D. (1989). An analysis of the relationship between proximal and distal motor control. American Journal of Occupational Therapy, 43, 657-662. Consensus Development Panel (1988). National Institutes of Health Consensus Development Conference statement: Geriatric assessment methods for clinical decision-making. Journal of the American Geriatrics Society, 36, 342-347. Eakin, P. (1989). Assessments of activities of daily living: A critical review. British Journal of Occupational Therapy, 52, 11-15.

174

FISHER

Faletti, M.V. (1984). Human factors research and functional environments for the aged. In I. Altman, M.P. Lawton, & J.F. Wohlwill (Eds.), Elderly people and the environment (pp. 191-237, Human Behavior and Environment, Vol. 7). New York: Plenum Press. Fisher, A.G. (1991). Assessment of motor and process skills (research ed. 5-R.2). Unpublished test manual available from the Department of Occupational Therapy, University of Illinois at Chicago. Fisher, A.G., Murray, E.A., & Bundy, A.C. (1991). Sensory integration: Theory and practice. Philadelphia: F.A. Davis. Fuhrer, M.J. (1987). Overview of outcome analysis in rehabilitation. In M.J. Fuhrer (Ed.), Rehabilitation outcomes: Analysis and measurement (pp. 1 15). Baltimore: Paul H. Brookes. George, L.K., & Fillenbaum, G.G. (1985). OARS methodology: A decade of e Society, 33, 607-613. Granger, C.V., & Gresham, G.E. (Eds.). (1984). Functional assessment in rehabilitation medicine. Baltimore: Williams & Wilkins. Guralnik, J.M., Branch, L.G., Cummings, S.R., & Curb, J.D. (1989). Physical performance measures in aging research. Journal of Gerontology, 44, M141-146. Jongbloed, L. (1986). Prediction of function after stroke: A critical review. Stroke, 17, 765-775. Jongbloed, L., Brighton, C , & Stacey, S. (1988). Factors associated with indeppendent meal preparation, self-care and mobility in CVA clients. Canadian Journal of Occupational Therapy, 55, 259-263. kKane, R.A., & Kane, R.L. (1981). Assessing the elderly (pp. 1-23). Lexington, MA: Lexington Books. Keith, R.A. (1984). Functional assessment measures in medical rehabilitation: cvCurrent status. Archives of Physical Medicine and Rehabilitation, 65, 74-78. Kielhofner, G., & Burke, J.P. (1985). Components and determinants of h u m a n occupation. In G. Kielhofner (Ed.), A model of human occupation: Theory aand application (pp. 12-41). Baltimore: Williams & Wilkins. Law, M., & Letts, L. (1989). A critical review of scales of activities of daily living. American Journal of Occupational Therapy, 43, 522-528. Lawton, M.P. (1987). Behavioral and social components of functional capacity. In National Institutes of Health (Author), Consensus Development Conference on Geriatric Assessment Methods for Clinical Decision making ference on Geriatric Assessment Methods for Clinical Decision making (pp. 23-29). (Available from the National Institutes of Health, Washington, DC) Linacre, J.M. (1988). FACETS computer program for many-faceted Rasch measurement. Chicago: MESA. Linacre, J.M. (1989). Many-faceted Rasch measurement. Chicago: MESA. Lunz, M.E., & Stahl, J.A. (1990). Judge consistency and severity across grading periods. Evaluation and the Health Professions, 13, 425-444. Lunz, M.E., Wright, B.D., & Linacre, J.M. (1990). Measuring the impact of

DEVELOPMENT OF A FUNCTIONAL ASSESSMENT

175

judge severity on examination scores. Applied Measurement in Education, 3, 331-345. Pincus, T., Callahan, L.F., Brooks, R.H., Fuchs, H.A., Olsen, N.J., & Kaye, J.J. (1989). Self-report questionnaire scores in rheumatoid arthritis compared with traditional physical, radiographic, and laboratory measures. Annals of Internal Medicine, 110, 259-266. Reed, B.R., Jagust, W.J., & Seab, J.P. (1989). Mental status as a predictor of daily function in progressive dementia. Gerontologist, 29, 804-807. Rubenstein, L.Z., Schairer, C , Wieland, G.D., & Kane, R. (1984). Systematic biases in functional status assessment of elderly adults: Effects of different data sources. Journal of Gerontology, 39, 686-691. Silverstein, B., Kilgore, K., & Fisher, W. (1989). Implementing patient tracking systems and using functional assessment scales (Center for Rehabilitation Outcome Analysis monograph series on issues and methods in rehabilitation outcome analysis, Vol. 1). Wheaton, IL: Marianjoy Rehabilitation Center Skurla, E., Rogers, J.C., & Sunderland, T. (1988). Direct assessment of activities of daily living in Alzheimer's disease: A controlled study. Journal of the American Geriatrics Society, 36, 97-103. Teri, L., Borson, S., Kiyak, H.A., & Yamagishi, M. (1989). Behavioral disturbance, cognitive dysfunction, and functional skill: Prevalence and relationship in Alzheimer's disease. Journal of the American Geriatrics Society, 37, 109-116. Wright, B.D., & Masters, G.N. (1982). Rating scale analysis. Chicago, MESA Press. Wright, B.D., & Stone, M.H. (1979). Best test design. Chicago: MESA Press.

chapter

10

Measuring Chemical Properties With the Rasch Model T.K. Rehfeldt

The Sherwin-Williams Co.

In the paint and coatings industry we do many careful quantitative experiments to develop better coatings. The use of sophisticated experimental designs is increasingly important. We carefully analyze the data that we obtain from these experiments. We carefully measure the processing and composition variables during the experiment, and we make every effort to control variables. However, the responses we measure for these designed experiments frequently take the form of subjective ratings on arbitrary scales. Solvent resistance is one such test. Here we are interested in what happens when solvent is spilled on the surface. For automotive coatings solvents of interest are gasoline, methanol, and engine coolant. Obviously, we don't want a little gasoline to leave a visible mark on the car. Other tests are described below. Stain resistance is similar to solvent resistance but with more persistent substances, such as, grease, oil, and tar. No one wants an automobile covered with oil stains that cannot be removed. Corrosion protection, salt spray, and weathering 176

MEASURING CHEMICAL PROPERTIES

177

refer to the effects of water, sunlight, road salt, and environmental conditions on the finish. Blocking and blistering are related to the hardness and durability of the finish and how well the finish coat adheres to the primer or the metal surface. Orange peel is the presence of texture in the paint t h a t makes the finish look like an orange. Excessive orange peel or other texture is objectionable in a highquality paint. A whole series of appearance tests, such as texture, color, color match of adjacent parts of the car, and gloss are also based upon ratings. All of these tests are very important to paint retailers and customers, because a poor paint job can kill a sale. The very nature of the rating process ensures t h a t these scales are subjective. At best, the rating process produces ordinal rankings; but proper evaluation requires quantitative interval measures. In this chapter I describe how we have used the Rasch model to obtain the interval measures that are required but are only implied by the rankings.

EARLIER ANALYTIC TECHNIQUES There have been attempts to overcome the problem of subjectivity in rating scales of paint performance. Usually a reference material, whose properties are known, is included with the experimental materials. However, this does not ensure equal interval, repeatable, objective scales. Rank order statistics are used to evaluate rating scale data (Lehmann, 1975; Sprent, 1989; Siegel, 1956; Hill & Prane, 1984). Here the scale itself is ignored, and the various paints under test are ranked from the best to worst by one or several judges. Rank order calculations are used to equate the rankings of the various judges. These rankings usually work well; judges will rank a group of paints in the same order, subject to experimental error, and it is easy to tell which is the best and which is the worst. Flowever, the ranking techniques do not provide objective scales of measurement. The rankings show which is better, but there is no way to tell how much better one coating is than the others. This is particularly troublesome in the middle range of the rankings, where discrimination is more important. In general, it is not difficult for judges to agree upon very good performance. Judges will also usually agree on very poor performance. However, this is not where most experiments will take place. One goal of industrial experimentation is to provide good performance at lower costs; thus, we are constantly looking for small improvements, or incremental changes in performance. This is

178

REHFELDT

where discrimination is most important. We want to know how low we can push one part of a formulation and still get an acceptable performance in the middle ratings. Thus, for rank order methods the greatest uncertainty occurs at the place where precision is most important. Another consideration is that one must always deal with a group of coatings and references; the usual rank order statistics do not provide an objective scale t h a t can be used in subsequent testing, where the make-up of the group will often change. Differences among coatings, which are part of a statistically designed experiment, are often analyzed by multiple analysis of variance (ANOVA) on the raw scores. In these cases each facet or factor is examined as a treatment level. However, this technique only detects which factors are associated with differences in the performance. It does not rank the various paints, nor does it construct useful rating scales. Often every factor appears to be significant in an analysis of variance. Furthermore, the initial scores, provided by the experts, do not, by themselves, provide the interval scale necessary to do the analysis of variance properly. What we would like is a technique that will allow for the differences between judges, that will measure the relative performance of the coatings, and that will produce an equal interval scale for use in subsequent testing. A n Example The application of the Rasch model (Rasch, 1960; Wright & Panchapakesan, 1969; Wright & Linacre, 1987; Wright & Stone, 1979; Wright & Masters, 1981) to these paint problems will be illustrated by examination of an experiment in which the response of interest was stain resistance. This experiment was chosen for this work because it is typical, in design and extent, to experiments conducted in our development laboratories, and, thus, is a useful test case (Rehfeldt, 1990). METHOD An experiment was conducted that investigated seven different polymer formulas. Each formula was evaluated with two hardeners. The response variable was stain resistance; for this experiment the stain was applied to the test paints at three concentrations. The staining agent was placed on the test panels and allowed to remain overnight. The stain was then washed off with 10 double rubs of a cloth saturated

MEASURING CHEMICAL PROPERTIES

179

with a suitable solvent. The appearance of the stained and cleaned area was evaluated by the judges. A completely balanced design was used—we examined all combinations of polymer with hardener with stain concentration. This produced 42 test results. Each of the 42 tests was rated by five judges on a scale of 0 to 8, where 0 is total failure and 8 is superior performance. The raw ratings for the experiment are shown in Table 10-1. The paint samples are designated by polymer, 6 1 - 6 7 ; hardener, A and B; and stain concentration, low, medium, and high.

RESULTS From the data in Table 10-1, an objective equal interval scale was constructed by application of the basic Rasch model t h a t produced a rating scale. The logit measures were estimated for each of the 42 tests. The fit of this model was good. There were no misfitting judges or panels. The test panel separation was about 1.6, and the separation reliability was about 0.8 (both on standardized residuals). Mean square errors were less t h a n 1 on the logit scale. Since the data set was rather smaller than the more traditional uses of this method the convergence was somewhat slower (100 to 150 iterations are typical). The position of each test coating was estimated on the scale. This scale, and the positions of each paint along the scale, are shown in Figure 10-1 and given in Table 10-2. Figure 10-1 shows the distribution of test paints on the scale. This plot tells us several things about our test paints. Polymer 67, when used with hardener A, is the best performer, since it is highest on the scale for both high and medium stain concentration. Polymer 61, with hardener A, gives equal performance, but only at the low stain concentration. As a result of the linear, equal interval scale, we can tell that, for example, the improvement in performance between the median, 1.5 logits, and the polymer 61/hardener B/low stain combination is the same as the improvement between polymer 64/hardener B/low stain and polymer 66/hardener A/low stain. In other words the odds that polymer 61/hardener B/low stain will pass the stain resistance test is about 4.5 times the median performance. Likewise the odds that polymer 66/hardener A/low stain will pass the test is 4.5 times the odds t h a t polymer 61/hardener B/low stain will pass the test. Also, the improvement between polymer 61/hardener B/low stain and polymer 63/hardener A/low stain is twice the improvement between polymer 67/hardener B/low stain and polymer 61/hardener B/low stain. We

180

REHFELDT

Table 1 0 - 1

R a w R a t i n g s f r o m Stain R e s i s t a n c e Experiment Judges

Conditions Obs

Poly.

Hard.

Cone.

BHA

KIL

LMF

DIA

PCC

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

61 62 63 64 65 66 67 61 62 63 64 65 66 67 61 62 63 64 65 66 67 61 62 63 64 65 66 67 61 62 63 64 65 66 67 61 62 63 64 65 66 67

A A A A A A A B B B B B B B A A A A A A A B B B B B B B A A A A A A A B B B B B B B

Low Low Low Low Low Low Low Low Low Low Low Low Low Low Med Med Med Med Med Med Med Med Med Med Med Med Med Med High High High High High High High High High High High High High High

8 7 6 6 7 7 7 6 2 1 6 6 4 5 8 5 5 6 7 7 8 0 0 0 5 3 0 4 6 5 5 6 6 7 8 0 0 0 5 3 0 1

8 8 8 7 8 7 7 7 3 2 7 7 5 6 8 6 6 6 8 8 8 0 0 0 6 4 1 5 7 6 6 6 7 8 8 0 0 0 6 4 0 2

8 7 7 6 7 6 7 5 2 1 6 6 4 5 7 5 5 6 8 7 8 0 0 0 5 4 1 4 6 5 5 5 6 7 8 0 0 0 5 3 0 2

8 8 8 7 8 8 7 6 3 2 7 7 5 6 8 5 5 6 8 8 8 0 0 0 5 3 0 3 7 5 5 6 7 7 8 0 0 0 5 3 0 1

8 7 7 6 7 7 7 6 3 2 6 5 4 5 7 5 5 5 7 7 8 0 0 0 5 4 0 4 6 5 4 5 6 7 8 0 0 0 5 4 0 1

NB: 8 = Superior Performance and 0 = Complete Failure

MEASURING CHEMICAL PROPERTIES

181

Figure 10-1 Objective Scale of Stain Resistance Constructed from Raw Ratings

182

REHFELDT

Table 10-2 Summary of Stain Resistance Measurements by Hardener, Stain Concentration, and Polymer on Original Logit Scale Hardener B

Hardener A Stain Concentration

Polymer Polymer Polymer Polymer Polymer Polymer Polymer

64 65 67 61 66 62 63

High

Med

Low

High

Med

Low

2.05 3.86 8.41 3.86 6.00 1.16 0.74

2.50 7.32 8.41 7.32 6.63 1.16 1.16

3.86 6.63 5.41 8.41 5.41 6.63 6.00

1.16 -1.51 -3.81 -6.67 -6.67 -6.67 -6.67

1.16 -1.28 -0.80 -6.67 -5.66 -6.67 -6.67

3.86 3.40 1.60 2.95 -0.27 -2.38 -3.54

N.B. 8.41 Is the maximum measure and - 6.67 is the minimum measure.

have no hope of making this kind of inference from the original 0 - 8 scale. The Rasch Facets (Linacre, 1989) model was next applied to the data shown in Table 10-1. The fit of the model in this case was very similar to the fit, described above, for the simple rating scale case. There were no misfitting panels, and the standard errors were less than 1 for all cases. If we use the extension of the Rasch model to the multifaceted case, then we can partition the effects of the separate facets, still on the equal interval scale. A model of this type was calculated, and the results, shown in Tables 10-2, 10-3, and 10-4, were obtained. In Table 10-3 we see the overall effect of the hardener on the performance of these test paints. We see immediately that hardener A is better t h a n hardener B. This is the average effect of hardener, separated from the other variables. This means that, for any combination of polymer and stain concentration, the addition of hardener A will give better performance than hardener B. Thus, for whatever polymer Table 10-3 Hardener 1 2

A B

Table Mean: Table S.D.:

Effect of Hardener on Stain Resistance Score

Count

Measure Logit

Model Error

705 279

105 105

1.61 -1.61

0.12 0.07

492 213

105

*Centered during estimation

—

0.00* 1.61

0.10 0.02

MEASURING CHEMICAL PROPERTIES

183

Table 10-4 Effect of Stain Concentration on Stain Resistance Score

Count

Measure Logit

Model Error

LOW MED HIGH

415 300 269

70 70 70

1.09 -0.38 -0.71

0.12 0.11 0.10

Table Mean: Table S.D.:

328 62

70 0

Concen.

0.00* 0.79

0.11 0.01

*Centered during estimation

is chosen, you will be better off with hardener A. This is equivalent to an ANOVA, but the values used are interval measures and the emphasis is on the amount or magnitude of the effect. Often an ANOVA is made, in this context, which examines only the statistical significance of the effect, and the magnitude of the effect is left uninterpreted because the scale is not meaningful (see, for example, Broder, Kordomenos, & Thomson, 1988). In Table 10-4 we see the effect of stain concentration. Each polymer, when all conditions are considered, has better stain resistance when the stain concentration is low than when medium or high. This is to be expected, but we now have a quantitative estimate of the differences. The odds, for any combination of polymer and hardener, t h a t a paint will pass the stain resistance test is about six times better at the lower stain concentration than at the high concentration. In addition, we have a basis, as we shall see below, for detecting unusual performance, and, hence, unexpected results. DIFFERENCES AMONG J U D G E S Table 10-5 illustrates one of the primary advantages of the Rasch model analysis over naive interpretation of the raw ratings. It is evident from this table t h a t these five judges do not rate in the same way—different judges give different ratings to the same paint panel. Here are two groups of judges: KIL and DIA, who are similar in the leniency of their ratings at 0.38 and 0.12 logits; and LMF, BHA, and PCC, who are also similar among themselves, at —0.15, —0.17, and - 0 . 1 7 logits, respectively, but who are significantly more severe than the previous group of two. The latter group is about 0.35 logits more stringent, or harsher in their ratings, t h a n the former. If this difference is not considered in the analysis of the data, then the ratings

184

REHFELDT Table 10-5 Differences in Judges' Rating Behavior of Stain Resistance Score

Count

Measure Logit

Model Error

KIL DIA

216 203

42 42

0.38 0.12

0.14 0.14

LMF BHA PCC

189 188 188

42 42 42

0.15 0.17 -0.17

0.14 0.14 0.14

Table Mean: Table S.D.:

196 11

42 0

Judge

0.00* 0.22

0.14 0.00

*Centered during estimation

obtained depend, at least in part, on who does the rating and not on the performance of the paint. It may be argued that the rankings of the coatings may be the same even though each judge gives different individual ratings. While this may be true, and could be used in experiments of this type, it implicitly places two restrictions on the data analysis. First, the experiment must contain enough samples to provide significance to the rankings, which means 10 or more samples. Second, the rankings are only adequate for the experiment at hand and cannot be used to evaluate subsequent measurements of the property—here, stain resistance. One must, thus, conduct a complete experiment with at least 10 trials for each evaluation. Further Analysis In Table 10-6 we see the effect of the polymer on the performance. The measure order of the polymers is from best to worst, again separated from the other variables. We get our positions, on an objective scale; and the scale can be used for one or a few subsequent measurements without running the entire experiment over again. Further, the measures tell us, not only which polymer is better, but how much better as well. We can use the measures determined by the model in several ways. Table 10-2 shows the summary of the experiment plotted in Figure 10-1. The scale measures are in the body of the table. The polymers are in order of decreasing performance. The columns of the table show the effect of stain concentration and hardener. The values here are the logits on the original scale.

MEASURING CHEMICAL PROPERTIES Table 10-6

185

Effect of Polymer on Stain Resistance Measure Logit

Model Error

Polymer

Score

64 Best 65 67

173 173 169

30 30 30

1.38 1.38 1.29

0.16 0.16 0.15

61 66

140 132

30 30

0.61 0.37

0.16 0.18

62 63 Worst

102 95

30 30

-0.60 -0.83

0.18 0.18

Table Mean: Table S.D.:

140 30

30.0 0.0

0.51 0.86

0.17 0.01

Count

NB: There are three groups.

We see here t h a t Polymer 64 is rated best overall by virtue of its total performance. While lower, for example, than Polymer 67, with high concentration of stain and hardener A, Polymer 64 is more consistent over the various stain concentrations and with both hardeners. In fact, Polymer 64 is the only polymer that did not receive negative measures with Hardener B. We can also see an interesting anomaly. Polymer 67, with hardener A, performs better with high stain concentrations than it does with lower stain concentrations. This is not expected, and may be important for formulation of this type of coating. This anomaly was found by examination of the residuals from a multifaceted analysis. We see this in Table 10-7. Here, the expected ratings, near 8, are shown with the residuals. Polymer 67 was rated lower, at 7, than was expected by all the judges. This would indicate an area for further investigation. Finally, we can use the FACETS analysis to combine effects of the variables if we desire. For example, in Table 10-8, we have combined Table 10-7 Residuals Analysis of Stain Resistance Measurement Polymer/ Hardener

Cone

Judge

Obs.

Expect.

Residual

67A 67A 67A 67A 67A

LOW LOW LOW LOW LOW

BHA KIL LMF DIA PCC

7 7 7 7 7

7.9 8.0 7.9 8.0 7.9

-0.9 -1.0 -0.9 -1.0 -0.9

186

REHFELDT Table 10-8 Ranking of Stain Resistance with Polymer and Hardener Combined Polymer/Hard 67A 61A 66A 65A 62A 64A 63A 64B 65B 61B 67B 62B 63B Table Mean: Table S.D.:

Score 115 110 108 107 89 89 87 84 66 30 24 13 8 72 38

Count

Logit

Error

15 15 15 15 15 15 15 15 15 15 15 15 15

5.97 4.77 4.39 4.20 1.64 1.64 1.40 1.08 0.20 -2.02 -2.52 -3.51 -4.00

0.55 0.45 0.43 0.42 0.35 0.35 0.34 0.32 0.23 0.27 0.30 0.30 0.33

0.99 3.16

0.36 0.08

15 0.0

the effects of the polymer type and the hardener by using the polymer/hardener combination as a single facet rather than as two facets. In this analysis the data for polymer/hardener combinations were entered as separate facets and polymer 61/hardener A is single factor, so the dimensions of the data matrix are changed from 7 x 2 x 3 x 5 , polymers, hardeners, concentrations, judges, respectively, to 14 x 3 x 5, polymer/hardener, concentrations, judges. Here, we obtain positions of polymer and hardener combinations with respect to stain resistance. The scale, shown in Table 10-8, then, is the scale for the polymer/hardener combinations calculated with respect to stain resistance by the five judges. In this case we do not separate the effects of the polymer and hardener, so the polymer 67/hardener A entity is the best overall. Further Application of the R a s c h Model We have begun to use the Rasch model for rating scales in several ways. One method that shows promise is to do the standard analysis and obtain the property map, such as the one shown in Figure 10-1. Once we have such a map, we can select a suitable number of paint panels scattered along the scale. We try to select a suitable number, 5, 8, or 10, depending on the test in question, and arrange them to approximate the equal intervals. Then, for subsequent applications of the test we ask the judge to select the best match of the test piece panel

MEASURING CHEMICAL PROPERTIES

187

with one of the set measured standards. In this manner the individual judge does not have to know anything about the analysis method, but the results are in line with the equal interval scale we want. We simply translate the match into the proper logit measure. In another application we have started to examine color perception. This is important for automotive paints in particular. Even when we have gone to great lengths in spectroscopic analysis to assure a color match or color purity, we find that certain viewers can perceive a difference from the standard color. Until recently we were at a loss to control this feature of paints. We are now beginning to use the Rasch model to measure the color and match perception. In this manner we will obtain a measure of color perception which we can compare with the spectroscopic analysis. We believe that this additional testing will produce many fewer rejects on the basis of color than we currently experience. SUMMARY A method of overcoming the difficulties of rating scale rankings of paints has been demonstrated. The utility of the method includes construction of an objective measurement scale, detection and adjustment for differences in judges, measures of performance, means to detect outliers, and consistent measures from one experiment to the next. The model is suitable for rating scale, pass/fail, and minimum performance testing in paints and coatings. Such tests as stain resistance, solvent resistance, tape time, cross hatch adhesion, hardness, and other such tests with inherently large scatter are suitable candidates for Rasch facets analysis. When this model is used, rating scale rankings can be used to estimate experimental measures for regression and other designed experiments in a like manner to other quantitative measurements. REFERENCES Broder, M., Kordomenos, P.I., & Thomson, D.M. (1988). A statistically designed experiment for the study of a silver automotive basecoat. Journal of Coatings Technology, 60(766), 27. Hill, E., & Prane, J.W. (1984). Applied techniques in statistics for selected industries: Coatings, paints, and pigments. New York: John Wiley and Sons. Lehmann, E.L. (1975). Non-parametric statistical methods based on ranks. San Francisco: Holden-Day.

188

REHFELDT

Linacre, J.M. (1989). Many-faceted Rasch measurement. Unpublished doctoral dissertation, University of Chicago. Rasch, G. (1960). Probabilistic models for intelligence and attainment tests. Rasch, G. (1960). Probabilistic models for intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research. Rehfeldt, T.K. (1990). Measurement and analysis of coatings properties. Journal of Coatings Technology, 60(790), 53-58. Rasch, G. (1960). Probabilistic models for intelligence and attainment tests. Siegel, S. (1956). Non-parametric statistics. New York: McGraw-Hill. Sprent, P. (1989). Applied non-parametric statistical methods. New York: Sprent, P. (1989). Applied non-parametric statistical methods. New York: Chapman-Hall. Wright, B., & Linacre, M. (1987). Rasch model derived from objectivity. Rasch Measurement SIG Newsletter, 1(1), 2 - 3 . Wright, B., & Masters, J. (1981). Rating scale analysis. Chicago: MESA Press. Wright, B., & Panchapakesan, N. (1969). A procedure for sample-free item analysis. Educational and Psychological Measurement, 29, 2 3 - 4 8 . Wright, B., & Stone, M. (1979). Best test design. Chicago: MESA Press.

chapter

11 J.JL

Impact of Additional Person Performance Data on Person, Judge, and Item Calibrations John A. Stahl

National Association of Boards of Pharmacy

Mary E. Lunz

American Society of Clinical Pathologists Achievement testing often relies on multiple choice items. Multiple choice items are economical when testing large populations, they have well-documented psychometric properties, and they are reliable because many items can be included in a test. The limitation of multiple choice items is proving that they do indeed measure competence to perform specified tasks. In most cases, they measure knowledge of how to perform the task. In any performance-related field, a direct observation and judgement of a candidate's ability to perform the desired tasks is preferable to the less direct measure of knowledge provided by multiple choice items. The development of Rasch models to handle many-faceted measurement, in particular the FACETS program (Linacre, 1988), has opened the opportunity for developing economically feasible ways of making more direct assessments of candidate performances. Oral examinations, practical examinations, and essay examinations, all of which involve the use of judges, can now be used in assessing candidates without sacrificing the properties of objective measurement (Lunz, Wright, & Linacre, 1990; Lunz & Stahl, 1990). More direct assessment of a candidate's capability to perform a par189

190

STAHL & LUNZ

ticular task is desirable; however, we should not be too hasty in abandoning the information that can be obtained through more traditional testing instruments. In many cases, the area being tested involves both the capability to perform tasks and a base of essential knowledge. Knowledge can be tested efficiently with a question-and-answer format. Frequently, a multiple choice test is the most efficient method for gathering this information. The ideal situation would be to use all of the available information concerning a candidate's capabilities when making the assessment. The traditional method is to use several testing instruments, make an independent assessment using each instrument, and then require the candidate to pass all parts. An alternative method is to combine all the available information into one single analysis. The flexibility of the FACETS program allows this alternate method to be explored. This study is an exploration of single analysis assessment using several different test instruments. Data from a multiple choice written examination and from a judge-mediated practical examination are combined. Both tests were administered to the candidates, although a small subgroup took only one of the two tests. The combined data set was analyzed using the FACETS program, and the results of the analysis were compared to the results obtained from analyzing the multiple choice and practical examinations separately. METHOD The data are from the certification process in histology, 1 a clinical laboratory specialty. The first examination consisted of 173 multiple choice items administered to 417 candidates. The questions covered processing, cutting, and staining tissue and general laboratory operations. The second examination was a practical that required the candidates to prepare 15 histology slides according to prescribed criteria. These slides were prepared by 321 candidates and mailed to a central location for grading. The slides were graded by a group of trained judges during a two-day grading session. The slides were graded on seven tasks: preparing the tissue block, labeling the slide, coverslipping the slide, obtaining the proper tissue sample size, processing the tissue, cutting the tissue, and staining the tissue. The candidates for both examinations consisted of individuals who had met the criteria to sit for the examination either by completing an approved program of 1

Histology is the science concerned with the structure of cells, tissues, and organs in relation to their function. Histotechnology is concerned with the preparation of slides for use in the microscopic study of tissues.

IMPACT OF ADDITIONAL PERSON PERFORMANCE DATA

191

instruction in histology or through a combination of on-the-job training and experience. ANALYSES The multiple choice examination was analyzed using the BIGSCALE program (Wright, Linacre, & Schultz, 1990) for Rasch analysis. Measures for each examinee and difficulties for each item on the test were obtained. The practical examination was analyzed initially using the FACETS (Linacre, 1988) program for many-faceted Rasch analysis. For this examination, the probability of candidate n with ability Bn achieving score x (rather t h a n score x - 1) on slide i with difficulty Dt from judge j with severity Cj was modeled as:

where =

Probability of candidate n being given score x by judge j on slide i Pnijx-i = Probability of candidate n being given score x — 1 by judge j on slide i Bn = ability of candidate n Dt = difficulty of slide i Cj = severity of judge j Fx = difficulty of achieving rating step x relative to step x — 1 Pmjx

The above equation is the general expression for the three-faceted Rasch rating scale model (Linacre, 1989, p. 62). The three components in the examination are the candidates, the items, and the judges. The probabilities of success are modeled as an additive combination of these three components. Taking the logarithm of the probability odds expresses these parameters in log-odds units (logits). Measures for each candidate, difficulties for each slide, and severities for each judge were obtained. The data from the histology multiple choice and practical examinations were then combined into one data set and reanalyzed using the FACETS program. This analysis added a facet to the model to account for the dichotomously scored multiple choice items (Bn —Dt -Cj —Mt -Fx) where ML is the difficulty of the multiple choice items. This

192

STAHL & LUNZ

combined analysis resulted in a single measure for each candidate, a severity for each judge, and a difficulty for each slide and each multiple choice item. The results of these three analyses were then compared. Calibrations and measures from the analyses of the examinations were plotted against the corresponding results obtained from the combined analysis.

RESULTS The Rasch fit statistics are a measure of the fit of the data to the model. The Infit (information weighted mean squared residual) is sensitive to an accumulation of central or inlying deviations. The Outfit (unweighted mean squared residual) is sensitive to occasional outlying deviations. Significant departures from expected indicate disruptions in the testing process. The fit statistics for the multiple choice items and for the slide items are presented in Tables 11-1 and 11-2, for both the individual and the combined analyses. The multiple choice items show very little misfit. Two of the slide items show evidence of some misfit. Slide 3 has an Outfit of 2.2 indicating that there were some outlying scores. This was a relatively easy slide and the outlying scores were probably due to unexpectedly low ratings on this item given to a few examinees. Slide 9 had low Infits and Outfits indicating t h a t there was a greater than expected consistency in the ratings of this item, probably all 2s and 3s. The degree of misfit for these two items was not sufficient to preclude using them in the analysis. Having determined that the data fit the model, we can now examine whether the simultaneous analysis of the two sets of results has introduced measurement disturbances. This is accomplished by comparing the pertinent measures derived from the separate analyses with those derived from the combined analysis. In Figure 11-1, the item difficulties obtained from the initial BIGSCALE multiple choice examination analysis are plotted against the item difficulties obtained from the combined FACETS analysis. It can be seen t h a t the multiple choice item calibrations were not affected by the addition of the practical examination data. In Figure 11-2, the calibrations of the slides obtained from the initial FACETS analysis are plotted against the slide calibrations obtained from the combined FACETS analysis. The slide calibrations were not substantially affected by the addition of the multiple choice item data. In Figure 11-3, the judge severities obtained from the initial FACETS

IMPACT OF ADDITIONAL PERSON PERFORMANCE DATA 193

Table 11-1 Multiple Choice ItemsTable 11-1 Fit Statistics

Combined Analysis

Individual Analysis Item

Infit

1

.9 1.0 .9 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.1 1.0 1.0 1.0

2

3 4

5 6

7

8 9 10 11

12 13 14

15

16 17

18 19 20 21 22 23 24 25 26

27 28 29 30 31 32

33 34 35 36 37 38 39 40

41 42

1.1 1.0

1.1 1.0 1.0 .9 1.0 1.0 1.0 1.0 1.0 1.0

.9 1.0 1.1 .9 1.0 1.0 1.0 1.1 .9 1.0 1.0 1.1 1.0 1.0

Multiple Choice Items

Outfit

.9 .9 .7 1.1 1.0

1.2 1.0 1.0 1.0 1.1 1.0 1.0

1.1

1.0 1.0 .9 1.1 1.1 1.2 1.1 1.0 .8 1.0 .9 1.0 .9 1.0 1.0 .8 1.1 1.1 .9 1.0 .9

1.0 1.1 .9 1.1 1.2 1.3 1.1 .9

Infit

.9 .9 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.1 1.0

1.0 1.0 .9 1.0 .9 1.0 1.0 1.0 1.0 .9 .9 1.0

.9 1.0 1.0 1.0 1.0 .9 1.0 1.0 1.0 1.0 .9

Outfit

.9 .9 .7 1.0 1.0 1.1

1.0 1.0 1.0 1.0 1.0 1.0 1.0

1.0 1.0 1.0 1.0

1.1 1.1

1.0 .9 .8 1.0 .9 1.0 .9 1.0 1.0 .8 .8 1.0 .9 1.0 .9 .9 1.1 .9 1.0 1.1 1.1 1.0 .9

(continued)

194 STAHL & LUNZ194 STAHL & LUNZ

Table 11-1 (Continued) Individual Analysis

Combined Analysis

Item

Infit

Outfit

Infit

Outfit

43 44 45 46 47 48 49 50

1.0 1.1 1.0

1.0 1.1

1.0 1.0

1.0 1.0 1.0

1.0 1.0 1.0 1.0

1.1

51 52 53 54 55 56 57 58 59 60 61

62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85

1.0 1.0 1.0 .9 1.0 1.1 .9

.9 1.0 1.0 1.0 1.1 .9 1.0 .9 1.0 1.0 1.0 .9 .9 .9 1.0 1.0 1.0 1.0 1.1 1.0 1.0 .9 1.1 1.1 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

1.0 .9 .9 1.1 .9 .9 1.0 1.0 1.2 1.1 .8 1.0 .8 1.0 .9 1.1 .9 .8 .9 1.0 1.0 1.1 .9 1.1 1.0 1.0 .9 1.1 1.1 1.0 .9 1.1

1.0 1.0 1.1 .9 1.0 1.1

.9 .9 1.0 .9 .9 1.0 .9 1.0 1.0 .9 1.0 .9 1.0 1.0 1.0

.9 .9 .9 1.0 1.0 1.0 .9 1.1 1.0 1.0 .9

1.0 1.1 1.0 .9 1.0 1.0 1.0 1.0 1.0 1.0 1.0

1.0

1.0 1.0 1.0

1.0 .9 .9 1.0 .9 .9 1.0 .9 1.1 1.0 .8 1.0 .9 1.0 .9 1.0 .9 .8 .9 1.0 1.0 1.0 .9 1.1 1.0 1.0 .9 1.0 1.1 1.0 .9 1.1 1.0 .9 1.0 .9 1.0 1.0

(continued)

IMPACT OF ADDITIONAL PERSON PERFORMANCE DATA

Table 11-1

(Continued) Individual Analysis

Combined Analysis

Item

Inflt

Outfit

Infit

Outfit

86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105

1.0 1.1 1.1 1.1 1.0 1.1 .9 1.1 1.0 1.0 .9 1.1 .9 .9 1.0 1.0 .9 1.0 .9 1.0 1.1 1.1 1.0 1.0 1.1 1.0

1.0 1.2 1.1 1.2 1.0 1.1 .9 1.2 1.0 1.0

1.0

.9 .9 1.0 1.0 .9 1.0 .9 1.0 1.1 1.2 1.1 1.0 1.1 1.0 1,1 .9 .9 .9 1.1 1.1

1.0 1.1 1.0 1.0 1.0 1.0 .9 1.1 1.0 1.0 .9 1.1 .9 .9 1.0 1.0 .9 1.0 .9 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 .9 .9 1.0 1.0 1.1

.9 1,1

.9 1.0

1.0 .9

.9 .9 1.1

106 107

108 109 110 111 112 113 114 115 116 117 118 119 120 121

122 123 124 125 126 127 128

1.0

.9 .9 1.0 1.1 1.1 .9 1.1 .9 .9 1.1 1.0 1.0 1.1 1.1 1.0 1.0

.8 1.1

1.2

.9 1.0 1.1 1.1 .9 1.0

1.1 1.0

1.1 1.0

1.0 .9 1.1 .9 1.0 .8 1.1 .9 .9 .9 1.0 .9 1.0 .9 1.0 1.1 1.1 1.0 .9 1.0 1.0 1.0 .9 .9 .9

1.0 1.1 .9 1.0 .9 .9 1.1

.9 1.0 1.0 1.0

.8 1.0 1.0

1.0 .9

1.0 .9

1.0

(continued)

195

Table 11-1 (Continued) Combined Analysis

Individual Analysis Item

Infit

Outfit

Infit

Outfit

129 130 131 132 133 134 135 136

1.0 1.0

1.0

1.0 1.0 .9 1.1 .9 1.0 1.0 .9 1.0 .9 1.0 1.0 .9 1.0 .9 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 .9 1.0 1.0 1.1 1.0 1.0 .9

1.0 1.0 .9 1.1 .9 1.0 1.0 .9 .9

137 138

139 140 141 142 143 144 145

146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162

.9 1.0 1.0 1.0 1.0 .9 1.0 1.0 .9 1.0 .9 1.0 1.0 1.0 1.0

165 166 167 168 169 170 171 172 173

1.0 1.0 1.0 1.1 .9 1.0 1.0 1.1 1.1 1.0 .9 1.0 .9 .9 1.1 .9 .9 .9 1.0 1.0 .9 1.0 1.1 1.0 d.O 1.0

Mean S.D.

1.0 .1

163 164

196

.9 1.1

1.1 .9

1.2 .9 1.0 1.1

.9 .9 .9 .9 1.0 .9 1.0 .9 1.0 1.0 .9 1.0 1.1 1.0 1.0 1.1 .9 1.0 1.0 1.3 1.1 1.1 .9 .9 .9 .9 1.2 .9 1.0 1.0 .9 1.0 1.1 1.1 .9 1.2

.9 1.1 .9 .9 .9 1.0 .9 .9 1.0 1.0 1.0 .9 1.0

.9 .8 1.0 .8 1.0 .9 1.0 1.0 .9 1.0 1.0 1.0 .9 1.0 .9 1.0 1.0 1.1 1.0 1.0 .9 .9 .9 .9 1.1 .9 .9 .9 1.0 .9 .9 1.0 1.0 1.0 .9 1.1

1.0 .1

1.0 .1

1.0 .1

.9 .9

.9 .9

IMPACT OF ADDITIONAL PERSON PERFORMANCE DATA Table 11-2

Slide Item Fit Statistics Individual Analysis

Combined Analysis

Item

Infit

Outfit

Infit

Outfit

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

.9 .8 .8 .9 .9 .9 .9 1.0 .7 1.0 1.2 1.2 1.1 1.3 1.2

.8 .9 2.2 .6 .6 .9 1.4 1.5 .4 1.0 .8 1.0 .8 1.3 1.1

1.0 .9 .9 1.0 1.0 1.0 1.0 1.1 .8 1.1 1.2 1.2 1.2 1.5 1.3

1.0 .8 2.2 .7 .6 .9 1.5 1.6 .5 1.2 .8 1.0 .9 1.4 1.0

Mean S.D.

1.0 .2

1.0 .4

1.0 .2

1.0 .4

WRITTEN DATA

COMBINED DATA Figure 1 1 - 1

Written Item Calibrations Written Exam Vs. Combined Data

197

198

STAHL & LUNZ

PRACTICAL EXAM ONLY

COMBINED DATA Figure 1 1 - 2

Slide Calibrations Practical Vs. Combined Data

PRACTICAL EXAM ONLY

COMBINED DATA Figure 1 1 - 3

Judge Calibrations Practical vs. Combined Data

IMPACT OF ADDITIONAL PERSON PERFORMANCE DATA

199

analysis are plotted against the judge severities obtained from the combined FACETS analysis. There is more variability between the judge severities derived from the two analyses, although the correlation is still high at .84 (p = .000 for a two tailed test of significance). The reason for this variability will become more apparent as we look at the examinee measures. There are three measures for each person: (a) the multiple choice examination measure, (b) the practical examination measure, and (c) the combined FACETS analysis measure. The multiple choice examination measures are plotted against the combined FACETS analysis measures in Figure 11-4. There is a linear relationship, but the combined FACETS analysis measures are about .5 logits higher than the multiple choice examination measures. The correlation between the measures is .97 (p = .000 for a two tailed test of significance). In Figure 11-5, the person measures from the practical examination are plotted against the person measures from the combined FACETS analysis. The relationship is less strongly linear (correlation = .59, p = .000 for a two-tailed test of significance), and the combined FACETS measures tend to be lower t h a n the practical examination measures. These results suggest the following. First, the results of the multiple choice examination are having a much greater influence on the combined analysis t h a n the results from the practical examination. This is logical, since the multiple choice examination consisted of 173 items, scored dichotomously, whereas the practical examination had only 77 judged responses per candidate, 15 responses scored on a 0 - 3 rating scale and the remainder scored on a 0 - 1 scale. Thus the multiple choice examination provided about 2.5 times the number of responses as the practical examination. Second, the practical examination was the easier of the two tests. Historically it has been harder to pass the multiple choice examination (about 50 percent pass) than the practical examination (about 80 percent pass). The variation in judge severities between the practical analysis and the combined FACETS analysis can be attributed to the strong impact of the multiple choice test on the candidate measures. A candidate who is less able on the multiple choice examination forces down his or her combined analysis measure even if he or she is more able on the practical examination. The judges who graded that particular candidate appear more severe on the combined data analysis. The converse is true for candidates who were more able on the written examination than the practical. The increased ability of these individuals on the combined analysis has the effect of making the judges who graded these candidates appear less severe. A close examination of Figure 11-3 shows t h a t the judge severities tend to split away from the identity line with some looking harder on the practical than on the combined

Figure 11-4

Person Measures

Figure 11-5

Person Measures

202

STAHL & LUNZ

FACETS analysis and some looking easier on the practical than the combined FACETS analysis. This divergence from the identity line represents the expected impact of the examinee ability measures, adjusted for the written examination measures, on the judge severity calibrations. An alternative is to equalize the contribution of the two examinations to the final certification decision. The traditional method is to analyze the examinations separately and require the examinee to pass both before being certified. Another approach is to weight the contribution of each assessment in a combined analysis in such a way t h a t the contribution is equal. Since the multiple choice examination had more impact on the candidate measures, the combined FACETS analysis was repeated with the results weighted so t h a t the contribution of each examination would be approximately equal. The errors of measure for the candidate measures derived from each test were compared. The error of measure from the practical was about three times larger than the error from the multiple choice examination. The contribution of the multiple choice examination was therefore weighted by a value of .3 to equalize the impact of the multiple choice examination in the combined measure and the analysis repeated. The relevant measures and calibrations derived under the separate and weighted/combined conditions were compared. The comparisons of the multiple choice item difficulty calibrations and the slide item diffiPRACTICAL EXAM ONLY

WEIGHTED COMBINED DATA

Figure 11-6

Judge Calibrations Practical vs. Weighted Combined Data

IMPACT OF ADDITIONAL PERSON PERFORMANCE DATA

203

culty calibrations showed no significant change between the separate and combined analyses. The plots of these comparisons were identical to the plots seen in Figures 11-1 and 11-2. In Figure 11-6, the judge severities from the practical analysis are plotted against the weighted combined judge severities. The impact of the multiple choice items on the judge severities is still apparent; however, the degree of impact is less in the weighted analysis. The correlation is higher at .96 (p = .000 for a two tailed test of significance). In Figures 11-7 and 11-8, the candidate measures from the written and practical analyses are plotted against the weighted combined candidate measures. In Figure 11-7, the linear relationship is less well defined than it was in Figure 11-4, as the impact of the multiple choice examination is reduced. The correlation coefficient is now .87 (p = .000 for a two tailed test of significance). In Figure 11-8, the practical examination candidate measures and the weighted combined candidate measures have a more clearly linear relationship with a correlation of .80 (p = .000 for a two tailed test of significance). The combined analysis candidate measures are between the higher practical examination measures and the lower multiple choice measures. The contribution of each is relatively equivalent. This weighting could probably be finetuned even further, until the correlations between the results of the separate analyses and the weighted analysis become identical.

DISCUSSION Making assessments of a person's performance often can have serious implications. The more information that can be obtained and utilized for t h a t assessment, the more reliable that assessment will be. Many instruments are available to obtain this information. These instruments include multiple choice tests, oral examinations, practical examinations, essay tests, and so on. This study was designed as an initial attempt to use the flexibility of the FACETS program to combine the information from two unique examinations, a practical examination and a multiple choice examination, and to explore the results of combining this information. The results indicate t h a t combined analysis can occur without significant disturbances in the measurement process. The calibrations of the item difficulties, both multiple choice items and practical slides, was virtually unaffected. Fit to the model of these items was acceptable and directly comparable to the fit from the separate analyses. The mean squared information weighted residual for the slides on both the

i

Figure 11-7

Person Measures

Figure 11-8 Person Measures

206

STAHL & LUNZ

practical and the combined analysis had a mean of 1.0 and a standard deviation of .2. For the multiple choice items, the mean squared infit for both the multiple choice examination and the combined FACETS analysis had a mean of 1.0 and a standard deviation of .1. The largest impact appeared in the judge severity calibrations. Even here, the correlations between the calibrations derived from the different analyses were high and no change in the fit of the data to the model was observed (Mean squared Infit for both analyses had a mean of 1.0 and a standard deviation of .1). The impact on candidate measures indicated that care must be taken in assigning weight to the contribution of each individual test used in the combined analysis. Each examination is designed to test elements of a candidate's performance. The role that each of these elements contributes to the actual competence of a candidate to perform the tasks may influence the weight assigned to that task. The importance placed on the parts of the examinations by the examination board when making the final certification decision must also be considered. In this case, greater importance has been placed on the multiple choice examination, because the multiple choice examination tests the candidate's basic knowledge of histology in a broader context t h a n the specific task-oriented practical examination. If this is the case, then the combined analysis may not be appropriate. This study, however, demonstrated that it is possible to combine information from different types of examinations and to analyze the extent of their contribution to the final evaluation. Further research on combined FACETS analysis is necessary; however, such an approach may be more commensurate with the assessment of overall competence now t h a t the technology and theoretical models are available.

REFERENCES Linacre, J.M. (1988). FACETS, a computer program for the analysis of multifaceted data. Chicago: Mesa Press. Linacre, J.M. (1989). Many-faceted Rasch measurement. Chicago: Mesa Press. Lunz, M.E., & Stahl, J.A. (1990). A comparison of intra- and interjudge decision consistency using analytical and holistic scoring criteria. Journal of Allied Health, 19, 173-179. Lunz, M.E., Wright, B.D., & Linacre, J.M. (1990). Measuring the impact of judge severity on examination scores. Applied Measurement in Education, 3, 331-345. Wright, B.D., Linacre, J.M., & Schultz, M. (1990). BIGSCALE, Rasch-Model Rating Scale Analysis Computer Program. Chicago: Mesa Press.

part I I I

Theory

This page intentionally left blank

chapter

12 JL^

Local Dependence: Objectively Measurable or Objectionably Abominable?* Robert J. Jannarone

University of South Carolina

INTRODUCTION This chapter concerns extending Rasch model-based objective measurement (Fisher, 1991; Rasch, 1980; Wright, 1980) to include a variety of locally dependent conjunctive measurement (LDCM) models (Jannarone, 1991). The Rasch model justifies the nearly universal practice of scoring tests by counting the number of binary items that are passed. LDCM models provide supplemental scoring schemes that use nonadditive combinations of item scores as well. In the process, LDCM violates the fundamental axiom of latent trait theory, which is a traditional basis for nearly all test models, including the Rasch model. LDCM models have been introduced and developed elsewhere (Jannarone, 1986, 1987, 1991; Jannarone, Yu, & Laughlin, 1990; Kelderman & Jannarone, 1989; Van der Linden & Jannarone, 1989). The issue to be raised here is whether LDCM offers useful cognitive measurement potential t h a t the Rasch model does not, while preserving the essence of objective measurement. * This chapter is dedicated to the memory of Rose and Peter.

209

210

JANNARONE

Other attempts have been made to extend objective measurement, some of which are sharply attacked in this chapter. These attacks may upset some readers, especially if they ignore the fact that science rewards strong results with sharp rebukes (Kuhn, 1970; Popper, 1968). To set matters straight, the author admires and appreciates the contributors who are criticized below, without exception. Indeed, the results of this chapter would have been unthinkable without their extraordinary efforts. The chapter is organized as follows. First, the question of whether to extend objective measurement is addressed informally and answered affirmatively. Next, the question of how to extend objective measurement is addressed more carefully, with psychometric measurement history as a basis and specific guidelines as a result. Finally, locally dependent, conjunctive measurement is assessed against these guidelines, and conclusions are made regarding its utility.

SHOULD OBJECTIVE MEASUREMENT BE EXTENDED? "If it's not broke, don't fix it!" is a common sentiment among people who just want to be productive. It is often a sound sentiment, because improvements are not easy to make. However, improvements surely cannot be made without being attempted, lending support to the opposite sentiment. "If it's not broke, fix it anyway!" When both short-term productivity and long-term optimality are concerns, then, it is natural to strike a balance between the two opposing sentiments. The balance may lean more toward the fix-it than the don't-fix-it end among scientists, for several reasons. Researchers have been primarily trained and directed toward identifying, applying, and validating new ideas, which necessarily requires questioning and rejecting current ideas. Also, the history of science has repeatedly shown t h a t pursuing new ideas, even without concrete goals in mind, can eventually result in important practical benefits. However, in pursuing either concrete, short-term goals or more abstract, long-term goals, researchers must use some established procedures and rely on some basic assumptions; otherwise they would have no basis at all for making progress. From the most practical to the most basic scientific work, then, the fix-it, don't-fix-it question is an important one. The case in point for this chapter is model-based objective measurement (MOM), in the form of the Rasch model (Rasch, 1980; Wright, 1980). MOM is an interesting case for fix-it versus don't-fix-it study because it is currently being both heavily researched and widely used.

LOCAL DEPENDENCE

211

The contents of this edited series indicate much MOM-related activity, ranging from very basic research to very applied assessment. While some studying measurement foundations are naturally interested in extending MOM, others are perhaps either happy with MOM as it stands or too busy using MOM to seriously consider changing it, or both. The fix-MOM, don't-fix-MOM question, then, is a broad one that may have different answers for different researchers. On the fix-MOM side, prudent extensions to the Rasch model could improve testing practice. For example, it is widely known t h a t educational aptitude does not depend on achievement alone, but also on learning ability, strategy selection ability, motivation, and the efficient use of time. Yet MOM is limited for measuring effects of these skills on performance, because it is based on a certain restrictive axiom. Also, as computing power continues its remarkable growth, computerized testing and tutoring are certain to become widely used. Yet the same axiom limits MOM prospects for computer-based, dynamic ability assessment. In addition, reports indicate that MOM cannot properly measure characteristics of some items in current use, such as differential item discriminations (Lord, 1980) and dependencies due to shared content (Jannarone, 1991). Thus, extensions to MOM may be needed to broaden its formal domain as well as its utility. The MOM axiom t h a t limits its domain is the local independence assumption (Lazarsfeld, 1958), which is widely regarded as the fundamental axiom of latent trait theory (Lord & Novick, 1968, Sec. 24.5; Jannarone, 1991a). Local independence requires t h a t measurement must be noninvasive (Jannarone, 1991) in that a person's future test behavior must be the same after responding to an item as it would have been before. By requiring that measurement be noninvasive, local independence prevents measuring a person's progress during a test as a function of his or her progress on previous items. As a result, as long as local independence is imposed, some potentially interesting abilities will be neglected. Conversely, those who are interested in measuring such abilities will continue to neglect MOM in its current form. Locally dependent instances can be found ranging from exerciseactivity assessment settings, where injuries are recorded, to learningactivity settings, where task responses are recorded. Suppose, for example, that a binary "item score" is recorded weekly, indicating whether or not runners in a study have been injured (Macera, Pate, Powell, Jackson, Kendrick, & Craven, 1989). Running-injury incidence can obviously depend on recent running-injury history. Also, some people may be more likely to press on after an injury t h a n others. As a consequence, running-injury measures may not only be locally depen-

212

JANNARONE

dent, but interesting individual differences in local dependencies may exist as well. As closely related example, suppose that item pairs have been constructed to reflect learning transfer, in the form of successfully learning information on one item and then successfully applying the learned information to a following item (Jannarone, 1987, 1991). As in the exercise activity case, (a) one item score is likely to depend (locally) on a preceding item score; and (b) individual differences in local dependencies (that is, learning transfer abilities) may be worth measuring. In these and other instances, information may exist in test score patterns that cannot be measured by number-correct test scores alone. For example, counting the number of adjacent item pairs that are both passed can provide information about learning ability, if adjacent items are linked by content in certain ways (Jannarone, 1991). Yet the local independence axiom turns out to prohibit the use of such nonlinear, conjunctive scoring schemes (in a sense that will be shown below—when items are binary, item scores are equivalent to logical events, cross-products of which are called conjuncts, whence the term con Arguments on the fix-MOM side, then, include prospects for allowing local dependencies among items and using nonadditive scoring schemes, both of which are prohibited by MOM in its current form. On the don't-fix-MOM side, arguments can be made for continuing to use MOM as it stands. The strongest among these is the natural and important wish to retain simplicity and elegance, if possible. No method of combining item scores is simpler than adding them up, as prescribed by the Rasch model. Moreover, local independence is a necessary and sufficient condition for additivity (within a broad class of item response models, q.v.), which means that extending MOM along noninvasive lines would necessarily decrease its simplicity and elegance. Also, it is usually the case that simple additive measurement works remarkably well relative to nonadditive alternatives, even when observations are generated according to nonadditive models (Jannarone, 1987). Number-correct scoring should be especially satisfactory for tests in current use, having items that were chosen with additivity in mind. In many practical settings, then, MOM in its current form can be expected to perform quite well. Local independence is the issue of focus here for the fix-MOM, don't-fix-MOM question, although it is not the only one. Other issues have also emerged over the years, for which different Rasch model variants have been proposed. These include extensions along multidimensional, multiparameter, multicategory, and nonparametric lines. Although these extensions are not directly related to the local

LOCAL DEPENDENCE

213

independence issue, they will be reviewed in the next section, in an attempt to identify the essence of objective measurement. Arguments exist, therefore, for and against developing and applying extensions to MOM. For the those who study psychometric foundations, the choice in favor of exploring ways to fix MOM is straightforward. Their only dilemma is how to develop general and potentially useful extensions to MOM that preserve good measurement properties. However, the choice tends to be more difficult for researchers who are more concerned with practical testing. Their choice requires assessing MOM alternatives for their particular needs (and within their busy schedules), rather t h a n pursuing extended objective measurement for its own sake. They thus need to balance real rather than potential extended MOM utility against necessary increases in extended MOM complexity, which is not easy. It is hoped that the following description will aid researchers with both applied and basic interests, in choosing between MOM as it stands and extended locally dependent, conjunctive alternatives.

HOW SHOULD OBJECTIVE MEASUREMENT BE EXTENDED? Since objective measurement has been closely tied to the Rasch model (Wright, 1980), the question of how to extend MOM will be addressed by first examining Rasch model attributes. Historical psychometric developments will then be reviewed, and resulting measurements extension guidelines will be proposed. t Rasch model is remarkably simple and elegant, especially in terms of additivity. Indeed, it will be shown later that the Rasch model is the only additive member of a very general test model family. Moreover, the Rasch model offers a sound basis for measuring differential item characteristics and including them in the measurement process. Also, Rasch measurement results in sound and straightforward statistical inference procedures (Andersen, 1980). The Rasch model has other features that have become identified with "specifically objective measurement" (Fischer, 1981, 1991; Wright, 1980). In particular, Rasch measurement produces ability estimates t h a t do not depend on item difficulty, along with item difficulty estimates that do not depend on abilities. Conceptually, this translates into a measuring process t h a t "transcends the measurement instrument [by excluding person-by-item] interaction terms" (Wright, 1980). The most familiar form of the Rasch model is

214

JANNARONE

where i indexes individuals, m indexes item measurements, the xlm are binary item scores, the f3m are item parameters, and the 0l are person parameters. (For readers who are unfamiliar with the proportionality (*) sign in (1), it indicates t h a t the second expression is the first expression times a factor t h a t does not depend on observed scores.) The M factors in (1), which are called item response functions, give the probabilities of passing component items as functions of 0. Rasch model local independence is evident from (1), because when person parameter (fy) values are fixed, joint item score probabilities are products of the component item response functions. The subtractive nature of Rasch model person and item parameters, which leads to item and person parameters being comparable on the same scale, is also evident from (1). Finally, since it produces individual differences measures t h a t are simple number-correct scores, the Rasch model also provides for measurement validation by correlating item and total test scores with external measures. (Other useful Rasch model properties will be described after the exponential family of statistical models is reviewed below.) t to regression and correlation methods (Galton, 1888; Pearson, 1896), which rely on component measures that have substantial (individual differences) variation, along with mutual (additive) covariation. Regression and correlation features were given prominence in both the classical test model (Spearman, 1904) and the closely related singlefactor model (Spearman, 1927), and they remain prominent in modern test theory. A closely related development was the advent of analysis of variance (ANOVA—Box, 1978; Fisher, 1921) models. ANOVA and regression models are members of the general linear model family (Searle, 1971), all of which are based on additive associations between dependent and (perhaps nonadditive functions of) independent variables. t analysis developments (MFA—Thurstone, 1932) and interactive

LOCAL DEPENDENCE

215

ANOVA developments (Box, 1978; Fisher & Mackenzie, 1923) have provided important lessons for extending objective measurement. Both multiple factors and ANOVA interactions increase explanatory power by supplementing simpler models with extra parameters. Extra "factor loadings" are used in the MFA case to supplement classical test "true scores" with "factor scores" (Lord & Novick, 1968). Likewise, extra "interaction effects" are used in the ANOVA case to supplement ANOVA "main effects" (Scheffe, 1959). Although similar in motivation, MFA models and extended ANOVA models have different forms and uses. The statistical form of the extended ANOVA model is nonadditive, in that cross-products of main effect (group indicator) variables are used as extra predictor variables. These extra observable variables are used in extended ANOVA estimation and inference to account for required extra parameters. By contrast, only additive functions of observable variables are used in the MFA model (expecting the covariances that are used for estimation and inference in general linear models and the one-factor model). Because extended ANOVA parameters are accompanied by corresponding nonadditive statistics, statistical estimation and inference procedures remain straightforward in the extended ANOVA case, just as in the additive ANOVA case. By contrast, MFA procedures involve exotic constraints (such as simple structure—see Thurstone, 1947) and estimation procedures (Joreskog & Sorbom, 1984) for dealing with difficult inference problems, some of which have yet to be resolved. As a result, while extended ANOVA models continue to be widely and successfully used, interest in MFA models seems to be decreasing (as indicated by fewer articles appearing in Psychometrika). i and regression methods can be used for binary data, binary measures violate basic normality assumptions for these models. The next major test theory development was the introduction of special binary item response theory (IRT) models, of the normal ogive (Ferguson, 1942; Lawley, 1943) and one-parameter logistic (Rasch 1960/1980) types. These new models were introduced as completely new alternatives to—rather than extended versions of—classical and MFA models, in order to precisely reflect associations among binary item scores. Like the Rasch model, they were constructed with provisions for individual differences along with subtractive item and person parameters, in an attempt to place item effects and person effects on the same scale. l Local independence and latent trait theory. The next development, which was more broad and foundational than model specific, was the identification of local independence (Lazarsfeld, 1958) as a

216

JANNARONE

test theory axiom. Specifically, permissible latent trait models (Lord & Novick, 1968, chap. 25), were restricted to settings where all local dependencies could be explained by latent traits (as opposed to more broadly defined latent variable settings, which may not necessarily satisfy the local independence axiom—see Jannarone, 1991a; Suppes & Zanotti, 1981). In the process, fundamental local dependence in general and LDCM models in particular were excluded from orthodox latent trait theory, by definition. In a related development, it was shown that for any latent variable model (including orthodox latent trait models as well as LDCM models), locally independent counterparts can always be constructed (this result was first published by Suppes & Zanotti in 1981—see also Holland & Rosenbaum, 1986; Jannarone, 1991a; Stout, 1987,1990). At first glance, the result suggests t h a t locally dependent modeling is of minor importance because locally independent counterparts can always be constructed. However, the Suppes and Zanotti alternative has been recognized as "vacuous," because it simply identifies each new item score with a separate parameter value, as each item score becomes observed. As a result, a basic statistical inference requirement— identifying the same latent variables with each of several observations—becomes lost in the process. e also been proposed, including multidimensional extensions of the Rasch model (Fisher, 1973; Whitely, 1980) to explain multiple person characteristics; multiparameter logistic models (Andrich, 1978; Birnbaum, 1958; Glas & Verhelst, 1989) to explain multiple item characteristics; and multidimensional, multiparameter models (Bock, 1972; Glas, 1991; Kelderman, 1984; Mckinley & Reckase, 1983; Samejima, 1969; Wilson, 1989). Tests based on nonparametrics (Mokken & Lewis, 1982; Holland, 1981; Holland & Rosenbaum, 1986; Rosenbaum, 1984, 1987; Stout, 1987, 1990) have been developed as well, for making inferences without having to make specific assumptions about item response function form. Most of these IRT extensions will be compared in more detail later in this section, once a basic for comparing them has been established. Exponential family theory, In exponential family form Exponential family theory, In exponential family form e (Lehmann, 1983, 1986), joint likelihood functions are expressed as exponents, which contain weighted sums of parameters. The parameter weights, which are called sufficient statistics, can be used for parameter estimation and inference. The exponential family format for the Rasch model (1) is,

LOCAL DEPENDENCE

217

Exponential family formats identify distinct sources of information with distinct exponent terms. Statistical independence is always indicated by separable product factors, such as the M item response functions in (1). Since products are equivalent to exponential sums, distinct exponent sum terms in exponential family models are also independent, in a sense (but not in general, because the proportionality constant may not be factorable). For example, when the Rasch model is expressed in form (2), each of the M + I sufficient statistics is seen as a kind of independent information source for its corresponding parameter. More precisely, it follows from exponential family theory that estimation for each exponential family parameter depends only on its corresponding sufficient statistic, given the remaining sufficient statistics. This property, when applied to (2), results in person parameter and item parameter separability for the Rasch model, which was listed as a specific objectivity property earlier. Exponential family analysis can be used to identify test model strengths as well as weaknesses. If a given model can be expressed in exponential family form several highly useful statistical properties follow. These include guarantees that: (a) unique, optimal (maximum likelihood and conditional maximum likelihood) estimates exist; (b) such estimates can be found by straightforward estimation procedures (since exponential family likelihoods are convex—see Andersen, 1980); and (c) relatively simple, optimal inference procedures can be identified (due to exponential family monotone likelihood ratio, asymptotic normality, and other properties—see Lehmann, 1986). All of these properties are strengths of the Rasch model, because of its exponential form given in (2). Similar strengths apply to unextended ANOVA, multiple regression, and classical test models, because they can also be expressed in exponential family form. Some extended statistical models can be also viewed as sound, once they are represented as extended exponential family models. For example, the ANOVA model with interactions represents an extended exponential family model with extra terms. Each term involves an additional interaction effect parameter, along with a corresponding new sufficient statistic, when expressed in exponential family form. Likewise, the Spearman one-factor model involves component item weights, over and above the (true score) parameters associated with the classical test model. When expressed in exponential family form

218

JANNARONE

(based on standard normality assumptions—see Joreskog & Sorbom, 1984), each such item weight becomes identified with an item variance parameter, which can be estimated by its corresponding item variance statistic. Thus, both of these models can be viewed as statistically sound extensions of their simpler ANOVA and classical test model counterparts. Other extended IRT models that belong in the exponential family (Andrich, 1978, 1985; Embretson, 1984; Fischer, 1973; Fischer & Formann, 1982; Kelderman, 1984; Masters, 1982; Whitely, 1980; Wilson, 1989) can be viewed as statistically sound as well. Attempts to place some other test models in exponential form can expose deficits, however. For example, the multidimensional factor analysis model becomes exposed as having more parameters (population means, factor loadings, and uniqueness) than sufficient statistics (as restricted by normality assumptions—sample item means, sumsof-squares, and sums-of-cross-products). The MFA model thus comes up short, in that no exponential family form can be constructed with a sufficient statistic for each parameter. As a result, certain side conditions (Thurstone, 1947; Joreskog & Sorbom, 1984) must be imposed to make MFA parameter estimation even possible, with no resulting guarantees of optimality. Similar problems exist for other multiparameter IRT extensions (Birnbaum, 1968), along with their multivariate extensions. Exponential forms for both Birnbaum model extensions and constrained MFA model extensions involve some terms that have products of two parameters associated with a single statistic, which cannot be separately estimated. Thus, distinct estimates for the two parameters cannot be identified (in the formal sense—see Fischer, 1981; Jannarone, 1991a), and estimation problems result (Fischer, 1981; Mislevy & Stocking, 1987). (Although the above examples suggest that statistical soundness is equivalent to exponential family membership, this may not be the case. For example, certain IRT models described by Kristoff, 1968, and Jannarone, 1991b, do not belong in the exponential family. Yet estimates with sound properties, such as uniqueness resulting from convexity, can be obtained for these models.) The two-parameter Birnbaum model deserves special attention because it has been strongly endorsed (Lord, 1980), and it continues to be widely studied (Mislevy & Verhelst, 1990; N.D. Verhelst, personal communication, December 1989). Abbreviated as the 2P model elsewhere, it will be called the toupee model here, because of its cosmetic nature (see below). Both the Spearman model and the toupee model were proposed to link latent traits with weighted sums of item scores, rather t h a n simple, unweighted sums. It seems strange at first glance, then,

LOCAL DEPENDENCE

219

t h a t the Spearman model is statistically sound by exponential family standards, while the toupee model is not. The reason lies in the fact t h a t individual item weighting information in both cases must come from second-order (and conceivably higher order) item statistics. In the Spearman model case for continuous data, this poses no problems from an information viewpoint, because sums of squared item statistics are distinct from sums of raw item scores. In the binary case, however, a squared binary item score is identical to a raw binary item score (whether its value is 0 or 1). As a result, no new item information can be obtained from examining binary item variances (and higher order moments), over and above that available in binary item sums. The fact that binary item statistics are limited in this way has a more basic message for item response theory than simply an argument against the toupee model. The message is this: If item parameters that are distinct from Rasch difficulty parameters are to be estimable, then local independence must be violated in the process (since statistics other t h a n additive item statistics must be obtained). One approach is to use nonbinary item information such as categorical item response data (Glas & Verhelst, 1989) and item response latency measures (Jannarone, 1991b). A second approach is to use item cross-product statistics such as item-subtest regression estimates (Engelen & Jannarone, 1989), to identify discrimination-like parameters. In either case, however, the use of such statistics presents a dilemma for researchers who believe in both local independence and the toupee model. Some guidelines for extending objective measurement. A vari ety of conflicting lessons and guidelines can be gathered from the preceding survey. For example, some with interests in developing advanced statistical procedures would identify different guidelines than others with more practical interests. For the author, who is mainly interested in developing simple tests based on established statistical principles to reflect general cognitive processes, the following guide-principles to reflect general cognitive processes, the following guidlines seem vital. A.

Extend explanatory power, by A(l) including new item parameters to reflect task differences as necessary, A(2) including new person parameters to assess corresponding individual differences, and A(3) avoiding substantive constraints (such as local independence); B. Ensure statistical soundness, by B(l) including new statistics to identify new parameters as necessary, B(2) preserving the use of fast, optimal, and conditionally invariant estimation procedures, B(3) using composites of distinct observations for each parameter,

220

C.

JANNARONE

to increase measurement precision, and B(4) using theoretically sound inference procedures; and Retain interpretive and statistical simplicity, by C(l) minimizing parametric and statistical complexity, and C(2) retaining specific objectivity.

Guidelines A(l) and A(2) may seem too parameter-based to those who favor nonparametric approaches. The argument in favor of nonparametrics, as indicated earlier, is that few assumptions need be satisfied in order for nonparametric procedures to be valid. The argument in favor of parametric modeling is that parameter estimates can be used to explain associations among items precisely, rank-order persons in terms of discrete skills, and test distinct cognitive processing hypotheses. At a more basic level, the nonparametric movement in statistics over the last 20 years (Lehmann, 1975) seems analogous to the old behaviorism movement in psychology (Boring, 1957). Both have their place for identifying simple associations, but both are simplistic to a fault for explaining underlying structures. Guideline A(3) is included to avoid defining away the existence of potentially useful test models. Examples later in the chapter will show t h a t orthodox latent trait theory violates A(3). Previously cited tests for dimensionality violate A(3) as well, making them strangely restrictive given their nonparametric intent. In order to exclude the trivial Suppes and Zanotti model from consideration, these tests impose presumably minimal restrictions on families of permissible test models, but they prohibit locally dependent alternatives in the process. Some such tests impose local independence explicitly (for example, locally independent, monotone IRF requirements—Holland & Rosenbaum, 1986), violating A(3) in the process. Others impose local independence with more subtlety. For example, according to a preliminary "essential local dependence" definition (Stout, 1990), certain LDCM models such as the Rasch-Markov model below are strangely both "essentially locally independent" and clearly locally dependent. The consequence is a dimensionality test that has equally strange implications. If such tests are to be truly nonparametric, they should require only the weakest necessary substantive assumptions, as suggested by A(3) (see also Jannarone, 1991a). Guideline A(2) provides an individual emphasis that may be helpful, if not essential, for externally validating test theory models. Previously successful extended models without individual differences (Andrich, 1978, 1985; Embretson, 1984; Fischer & Formann, 1982; Masters, 1982; Wilson, 1989) indicate that extended individual differ-

LOCAL DEPENDENCE

221

ence measures are not essential. In all these instances, however, if individual differences measures had been included, then external validation prospects could have been enhanced. Also, Rasch model specific objectivity features provide a very powerful basis for independently assessing and using item difference measures as well as person difference measures. Guideline A(2) is meant to indicate, then, that extended individual differences provisions can only be helpful, provided that other guidelines can be followed. The guidelines under B and C are nearly equivalent to an exponential family membership requirement. If a statistical model falls in the exponential family, all of the requirements under B are satisfied, as was indicated earlier. Also, the exponential family format requires a one-to-one correspondence between statistics and parameters, ensuring the kind of statistical clarity that is required by B. The equivalence is not perfect because some models outside the family may be both statistically sound and interpretable as was indicated earlier. However, exponential family membership is sufficient to satisfy B and C. Thus, all of the previously cited extended exponential family models satisfy B and C as well. Some prominent extended test models do not satisfy B, however, most notably the MFA and toupee models. Also, guideline B(3) was specifically included to clarify why the extended approach presented by Suppes and Zanotti is unacceptable. Local independence has of course not been included as a guideline, given the viability of local dependence. In light of that viability, arguments that have been expressed in favor of local independence (Lord & Novick, 1968; McDonald, 1981) are not compelling (Jannarone, 1991). Although the above three guidelines are individually beneficial, they should be recognized as being mutually at odds. The extended generality that falls under A cannot be achieved without added statistical and conceptual complexity, which compromises b o t h B and C. For example, the interactive and multidimensional extensions of the ANOVA and factor analysis models necessarily involve more elaborate data explanations and analyses, as was indicated earlier. Thus, these and all other extended explanatory models are useful only if such elaborate explanations are necessary. More pointedly, extensions to the Rasch model should be considered only if extended insights about complex cognitive processes are required. Otherwise, there is no need to compromise the marvelous simplicity and elegance of the Rasch model. Because of this tradeoff between generality and complexity, more general alternatives to the Rasch model should be considered on a case-bycase basis, with this tradeoff between the three guidelines in mind.

222

JANNARONE

LOCALLY DEPENDENT, CONJUNCTIVE MEASUREMENT AS THE CASE IN POINT t the conjunctive measurement family is a special case of the following general model.

The notation in (3) resembles the Rasch model notation in (1), in t h a t M is the number of items, the xim are binary item scores, p is a vector of item parameters, and the elements of 0 are person parameters. However, equation (3) contains many more person and item parameters t h a n (1)—indeed, it contains far too many parameters to be useful (and identifiable) as it stands. Instead, some parameter values must be set to 0 or equated with other parameters, to obtain useful special cases. One such special case of (3) is the Rasch model (1), which is obtained by equating all first-order person parameters,

and excluding all higher order terms,

Multivariate Rasch models, from which distinct Rasch measurements are obtained for different tests within a battery, can also be viewed as special cases of (3). For example, the appropriate special case of (3) for a battery made up often 100-item tests would have: (a) M set at 1,000; (b) the first 100 6im values equated and denoted by fyl\ (c) the next 100 0im values equated with ^ 2 ) , and so on; and (d) the constraints in (5) imposed to remove higher order effects. The resulting model, after including (a) through (d) in (3) is,

LOCAL DEPENDENCE

223

The final form of (6) indicates that the 10 component tests can be treated as independent, distinct Rasch models. The general conjunctive model given in (3) represents an extension of Rasch model measurement, in that the Rasch model is a special case, along with a variety of other locally independent models such as (6). After statistical properties of (3) are examined next, some locally dependent special cases of (3) will be examined as well. Conjunctive measurement model exponential family membership. The likelihood associated with (3) can be arranged in the following exponential family form:

224

JANNARONE

Since all conjunctive measurement models are special cases of (7), they can be placed in exponential family form as well. For example, the exponential family and sufficient statistics for the Rasch model have already been formulated in (2). Also, the exponential family form of the likelihood for the above multivariate Rasch model (6) is,

As would be expected from fitting separate Rasch models to tests within the battery, (8) shows that item difficulty sufficient statistics are counts of persons who pass items, and ability sufficient statistics are persons' component test number-correct scores. Thus, person and item sufficient statistics for test battery Rasch models are additive, as in the usual Rasch model case. The subtractive form of person and item parameters in (3), along with its exponential family membership, guarantee that some specific objectivity features will be retained by conjunctive Rasch model extensions. In particular, (7) guarantees that if person sufficient statistics are fixed then item parameter estimation will be independent of person parameter estimation, and vice versa. This holds for all conjunctive measurement models, not only the (univariate and multivariate) Rasch model special cases that have been introduced so far. Local independence within the conjunctive measurementl family. It has already been shown in (1) that the Rasch model is locally independent. Since (6) is a product of Rasch models, it follows that the multivariate Rasch model is locally independent as well. Among the many possible conjunctive measurement models, it happens t h a t (univariate and multivariate) Rasch models are the only conjunctive measurement models that are locally independent. More precisely, it can be shown (Jannarone, 1991a) that: Proposition I. Special cases of (3) are locally independent if and only if second-order and higher-order terms are absent, that is, all of the constraints (5) are satisfied. It also follows from (7) that locally dependent conjunctive measurement models must involve nonadditive statistics. More precisely, it follows that:

LOCAL DEPENDENCE

225

Proposition II. Special cases of (3) are locally dependent if and only if they require the use of second-order and/or higher-order sufficient statistics, that is, one or more of the constraints in (5) are not satisfied. The practical consequence of Propositions I and II is that Rasch models are quite exclusive—but substantively limited—members of a large measurement model family. The measurement family is large, in t h a t it includes many special cases that are restricted only by objective measurement and statistical soundness concerns. Rasch models are exclusively locally independent and additive, hence easy to interpret and statistically elegant. However, they are also limited for reflecting interesting, locally dependent versions of (3), some of which will be given next. Some locally dependent conjunctive measurement models.s (The following examples will be treated briefly here—see Jannarone, 1988, 1991, 1991b, for details.) As mentioned earlier, locally dependent models can be useful in settings where items are sequentially linked, as in studying exercise injuries and learning transfer abilities. One relatively simple LDCM model for sequentially linked items has the Rasch-Markov form,

h

along with item sufficient statistics of the form,

The M symbols in (10) and (11) indicate that parameter estimates are monotonically increasing functions of their corresponding sufficient

226

JANNARONE

statistics—it follows from (9) that all item local dependencies can be explained by those among adjacent items only, resulting in the socalled Markov property—see Jannarone, 1987.) The nonadditive person statistics associated with the Rasch-Markov model reflect individual differences in adjacent item local dependencies. For exercise injuries, high values of such statistics (controlling for numbers-of-injuries) reflect a person's inclination to press on in the face of physical adversity. For learning transfer settings, high crossproduct statistic values (controlling for number-correct scores) reflect a person's inclination to successfully transfer information t h a t was learned on one item to the next item. Nonadditive person statistics can also be used to assess the utility of LDCM models in prediction. In particular, the explanatory power of (9)—over and above that of the Rasch model can be assessed with partial correlation tests. Such tests can be performed based on sample correlations among number-correct person sufficient statistics, crossproduct sufficient statistics, and external criterion measures. (More precise tests based on correlations among efficient Rasch-Markov person parameter estimates can be obtained as well, but calculating efficient estimates is not easy—see Jannarone, 1987). Similar tests can be constructed for validating the other LDCM models to be introduced below. Some simplified versions of (9) can be formulated by equating item parameters. For example, all first-order item parameters can be equated and all second-order item parameters can be equated, resulting in the following stationary Rasch-Markov (or binomial-Markov) model,

having the same person statistics as (1), but item sufficient statistics of the form,

Choosing between the Rasch-Markov model (9) and the Raschbinomial model (12) is like choosing between the Rasch model and the binomial model (Keats & Lord, 1962). Insofar as first-order and

LOCAL DEPENDENCE

227

second-order item sufficient statistics are equal, they should be combined for greater statistical efficiency and simplicity. Insofar as true corresponding item parameters are unequal, however, explanatory power will be lost. Constrained versions of the LDCM models below can be similarly constructed, and similar considerations apply. Other simplified versions of (9) can be obtained by excluding person parameters. For example, second-order person parameters can be removed from (9), resulting in a model that accounts for local dependencies without accounting for individual differences in such dependencies. Such LDCM models, some of which have been adopted in interesting ways (Andrich, 1978, 1985; Embretson, 1984), are perfectly sound, provided that such individual differences do not exist. If they could exist, however, it would seem better to account for them formally and utilize resulting individual differences measures for external prediction. Similar options and concerns apply to the remaining LDCM models to be described. Other LDCM models can be used in schemes where "testlets" made up of items sharing common content are involved (Wainer & Thissen, 1989). One familiar scheme involves testlets that assess comprehension ability, with each testlet being based on reading the same paragraph of text. For the case involving T testlets, with the tth testlet being made up of Mt items (t = 1, . . . , t), an appropriate LDCM model would have the form,

where M = Mx + • • • + MT. For simple cases involving two items in each testlet, the person sufficient statistics would be,

and the item sufficient statistics would be

228

JANNARONE

The nonadditive person statistics in this case may be viewed as measures of task completion style. Among persons who get scores of 50 on a test made up of 50 two-item testlets, for example, the cross-product score can range from 0, indicating that a person got exactly one item correct in each testlet, to 25, indicating that a person either passed both items or no items in each testlet. The score of 0 would show an inclination to get one item correct and then move on, whereas the score of 25 would show an inclination to either work a problem through entirely or give it up. (For higher Mt values than 2, 6\2) sufficient statistics are total numbers of item pairs that are both passed within testlets, with high value reflecting "compulsive" behavior and low values reflecting "hyperactive" behavior.) Insofar as these behaviors can be measured reliably, they may be useful in predicting certain external criteria, over and above number-correct scores. Alternatively, they could reflect sources of error variation if only number-correct scores are used, which could be either adjusted statistically or reduced by coaching examinees, or both. Thus, LDCM models can be used to identify different cognitive resource allocation strategies, by capitalizing on nonadditive person statistics. This use of LDCM models has also been proposed in settings where a battery of diagnostic subtests is assigned, followed by a training period, followed by a battery of parallel achievement subtests (Jannarone, 1991). As in the previous two examples, the extra utility of nonadditive test scores in this setting is easy to demonstrate. In particular, it can be shown that different learning strategies can be uncovered, by assessing their (nonadditive, within-person) subtest correlation coefficients—such differential strategies can be uncovered even among people having identical pretest, posttest, and change scores (Jannarone, 1991). Furthermore, if such individual strategy differences exist, then pretest-posttest scores (in the latter case as well as testlet scores in the former case) must be locally dependent, in keeping with Propositions I and II. Finally, LDCM models can be used in settings where quickness and correctness are measured concurrently (Jannarone, 1991b), to assess individual differences in speed-accuracy tradeoffs. For example, in speeded test settings where stringent time limits are imposed, some persons may "rise to the occasion" and do relatively well, while others may not do so well. Assessing individual differences along these lines may be useful in selecting personnel for whom quick, accurate responses are essential, such as air traffic controllers and police officers. Corresponding LDCM models are based on person sufficient statistics t h a t are cross-products among item response speed measures and item

LOCAL DEPENDENCE

229

correctness measures. As in the previous three cases, since these measures are nonadditive they are also necessarily locally dependent. CONCLUSIONS This chapter began with an informal discussion of whether and how to extend objective measurement beyond the Rasch model. A more careful analysis was presented next, based on historical measurement developments. The analysis indicated that objective measurability should be assessed case-by-case, based on: A, extending explanatory power; B, ensuring statistical soundness; and C, retaining interpretive and statistical simplicity. Conjunctive measurement was also reviewed with these guidelines in mind. Overall conclusions follow. Since conjunctive measurement models are exponential family members with subtractive person and item statistics, they satisfy all criteria previously listed under A and B. However, LDCM models necessarily violate some simplicity and specific objectivity requirements previously listed under C. Item and person measurement separability are preserved, in t h a t person parameter estimates are conditionally independent of item parameters (when item sufficient statistics are fixed), and vice versa. However, it is less clear that LDCM "transcends the measuring instrument [by excluding person-by-item] interaction terms." Indeed, LDCM clearly involves such interactions at the item level. On the other hand, LDCM is free of such interactions at the testlet level. In paragraph comprehension settings, for example, items are locally independent between distinct paragraph testlets, so that number-correct and cross-product scores do not interact between testlets. LDCM may be viewed, then, as "transcending the measuring instrument" up to a point, once the instrument is viewed as performing measurements at the testlet level rather than the item level. Similar conclusions can be made for other LDCM models if they are applied to repeatable testlets, rather than nonrepeatable tests. The most basic message from this chapter is that if reasonable statistical soundness conditions are imposed, then alternatives to the (univariate or multivariate) Rasch model will necessarily violate the local independence axiom. This message is noteworthy, because local dependence conflicts with traditional latent trait theory foundations. It is tempting to suggest that this violation marks only a lull between newly discovered locally dependent counterexamples and new locally dependent text theory axioms that will necessarily follow. However,

230

JANNARONE

the history of science (Kuhn, 1970; Miller & Fredericks, 1991; Popper, 1968) has shown that foundational requirements carry a great deal of inertia. Because of this inertia, the local independence axiom cannot be taken lightly and much effort will be needed to overcome it. Continued attempts to develop the toupee model along latent trait theory lines illustrate this inertia quite well. The cosmetic nature of such attempts should be clear, in light of the previous discussion. Moreover, straightforward LDCM alternatives to the toupee model can easily be developed, by including item-subtest regression parameters and corresponding sufficient statistics as necessary. However, the toupee model will not be shed for a better alternative, until researchers face the bald fact that it presents a local independence paradox. The status of local independence is conceptually (and substantively —see Jannarone, 1991) similar to the status of Newton's physics laws at the end of the last century. J u s t as evidence gathered then t h a t Newton's laws were not always followed, evidence is gathering now t h a t local independence does not always follow. Likewise, just as classical physics gave way to modern physics then, local independence is being recognized as a special core of a more general test theory now. This century has shown that classical physics continues to provide a simple and elegant model, which is adequate for most purposes, but, other quite useful developments have resulted from extending physics to include relativity. Similar developments in psychometrics seem likely for the next century, including (a) the continued successful use of additive Rasch measurement in most cases, along with (b) occasional uses of alternative approaches for measuring more complicated, nonadditive cognitive processes. The development of such alternatives seems especially likely, in light of ongoing advances in real-time human, computer interaction (Wainer & Thissen, 1989). Returning finally to whether LDCM is objective measurement or an abomination, it was suggested early in the chapter that different researchers with different needs are likely to have different answers. That conclusion holds after a more careful analysis as well. Those who only need to measure achievement and other simple attributes will lean toward the Rasch model and away from less elegant "abominations," and rightly so. Others, who feel the need to extend cognitive measurement along dynamic lines, will be more willing to view LDCM as objective and potentially useful. Hopefully the preceding description has helped researchers with these and other leanings to consider locally dependent, conjunctive measurement for their own needs. (Some computer programs are available and others are being developed by the author for LDCM analysis. Researchers with LDCM interests should feel free to contact him for assistance.)

LOCAL DEPENDENCE

231

REFERENCES Andersen, E.B. (1980). Discrete statistical models with social science applications. Amsterdam: North-Holland. Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561-573. Andrich, D. (1985). A latent trait model for items with response dependencies: implications for test construction and analysis. In S.E. Embretson (Ed.), Test design: Development in psychology and psychometrics (pp. 245-275). Orlando, FL: Academic Press. Birnbaum, A. (1958). Statistical theory of tests of a mental ability. Annals of Mathematical Statistics, 29, 1285 (abstract). Birnbaum, A. (1968). Some latent trait models and their use in inferring an Examinee's ability. In F.M. Lord & M.R.Novick (Eds.),statistical theories of mental test scores. Reading, MA: Addison-Wesley. Bock, R.D. (1972). Estimating item parameters and latent ability when responses are scored in two or more numerical catagories.Psychometrika, 37, 2 9 - 5 1 . Boring, E.G. (1957). A history of experimental psychology (2nd ed.). New York: Appleton-Century-Crofts. Box, J.F. (1978). R.A. Fisher: The life of a scientist. New York: Wiley. Embretson [Whitely], S. (1984). A general latent trait model for response processes. Psychometrika 49, 175-186. Engelen, R.J.H., & Jannarone, R.J. (1989). A connection between item/subtest regression and the Rasch model (Research Report No. 89-1). Enschede, The Netherlands: Department of Education, Twente University. Ferguson, G.A. (1942). Item selection by the constant process. Psychometrika, 7, 19-29. Fischer, G. (1973). The linear logistic test model as an instrument in educational research. Ada Psychologica, 37, 359-374. Fischer, G.H. (1981). On the existence and uniqueness of maximum likelihood estimates in the Rasch model. Psychometrika, 46, 59-77. Fischer, G.H., & Formann, A.K. (1982). Some applications of logistic latent trait models with linear constraints on the parameters. Applied Psychological Measurement, 6, 397-416. Fisher, R.A. (1921). Studies in crop variation, I: An examination of the yield of dressed grain from Broadbalk. Journal of Agricultural Science, 11, 1 0 7 135. Fisher, R.A., & Mackenzie, W.A. (1923). Studies in crop variation, II: the manurial response of different potato varieties. Journal of Agricultural Science, 13, 311-320. Fisher, W.P. (1991). Objectivity in measurement: A philosophical history of r Theory into practice. Norwood, NJ: Ablex Publishing Corp. Galton, F. (1888). Co-relations and their measurement, chiefly from anthropomorphic data. Proceedings of the Royal Society, 45, 135-145.

232

JANNARONE

Glas, C.A.W. (1991). A Rasch model with a multivariate distribution of ability. In M. Wilson (Ed.), Objective measurement: Theory into practice. Norwood, NJ: Ablex Publishing Corp. Glas, C.A.W., & Verhelst, N.D. (1989). Using the Rasch model for dichotomous data for analyzing polytomous responses. Unpublished manuscript,d CITO, Arnhem, The Netherlands. Holland, P.W. (1981). When are item response models consistent with observed data? Psychometrika, 46, 79-92. Holland, P.W, & Rosenbaum, P.R. (1986). Conditional association and unidimensionality in monotone latent variable models. Annals of Statistics, 14, 1523-1543. Jannarone, R.J. (1986). Conjunctive item response theory kernels. Psychometrika, 51, 357-373. Jannarone, R.J. (1987). Locally independent models for reflecting learning abilities (Center for Machine Intelligence Report No. 87-67). University of South Carolina, Columbia. Jannarone, R.J. (1991). Conjunctive measurement theory: Cognitive research prospects. In M. Wilson (Ed.), Objective measurement: Theory into practice. Norwood, NJ: Ablex Publishing Corp. j Contrasts and connections with traditional test theory. Unpublished manuscript. University of South Carolina, Columbia. Jannarone, R.J. (1991b). Measuring quickness and correctness concurrently: A conjunctive IRT approach. Unpublished manuscript. University of South Carolina, Columbia. Jannarone, R.J., Yu, K.F., & Laughlin, J.E. (1990). Easy Bayes estimates for Rasch-type models. Psychometrika, 55, 449-460. Joreskog, K., & Sorbom, D. (1984). LISREL VI users guide. Chicago: International Educational Resources. Keats, J.A., & Lord, F.M. (1962). A theoretical distribution for mental test scores. Psychometrika, 27, 59-72. Kelderman, H. (1984). Loglinear Rasch model tests. Psychometrika, 49, 2 2 3 245. Kelderman,H.,&Jannarone,R.J.(1989,March).Conditional maximum likeli-k hood estimation in conjunctive item response models. Paper presented at the annual American Educational Research Association meetings, San Francisco. Kristoff, W. (1968). On the parallelization of trace lines for a certain test model (Research Report No. RR-68-56). Princeton, NJ: Educational Testing Service. Kuhn, T.S. (1970). The structure of scientific revolutions (2nd ed.). Chicago: University of Chicago Press. Lawley, D.N. (1943). On problems connected with item selection and test construction. Proceedings of the Royal Society of Edinburgh, 61, 273-287. Lazarsfeld, P.F. (1958). Latent structure analysis. In S. Koch (Ed.), Psychology: A study of a science, Vol. III. New York: McGraw-Hill. Lehmann, E.L. (1975). Nonparametrics: Statistical methods based on ranks. San Francisco: Holden-Day.

LOCAL DEPENDENCE

233

Lehmann, E.L. (1983). Theory of point estimation. New York: Wiley. Lehmann, E.L. (1986). Testing statistical hypotheses (2nd ed.). New York: Wiley. Lord, F.M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum. Lord, F.M., & Novick, M.R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. Macera, C.A., Pate, R.R., Powell, K.E., Jackson, K.L., Kendrick, J.S., & Craven, T.E. (1989). Predicting lower extremity injuries among habitual runners. Archives of Internal Medicine, 149, 2565-2568. Masters, G.N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149-174. McDonald,R.P.(1981)The dimensionality of tests and items.British Journalm of Mathematical and Statistical Psychology, 34, 100-117. McKinley, R.L., & Reckase, M.D. (1983). An extension of the two-parameter lodistic model to the multidimensional latent trait space (Research Report No. R83-2). Iowa City, IA: American College Testing Program. Miller, S.I., & Fredericks, M. (1991). Postpositivistic assumptions and educational research: another view. Educational Researcher, 20, 2 - 8 . Mislevy, R.J., & Stocking, M.L. (1987). A consumers guide to LOGIST and BILOG. (Research Report No. RR-87-43). Princeton, NJ: Educational Testing Service. Mislevy, R.J., & Verhelst, N.D. (1990). Modeling item responses when different subjects employ different solution strategies. Psychometrika, 55, 195216. Mokken, R.J., & Lewis, C. (1982). A nonparametric approach to the analysis of dichotomous item responses. Applied Psychological Measurement, 6, 417-430. Pearson, K. (1896). Mathematical contributions to the theory of evolution, III. r Royal Society, A, 187, 113-178. Popper, K.P. (1968). The logic of scientific discovery. New York: Harper & Row. r tests. Chicago: University of Chicago Press. (Original work published 1960). Rosenbaum, P.R. (1984). Testing the conditional independence and monotonicity assumptions of item response theory. Psychometrika, 49, 425-436. Rosenbaum, P.R. (1987). Probability inequalities for latent scales. British Journal of Mathematical and Statistical Psychology, 40, 157-168. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometric Monograph No. 17. Scheffe, H. (1959). The analysis of variance. New York: Wiley. Searle, S.R. (1971). Linear models. New York: Wiley. Spearman, C. (1904). The nature of intelligence and the principles of cognition. London: Macmillan. Spearman, C. (1927). The abilities of man. New York: Macmillan. Stout, W. (1987). A nonparametric approach for assessing latent trait dimensionality. Psychometrika, 52, 589-617.

234

JANNARONE

Stout, W. (1990). A nonparametric multidimensional IRT approach with applications to ability estimation. Psychometrika, 55, 293-326. Suppes, P., & Zanotti, M. (1981). When are probabilistic explanations possible? Synthese, 48, 191-199. Thurstone, L.L. (1932). The theory of multiple factors. Ann Arbor, MI: Edwards Brothers. Thurstone, L.L. (1947). Multiple factor analysis: A development and expansion of the vectors of mind. Chicago: University of Chicago Press. Van der Linden, W., & Jannarone, R.J. (1989). Locally dependent choice models. Unpublished manuscript, Twente University. Wainer, H., & Thissen, D. (1989). Item clusters in computerized adaptive testing: A case for testlets. Journal of Educational Measurement, 24, 1 8 5 202. Whitely, S.E. (1980). Multi-component latent trait models for ability tests. Psychometrika, 45, 479-494. Wilson, M. (1989). Saltus: a psychometric model of discontinuity in cognitive development. Psychological Bulletin, 105, 276-289. Wright, B. (1980). Forward and Afterward to Rasch, G. Probabilistic models for some intelligence and attainment tests. Chicago: University of Chicago Press.

chapter

13 -LO

Objective Measurement with Multidimensional Polytomous Latent Trait Models* Henk Kelderman

University of Twente now at Vrije University

Objective measurement in the social sciences is rarely possible without probabilistic models. In many cases these measurements are based on aggregates of elementary measurements, such as answers to test questions and errors in spelling, which are themselves subject to random error. Classical test theory models (Lord & Novick, 1968) yield estimates of the reliability of the total test score under certain assumptions such as one dimensionality of the test. Item response theory (Birnbaum, 1968; Lord, 1980, Rasch, 1960/ 1980) explicitly relates the item responses to subject parameters and item parameters via a probabilistic model. For a set of test data, the assumptions of the model may be tested and the parameters estimated. A desirable property of the class of IRT models proposed by Rasch is t h a t the subject parameters may be estimated independent of the item parameters, and vice versa. As an example of a Rasch type model * Requests for reprints should be sent to H. Kelderman, Department of Work and Organizational Psychology, Faculty of Psychology and Pedagogics, Vrije University, De Boelelaan 1081c, 1081HV Amsterdam, The Netherlands. The author thanks Mary Lunz of the American Society of Clinical Pathologists for test data and suggestions.

235

236

KELDERMAN

consider the unidimensional Rasch models for polytomous items described by Andrich (1978), Masters (1982, 1987), Wright and Masters (1982), Masters and Wright (1984) and others. Suppose that N subjects respond to k test items. On item j , subject i may give any of r ; responses xij• = x (= 0, . . . , r-). The probability of this is denoted by Tiijx. Let 6, be a parameter describing the ability of person i and let 8 /x be a difficulty parameter of response x on item j . Written in terms of log-odds, the partial credit model related the response probabilities to the person and item parameters through

j = 1, . . . , k; x = 1, . . . , rr If for a certain population of subjects P and a universe of items Q, this Rasch type model holds, the abilities of the subjects in P can be compared regardless of the choice of items from Q, and the difficulty of the items can be compared regardless of the particular sample of subjects taken from P. Rasch (1960/1980, 1977) calls this specifically objective measurement. It can be shown the likelihood of the data under model (1) factors into two distinct parts, a part with only subject parameters and a part with only item parameters (e.g. Masters, 1982). To each of these sets of parameters corresponds a set of minimal sufficient statistics. For the subject parameters they are the simple sums of the item scores xn + . . . + xlk and for the item response parameters they are the numbers of subjects that have given that particular response. The sets P and Q define the limits of model validity. For subjects and items outside the sets, objective measurement may not be possible. For example, if the trait to be measured is knowledge of medieval history, kindergarten children may not be in P nor arithmetic items in Q. Sometimes a set of items Q is supposed to measure a certain trait for subjects in P, but it does not fit the Rasch model. In that case, one may attempt to partition the universe Q into s Rasch homogeneous subuniverses Qq (q = 1, . . . s), each fitting the Rasch model. Obviously, if this can be done, objective measurement is still possible because each subuniverse allows objective measurement. The only difference is that each subject is now characterized by a vector of person parameters 6, = (9 zi , . . . , dls) rather than a single scalar subject parameter. However, in some testing situations, particularly if the items have several answer categories, multidimensionality may be more intricate and surface within a single item. For example, consider the following test item " V l 2 - 3 = ?". Two numerical operations may be assumed: Subtraction (12 - 3 = 9) and taking the square root (V9 =

OBJECTIVE MEASUREMENT

237

3). If both operations are performed, the answer gets the full credit x = 2. If only the first operation is performed, the answer gets the partial credit x = 1. Finally, if the answer is incorrect it is scored x = 0. The partial credit model (1) then explains the odds of getting a credit of 1 rather than 0 by the person's ability 8,- and a response parameter 8 ; 1 and explains the odds of getting a credit 2 rather than 1 by the same ability parameter and a response parameter 8 /2 . It might, however, be hypothesized that "subtracting" and "taking the square root" are different abilities calling for a model, where the consecutive odds depend different latent traits. Therefore, we now consider multidimensional Rasch models. Multidimensional R a s c h Models Rasch (1961), aware of multidimensionality within item responses, invented a model that allows for a different dimension in each category:

j = 1, . . . , k; x = 0, . . . , r ; , with constraint 8 / 0 = 0 and 8 n + . . . + bkl = 0. This model describes the log probability of a score x on i t e m j . It is easy to reformulate the model into a model for the odds of getting score x rather than x - 1 as in the partial credit model, because

This multidimensional version of the partial credit model was described by Kelderman (1991a,b). As in the unidimensional partial credit model (1), the model contains a threshold parameter 8j^, but the model now has a separate subject parameter dfx for each response category. An extension of Rasch's multidimensional model that gives the analyst more flexibility in specifying the relation between item responses and latent traits is the Multidimensional Polytomous Latent Trait (MPLT) model (Kelderman & Rijkes, in press). Let Bqjx be a positive integer valued weight of response x of item j with respect to the qth subject parameter. If B^ ^ 0, it means t h a t the item response denoted by the pair (j, x) depends on latent trait q and if Bqjx = 0, it does not. In addition, if the weight is larger than one, it means t h a t the response involves more t h a n one application of the latent trait. The MPLT model is then written as:

238

KELDERMAN

j = 1, . . . , k; x = 0, . . . , r ; . As in the previous models, additional constraints must be imposed on the parameters to obtain a unique set of parameter estimates. There are two types of indeterminacies in the model, between pj and bJX and between 8^ and 8^. Adding a constant Cj to each Sjx and subtracting it from fjL; does not change the model. These indeterminacies may be removed by setting convenient linear restrictions on the parameters that facilitate the interpretation of the parameters. For example, setting 8 -0 = 0 removes the first indeterminacy and makes sense if x = 0 is the incorrect response. cq may be chosen such t h a t the mean person parameter or item parameters is equal to zero to fix the scale of Qiq. Different parameterization may be employed in different situations to improve the interpretability of the parameters. One example of this is given later in this paper. If the parameters of model (2) are unique, conditional estimation of parameters is possible. Several applications of MPLT models have been described in the literature (Duncan & Stenbeck, 1987; Kelderman, 1991; Kelderman & Rijkes, in press, Wilson, 1989, 1990; Wilson & Adams, 1993; Wilson & Masters, 1993), and a computer program that computes, conditional maximum likelihood estimates and goodness-of-fit statistics has been developed (Kelderman, 1992; Kelderman & Steen, 1993). A powerful example of MPLT modeling in practice is the following analysis of medical-laboratory-test items. A n Example The American Society of Clinical Pathologists (ASCP) produces tests for the certification of medical personnel. The Society has a long standing commitment to objective measurement. Their tests are carefully constructed and analyzed to make sure that the comparison of person parameters is independent of item content as much as possible. Mary Lunz of ASCP made available for reanalysis the following set of data. The data we analyze here are the responses of 333 examinees to nine four-choice items measuring the ability to perform medical laboratory tests. The items are calibrated under a Rasch model so that the sum of the correct answers contains all information about the subject's ability 8, available in the data. There are, however, reasons to believe that this single ability parameter might not be sufficient to explain the subjects' behavior on the tests. In particular, it was hypothesized that several different cognitive processes are involved in making the items, and t h a t even the

OBJECTIVE MEASUREMENT

239

incorrect responses might be chosen on the basis of partial execution of these processes. The correct response would then be chosen if all processes were successfully executed and orchestrated. Table 13-1 gives the judgements of ASCP content experts about three cognitive processes t h a t are possibly involved in choosing the items alternatives. For example, in Item 3 the correct answer b involves the application of knowledge as well as two computations, whereas in the incorrect answer c one calculation is missing. Now assume that there are individual differences in the subjects' ability to perform each of these cognitive operations and the three corresponding ability parameters are given by 8 a , di2 and 8;3. Furthermore, assume that there is a parameter 8;4 that is exclusively involved with giving the correct answer. In that case, the specification in Table 13-1 gives the B weights of the MPLT model. To investigate whether this hypothesis is correct we specify two models: (a) a model with Qi4 only, and (b) a model with all four ability parameters 8 a , 8 U , Qi3, Qi4. The item difficulty parameters of both models were estimated with the LOGIMO program (Kelderman & Steen, 1993). The log-likelihood (and number of independent parameters) of model (a) is - 9 3 9 1 (36) and of model (b) is - 9 0 5 2 (446). The log-likelihood of a model is the logarithm of the probability of the observed data under that model. Obviously the likelihood of the data under model (b) is larger t h a n under model (a), but this comparison is not fair since model (b) has 446 parameters estimated from the data and model (a) only 36.

Table 13-1 Specification of Cognitive Processes Involved in Responses of the ACSP Medical Laboratory Test

I Applies

III Correlates

Knowledge

Item abed

II Calculates Data

abcdabcd

12 2 1 2 3 1 2 1 1 4 2 1 1 5 1 1 1 1 6 1 1 1 7 1 2 2 8 1 1 1 1 1 9 1 N = 3370, Nonresponse = 39

1 1 1 1

1

1 1

1

IV Correct a b c d

1 1 1 1 1 1 1 1 1

240

KELDERMAN

A statistic t h a t makes a tradeoff between log-likelihood against parameters to be estimated is Akaike's Information Criterion AIC = constant + 2 number of parameters - 2 loglikelihood. Akaike (1977) found that it can be expected that adding two extra parameters to the model is generally equivalent to an increase of one point on the log-likelihood scale. If we now compare model (a) and (b) with AIC (constant = -18000), we have values of 854 and 996, respectively. That is, adding the 410 parameters to the 36 parameters of model (a) to get model (b) does not increase the likelihood beyond t h a t expected from chance. In fact the likelihood of (b) is smaller than expected. The conclusion, therefore, must be that the cognitive processes do not explain the structure of the data beyond the simple correct model. In Table 13-2, Pearson Goodness of fit statistics and item response parameters are given for each of the items under model (a). Comparing the Pearson Goodness of fit statistics X2 with their degrees of freedom, we see that X2 generally is not much larger, indicating a satisfactory fit. So we may conclude that the practice of using the number correct score is not invalidated by possible multidimensionality in the correct responses and does not ignore possible information about cognitive processes present in incorrect responses. Table 13-3 shows the parameter estimates of model (a). To obtain an interpretable and unique set of parameters, two types of identifying restrictions are imposed. Firstly, to make the parameters of the incorrect responses comparable, we reparameterize in such a way that their sum equals zero in each item. This is achieved by subtracting in (2) their mean from the item's response parameters S ;x and adding it to fJL;. Secondly, as with the dichotomous Rasch model, to fix the origin of the latent trait scale the sum of the item parameters corresponding to the correct responses is set equal to zero. The starred parameters in Table 13-3 describe the item difficulty. A high value of this parameter means t h a t the item is relatively difficult. This difficulty is measured in the

Table 13-2

Goodness-of-Fit Statistics for Model (a)

Item 1

2

3

4

5

6

7

8

9

sum

Chi-Square 33 38 40 32 21 16 21 22 28 251 DF 21 22 22 22 23 20 23 20 21 194

Table 13-3

Parameter Estimates of Model (a) Item

1

2

3

4

5

6

7

8

9

-0.54 0.95 0.92* 1.50

-0.88

-0.37 -0.66* 1.00 -0.63

0.01* 0.07 -0.48 0.41

-0.14 -0.93 -0.47* 1.07

1.13 -0.07* -0.58 -0.54

0.30 0.60 0.21* -0.30

-0.46*

0.97

Response

a b c d

* Correct Response

0.02 0.50* 0.85

1.05 0.16 0.90

1.29 0.05* 0.31

242

KELDERMAN

same latent trait scale as the subject parameter 0 a (see Model (2) for s = 1). It is seen t h a t Items 1 and 2 are relatively difficult and Items 3 , 5 , and 8 are relatively easy. The nonstarred parameters are impopularity parameters of the distractors. A high value of this parameter indicates t h a t the particular distractor is not popular compared to the other distractors of the item. For example, Table 13-3 shows t h a t Distractor b of Item 9 is much more popular than a or d.

DISCUSSION In this chapter the possibility of multidimensional objective measurement is discussed and a general multidimensional Rasch model for polytomously scored items is introduced. The analysis of the ASCP data shows that multidimensionality can be modelled quite flexibly on the item response level. It is shown that multidimensionality is not present in these data and that a unidimensional model suffices to describe the data. It should be noted that this unidimensional model is not the same as the dichotomous Rasch model, but a unidimensional model for polytomous items. It is a subject for further investigation to determine whether this new model is more desirable in this case t h a n the classic Rasch model. An advantage is that all given responses are modelled and described and that goodness-fit studies that focus on the various responses may yield information on possible sources of misfit. It is hard to compare both models empirically, for example, using AIC, because the sample space is different.

REFERENCES Akaike, H. (1977). On entropy maximization principle. In P.R. Krisschnaiah (Ed.), Applications of statistics (pp. 27-41). Amsterdam: North Holland. Andrich, D. (1978). A rating scale formulation for ordered response categories. Psychometrika, 43, 561-573. Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee's ability. In F.M. Lord & M.R. Novick (Eds.), Statistical theories of mental test scores. Reading MA: Addison-Wesley. Duncan, O.D., & Stenbeck, M. (1987). Are likert scales unidimensional? Social Science Research, 16, 245-259. Kelderman, H. (1991, April). Estimation and testing a multidimensional Rasch model for partial credit scoring. Paper presented at the Annual Meeting of the American Educational Research Association, Chicago, Illinois.

OBJECTIVE MEASUREMENT

243

Kelderman, H. (1992). Computing maximum likelihood estimates of loglinear IRT models from marginal sums. Psychometrika, 57, 437-450. Kelderman, H., & Rijkes, C.P.M. (in press). Loglinear multidimensional IRT models for polytomously scored items, Psychometrika, 59. Kelderman, H., & Steen, R. (1988). LOGIMO: Loglinear Item Response Modeling [computer manual]. Groningen, The Netherlands: i.e.c. ProGAMMA. Lord, F.M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum. Lord, F.M., & Novick, M.R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. Masters, G.N., & Wright, B.D. (1984). The essential process in a family of measurement models. Psychometrika, 49, 529-544. Masters, G.N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149-174. Masters, G.N. (1987). Measurement models for ordered response categories. In R. Langeheine & J. Rost (Eds.), Latent trait and latent class models. New York: Plenum. Rasch, G. (1961). On the meaning of measurement in psychology. In J. Neyman (Ed.), Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability (Vol. 5). Berkeley, CA: University of California Press. Rasch, G. (1977). On specific objectivity: An attempt at formalizing the request for generality and validity of scientific statements. Danish Yearbook of Philosophy, 17, 58-94. r tests. Chicago: The University of Chicago Press. (Original work published 1960) Wilson, M. (1989, April). The partial order model. Paper presented at the Fifth International Objective Measurement Workshop, Berkeley, CA. Wilson, M. (1990). An extension of the partial credit model to incorporate diagnostic information. Unpublished paper, Graduate School of Education, University of California, Berkeley, CA. Wilson, M., & Adams, R.A. (1993). Marginal maximum likelihood estimation for the ordered partition model. Journal of Educational Statistics, 18, 69-90. Wilson, M., & Masters, G.N. (1993). The partial credit model and null categories. Psychometrika, 58, 87-99. Wright, B.D., & Masters, G.N. (1982). Rating scale analysis. Chicago: MESA Wright, B.D., & Masters, G.N. (1982). Rating scale analysis. Chicago: MESA Press.

chapter 14 I T

When Does Misfit Make a Difference? Raymond J. Adams

Australian Council for Educational Research

Benjamin D. Wright University of Chicago

The Rasch model (Rasch, 1960/1980; Wright & Stone, 1979) (indeed all fixed effects item response models) requires that item parameters remain fixed and independent of the persons they are measuring. Similarly the model requires that the person parameters be independent of the particular items used to measure them. When applying the model, it is usual to use tests of fit that examine the extent to which these requirements are met by a set of data. If these tests fail to reject the model (at some arbitrary level of statistical significance) then it is accepted that the model and data are compatible and that the above properties are met. On the other hand, if the fit tests lead to a rejection of the model, then it is concluded that the data and model are incompatible. In reality, model and data are never fully compatible. The manifest ability of person n (that is, the ability level that is actually applied by person n) is always slightly different when faced with item i t h a n it is when faced with item j . Similarly, the manifest difficulty of item j is always slightly different for persons n and m. No data can ever be perfectly compatible with any measurement model. The consequence 244

WHEN DOES MISFIT MAKE A DIFFERENCE?

245

is t h a t as samples become large enough, tests of fit invariably indicate t h a t the data do not exactly follow the requirements specified by the model. When the number of observations is sufficiently large, tests of fit will always indicate incompatibility between any data and any model (Gustafsson, 1980; Martin-Loff, 1974). A more constructive approach to examining the fit of a model is to address the question of whether the model constructs a useful representation of the data structure (van den Wollenberg, 1988). Under these circumstances, the impractical all-or-nothing treatment of the usefulness of parameter estimates is replaced by a consideration of how well the model represents the important elements of the data. When we consider the case of person measurement, the most import a n t questions are: How useful are the person parameter estimates that result from the application of a model to the data at hand? What kinds of mistakes are likely to be made if the parameter estimates are used as though the model and data are compatible? In this chapter, misfit to the Rasch model is examined in terms of its effect upon person parameter estimates. The analysis addresses the case where a set of calibrated items is used to provide estimates of person abilities. Since the Rasch model can be derived from a set of coherent requirements for measurement (Thurstone, 1928; Wright, 1989) the term measurement disturbance is used to describe data misfit. FRAMEWORK FOR STUDYING DISTURBANCE To undertake the analysis, a general framework for disturbance is presented. This framework enables the imposition a broad class of measurement disturbances through the specification of a single interaction term. The Rasch model for dichotomous data specifies that the odds of success of person n on item i must depend upon a function of n and i t h a t can be factored into two components, one depending only on n and the other depending only on i. This requirement can be expressed as:

where K is an arbitrary constant that can be set at one. One large class of deviations from this measurement model occurs when the odds of success depends on a function of n and i t h a t cannot be factored. That is:

246

ADAMS & WRIGHT

where E(n,i) cannot be factored into separate functions of n and i. Taking logs and writing b n = log[A(n)], d{ = -log[C(i)] and e ni = log[E(n,i)] this becomes:

or equivalently:

The requirement that the outcome be predictable solely from the fixed main effects is violated by the interaction, e ni . This interaction can be produced by many kinds of measurement disturbances. When estimated item difficulties are used to estimate person abilities (and no other disturbance exists) then e ni can represent the uncertainty in the item parameters estimates. Here e ni is the same for all persons but varies over items (i.e., e ni = e{) and the observed outcomes actually result from (b n - dj) + e^ In this case the actual difficulty of each item is not exactly what it is believed to be. The disturbance e{ may be equal for all items (simply e), it may be different from all items, or it may be useful to consider e^ as sampled randomly from a normal distribution with mean zero and variance of, the calibration error for item i. By specifying the combinations of n and i that lead to a nonzero interaction, e ni , and by specifying the relationship between the magnitude of e ni and group membership, it is possible to generate a family of disturbances. This family includes the most widely mentioned disturbances: item bias, multidimensionality, variations in discrimination, and guessing. A general model is derived by partitioning the test into Q mutually exclusive, exhaustive, nonempty item subsets, fllf H 2 , . . ., HQ and by partitioning the sample of individuals measured into P mutually exclusive, exhaustive, and nonempty person subsets, @1? ©2> • • • , ®p. If P = 1 then the persons are not partitioned; if P = N, then they are partitioned one per group; and if P = p < N, they are partitioned into some small number of groups. Similarly the items may be partitioned in 1, q, or L groups. Then when person n takes item i, the outcome is not best predicted by (b n - dj) but by (b n - dj) + e ni where e ni = e st , for n E @s, i E fl t , and e s t is either some function of s and t or a sampled element from some distribution whose parameters depend on s and/or t.

WHEN DOES MISFIT MAKE A DIFFERENCE?

247

Calibration noise is the disturbance that arises when items with specified difficulties are used in a measurement instrument. The item parameter estimates used for estimating person measures always involve some uncertainty. That uncertainty can be represented by an error variance in the parameter estimates. The parameter estimates are generally assumed to be unbiased and normally distributed. That is, di ~ N(8j, erf), where of is a function of the sample size (and the targeting of §{ on the abilities of the people in the calibrating sample). The disturbance model expressed the calibration noise by letting q = L and p = 1 so that e ni = e H . There is only one group of people, so the disturbance only varies over items. The size of e H depends upon the uncertainty in the item calibration, and it can be viewed as an element sampled from a normal distribution with zero mean and variance of, the error variance in the calibration of item i. Random misfit is similar to calibration noise. In this case both item and person partitions contain one item each, so that e ni is different for every item person combination. That is, Q = L, P = N, and e ni = e n i where e ni , the disturbance, is random—perhaps viewed as sampled from a normal distribution with a unique variance cr^. p o r simplicity, it is easiest to begin with the variance of the disturbance as constant, so t h a t 0-2. = a2 for an n an(j i Because the disturbance is regarded as random, the particular value of e ni is not related to b n or dj. Item bias is a third type of disturbance. There are innumerable definitions of item bias in the literature. Here an item is considered biased when its difficulty parameter is different for one group than for another. Mellenbergh (1982) describes two biases of this type; uniform bias—a constant shift in difficulty, and nonuniform bias—a shift in difficulty related to the ability of the individual. In our formulation only uniform bias is included under the heading item bias. Nonuniform bias is treated as multidimensionality. When the persons are partitioned into a small number of groups (P = p) and the items are partitioned in a small number of groups (Q = q), then the items in group are a considered biased for the people in the group a, if e ni = e a a ^ 0 for n G @a, i E fta, and e ni = 0, otherwise. i tion (P = N) and items partitioned into a small number of groups (Q = q). Then e ni = e n t for i E fl t , so that the ability of person n, when attempting items in subgroups t, is given by (b n + e nt ) when the dimensions underlying the test are thought to be related; then the error components need to be specified so that corr(b n + e n t , b n ) is some function of t. If there are people for whom e n t is zero, then this error specification gives nonuniform item bias. Suppose, for example, a mathematics test consists of a small subset of items that have a language requirement t h a t is greater than that of the majority of items—

248

ADAMS & WRIGHT

t h a t is, they are confounded by a second dimension. The ability necessary to respond to these items is some combination of mathematics and language ability, and this combination is (b n + e nt ). When variations in item discrimination are introduced, item bias is specified so that both items and persons are partitioned one per subgroup (P = N, Q = L); then e ni = ei5 the ability of person n on item i, becomes (b n + e ni ), and the abilities (b n + e ni ) and b n are correlated differently for different items. If corr(e ni , b n ) is positive, (b n + e ni ) will increase for persons with higher abilities and decrease for persons with lower abilities. This leads to increased discrimination between higher and lower abilities in item i, relative to the rest of the items. Similarly, a negative correlation leads to a decrease in item discrimination relative to the rest of the items. If the items are grouped into larger subsets with homogeneous discriminations, then the model for variations in discrimination is exactly the multidimensionality model.v This framework shows that for the Rasch model, variation in item disc Guessing can be included in this disturbance model. As with variations in discrimination and random misfit, both persons and items are fully partitioned (P = N, Q = L) so that e ni = e ni . For random misfit, e n i was produced randomly. For variations in discriminations, it was correlated with b n . Now for guessing, e nj is functionally related to (b n - d{) so t h a t the probability that person n will guess correctly on item i, g ni brings together the propensity for person n to guess and the chance t h a t item i can be guessed correctly. This is a more general definition of guessing than that typically used with models such as the threeparameter logistic (Birnbaum, 1968), which suggests that guessing varies with items but not individuals. To specify e ni so t h a t g ni is the minimum probability that person n will succeed on item i use:

so that:

The disturbances described have been recognized because their identification follows directly from the traditional statistical procedures of psychometrics. Our identification of variations in item discrimination is due to the role that item-test correlations have played in traditional test theory, and multidimensionality is identified because

WHEN DOES MISFIT MAKE A DIFFERENCE?

249

factor analysis is so often applied in test analysis. Item bias is specified because of the concern with the fairness of tests for all individuals. None of these disturbances, however, is more important or more likely t h a n any other. Other, as yet unnamed, disturbances must exist, and in fact, are quite likely to be of equal as prevalence. Consider, for example, analogues to item discrimination and item multidimensionality, such as person discrimination and person multidimensionality. These disturbances might be equally prevalent, and might present equally likely threats to valid measurement. They have been neglected because techniques that would expose them have not been routinely applied. Table 14-1 summarizes the six disturbances that have been singled out for discussion and shows how they can be modelled. These six are introduced because they are most commonly recognized and named. While we have introduced and specified these disturbances separately, of course they exist simultaneously, to a greater or lesser extent in all real test data. To specify a model with all disturbances is possible, but its study would be discouragingly ambiguous.

INVESTIGATING DISTURBANCE USING PROX When the assumption is made that the item parameters are normally distributed, and the mean and the variance of the distribution is known, the PROX estimation equations (Cohen, 1979; Wright & Stone, 1979) provide closed-form estimators for Rasch model ability parameters. These simple equations will be used to deduce the likely effects of disturbance on ability estimates. The analytic findings will then be confirmed by simulations t h a t use maximum likelihood estimation of person parameters. The disturbance e ni can be introduced either as a modification to the ability of person n or as a modification to the difficulty of item i. In the

Table 14-1 Summary of Some Measurement Disturbances Disturbance Type calibration noise random misfit item bias multidimensionality variable discrimination guessing

Characteristics for (bn - d,) + en, *n

= e

*n

= e

*n *n Cn €n

= e - e =e

= e

e, sampled from N(0, erf) e ni sampled from N(0, a-*,) est some constant corr(bn, ent) varies with t corr(bn, eni) varies with i e ni = max[bn - d„ log(g ni /(l - gni)] - (bn - d.)

250

ADAMS & WRIGHT

case of item parameters assumed known and person parameters being estimated, it is the modification of item difficulties that leads to the clearer understanding of the effects of the disturbance. Throughout the examination of measurement disturbance, dt will be used to denote the available estimate of difficulty for item i, perhaps stored in an item bank. It is this previously calibrated item difficulty that is used as the basis of subsequent person measurement. These difficulties are not, however, the actual item difficulties for each person n. The actual difficulty of the item for individual n is d{ + e ni , and will be denoted 8f. Depending upon the disturbance that is modelled, 8fmay or may not vary across individuals. The actual ability of person n is £ n . The estimator of (Bn t h a t uses the estimated (or bank) item difficulties, d{ is denoted b n and the estimator of p n that uses the actual item difficulties 8f, is denoted (3n. We call b n the disturbed estimator and |Bn the undisturbed estimator. Our aim is to investigate the bias in b n as an estimator (3n and we tackle this by examining the difference between b n and (3n. It is import a n t to recognize, however, that to call b n the disturbed estimate and P n the undisturbed estimate does not imply that (3n is better t h a n b n . The person parameter estimate can only be a useful measure when there is a stable frame of reference with respect to which it can be interpreted. The undisturbed ability estimate (3n does not satisfy this requirement, because its frame of reference against which it is unique to that individual—it depends upon an individually defined set of item difficulties. The disturbed ability estimate b n may not be useful either, if the difference between b n and (3n makes the existing frame of reference inappropriate for this person. The issue is not one of choosing between b n and (3n but of translating the disturbances into their possible effects on the validity and accuracy of parameter estimates. When a test is made up of L items with actual difficulties 8f that are normally distributed, the PROX formula gives the (Bn for individual n with proportion correct score fn = r n /L as:

where 8. is the mean of the actual item difficulties 8ffor person n and ag- is the variance of these actual item difficulties for person n. The item parameters enter this equation through their means 8. and their spread 8|, implying t h a t the effect of the disturbance on the measure can be determined from the effect that the disturbance has upon the mean and dispersion of the item difficulties. A change in the mean item difficulty adds a constant bias to the measure, which is equal to the change in the mean. An increase in the

WHEN DOES MISFIT MAKE A DIFFERENCE?

251

dispersion of the items produces estimates that are further from the centre of the test. A decrease produces estimates that are nearer the centre of the test. If d. and o^ are the mean and variance of the bank item difficulties di? then an approximation for the difference between the disturbed ability estimate using the bank difficulties dd and the undisturbed ability estimates, p n , using the actual difficulties, 8f, for person n is:

If (3n is considered an unbiased estimate of the actual ability ($n, this expression also gives the bias in b n as an estimator of P n . Letting vn denote the variance ratio v2,/^ and (ULn = (d. - 8), enables vn and |xn to be used as indices of the magnitude of disturbance. Substituting a2 = u n of into the above equation, the expression becomes:

Expression (7) shows t h a t a constant bias, independent of the score fn, is introduced through the difference |xn. The second term is zero at fn = 0.5 and/or vn = 1 (of = of) For u # 1, the absolute value of its contribution to the bias increases as the difference between fn and 0.5 increases. For u n < 1, (of > of) the bias is away from the centre of the test and for vn > 1, (of < of) the bias is towards the centre of the test. The PROX standard errors of the ability estimates b n and (3n are:

The items only enter these expressions through their dispersion. The larger the item dispersion, the larger the standard error of the parameter estimate. The mean squared error (MSE) of a bank estimate b n about the actual ability p n , based on the estimated difficulties, is:

When the estimated difficulties are used to estimate an individual's ability, it is var(b n ) t h a t is reported as the error variance for the ability estimate, but it is the MSE expressed in (9) t h a t gives the actual varia-

252

ADAMS & WRIGHT

tion in b about p. The difference between (9) and the modelled variation var(b n ) is due to the bias, b - P n . The ratio of MSE(b n ) to the var(b n ):

is the factor by which the sampling variation of b n about p exceeds the error variances that would be reported on the basis of estimated difficulties alone. Expression (10) shows that modelled standard errors that are reported on the basis of the bank difficulties d{ will underestimate the mean squared error in the bank estimates b n . The increased uncertainty is due to the bias in the bank estimates. The bias causes a variation of b n about p n that is not symmetric. If the bank estimated item difficulties, di? have greater variation than the actual item difficulties, 8" then (b n — Pn) will be skewed away from the center of the test. But i the bank item difficulties have less variation than the actual item difficulties, then (b n - Pn) will be skewed toward the center of the test. 1

EFFECTS OF THE DISTURBANCE ON ^ n A N D u n Both PROX and UFORM indicate that under the assumptions of this study, the bias in an estimate for person n, based upon disturbed items, depends only on the mean of the disturbances \xn, and the change in the dispersion of the items as expressed through the ratio vn. Because these two indices capture the effect of all of the disturbances in the class t h a t we are considering, they need not be examined separately. Describing each of the disturbances in terms of its effect on fxn and vn will be sufficient to specify the effects of that disturbance. Since vn captures the direction of the bias, it is important to consider the circumstances under which vn is likely to be greater than one and vn is likely to be less than one. Begin by recalling that the bank item difficulties are denoted by di? i = 1, L and the actual item difficulties for individual n are 8f = d{ e ni ; then the variance of the actual item difficulties for person n is: (ig = a* + &i - 2ade,

(11)

1 A similar analysis using UFORM estimation equations (which assume uniform rather than normal distributions for the item and person parameters) indicate the same bias patterns.

WHEN DOES MISFIT MAKE A DIFFERENCE?

253

where of is the variance of the bank difficulties, of is the variance of the disturbances and a d e is the covariance between the bank difficulties and the disturbance. Therefore,

When item difficulties and disturbances are uncorrelated, then the variance of the actual item difficulties for person n exceeds the variance of the bank item difficulties and vn = 1 + (of/of), is always greater than one. This leads to person parameter estimates that are biased toward the centre of the test. When disturbances and difficulties are negatively correlated then again o n will be greater than one and there will be a bias toward the centre of the test. If the disturbances and item difficulties are positively correlated, however, and their covariance is more than half the variance of the disturbances, then the estimated abilities will be biased away from the centre of the test. That is, to get vn < 1 requires:

which requires

Unless item partitioning is done in terms of item difficulty, calibration noise, random misfit, item bias, multidimensionality, and variation in discrimination are disturbances that are uncorrelated with the item difficulties. This is shown by considering each of the named disturbances in t u r n and describing their effect on |jLn and o n . Calibration Noise In the case of item calibration error, it is assumed that a test is formed by selecting from a previously calibrated bank. The existing bank item estimates are used as the basis for the estimation of individuals' abilities on the assumption t h a t they can be used as though they were the item difficulties. But this is not the case; the actual difficulty of item i for person n is 8f — d{ — e{ where ei is a random disturbance sampled from a normal distribution with mean zero and variance of, the error variance of the item bank estimate. This kind of disturbance does not effect the mean item difficulty,

254

ADAMS & WRIGHT

because the expected values of 8. and d. are equal. It does, however, change the spread of the item difficulties. The disturbances and the estimated item difficulties are uncorrelated, so the variance of the actual item difficulties (assuming independence among items) is:

so t h a t

and since vn must be greater than one, the disturbed estimates will be biased toward the center of the test. When calibrated items with estimated error variances are used, an estimate of vn is available and either PROX or UFORM can be used to approximate the bias and the mean squared error. As will be shown later, calibration noise leads to a negligible bias, and it is likely t h a t other disturbances will contribute more to bias and mean squared error t h a n does calibration noise. R a n d o m Misfit In random misfit the disturbance is unbiased and independent of both person ability and item difficulty. This gives:

This disturbance will cause vn to be greater than one and the disturbed estimates will be biased toward the center of the test. Item Bias If person parameters are estimated on the basis of known item parameters, then estimates for people who are not in the bias group will not

WHEN DOES MISFIT MAKE A DIFFERENCE? 255

be affected. Taking the simple case of one set of biased items, and one set of people for which the items are biased, the item bias model gives 8f = dt for i £ H t or n g O s and 8f = d{ + e s t for i G Ctt and n G 9 S . Letting M be the number of items in flt gives:

Item bias causes a constant bias |jin, the magnitude of which depends on the size of the constant effect e st , and the proportion M/L, of items t h a t are biased. Because u ni > 1 the disturbed person parameter estimates will also be biased towards |xn by an amount related to the size of the effect e^/of, and the proportion M/L, of items that are biased. To illustrate the way item bias works, Table 14-2 shows PROX estimates of bias at various levels of ability p, when the magnitude of the disturbance e st , is - . 2 5 , - . 5 , - 1 . and - 2 . and ten, twenty and forty percent of the items are considered biased on a 100 item test with item difficulties ranging from - 3 to 3 logits. The table shows t h a t the disturbed ability estimates are always less t h a n the undisturbed estimates. This negative bias increases with the magnitude of the disturbance and the number of disturbed items. Because the disturbance causes an increase in the test variance, there is a bias toward the center of the test that is added to the constant bias. This means that, relatively speaking, the bias for more able students is greater t h a n the bias for less able students. The practical consequences of the biases shown in Table 14-2 can be assessed by comparing their magnitude with the minimum measurement error which a 100 item test could provide, namely, 2/VlOO = 0.20. For modest bias (less that 0.5 for less than 20 percent of the items) the bias is less than 0.10, which is half of one standard error. However, for more severe item bias the estimation bias can exceed two or three standard errors. Item Multidimensionality Multidimensionality is similar to item bias, differing in only two minor ways. First it applies to all persons, not just a subset, and second the disturbance is not a fixed effect for each subset of items—it is correlated with ability. In the case of a two dimensional set of items with M items on a second dimension, this gives:

Table 14-2

BIAS in PROX Ability Estimates Caused by Item BIAS

M/L

Ability (p) 1.5 1.0 0.5 0.0

-0.5 -1.0 -1.5

.10

.026 -.025 -.025 -.025 - .025 -.025 -.024

.20

.40

.10

- .051 .051 .050 -.050 -.050 -.049 -.049

.103 -.102 -.101 -.100 .099 -.098 .097

.053 .052 -.051 -.050 -.049 -.048 -.047

est = -0.25

.20

.40

.10

-.106 .104 -.102 .100 -.098 - .096 -.094

.211 - .207 -.204 -.200 -.196 -.193 .189

- .111 .107 -.104 -.100 .096 - .093 -.089

ert = -0.5

.20

.40

.10

.222 .215 .207 -.200 -.193 .185 -.178

.444 -.429 - .414 -.400

- .244 .229 -.214 .200 -.186 - .171 -.156

est = -1.0

-.386 -.371 -.356

.20

.40

.487 -.457 -.428 -.400 -.372 - .343 -.313

.968 - .910 - .854 -.800 -.746 -.690 -.632

est = 2.0

WHEN DOES MISFIT MAKE A DIFFERENCE?

257

These equations are the same as those for item bias. The difference between the two disturbances is that now the bias occurs for all persons, not just a bias subgroup, and because e n t varies \xn and vn, vary across people. Because, the underlying dimensions of most tests are positively correlated, |mn and vn will tend to be larger for people with extreme abilities. That is, the biasing will be most pronounced for the people with the highest and lowest ability estimates. This implies that multidimensionality (and nonuniform item bias) can be advantageous to the least able students. If |jLn is zero or positive, then less able students will get disturbed ability estimates biased upwards. A negative U | Ln may lead to either a bias up or down depending upon the relative magnitude of |jLn and u n , and the score of the individual. For small negative |jLn it is possible for an individual of low ability to have a disturbed estimate t h a t is positively biased. It will always be the case that, if a test is biased against a set of individuals, the measures of the less able individuals in t h a t set will always be biased upwards relative to the ability of more able individuals in that group. Variations in Item Discrimination For variations in discrimination, the disturbance varies across all items and all persons. Assuming that the test contains a set of items with a symmetric distribution of discriminations, that are independent of item difficulty, then |xn, the mean disturbance for any person will be zero, and vn will be given by:

Under these conditions variations in discrimination will behave exactly like random misfit. If the distribution of discriminations is not symmetric, then there will also be a bias due to jjin, which will no longer be zero. Over- and Underdetermined Response Patterns None of the disturbances we have considered so far directly address one misfit that is routinely identified in Rasch measurement—

258

ADAMS & WRIGHT

variation in person discrimination. The examination of individual response patterns often indicates that the hard items proved harder for the individual than the dj indicate, while the easy items proved easier; or t h a t the hard items proved easier for the individual t h a n the d{ indicate, and the easy items proved harder. In the first case the probability t h a t the individual will succeed on easy items is greater t h a n expected, and the probability t h a t they will fail on hard items is greater t h a n expected. In traditional test analyses such a result would be regarded as desirable and might be labelled as high person discrimination. But there is also a sense in which this response pattern is overdetermined by the estimated item difficulties. From the perspective of objective measurement, an overdetermined response pattern is not a desirable outcome. The requirement of invariant item difficulties has been violated. As with the other disturbances the actual difficulties of the items are unique to that individual. In the second case, the probability that the individual will succeed on easy items is less than expected, and the probability that they will succeed on hard items is greater than expected. Here the pattern of responses would be underdetermined by the estimated difficulties on the items. In traditional item analyses such a response pattern would correspond to a poorly discriminating person. Again the under determined response pattern indicates that the measurement requirement of invariance has been violated—the actual item difficulties are unique to the individual. Under- and overdetermined response patterns are caused by disturbances t h a t effect both |mn and vn. For over determined response patterns 8f> dj if dt > (3n but 8f< dj if dj < H n . As a result |xn may take any value, depending upon the number of items above and below the individual's ability. A uniform distribution of items centered at zero makes |xn > 0 if P n > 0 and |xn < 0 when (3n < 0. Thus overdetermined response patterns are likely to cause a bias away from the center of the test. The overdetermined response pattern also indicates that the variance of actual item difficulties is greater than the variance in the calibrated item difficulties. That is vn > 1, which causes a bias toward the center of the test. The net result of these two competing biases will depend upon the distribution of the items and magnitude of the disturbance. For underdetermined response patterns the above argument is reversed, |xn causes a bias toward the center of the test and vn is most likely to cause a bias away from the center of the test. Again the net result will depend upon the distribution of the items and the magnitude of the disturbance.

WHEN DOES MISFIT MAKE A DIFFERENCE?

259

SIMULATIONS The above discussion is based on the expected pattern of bias indicated by PROX (and UFORM). In what follows these expectations are compared to the results of a set of simulations that use maximum likelihood to estimate abilities on the basis of estimated item difficulties. Three classes of disturbances identified by the type of response patterns they produce are considered. The first class are the noisy response pattern disturbances. They occur when random disturbances t h a t are uncorrelated with the item difficulties are introduced while generating the response patterns for simulated individuals. Response patterns of this type emulate calibration noise, random misfit, item bias, multidimensionality, and variation in item discrimination. Noisy response patterns are also under determined response patterns. The introduction of the random disturbance means that the bank difficulties do not determine the response pattern as well as expected. The second class of disturbances produce systematically underdetermined response patterns. When p n is the generating ability of person n, a disturbance is introduced that makes the items for which dj < P n more difficult but items for which dj > p n less difficult. The third class of disturbances produce overdetermined response patterns. If Pn is the generating ability of person n, a disturbance is introduced that makes the items for which dj < p n less difficult but items for which dj > p n more difficult. One normally distributed sample of 500 persons was generated for all simulations. This sample was constructed by applying an inverse normal transformation to a set of numbers uniformly spaced between 0 and 1 and then scaling them so that the abilities ranged from - 3 . 3 to 3.3 logits. These abilities were fixed throughout all simulations and are referred to as the generating abilities, p. The mean ability was zero and the standard deviation was 1.3. Tests of 40, 60, and 100 items were constructed with difficulties uniformly spaced between - 3 . 0 and 3.0 logits. These difficulties where used as the bank difficulties, d{, and were fixed throughout the simulations. Tests shorter t h a n 40 items were not considered because they introduce floor and ceiling effects sufficient to confound the study of bias due to item disturbance alone. In the process of the simulation each bank item difficulty, di? had a disturbance added to it to construct an actual difficulty, 8f, for each individual. The combination of P n and 8fwas used to simulate item responses and produce test scores. Each test score was then transformed into two logit abilities, p n based on the actual 8s and b n based

260

ADAMS & WRIGHT

on the bank ds. This process was replicated 100 times for each sample, producing 100 pairs of p and b for each of the 500 generating abilities. For the noisy response patterns, five different disturbance standard deviations and three different disturbance means were used. Three standard deviations were the same for all items. Two had standard deviations, related to item difficulty. For the fixed standard deviations a random deviate was sampled from a normal distribution with mean 0, 0.25 or 0.5 and a standard deviation of 0.5, 0.75, or 1.0, and added to each bank difficulty, d r A unique disturbance was added to each item but t h a t disturbance remained constant across the persons and replications. 2 The three standard deviations and three means combine to give nine different disturbances. In an attempt to emulate the effect of calibration noise more closely, two standard deviations that varied with item difficulty were also considered. Here the disturbance for item i was created by randomly selecting a deviate from a normal distribution with zero mean and variance given by:

This is an estimate of the asymptotic error variance for item parameter estimates made with a calibrating sample of size N, under the assumption that all members of the calibrating sample had p = 0 and t h a t there was no covariance between item parameter estimates. This will overestimate the error variance for the hardest and easiest items and underestimate the error variance in the middle of the test. These two noise disturbances were generated with zero means and the are denoted as N10 and N100. To produce response patterns that were under determined either 0.25 or 0.5 logits was subtracted from the difficulty of an item when dj was greater than P n and either 0.25 or 0.5 logits was added to the difficulty when dx was less than P n . These are denoted U25 and U50. To produce overdetermined response patterns, either 0.25 or 0.5 logits was added to the difficulty of an item when dj was greater than p n and either 0.25 or 0.5 logits was subtracted when dj was less than P n . These are denoted 0 2 5 and O50. 2 A disturbance can be generated for each item and held constant across persons and replications, or disturbances can be generated for each item-person combination and held constant across replications, or a unique disturbance can be generated for every item-person-replication combination. It was found that all three choices produced the same results. The first choice cuts computing time dramatically, and it was adopted.

WHEN DOES MISFIT MAKE A DIFFERENCE?

261

For the underdetermined response pattern an additional condition was applied t h a t prevented the actual difficulty 8f from becoming greater t h a n p n if dj was less than p n , or S-'from becoming less than P n if dj was greater t h a n P n —in these cases 8" was set equal to P n . This leads to 15 different disturbances. The five noisy response pattern disturbances with zero means were used with tests of 40, 60, and 100 items and the remaining eleven disturbances were applied with the 100 item tests only.

RESULTS In analyzing the results of the simulations it was the bias caused by the disturbance that proved to be of most interest. For every simulation two bias indices were saved for each of the 500 sample elements;

and

In both of these indices the denominator R is the number of successful replications for generating ability p, and the summation was taken over each of the successful replications. 3 The first index BIAS-p, provides a frame of reference for the bias in b, since it is the bias t h a t would be expected if there were no disturbance. BIAS-p is the bias in the estimates p n of p n . Each p n is estimated using the actual item difficulties for person n, 8f—it does not involve the disturbance—so it is expected to have a mean close to zero for all ability levels and test lengths. The second index, BIAS-b, gives the bias in the disturbed estimates b when used as estimates of the actual ability p. Most of the analysis is concerned with the magnitude of BIAS-b and the way it is related to test length, disturbance and p. The first step in the analysis was to compare the 500 disturbed and undisturbed ability estimates with each other and with the true abili-

A successful replication being one for which a finite ability was estimable for (3.

262

ADAMS & WRIGHT

ties. When such a comparison was undertaken, remarkable agreement was found between the two parameter estimates and the actual parameter values. Figure 14-1 contains a comparison of each of the disturbed and undisturbed estimates with the true abilities for test of 100 items and disturbance, a = 1, JJL = 0. This figure shows a worst case scenario in the comparison of disturbed and undisturbed estimates when the disturbance has a zero mean. The test is long, so both sets of estimates have negligible standard errors, and any discrepancy between estimates and between the disturbed estimates and the generating values is almost entirely due to the disturbance. The relationship between the 500 disturbed estimates and the generating parameter values for tests of length 100 and disturbance, a = 1, JJL = 0, is examined more closely in Figure 14-2. The solid line Generating Ability, p

Figure 1 4 - 1 Comparison of the 5 0 0 generating abilities b, undisturbed estimates p, and disturbed estimates b, for 1 0 0 item tests with disturbance; a = 1 , I^t = 0.

WHEN DOES MISFIT MAKE A DIFFERENCE?

263

Generating Ability, p Figure 14-2 Plot of 5 0 0 generating abilities p, against disturbed estimates b for 1 0 0 item tests with disturbance; a = 1 , |x = 0.

corresponds to where the points would lie if the disturbed estimates and true values were equal. The bias in b shown in Figure 14-1 is towards the center of the test. This is consistent with the predictions based on PROX and UFORM, which showed that the larger variance in the actual item difficulties (vn > 1) results in parameter estimates that are biased toward the center of the test. Figures 14-1 and 14-2 highlight that, even with a substantial amount of disturbance that leads to noisy response patterns, the bias in the person parameter estimation is

264

ADAMS & WRIGHT

quite small. This is consistent with a result reported by Wright and Douglas (1977), who found that for test designs encountered in practice, a random disturbance with standard deviations as large as one lead to negligible distortions in ability estimates. Table 14-3 shows the mean, standard deviation, and range for the bias indices BIAS-p and BIAS-b for all of the disturbances. Results for BIAS-p are reported only once for all of the test lengths, because they are independent of the disturbance. The difference between the BIASb results and the BIAS-P results are due to the disturbances. When there is no disturbance BIAS-b is equal to BIAS-p. The results shown in Table 14-3 follow those that were predicted on the basis of PROX and UFORM. In each case the mean of the bias is close to the mean of the disturbance, and, for the noisy response patterns, the range and standard deviation of the bias increases with the standard deviation of the bias. The range and standard deviation of bias decrease as test length

Table 14-3 Mean, Standard Deviation, and Range of BIAS in Undisturbed and Disturbed Parameter Estimates Test Length 100

60

40 V>

mean

sd

range

mean

0.00

-.001

.045

.332

.049

sd

sd

range

mean

.000

.036

.250

.000

.029

.225

.303

-.005

.040

.262

.071

.496

.025

.057

.437

.108

.687

-.010

.089

.537

-.001 .250 .492 -.007 .241 .479 -.016 .231 .533

.036 .036 .039 .060 .059 .066 .094 .097 .085

.275 .164 .228 .398 .353 .396 .549 .635 .529

.046 .070

.280 .529

.001 -.011

.034 .064

.227 .486

.001 -.003

.029 .048

.182 .402

o 025 050

.000 .003

.070 .130

.652 1.229

u U25 U50

.002 .000

.081 .155

.568 1.011

(X

0.00

noisy response patterns -.002 0.50 0.00 0 . 5 0 0.25 0.50 0.50 .022 0.75 0.00 0.75 0.25 0.75 0.50 1.00 0 . 0 0 .001 1.00 0 . 2 5 1.00 0 . 5 0 N100 N10

-.003 -.027

range

WHEN DOES MISFIT MAKE A DIFFERENCE?

265

increases; this was not predicted by PROX or UFORM. Both the PROX and UFORM bias formulae are independent of test length. Support for this is given by the decrease in the range and standard deviation of the bias, with no disturbance added. For the noisy response patterns, the largest bias reported in Table 3 is approximately 0.34 logits (half of the range of 0.687) for 40 item tests with disturbance a = 1. This maximum bias is no more than the standard error of person parameter estimates typical of 40-item test. For the 100 item tests with disturbance a = 1 the largest bias is approximately 0.27 logits, and this, too, is no more than the standard errors typical of 100 item tests. In fact, because the maximum biases occur at the extremes of the test, the modelled standard error of a parameter estimate always exceed the corresponding bias by a considerable amount. It is also clear from Table 14-3 that BIAS-b for disturbance a = .5 is not much larger than BIAS-p. In fact for or < 0.5 the bias is not discernible. Similarly, for tests calibrated on samples as small as 100, item parameter uncertainty does not cause any discernible bias in the person parameter estimates—the standard deviation of BIAS-b and BIAS-p are almost identical, and the range of the BIAS-b is slightly less t h a n the range of BIAS-p. Even items calibrated on as few as 10 people appear to give person parameter estimates that are not excessively biased. The standard deviations and ranges for the over- and underdetermined response patterns, however, do show a substantial variation in the parameter estimates. Figures 14-3, 14-4, and 14-5 show how the bias, BIAS-b, in the disturbed estimates varies with the generating values of p. Each plot contains 500 points, one for each ability, showing the mean bias from the 100 replications in the simulation. The plots also include a smooth curve, which is the expected bias based on PROX calculations. The PROX estimates were produced by using the generating ability, p, and the bank difficulties, d, to produce expected relative scores for each individual. The variances of the bank difficulties and the disturbance generating parameters were then used to estimate jutn and vn, and the bias was calculated. The PROX and UFORM estimates of bias due to disturbance are very similar. The PROX results are presented because under the PROX assumptions, the bias is determined by the effect of the disturbance on the mean and standard variance of the item difficulties—an effect t h a t can be easily derived. Under the UFORM assumptions the bias is determined by the effect of the disturbance on the range—an effect t h a t cannot be easily derived. For each plot in Figure 14-3 there is strong agreement between the

266

ADAMS & WRIGHT

Generating Ability, p

Generating Ability, p

Generating Ability, p

Generating Ability, p

Figure 14-3 Bias in disturbed ability estimates, BIAS-b, plotted against the generating ability, p, for a variety of noisy response patterns with mean disturbance zero.

PROX estimate of the expected bias and the observed bias. As predicted, the noisy response pattern disturbances shown in Figure 3 cause ability estimates to be biased toward the centre of the test. The amount of the bias toward the centre of the test is larger for the larger disturbances. Change in test length alters the sampling variation but not the magnitude of the bias.

WHEN DOES MISFIT MAKE A DIFFERENCE?

g

Generating Ability, p

Generating Ability, p

Generating Ability, p

267

Figure 14-4 Bias in disturbed ability estimates, BIAS-b, plotted against the generating ability, p, for the calibration noise disturbances and noisy response patterns with nonzero mean disturbances.

Figure 14-4 shows the bias for the N10 and N100 disturbance and the noisy response patterns disturbance that has a nonzero mean. For NIO and N100 the bias is toward the centre of the test, as predicted. But the PROX estimates are not as accurate as they are for the constant variance disturbance. For NIO it appears that in the middle of the test there is less bias than predicted by PROX. This may occur because the disturbance is smallest in the middle of the text, and the

268

ADAMS & WRIGHT

i

Generating Ability, p

Generating Ability, p

Generating Ability, p

Generating Ability, p

Figure 14-5 Bias in disturbed ability estimates, BIAS-b, plotted against the generating ability, 0, for over- and underdetermined response patterns.

items in the middle of the test carry most information for the estimation of the abilities in the middle of the test. The N100 plot shows negligible bias and the two nonzero mean plots show the effect of the constant bias and the bias that varies with ability. Figure 14-5 shows the bias caused by the under and over determined response patterns. The overdetermined response patterns show a bias away from the center of the test, and the underdetermined response patterns show a bias toward the center of the test. There is a substantial range in the middle of the test, however, in which none of these disturbances leads to bias larger than 0.2 logits.

WHEN DOES MISFIT MAKE A DIFFERENCE?

269

SUMMARY AND CONCLUSION The framework for describing measurement disturbance that was developed in this study shows that a substantial range of misfit to the Rasch model can be expressed as interactions between individual group membership and item group membership. This makes it possible to use the PROX estimation equations to determine the effects of all varieties of measurement disturbance on person parameter estimates. PROX estimates of abilities depend only upon the mean difficulty of the items and their variance. If the effect of the disturbance on the mean item difficulty and variance is available, then PROX estimation equations can be used to estimate ability estimates based on both the bank and actual difficulties, and the simulations confirm that PROX estimates do accurately predict the nature and magnitude of the effects of disturbance on person parameter estimates. Further, it was shown t h a t the disturbance manifests itself as a bias in the parameter estimates. That is, disturbance leads to systematic errors in the estimation of individual person parameters. When the disturbance changes the mean of the item difficulties then there is a constant bias, equal to the change in the mean. When the disturbance alters the variance of the item difficulties then a bias either in or away from the centre of the test results. When the response pattern is noisy or under determined then the likely bias is towards the centre of the test. When the response pattern is over determined then the likely bias is away from the centre of the test. In practice, of course, the effect that the disturbance has upon the mean, jm, and variance, v, of the item difficulties is unknown. A further line of research, which examines the relationship between fit statistics and JJL and v, may be profitable. If fit statistics could be found that are systematically related to (JL and u, then estimates of the bias caused by the disturbance would become available. At this point we are only able to use fit statistics to indicate the likely direction of the bias. Previous research (Smith, 1982) has indicated that the t-fit statistics used by Wright and Stone (1979) are most sensitive to variations in discrimination. A positive t-statistic for a person generally corresponds to an underdetermined response pattern, while a negative t-statistic corresponds to an overdetermined response pattern. Pending further investigation, this suggests t h a t positive t-statistics correspond to person parameter estimates biased towards the center of the test, and negative t-statistics correspond to person parameter estimates biased away from the center of the test. While it may be possible to use indices of fit to obtain estimates of this bias, it is not recommended that the bias estimates be used as a correction to estimated parameters. The disturbed ability estimate is

270

ADAMS & WRIGHT

b a s e d on a s t a n d a r d s e t of i t e m difficulties k n o w n , by v i r t u e of a misfit i n d i c a t o r , n o t to b e a p p r o p r i a t e for t h e i n d i v i d u a l . T h e u n b i a s e d estim a t e is b a s e d on a n o n s t a n d a r d , s l i g h t l y different set of i t e m diffic u l t i e s u n i q u e to t h a t i n d i v i d u a l . N e i t h e r b n n o r (Bn qualifies a s a b e s t measure.

REFERENCES Birnbaum, A. (1968). Some latent trait models and their use in inferring and examinee's ability. In F.M. Lord & M.R. Novick, Statistical theories of mental test scores (pp. 397-479). Reading, MA: Addison-Wesley. Cohen, L. (1979). Approximate expression for parameter estimates in the Rasch model. British Journal of Mathematical and Statistical Psychology, 32, 113-120. Gustafsson, J-E. (1980). Testing and obtaining fit of data to the Rasch model. British Journal of Mathematical and Statistical Psychology, 33, 205-233. Martin-Loff, P. (1974). The notion of redundancy and its use as a quantitative measure of discrepancy between a statistical hypothesis and a set of o Mellenbergh, G.J. (1982). Contingency table methods for assessing item bias. Journal of Educational Statistics, 7, 105-118. Smith, R.M. (1982). Detecting measurement disturbances with the Rasch model. Unpublished doctoral dissertation, University of Chicago. r tests (expanded ed.). Chicago: The University of Chicago Press. (Original work published 1960) Thurstone, L.L. (1928). Attitudes can be measured. American Journal of Sociology, 33, 529-554. van den Wollenberg, A.L. (1988). Testing a latent trait model. In R. Langeheine & J. Rost (Eds.), Latent trait models and latent class models. New York: Plenum Press. van den Wollenberg, A.L., Wierda, F.W., & Jansen, P.G.W. (1988). Consistency of Rasch model parameter estimation: A simulation study. Applied Psychology Measurement, 12, 307-313. Wright, B.D. (1989). Deducing the Rasch model from Thurstone's requirement t h a t item comparisons be sample free. Rasch Measurement Special Interest Group Newsletter, 3(1), pp. 9-10. Wright, B.D., & Douglas, G.A. (1977). Best procedures for sample-free item analysis. Applied Psychological Measurement, 1, 281-295. w Chicago: MESA Press.

chapter

15 JLO

Comparing Attitude Across Different Cultures: Two Quantitative Approaches to Construct Validity Mark Wilson

University of California, Berkeley

Use of an instrument across national and cultural groups raises issues concerning the validity of any comparison between the groups due to the possibility t h a t respondents in the groups have understood the questions they are being asked in different ways according to their group membership. These differences could arise in translation or could also arise due to cognitive and affective differences between cultural groups. For attitude scales and other types of instruments in the affective domain, the most usual process used to ensure that a scale's meaning has not drifted too far in the process of translation is to back-translate. That is, each translated item is translated back into the original language, and a panel of experts is consulted to ensure t h a t the original and the back-translation are sufficiently close. International comparisons of ability and attitude are an important part of the arsenal of techniques available to comparative education. For a comprehensive discussion of this issue with respect to ability tests, see Irvine and Berry (1988). In this chapter the focus is on the affective domain. An example is provided by the studies of the Interna271

272

WILSON

tional Project for the Evaluation of Educational Achievement (IEA) comparing various national educational systems that make regular use of attitude assessment instruments whose qualities within different cultures and languages must be considered constant to a certain degree in order to make such comparisons valid (e.g., Husen, 1967; Linden, 1977; Walker, 1976). In these studies the comparability of results across languages is examined exclusively by using backtranslation to establish content validity (Messick, 1989). In this chapter, examples are given of techniques that could be used in addition to back-translation t h a t would allow one to examine the construct validity (Messick, 1989) of the instrument across cultures. Note that the point of this chapter is not to criticize the process of back-translation, but rather to raise the question of whether back-translation alone is sufficient, and to describe some additional techniques that may be useful. When one wishes to compare a particular attitude across contexts such as across different nationalities or languages, it is necessary first to establish that the instrument being used to assess the attitude means the same in the different contexts; otherwise the interpretation of differences becomes intractable. The question boils down to: What must remain the same in order to detect meaningful differences? This problem has been known to psychometricians as the issue of item parameter invariance: What are needed are item parameters that remain approximately invariant from group to group. Since this need arises because of variations among groups of examinees in the abilities or traits measured by the items, any solution must necessarily involve a consideration of the relation between these abilities or traits and examinee performance on the items. The problem of dealing with the relationship between the examinee's mental traits and his performance is not a simple one, but we cannot avoid it. It lies at the heart of mental test theory, which is, after all, fundamentally concerned with inferring the examinee's mental traits from his responses to test items. (Lord & Novick, 1968, 354) What is different about the present study is that I am applying this same logic, which has traditionally been applied to ability and achievement tests, to instruments in the affective domain. One problem with the application of construct validity concepts in the affective domain is that instruments are frequently developed seemingly without an explicit reference to any underlying structure t h a t might be used as the basis for the examination of construct validity. This should not be seen so much as a problem with the use of construct validity as a criterion, but rather as a problem with the construction of such instruments. Messick (1989) has argued strongly

COMPARING ATTITUDE ACROSS DIFFERENT CULTURES

273

t h a t construct validity is the foremost criterion for establishing validity. Any instrument developed without some sort of construct validation should be considered as having dubious quality. In fact, most instruments in the affective domain are scored by simply adding up the weights for the (usually Likert-type) responses for each item on a given subscale. This implicitly assumes that the underlying construct for the subscale is a unidimensional latent trait. Moreover, Andersen (1973) has shown t h a t where the weights are integers (which is true in the great majority of cases), the resulting scores can be sufficient statistics only where the underlying model is a Rasch model (where I am here referring to the class of models defined by Rasch that have specific objectivity—Rasch, 1960/1980, not just the simple logistic model). Thus, one can argue t h a t even in cases where the instrument developers have ignored all reference to construct validity, the use of weighted scores betrays an unstated reliance on a unidimensional structure, and the use of integer weights betrays an unstated reliance on fit to a Rasch model. Consider first an instrument that is intended to measure just one unidimensional attitude. What is needed to ensure that measurements within a certain context can be compared to measurements within a new context is that (a) the instrument is also unidimensional in the new context (consistent dimensionality), and (b) that it is sufficiently consistent in its parametric structure (consistent construct validity). For an instrument composed of several subscales, the situation can be somewhat more complicated. Such multiscale instruments are being increasingly used in social sciences research, for example, in the learning environment literature (Epstein & McPartland, 1976; Fraser & Fisher, 1983; Moos, 1978; Walberg, 1979). If the theoretical basis of the instrument specifies no particular a priori multidimensional relationship between the subscales, then assessment of consistency involves only the replication of the above steps with each of the subscales. But if some particular relationship among the latent traits represented by the subscales is postulated as an inherent part of the construct, then, after confirming measurement stability for each subscale, the stability of the multidimensional relationship among the subscales must also be confirmed. In this study, I will consider two different approaches to the study of measurement consistency—a structural equation modelling (SEM) approach and an item response theory (IRT) approach. Below I describe the two approaches, and this is followed by an example that illustrates the methods. For ease of understanding by an English-speaking audience, the example makes a comparison across two different Englishspeaking cultures, rather than across two different language groups.

274

WILSON

THE TWO APPROACHES Structural Equation Modelling Approach In what follows, I describe statistics that result when one applies the unweighted least squares estimation procedure to polychoric correlation matrices rather than the more common maximum likelihood estimation applied to product moment correlation matrices. This is done because the assumption of normality of observed variables is unlikely to be fulfilled (even approximately) by Likert-style items such as those most commonly used in the affective domain (Joreskog & Sorbom, 1986). Using polychoric correlation coefficients assumes t h a t the distribution of the observed categories on the Likert scale results from the discretization of an unobservable (latent) normally distributed variable into the categories by cutting the latent variable at successive thresholds. This has the advantage that the assumptions on which the analysis are based are more like what one might expect to be the case, but it also has the disadvantage that no standard errors are available, nor are chi-square fit tests available. Unidimensionality. The unidimensionality of each scale within au multiscale instrument may be assessed using a congeneric test model approach (Joreskog, 1971). Each subscale is first fitted to a one-factor LISREL model (Joreskog & Sorbom, 1986) with one loading (the first) fixed to unity to provide a scale. Fit to a unidimensional model can be assessed by a number of measures, among them the squared multiple correlation (SMC) between each item and the underlying factor, the coefficient of determination (D), and the root mean square residual (RMR). The SMC for item i on a subscale is

where 6^ is the modelled error variance and s{i is the observed variance for item i (Joreskog & Sorbom, 1986, p. 1.37). The coefficient of determination, D, is

where | | is the matrix determinant function, O is the covariance matrix of the modelled errors, and S is the covariance matrix of the observed variables. It varies between 0 and 1 and is a generalized measure of reliability for the whole model (Joreskog & Sorbom, 1986, p. 1.37).

COMPARING ATTITUDE ACROSS DIFFERENT CULTURES

275

The RMR is

where k is the number of items, s y are the elements of S, and a y are the elements of 2 , the fitted variance-covariance matrix (Joreskog & Sorbom, 1986, p. 1.4). It is an indicator of a typical element among the variance and covariance residuals, and must be interpreted with respect to the size of the elements of S. The maximum of the residuals (MR) is also useful for getting a feel for the worst-case variation around the RMR. Fit can also be judged by using Joreskog's goodness of fit index (GFI) as an overall measure of fit (Joreskog & Sorbom, 1986, pp. 1.40, IV.17): The goodness of fit index is

where tr is the matrix trace function. GFI is a measure of the relative amount of variance and covariance accounted for by the model (i.e., the closer to 1 the more variance accounted for by the model), and it is independent of sample size and relatively robust against departures from normality. It can be used to compare the fit of models for different data, but its distributional properties are unknown, so there is no standard with which to compare it. i SEM approach is assessed by testing the fit of a one factor solution with factor loadings constrained to be the same across both samples (Munck, 1979). The same indices of fit are used here as were used for checking unidimensionality. Item R e s p o n s e Theory Approach In this discussion, I will use a particular form of IRT model drawn from the Rasch family of measurement models (Wright & Masters, 1982), and designed specifically for ordered polytomous data. The advantages of using Rasch models when the data have the appropriate characteristics have been noted elsewhere (Masters & Wright, 1984), and I will not pursue the issue here. The partial credit model (Masters, 1982) takes as its basic observation the number of steps that a person has made

276

WILSON

beyond the lowest performance level, or, in a rating situation, the number of steps that the object has been judged to be above the lowest level. Note that the number of ordered levels in each item need not be constant across all items, although it is constant in many cases in attitude measurement because of the predominance of Likert-type response alternatives. Consequently, the basic parameter is the step difficulty within each item. For an item with m + 1 ordered levels from 0 to m, the probability of person i with ability fi{ being observed in category n in item j (yy = n) is:

for n = 1, 2, . . . , m, where 6jk is the difficulty parameter for the step k in item j ; and

The local independence assumption used in the partial credit model is that, conditional on step difficulties, the interaction between a person and an item is independent between items. The analyses were conducted using the Quest computer program (Adams & Khoo, 1991). Model-data fit.In order to use the partial credit model to compare subscales across different groups one must first check for adequate model data fit. Only if the model fits in both contexts can meaningful comparisons be made. Note that this criterion is more demanding t h a n the criterion of unidimensionality used in the SEM approach, as items may misfit due to other problems besides multidimensionality. Model fit is assessed here using two indices. The "Person Fit t" gives an indication of the statistical significance of misfit for persons. With no misfit, it is distributed approximately as a normal distribution with mean 0 and standard deviation 1 (Wright & Masters, 1982). A "mean square" statistic is used to assess the degree of item misfit (Wright & Masters, 1982). It has an expected value of 1, and a rule of thumb t h a t I will use here is that the effect is strong when the statistic is outside the range (.75, 1.3).

COMPARING ATTITUDE ACROSS DIFFERENT CULTURES

277

i item statistics can be compared to check for equivalence of item location using the item step difficulty estimates. These comparisons can be routinized by using the standardized difference between the parameters:

where the primed estimates refer to those from one sample, and the unprimed estimates refer to the other sample, and the us are the appropriate standard error in each case (Wright & Masters, 1982, p. 115). Note that this requirement is not the same as requiring equal item marginals, even though the item marginals are sufficient statistics for the item parameters. Rather, the requirement is that the item steps have the same relative difficulty for the two groups. This comparison is far more detailed than that for the SEM approach. A comparison at a similar level of detail would be to compare the overall results for the persons from the two analyses. One way to do this is to use the difficulty estimates from one of the groups to estimate person abilities in the other, and then examine the overall fit of the new person estimates. This gives some indication of the overall impact of the altered difficulty estimates on person estimates.

AN EXAMPLE In this study, data were collected using a multiscale quality of life instrument across Australian and American student samples. Instead of translating a scale from one language to another, a translation was made from one dialect of English to another. This short-cut is taken to allow study of this phenomenon in a monolingual setting, and to make the alterations completely comprehensible to an English-speaking audience. The results are used to illustrate the procedures described above.

THE SAMPLES Two data sets are used as the basis for comparison:

278

1. 2.

WILSON

(AUS sample): a sample of 1,368 Year-9 Victorian high school students collected as part of a study of school staffing policies (Ainley, Reed, & Miller, 1986); (USA sample): a sample of 138 Year-9 high school students from Louisiana based on a stratification of the State's school system, identified as potential drop-outs, assessed before a summer-school intervention program called Louisiana State Youth Opportunities Unlimited (LSYOU; Shapiro, 1987).

Note t h a t both samples are stratified samples of the schools in each state, with random choice of appropriate students within schools. THE INSTRUMENT The QSL Construct The Quality of School Life instrument (QSL; Williams & Batten, 1981) was designed as an application of Burt's conception of quality of life assessment (Burt, Fischer, & Christman, 1979) and Spady and Mitchell's model of schooling (Mitchell & Spady, 1977). Spady and Mitchell have developed a model of schooling based on sociological theory. Drawing on the work of Talcott Parsons, they have postulated a fourpart system that links societal expectations to school structures and hence to student experiences. In the four domains of societal expectations schools are expected to: 1. 2. 3. 4.

facilitate and certify the achievement of technical competence; in effect, to certify t h a t individuals are capable of doing tasks valued in the society at large; encourage and enhance personal development in the form of physical, emotional, and intellectual skills and abilities; generate and support social integration among individuals across cultural groups and within institutions; and n u r t u r e and guide each student's sense of social responsibility for the consequences of his or her own personal actions, and for the character and quality of the groups to which the student belongs. (Mitchell & Spady, 1977, p. 9)

Williams and Batten (1981) used exploratory factor analysis to explore the multidimensional nature of the QSL instrument, and then the hypothesized structure was tested using confirmatory procedures. It consists of six subscales, two general ones and four more specific

COMPARING ATTITUDE ACROSS DIFFERENT CULTURES

279

ones matching the Spady-Mitchell domains. The two general scales are: (a) general affect (GA), which taps the nonspecific feelings of happiness and well-being associated with school; and (b) negative affect (NA), which taps the reverse of GA, depression, loneliness, and restlessness. The four domains are: 1. 2. 3. 4.

Status (ST), which assesses a student's feelings of worth in the social context; Identity (ID), which assesses a student's feelings of growth as an individual; Opportunity (OP), which assesses a student's feelings of increasing adequacy to meet society's standards; and Teachers (TE), which assesses a student's feelings towards his or her teachers.

The original scheme was for a fifth domain, Adventure (AD), in place of the TE domain, to assess personal academic development. In the initial studies it was found that the items developed for this domain did not adequately identify it as a distinct factor, but that all items t h a t involved teachers loaded on a distinct factor. There are 27 items in the scale, with four or five for each subscale. The items are all statements with the stem "School is a place where . . . " followed by a specific predicate such as " . . . I feel happy." The response format is Likert-style with four categories: Strongly Disagree (scored 0), Disagree (1), Agree (2), and Strongly Agree (3). All are scored positively except for the NA subscale, which is scored negatively. Williams and Batten (1981) give complete details of the instrument. Content Validity Use of this instrument in different geographic, cultural, and developmental contexts raises issues of the ability of the respondents to understand the original intent of the instrument's authors because of differences in idiom and word-meanings. Consequently, when use of the instrument was considered in an American context, each item was examined for appropriateness. A panel of local experts was consulted to recommend alterations in the wording of the items for the USA sample—the teachers who were involved in the LSYOU summer training program. A complete record of the changes for the whole instrument is given in Figure 1 in Wilson (1988). In this chapter I will concentrate on three of the subscales, and the changes for those are given in Table 15-1. The Negative Affect scale was found to require no

280

WILSON

Table 1 5 - 1

Comparison of the Two Item Sets Text for AUS Sample

Item

Text for USA Sample SAME SAME SAME SAME

NA1 NA2 NA3 NA4

1 1 1 1

TE1 TE2 TE3 TE4

teachers teachers teachers teachers

GA1 GA2 GA3 GA4

1 really like to go each day 1 get enjoyment from being there 1 feel proud to be a student 1 like learning

feel depressed feel lonely get upset feel restless help me to do my best listen to what 1 say are fair and just treat me fairly in class

SAME teachers take notice of me in class SAME SAME lly like to be each day 1 real 1 feelI happy SAME 1 aminterested in the work we do

adjustments: It is an example of what one might consider an otherwise unattainable ideal in instrument translation. The Teachers scale was chosen to represent a scale that needed only minor adjustment. The General Affect scale was the one most affected by the adjustments. Although it is hard to put a limit to just how much a scale might be altered in translation, this was chosen as a representative of a heavily adjusted scale. Reliability The reliability of the QSL subscales has been examined in a number of circumstances. In the original study, Williams and Batten (1981) found t h a t the reliabilities varied from .76 (for the NA scale) to .91 (for the ST scale), with a mean of .83. Wilson (1988) reported similar ranges and means for a high school and a university sample from Louisiana using the altered instrument. These are quite respectable reliabilities for instruments in the affective domain. RESULTS SEM Approach Unidimensionality and item parameter invariance. Consideru first the results of the LISREL analyses for the Negative Affect scale given in the top panel of Table 15-2. These are the results for a one-

COMPARING ATTITUDE ACROSS DIFFERENT CULTURES

Table 15-2

281

LISREL Unidimensionality Results SMCa

Loadings3 3

4

1

2

3

4

cDa

GFIa

MSR

MR

Negative Affect USA 49 43 AUS 60 51

50 57

56 53

55 51

45 38

54 40

31 38

82 74

98 99

.026 .011

.049 .021

Teachers USA 65 AUS 64

47 69

86 66

73 41

70 59

58 58

91 67

79 47

99 85

99 99

.008 .007

.019 .013

General Affect USA 64 AUS 73

70 70

49 53

49 58

83 63

89 66

55 46

60 51

98 85

94 99

.029 .013

.088 .031

Design

1

2

a

The numbers under Loadings, SMC, cD, and GFI are to be divided by 100

factor solution in each of the samples. The factor loadings in the unconstrained design for the two samples are evidently not identical, the largest difference is .49 to .60, the smallest, .56 to .53. These result in squared multiple correlations (SMC) for each of the four items as given in the next four columns, and a total coefficient of determination (CD) in the next column. The coefficients indicate that three of the four items and the set as a whole is better fit by the one-factor model in the USA sample than the AUS sample. The next column gives the goodness-of-fit index (GFI) which seems to indicate a reasonably good fit for the one factor design. In the last two columns are included the mean squared residual (MSR) of the fitted covariance matrix, and the maximum residual (MR). The entries in the covariance matrices for both USA and AUS vary from about .2 to about .8, and this is typical for all the covariance matrices analyzed here. Hence, the residuals confirm the picture presented by the GFI, that the one factor solution in each sample is a reasonable one. Now compare the results for the one factor solution with that for the one-factor solution with loadings constrained to be the same in both samples, given in the top panel of Table 15-3. By assumption the loadings are identical. Compared to the results in Table 15-2, the constrained loadings give somewhat different SMCs for the USA sample and identical ones for the AUS sample. This ought to be expected as the common loadings are much closer to the original AUS loadings t h a n to the USA loadings, which is due to the larger sample size for the AUS sample. Although the SMCs for the USA sample have changed, they are not systematically larger or smaller. The overall picture con-

282

WILSON

Table 15-3

LISREL Parameter Invariance Results SMCa

Loadings3 2

3

4

1

2

3

4

cDa

GFIa

MSR

MR

Negative Affect USA 59 AUS

51

56

53

59 51

47 38

52 40

26 38

82 74

97 99

.067 .013

.158 .102

Teachers USA AUS

63

66

69

45

"

"

71 57

51 56

88 69

59 48

97 85

72 97

.191 .027

.315 .039

General Affect USA 72 AUS

71

53

57

84 63

86 66

55 45

61 51

98 84

92 99

.070 .015

.111 .038

Sample

a

1

"

" "

"

"

The numbers under Loadings, SMC, cD, and GFI are to be divided by 100

tained in the cD and GFI columns show no interpretable change at all between the two designs. The RMR column shows that the overall change in the residuals has been largely confined to the USA sample. The MR column reveals that while the residuals remain small on the whole, the maxima have inflated by a factor of three for USA and five for AUS. Overall, the picture for Negative Affect looks pretty good: The differences in fit brought about by constraining the solution to have the same loadings are not particularly important according to the summary statistics. The maximum residuals give a somewhat more detailed, and perhaps somewhat more disturbing comparison. The above analyses were then repeated for the Teachers and General Affect scales. The description of the results detailed in Tables 15-2 and 15-3 are abbreviated as the format is the same as above. Only the most interesting differences are commented upon. For the Teachers scale, a somewhat better (compared to Negative Affect) fit to the one factor design is not maintained for the constrained loadings design— GFI for USA drops from .99 to .72, the RMR inflates by a factor of over twenty, and the MR is clearly unacceptable. For the General Affect scale the situation for the Negative Affect scale is repeated, with almost identical general measures of fit for the two designs, and a somewhat greater degree of change revealed by the residuals. IRT Approach Model-data fit. The mean and standard deviation of the Person Fit t statistics are recorded in Table 15-4. These show that across both

COMPARING ATTITUDE ACROSS DIFFERENT CULTURES Table 15-4

283

Partial Credit Person Fit Statistics AUS

USA

AUS Anchored

Scale

Mean

SD

Mean

SD

Mean

SD

Negative Affect Teachers General Affect

-.19 -.24 .22

1.11 1.14 1.42

-.17 -.22 -.15

1.19 1.14 1.27

-.21 .16 .07

1.38 1.16 1.31

subscale and sample, the variability in the statistics are slightly greater t h a n would be expected, and that the values are somewhat more negative than we might expect. These negative values are sometimes associated with a situation where the items within a subscale have some degree of local dependence. The mean squares for the items are given in Table 15-5. The items in the Teachers scale immediately stand out as fitting poorly in the USA sample—items TE2 and TE3 both fall outside the guidelines. The remainder do not show such poor fit. i alyses within the two samples are given in Table 15-6. For each scale, given as separate panels of the table, the results are organized by the partial credit step parameters. For each item within a scale, there are three sets of columns, one for each step parameter. Within those three columns, the first gives the USA estimate of that step parameter (in logits) and the second column gives the AUS estimate. The third column gives the standardized difference (z). Larger absolute values of

Table 15-5 Partial Credit Item Fit Statistics Mean Square s Negative Affect USA 1.00 AUS .95

1.09 1.07

.98 1.01

.95 .98

Teachers USA AUS

.94 1.15

1.52 .96

.61 .87

.77 .94

General Affect USA .92 AUS 1.12

.81 1.01

1.08 .89

1.08 .91

284

WILSON

Table 15-6

Partial Credit Item Parameter Estimates

z

USA

AUS

z

0.53 1.05 0.34 0.12

-1.15 1.03 0.51 -1.59

2.17 2.32 1.68 0.94

1.08 0.81 1.01 1.01

1.62 1.85 1.20 -0.17

-1.48 -1.23 0.18 -0.76

0.84 -0.31 -0.39 -1.84

1.63 -2.40 0.58 2.82

2.60 3.89 3.82 4.02

2.29 3.16 3.63 3.26

0.85 1.68 0.44 1.70

0.03 -0.11 -1.04 -0.34

0.37 0.31 0.91 -1.40

0.97 0.57 0.33 2.91

3.99 4.17 2.75 3.56

3.35 2.78 2.60 1.73

1.54 3.33 0.43 4.90

AUS

z

USA

Negative Affect 1 -2.28 2 -1.37 3 -2.08 4 -2.12

-1.69 -0.61 -1.40 -2.22

-1.77 -2.70 -2.17 0.31

0.18 0.69 0.19 -0.32

Teachers 1 -3.00 2 -2.85 3 -2.06 4 -2.77

-2.36 -1.34 -1.88 -3.39

-0.70 -2.59 0.41 1.07

General Affect 1 -2.82 2 -2.43 3 -3.53 4 -4.23

-1.08 -1.92 -2.51 -2.70

-3.01 -0.97 1.25 -1.41

Item

USA

Third Step

Second Step

First Step

AUS

the standardized difference indicate greater discrepancy between the two samples, and, while the theoretical distribution of these statistics is only approximately known, values greater than 1.96 or less than - 1 . 9 6 are generally accepted to indicate a problem (Wright & Masters, 1982, p. 115). It should be noted that relatively larger differences in logits between two estimates at the extremes of the scales may result in smaller standardized differences than in the middle because of the U-shaped standard error distribution for partial credit. Even though the TE scale showed a poor fit in the previous analyses, for illustrative purposes, it will be included in the analyses at this next stage. Looking at the results for the Negative Affect scale in the first panel of Table 15-5, one finds two standardized differences less t h a n -1.96—for step one for both items NA2 and NA3. The count for the Teachers scale is three—two less than - 1 . 9 6 in item TE2 and one greater than 1.96 in item TE4. For General Affect there are four—one each in items GA1 and GA3, and two in item GA4. Rather than examine each of the discrepant items in detail, three representative items will be examined and illustrated below. First, consider an item that shows little or no difference between the samples: item TE3, "Teachers are fair and just." The estimated category characteristic curves for the AUS sample are illustrated in Figure 15-1, and those for the USA sample are illustrated in Figure 15-2. The

COMPARING ATTITUDE ACROSS DIFFERENT CULTURES

Attitude

to

Teachers

285

io0ita

Figure 15-1

Probability of responses for item TE3 in the AUS sample.

Figure 15-2

A t t i t u d e to Teachers io0ita Probability of responses for item TE3 in the USA sample.

286

WILSON

figures give the probability of responding with each of the Likert-style responses indicated in the body of the figure, at increasing locations along the latent trait. For example, in Figure 15-2, a student located at - 4 . 0 0 logits would be predicted to respond with "Strongly Disagree" (SD) with probability approximately .90, and "Disagree" (D) approximately .10, but the others with vanishing probability. At the upper end of the scale, a sample member located at 4.00 logits would be predicted to respond "Strongly Agree" with probability approximately .60, and "Agree" approximately .40, but the rest hardly at all. The sample members are, of course, located at positions estimated for each score. These did not alter noticeably between the two samples (a consistent pattern for all three scales), so the locations on the latent trait are indicated only by logit values in order to clarify the figures. Clearly, there would be no interpretable differences between the sample with regard to item TE3. Second, consider an item with just one discrepancy between the samples: item TE4, "Teachers treat me fairly in class." The estimates for the AUS sample are illustrated in Figure 15-3, and those for the USA sample are illustrated in Figure 15-4. Although the standardized difference indicates a significant discrepancy only for the second step

Figure 15-3

Probability of responses for item TE4 in the AUS sample.

COMPARING ATTITUDE ACROSS DIFFERENT CULTURES

287

I a Figure 15-4

Probability of responses for item TE4 in the USA sample.

parameter, the figures show that this results in noticeable differences for all the transitions. For instance, at - 4 . 0 0 logits, the "Strongly Disagree" to "Disagree" (SD to D) ratio is approximately .65/.32 = 2.03 for the AUS sample, but is approximately .83/.16 = 5.19 for the USA sample. Similarly, at 4.00 logits, the SA to A ratio is approximately .7/.3 = 2.33 in the AUS sample, but is approximately .6/.4 = 1.5 in the USA sample. Looking overall, for a person at the same latent trait value in both samples, the discrepancy indicates that it is relatively easier for an AUS sample member at a particular location to give a positive response to the item than a USA sample member at the same location. The shapes of the curves are relatively unchanged, indicating that a simple translation, of, say, .80 (which is the average discrepancy in logits), would bring the two sets of estimates into alignment. We might consider this a "consistent" difference. Third, consider the item that is most discrepant between the samples: item GA4, "I like learning," for the AUS sample (Figure 15-5) and "I am interested in the work we do" in the USA sample (Figure 15-6). Here, although it is somewhat easier for the Australian sample to give a positive response, a simple shift in location does not suffice to make the curves even approximately equal. The Australian sample has

288 Wilson

1

I

1

Figure 15-5

Probability of responses for item GA4 in the AUS sample.

Figure 15-6

Probability of responses for item GA4 in the USA sample.

COMPARING ATTITUDE ACROSS DIFFERENT CULTURES

289

shown much greater proclivity to give more extreme responses closer to the middle of the probability location. For example, while a member of the Australian sample who is located at the point where D and A are equally likely (the intersection of the second curve and the 0.50 probability line) would have to change in attitude by 3.00 logits to move to the point at which A and SA are equally likely, a similarly located member of the USA sample would have to change by 4.00 logits. We might consider this an "inconsistent" difference. The USA item estimates were also used to anchor a second analysis of the AUS sample. The resulting overall fit statistics from this are shown in the column headed "AUS Anchored" in Table 15-4. Neither means nor standard deviations differ in any large extent for any of the subscales. This shows t h a t the differences in the item estimates for the two samples, although making statistically significant and interpretable differences for the items, do not seem to be having any great impact on the person estimates. We should not be too surprised at this, as the sufficient statistics for the students are the same under both sets of item estimates.

DISCUSSION OF RESULTS FROM EXAMPLE The two approaches have resulted in rather different orders of detail for the three chosen subscales. The SEM approach gave positive assurances for all three subscales concerning unidimensionality, and a similar assurance concerning parameter invariance for both the General Affect and the Negative Affect subscales, but indicated a problem for the Teachers subscale. Thus, we have an example where the most altered subscale in terms of content was not the most problematical in construct validity terms. The results for the partial credit model indicated that the Teachers subscale had a fit problem for one of the samples (USA), but that the others fit at a reasonable level. Comparison at the item step level between the two samples revealed considerable differences, which were illustrated for three cases that were, respectively, small, consistent, and inconsistent. These comparisons revealed statistically significant differences between the item parameters for a little over half of the items, including at least two in each subscale. Of the items that were the identically-worded in the two samples, 5 out of 8 were found to have significant differences; of the four items that were altered, all were found to have significant differences. Comparison at the overall level of person fit statistics, however, did not reveal any great impact from these differences in person estimates.

290

WILSON

CONCLUSION The overall finding is one t h a t contains some good news and some bad news for those who use attitude instruments to conduct research across cultural contexts. Looking at it on the negative side, none of the subscales showed invariance on all criteria. In the SEM analysis, for construct validity as evaluated by fit to a constrained one-factor model, two subscales performed reasonably well. The IRT analysis revealed t h a t all three of the subscales gave significantly different estimates of item location across the samples, indicating that the respondents saw the latent traits in different ways. Looking on the positive side, these results, may be considered substantive results rather than merely negative findings, telling us about the different ways that people construct variables and respond to items in different contexts. In summary, this study has shown that through careful assessment of psychometric properties using techniques such as Structural Equation Modelling and Item Response Theory, attitude scales can be examined to see whether they are sufficiently consistent in their characteristics to allow meaningful comparisons to be made across cultural contexts. The results of such examinations will be dependent upon the level of detail that the researcher pursues. Clearly, the IRT approach resulted in a greater degree of detail in the examination, and hence found more discrepancies than the SEM approach. Many researchers in the area of cross-cultural comparisons will find such a level of examination alarming, leading potentially to the rejection of much of the existing research base. Others may consider it merely the inevitable result of trying to compare the incomparable. It is the position of this researcher that the present situation regarding the use of affective instruments across cultural contexts is not sufficiently well-researched to say which of these alternatives is correct, indeed it may be t h a t neither is correct. What is needed is a program of study that seeks out the conditions under which affective instruments display parameter invariance across particular cultural and linguistic contexts. This might be called strong construct validity for the comparison. Where such conditions are not attainable, or where particular indicators are considered important enough to be kept free from modification, one might instead seek evidence of weak construct validity, such as that used in the SEM approach here, or perhaps by using a technique similar to that described above for assessing fit of one sample to the item parameters of the other. This will require both technical work on what are the most appropriate techniques to investigate these types of construct validity, and substantive and philosophical work on the meaningfulness of terms such as strong or weak construct validity.

COMPARING ATTITUDE ACROSS DIFFERENT CULTURES

291

REFERENCES Adams, R.A., & Khoo, S.T. (1991). Quest (computer program]. Hawthorn, Australia: Australian Council for Educational Research. Ainley, J., Reed, R., & Miller, H. (1986). School organisation and the quality of schooling (ACER Research Monograph No. 29). Hawthorn, Australia: ACER. Andersen, E.B. (1973). Conditional inference for multiple choice questionnaires. British Journal of Mathematical and Statistical Psychology, 26, 31-44. Burt, R.S., Fischer M . G , & Christman, K.R (1979). Structures of well-being: sufficient conditions for identification as restricted covariance models. Sociological Methods and Research, 8, 111-120. Epstein, J.L., & McPartland, J.M. (1976). The concept and measurement of the quality of school life. American Educational Research Journal, 13(1), 15-30. Fraser, B.J., & Fisher, D.L. (1983). Development and validation of short forms of some instruments measuring student perceptions of actual and preferred classroom learning environment. Science Education, 67, 115-131. Husen, T. (1967). International study of achievement in mathematics: a comparison of twelve countries (Vols. 1 and 2). New York: Wiley. Irvine, S.H., & Berry, J.W. (1988). Human abilities in cultural context. Cambridge, UK: Cambridge University Press. Joreskog, K . G (1971). Statistical analysis of a set of congeneric tests. Psychometrika, 36, 109-133. Joreskog, K . G , & Sorbom, D. (1986). LISREL VI: Analysis of linear structural relationships by maximum likelihood and least square methods. Mooresville, IN: Scientific Software. Linden, L. (1977). Home environment and student support (Department of Statistics Research Report No. 77-10). Uppsala: University of Uppsala. Lord, F.M., & Novick, M.R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. Masters, G.N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149-174. Masters, G.N., & Wright, B.D. (1984). The essential process in a family of measurement models. Psychometrika, 49, 529-544. Messick, S. (1989). Validity. In R.L. Linn (Ed.), Educational measurement (3rd ed.). New York: ACE-Macmillan. Mitchell, D.E., & Spady, W.G (1977). Authority and the functional structuring of social actions in schools. Unpublished AERA symposium paper (quoted in Williams & Batten, 1981). Moos, R.M. (1978). A typology of junior high and senior high classrooms. American Educational Research Journal, 15{1), 53-66. Munck, I. (1979). Model building in comparative education. Stockholm: Almqvist & Wiksell. Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests (expanded ed.). Chicago: The University of Chicago Press. (Original work published 1960)

292

WILSON

Shapiro, J.Z. (1987 April). Project LSYOU: A summative evaluation. Paper presented at the annual meeting of the American Educational Research Association, Washington, DC. Walberg, H.J. (1979). Educational environments and effects. Berkeley, CA: McCutchan. Walker, D.A. (1976). The IEA Six Subject Survey: An empirical study of education in twenty-one countries. Stockholm: Almqvist & Wiksell. Williams, T.H. & Batten, M.H. (1981). The quality of school life. ACER Research Monograph, No. 12. Hawthorn, Australia: ACER. Wilson, M. (1988). Internal construct validity and reliability of a quality of school life instrument across nationality and school level. Educational and Psychological Measurement, 48, 995-1009. Wright, B.D., & Masters, G.N. (1982). Rating scale analysis. Chicago: MESA Press.

chapter

16

Consequences of Removing Subjects in Item Calibration Patrick S.C. Lee LaSalle University

Hoi K. Suen

Pennsylvania State University

The metric of the ability or 0 scale in item response theory (IRT) is indeterminant. With this indeterminancy, item and ability parameters are theoretically unidentifiable unless an origin is assigned to 9 (Lord, 1980). A common practice today is to scale along a z-score metric with a mean of 0 and a standard deviation of 1 (Hambleton & Swaminathan, 1985). Existing methods in IRT parameter estimation generally assume that, given the z-score metric, 0 is within the interval — ^ < G < sc. When Newton-Raphson (e.g., Lord, 1980; Hambleton & Swaminathan, 1985) or other unconstrained numerical procedures are applied to estimate ability, 9 can theoretically take on a value of positive or negative infinity. Specifically, the maximum likelihood estimator for a subject with a perfect response vector is infinity, while t h a t for a subject with an all-zero response vector is negative infinity. These estimates are problematic in a joint maximum likelihood estimation of item parameters in that item estimators are affected or unattainable. If item parameters are attainable but affected in an unspecified manner, the invariance of parameters is no longer guaranteed. Hence, it can potentially affect subsequent applications such as equating. 293

294

LEE & SUEN

There are at least five alternatives to resolve this problem. One solution is to impose external constraints in the estimation procedure to minimize parameter drift to unacceptable values (cf. Hambleton, 1989). These constraints are generally based on experience or logical deduction. For example, in the 3-parameter context, the slope parameter may be constrained to be positive (i.e., a > 0), the guessing parameter may be constrained to be less than some reasonable amount (for example, c < .35), or the ability parameter constrained away from the extremes ( - 3 < 9 < 3). Another solution is to impose a nonuniform prior distribution of 9 values; then the posterior 9 values estimated through a Bayesian Modal Estimation procedure (Swaminathan & Gifford, 1986) are taken as the best estimates. The third solution is to remove the need for estimating 9 altogether through the Marginal Maximum Likelihood procedure (Mislevy & Bock, 1990), although there is still a need to estimate the distribution of 9 . A fourth option is to create two "dummy" items. 1 One of these items will have a perfect classical p-value while the other will have a zero p-value. Subjects with perfect and zero raw scores would thus be eliminated. This alternative would be appropriate only for a conditional estimation of abilities. For a joint estimation of subject and item parameters, it essentially replaces the problem of perfect- and zero-scored subjects with perfectand zero-scored items. A final alternative is to remove all subjects with perfect or zero raw scores prior to item calibration (e.g., Wright & Stone, 1979). The consequences of the final alternative of removing subjects prior to item calibration on the quality of the estimators are unknown (Hambleton & Swaminathan, 1985, pp. 92-93). The purpose of this chapter is to examine the effects of such a tactic on the 9 metric and item parameters. INVARIANT ITEM PARAMETERS An important and desired characteristic of IRT is the invariance of item parameters (Lord, 1980), which also enables the calibration process to be sample-free (Wright & Stone, 1979). When the z-score metric is imposed on the 9 scale for each of two groups responding to the same set of items, estimators of item parameters will most likely be different from one group to another. However, the property of invariance is maintained if the two 9 scales are linear transformations of one another (Hambleton & Swaminathan, 1985; Lord, 1980; Lord & Novick, The authors wish to thank Robert Jannarone for pointing out this option.

CONSEQUENCES OF REMOVING SUBJECTS IN ITEM CALIBRATION

295

1968; Wright, 1968). If the effects of removing subjects are such t h a t the 9 scales from different calibration samples become unknown and nonlinear transformations of one another, the practice of removing subjects would be problematic in that item parameters are no longer invariant. Let's assume that the 9 metric X for group A with a number of perfect and all-zero response vectors is a linear transformation of the 9 metric Y for group B, which also has a number of perfect and all-zero response vectors. If subjects are removed from these groups because of perfect and zero raw scores, the metric of the 9 scales would change, resulting in two new metrics X* and Y*. The property of invariance is guaranteed only if X* is a linear transformation of X and Y* is a linear transformation of Y, which would then imply that X* remains a linear transformation of Y*. TRANSFORMATION OF METRICS Samuelson (1968) demonstrated that, given a finite sample of N subjects, no score can be beyond ±(N - 1)° 5 standard deviations from the mean. For a 9 scale with a z-score metric, this property implies that the boundaries of 9 scores calibrated from a finite sample are ±(N 1)° 5 . Let N be the size of a calibration sample in which p subjects have perfect response vectors and m subjects have all-zero response vectors and let X be the 9 scale for this sample. With 9 on a z-score metric, we can assume t h a t the distribution of 9 is symmetric. Let ±c be the actual maximum and minimum 9 values for a given finite sample of subjects, then - ( N - 1)° 5 < - c < 9 R < + c < + ( N - 1)° 5 , where 9 R is the ability score for all subjects R whose raw score is neither perfect not zero. That is, each subject R would be retained after subjects with perfect and zero raw scores have been dropped. Let X* be the 9 metric after subjects with perfect and zero raw scores have been dropped and 9* R be the ability score for subject R on the X* metric. We demonstrate below that 9* R is a linear transformation of 9 R by obtaining a mapping of the boundaries. That is, we need to show how 9 R of the interval [-(N - l ) 0 5 , + (N - l ) 0 5 ] is transformed. In the estimation of 9* R , such a transformation is equivalent to transforming the interval [-(N - 1)° 5 , + (N - 1)° 5 ]: Maximize-minimize 9* R :1 < R < n

296

LEE & SUEN

where p is the number of perfect raw scores, m the number of zero raw scores, and n the number of nonperfect, nonzero raw scores. This constrained optimization problem requires that the sum of the 9 scores is zero (Eq. (1)) with the variance of one (Eq. (2)) in order for 9 to remain within the z-score metric. The solution of this optimization problem would lead to the range of 0* R . Using the Lagrangian technique (cf. Mangasarian, 1969), we obtain the interval of 0* R to be

where c is the actual number of standard deviations away from the mean which will contain all possible scores in the sample of subjects. Note t h a t two distinct scores 9 R and 9 S of the original X metric would become 0 * R and 0 * s of the new X* metric such t h a t their ordinal positions are preserved. Thus, the transformation is a one-to-one ordered mapping. In other words,

For a given calibration sample of subjects, N, m, p, n and c are constants. Thus, Equation 4 demonstrates that 9 R and 0 * R are linear transformations of one another.

DISCUSSION Recall t h a t the X metric is of all scores while the X* metric is of nonperfect, nonzero scores only. Equation (4) shows that the same ordering is preserved in the X* metric and the boundaries of 0 * R have changed according to Equation (3). The absolute difference of two distinct scores 9 R and 9 S on the X metric is transformed into 6* R and 0 * s on the X* metric, as given by Equation (4). This transformation is approximately equivalent to the magnitude of n° 5 , provided p and m

CONSEQUENCES OF REMOVING SUBJECTS IN ITEM CALIBRATION

297

are negligible. The result is that X* becomes a linear transformation of X. Thus, the invariance of item parameters is preserved. Therefore, from the perspectives of the invariance of item parameters, the practice of removing subjects with perfect and all-zero raw scores prior to item calibration is acceptable in that subjects' relative positions are maintained, item parameters remain invariant, and equating of 9 across samples is possible. This finding is also consistent with the general notion that item calibration is sample-free. It should be cautioned, however, that, while Equation (4) provides a theoretical justification for the removal of subjects in item calibration, it is by itself a necessary but insufficient condition to support the practice in applied settings. It demonstrates t h a t parameters are not affected. To support practice in applied settings, it is also necessary to demonstrate that estimators, in addition to parameters, are also not affected. Further analyses are needed to explore the effects of removing subjects on estimators. An additional consideration is that, whereas this chapter provides a justification in removing subjects in item calibration, the problem of how to derive a finite 9 for these subjects in ability estimation remains. Wilson and Wright (1985) provided one solution for this problem.

REFERENCES Hambleton, R.K. (1989). Principles and selected applications of item response theory. In R.L. Linn (Ed.), Educational measurement (3rd ed., pp. 147200). New York: Macmillan. Hambleton, R.K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Hingham, MA: Kluwer. Lord, F.M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum. Lord, F.M., & Novick, M.R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. Mangasarian, O.L. (1969). Nonlinear programming. New York: McGraw-Hill. Mislevy, R.J., & Bock, R.D. (1990). BILOG 3: Item analysis and test scoring with binary logistic models (2nd ed.). Mooresville, IN: Scientific Software. Samuelson, P.A. (1968). How deviant can you be? Journal of the American Statistical Association, 63, 1522-1525. Swaminathan, H., & Gifford, J.A. (1986). Bayesian estimation in the threeparameter logistic model. Psychometrika, 51, 589-601. Wilson, M., & Wright, B.D. (1985, April). Finite measures from perfect scores. Paper presented at the Annual Meeting of the American Educational Research Association, Montreal.

298

LEE & SUEN

Wright, B.D. (1968). Sample-free test calibration and person measurement. Proceedings of the 1967 invitational conference on testing problems. Princeton, NJ: Educational Testing Service. Wright, B.D., & Stone, M.H. (1979). Best test design. Chicago: MESA.

chapter

17 JL f

Item Information as a Function of Threshold Values in the Rating Scale Model Barbara G. Dodd

The University of Texas at Austin

Ralph J. De Ayala

The University of Maryland—College Park

Birnbaum's (1968) conceptualization of information functions for tests and for individual items has been used in many applications of item response theory (IRT) models. The primary benefit of information functions is that they allow one to construct measurement instruments t h a t will maximize the precision of measurement or information where it is needed most. Another benefit is that information functions for two measurement instruments can be compared in terms of relative efficiency to aid in the selection of the best instrument for a given measurement situation. Information functions have also been used effectively to determine item selection for computerized adaptive testing (CAT). Most of the applications of information functions have been restricted to the IRT models for dichotomously scored items, where item responses are scored either correct or incorrect. Very little research has investigated the properties of information functions for IRT models developed specifically for item responses that are scored into more t h a n two categories. 299

300

DODD & DE AYALA

Three of the models that are appropriate when item responses are scored using integers to represent ordered response categories corresponding to varying degrees of the trait measured by the item are the rating scale model (Andrich, 1978a,b), the partial credit model (Masters, 1982), and the graded response model (Samejima, 1969). The rating scale model was developed specifically for the case of attitude measurement when the Likert-type response format is used. The partial credit model is an extension to the multiple category case of the one-parameter Rasch model for dichotomously scored items, while the graded response model is an extension to the multiple category case of the two-parameter logistic model. Both the partial credit model and the graded response models are appropriate to use with items for which partial credit can be earned for partially correct solutions to problems. While the rating scale model has been shown to be a special case of the partial credit model (Wright & Masters, 1982), the partial credit model is not a special case of the graded response model (Thissen & Steinberg, 1986). Samejima (1969) extended Birnbaum's formulation of information functions to the multiple category case. By comparing the information yielded by items scored with optimal dichotomization with the information yielded by scoring the items according to the graded response model, Samejima (1969, 1976) found that the graded response approach yielded considerably greater precision of measurement. Dodd and Koch (1987) applied Samejima's formulation of information functions for the multiple category case to the partial credit model. Unlike the simple Rasch model for dichotomously scored items, it was found t h a t item information functions for the partial credit model could differ substantially from one another as a function of the step estimates for each item. Dodd and Koch also demonstrated the usefulness of information functions to test revision. Information functions for the multiple category case have also been shown to be effective for item selection during CAT based on either the partial credit model (Koch & Dodd, 1989) or the graded response model (Dodd, Koch, & De Ayala, 1989). Dodd (1987) applied Samejima's (1969) formulation of information functions for polychotomously scored items to be the rating scale model. It was found that the distribution of item information for a set of items with the same response threshold values was a function of the scale value for the item. Each item information function peaked near the scale value for the item. It was also discovered that rating scales with threshold values that spanned a small range along the attitude continuum yielded more peaked information functions than rating scales with threshold values that spanned a large range. Thus, the

ITEM INFORMATION AS A FUNCTION OF THRESHOLD VALUES

301

distribution of item information was a function of both the scale value for the item and the set of response threshold values for the rating scale. This chapter presents the results of a further investigation of the relationship between the distribution of information for an item and the item parameter estimates of the rating scale model. The effectiveness of using item information functions for item selection during CAT was also investigated.

THE RATING SCALE MODEL Andrich (1978a,b) extended the Rasch model for dichotomously scored items to the polychotomous case of rating scale items in which responses to an item are scored using ordered categories to represent varying degrees of the attitude level. In the rating scale model, a scale value is estimated for each item to reflect the location of the item on the attitude continuum. In addition, a single set of response thresholds is estimated for the entire set of items included in the rating scale, because the response threshold values are assumed to be constant across items on a given rating scale. The probability of responding in a given category is defined as

Equation 1 is the general form for obtaining the operating characteristic curves for an item based on the rating scale model. The 6 term is the attitude level, the bt term is the scale value or location parameter for item i, and the t} terms are the response threshold parameters for the set of items. For notational convenience, S[0 - {b( + tj)], forj = 0 to 0 is defined as being equal to 0. Item information (after Samejima, 1969) for the rating scale model, conditional on theta, is defined as

302

DODD & DE AYALA

Figure 1 7 - 1 Item information functions for two items that have a scale value of zero and threshold values that are symmetric around zero but differ in the range of the threshold values.

where P' is the first derivative of Equation 1. An example of item information functions for two hypothetical items with a scale value of zero and symmetric threshold values that differ in range are presented in Figure 17-1. Both items provided maximum information at the scale value. Item 2 had a slightly flatter information function than item 1 because of the larger range of the threshold values for the scale from which item 2 was selected compared to the scale for item 1. The information for a given rating scale is simply defined as the sum of the item information functions. Thus, the information that a given item contributes to the scale information function is independent of the information provided by the other items in the rating scale. Item and scale information functions could prove useful in some applications of the rating scale model. For example, the scale information functions for two rating scales can be compared in terms of relative efficiency, which can aid in the selection of the best rating scale for a given measurement situation. Item information function might also be used effectively to determine item selection for computerized adaptive attitude measurement.

ITEM INFORMATION AS A FUNCTION OF THRESHOLD VALUES

303

INFORMATION STUDY Datasets The relationship between the distribution of item information and the item parameters of the rating scale model was assessed with rating scales t h a t had either three- or four-threshold values. A total of 30 different scale threshold values were generated to investigate the effect of the number, symmetry, and distance between adjacent threshold values on the distribution of item information. For each of the 30 sets of threshold values, nine item scale values that ranged from - 2 . 0 to 2.0 in .5 increments were used in the item information analyses. To determine if the relationships between the item parameters and the distribution of information across the trait continuum that were found for the generated items would hold for real data, the threshold values that were estimated for the AWS and ADCOM datasets (Dodd, 1990) and the threshold values estimated by Masters and Wright (1981) for the fear of crime items were used in the item information analysis. The three real attitude scales differed from one another in terms of the number and range of threshold values. Analyses The nine scale values used in conjunction with each of the 30 generated scale threshold values used to investigate the effects of number, symmetry, and distance between adjacent threshold values on the distribution of item information were treated as known parameters in the information analyses. Estimates of the threshold values reported in the literature for the three real attitude scales were also used in the item information analyses. Equation 2 was used to calculate information for the 0 values ranging from —4.0 to 4.0 at intervals of .1 for the 270 generated items and the 27 items, based on estimates of the threshold values for the three real attitude scales. Results The item information functions for the 270 generated items confirmed the findings of Dodd (1987) that the item information function for each item peaked near the scale value and that rating scales with threshold values t h a t spanned a small range along the attitude continuum yielded more peaked information functions than rating scales with threshold values that spanned a large range. As expected, it was

304

DODD & DE AYALA

Figure 17-2 Item information functions for four items that have a scale value of zero and threshold values that are symmetric around zero but differ in the range of the threshold values and the distance between adjacent threshold values.

also found t h a t items with four threshold values yielded more total information across the trait continuum than the items with three threshold values. Thus only items with the same number of thresholds yield the same total amount of information across the entire attitude continuum. Inspection of item information functions for scales with three threshold values revealed that the information functions peaked at the scale value of the item when the threshold values were symmetric around zero. For the scales with four threshold values that were symmetric around zero, it was found that the item information functions peaked at the item scale value provided the distance between the two middle threshold values was not equal to or greater than 2.0 logits. The four items selected to illustrate this finding had a scale value of zero but were from scales that differed in the distance between adjacent threshold values as well as the range of the threshold values. Figure 17-2 shows the item information functions for these four items. The information functions for items 3 - 5 all peaked at the scale value of zero.

ITEM INFORMATION AS A FUNCTION OF THRESHOLD VALUES

305

Item 6, however, had a bimodal distribution of information which peaked at trait levels of —1.6 and 1.6. Given the fact that the range of threshold values for the scale from which item 6 was selected is the same as the range of the threshold values for the scale from which item 5 was selected, it appeared that large distances between the two middle threshold values resulted in bimodal information functions. Inspection of the information functions for other scales with four threshold values revealed bimodal information functions when the distance between the two middle threshold values was equal to or greater than 2.0 logits. It should also be noted that the information function for item 4 was flatter than the information function for item 3 because the distance between the two middle threshold values was greater for the item 4 scale t h a n for the item 3 scale. The information functions for items 3 and 4 were also more peaked than the information functions for items 5 and 6 because the range of scale threshold values for items 3 and 4 were less t h a n the range of the scale threshold values for items 5 and 6. When there was an odd number of asymmetric thresholds, the peak of the information function was shifted away from the scale value in the direction of the dominant sign of the threshold values. Figure 17-3

Figure 17-3 Item information functions for three items that have a scale value of zero but differ in the range and degree of asymmetry of the threshold values.

306

DODD & DE AYALA

Figure 17-4 Item information functions for two items that have a scale value of zero and asymmetric threshold values with the same range but differ in the distance between adjacent threshold values.

depicts the information functions for three items that had a scale value of zero but differed in the range and degree of asymmetry of threshold values. As can be seen, the scale threshold values for item 7 had the smallest degree of asymmetry and the smallest shift of the peak of the information function away from the scale value. Items 8 and 9 had scale threshold values that differed from one another only in terms of the direction of the asymmetry. The magnitude of the shift of the peak of the information functions away from the scale value was identical for items 7 and 8 and differed only in terms of the direction of shift away from the scale value. For the scales with an even number of threshold values that were asymmetric, the degree of shift away from the scale value was found to be a function of the distance between adjacent threshold values. Figure 17-4 presents the item information functions for two items with four asymmetric scale threshold values that differ only in the distance between adjacent threshold values. Item 11, which has a distance of 2.5 logits between the two middle threshold values, had a 1.6 logit shift in

ITEM INFORMATION AS A FUNCTION OF THRESHOLD VALUES

307

Figure 17-5 Item information functions for three items that have a scale value of zero but differ in the range and number of the threshold values.

the peak of the information function away from the scale value. Item 10, on the other hand, had a smaller shift in the peak of the information function (.9) because the distance between the two middle thresholds was smaller t h a n the distance between the middle threshold values for item 11. Figure 17-5 depicts the item information functions for each of the three real attitude scales with an item scale value of zero. As can be seen, the magnitude of the shift away from the scale value is a function of the degree of asymmetry of the threshold values. For the odd number of threshold values, the direction of the shift is determined by the dominant sign of the threshold values. It is interesting to note that the shift for the fear of crime item was 2.1 logits. For the even number of threshold values, the direction of shift was a function of the magnitude of two middle threshold values; the shift was in the direction of the threshold value with the largest deviation from zero. These results confirmed the relationship between the distribution of item information and the item parameters of the rating scale model that were identified with the generated item parameters.

308

DODD & DE AYALA

CAT STUDY Method Datasets. Two real datasets consisted of response data for two different attitude scales. The third dataset consisted of simulated response data generated specifically to fit the rating scale model. Responses made by 490 teachers to the Audit of Administrator Communication (ADCOM; Valentine, 1978) were available for use in the present study. ADCOM is a 40-item Likert-type attitude scale designed to measure attitudes of teachers toward the communication skills of their school administrators. All items are scored on a five-point scale on which 0 represents an unfavorable response toward the communication skills of the administrator, and a score of 4 represents a favorable response. Factor analysis of the ADCOM scale (Koch, 1983) indicated t h a t the scale is approximately unidimensional; the first factor accounted for about 85% of the common variance. The Attitude Toward Women Scale (AWS; Spence, Helmreich, & Stapp, 1973) was designed to measure attitudes toward the rights and roles of women in contemporary society. Each of the 25 items has four response alternatives ranging from "AGREE STRONGLY" to "DISAGREE STRONGLY." Responses are scored so that profeminist attitudes receive a score of 3, whereas very traditional attitudes receive a score of 0. Response data were available for 531 women. Previous factor analytic studies (Dodd, 1985) demonstrated that the AWS has one dominant factor that accounts for about 83% of the common variance. The third dataset consisted of simulated responses to 27 items from 500 simulees. These data were generated according to the rating scale model using standard procedures. The items were constructed to have four response alternatives. Consequently, three response threshold values were specified for the set of 27 items, and a scale value was specified for each of the items. The item parameters used to generate the data were those estimates reported by Masters and Wright (1981) based on real responses to fear of crime items. More specifically, the item parameter estimates for 9 items reported by Masters and Wright were treated as known item parameters and were used as input into the data generation program. Given the fact that 9 items is too small an item bank for CAT, the size of the item pool was tripled by duplicating Masters and Wright's item parameter estimates for the 9 items twice, and simulated item responses were thus generated for 27 items. Conventional procedures were used to generate the simulated item responses according to the rating scale model. The reader is referred to

ITEM INFORMATION AS A FUNCTION OF THRESHOLD VALUES

309

Dodd (1990) for a detailed description of the data generation procedure. Response strings to 27 items for 500 simulees were generated for later use in the simulated adaptive measurement procedures. Because these data were generated according to the rating scale model, there was no need to assess the unidimensionality of the data. Calibration. For each of the three datasets, a two-stage procedure outlined by Masters and Wright (1981) was used to obtain the estimates of the item parameter according to the rating scale model. In the first stage the computer program PARTIAL 1 was used to obtain item parameter estimates based on the partial credit model. This program was written according to the calibration procedures and estimation equations specified by Masters (1982) for the partial credit model. The second stage involved obtaining estimates of the threshold values and of the scale value parameters from the step value estimates obtained from the PARTIAL program. For each item, the partial credit model's step estimates were simply averaged to obtain the estimate of the scale value for the item. Estimates of the threshold values were obtained by first transforming each of the partial credit step value estimates for an item into a deviation score from the scale value for that item. The deviation scores for each step were then averaged across the items to obtain the estimate of the threshold value for each step. Note that, generally, these estimates will not be identical to those yielded by a computer program that estimates the item parameters of the rating scale model directly. c for the rating scale model was used to simulate computerized adaptive attitude measurement using a sample of 200 persons from each of the three datasets, respectively. The maximum likelihood estimation method was used to estimate the person's attitude trait level after each item. Prior to maximum likelihood estimation, however, it was necessary to use a specified stepsize along the theta scale as preliminary theta estimates to administer the first two or three items. The variable stepsize recommended by Dodd was used to change the theta estimate by half the distance between the previous theta estimate and either of the two extreme scale value estimates for the item pool. If the response to the most recent item administered was in the lower half of the response categories, the lowest scale value estimate was used, while a response in the upper half of the response categories resulted in using the upper extreme scale value estimate. C o n t a c t the first author for information about the PARTIAL computer program.

310

DODD & DE AYALA

Given the current theta estimate, the two item-selection procedures studied by Dodd were used in the present investigation to determine the most appropriate item remaining in the pool to administer next. The maximum information method involved choosing the item t h a t provided the most information for the current theta estimate, while the scale value method involved selecting the item with the scale value closest to the current theta estimate. Unlike the Dodd study, however, the CAT sessions under both item selection procedures continued to administer items until a prespecified standard error was obtained or a maximum of 20 items had been administered. For the ADCOM the minimum standard error was arbitrarily set at .25. For the AWS dataset a slightly higher standard error level of .30 was used because the average standard error for the full scale calibration was higher t h a n .25. An even higher standard error level of .50 was used for the artificial dataset because the average standard error for the full-scale calibration was .41. Data Analyses Descriptive statistics, correlations, and scattergrams were used to evaluate the two CAT conditions. For each dataset, means and standard deviations were obtained to describe the thetas, standard errors, and number of items administered under the two CAT conditions as well as for the full scale calibration. Scattergrams and correlations were run to determine the degree of linear relationship that existed between the theta estimates obtained under the two CAT conditions and the full scale calibration run for each dataset. For the artificial data, the theta estimates yielded by the CATs and the full scale calibration were also correlated with the known z values used to generate the data. In addition, the root mean squared error (RMSE) statistic was calculated to measure the correspondence between the full scale theta estimates and those yielded by the CAT procedures. Results Item pool calibration. For the AWS data, the step value for the lowest category of one item could not be estimated because only one person responded in that category for that item. The item was thus deleted from the scale, and the remaining items recalibrated. Descriptive statistics for the scale values of the remaining 24 items and the threshold values are presented in Table 17-1. Initial results revealed that an estimate of the lowest step value for

ITEM INFORMATION AS A FUNCTION OF THRESHOLD VALUES

311

Table 1 7 - 1 Descriptive Statistics, Scale Value Estimates, and Threshold Estimates for Three Datasets

Scale Value Mean SD Minimum Maximum Number of Items Threshold 1 2 3 4

AWS

ADCOM

Artificial

-.475 .838 -1.864 .903 24

.855 .709 -1.985 .829 39

-.097 .685 -1.474 1.169 27

-.728 .091 .819

-1.347 -.536 .024 1.859

-4.688 .880 3.807

one item of the ADCOM scale was unobtainable because no person responded in the lowest category. In effect this item did not have the same functional response scale as the other 39 items. Consequently the item was deleted from the scale and the remaining 39 items recalibrated. Table 17-1 shows the descriptive statistics for the scale values and the threshold values. The PARTIAL program yielded step estimates for all 27 items of the fear of crime scale. Descriptive statistics for the scale values and the threshold values are displayed in Table 17-1. d deviations of the theta estimates, standard error of the theta estimates, and the number of items administered under the two adaptive testing conditions and the full scale calibration for each of the three datasets. The mean theta estimates for each of the two CAT conditions and for the full scale calibration within each dataset were very similar. For the AWS and ADCOM, the mean standard error of the theta estimates were identical for the two adaptive conditions which administered virtually the same number of items, on the average. The scale value item selection procedure administered one fewer item, on the average, for the artificial data, but resulted in approximately the same average standard error of the theta estimates as the item information selection technique. i estimates yielded by each of the two CAT conditions were correlated with the theta estimates from full scale calibration. The resulting

312

DODD & DE AYALA

Table 17-2 Descriptive Statistics for Three Datasets Under Two Adaptive Conditions and the Full-Scale Calibration Dataset and Testing

Theta Estimate Mean

Condition AWS (N - 200) Scale Value Information Full Scale ADCOM (N = 200) Scale Value Information Full Scale Artificial (N = 200) Scale Value Information Full Scale

SD

Number of Items

Standard Error Mean

SD

Mean

SD

.37 .38 .35

.89 .88 .88

.32 .32 .26

.09 .09 .10

16.10 16.10 24.00

2.52 2.51

-.03 -.07 -.02

1.18 1.18 1.17

.27 .27 .21

.04 .04 .04

19.13 18.88 39.00

1.12 1.33

.05 -.05 .02

1.13 1.07 1.09

.50 .51 .41

.07 .05 .06

16.12 17.08 27.00

2.80 2.52

coefficients of correlation were very high (.97 to .98) and virtually the same regardless of the item selection procedure used (see Table 17-3). For the artificial data, it was possible to determine the relationship between the known z values used to generate the data and the theta estimates yielded by the two CAT conditions and the full scale calibration. The coefficients of correlation obtained for the two CAT conditions were virtually the same (.88 and .89), but somewhat lower t h a n what has been found in similar research. For example, using a different artificial dataset, Dodd (1990) obtained coefficients of .95 to .96 for various CAT procedures based on the rating scale model. The coefficient of correlation was also somewhat lower than expected for the full scale calibration (r = .92). These slightly lower coefficients of correlation are due to the size of the standard errors of the theta estimates Table 17-3 Pearson Correlation Coefficients and RMSE Statistics for Three Data Sets

9

FS- ^SV

B

FS-

B

INF0

AWS

ADCOM

Artificial

.98 .98

.98 .98

.16 .16

.25 .26

.96 .97 .89 .88 .92 .31 .28

"V Bsv z

'

B

INFO

• V «FS

RMSE ^ Bsv RMSE eFS, INF0 B

ITEM INFORMATION AS A FUNCTION OF THRESHOLD VALUES

313

from the full scale calibration of the artificial data (Mean = .41). The higher coefficients of correlation reported by Dodd (1990) resulted from a full scale calibration that produced an average standard error of .23. The standard errors are a function of the scale information, which in turn is related to the threshold values for the rating scale. The current item pool had an exceptionally wide spread of the threshold values compared to other scales reported in the literature (Andrich, 1978a; Dodd, 1990). The RMSE statistics, which are also presented in Table 17-3, mirrored the results found for the correlation coefficients. For each dataset, the RMSE statistics were virtually the same for the two CAT conditions.

DISCUSSION The results of the item information analyses confirmed the findings of Dodd (1987) and provided further clarification and extension of other findings. Both studies demonstrated that across the entire trait continuum, items with the same number of threshold values provided the same total amount of information. The finding that items from scales with more threshold values yielded more total information across the entire theta scale t h a n items from scales with fewer threshold values is not surprising. This finding is consistent with the belief that more categories provide more information or allow for finer discriminations among persons t h a n items with fewer categories. The systematic comparison of item information functions in this study provided further clarification of the previous finding that the item information function peaked near the scale value of the item. The results revealed that the magnitude of the shift away from the scale value for a given item in a scale was a function of the degree of asymmetry of the threshold values. When there was an odd number of asymmetric threshold values, the peak of the item information function was shifted away from the scale value in the direction of the dominant sign of the threshold values. For the scales with an even number of threshold values, the degree of shift away from the scale value for a given item was also found to be a function of the distance between adjacent threshold values. In addition, it was discovered that if the distance between the middle threshold values was large when the number of threshold values was even, the information function could be bimodal even if the thresholds were symmetric around zero. The fact t h a t the shift in the peak of the item information function was found to be 2.1 logits away from the scale values for one real

314

DODD & DE AYALA

d a t a s e t s u g g e s t e d t h a t u s i n g t h e closest scale v a l u e to select i t e m s for a d m i n i s t r a t i o n d u r i n g a n a d a p t i v e a t t i t u d e m e a s u r e m e n t session (Dodd, 1990) m i g h t n o t be t h e b e s t i t e m selection p r o c e d u r e . T h e r e s u l t s of t h e CAT s i m u l a t i o n s t h a t c o m p a r e d t h e scale v a l u e a n d t h e m a x i m u m i n f o r m a t i o n i t e m selection p r o c e d u r e s for t h r e e d a t a s e t s did n o t , however, l e a d to t h i s conclusion. E v e n t h o u g h t h e t w o i t e m select i o n p r o c e d u r e s a d m i n i s t e r e d different i t e m s , t h e r e s u l t s of t h e t w o CAT w e r e for all p r a c t i c a l p u r p o s e s t h e s a m e . T h i s r e s u l t w a s p a r t i c u l a r l y i m p r e s s i v e for t h e artificial d a t a , given t h e less t h a n o p t i m a l i t e m pool a n d t h e fact t h a t shift in t h e p e a k of t h e i n f o r m a t i o n funct i o n a w a y from t h e scale v a l u e w a s g r e a t e r t h a n 2 logits. W h i l e t h e r e s u l t s r e v e a l t h a t b o t h i t e m selection p r o c e d u r e s w o r k e d e q u a l l y well, t h e scale v a l u e i t e m selection p r o c e d u r e r e q u i r e s less c o m p u t i n g t i m e a n d t h u s would be t h e p r e f e r r e d m e t h o d . T h i s w o u l d be p a r t i c u l a r l y t r u e for l a r g e i t e m pools b e c a u s e i n f o r m a t i o n w o u l d n o t h a v e to be c a l c u l a t e d for e v e r y i t e m t h a t h a d n o t b e e n a d m i n i s t e r e d . D e t e r m i n i n g t h e closest scale v a l u e to t h e l a t e s t t h e t a e s t i m a t e w o u l d b e m u c h m o r e efficient.

REFERENCES Andrich, D. (1978a). Application of a psychometric model to ordered categories which are scored with successive integers. Applied Psychological Measurement, 2, 581-594. Andrich, D. (1978b). A rating formulation for ordered response categories. Psychometrika, 43. 561-573. Birnbaum, A. (1968). Some talent trait models and their use in inferring an examinee's ability. In F.M. Lord & M.R. Novick, Statistical theories of mental test scores. Reading, MA: Addison-Wesley. Dodd, B.G. (1985). Attitude scaling: A comparison of the graded response and partial credit latent trait models (Doctoral dissertation, University of Texas at Austin, 1984). Dissertation Abstracts International, 45, 2074A. Dodd, B.G. (1987, April). Computerized adaptive testing with the rating scale model. Paper presented at the Fourth International Objective Measurement Workshop, Chicago. Dodd, B.G. (1990). The effect of item selection procedure and stepsize on computerized adaptive attitude measurement using the rating scale model. Applied Psychological Measurement, 14, 355-366. Dodd, B.G., & Koch, W.R. (1987). Effects of variations in step values on item and test information in the partial credit model. Applied Psychological Measurement, 11, 339-351. Dodd, B.G., Koch, W.R., & De Ayala, R.J. (1989). Operational characteristics of adaptive testing procedures using the graded response model. Applied Psychological Measurement, 13, 129-143.

ITEM INFORMATION AS A FUNCTION OF THRESHOLD VALUES

315

Koch, W.R. (1983). Likert scaling using the graded response latent trait model. Applied Psychological Measurement, 7, 15-32. Koch, W.R., & Dodd, B.G. (1989). An investigation of procedures for computerized adaptive testing using partial credit scoring. Applied Measurement in Education, 2, 335-357. Masters, G.N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149-174. Masters, G.N., & Wright, B.D. (1981). A model for partial credit scoring (Research Memorandum No. 31). Chicago: University of Chicago, MESA Statistical Laboratory. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, No. 17. Samejima, F. (1976). Graded response model of the latent trait theory and tailored testing. In C.K. Clark (Ed.), Proceedings of the First Conference on Computerized Adaptive Testing. Washington, DC: U.S. Government Printing Office. Spence, J.T., Helmreich, R., & Stapp, J. (1973). A short version of the Attitude toward Women Scale (AWS). Bulletin of the Psychonomic Society, 2, 2 1 9 220. Thissen, D., & Steinberg, L. (1986). A taxonomy of item response models. Psychometrika, 51, 567-577. Valentine, R.J. (1978). Audit of administrator communication. Columbia, MO: Jerry W. Valentine. Wright, B.D., & Masters, G.N. (1982). Rating scale analysis. Chicago: MESA Press.

chapter

18

Assessing Unidimensionality for Rasch Measurement Richard M. Smith

University of South Florida

Chang Y. Miao

American Dental Association

Unidimensionality is one of the requirements for Rasch measurement, as it is for most measurement models. However, the primary sources on Rasch measurement have very little to say about the requirement of unidimensionality and provide no recommendation as to methods for directly testing this assumption. Rasch (1960/1980), although describing several methods for "control of the model" and discussing the applicability of the model to data, does not directly address the issue of unidimensionality. Wright and Stone (1979) do not explicitly discuss unidimensionality of the data as a requirement of Rasch measurement. The notion, however, is implicit in their definition of a variable that results exclusively from items that share a common line of inquiry. Wright and Stone provide extensive documentation of methods to test the fit of the data to the model, suggesting that fit of the data to the model assures that the assumptions of the model were met. Wright and Masters (1982) define unidimensionality as a basic requirement for measurement and further expand the assessment of fit on an item and person level, suggesting fit on this level assures the 316

ASSESSING UNIDIMENSIONALITY FOR RASCH MEASUREMENT 317

existence of a single variable. Andrich (1988) is more explicit in his discussions of unidimensionality, but again relies on tests of fit, either in testing the invariance of parameter estimation across subgroups or analyzing the differences between observed response patterns and probabilities developed from the estimated model parameters. Consistent throughout these works is the notion that the unidimensionality assumption is satisfied if the data fit the model. Practically, this is often interpreted by researchers using Rasch measurement as meaning that the requirement of unidimensionality is met if the fit values that accompany most calibration programs for items and/or persons do not depart significantly from their expected values. Hattie (1985) not only provides a comprehensive review of the various definitions of unidimensionality that appear in the psychometric literature, but also reviews a large number of studies t h a t have attempted to develop and validate the use of a variety of indices for assessing unidimensionality. Given this review, there is reason to be extremely skeptical of the use of any fit indices based on the Rasch model, and there is little encouragement to use any principal component or factor analytic procedure. However, the practicality of the situation is such that many researchers do use the family of Rasch measurement models. The research based on the use of this model rarely contains evidence that the dimensionality issue has been addressed in any method other than looking at the general level of item and/or person fit available in the common calibration programs. It is also the case that many other researchers typically rely on factor analytic or principal component techniques to assess the unidimensionality of tests, either in the development stage or in assessing the applicability of a given test to a specific sample. It would appear helpful to directly compare the results of using these commonly available techniques. In this study the use of the Rasch fit indices will be limited to the unweighted total item fit statistic (OUTFIT) found in such Rasch calibration programs as BICAL, MSCALE, and BIGSCALE. The choice between the principal component and factor analytic procedure is more difficult. Hattie (1985) separates principal component indices from factor analysis indices for several reasons, including the fact that factor analysis requires a hypothesis as to the number of factors. It is exactly for this reason that principal component analysis was chosen for this study. It seems reasonable that researchers using the Rasch model to analyze item level response data believe that, at least operationally, the test is unidimensional. Otherwise, there would be little reason to choose a model that makes unidimensionality a requirement for measurement.

318

SMITH & MIAO

It is always prudent to determine that the sample of persons taking a particular examination responded to the items in a manner which suggests unidimensionality. No matter how many times a test has been demonstrated to be unidimensional for other circumstances, it is always necessary to reconfirm this for the current circumstances. Given this framework, it is unlikely that the researchers wanting to assess unidimensionality would have a preconceived notion of a multidimensional factor structure, if it exists, but rather are simply checking to see if the common threats to unidimensionality, such as speedness, sex bias, race bias, or interactions between content and instruction, have effected the dimensionality of the test. This reasoning suggests that the principal component analysis, which assumes no a priori number of factors, would be the most appropriate method for assessing multidimensionality.

OBJECTIVE The purpose of this study is to compare two methods of testing the assumption of unidimensionality: the Rasch fit statistic approach detailed in the references cited above and principal component factor analysis. The factor analytic technique is not based on the same set of assumptions as Rasch measurement and can be applied prior to Rasch analysis to test the unidimensionality assumption. The Rasch fit techniques must be used in the context of the Rasch models and require the estimation of parameters before they can be applied. To fully test the applicability of these two approaches, the true factor structure of the data most be known a priori, thus requiring simulated data. The use of real test data with an unknown factor structure would not be useful in deciding which of two or more methods of assessing the factor structure is appropriate, since there is no known structure against which to compare the results. Usually, a study using real data results in the methods that happen to agree being declared the winners because they happen to agree. This decision has no relevance in answering the question of which of the methods best describes the true factor structure of the data.

METHODS The data for this study were simulated so as to represent varying degrees of correlation between the two factors represented in the re-

ASSESSING UNIDIMENSIONALITY FOR RASCH MEASUREMENT 319

sponse data and varying numbers of items representing each of the two factors. The correlations between the two factors (X and Y) ranged from 0.10 (.01% common variance) to 0.87 (.75% common variance), with nine different values for the common variance (.01, .04, .09, .16, .25, .36, .50, .64, .75). For each data set the total number of items on the test was set at 50. This test length was chosen to represent an average test length. The number of items in each factor were also varied across five different ratios of items for the two factors (45 & 5, 40 & 10, 35 & 15, 30 & 20, and 25 & 25, with the number of X factor items listed first). This resulted in 45 different combinations of common variance and ratio of X to Y items. For each data set a sample of 1,000 person was used, again to represent an average number of examinees. For each person two sets of independent unit normal ability distributions were generated (X and Z). The unit normal distributions for each data set were created in SYSTAT From these two distributions the correlated data were produced by substituting one of the common variance values listed above in the following equation: Y, - aX, + (1 - a)Zi where Xj is the first independent ability for person i, Zj is the second independent ability, a is the amount of common variance, and Yj is the correlated ability. The two abilities (Xj and Yj) for each person were then used to create simulated responses to the 50-item test. The X ability was used to generate the responses to the items measuring the X factor, and the Y ability was used to generate the responses to the items measuring the Y factor using the Rasch probability equation for dichotomous data: P(x - 1 | X, d) = exp(X - d)/(l + exp(X - d) and P(y = 1 | Y, d) - exp(Y - d)/(l + exp(Y - d).

Here X and Y are the person abilities, d is the item difficulty, and p is the probability of a correct response. Each probability was then compared to a random number between 0.0 and 1.0, chosen specifically for t h a t person item interaction using the random number function available in BASIC. If the value of the random number exceeded the probability, the item was assigned a response of 0; otherwise, the response was set to 1. The item difficulties used in the simulations were uniformly distributed in sets of five items (with item difficulties in logits of — 1, - . 5 , 0, +.5, and +1) so that the number of items in each facto did not have an effect on the mean or distribution of the item diffi-

320

SMITH & MIAO

culties for that data set. In this study, two replications of each data set were created. The resulting sets of simulated response patterns were analyzed by two methods. The first was calibration and item analysis using the MSCALE program (Wright, Rossner, & Congdon, 1985). This provided the Rasch item difficulties and the unweighted total fit statistic (OUTFIT) for each item. The unweighted total fit statistic is based on the standardization of the difference between the person's observed score on an item and the probability of a correct response, based on the performance of the total calibration sample on the item and the person's total score on the test (Wright & Stone, 1979; Smith, 1986). The standardized residual is summed over all persons who took the item and converted to a mean square by dividing by the number of persons:

where N is the number of persons, x n i is the scored response (1, 0) of person n to item i, and P n i is the probability of a correct response for person n to item i. This mean square (MS (UT)) is then converted to an approximate unit normal using the cube root transformation. Values of the unweighted total fit statistic greater than 2 generally indicate a person has unexpected responses in his or her response pattern—easy items answered incorrectly for higher ability persons or hard items answered correctly for lower ability persons. The second analysis was principal component factor analysis using SAS. This provided an estimate of the number of factors contained in each data set and factor loadings for each item. In the case of the Rasch analysis, the magnitude and the variance of the outfit statistics were used to assess unidimensionality. In the case of factor analysis, the eigen values for each factor and the factor loadings for each item were used to assess unidimensionality. Table 18-1 contains the equations used to create the correlated abilities. A total of 10 different conditions (nine different amounts of common variance and no common variance) were developed with two sets of correlated abilities generated for each condition. The expected correlation between the two sets of ability based on the amount of common variance is also listed. Finally, the observed correlation between the two sets of ability is listed. For Tables 18-3 through 18-5, the results represent the average of two replications based on the two sets of correlated person abilities reported in Table 18-1.

ASSESSING UNIDIMENSIONALITY FOR RASCH MEASUREMENT

321

Table 1 8 - 1 Correlation Between Independent Ability and Correlated Ability Observed Correlation

Data Set

Generating Equation

Expected Correlation

Simulation 1

Simulation 2

0 1 2 3 4 5 6 7 8 9

.oox + l.OOY

.00 .10 .20 .30 .40 .50 .60 .71 .80 .87

.07 .08 .11 .17 .26 .39 .55 .72 .89 .96

-.05 -.02 .06 .13 .31 .49 .42 .88 .95

.01X .04X .09X .16X .25X .36X .50X .64X .75X

+ + + f + + + + +

.99Y .96Y .91Y .84Y .75Y .64Y .50Y .36Y .25Y

RESULTS The interpretation of the factor analytic results depends in large part on the choice of the critical value of the eigen values. To determine the best value to be used, a set of single factor data was created. The results, shown in Table 18-2, indicated that there were a considerable number of factors identified with eigen values greater than 1.0. However, the eigen values for the second component never exceeded 1.40 in the four simulations of unidimensional data. Consequently, the value 1.4 was chosen to determine the presence of a second factor in the two factor simulations. The results of the principal component factor analysis are presented in Table 18-3. Overall, the factor analytic technique was able to detect the presence of two factors at all variations in the number of X factor Table 18-2 Results of Principal Component Analysis Unidimensional Data Eigen Values

Data Set

No. of Items

Factor 1

Factor 2

Factor 3

N> 1

0-1 0-2 0-3 0-4

50 50 50 50

8.51 7.75 8.69 8.43

1.26 1.33 1.21 1.26

1.23 1.25 1.19 1.22

13 15 13 13

322

SMITH & MIAO

Table 18-3

Results of Principal Component Analysis Multidimensional Data

Number Data of Items Set Y vs. X Factor 1 1

2

3

4

5

6

7

8

9

5-45 10-40 15-35 20-30 25-25 5-45 10-40 15-35 20-30 25-25 5-45 10-40 15-35 20-30 5-45 10-40 15-35 20-30 5-45 10-40 15-35 20-30 5-45 10-40 15-35 20-30 5-45 10-40 15-35 20-30 5-45 10-40 15-35 20-30 5-45 10-40 15-35 20-30

8.50

7.65 6.73 5.93 5.25 8.36 7.65 6.94

5.80

5.40 8.50 7.83 6.99 6.14

8.59

7.72 7.05 6.19 8.51 7.54 7.16 6.46 8.66 8.00 7.16 6.66 8.28 7.52 7.36 6.70 8.77 8.47 8.12 7.50 8.88 8.43 8.28 7.95

Percent Correctly Loaded on Factor

Eigen Values Factor 2

Factor 3

1.71

1.22

4.09 4.52 1.61 2.43 3.14 3.79 4.31 1.61 2.29

1.20 1.21 1.20 1.21 1.19 1.24 1.22

2.52 3.23

2.89 3.60 1.49 2.01

2.64 3.20 1.48 1.88 2.34 2.76 1.40 1.67 2.02 2.08 1.24 1.48 1.61 1.71 1.21

1.29 1.31 1.37 1.22 1.22 1.28 1.29

1.18 1.26

1.19 1.21 1.21

1.21

1.19

1.24 1.21

1.21

1.21

1.23 1.29

1.24 1.20 1.29 1.24 1.26 1.23 1.28 1.29 1.28 1.17 1.21 1.26 1.30 1.17 1.19 1.20 1.25

N

1 12 12 11 13 12 12 13 12 12 14 12 13

13 12 13 13

13 14

12 14

12 14 12

13 14 15 13 15 14 15

12

14

13 15 13 14 14 14

Factor 1

Factor 2

100 98 97 100 96 100 98 97 97

100 100 100 100 68 100 100 100 100 68 100 100 93 100 100 100 93 75 80 90 80

92 100 98 100 100 100

100 100 100 100 100 100

97 98 100 100 100 100 95 100 93 100 100 100 97 100 100 97 100

75 80 80 53 50 40 70 40 35 40 20 7

15 0 0 7

10

ASSESSING UNIDIMENSIONALITY FOR RASCH MEASUREMENT

323

and Y factor items for the following common variance levels, .01, .04, .09, .16, and .25 (data sets 1 to 5). For the .36 and the .50 common variance levels (data sets 6 and 7), the eigen values were lower than 1.40 for the 45-5 item ratio. For the .64 and .75 common variance levels (data sets 8 and 9) the eigen values were less than 1.40 for all 5 of the item ratio levels. Also summarized in Table 3 is the percentage of items t h a t loaded on the appropriate factor for each of the replications. As the proportion of common variance between ability X and Y increased, the principal component analysis was less able to assign the Y ability items correctly to t h a t factor. The interpretation of the Rasch item fit statistics was accomplished by comparing the value of the individual fit statistics to a critical value of +2.00 (A commonly used value that corresponds approximately to a Type I error rate of .05—Smith, 1991.) The presence of a second factor was determined by evaluating the number of X factor items with a fit value greater than +2.00 and the number of Y factor items with a fit value greater than +2.00. The results of this analysis are summarized in Table 18-4. For all levels of common variance involving 45 items on the X factor and 5 items on the Y factor, the percentage of the X factor items that had a fit value greater than +2.00 was at or below the Type I error rate and 100% of the items on the Y factor had fit values greater than + 2.00. For the 40 X factor and 10 Y factor item comparisons across levels of common variance, 90% or more of the Y factor items had fit values greater than +2.00, while the number of X factor items with fit values greater than +2.00 was less than the Type I error rate. The only exception was the 75% common variance level (data set 9), where only 35% of the Y factor items had fit values greater than +2.00. For the 35 item X factor and 15 item Y factor comparisons across the nine levels of common variance, the percentage of X factor items with fit values greater than +2.00 remained less than the Type I error rate, while the percentage of Y factor items with values greater than +2.00 averaged over 90%, up to the 25 percent common variance (data set 5). Above 50% common variance (data sets 8 and 9), the number of Y factor items with values greater t h a n +2.00 dropped to less than 50%. For the 30 item X factor and 20 item Y factor comparisons the percentage of X factor items with values greater than +2.00 was less than or equal to the Type I error rate for this statistic. The percentage of Y factor items with values greater t h a n +2.00 never exceeded 60% and dropped to 25% for the 75% common variance level (data set 9). For the 25 item X factor and 25 item Y factor comparisons for the first two levels of common variance both the percentage of X factor items and Y factor

324

SMITH & MIAO

Table 18-4

Results of Rasch Fit Analysis Multidimensional Data

Number Data of Items Set Yvs. X 1

2

3

4

5

6

7

8

9

Mean

S.D.

8.40 1.53 5-45 6.08 1.37 10-40 3.69 1.08 15-35 1.52 0.83 20-30 25-25 -0.56 0.79 5-45 7.88 0.86 10-40 5.91 1.00 3.59 1.16 15-35 1.76 0.91 20-30 0.20 1.14 25-25 8.50 1.88 5-45 6.20 1.45 10-40 15-35 4.15 0.94 20-30 1.90 0.85 8.54 1.60 5-45 5.75 1.37 10-40 3.79 1.03 15-35 1.77 0.82 20-30 7.04 1.53 5-45 5.00 1.42 10-40 3.85 1.14 15-35 1.91 0.90 20-30 5.40 1.11 5-45 4.20 1.12 10-40 3.15 1.59 15-35 2.21 0.99 20-30 4.80 0.68 5-45 3.08 1.14 10-40 2.55 0.68 15-35 1.89 0.91 20-30 4.12 1.48 5-45 2.81 1.16 10-40 1.95 0.99 15-35 1.57 1.03 20-30 2.34 0.31 5-45 1.30 0.97 10-40 1.22 0.96 15-35 1.49 0.78 20-30

Total Test Item Fit

X Item Fit

Y Item Fit % - 2

Mean

S.D.

%>2

Mean

S.D.

% >2

100 100 100 40 0 100 100 93 40 8 100 100 100 32 100 100 100 40 100 100 93 35 100 100 73 55

1.10 -1.64 -1.69 -1.08 0.13 -0.93 -1.70 -1.66 -1.30 -0.23 -0.94 -1.69 -1.93 -1.34 -0.96 -1.58 -1.70 -1.13 -0.79 -1.44 -1.87 - 1.45 -0.78 -1.25 1.47 -1.59 -0.60 - 0.82 -1.18 -1.19 0.44 -0.75 -0.90 -1.11 -0.29 -0.40 -0.54 -0.92

1.09 0.69 0.94 0.69 0.80 0.76 0.91 0.82 0.96 0.78 1.00 0.84 1.01 0.84 0.87 0.97 1.01 1.06 0.68 0.82 0.80 1.04 0.92 1.03 0.91 0.77 0.86 0.97 0.74 1.08 0.86 0.89 0.87 0.79 1.07 1.05 0.91 0.95

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0

-0. 15 -0. 09 -0. 07 -0. 04 -0. 09 -0. 05 -0. 17 -0. 10 -0. 08 -0. 01 0. 01 -0. 11 -0. 10 -0 .04 0 .01 -0 .11 -0 .05 0 .02 0,.00 -0 .15 -0 .16 -0 .10

3.10 3.23 2.66 1.49 0.79 2.77 3.21 2.60 1.76 0.99 3.06 3.32 2.98 1.80 3.04 3.13 2.74 1.72 2.54 2.77 2.81 1.93 2.09 2.43 2.42 2.06 1.83 1.86 1.87 1.82 1.65 1.71 1.59 1.59 1.30 1.23 1.22 1.49

10 20 30 16 0 10 20 28 16 4 10 20 30

100 90 87 45 100 90 40 30 100

30 20 25

0 0

0 0

0

0 0 0 3 0 0 0

0 4 2 0 0

-0 .17 -0 .15 -0 .09

-0.07

-0 .01 -0 .03 -0 .07 0 .04 0 .01 -0 .04 -0 .04 -0 .04 -0 .03

-0 .05 0 .02 0 .03

14 10

20 30 18 10 20 28 14 10 20 22 22 10 18 26 20 10 18

12 12 10 8 6 10

items with fit values greater than +2.00 was very near the Type I error rate. Table 18-4 also summarizes the overall fit values for the entire set of 50 items—that is, X and Y factor item fit statistics combined. In no case does the absolute value of the mean fit value for the test exceed

ASSESSING UNIDIMENSIONALITY FOR RASCH MEASUREMENT 325

Table 18-5 Recommended Procedure to Detect Multidimensionality Data sset

No. of items

1

5-45 10-40 15-35 20-30 25-25

2

3

4

5

6

7

8

9

Yvs. X

5-45

10-40 15-35 20-30 25-25 5-45 10-40 15-35 20-30 5-45 10-40 15-35

20-30

5-45 10-40 15-35

20-30

5-45 10-40 15-35 20-30 5-45 10-40 15-35 20-30 5-45 10-40 15-35 20-30 5-45 10-40 15-35 20-30

Prin,Components Factor Analysis

Rasch Item OUTFIT

yes

yes yes yes No No yes yes yes No No yes yes yes No yes yes yes No yes yes yes No yes yes yes No yes yes yes No yes yes No No yes No No No

yes yes

yes yes yes yes yes yes yes yes yes yes yes No yes yes yes No yes yes yes No yes yes yes No No yes yes No No No No No No No No

20. There is considerable variation in the standard deviation of the fit values, but for the 64% and 75% common variance simulations (data set 8 and 9) the standard deviation of the fit values approaches the expected standard deviation of the null distribution (1.00). Table 18-5 combines the results of the principal component analysis

326

SMITH & MIAO

and the Rasch fit analysis. For each of these two techniques, over each of the simulated data sets and combinations of X and Y factor items, a decision was made as to whether that method was appropriate to detect multidimensionality for that combination of X and Y factor items and common variance. The criterion used to make this yes/no decision was an eigen value greater than 1.5 for the second factor in the principal component analysis or more than 60% of the Y factor items identified as misfitting for the Rasch item analysis method. Both of the decision points were reached on an ad hoc basis and no attempt was made to determine if they were equivalent. The results suggest that the principal component and the Rasch item fit approaches are not sensitive to the same combinations of common variance and the number of items represented on the second factor. These results strongly indicate that in cases of a second factor with less t h a n 64% common variance (data sets 1 through 7), the factor analytic procedure will detect the factor as long as 20% or more of the items load on that factor. If less t h a n 20% of the items load on t h a t factor, the techniques is much less sensitive to the presence of the second factor. For data with 64% and higher common variance the factor analytic procedure identifies only a single factor, no matter what proportion of the items load on the second factor. These results are almost the opposite of the Rasch fit values based on the unweighted total item fit statistic. The Rasch fit statistic is sensitive to the second factor until approximately 30% of the items loaded on the second factor for data sets 1 through 7, and until approximately 20% of the items on the second factor in data set 8, and until approximately 10% of the items belong to the second factor in data set 9. If the percentage of items on the second factor was above t h a t level, then the fit statistic was generally unable to detect multidimensionality, no matter what the degree of correlation between the two factors. CONCLUSIONS If one can assume that the original objective of the test construction process was to produce a unidimensional measure, it would be unusual to find t h a t the test had approximately equal numbers of items on two relatively uncorrelated factors. Rather, one would expect to find the majority of the items on one factor and relatively few items on the second factor. It is also reasonable to expect that the second factor would be highly correlated with the primary factor. These are exactly the cases where the factor analytic method is inappropriate. If, in fact,

ASSESSING UNIDIMENSIONALITY FOR RASCH MEASUREMENT

327

there had been equal numbers of items on uncorrelated factors, there would be reason to believe that the test developers had little understanding of the underlying construct that the test was designed to measure. Thus, although the factor analytic method detected the second factor in slightly more cases in these simulations, the Rasch item fit approach performed better in the simulations that most closely resembled the expectations discusses above for departures from an intended unidimensional test. However, a prudent practice would be to use the two methods to complement each other, thus assuring the widest possible coverage of different combinations of common variance and proportion of items loading on the second factor. Further, it should be realized t h a t neither of the procedures worked well when more than 30% of the items loaded on the second factor that had more t h a n 64% common variance with the first factor. In situations like this, the important question is whether the test is functionally unidimensional despite the presence of two factors.

REFERENCES Andrich, D. (1988). Rasch models for measurement. Newbury Park, CA: Sage Publications. Hattie, J. (1985). Methodological review: Assessing unidimensionality for tests and items. Applied Psychological Measurement, 9, 139-164. Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests (expanded ed.). Chicago: The University of Chicago Press. (Original work published 1960) Smith, R.M. (1986). Person fit in the Rasch model. Educational and Psychological Measurement, 46, 359-372. Smith, R.M. (1991). The distributional properties of Rasch item fit statistics. Educational and Psychological Measurement, 51, 541—565. Wright, B.D., Rossner, M., & Congdon, R.T. (1985). MSCALE: A Rasch program for ordered categories. Chicago: MESA Press. Wright, B.D., & Masters, G.N. (1982). Rating scale analysis. Chicago: MESA Press. Wright, B.D., & Stone, M. (1979). Best test design. Chicago: MESA Press.

This page intentionally left blank

Author Index A Ackermann, J.R., 44, 64, 65 Adams, R.A., 278, 293 Ainley, J., 280, 293 Akaike, H., 242, 244 Allen, M.J., 116, 121 Andersen, E.B., 20, 23, 63, 66, 215, 233, 275, 293 Andrich, D., 38, 45, 46, 63, 66, 73, 78, 79, 89, 96, 96, 151, 154, 155, 175, 218, 220, 222, 229, 233, 238, 244, 302, 303, 315, 316, 319, 329 Appelbaum, M.I., 75, 91, 97 Armstrong, D.M., 30, 34 Asplund, K., 149, 175

B Barton, M.A., 51, 71 Bakan, D., 46, 66 Barnes, L.B., 123, 130 Batten, M.H., 280-282, 294 Bauer, D., 156, 175 Bechtoldt, H.P., 49, 66 Beckwith, T.G., 34, 34 Bell, S.R., 51, 72 Bennett, R.E., 137, 146 Bergman, E.O., 90, 91, 98 Bergstrom, B.A., 105, 111, 113

Bernspang, B., 149, 175 Berry, J.W., 273, 293 Bigelow, J., 30, 34 Birnbaum, A., 218, 220, 233, 237, 244, 250, 272, 301, 316 Blinkhorn, S., 41, 48, 68 Bock, R.D., 52, 70, 73, 78, 83, 89, 96, 123, 124, 129, 218, 233, 296, 299 Boekkooi-Timminga, E., 115, 121 Bollinger, G., 38, 66 Boring, E.G., 222, 233 Borson, S., 149, 177 Bostock, D., 27, 34 Box, J.F., 216, 217, 233 Branch, L.G., 149, 176 Braun, H.L, 132, 140, 146 Brennan, R.L., 74, 96 Brenneman, W.L., 56, 66 Bridgeman, P.W., 48, 66 Brighton, C., 149, 176 Broder, M., 185, 189 Brogden, N.E., 18, 23, 38, 66 Brooks, R.H., 149, 177 Buck, N.L., 34, 34 Bundy, A.C., 156, 176 Bunt, A.A., 33, 35 Burdick, D.S., 49, 62, 71 329

330

AUTHOR INDEX

Burke, J.P., 150, 176 Burt, R.S., 280, 293 Burtt, E.A., 40, 42, 53, 66

c Cajori, F., 42, 66 Callahan, L.F., 149, 177 Campbell, D.T, 54, 66 Campbell, N.R., 3, 23, 28, 29, 34 Carver, R., 46, 66 Case-Smith, J., 156, 175 Cason, C.L., 132, 133, 146 Cason, G.J., 132, 133, 146 Cattell, J.K., 73, 96 Cherryholmes, C , 36, 39, 40, 4 7 - 4 9 , 5 3 - 5 5 , 64, 66 Choppin, B., 51, 66 Christman, K.P., 280, 293 Clagett, M., 28, 34 Clifford, G.J., 89, 96 Coats, W., 46, 66 Cobb, M.V., 90, 91, 98 Cohen, L., 251, 272 Congdon, R., 124, 130, 322, 329 Cook, L.L., 38, 52, 68 Cook, T.D., 54, 66 Coombs, C.H., 34, 34, 58, 66 Craven, TE., 213, 235 Cronbach, L.J., 43, 44, 47, 57, 66, 74, 96 Crouse, J., 46, 53, 67 Cummings, S.R., 149, 176 Curb, J.D., 149, 176

D Dawis, R.V., 38, 71 De Ayala, R.J., 302, 316 Dedekind, R., 27, 34 De Gruiter, D.N.M., 133, 146 Divgi, D.R., 38, 47, 61, 67 Dodd, B.G., 302, 305, 310, 311, 314-316, 316,317 Donovan, A., 94, 96

Dorans, J.J., 123, 130 Douglas, G.A., 266, 272 Duncan, O.D., 38, 39, 48, 54, 58, 67, 70, 240, 244

E Eakin, P., 148, 175 Ellis, B., 4, 23 Embretson (Whitely), S., 48, 49, 67, 220, 229, 233 Engelen, R.J.H., 221, 233 Engelhard, G., 75, 77, 78, 85, 86, 89, 9 1 - 9 3 , 96, 97 Epstein, J.L., 275, 293 Eriksson, S., 149, 175

F Fahnestock, J., 64, 67 Faletti, M.V., 149, 176 Falk, R., 46, 67 Ferguson, G.A., 217, 233 Fillenbaum, G.G., 150, 176 Fischer, G.H., 18, 21, 23, 38, 48, 67, 215, 218, 220, 222, 233 Fischer, M.G., 280, 293 Fisher, A.G., 151, 153, 156, 157, 175, 176 Fisher, D.L., 275, 293 Fisher, R.A., 94, 97, 216, 217, 233 Fisher, W.P., 38, 55, 67, 154, 177, 211,233 Fleck, L., 45, 64, 67 Folk, V.G., 123, 129 Formann, A.K., 220, 222, 233 Forrest, P., 30, 34 Fowles, D., 41, 47, 71 Fraser, B J . , 275, 293 Fredericks, M., 232, 235 Fuchs, H.A., 149, 177 Fugl-Meyer, A.R., 149, 175 Fuhrer, M.J., 148, 176

AUTHOR INDEX

G

Gadamer, H.-G., 42, 44, 49, 56, 58,67 Galton, F., 216, 233 George, L.K., 150,176 Gerson, R.C., 106, 111, 113, 117, 121 Gifford, J.A., 292, 299 Glas, C.A.W., 218, 221, 234 Gleser, G.C., 74, 96 Goldman, S.H., 48, 50, 67 Goldstein, H., 38, 4 1 , 43, 48, 63, 67,68 Gould, S.J., 46, 53, 68 Granger, C.V., 148,276 Grau, B.W., 38, 68 Green, B.F., 123, 124, 129 Gresham, G.E., 148,276 Guilford, J.R, 77, 97 Gulliksen, H., 86, 97 Guralnik, J.M., 149,176 Gustafsson, J.-E., 38, 4 1 , 68, 247, 272

H Haas, N.S., 132, 146 Hacking, I., 44, 68 Haladyna, T.M., 132, 146 Hambleton, R.K., 38, 43, 47, 48, 50, 52, 68, 295, 296, 299 Harvey, A.L., 123, 130 Hattie, J., 319, 329 Heath, T.L., 27, 34 Heelan, P., 42, 44, 47, 64, 68 Heidegger, M., 42, 48, 56, 68 Helmreich, R., 3 1 0 , 3 2 7 Henning, G., 38, 68 Hesse, M., 40, 44, 64, 68, 69 Hill, E., 179, 189 Ho, K., 103, 106, 114 Holder, O., 26, 27, 34 Holland, P.W., 218, 222, 234 Holton, G., 40, 44, 64, 69

331

Hornke, L.F., 38, 66 Houser, R., 123, 130 Hudson, L., 42, 69 Humphreys, L.G., 123, 124, 129 Husen, T., 274, 293 Husserl, E., 48, 69

I Ihde, D., 42, 44, 47, 64, 69 Irvine, A.D., 30, 35 Irvine, S.H., 273, 293

J

Jackson, K.L., 213, 235 Jaeger, R.M., 37, 45, 53, 57, 69 Jagust, W.J., 149,277 Jannarone, R.J., 211, 213, 214, 218, 220-223, 226-228, 230, 2 3 2 , 2 3 3 , 236 Jansen, P.G.W., 21, 23 Jones, L.V., 73, 75, 78, 83, 89, 91, 96, 97 Jongbloed, L., 148, 149, 176 Joreskog, K.G., 74, 97, 217, 220, 234, 276,277,293

K

Kane, R.A., 148, 150, 176, 177 Kane, R.L., 148,276 Kaye, J.J., 149, 177 Keats, J.A., 228, 234 Keith, R.A., 148,276 Kelderman, H., 211, 218, 220, 234, 2 3 9 - 2 4 1 , 244, 245 Kendrick, J.S., 213, 235 Khoo, S.T., 278, 293 Kielhofner, G., 150, 176 Kiely, G., 123, 130 Kilgore, K., 154,277 Kingsbury, G.G., 103, 106, 114, 116, 121, 123, 130 Kingston, N.M., 123, 130 Kiyak, H.A., 149, 177

332

AUTHOR INDEX

Koch, W.R., 302, 305, 310, 316, 317 Kordomenos, RL, 185, 189 Krantz, D.H., 16, 23, 25, 3 1 - 3 3 , 35, 38, 69 Krenz, C., 46, 69 Kristoff, W., 220, 234 Kuhn, T.S., 40, 42, 4 4 - 4 6 , 51, 64, 69, 212, 232, 234

Luce, R.D., 5, 14, 16, 19, 23, 25, 3 1 - 3 3 , 35, 38, 69, 70 Lumsden, J., 46, 70, 77, 97 Lunz, M.E., 105, 111, 113, 113, 120, 121, 143, 144, 146, 150, 154, 155, 158, 174, 176, 191, 208

M

Macera, C.A., 213, 235 Mackenzie, W.A., 217, 233 Maier, W., 46, 62, 70 Lahart, C., 137, 146 Mangasarian, O.L., 298, 299 Latour, B., 44, 64, 69 Martin, J.T., 116, 121 Laudan, L., 94, 96 Martin-Loff, P., 247, 272 Laudan, R., 94, 96 Masters, G., 43, 50, 57, 59, 62, Laughlin, J.E., 211,234 70, 72, 80, 83, 97, 99, 106, Law, M., 148, 176 114, 133, 146, 154, 155, 177, Lawley, D.N., 217, 234 220, 222, 235, 238, 245, Lawton, M.R, 150, 176 277-279, 286, 293, 294, 302, Lazarsfeld, P.F., 213, 217, 234 305, 310, 311,317, 318,329 Lear, J., 27, 35 Lehmann, E.L., 21, 23, 179, 189, Maurelli, V.A., 106, 113 Maynes, D.D., 103, 106, 114 218, 219, 222, 234, 235 McBride, J.R., 116, 121 Letts, L., 148, 176 McDonald, R.R, 223, 235 Levelt, W.J.M., 33, 35 McKinley, R.L., 103, 113, 219, Lewine, R.R.J., 38, 69 235 Lewis, C , 218, 235 McPartland, J.M., 275, 293 Linacre, J.M., 38, 50, 59, 62, 69, Meehl, P., 43, 44, 47, 66 72, 125, 130, 133, 135, 143, 144, 146, 151, 154-156, 158, Mellenbergh, G.J., 249, 272 161, 174, 176, 180, 184, 190, Messick, S., 40, 43, 47, 49, 61, 70, 274, 293 191, 193, 208 Michell, J., 5, 15, 16, 23, 26, 28, Linden, L., 274, 293 32-34, 35, 4 6 - 4 8 , 70 Lindquist, E.F., 37, 43, 56, 62, Miller, H., 280, 293 69 Miller, S.I., 232, 235 Linn, R.L., 123, 124, 129 Mislevy, R.J., 52, 70, 91, 97, 220, Loevinger, J., 38, 43, 47, 69, 75, 235, 296, 299 79, 83, 97 Mitchell, D.E., 280, 293, 294 Lord, F.M., 16, 23, 38, 5 1 - 5 3 , Mokken, R.J., 218, 235 69-71, 103, 105, 113, 213, 217, 218, 220, 223, 228, 234, Moos, R.M., 275, 293 235, 237, 245, 274, 293, 295, Mosier, C.I., 76, 77, 97 Mueser, K.T., 38, 68 296, 299

L

AUTHOR INDEX

Munck, I., 277, 293 Muraki, E., 91, 97 Murray, E.A., 156, 176

N

Nanda, H., 74, 96 Narens, L., 31, 35 Newman, E.B., 30, 35 Nolen, S.B., 132, 146 Novick, M.R., 40, 43, 47, 68, 103, 105, 113, 213, 217, 218, 223, 235, 237, 245, 274, 293, 296, 299

o Olsen, J.B., 103, 106,224 Olsen, N.J., 149,277 Olson, A.M., 56, 66 Ormiston, G., 44, 47, 64, 70 Osberg, D.W., 78, 96 Osburn, H.G., 47, 70 Owen, D.S., 46, 53, 70

P

Panchapakesan, N., 137, 146, 180, 190 Pate, R.R., 213, 235 Pearson, K., 216, 235 Perline, R., 18, 19, 23, 32, 33, 35, 38, 70 Philipp, M., 46, 62, 70 Phillips, S.E., 4 1 , 43, 70 Pincus, T., 149, 177 Plake, B.S., 123, 130 Popper, K.P., 212, 232, 235 Powell, K.E., 213, 235 Prane, J.W., 179, 189

R

Rajaratnam, N., 74, 96 Raju, N.S., 48, 50, 67 Ramsay, J.O., 38, 63, 70

333

Rasch, G., 16, 17, 20, 21, 23, 42, 54, 56, 57, 70, 7 9 - 8 3 , 89, 97, 103, 105, 106, 114, 115, 121, 180, 190, 211, 212, 217, 235, 237-239, 245, 246, 272, 275, 293, 318, 329 Reckase, W.D., 103, 113, 123, 124, 129, 219, 235 Reed, B.R., 149, 177 Reed, R., 280, 293 Rehfeldt, T.K., 180, 190 Ricoeur, P., 38, 42, 55, 70, 71 Riemersma, J.B., 33, 35 Rock, D.A., 137, 146 Rogers, H.J., 38, 48, 50, 52, 68 Rogers, J.C., 149,277 Rorty, R., 36, 53, 71 Rosenbaum, P.R., 218, 222, 234 Roskam, E.E., 21, 23 Rossner, M., 322, 329 Rowley, G.L., 74, 98 Rubenstein, L.Z., 150, 177 Ruch, G.M., 131, 144, 146 Rudner, L.M., 124, 130 Ruggles, A.M., 132, 146 Russell, B., 28, 29, 35

s Samejima, F., 218, 235, 302, 303, 317 Samuelson, P.A., 297, 299 Sassower, R., 44, 47, 64, 70 Sax, G., 46, 69 Schairer, C., 150, 177 Scheffe, H., 217, 235 Schultz, M., 124, 125, 130, 193, 208 Seab, J.P., 149,277 Searle, S.R., 216, 235 Shapiro, J.Z., 280, 294 Shavelson, R.J., 74, 98 Siegel, S., 179, 190 Silverstein, B., 154, 177

334

AUTHOR INDEX

Singleton, M., 38, 71 Skurla, E., 149,277 Slawson, D., 103, 114 Smith, M., Ill, 49, 62, 71 Smith, R.M., 271, 272, 322, 325, 329 Sorbom, D., 74, 97, 217, 220, 235, 276, 277, 293 Spady, W.G., 280, 293 Spearman, C , 74, 98, 216, 235 Spence, J.T., 310, 317 Sprent, P., 179, 190 Stacey, S., 149, 176 Stahl, J.A., 120, 121, 144, 146, 150, 155, 174,276, 191,205 Stapp, J., 310, 317 Steen, R., 240, 241, 245 Stein, H., 26, 27, 35 Steinberg, L., 3 0 2 , 3 2 7 Stenbeck, M., 240, 244 Stenner, A.J., 49, 62, 71 Stevens, S.S., 14, 23, 25, 29, 30, 35, 48, 71, 73, 76, 77, 98 Stocking, M.L., 38, 51, 52, 71, 220,235 Stone, M., 39, 62, 72, 83, 99, 103, 105, 106, 114, 124, 130, 159, 177, 180, 190, 246, 251, 271, 272, 296, 300, 318, 322, 329 Stout, W., 218, 222, 235, 236 Strenio, A.J., 46, 53, 54, 71 Sunderland, T., 149, 177 Suppes, P., 25, 3 1 - 3 3 , 35, 38, 46, 69, 71,218,236 Sutherland, G., 46, 53, 71 Swaminathan, H., 295, 296, 299

T Teri, L., 149, 177 Thissen, D., 299, 232, 236, 302, 317

Thomson, D.M., 185, 189 Thomson, G.H., 9 1 , 98 Thorndike, E.L., 8 7 - 9 1 , 98 Thurstone, L.L., 16, 24, 84, 87-91, 98, 216, 220, 236, 247, 272 Tolmin, S., 64, 71 Trabue, M.R., 85, 98 Tracy, D., 55, 71 Travers, R.M.W., 89, 98 Trusheim, D., 46, 53, 67 Tukey, J.W., 5, 14, 16, 19, 23, 38, 63, 70, 71 Tversky, A., 16, 23, 25, 3 1 - 3 3 , 35, 38, 69

V Valentine, R.J., 3 1 0 , 3 2 7 Van der Linden, W.J., 16, 24, 211,236 van den Wollenberg, A.L., 247, 272 Verhelst, N.D., 218, 220, 221, 234, 235

w Wainer, H., 18, 19, 23, 32, 33, 35, 38, 70, 105, 114, 123, 130, 229, 232, 236 Walberg, H.J., 275, 294 Walker, D.A., 274, 294 Ward, W.C., 137, 146 Webb, N.M., 74, 98 Weiss, D.J., 103, 106, 113, 114, 116, 121 Wheeler, J.A., 48, 71 Whitehead, A.N., 28, 35 Whitely, S.E., 38, 43, 4 7 - 5 0 , 71, 218,220,236 Whiteside, D.T, 27, 35 Wieland, G.D., 150,277 Williams, T.H., 280-282, 294

AUTHOR INDEX

Willmott, A., 4 1 , 47, 71 Wilson, M., 64, 71, 218, 220, 222, 236, 240, 245, 281, 282, 294, 299, 299 Windmeijer, F.A.G., 137, 146 Wingersky, M.S., 51, 71 Wise, S.L., 123, 130 Wood, R., 38, 71 Woodyard, E., 90, 91, 98 Woolgar, S., 44, 64, 69 Wright, B.D., 18, 19, 23, 32, 33, 35, 3 7 - 3 9 , 43, 45, 46, 50-52, 54, 57, 59, 62, 63, 69-72, 80, 83, 95, 97-99, 103, 105, 106, 113, 113, 114, 115, 116, 121, 124, 125, 130, 133, 137, 143, 144, 146, 154, 155, 158, 159, 174,276, 177, 180, 190, 191, 193,208, 211, 212, 215,236,

238, 245, 246, 247, 251, 271, 272, 277-279, 186, 294, 296, 297, 299, 299, 302,305,310,311,327, 322, 329

335

266, 293, 300, 318,

Y Yamagishi, M., 149, 177 Yarian, S.O., 56, 66 Yen, W.M., 116, 121, 123, 130 Yu, K.F., 211, 234

z Zanotti, M., 218, 236 Zimmerman, M.E., 59, 72 Zinnes, J.L., 46, 71 Zurek, W , 48, 71 Zwick, R., 59, 72

This page intentionally left blank

Subject Index A Achievement Testing, see Applications Additive conjoint measurement, 14-16 Additivity, 2 6 - 2 8 , 32, 212-213 Affective domain, see Applications Akaike's Information Criterion, 290 Analysis of variance (ANOVA), 214-215, 217, 218 Applications achievement testing, 189 affective domain, 271-289 assessment of motor and process skills (AMPS), 145-173 computerized adaptive testing (CAT), see Computerized adaptive testing functional assessment, occupational therapists' use of, 145-173 judge mediated practical examination, 190 quality of school life (QSL), 278-280

quantitative experiments, see Quantitative experiments Assessment of motor skills, see Applications

B Bayesian modal estimation, see Estimation Bias in person measurement discrimination, 248, 257 guessing, 248 item bias, 247, 254 misfit, 244, 257 multidimensionality, 247, 248, 257 BIGSCALE, see Computer programs BIGSTEPS, see Computer programs Boundaries of ability, 295

c Calibration items, 104-105, 116, 122-128, 153, 293-297 sample free, 7 7 - 7 8 , 295 Cancellation conditions, 3 3 - 3 9 337

338

SUBJECT INDEX

CAT, see Computerized adaptive testing Chemical properties, measurement of, see Quantitative experiments Classical test theory, 235 Comparison between groups, 271 international, 271 Computer programs BILOG, 52 BIGSCALE, 125-126, 191, 317 BIGSTEPS, 62 FACETS, 62, 152-153, 156, 159, 161, 171-172, 189-191 combined analysis, 189-192, 198, 205 weighted analysis, 202-205 LISREL, 281-282 LOGIMO, 239 LOGIST, 51-52 MSCALE, 124 Computerized adaptive testing (CAT), 103-113, 115, 117, 122, 125-128, 308-314 algorithm, 106 attitude, 308-314 review, 112-113 targeting, 110-111, 123 test length, 111-112 Concatenation operation, 7-10 Conjoint Measurement Theory, 38 Conjunctive measurement, 212, 222, 223-228 local dependence, 225 local independence, 224 Consistent estimators, 22 Construct validity, quantitative approaches to, 271-290 Coomb's theory, 34, 58

D

Derived measurement, 4, 1 0 - 1 1 Differential item functioning, 59

E

Equating, see Invariance Estimation Bayesian modal, 294 item scale values, 77 marginal maximum likelihood (MML), 294 maximum likelihood (ML), 106, 217, 293 Newton-Raphson procedure, 293 PROX, 249 Exponential family theory, 216-219

F FACETS, see Computer programs Factor Analysis comparison with Rasch fit statistics, 321-326 principal component, 317-318 threshold for eigen values, 321 use with correlated factors, 319 Fit, 13, 19, 20, 22, 5 9 - 6 2 , 80, 83, 84, 85, 106, 165, 179, 244, 257. See also Fit Statistics diagnosis, 133-138 item, 83 model-data, 276 Pearson goodness of, 240-242 person, 83 test of Thurstone's scaling method, 8 4 - 8 9 Fit Statistics, 153, 157, 159, 161-162, 165, 171, 192 comparison with factor analysis, 321-326

SUBJECT INDEX

goodness of fit, 20 item total (outfit), 317, 320 Functional Assessment, Occupational therapists' use of, see Applications Fundamental measurement, 3-10

G Galileo's theory, 42, 44, 45, 48 Graded response model, 300

H Hermeneutic circle, 42-44, 48, 54-58 Husserlian phenomenology, 48, 55-56

I Implicit measurement, 12-13 Indeterminacy, 293 Information functions, 299-307, 313-314 item, 299-307, 313-314 scale, 302 Interpreting data, 5 4 - 6 5 Invariance item parameters, 294 of parameters, 293-297 Rasch's perspective, 7 9 - 8 3 Thorndike's perspective, 8 9 92 Thurstone's perspective, 8 3 89 Item response theory (IRT), 37, 38, 5 0 - 5 3 , 59, 74, 104, 115, 215-216, 218, 235-237, 273-280 model data fit, 282-283 Item parameter invariance, 272, 275, 277, 280-290

339

J Joint estimation, 298 Judge mediated practical examination, see Applications Judges behavior, 130-131 differences among, 148, 152, 154, 156, 161, 183 training of, 142, 156 use of, 189 Judging plan for analysis of, 138, 141

K Kuhnian revolution, 38, 4 5 - 5 4

L Lagrangian technique, 296 Likert-type responses, 276 Linear transformation, 294 Locally dependent conjunctive measurement models (LDCM), 209-230

M Mantel-Haenszel procedure, 59 Meaning, 39, 44, 46, 53, 55, 57, 58, 60, 6 2 - 6 3 Meaningfulness, 3 0 - 3 1 Measurement classical theory, 2 5 - 2 8 , 3 0 - 3 4 color and match perception, 187 conjoint, 3 2 - 3 4 context, frame of reference, interpretive structure, 36-65 as conversational give and take or question and answer, 37, 38, 3 9 - 4 0 , 47, 55, 62, 64-65

340

SUBJECT INDEX

crucial role of instrument quality in, 3 6 - 6 5 as experiment, 3 6 - 6 5 , 44, 47 fundamental, 28 and imagination, 38, 4 2 - 4 5 , 47, 51, 59 of individuals, item invariant, 78 locus of authority, 39, 43, 53, 55 and mathematical ideality, 42-45 model-based objective (MOM), 210-213 paradigms, 46-50, 5 4 - 5 8 questioning authority of, 39-40, 43, 56 representational theory, 25, 28-32 socio-political implications of, 36, 58, 6 4 - 6 5 validity and empirical consistency of data, 39, 43-44, 47, 5 9 - 6 3 Measurement consistency, see Structural equation modeling (SEM); Item response theory (IRT) Metaphysics, 42, 53-54 Method, 58 Misfit, see Fit; Fit Statistics Multidimensional factor analysis (MFA), 214-215, 218 Multidimensional polytomous latent trait models (MPLT), 235-242

N Newton-Raphson procedure, see Estimation Nonlinear transformation, 295

o Objectivity

multi-faceted, dialogical, communitarian, 37, 38, 39_40, 47, 55, 58, 62, 6 4 - 6 5 one sided, monological, authoritarian, 3 6 - 3 9 , 43, 53, 54, 62, 6 4 - 6 5 Operationalism, 47, 48, 49

P

Parameter convergence/separation, 48 divergence, 5 1 - 5 2 estimation, 293 Partial credit model, 275-276, 300 Platonic idea, 4 2 - 4 5 , 48 Positivism, 36, 40, 43, 44, 47, 53, 54,57 PROX, see Estimation Pythagorean, 4 2 - 4 5 , 46, 49

Q Qualitative and quantitative paradigms, 46—50, 5 4 - 5 8 Quality of school life, see Applications Quantitative experiments chemical properties of paint, 176 in the paint industry, 176 paint performance, 179 Quantity, 26 extensive and intensive, 28

R Rasch debate, 37, 45, 53 Rasch measurement calibration programs, 317, 320 requirements, 316-317

SUBJECT INDEX

Rasch models, 16-17, 2 0 - 2 1 , 105-106, 122 attributes, 213-214 internal measures, rankings, 177-181 many facets, 131-133, 143, 152-171, 182, 189-192 multidimensional, 237 rating scale model, 301-314 Rater consistency, 154, 161; see also Judges, differences among Rating scales, 132-133, 177, 179, 187 Real numbers as empirical relations, 27 Reliability alternate forms, 113-120 Removing subjects, 293-297

s Scaling theory, 74, 76, 84-92 Specific objectivity, 19-22, 80, 237 Statistics, fit, see Fit statistics permissible, 3 0 - 3 1 sufficient, see Sufficient statistics

341

Steven's theory of scales, 14 Structural equation modeling (SEM), 273-275, 280-282 Sufficient Statistics, 20-22, 228 item, 225, 226, 227 person, 225, 227, 228

T Tests, subjective, 129-131 Test model origins, 214-219

u Unfolding, Coombs' theory of, 34 Unidimensionality, 109, 274, 280-282, 316-318, 326-327

V Validity concurrent, 43 content, 4 0 - 4 2 , 4 3 - 4 4 , 47, 49, 58, 62 predictive, 43

Z Z-score metric, 293, 295

Our partners will collect data and use cookies for ad personalization and measurement. Learn how we and our ad partner Google, collect and use data. Agree & close