The Role of Constructs in Psychological and Educational Measurement

e final cameracopy for this work was prepared by the editors d the~forethe publisher takes no res~nsibilityfor consist...

Author: Henry I. Braun | Douglas N. Jackson | David E. Wiley

31 downloads 1641 Views 46MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form

DOWNLOAD PDF

e final cameracopy for this work was prepared by the editors

d the~forethe publisher takes no res~nsibilityfor consistency

of ~pographicalstyle. However, this a ~ ~ ~ e m e n t ~ublica~on of this kind of schol

m e ~ swithout , the priorwri~en~ ~ s s i of o the n publisher.

er

n e Eribaum Ass~iates,Inc., ~ b l i s h e ~ 10 ~ d ~ tAvenue r i ~

Hought~ingLacey

i c ~e d u c a ~ o n~~e ~ ~ e mI eedited n t by The role of c o n s ~ inc ~ y c h o l o ~and

y, Dauid E.N.Messick, Samuel.

7

6

5

4

3

2

1

on, Rouglas Northrop,

Qf

n

Q

This Page Intentionally Left Blank

List of C o n ~ ~ u t o ~ ix

Preface I: P E R S ~ N A L I T ~ - T ~ EAND ~ R ~ASSESS

3

1

The ~ o ~in People’s s ~ Heads c ~ ~ o u ~ l N. a s~ac~son

2.

The ~ u e s t i o ~ a i r e C o ~ ~ u cof t i Personali~: on P r a ~ m a ~ of c sP~rsonalityassess men^ ~ i ~ K,l B.e ~ ~o ~ s f e e

19

3.

P e ~ o n Theory ~ i ~and ~ s s e s s ~ e nCurrent t:

37

~

.

a H. ~ o ~n f z ~~ a n

Socially ~ ~ s i r a ~ l e R e s ~ o nThe d i nEvolution ~: of a C o ~ ~ ~ ~ e l r L. o ~~ a u l h ~ s

9

PART 11: I ~ E L ~ E

5.

easurement From a ierarchical Point of View ~ a n ~ € ~r ~~c s f a ~ s o ~

73

6.

The Five-Factor Personality Model: How Complete and S a t i s ~ ~ c Is t oIt? ~ ~ o h nB. ~ a ~ o l l

97

vii

127

A L

~ o ~Gluser e r and ~ Gail P. ~

a

~

~

e

~

lternatives in Cons ~ a r r e nW.Willing~a

19

of ~ o n s ~ Vers u c ~ ~ a ~ E. i dWiley

and Values in Sta ~ o ~L.eLinn r ~

ased Assessrnent

u r n ~ in~ Assessment o ~

31 5

tember 1997, the Educational Testing Servicehosted a conferencein of Sam Messick on the occasion of his r e ~ e m e n as t an officer of ETS, a position he held for soke 30 years. It was a w o n d e r ~ event, l not only because the invi speakers and guests included many lea researchers in psycho1 andeducationalmeasurement but alsobecause al and professional connections to Sam. Held at the Chauncey Conference Center on the ETS campus, the conference gave Sam’s many friends and colleagues atETS an opportunity to p~ticipate in a milestone event and add their good wishes to those of the participants. This volume has much to offer the reader.It comprises a set of chapters based on papers presented at the conference. As befits a scientist of S a ’ s interests, the chapters cover a broad spectrum of topics in the sonalityandintellect,withparticularfocus on constructs, validty, and values. A number of authors seized the occasion to provide a reviewofwork in aparticularareaand to S estduections for ~ t h e r research.Some took a morecriticalstancedfocused on themore es in the fieldand indxated how theymightberesolved. tedtheir own leadingedgework. All ac~owledgedtheir d his seminal workthat spanned more than 40 years. One sad note at the conference was the fact that Dick Snow, Sam’s dear friend, was not able to attend due to his increasingly fragde health. Dick diedinDecember1997,followedin st 1998 by the deathof Ann Jungblut. Ann, of course, was Sam’s longtime collaborator and inseparable mpanion. Her death was a shock to us al, but most of all to Sam, and etty,hiswife.Buttherewasmoresadness to come. Sam fell dl in September and, despite a valiant struggle, died on October 6, 1998. In 13 short months we had gone from the “high” of the conference to the loss of three very special people. As I mite almost exactly 2 years later, it is s t i l l d f f i d t to come to terms with that sense of loss and the r e ~ a t i o nthat t have accomp~shedeven more had they been accorded a little more time. Along with Sam and his assistant Kathy Howell, Ann and Dick were deeply involved in organizing the conference. Indeed, they were planning to ent volume as part of their tribute to Sam. Sadly that ed to take on the task and recruited Doug Jackson an Wiley, both conference participants, to serve as coeditors with me, They

X

e


In this chapter I describe how one canemploy the conceptionsof p e r s o n ~ t ythat exist in o r h a r y people to Eurther the process of construct validation in person~ty. I illustrate how Sam Messick strongly ~fluenced me in the ~evelopmentofmy t h i n h g about constructs and co~struct vahdation by ~troducingme to the quantitative analysis of j s ~ a r i t y .Messickconvincedme that mu1 ded by some arcane an as method for evaluatin psychophysical judgments, had the potentialto perrnit the representation o theconstructsofordinarypeople as projections on dimensionsin a g e o m e ~ cspa S turn inprovided an orderly, rigorous way of m e a s ~ neople9s gand constructions of porta ant entities eir psycholo~calworld. ecause these porta ant entities are often other people, multiensional scahng methods have provided a foundation for research into ordinary people about nizationoftheconstructsof ese methods do not require prior speci~cationof a structure. And the evidence supports the view that psychologst eatdeal about their o m constructsfromstudying the consructs of people. But before ~scussingsome of this evidence, I would like ing about my eady association with Sam Messick,p ~ t i c u l ~ l y t year of our ~ollaborationthat began in 1955 shortly after th were National Institute of Health Postdoctoral Fellows at the Menninger Fo~dation. hy, one oftheleadingpersonalitytheoristsofthe our sponsor. I was ~ e ~ a t estruck l y by MessicPs S a c o ~ u ~ c a t owhether r, in e ~ p l ~ n i nthe g intricacies of multidim~n sealin of test theory, or of modern literary criticism. I was also fasc~ated as a person, one who was full of complexity. In many ways he

4 Jackson was a reconciliation of contradictions: a person who arose from a modest und-his father, a police officer, &ed when Sam was young-but ssessed champagne tastes and hlgh aspirations; liberal intellectually and politicallyy but with strong traditional and f a d y values; a psychome~cian, but withwide-ranginginterests; a bon vivantand, connoisseur of the arts, but a loyal fan of the P~adelphiabaseball club; and one who heldthepositionofvicepresident for researchatthe Educational Testing Service with decorurn and propriety, but who could entertain as a raconteur, which he accomplished, as with all things, with creativity and zest. dlfferent educational backgrounds-I Although Sam and I had ETSgraduated with a PhD in clinical psychology and he completed the sponsored PhD psychometric fellowship program at Princeton University -we immediately became intellectual allies and personal friends, beginnin a collaborative venture that extended over decades and resulted in more fust of these was a than 25 join~y-authored papers, One ofthe multibensional scahg study of the perception of personality (Jackson, Messick, & Solley, 1957), which I, after reading Sam’s doctoraldssertation, proposed to him and to Charles Solley, who shared with us the position of postdoctoral research fellow. Wemust have been dedicated to t h ~ sresearch project, as each of us spent many hours at a mechanicalcalculator extracting b e n s i o n s by hand. In this chapter I discuss some of the many stuches that followed in the wake of that initial study.

Jane Loevinger(1957/1967), in seeking to elucidatetheterm,construct, that “traitsexistinpeople;constructs(hereusually about traits) the G d s and magazines of psychologists, People have constructs too, but that is outside of the present scope (p. 83).” People’s constructs are not out of the scope of this chapter, nor should they be, in my view, outside of thescopeof anyconsiderationof construct valtdityand personality. I set out to provide evidence in support of three points: (a) one of the most f&tfid avenues for i d e n t i ~ n gand c o n ~ u constructs ~ g isthroughthesystematicstudyofhumanjudgmentas it applies to (b)the study of people’s constructs can be undertaken with the as one norrnally associates with psychophysical judgment; and (c) thestudyofpeople’sknowledgeofconstructsandof construct

The Constructs in People’s Heads 5 interrelations~psprovides a solid basis for investigating and u n d e r s t a n ~ g a key facetofsocial intekgence, an important but neglectedaspectof human intelligence. with the latter point. Imagine a point in p r e ~ s t o say ~ , 50,000 when one of our ancestors was roundmg a mount^ pass on a xpeditionwhensuddenlyand without warninghe or she encountered three sturdy-built Neanderthal strangers. A. quick calculation was required to estimate and predct their probable behavior. Should one exchangepleasantries,ifnecessarywithsignlanguage or grunts, ignore them, or run away? The fact that you and I are here, while other potential ancestors in countless past generations perished, gives us some confidence that our ancestors made the correct calculation.Andcapacity for such calculations might be conveyed in a cumulati~efashion in the t r a n s ~ s s i o n ofcultureand in the gene pool. The point that I makeis that social but central to the intekgence is not a trivial set ofcognitiveabilities, survival and success of individuals and of humankind. But we need not limit our consideration of the develop~entof social intekgenceto humanprehistory.Humansarehigherprirnates,andif there’s a singledescription that characterizes bgher primates, it is their social abdity. N. Humphrey (as reported by Shreeve, 1996), in reporting on his observations of mountain g o d a s remarked I cannot help being struckby the fact thatall of the animals in the forest, the guerrillas seemed to live the simplest existence. Gorillas live ir]. areas with a benign climate, abundant food, no natural animal enemies, and little to do but eat, sleep, and play. How have guerrillas and other great apes evolved the remarkable intelligence that has been observed?(p.292)

Humphreys believed that &IS source of gorilla social intelligence was eatapesarecharacterized by long periods of and a greatdealofdirect tuition. Infants and juvenilesare i n ~ e c t l ytaughtsurvival skills, and how to hierarchical extended f a d y g r o u p - o f siblings, parents, grandparents, cousins, uncles, and aunts-each blessed with superior intelligence. They learned i n d i ~ d u a ~how y to compete for mates, food, the best sleeping places, and to avoid being hurt by the larger animals, Humphrey

6 Jackson character~es game a ~ ~ p h rstated, e y

of “plot and counter plot.” “Social p ~ a t e s , ”

are zequized by the very nature of the system they create and m ~to be t calculating beings. ... “hey must be able to calculate the consequences of their own behavior, to calculate the likely behaviorof others, to calculate the balance of theadvantageandloss-and all this inacontextwherethe evidence on which the calculations are based is ephemeral, ambiguous and liable to change, not least as a consequence of their own actions . .It asks for levels of intelligence which is, I submit, unparalleled in any other sphere of living. (p.293)

~

And the socialpressures on evolving human populationswereeven eater. This was particularly true after the advent of agricul~rewell over ten thousand years ago, when it became more necessary to form alliances with nonkin. The growthof consciousness about thepersonahties of others arose in part out of an outgrowthand adaptation to processingthe b e ~ d e ~complexity g of information from the social sphere, the te C O ~ U of~ more-or-less ~ ~ ” k e - n h d e d individuals. Social ce is honed in a variety of ways-by providing a social context for learning, by surrounding the individual with other clever individu~s who are s~ultaneouslylooking after the group but also, probably with greater motivation, themselves, by providing challenges involving c o ~ u ~ c a t i o n , threat, bluff,promise, and counter-threat. After theinvention of e and the opportunity for sedentary individuals to live in larger groups, sophisticated solutionswererequired for the problems of living and working together in lunited groups and for S lunited resources. According to Shreeve, in such a complex world, ordmary ence might not suffice: *

“he advantage would go to the player who was not only a keen observer of others’ behavior, but who could also peek inside their minds and anticipate theirnextmoves. No onehasyetevolvedanervoussystemcapable of directly reading the mindsof others. But we have evolved the means to read y providing awarenessof the motivations and consequencesour own actions, consciousness grantsus the insight into the actionsof others as well. Obviously we were not born with our social identity nor with insight into the personality of ourselves and others. But our brain has evolved in

The Constructs in People’s Heads 7 such a way that acquisition of the knowledge and skill necessary to acquire social inte~genceis present, e n o ~ o u s advances the in m~asurement and u n d e r s t a n ~ gof cognitive ability, psychologists and educators have been hard-pr~ssedto demonstrateempiricallythekind of social about wkch I describe, although, as Sechrest and I (Sechrest yearsago, ordrnary people had no trouble in eofsocialintelligence in others. Theyalso convincingly ~fferentiatedsocial inte~gencefrom academic inte~gence.I believe that the approach that I and my collaborators have employed might offer a foundation for u n d e r s t a n ~ gone important facet of social nerinwhich we inferthelikelihoodof a h t e d i n f o ~ a t i oabout n the person being

My approach to understan~ngtheinferentialrelationshipsamong personality-motivated behavior stems from at least three sources. The first derives from the process of creating and evaluating a personality item.The second-and IhaveSarnMessick to thankfor this-is d r a m from e ~ e r i e n c ewithpowerfuljudgmentmethods in psychology, partic~arly &g. The third springs from two of the branches of with wkch I’ve had some experience, n ~ e l ~ , ~ ~ c a psychology, where clinical judgment is regarded as having a central place, dustrial psychology? where an understandtng of the decision- ma^^ ses of the employment interviewer has been an important focus of research-my own and that of others.

I haveprobablyreviewedandevaluated more than 15,000 personality, vocational interest, and attitude items in various assessment ventures. Even when I had reviewed only fraction a of that number, I r t it was possible to judge quite accurately how the item would its targeted scale, and what its relationship would be with irrelevant scales, as wellaswithsourcesofevaluativebias.Butgiven the Meehl (1945) manifesto on empirical strategies for personality scale c o n s ~ c t i o nand his ent for the method of contrasted groups, I was at first reluctant to for the role of human judgment. Instead, I and my colle arn Messick, devised ever more complicated ~ ~ t i v a ~ a t ~ techniques to show what we already h e w that the great maj~rityof items *

8 Jackson prepared for given personality scale performedas expected (Jackson, 1971). m e n I tried to make the point privately with colleagues, that I can judge with considerable accuracy the degree of relevance of an item to a scale, they were often surprisingly vzilling to agree that I had superior insight into personality. However flattering this was, I doubted it and not only because many people granting me tius insight were my present and former graduate students, The question arose as to howonecould demonstrate the critical irnportance of “the constructs in people’s heads” in ~ormulatingthe rninihypotheses that comprise the creation of a personahty item? One of my first ventures in such a demonstration (Jackson,1971)was a sirnple m u l t i ~ e n s i o n a lscaling analysis of judgments of hypothetical people and of items. Consider the following person designed to represent the positive pole of the personality dimension of Autonomy. ~ ~ ~ ~ is ar free-lance ~ ~ writer. r ~ At~ oneeh erhe worked for a newspaper, but quit because he felt restricted by the company’s regulations. He now enjoys h i s work because he is h i s own boss. Edward spent the last year traveling alone through several European countries. He has currently taken up temporary residence in a large city’s Bohernian district. Edward strongly advocated the view that today’s young people are mere knitations of each other.Hefrequently tells otherpeoplethattheworldneedsmore “individuals.” His friendsdescribe him as independent, h e a d s t r ~ ~and ~, freedom-loving.

S d a r personality capsules were written for the negative pole of Autonomy and of two other Personahty Research Form scales, Imp~sivity and Dorninance. Alsoincludedwerepositivelyandnegativelykeyed persona~tyitems from each of the scales. Two sets of judges of hardy greater than averagesophlstication r e g a r ~ gpersonality rated the joint probab~tiesof occurrence of the behavior described by the people and by the items.Each set yielded three dimensionseachdefmedonly by the exemplars relevant to a particular scale. The average correlation between the correspon~ngprojections on the three dimensions between the two separate sets of judges was .99, indicating that when given the appropriate o p p o r ~ ~ t ordinary y, peoplecanmakeinformedjudgments about the relevance of the personality constructs reflected in persons and items. It might be argued that the above demon~trationwas based on items that had already undergone empirical scrutiny, which was true. Accordingly, I directed my attention to the entire initial pool of items for the first set of

The Constructs in People’s Heads 9 seven scales constructed for the PRF, which comprised before screening, from 100 to 140 items per scale. I evaluated the frequency with which each item was rniskeyed, that is, should have been included in the selected items ofanalternativescalebased on that item’s pattern of correlations with targeted and irrelevant scales.In no case did the number of rniskeyed items exceed a percentage of one tenth of 1Yo. Again, this illustration of human judgment in the item creation process supports the idea that the “constructs in people’sheads”show a correspondence withhow other people respondconsistently to construct-oriented personality items and scales. Resultssuchasthesegaveme the confidence to issue a challenge (Jackson, 1971). I suggested that the fruits of a couple of hours of work in preparing items by ordinary people be pitted against the most elaborate empirical item analyses using external criteria available, to determine which method yielded higher validties. Ashton and Goldberg (1973) accepted the challenge. They reported that journeyman item writers produced validties more than comparable with those obtained from California Psycholo Inventory (CPI) scales,whichareconsideredbymany to be the finest example of the empirical method of personality scale construction. In a replication of Ashton and Goldberg, I (Jackson, 1975) found that when given a clear scale defnition of the construct underlpg a scale, ~dividual undergraduate students produced scales y i e l h g three times the validity of CBI scales of Comparable length. Human judgment regardmg constructs of personality indeed warrant greater respect than accorded it by the radcal empiricists amongpast generations of psychologists.

’ I h s is a contentious issue in personalityresearch.Historicallymany

researchers found little evidence supporting accuracy in person perception. I and my collaboratorshaveapproached this issuewith a construct approach, one that requires that the information provided judges, and the behavioral outcomes that they are predicting, have some clear, demonstra~lerelationship with a personality construct, and in addtion, that traits and behaviors be linked in both inferential and empirical networks. We have often found it convenient and illu~natingto study the accuracy of inferential judgments of personalityby employing personality items (i.e., the statements of that comprise personality scales). The advantage of using

10 Jackson personalityitemsis that we knowgreatdeal about theirpsychometric properties, particularly their correlation with the personality factorthat they weredesigned to measure, aswell as theircorrelationswithirrelevant from other traits and from the evaluative bias. studydealtwiththefocaliss of personperception,the ofimplicitpersonalitytheory(Brun Taguri, 1954). A number of p r o ~ e n authors, t including Bandura (1969), Mischel (l968), Schneider (1973), and Shweder (1975) argued that inferential trait relationships were based on semantic sirmlarity, and, accordingly, were ‘‘~Uusory” with littleor no validity. Our research was a response to dings reported by (1976), which offered empirical ~ ~ that seemed g s to call into the validi of implicit personality theory and support the view that it was elsemployed21items fromthe ~ersonality from each content scale,andaskedjudg con~tional probab~ty of pairs of these items, by askmg a question like “If rs True to Item A, what is a probab~tythat he d Answer B?” W e l s found that condi~onal probab~ties correlated only ctual empirical co-endorsements between the items based on the self-desc~ptiveresponses of persons to the P W items. W e l s referred to ‘‘&matic inferential illusions” in his conclusion. We t that humans do not think in terms of con~tionalproba items of behavior, but, rather, interns of constructs. A Chan,andStricker (19’79)employedcorrelation coef~cientsrather than con~tionalprobab~ties becausewebelieved that peoplethink about implicative relationshps in terns of &stances, not conditional probab~ties. urthernore, we viewed the empirical relationships between endorsements ~~ This encouraged dual items as exemplars of u n d e r l constructs. us to consider not onlythejudgedandempirical r e l a t i o n s ~ ~between s dual items, but relationships between u n d e r l ~ gscales, and between andscales aswell. Thus, weanalyzed the correlation between the scales ~ n d e r l ~the gpairsofitems for which endorsements, as well as thecorrelationsbetween a represented by the item with which it was p c o m ~ ~ s o yielded ns an average correlation of .81 between jud co-endorsement and indices of actual trait rela~onships. Given oftheitemswerefromthesarnescale(thus r e q judges ~ ~ to have accurate of the eimplicative distances between scales) a the ackn construct underrepresentation of a single item,

“he Constructs in these f i b g s as strong evidence that the ‘‘constxucts b r o r more than mere usi ion.

in peo

There is a widespread view, held partic~arlyby social the observed consiste~ciesinjudging other people to lawfulprocess s ~ a r i t yin the m e a ~ g softraitnames,ratherthan derived from the actual perception of personahty. Paunonen ~ a u n o n e n Jackson, 1979) addressed thls question by develop in^ a nonverbal representation of l 7 scales of the PRF. Figure 1.1 contains two s ~ p l e s , to reflect T h ~ s e e (~n ege the &st (A) was des ed (B),Nurturanc inthecentraleharaandthesecond identified, using m ~ t i v a ~ aprocedures, te profiles of respondents in a set of ’796 profiles. The furst was marked by

~ o n v e items. ~ b ~Paunonen syst verbal and nonverbal descriptio predtct behavior using either a

d e t e ~ e dhow

12 Jackson

FIG. 1.1. Sample nonverbal items depicting thrillseeking behavior and nurturant behavior.

differences between the results from verbal or nonverbal info~ationalor rating conditions. The &port of these findings is that judges could quite accuratelyinferthepatterningoftheentireset of 17 P W scaleswith Iimted information, and that this accuracy did not depend crucially on the use of verbal trait naxnes or descriptors. Finallyy Paunonen conducted a p ~ c i ~ components al factor analysis that showed that more than 88% of the variance in the judgments was attributableto personality content, about 8% to the use of verbal versus nonverbal rating scales, and less than3% to info~ation” the verbal or nonverbal medium for provi the personality Clearly, thehypothesisthatthestructureofimplicativepersonality relations~psreflect “nothing but” the structure of semanticrelations~psis not supported in the Paunonen data. Neither is the view that semantic overlap accounts for the major portion of the variance. Rather, the most

The Constructs in People’s Heads 13 plausible inte~retationis that the ‘ C ~ ~ n ~ tin r upeople’s c t ~ heads” and their a perceived interrelations~psare responsible for what has proven to be s ~ r i s i n g l yaccurate representation of the organization of traits in other people, which translatesinto accurate predictions regarhg their behavior. ~ u b s e ~ u e ~ tMacLennan ly, ~ a c L e n n a n& Jackson, 1985), using s M a r nonverbalmaterials,demonstratedconsistentdevelopmentaltrendsin accuracy in the perception of personality in four groups of judges ran in age from 5 years to 22 years.

a construct approach in the study of Reed and Jackson (1975) employed clinical j u d ~ e n t of s psycho pa tho lo^. We provided the judges with a brief descriptionofthreepersons,each r e p r e s e n ~ ga drfferentformof psycho pa tho lo^: cZ~n~caZ ~ ~ ~ ~~s ~ ~ o c n and b, ~ ~ a a~ a~n o~~ ~, We ~ e did r~ona~ not to pull these descriptions out of tlvn air, nor did we base them on pressi ion is tic clinical experience. Rather, each description was basedon a replicated syndrome ofpsycho pa tho lo^ that was derived from ~ultivariate classification analyses of several hundred psychiatric patients.“he followin is a description ofthe psychopathy target:

Jack Cole has been arrested several times for theft. Usually his crimes have been poorly planned and rather reckless. He says that he does not feel his stealingsomethhgbysimply about his behaviorandoftenexplains saying he wanted it.Ininterviews Jack frequently mentions his strong dislike for rulesand discipline, and he seldom speaks of friends. ~onsistentwith our multivariate f m h g s , the Jack Cole psychopatholo description was based on construct-oriented psycho pa tho lo^ scales for Desociakzation, I m p u l s i ~Rebelliousness, ~, and Socially Deviant Attitudes. S were assigned the task of rating the probability that the given erson would endorse each of a set of 52 heterogeneous psycho pa tho lo^ items.Whenjudgeswereassignedrandomly to two oups andthecorrelationcomputedbetweentheiregated jud~ents, inter-judge reliab~tieswere high: .g?, ‘99, -99, and .99, respectively,for the three clinical targets and for a fourth nonchcal control target. Reed also correlated the judgments of each individual rater with the group consensus judgment. These covered a range of values; in the case of the Jack Cole target the values ranged from slightly negative to 39, with a median value

dicatingwide in~vidual ~fferencesin &IS urthemore, there was strong evidence that e across different targets.Our results suggested that this type of taskcouldserve as a measureof c h c a l ju sychopatholo~,although,ofcourse,anidea psychopatholo~caltypesmuchmorebroadlythanthe three that we ed. In a follow-up unpublished study Reed(1976) demons~atedthat nts of the responses of desi~atedpsychopatholo~c~targets ndedstronglywith the actualresponses of psychiatricpatients as thetargetsdescribed in thebrief to thesamesyndrome or example, in the case of the Jack Cole target, Reed found that when content judgments were combined withi n f o ~ a t i o nfrom judgments of base rates and of d e s ~ r ae ~~p o~n ~ ag multiple , correlation of .95 was obtained. is interesting to speculate as to why the model employed byReed d a high degree of judgmental consensus in contrast to much previous research that found little or no consensus in clinical judgment. I submit ed's researchfromthatof others is that he constructs, as well as exemplars in the form of personali~items that have been linked empirically and substan~vely to the themore, the judged targets represented empiric~y that showed a lawful,replicatedorganization.When dedrelevantinformation,ordinary p do haveinsight into the rarer ation, not only of normal perso types, but also of psychopatholo~c~ syndromes. Theprobnot with the judges, but with the construct-poo~materials that they have been asked to judge in many studies. W e n given formation bearing on constructs, they can and do showhighlevelsofjudgmentalaccuracy,even to the point that Frofessionals might benefit from studying the structure that emerges from ents of ordinary people, particularly when they are d study or in systematic m s ~ c of ~hplicit e theories of psyc

If one considers the kinds of formation that can be gleaned from the employment interview, one is kely to recognize that information about person^^ is salient and isusually the most sought after. However, the

The Constructs in People's Heads 1 eat majority of the studies examining the employment interview sidestep e issue of personality and its relation to job performance, whether this relation is examined e m p ~ c a or ~ yin terns of the constructs employed by i n t e ~ e w e r s ,Mycollaboratorsand 1 haveundertaken a n ~ b e of r e x p e ~ e nstudies t ~ examining in some detail the h k s between p e r s o n ~ t y and erceived job demands. These studles have focused on the degree of ence between the personality of the app~cantand the p e r s o n ~ t y associated with a particular occupation. We have undertaken studies with a wide variety of occupations e x a ~ the g relative influence on ju of candidate suitab~ty ofpersonality congruence and other vari example, work experience,education, the desirability of self-referent statements made d ~ i n the g interview, and letters o f r e c o ~ e n d a t i o n~e . havepresentedinterview infornation ina variety of formats to judges, g printed transcripts, tape recordings, and videotapes of shulated ~ t e ~ i e wWe s . were aided nsiderably in these ventures by ~o~ results of astudy(Siess Jackson, 1970) that permitted us to p e r s o n ~ t yprofiles for different clusters of occupations,andbyhav h u n ~ e d s opersonality f items that had known relationships with differ sonality constructs, items that could be inserted smoothly intoshulated miew transcripts as self-referent statements made by ants. An example of suchastudy wouldbe useful. InExp from th Jackson, Peacock, and Smith (19$0) study, judges evaluated candidates for one of four jobs: a c c ~ ~ n ~a a n ~~, ~ e ~~~e~~~~~ ~ ~ and ~~ r c e ~ e~~ t y based on the Siess ~ ~ ~Each~ joba had ~a distinctive a ~ , ~ e r s o n ~protile dmgs. For example, the accountant occupationwas marked by high Order and Cognitive Structure, and Low Impulsivity, Change, and ~ u t o n o m. "he advertiser occupation was defied by an opposite pattern. job of industrial supervisor was marked by high ce scores, whereas the mirror opposite orchestrallib was markedby high scores on H ~ m a v o i d a n cand ~ Abase personality for nation embedded was into transcripts by inserting statements reflecting congruent ent and persona~ty for nation, suitability and predicted job performance ratings of candidates were ~ o w e r influence^ ~ y by p e r s o n ~ t ycongruence, so much so that they washed out the effects of prior work experience. m e n personality infornation was congruent, judges were very much more likely to f i d the app~cantsuitable and to expect superior job performance. Inanother of the studies reportedbyJackson et al., (1980) the d e s ~ a b ~ toy f self-reference statements was varied s y s t e ~ a t i c ~ y *

16 Jackson independent of personality congruence for several hfferent occupations. In general, personality congruence had a considerably stronger effect than did the desirabitity of the statements made by job candidates. But there was a notable exception. Siess and Jackson had found that the job of gudance counselor was defined by desirability responding, as well as by personality traits such as Dominance, Exhbition, and Nurtmance. For the counselor job an interaction emerged, indica~ngthat if the self-pre skills wereseen as deficient byjudges, hgh personalitycongruencewas insuf~cient to elicit high suitability and job performance ratings. The &port of these fmdings concerning personality in the employment interview is that ordmary people are sensitiveto the nuances of personality re~uireme~ts of dfferent jobs, possessing “ c o n s ~ c t sin their heads’’ that not only p e r d t inferences about therelationshipsbetweenpersonality characteristics, but, accurate owle edge ofthenetworkof relations~ps between personality and work behavior, as well, even differentiating jobs that require a high-order self-presentations skills, and affording such skills eater weight when appropriate.

I have attempted to show in brief outlines of a few fflustrative studieshow SamMessick‘s introducingme to powerfuljudgmental methods and affording me the o p p o r to ~ work ~ ~ with him on construct validation problemshad r ~ f i c a t i o n sthat neither one of us fidy realizedat the outset. By thinking in terms of personality constructs rather than isolated facts or simple empirical fndings, one can hope for progress. My message is thus consistent with T, H. Huxley’s famous dictum that those who do not go beyo~dfactrarelygetasfarasfact. The useof a construct roach, even in some quite applied areas of psychology, like personality assessment,personperception,clinicaljudgment,andtheemployment interview, dcontribute to an understanding of certain of the processes perhaps, contribute to the betterment underlymg these endeavors, and d, of the human condition.

Ashton, S. G,, 8r Goldberg, L. R. (1973). In response to Jackson’schallenge:The comparative validity of personality scales constructed by the external (empirical) strategy andscalesdevelopedintuitivelybyexperts,novices,andlaymen. ~ o of ea^^ ~ ~in F ~ r ~ o n 7,l-20. u~~,

u

~

The Constructs in People’s Heads 1’7 Bandura, A.(1969).Principlesofbehaviormodification.NewYork:Holt,Rinehart & Winston. Bruner, J. S., & Tagiuri, R. (1954). The perception of people. In G. Lindzey @d.), ~ a n ~ o o ~ ~ s o c i a l ~ ~(Vol. c b o 2). ~ oCambridge, ~ M A : Addison-Wesley. Chan, D. W., & Jackson, D. N.(1979).Implicittheoryof psycho pa tho lo^, ~ u l ~ ~ a ~ a t e ~ e ~ a ~ o~searcb, r a l 14,3-19. Jackson, D. N. (1971). The dynamics of structural personality tests. P~cbolo~cal Rwiew, 78, 229-248. Jackson, D. N. (1975).The relative validityof scales prepared by naive item writers and those based on empirical methods of personality scale construction.~duca~onaland P~cbolo~cal ~ e a s u ~ m e n35, t , 361-370. Jackson, D. N., Chan, D. W., Stricker, L. J. (1979). Implicit personality theory: Is it illusory? ]o~rnalofPersonali~l47,l-10. al approach Jackson, D.N., Messick, S. J., & Solley, C. M. (1957). A m ~ t i d ~ e n s i o nscaling T ~ e ] o u ~ u l ~ P ~ c44,311-318. ~olo~, to the perception of personality. Jackson, D. N., Peacock, A. C., & Smith, J. P. (1980). Impressions of personality in the and Social P ~ c b o l o39,294-307. ~, employment interview,Journal ~Personali~ P~c~o~o~cal Loevinger, J. (1957).Objectivetestsasinstrumentsofpsychologicaltheory. or&, 3, 635-694. Reprinted in D. N. Jackson & S. Messick @ds.) (1967). P ~ ~ l ine ~ s bwzzan ~ s e s s ~ eNew n ~ , York, McGraw-€Ul. MacLennan, R. N., &Jackson, D. N. (1985). Accuracy and consistency in the development of social perception.~ e ~ e l ~ ~ eP n~ ct baol l o2~ ,1,30-36. ] o ~ ~ a l o f Clinical Meehl, P. E. (1945). The dynamicsof“structured”personalitytests. P ~ c b o l o I, ~ ,296-303. Mkels, H. L. (1976). Implicit personality theory and inferential illusions. ]ourna~ ofpersonali~, 44,467-487. Mischel, W. (1 968).Personali~and ~sess~ent. New York: Wileye Paunonen, S. V., & Jackson, D. N. (1979). Nonverbal trait inference.] o u ~ a ofpersonali~ l and Social P ~ c b o l o37,1645-1 ~, 659. Reed, P.L., & Jackson, D.N.(1975). Clinical judgment of psycho pa tho lo^: A model for inferential accuracy,Journal o f ~ ~ n oP ~~a ci ~ 84,475-482. o ~ ~ , n ~ in a cl l i n i c a l ~ u ~ e ~ e n t a ~ d ~ e r s o n0)octoral ~erc~~on. Reed, P.L. (1976). ~ s s e s ~ ~n g~ e ~acmracy dissertation, University of Western Ontario, London, Canada). Dissertation Abs&acts International, l977,37,5333B. ~ulle~in, 79, 294Schneider, D. J. (1973). Implicit personality theory: A review. P~cbolog~cal 309. Sechrest,L., & Jackson, D. N. (1961).Socialintelligenceandaccuracyofinterpersonal li~~ predictions. Journal ~ P e r ~ o n a29,167-182. ~ a l S o l ~ n gtbe mp-tey ~ m buman o origins. ~ New~York: Shreeve, J. (1 996).Tbe ~ e a n ~enkma: Avon Books. ]ou~al Shweder, R. A. (1975). How relevant is an individual difference theory of personality? ~Personali~, 43,455-484. Siess, T,F., & Jackson, D. N. (1970). Vocational interests and personality: An empirical ~ ~ l integration. Iournal counseling P ~ c b o I7,27-35.


ersonality questionn~es as a means tocommu~cate about personality. From S perspective, a number of recommendations are listed for state-of-the art personality assessment. Item formulations should avoid nals (speci~ca~y,counterfac~~s), adjectival phrasings resultinparadoxes),and c o n s p ~ a t i o nlanguage. ~ The uld not containrepetitiveitems,stylisticvariation,and ppant i n s ~ c t i o n sto the assessor. ~veragingover multiple is a condition for proper assessments. With respect to the and sc of results, the use of principalcomponent analysis, correction for acquiescence, anchored scores, and natural confidence intervals are r e c o ~ e n d e d . In discussing app~cations, I argue that the use of personahty questionn~esin insti~tionalsettingsisproblematic, but in attend in^ to individualsettings self-mana~ement is served bydesigns p e r s o n ~ t y ~ n ~ o n m fit. ent

Inmost researchandapplications,personalitycomes from answers to questions of whata person is &e. The basic script involves threeroles-an investigator, who asks the questions; an assessor, who gives the answers; and the principal character or target person, whom the questions are about. 19

20 Hofstee In the compactversionof the script,wherepeopleareasked to assess themselves, the latter two roles are played by one and the same person; in other words, the questions are asked in the second person singular ( “ m a t are you &e?’) rather than the third (“What is he or she like”), although the actual g r ~ a t i c a lphrasing may vary. In the hll-size version, the investigator would command a sufficient number of independent assessors per target to attainanelementarylevelofnormal-scientificprecision (Hofstee, 1994). Conversely, in a subcompact version, an individual would ask questions about hirnself or herself, therefore, in the fast person singular ( ‘ m a t arn I like” or even “Who amI”); this p ~ o s o p h i c a rather l than psychological script is not considered here. Two developments have extendeda beginning of scientificrespectab~ty to the questionnaire construction of personality. First, research in behavior genetics has established heritability coefficients in the order of .4 for traits assessed by questionnaire Poehlin, 1992). This figure is in the same order of magnitude asthe g e n e r ~ a b ~of t ya questionnaire score over h e , item samplings, and assessors (Hofstee, 1994). Thus the modest generalizabdity raisesanargument a fortiori: As oneincreases the gener~abilityof questionnaire scores, the heredltycoefficientswouldapproachunity. C o n f ~ a t i o nof this argument is found in a study by Riemann, Angleitner, and Strelau (1997) using two assessors per target. “he heredity coefficients rose to the upper .bo’s, which is what would have been expected under the h ~ o t h e s i of s perfect heritabdity ofthe true score. Second, questionnaire investigators are taking steps toward getting their conceptual book-keepingin order. I feelless andless inched (seealso Hofstee, 1997; Hofstee, Ten Berge, & Hen&&s,1998) to takesides in ents on numbers like the big 5, the giant 3, or other numerological I do fmd progress in the exploitation of the lexical entities. But h~othesis-which holds that novel trait concepts can generally be represented in o r h a r y language-and, especially, of the well-tried EckartYoung Theorem, which provides us with a sequence of principal components of personalitythat follow the law of ~ n i s returns. ~ g Still, questionnaire research has a reputation of being quick and dirty. There is the story of the young psychologist who followed his wife to a distant post and mailed a question to hrs Rabbi ‘ m a t can I do by way of research in the middle of nowhere”; he received an immediate reply: ““Good question? my son; allyouneed is to keep a s h g them”. More seriously however, there is nothing unscientific about a s h g questions to others.

The ~uestionnaire Construction of Personality 21 Investigators maymostly prefer to find out by themselves, through e x p e ~ e n t a t i o nand objective measurement. At some point, one may thus beable to predict individual dfferences in extraversion,agreeableness, conscientiousness, emotional stabhty, and the &e by genetic diagnosis. But in the m e a n t i m e ~ w ~ cmay h last longer than some would expect-there is a c o n ~ b u t i o nto bemade to accurateassessment;even the progress of netic research of persona^^ depends on it. However, complacen~ about the questionnaire construction of personality is notin order. I sumarize somerecommended further hprovements in the interest of an optimal performance of the questionnaire script. Certain of these are quite technical, but the general point ofview is discourse-analytic: I present questionnaires as a way of c o ~ u n i c a with ~ g people, about people. Finally, I discuss questionnaire applications in various contexts.

ss stering questionnaireshad better not beviewedasaformof objective measure~ent;rather, it is a way of c o m m u n i c a ~ gwith the assessor (not with the targetperson). The ~vestigator depends on the assessor for hs or her information-for better or for worse, if one wishes. Once this s h p l e point is accepted, it serves as a powerfkl organizer of do’s anddon’ts in questionnaire construction and administration. Generally speahg, the investigator’s questions had better make sense to the assessor. Note that this recommendation is specific, obvious though it maybe: If one measuresasubject’sbrainmassasan indcator ofhis or her inte~gence,the measurement does not depend on whether it makes sense to the person. Preoccupation with measurement (as dstinct from assessment)maywellbe the reasonwhyelementaryrulesofsensible communication are frequently violated. Illus~ations ofitemformulation that wdl hinder the process of c o ~ u n i c a t i o nbetween the investigator and the assessor are the following:

The technical problem with negatively formulated items k e ‘‘Is not easily frustrated” is that a denial may logically have two quite different meanings, namely, the absence of a trait or its opposite. The communicative problem

2 Hofstee is that the question

maydrivepeoplemad. In the event that the target rson happens to be quite Stable and Agreeable, the assessor may be able muster sufficient intellectual sophisticationto read “not ea3ily frustrated” as a litotes (Oxford expressingofan aff~mativeby thene c o n t r ~ )The . answer would go &e ‘‘Yes, you might say that he/she is m]not easily frustrated, if youinsist on usinganunderstatement.” If, however,thepersonhappenstobeEasilyFrustrated, the expected response from the assessor is c‘No,it is not at all so that h s person is not easily frustrated,,’ thus,a reversal of a litotes, or a quadratic understatement. In out” experience, even thorough-bred intellectu~smay fmd this too much not be posed in a of a good thing, and wonder why the question could normal way. Note that even aff~mativeitems carry the problem to some extent. If the item is phrased as “Is easily fmstrated,” and the target person is quite StableandAgreeable,theassessoris still supposed to respond withan understatement. The problem may be met by defining the other pole of the trait in an a f f ~ m a ~ vmanner, e for example,“Keeps hls/her cool”; the assessor would be asked which of the two expressions would apply more. In many cases, however, the expressions would both have partly a specific m e a ~ g ,so that theassessorwouldhavechfficultyseeingthem as opposites. S

~xample:“Helpsdoingthedishes”;ingeneral:“traitconchtional upon situation.”Conditionals may followfromaninteractionisticpara wheretraits &e helpfulnessareindeedsupposed to betied to specific situations. (One may wonder, of course, whereinteractio~smends: Should we c h s ~ ~ as separate h helpfulness with respect to dish-washing in the ~ ~ ~ on using a detergent m o r ~ versus g evening, toward males v e r females, or not?).More often, however,conditionalsseem to comefroman investigator’slackof trust inthe ablltty oftheassessor to perform an egation operation ((‘does the target show helping behavior in general, more orless than others?”).The idea is to use the respondent as a vicarious observer rather n assessor; Buss and Cr&s (1983) act frequency t idea in the purest fom. However, people assess e, and hardly ever observe in a detached scientific r might as well try to capitalize on that expertise.

The ~ u e s t i o n n Construction ~e of Personality 23

A special problem with condttionals is that they may be counterfactual9 as can be ~ustratedby the above example: If the household c o n t ~ as dishwasher, the item would read &e ‘Would he/she help doing the dishes, if not for the fact that there is a dishwasher?”“.The case arises whena work attitude scale is tested (counterfactua~y) on a student sample. The most notorious example is the opinion pollster’s question: “If elections were held today, for whomwouldyou vote?” (for a dtscussion,see Hofstee Schaapman, 1990). Assessors may find it Eun to engage in speculation, but that is hardly whatthe investigator intended them to do.

Many questionnaire items revolve around a trait-descriptive adjective. This is not the way t a h g about people happens in everydaylife. De ( 1 9 ~ found ~ ) that personalities are dtscussed in terms of behaviors denoted by verbs; trait adjectivesareseldomused. In word counts of orhary language their frequencies are among the lowest. Trait adjectives may be found in ceremonial and official discourse like eulogies,letters of reference and, most notably,psychological reports. In such commu~cations,they may serve the purpose of diplomatic vagueness. However, that is hardly what an investigator wants from assessors. In past yeam, we have spent much time and effort in developing a large pool of brief concrete sentences for p e r s o ~ ~ tassessment. y Wereas theyappear to fit in the same m u l t i ~ e n s i o n a lspace as traitadjectives, both their c o ~ u n a l i t i e sand their self-peer validties run higherthan those ofadjectives ( H e n ~ ~ s , 1997). A special problem with many adjectives arises when they are used in self-report (Hofstee, 1990). One may describe a t h d person as Modest or Superficial, but seif-description in such terns is paradoxical. Any claim that I would be a Modest person would only testify to myimmodesty;if I declwe myself to be a superficial character, my audience must presuppose a measure of profoundness on my part without which I would be incapable of finding myself Superficial-and that would be precisely the intention of my disclaimer. Thus to adnimster such items for self-report is to place the assessor in a classical double bind. Some might react &e the b~dge-player who cannot bemisledbecausehe or she lacksfantasy.However9they would hardlyrate as ideal assessors.

ofs tee

To assess is to be in office; assessors of personality are not asked about es or dislikes,any more thanpeerreviewersaresupposed to esercise their predilections. The investigator should thus avoid the Fmpression of inviting a conspiracy at the expense of theta C o ~ o q ~ a l i s mincluding s fashionable barbarisms and other idiomatic language,invectivesandpositiveemotionalexpressions,andsexisms, ethnocen~sms, andthelike,interferewiththedescriptivescenario. uestionnaires have to be impeccable. now from item formulation to the level of the questionnaire, ~ustrationsof improper p r a ~ a t i c sat that level are: ss

A frequent complaint voiced by respondents to p e r s o n ~ t yquestionnaires is that their consistency is being checked. In a sense, that apprehension is correct: To attain reliability and construct validity, the investigator has to that psychome~c represent a constructinseveralguises.However, rinciple is no excuse for repetitiveness. First, hlgh redundancy indicates a “bloated-speci~c”(Cattell, 1964) construct; any trait that is worth assessin is broad enough to have many &verse manifestations. Second and mos rtant in the present contest, redundancy is a form of rudeness. In ary Life, one would get slapped in the face when posing essentially the same question twice, let alone 10or 20 times. An elegant technical solution for the d d e m a is available that is rarely ed. It consistsof p~cipal-componentscoring, so that everyitem btains a nonzero weight for every construct scale.We (Hofstee et al., ~ 9 have ~ shown ) that prin~pal-component scoring is the logical consequence of selecting itemson the basis of their item-scale correlation. ore genexally, I cannot think of any excuse to use unit weights in any scoring key). With this flexible and efficient procedure, items can be used at are blends ofunderlpg constructs; for esample, Cordtality is between s ~ a v ~ r s i oand n Agreeableness, ~ e t e ~ a t i blends o n Emotional Stability Conscientiousness. That approachenhancesthediversityofitem content without introducing specific item variance, which would otherwise ed to achieve content variety but that is psychometrically

productive*

uestionnaire Construction ofPerso euristic that is easily ~ p l e m e n t e dthrough w p r o ~ a m sis the following: In each r the sake of clarity, the item sho core word-and prevent it from o c c ~ mor ~ g

and Lohr (1986) provide a classi~cationof questi al are overt (“I often go to par tie^'^) and ’> behaviors that together make up over ries are s ~ p t o m s(“I sweat a lot”), tr ersonY7), wishes and interests (“Sometimes I would rea o~aphicalfacts (“I had some trouble with the law when was y o ~ g e r ” ) ,attitudes C‘I social effects (“at parties, I m the center of attention7’), and b to poison me”). On the a s s ~ p t i obtain an assessment of a target’s pers belong in the questionnaire. Th g an accurate description of the target.

the foremostexampleof poor mannersofuestionn the standard ~ s ~ c t i to o nthe assessor: L c D ~ n 7 t question;thereare no right or mong answers.” bridge player who says to h i s spouse: “et me do the no-trumps, dear” (notrump play being the ~ f ~partc of~thet game).The plication is that th gisgoing to bedone by theinvestigator, that assessors can no^ to fathom the meaning of their own responses, and that the besq ehave like a jax of liquidinto which alitmus In any serious conception ofp e r s o n ~there ~ y are more n s w ~ s ,andif a n y ~ g ,the q u e s t i o ~ ~instructio e c ~ o ~ l e inevitable d ~ g u ~ c e r on t ~ ~ this shift toward stressing accuracy inv estionnaires by trait psycho~o~sts, at .the so~al-perceptionpara eholder;andthe chc *

26 Hofstee assessors cannot h o w what they are talking about. These paradips are ound inviewofthebehavior-geneticevidenceasdiscussed is on theiraversive contribution to the previously;here,theemphasis public relations of personality psychology.

~ o n the g most misleadingexpressionsinthepsychologicaljargon is personality test (foraquestionnaire),carrying a S stionofobjectiveness that is largely unwarranted. Administering a se1 number of persons, for example, is quite different fr scoring intelhgence test. Each person gets the same intelligence test and key, but each responder to a questionnaireis scored by a dfferent assessor, namely, that person himself or herself. Even if one would opt for a test s ~ ~ l i design n g and administer a different intelligence test to each person, the comparability among persons would still be hgher: total scores on IQ batteries may beexpected to correlateabove -8, whereas the c e h g for dfferent assessors is about .6. So the usual questionnaire a ~ s t r a t i o nis a h to having each applicant select a different IQ-test to his or her k g , whosegeneralizabilitieswouldbefarbelowstandard.Needless to say, internal consistency coefficients of questionnaire scales, which may be quite high,have nothing to do with this argument (to the extent that a questionnaire’salpha is lessthanunity, it onlydetracts further from its genera~ab~ty). A minimal condition to bring personahty assessment up to elementary egation: averaging over a number of independent judges. will not thereby be objective, but a sufficient degree of intersubjectivity can be attained. The argument is not specifically dlrected against self-assessment: The person hmself or herself, as an assessor, has certainstrengthsandweaknesses(Hofstee,1994),and it is anempirical question whether self-assessments are moreor less representative and valid thanother-assessments. The cardinal point isthatself-reportshave the ~ n s ~ o u n t a bhandcap le of being single by defnition. So at the very least, e to be supplementedby other-assessments. primarily technical;its forma~ation tionargumentis follows the lawof large numbers and the function derived independen~yby Spearman (1910) and Brown (1910). However, it has profound discursive ~plications.The assessor is approached as an exchangeable element of a

The Questionnaire Construction of Personality 27 set of experts. That position of exchangeability shows resemblanceto being a subject of research. So on the one hand, assessors are coworkers of the tor, who depends on them for theirprivilegedinformation; however on the other, they are not approached for their personal point of view; they should realize that any unique variance in their assessments is part of the error term. (The same holds for other forms of assessment and evaluation, most notably,peerreview). In our Five-FactorPersonality Inventory (FWI; Hendrks, 1997) we &ve home the point by m a i n t ~ g athird-personformulationeveninthecaseofself-report. Another ~plementationis to instruct assessors to predtct the average assessment, and to reward them to the extent that they succeed (Hofstee, 1995).

SU C o ~ u n i c a t i o nof questionnaire results is an integral part of personality assessment.Recently,we(Hofsteeetal.,1998;Hofstee, Ten Berge, & Snijders, 1997) have applied ourselves at aspects of it. This section contains a restatement of some results.

A h o s t all personality constructs, when scored in socially desirable chection, have a negatively skewed raw-score dtstribution; mean scores are above the neutral midpoint (..g, 3 on a scale from 1 to 5) of the scale. C u s t o ~ procedures ~ ofrelativescalingthereforeinvolvemoving the midpoint. That would be inconsequential with unipolar scales where any midpoint would not have a naturalmeaning.With the bipolar scales of personality, however, the midpoint is where a trait shifts into reverse gear. Moving the midpoint is thus a manoeuvre that can pretty well ruin the constmct. Less metaphorically: A person whose Conscientiousness or Emotional Stability would be assessed as somewhat above the scale midpoint, would obtain a somewhat negative score on the relative scale. That is not what the assessor had in &d. Other cases suffer from the rough treatment as well, though less dramatically: A somewhat Sloppy person, for example, would be reported to be clearly Sloppy. A radicalsolution to this problem is to resort to absolutescores. Hofstee and others (1998) present the appropriate procedures for the use

ofstee lute scores, in

articular, a

version

a1 c o ~ p o n e n tand little else. atically ~ f f e r e n tfrom the one th l component analysis. We have n

e actor scores o f a person who WO

uestionnaire Construction of Person and repetitive (see previous discussion). So to be spotted in an empirical way. e ~ v e s t i ~ t ohowever, r, is n problem. For, chfferenti~acquiescen between items more positive; in other found. Hofstee, Ten solutions to this opposites. For a complete so at least demons~atethat all sarne end result. Two misunders tandings conhsion between socially d set, which we have encount ectional scalesconsisting of positive items only only, the two cannot be ~ s ~ ~ s hwhich e d is, probably the reason for conhsion. However with bidirectional scales, acquiescence is the tende to endorse bothpositive andnegativeitems,whereassocial desirab~ty wouldinvolve endorsing positive items and rejecting undesir~bleitems. Another sunders st an^^ is that acquiescenceis a axinor artifact (e. ~unnally,1967, p. 611-612).We found acquiescence variance to comparable in size to the third p ~ c i p acomponent l in hetero~eneoussets of questionn~e items (Hofstee et al., 199~). A t the level of p r a ~ a t i c sI, subsctibe to the reservationthat correc forresponse artifacts likeacquiescenceis a form of deception of ~ s ~ e s s oHowever, r. the objection could be met by i n f o r ~ the g asse Part of the c o m ~ u ~ c a t i between on the investigator and the assessors is a u n s e n ~ e n t achscussion l of their weaknesses as well as their stren CO

ite

of statistics-~cludin~ questionnaire scores S

is crude and potentially misleading (as the

to ~ d e r e s ~ a the t e margin of error). Using

con~denceor credibility intervals would be a step others(1997)havedeveloped natural confidence require an arbitrary decision with respect to the p covered ( ~ 0 ~90%, 0 , and so on). In our pro s ~ b u t i o nof a target person’s true score is pine

30 Hofstee background hstribution, for esample, the prior distribution of true scores in a relevant population or subpopulation. The natural confidence interval is located between the intersection points of the target and background distributions. Thus, the reporting of a score interval is constructed as a bet between the investigator and an opponent who would, in the example, fmd the individual assessment uninformative: The natural internal is where the ~vestigator’s probabhty is hlgher, therefore, it is the interval on which the investigator places his or her stakes. A spectacular property of this natural confidence interval is that its midpoint is the target’s observed score rather than the true score as in classical test theory.

With respect to questionnaireapplication, a sharp distinction should be madebetweenindividualand institutional decision contests. Generally speakmg, assessments of personality have their place in individual rather than institutional contests.

ontexts Institutional application contests like personnel selectionand student admission are basically contests of maxirnum performance, whereas personality summar~estypical behavior. Thus the pragmatics of selection and admission situations constitute a formidable obstacle against personality description. Consequently, answers to the question of what a person is like can only be ambiguousin that setting. According to a widespread but prinitive notion, the answer to an item like “Do you tend to keepyour appointments” wouldcome about as follows: The self-assessor makes up Ius or her mind on the basis of past behavior, and subsequently engages in self-enhancement in socially desirabledirection; thus the answerconfoundsindividualdifferences in punctualityandfakmg good. This notion hasgivenlead to attempts to control for social desirability. The failure of that approach was documented in the early1960’sbyDicken(1963)and, most recently,by Ones, ~ i s w e s v ~ aand n , Reiss (1996). In a more subtle conception, the answers are programmatic. After all, from the point of view of the applicant, the institution cannot legitimately be interested in his or her past behavior; the task is to preview applicants’

The Questionnaire Construction of Personality 31 future c o m p o r ~ e n t First, . applicants w i l l correctly reahe that behavior is subject to situational effects: For example, employees tend to behave in a more sociallydesirablemannerthan students. Second,selectionsettings caxry an eleAent of b a r g ~ i n g A . notoriously lazy person maysincerely promise to be punctual, through his or her answers to a questionnaire; the weaker the b ~ g position ~ g of the applicant, the more &ely it is that such promises are made. Tragically, the’less self-insight the applicant has, and the more lhrs or her ideas are influenced by voluntaristic theories on the changeab~tyof personality, the easier promises are submitted. But there need not be anyquestion of l p g or faking. Can instructions to the (self-)assessorbedevised that would remove response ~ b i The ~ standard ~ ? instruction to respond naively and the pragmaticsof the selection spontaneously is so farremovedfrom situation that it can only function as an insult to the applicanfs intelligence. A realistic instruction would be the following: “For each item, it wrll be quite evident what the most socially desirableresponse option is. The more socially desirable your answers, the higher are your chances. Also, there is no way of checking them. Only, in the longer run youmayhavedone yourself and others a dmervice if you misrepresent yourself. Also, don’t fool yourself into thinking that you could actually change your personality to becomeentirelysociallydesirable.” “hat wouldalso not helpmuch: Cynical and short-sighted applicantswould still be at anadvantage. However, the instruction wouldcomplywith requirements of informed consent, andwouldbebeneficial to the publicrelationsofapplied

psycho lo^. A coherent solution would consist of turning the personality q ~ e s t i o n n into ~ e a cogmtive test. The instruction to the applicant would consist of an explicit challenge to fmd the most desirable response option. The criterion providing the scoring key mightbefound by ashng key persons in the organisationwhat the most desirable option is,and a v e r a ~ gtheir answers. The test wouldmeasureempiricalintelhgenceabhty to predict a state of affairs”or more specifically, psycholo~ca~ ~ t e ~ ~ e nThe c e reader . may wonder if sucha thing exists, that is, if the test wouldbereliableandvalid.However,quitecomparable (Hofstee, 1997) procedures called tests of practical intelhgence (Sternberg, ~ a ~ e r , Horwath, 1995) appear to do well in organizational settings. Also, the prediction task is not trivial in the sense that the most socially desirable option would appear to be the same in all cases.

ofstee esas a test of psycholo~cal inte~gence aby with the bathwater. With increasing e autonomy of the individual, however,I can see no essment as such, by ~ u e s t i o n n ~or e

o end on a brighter note, much more use can be made of state-of-the-art personalityassessmentinindividualdecisioncontexts. ally put, personality assessment might help people to avoid onments that do not fit their personality, I do not pretend to exhaust that issue, but present a few central considerations. First, people do have personality traits. "hey can try to den them, and tance of traits is not to be ersonality, there are widem to suppose deterministic,one-to-onepersonality en~onment the bounds are exceeded, people may become quite atiserable. also, traits can be controlled to some extent for some h e . it may be more efficient to invest in chan the sibation &an in selfSecond, personality traits along with abilities are more central to selfSUS, management than are interests, motivation, values, attitudes, and even all of which are changeable by d e ~ t i o n(to the extent they are not, they ould be called traits). Still, most of the research on person environment has dealt with these more epherneral phenomena (for an overview, see of, 1996). Another knitation of that research tradition is its focus on environments.Personahty e n v i r o ~ e n tfitpertainsalsoto, for example, educational, leisure-time, and partner e n ~ o ~ e n t s , In preparation to o ~ e r a t i o n ~ i npersonality g env~onmentfit, the idea at e n ~ o ~ e nhave t § personalities that may or may not be fitting should probably be avoided. Such notions violate the principle of methodolo individualism and divert the attention from the genotypic foundation of One can attribute personality to animals and plants, but not es).Mainly,however,theyinvite a a1 c o m p ~ s o n saremadebetween §omeone's perceptions, of his or her own personality and the

The ~ u e s ~ o n n ~ e env~onment’s. Such c o ~ ~ a ~ s oare seriously ns assessor is one and the same.

~ o nosf ~persona^^ ction conta~nate

in this context consists o f the

an incumbent’s ~ e ~ s ~ n a ~

~ u t o n o m yis especially des schools so fw, seem to have

34 Hofstee

FE

S

Angleitner, A., John, O.’P., & Lohr, F-J. (1986). It’s wbat you ask and how you ask it: An itemmetric analysis of questionnaires. In A. Angleitner & J. S. Wiggins @ds.), persona^^ ~sessmentvia questiQnn~res @p. 61-108). B e r h Springer, Brown, W. (1910). Some experimental results in the correlation of mentalabilities. ~ ~ t i s h Jo~rnalofP~cbolo~, 3,296-322. BUSS,D. M., & Craik, K. H. (1983). The act frequency approach to personality. P~cholo~cal vie^, 90,105-126. Cattell, R.B. (1964). The importance of factor-trueness and validity, versus homogeneity and orthogonality,in test scales. ~ducationaland P ~ c b o ~ g ~ c a l ~ e ~24, u~ 3-30. men~ De b a d , B. (1985). Person- tal^ in every^ bp: Prag~aticsof utterances about persona^^. Unpublished doctoral dissertation, University of Groningen, Netherlands. Dicken, C. (1963). Good impression,socialdesirability,andacquiescence as suppressor a l P ~ c h o l o ~ c a l ~ e ~ # r e23, m e699-720. ~t, variables. ~ ~ c a t i o nand Hendriks, A.A. J. (1997). The c o n s t ~ c ~ oof n tbe Five-Factor persona^^ ~ n v e n t o@?€?PI). ~ ~npublisheddoctoral dissertation, Universityof Groningen, Netherlands. Hofstee, W. K. B. (1990). The use of everyday personality language for scientific purposes. ~ ~ ~ pJournal e a n~ P e r s o ~4,a 77-88. ~~, Hofstee, W. K. B. (1994). Who should own the d e f ~ t i o nof personality? E u ~ p e a Jn o ~ ~ofa l P e r s o ~ a8, ~~ 149-1 , 62. Hofstee, W. K.B. (1995). Beoor~len: Wetembqo j Aunst? b assessment andeval~ation: Science or artq. Amsterdam: Royal Netherlands Academy of Sciences. Hofstee, W. K. B. (1997, July 14-16). persona^^ and intel~gence: Do tbq mix? Paper presented at The Second SpearmanSeminar, Plymouth (US). Hofstee, W. K. B., & Schaapman, H. (1990). Bets beat polls: Averaged predictions of election outcomes. Acta Poktica,25, 257-270. Hofstee, W. K. B., TenBerge, J. M. F., & Hendriks, A. A. J. (1998). How to score s, questionnaires. persona^^ and ~ n ~ ~ d~#~ ue rle n c e25,897-909. ~ ~ t ~ ~ e Hofstee, W. K. B., Ten Berge, J. M. F., & Snijders, T.A. B. (1997). ~ a ~ u r a l con~de~ce of Groningen, i n t e ~ f ftest ~ s~corres and other quantities. Unpublished manuscript, University Netherlands. fiistof, A.L. (1996). Person-or~nisationfit: An integrative review of its c o n c e p ~ ~ a t i o n s , ~~, measurement, and implications.Personnel P ~ c b o49,1-49. Loeldin,J.C. (1992). Genes and e n ~ ~ ~ mine persona^^ nt ~ u e l ~N~e e~~b ~ Park, u .~ CA: Sage. Nunnally,J. C. (1967). P~cbomet~c theory. New York MC.Graw Hill. Ones, D. S,, Viswesvaran, C., & Reiss, A. D. (1996). The role of socialdesirabilityin personality testing for personnel selection: The red herring. JQurna~ofAp$ied P ~ c h o l o ~ , 81,660-679. Riemann, R., Angleitner, A., & Strdau, J. (1997). Genetic and environmental influences on personality: A studyof twins reared together using self- and peer report NEO-FFI scales. ~ O u ~ f f l o f P e r s o n65,449475, a~~, Spearman, C. (1910). Correlation calculated from faulty data. B&b Journal o f P ~ c b Q l3,o ~ , 271-295. ~

The ~ u e s ~ o n Construction n~e o f Personality 35 Sternberg, R. J.,Wagner, R. K., Willims, W. M., & Horvath, J.A. (1995). Testing common sense. American ~ ~ G ~50,912-927. o ~ ~ s ~ ,


The 1961 volume of the Annual Review of Psychology was the first to have a chapter devoted entirely to recent research on personality S interesting to note that the psychologist c o ~ s s i o n e dto write this review wasSamuelMessick(1961) whom we are honoring at this sym I wrotethethirdreview on personality s t ~ c ~ r e Four yearslater, ~ o l t z m a n ,1965),ataskmademuch shpler byMessick's outs scholarship and definitive work. My acquaintance with Messick go to his arrival with Dou Jackson in 1956 on the campus Men&gerFoundationineka,Kansas,where both of them were postdoctoral research fellows under Gardner Murphy in the field of personality &g of a close friends^ and assessment. "his meetingwas the b ed to thisday. I wasprivil professionalcollaboration that has CO a consultant who periodcdy visit that h e to beresearch exciting research program had been Foundation where a vigorous, developed under ~ u ~ h yleadership. 's It was a distinct honor and pleasure to be called backto ETS on the happy occasion oft h s conference. At an APA s y m p o s i ~ in 1963 c o ~ e m o r a t i n gthe 25th a n ~ v e r s of a~ the classical pub~cation,Explorations in Personality, by Henry Murray and his associates, I presented six major unresolved issues related to rec dilemmas in p e r s o n ~ t y assessment as follows: (1) the m e a ~ g personalityassessment, (2) how many things mustbe known about an ~ d v i d u ato l understand his personality, (3) how personali separated from method variance, (4) whether we areb bound theoryandtechniqueofassessment, (5) whether wecanever develop a systemic, comprehensive personality theory closely linked with empirical data, and (6) the moral dilemmas created by personality assessment(Holtzman,1964).Today most oftheseissuescontinue to

38 Holtzman persist although in somewhat different form,one has faded from the scene, and new ones have taken its place.

The meaningofpersonalityassessmentcontinues to bean important, unresolvedissue,dependmg on one's point ofview. Personologists and many c h c a l psychologsts continue to insist that one must h o w a great deal about an individualto truly understand his personality. Specific scaleor factorscoresareviewed as eitherirrelevant or too superficial to prove useful without a thorough analysis of an individual's motivations, aspirations? hstory, andlifecircumstances.Althoughsomepersonality m e a s ~ e smayhave a lhmiteduse to confirm personal impressions or to as provide supplementary information, they are generally dismissed incidental to a deep understandmg of personality. ~sychome~cally oriented personality psychologsts differ strongly with this idio~aphicpoint of view, insisting that a true science of personality must focus on meas~ementif any progress is to be made. The primary focus of scientific study should be the development of more reliable, valid measures of hportant aspects of personality that can then provide a stable framework within whichto gain a deeper u n d e r s t a n ~ gof the indmidual. anypersonality psychologsts believe a stableframeworkis now at p chometric t h i n h g on a herarchical hand withtheconvergenceof systemdominated by thegeneralFivefactors.Butsuch enthusi~s~c optirnism is probably premature. There are good reasons why the idio~aphic-versus-nomotheticdebate that ragedvigorously in the 1960s has apparently faded away. Psychology has become more f r a ~ e n t e dand c o m p a r ~ e n t ~ ewith d the growth in numbers, the diversity of research, and the proliferation of specialized societies and scholarly journals. Consequently, the many thousands of psychologists in applied settings who e n ~ g edaily inpersonalityappraisals for practicalreasonshavelittle discoursewith the academic, p s y c h o m e ~ i c ~oriented y psycholo~sts engaged in personality research. Eachgroup has a different mission and set of priorities that rarely meet. X'he fact that the issue has faded from the radar screen only means that the two groups have little more to say to each other at present.

Personality Theory and Assessment 39

e different worldsof clmica1 assessmentand p s y c h o m e ~ c ~based y personality assessment are once again beginning to collide and even show boration, The most notable example is the long I in c h c a l assessment aswellas in perso oses of the two worldsare quite differentversus assessing normal personality and terion-based methods versus rational constructaboration has been uneasy, difficult, andoften doomed to failure. Several trends since the 1980s point to a more optimistic e for resolution of these issues. rapid rise of computer-based clinical assessment methods in sishasforcedcluzicians to tlxnk ~ f f e r e n ~about y their the same h e , thesenewcomputermethodshavegreatly increased the efficiency and power of those chicians who have learned to use them wisely. Second, a growing number of younger, highly competent c~cian-researchers and personality psychologists have a shared d in both camps, c h ~ e n each ~ g other to develophybrid S andsystemsofassessment.And third, academicleaders in points of view, encoura~nggraduate e theirideasand to undertake new erhaps the most porta ant new trend accelerating a changeof is the externalmarketplace that is d e m a n ~ g ~ o n chicians g more a c c o u n t a b ~and ~ value for the moneyfrom the thousands of practitioners whose ~ v e ~ o depends od on it. The old ways, of s p r i n b g a lot of projective-technique jargoninto a standard case history ands u ~ n g that justifies the proposed treatment or satisfies the st, are no longeracceptable to manymanaged-care payers.And in manysituations, the chical or counseling p~ychologist~s u ~ q u ~ valued ly c o n ~ b u t i o nto the improvementof a clientisbeing c h ~ e n g e din that same m ~ k e ~ l a by c e social workers and other therapists whose services are available for a lower price. Therefore it is incumbent on g the way in chical assessment to develop stronger, efficient, more valid methods that capitahe onthe psychologist's special training and role as a mental health senrice provider.

od exampleofthese e m e r p g newdevelopmentsis the special c area known as personalitydtsorders.As chicians increasin the widelyaccepteddiagnostic standards that areperiodically refined under leadership ofthe Arnerican Psychiatric Association, known as the Diagnostic andStatisticalManualofMental Disorders or DSM IV (APA,l994), there is a compellingnew opportunity for co~aboration between clmicians and personality psychologsts, One serious s h o r t c o ~ ~ of the revised DSM IV has become apparent in the lack of widely accepted ostic categories and efficient techniques for the assessment of personality disorders. Reahation of this problem has already stimulated a flurry of new research and developmentthat dundoubtedly be expanded. U n b e any other areaof psychodia~osis,personality disorders arean obvious target for f h t M collabora~onbetween chicians and personality psycholo~sts.

One old issue that heats up occasionally concerns the relative value of the empirical method of developing scalesby selecting items that correlate with anexternalcriterionofspecial interest as contrasted with the rational method of starting out with a theoretical construct and then developing items that are chosen or dtscarded accordulg to the degree of homogeneity produced in the fiial scale. One side of h s controversy argues that if you have a known criterion to be predicted, such as the accurate di id schizophrenia, the best approach is a strictly empiric isanexcellentexampleof a successfullydevelopedpersonality ased on criterion-referenced items. The other side insists just as t a scienceofpersonalttyhas to be b d t upon theoretical constructs where construct validltyis the stern tashaster that should dtctate the methodof scale construction. Then once the scalesare appropriately constructed, a sound theoretical framework can be developed for mapping the relationships between the known personality factors and er independen~yr e c o p e d characteristics or specificbehavior of interest. Unlikeunresolvedissuesbetweencliniciansandacademicpersonaltty psychologsts, the empirically oriented advocatesand the rational scale supporters share certain beliefs in cormnon that make it far easier for them

Personality Theory and Assessment to interact productively. First o f all,theyare dedicated to h p r personality m e a s ~ e m e n tthrough research based on sound psycho ethods. Second, they generally begin with objectiv stically based items in questionnaires, peer ra ations of actual behavior. And third, they both c h-based academic s e t ~ g swhere s ~ a r - ~ n d e d oductivity. C ~ r e n this ~ y debate is best represent ed ~ n e s o t agroup, led d YossefBen-Porath (1 Five group, led by Jerry Lewis Goldberg (l990), RobertMcC elieves they have ~mallyconverged on ensions having universal significance. f i n v e s ~ ~ t oin r sthe 1990s has been d ents that b e p with rationally a then empirically refm .Among these are Jack rey's (1991) Person dhybrid approach is the Millon C h c a l M u l t i a ~ a l whlch was developed as a set o f personality scales ality cfisorders as defmed within Axis 11 ual. No doubt these hybrid approaches ture o f personality cfisorders becomes h c a l research that d eventually work its way

nalitytraits or typologies have allurin and pervasiveness that have attracted ng persona~typsychologsts they take the form of factors that can be repeatedly found in different gs and cultures, l e a b g to a lofty status that commands reverence their most decficated advocates. On the other hand, specific, narrowly eadvantage o f more direct linkage to observed on l a n ~ a ~ e personality of description that has ars o f human interaction. " h e modern version nineteenth century, later f l o ~ s ~

42 Holtzman differing points of view of Gutlford, Cattell, Vernon, and Eysenck. Today the focus of many personahty psychologists on the so-called Big Five of Extraversion, EmotionalStabiltty,Conscientiousness, Nurturance, and Inquiring Intellect can be contrasted with such specific scales as Spielberger’s(1983)State-TraitAnxiety Inventory, the BeckDepression Inventory (Beck,Steer, & Garbin, 1988), or the Bussand Perry (1992) scales for measuring four aspects of aggression. Nearly everyone now agees that both general and specific traits have a rightful place in any comprehensive personality system. But what that place should be is still a c o n t i n ~ issue. g A hierarchical system withthree or four levels varying from highly specific to general traits is widely accepted even indmidual preferences may focus on only one or two components or levels.Whetherapersonalttysystemand its components should be conceptualizedentirelyon the basis ofasomewhatarbitrary, although widely accepted, method of factor analysis, however, remains to be seen. As Loevinger (1994, p. 6) has pointed out, “There is no reason to believe that the bedrock of personality is a set of orthogonal...factors, unless you in rowsand think that nature is constrained to present usaworld columns.”

US I The direct methods of asking someone questions dealing specifically with traits or asking others to ratean in~vidualonsuchtraitsare the most popular approaches to personality assessment for obvious reasons. They haveanappealmgfacevalidity,theyareusuallysimple, straightfo~ard methods,theyarebased on commonlyunderstoodlanguageused to describe personahty, and they are economicalto employ, especially for large numbers of individuals. But they also suffer from ease of f a h response sets, variance due to the situation, and other extraneous factors that lower their validity in all too many situations. Considerable attention hasbeengiven to embedding dtspsed vahcttyscales in the more sop~sticatedpersonalityinventoriessuch as the l”P1, but they stdl frequentlyfall short. Andevenmore important, anycomprehensive approach to personality must delveinto enduring personal dispositions and stylistic behavior that reflect levels of organization and f u n c t i o ~ gbeneath the surface of verbal expressionor interpersonal behavior.

Personality "heory and Assessment 43 Indirect methods, by contrast, involve assessments where the meaning of the measurements is dispsed so that the individual hasn't the slightest idea how his responses will be interpreted. Such methods often involve asking an individual to perform a specific, standardized task such as using hs i m a ~ a t i o nin viewing meaningless colored inkblotsand reporting what he sees. The ThematicApperception Test, theRorschach or its more ~sychome~cally basedoffspring,theHoltzman Inkblot Technique,are favored by many clinicians as projective techniques for personality assessment because it is very difficult to fake one's responses or malinger when reacting to such ambiguous stirnull. "he most comprehensive studies of indirect approaches to personality assessment are still those of Raymond Cattell and hs associates that were done overforty yearsago.Cattell(1963)stated that overa thousand different behavioral measures were developed and studied in a number of factor-analytx, multivariate matrices during the 1940s and 1950s. Many of these measures were performance scores on ingenious tasks that survived as partof his revisedObjective-AnalyticPersonalityFactorBatteries 21 primarypersonalitydimensions designed to measurewhathecalled (Hundleby, Paw&, 8r Cattell, 1964). A basic difficulty with most of them, however, was the very low order of intercorrelation between these inhect, objective test scores and either behavior ratings or personality questionnaire scores. The possibility of indirect, performance tests of personality were so appealing in the 1950s, largely because of Cattell's work and the studies of field dependency by Witkin, that a national conference was held in 1959 that I chaired under sponsorship of the NationalInstitute of Mental Health to review this rapidly growing, new field to summarize the work to date, and to point the way to future promising areas of research.A wide range of interesting performance tests, many of them derived from experimental studies of perception, learning, or cognition, were discussed. Many proved to be highly reliable measures of individual differencesthat were interesting in their o m right but faded to correlateappreciablywithany other measures of personality. As a result of these disappointing results, the focus upon perfomance tests of personality faded away and greater efforts were devoted to a better ~ d e r s t a n d i nof~ how the objective measurement of personality could be improved by a deeper under st an^^ of how personality variance could be separated from methodvariance. Moreover, some perceptual and cognitive

oltzman roved sufficien~y interes~ng to s ~ ~ l amore t e atten~on tothe that at least some them night well be usefid as moderator ables or as valid indxators of coping strategies in problem solving,

€ ~ s ~ a and ~ npersistent g problem in persona~ty assess~ent traits with the articular method for

r not, and a c ~ ~ e s c e n cthe e, of content. Social d e s ~ a band ~ ~ method variance by i n c o ~ o r a ~ n g a d ~ scales tion~ measure and isolate these respo~sestyles and other The more Cbfficult, unresol od of m e a s u ~ ga

*

g methods variance c h ~ a c t e ~ of s ~cco n s ~ c t sin

C m ell and Fiske (1959), is a sobering, wise

p ~ CO ~ e flurry of research in the 1950s and 1960s on p e r ~ e and stylevariables resulted from the widespread belief that these s ~ l i s t i c

Personality Theory and Assess~ent45 ~ e a s ~ would e s have theoretical and practical useas indirect approaches to personality. Herrnan Witkin's important experime chairin a conflictingvisualfield,theRodre Test, and other perceptual tasks ues led to the important concept o he laterbroadened to the lessprecise,vague CO differentiation ~ i Dyk Faterson, ~ , second main stream of activity, shortly after Wi stylevariables growkg out of ,Linton, & Spence, 1959) at the (Gardner, Holzman, Foundation and by Samuel Messick, Douglas Jackson and their firstatthe M e n ~ g e rFoundationandthenatEducation Testing Service. Messick's review in 1961 of personality s ~ c include ~ e on stylistic variables in perception, judgment,and me e control variablesas l e v e ~ g - s h a ~ e focus ~g, cted-flexible control, equivalence range, and toler unrealistic experiences? in additionto field dependency. As with hundreds of other in&ect approaches to personality, fdure of these perceptual and cognitive variablesto correlate appreciably with wellh o r n personality trait measures from self-report inventories, peer ra and independent behavioral obse~ationsproveddisappoin investhough not to suchleadersin the fieldasMessick and Wi who conceived reliable of individual differences in perceptlon as valid domains within a broadly conceived field esting to note that personality psychologis Fiveas thebasicdunensionsofpersona g intellect or openness to experienc ion in their Big Five personality model, thereby m a h g room for Messick's view after all these years.

One frequently hears the criticismthat personality theories and assessment, especiallytrait-basedinventories,arecultural-boundbecauseof derivationfrom the languagesofWestern Europe. Expansion of c cultural studies of personality ~ o u g h o u the t world in the late 20" cen

46 Holtzman has produced convincing evidence that the main dimensions of p e r s o n ~ t y found in America and Europe are more universal than earlier critics have believed. Recent reviews by Butcher and Rouse (1996) of clhcal assessment and by McCrae and Costa (1997) of personality trait structure reveal a large number of major stuhes in just the past several years that demonstrate the wide applicab~tyof personality assessment across many different cultures and languages. "he issue that now seems to be emergmg is not whether personality assessment is culture-bound, but rather, how are the basic ensi ions of personalityexpressed in hfferent culturesand do culturalvaluesand customsinfluence this expression.Mostlikely there will be c o n ~ ~ n g expansion of cross-cultural research on these important questions in the near future, spurred on by the growing agreement concerning the nature of ~ersonality-tr~tstructure and facilitated by high-speed, inexpensive, inte~ational comunication.

Thirty-five years ago there was considerable concern about public i n d i ~ a t i o nover possible misuses of persondty assessment.Although public mistrust stiU exists in some quarters, psychologists and others have properly addressed the legitimateethicalandsocialissues.Adequate safeguards for the protection of human subjects have been built into the entire process from reviewing researchgrant applications to monitoring the collectionandanalysisofdatawhile protecting the confidentialityof participants. Many of us fmd these restrictions and procedures f ~ s t r a ~ g and t i m e - c o n s ~ gNevertheless, . such safeguards are necessary, At least the educated public in hlghly developed countries is much better informed than in the past. And yet, the fears that psychologsts have powerful, secret techniques for uncoveringprivate information about anindividual or c o n t r o ~ ghis personality stlll lurk in the bac to the conduct of scientific research. Moral & e m a s created by personality assessment in other, more a~tocratic,less openly democratic countries are more kely to become a serious problem that will bear watchmg carefully in the corning years. The protection ofhumansubjects, the educationof the generalpublic,and

Personality Theory and Assessment 47 establishment of high standards o f ethical conduct are porta ant regardless of where the personality assessment is done. " h e last fifty years o f the 20th century have seen a great deal o f progress in the development of personality theory and measurement. In spite of these important scientific advances, current issues in the field have a remarkable similarity to the issues and controversies debated many years ago. O f course they areusually expressed in a chfferent form today. ether these same issues d still be o f majorconcern 10 or 20 years from now is difficult to say. But one b g is virtually certain. The complex and changulg nature o f personality d continue to be a major challenge to f inchvidual and those who seek a tmly comprehensive u n d e r s t a n ~ g othe social interaction. " h e field has more promising leads for young investigators to pursue than ever before, assuring that rein~gorated ladvance more sigmficantly than in personality and c h c a l measurement d the past.

CES

American Psychiatric Association (1994). D~agnosticandStatiiticalManual ofMental Disor~rs (DSM-IV). Washington, DC: American Psychiatric Association, 4th ed. Beck,A. T., Steer,R.A., & Garbin, M. G. (1988). Psychometric properties of the Beck C~n~nical P ~ c b hview, o ~ 8, 77Depression Inventory: twenty-five years of evaluation. 100. Ben-Porath. Y. S. (1994). The M W 1 and MWI-2: Fifty years of differentiating normal and abnormal personality. In S. Strack 8c M.Lorr (E3ds). Dz~erentiatingNormala n d ~ b n o ~ a ~ persona^^. New York: Springer. o (?f: persona^^ ~ ~ and aSocial ~ BUSS,D. M., & Perry, M. (1992). The aggression questionnaire.~ P ~ c b 63,452-59. ~ ~ ~ , Butcher, J. N., & Rouse, S. V. (1996).Personality:Individualdifferencesandclinical ~~, assessment. ~ n n z t a l P~~ ~c eb ~o 47,87-111. Campbell, D. T., & Fiske, D. W.(1959). Convergent and discriminant validation by the P~cb~ ~ztlle~n, l o ~56,81-105. m u l t i ~ ~ t - m u l ~ e t happroach. od Cattell, R. B. (1963). Concepts of personality growing from multivariate experiment. In J.M. Wepman & R. W, Heine (eds). Concqts of persona^^. Aldine: Chicago. Gardner, R.W., Holzman, P. S., Klein, G. S., Linton, H. B., & Spence, D. P. (1959). Cognitive control A study of individual consistencies in individual behavior. P ~ c b o l o ~ Issztes, 1, 1-186. Goldberg, L. R.(1990).Analternative"descriptionofPersonality": The BigFive factor structure.~ o z t ofPersona~~ ~a~ and SocialP ~ c b o l o 59, ~ , 1216-1229. Holtzman, W. H. (1964). Recurring dilemmas in personality assessment. ~ o z t ~ a l ~ P ~ e c t i ~ e and ~ e r s o nAssessmen& a~~ 28, 144-1 50. ~ecbn~qztes 16,119-156. o~, Holtzman, W. H. (1965). Personality structure.~ n n ~ a l h vo if eP ~~ c b o ~

Hundeby, J.,Pawlik, K., & Cattell, R. B. (1964). P e r s o n a ~ ~ ~ incot~~e~c ~ s test v edevices. Knapp: San Diego. Jackson, D. (1989). ~ a ~ c ~ e r s o n a ~ ~mamaL i n ~Port e # tHuron: o~ Sigma Assessment System. L o e ~ ~ ~J.e(1994). r, Has psychology lost its conscience? ~ o ~~ ~P ea r is o~~s us e ~ s~s ~62, e~~ 2-8. McCrea, R. R. & Costa, P. T.Jr. (1997). Personalitytrait structure asahumanuniversal. tican P~cho~gst, 52,509-51 6. S. (1961). Personality structure. ~ n n ~ a i ~o f vP i~ ce b~o i o12,93-128. ~, on,T.(1994). ~ a ~the ~ ~ i i ~cu ~n #ij c a~i ~ ~ri t j ~ a i i ~~v ne ne taop~o- ~ ~sNational ~: . Computer Systems. Morey, L. C. (1991). ~ e r s o ~sessment ~ a ~ ~ invento~: P ~ ~ s ~ m o na a~ ~~ Odessa, ~ L FL l’sycholo~calAssessment Resources. Spielberger, C. D. (1983). ~ t a t e - ~ ~r n~v e t# t~o(Fom ~~ ~Y)e~ ~ a Palo~Alto: ~Consulu ng ~ Psycholo~stsPress, ‘ S , J. S. Br Pincus, A. L. (1992). Personality: Structure andassessment. ~ ~ n ~of a cboio~,43,473-504. W i t h , H. A., Dyk, R. B,, Faterson, H. F., Goodenough, D. R., & ISarp, S. A. (1962). ~ ~ e r e n ~ uNew ~ o nYork . Wiley.

i

i

~

c d y defined as the tend

ponse style rests on the

al c o n s m c t . A. brief history of such

~ ~ e p e n d e nof t their

50Paulhus

The topic of this essay is restricted to one response bias--sa&& ~ e ~ ~ ~ a ~ Z e ~ e ~ o n (SDR)"defined ~~ng here as the tendency to give overly positive selfdescriptions. Note that my q u ~ f i c a t i o n ~ ~ seldom e ~ ~ included is in d e ~ t i o n of s SDR, but is of central importance inthis essay. Indeed, I w li argue that no SDR measure should be used without sufficient evidencethat high scores indicate a departure from reality. This essay begins witha selective review of the wide variety of constructs is particuldy held to underlie SDR scores. Coverage of the early developments selective because that history has already been reviewed elsewhere (Messick, 1991; Paulhus, 1986). The latter part of the essayemphasizestherecent developments with which I have been associated. Although my approach departs from theirsin some respects,my understanding of the topic of SDR draws liberally from the substantial empirical and theoretical contributions of the team of Sam Messick andDoug Jackson (e.g., Jackson &:Messick, 1962; Jackson, 1961). And specificto this volurne, my depiction of the interplay between response styles and personality can be traced to Messick's insight^ analyses @am& &:Messick, 1965; Messick, 1991).

LE

S

Assessment psychologists have agreed, for the most part, that 'the tendency to e sociallydesirableresponses is ameaningfulconstruct. In developing measures ofSDR, however, they have useda diversity of operationalizations. A singularlackofempiricalconvergencewastheunfortunateresult. C o ~ e n t a t o r swho were already wary of the very concept of SDR have exploited &SI. disagreement to buttress their skepticism (e.g., Block, 1965; have in that the Kozma & Stones, 1988; Nevid, 1983). And the skepticsa point to allegation that SDR contaminatespersonalitymeasuresiscfifficult l h s chapter substantiate without a clarification of the SDR construct itself. a i m s to provide such a clarification. I argue that the attention gven to SDR research cannot be dismissed asa red herring (Ones, Viswesvaran,&:Reissa, 1996), but represents a processofconstructvalidation that has now a coherent integrationispossible. a c c ~ u l a t e d to thepointwhere Accordingly,myreviewoftheliterature begm bylaying out thethree approaches that require integration. *

1.~ ~ ~Constmct~ ~ ~A number a ~ of ~contributors s t have etred on the cautious side byusing astraightforward operationa~ationofSDRwith minimal theoretical elaboration. One standard approach entails (a) collecting social (b)assembling an SDR desirability ratings of a large variety of items, and

Socially Desirable Responding 51 measure consisting of those items with the most extreme desirabhty ratin 8c Messick, 1761; Saucier, 1774). The rational (e.g., Edwards, 1753; Jackson that individuals who claim the h i g h - d e s ~ a bitems ~ ~ and disclaim the lowd e s i r a b ~items ~ arelikely to berespondingonthebasisofanitem’s d e s i r a b rather ~ ~ than itsaccuracy. The validity of such SDR measures has been supported by demonstrations of consistency across diverse judges in the desirabhty ratings of those items 8c Messick, 17621). Moreover, scores on SDR scales ~ d w a r d s1770;Jackson , developed from two hfferent item domains (e.g., clinical problems, personality) were shownto be hi~hlyintercorrelated Pdwards, 1970). In short, the same set of respondents was claiming to possess a variety of desirable traits. psychome~callyrigorous E ~ e m p ofl ~the minimalist approach was the but theoretically austere creation of the SD scale by Allen Edwards (1957, in r e p r e s e n ~ g SI) 1770). ~ o u g h o uhis t career, Edwards remained cautious scores as individual differences in rates ofSD responding” (Eidwards, 1790, p. 2 ~ 7‘2)At the same h e , the proninence of his work derived m d o u b t e &om ~~ the implication that(a) highSD scores indicate misrepresentation and that (b) ~ e r s o n ~measures ty correlating highly with h i s SD-scale werec o n t ~ a t e to d int of htili Walker, 1761). Such inferences were easily from his Er S about the utrnost necessity that personality measures bemcorrelated with SD Pdwards, 1970, p. 232; Edwards, 1957, p. 71). An ~ p o ~ a alternative nt operationa~ationof SDR has been labeledrole(e.g., Cofer, Chance, Judson, 1747; Wiggins, 1959). Here, one group of p ~ c i p a n t sare asked to “fake-good”, that is, respond to a wide array of items asif they weretqmg to appear socially desirable. The control group does a “s~~ght-take”: That is, they are asked simply to describe themselves as accuratelyaspossible. The itemsthatbest h s c ~ a t the e two responsesareselectedfortheSDRmeasure. ” h i s approachled ~ gand Wiggms’s Sd scale, which is c o n s ~ c t i o nof the“PI M ~ g e scale still proving useful after 30 years (see Baer, Wetter, 8c Berry, 1772). Both of the above operation~ationsseemed reasonable yet the popular es ensuing from the two approaches (e.g., Edwards’s SD-scale and S’S Sd-scale) showed notoriously low intercorrelations (e.g., Holden ~~~~~~

1Nonetheless,bothauthorsnotedelsewherethatmultiplepoints of views SD mustbe of ST) in personality (Jackson & Singer, 1967; Messick, 1960). recognized to understand the role

2 On the few occasions where he lost h i s equanimity, h i s opinion was clear: “Faking goodon personality inventories, without special instructions to do so, I would consider equivalent to the socially desirable responses in self-description’’Pdwards, 1957, p.5’7).

Paulhus

t the endorsement rateSL) ofitems (th (e.g., "I am not afraid to handle te for Sd items (e.g., '$1 never worry about my 1 relatively low. To obtain a high score on the Sd scale, onemust e but desirable traits. "hus the Sd scale (and s ~ a r l y - d e ~ v e orated the notion ofex

2. ~ ~ C o ~~ ~ Some o~ attempts ~ c~ ~ to ~ ~develop . ~ SD e involved more theoretical i n v e s ~ e nat t the operationa~ provided a detailed construct elabo specific hypothesesr e ~ a r ~ n the gu owe, 1964; Eysenck & Eysenck, 1964;H 0; Sackeirn & Gur, 1978). "he items were designed to responses in honest responders than in respondents ures inco~orated thenotionof

e rather widespread social approval but. ..are rare S on the lie scale were assumed to mark a dishonest

S

were accumulated by s e l f - ~ e s c ~ p ~ othat n s were not just positive, but

studying their behavioral correlates in great detall. "be a~thors concluded that a need for approval was the motivational force behind both (a) high scores thee-Crownescaleand (b)pub~c-~ehavior that was both con and hmonious. Furtherresolutionof this characterwas p .Thus the construct had evolvedapprop~atelyin response to

Socially Desirable Respondin c C o ~~s t ~Serious c ~~ s . consideration ~ ~ must begven to the theorists e that those scoring high on i SDR n s ~ e n t should s be take trai actually do possess an abundance of desirable osta, 1983; ~ o ~ a n 1964). d , To support the chers showed that the self-reports on SDR instruments correlated with reports by owle edge able observers. More recen however, have revealed that the evidence regardmg the acc made by lagh SDR respondentsis mixed, at best(Pauhus The most prominent example of the accuracy position is Block's (1965) book, theC ~ ~ ~ ~Z e ~ ~ ~ Sets. e His view ~ wasthat o hgh scores ~ on Edwards's ~ e (as well as the first factor of the "PI) represented a desirable ty s y n ~ o m e called ~ o - ~ e sHis ~ ~evidence e ~ included ~ . the tion by owle edge able observers(e.g.,spouses)ofmanyofthe desirable qualities that were self-adbed on the SD scale. No doubt there is some degree of accuracy in SD scores, butmy recent analysis of Block'sEgo Resiliency measure c o n h e d that it also includesa demonstrable degree of sto or ti on (Paulhus, 1998a). McCrae and Costa (1983) articulated a s d a r a r ~ m efor n t theaccuracy of self-descriptionsonthe Marlowe-Cro~e(MC)scale.Theyshowedthat spouses sustzined many of the claims by high scorers that they possessed a vdety of desirable traits. In apparent eontradlction, a series of studies by am and Jacobson(1978) showed that high-MCs would lie and cheat to impress experimenters with their character. These c o n ~ c depictions ~g can be reconciled w h i the construct of need for approval. High scorers MC may on realne that socially conventional behavior is usually the best way to gain in a number of situations where approval yet believe that deceit works better detection is very u&ely. In short, the data do not support the naive clakn that high MCs (orhigh~SDs)are simply those with desirable character.3 and appearto tap In sum, the two most popular measures of SDR (SDMC) both reality and distortion.C o n ~ a t i o of n the distortion component makes it easier to understand why some respondents describe themselves in consistently positive terns across a variety of traithensions.

3. A

The notion that SDR appears in two dlstinct forms was recognized by a number of early researchers (Cattell &, Scheier, l(161; Edwards, Diers, 3 See PaulhusandJohn (1998) for other reasonswhy the claims of high MC scores cannot be taken at face value.

54Paulhus Walker, 1962;Jackson &:Messick, 1962; Messick, 1962; analysesrevealed two relatively independent clust c o ~ t t a l l labeledA&a y and Gumma by clearly marked byEdwards §D-scale and scale.Subsequentresearchpositioned other i n c l u ~ the g iWhfI?IK-scale,Byme's(1961) Repression4 ' Sackeim and Gm's (1978) Self-Deceptio the second factor included Eysenck's Lie Deception Questionnaire. A h d set of measures loading largely, but not exclusively, on the second factor included the M a r l o w e - ~ r o ~scale, e the Good Impression scale (Cough, 1957), andthe "PI Lie scale (Hathaway ~ c ~ e1951). y , With a growing consensus regardln~two empirical factors, the conceptual task was now doublychallen~g-what are the psycholo~cal constructs underlying these two SDR factors?

arin a

essic

It was not untll the review by Damarin and Messick (1965)4 that a detailed theoretical interpretation of the two factors was offered (see Factor 1was saidto involve the defensive dlstortion of one's private self-image to be consistent with a global evaluative bias.As a substantive label for &IS factor, they proposed autisticbias in s e ~ r ~ uAssociated r~. personalrty traits included self-esteem and ego-resdiency. Factor 2 was l a b e l e d ~ r ~ a ~ a nbias ~~st~c to indicate a naive tendency to promote a desirable public reputation, Here, the underlpg motivation was h k e d to factors v a r p g from social approval to habitual l p g . For the first tirne, detded a ch~acteriolo~cal analysis had been provided for both factors.

Perhaps the most clear-cut example of the rational approach to SDR scale development was the work by Sackeim andGur (1978; Gur &:Sackeim, 1979; Sackeim, 1983). "hey applied to the process of questionnaire r e s p o n ~ a ~ &stkction between the constructs of s e ~ ~ e c ~ and t ~ oont ~ e r - ~ e c ~ ~ Some ion. respondents report unrealistically positive self-depictions about which they appear to be convinced;other respondents consciously and deliberately dlstort their self-descriptions to fool an audience (See Figure4.lb). 4 This was a technical report with limited circulation but much of the material was reviewed in

the subsequent chapterby Messick (1991).

Socially Desirable R e s p o n ~ g5

FIG. 4.la. Two Constructs proposed by Damarin-Messick.

To compose a set of items indicating self-deception, the authors drew on the p s y c h o - d y n ~ notion c thatsexual and aggressive thoughts are universally experienced yet often denied. If respondents overreact to questions with offensive content (e.g., “Have you ever thought aboutkilling some one^'), then theyareassumed to haveself-deceptivetendencies. To measureothereception, the authors wrote items describing desirable behaviors that are so public and blatant that they axe not subject to self-deception (e.g., ‘61always pick up my litter”) According to the authors’ reasoning, then, excessive clhs of such commendable behaviors must involve conscious dissimulation. The resultof Sackeh and Gur’s rationalitemcompositionwasthe ~ 6 d ~duo~ ofc measures ” labeled the Self-Deception ~ u e s t i o n n and ~ e the ~ther-~eception ~uestion Use n ~ of e . the word “deception” in both 1 made it clear that exaggeration was an integral partboth of conceptions. i to ensure that t h i s exaggeration tendency was captured by the recommended a scoring procedure that gave credit only for ex positive item responses: Specifically, only responses of‘6’ or ‘ ’7’ on a 7-pokt scale were counted.

My early work was essentially an attempt to link and integrate the provocative concepts and instmrnents developed by Sackeim andGur with theinte~ative structure provided by Damarin and Messick (see Paulhus, 1984; 198~).

56 Paulhus

FIG. 4.lb. Sackeh and Gur's two deception constructs.

and Gur scales was revealing:"hose two scales clearly markedthe two factors sting a theoretical ~ t e ~ r e t a t i oofn the factors that was consistent with? ore theoretically trenchant than, the labels provided by Damarin and e c~~~~~~~0~ ~ ~ ~ ao n a ~ (seee ~ e ~ ~ ick. I settled on the labels,~ e ~ ~and ."he tern,o ~ ~ e ~ - was ~ e creplaced ~ ~ ~ because o n , its implication of deliberate amarinand Messick, lying seemedp r e s ~ p ~ o uInstead s. Iargued, followingD that habitual presentation of a specific positive public impression could be construed as an aspect of personality, rather than a deception (see also Hogan, 3). Hence, the tern, ~ ~ ~ ~ a ~ a e~was e~~judged e~n ~to ~, beomore n apt. I also deyoted some effort to evaluating the psychometric properties of Sackeim and Gur's Self- and Other-Deception ~ u e s t i o n n ~ ewith s , some dismaying conclusions. To begin with, all the items on the forrner measure were negatively keyed and all items on the latter measure, positively keyed. Becausethemeasureswerethusconfounded in oppositedirectionswith acquiescence,theirobserved ~ t e r c o ~ e l a t i oof n .30 waslikely to have

FIG, 4 . 1 ~Two . constructs proposed byPaulhus (1984).

Socially Desirable Respon u n d e r e s ~ a t e dthe true value.As feared, when reversals were addedto each scale, the interco~elationexceeded .50. Although the balanced versions of these measures still loaded on their o r i p a l factors, the Lugh interco~elation negated their advantage over single-factor measures. Moreover, some of the items on the Self-Deception ~ u e s t i o n n ~were e blatantly confounded with adjustment. To say the least,this state of affairs was dtscouraging for thetwofactor conception.

Insteadofconceding to theone-factorconception, my researchgroup embarked on a new phase of item-writing. An extensive range of items were rationally composed to tap every conceivable f o m of self-deception and impression management (Paulhus, Reid, & Murphy, 198’7).A swarm of factor analyses consistently revealed one factor of impression management and two factors of self-deception. The impression mana~ementitems that cohered were largelythe same items from earlier versions of the measure going to back Sackeim andGur (1978). The two clusters of self-deception items appeared to ~ e n ~ positivequalities)and denial (disavowing involve e n ~ a ~ c e(promoting negativequalities;Paulhus & Reid,1991).Figure4.2showstheresulting subscales labeledhpression Management (IW,Self-Deceptive Enhancement (SDE),andSelf-DeceptiveDenial(SDD):Theywereincorporated into Version G of theBalanced I n v e n ~ ~ ~ ~ ~ ~ e ~~( ig~I a~ ~bwhich ~l e) , nI began d d i s ~ b u in1988. ~g Table 4.1 provides examples of the three types of items. ~ o n s Vakdz~ ~ ~ c ofthe ~ BIDR The SDEand IM scales, in particular, foma useM combination of response style measures because they are relatively uncorrelated but capture the two major SDR dimensions (Paulhus, 1988, 1991). Their utility was demonstrated recently ainstudy of self-presentation during a jobapp~cationsituation (Paulhus, Bruce,& Trapnell, 1995).The IM scale, but not SDE, was extremely sensitive to faking instructions requesting various degrees of self-presentation. The sensitivity of the IM scale also far exceeded that of any of the NEO-FFI measures of the Big Five personality traits (Costa& McCrae, 1989).A similar pattern was observed ina study of job applicants vs.incwbents (Rosse et al., 1998) of In other studies, theSDE scale, but not the IM, predicted various lands self-deceptive distortions, for example, hdsight bias poorens, 1995; Paulhus, 1988). More than 40 other studies, most outside our of laboratory, have added to the construct validity.For a more extensive review, see Paulhus (1992%).

~

TABLE 4.1 Sample items from the Balanced Inventoryof Desirable Responding Version 6

I have never thought about killm someone.

FIG. 4.2. Refmed constructs proposed by Paulhus (1988).

~ e ~ s oand n a~ ~ ~~ Covehta~ s oft the BIDR ~ e One~ a r p~e n t against as personalityconstructs is that they rarely appear as i n t e ~ r SDR e ~ factors ~ independent factors in comprehensive factor analyses of personality. One possible explanation is that the BIDR response styles are simply Cksguised my colleagues measures of nomal personality. To clarify the interrela~ons~ps, stered both lands o f measures to the same stud ous conditions ~ e s t o nHeirnan, , Trapnell, the correlations of the three response styl cCrae's (1989) measures o f the Big Five personality traits. ~ t h o u the ~ h response styles do not h e up k e c t l y with any of the do seem to pervade e, the two self-deception subscales, SDE and §DD, u s o f a ~ s ~ a t i othe n , ctors. Given the ~ o n ~ oconditions

Socially Desirable Respon TABLE 4.2 Correlations of BIDR Subscales with the Big Five personalityfactors.

Extraversion Denness Stabiliw Conscientiousness

thatself-deceptivebiasplays a rolein all person^^ elations with the Impression Management scale are weaker but the fact that they are non-zero is noted here and dtscussed later. " h e a d j u s ~ e correlates nt of these response style m e a s ~ e have s also been enerd, SDE, but not IM,is positively relatedto self-p ofmentalhealth(e.g. nno etal.,inpress; Brown, 1998;Nic ~reene,1997;Padhus ,1991; Padhus, 1998b).High SDE can d a positiveimpacton p ce incertaincircumstances(Johnson,1995). a recent study of inte~ersonala ~ j u s ~ e nhowever, t, highSDE scorers were perceived negatively after 7 weeks of interaction. Moreover, hgh-SDE but not high-IM or high-SDD p~ticipantsexhbited a dtscordance with realias indicated by an inflation in self-ratings relative to ratings by fellow members (Paulhus, 1998a). That reseavch bears c€irectl on the debate about whether positive dlusions ,1988; Yk, Bon adaptive are (Tay10 maladaptive (Colvin, nder, 1995; John a trait op scale (along with measures of narcissism) represents positive illusions, that is, trait self-enhancement. "he two S (1998a)indicated,inshort,thattraitself-enhancementwasadaptivein high self-esteem and positive first impressions, but had ne ~ t e ~ e r s consequences on~ (see also, Bonannoet al., in press).

60 Pauhus The implication ofthis understated commentwas that researchers making allegations about response bias must do the work of demonstrating departures from realtty: This taskrequiresthecollectionof crehble measuresof personality to be partialed from self-reports. Dmarin and Messick (1965) went on to lay out the statistical p ~ t i o n i n gnecessary to isolate the residual bias component (p.21). This recommendation proved invaluable in my work with Oliver John on d e t e ~ n the g structure of self-favoring bias (Johnst Paulhus, ~ 0 0 0Padbus ; & John, 1998).We needed a unit ofbias to represent each part of the personality space.5 For each personaltty variable, we collected self-ratingsto compare with a more objective criterion, namely, ratings by knowledgeable peers (i.e., friends, family). In the ease of intelligence, we also IQ used scores as a criterion. Each self-rating was regressed on its corresponding criterion to create a residual score representing the departure of the self-rating &om reality. Factor analysis of a comprehensive set of such residuals should uncover the structure of self-favoring bias. Using the Big Five dwlensions of personaltty plus intelhgence to represent a smaller space than personality space,our factor analyses of residuals revealed the 5-space of either self- or peer-ratings. The first two major dimensions Extraversion and appeared as in Figure 4.3. Factor 1 wasmarkedbythe Openness residuals whereas Factor 2 was marked by the Agreeableness and Conscientiousnessresiduals.Apparently,thestructure of biasbearslittle resemblance to the standardBig Fivestructure, If anything, these factors look more like agency andc o ~ u n i o n (see Bakan, 1966; A replication study helped to clarify the meaning of the bias factors hough a widevarietyofself-reportmeasures.Theseincluded theadditionof traditionalmeasuresof SDR (Marlowe-Crownescale) as wellasrelated measures of self-enhancement (e.g., Narcissistic Personality Inventory). The onto additions allowed us to project a variety of bias arrd personality measures the two bias factors. The resulting projections (correlationswith the factors) are depicted in Figure4.4.

5 The last 15 years of work on the Five Factor Model suggests that it captures the 5 most important dimensions of personahty (TViggins, 1996). There is some dispute, however, about whch rotation is optimal. 6 The results were more clear when we separated Conscientiousness into Dutifulness and Ambition following Paunonen and Jackson (1996).It is the Dutifulness measure that is most faithfid, conceptually and empirically, to the Gamma factor.

Socially Desirable Respondtng 61

FIG. 4.3. Structure of Big Five Residuals.

*

FIG. 4.4, Response style correlates of Alpha and Gaxnma.

Note festthe staking match of thetwo R D R subscales (SDE and SDD) to the two bias b e n s i o n s . Immediately, we have reason to believe that the factors represent Alpha andG a m a , the bias factors named by and explicatedby Damarin and Messick (1965).Note h t h e r that narcissism, as measured by the Narcissistic Personality Inventory (Rashn & Hall, 1981), marks Factor 1 along with SDE7. Factor 2 resembles earlier studies in being well-marked by the IM and SDD scales and less weu by Eysenck's Lie scale, the " P I Lie scale and the Marlowe-Crone scale. Rem~kably9 the venerable Alpha and Gamma SDR factors (noted above) have been re-generated via a novel technique requiring only personahty content measures. n e convergenceofresultsacross the twotechniquesadds substantial credibilltyto the Alpha and Gamma factors. In particular, the new technique provides evidence that both Alpha and Gamma assess departurefrom-reality. R a t is, high scores onboth SDR factors involve overly positive se~~-descrip~ons (Q.E.D.). One remaining puzzle isthe fact that two conceptually chfferentresponse style measures, Self-Deceptive Denial and Impression ~ n a g e m e n t , onf dthe same SDR factor, Gamma. How canthe response styles previ conscious and unconscious distortions (l'aulhus, 1986; Sack now coalesceat this point? In anonymous student samples, where pressure for self-presentation is rninirnal, SDD and IM appear to be capturing sknilar persona~tycontent. Yet IM is more responsive to i n s ~ c t i o m~pulations. n~ In short, G m a subsumes both consciousandunconsciousaspectsof common content. Apparently, I have to question my previous contention that level of consciousness is the core chfference between Alpha and G m a factors ofSDR (e.g., Paulhus, 1986). " h s theoretical revision makes it easier to explain whythe Gamma l o a h g of an allegedly conscious deception measure (IM) does not disappear entirelyin anonymous responses. With Gamma as a content factor9it is now quite understandable that IM should appear even when there is no audience to motivate irnpression management,

In a more recent set of experiments, we sought to clarify the Alpha and a m a factors via a series of studies varying self-presentation instructions Notareschi, 1993). First we wondered why Gamma measures were 7 A number of recent reports on narcissism and self-deceptive enhancement suggests substantd primary measures, NPI and SDE (McHoskey, Wortzel,& overlap inboth the constructs and the Szyarto, 1998; Paulhus, 1998a; Raskin, Novacek, & Hogan, 1991).

Socially Desirable Responding 63 so sensitive to instructional manipulations (e.g., Paulhus, 1984). m e n gven standard instructions to respond in a socially desirable fashon, respondents reported that they interpreted the instruction to mean that they should respond like a “niceperson” or “good citizen.” It struck us that this interpretation of social desirab~tywas rather nmow, focusing on content related to eableness and dutifulness, i.e., communal traits. Accordlngly we tried a agentic e f o m of instruction to respondents: ‘“Respond to ~uestionn~ ine a way to impress an experimenter with how strong competent you are.”h and behold,the SDE was more sensitive thanthe IM scale8 to these instructions (Paulhus, Tanchuk,&:Web, 1999). In retrospect, yet they have dramatic these fidlngs seem emba~assingly obvious; implications for previous research on SDR. First, it is now apparent why the items onW i g ~ s 7Sd s andother G a ~ a factor scales contahed those socially desirablebut distinctively conventional items.9Recall that thesemeasuresweredevelopedusing role-pla~ng instructions that emphasized co~union-relateddesirabhty. Second, it now seems obvious why Gamma-related scales were so responsive to instructions. They containthe very content that is implied bythe instructions. Third, Alpharelated measures may beno more unconscious (and therefore self-deceptive) than G m a measures. “hen what, after all, can we make of these two factors? anonymousconditions.Both respond to f a h g instru consciousandunconsciousaspects to them.Atleastwedon’thave to withdraw the (thankmy n o n c o ~ t t a l labels, ) Alpha and Gamma. “he “final” two-tier conception suggests that (a) Alpha and G a m a be ~ s ~ ~ sinhterms e dof personalitycontent, and th deceptive style and an impression m~agemen Gamma are held to be two constellations of origins in two ~ n d ~ e n tvalues, al agency andCO 1998).Excessive adherenceto these values results in self-deceptive tendencies, bias. whch we label,pistic bias and mo~a~istic AssociatedwithAlphaisan egoistic bias, aself-deceptivetendency to exaggerateone’ssocial andintellectualstatus. This tendencyleads to u ~ e ~ s t i c a l positive ly se~f-perceptionson such agentic traits as d o ~ a n c e , fearlessness, emotional stabhty, intellect, and creativity. Self-perceptions of 8 When subjects were notified that their answers could land them a summer job (emphasizing Lysy, & Yik, 1994). competence), then both scales showed significant increases (Paulhus, 9 The items werealso low incomun&ty w i g g i n s , 1964).

64 Paulhus

FIG. 4.5. Proposed two tier system,

high scorers have a narcissistic, "superhero" quality. Associated with Gamma is the ~ Q ~ &ax, ~ a Zself-deceptive ~ ~ ~ ~tendency C to deny s o c i ~ y - d e ~ aimpulses nt and clairnsanc~onious,"saint-like" attributes. ' I h s tendency is playedout in overly positive self-perceptionson such traitsas agreeableness, dutifiess, and restraint. Atthe hpression managementlevel,peopleare often motivated to deliberately exaggerate their a t t ~ e n of t agency and communion values. "hus the sametwo clusters of traits are involved but the exaggeration is more conscious. At this level,Alphainvolves Ageny ~ a n ~ g that e ~ is,e ~ asset~ , Such deliberate promotion of competence, fearlessness, promotion or physical prowess, etc. are most commonly seen in job applicantsor in males attempting to impress a dating partner. Deliberate exaggeration of Gamma is termed Communion ~anagementand involves excuse-mahg and damage control ofvarious sorts. Such deliberate~ z a t i o of n faults might also be seen in religious settings, or in employees who are tqmg to hold on to the status quo,or legal defendants tqmg to avoid p u n i s ~ e n t . To fully assess the two-tiered system of SDR constructs, four types of measures are needed. Fortunately, three out of the four have been avdable fox some h e . Self-deceptive enhancement can be measured with its namesake (SDE) or the Narcissistic Personality Inventory (see Paulhus, 1998a). Selfdeceptivedenialcanalsobemeasured by itsnamesakescale(SDD). Communion management may be assessed using the traditional Impression ~anagementscale, whxh has varied little since Sackeh and Gur (1978). Tentatively, it is renamed Communion~ a n a ~ e m e ~ t . "he fourth type of desirable responhg construct, agency management, requked the development of a new instrument by the same name (AM). It consists of items related to agency content butwith low endorsement ratesin straight-take a ~ s t r a t i o n s .The lowcommunalitiesperrnit room for manip~atorsto deliberately enhance impressions of th& agency. Examples are

Socially Desirable Responding 65

“I am very brave” and “1am exceptionally talented”. Such items tend tonot be c l b e d , even by narcissists, under anonymous condttions. But the endorsement rate is higher under agency-motivated condttions than under anonymous conditions. In recent studtes, we have found that the irnpression management scales, AM andCM, are more useful as response sets than response styles. They were designed for the purpose of capturing instructional setsto appear agentic or communal and they perform that task very well. These measures do not so has perform very well as styles, presumably because irnpression management many sources and is so sensitive to situational demands.10 To the extent thata response bias is self-deceptive, the motivation for bias is more trait-&e and therefore consistent with the definition of response style,

The factthattheoriesevolveisnotadeficiency of science.Indeed,its In&IS hght, responsiveness to new data can be seen as science’s greatest asset. the evolution of constructs underlying SDR should be viewed as inevitable rather than distressing. At the same h e , science should exhibit net progress rather than veer haphazardly. The ideas about SDR presented here are the result of such progress: They were founded on and developed &om earlier work. In particular, the earlier writings by Messick (1991; Damarin & Messick, 1965) were a necessary precursor for many of the ideas presented here. For example, Messick‘s writings emphasized the necessity of demonstrating d e p a r ~ e - f r o m - r e ~ tinyassessing SDR. To this end,he S statistical analysisof partial correlations.That notion andthat method proved to be central to the development of our residuals method of d e t e ~ the ~ g structure ofbias in self-reports (Paulhus&John, 1998). Yet those earlier ideas could not account for all the newly-collected data. In particular, the new data required a more elaborate structural model of SDR. This finaltwo-tieredsystemincorporates a content-level (agencyvs. c o ~ u ~ o asn well ) as a process level (conscious vs. unconscious).All four typesof SDR wereshown to involvethedeparturefromrealitythat d i s t i n ~ s h e sresponse biases from content dimensions of personality. And they r e a f h the continuing challenge of response biases to valid assessment.

10 Promising new methods for measuring individual differences in impression management include the overclaiming technique (Paulhus & Bruce, 1990) and response latencies (Holden 8c Fekken, 1993).

Baer, R.A.,Wetter,M.W., & Berry, D. T. (1992). Detectionofunderreporting of psycho pa tho lo^ on the MMPI A meta-analysis. C ~ n ~ c a l P ~ c bReview, o f o ~ 12,509-525. Bakan, D. (1966). The d ~ a ~ ~t y b ~ emktence: ~ a Ins o ~ ~ oand n ~ o ~ ~ in~ Western n ~ man. o n Boston:

Beacon. Block, J. (1965). Tbe cbal~nge~reqonsesets.New York Century. Bonanno, G. A., Field, N. P.,Kovacevic, A., 8z Kiltman, S. (in press). Self-enhance~entas a buffer against extreme adversity.J o ~ ~ ~ Pae rf s o ~ a and ~ t yS o ~ aPf ~ c b o f ~ ~ . Brown,J. D. (1998). ?‘be se8 Boston: M ~ G r a w - ~ . ~ o ~ ~ a Byrne, D. (1961). The repression-sensi~ation scale: Rationale, reliability, and validity. Persona~ty,29,334-349. mewing of objective test person~tyfactors: Cattell, R.B., & Scheier, I. H. (1961). Extension of ~ ~ e n e Especially into anxiety, neuroticism, questionnaire, and physical factors.~ o ~ ~ a l P ~ c b O f o61,287-3 ~, 15. Cofer, C. N., Chance, J., & Judson, A. J.(1949). A study of m ~ g e r i n gon the Minnesota M~tiphasicPersonality Inventory.J o ~ ~ a ~ ~ 27,491-199. P ~ c b o f o ~ , Colvin, C.R., Block,J.,& Funder, D. C. (1995). Overly-positive self-evaluations and personality Negative ~ p ~ c a t i o for n s mental health.~ o ~ r n a l ~ P and e ~ sSocial o n aP ~ c~b o f o68,1152~, 1162. McCrae, R. R. (1989). ~ a n ~ athef NE0 ~ r PersQna~ty ~ n v e ~ t oFive ~ : Factor ~ n v e ~ t o ~ / N EOdessa, O - ~ ~F. L PAR. Crowne, D. P. (1979). Tbe e ~ e ~ m e n tsatzl@ persona^^. Hillsdale, NJ: Lawrence Erlbaw Associates. New York: Wiley. Crowne, D. P., & Marlowe, D. (1964). ?‘be ~p~valmotive. cT aS fjnt~ra~on Darnaxin, F.,& Messick, S. (1965). Reqonse styhs ~ p e r s o n a ~ t y v a ~ a b f e s : ~ t b e o ~ ~@ RI3 65-10). Princeton, NJ:Educational Testing Service. Edwards, A. L. (1957).Tbe s o c i a f ~ ~va~ab~e r a ~ ~~ ~~ p e r s ~os ~e sas ~~eand ~n t ~ s e a r cNew ~ . York Dryden. Edwards, A. L. (1970).Tbe ~ e ~ ~ ~ person^^^ e m e ~trkts ~ .ly scales a~d~nvento~es. New York Holt~ehart-Winston. s o d desirability.~ e ~ c ~ a~ nc b o Edwards, A. L.(1990). Construct validity and 289. Edwards, A.L., Diers, C. J., & Walker,J. N. (1962). Response sets and factor loadings on sixtyone personality scales.J o u ~ a ~ P~~~c bpop~46,220-225. ~ o ~e , d J.N. (1961). A short form of the MMPI: The SDscale. P~cbolog~caf S.B. G. (1964). Tbe ~ a n ~ a ~Eysenck ~tbP e ~ r s o n jnven~o~. a ~ ~ London:

TJ of LondonPress, Gough, H. C.(1957). ~ a n u a l the ~ r C a ~ o ~ j a P ~ c ~nvento~. ~ Q ~ ~ Palo c a lAlto: Cons~ting Psycholo~stsPress. Gur, R.C., & Sackeim, H. A.(1979). Self-deception: A concept in search of a phenomenon, ~ Q # persona^^ ~al and SocialP ~ c b o f o37,147”~, 69. Hartshorne, H., 8z May, H. A.(1930). Stg&es in tbe at^^ ~ c b ~ a cNew t ~ .York mac^^.

Socially Desirable Respon Hogan,R., & Nicholson,R. A. (1988). Themeaning of personalitytestscores. Amen!can P ~ c b o l o43,621-626. ~s~ Holden,R.R., & Fekken, G. C. (1989). Threecommonsocialdesirabilityscales:Friends, acquaintances,or strangers?J o u ~ a ~ ~ ~ins persona^^, e a r c b 23,180-1 91. Holden, R. R., & Fekken, G. C. (1993). Can personality test item responses have constxuct validity?Issue of reliabilityandconvergentand ~ s c ~ avalidity. n t persona^^ and ~ n ~ ~ d u a l ~ z ~ 15,243-248. e~nces, in Hoorens, V. (1995). Self-favoring biases, self-presentation, and the self-other a s y m m e ~ social comparison. J o ~ ~ persona^^, al 63,793-818. Jackson, D. N., & Messick, S. (1958). Content and style in personality assessment.P~cbological in,55, 243-252. Jackson, D. N., & Messick, S. (1961). Acquiescence and desirability as response determinants on ~ d ~ c a ~ o n a l a n d P ~ c b o ~ o g21,771-792. ical~e~~~~~n~, the "PI. Jackson, I). N., & Messick, S. (1962). Response styleson the MMPI: Comparison of clinical and normal samples.J o u ~ a~l A ~ n oand ~ aSocial l P ~ c b o65,285-299. ~ ~ . J o ~ ~ a l ~ E ~ Jackson, D. N., & Singer,J. E. (1967). Judgements, items, and personality. ~ s e a in~ persona^^, b 2,70-79. John, 0.l?.,& Paulhus, D. L.(2000). Tbe s t ~ c~~s ue ~~e n b a n c e ~Unpublished en~ ms.,~ n i v e ~ i t y of California, Berkeley. John, 0.P., & Robins, R.(1994). Accuracy andbias in self-perception: Individual differences in self-enhancement and the role of narcissism. Journal persona^^ and Social P ~ c b o l o66, ~, 206-219. Johnson, E. A. (1995). Self-deceptive coping Adaptive onlyin ambiguous contexts.J o u ~ u ~ ~ P e ~ s o n a63,759-792. ~~, of subjective well-being Age Kozma, A., & Stones, M. J.(1988). Social desirability and measures Socia~~n~ca~ors fisearcb, 20,l-14. comp~sons. McCrae, R.R.,& Costa, P.T. (1983). Social desirability scales: More substance than style.Jou~al ~Cons~lting and CbnicalP ~ c b o51~ ,882-8~8. ~ , McHoskey,J. W., Worzel, W.,& Szyarto, C. (1998).M a c ~ a v e ~ n i and s m psychopathy.J o u ~ a ~ persona^^ and SocialP ~ c b o l o74,192-210. ~, Messick, S. (1960). Dimensions of social d e s ~ b i l i t y . ~ o ~ ~ a ~ ~ PC ~ o nc bs ~ o~ 24,279 ~o n~g, invento~es. Messick, S. (1962). Responsestyleandcontentmeasuresfrompersonality nal-and P ~ c b o l o ~ c a l ~ e 22,41-56. ~u~~en~ Messick, S. (1991). Psychology and methodology of response styles. In R.E. Snow Wiley (Eids.), I ~ ~ i n~p i ynin socialscience. g (pp. 161-200). Hillsdale, NJ: Erlbau ,D. N. (1961). Desirability scale values and dispersions for MMPI items. 8,409414. Meston, C. M., Hehan, J. R., Trapnell, P. D., & Paulhus, D. L. (1998). Socially desirable J o u ~ ofsex a ~ ~searcb,35,148-1 57. responding and sexuality self-reports. b o~ ~ ~ , ~ h o l l a n dJ., E. (1964). Theory and techniquesof assessment. ~ ~ n u u~ l P ~~ ~c e15, 31 1-346. Millham, J.,& Jacobson, L .I. (1978). The need for approval. In H. London & J. E. Exner n a 365-390). ~ ~ New York Wiley. Fds.), ~ i ~ ~ n s i~o nPs ~ s o (pp. Nevid,J. S. (1983). Hopelessness, social desirability, and construct validity. Jou~a~~Consul~n~ and Chicul P ~ c b o 51,139-140. ~~, Nichols, D. S., & Greene, R. L. (1997). Dimensions of deception in personality assessment: The persona^^ Assessment, 68,251-266. example of the ~ I - 2Jo~rnal . *

68 Paulhus Ones, D. S., Viswesvaran, C., & Reiss, A. D. (1996). Role of social desirability in personality , testing for personnel selection: The red herring.Journal ofAppied P ~ c b o / o ~81,660-679. Paulhus, D. L. (1984). Two-component models of socially desirable responding. Jo~rnalof Personai~and S o ~ a l P ~ c h46,598-609. o~~? Paulhus, D. L. (1986). Self-deception and impression management in test responses. In A. n ~ ~143-165). New Angleitner & J. S. Wiggins Pds.), persona^^ assess~entvia q u e s ~ o n (pp. York Springer-Verlag. l ~Ba/ance~ r Inventoy o f De~rable~ ~ o n (BIDR-6). ~ n g Paulhus, D. L. (1988). ~ a n ~ athe Unpublished manual, University of British Columbia. InJ.P. Robinson, P.R.Shaver, Paulhus, D. L. (1991). Measurement andcontrol of response bias. & L.S. Wrightsman @ds.), ~ e ~ u ~ s and ~ spo ~~a l ps ~oc b n o / oaa~ c~a ~~~ (pp. 17-51)). ~ ~ ~ New York Academic Press. oftrait self-enhancement A Paulhus, D. L. (1998a). Intrapsychic and interpersonal adaptiveness ofPersona~~ and Social P ~ c h o74,812-820. ~~, mixed blessing?]o~~al ~ r Invento~o f Desirable ~ ~ o n (BIDR-7). ~ n g Paulhus, D. L. (1998b). ~ a n ~ athelBakmced Toronto/Buffalo: Multi-~ealthSystems. ~ o n n Presented ~~ at Paulhus, D. L., & Bruce, M.N. (1990, June).The O v e r - C ~ i ~ i n g ~ u e s(OCQ. the meeting of the Canadian Psychological Assoc@tion, Ottawa, Canada. on Paulhus, D. L.,Bruce, M. N., & Trapnell, P.D. (1995). Effects of self-presentation strategies personality profilesand structure. Personai~and Socia/ P ~ c h o Bul~etin, ~ o ~ 21,100-108, Paulhus, D. L., & John, 0. P. (1998). Egoistic and moralistic bias in self-perceptions: The interplay of self-deceptive styles with basic traits and motives.Joumal~Persona~~, 66,10241060. Paulhus, D. L.,Lysy, D., & Yik, M.(1994). S e ~ p ~ s e n t a ~ono na real-~orld~obpi cation. Unpublished manuscript, University of British Columbia. n ~ gi a ~Unpublis~ed ~ons. data, Paulhus, D. L., & Notaresch, R. F. (1993). Vaketies ~ ~ ~~ aj n ~ University of British Columbia. in socially desirable responding. Paulhus, D. L., & Reid, D. B. (1991). Enhancement and denial J o ~ ~~ P a e/ r s o n a and i ~ SocialP~c~olo~, 60,307-31 7. Paulhus, D. L,, Reid, D. B., & Murphy, G. (1987). The Omnibus St%& o f Desirable ~ ~ o n ~ n g . Unpub~sheddata, University of British Columbia. d ~personai~ ~~ng Paulhus, D. L.,Tanchuk, T., & Wehr, P. (1999,August). ~ a ~ u e - b ~ e on q ~ e s ~ o n nAgency ~ ~ s : and c o ~ ~ ~ n mh. i o n Presented at themeetingoftheAmerican P s y c h o l o ~ Association, c~ Boston. Paunonen, S., &Jackson, D. N. (1996) The Jackson Personality Inventory and the five-factor model of personality.Journal of~searcbin persona^^? 30,42-59. Raskin, R. N., & Wall, C. S. (1981). The narcissistic personality inventory: Alternative form ]o#~a/ofP ~ s o n a ~ ~ ~ s e 45,159ss~en~, reliability and further evidence of construct validity. 160. Raskin, R.N., Novacek, J.,& Hogan, R.T. (1991). Narcissism, self-esteem and defensive self/ 59, 19-38. enhancement.J o # ~ aofPersona~~, Rosse, J.G., Stecher, M. D., Miller, J. L., & Levin, R. A. (1998). The impact of response J o ~ ~ ~ / o f ~P p~ cp h~oe/ d 83, ~~, distortion on pre-employment testing and hiring decisions. 634-644. depression:~eadaptive value of lying to Sackeim, H. A. (1983). Self-deception, self-esteem, and ~ e o101-157). ~es Hillsdale, oneself. In J.Masling (Ed.),~ ~ i ~ c a / so f~p u~ c~h oe asn a ~ ~ c t(pp. NJ: Lawrence Erlbaum Associates.

s

Socially Desirable Responding 69

Saclceim, H. A., & Gur, R.C. (1978). Self-deception, other-deception and consciousness.G.E. In Schwartz & D.Shapiro (Eds.),Consczousness ands e ~ r ~ ~ l a ~ o ~ : ~ dinvresearch a n c e s(Vol. 2; pp. 139-197). New York Plenum Press. Saucier, G. (1994).Separatingdescriptionandevaluationinthestructureofpersonality a ~ ~SoczalP ~ c b o i o ~66,141-1 , 54. attributes. J o u ~ a l o f P e r s o n and o ~ - p s y c h o l o ~ cperspective al Taylor, S. E., 8c Brown, J.D.(1988). Illusion and well-being: sA l~e~n, on mental health. P ~ c b o ~ ~ c u l ~ #103,193-210. S, J.S. (1959). Interrelations~psamong "PI measures of dissimulation under standard P ~ c b o l o23,419427, ~, and social desirability instructions.Jo#~alofCons~l~ng Wiggins, J. S. (1964).ConvergencesamongstyIisticresponsemeasuresfromobjective personality tests.E ~ c a ~ o nand a l P . r r c h o ~ g j c a l ~ e ~24, ~~ 551-562. ~en~, Wiggins,J.S. (1991). Agency andcommu~onas conceptual coordinates for the understanding and meas~ementof interpersonal behavior. InW. Grove & D.Cicchetti (Eds.),Tb~n~jng Essgts : in honor of Paul ~ e e (V01.2; b ~ pp. 89-1 13). ~ n e a p o I i s : clearlyabout P ~ c b o l o ~ University of Minnesota Press. (1996). Tbe Five F a c t o r ~ New o ~ ~York Guilford. ~ i g ~J.S. s (Ed.). , Y&, M. S. M., Bond, M. H., & Paulhus, D,L, (1998). Do Chinese self-enhanceor self-efface? Journai of~searcbin persona^^ 24,399-406. It's a matter of domain. *

I



Jal There is ample empirical evidence that the stnJCbJreofindtvidl dtfferences in Cognitiie abilities may be described in terms of a hierarchcal modelwith three strata (Carroll,1993; Gustafsson, 1988, Gustafsson & Undheim,1996;Messick,1992). From a taxonomic point ofview the herarchical approach has important advantages (Gustafsson, 1988), and it makes it possible to unite confltctingmodels that emphasize either one generalability(e.g.,Spearman,1927) or manyspecificabihties(e.g., Thurstone, 1938). But the herarchical approach also may have plications for the meas~ementof cognitive abhties, and for measurement in general, which so far seem largely unexplored. The purpose of h s chapter is to dtscuss possible implications ofthe hierarchical approach for measurement. Coan (1964) introduced the term r ~ e r e ~ ~ gtoe refer ~ e rtoa the ~ ~scope ~ of reference of a construct, or "the variety of behaviors or mental activities to which it relates and the degree to whch it relates to them" (p.138). Snow (1974) emphasized the referent generality of constructs representing outcomesofexperimentaldesigns,andMessick(1989)made frequent reference to the construct of referent generality in his treatise on validtty. "he idea that measures and constructs differ in referent generality has, of course, close affnity to the hierarchical model ofthe structure of individual differences(Coan,1964; Gustafsson & Undheim,1996;Messick,1992; & Jackson, 1996). For example, the construct general Snow,Corno cognitive abihty is more general than is the construct spatial ability, in the sense that the latter ability is only one among many abilities that may be subsumedundergeneralcognitiveabihty. Similarly, it is obvious that a broad spatial ability subsumesa large set of more narrowly defined abilities,

73

such as ~ ~ u ~ a t iFlexibility o n , of Closure, or Spatial Relations (see, e. are easy to make. It is not ~ e d i a t e l y ment o f c o n s ~ c t sof dif€erent referent .The issue how to best ~ e a s u r ethe high

somewhat more closely.

e abilities. Some

confronted withempiricaldata, and that the crude factor-anal~cmodel used by was superseded by the considerably more advanced m~tiplesis developed by “hurstone (1947).Anotherreasonfor the reluctance toaccept generalintelhgence as a scientificconstruct is ably that the heterogeneous intehgence tests have not been accepted easwes o f a single construct, because they have not fulfilled the ideal o f homogeneity and u n i ~ e n s i o n ~ demanded ty both by classical (e.g., Gullhen, 1950) and modern (e.g., Lord, 1980) m e a s ~ e m e n ttheories. F u r t h e ~ o r e such , tests have had difficulties meeting the criterion o f “face ” when being scrutinized (e.g., Neisser, 19’76, Gould, 1981). an (1927) did try to solve themeasurementproblem through ment o f more pure g-factor tests. John Raven was thus inspired by Spearman to develop the Progressive Matrices test as a g-test. This test has met with considerable success, but it has not generally been accepted as an alte~ative tothe heterogenous intelhgence tests in practical work, and it has not generally been accepted as the measwe o f g in scientific studies. seems that Compared to Binet’s practical contribution, it thus an’s theoretical and methodological contributions have had less o f an impact during large periods o f the 20th century. It seems, however, that the 1980s and 1990s haveimplied a renewed scientificinterest in the ct o f general intehgence, and in Spearman’s work (Carroll, 1996; & Tapsfield, 1996). One reason for this revitalization o f theoretical interest in general nce is probably the growing p o p ~ a r i and ~ success o f nistic, biologically based, models to explain individual differences in general intelligence (see, e.g., Anderson, 1992; Brody, 1 e research conducted during the 1980s of intelligence that has restored thegfac u s ~ f s s o n ,1988, Gustafsso likely reason forthe in as a c ~ u l a t e da massive ~ o o f empirical ~ t oubiquitous f the imp dimension o f individual ,1997). R e v i e ~ gthe es (see, e.g., Lubinski en a variety o f behavioral, medi henomena, and individual differences in general intelligence concluded that the relations are so obvious and strong that they must be If that is not done the researchers c o d t the logical cted aspect,””because they fail to include all relevant to L u b i n s ~and k e y s general intelli~encehas

Measurement From a Nerarchcal Point o f View 77 been a “neglected aspect” in too much research, and th orhg the that generalintelhgenceholdscau long ~ ~ e ot f ycritically porta ant behaviors is no ~e must let general i n t ~ ~ g e n ccompete e wi er questions concerning its scientifi ~ ~ p h r e y1997, s , p. 190). Let prhciple of scientific work nam razor,” dictates that a more narfow concept should not be eady exists a more general concept with thesame e ~ p l a n a t o ~ ~~~~~~~~~

“hereis, thus, considerable evidence that we mustaccept intelligence as a construct, and later on in this measurement issues associated with this construct. evidence that structural models that only include o fail to acc ’ S among performance m e a s ~ e s It . woul thus, be e specific abilities, as it is to neglect gene For many decades “ h ~ s t o n e ’ s(1938) approach o f describing Mental Abilities, using Multiple Factor analysis (I’hurstone, 1947) test batteries, dominatedthe research on the structure o f abilities. “he approach was successful in yi which in one sense are e generality), but which also show ogy from 1950 and onward work in the field o f differential

model may be seen as

usta~sson

00 correlation matrices collected thoughout

abilities. Carroll also has identi~ed In addition to ~ e ~ e r a ~ (G), In~ these e ~ are ~~ thee n ~ e identi~edin the Cattell- orn model, namely Figid ,Broad VLwaZ~ ~ (Gv), fir ad ~ ~ e r e ~(Ga). ~ iIno.addition ~ Carroll ory factor ( ~ e ~ ~ e r a ~eand ~ ~ u oGy) as~~well ~ ~ i u e ~ (Gs, Gt, e and ~Gp). A ~ g Broad ~ o g ~ i S major ~ ~ ~ e r e between nce Carroll's model and the C a t t ~ l l - H o ~ S.

asinglypopular. However, c~ tion of h i e r ~ c h i models,

Cattell-Hornmodel has

velied on ~ a c t t oe c~h ~~~ u ewhich s start

ctoris so close to

~~

orrelation betwe ese b actors must toraccounts

n

~

~ e a s ~ e m eFrom n t a Hierarchical

have been performed. In the American research the narrow factors have beenseenasbuildingblocksforthebroaderfactorsand the analysisis conducted from bottomgoing up, using what Gustafsson and labeleda higher order modelingapproach. In theBritish tradition the analysis has started at the top, going down, usinga ~ e ~ ~ e d ~ c ~ approach (Gustafsson & Bake, 1993). If, however, these S arecarriedthrough to yield ahierarchicalmodelwiththree levels, similar results are obtained (Gustafsson, 1997). There remains, however, an important conceptual difference between the two approaches. According to the bottom-up, higher order, m o d e ~ g roach the lowerorderfactorsareundivisible9 and thereis no direct involvementof the higherorderfactorsinthelower order factors. However, in the top-down approach factorsat lower levels in the hterarchy are automatic~yfreed from the variance due to the broader factors. Thus, the lower order factors are split up into two or more parts: one due to the lower order factor, andone due to the higher order factors.These two ways oflookingatthelowerorderfactors(i.e.,thelow referent generah factors) have, asw llibe shown below, irnportant implicationsfor how to go about measuring the factors.

iscussions about alternative models of the structure of abilities tend to so thereisneed for a concrete become both complexandabstract, example. In order to beable to discuss measurement issues in terms we also need an example, and I rely on the Holzinger and S (1939) study for both these purposes. This study certainly deserves muc more attention than it has gained so far. It seems, in fact, that this study, whichwasconductedinthe 1 9 3 0 ~presented ~ all the important results about hierarchical structures of ability that have been obtained in recent years. A secondreason for selecting this study for reanalysisis that it includes a large test-battery, which covers a wide spectrum of broad and narrowabilities. The test-battery was administered to areasonably oup (N= 301) of 7th- and 8th-grade students fromtwo Chicago sch These datathusprovidematerialfordiscussion about manyinteresting zinger, who wasan ~ e r i c a n ,studiedforseveralyearsatthe ~ ~ v e r ofs iLondon ~ withPearsonandSpearman.Afterre

methodolo~cal work onfactor analysis, and he also did a ble amount of substantive work. One h e of work focused on the of mental abilities, and Holzinger in particular was oriented g the S ~ e ~ model a n to accomodatethefact that the Two-Factor model had been work prove was done under the auspices of the so called ~ eth , e s t a b ~ s ~in e d1931 by E.L. T h ~ ~under S also amember of the c o ~ t t e e . Os, Holzinger pub~sheda serie Trait Study.,’ The last report (1939) study, the or the0intoa nerality: a general d many of these are closeto tests to measure abilities in five broad areas: mathematic^ deduction. Theresults rd (1939) supported the h ~ o t h e s ~ e d representing mathematical deduction neral factoraccounted for dl the tion factor, and d (1939) concluded that “the general factor may deductive factor as these tests were expected to measure” (p.8). ger and Swineford study was r e a n a l ~ e dby G~stafsson o used two different approaches of fittin ctor models to the Holzinger and Swineford data e reanalyses was to investigate the h ~ o t h e s i of s e i n t e ~ ~ e nand c e fluid intelhgence. a higher ordermodel (see,e.g., ,1993) was fitted that included five = spatial, Gf = mathematical deductive, Gm = er no^ Gs = speed) and onesecond-orderfactor(G).The higher order els showed quite cleuly that there is a relation of unity between G and ~ h ~ e the a relations s between G and theotherfact so-called nested-factor model (Gus~fsson a G-factor with re ,Gv, Gm and Gs. f equiv~encebetween G and *

Measurement From a Hierarchical Point of View Gf, because in this model it was not possible to introduce a residual Gffactor, after G was included into the model. m a t is even more i n t e r e s ~ gis that the results of the nested-factor model were very close to the results presented by Holzinger and Swineford.For their analyes they U developedby Holzinger (see Holzinger, 19 1939). The essence of the bi-factor solutio factor,uncorrelatedgroupfactors, and unique factors.Thebi-factor solution can directly, and relatively simply, becomputedfromthe It does, however, require that thetestsarebrought S before the analysis, so in this sense it is simdar to r analysis. The technique is described in great detad in Harman's (1967) book onfactor analysis. Table 5.1 presents the estimatesoriginally obtained by Holzinger and Swineford, and the results computed with the confirmatory nested-factor model. As may be seen, the results are generally quite close. The reanalysis of the Holzinger and Swineford data with a modern form of factor analysis thus gives excellent support to the original bi-factor analysis, and it supportsthehypothesis that reasoning is of central importance to general factor (see Gustafsson, 1997, for an extended discussion about the substantive implications of these hndings). I

.

I now considersome measurement issues that are associated withthe hierarc~calmodeLusing the empirical resultspresentedabove.Inthe discussion I rely on a statistical model for relating observedtest-score variance to the dimensions of a factor-analytic model. This model is closely associated with the nested-factor model, and may be ed an as extension of classical test-theory to deal with mdtidimensio Reuterberg and Gustafsson(1992)demonstratedthattheinternalconsistency measures of reliability (e.g., CronbacWs alpha) may be formulated in terms of c o n f ~ a t ofactor ~ analpc models,andthat re~abilitym e a s ~ e may s easily be computed from the parameters es in such models, Thus, given a factor modelin which a setof components is related to one common factor, the amount of true variance components is the square of the sum of unstan edfactor loadings

I

Table 5.1

~ s ~ a t Loadings e d in the Original Analysis (€-€S) and in the Reanalysis (NF)

S

.59 .25 .45 .36 .52 .38 .50 .56 .31 .57

.32 .l8 '12 .51 .27 .l9 .37 .51

.69 .68

.77 .67

N

HS

NF

HS

NF

"78

'72

.61 .82 .5l -60

.58 .75 '48 .59

HS

NF

.51 "43 .54

.57 .51 .53 $39

HS NF

.63.34 .40

.30.36 .43

.29 .53 .43 .57 .47 .50 .60 .26 .48 .29 .40 .22 -14 .52 .23 .l8 .39 .49 A2 .64 .76 .60

.50

.44 .36

37

.68 .67 .55 .58 .33 38 .48 .48 .4'7 .42 .30 2 8

easurement From a

erarchical Point o f View 83

tor o f k latent variables, A is a m iduals in observed variables. U e o f residuals and factors this m cture (e.g., Joreskog, 1971):

covariancematrix o f the latent variablesand o f the residuals in manifest variables. A, sted-factor models, YE/ will be &ago is on properties o f hnctions o f the obse at we c o n s ~ c at simple unitthe m variables, which are assumed ecause the variance o f a sum o f urn o f allthe elements o f the covariance matrix for the componentsit follows that:

ii omposition o f thetotal o ~ s e ~ e d rent components due to the different latent on o f each o f the or

ree pro~ositionsabout meas~rement ow from the hierarchical approach, lved in ~ o n s ~ such c ~rno g easwe c o n s ~ c t with s high referent generality it is necess hetero emus measwementdevices.

4 Gustafsson At fast sight these three propositions may seem to conflict with current king about measurement, which emphasizes u ~ ~ e n s i o n a l i and ty homogeneity as prerequisites for i n t e ~ r e t a b ~(e.g., t y Pedhazur &c Pedhmur S c ~ e 1991). ~ , I try to show, however, that there is little conflict, but that the notion of measurement at different levels of generality makes it necessary to elaborate the meaning of the concepts u ~ ~ e n s i o n a l iand ty homogenei~.

The fast proposition w li be exarninedusing measurement of general intelligence as an example of a construct with high referent generality. One way to create a heterogenous test is to use a simple unweighted sum of all the tests in the Holzinger and Swineford battery. Such a composite score would a p p r o ~ a t ean IQ score, or a score on a very heterogenous test. that standardized scores (i.e., z-scores) are summed together, the ed variance for the sum wouldbe 165.9.Accordingtothe fomula presented above no less than 127.0, or 77 YO,of the total variance isdue to G. Another way of putting this infomation is that there is a correlation of .88 (i.e., the square root of '77) between G and thetotal score. T h ~ sis a somewhat paradoxical result, given that theamount of relations between G and the individual tests are not p a r t i ~ l a r high. l ~ The s t a n d a r ~ e dloadings(i.e., correlations) range between .l4 and "76,but most of the loadings are quite low. Often the relation between the broad factor and a test is, furthermore, hgher than is the relation between the Gfactor and the test. However, there is a rather obvious reason why the Gfactor is the dominating source of vakiance in the sum of scores, and it is that it is, to some extent, present in every test. According to the fomula presented above the contribution of a latent variable is a function of the square of the number of variables to which it is related. Because the general factor is present in every one of the 24 tests it gets a weight, as it were, of =I: 576, whereas the Gv-factor, for example, to which only 4 tests are Gv to the vaxiance in ts a weight of 16. The con~bution of the sum of scores is 3.1, or 1.9?40. From GC(5 tests) the c o n ~ b u t i o nis 8.7 (5.~?40); from Gs (4 tests) it is 3.1 (1.9?40) to Gs; and from Gy (6 tests) it is 6.4 (3.9yo). "hese results thus show a striking domhance ofthe G-factor in the sum of scores, and quite limited conttibutions from the group-factors.

~easurementFrom a HierarchicalPoint of View 85 e influence from the residuals also is very much reduced in thesum, and together they only accountfor 6.9% of the variance. It is i n t e r e s ~ gto compare the properties of the totalsum of scores as a measure ofGf with the properties of the Gf-tests. The single bestGf-test is the S E R C test, ~ ~with a loading of .76 on G, This implies that about 58% of the variance in this test is due to G, and it seems that this is about as high as it is possible to get with a single measure (Gustafsson, 1998). The other Gf-tests have loadings on G around .6, which implies that no more than some 40% of the variance in these is due to G (orGf). However, a sum of scores on the five Gf-tests in the Holzinger and Swineford battery hasavariancewhich to 76% isdue to Gf, and this isclose to whatis obtained from the entire testbattery. Such a score has a better face validity as a measure of reasoning ability than has the sum of scores on the entire testbattery, but it still would be derived from quite different types of tasks. It may, in passing, be noted that these results explain the success of Binet’s approach to measuring general ability with heterogenous tests, as compared to the relatively modest m o u n t of success met by Speman’s attempts to measure general intelligence witha single Gf-test. The total sum of scores on the test battery thus is a fairly homogenous measure of G, and also of Gf. The fact that a heterogenous mixture of task all typesmayyieldafairly pure measureofasingleability,much the colors yieldswhite,is not easilyrealized, and it certainly i us from an ocularinspection of atestbattery.Heterogenousteststhusare seriously lackingin face validity.It should, however, be emphasizedthat the tion demons~atedherehasbeenidentifiedbefore. Thus, H u m p ~ e y s(1962,1985)argued that measurementofgeneral with heterogenous tests, because such tests do ~ t ~ ~ g isebest ~ c done e provide a homo~enousmeasure of general intelligence. It also may be pointed out that a similar line of reasoning as that used here can be applied to explain why the general factor isso highly predictive of performance over a wide range of situations. Although any single task performed in school, on the job, or in daily life has a low relationto G, the fact that there are so many tasks on which performance to some extent depends on the general fa r explains why, in the long run, there will be a relation between aggr measures d performance of a d test perfo~anc~.

S

statement seems to be com~letelyat odds with current views on the

test s ~ ~ t ~ e o umeasures sly m ~ ~ ~ l

ition is closely related t broad factors 'that ma

rsonswe

may use

Table 5.2 S t a n d ~ ~ Mean e d Differences between Schoolson Latent Variables in Oblique and

values a higher mean for the Pasteur Elemen

erareheal Point o

overlap pin^ components, the three m e a s ~ e s

d appear to c o n v e r ~in~ the ~easuremen the overall complex construct. Yet each ~ e a some aspect of the construct. A composite of the cover all h e e aspects and, ~ t h e ~ o r cons~ct-relevantvariance would c o n ~ b u t eto the composite scor the irrelevant variance would not. (p. 35)

I

s

~

~

Bake, 1993).Accordmg to this model the observed variables are exchangeable indcators of the latent construct, and there is only a probabilistic relation between the observed and latent variables. For a construct such as general intehgence such a reflective, “open,” model seems more appropriate than a componential, “closed,” model. These subtletiesaside, it seemsthatconstructionofheterogenous measures is a wayproblem the of of construct underesentation in the meas of high referent generality constructs. problemofconstruct-irrelevantvariance,which is the other major t to construct valldity discussedby Messick (1989), does not, however, to be a major problem in a well constructed heterogenous m e a s ~ e , a more or less because of the gation effects, However, when homogenoustested to measure a highgeneralltyconstruct,suchas when the Raven Progressive Matrices Test is used to measure the &-factor, e problem of construct-irrelevant variance, in the form of test-specific variance, seems to be the major threatto validq. VVhen a low generality construct is the intent of measurement ~ ~ e y , 1991)construct-irrelevantvarianceseems to bethemajorproblem, but now c o n ~ b u t e dby more general sources of variance. At least this follows if a reflective,hierachcal,measurementmodelisassumed,because to such a model variance from high referent generality constructs contribute variance in the measure. If the measure is inte f the low referent generality construct only, the variance due to neral constructs takes theform of construct-irrelevant variance. eral sources of variance are only rarely conceptua~edin terms of -irrelevantvariance. It is,however,interesting to observe that e (1951) inthechapter on reliab~tyin thefirsteditionof 6~~~~~~~ ~ emade a classification ~ of~sources of variance ~ in test ~ scores in terms of referent generality, from neral to specific, and in term of lastin~ess.”his model is close to the on resented here, and it follows construct is in the focus of interest, variance f variance would be irrelevant. t measures of low referent erality constructs may capture urcesofvarianceis, fo ately enough, often realized inte~retationsof fmdingsismade. If, for example,a study establishes a correlation between a measure of spatial ~isu~ation ty and reading comprehension performance, the inte~retationwould cally not becouched in termsof,say,thedemands on spatial ation abilityby the processing of letters and symbols of punctuation

Measurement From a Hierarchcal Point of View 91 asfiguxalelements. A more reasonable interpretation would be that the correlation is accounted for in terms of an unobservedt h d variable, such era1 cognitiveability,whichisrelated both to performance on the visua~ationtest and the reading comprehension test. Thus, as long as measures of the more general abilities are included,the construct irrelevant variancemay be kept under control, and even when the more general sources of variance have not been measured they may be brought in via theoretical deliberations. But much may be gained if we also conceptualize the problem as a problem of construct vall&ty, because this wouldmaketheinterpretationalproblems more obvious,and it would stimulate development of appropriate methods of controllingfor the more general souces of variance (cf Gustafsson& Snow, 1997). Before leaving this topic it must be emphasized, however, that the line of reasoning presented here is meaningful only when the measures of low referent generality constructs are interpreted as signs (Loevinger, 1957) of unobservable abilities. If the test scores instead are seen as samples from a domain, it does not make much sense to &vide the variancein performance into sources of chfferent degrees of generality. Thus, in this case we haveto regard the performance measuresas measures of an un&visibleconstruct. Let me, finally, make a few comments about choice of a hierarchical measurement instead of a more traditional measurement model. It must, thus, be emphasized that compared to an oblique measurement model a hierarchical model is more restrictive, and it is based on an additional set of assumptions, Thus, ifthe hierarchcal modelfits the data, and the assumptions are not being violated, sucha model is to be preferred because it is more informative than is a less restrictive model. The problem is, of course, that it is not always clear which assumptions are imposed,and how theymaybe tested. The ordinary fest-order factormodel for relations between observed and latent variables an is additive, linear and compensatorymodel,and a higher order modelimposes the same assumptions for the relations between higher order and lower order factors. , Normally, however, a factor analyst does not test these a s s ~ p t i o n s and carrying the factor analysis into the hgher order realms does not imply any increased concern with these problems. In future work these issues should, however, be given more attention. The measurement issues discussed here are hplications of a particular theoreticalmodelofthestructureofcognitiveabilities,andif this theoretical model is not accepted, these measurement implications w i l l also or be c h ~ e n ~ e In d . fact, the question whether general intelligence exists

hesitant to adopt a herar

that the results fro

Anderson, M.(1992).~n~ei~gence and ~ v e ~ A ~ cQgn~~ve ~ e n ~~b e, QOxford ~. Blackwell. Binet, A. (1905). Analyse de C. E. Spearman, ‘‘The proof and ~ e a s u r e ~ eof n tassociation between two things” et “‘General intehgence objectively determined and measured.” L’ Annke ~ ~ c h oI I , ~623-624. ~ ~ ~ e ,

tdectual level of a b n o ~ ~ sL’ ). e mind: A review of the results of factor analysis. ~

~

~

R. J.Sternberg @d.),~ f f n ~of~ o o ~ ~~~~n ~ n ~ e ~ ~@p. g e n29-120). ~e New York Carnbrid~eUniversity Press, Carroll,J.B. (1993). ~ ~ cogn~~ve ~ f ~f n~ Cambridge: ~ ~ e s Carnbridge . ~niversityPress.

s

~ e a s ~ eFrom ~ e a~Hierarcheal t Point of View 93 Carroll, J. B. (1996). A three-stratum theory of intelligence: Spearman's contribution. In I. Dennis & P, Tapsfield(Eds.) Ekman a ~ ~ t i eTbeir s . natzmand me~~mment (pp.1-17). Mahwah, New Jersey: Lawrence Erlbaum Associates. Cattell,R.B. (1963). Theory of fluidandcrystallized intehence: Acriticalexperiment. ~ o ~ o f~ E a~ cZa ~ o n a ~ P ~ c54, b o1-22. Zo~l Coan,R. W. (1964). Facts,factorsandartifacts:Thequest for psychological mean~g. P~cboZo~ca~ ~ e 71,123-1 ~ 40.] Cronbach, L. J. (1975). Five decades of public controversy over mental testing. herican PgcboZo~s~ 30,l-14. Cronbach, L. J. (1988). Five perspectives on the validity argument. In H. Wainer 8r H. I. Test v a (pp. ~3-17). Hillsdale, ~ ~ New Jersey:LawrenceErlbaum Braun (E3ds.) Associates. e s nutgm . and ~ e ~ ~ ~ mMahwah, e n t . New Tapsfield, P.(1996). man a ~ ~ ~Tbeir nce Erlbaum Associates, DuBois, P.H. (1970). A bistooly ~ P ~ c b o ~ ~ c a ZBoston: t e s ~ ~ Myn g . & Bacon. S in 14 nations: What IQ tests really measure. Pgcbo Flynn, J. R,(1987). Massive IQ ~ ~ Z Z101,171-191. e~~, Gould, S.J. (1981). Tbe is me^^^ o f ~ a nNew . York Norton. binteZ~ge~ce. ~ New ~ YorkaMcGraw-Hill. ~ Guilford,J. P.(1967). The n a t m ~ e s t s York , John Wiley. Gdiksen, H, (1950). T b e o ~ ~ ~ e n t a Z tNew Gustafsson, J.-E. (1984). A uni+g model for thestructure of intellectualabilities. InteZ~gence,8,179-203. Gustafsson,J.-E. (1997, July 14-16). On the hierarchical structureof ability and personali~. Paper presented at the Second Spearman Seminar, Plymouth, England. Gustafsson, J.-E. (1988). Hierarchical models of individual differences in cognitive abilities. o hman ~ ~ i~teZ~ge~ce. Vol, 4, (pp.35-71). In R. J. Stemberg, Ahances in tbe p ~ c b of Hillsdale, New Jersey: Lawrence Erlbaum Associates, Inc. Gus~fsson,J.-E. (1998). M e a s ~ and g u n d e r s ~ G: ~ gExperimental and correlational n g approaches.In P. L.Ackerman, P. C. Kyllonen, & R. D. Roberts (Eds.) ~ a ~ i am# i ~ ~ ~ d ~ a Z Pmcess, ~ ~ e ~trait n cand e content s. ~ t e ~ ' n a(pp. ~ t 275-291). s W a s ~ ~ oD.C.: n, American Psychological Association. Gustafsson, J.-E., 8r Bake, G. (1993). General and specific abilities as predictorsof school achievement. ~ ~ Z ~~eba~oraZ v ~ ~ t~searcb, e 28,407-434. Gustafsson,J.-E., 8r Snow, R.E, (1997). Ability profiles. In R. F. Dillon, (Ed.) andb boo^ on t ~ s ~ (pp. n g 107-135). W e s ~ oConnecticut: ~, Greenwood Press. User? G&%, Version 1.7. Molndal, Gustafsson, J.-E., & Stahl, P. A. (1997). S~~ Sweden: Multiv~teWare. Gustafsson,J. E., & Undheim, J. 0. (1996). Individual differencesin cognitive functions.In D.Berliner & R. Calfee (Eds.), ~ a n ~ b oo fo~~~ c a ~ o nPgcboZo~ aZ (pp. 186-242). New York M a c ~ a n . s . e ~ ~ oChicago: n. The Universityof Chic Harman, H. H. (1967). ~0~~ factor a n a ~ ~'2nd Press. 9,257-262. Holzinger, K. J. (1944). A simple method of factor analysis. Pgcbo~et#~a, Holzinger, K. J., & Swineford, F. (1939). A study in factor analysis: The stability of a bifactor solution, S ~ p ~ ~ ~ e~ nc ta ~~ ~oo nn oag r~~ bNo. s , 48. Chicago: ~ e p ~ eofn t Education, U~versity of Chicago. of thetheory of fluidand Horn, J. L., & Cattell,R.B. (1966). Refmementandtest ~ ~ , crystallized intelligence.~ o ~ o f~ E a~ cZa ~ o P~ a~ Zc b o 57,253-270.

5

Humphreys, L.G. (1962). The organization of human abilities.American ~ ~ c h o i o17,475~st, 483. Humphreys, L. G. (1985). General intelligence. An integration of factor, test and simplex Theories, m e a s ~ ~ ~ and e n t~s ,p ~ c a t i o n s theory. In B. B. Wolman (Ed). andb boo^ ofin~ei~gence. (pp. 201-224). New York John Wiley & Sons. Joreskog, K. G. (1971). Statistical analysis of sets of congeneric tests. P ~ c h o ~ e t t36, i ~ 109al 133. ~~choio~cai Loevinger, J. (1957).Objective tests as instrumentsofpsychologicaltheory. or&, 3,635-694 ~ o n o ~ a Supp. p h 9). Lord, F. M. (1980). App~cationso f item re~onsetheory top r a c ~ c a i t e s ~ n g ~Hillsdale, ~ b z e ~ NJ: s. Lawrence Erlbaum Associates. into Lubinski, D., & Humphreys, L. G. (1997). Incorporatinggeneralintelligence InteZ~ge~ce, 24(1), 159-201. epidemiology and the social sciences. Messick, S. (1989). Vdidity, In R. L. Linn (Ed.) E d # c a t i o ~ a i ~ e(3rd ~ ~ed., ~ ~pp.e n13-103). t New York:M a c d a n . Messick, S. (1992). Multiple intelhgences or multilevel intelligences? Selective emphasis on distinctive properties of hierarchy: On Gardner’s Frames o f ~ i and n ~ Sternberg’s ~ g o n d in the context of theory and researchon the structure of human abilities.~ ~ c ~ o ~ o ~ t~,3(4), 365-384. Neisser, U.(1976). General, academic, and artificial intelhgence. In L. Resnick (Ed.), The natmv ofintei~gence, (pp. 134-144). Hillsdale, NJ: Lawrence Erlbaum Associates. m e nandt l a n a ~ ~An s. Pedhazur, E. J., & PedhazurSchmelkin, L. (1991). ~ e ~ ~ ~ deagn1 i~t~rated ~ p ~ a c hNJ: . Lawrence Erlbaum Associates. Hillsdale, Reuterberg, S.-E., & Gustafsson, J.-E. (1992). Confsmatory factor analysis and reliability: and P~choZu~ca~ ~ e a s ~ ~52, ~ent, Testing measurement model assumptions. Ed~~ationai 795-81 1. Rostn, M.(1995). Gender differences in structure, means and variances of h i e r a c ~ c ~ y on, ordered ability dimensions.Laming and I n ~ t ~ c ~5,37-62. Scar, S. (1989). Protecting general intelligence: Constructs and consequences for ~ e ~theoy ~a ~ d ~ p # b ~ c (pp. p oe ~74~ n, ~ interventions. In R. L. Linn (Ed.), ~ntei~gence. 118). Urban% University of Illinois Press. on Snow,R. E. (1974).Representativeandquasi-representativedesignsforresearch teaching. fiview ofE~~cationai Research, 44,265-291. Snow, R. E., Corno, L., & Jackson 111, D. (1996). Individual differences in affective and conative functions. In D. Berliner & R. Calfee (Eds.), andb boo^ o f E @p. 243-310), New York M a c d a n . Spearman,C.(1904a). The proofandmeasurementofassociationbetween two things. h e t i c a n Jomnai ~ P ~ c h 15,72-101. o ~ ~ , Spearman,C.(1904b).“Generalintelligence,”objectivelydeterminedandmeasured. ~ e ~ c a n o f PJ~oc h~o Z~o15,201-293. ~a, i Spearman, C.(1927. The a ~ ~ ofman. ~ eLondon: s Mac~an. ~ ~~ t e Boston: ~n ~ tg eH~o cueg. h ~ o n - ~ f ~ . Terman, L. M. (1916). The ~ e ~ ~~ ~ e~~~men Thorndike, R.L. (1951).Reliability, In hdquist, E. F. (Ed.) ~ d ~ c a t i o n a z ~ (pp. 560-4520). ~ a s h i n ~ oD. n ,C.: American Councilon E~ucation. Thorndike, R. M,,& Lohman, D. F. (1990). A centHy o f a testing. ~ ~Chicago: ~ Riverside P u b l i s ~ gCompany. Thurstone, L. L. (1938). Primary mental abilities. ~ ~ c h o m e t t i c ~ o n o No. g r ~1. hs, Thurstone, L. L. (1947). ~ # i ~ ~ i ana&is, e ~ c t oChicago: r The University of Chicago Press.

~easurementFrom a HierarchicalPoint of View 95 Undheh, J.0. (1981). On intelligence 11: A neo-Spearmanmodel to replaceCattell's f u ~ , theory of fluid and crystallized intehgence.~ c a n & n ~ ~ a n~~~u~~ c~ ha of 22,181-187. Undheim, J. O., 8c Gustafsson, J. E. (1987). The hierarchicalorganization of cognitive abilities:Restoringgeneralintelligencethroughtheuseoflinear structural relations (LISmL). ~ ~ f ~~ e~h a ~a o r~Research, a f a ~ 22,149-1 e 71. ~~ ~ h a ~c ~~ ~London: ~e s~ . ~ Methuen. a ~ ~ Vernon, P.E. (1950). The s h a ~ ~~~ j2nd e s ~e. & h . aLondon: ~ Methuen. Vernon, P.E. (1961). The s&&m ~ e a~ a~ d j ~~~~ ~ f f ~~~g Baltimore: e n~c e . e Willims ~ ~ 8c Wilkins. Wechsler, D. (1939). The ~ Wiley, D. E. (1991). Test validity and invalidity reconsidered. In R.E. Snow 8c D. E. Wiley j ~ science (pp. 75-107).Hillsdale, New Jersey: Lawrence (Elds.). I ~ ~ j ~n n~ in~gsocial Erlbaum Associates.

I


del ro osed b Costa

er research, he concludes

order ~ e n s i Q n or s possible that the model does not cover

-order ‘‘Su~erfactQrs’’that account for a

I

98 Carroll

A reader may well wonder how I came to explore the field of personality en that in recent years I have devoted most o f my efforts to the structure of cognitive abhies. Several yearsagoIwas asked to review for ~ o n ~ e ~~ ~ ~ co r~ aalarge o~Z O volume, ~ I ~ ~ e ~ a ~ n aI ~ n ~~e Z ~ ~edtted e ~ c e , bySaklofskeandZeidner ~ a n d ~o fo ~o ~e r ~ o and th this volume,SaklofskeandZeidnerhoped to promote the of personality andintelligence as fieldsof research. In my review 97), I noted that the contributors of essays in the volume often had difficulty in bringing the fields together, neglecting to point out sknilarities anddifferencesbetweenthem or ways in whichtheymight cooperate. Indeed, some of the contributors treated the fieldsas if they were h o s t completely separate. Few if any of them concerned themselves with comparing them, as they might have done, for exmple, with respect to the treatment of cognitive ability. In the field of intelligence per se, the development of multilevel or hierarchical structures has been a prevalent strategy. In the field of personality, intellect has often been treated as one o f the factors in a so-called Big Five model favored by many researchers. In any case, through reading and studyingSaklofske andZeidner's edited book with the hope of writing a competent review, I became aware o f a major controversy about theplace of the Big Five modelin personality research. This controversy has been brought to a head in various ways. First, there has been a long series of publications promoting the Big Five model by Costa and McCrae(1988,1992a5 1995;McCrae & Costa, 1987), who appear to be m o n g its major proponents. They (Costa & McCrae, 1985, 1989, a revision, and a b) have publisheda personali~questionnaire e er so^^^^ shortened form) based on the five-factor model (F I n P e ~ ~This o ~ .questionnaire has been widely used in clinical practice and researc k published the by American Psychological Association (Costa er, 1974) seeks to show how the FFM applies to the classi~cationandtreatment o f personalitydisordersand how it can be plement to the ~ i a ~ nand o ~~ ~~ ~a c~~ ~ ~a o~f ~~~e c ~a Z ~ a rders ~ ~ ~ - I I I ~ ~ A m Psychiatric e ~ c a n Association, 1980 Several - acked reviews in the ~ n n RePiew ~ a o~ j~ ~ 6 ~ 0 inf & Pincus, 1992; Ozer & Reise, 1794) have giv highly favored basis for the analysis o f personality; only the authors of the most recentof these reviews, Butcher and Rouse(1996), are strongly critical, presenting various ents and empirical evidence *

*

The Five-Factor Personality Model 99 st the five-factor model. According to them,cc[m]any reco Five-Factor Model as too superficial to help much in clmica1 assessment, which requires more refmed and broadened personality and symptom foci than are provided through the narrow lens of only five factors” (p. 103). Only in later phases of my literature review did I come across the most usefd references I have found: edited volumes by Strack and Lo (1996), the former devoted to dtfferentia~g no alities by means of various trait models, including the Five, and the latter expressly concerned with defending the five-fac its critics. My only complaint aboutthebook edltedby the authors neglect to say enough about possible criticisms ally, an impressive treatise on personality (Hogan, Johnson, 1997) devoted five of its 36 chapters to the Big Five model and d b e n s i o n s , and McCrae and Costa (1997b) suggested that the ~ve-factorpersonality trait structure is a human universal because it has been found to o c c u in a different languages, Indo and others. They even cite Edward Sapir (1921) as tured by the l a n ~ a g eone speaks, just missing ci Lee m o r f and the ‘TUhorfian h~othesis” (Ca~oll, In p r e p ~ my g review of the Saklofske and Zeidner v o l m e , I noted that Boyle et al., (1995) took a strong position against the FFM, in s t a ~ g that “both Cattell and Eysenck axe in complete a ~ e e m e n that t studtes of the so-called big five axe scie unacceptable” (p.431). of several other chapters in however, devoted considerable space to the FFM, ~ t h o u expre t c o n ~ b ~ t i to o nresearch in personality structure. I was sufficien~yintriguedbythis controversy to decide to study the ~teratureabout the FFM (and other models) for myself, in order to fo ~ d e ~ e n d eopinion nt about it. In my review (Carroll, 1997) of the S a ~ o f s k e and Zeidner book 1 c o m p l ~ e dthat ed to Contain material that would serve to resolve the contro~ersy. I received an invitation from cational Testing Service toparticipate in a conference hono essick, the topic seemedtailor-made to discuss at that Messic~sinterest in the study of ~ersona~ty. 1 might be especially usefd if I could develop my views from e ability domain. the standpo~t of work I have done in the CO

,has b ~ g e o n e dtremendou asily able to id en ti^ several dozen articles the last 3 or 4 years; through CO eir ~ e f ~ e nlists, c e I haveamassedandstudie re~erences.

n wereusuallydealt

with

*

x of co~elationsamong ossible to use h s in doin

ater than about .3Q

The Five-Factor Persona~ty e instance in whi (1957) procedure for or e that 1 had fre~uently that used the S

ed the so-called “big five” o benefit of those readers who F F is~a statement of what are re entsas fivemajor dependent and ggrobust” that can be identified, by factor analysis or ents for m e a s ~ ~personality, g wi lication that these five factors encompass the o v e ~ h e majority ~ g of e va~anceneeded to describe individual differences in r these factors that were concise manner; these naxnes, with S I: E x ~ a v e r s i o n / ~ ~ o v e r s ior o nS, ~ ... E~ 11: F ~ e n ~ ~ s s / h o sor~A~eeableness ty, . A 111: Conscientiousn~ss,or Will ................... G

~

n

~

IE ~ e ~ o t i ~ s m / E ~ oStability tion~

(or E ~ o t i o nstab^^) ~ ........................ V. Intellect (or ~ ~ e ~..... ~......s...... s...... ).

N 0

the ~ e m o ~OC c s d as shown above, all S, to describepers ent of personality disorders. review was entitled ~ e r ~Stm;lctum: o ~ ~a ~~ e ~~ of~ e ~ c e ~ ~ ~ e - F As one ~ cwho ~ favored o ~ (or ~ had o ~ been~converted to favo model was e ~ e m p lin. i ~a ~ ~ o be ignored by r e ~ e w e r san

y Cattell’s (1957) 16-facto2 .Thus, the model could be in studies of the persona^^ domain

of‘ the article) that emerged and

th

r e c ~ r e n t ”across

ctors in one

uld more properly be e-factor model if o

02

e Five-Factor Personality Model 103 factors five m to

ored. "hese failures could usually be at~ibuted in the sampling of variables. one must raise the question of how well the various factors, and the lower order v support them, can be match across studies and samples. F e most part, the names ass to those assigned by Tupes and ableness, (c) ~ e p e n d a b i l i ~(d), even ~ i ~Table ~ 1' shows s &es in the names assigned to factors. For the example, it might be & f f i c ~to t conclude that factors named *

s ~ ~ easse~~vene~s, n ~ , ~ o ~ e ~ ~ v e ~aree all n~ the same basic factor.

a c ~

to define the factors, but I cannot undertake here anydetaileationofthisvariation.My press ion is f these lowerorder variables can make e hasbeen con~oversyabout the i n t e ~ r ~ t a t i oof nfactors

other

apter devoted to this [t]he factor of Openness penness as ~ o oncept of openness as a ,a mental phenomenon

ability domain, I feel

I

~

m a ~ e r sof two very ~fferentfields

~ a ~ a ~ occur o n sin the

research, T e ~ e ~ edev n

~fferentset of seven factors emerged from a study ~

~

~

~

c

~

106Carroll theories. He was impressed by clairns that personality traits are influenced by genetics and that they are stable over h e . He was p ~ t i c ~ a rcritical ly of clairns for agreement on a “magic number five, plus or minus ~ w (p. o ~ 305). ~ to maptraitsfrom ~fferent According to him, ‘‘despiteheroicefforts schemes onto one anotherandsuggestionsthatthere is substantial eement in this regard, questions remain c o n c e ~ the g c o m p ~ a of b ~ ~ ctors across i n s ~ e n t and s data sources” (pp. 105-106). Furthermore, “ratherthanbeingaserviceablesystem,aitmodelis ...~ n d ~ e n t a l l y flawed intermsofitsabillty to come to withtheissues of personality dynamics and personakty pattern andorganiza~on”(p.11l). Block (1995a) offered a strongly negative, “con~arian”view of what he called “the five-factor u ~ ~FFA; ~ umy c emphasis]” ~ becausehedoesn’t believe that it meritstheterm ~Q~~ MS reviewwaspublishedin ~ ~ c g ~ ~ along ~ ~ with ~ ~replies Q e ~ by ~~dvocates ~c of~~the ~ FFM (Cost McCrae,1995; Goldberg &: Saucier,199 as well as a rejoinder in which lock(1995b)continued to disparagethemodel,Blockhadmis about factor-analytic studres of personality. For one thing, he claimed that the algorithmic methods of factor analysis may not be conducive to the finding of personality structure: . . .It is the personality structure of an in~vidu~ that, energized by motivations, dynamically organizes perceptions,cognitions,andbehaviors so as toachievecertain ‘6~y~tem7y goals. No f u n c t i o ~ gpsychological “system,” with its d e s and bounds, is designated or implied by the ‘%ig Five” form~ation;it does not offer a sense of what goes on within the structured, motivation-processing, systemm ~ t ~dividual. ~ g ...the frequent presence and the powerfbl effects areoftennotsufficientlyrecognizedbythose U explorato~factor analysis. ...[i~nfluenti~ demonstra of the FFA may have been unduly influenced by prior prestructuringof the personality variables used in these analyses. If so, then the “recurrence” and “robus~ess’’over diverse samples of factor structure may be a t ~ ~ u t a b l e more to the sameness of the variable sets used thanto the intrinsic structure of the person~ty-descripti~e domain. (E3loc

lock further worried about often-cited problemswith factor anal sisdeciding on the proper numberoffactors,factorrotations, inte factors,theeffectsofmergingsamplesfromdrfferent p o ~ ~ a t i o n the s, selection of factorial models, and so forth. Actually, I believe that many of these problems (except that of assessing the number of factors; see the

The Five-Factor Personality Model 10’7 f o l l o ~ have ~ beenreasonablywellresolvedincurrent

factor-anal~c too far. h e a d y several research studies (e.g., Saucier 8t Goldberg, 1996) have ~ strongly suggested that biases resulting from what Block called~ have not occurred. It is possible, however, that many current investigators in the personality domain may be unaware of technical problems with their analyses. Block correctly noted the problem that the Big Five dimensions mayhavebeenlirnitedbythe smples ofadjectivevariables used”as Tellegen (1993) also suspected. As Block wrote:

me tho do lo^. In my opinion, Block‘s mistrust of factor analysis goes-

An infinite number of sets of descriptive variables can be formulated, each being preferred by its progenitor and contestable otherwise. m a t is needed is a basis for choosing among these alternative sets.Efforts to study or conceptualize the dynamics underlying i n t r ~ ~ ~ dfunctioning ual might well move the study of personalitytoward such a basis. (p.210)

“hese sound &e good ideas, but in the minds of some of his critics Saucier, 1995), Block faded to S est how a reasonable basis for choosing appropriate sets of variables might be established.

I have a numberofsuestions

about personalityresearch,lookingat it from the standpoint ofexperiencewithresearch in the CO reah. First, I believe that there are lessons to be learned from that research, where the most successful and informative studies have come from efforts to findnewfactorsandtodeterminethestructureofnewdomainsin relation to domains already established. This means that in the personality field, more attentionneeds to bepaid to studyingthefactorsalready established-be they five, six, seven, or whatever-with the object of (a) whether any of the factors need to be more finely divided, and (b) d e t e ~ what g aspects of personality are poorly covered by those factors. This would presumably leadto the generation of lists of variables in the personalitydomainthatholdpromiseof r e v e h g newfactors or dimensions of personality. Research should be directed not at c o n f ~ g that personality is describable in terns of a given number of dimensionssuch asfive-but towardcharting m e r dimensionsuntil it becomes apparent that all the necessary dimensions are in hand. Do the Big Five factors really provide for all thevarietiesofpersonalitydescribed in

I

impresssec! by Lykken’s

seem to rec

an’s tutorial th

e lowest. As can be seen fr

tes a hierarchic

t t

3,

e Five-Factor per so^^^

t

110Carroll Costa andMcCrae(1992b) sought to measure the facetlevel in their se^ NE0 ~ e r s o ~~ ~av~e ~~ but to~ they ? established facet levels by itemanalysis techniques rather than by more powerful factor-analytic procedures. They chd, however, use factor analysis to establish factors at the domain level. Essentially, the expanded FFM presents a set offive second-order factors”or perhaps one might call them second-stratwz factors, analogous to the s e c o n d - s t r a ~factors in the cogmtive ability domain-in addition to a series of first-order factors, facets, and possibly alsoone or more thirdorder superfactors. The popularity of the FFM has prompted me to think that my work with .. eabihtiesmighthavebeenmoreexciting to the p s ~ c h o m e ~ c ty if I had dubbed the second-stratum factorsin the cognitive domain the ‘“Big Eight” --2F, 2C,2Y, 2V9 2U, 211, 2S, and 2T (Carroll, 1993, p. 626). But because I arn uncertain about how many second-stratum factorsactually esist in the coptive domain, I do not like to cornnit myself to some defnite number p r e m a ~ e l y . S d a r l yif, I were workingin the personality domain, I would not like to tie myself down to the idea that it contains only five second-order factors. From the hierarchicalperspective, it maybesaid that disagreements een Cattell (1957) and Eysenck (1970) in their theories of personality cture mightberesolved by pointing out that Cattellwas p ~ concerned with first-stratum factors-whatareelsewherecalledfacetswhereas Eysenck was concerned mainly withs e c o n d - s t r a ~or even thirdstratum factors. Treatmentof herarchical datais not limited to exploratory factor also be handled in c o n h a t o r y factor analysis by analysis. It can p o s ~ a t i n gandtesting factor patterns that specify different levelsofa hierarchy. This procedure hasbeenpioneered in the abilityrealmby ~ustafsson(1984,1988). It can easily be appliedin the personality r e a h , Since delivering my address at the September 1997 ETS conference, I have had an opportunity to consider further the nature of data typically found in the personality realm and to formulate factor-analpc procedures suitable for analyzing such data. I have been able to test my procedures by a p p l ~ them g to a dataset supplied to me by Digman, previously analyzed orted on in an article published by D i p a n and Inouye (1986; see

As noted previously, typical data in the personality research realm can be hierarchically stmctued, with facets at the first order and domaim at the

~

The Five-Factor Personality Model 11 1 second order (Goldberg &;Digman, 1994). In investigating simulated data, I have dscovered that when factor loadings o f facets on domains are generally higher than factor l o a h g s o fvariables on facets, as they can be in the personality realm, standard methods o f estirna factors that are vvOrthvvh.de to analyze may Specific~y,the number o f principal component roots that are greater than one may under these conditions give an estirnate, not o f the number of facets (fast-or~erfactors) ina set o f data, but o f the number o f (second-order factors) in the data. S d a r l y , the M o n t a n e ~and (1976) parallel analysis criterion may estirnate the number o f domains, not the number of facets, inadataset. These facts imply that at least some published factor analyses o f personalitydatamayhave been performed incorrectly and incompletely, in the sense that they have found l o a h g s for domains, but not for facets. It may be the case that a gven set o f data has ctors” (second-order factors masquerading as fast-order factors) S not inform one about facets (as true fast-order factors). To perform correctfactor analyses in the personality realm, it is ne cess^ fast to estimate the number o f facets (first-order factors) in the dataset. Myresults S st that the number of facets can best be es by det g the m a ~ u m number of principal factors that properly and easily relatively small number o f iterations) converge to a strict epsilon criterion o f .0005 for differences between successive CO estirnates. Further computations involve exploratory factor-analytic procedures required to perform a Schmid-Leiman o r t h o g o n ~ a t i o n ofactor f matrices produced at two ormore levels o f a ‘hierarchy. Given results of this procedure, c o n ~ a t factor o ~ analyses using LISIXEiL or other ap~ropriate computer progrms are done to refme the total structure and to establish significance of the structure and its elements. of this details of these procedures is beyond thescope chapter. However, the procedures may be dlustratedasapplied to the R i p a n and Inouye (1986) correlation matrix that was supplied by D i p a n . ~~~~~~~

The data consisted o f the matrix of Pearsonian correlations amon ersonality trait ratings (on %pointscales)made by teachers on 499 ade cMdren in Hawaii. According to Digman and Inouye (1986, p. 1l$),

e r e s u l ~ gdistributions of ratings"'may be describedasquasi nomal, nerally s y ~ e ~ and c with , verycomparable means andstandard ~onventionalprocedures in the exploratory factor analysis of this matrix wouldhave yielded five or possibly six factors, in that therewerefive ponentroots greaterthanunity, andthe MontaneUli and (1976) parallel analysis procedure for principal factoring t there were six factors worth analyzing. However, on the basis of simulationstudies mentioned earlier, it was conjectured that these estimates were for the number of second-order factors, not for the number of ~ist-order factors. Thenumber of first-order factors was estimated visionally by p e r f o ~ gexploratory f a c t o ~ gfor 3 to 15 factors, oosing the numberof useful factors as the highest number that provided convergence for estimated c o ~ u n ~ t iwithout es an excessive number of iterations, ~onvergencewas defined as occurring if the m a ~ u m dfference between successivelyestimatedcornrnunalities was lessthan .0005. To illustrate with the Digman and Inouye dataset, for 7 to 1 2 factors, convergence occurred with 9, 13, 10, 9, 14, and 49 iterations, respectively, For 13, 14, and 15 factors, convergence occurred with 165, 14, and 1 1 5 itera~ons,respectively. A conservative e s h a t e o f the number of useful factors would have been 11 or 12, but initially the estimate, 14, was more Liberal. Using~oreskogand Sorbom's (1989) LISmL 7 program, I estimated the ofactors ~ required~to fit the ~ ~ d e~ r l correlation ~ eg ~matrix; number of ~ thus, in the model parameter h e , PH=ST was specified. Initially, a pattern gs on 1 4 factors was proposed, based on thepattern o f high in thecorrespondmg ~ a ~ a x - r o t a t ematrix d computedby e~ploratoryprocedures. "his and some of the succeeding runs fded, fit the data when too many factors presumablybecause it was imp were specified. In successive runs, gutded the by modficatio~ indexes provided by the program, it became apparent that no more than 1 0 t-order factors (facets) could be supported. With 10 factors, the value of square with 796 degrees of freedom was 1774; Goodness of Fit Index (EFT) andAdjustedGoodness of Fit Index (AGFI) 348 and 3 1 9 respectively; andthe Root Mean Square Residual i n d i c a ~ grelatively small differences between the correlations reproduced fitted matrix and the actual correlation matrix. r e s u l ~ gfist-order factor matrix, with its second-order matrix of correlations amongthe 10 first-order factors, wasused in thebe *

e Five-Factor ers son

factor matxis for five factors, whc

In subsequent runs, thechoiceofpostulatedpatternentrieswasbase

114Carroll used here (N=I 496) was relatively large. Furthermore, in the fmal stage of the present analysis, the LISREL model was testedagainst a correlation ch of two virtually random independent samples (i.e., the oddandtheeven-numberedcases).Anynonzeroparameter found to be nt in either of these two samples was dropped and replaced by resulting factor matrix model, with 12 uncorrelated factors, is shown in Table 6.1. These factors, and their labels, are arranged to show their provenance in the Schmid-Leiman matrix that was used to start the computations. There were two superfactors, labeled SF1 and SF2;5 domain factors, labeled 1-0,-N, +C, -E,and +A that were interpreted as identical or similar to the factors in the so-called five-factor model, and five facets, labeled f3, f4, f7, B,and 212, concerned with highly specific varieties of behavior. It is interesting to note that the two superfactors accounted for by far the largest amounts of variance in the data, certainly more than the variance accounted for by the “standard” five factors. For this matrix, the fit to the data matrix was indicated by the following statistics: Chi-square with 759 degrees of freedom was 1370; GFI = 387; AGFI = ,860;RMSR = .084.

The factor matrix shown in Table 6.1 is not subject to the restriction whereby higher order factors are dependenton lower order factors, because the LISR.EL procedurechosenGsures that all factorsarec&npletely independent.

~ # t e ~ ofthe ~ ~Factom ~ t ~ Theo factor # ~ matrix shown in Table 6.1 indicates how much each of the 43 variables measures each of the 12 factors that were found to be significantly present in the data. That is, each row of the table specifies the “ll~ading’~ of the particular variable on each of the 12 factorslisted. The loadings,whethertheyare zero or nonzero, canbe thought of as the estimated weights of the factors in producing the scores or r a ~ g son a particular variable. Most variables are shown to measure severalfactors. For example,variable 8 ~ ‘ ~ o ~ l e d g e a b l has e y ynonzero ) loadings on three factors: .705 on the superfactor SF1, .524 on the domain factor 0 (“Opennessyy),and ,261 on facet f3. The problem is to arrive at a reasonable interpretation of the meaning of each factor. T~ically,this is done by comparing variables having high loadingson a factor with variables having low or zero loadings on the factor, seeking to induce some general rule or characteristic,variationsinwhichwouldappear to explain the contrasts in loadings. It should be noted that some variables were reflected in the process of factoring; the names of these variables have negative S

The Five-Factor Personality Model 115 refied to them. A negative sign thus means “opposite of’’; for example, ariable-13(-lethargic)istaken to mean “the oppositeoflethargic” or “active.” It may be useful to consider first the domain factors 3-0 (factor 2), -N (factor S), +G (factor 6), -E (factor lo), and +A (factor 11) and the facets associated with them. The symbols for these factors are found in the row, near the top of the table, labeled “Symbol for Factor.” The sign of the factor shows how it is oriented. Factor +Q is oriented positively in the scores on the factor indicate high degrees of “openness?” -Nis oriented negatively: high scores on the factor indicate 6cneuroticism.”In mostcases?thevariablesloaded on a arelistedinTable6.1intheorderof the algebraic es of their loadin on it. Loadings greater than I ‘30 are printed +O is interpreted as “openness to e~perience”-the tion favored by a number of personality researchers (e.g., McCrae Some of the variables highly loaded on 1-0alsohave et f3, “ICnowledge,” or on facet f4, “Creativity.” For one of aded on facet f3, ccsensible,” the loadmg is s i ~ f i c a n and t relatively small. It is hard to find an explanation for this ms to indicate that persons rated as ~ o ~ l e d g e a b and le “verbal” (with a large v o c a b ~ a ~tend ) , to be rated as less ccsensible’’ than otherwise. All variables loaded hghly on +Q also have hgh loadmgs on the superfactor SF1, to be discussed below. Factor -Nis interpreted as ‘CNotneurotic”; the loadings are relatively small. Associated with this factor is facet f12, possibly in~catinglack of ety, in the sense that highscores on facetf12inchcatelackof ~ e ~ ~ enot s being s , ‘cconcerned,” and not being nervous. Some of the variables with high loadings on -N(-12, “not rigid,” and +9, ~~a~aptable”) havehighloadings on SF1; others -43, “not restless”,and-19, “not outspoken”, have high loadingson SF2, discussed below. Factor +C is the factor often interpretedas indicating Conscientiousness, but for these data it seemsthat this in etationis not as appropriate as retation e might bbe, +2, “conscientious” loading on it. One m a more general term, sociated with &IS factor: facet as either “neatness” or non-eccen~city~' and

I

e Five-Factor ~

r

e

~

s

o

~

~

Factor -E is the opposite of the factor E often cited in the factorial ~terature as “Extraversion,” possibly thus justi the label c‘Introversion.’’ It is not particularly well de gs are for ratings characte~ingperso no facets associated withit, and its v on either or both of the superfactors SF1 and Factor +A is the factor often cited in A, A~eeableness. Variables with high loadmgs on it include “not touch^,^' t considerate of others,” “not jealous,” and Some of the variables loa also load on facet f12, possiblyinte~retableas ‘ g ~ o n - ~ n ~ o u s . ’ 7 It is partic~arlyhportantto f i d m e a ~ g inte~retationsof the superfactors SF1 and SF2 because they account for large proportions (‘3 and ,386, respectively) of the common factor covariance (in contrast to propor~on,261, contxibutedbyalldomainfactors andfacets). If one arranges all variables in order of their absolute loadings on factor SF1, the highest 20 of the loadings are for the following variables: socially confident ( . ~ 2 )adaptable 9 (.76),perceptive(.75),verbal(.73), orignal (.72),sensible p l a n ~ (.66), l not (.72), owle edge able (.70), curious(.68),imative(.68), es lethargc (.M), not rigid (.62), not s u b ~ s s i v e (.56), p e r s e v e ~ (.54), ~ outspoken (.50), responsible (.4-9), ass energetic (.45), and mannerly (.45). In contrast, variables with zero loadmgsare:considerateof others (*17), restless (.ll), rude (0), spiteful (0), eccentric (0), c (0). From these touchy (0), andimpulsive e ~ a~l o ~ ~ The e ~ e ~ c e superfactor SF1 refers to what may be called ~ e ~ Social variables with high loadings describe various forms of social competence, whereas the variables with low or zero loadings are generally relevant to social competence. Using a s d a r approach to the inte~retationof superfactor SF fi the variables with the 20 highest loadmgs on it (in absolute tude) to be as follows: not impulsive (.85), not restless (.$l), not rude ,doesn’tfidget(.79), not spiteful (.77), not outspoken (.76),self.76), not assertive (.73), not jealous (.66), not c o m p l a i ~ g(*65), fluent (.65), not careless (.59), considerate of others (.58), not ( S ) , not gregarious(.54), not touchy (.5 fickle(.58),conscientious submissive(.53), not energetic(.53), andcareful(.50). In con~ast,traits having zero loadings on the factor, and cQnse~uently~ r e l ~ v a to n t it, are: perceptive, verbal (large vocabulary), ~ow~edgeable, ad ,fea swi not fearful), and happy. One gets the pressi ion that

The Five-Factor person^^ scores on this factor are simply “nice people’’ who don’t draw attention to themselves and who don’t act in any antisocial ways. They are considerate of others, conscientious, and thoughtful. Persons with low scores on this factor would be thosewho tend to be antisocial, too assertive, too talkative, or othervise objectionable. One night call this factor ~ e ~~ Q e Q ~ ~ofa~ ~e s s

~

e

~

s

~

~

~

~

~

~

.

The above analysis of the D i p a n and Inouye (1986) dataset, yielding 1 factors at threelevels of a hierarchy, appears to make sense. Becauseof the me tho do lo^ used to arrive at it, its results are rather dtfferent from most of what may be seen in the person^^ trait literature, except, of course, with respect to the “rec~rent”factors of the five-factor model. It is dtffic~tto tell whether the two superfactors found here correspond, for example, to any of the factors found by Tellegen (1993) or others who have expl personality traits beyond those specifiedby the F€!M. ‘Has is a question er smples of in~vidualsand variables s d a r to that employed here. at the inte~retationsof s~perfactorsSF inte~retationsof higher heredtfferconsiderably n (1997). Undoubte~ythe order factors a ~ and ~ beta a offered by gies. In thischapter,the differencesarise from dtfferencesin S on ~ d i ~ d uitems a l to the extent that they play a role ,whereas Dipan’s (1997) me tho do lo^ focused on factors withouth t h e r consideration of individual items. The factorial structure found in the reanalysis presented here is more complexthanthatfoundinmanyrecentstudtesof aits,because it appearsthat most variablesin to measure, s~ultaneously, several ~ f f e r e n t make for dtfficultyin e s ~ a t i n gpersons’scores on w i l l be needed on thebestmeansofstruc variables and computing reliable and construct-va~d scores from them.

search studies should attempt to come to better grips with the processes allty factors found inensemblesof beha~oraltraits ).Rather than simply describing person^^ in t e ~ ofs adjectives or short phrases,thevariablesusedin aires should attempt to get at the motives, atti~des,

and beliefs underlpg observable behavior habits. Accor (1997):

Most psychologists regard “outer” ~ehavioral)traits as desc~ptionsthat need e~planationand they assume that (emotional and c erateandthereforeexplainoutertraits.Behavioral consistencies may be determined by the interactionof several emotional and coptive traits. (p.79) e the dimension labeled Conscientiousness. Accordmg to cus (1994, p. 82), characteristic q u e s t i o n n ~ eitems measu ension include the following: (On the positive side:) ‘ W e n I makea c o m ~ ~ e n I can t , always be counted on to follow though.” “1 am a productive person who always gets on thenegativeside:) “Somehes I’m not as thejobdone.”(And dependable or reliableas I shouldbe.” “I never seem to beable to get organized.” ut these items pertain only to observable behaviors. Could it be that more i n f o ~ a t i v eitems could be constructed around the possible motives, attitudes, and beliefs underlying these b~haviors?I am not sure whether research in personality hasadequately identified and listed the motives, beliefs, and attitudes that lead people to behave in conscientious or unconscientious ways. Nevertheless, a chapter on conscientiousn ty written by Hogan and Ones (1997) may be helpfid. Aeco them, conscientiousness isa complex scale. I t contains at least three themes: (a) control and lack o f impulsiveness; (b)orderlhess, tidhess, and m e t h o ~ c a ~ e s(c) s ; hard work and perseverance. (Note that two of these themes correspond to facets f7 and f8 in the reanalysis o f the Inouye dataset.) Hogan and Ones appeal to socioanalytic theory 1983) for an explanation o f these traits in terns o f their value for social c o n f o ~ t y ,the maintenance o f group structure, a hierarchy o f social statuses, and accountabdity and dependability o f group members. It might be possible, therefore, to construct scales whereby subjects could be asked ~ t group ~ sg~ c t u r eand the to rate the importance to them of m in~vidualaccountability of group members. Possibly this is heady done in certain ~ u e s t i o n n ~ efor s m e a s ~ honesty g and integrity. For example, subjects might be asked to rate each alternative in the fo~owin eed to do a job, it is important for me to get it done, because (a) people

The Five-Factor Personality Model 121 will think poorly of me if I fail,, (b)it is important that people be able to (c) things work out better if my planwhether 1 will getajobdone, c o ~ ~ e ncan t sbe counted on.’’ The intention would be to construct scales whereby one could detect the strength of the motives and beliefs underlying each of the Big Five (and other) factors. It would be interesting to factor-analyze these scales to see whether the factors would correspond in some way to the Big Five (or Seven) factors, or to a new set of factors quite different fromthe Big Five.

I seemanygoodfeaturesintheBigFivemodel. It hasbeenusefulin bringingagreement on important andrecurrentfeaturesofpersonality structure, and in organizing new research. But it is undoubtedly incomplete. g the question, “Is the five-factor model comprehensive from an e v o l u ~ o npsychologcal ~ perspective?’Buss(2996)opined, “From the current theoretical perspective, it is unlikely that the five factors alone w lli prove to be sufficient” (p.203). As I have already mentioned,, possibly the sixth and seventh factors of personality identified by Tellegen (1993) and others mustbeconsidered,andpossiblythereare still more factorsof personality to be elevated to second stratum statussimilar to that of the Big Five, Several of the Big Five factors seem overly complex, such that they need to be split into h t h e r factors. In particular, the factor ofOpenness to erienceseemsunnecessarilyambiguousandcomplex. “he workof A c ~ e m a nand Heggestad (1997) and Goff and Ackeman (1992) on the role of intellect-a what they call typical intellect.al engagement-needs to be fbrther developed. It w llibe important, also, in reanalyzing the factor-analytic literature on the structure of personality,to add information on facets underlying the Big Five and other second-order factors, such as those identified in the sample reanalysis offered here. Above all,work on the three-level structure of p e r s o n ~ t y ~ o m p ~ s i n g facets, domains, and possible higher order factors-needs to be continued in the f r ~ e w o r kofnormalscience.Icaneagerly endorse McCraeand F M is “desperately in need of Costa’s (1996) assertion that the elaboration” (p.78). And there must be renewed focus on themotives, attitudes,andbeliefsthat may underliedifferentaspectsofpersonality structure. Ten years from now, perhaps the Big Five model will have been laid aside, replaced bystill another, more compelling model.

ects of the c o m ~ u t a ~ o n s the new methodo

a s m for

Ackerman, P. L., & Heggestad, E. D. (1997). Intence,personality,and interests: Evidence foroverlap pin^ traits. P~chofogjcaf ~ ~ / Z el2l,219-245. ~~~, All ort G. W. Br Odbert H. S, (2936). Trait-nmes:A psycho-le~c~ study. P~cboZo~cu/

.The Big Seven factor model of personality description: nerality h a Spanish sample. 10

S o ~ a / ~ ~ 69,701-718. c~ofo~, Block, J.(1995a). A contrarian view of the five-factor approach to personality description. five factors given: Rejoinderto Costa & McCrae (199§)

m d Goldberg 8r Saucier (1995). ~ ~ c b o ~ g j c a /117,226-229. ~~Z/e~~,

~ e h ~ ' o r u ~ S ~9,&-17. e~ce, Borgatta, E. F. (1964). "he structure of personality characteristics. Bouchard, T. J.,Jr. (1995). Longituhal studies of personality an p s y c ~ o l operspective. ~ In D.H. k ~ p e r s o and ~ a ~~ n~ ~ e f (pp. ~ ~ 81-1 e ~ ~ e Stankov, L., & Cattell, R. B. (1995). Me~urementand statistical models in the In D. H. Saklofske, & M. Zeidner @dsJ, personalityandintehgence.

e Erlbaum Associates. .(1996). Social adaptation and five major factors (]Ed.), The ~ u e ~ a mode/ c ~ o r~ p ~ s o ~~b ae o~~ ~~cp a:e fr ~ e ~ ~@p. ~ e s180-207). New York Gdford.

Rouse, S. V. (1996).Personality: Indi~idualdifferencesandclinical assessme~t.~ n ~o f ~ a ~ ~ ~ e ~ f ~Vbo$ e ~ a ~ j n Carroll, J. 13. (Ed.).(1956). ~ n g ~ u g~eb, o ~ g b ~ Jreukty: S e ~ e c ~ e~~ ~ n g s o Lee New York John Wiley, and Cambridge, The Technology Press of M.I.T. ~ ~ c o g ~ ~ j va eu~ ~ ~nAe s~~~ s : o f f a ~ o ~ - as t~~ a~ e~sNew .~ c York:

C o n ~ ~ a ~and oclarification n

of primarypersonalityfactors.

McCrae,R.R.(1985). The NE0 P e ~ s o n a ~ ~ I n ~ e Odessa, n ~ o ~ . FL essment Resources. Costa, P. T., Jr., &; McCrae, R. R. (1988). Personality in adulthood A six-year l o n g i t u ~ ~ on the NE0 PersonalityInventory. ~ o of ~ ~ study of self-reportsandspouse ~ e r ~ o and n aS ~o ~~ a ~ P ~ c54,853-863. bo~~J I ~ v e n ~ o ~ / ~ ~ ~ j v e - ~ Costa, P. T., Jr., Ik McCrae,R. R. (1989). Tbe m0 Perso~a~ty I ~ v e ~ ~t oa~ n ~ a ~ sOdessa, ~ ~ ~F L~ Psychological e # ~ . Assessment Resources.

Digman, J. M. (1996). The curious history of the five-factor model. In J.S. Wiggins (Ed.), The ~ v e ~ a c t omodel r o f persona^^: T h e o ~ ~ c a l p e r ~ ~ (pp. c ~ v1-20). es New York Guilford Press. Digman, J. M. (1997). Higher-order factors of the Big Five. Journal of persona^^ and Social ~ ~ c h 73,1246-1256. o ~ ~ , Digman, J. M.,& Inouye, J. (1986). Furtherspecificationofthefiverobustfactors of ~ Social P ~ c h o l o50,116-1 ~J 23. personality. Journal o f P e r s o n a ~and Eysenck, H.J. (1970). The structure ~ h ~ m a n p e r s (3rd u ~ aed.). ~ ~ London: Methuen. Eysenck, H. J. (1991).Dimensionsofpersonality:16, 5, 3?Criteria for ataxonomic paradigm. persona^^ and I n ~ ~~ ~~ ea~ ln c12,773-790. es, Eysenck, H. J. (1992). Four ways five factors are not basic. persona^^ and I ~ ~~ z ~~ e ~~ ~ a~ ei 13,667-673. Fiske, D.W. (1949).Consistencyofthefactorialstructuresofpersonalityratingsfrom olo~, different sources.Journal o f A b n o ~ a l a n~d o ~ a l ~ ~ c h44,329-344. Goff, M.,& Ackerman, P. L. (1992). Personality-intelligen~erelations: Assessment of typical intellectual e n ~ g e ~ e nJo~rnai t. ~ ~ d g c a t i oP~~aci~ o i 84,537-552. o~, Goldberg,L. R., & Digman, J. M.(1994).Revealing structureinthedata:Principlesof In S. Strack & M. Lorr (Eds.), ~g~erentjatjng normal and exploratoryfactoranalysis. a b n o ~ a l p e ~ s o (pp. n a ~216-242). ~ New York Springer. Goldberg, L. R., & Saucier, G. (1995). So what do you propose we use instead? A reply to Block. P~choiugicaiBuiie~n,117,221-225. Gorsuch, R. L., & Cattell, R. B.(1967). Second stratum personality factors defined in the ~ a t e search, 2,211-224. questionnaire realm by the 16 P F. ~ ~ i ~ v aBeha~orai P~choio~caiBglie~n, 82. 802-814. Gdford, J.P. (1975). Factors and factors of personality. Gustafsson, J.-E. (1984). A uniQing model for the structure of intellectual abilities. Intel~genceJ 8,179-203. Gustafsson, J.-E. (1988). Hierarchical models of individual differences in cognitive abilities. In R. J. Sternberg (Ed.), Advances in t h e p ~ c h o l o ~ f h u ~ a njntel~gence,Vol. 4 (pp. 35-71). ~ s d a l eNJ: , Lawrence Erlbaum Associates. Hogan, J., & Ones, D. S. (1997). Conscientiousness and integrityat work. In R Hogan, J. A. h o849-870). lo~ San Johnson, & S. R Briggs(Eds.), hand boo^ o f p e ~ s o n a ~ ~ p ~ c(pp. Diego, CA. Academic. In M. M.Page(Ed.), The 1982 Hogan, R.(1983).Asocioanalytictheoryofpersonality. ~ e b r a s ~~ a ~ on ~ o ~o v a ~ Cgmnt o ~n : theoy u and research ~ (pp. 59-89). Lincoln, N E University of Nebraska Press. ~ Hogan r persona^^ Inventoy 1(HPlJ. ~ n e a p o ~MN: s , National Hogan, R.(1986). ~ a n u a lth8 Computer Systems. Hogan, R., Johnson, J.A., & Briggs, S. R. Pds). (1997). ~ a ~ d b ouf #o e~r s o n a ~ ~ # ~ c b oSan io~. Diego, CA: Academic. Jackson, D. N., Ashton, M. C., & Tomes, J. L. (1996). The six-factor model of personality: ~nces, Facets from the big five.persona^^ and I n ~ ~ d g a l ~ z ~ e 21,391-402. Joreskog, IS. G., & Sorbom, D. (1989). USML 7: User? @mnce g&k Mooresville, IN: Scientific Software,Inc. In Johnson, J. A. (1997). Units of analysis forthe description and explanation of personality. R. Hogan, J. A.Johnson, and S. R.Br& Pds.), H a n d ~ o o ~ o f p e r s o n a(pp. ~ ~ 73p~c~o~~ 93). San Diego, CA: Academic. Krug, S. E. (1994).Personality: A Cattellianperspective. In S. Strack & M. Lorr(Eds.), ~ ~ e ~ n ~normal i a ~andn agb n o ~ a i p e r s ~ (pp. n a ~65-78). ~ New York Springer.

The Five-Factor Personality Lohman, D. F., & Rocklin,T. (1995). Current andrecurringissuesintheassessment of In D. H. Saklofske & M.Zeidner(Eds.), Inte~ational intekgenceandpersonality. band boo^ of persona^^ and intei~gence @p. 447-474). New York Plenum. London, H.,& Exner, J. E., Jr. (Eds.). (1978). ~ i ~ e n ~ ofo persona^^. ns New Sork Wiley. ~le ~ a n Los ~ aAngeles, ~ CA: Western Psychological Lorr, M. (1986). I n t e ~ a ~ o n a i SInvento~: Services. Lykken, D. T.(1995). Tbe antiso~aipersona~ties. Mahwah, NJ: Lawrence ErlbaumAssociates. MacCdum, R. (1986). Specification searches incovariance structure modeling. P~cbologicai ~ ~ lIOO, ~ 107-120. ~ n , McCrae, R. R., & Costa, P. T., Jr. (1987). Validation of the five-factor model of personality across instruments and observers. Jolrrwal ofPersona~~ and SocialP ~ c b o l o52,81-90. ~, McCrae, R. R., & Costa, P. T., Jr. (1996). Toward a new generation of personality theories: Theoretical contexts for the five-factor model. In J. S. Wiggins (Ed.), ~ b e ~ ~ e ~ a~ c t oo r ~ l of persona^^: ~beo~ticalperspec~ves @p. 51-87). New York Guilford. McCrae,R. R., & Costa, P. T.,Jr. (1997a). Conceptions and correlates of openness to experience. In R. Hogan, J. Johnson, & S. Briggs (Eds.), b and boo^ o f P e r s o n a ~ ~ p ~ c b o l o ~ @p. 825-8417). San Diego, CA: Academic. McCrae, R. R., & Costa, P. T., Jr. (1997b). Personality trait structure as a human universal. ~ e ~ c Pa~ n c b o l o 52, ~ s ~509-5 16. Montanelli, R. G.,Jr., & Humphreys, L. G. (1976). Latent roots of random data correlation matriceswithsquaredmultiplecorrelations on thediagonal:AMonte Carlo study. P ~ c b o ~ e t41,341-348. ~~a, Most, R. B., & Zeidner, M. (1995). Constructingpersonalityandintelligenceinstruments: of Methodsanduses. In D.H . Saklofske & M.Zeidner(Eds.), Inte~utionalban~boo~ persona^^ and intel~gence(pp. 475-503). New Sork Plenum. Noller, P.,Law, H., & Comrey, A. L. (1987). Cattell, Comey and Eysenck personality factors J o ~ ~ of a ipersona^^ andSocial compared:Moreevidence for thefiverobustfactors? P ~ c b o i o53, ~ , 775-782. Norman, W. T. (1963). Toward an adequate taxonomy of personahty attributes: Replicated aA l ~ n o ~ and a lSocZai factor structure in peer nomination personality ratings. J o ~ ~ of ~ ~ c b o l66,574-583. o~, Ozer, D. J., & Reise, S. P.(1994). Personality assessment.A n ~ ~ ao fiP ~ ~ c ~b o leo45,357~ ~, 388. Peabody, D., & Goldberg, L.R. (1989). Somedeterminants of factor structures from personality-trait descriptors. J o ~ ~ a i o f ~ e randsocial s o n a ~P~~ c b o l o57,552-567. ~, ~ ~ i ~ , 13. Pervin, L. A. (1994). A critical analysisof current trait theory.P ~ c b o l o g i c a ~ I n5,103-1 Revelle, W. (1989). Personality, motivation, and cognitive performance.In R. Kanfer, P. L. A b ~ ~ ~ oe~sv ~ a ~ i oand n , ~ e t b o ~ l oTbe ~ : ~innesota Ackerman, & R.Cudeck(Eds.). S ~ on Laming ~ and o I n ~ ~~ ~ a ~ l ~@p. ~~~297-341). e ~ n c Wsdale, e s NJ:Lawrence Erlbaum Associates. ~, Revelle, W. (1995). Personality processes.A n n ~ a l o~f P~~ec b~o i o46,295-328. Saklofske, D. H., & Zeidner,M.(Eds.). (1995). Inte~a~onal band boo^ of persona^^ and in~ei~gence. New York Plenum. Sapir, E.(1921). ~ ~ g ~ a g e ~: ~Atn~ d ~ c to t ithe o nst#& ofspeech. New York Harcourt, Brace. Saucier, G., & Goldberg, L. R. (1996). Evidence for the Big Five inanalyses of familiar na~~, English personality adjectives.~ ~ ~ P e a n ~ o ~ m a l o f P e rIsO,o61-77. Schmid, J., & Leiman, J. M. (1957). The development of hierarchical factor solutions. P ~ c b o ~ e t2~2~, 5a3,4 1 .

Strack, S., & Lorr, M.(Eds.). (1994). ~ z ~ e ~ n ~naot~~aand ni g abno~aipersona~ty. New York Springer. Tellegen, A. (1985). Structures of mood and personality and their relevance to assessing anxiety, with an emphasis on self-report. In A. H. Tuma Br J. D. Maser (Eds.), Anxiep and the anxiety ~ s o r(pp. ~ s681-706). Hillsdale, NJ: Lawrence Erlbaum Associates. Tellegen, A.(1993). Folk concepts and psychological concepts of personality and personality ~ n q ~ 30. ~, disorder. P ~ c b o ~ ~ c a 14,122-1 Tupes, E. C., & Christal, R. E. (1961). ~ ~ ~ n t ~ e r s o # a ~ based t y ~on c ~trait o r~at~ngs. s Technical Report, United States Air Force, Lackland Air Force Base, TIC. (Reprinted, J o ~ ~ofa 1 P e ~ s o n a 1992,60,225-251). ~~, Waller, N. G., & Zavala, J. D. (1993). Evaluating the Big Five. P ~ c b o 1 o ~ c a 4,~ 131~~q~~~, 134. . . S. (Ed.). (1996). Tbe~ve~actor mode/ persona^^: T b e o ~ t ~ c a i ~ e r ~ e cNew ~ v e sYork: . S., & Pincus, A.L.(1992). Personality structure and measurement.Anma1 S, J. S., Br Pincus, A. L. (1994). Personality structure and the structure of personality ~sordersand t ~ e ~ ~ e ~ a c t o r disorders. In P. T. Costa, Jr., & T. A.Widiger Fds.), P~sona~ty modi1~ ~ e r s o # a@p. ~ t y73-93). Washington, DC: American PsychologicalAssoci~tion. The return of thebigfive. In J. S., & Trapnell, P. D. (19sonalitystructure: ogan,J. A. Johnson, &. S. R.(Eds.), andb boo^ ~ ~ e ~ s o n a ~(pp. ~ 737p ~ c b o ~ ~ 765). San Diego,C A Academic. Zinbarg, R.E., & Barlow, D. H. (1996). Structure of anxiety and the anxiety disorders: A o~o~, hierarchical model.J o ~ ~~ aA 1b n o ~ a 1 P ~ c b 105,181-193.

S o ~ e ~styles o w and abi~tiesneed to be disenta~~led to i each

-Messick,

~the vakd ~ ~ e a~s ~ r of ~e ~ e ~e t 1996, p. 92

..................... ionedthe cause of cognitive styles-at once gentangledstyle constructswhile simultaneously tracing their path through an irmnense field of research on the psycho lo^ of h ~ a differences n dner9Jackson, 8z Messick, 1960; Messick, 1984, 1987,1996; Messick gan9 1963). Indeed, theinfluence of cognitive styles extends well beyond the borders of differential psycho1 ~haracteristicways of perceiving and o r ~ ~ i experience n g represente style constructs are important not merely for understan~ng how S differ, but for u n d e r s t a n ~ g belief and conflict in science itself. words,cognitive styles are not just an interesting subfield of differ en ti^ ~ s y c h o l obut ~ ,aremore foundation^ elements that he1 sorts of theories we build, the methods we use to test them, an ortant, cause conflict among those who hold differen e of our themes wli be the confusions that have resulted from o r s so tenaciously to ~ f f e r e n t ure to unders~ndwhy ~ ~ e s t i ~ tadhere research p ~ a d i and ~ sprocedures.In thischapter, however, we w li ot so much thebroad sweep of theorizingaboutcognitive ratherthe much n ~ r o w e rtopic ofhow theymightbe ~ e a s ~ eWe d . emphasize the limitations of trait-factor modelsandthe l27

I

128 Lohman and Bosma otentialcontributionsofcognitivemodelsforthistask. We hope to astenthearrivalofthe daywhen the sophistication of techniques for style constructs catches up with the sophistication of t h ~ o ~ i n g about them that Messick championed. One avenue for hproved meas~ementisthroughtheuseofmeasurementmodelsderived from However, before we discuss how such models can aid in the measurementofstyleconstructs, it is necessary to understand why tively based havenot had much h p a c t on the meas~ementof ability constructs. Thus, Fist we discuss abdities, then styles.

A new enthusiasm invigorated discussions of abihty

meas~ementin the 1970s. For the first h e in a very long h e , experimental psychology saw more than error in i n d i ~ i ~differences. u~ To name a few of themany con~butors:Estes (1974) proposed s t u d p g cognitivetestsascogmtive tasks; Hunt, Frost, and Lunneborg (1973) proposed using l a b o r a t o ~tests to clariljT the meaning of ability constructs; Underwood (1975) proposed dualdifferences as a cruciblefortheoryconstruction;Cha n (1976) investigated correlations among individual differenc on a memo^ search task, a visual memory search task, and SAT scores, From the ff side, Carroll (1976) showed how a n inf~rmationmight help us understand abfity factors and Royer Digit Symbol subtest of theWAIS could be studied as an information processing task. “he new look held a p ~ ~ c u l a strong ~ly attraction for those such as Bob Elaser at Pittsburgh and Dick Snow at Stanford?who had long tried to keep a foot in both the experiment^ and differ en ti^ camps. Finally, there were the freshly hatched new PhDs who developed these ideas into research pro ams of the& own :Susan Embretson, Bob S t e ~ ~ e rJim g , PeJlegrino, P Ackerman. But what began with parades down Main etered out in a hundred side streets. Some early enthusiasts-such as Earl dered aloud whether experimental psychology and differ en ti^ mightindeedbe ~ n d ~ e n t a l incompatible. ly Afteryearsof effort that produced, at best,a scattering of small correlations,Hunt (1987) ded: “It does not seemparticularly ffitfid to try to derive the sionsof a [traitmodel]ofabilitiesfromanunderlyingprocess Although this surely overstates, we believe Hunt’s t h e o ~ ” (p. 36). *

Using Cognitive Measurement Models 129 pessimism is closer to the truth than the naive optimismof many would-be bridge builders, whether they begin their efforts from the precipitous cliffs that ringthe tight little island of experimental psychology or from the beaches of the seexningly borderlessempire of differential psychology (cf. Cronbach, 1957). O f the many differencesbetweenthe two disciplines that couldbe discussed, we believe two are central. The first concerns how researchers thinkabout variation. Onecould CA it a difference in p ~ o s o or ~ h ~ cognitive style. The second difference stems from the fact that constructs in the two disciplines are defuledby quite different-often largely ~depen~ent-aspects ofscore variation. We discuss each of these in turn.

In his effortsto explain rift between experimental andevolutio biology,Mayr (1982) disshed betweenwhathe c ~ e d ~ ~tbin~ ~ ~ t i and essenti~~st tbin~ing. on and diversity are thestuff of population king; categories and typologies are thestuff of essentialist P o p ~ a t i o nthinking uniquely characterked the Darwin-Wallace natural selection, and later Galton’s studies of the inheritanceof mental and physicaltraits.Essentialistthinking, on the other hand,has experimentalists in biology, physics, and psychology. Ess p~osophy ting with Plato and Aristotle, asserts that observable characteristibjectsin theworld are butimperfectshadows of more perfectforms or essences. These essences aremore ~ e ~ a n e and nt therefore more real than the particular objects through which we conceive or deducethem. Variation amongcategory members reflects erroror im erfection in the manifestationof the essential form. sentialist b k i n g in psychology is perhaps most clearly evident in the al work of the Belglan statistician Quetelet and his concepti mean of a distribution o f a n t h r o p o m o ~measurements ~c as rev essential form of the average man ( ~ j en). ~ ~ Variation ~ e about themean reflected the actionof accidental causes,So Quetelet reasonedthat “there is [or no possibility of discovering anything abouttheimportantconstant systematic]causesin nature from the character of the error dis~bution, since this ~stributionis related only to accidental causes” (Hilts, 1973, 217). In its purest form, this view endures in psychometrics in what Lord

k ( ~ 9called ~ ~a ~)

Z ~ tm~c scare, u ~In~muted c form, it character~es to describe elements w i t h a category by asingle score, from al of carving nature at its joints to the more esoteric a p p ~ c a ~ o n s

ose that differ chemists are not out the character o f carbon. I methods seem most comfortable with i n ~ ~ differences d u ~ as error

ted, the work of this group an chore^ in the categ~riesa

S

way. In his m e m Q ~he s noted:

' yobjects of the ~aussian one sense, to those to which I appli just ~ o ~for ~errors. c Bue S I wanted to preserve and to

nbach (1957) put it: ' e ~ o r ~ e l psycho1 a ~ ~ n ~ ose varia~lesthe exp nter left home to fo

Using Cognitive ~ e a s ~ e m e n t differences between individuals, and with relativerather than with a g the relativefitbetween all about. Even when abso a v ~ a b l eit, is ~ f o ~ a t i about o n the relative is itsspecial concern. Thus, b p s ~etweentraitand proc ental psychology-is that adherents of th eptuallzeproblemsand conse~uentlyto m e a s ~ e ~ a ~ a bdifferen~y. les ~ ~ p e r ~ e n t a hgenera~y sts prefer the neatly essentiahsm;differential psycholo~sts prefer the sional spaces ofpop~lation tluahng.

chfferences in c e style translate into mu nces in the constructs in the two domain at least the most ell-studied) construct in each e r ~ e n t a l psychology and i n t e ~ ~ e n cin e ch is defied by changes over trials or colum

to relate these two domains,

eh

elations ship between le

( l ~ ? ~ )to orr relate scores for cons~cts.

CO

S

sciphes do meet, or overlap. ~ o ~ d e p e n d e n of c erow a n n shows up in the interac~onterm. m e n consi

132 L o ~ a and n Bosma relations~pbetween learning and inte~gence,the most important cause of the interaction is an increase in score variation across trials, or what Kenny (1974) called thefan effect. Statistically, the fan effectoccurswhen true on the learning task is positively related to initial status on the learning .If initial status on the learning task correlates with intefigence, then g a i n s d also show a correlation. There are, of course, other possibilities, but this is a c o m o n scenario. Thus, the interaction term is the key to a better understanding of styles. ~nfortunately,both dfferential and experimental psychologists have been taught to dnirnize the interaction tern. Differential psychologists evaluate the dependability or reliabllity of individual drfferences by the proportionof the between-person varianceattributable to the person variance component (Cronbach, Gleser, Nanda, & Rajaratnam, 1972). A large person variance component and a comparatively small person X item interaction variance component are the goal. For the experimentalist, dfferences between conditions(or items, z) are judged relative to the size of the p by i interaction. But a s m d p by i interaction is not always the goal. Diagnostic i n f o ~ a t i o nabout how subjects solve tasks is most informative when the interaction term is large. In such cases, a single rank order of indviduals or of conditions does not give all of the interesting infornation. I ~ ~ u e n ~ a l developmental psychologlsts have long built their psychology around tasks that induce subjects to reveal important, preferably~ualitativedifferences in owledge or strategyby the type or pattern of responsesthey give. Furthermore,thesedifferences in owle edge or strategy mustthen be shown to generalize toother tasks or even to be indicative of broad competencies.Piagetwas partic~arlyclever in inventing or suchtasks for use with children. Siegler (1988) and others have continued thetradition, p r con~bution ~ ~ ofan information-processing analysis of a problematic situation is infornation on how subjects understood that situation or solved that task. Although such analyses usefully inform all subjects follow a u n i f o ~ test scoresevenwhen modelsaremost usehl for understan~g individual there are interesting dfferences in theway subjects S and in the strategiesthey deploy when attempting to er, most tasks studied by experimental psyc most tests developed by differential psychologists are not des such qualitative dfferences in type of knowledge or strategy use or to reveal them when they do occur. In fact, tasks and tests are usually constructed

Using Cognitive ~easurement withexactl

the oppositegoalinmind.ensuchtests or tasks are n info~ation-processinganalysis, the results are not exactly For e x ~ p l einformation , processing analyses of spatial tasks that require the mental rotation of figures tell us that a major source of in~vidualchfferencesonsuchtasks is to befoundinthe sp accuracy of the rotation process. Did anyone seriously doubt h s ? news is when we f i d subjects who do not rotate stimd, or who persist in rotating them in one direction when rotation in the other ~ e c t i o nwould be shorter, or when some rotate along rigid axes while others perform a mental ~ s andturning ~ atgthesametime.Yeteventhesestrategy differences are of no enduring interest unless they can be related to more global indices of abilityor some personological attribute suchas conation. Mostresearch in thepast 20 years attempting to relativeand differ en ti^ psychology has assumed that connections b the two i lines would be more s t r ~ g h t f o ~ a rInvestigators d. fitted information models to each subject's data, then estimated component scores entalprocesses(suchastheslopeparameter from the of latency on angular separation between stimuli in the rotation en used these process-based parametersas new in~vidual les.However, in~vidualdfferences that areconsistent across trials are locatedin the interceptsof the indmidual regressions,not in the slopes or other component scores,as c o ~ o n l yassmed (Lohman, 1994). Such comple~tiescomplicate but by no meansembargo tra between the two disciphes of scientific psychology. The main avenue of contactisthroughtasks or measurementproceduresdesigned to elicit rather than to prohibit (or obscure) dfferences in strategy or style, which brings us back to cognitive styles.

Cognitive styles include constructs such as field articulation, extensiveness tive complexity versus simplicity, leveling versus width,reflectionversus impulsi~ty, automatizatio~ versus ~ e s ~ c and ~ gc o,n v e r ~ gversus styles reflect consistent in cognition as distinct fro cognitivestyles axe often viewed as ~ e r f o ~ a n c e

134 Lohman and than as competence variables. "he division is not S sestylesaregenerally thought to be i n t e ~ o v e n as y c h ~ a c t e ~ s t i c sand to ~ c t i o n m ~ l conative

are interpreted more emory, and thou~ht a compone~t ofattention,would

S

fall underthe

for organizing their knowledge (Pask,1976).

c o ~ t i o n ,often inthefaceofintenseaffect. Four broad proposedobsessive-comp~sive, hysterical, of p e r s o ~ a are ~~, nd irnpulsive,which,inthenormalrange irnpressio~sticysuspicious, and u ~ t e ~ a t ecognition, d res~ectively.

In one way or another, the notion of strategy enters into all of these style at, then, is the relationship between the two? Styleis clearly termthanstrategy.Strate ay si@$ nomore than a articular way of solving a task. W e n the is used in this way, there is o r ~ ~ ~ e mthat e n Individuals t choose or even be aware of the strate~es However,strategyusecanalsoirnplychoice in action or en the term is used in this wayy listing strategies as e~emplars

tive ~easurementModels 13 S

the presence of some f o m of executive or selfrange of situations in which particular processe ty with which they are used may depend on an Therefore, cognitive styles contain conative and volitional assessment. These at have plications for their e, in the caseofvolition, mecha~smsfor the selfand affect in r e ~ l a t i n gaction or behavior, or in the c h a ~ s m sfor the i~tiationand m ~ t e n a n c eof actionappropriate thought. Thus, one way to observe styles is t consistencies in the application of strategies across tasks or situations. For exam le an obsessive-comp~sivestyle may be inferred from consistencies ing strategies usedto fend off the influence of nega e style-strategydistinction is perhapsmostsakent in style c o n s ~ c t soffielddependenceandfield independence. A crucl aspect ofstrategy control is not so much the purposeful drsp ance through task-relevant cognitions but rathe irrelevant or ~ s l e a ~ n g S (see Kuhl, 1992, and Pascual-~eone, 1989, for two perspective e role of in~bitionand fac~tationin use). Self-re~latio ts a situationally sensitive and adaptable h to the plan ation, and mainte~ance of context appropriate (or d r s e n ~ ~ e ~ from e n t context inappropriate) intentions. Field-independent and field-dependent learners can be d i s ~ ~ s h in e hd 5 respect. "he fomer are more ableto make use ofappropriate (inhibito~ or facilitato~)strategies. "he latter are more oriented toward si~ationalcues and make less use ofappropriate strateges, even when they are avadable.It is the differentialeffectof internal versusexternalcues that appears to between the obsessive-compulsiveand field-dependent/~eldnsions. In the first case,internally generated affect whereas in the second case, the external cues influence strategy. In one, the individual keeps the world at bay by inhibiting external influences; in the other, the individualkeeps inner demands at bayby i n ~ b i t i naffect. ~ In both cases, style facilitates cognition and so the type of n indicate broader dispositional style. 5 to the issue of consciouscontrol or choice in strategy use, ded to earlier. Control appears to be a question of de nconscious and automatic control to fully conscious control. ses, we assume that control can be exerted at any of these

136 Lohman and levels andreflects the actionofahigher order selfHowever, for the valid meas~ement ofcognitive S prerequisite that an individual be consciously involvedin the app~cationof an articular strategy. cause strategies and-to a lesser degree-styles can be perceived as t of a s e l f - r e ~ l a t osystem, ~ they can be situated wi -conative-affective framework (Snow,Corno, er, 1991). In Snow’staxonomy the conativ between the cognitive and affect domains, and represented by a motivationvolition continuum. Strategiesaremostlysubsumedunder the more e, dunder the affectiveitive-volitional pole and, to a lesser ~otivationalpole.Styles, on the other hand, ~ s ~ ~ uinore t e devenly across volition and motivation. One advantage of Snow’s scheme is that dtfferent style constructs (and their c o n c o ~ t a n tstrategies) are not operationallydependent on an overarching and rather conceptually nebulous CO itive-personality system, Messick took a somewhat dtfferent approach. m e r e a s Snow specific set of variables spanning the space, but excluding the s u p e r o r ~ a t e constructs of cogrution and personality, Messick (1989) preferred a greater inclusion of personality variables, He wrote: The human personality is a system in the technical sense of some ctions as awhole by virtue of theinterdependence of its parts..., Personality may influence the o r ~ a ~ a t i oofn cognition, the ~ e n s i o n a l i t y and stability of structure, and the nature and course of cognitive processes, as well as that of level of measured ability. (p.36)

Accordmgly,styles should betreated, not as CO behavioral variables related to personality, but as “ma ing personality structures in cognition, affect, and behavior” ~ e s s i c k , 1994, p. 133). An important question then is, ‘‘Do we include personality characteristics when attempting to assess styles,” and if we do, “At which point do weintegratethem into our meas~es?”This question canbe answered from either a top-down or bottom-up analysis, with the former h k n g p e ~ s o nto~ performance ty andthe later performanceto personality. From the top-down perspective, styles are consideredthe superor~natetier subsuming and ~stantiatingstrategies; most likely they do so ~ f f e r e n t i ~ y across situations and tasks, not unlike the personality constructs to which they are presumably affsed. Our tack dtake us through the bottom-up

Using Cognitive Measurement Models 137 analysis:We try to determine how individualsprocessinformationin particular contexts, and then look for consistencies in the strategies used.

By definition, styles concern not bow much but bow. As Messick(1976) observed:

tivestyles differ from intellectualabilities in a number of ways. . , Abilitydimensionsessentially refertothecontent of cognition or the question of what-whatkind of informationisbeingprocessedbywhat operation in whatform? . . . Cognitivestyles, in contrast,bear on the questions of how-n the manner in which behavior occurs....@p. G-9) Measures of styleshouldyieldscoresthatarebipolarandvalue ~fferentiatedrather than unipolar and value hected (Messick,, 1984, 1996). Messick proposed that we examine typical performance (see also Goff Ackeman, 1992) and use ipsative or contrasted scores to measure styles. There are a variety of ways to do ths. However, most attempts to measure cognitivestyleshaveinappropriatelyfollowedtheabllity-factormodel, which is bettersuited to valuedlrectionalquestions about unipolar, m a h a l perfomance constructs that ask how mtlch. "he subversion of questions about how by methods better attuned to how m ~ c his but one exampleof how theapplicationofelegantstatistical techniques that do not really answer the questions posed can u n w i t ~ g l y reshape a ~ s c i p ~Eady e . mental testers-particularly Binet, but others as as muchconcernedwith how children well(seeFreeman,1926)-were solved problems as with the answers they gave. This concern with process was picked up by developmental psychologists,but gradually abandoned by psychome~icia~s, especially with the rise of ~ o u p - a ~ s t e r etests d that could be scored by a clerk, and then later,, by a machme. Tests became ~ c r e a efficient s ~ ~ vehicles for identifjmg those who were more (orless) able, but increasin~u ~ f o r m a t i v eas to what abilities might be (Lohman, 1989)). Issues of process were exiled to the land of cognitive styles. There, isolated from the m ~ s t r e a mof differential psychology, promising style h o r n to ablhty constructsweregraduallyground into traitsalready theorists, but by other nmes. When the redundancy was finallydiscovered,

ability theorists claimed priority andstyle

theorists were left with

e key to m e a s ~ style g lies in m e a s ~ how ~ g rather than how mmh. an one measure how? First, one needs tasks in which i n d ~ d u ~ f rather than i nces are clearly reflected in m e a s ~ e s ohow ,that is, tasks that e v e ~ o n esolves in some sense but amenable to dfferent solution strategies. Second, o must ]have some way clear inferences about strategy from r Onses that are given. ere are dfferent ways ortant. We often find that, even tho of solving a task, the dfferent methods are not d s ~ ~ s h a bwith l e om dependent measures. For example, different ways o f solving a problem that requires mental rotation of a figuremay all show increas t be d f ~ or~ t latency with m o u n t of rotation r e q ~ e d . Imay to detect such strategy difference

,one needs not onlytasks that eli and d e ~ e n d ~measures nt that are S models that can represent the much better suited to the task of m e a s ~ i nhow ~ &oh es. Fourth, one nee model) whereby different S or more style constructs. is r ~ n ~many y , dfferent strategies CO cular style. There are many differen ~~~~~~~

throu~houta continu

tive ~ e a s ~ e m e ~ t

Every test may be

easureme~t

~ystematicv~ationin h models). This is because

140 Lohman and Bosma For example,Sternberg(1977) d i s t i n ~ s h e damong four different vahdity models for analogical reasoning tasks. In Model I, all component processeswere self-termina~g,whereasinModel TV, all component processeswereexhaustive.ModelsI1and I11 ~ s t i n ~ s h eintermediate d cases. Performance of adult subjects was generally well fit by Models I11 or W. In later work withcMdren, Sternberg discovered that the performance ofyoungerchildrenwasbetterfit by modelswith self-te~ating processes, whereas that of older childrenwas generally better fit by models that hypothes~edmore exhaustive processing (Sternberg & Rifkin, 1979). Thus, modelscouldbeordered by amount ofexhaustiveprocessing required. Category score in this measurement model was then shown to be correlated withage or developmental level. ~ o m e ~ more e s than one h e n s i o n is required, such as in attempts to differences on cognltive tasks to abdlty constructs identified how theories.Siegler(1988) reported a niceexampleof classificationofmeasurementmodelsalongtwodimensionsmightbe accomp~shed. He a M s t e r e d addition, subtraction, and word identification tasks to two groups of fust graders. Performance on each item wasclassifiedasbasedeither onretrievalofaresponse oron construction of a response using a back-up strategy. Students were then classifiedin one of three groups depending on the pattern of response correctness overall, on retrieval problems, and on back-up strategy problems. Sieglerlabeled theupsgood,not-so-good,andperfectionist students.Perfectionistswereents who exkbited good knowledgeof problems but set high confidence thresholds for stating retrieval answers. The distinction between perfectionist and good students thus d r o r s the h e n s i o n of r e ~ e c t i ~ t y - ~ p u l s i ~ Note, t y . however, that nsion is typically defined by p e r f o ~ ga median split on latencyand error scores on a fi~e-matchingtaskandthendiscarding subjects in two of the four cells. Siegler, however, started with a model of hed between strength of asso a confidencecriterion for answers (a “conative”construct).Furthermore,thehypo g patterns across three tasks ension was shown by e x a ~ response commonly usedin the classroom. The key assumption in both the Sternberg andR.ifkin (1979) and Siegler (1988) studies is that individuals can be classified on the basis of which of several models best describes their data. Once again, this is an essentialist or typological way of thinking about the issue. When tasks admit a variety

Using Cognitive~easurement of solution strategies, indtviduals only rarely appearto solve all items in the sarneway. The problem is not “whlchstrategydoestheindtvidual or even “whxh strategy does the inchidual use most fre~uently?”but rather babihty that the i n d i ~ ~ uused a l each of the n statedinthis way, it isobvious that in which in strate the ically use, but also in the propensity to use a variety o S . Experts not only have a broader array strategies of at do novices; they them use more appropriately. In other words,theyaretuned to e n ~ o n m e n t a l constraintsandaffordances,and to metacognitiveknowledgeofself. Indeed, the continued application ofan ineffective strategy is one hallmark ature and dtsordered~ n c t i o ~ g . onen, Lohman, and Woltz (1984) showed how to do this in a rough way In an investi~tionof the solution strategies subjects used on a spatial task. Consider the case in which two strategies are hypothe 1 andStrategy 2, Kyllonenetal.testedmodels that pres subjects solved different proportions of the item using each strategy: O0/o7 25%, So%, 75% or 100% of Strategy 1 with the complement solved U ‘ Strategy 2. Of course, 0% and 100% represent single strategy models. investi~torswereable to distinguishamongthesedtfferentmodels, because the ch~acteristicsof itemsused to predlctwhethersubjects synthesized or dtd not synthesizefiguresvariedorthogonallyinthe ob esign. Without this, it would not have been possible to &S ong the dtfferent process or meas~ementmodels. onen et al. (1984) found that subjects with extreme ability profiles were more likely to use a single strategy. In particular, subjects who scored much higher on reference spatial ability and visual memory tests than on other referencetestsconsistentlysynthesized component f i ~ r e sinto a single shape, whereas those who showed the opposite profile seemed only able to combine figures actually in view. Those who where most able showed the most flexible adaptation, changing solution strategies to meetchangesinitemdemands.Brodzinsky(1985)claimed that this g e ~ e r ~ a t i oalso n applies to the coptive style construct of ~ p ~ s i ~ t y re~ectivity,In p ~ ~ c ~indtviduals a r , who showextremely ~ p ~ s i vore reflective behavior are less able to modtfy their speed-accuracy trade-off across situations.

much simpler example comes from the work of ~~g and en in their study fist listened to a shortstory and then des of questions about the passage, all of whxh r e q ~ e d stions were of two es, those that depended on ~ a g e r y depended on sem .For example,story the cked on the door of a may have mentioned the fact that S stion night be ‘“hat color was the door?” 2% ause the color of the door was not specified S recorded. However, the de~endent variable of interest was an at compared latencies on seman~cand e idea was to identi$ children who were much quicker to answer one of ~uestionthan the other. Correlations ipsative score and the Junior Eys orrelations -with the Extroversion scale we for boys (E = 107) S (E = 107). Thus, children who showed a pre~erence for gwere much more likely to be ~ t r o v e r t ewhereas ~ those who showed a preference for verbal elaboration were more likely to ne of the nice features of this study is that the irnpose a typology, even though careless int ent in many respects, the Siegler (198 g and Dyer (~9$0)stuhes all sh al hffere~cesin strategy preference can, ,with proper observation and meas~ementmodels, define style c o n s ~ c t that s provide one tween the domains of person ty and ability. ~ ~ o u g h S use latencyas the only or thep~~dependent me as^^, other dependent measures can also be used. For example, one can also follow the lead of Binet and t m d many others in the evelopment~tradition whohave atte to make inferences from a classi~cation ofthe response given. have failed because they sought to place the c u d in a category rather than to estimate e p r o b a b that ~ ~ the ineach of the categories use A good ~ e a s of ~ e on ture rather than to hscard or ignore ~ f o ~ m a t i on avior across trials, tasks, or contexts. It is only our a1 ~~g that makes inconsistency in style or

Using G o p t i v e

~eas~ement

a second-order resemblance concept. Individuals ~ener Style is cannot be typed by strategy, and strate~escannot be typed by style. Thus, the relationshipbetween in~vidualandstyleisdistal. This means that attempts to make strong pre~ctionsabout behavior in a particular context on the basis of a style d generally not succeed. This does not mean that style constructs are any less real than more proxknalm e a s ~ e sof behavior. A description of the general es of the landscape is valid even if it does not well describe ap ~ t i c u l ~ tyles can provide a ferde "hemeas~ement of CO interaction between the twodisciphes of scientific psychology.Indeed, we believe there probably is greater promise for fruitful interaction between the two disciphes in the meas~ementof styles than in the meas~ement of abilities. For this to occur, however, trait models of cognitive styles that involve a simpleegationofitemscoresmustgive way to models that reflect ~ u ~ t a t i differences ve in strat be mapped onto one or more S o~ercomeits penchant for catego of essentialist or typological thinking is as dangerous for the meas~ement of stylesasis re~uctionismfor psychologygenerally.Typologicallabels usually identif) extremes on a c o n ~ u ~ofmnormally ~stributedscores. In other words, the measurement of style must recognize that the c a t e ~ o ~ of responses or persons is a probabilistic affair. Cat often n o b g morethanconvenient fictions-arbitrary par c o n ~ u o u sspace that enable us to comunicate with one a because of h s need to c o m u ~ c a t ewe , w i l l always have a n categorylabels. The trickis to remember not to bemisle literally what we say.

Aidey, M. I). (1993).Styles of engagement with learning: M ~ t i ~ e ~ s i oassessment nal of their strategy useand school achievement.] o ~~ ~E ~a c~a ~ o n a ~ ~85,395405. ~ c b o ~ o ~ , Mison, R. B. (1960). Laming ~ a ~ a ~ eand t e ~b s~ ~ ua~n~ t i e(UM s 60-4958). Un~ub~shed report, Princeton, NJ: Educational Testing Service. Biggs, J.B. (1987). S&&nt ~ ~ ~ a ctob ~e esa ~ i nand g s t ~ ~ i Hawthorn, n~. Victoria: Australian Council for Educational Research.

I

Lohrnan and Bosrna Brodzinsky, D. M.(1985). On therelationshipbetween coptive stylesandcognitive structures. InE. D. Neimark, R. D. Lisi, & J. L. Newman (Eds.), ~ o d e r a t o r s ~ c o ~ e t e n c e . Eiillsdale, NJ: Lawrence Erlbaum Associates. Carroll, J.B. (1976). Psychometric tests as cogni~vetasks: A new “structure of intellect.'^ In L. B. Resnick (Ed.), The nutwe ~intei~gence. Hillsdale, NJ: Lawrence Erlbaum Associates. ~ ~ s a select Chang, A., &c At!sinson, R.C. (1976). Individual differences and r e ~ a t i o n s among set of cognitive skills. Memory and Cognition, 4,661-672. gist, Cronbach, L.J.(1957). The two disciplines of scientific psychology. American ~ ~ c h o ~ o12, 671-684. Cronbach,L. J., Gleser, G. C., Nanda,H., & Rajaratnam, N.(1972). The ~ e n of ~ ~ beha~oraime~#~ments. New York: Wiley. Entwistle, N.(1987). Explaining individual differences in school learning. In E. DeCorte, H. Lodewijks, R. Parmentier, & P.Span (Eds.),Learning and inst~ction:Eu~peanresearch in an i n t e ~ a ~ ocontext n a i (Vol. 1,pp. 69-88). Oxford Pergamon Press. 29,740-749. Estes, W. K. (1974). Learning theory andintebence. h e s i c a n P~choiogist, i Their history, p # n ~ ~and ~ ~pkcation, es Boston: Houghton Freeman, F. N.(1926). ~ e n t atests:

Miffh.

Galton, F. (1908). ~ e ~ ofmy o ~ wee (2nd s ed.). London: Methuen. Gardner, R. W., Holzman, P. S., Klein, G. S., Linton, H. B., & Spence, D. (1959). Cognitive Issges, l, control A study of individual consistencies in cognitive behavior. P~cho~ogicai M o n o ~ a p h4. Gardner, R. W., Jackson, D. N., &; Messick, S. (1960). Personality organization in cognitive P~cho~ogica~~ss#es, 2, ~ o n o ~ a 8. ph controls and~ t e ~ e c t uabilities, al Goff, M. & Ackerman, P.L. (1992). Personality-inte~encerelations: Assessment of typical ~ c h o537-552. ~o~, intellectual e n ~ g e ~ e Jnot ~. ~ a ~ o f E d u c a ~ o n a ~ P84(4), Hilts, V. L. (1973). Statistics and social sciences. In R.L. Giere and R. S. Westfall (Eds.), ~ o # n ~ t iofthe o ~ s s ~ e n tme ~tho^ c The n~neteenthcentury (pp. 206-233). Bloomington: Indiana University Press. Hunt, E. (1987). Science, technology, and intelhgence. In R. R. Ronning, J. A. Glover, J. C. ~ c h oThe i o ~~ ~ ~ s Conoley, & J. C.Witt (Eds.), The i n ~ ~ e n c e o f ~ g n i ~ vone ptesting: o ~ on#~ e~ ~ ~ r e and m e testing n t (Vol. 3, pp. 11-40). Hillsdale, NJ: Lawrence Erlbaum ciates. ifferences in cognition:Anew Hunt,E. B., Frost, N., & Lunneborg, C.(1973). (Ed.), ~ ~ e a and ~ i~nogt i v a t i(Vol. o ~ 7, approach to intelligence. In G. Bower pp. 87-122). New York Academic Press. Kenny, D.A.(1974).Aquasi-experimentalapproach to assessingtreatmenteffects in noneq~valentcontrol goup design. ~ ~ c h o i o g i c a ~ ~ 82,345-362. #~ietin, ogan, N. (1994). Cognitive styles.In R. J. Sternberg (Ed.),~ n ~ c ofintei~gence i ~ e ~ (pp. a 266” 273) Kuhl, J. (1992). A theory of self-regulation: Action versus state orientation, self~ s c ~ a t i o and n , some applications. Appked P ~ c h o i oAn ~ : Inte~ationui~ ~4?(2), e 37-1 29. Kyllonen, P. C., Lohman, D. F., & Woltz, D. J.(1984). Componential modelingof alternative strate~esfor performing spatial tasks. ~ o ~ ~ ~~~ a# ic aP ~ ~ co hno ai o76, i~ ,1325-1345. Lohman, D.F.(1989). Humanintelligence:Anintroduction to advances in theoryand o f ~ ~ c a Research, ~ o n a 59,333-373. ~ research. Lohman, D. F.(1994). Componentscoresasresidualvariation (or why theintercept correlates best). Intei~gence,19, 1-12.

-

*

I)

~~~~

~

Using Cognitive Measurement Models 145 Lohman, D. F., & Ippel,M. J. (1993). Cognitive diagnosis: From statistic~y-based assessment toward theory-based assessment. In N. Frederiksen, R. Mislevy, & I. Bejar (Eds.), Test t b e o ~ f o ar newgenera~on oftests(pp, 41-71). Hillsdale, NJ: Lawrence Erlbaum Associates.. al of mental test scores. Reading, A 4 A Lord, F. M., & Novick, M. (1968). S t a ~ s ~ c theodes Addison-Wesley. Mayr, E, (1982). The growtb o f biolo~ca~ thought: ~ i v e r ~ evol~~on, p, inbedtunce. Cambridge, M A : Harvard University Press. Messick, S. (1976). Personality consistencies in cognition and creativity. In S. Messick, (Ed.), I n ~ ~ in~ learning: a ~ Ip~ p ~ c a ~ oofncognitiv~ s sphs andcreatidy for human ~ v e ~ p ~ eSan nt. Francisco: Jossey-Bass. Messick, S. (1984). Thenature of cognitivestyles:Problemsandpromiseineducational practice. E ~ c a ~ o nP~cbo~ogist, al 19, 59-74. Messick, S. (1987). Structural relationships across cognition, personality and style. In R. E. Snow & M. J. Farr(Eds.), A p t i t d , learning,and i n s t ~ c ~ o nVol. : 3.Conat;Ueand agec~ve process ana~ses(pp. 35-75). Hillsdale, NJ: Lawrence Erlbaum Associates. Messick, S. (1989). Cognitive style andpersona~p:Scanningandodentationtoward aged (RR-89-16). Princeton, NJ: Educational Testing Service. Messick, S. (1994). The matter of style: ~anifestations ofpersonality in cognition, learning, and teaching. E ~ c a ~ o n a l P ~ c b o29(3), l o ~ s t 121-1 , 36. abilities and modes of attention:Theissue of stylistic Messick, S. (1996). Human Their nuturn consistencies in cognition. In I. Dennis & P.Tapsfield (Eds.), Hzzman a~j~ties: a n ~ m e ~ u ~ m(pp. e n t 77-96). Hillsdale, NJ: Lawrence Erlbaum Associates. Messick, S., & K o p , N.(1963). Differentiation and compartmentahationin object-sorting measures of categorizing style. P e r c ~ t ~ a ~ a n $Skiis, ~ o t o16,47-51. r ~s. Alberta, Canada: University of Miller, A, (1991). Personu~pp e s : A m o ~ r n ~ n t b eCalgary, Calgary Press. Novick, M. R. (1982). Educational testing Inferences in relevant subpopulations.~ ~ u c a t i o n a l ~ s e a r c h el ~l, 4-40. Pascual-lkone, J. (1989). An organismicprocessmodel of Witkids field-dependenceindependence. In T. Globerson & T. Zelniker (Eds.), Cognitive style and cogni~ve~ v e l ~ m e n t @p. 36-70). Nonvood, NJ: Ablex. P~cho~o~, Pask, G. (1976). Styles and strategies of learning. British ~ o u r n a ~ o f ~ d u c a ~ o n a ~ 46, 128-148. Riding, R. J.,& Dyer, V. A. (1980). Therelationshipbetweenextroversionandverbalimagery learning style in twelve-year-old children. Persona~pand I n ~ ~~zgerences, ~ a l l, 273-279. Royer, F, L. (1971). Information processing of visual figures in a digit symbol substitution o o f E~~ e ~~ ~ e~ na~t ca ~ b~ ~87,l 335-342. o ~ , task. ~ Siegler, R. S. (1988). Individual differences in strategy choices. Good students, not-so-good t, students, and perfectionists. CbiB ~ e v e l ~ ~59,e n833-857. Snow, R.E., Corno, L., &Jackson, D. (1996). Individual differences in conative and affective functions. In D. C. Berliner & R. C. Calfee (Eds.), hand boo^ o f e d ~ c a ~ o n a l p ~(pp. cbo~~ 243-3 10). New York Macdan. Snow, R. E., & Swanson, J. (1992). Instructionalpsychology:Aptitude,adaptation,and assessment. A n n # a l ~ ~ e ~ o f P43,~ 583-626. cho~~,

I

.(1946). " h e ability to learn. ~

~ c ~ o Z o ~ ~ 5 c3, a 147-1 Z ~ ~ 58. e ~ ,

utions and influence of Samuel

welleversince

his

PO

created the kind of atm

n cognitivestylesthrough the was also p ~ papers on ~ d &e to dunk that I cited them papersdevoted to c

analytic,and more c ~ t i c a l l ~ t the topic. Of comse, §m 147

~

148 Kogaa cited my work as well, and I am grateful for his praiseworthy c o m e n t s about it. A s my engagement with the topic of cognitive styles approaches the 40year mark, I am compelled to acknowledge that the peak of activity and interest in the field was reached some years ago, with a steady decline since that time. I a m sure that Sam would have agreed with &IS assessment. It is quite dsconcerting to dscover that cognitivestyles, on whichonehad focused only a short time ago, were now relegated to historical chapters r e p r e s e n ~ golder approaches (Cantor strom, 1987; Sternberg, 1997). "he latter author claims that lus work on dunking styles (based on a theory of mental self-government) d rejuvenate the field. Clearly, more time dhave to pass before such a clairn can be effectively evaluated. Following my contribution to the CO styleliterature in the formof anencyclopeda entry (Kogan, 1994), d that it wouldbemyfmal word on the subject.Breaking the vow for the purpose of this volume would have been possible, but I could not see the point of it. I would essentially fmd myself recycling ideas that I had previously committed to print. It was evident that I had to move in a dfferent direction. ~atificationfrom the Throughout mycareer, I havederivedmuch construction of psychological measuring instruments tapping constructs for whch no such instruments existed or whereexistent i n s ~ e n t swere ossly inadequate. My initial engagement in such activity began before my arrival at ETS "he area of gero-psychology was quite underdevelope~at that time, and I accordingly took on the challenge of devising a series of i n s ~ m e n t sto measureattitudestowardandbeliefs about the elderly with Mchael Wallach, the Choice ~ ~ o 2000). ~ n In , collaboration Dilemmas Questionnaire was constructed Fogan & Wallach,1964),an enormously popular instrument that pervaded the research on risk-taking behavior (indvidual and group) for many years. With the move to ETS, and the influence of Sam and the prevailin more aware of the subtletiesand psychometric tradtion, I became complexitiesofmeasurement,and I would &e to believe that my instrument-cons~ctionactivities became more sophisticated. In the 1970s, duate students and I devoted much effort to the c o n s ~ c t i o nand on of the pictorial Metaphoric Triads Task, an instrument intended to assess in~vidualdifferences in metaphoric sensitivity in both children gan, Connor, Gross, & Fava, 1980). All of this is aprelude to the challenge that confronted mewhen I accepted a consultantslupwith the Lincoln Center Institute, an o r ~ ~ a t i o n

Assessment in the Performing Arts 149 to the fac~tationofaestheticeducationinpublic p r and ~ s e c o n d a ~schools. Jack Carroll (a former ETS colleague and a c o n ~ b u t o r to h s volunse) had served in the role and r e c o ~ e n d e dthat I take over as h i s successor. "he Institute wished to evaluate the success of theirp r o ~ a m , and it became evident to me that new and possibly unique ins~ments would have to be developed to accomplishthat goal. In the next section of this chapter, I describe the approach taken in the develop~entofsuchinstruments. The intent ofthepresentchapter, however, is to go beyond the particulars of the Lincoln Center Institute evaluation to assessmentintheperformingartsmoregenerally,andof p e r f o ~ artists g more spe~fically.

How does one proceed to studychildren'ssensitivityin the aesthetic domain? It is evident that we must place aesthetically relevantstimult before the child, and then evaluate the responses offered to such stimuli. There are a n ~ b e of r ways such a goal can be accomplished. At one extreme, one can offer the child musical or dance works from the appropriate standard reperto~e,and then ask a series of questions intended to tap his or her appreciation of such works. This is a direction that we deliberately chose not to take, given the prerniuns such a procedure would place upon verbal ability, prior arity, and c d ~ a lsophstication. Accorckngly, we adopted a more experimental trial-and-error approach to the c o n s ~ c t i o n of the required i n s ~ e n t s . Consider first the tasks developed for the dance assessment. We asked question: Are there aspects of dance of sufficient saliency and lend themselves to at least partial apprehension by children naive to the dance field? If so, could tasks inco~oratingthose aspects be pitched at a level of difficulty to allow for ~ e a individual ~ g ~ ~fferences?In the case of dance, we settledon tasks assessing(1) sensitivity to the affective and descriptive qualities inherent in dance movements, and (2) theabiltty to translate a dancemovement into a ~ o - ~ e n s i o n a l at isnot a floor pattern of dance steps). sational dancers were a v ~ a b l eto us, and a particular test item was represented by one, two, or all three dancers. Thus, in the case of the first dancecomponent indicated previously, the dancers were requested to produce a brief movement representing a specific descriptive-affective c o ~ ~ i n a t i o(e.g., n s h ~ - a n ~ A) .test itemwas then constructed by addmg

~

h of these (eight

in all)

ew i n s ~ e n t of s the type described, there is bound to

dance movementin the first place. The task in the case of music required that the chil selection with a visualpattern (again d r a m fr reference Test). A professional music educator sele classical or ethnic musicselectioncouldbematched. ctors for each item were chosen. Advanced mu used to ensure a consensus on the item keye ore than 5% of responses) replace A few words about the use of dlstractors is called for here. In ~ t i a l pilot work,weemployeddistractorshighlydiscrepantfrom the CO ~ t e ~ a ~and v eobtained , marked ceiling effects in children 8 to 9 years old. It is thus apparent that such chlldren exhbit a basic veridical s e n s i t i ~to~ the af~ective~escriptive properties of dance movement, and are able to the corr~spondencesbetween dancers’ movement stract representation of that movement in the form uchceilineffectswerelesssevere in the case to the increased ~ f ~ c associate u l ~ and visual. Of course, as more subtl e of less obviously ~ s ~ r e p a n t ~ s t r a c t o r s , ed.Suchvariation is clearlyessenti 1989). At the same time, it is a arent that the ely to be present in at least ru ,year-old infants demonstrate

Assessment in the Performin categorical matching of auditory signals and visual line patterns t inner, Cicchetti, 8r: Gardner, 1981). Such findings point to the p synesthetic and p h y s i o ~ o ~linkages c described by Werner an (1963). The pointatissueis that cross-categoricalsensitiviti demonstr~tedearly in the life span, but with development, they beco more subtle and differentiated. I should We to argue that the increas subtle of differentiation proceeds at different rates across cluldren, hence e kinds of individual differences under dzscussion. rtant to note thattheaestheticeducation pro coln Center Institute was not aimed at training performers, although classroomexercisesofferedstudents the opportu~ty to er, the intent was basically akned at instiUing an appreciation for and sensitivity to the p e r f o ~ arts g through the demons~ationand ticipation of experienced teacher-artistsin the fields of music, dance, and m a . The p e r f o ~ garts world obviously requires sophisticated spectators as well as talented performers. To the degree that cluldren are sensitized to the p e r f o ~ arts g and exposed to some of its classic works e educational context, one can h o p e ~ yanticipate that they to become the engaged audzences of the future.

Of course, it is entirely possible that a s m d proportion of the chil r progrm were stimulatedto the point of aspiring toward career,Suchcluldrenwould naturdy beexpected to sures of aesthetic sensitivity,but clearly &IS would be but requirementforthe succe~sor pursuitof a careerin the eally, we should follow the lives of such children thou adolescenceandyoungadulthoodtoseewhathappens to them. The research on teenagers by C s ~ s z e n ~ aRathunde, l~, and ~ a l e (1993) n offers a model of how talent is developed or dissipated across the high school years in dzfferentdomains(includingmusicandthevisualarts). That research does not, however, offer information on the fate of these teenagersafterhighschoolgraduation, nor does it considerstudents talentedindance or acting.Althoughwelearn a greatdeal about the forces-internal and external-that contribute to success and failure at the high school level, we would also Weto h o w more about those i n d z ~ d ~ a l s who have committed themselves to a career in the p e r f o ~ garts. Are

tive, personali~, and motivational dancers, and actors?

To raise this question &plies that we have entered the re

ertise, a realm that has been explored recently .Ericsson (1996). To o v ~ the issue s fo ~ ~ ~ ~ c o n c e ~ sthe role of deliberate practice in w h e t h e r extent of practice is the primary cau ace or whether ahigh level o f talent, possib t expertise as aa o u t ~ o m ~ . f expertise are r~presented, on

in~olvementin fostering good practice habits in musical r hand, by Winner (1~96)~ who makes a precocity in the visual art ex~insic in~uences.It is possible to argue, of c rences, such that musical expertise is largelya eveloped through specific training and practice r expertise in the visual arts e s s e n t i ~ yreflects an i n b o ta ~ forrn over h e . It is &us ~ s t i n c t i ~artistic e

one must search further for potential instances of c o n ~ m e~dx c e p t i o n ~ ed by ptactice and .”(p. 33). “he approach taken in hapter opts for neither ode1 of careers in the

A ~ursoryglance at the c o n t e ~ p o researc r~ a t s points to the very thin boundarysep

.Indeed, James Sloan M e n , fo ented as follows: “SO deman

rature on the perfo

*

ical espectations of

such transformation. For the exceptionally gifted p e r f o ~ e r , the ~ a n s f o ~ a t i o nd produce a personal p e r f o ~ a n c e style often d i s ~ ~ s h by e dnew interpretive insights. A highly relevant study emplopg expert and novice ballet dancers was carried out by Janet Starkes and her associates (Starkes, Deakin,b d l e y , &: Crisp, 1987). The dancers were shown a videotaped sequence of eight ballet steps without accompanpg music. In one case the sequence was selected by an experienced choreographer; in the other case, the sequence used the same elements randomly arranged."he dancers were required to reproduce the sequence of steps from memory. The results showed that the expert dancers did better at the task than did the novices, but only in the case of the sequences with choreographic structure. In addition, when music was added, the expert dancers' probabhty of recall showed a fwther increase. Of interest as well are the drfferences in the strategies of recall of the expert and novice dancers. The novices rushed to reproduce the sequences before the memories faded; the experts, by contrast, had encoded the movement sequencesthrough a processcalled mar~ng-where dancerssubstitute handpositions for foot andbodypositions. These handpositionsare reinstatedduringrecall,therebyfacilitatingretrievalof the sequenceof steps. Verbal labels are also sometimes assigned to the particular m o v ~ e n t sequences to enhancerecall. It is thusevident that acquisitionofballet expertise entails a growing sensitivity to choreograp~cstructure (of which musicisanintegralpart),andtheeffectiveuseofspecializedmemory techniques to encodemovementsequence (see Mard & Starkes,1991). rote ~ eas typically ~ conceived. o ~ Clearly, we have advanced well beyond Indeed, aswehaveseen,removalofchoreographicstructurethrough r a n d o ~ a t i o nof ballet steps-essentially turning the task into one of rote memory-wipes out the advantage of expertise. Comparable effects have been reported by Noice and Noice (1997) in their studies of the memory for dramatic script among professional actors. ~rofessionalactorsandnoviceswererandomlyassigned to mte and g.& instruction^ conditions. In the former, the task involved memor~ingthe lines by rote repetition without scanning forwardor backward in the script; in the latter, the participants had to learn the role as if preparing for an ~ e naudrtion. t Recall provedto be vastly superiorin the gist relativeto the rote condrtion for both the actors and novices, although the difference for the latter. Noice and Noice(1997) S was s ~ r i s i n g l larger y the actors were better able "to defeat the c o n s ~ ~oft sthe (p.45). In c o ~ e n t following s the experiment, the actors talked about the

Assessment in the Performing Arts 155 extreme f~strationthey felt whennot permitted to move freely through the script. No such anecdotal c o ~ e n t were s offered by the novices accor to the Noices. In sum, it is evident that rote memorizing does not work for either actors or dancers, for the reason that it does not permit access to the deeper structure or meaning of the materialto be learned.

~ e ~ As we~depart o &om ~ the ~ study ~of the ~ SUS . inherently demanded by the performing arts, and move to the personality correlates of choo c performing arts careers, we cannot anticipate findings dstingui at statistical power. For we are essentially asking whether there are teristicsbeyondtherequisitetalents that contribute to choosing particularperformingartscareers. One canconceptuahze the issuein multipleregressionterms(i.e.,whatistheincremental contribution of personahty dispositions to choice of and success in the performing arts, ass that skills andtalentfactorsconstitutethestrongest predictors)~ Th also the possibility that the criteria employed-ehoice of a p e r f o ~ arts g career and success in that career-may have distinct causal nceivably, different quahties are associated with success in the arts than with choosing to embark on such a career in the first place. ~nfortunately,at the present h e , we do not have answers to these questions for the reason that the kinds of studies required to answer them have not been conducted. Investigators who study the skill components of the performing arts are not especially interested in personality factors an vice versa. A fkst approach, largely confmed to actors, involves the construction of a theoretically based ~ s ~ e onn which t actors are expected to achieve especially high scores. A good example is represented byMarkSnyder’s (1987) construct of self-monito~gwhosecentralfeatureconcerns the extent to which individuals are able to achieve self-control, and a c c o r h “are, in a sense, actors with a large repertoire of roles, d l t n g and able to work from a wide range of scripts; they cast themselves in many different partsinlife. By contrast, the lowself-monitor may belikened to the performer who hasthesame part ineveryproduction, partic~arlythe performer whose o m personality seems to provide the script forhs or her everyrole.”(Snyder,1987,p.186). Not s ~ r i s i n g l y ,the item that best differentiateshighfromlowself-monitors on the Self- oni it or in^ scale reads as follows:“I would probably make agood actor.”

I

Assessment in the Perfo artistsand

control subjects,andamongdifferent h d s of &g h ~ o t h e s i s behind this workis th o cons~tute distinctive subcultues, eachwithits nts. The p r e s ~ p t i o nis that indivi to theextent that theypossess (1988) research on young balletS e approach. That author observed that, rela more ~ t r o v e r t e d , a c ~ e ~ e ~ e n t - o r i e n t e d ~ 1s of emotionality. "hese traits are considered conducive to theworldofclassicalballet-the strong emphasis ~ ~ p e r f o~~ a n c eduring d training u as~conducive to ~troversion, focus on competition to succeed as consistent with ac~evement ~otivation,and the need to gme emotionalexpression to musicand c h o r e o ~ a ~ has y b

Clearlyneeded is a personality c o ~ ~ a r i s owithin n the same studyof artists across different fields. Such a study has beerx carried out S,

dancers, and nonartist controls, who volunteeve sonality ~ s ~ e n t - t h eE senckPersonali ed by that i n s ~ e n gen t n - ~ ~ ~ ~ v ~e ~ ~ s ~ o n~ , and ~ ~ n ~ ~ ~ ,

otional extreme, and musi~ansat the

the traits ~ ~ ~ a a ~ ~ ~ ~and o y~ ,~ o , ~ ~ One o ~cannot d ~ ba ~ a ~ these characteristics have any bearing on the choice ofa etcareersgenerallybegininpreadolescentchildhood, and it truly stretches credtbility to believe that the most neurotic children are selected out for ballet training. Rather, such neurotic traits may well be reactive to the kinds of life stresses that a dance career entalls. S ~ a r l in y the case of classicalmusicians,themostextremescoresareforthetraits ~ n ~ c ~ ~ ~ ~ ~ and b ~~ n a~ ~ b~ Again, ~ ~~we ~ ~ ocannot ~ ~ ~ .quite e believe , that these traits characterizedmusiciansatthetimetheyembarked on a musicalcareer. ther, the traits probably reflect theclassical musician’s rehation that he she is but a part of a large musical ensemble, and that aspirations toward advancement to the status of soloist areunreahable. It is evident, in s u , that the personali~analysis of perfo carrieswith it anarrayofinterpretive ~ f ~ c u l t i e In s . the 1 studiesofperforrningartists c o ~ e n c i n gin childhood, we left with a set of d i s t i n ~ s traits, ~ g some possibly of genetic ome strongly suggestive of a reactive response to the pressures ngartscareer.Conceivably,thenatureofthisreactive response is partially attributable to indtvidual differences in susceptib~tyto stress-a pe~son-by-situationinteraction, if you will. Clearly, we have barely scratched the surface of the role of personality in the lives of p e r f o r ~ g artists. Contradtctions abound that call for some resolution. In a survey of dancers contempla~gretirement, Ellen Wallach (1988) vividly reported on theemotionally wenching qualityof the separation from dance,and H ~ t o and n Hamilton (1991) described a high level of suicidal ideation. It is apparent that dancers derive much satisfaction from theit careers, and do not relinquish them lightly. How is one to reconcile such info with personality data pointing to heightened levels of dysphoric affect and lessened subjective well-being in dancers? m e r e actors are concerned, wemightwellask how they manage to m ~ their extroverted t ~ and adventurous dtspositions in the face of the remarkable instability that characterizes acting careers. As we shall soon see, a ~ p r o ~ a t e 95% l y ofNewYorkstageactorsare not en acting at any particular point in time. How do we account resiliency that actors seem to display in the face of periodtc rejection? In the in~uences~hil~ood derof this chapter,Iexploresocialization experience that night havecontributed to choosing to pursueacting or dance as a career. Then, having chosen a perforrning arts career, what are

Assessment in the Performing Arts 159 the motivational forces at workthat lead to persistence in the face of career hards~ps?

~ o ~ In the ~ matter Z of~ socia~ation, ~ ~ ~consider ~ the ~ 448~actors. and dancers sampled in Reciniello’s (1987) dissertation, W e r e parents of these performers are concerned, the data clearly demonstrate subs tan ti^ involvementintheartsateithertheprofessional or amateurlevel. For fathers, 21% were professionals, 49% participated in the arts as amateurs. In the case of the mothers, the figures were 22% and 69% for professional and amateur ~volvements,respectively. In thls sample, actors and dancers were basicallysimilar in the pattern of parental artistic involvements. Considernext the responsesoftheactorsanddancers to Helson’s (3965) ChildhoodActivityCheckkst. A factor analysis ofthesedata generated six. factors, of whch two are especially relevant-an Imagn a P e r f o ~ gFactor. Ofpartic~ar interestin the caseof Play is the degree to which the item content represents apparent a n a l o ~ e sofactingattheadultlevel. In partic~ar,“creating a q situation^,'^ ‘%ritingpoemsandstories,” ‘ ‘ ate,” and “pretendmg to be different people” ar bytheirverynature.,and it thusappearsthatactors ahost extension over time have found the perfect field in whch to in c h i l ~ o o dpassions. Of course,dance, too, canhave stron elements, and as we have seen., dancers also score high on th Play factor. But it makes ednent sense that actors would score highe dancers, gmen the internal lanesthetic focus of dance in comparison to the ary role p l a p g that constitutes the essence of &ma. oftheseretrospectivedatasuggestthatthechoiceofacting or dancing as a possible profession may be set quite early in c ~ d h o o d given , the observed high level of parental involvement in the arts and the pattern of c h i l ~ o o dactivity preferences. Those two factors in combination must rful hpact, for it implies that the parents are at least~ p l i c i ~ y the child along pathways close to that child’s salient ~terests. ~ 0 ~ In ~ an effort ~ to ~ understand ~ ~ how 0 ~actors . respond to success and failure,Wilson(1989),inherdissertationresearch,employed a modified A ~ ~ u t i oStyle n ~ u e s t i o ~developed ~ e by Petersonet ~odificationtook the form of adding items specific to the ac For example, respondents are asked to hagine a situation in whch they have been unable to find acting work for several months. They are then

re~uested tospecify the major cause of that unfortunate situation and to whether that cause would continue to prevail in the future, The a n tiveevent; other itemsreflect itemobviouslyrepresents lected out of the cast to as tive or positiveeventscan e x t e ~ a ~(i.e., y to 0th emore, the cause may be viewed as specific to the event described or as likely to prevail when such events occur in rds, the ~ s or stability ~ of theb cause ~presumed ~

g to find that actors’ internal at that a sense of personal control over eventsareviewed as stable, S ood eventsis accompa~ed e ~ p e c t that a~~ such control d persist mto the fume. Onthe other that one is p ~ s o ~ aresponsible ~ y for ple, is not necessdy experienced as rs clearly do not fi things that are h a ~ p e to ~ nme~ and that future.) Actors obviously do not lend

oice and m ~ t e n ~ of c ea per

8.1 divides career develop~entinto five

-ats careercannot ate such acareer. F

Assessment in the Performing Arts 16 By thetimeyoungadulthoodisreached, one canexpect that those aspirants without the requisite talent and motivation would have movedon to other fields. Those few who move on to a career in the p e r f o ~ n garts will identify themselves as professional musicians, dancers, singers, actors, and will be subject to all of the rewards andhardshps that those fields have to offer. As we have observed in thecase of acting, the incidence of fdure to maintain is so very substantial that adaptive coping strategies are required one’s career identifica~onas an actor. Further, stage fright troubles many performers (seeWilson, 1994, chap. ll), andcopingstrategiesmustbe developed to deal with it if a performer is to remain in the field. The figurelistsworkversushomedemandsandphysicalinjuryas extrinsic factors, For those performers withf d y attachents whose work requires constant touring, the strain that such travel can have on marital ~elations~ps isconsiderable. Not all performerscope wellwithsuch stresses. In thematterofphysicalinjury,balletdancers axe particularly susceptible. If such injury is severe enough,it can seriously dsrupt and even end a dance career. It is events such as these along with all of the other exigencies a c c o m p a n ~ ga performingartscareerthataccountsforthe inclusionofpost-careerchoicesunderintrinsicfactors. For somefields sooner, and for others later, the performerw i l l have to consider post-career o tions.Suchoptions,ofcourse,are S gly different for the dancer at age 35, and the musician‘ retiring from a symphony

8.1 isintendedasagenericworkingmodel for the career ent of performing artists. No doubt, models constructed for the differentperformingartswouldrevealsomevariationfromthegeneric model as well as greater precision regardingthe forces at work ’ It is c o ~ o n knowledge,forexample,thatballetcareers cM&en as young as 7 to 8 years of age. Musical careers m earlier. For these performing arts,a k e c t trajectory can be child’s to thematureperformer’sactivities. In contrast, it isdubious whether acting careers (apartfrom a small ~ o rof ichild ~ actors) fo begin at such young ages, and the trajectory from childhood activities to the profession of acting is likely to display nmnerous twists and turns. In sum, theworkingmodeldisplayed in the figureis intended as a heuristic device to promote hture research on the forces that c o n ~ b u t eto the development of performing artists. For inves tors seeking a relatively uncultivated field, the study of p e r f o ~ artists g offers ani n ~ t i n gtarget.

A s s ~ s s ~inethe ~ tPerf02

ts

Sloboda, J. (1991). Musicalexpertise.In K. A.Erics ~ h e ~o ~e ~ e (pp. ~ 153-171). e . New York Camb Sloboda, J. (1996). The acq~sition ofmusicalperformanceexpertise:Deconstruc “talentt’account of individual differ~ncesinmusicalexpressivity,In K. A.. Ericsson

Werner, H., & Kapla Wilson, G. D.(1994). Wilson, M. J. (1989). Ericsson@d.)% Associates,

g Ps~cholo~sts Press.

The mud to e ~ ~ e ~ ~ e(pp. # c e 2’71-301).Mahwah, NJ:LawrenceErlbaum



169

170 Fiske

k it was World W ar I1 that forced psychologists to rethink the of valid~ty.We could no longer just ask “Does this test predictthe outcome of training?” We had to ask “Does this test predict performance on the job, suchas sktll in combat?” and“How does one assess that?’7 Does anyonewant to volunteer as observer-raterforstudyingpersonnelin combat? on the In his paper, Validip for what? (1946), Jenkinsreflected professional lessons learned during the warby psychologists working in the b e d Services. He pointed out that,beforeWorld War 11, themajor emphasis was on the predictor test, with any old criterion being hauled in, as needed. In that war, the psychologists working on personnel selection andclassificationquicklysawtheneed to pickandstudyeachcriterion carefidly. The criteria must be vahd. But what do you do if scores from two ~ e d i a t e l successive y flight checks (work sample tests, each conducted by anexperiencedinstructor),correlatedwitheach other around .OO (yes, U can fire the instructor you like least, but that w ~ d fix t things oblem was not just due to subjectivity in instructor ratings. ogists were also disturbed to discover that there was an essentially zero correlation between the test and the retest the next day of aduates from bombardier school. A moment’s thought about the situationmakes one realizethemultiple d e t e ~ n a n t softhenice,neat of the distance between where the dummy bomb landed and the ow well did the pilot h e up the bomber before turning it over to ardier? Did wind and weather affect the accuracy? So even an objective m e a s u r e ~ s t a n c ~ m ahave y its ~ t a t i o n s . From theseresults,psychologistssawthattheyhadtobeconcerned about the validity for criteria, and that they should be cautious about criteria that a priori seemed quite adequate. And as a staff officer, what would your r e c o ~ e n d a t i o nbe to your C. 0.if you obtained results &e these? (I am afraid I do not recall what the rnhtary officers did to resolve these serious problems, but we won the wara n ~ a y . ) Afterthe war, somepsychologistscontinued to work in theirivo towers, concerned p r i m d y with the tests themselves, with only pass attention to relationsbps betweentestsandtherealworld.Harold Gdliksen’s They of ~ e ~ tText$ a l isanexcellentpresentationof that approach as of 1950.In that book, he was more concerned with true scores as obtained in therealworld. That sameyear, thanwithtestvalidity ~ ~on intrinsic h ~ l however, he published a paper in the A m e ~ c a B~ validity, a r p g that psychologists must assess the i n h s i c validity of the

~

Validity for YVhat? 171

criterion. By its correlations with other measures, we can learn about the criterion. He distinguished between intrinsic content validlty for achievement tests and intrinsic correlational validity for ability tests. G d f o r d was more practical. By the second edition of his ~ ~ c b o ~ e t ~ et^^^ (1954), he offers not only a careful statement of the classical work on validity but also hints of things to come, with references to subjects’ motivationsand to responsesets.Regrettably, this editionornitted h ~ s citation, (found in the first edition) of Clark Hull’s 1928 book on A ~ t i t ~ d e TeJting. (That date is not an error. It was before Hull got into how rats learn.) H d identified three classesof aptitude criteria:product criteria, such as amount of work done; action criteria where one measures what occurs, such as the speed of a runner or the height of the bar in pole-vaultin finally subjective j u d ~ e n t sif, nothing elseis available. A major shift in. our thinking about validlty, and a major advance, came i c uc ~o ~ ~ efor ~ ~~ ~t ic o~n o~ ~ g with the work leading to the f=st T e c ~ ~ ~ Text$ and ~ i ~ g ~ ~ ~ t i published c T e c ~byn the i ~American ~ e ~ Psycholog.tca1 Associationin1954. “he authors’basicdistinctionsbetween content, concurrent, and predictive validity helped to structure our thinking. ’This committee report was followed by the Cronbach-Meehl paper on construct validity (1955).Those authors pointedout that the basicnotion of construct validlty cannot be expressed by a single coefficient.It is a pattern of results consistentwiththepatternspecified by the conceptu~ationof the construct. One basic statement in that paper deserves our attention and carefulthought: ““he investigationof a test’s constructvalidityis not essentially different from the general scientific procedures for de~eloping and c o n f k g theories9’Op. 300). That paper and the Tecbn~cu~ ~ c o ~ ~ e ndocument ~ t i o nstarted ~ a concern with validity of constructs andnot just with the validltyof a single test. But the spread of concern with validity did not stop there. It went over to validity of the whole research plan:To be more specific, “vafidlty for what” became questions about the vafidity of conclusions, of interpretations9 and of g ~ n e r a ~ a ~ oItn was s . almost as if we began to ask about the validity of each piece of i n f o ~ a t i o nin our whole research enterprise. We’ll turn to that development later inthis chapter. Concernwith classicalproblemscontinuedalongsideconcernwith construct vafidity. In 195’7, Loevinger published a monograph, O ~ e c t itests ~e ar i ~ ~ t ~ ~ e n t ~ tbeoy. ~ J ~A clarge b o ~part o gofi cthe ~ text Z is concerned with construct validity. As a part of validity, she designated structural validity, whichincludesthefidelityofthestructuralmodel to thestructural

e of valtdity, they list a number of threats. They help^ to the student,although their list probably also increases s ~ d e n t anxiety, s u p p o r ~ gthat basic oracular warning, from research lam: “If anything can go maon so that nothing can possibly st sudden flare-up of program evaluation, a ~ u c h - n e ~ d e of effort fostered by 1 eats fog socialplans in supported research on S, &d much to stimula r h g together on this end~avorwas nford Evaluation Consofflum. It generated two major works: ~ o ~ a r d of ~ ~ g ~r v ua ~~~ a f ronbach, i ~ o ~ L.J., et al,,19$0) and . o m of ~ d ~ c a f i ~ an o ~ a ! cia! ~ ~ g (Cronbach, r ~ ~ L.J., $ 19 notation^ and analyticsystem provides an excellent basis for ng about many ~ndamentalvalidity issues. (In UTOS, U stands , the Obs the unit studied, T for the ~ e a ~ e n0t for for the s e t ~ g . ) C r o n b a c ~ 1989 s picture of ~ ~ ~ V a$ h ~ dfier ~ f i Yeam came out a few years later. He noted the great ~ a n s f o ~ a t i from o n construct roachdeveloped for applications whereother forms o f test lied (for example, the valtdation of projective tests) to S the base on which all other foms of can look at the other forms of valt&ty as CO construct validi . ay, and others at ~ o r t h w e s t were e~ cluded secondary analysis, the desk having a body of evaluation datareexamined by an indepe using different a n ~ methods. ~ c After all, what you get out of on howyou analyze them. “he incr~asing ance of construct validity is reflected in Messick‘s analysis concept. He d i s ~ ~ s h Se s ~ f i ~ c f i~~ r a ~e ~E ~ ~ ~ ~ ~ b ~ ~ to say that these ‘ ~ ~ n c t i o criteria dity or stan educational and psycholo a s ~ e m e n t ”(Messick, 1995, origit~al statement of the analysis was made at least 6 years earlier.) In other words, these are matters that all of us who are concerned with ~ e a s ~ should wor about. spect is concerned with the construct d what is outside it? In particular, does th ~~~~

174Fiske a representation of all parts of the construct include tasks that provide domain? “he S t ~ c t ~ r aspect aZ of construct validity addressestheconsistency between the internal structure of the assessment and what is h o r n about the i n t e ~ astructure l of the construct domain.“he theory of the construct domain should de the rational development of the construct-based scoring criteria. Messick‘s G e n e r a ~ aspect ~a~ ofconstruct ~~~ validty “examines the extent to whichscorepropertiesand inte~retationsgeneralize to and S, settings9 and tasks, includingg e n e r ~ a of b ~ test-criterion ~ across settings and h e period^'^ (Messick, 1996, p.9). In his later discussion, he included in h s aspect the consistency of p e r ~ o ~ a n c e across tasks. Couldn’t the degree of such consistency be expected to vary with the de~neationof the construct? A heterogeneous construct should, quiteappropriately,havelowmeanintertaskcorrelations,assuming that nt tasks are aimedat dfferent parts of the construct domain. e ~ x t e ~ aspect a Z ofvalidityreferstothe pattern of relationshps between assessment scores and criterion measures in applied situations and also the relationships arnong the assessment scores. The C o ~ ~ e ~ ~ aspect e n “appraises t ~ ~ Z thevalueimplications of score inte~retationas a basis foraction as wellastheactual and potential consequences of test use, especially in regardto sowces of invahdity related to issues of bias, fairness, and distributive justice” (Messick, 1996, pp.9-10). a n t of ~ construct ve validity is focused on the processes “he S ~ ~ ~ ~aspect used by respondents: Are they the ones that were intended on the basis of the domain processes? Should we try to maxknize the similarity between the processes required for successful answering ofa test item and those embedded in the construct as d e ~ e a t e d At ? one time, I was convinced that close these was necessary for good measurement. But asI th I realized that rarely could the two processes be expected to be s M a r . To besure, one canmeasureachievementinmultiplicationbyaskingthe or measurecompetencein subject to multiplysomepairsofnumbers, spelling by asking the subject to spell some words. But that is about as far as one can go. In ~ e theConcepts ~ of ~~e r ~ o ~ I~S a ~ested ~ ~ : bee , gtestdesigns: A simulated stimuli9 a priori related process:, and empirically related process. test of field independence illustrates the simulated stimuli type of design, subject reports of prior behaviors illustrates thea priori related process:, and

Valtdity for at? 1’75

I the e m p ~ c related ~y process. We use whichever is possible and promising for the task at hand. After studymg Messick‘s comprehensive analysis of construct validity, one would hesitate to undertake the d e t e r ~ a t i o nof the construct validity of any assessment operation.One sees why construct validation is seen as a never-ending procedure. One also speculates (or dreams) about the possibihtyofgoingback to nontest behavior, to opportunitiesfor appraising actual manifestations of the construct under na~alistic conditions: forget the ethical problems, forget the difficulties in obtaining samples that cover the construct domain, forget the diffic~ltiesin o b t ~ g reliable judgments of the construct from recordings. No, let’s forget that daydream and get back to measuring procedures where we, the authorities, can tellour subjects as subjects what they haveto do. Back to realtty. With a bit of luck, our work on construct validity of meas~ementsand of foms of validtty in research designswill complement each other, so that we can fit the various piecesinto one big picture. As Cronbach and Meehl said at the end of their 1955 paper (and as I have quoted earlier), construct validation is essentially the general scientific approachfortestingtheories. In pursuing it, aredoingwhatcolleagues all along. our dunking about it may in other scienceshavebeendoing help us to understand better whatwe are doing. Anastasiwrotein1986thatconstructvalidationis a never-endin process. It is however, not a task for Sisyphus, doing the same thlng over a major problem in and over. And we are moving ahead. We have identified psychology, and that is thefist step toward solving it. Construct validity refers toa goal. It may be ad-o’-the-wisp, leadtng us farther and farther downa road. I say that because rarely if ever have I seen a completely executed program for establishing the construct validity of anythmg, an assessment procedure or a construct. Yes, construct validation is a never-endtng process, partly because the construct vahdation of one assessment procedure depends to some extent on the construct validation of other constructs and assessment procedures. To s m & e what this chapter has argued up to this point, there has beenadevelopmentfrom one looseconceptofvalidity to multiple by 1980, r n e a ~ ~ Validity s. is a constantlyexpandingproblem.Even Messick counted 1’7 usages and labels for kinds of validity. Those working on a singletestseek to increaseitsreliabilityandhenceitsvaltdtty, a reasonable assumption if one has some vahdity to begm with. But others

areconcernedwith the validlty ofthecriteria.Morebroadly, we are not just the validity of edwith the validity of eveweuse,and eas~ementprocedures t also the validity of the research e validlty of the e x p e ~ e n t a methods l ( i n c l u ~ gthe validtty of themselves), and the validity of our conclusions and inferences. construct validity underlies all of these aspects of research. m a t ? ” Validlty for predictor tests and for criteria. Valldity g we use and do in research. n the validity matter, does anyone feel completely comfortable with where we are today? I a m c pressed by the progress we have made on the arlywith theextendingof our validlty c o n c e ~ sto a bit reluctant to

estion: Khat are we

don’t feel that mea

ask me the ~ m ~ ~ a s s question, ing have I ever hlly specified a c o n s ~ c t , tionsgiven in thatchapter? I have done so only r e e not concerned with havinga detailed p construct at the outset of their research: They feel that it their research results come in. I differ with that nd some time and ene

~

ut consi~erwhatthese ~ f f e r e ~ c eins

of what we are

ed in the natural S better c ~ n s ~ c t s .

m.

Anas~si, A.(1986). Evolving conceptsof test validation.~ ~ ~ ~ ~ c b~37,l-15. o ~ o Campbell, D. T., Bt Fiske, D. W. (1959). C o n v e r ~ n and t ~ s c ~validation n ~ byt the u ~ ~ ~ ~ ~ e ~ ~ # , m ~ t i ~ ~ t - ~ ma&. ~ ~ e ~t ~h c ob do ~ o ~ c 5481-105.

SanFrancisco:

~ ~,

178 Fiske

Cronbach, L. J.,& Meehl, P. E. (1955). Construct validityin psychological tests. P~cbological B~lletjn,52,281-302. Cronbach, L. J.,Ambron. S. R,, Dornbusch, S. M., Hess, R. .D., Harnik, R. C., Phillips, D. C., Walker, D. F., & Weiner, S. S. (1980). To~ardrefom i n p ~ g r eva~~atjon: a~ Aims, ~ e t b o ~ and instjt~~onal a~ange~ents. San Francisco: Jossey-Bass. Fiske,.D. W. (1971). ~ e ~ the~co#c~ts ~ ~n p ~g r s o nChicago: a ~ ~ . Aldine. Fiske, D. W. (1978). S~rat~iesfo~personafi~ researcb. San Francisco: Jossey-Bass. Guilford, J.P.(1954). ~ ~ c ~ o~ ~e e ~t(2nd b~ ed.). co ~ NewYork McGraw-Hill. l o 511-517. ~s~, G a s e n , H. (1950).Intrinsic validity. h e n k a n P ~ c ~ o 5, Gulliksen, H. (1950). Tbeog ~~entaztests. New York:Wiley. H d , C. L, (1928). Aptj&detesting. Yonkers on Hudson: World Book. ~ ~ c b o l10,93-98. o~, Jenkins, J.G. (1946). Validity for -urhat?Joumal~Cons~~~ng hevinger, J. (1957).Objectivetestsasinstrumentsofpsychologicaltheory. P~cbolo~~al R.~J&o&s,3,635-694. Messick, S. (1975). The standardproblem:Meaningandvalues in measurementand evaluation. &erica# P~cbologist, 30,955-966. ~ogist, Messick, S. (1980). Test validity andthe ethics of assessment. A ~ e ~ c a n ~ ~ c b o 35,10121027. ~ ~ eed.), n t pp.13Messick, S. (1989).Validity. In R. L. Linn (Ed.), ~ d ~ c a t j o # a Z ~ e a s ~(3rd 103. NewYork American Council on Education/Mac~anPublishing Co. Messick, S. (1995). Validation of psychological assessment: Validation of inferences from s . persons' responses and performances as scientificinquiry into score m ~ ~ gAmencan P~c~ozo~ 50,741-749. st, a c~ ~n ~ ~ testing. a g e EducationalTestingService, Messick, S. (1996). Vafidty and ~ ~ b b in Research Report No, 96-17. Newtson, I). (1976).Foundationsof attribution: The unit of perceptionofongoing J. Harvey, W. Ickes, & R.a d d (Eds.) New ~ i r e c ~ o in n s~ ~ ~ b researcb, ~ t ~ o n ~ehavior. In (pp. 223-247). Hillsdale, NJ: Lawrence Erlbaum Associates. Sternberg, R. J., & Barnes, M. L. (Eds.)(1988). Tbe bolo^ fo love. New Haven:Yale University Press. Tec~nicazreco~~en~~onsforpS3"holbgiaz tests and dagnos~ctecbniq~es.(19 54).~ ~ c b o ~ o ~ c a l B ~ z l e ~ S ~ p z e ~ e51 n ~(2), , 1-38.

The studyofhumancognitionandthemeasurementofeducational achievement are beginningto cross intellectual and empirical paths. Current efforts to define the conditions of educational attainment with the integral use of assessmentsdemandanexplicitallianceofthesedisciplines. eresearchhasdescribedprocesses,strategies, and structuresof that contribute to competent performance and identified the characteristics of performance change as subject-matter competence develops. m s work is c o n t r i b u ~ gto a better under st an^^ ofwhat learning involves, to a theoretical and empirical base for m e a s ~ n gwhat has been learned, and to the formulation of methods for addressing certain aspects of the construct validity of performance assessments. Construct validity in a modem context of coptive process inte~retationwould imply that assessment situations be evaluated in terms of the coordination of knowledge and skills in a particular domain and the associated cognitive activities that underlie competent performance (Glaser 1981; h n , Baker, & Dunbar, 1991;Messick,1994,1995). Of critical importance is that " ..the level and sources of task complexity should match those of the construct being measured and be attuned to the level of develo ing expertise of the students assessed" (Messick, 1994, p. 21). In d, two aspects of the assessment situation merit attention: (a) the relationship between the goals of the task and the performance elicited by the assessment situation (i.e., substantive aspect) and (b) the relations~p

FIG. 10.1. Content-process space of science assessments.

that the problem does not have a clean r ~ n i t i e sfor s ~ d e n t to s apply their subj nce to develop an und flight o f the maple seed. this s i ~ a t i o n Success , endent on an a d e ~ ~ are t e entation o f theproblem, and s?stema~c ex loration strat ~ s e ~ a t i o nand tation), monitored p ,.and e ~ l a n a ~ of on ~ 0 In contrast ~ to e ~aplecopter ~ ~ task; tasks ~ e lean-process constrain d school experiences with S successful completion. Rath edwes and then asked to re

tion and ~ o n s ~Valldtty c t 1 t oftheirinvestigation,studentsreplicatepotential che~cal reactions from that situation” They are explici~y hected toadd measured ounts of the relevant substances in a p r e d e t e ~ e dsequence to set up eechetions. F o l l o this, ~ ~ theyare prompted to observeeach these for temperature, chang color observed.” A tableisprovided to obse~ations.Students are then posed can t, be answered by rereading data from the observations or other ~ f o ~ a t i oprovided. n For tasks of generative opportu~ties for problemrepresentation, S m o ~ t o are ~ gprecluded by the step-by-step are gven in the task so that S al S of formal ~ s ~ c t i o nexp the si~ation. - ~ ~ Open. c e Assessment ~ ~ tasks of this type require stude

s l d s with minimal demands a sequenceofprocess .For example, the “Mystery P o ~ d e r s assessment ’~ asks to identify the substances in each of six bags from a list of five possible alternatives (Ijaxter, Elder, Br Shavelson, 1997). S ~ d e n t s are presented with relevant indicators and tools and told they can use e q ~ p m e nin t any way they wishto solve the problem. in termsof Withthese i n s ~ c t i o n s ,studentsrepresenttheproblem actions that followfromwhattheyknow about the pro s u b s t ~ c e sand ways to identif) them (i.e.9 tests and relevant ob They then ~ p l e m e n at strategy, such as ad revise that strategy, if ne cess^, based on t

etations of current trials.

require student-generated processskills, their use may become routinized in situations where the demands for content knowledge are minimal.

C~~~~~~ ~ c ~ - C~ ~ c ~e Tasks ~ ~ ~that are~ content~ rich-process r ~ constrained emphasize knowledge generation or recall. For example, high school students were askedto "describe the possible forns of en es of materials involved in growing a plant and explain M y related" (Lomask, Baron, Greig, & Harrison, 1992). A comprehensive, coherent explanation revolves arounda discussion of inputs, processes, and products such as theplanttakesinwater,light,andcarbondioxide. Through theprocess of photosynthesis,lightenergyisconverted into chemical energy used to produce new materials such as sugar needed for ; in addttion oxygen is given off (e.g., Gregory, 1989; Solomon, ,& Villee,1993). In developingtheir expla~ations,students makedecisions about whichconceptsare important and how these concepts are related, thereby reflecting their conceptual u n d e r s t a n ~ gof the topic.Althoughtheopportunities for explanation are apparent, opportunities for other activities such as planning, selecting, and or monitoring problem-sol~g implementing appropriate strategies procedures are lessso. In s ~ aspecifpg ~ , cognitive activities in the context of the subjectmatterdemands(i.e., content andprocess)providesaframework for a n t i c i p a ~ gthe&pactofassessmentfeatures on student performance. Taskscanbedesignedwithspecificcognitivegoalsinmind,andtask quality can be judged in terns of an alignment withthe go of the developers. Furthermore, scoring systems can be d to the quality of cogmtive activity rather than the easily quantifiable aspects of performance. Examination of the alignment of tasks and scoring systems in a sample of assessment effortsis now described.

"he properties and objectives of assessments and scoring systems visible in programsacrossthecountrywereexamined(I3axter In eachoftheassessmentsituations, weanalyzed ver protocols, observed student performance, examined student wri andreviewedtask ~ s ~ c t i o nand s scoringcriteria. "he go ascertainwhetherand how theseassessmentsaremeasuringcognitive capabilities that dtstinguishvariouslevelsof student achevement.Two

Congnition and Construct Validity 187 questions focused our analyses. First, to what extent are the intentions of test developers reflected in the cognitive activities requisite for successful taskcompletion? A correspondencebetweentheperformancegoals as described by thetestdeveloperandtheperformanceelicited in the assessment situation provides some evidence for the substantive aspect of construct vahdIty(Messick,1994,1995). Second, to whatextentare students’ scores reflective of the qualityof observed c positive relationship between the quaky of the elicit novice to expert)andperformancescore (low to high)providessome evidence for thestructuralaspectof construct validty (Messick,1994, 1995). We describe here four categories of assessments illustrative ofth of h t s and nisses experienced by test developers in their efforts to cogrutivelycomplexscienceperformanceassessments. Two categoriesUniform Performance and Goals ~ s i n t e r p r e t e d ~ x e m pthe l i ~~ f ~ c u l t y in ans slating task goals into assessment situations, and two categoriesU n d e r s t a n ~ g O v e r e s h a t e d andCogrutive Consisten~-illus~ate the necessity of developing scoring criteria that reflect the complexity of the task. Each of these is described in turn.

~ e ~ In Q some ~ performance ~ ~ ~assessment e . situations, a p o t e n ~ a ~knowledge y rich-cessopenactivitywasconfigured to elicit only uniform performance little opportu~tyfor differentiation. For example,consideraplatetectonicstaskdesigned to measure$&-grade students’ u n d e r s t a n ~ g“of the process that causes rock layers to fold and twist’’ (Cahfomia Department of Education, 1993b)). Students were rovidedwith a cardboardmodelandguidedthroughasequenceof procedures for manipulatingit. “hey thenwereasked to respond to questionsbased on theresultsoftheseprocedures. Student responses required description of their observations, and for this purpose, a list of relevant conceptual terms was provided. However, in respondmg to the task (and scoring performance), there was no requirement for the use of these technical terms or for students to r e c o p e the cardboard model as a repres~ntationofthe m o v e ~ e n toftheearth’splates and theresulting changes of that movement. Rather, performance could be independent of an under st an^^ of how geological and geomorphic processes have shaped the earth, u n d e r s t a n ~ that ~ s the task was designed to measure.

S.

For example, in d e s c ~ ~ ~

dididentifytasksthatelicited sistent with theintended goals fora pro~lem-so

e x m le is the“Electticsteries”taskwhere

5th

steries assessment.

into assessment si~ationswithout the benefit of w e ~ - d e ~ ~ e d mles for the desi of per ts is the need fo mance scores to and extent of s ~ d e n

ret tat ion of test scores” owledge of mode^ CO

or s~c~res

190 Glaser and Baxter

Messickhas expressed continuing concern for the evidentialbasis for It is useM to think of the facts and evidence in terms of c o n s ~ cvalidity. t a theory of human performance for the purposes of test design and the interpretation of measurement procedures. As describedin this chapter, e activity onto the relative demands for science content rocess skills forcescoherent articulation of assessment concepts and practices that are informed by knowledge of human CO and learning. Essential in t h ~ sr is r e c o ~ t i o no f the qualitative dfferences in performance that S egeesof learning andexperience with asubject matter. Explicit attention to these dfferences d: (a) enhance the quality of current practice for the design and interpretation of achievement measures, (b)enable the defnition and analysis of consistency and errorsin performance interpretation, and (c) suggest keyconsiderations f scoringperformance.In this chapter, weoffered a rk for these purposes. development of structwesto p d e the design and evaluation of performance assessments wli require re~nementof the framework described here. Consideration also needs to be given to specific understanding of the course of learning and growth of knowledge in other content domains, andtheelaboration of the particularcharacteristics of developing competence appropriateto these subject matters.An important endeavor to be undertaken for improved assessment is empirical research that demonstrates the&pact of changes in design features on the quality of served in assessment situations. Our work also has indicated d of analysis we propose is p ~ t i ~ a ruseM l y for evaluating scoring procedures and judpents of performance. We pointed to cases where perfomances have been overestimated or wherestudentsscored highly but essentially rnissed the concept involved. As Linn (1982) noted “. ..providing the evidence and logical analysis that supports the uses and interpretations of test scores is the ~ n d ~ epsychometric n t ~ problem” @. 13). The message of this chapter hasbeen that such evidence can be derived from consideration of cognitive activity in relation to the content and process skills of the subject matter. Still to be considered is the use o f theories o f human performance asabasis for the systematicdesign o f performance assessment. .

I

ition and ~ o n § ~Vali c t

n, § t a ~ ~ a r d and s , it dent Testing ( C ~ ffice of Educational ~ 0 0 0~~ . ~ ~ authors and not n e c e § s a r ~ the ~ § u ~ ~agencies. o r ~ ~

§ 0

~

~ ~

s

Anastasi, A.(1967). Psychology,psycholo~s~s, and p s y c h o l o ~ ctesting. ~ hetican P~cbo~~st, 22, 297-306. Anderson, J. R. (1985). C o ~ n i ~ i ~ e p ~and c b oits l oi~~ k c a ~ o n(2nd s ed.). New York: W. H. B ~ o nJ. , B., Carlyon, E., Greig, J.,& Lomask, M. (1792, March). Vbat do 0.w s ~ # ~ n tno^^ s ~ s s e s s~ ~~ g ~ a b~i &~~ to ttb i n s ~and ~ act &ke s ~ e n ~ sthrough ts p e ~ ~ a ~~essment. ~ c e Paper presented at the AnnualMeeting of the National Science Teachers Associa Baxter, G. P., Elder,A. D., & Glaser, R. (1996). owle edge-based performanceassessment inscienceclassroom. Ed#ca~onal~~cbolo~st, Shavelson, R. J. (1977). Efect of e ~ b assessmen~s e ~ ~on axter, G. P.,Elder,A. D., n ~ aclassrooms. ~ ~ n p u b ~ s h emanuscript, d ~ ~ v e rofsWC i ~ ~ e ~ o ~ ainne c~e e ~ es~ence Baxter G. P. & Glaser, R. (1998). ~ n v e s t i ~ t i nthe g cognitivecomplexityofscience nal ~ e a s u ~ ~Issues e ~ and t : Practice, l Education f (1993a). ~ ~ e n c e g5r a ~ n m a n ~ Sacramento, a~ *

ent of Education. (1993b).S ~ e n c e g r 8a ~a ~ i n i s t r a ~ ~o n ~ n Sua car ~~e n ~ o ,

. The ~ a t ~ orf ee ~ e ~ s~e .s d ~NJ: e , Lawrence Er~baumAssociates.

eral tbeoty ~ e ~ e ~ s e :

and P ~ kmits. ~ e c t s

testing: A research agenda for cognitive American ~ ~The role g of knowledge. : 93-1 04. G r e g o ~R. , P. F. (1989). P b ~ ~ o ~ n t New ~ e ~ iYork: s. Chapmanand H a l l . .Two weak spots in the practice of criterion-referenced measurement. E d ~ c a ~ o ~ a l ~ e ~IssHes # ~ and ~ e Prac~ice, n t : l(l), 12-1 3,25. ,S. R. (1991). Complex, pe~formance-bas~d ~ssessment: ria. E ~ c a t i o n a l ~ s e a r20(8), c b e ~5-21. J., Greig, J., & Harrison, C. (1992, M ~ c h ) . con^^: C ~ n ~ eusec o~f ~ t ~ as,ress the s t ~ c t ~orfes ~ # ~ n ~ t s~, o w l e ~ e ~As s~ye m n~ c eo .s i upresented ~ at in ScienceTeaching, ationd AssociationofResearch

192 Glaser and Baxter Mayr, E.(1976).~ v o und~ the~diver@ ~ ~ojhi: n Selected essuys. Cambridge, Mti: B e h a p Press of Harvard University. Validity. In R. L.Linn (Ed.), E ~ c u ~ Q n u ~ ~ e u(3rd s ~ed., ~ ~pp. e n 1t3Messick, S. (1989))). 104).New York Macmillan. in thevalidation of Messick, S. (1994). Theinterplay of evidenceandconsequences ~ e r f o ~ aassessments. n~e Ed#cu~~nui ~seu~cher, 23(2), 13-23. Messick, S. (1995). Validity of psychologicalassessment:Validation of inferencesfrom &enkm person’s responses and performance as scientific inquiry into score meaning. ~ ~ t h o l o ~50(9), J + t , 741-749. Resnick, L. B. (Ed.). (1989).~ o leumivg, ~ and n~ n s t~~ c Essuy~+ ~ ~ n .in honor ~~~e~ Ghser. Hillsdale, NJ:Lawrence Earlbaurn Associates. Solomon, E.P., Berg, L. R., Martin, D. W., & Villee, C. (1993). io^^ (3rded.). Orlando,F L Harcourt BraceJovano~ch.

si

It is a personal pleasureto recognize the many contributions of a friend and colleague who hashadsuchbeneficialinfluenceoneducationaland cal meas~ement.Sam has been a leader in molding the way we in creating the structures on which we hang our better ideas. Such accomplis~entdeserves high honor. In this chapter I want to discuss an important issue concerning fairness intest design-a consequentialissue that is too littleappreciated. My eormnents orighate in a project in which I was involved with Nancy Cole and others inthe 1990s-a studyof Gender andFairAssessment ~ ~ g h 8r a Cole, m 199’7).Coledeservesequal credt for theconcern about fahess in test design, but she is not responsible for liberties I may take here in extending the topic. A brief overview of theETS Gender Study w lliprovide usefid contextfor my remarks. oup of ETS researchers undertook this study for three reasons. We hoped to help clarify someconhsing fulhgs regarding gender differences; werealized that testing organizations need to know more about fairness issues in order to cope with new foms of assessment; and we believedthat a carefid study of gender difference and similarity would provide a usefid template for stadyhg fairness issues generally. T h ~ was s a 4-year study involving estensive review of previous work and quite substantial data assembly-about 1500 data sets involving some 400 tests and related measures of esperience, attainment, and proficiency. Our review of previous research convinced us that observed patterns of gender difference are distorted by three confounded effects: construct differences, sample differences, and cohort differences (i.e., chfferences by grade and year). We focused especially on large, nationally representative samples in order to better disentangle the construct, sample, andcohort effects. l93

cience scores at ouag womea are

now

See

~ l t e r n a t i ~in e sCons

lausible ~lternativeswith o

es in meas~ementare

to live in

*

*

ar represent~tion of

~fferencesobserved htest performance. e or ~oped-forpositive impact of tests o

Test Interpretation

Construct Evidential asis

Test Use

Construct validlty + Relevance/u~ty

Value Consequential asis

Source:Messick,1980 FIG. 11.1. Facets of Validity

As we know, enthusiasm for and worries about the educational h p a c t oftestshasspawnedlargeinitiatives: standards-~a~ed assessment, the national test, and so on. Test fairness has important ties to these concerns. Group impactandeducationalimpactarethequintessenceofsocial consequence. They dragmeas~ementpractice into that fourth quadrant.

Text Design ii- Ctiticcal. In Figure 11.l, social consequences arethe outcome of test use. That does not necessarily mean that test use is the stage of the assessmentprocesswheredecisionsare most likely to affect group differences. If there are limited test options to consider, then the decisions most likely to bear on testfairnessmustcomeearlier. Test designis a natural placeto look, because we design tests with a use in mind. There was a h e when the user typically decided whether to test, which test to use, and whether it was valid for the purpose. That is still true for some types of tests, but for the most important present-day educational tests, the options as to construct choice lie mostly in the design stage. We do notusually build several different high-stakes tests and then expect users to consider carehlly the utility and consequences of each test, and decide" as independent practitioners-whichtest to use.Wedesign tests with anintended purpose and populationin mind. We build the test in cons~tationwith users, veri5 its validity and quality, and then the test is put to use-more or lessas intended. This isthetypicalscenariowith admissionstests,scholarshiptests,majorplacementtests, and to a

S e e h g Fair Alternatives inConstruct Design l97 considerabledegree,the most fiequentlyused tests. Thus, much hangs on test design.

comercial achievement

There

are Opt~onsin TeJt Design. Constructs are not provided by some er power. In large measure, westill use old-fashioned expert j u d ~ e n t in writing sensible items, and then apply quitemodem statistical techniques to assemble tests and to see what we have. In educational testing certainly, constructs are complex. Even in relatively homogeneous tests, the tested construct is layered with latentskills and knowledge bundles. We try to design test constructs by deciding what should be part of the construct and what should not, but these decisions are certainly not based skills involved or what S to on sure knowledge of the cognitive construct irrelevance may lurkintheassessmentformat. Desc this process as constructdesignisperhapsmoreflatteringthanfactual. Nevertheless, my point is, we make choices-deliberate or unknowing,

There are ~ o n ~ e ~to~thee Options. n ~ e J It is understandable that individuals andgroupsof exaninees night performbetter onone construct than another, but why are there choices that affect fairness? We always shoot for the most valid test. Isn’t themost valid test the fairest test?That is true, but not alwaysbyanymeans. For example, adding verbal givesanadmissionstestterpredictivevalidityoverall, but worsensthe underprediction of gradelanguageminoritystudents ( R d s t , Lewis, M c C ~ e y - ~ e n1994, ~ s , Tables 58c 8). There may be goodreasons”educationa1 or social-for favorin different representations of a construct. As a result, inconspicuous late skills embedded in theconstruct may enhanceitsrelevance for some purposes, but alsoalterthepatternofgroupAfferences. And most important,testsinviteAfferentinterpretationsofvaliditybecausetests often have multiple uses, and some argue, major side effects. m a t if two test designs that are otherwise equally valid favor different groups? How does one resolvethatchoice? m a t fairnessprincipleis involvedhere?Definitionsoftestbiashavealwaysbeenbased on the notion ofchfferentialvalidity.Increasingly,wehavecome to regard test neralization of that idea; that is, comparable validity across oups (Cole 8z Moss, 1989; Moss, 1995). One can think of test fairness as the absence of invalid group differences, but that principle is manifested differentlyinAfferentaspectsoftheassessmentprocess ~ ~ g h a m , 1998).

am

acquired that are rele~antto

or^^^

1.

for ~fferent

~ e ~ ~ ~ -illustration a n invo constructchoice.In si~ations,construct choices are based more on ju 'dence. ~ ~ nand gColehdustrated ~ what can happen when school a board chooses o of tests over an~therfor a school-lea~gexam ~ ~ n g h a m Cole, 1999, p. 241-4243),They used NEiLS and NAEP data to c~mparefour S of tests with respect to the number of females and males who WO to earn a diploma.Thefourpairswerebased on different values ents, in each case quite a plaus~blerationale i n v o l ~ g or n a ~ o ~interests. a l It'seasy ~~~~~~

nceintheoutcomethat f ~ a l e f ~ ufor r e Pair s 1 (Math fdures for Pair 4 (Reading & ~ ~ ~M of g these ) . Which outcome best fits other facts in the s i ~ a

~

~

~

~

Four H ~ o t h eChoices ~ c ~ for a Two-Test School Leaving Exam:X%o Fails?

to

ests

Exa

1.

1.31

.78

Ma

.3

* “Fdures” scored in thebottom 10% on both tests. Source: ~

~

n

g

~

~

fairness and positive educational consequences against practical difficulties, like cost to the examinee. 3. ~ ~ ~ i taszaghtetzt ski/&an ~ustrationinvolving format choice. There have been a number of reports suggesting that men do better on the multipleon the essay format. There is much data choice format and women do better

toconfirmthattendency,buttheissueismorecomplicated.Table 11-2 shows average gender differences on the multiple-choice and essay sections of Advanced Placement examinations in faux areas. The first two subject areas-languageandliteratureandmathematics-show no formateffect; that is, the amount of gender difference is the same in thetwo formats. The gove~ment-sho~a other two areas-naturalscienceandhistoryand gender difference favoring males, largeron the multiple-choice section than on theessaysection. ‘”his differentialformateffectacrosssubjectsis consistent inAP exams from yearto year ~ ~ n g h &a Cole, m 1997, p. 260” 62). Much the same subject pattern has been described in the GCE exams in England ~ ~ h1982). y , Whythesubjectdifference?Thework of Breland,Danos, Kahn, Kubota, & Bonner, (1994) clearly implicates writing as a major source of the format effecton the AP History exams.Inthis analysis, writing skill appears to interact with content knowledge, which s t r e n ~ e n the s supposition that writing is important to proficiency in history ~ ~ g h &a Cole, m 1997, p. 264). Suchresultssuggestareasonablequestioninthedes educationaltest.Namely,“Howimportantiswriting to theintended construct?”

~ aa hzteflt ~ ~skziihother a / ~ illustration involving format choice.In 27 AP examinations, W ~ g h a m and Cole found that wherever there was a format effect-in about half of the examinations-free-response favored women and multiple-choice favored men ~ ~ g h &aCole, m 1997, p. 261). Is that a generalizable result? No,it depends what one means by free-response. The 1990 NAEP Science Assessment showed a small difference favoring males on amultiple-choicesection,andalargerdifferencefavoringmales on a figural-response section (Jones, Mullis, Raizen, Weiss, & Weston, 1992). It was the men who did better on free-response, not the women. The items inthis section appearto call on spatial visualization-a skill on which males tend to perform well. In another analysis, a NELS

4. ~

Seeking Fair Alternatives in Construct Des TABLE 11.2 ice Standard Gender Differenceon the M ~ t i p l e - ~ ~ oand Free-Response Sectionsof AP Examinations in Four Subject Areas

Mean Gender Difference Section M-C

F-RSection

Language .01 & Literatwe

.04

Mathematics

-.28

Natural

-.42

History -.31 & Government Source: W

~

-.21

.oo

n & Cole, ~ h1997, ~ p. 262 (Jones,Mullis, Raizen, Weiss,& Weston, 1992).

performance test involvingfigural material yielded a similar result (Pollack8c Rock,199’7).Inthesetwoassessmentsfree-responseapparentlymeant spatial ~ s u ~ a t i oBut n .inAdvancedPlacement,free-responseusually means writing”a quite different cognitive &plication and a quite different gender result. There is a lesson here. We are prone to hi& of bias and faimess issues as construct irrelevance-something quirky about the test. But this f h e s s issue is not construct irrelevance. Both writing and spatialskills are relevant to science. The question is whether &e balance of writing and spatial skills in the test accurately reflects the criterion domain. ~ ~ e illustration ~ # ~ involving j c ~ construct ~ representation. Educationaltestbatteriesnormallyincludeamathematicstestbecause mathematical skills arecritical to manyareas of studyandanumber of careerlines.Genderdifferencesfollowadissimilarpatternin different aspects of mathproficiency(i.e.,differentconstructcomponents).Many studies have shown the same general result. Females tend to do better on basic facts and computa~on.Malestend to do better on math reasoning, p ~ c u l ~atl higher y age levels ~~~h~ & Cole, 199’7,p. 286-89))). Mathematics illustrates why choosing how to represent a construct can createa disj~ctionbetweenutilityandfairness.Test b c t i o n isthe mediating consideration. If amathtestis desiped for course placement, basic facts and computation may be the most appropriate content. If the test

5. A c ~ j e v e ~ in e n~~ #

al consi~e~ations: a) the effects o f such alt

ass~ss~e~ts

context of actual use.

It is in the last quadrant of Sam Messick‘s matrix that we best perceive the multiple i n t e ~ r e t a ~ o nofs validity, the side effects that may be main effects, and the conficting values that create divergent impressionsof a fair or unfaix test. We meas~ementspecialtsts correctly maintain that it is the user who must make the h a 1 ju ent on whether to use a test. But it is the meas~ementspecialist who is often the best-informedandhasthe heaviest handin. d e s i p g the test thatw lliprobably be used.

FE

S

Breland, H. M.,Danos, D. O., Kahn, H. D., Kubota, M. Y., & Bonner, M. W. (1994). Performance versus objective testingand gender: An exploratory study of an Advanced a l 3 I, 275-293. Placement history examination.J o ~ ~of-E~cationalMe~~re~ent, tests m p#~ctorsof colligef # s ~ G~P A a (ETS ~ Bridgem~,B. (1991). Essays and m~l~~le-choice RR-91-3). Princeton, NJ: Educational Testing Service. Bridgeman, B., & McHale, F. (1996). Gender and e t h n ~ c ~g~~e ~# n c eons the GMAT a n a ~ ~ c a l ~ ~~sessment n g(ETS RR-96-2). Princeton, NJ: Educational Testing Service. Bridgeman, B., & Schmitt, A. (1997). Fairness issues in test development and a~s~ation. In W. W. W ~ g & hN. S. ~ Cole (Eds.), Genderund fair ~ s e s s ~ e (pp. n t 185-226)Mahwah, NJ: Lawrence Erlbaum Associates. Cole, N. S. (1997). Understanding gender differences and fair assessment in context. In W. W. ~ ~ & N. S.gCole, Gender h aprd f i~r msess~ent(pp. 157-183). Mahwah,NJ: Lawrence Erlbaum Associates. Cole, N. S., & Moss, P, A. (1989). Bias in test use. In R. L.L h n (Ed.),- E ~ c a ~ o n a l ~ e ~ ~ ~ ~ (3rd ed., pp. 201-219). New York American Council on Education & Macmillan, Cronbach,L. J. (1980). Validity on parole: How canwe go straight? InW. B. Schrader@d.), New ~ # c ~ o tes~ng n s ~and ~ ~ e ~ ~ 5. ~M emm ~e~achievement nng ~ over a h a d . P ~ c e e ~ nofthe gs 1979ETS ~ ~ ~ t a ~ o n a ~ (pp. c o99-108). ~ ~ n cSan e Francisco: Jossey-Bass. Flanagan, J. C., Davis, F. B., Dailey, J. T., Shaycoft, M. F., Orr, D. V., Goldberg, I., & Neyman, C. A.,Jr. (1964). Pqect T a m : The American h ~ ~ - s c h o o ~ (Final s ~ ~ ~Report nt for Cooperative Research Project No. 635, US. Office of Education). P i ~ s b u r ~PA h, University of Pittsburgh. Jones, L. R., Mullis, I.V., Raizen, S. A., Weiss, I. R., & Weston, E. A. (1992). The I990 sience card N a p ’ s ms~ss~ent of ~ ~ Hghth, ~ and h ~ , e gradrs. ~ hPrinceton, NJ: Educational Testing Service, National Assessment of Educational Progress. Messick, S. (1980). Test validity and the ethicsof assessment. AmericanP~ch~~ogist, 35,10121027. IsstxeJ Moss, P. A. (1995). Themes and variations in validity theory. E~cationa~Me~tx#~ent: and Practice, 14, 5-13. M u ~ h y R. , J. (1982). Sexdifferences in objectivetestperformance. ~~~s~J o t x ~ aof ~ ~ ~ c a ~ oPn~acl h o52,213-219. ~~, O’NeiU, K.A., Wild, C.L., & McPeek, W. M. (1998). ~ d ~~ ~~e r e~~ t ni a ~g~ n citems t i o on ~ing the G r a ~ a t&cord e E x ~ i n a ~ o n ~ Test. e n eManuscript ~al submittedfor publication, Pollack, J. M., & Rock, D. A. (1997). Const~cted~ ~ o ntests s eintbe I ” : 88 h$ school e~ec~veness stm& W a s ~ ~ oDC: n , National Centerfor Education Statistics.

S e e h g Fair Alternatives in Construct Design 205

Ramist, L.,Lewis, C., & McCamley-Jenkins, L.(1994). S t u ~ n t g ~ ~u p~ ~ n c e s collige i n ~ ~ ~ ~ i u p (CBRep. No. 93-1; ETSRR-94-27).New York gruds: Sex, ~ n g ~ a gand e , ethnic p College Entrance Examination Board. Ward, W. C., Frederiksen, N., & Carlson, S. B. (1980). Construct validity of free-response and mac~e-scorableforms of a test.JoumralofElJucational~e~u~~en~ 17, 11-29. Willingh~,W. W. (1999). A systemic view of test f h e s s . In S. Messick pd.), Assessment in hgher e ~ c u ~ oMahwah, n. NJ: Lawrence Erlbaurn Associates. Mahwah, NJ:Lawrence Willingham, W. W., & Cole, N. S. (1997). Gendr and f&r ~sessmen~. Erlbaum Associates

I


r has a twofold o

.Fkst in. Teft v a ~ and d ~~ ~ nva~d~~ reco~~~

ference for separating vahdtty questions estuse and consequences fom thoseconcerning the rela betweentestscores and c o n s ~ c t s My . rationale, at the h e , was in.vestigations concer~ngtest use and its consequences involved, at base, issues about the social o ation of h u a n activity, whereas in.ves~ations c o n c e ~ grel bemeen test Scores and cons~ct *

1. m a t is a construct domain (i.e., how should domains be defined, organized,

and structured)?

2. How should constructs be selected and specifiedin domain terms? 3. How should tests be developed to measure these specified constructs?

4. Howshoulddomain-basedspecifications o f constructsbeutilized validation o f test scores developed from them

in the

ost progress has been madetowardanswering questions 3 and 4. However, important b e ~ n have ~ ~been s made on question 1 and 2 as well. Some o f this is sketched fmther in the chapter. A second part o f the chapter focuses on approaches to questions o f test use and consequential aspects o f valldity? Three questions structure this &scussions: 1. Whatconceptualschema can beprofitablyusedtoframequestionsabout test use and the consequencesof such use? 2. m a t role should constructs play in this schema? 3. How should such a schema be utilized in construct validation?

C ~ nand~Tash. ~ In ~ this ~ chapter, ~ s the term construct encompasses characteristics that are c o m o n l y classified as coptive, focusing more s p e c i f i c ~ yon educational achevement. A construct, here, is an ability (i.e., is a human characteristic required for successful task performance). At the shplest level, these constructs can be identified with capacities to perform classes o f tasks defined by task specifications. Because they must enable more thanasingle task performance, the concept implicitly follows from the formulation o f an equivalence class o f task implementations or reakations, all of which require possession o f the sarne a~~~~~ construct for successful performance. However, in order tobe an ability,ahuman characte~sticmust not only ~fferentiate successhlfrom unsuccesshl task p e r f o ~ a n c e but , must also apply to some tasks and not to others. "hat is, everyabdity must be defmed so that it subdmidestasks into two groups: those to which that ability applies and those to whch itdoes not.

Validity of Constructs Versus Construct

~ a ~ i nGoah g and Taxk ~ e ~ o ~ a ~ In c eorder x . to comprehend achievement constructs, wemustlinktaskperformances to the learning goals they instantiate. To do this we must distinguish learning goals from teaching specifications. Curriculum is usually defmed in terns of the goals that are to be addressed by a learning system. Such goals refer to what is desired or intended to be learned by students (i.e., what students should become capable ofdoing after completinginstruction). In contrast, teaching specifications,whetherphrased in terns ofsyllabi, lesson plans, or specifications for learning activities, address what instruction must, should, or may take place. These specifications are often phrased as guidelines and linked to goals, but theyusuallytake the forn ofexamplesofrelevant instruction that are not explicitly analyzed in terns of the total set of goals. In short, we frame educational goalsin terns of abdlties to be acquired, but we frame instruction in terms of activities (tasks)to be carried out. During thls century, student learninggoalshaveincreasinglybeen phrased in psychological terns. That is, doing has been defmed in terns of ~ ibecome ~~ terminoloped as knowledge, task performance and c a ~ a has SM, or ability. Teachngspecifica~ons, on the other hand,areusually phrased in socialorganizationalterms,Theyfocusonactivities,mostly defmed in terms of what teachers should do with students; less frequently, in terns of student participation in instruction. Thus, although goalsrefermostdirectly to the attributes successful students should come to possess, the operational focus of goals is actually the activities in which the students participate. Obviously, these activities should be selected or created with the goals in view. It is these activities (lessons, tasks) that are intended to promote or assess the learning of the intended capabilities. The #nit or entity for which goals are established is therefore some kind of learning activity (or part or one, or an a g g r e ~ t eof several). These units range from an entire school system’s hll instructional program through those mounted withln it for a school, a class, a course, a lesson, a test, or a single task. Thus, goals are ablltties or capabilities, which students are intended to acquire. The structure of interrelations among goalsis complex. First, some capabilities are prerequisite to others. That is, some specific learnings must take place before others can occur. (This is not to say that in any curriculum there are not arbitrary orderings of the skills to be acquired, only that there exist some abdlties that cannot be acquired before certain others.) Second, capabilities are usually thought of as groupable, that isclustersof

en~onmentor c ~ c ~ s t a n cwithin e § which the task p e r f o ~ ~mall ce ~ h y ~ i ce anl ~ o ~ e n t . .

nt, physical resources, etc.,to be made available be made available cationsdirected to theperson p e r ~ o ~ i n ~ eation of its p e ~ f o ~ a ngoal, c e inch e eval~ationcriteria p e ~ ~ oconstraints ~ ~ c e(i.e., the c ~ c ~ § t ~ WJ.c e s

&e tools that could be use

A task speci~cationsets up an equivalence class of task imple to a speci~cation’s ence classifandonlyifitsconchtionsmatch those ofthe ation. It is this framework that allows two different i n c h v i ~ to ~~s e task, or more than one

or r e a ~ a t i o n ssuch ~ that ealization a belongs

g the context for both p e r f o ~ a n c eensues from

0th tasks and a be structured. ~ ~ c ~forrOUT e ? seshere,consistsof S entities within the s m e sub those in different subdivi subchvisionsmabe is, structures need not be simp

ationalwork, especially that Linked to testing, is correspond~nce ofa b ~ t i e sand tasks. Thus, tasks ed into content domains and class of task-ab& cat ego^ system not whethergoal or a b d q tasks or to the ab~ties.The issuehereis task distinctions. They clearly cannot, as (i.e., they are h k e d by def~ition).The main point is that ~otentiallycomplex) joint s ~ c ~ofeabilities s md tasks do not consist of simple one-to-one correspondences of task and ability. In the i n s ~ c t i o ~ context, al thereis no essentialchfferencebetween test tasks. All are classroom activities or components of such ey only differ in the intent of their use. L e ~ tasks ~ re g clusters ticular of successful abiliti per ~ c t i o intent n ~ is to selec tasks that require boththe a b ~ t i e s taskperformance. Test tasks o not havelearnin goals, but they do require ~ n c t i o nis to assesswhethe

212Wiley

Two UJeJ$‘ ‘ ~ o n J tIn~ the c ~ psychometric ~~ literature, the term c o n ~ ~ ~ has been usedin two ways for two different purposes:

to namethe psycholo~calcharacteristicsactuallyestirnatedbyanexisting test scoreor other measurement. 2. to name thepsychologicalcharacteristicsthat test a score or other ~easurementis intended (“designed”) to measure. 1.

Thus, the first purpose isusually to fmd out what a test score actually measures. For example, does this score labeled ~at~e~at~cal~~ob~ actually measure that ability or how should h s ~ a t ~ e ~ a t ~ c solving al~~obl score be interpreted? The second purpose is usually to orient the test design process so that the resulting measurement wdl be interpretable as planned (e.g., develop a measure of ~ a t ~ e ~ a t ~ c a l ~ ~ or ogiven b ~ ethis ~ definition sol~~ng ~ o b lthat e ~ theJ test o ~ score ~~~ resulting g, from the of ~ a t ~ e ~ a t ~ c a l ~ assure development process adequatelycorresponds to it). These two purposes are obviously interconnected. From my perspective, a targeted development process should be iterative (i.e., test items are drafted, tested, and revised iteratively through a sequenceofstages.Eachofthesestagesinvolves, during testing, an assessment of what each item measures and whether it matches the target). In the original formulation of construct valichty (Cronbach & Meehl, 1995), the nomethetic networkofinterrelationsamongobservableswas conceived asdynamic, that is, measurementschangedovertime so that validrty would change as measures were modrfied and new observables were added. Thus, from my perspective, the intent (dynamic) aspect of construct validty aswellas the actual (static) were built into the concept from the nning. The point I want to make here is that the word con.rtmtis used in a test the measurementliterature in two ways?ascharacterizingwhat measures (e.g., intelligence may be what an IQ test measures, but we need to identify more precisely whatthat is versus iteratively improvingboth the construct and its measures, e.g., how well does the SAT9 problem solving score measure problem solving using the New Standards construct d e ~ ~ t i o nThus, ? ) construct validtty has both a static and a dynamic aspect: 1. What does h s score measure? vs. 2. Does h s score measure what we want it to?

Validity of Constructs Versus Construct 213

~ ~ ~k'T~ ana^^^^ c a and ~ Scorin& ~ o W;lnen n a~performance ~ ~ task uses collectionsoftypicalandatypicalperformances(i.e., student work) to is a demark kinds or levels of performance in terms of constructs, the result mapping of different abilrties into hfferentiable parts of task performance. "his d o w s the multiple learning or assessment goals specified for the task to be more closelyalignedwith its internal structure. T h s kindoftask analysis makes clear the available information form the task performances about the target abllities. This has important irnplications for scoring. ~ e ~ ~ ~a cn oc The e~ ~scoring . of task performances is always based on scoring records of some kind. However, the criteria for what is included in a scoring record are quite varied.As an example, in a multiple-choice task, only the response category chosenby the respondent is recorded. There are no standard mecha~smsfor recording the process stages or the pre products of multiple-choice tasks. In some experimental work, eye movementshavebeenrecordedbutthis is not feasibleunder orhary testing conditions.In computer administered multiple-choice test tasks,it is possible to gatherinformation on searchand intermehate processing d e ~ e n on ~ ghow the computer program and the task are structured, In research studies, process information can be obtained using ~ k - a l o u dor s ~ u l a t e d - r e c amethods ~ to supplement or stand for performance records. These procedures are increasingly used for vaxldity assessment. Tasks and TaskSCQ~~S. The purpose of formulating an assessment task to spanaconstructdomain is to havethetaskevokeresponses(i.e., performances) that provide evidence for the utilization of the psycholo~cal processes represented in the domain. Establishment of criteriafor inferring theseprocessesfromassessmentperformancerecordsistheessenceof valid scoring. C o n ~ ~ a t i that o n a task does evoke appropriate responses is the key criterion for task validation. Thus, the critical bases for evidence are the performance records of respondents. In this conception, task performances and the records that reflect them containinformation about constructsthatcapturetheirmeasurement possibilities. " h i s information circmscribes the potential scores that night beextractedfromtheperformancerecords(e.g., if ataskrequires for its solution then conceptual understand in^ of h e a r equations

ation about a respondent’s u n d e r s t a n ~ gshould be contained in the performance record if the task hasstructured the response record to reflect it).Given. &IS, a scoring rubric couldbe constructed to extract this information. (See ~ ~ & Haertel, e y 1996, for a discussion of performance records.) In general, a scoring record for a task may contain i n ~ o r ~ a t i o n about multiple constructs if the task requires performers to use processes or owledge that contribute to more than one construct,

S.

The specifications of the constructs to be measured by a test would seem to be a necessary prereq~siteto its construct validation, Performance ent has been characterized as being more “authentic” than aditional modes of assessment. As Messick (1994) asserted:

~uthenticassessments aim to capture a richer array of student knowledge than is possible with multiple choice tests; to depict the processes and strategies by which students produce theirwork to align the assessment more directly with the ultimate goals of schooling; and to provide realistic contexts for the production of student work by having task and processes, as well as timeandresources,parallelthose in therealworld.(Arter & Spandel, 1992) essick(1989,1994)discusses two majorthreats to validrty for tests enerally and performance assess~entsin particular. These are “construct melevant variance77 and “construct underrepresentation”. Authentic ts, in Messick’sview,are intended not to leave out relevant and construct aspects, rxinimizin construct under-

r s ~ e cconstruct i ~ ~ delineating the constru ortant differences between category ne is that the app~cation test can result in tasks relating to multiple categories. taskcategorysystemsastheyareusually

Validrty o f Constructs Versus C o n s ~ c t classifications that are m u t u ~ yexclusive and exhaustive o f the task

n t r a ~ t i o nspecifications ~ for tests, a category system for test tasks is selected or created at is intended to apply to the d o m ~ nor universe of resent all o f the possible tasks that could have been used to test. C o n s e ~ u e n ~the y , categories are used to classify the tasks the test. Specification categories are typically restricted t, not process. Content, in this sense, refers to characteristics o f successful products o f task performance. ~ ~ G G e means s~z that the perfomance product matches the perfomance the task and c o ~ u n i c a t e dto the respondent ~ ~ e y Because o f this focus on the task itself and its successful p e r f o ~ a n c e products, tradition^ test speci~cations~ i c a l l yignore process (i.e.,they leadrng to the product). To leave out aspects or parts o fm a n c e the extent that specific psych processes underlymg performance characteristics constitute part desired construct, traditional test speci~cationsand test specification categories may lead to serious construct underrepresentation in the tests for which theyareused. This is the case by matching each task in because the content o f the test is either (a) e~az~a~ed the test to the categories and ev~uatingthe resulting drstribution of tasks over categories, or (b)~~~~e~ by specifymg &IS distribution. To the exte that the categories are inade~uatelyor incompletely characte~zedwi respect to the desired constructs, a test will appear to be more valid than it s e nconstructs t or construct aspects. is, and will a c t u ~ y u ~ d e ~ e p r ekey ey, Petit, Mapus, and Atash (1995) have identified a number o f ance task characteristics ~fluencing orreflecting process that are not accounted in usual content specification categories. Some o f these are These provide salient examples uped into a larger class labeled c~a~ze~~e. process and p e r f o ~ a n c easpects usually omitted in traditional content es. In my view, these are task characte~sticsassociated with nt process aspects o f assessed constructs. Some o f thes ~ e ~ ~ e s s , ~ c a ~ ~ z d ~ ~ g , ang e ~ e ~ a z ~ ~ ‘ c h ~ e n g e c” h a r a c t e ~tics sare:

c

~

~

~

~

e

~

~

.

1. ~~e~~~~~ the degree o f openness in the task, ranging from closed tasks with only oneapproach and one successfulsolution to open tasks with many approaches and many successful solutions.

2. ~

c thedegree ~ o f~structure ~ within~ a task,~ (e.g., inclusion n ~o f leading ~uestions)as related to cueing the student’s selection o f approach or st rate^ for a r r i ~ at g a successful solution. 3. ~ e ~ e ~the ~ de ~e ~ to ~ which ~ j othentask : demands generalizationin order for a student to arrive at a successful solution. 4. ~ 0 the number ~ o f different ~ concepts ~ a~student must ~ use :to arrive at a successful solution.

g to specify here all ofthe aspects of process and performance ninimize construct underrepresentation through adequate specification of construct categories, clearly the supplementation of tra~tionalcontent descriptions with additional information of this type on the character of the desired construct would likely improve the fidelity of the assessment. Fidelity here means validlty in the sense of match between constructs assessed and constructs desired. At the level of an assessment (task, form, test, i n s ~ e n tetc.), , h s concept denotes the match between the desired mixtureof constructs fr the domainand the mixtweproduced by the task score or task score egation d e f i g the test score.At the levelof fidelity denotes the centralityand scope of the constructs ed in the particular t a d score to the overall mixture desired for the test or subtest score, As such, task fidelity rnight also be called task quallty. e United States, the educational reform ~ o v e ~ e has nt of perfomance assessment and along with that has also ation of “content standards” to guidecurriculumand assess~entdevelopment. In assessmentterms,these standards canbe viewed as a p p r o ~ a t i o n sto content specifications, because they provide categories for the classification of content and performance. Viewed in this way, they do not suffer from some of the defects of ~a~tional specifications. First, they are not simpleclassificationsoftasks or items. S is because they areintended to affect performanceas well as traditional content, they tend to incorporate process or performance aspects such as those described previously. Also they tend to be descriptive and are not ~ ~ e m p l i by ~ etasks d or p e r f o r ~ a n ~This ~ s . latter problem has ledthe New t e standards with two ~ t a n ~ a r d s P a r ~ e r (1995) s h i p to f o ~ ~ aperformance tions-~escri~tions of whatstudents should h o w and the ways they should demonstrate the howledge and skill they have acquired.

Valldity o f Constructs Versus Construct 217

~entanes-Samples of student work selected for theircapacitytoillustratethemeaning of thecontentandperformance descriptionstogether with commentarythatshowshowthecontentand performance descriptions are reflected in the work sample. Theinclusion of studentworkiscentral tothedevelopment of performancestandards.Therole of theseworksamplesispartlyone of e x e ~ p ~ ~ c a t i oThe n .value of contentand perfo~ancedescriptions is b t e d unless the descriptions are accompanied by samplesof actual student workthatdemonstratewhatismeantbytheexpectationssetoutinthe descriptions.Thevalue of thesesamples of studentworkisgreatly enhanced if they are accompanied by commentary that sets the context for thework:howthetaskwaspresented,thekind of supportthestudent receivedand so on. Itisfurtherenhanced if thecommentaryincludes discussion of the qualities and characteristics of the worky and is annotated with reference to the content and performance descriptions to show how the work illustrates the descriptions. Once samples of student work and commentaries are included in this way, becomes it apparent that their role extends well beyond exemplification.Theyformanessentialpart of thearticulation of the standard itself, since it is only bough cross-referencing to the samples of student work and their accompanying commentaries that the content and performancedescriptionstake on sharedmeaning. In otherwords,the samples of student work anchor the standards in a way that allowsfor their consistent interpretation(NSI?, 1995). It should be noted that “samples of student work and accompan~ng commentaries” means, in t h s context, (a) full specification of the task the student was to successhlly complete, (b) inclusion of the performance record (i.e., student work), (c) evaluation of work against both the task‘s performance goal and its measurement goal (i.e., the intended construct). See Wiley (1991), Haertel and Wiley (1993), and Wiley and Haertel (1996) for a fuller discussion o f these concepts. Thus, two ways of more completely specifymg and anchoring construct categories have been identified: specifications o f additional task characteristics reflecting process and performance aspects of tended constructs Pgney, Petit, Mapus, Atash, 1995) and inclusion o f samples of student work and accompanying c o m m e n t ~ e s(NSP, 1995). Buildmg on these concepts, Kopriva’s theory o f tasks and task performance ensi ions (Kopriva, 1997b) represents the b e ~ ~ o gf a s

I

onstruct domains, especially for e cational ac~evement.It plies, and ~fferentiates the content/~roc a c c e s s i b ~aspects ~ o f tasks and ands and reorients several o f

S

e n ~ ~ c a t i oand n e s e m p ~ ~ c a t i o n CO of ory may need ~ t h e srp e c i ~ c a ~ oinn o f centrality o f importance o f the various . Also category b o ~ d a r i e smay need deheati

S.

as those that usus o r t u for ~ ~students to an

it in their own terns. Tasks

to assess conceptua~ cast in a contest. tasks can be thought o f as t o f a conceptu~understan on r e c o n s ~ c t i o nrather than on recall; solutions ation rather than by the mani ula~on. suf~cientto accomplish the an do easily if they ~ d e ~ s t a n ss

~

~ skill ~ tasks c are ~ broadly Z described as those that create the r students to applyawellceiced and porta ant routine or ks or parts o f tasksdes d to assess s m s are routine, d often not cast in a context. A c c o ~ p ~ s ~ e nthis t of

heady on recall and solutions are c h a r a c t e ~ e dlargely y m a ~ p ~ a t i o Tas n. olve mathematic^

Valichty o f Constructs Versus Cons ~ ~tasks ~are described ~ eas those ~ that - create ~ the opportunity ~ ~ for ~ ~ to select and deploy problem-solving strate .Tasks d e s i ~ to e~ g are usually nonroutine, long d cast in a context. bout the assessment of problem solving that is useful o f it as a measure o f what students can do with have learned one or even two years pre~ously. lem solving tasks make h i g h - l e ~ ~use l of ts, and s u s . ~ppropriatelycast p r o b l ~solvin ~ ents are gven the o p p o r t u ~ t yt of o ~ u l a t e an approach to the problem and the opportu~ty to work pu ~

~ u r r e n ~ ay ,cohesive technolo

subject areas and

does not exist for

adelevels

technology and tools for ar ct validity o f items and test. analyses across item types e v a l u a ~ gconstruct vali compon~ntsconsisted of creating methods for: e ~ a l u a and ~ g promoting content standards to item specifications match ensuring item a ~ ~ e innthet item de~elopmentprocess s andabilitiesneeded in items that are a n c ~ to a ~

0 Wiley

balanceoveritemsandconstructelements alignment to standards e v ~ u a ~ofo naccessibilityin tests/forms.

to ensure testsfforms

T h s setofstudieswerebrokendown into two majorstrandsinvestigationswiththegoalofproducingeffectivemeansof identi~ng constructs and evaluating completed items/rubrics toenswe alignment; and investi~tionsfor producing efficient methods (and associated materials) of buildingtheeffectiveconstructahgnment into largescaleitemandtest developrnent. The proposed studies were to develop, refie, and use the methods and materials to evaluate construct identification? with components bult in to evaluate the effectiveness of the identification, the construct alignment, and the methodology.

In order to vahdate a performancetaskanditsaccompanyingscoring rubrics in terns of intended constructs, several approaches may be taken. Onemethod ofconstructvalidationisto (a)systematicallymodify assessmenttasksand task specifications, (b) admimsterthem to small ups of students, and (c) analyze the resulting student work to compare forrnances across different versions in order to assess the effects of the modi~cationson perform~ce.These effects would be used to infer how different performance capabtlities are evokedby tasks. Because the purpose is to validate perfornance tasks by inferrhg which performance capabilities they can be used to assess, these activities help accomplish this goal. n e second method is to gather additional data directly from the respondents about theirperformance. This canbeaccomplishedwith“thinkaloud” protocols, post-performance interviews or, in some cases, throu (l997)an ~ u e s t i o ~responses. ~e See Kopriva,Wiley,andSchmidt of New Standar Kopriva (1997a) for elaboratio (In the context Project, see Young, Resnick, Wiley? 1997, and Wfiey, Kopriva, Shannon, 1997)

~ ~ ~and ~e ~~ ~ ~ Scores. e s ~~Assessments ~e e nn are ~~ not ~ singletasks? however, they are collections of tasks together with rules for scores on component task to overallscores on thecollection. Thus the

Validity of Constructs Versus Construct 221 construct validation of a test, as opposed to a task, must focus

on these

practice, n typical performance tasks-especially those oflonger d ~ a t i o n - ~beinformative about multiple constructs. An rule is a function that maps a pattern of scores across tasks score for the taskcollection. For example, in traditionalmultiplechoice testing,testitemsareeach scored only once as correct (scorezl) or incorrect (score=O); the total score for a collection of items is usually shply the sum of the item scores. In performance assessments, rubric scores for tasks are ~ i c a l l ygraded scales of perfomance, and s ~ scores amay ~ take the form of task mappings, (California Learning Assessment System [CLAS], 1994;Wiley,1993). The NewStandardstasksmaybeused to assess more than one construct-on single or multiple rubrics for a given task. A task collection may therefore have more than one In general, there mall be a distinctaggregationrule for measured by the task collection. Because tasks may have more than one scoring rubric, the number of score types potentially entering an gation rule is equal to the sum-”over tasks-of the number of rubrics for eachtask.Anation d e groups task perfomance patterns all the patterns in a group considered as into score pattern g o equally informative about the construct. The only logcal c o n s ~typically ~ t egation d e s is that if one pattern repres task p e r f o ~ a n c ethan another, then the hould be greater or equal (e.g., f(2, 3, 1) ationrulesarelinearandadditive. Two major rged, both h e a r and additive. One is where each h,):Here, each task is task is scored on a single “holisticffrubric ( weight (Wtc 2 0) on eac construct and the construct score sum of task scores [sc = wtc ht].A second is where tasks are ess different constructs. So if a task has a rubric for a construct we defme a rubric indicator (Utc ‘1, if rubric for c; =I: 0, if not). A construct rubric score for a task is vtcand the construct score (rc) is a weighted sum of rubric scores [rc = Ctwtc utc -vtc]. Under tbis schema, wtcor wtc utcrepresent the estknated i n f o ~ a ~ in on the performance record oftask t about construct c.Validity of the assessment construct scores is clearly dependent on the validityof the piecesused in the ation,i.e., sc = C, wtc ht and rc C, wtc utc vtc so that validity of the ate score rests on the validitiesofwtc, ht, utc,and

I

vtc (i.e., on the valihty of the task scores and weights). We note that the ssessment here must be multivariate (i.e., ay of wtc, ht, utc, and vtc, because the construe ture of subconstructs defined across the over tion of task scores was briefly ~ s c u s s e dabove. Task weights a e u s u ~ ybased on judgements about thecentrality and covera intended construct reflected in the rubric scores; also weights reflect the ~ o u n of t time given to perfo~ance. ~alidation of weights shouldcome about via perfomance records. One p o s s i b ~when ~ the easuredconstructsare not highly ~terdependentis to S sampled e r f o ~ a n ~records e by construct and e s ~ a t eavera unts of evidence in order to cross validate the weights. S an overall validation strategy that does not dvectly d e ~ e n don the task or weightvalidities,using c~oss-assessment relationships amon ct focused scores is a p p r o ~ ~ aas t ea conv~gentand dis ation process. Consistent with their reliab~ties,measures of the same truct should be more highly related across assessments than m e a s ~ e s

briefly c o m e n t on threearenas currently education~y ~ro~nent:

in testconstruction

1. e x u ~teuc~er ~ ~ text$ : ( c o ~ ~ got ~ ~e G ~ ~~When o ~ ~ teachers ~ construct ~ tests . for

classroom use, they often draft test items around an instructional unit for which they aree v a l ~ a learning. ~g The items span the instructional unit9 but theconstructbeingmeasuredisnotexplicitlyarticulated.In this case a posterior analysis might frame constructs in terms of the ins~ctionalgoals ofthe t e a c unit, ~ ~ butthis is not usually done prior to test construction.

er

2. e x u ~ ~ ~ e : te.rts ( c o ~ ~ t ~uia c t ~teft ~ e ~ ~ c ~ t When ~ o ~ test ~ )publishers . construct tests for clients or public sale, they usually develop test s ~ e ~ ~ c a t iprior o n s to item development.For achievement tests, the current is often a content byprocessmatrix.These form of these speci~ca~ons speci~cationsare not equivalent to a full speci~catio~ of the constructs to be measured by the test, but are usually explicit e ~ o u that g ~ they could be of constructs, date a partial speci~ca~on

Validity of Constructs VersusC o n s ~ c t 3. e x ~ ~ new ~ ~s t e ~: ~ d ~aTT~~e c t ~ s tassess~ents: ~te s t ~ n ~ ~ ~ educational r e f o ~ movement of thepast 15 yearsadvocates ~ p r o ~ g

classroom i n s ~ c ~ by o nspeci+g what is ~ ~ o ~forastudents n t to know and to be able to do (i.e., what learning outcomes are espected o f students). The vehicle for a c c o m p ~ this s ~ has ~ become content standards. Content standardsaresupposed toaffecttheinstructionalprocessbychthe is directed. major A for goals and foci to which instruction a c c o m ~ ~ this s ~ has g beenthroughassessment.Assessmentsshouldtest ed to be porta ant in the standards, and, because teachers teach what is being evaluated, this in turn would encourage the focus on certain types o f i n f o ~ a t i oand n perfo~ancesin the classroom.

ach the construct validity of a test score, we are tahng a of that score by the very nature of the notion of c o ~ ~ t ~ c t . mplicit in the idea that a score can measure a construct, is a decomposition of the score into 6 0 plus other ~ (i.e., ~ S = ~c + 0). ~ Usually, ~ we ~conceive of e other as the discrepancy between the score and the construct. Sources of screpancies may be (a) errors of measurement-in the ~aditionalsense, entation (Messick,1989),(c)ancillary ab~ties ~ ~ e 1993) y , leading to construct989); i.e., invalidity).Note that I am lump^^ ty with sources of lowered validityto constitute other or ~spectiveleads to theconcept that validity of measurements subconcepts. (1) The validtty of the The validtty or invalidity of the score construct. The notion hereisthat the ~aditional plies jointly to the construct being measured and to the score used to measure it. The analo is that of joint and c ~ n ~ t i o n a l

224Wiley and the validtty of the construct itself (i.) by V(C). I do no wish to imply by this that there must be validtty indtces suchthat V(S, C) = V(C)V(S/C), and whicharescaled &e probab~ties(i.e.9 0 2 V 5 l), although p o s s i ~ ~ t yMy . intent is only to allow separation of the evaluation of the validity of the score for the construct from the evaluation of the construct itself. These aspects of valldtty require more complete definition. The defition of the joint validity of score and construct, that is, V(S,C), is to my mind, the same as the defi~tionimplicitly used by Messick 1989,1994)and thus involvesall the validityaspectshe identi content, consequential). To understand the decomposition better, I defme the V ( ~ / C )aspect as the limited defmition of construct validity of a score focused on how well the test score measures the construct, (i.e., whether or not the score has construct-~elevant variance or construct underrepresentation). To better understand the eaning of by V(C), let’s assume s that the score validtty for the construct perfect. Thls e ~ a t e score h e r f ~ c t i o ~ass factorsloweringvalldityand reliab~tyand under these othetical conditions V(S9C) I=: V(C). The aspects of valldtty influencing ,C), under these condttions, and therefore V(C), enerally, are p r ~ a r i l y consequential. To repeat what I said in the in~oduction;this chapter has a twofold .First, in Test v a and~~ ~~ v ~~a ~~e ~ c(Wiley, d ~~ 1991), ~~ I~expressed ~ ~ and a reference for separating v ~ d i t y questions about test use consequences from those c o n c e r ~ g the relations between test scores and constructs. My rationale, at the h e , was that investigations c o n c e ~ test g use and its consequences involved the psychologyof indi~i~uals. I now find it less useful to s h q l y separate these perspectives and as this new perspective has evolved, ani n t e ~ a ~ now o n seems moreproductive.

Validtty is at base about interpretations. In the most ~aditionalsense (e.g., Cronbach1988,1989); (AERA,APA, NC ,1985) this inte~retationis of scores. I would like here, however,to propose an expansion of targets of interpretation in a validity frameto include constructs as well as scores.

~

Valichty of Constructs Versus Construct 225 1.

i n t e ~ ~ t a t i(of o nc o n s t ~~c ~ o validity ~ of aconstmct. ~ In onesensea construct is, itself, an inte~retation.However,as I discussedpreviously, construct isgenerally used in measurement to denote either (1) what a score measures or (2) what a score could measure. Ineither case, a construct bears specific a relationship to somescoreandintraditionalmeasurement frameworks relations between these construct-laden scores and other criteria have evidential value for construct validity. Because o f this, my perspective in this chapter is that the notiono f construct is specific enough to support a conceptionthatthescoreisan(errorfd)mathematical fmction o f a hypotheticalconstructscore ~reviousl~). "hus, theinterpretation o f the )derive from the construct itself. construct should (at least in part ~ c o ~ ~ co~stmct ~ c v~ ~ k o~ i"t~hy e. relationship e ~ ~ ~o f the ; construct to the scoreis the major partof construct validity.

4. inte~retation~ o f c o n s ~inte~reta~ion ~c~ (of score) ( s o ~ aScore ~ . interpretationis (2) the acontextuallymediatedfunction of both (1) thescorevalueand

inte~retationo f the construct.

5. i n t e ~ r e t a t ~ o n ( o f s c 0 ~ ) ~ ~ s i o n / ~ cFinally, t i o n (the so~ consequential a~. decision or actionresultingfromthese inte~retationsisadirectoutcome interpretation of the score's value.

FIG. 12.1

o f the

ote that Figure 12.1 in s ~ exhibits ~ two a~outes~between. sion/Action. One

FIG. 12.2

ow very little about how anticipated uses selection or even if the do.

out how people ~ t e ~ r an e t rticdtu: labels. We haveeven or o f the so ns are acquired decision and action..

Validity of C o n s ~ c t Versus s Cons

Arter, J. A., & Spandel, V. (1992).Usingportfoliosofstudentworkininstructionand IssHes and Practice, I 1(1), 36-44. assessment. Ed~cational~emHre~ent: California Learning Assessment System. (1994). Technical Report. Monterey, CA: C~/Mc~raw-H~. Cronbach, L. J., &:Meehl, P. E. (1995). Construct validity in psycholo~caltests. P~cbological ~ ~ l ~ e52, t i n281-302. , Haertel, E. H., & Wiley, D. E. (1993). Representations of ability structures: Implications for testing. In N. Frederiksen,R. J. Mislevy,and I. I. Bejar, (Eds.), Test t b e o ~ f o ra Mew generationoftests. Hillsdale, NJ: Lawrence Erlbaum Associates. g validp tecbnolo~forassessments.Discussion paper. Kopriva, R.(1997a). ~ ~ i l d i an const~ct a ~~~ee~ ~ c a t iin o nl as ~ e - s c aassessmeMt: ~~ Kopriva, R. (1997b). U.kg threelenses to ~ c o n c ~ t H test ~ ~ n i item n coverages g

m a fHMction of conte~t~p~cess, p e ~ ~ a n c e ~ p and th, Q c c e s ~ ~ ~ ~ .

Discussion paper. Kopriva, R. (1997~). Gettingreal abotrt acctrracy: I n s ~ ~ n g a c c e s ~in~ i testing ~ p for eve~one. Discussion paper. Kopriva,R., & Wiley, D. (1996). Vakhting and c o ~ a k n g ~ s s e s s ~ e n~t s : n a ~ n~ ~ a~ g ens ~ between content ~ p e ~ ~ as nt ac M e ~ and r ~ msess~ent~ s t e ~ Discussion s. paper. n g v a ~T e~c hpn o l o ~Stan ~r Kopriva, R., Wiley,D., & Schmidt. W. (1997).~ ~ i la~const~ct ~ a s e d ~ s s e s s ~ eA n tproposal . submittedto OERI. (3"' Ed). New York: Messick, S. (1989). Validity. In R. L. Linn (Ed.), ~~cationalmemtrre~ent ACE/Mac~~. in thevalidationof Messick, S. (1994). The interplayofevidenceandconsequences 23(2),13-23. performance assessments. Edtrca~onal~searcher, New Standards Project. (1995). ~ e ~ o ~ asnt acneh r ~Version 5.1. Pittsburgh, PA: National Center on Education and the Economy. New Standards Project. (1997). The New S t a n ~ ~r ~ ~ ~Exa~ination n c e Stan~r~-~~~nced scokng ystem Pittsburgh, PA: National Centeron Education and the Economy. y tbe tmk. Grade Rigney, S., Petit, M., Mapus, L., &:Atash, N. (1995, June).~hallengepresentedb 8. CCSSO Large Scale Assessment Conference. Wiley, D. E. (1991). Test validity and invalidity reconsidered. In R. Snow, and D. E. Wiley ~i n ~~ ig ~ s ~ i~ u~l s ~ e nHillsdale, cne . g NJ: Lawrence Erlbaum Assoc' (Eds.), ~ Wiley, D. E.(1993, June). Combining Scores into Overall Levels of Performance: fromProductsofChains to Chains.Paperpresented at theCaliforniaA nilworth, IL Beacon Institute. Haertel, E. H. (1996). Extended assessment tasks: Purposes, definitions, o~ance P~~ises, scoring, and accuracy. In R.Mitchel, (Ed.), I ~ l e ~ e ~ t i n g P e ~~ssessments: P ~ b ~Challenges, ~ s , Hillsdale, NJ: Lawrence Erlbaum Associates. Wiley, D., Kopriva, R., & Shannon, A.(1997,February). St~dards-bas~d validation of performanceassessments.Paperpresented at theannualmeetingoftheAmerican Association for the Advancement of Science,Seattle, Washington. Young, M. J., Resnick,L., & Wiley, D. E. (1997). What is a s t a n ~ r ~ - $ ~ ~assess~ent~ nced Paper presented at the symposium on New Standards Implementation of StandardsReferenced Assessment. Chicago,E NCME



*An earlier version of h s chapter was presented at a conference in honor ofSamuelMessick,“Under ~onstruction:The Roleof ~onstructsin P~ycholo~caland Educational Testing,” Princeton, NJ: Educational Service,September19,1997.Preparationof this chapter was rted undertheEducati PR/Award Numbe cationalResearch a Education. The f ~ ~ n and g sopinions expressed in this publication do not reflect the position or policies of National Institute on student A c ~ ~ v e m e nthe t , Office of Education U.S. ~ e p a r ~ eof n tEducation.

though I wasquick to sayyeswhen Ann Jungeblutasked me to participate in thls celebration of Sam Messick‘s contributions to educational measure~ent,I must admit it is a bit in about issuesofvalidity,constructs,andvalues in this eople have contribute^ to the ref~ementof the S about validity. Two people, Sam Messick ~ronbac~ however, , standheadandshouldersabove the crowdin regard. In the spaceavailable, it wouldbedifficult to do justice t s~~ of the ways in which Sam has advanced our h n ~ about g the i n t e ~ l aof~ c o n s ~ c t sand values or the process of validatiag inferences and actions based on test scores. The intimidation, however, comes from the desire to build and extend h i s work in these complex areas with him sittiag here. Nonetheless, I, of course, toldAnn that it was a wouldn’t n i s s the chance to p

232 h n

Although it may not be evident from my remarks,I have learned a great deal from Sam. I am indebted to h for the support he gave me early in my career when I was at ETS. He always gave me encouragement and the freedom to pursue my research interests. In addltion to teaching me a lot about validity, Sam alsotaughtmesomemoremundaneandpractical things. For example, he taught me that an ETS Senior Research st (the title he held whenI was new to ETS in 1965) is obviously to a full professor because we always waited for him to arrive full 15 &Utes after the tirne he set for the start of a meeting. When he became a vice president, I learned that there is no university equivalent of that exaltedtitle,because no onewouldwait that long in auniversity setting. I alsolearned as edltor of the Third Edltion o f ~ ~ ~ c a t i o ~ a chapter by Sam wli certainly not arrive on the ~ e a ~ ~ r ethat ~ ealthough nt, li beclearwhen it does arrive that the editor’sdesk on schedule, it w contribution it makesismorethanworth the wait.Although I haven’t checked it, I am confident that Sam’s chapter is the most cited,and certainly the most influential chapter in the Third Edltion of ~ ~ ~ c a t i o ~

~ e ~ ~ ~ e ~ e n t .

Of course, there were good reasons people werewJUlng to wait for Sam to show up for a meeting just as there are for an editor of a book to be g to wait for a chapter. Most porta ant of these reasons is that the quality of his thinking on issues ranging from the mundane bureaucratic ones to the profound. Sam’s contributions to the discussion are consistently worth the wait. It is no wonder that when ETS has had a o n to offer, Sam problem or issue that required the best the o r ~ ~ a t i had has long been the person that every ETS President (Henry Chauncey, Bill Turnbd, Greg Anrig, and now Nancy Cole) has turned to for help. Thus, there was no answer other than ‘‘yes?’ when Ann Jungeblut asked me to participate in this celebration. I am delighted to be part of it. The focus of my chapter is on issues of values, constructs, and valicbty in the context standards-based assessment programs. I take as my starting point, Sam’s one sentence definition of validity in his 1983 chapter in the third edition of ~ ~ ~ c a t i o ~ awhich ~ Shepard ~ e ~(1393) ~ r edescribes ~ e ~ as t , the currently“mostcited a u t h o r i t a ~ ~reference e on the “Vahdity isan integrated evaluative j u d ~ e n tof the d empiricalevidenceandtheoreticalrationales support the a ~ and ~ a ~ ~ r ~ of~~ an ~t ree and n~c e eactions ~ ~ ~based on test scores or other modes of assessment” (Messick, 1983, p. 13, emphasis in original). This statement,

~

Constructs and Values in Standards-Based Assessement 233 which is the fust sentence of Sam’s influential chapter, is so packed with meaning that it took him 91 pages of an oversized book with relatively small type to elaborate the concept. As I’m sure is f d a r to most readers, Sam’s comprehensive defmition a two-by-twotablecorrespondmg to the ofvalidityiselaboratedin adequa~/approp~atenessand inferences/actions distinctions of the defition. The two rows of the table distinguish two bases of support for valtdity claims-theevidentialbasisand theconsequentialbasis that are used to support claims of adequacy and appropriateness. The two columns of the table distinguish between interpretations of assessment results (e.g., a ““strikinglackof the latest NAEP hlstoryresultsshowstudentshave knowledge about their heritage”, Innerst, W a s ~ g t o nTimes, November 2, 1995)anduses of results(e.g.,thecashawardsgiven to teachersin Kentucky based on increases in assessment scoresby successive cohorts of student). Although there is some disagreement in the field regarding the desirability of includmg consequencesas a part of validity (see,for example, Green, 1990; Moss, 1994; Wiley, 1991), there is broad consensus both with other parts of Messick‘s comprehensive formulation and with the importanceofinvestigationsofconsequences as part of the overall evaluationofparticularusesandinterpretationsofassessmentresults (Baker, O’Neil, & Linn, 1993; Cronbach, 1980; 1988; Linn, 1994; Linn & Dunbar, 1991; Moss, 1992,1994;Shepard, Baker,1998;Linn,Baker, 1993). Of course, a f ~ ~ a t i oof prirnacy n of validity based on a comprehensive framework is. one thing. Valtdity practice is quite another. Validity practice is too often more in keeping with outrnoded notions of validity that basevalidityclaims on a demonstration that testitems correspond to the cells of a ma& of test specifications or the demonstrationthatscores on atestarecorrelatedwith other relevant measures (e.g., teacher ratings, or another test administered at a later time). Although both content-related evidence and criterion-related evidence are relevant to validity judgments, theydo not provide a sufficient basisfor the kind of “integrated evaluative judgment” that Messick demands. Standards-based assessments currently being introduced in states around the country are frequently introduced by legislation that includes a requirement that assessments be “valid, reliable, and fair.’’ However, the approach to validation is often limited to comparisons of the assessment content to the content standards that are supposed to determine what gets

n

ate evidence for making

focus on current standar~s-~ased assessment

follow a standardse the framewor~sof

pmentandthe

o v e r ~ a p pmem~ership ~~ on &e area e content

236 I.,inn

Before getting into issues specific to the validity o f assessments, it is useful to consider some aspects o f the standards-based reform contest that is the basis for the types o f assessments that are the focus o f my chapter. A l995 l of Education Panel on ~tandards-~ased report of the ~ a t i o n a Academy Education Reform described the vision of the movement as follows. The intentions of standards-based reform are best captured in the slogan ‘high standards for all students.’ Unlike previous reforms focused on the attainment of tminimum competencies, setting higher academic expectations is expectedtoensurethatstudentshavemorecompletemasteryof subject matter and the ability to apply what they know to solve real-world problems. Standards-based reform also reflects a strong c o d t m e n t to educational equity: The same high espectations are to be established for all students, even groups who have tra~tionally performed poorly and received watered-down curricula.~ c L a u g h& ~Shepard, 1995, p. xvi)

Values questions are evident in this characterization of standards-based education reform. So too, are issues about the nature o f constructs that assessments based on standards ~onsistentwith this description are intended to measure. The s m e curriculaandsamehighstandards for all students, although easy to accept at a rhetorical level, obviously may conflict with the desire o f parents to have the best for their children. The u~ poll (Rose, recently released annual Pht. Delta ~ a p p a / ~ a l leducation Gallup, & E l m , 1997), for example, reported that whileroughly threefourths o f the respondents approved e s t a b ~ s ~national g standards for m e a s ~ gthe academic performance o f the public schools ( 7 7 ~ 0 )an a p p r o ~ a t e l yequal proportion favored moving chronic “~oublemakers’’ all students does not include into alternative programs ( ~ 7 ~ 0Apparently, ). ~oublemakersin the minds o f the survey respondents or they don’t agree with the dual goals of the leaders o f the standards-based reform movement. Moreover, two out o f three favored ability grouping ( ~ 6 ~ 0 ) .though abilitygroupingis not, in p ~ c i p l e , consistent with the sarne‘,high standards for all students,?’ considerable experience inconsistent in practice (e.g., Gamoran & Berends, 1987; Value conflicts over the weight to be given to the e over acceptable or desirable approaches to achieving that goal provide

Constructs and Values in Standards-~asedAssessement 237 potential sources of confict regardmg the standards-based reform effort. These conacts arelessvisible,however, than value conacts over what should be emphasized in the standards and, therefore, taught and assessed. As is evident from the heated debate about the Voluntary National Tests, value conflicts about c o n s ~ c t sthat are to be assessed have considerable sahence. This is so, in part, because it is less socially acceptable to question the goal of equity than to question the vision of a &scipline such as hstory, or even, the less controversial ~sciplineof mathematics. The standards movement has pushed new concepts of the disciplines that generally d o ~ p l a ythe ac~mulationof facts and procedural skills in favor of conceptual understandmg and problem-solving processes. Mathematics was in many respects the lead chsciphe. It is almost a decade since the National CouncilofTeachersofMathematics(NCTM,1989) p u b ~ s ~ ethe d ~~~~~Z~~ and E ~ a Z ~ a tSi o t~ a ~ for ~ SchooZ r ~ ~athe~~t~c~ Although it is hardly the case that the instruction envisioned in the NCTM Standards has been inco~orated into the daily practices in most classrooms, there is little doubt that those standards have been in flu en ti^. They have receiveda great deal of praiseby politicians as well as educators, have served asa model for mathematics standards developed in most states as well as standards in other subject areas at the national and state levels, and are now routinely claimed to be a source of p d a n e e by textbook and test publishe~s. The broad support enjoyed for a number of years NCTM Standards was never there for some of the other content areas. The history and Englishlanguage arts standards, in particular, were subject to harsh criticism (e.g., ~ i e ~ u e l l e1994). r , As the NCTM standards have become more influential and as assessments designed to be aligned with the standards have started to make the constructs and values of the standards more explicit, however, the NCTM standards also have come increasingly under attack. This point is illustrated by the exchange in the New York Times in whch Lynn Cheney (1997)attackedwhatshecalled the f fuzz^ math” or “wholemath” encouraged by the NCTM standards and Tom Romberg (1997) defended the standards and the mathematics that theyenvision. That and other recent criticisms have taken place not only in the contest of the debate about the Voluntary National Test, but in the context ofseveral state assessmentprogramsand efforts to revise state content frameworks or standards.

n sympathetic to the national content standards of S aresometimesquick to &smiss attacks such as only a small fraction of the e co~siderableevidence that “the questions opposition er public concerns. For example, recent sume ematicsandwriting point to ~ndamentaldi r values of education r e f o ~ e r and s 1 nell, 1996, p. 31). Large se CS

de of a calculator. to such f a d a r

orization 2nd andthose s M s . Lea&ng education r e f o ~ e r s , ch greater emphasis on c o ~ c e p t u ~ favor cons~ctivistnotions ofeffective

ards, especially ones that actua~yhave an such as assessments, te i n s ~ c t i o n a lprograms, are con~oversial.To standards mustspecifywhatknowledgeand S specifici~of what is of m thin a content area. It is makes standards valuable des of cu~riculum and i n s ~ c t i o n ,but specifici~also makes them potentially conten~ous.As Cremin (1989, p. 9) noted, “standardsinvolvemuchmore than d e t e r ~ a t i o n sof wledge is of most worth; they also involve social and cultural rences, and they frequently serve as symbols and surro The emphasis in standards-based assessment is on (1) clear ecifications through “content standards” of what students should kno d be able to do in specific content or subject matter areas at identifie points of their education (e.g., 4th grade readmg, or 8th ear specifications through “ p e r f o ~ a n c estand ce that students are expected to achieve in content standards. In other words, content stan measured whereas is good enough~’

Constructs and Values in Standards-Based ~ssessement239

Sadler (1987) d i s t i n ~ s h e damongfourmethodsofspeci ting performance standards. These are (1) nurnerical c tacit knowledge, (3) exemplars, and (4) verbal descriptions. N offs are the most farmkar and widely used. They have the specifici~and apparent precision, but they do not aide und what is required by the standard. By itself, knowledge that a S a test is r e q ~ e dto meet the proficient^' standards does not tell either the teacher or thelearnerwhatneeds to be done to acleve a proficient performance. Sadler’s second method, tacit knowledge, best is illustrated by c o n n o i s s e ~ s ~ Experts p. who have the tacit knowledge of the standard may achieve relatively l g h degrees of agreement, but that is of little aideto the novice if thestandardscontinue to depend on tacitknowledge essed only by experts. Tacit knowledge can be used success~lly to e student work by consensus among experts, possibly using formal moderation procedures. More is needed than consistent if students and other novices are to understand what is the standard and hence be able to judge performances against the standard. The third method, the use of exemplars of performance that does or does not meetstandards,is usually institutedinanexplicit effort to c o ~ u ~ c a what t e thestandardsentail to a broaderaudience.“The exemplarsare not thestandardsthemselves, but areindicativeofthem; they specify standards ~plicitly”(Sadler, 1987, p. 200). Because there are innurnerable potential examples and a host of contextual and other factors that may influence performance, many exemplars are needed and a heavy inte~retationburden is placed on anyone attempting to infer performance standards from a collection of exemplars. Verbal descriptions attempt to provide explicit d e f ~ t i o ~of s performancestandards.Someverbaldescriptions m adverbs and adjectives to defme levels of performance ~ ~ ~ Other ~ c Q verbal ~ .descriptions rely on verbs (e.g., make. For example,theProficientlevel for fourth g is definedas follows for NAEP. F o ~ t h - ~ a dstudents e performing at the Proficient level should be able o ftex to demons~atea n overall ~ d e r s t ~ d i n g the well as literal ~ f o ~ a t i o When n. reading text app

they should be able to estend the ideas in the test by making ~ferences, drawingconclusions,and making connectionstotheir o m experiences. Theconnectionbetweenthe text andwhatthestudentinfersshouldbe clear. ( ~ ~ p b Donahue, e ~ , Reese, pmps, 1996, p. 42) Unlike the precision o f nwnencalcut-offs, verbal descnptions are “‘fuzzy standards.” Yet, it is verbal descriptions and associated exemplars that most influence interpretations o f assessment results, Hence, those descriptions and exemplars deserve much more attention in validation efforts than they usually receive. Three of the four methods described by Sadler are use plement and c o ~ u ~ c a the t e NAEP performance standards. Verbal descnptions, such as the one just illustrated, are accompa~edby numerical cut-offs that operationa~ethe perfomance stand~ds on the NAEP score and by exemplar items and student responses. Apparent &Sties between these different approaches to s ~ e c i perf0 ~ g ds sometimes arise and lead to challenges to the validity of the stand~ds. Verbal descriptions and the use o f exemplars as a way o f de s the construct & r n ~ p ~ c a t i o nfor rformance standardshave to determine whether students tations of assessments d descriptions are fuzzy sta~dards se standards. The fact th in the constructs measured by the &plies that there isfuzziness a e achieved by a student. assessment used to determine the p e r f o ~ a n c level

The prospect of the Voluntary National Test has caused concerns in some states about the degree to which the tests d match their own content S , Educational leaders in some states that have invested heady in o f standards m d put assessments se standards worry that ssessments and content standards and become the de er o f curriculum and teaching. be-based efforts to improve ~ ~ r i c u l uand m ins the de~elopmentand promulgation o f content-stand~dsare reinforced by s t a n d ~ ~ - s e efforts t ~ g in almost every state, “he National Academy of Educatio~Panel onStandards-~ased Education R e f o ~ for , example, re~ortedthat by l995 all but one o f the 50 states had “develope

~ o n s ~ cand t s Values in ~tandards-~ased ~ssessement 2 ing ~ ~ c u l uframe~orks; m some of to in~uencenational standard-set~gefforts’’ standards 10 and dards say that students in s xperiences so that s ~ d e n t can ‘ ‘ c o n s ~ c t ,read, and tables, ret charts, and Standard 10, Grades 5 thr evising car^^ out e s p e ~ e n t sor p r o ~ a b ~ t i e(from s ~ ’ NCTM Standard 1 1,

As can be seen in

ere is considerable topics, There is also considerable consistenc~ etween &e state standards and those of the

hs and plots to organize and display data, and m eddata. The itemsthemselvesshouldpush s ~ d e n t s ns, select appropriate representations, hs or plots, and justie their selections, or and to solveproblemsusingtherepresentationtheyselect.Analysis genera~ationscouldbecalled for acrossmultiplerepresentations.Items could also involve multiple data sets or multiple representations o f a single data set. pxample 9, I ” R 0 MIL, Ch. 51

inferencesfrom o different are cam to

I

0

M

v)

l ‘

P

Constructs and Values in Standards-Based Assessement 243

I

*\

nn 7 . Determine probab~ties,describeevents,andmakepredictionsusing

as

ons,samplespaces,ande d e ~ t i o no€ p r o b a b ~ ~ , nt ess o f g ~ e s . Use p r o b a b ~as~ it relates to ~ d e p e n ~ eand ndentevents. In theassessment of p r o b a b ~ situations ~, shouldbe ~ ~ as well le enough €or students to demonstrate what p r o b a b means applyadditionand m ~ t i p ~ c a t i o n d e€or s p r o b a ~ pxmple ~ ~ . 6:

(See web site:h t t p : / / ~ . m ~ ~ c . c o m / n a t i o n a l t e s t s / m a t h . ~ f course, analyses of other topics or stran CS or of the standards andspec~~cations for farlessconsistencythantheones I from theareasof probab~~ ,is that detailed analyses are neede Its own content standards to be convinced or ationsthatareconsistent standards.

equestionofwhatshouldbeassessedisandshould ~ p o r t a n c e . T r a d i ~ o n ~ y , p u b ~ s hof e r swidely-usedtest

ack from users,and experience with previous editions of the test battery. For obvious reasons, arketforcesplaya prominent role.Evenif a publisher, for example, to do away with a separate arithmetic computation score and make calculators an assumed tool for its mathematics test, it is unlikely to do so as long as the marketplace wants scores to show how well students etic computations by hand. Real and perceived m push in the &ection of the status quo. "he CO overly i n ~ o v ~ t ies v~, higher price tag. dards movement has done at deal to rock the boat. ss~ssmentsmake explicit par rovideoperationalexamples stu~ents"should understand

Constructs and Values in Standards-Based Assessement criteria and cut scores required for a student to be at the “‘basic” level in hlstory, the ccadvanced” level in mathematics, or the “‘proficient” level in reading also make explicit how good is good enough and what is required to be considered excellent. In other words, it is through assessments that both content standards andp e r f o ~ a n c estandards become concrete. ecauseofthediversityofopinion on whatis most important for students to learn and what is legitimate for schools to teach, it is essential that the processofdevelopmentene the broadpublic so that the &verse range of views can be heard. Basic skdls such as spelling, phonics, ~ ~ t i p l i c a t i otables, n historicaldatesandlocations that arevaluedby ents of the public must be considered. T’h cause educationalrefomers believe in “con theoriesonlyattheperilof the wholeenterprise. consensus requires, not just public hearings and a period of time for public review and input, but a process that ensures that concerns are heard and a balance is struck in the standards and assessments that attends to major concerns of the public. Neither the voluntary national tests nor important state assessments can escape this debate.Indeed, it isalreadyunderway. As Ravitch(1997a) noted, “the Internet ish stacked to favor ‘whole and to promote ‘fiafuzzy math’ (where the rightanswer,mattersmost) though she stated that she did n used the concern to argue forcehlly for Congressional autho~ationfor the v o l u n t ~nationaltestsandtheassignmentof responsib~ty for d e t e what ~ should ~ be assessed by the tests to a nonpartisan bo ed for the action by Congress to establishan expande~ Assessment Gove oard (NAGB) for this purpose. The key oint is that the tests need otected from partisanviews and political *

~ t a n d a r ~ a t i ohas n long been thought of as a foundation for b ~ & n tests g that werevalid,reliable,and f k . The firsteditionof Anastas~s(3 extbook, ~ ~ c ~ Q Z Q~e~~~~~ g ~ c ~ described Z a psycholo~calte d measure of a sample of behavior” @. 22). She elaborated

I

the concept of a standardized measure as follows. “Standardization irnplies u ~ f o ~oft procedure y inadministering and scoring the test. If the scores obtained by different inchiduals are to be comparable, testing condtions ust obviously be the same for all. Such a r e q ~ e m e n tis only a special application of the need for controlled conchtions in all scientific observations” @. 23). Other well-knownauthoritiespresented sMar ts. Cronbach (1960), for example, emphasized the standar~~ation a standardizedtest as “oneinwhichthe edureanddefined procedure, apparatus, and scoring have been fixed so that preciselythe @. 22). same test can begven at different times and placesy’ Popular use of standardized tests often has other meanings. It may, for example, be the term used to refer to a published, m~ltiple-choicetest, but such characteristics are not fundamental to the concept. An essay test can have a relativelyhighdegreeofstandardizationofprocedures(e.g, instructions,timing,andscoring),whereas a multiple-choicetestcanbe a ~ s t e r e dinunstandardized ways(e.g., withdifferentinstructions, under different conditions, and with different time constraints). It is hardy surprising that u~formity of procedure would be emphas~edwhen the goal is fair and valid comparisons among indwidual test takers. It would obviously be quite unfair in the a ~ s ~ a t i of o nan essay test scored for content, spelling,andgrammar to allowsomejob applicants to typetheirresponses at a word-processorwithaccess to a ar checkerwhereas other applicants spell-checker, a thesaurus,and a were required to write their by hand without access to such resources. Tradttionally,it would also be considered unfairto allow one test taker 90 minutes to complete a test whereas other test takers with whom they are competing were held to a strict time h t of 60 minutes. Such ~aditionalnotions of standar~ationare being challenged with increasin frequency, however. The goal of including students in assessments who would have been excluded in the past because of a disability or because of lack of fachty in S raised questions about the essential and nonessential aspects of tion. This is fundamentally a values-laden construct validity problem. The most commonly requested and used variancefrom stipulated a ~ s t r a t i o nconditions in state and national assessments is extra time. Other adaptationsandaccommodations that areusedwithincreasing frequency include oral presentations assessments of to students, i n d i ~ ~ u ~ e d a ~ s ~ a tand itranslations o n , assessments of into

Constructs and Values in Standards-Based Assessement247 languages other than English. The plan for the 8th grade voluntary national test in mathematics, for example, includes a provision for the creation of a Spanish language version of the test. A number of states are following a s h d a r path. The expanded emphasis on adaptations and accommodationsthat allow greater n w b e r s of students to participate in assessments raise~ n ~ a m e n t a l issues about s t a n d ~ ~ a t i onot n just for thestudentsfor whom the variations in procedures are intended, but for all students. ~ n i f ofo ~ ~ procedureshouldberecognized as a means to theends of valihty, reliabhty,andfairnessratherthan an endinitself. Shplistic examples, such as being allowed to use eye glasses when tahng a test, make it clear that complete u n i ~ was o ~ never ~ the intent. Indeed, most people woul agree that it would be unfair to deny a person who wears glasses the use of thls a c c o ~ o d a t i o nwhentaking a test. SMarly, thereislittledebate or B r d e versions of a test. about the fairness of the use of large print Implicit in the allowance of such accommodations is the belief that they are not relevant to the construct being measured. Or conversely, the unaided visualacuity that may bute to variabhty intestperformanceisan ancillav abihty(Haerteln,1996) that contributes c o n s ~ c t - ~ r e l ~ v a n t variance to test scores (Cook& camp be^, 1979; Messick, 1989). In a s d a r vein, the decision to make a Spanish version of the 8th ade voluntary national test in mathematics is predicated on the notion atthespecificlanguageused for presentauonofproblemsand for c o n s ~ c t i n gresponses by testtakersinvolveslanguage skills that are ancillary to the mathematical knowledge and SMS that the test is intended to measure. Presumably the limitation to either English or Spanish rather any language preferred by a student is based on practical considerations of numbers and cost rather than principle. Note that the same decision was not made for the 4th grade voluntary national test in reading where the Departrnent of Education planprior to the transferof responsibility for the test content to NA called for thetest to beprintedonly in E Given that position, obvious that the measurement intent is not the abihty to read, but the abilityto read in English. It is worth noting that both the plan to develop a Spanish langu version of the 8th grade mathematics test and the plan not to do that adereadingtestwerechallenged.Advocates for lin ~ n o r i t ychildren argued that the reading test needs to be given in Spanish to allow children who have received instruction in Spanishto participate in *

test. On theother hand, the plan to givethe8th grade mathematics test in Spanishwascalleda “terrible idea” bythose who ed that “[s]tudents in American schools must learn to function in the ger society, which requiresknowledge of English lan~age” pavitch, 199?b, p.64). though the allocation of the likely effects on performance to c o n s ~ c t - r e l e v a and ~ t cons~ct- relevant sources of variance is str and noncontroversi~for some variations in a e (e.g., the use of eye glasses), in most cases there is uncer eis controversyaboutthechoice o f c o n s ~ c tshould it reading in English, doing mathematics in English, or should it be rea or doing mathematics in the language of a student’s choice? Secon is uncertain^ about the degree to which skills intended to be ancillary (e.g., g on mathematics a test) influenceperformance for different Th evidence to support or refute aspecific allocation o f variance as relevant or irrelevant is generally weak or none~stent. Doesthe extra m e allowed a student diagnosedashavingaparticularlearning disability make the test performance a better reflection that student’s skills andknowledgethe test is intended to measurethan itwouldbeunder e limits? Would a similar increase in tirne makethe test a more valid measure of those intended skills and knowledge who were not di osed ashavinga le relevant to judgments validity and fairness of the acco answer to both questions is yes, then it may be consid extra h e only for studentschagnosed to have a disability although it would enhance the validity of the measure for those students. Two keyideas that direct the search for evidencerelevant to an ev~uationof the valichty of the inte~retationsand uses of an assessment e the concepts of construct underre~resentationand cons~ct- relevant Campbell 1979). As Messick‘s (1989) elaboration of these ideas makes clear, the nature of the evidence relevant to these two ideas bemultifacetedand quite complex. “he basicideas, however, are straight forwardand intuitively reasonable. Contentundertion occurs to the degree that the assessment excludes or gves attention to aspects of the intended domain of measurement es hard-to-measure SUS and u n d e r s t a n ~ g sspecified in the

~onstructsand Values in Standards- ased Assessement content standard refers ce to the de S irrelevant that are to arescores which intent of the meas Although.one d g h t &e a measure to be pure in the sense that it dependsonly on the single abity the test is intended to measure, in ance always depends to some extent on ancilla~abhties. is, there aresome ~onstruct-irrelevantsourcesof v ~ in athe ~ a n c of e test takers. ~onsiderableeffort is put effects of ancilla~ abilities. For exatnple, the rea test may be reduced well belowthe typic ade of a test in an effort to ~e the effect of rea abhty because it is considered ancillaryto the intent of a n c ~ aability ~ as a source of efforts caneffectively e ~ a t this e noncomparability for most students, and may reduce its effect for other students, but there are h t s . ~bviously,h s approach does not he students who readandspeak a different l a n ~ a g e .A ma~ematic sh. isobviouslyunfaix to students who speakonly nd receive their mathematics instruction in that other case the u n f ~ n e s sis the resultof scores that are down war^ for members of one another. In general, ..anysingletask that relies on ancillary abhties that one li be biased in ssesses and another group does not w Lmn, 1996, p. 67). Although.easily f f i d t to evaluateandevenmore te in practical assessment situations. example,thoughrealand important for ~ ~ b e ofr s ~ d e n t sin the U.S., is in some ways too easy. arity with specific as ects of tasks on an assessment is more subtle and more pervasive.

stakesdecisions”

~

~

test fairness has evolved. They argued that the concept of test fairness was made more complicated by the addition of requirements that the test be “unbiased” and that the uses of test results be “equitable and just.” ”hey went on to make the case that “fairness is animportant aspect of validity ., .” and that “[a]nything that reduces fairness also reduces vahdity” (p.6). They concluded that “test fairness is best conceived as comparab~tyin assessment; more specificallycomparablevahdity for allindividuals and oups” @p. 7-6). Equating test fairness with comparable validity for all individuals and groups places some constraints on the concept that may not be compatible with certain views of equity or justice. The formulation by ~ ~ n andg Cole, however, provides a useful lens for analyzing quitea range of fairness issues. Test fairness as comparable validity implies a great deal for the ways in whichevidencecanbebrought to bear on fairnessquestions and challenges some olderways of thinking about fairness. Although popular discussionsoften speak of validity as itif were an all or nothing characteristic of a test, there is broad professional consensus that validityisneither a propertyof the instrument nor anall or nothing character is ti^. Inferences and actions based on test scores are validatednot the test itself-andvalidityisalways a matter of degree. The same could be said of fairness. X‘hinking of the concept as a matter of degree rather thananall-or-nonecharacteristic,however,maybeeven farther removed from common usage in the case of fairness that in the case of validity. It may also be more foreign to think of fairness as residing not in the test but in the inferences or decisions based on test scores. is especially port ant to Evidence about c o n s ~ c t - ~ r e ~ e vvariance ant an evaluation ofthe fairness of an assessment because groups may differ in a widevarietyofattitudes,experiences,andcapabihties that are not explicitly a part of the intended focus of meas~ement.The assessment wouldhavereducedvalidityandbelessfair to the degree that those ancillary skills and characteristics affect performance on the assessment. It shouldbe noted that s k i l l s that areconsideredancillary or sourcesof cons~ct- relevant variance for one inte~retationor use of assessment results may be relevant for part of the intent of meas~ementfor anqther inte~retation. The f i a l issue I want to mention before closing is an important one, but also a very complex and messy issue. That is the notion of opportunity to leam. Although it may be fair to include material students have not had

h

Constructs and Values in Standards-Based Assessement 251 an opportu~tyto learn for someusesoftheassessmentresults(e.g., or system-levelaccountabdity), the same m o ~ t o ~system g progress assessment may be considered quite unfair for other purposes (e a consequential decisions about inchvidual students) without evi students had had an adequate opportunityto learn. ~nfortunately,withtheVoluntaryNational Test the variations in consequences associated with uses of the test fromone state to the next or one &strict to the next are unknown.The A ~ s t r a t i o nhas been in such a strong promotional mode so far that littleserious attention hasbeen given to the question of appropriate and inappropriate uses. " h e c o d t t e e that developedthedraftspecifications for theeighth ade ma~ematicstestmadesomefairly strong assumptions about the learnkg opportu~tiesthat students would have had before they take the test.According to the ath he ma tics Test Specification C o ~ t t e e for , example,

Students should have had extensive experience in gathering data,o r g ~ i n g dataintotablesandgraphs,andinterpretingdatafromavariety of situations.Sincetheearlyelementarygrades,theyhaveinterpreted,used, andmadebasicbargraphs,pictographs,andlinegraphs,althoughoften these experiences may have been teacher led in terms of making decisions aboutform,scale,axes,units,and so on. Studentsshouldhavebeen of suchgraphs to previouslyassessed on basicreadingandconstructing organize data. Consequently, it is assumed that they are readyto be assessed atamorein-depthlevel than makingabargraph on providedaxes or reading information from a line graph. ~ttp://~.mprinc.com/nationaltests/math.h~).

The c o d t t e e also assumed that students should have had o ~ ~ o r t u n i to ~ euse s calculators on a routine basis. They viewed calculators as a standard tool that should be available to students and hence push~d for allowing students to use calculators of their choice when taking all parts of the V o l u n t ~National ath he ma tics Test. Of course, not all students will havehadanadequate opportu~tyto learn how or when to use calculators. Nor will all studentsevenhaveaccess to calculatorswhen taking the test. Although the specifications call for test items that can be done either with or without the calculator, I believe that the committee's permissive stance on the use of calculators raises fundamental issues of validity and fairness that could cause serious problems in any ~gh-stakes

Linn uses of the test that ~ g h be t mandate by astate or ~ s ~ i cS t . e close attention to the social consequences o f such uses.

tion of the ““adequacyand appr on results of the Voluntary rts of many researchers and

ateness of infe~encesan

ncial support for develo~me ovemment ~ p ~ and aare a~ p , on’s educationagenda, the v nt than they are for a typical te p r o ~ e n c eof the tests d only hei~htenthevalue cerbate the ~ f ~ c u l t i e s o f c o n d u c ~ g ~ gvalidatio orous e reasons, I dank it iscritical that the ~ e a s ~ e m e n ofession and o r ~ ~ a t i osuch n s as ETS get actively involved in fosterin work s t a r ~ with g the 1998 field tests and con tin^ use of the tests in 1999 andbeyond. M e a s ~ e ~ e n t sionals cannot answer the question “Should the test be used for this e?” that underlies Messick‘s for an overall ~ v ~ u a t of io~ propriateness,” it is incumbent U on the ssion to a c c ~ u l a t e evidence regar ences of particular uses of tests, especially when th the certi~cation ofan ~ ~ ~ d tly put it: “‘Our task is not to c ~so that ~ in theit ~ ju , ents they can use their ronbach, 1980, p. 100).

Anastasi, A. (1954). P~choiogj~aitestj~g. New York ~ a c ~ ~ . Raker, E. L., O’Neil, H. F., Jr., & Linn, R. L. (1993). Policy and validity prospects for ~ e r f o ~ ~ c e - b aassessment. sed ~~~a~ P ~ c b o ~ 48,1210”1218. o~s~ Campbell,J. R., Donahue, P. L., Reese, C. M.,& Phillips, G.W. (1996). N&P 1994 ~ a ~ ort ~ a the ~ ~ a and d ~the ~ States: o ~ ~~ j ~fmm~ the~~ ag ~ so ~~ sa ei ~ s. f ~E ~ec ~ a t tj o ~ a i P ~ g ~ and s s Ttiai State ~ s s e s ~ ~ Washington, e~t. DC: nation^ Center for ducati ion ~tatis~cs.

u

Constructs and Values in Standards-~ased~ s s e s s e m e ~ t Cheney, L. (1997, August 11). Once again,basic skills fall prey to a fad. New York Times, Monday, August11, p. A13. Cook, T. D., & Campbell, D. T. (1979). ~Uasi-Expe~ental Design and Analysis Issues for Field Settings. Chicago: Rand McNaUy. Crernk, L. A. (1989). P ~ ge~cation ~ r and its ~sconte~ts. New York Harper & Row. Cronbach, L.J. (1960). Essentia~$ ~ ~ c b o ~ ~ c a ~ t(2nd e s t ied.). ~ g NewYork Harper and Row. Cronbach, L.J. (1988). Five perspectives on validation argument. In H. Wainer, & H. Braun i 3-17). ~ ~ Hillsdale, NJ: Lawrence Erlbaum. (Eds.), Test ~ a l(pp. Cronbach, L.J.(1980). Validity on parole: How can we go straight? New ~ i ~ c t i o n~se~s rt i n ~ a n d ~ e a s ~ r5,99-108. e~e~~ Diegmueller, K. (1994). Panel unveils standards for history: Release comes anid outcries of imbalance. E ~ c a t i oWeek, ~ 1 4 0 , pp. 1,lO. Gmoran, A., & Berends, M. (1987). The effects of stratification in secondaryschools: Synthesis of survey and ethnographic research. &view$ E d g c a ~ o ~~searcbl a~ 57, 415435. Green, B. F. (1990). A comprehensive assessment of measurement. C o ~ t e ~ oPr ~a c~b o ~ o ~ l 35(9), 850-851. Haertel, E. H., & Linn, R. L. (1996). Comparability. In G. W. Phillips (Ed.), Tecbnica~issges in ~ ~ e - s c a ~ e ~ e ~ o ~ a n c e @p. ~ s e 59-78). s s ~ Washington, e~t DC: NationalCenter for Education StatisticsReport NCES 96-802. U. S. Government Printing Office. Times. Innerst, (1995, Nov. 2). W~bi~gton R. L. (1982). Two weak spots in thepractice of criterion-referencedmeasurement. E ~ c a ~ o ~ a ~ ~ Issges e ~ #and ~ ~Practice, e n tI, : 12-1 3,25. Linn,R.L. (1994). Performanceassessment:Policypromisesandtechnicalmeasurement b eno. G 9,4”14. standards. E ~ c a ~ o n a ~ ~ s e a r c23, k n , R, L., & Baker,E.L. (1998). Canperformance-basedstudentassessmentsbe psychome~c~ sound? y In J. B. Baron & D. P. Wolf Fds.), P e ~ ~ a ~ c e - b ~ e d L

i

m

y

~ s e s s ~ e ntoward t: accessl 6~~~~ and cobe~nce.Nine~-thirdearb book $the N a ~ o ~ Q ~ Stheo ~ e ~ ~ St@ $Edgcation. Linn, R. L., Baker, E. L., & Dunbar, S. B. (1991). Complex, performance-based assessment: G 15-21. Expectations and validation criteria.E ~ c a ~ o n a ~ ~ s e a r c20b e(8), McDonnell,L. (1996). Tbe Politics $XtateTesting: I ~ ~ e ~ e n t iNew n g Stgdent ~ssessments.

(Technical Report).LosAngeles: UCIA Center for theEvaluation,Standards,and Student Testing(p.31). n g tbmggb s t a n ~ r ~ - b ~~~. ~ed M c L a u g ~M. , W., & Shepard, L. A. (1995). I ~ m ~ e~cation A report by the National Academy of Education Panel on Standards-Based Education Reform. Stanford, C A The National Academyof Education. # ~ed. ~ (pp. e n 13t, Messick, S. (1989). Validity. In R.L.Linn(Ed,), E ~ c a t i o ~ a ~ ~ e ~3rd 103). New York M a c ~ a n . of Messick, S. (1994). Theinterplay of evidenceandconsequencesintheValidation performance assessments.E ~ c u ~ o ResearcbeG na~ 23,13-23. Moss, P. A. (1992). Shifting conceptionsof validity in educational measurement: Implications for performance assessment.~ ~ $E~cationa~~searcb, e w G2,229-258. na~ 23(2), 5Moss, P. A. (1994). Can there be validity without reliability? E ~ c a ~ o ~~earcbeG 12. National Council of Teachers of Mathematics. (1989). C g ~ ’and~ e~a~~ation ~ g ~s t a n ~ r ~ ~ r schoo~matbemu~cs. Reston, Vk.Author.

Oakes,J. (1 990).~ ~ l ~i n e~~ #~a ~ i~The ens :e~ects g ofrace, social clms, and trackingon o p ~ o ~ ~ ntoi ~ e s barn math and science. Stanta Monica: Rand Corporation. Ravitch, D. (1997b, May).Yes to national tests. Forbes, p. 64. hvitch, D. (1997a, August). National Tests: A Good Idea Going Wrong. The ~ ~ h i n g t o n Post, p. A1 5. Romberg, T. (1997, August 11). Mediocre not is good enough.New York Times, p. A13. Rose, L. C., Gdlup, A.M., & E l m yS. M. (1997). The 29th annual Phi Delta IS.appa/Gallup poll of the public’s atttudes toward the public schools, Phi Delta an, 79(1), 41-56. ~ ~ g standards. O x ~ r Review d of Sader, D. R. (1987). Specifying and p r o m u ~ achievement ~ d # c a ~13,191-209. o~, on, Shepard, L.A.(1993). Evaluatingtest validity. Review ofResearch in ~ d ~ c a ~19,405-450. WileyyD. E. (1991). Test validity and invalidity reconsidered. In R. E. Snow & D. E. Wiley roving inquiry in socialscience: A volkme in honor of L e J .C r o ~ ~ a(pp. c h 75-107). I-€.illsdale, NJ: Lawrence Erlbaum Associates. W ~ g hW.~W.,, & Cole, N.S. (1997). Gender and f k r ~sessment.Mahwah, NJ: Lawrence Erlbaum Associates.

In order to make pro ess in developing a scientific discipline it is often necessary to turn to a e x ~ a t i o nof the foundations of that ~scipline, d-core empiricists in the~ s c i p often ~ e oppose &IS move. S of such a move are provided by the Copenhagen Scho e ~ d - b e n threat ~ g of the ~ n c e r t ~ n Principle ty an r felt he had to turn back to the Athenian ph~osophersfor by Einstein who was forced, by what he called Mach's phil questions about the ether to explore a physicsbased on d existence. We see it again in Sam Messick's career, whlch took h the maverick and gadfly roleto that of a serious critic ofthe foundations of the applied science of assessment.This is nowhere more evident that in the er in ~ d ~ c u ~ ~ oen u Z~ ~where r e ~ he criticized e ~ ~ ,the a s s ~ p t i o n s tional ~cademyreport on testing. He saw and followed the need to connect the foundations of testing to the foundations of ethics, to the concepts of justice and fakness. In so doing, he led us to a new perspective and indeed new ensi ions of assessment, dimensions that he ~ ~ a t fiather e d in the ~ ~ n d essay. ~ o o ~ Like any s i ~ ~ c a innovator, nt this has of course led to some s p e c t a c ~ ~ ~ s u n d e r s t a n ~ gbys others, such as the ascription to hirn of the concept of c o n ~ e ~~~~~~~~, ~ e ~a ~ tern ~ he ~ never Z uses referring to a concept he does not endorse. He doesendorseexaminationoftheconsequences of ocid consequencesbofparticularimportance. These are to our tests-inclu~g ed,hesays,inordeeterrninewhether their selection, use, inte~retation, adrninistration, and scoring-have build in one bias or another, this to be judged by the usual, standardsof validity. The consequences,howeversocially 255

56 Seriven undesirable,are not to beconsideredas conclusions about validlty. In passing one should also note that we owe thanks to hls e~ployerfor having the good sense to see him as s o ~ e o n eclearing ground that had to ss was to bemade,rather than-as othershave testingata h e when it neededlittle ditself with those other e employers such as IBA4 and Bell Labs in their golden days, who research by their staff even if it opened up prospects of having to the assumptions on whichmuchoftheirbusinesspractice depended. In the case of ETS,as with the others, this policy commanded a respect in the educational research commu~tythat could not be bought at paid some well-earned c o m p ~ e n t sI, can no do, albeit mainly form here,is to follow §am Messick'sexample and push still harder at thevalueassumptions ~ n d e r l ~p agr a d i ~ a t i c standardized . I w i l l pick up six: examplesofplaceswhereIthinkweneed to reconsider our resent assumptions. I should start by emphas~ingthat in e, the issue is not whether I get all these points right. than one can reasonably expect in such an enterprise, which challen~esconventionalexpertise on so manyissues. It wouldat most be reasonable to hope that some few of these S serious reconsideration of current theory and practice. concession to realitybeseenasds for casuallydismissinganyofthese oints as catcallsfromthebleaI do not put them f o ~ a r d.with any sense that the arguments for any of them are slight or merely e very hard for each point: these are serious charges an , becausethese ~ ~ m e n attack t s the foundatio structures of theory and practice in assessment.

th this core notion, because other assumptions depend on it. er, §mMessick pointed out the extent to which the NAS report is committed to what he called the ~insti~tional perspective': a focus of testing. on theselectionandemploymentfbnctions sts that we should pay m ttention to &e us p osticandinstructional eworksinwhich i beplaced, ~& a particul~eye on

assess in^ Six A s s m

tions in ~ssessment

rather pedantic effort to

.As we’ll see, however,it

S

related conclusions, e u d t y , valihty, and what I am about to

,from results

on the test to states of affks

It. of certain skills, both q u a n t i t a ~ ~ e

58 Scriven it to future performance in graduate school. It is not correct to refer to the as avalid test of future graduate school performance, essentially because no test can be a valid test of sornething that has not yet happened. m e n we talk of its use as a prechctor, the more exact formulation would be to say that the GRI3 is avalid indicator of future graduate school performance (i.e., we can make vahd, although of course only probabilistic) inferences from the test to future performance. It is not a valid test of those matters, although it’s easy to see how we are tempted to talk that way. In medcal diagnostics, we have a number of tests for cancer. In the sense that we are using here, the biopsy is a logically valid test, because it hvolv &ectvisualidentification of cancercellsdirectlyremoved from nt’s body. The others-forexample, the ma sts of various states of affairs that provide from whch we can make inferences of modest to good valichty about the presence of cancer.(In the CronbachandMeehlterminology, h s isof course concurrent validity rather than the predxtive valichty of the GRE; for Messick,it’s‘diagnosticutility’anexcellentway to describeit.) One mightthink that at leastapositive m ~ r n o ~ aisalso m logicallyvalid because it is commonly thought to be an example of ‘seeing?the cancer; but the false positive rate is hgh just because benign m o r s present identically. And of course the falsetive rate is veyyhigh: neithe~is ’ with e logical vahdity. (The problem of oss sib le) ‘pseudoapplies to testsofteachercompetence that arebased on what experienced teachers say they see in the candidate’s performance.) Note that logcal s d a r i t y between test content and criterion is not the only relation that supports test validity; a mercury thermometer is a valid i n s ~ e n for t measuring temperature even though the volume of mercury is not a sample of the t h e ~ o d y ~ concept a ~ c of temperature. It is valid because it is calibrated so as to exactly exhibit an aspect of variations in erature-”and nothing else. When cows lie downin the fields, although S said to be an inchcator of corning rain,it is surely not an aspect of it. In general, there are no valid tests of future affairs, only indicators. Current states of affairs, however, are directly and validly accessible by tests ðe thermometer. Valichty in mathematics is &e validity in tests: it has to be supported by logically valid proofs. The judgment of good mathemati~ansas to the truth of u~provedhypotheses like Fermat’s Last Theorem is kghly correlated actual truth; but that does not make it proper to calltheir a type ofproof. So one should not call the reliability of inferences

Assessing Six Assumptions in Assessment ~5~ from results on test X to condition Y a kind of test validity; it's just a diagnostic inference,a shadow of corning events, not part of them. Although Messick wants to move to what he calls unified validrty, he takes this to include both of what are,I suggest, properly called validrtyand utility:

"he essence of unified validity is that the appropriateness, m e ~ ~ w e s s , and usewess of score-basedinferencesareinseparableandthatthe unifjmg force behind this integration is the trustworthiness of empirically grounded score interpretation, thatis, construct validity. (1989, p. 5) By contrast, although I alsowant to move to a unitary notion, it's a unitary notion ofvalidity only. I want to do thls by separating off the validity from theutdity. This meansabandoning all seriousefforts to identify separate notions of validrty"constmct valtdity, predictive validity, etc.-and r e c o ~ e n ~ that g weregardthemasmattersofuullty not validity, andthat we restrict the use of the concept of validity to the original and proper notion. We got seduced by talk of 'the nomological network' into r e ~ r all ~ legitimate g inferencesfromtestresultsaspartofthe valtdity of the test, instead of being a separate issue of inference validrty.(I was also seduced, which got my name in the attribution footnote to the original Cronbach and Meehl paper,so this can be seen as a slightly belated postscript to that essay.) A test is a valid test of X if and only if it directly tests for X, that is, it calls for performance correctly described as X or an essential part of X, whenadministeredandscored in theprescribed way to the prescribed population, etc. Predrctive or concurrent inferences are useful applications of a test, but not separate types of validity. Content and face validity are considerations in identifymg valid tests,but not separate types of validity in themselves. Constmct validity is an illicit combination of test validity and indicatorvalidity(i.e., the validityofinferencesfrom the test to other conclusions) So, a test of X that is useful in predicting Y should not be advertised as a test of Y' or even as a test forJI;it can be advertised as a testthat is a good indicator of Y, or as a test (that can be used) for predicting Y. Further examples: (a) A projective test is a test of one9s response to inkblots, etc., not of p s y c h o d ~ state; ~ c @) The M e y e r s - ~ r i ~ can s correctly be called an inventory; (c) "he " P I tests people's belief in various c l h s (thanks to the lie scale), and indicates something itnportant about their classi~cation in a rather antiquated but not useless taxonomy; (d) The SAT may or may

60 Scriven not be an aptitude test but it’s no longer seen as staking ~ ~ onc h and it is a valid test of certain scholastic predictive utd~ty an cator or of (e) The Stanford- net canbecalled ause, fsst, the sub-tests require intelligence; and second, (if because it correlates as well or better, across large numbers teachers who know the als tested, and usual sources of error, as does the ju

careful about the

.The substi~tionof indicators

Assessing Six Assumptions in Assessment tion. Taking asif-stakestestinginValdatestestsis another f confusion here; what it does is to encourage the violation of the standard conditions forone use of the test. In theevaluationofteachers,thediscoveryofcorrelationsbetween certain features of teaching styles and successful causation of le students led to the use of those styles, which are only indicator h o r n as the ‘researchteaching, as tests of good teachers. This became basedmodelofteacher e~aluation’andhasbeenwidelybuilt standardevaluationprocedures, often withstateendorsement. completely inappropriate procedure becausewe have-or can ready getevidence about criterial perfo~ance.This remindsusthatanyuse of indlcators (i.e., variables that justify an inferencein some cases) can only be justified when we cannot get or cannot afford to get criterial evidence (i.e., variables that defmethequalityweareseeking to identif));and this is indeed the case with the predictive use of academic tests. In the testing world, we tendto forget that teacher-made tes the largest proportion of tests in use. And validity as I have desc validity in the sense of what dlstant matters can be inferred from the results, isexactly the standardto whch theyshouldbeheld. In the worldof comrnercialtestin we aretodayfar more sensitive to lookingatthe consequencesofstingthanwewere,andatlooking at contextual modifiers: both of these r&d us that the long-chain inferences from test scores to futureperformanceare We shoulddemarcate the useof tests asprovidingdataforsuchcesfromthevalidityof the tests themselves,amuchmore robust propertyoftests. One should not be su~esting,as is too c o m o n today, that the validity ofa test is that fragile; we were wrong to tie validity to these weaker inferences, just as we were merit at all strongly to performance indicators, or hing styles. The validity of tests should be seen as s o m e ~ tied g logically,not correlatively, to the test. It is not a trivial matterto determine the validity of tes description of them, even in this narrower sense). The Test, for example, developed by an ETS te can’t be validated by just checking whether its items are all good of atical items. The desired label here-calling it “a test mathematic^ reasoning”-is not easy to attaim one must show that all ficant q e s of mathematical reasoning are covered, and that there is not too much emphasis on one rather than another. But that7s a feasible in the sloppysense(i.e., project,muchlessexpensivethanvalidation

validation of inferences to future performance). So, test validity is indeed a matter of whether the test tests what it says it tests, as the textbooks all say, but only if whatit says it tests is what it ~ i ~ tests. e ~ t ~

Naturally, the valichty of a test dependsnot only on its content, as we stated previously, but also on the way it is scored (and administered, etc., but we will here concentrate on the scoring). Considerationsoffairness,a c o ~ e n d a b l yfavorite topic of Sam Messiclr's and an important element in considerations, suggest that a requirement that I called the Point 9~ ~ ~ (PCR)i should ~ often ~ be e gven n center ~ stage in the chscussion of the test scoring component in validity. The PCR sirnply says that a point, however it is earned on the test, should represent an equal increment of merit on whatever it is the test is said to be a test of. We'll never attain this ideal in practice, but we can certainly get reasonably close to it: the paradigm is a test of basic arithmetic skill. It should be treated as an ideal in the design of all scoring keys (a.k.a. rubrics). Moreover, it is an ideal that should be a p p r o ~ a t e d not , treated as extremely remote. (For convenience,wecanfocus on quantitativescoring, a l t h o ~ gthe ~ same principle extends to qualitative assessment, e.g., most gradmg.) Note that the PCR does not involve c l h s about item (-success) difficulty, only itemsuccess merit. The two are connected, of course,but it's easier to all basic two-by-two multiplication items are of equal merit, in of arithmetic competence, thanthat they are of equal difficulty; andone can do it from one's armchair, which helps the development budget The PCR is a principle that governs test valichty, but its ethical aspect surfaces when we look at the continuing failure of students in less-wellfunded schools to receive any training in test-tahg SUS. This means, for example, on the usual multiple-choice tests, that they are not taught to: (a) put in answers by guessing, when theydon't h o w or have time to work out the correct answer; and (b)answer the easy questions fust. Imechately, we have adverse impact: and, for SM and MS as well as all the enthusiasts for consequential validity, we have test invalidity. Not because of the adverse impact, however, although that must be studied, as SM stressed, because it often serves to warn us of invalidity,but because the test is in fact a test of two quite different skills, although its results are being treated as if the test only tested oneofthese skills. Students from the wealthierdistricts or parents are getting extra scoreson that skill by virtue of having ~urchased)

Assessing Six Assumptions in Assessment 2 the other skill; a clear violation of the PCR.“he worst violation of the PC er, involves a more general problem, mentioned as item (a) above. guessing on the usual multiple choice test has an expectancy of 25% and a merit of zero, a flagrant violation of the PCR. Now, the adverse a “correction for gues impact of this can be reduced by applying ~ s n o m e since r the usual version is a correction for not ~ e s s ~ g ) , t h s process,because the blind dityofthetest cannot besavedby sser stdl gets 25% of the total score for knowing nothing at all. Hence, in case you were getting a little sleepy by now, it is provable that all standard multiple-choicetestsareinvalid.I’lllaterindtcate how to correctthis without plungmg off the straight and narrow path ofsound testing into the do-authentic testing. argument for this conclusion concerns item (b)above; the ingle point for all correct answers, when in fact a correct answermaybemuchmoremeritorious onone ratherthan another question. I’ll now extend our list to include a third breach of the PCR. ”hls is (c) the failure to distinguish between the score for incorrect answers (i.e., zero), when in fact,on some items (not all) some incorrect answers are very as correct as the officialcorrectanswer,whereas other distractors would only be chosen by someone with essentiallyno knowledge of the topic. (Note that there are test items, e.g., in competency testing for surgical or airline pilot skills, in which one wants to say that any answer besides the correct one is equally and absolutely valueless;but these are the exception rather than the rule.) In terns of our list of assumptions for assessment, under the headtng of testscoring weare now addressingthree common a s s ~ p ~ o nofs standardizedtesting(twoofthemare not entirelyindependent): the assumption that blind gues can legitimately yield 250/0(in the usual case; 20% in some other cases); a s s ~ p t i o nthat all correct responses should be equally weighted in the scoring key; and the assumption that allincorrect responses should be equally weighted. These d bring us up to a total of lihave dtscussed. four assumptions that we w I w i l l propose relatively easy fixes for each of these problems; the main ~ n c t i o n sof theproposalsare to that fues are not impossible,and to stimulatediscussion,ratherthanttheseare the only or best ways to fix the problems. But &st I need to head off the usual dismissal of any Lord, for they attempts to fix them. It is commonly said-forgive them, h o w not what they do“--that such changes have been shown repeatedly to have little effect on the rank ordering of testees. Justice is not done by

64 Scriven dy counts as negligble. If one of your chlldren cfid not because of a sloppyscoring key, even if it is only you would not regard this as an acceptable end maklng that the test for acceptability. this r e c o ~ e n d a t i o nassumes that the costs for reme~ation are not beyond our means. ese errors are not just methodologcal orzes, but,because of the tably adverse effects on some people, ethical ones. But there is also actical issue. To t h s enlightened company of colleagues, I am sure I need not stress that the days of ~~g that tests are to be rated purely or even mainly as i n s ~ e n t for s d i s c ~ a t i n g(rank-order in^ are over. We are interested in criterion-referenced testing as well as norm-referenced, not always but often. We need the total score on a test to be an indication of competency, not just of relative c o m ~ e t e n This ~ . has some S consequences(e.g., that we need to avoid the old practice of v ~ d a t i o nin which we threw out items that did not correlate well with other items, without thorough exploration of the possibility that they were probes into untouchedregions of the domain). (m the critical thinking area, r exaxnple, they often werejust that.) In the presentcase, we want a score at represents the actual value of the testee’s knowledge or skills as well as is practicable. So we must set up the scoring key to measure that knowledge asaccuratelyas we can,asalways with due respect tocost feasibility. g wdl always follow from gradmg, but should not be the p h q Cnction of most tests, and ~ s c ~ a t i power o n should not be the key index o f merit in items. The crucial question about a c~terion-referenced test, analogous to the question of how many testees are wrongly ranked, ,“How accurateand how efficient is our estimate of total ge?” Some rough calculations suggest a surprising figure. In round numbers, our presentscoring system oftenthrows away 50% of the i n f o ~ a t i o nthat is on the answer sheet turned in by the testee. We can recover most of that extra i n f o ~ a t i o nby adding estimates of variable item merit, variable error merit, and by scoring the test so as to elininate any expectancy for guessing. h e s for t h s wastage that I a m going to suggest involve three fairly S procedures,but no pretense ismade that they donotinvolve any extra costs. There is nocom~letely free lunch in thisbusiness, and li often take about 50% more twiceas much i n f o ~ a t i o nw typical case, not a bad ratio. However, the s u ~ e s t i o n here s are not all-or-none; one can recover very a x n o ~ t sperhaps , half of what is

I

266 Scriven the negative a r ~ m e nwas t that getting interjudge agreementon weights was dsfficult,hence no dsfferentialweightscouldbejustified. The positive a r ~ e n was t presumably that because certain predsctions from the levelweightedtestsweremodestlysuccessful,thentheywerevalid(orvalid enough). On our account, the situation is much simpler: the test is valid onlyifitsscoresmatchitstitle. It may beusefulforcertainpurposes without meeting that requirement; but that match is what is required for validsty. It is likely to be more useful, for criterion referenced evaluation, if we make the scoring more defensible. ea^-^^^^ Scomirg. In general terns, therearetwo ways to produce improvedtests that takesomeofthesesuggestions into account:test conversionandtestcreation. The furst meansimproving the scoring of tests;thesecondmeansimprovingthe instrument itself in the course of creating it. Item weighting is easily usedin conversion; near-miss weighting can be used either way. In the conversion scenario, using the existing distractors, inspection w i l l usually reveal that some of them are only subtly different from the correct answer, whereas others wouldonlybe chosen by someone who was extremely conhsed about the subject or knew little or nothing about it and was essentially guessing. The is that in many cases it is more appropriateto reward the tes thedifferencebetweennearlythebest option and the awarding half a point for the near-miss. From the diagnostic point of view there is obviously considerable value in knowing this, in order to adjust future teaching: more on this in a moment. In the construction scenario, eer the items so that they illustrate this distinction more clearly since it is animportant distinction. An independent reasonformakingtheseeffortsis that they all contribute to improved ~ c ~ - which v a leads ~ ~ to ~ much ~ improveduser relations,aswehavediscoveredinevaluatingtestsinthepersonnel evaluation field. The changes described do something to avoid the many complaints about what is often described as the ‘‘rigid” and ~‘unfor~ving’’ nature of standard multiple-choice tests. At this point, we need to say something more about the diagnostic use of these reweighted items. By running a computer analysis that identifies the number of near-misses in the score total versus the number of “farrnisses”versusthenumberofcorrectoptionschosen;andbetween the number of points scored on single-weighted items by contrast with those weighted 50% and 100% more, wecan often identifya pattern that is

Assessing Six. Assumptions in Assessment 267 helpful for instruc~onalpurposes. For example, we might be ableto suggest that an inordmate number of responses are in the near-miss category and that the testee needs to focus more on getting exactly the right form of expression or the right qualifications into their understanding of a matter. Or we might be able to classify someone as doing very well on basic items (merit level = weight = l) by contrast with those worth double the points. to identifythe f ~ g e r p of ~ tthe (lwgely) random We should also ble isdoingratherbetterthantheexpectancy for that approach. When in the construction mode,it d also be desirableto try to restwict the actual range of item difficultyto the range we are recowzing with our 200% spread (i.e., to avoid items that are much more than twice as hard or meaty than any other item). This d improve the absolute accuracy ofour scoring as an e s h a t e of merit(andtheaccuracyof our e s h a t e of i n f o ~ a t i o ngained over the standard approach), although thes~periorityof near-miss scoring over the standard method of level scoring will not be slightest. Ethically, as well as impaired, or improved, in the epistemolo~c~y, the advantage of doing this is to reduce the payoff from the testwise strategy of doing easy items fwst. (Of course, there will s d be an advantage in doing items fast that are easy for you.)

~~~-~~~~$co~&g. Now we come to the feature ofour scoring reform plan that eliminates pay-for-~essing.Let’s begin with a plan we won’t stay with but which indicates the approach. Supposewe are convertingor creating an item with four options, one correct, one near-miss, one plain wrong answer withoutredeemingfeatures(but not ridiculous),and one far-miss(i.e., hopelessly wrong/absurd answer). If this is a typical item, and if we score one for the correct answer (subject to the multiplier from the weighting of the item), half for the near-miss, zero for the plainly wrong answer, and minus 1.5 for the absurd answer absu sur^' only if you understand the field), the expectancy from blind guessing goes to zero, and with it the benefits of testwise ~ ~ i innthisg respect. Or you could deal with an item that had two far-missanswersbyscoringtheresponses 1, .5, -.75, -.75, withthe same net effect. And so on for near-misses.The possibfity of a net negative score on some items creates the possibility that the testee loses points that were l e ~ h a t e l ygained on other items. At fast sight this seems “unfair“; however, it is only the total score thatis used for test-dependent actions,or that is relevant to test validity, and the total score more accurately reflects me& if thisconstraintisimposedonit. (For diagnosticpurposes,the

teacher can look at each ite

ey get zero points; if they do &U ere is no need to correct for what

d on other items.

e present standard procedureQMC),by c o m p ~ s o nwith other a ~ t e ~ a ~ v e s that have been proposed. 1. It doesnotinvolve a b a n d o ~ gobjectivel~scorabletests,andhence it b l eadoption of most so-called keeps us out of the serious ~ a n c i a l ~ o uthat

authentic ~ t e ~ a t i vwould e s involve.

2. It reduces some of the antago~sm inspired by s t a ~ d aMr ~ C tests. 3. It ~ ~theadvantage e ofs testwisetestees(therebyreducingadverse

~p~ct).

4. Itprovidesnoreward for blindguessing,therebyavoi oss invalidity,andavoidingpositivereinforcement

practice of faking knowledge when it’s not

~ssessingSix ~ s s ~ p t i o in n sAssessmeat

thatisbasedonagood a ow led of thesubjectbut ~ c e ~ a i n ~ still makessense, abouthowtoansweraparticularquestion research on it, andw llipay off; we can think of this as “informed . It avoids one of the pressures towards grade inflation. It’s c o ~ o to n give C-or at worst D grades for 35% scores, which represent a true owle edge level of IO%, h a r ~ approp~ate y for a pass. Worse, a score of even 45% on a st will often reflect 10% or more from overcueing or other as much as it does knowledge, so even that score level may or less of true owle edge. Using thescoringsystem stedherewouldmakeithardertojustifyanygivenlevel of se they wouldnot all be pushed upward by a false Now$if one did not use negative scoringof the kind could alte~ativelyjustsubtract 25% from each ‘‘correction for guessing”thatwouldmake@End)guessinganincorrect The results would no doubt be poEtic~yexciting. F o ~ a t e l ytihis , is inva~dfor other reasons. However, the kind of grade inflation due to bad MC test scoring may be one main reason that people report that US. educationisintroublenationally,althoughtheyalsosaythatthe schools their children attend aregood: their chddren are getting good (onlybecauseinflatedasabove)buttheyarealarmedbytherepo read about the state and nation, which are based on more analytical studies. 6. It fac~tatesbetterfeedbacktotesters,teachers,andtestees for a n of thetest.Itshares this advantagewith com~uter-inter but for com~letelydifferent reasons. It squeezes considerably more tionaboutthetesteefromthefootprint of the sametest thesametestitems.Wehavemadeagooddeal t it appears we can uter interactive testing, e by redoing ow: of at least the sme ma *

The next a s s ~ p ~ o onn our list is one that has received remarkably little critical s c ~ t i n y . is I t the common textbook claim that m ~ t i ~ l e - c h o i items ce e o f ‘‘objective9’ test (i.e., objectively storable" jectively scorable’’ s h p l y means scorable by the and completed by the testee-~thout chine scorable or template g since the scoring key used may M that’s objective is ~ c oinstead. ~ ~ It’s ~ an~ e g it; after al, most l ~ ~ e - s c a l e

70 Scriven to fiance ifweusedessayitems testingwouldbevirtuallyimpossible throughout. Returning to the cl& beingconsidered, it is true that the multiplechoice item is "moregeneral than a number of other machme-scorable item types such as the true-false item or the matc from being the most general machine scorabl brieflyreview another andconsiderablymorege offers somepowerfuladvant over the m~tiple-choiceitem (whichisa specialcaseofit), not in ses but in many. It may or may not be combined with the three mo&fications to the standar are built into what we have called the modified MC item in the previous section. This type of item is called the item, and is consideredat greater lengthin Fisher and Scriven(1997). tiple-rating item is typically presented in a format that is rather S format to multiple-choice a item. It b e p s with a text stem followed by a set of short, numbered, text passages. But what the te called on to do with it is completely dfferent. First, the short pa g the stem (they are called stubs) are not options from whi best is to be picked; each is to be dealt with separately. Hence, none serve the throwaway role of &stractor, and we get more d e a g e out of each; a stem with four stubs is formally equivalent to four MC items, which saves time and trees.Second, what is to be done with them-the stubs-is not to choose one from the set, but to rate each one using a rating scaleprovided e term m~tiple-ratingitem,T'heratingscalemightbe the or it mightbea merit scaleA-Fwithsuitablydanchors; e; or it mightbeanad hoc scale such as a set of costeffectivenes~ or storm weather categories. Anyitem may have any grade;so, for example, there maybe no As or all As.. For teacher-made tests, the student can write the appropriate letter into a single box that is provided; for machme scoring, a list of the rating options (e.g., the letters A to F) would be printed next to each stub and the testee would circleone of these. One advantage of this type of item is that it cuts S order thinking skills, notably evaluation and criti o"+& a suitable choice of items-sthesis SMS. Tlus re resents one advantage over the more typical kind o e in principle capable of s M a r testing. Item is that it undercuts many of the other taking s e ~ a r ssuch , as looking for subtle choice, or elirmnating options +thout h

Six Assumptions in Assessment

respect to the right answer (‘T don’t know what this option means, but the rest are sure ~ o n g . ’ ’ )A third is thattheevaluationtaskismuchmore realistic than the choice task; in the real world, we are rarely offered aS set of options of which one is guaranteed to be correct: in this sense, items are more authentic than MC items. A four that many of the usual one does not have errors of am ate^ i t e m - ~ t e r sareavoided;forele, a ~ a t i c acueing. l

~ ~ oef .~r ~ ~ 1tem.r. t ~An ~portae ant- kind ~ofexample ~ ~ ~ of the g multi l e - r a ~ gitem,although no more h s , The stemwouldbeah us say, a technical report, T h e four two or six) would be attempted smmnaries of it, about for use in a condensed field manual or textbook. havefiveanchors &e:(a)essenti but not highly readable; ntent and so on. Another example would prese i n f l ~ a t speech o ~ on campus as the stem; the stubs woul ues of the a r ~ m e n t They . might of course all be equally or bad, or any interne~atemix. There is no way of pullin nating alternatives in orderto increase one’s chance of r, despite having no understanhg of ple, of a kmd that one might say is in a school of edu for t aneleventhscriptions of five student answers 01 the previous semester, an priate for such a class. Give ,thls would be a example e just given, which is readdy generalizable i n s ~ c t i o n acl o n t e x t s ~ e ~ ~inl yjuni and

to other up-~us~ates a useful and labor-s namely, earlier answers an mark or note a few a h g them the previous tern, as ~ u s ~ a t i n port g ant or

them to emulate and learn something about the teacher’s grading behavior. d, having the student role-play the instructor in gra step in theprocessof i n t e ~ a ~ i nqugality st payoff. This an advantage that also helps bri gap, so destructive in many classrooms, and the reverse of what it should be; thus it bears on Sam Messick‘s r e c o ~ e n d a t i o nthat we look for i n s ~ c t i o n auses l of testing. ~ ~ for ~ Items. 6 The s basicscoring key foreach stub (sub-item) would be to award one point for the correct grade and zerofor the rest, if using the traditional approach. Using the recommended near-miss system, it would award half a point for a grade only one grade away from the correct de, zero for the rest. If that were the only innovation in the rubric, still bringin a substantial wever,intelligentblindguessingwould re~ard-it could in fact bring in40% by using the optimal stxate ss C. But using the recommended guess-contxol system of n egative scores for far misses), we would set a score of ~ two points for being two or more off the correct ade (e.g., for selecting C when the correct grade is A or F>. If you construct the items so that about an equal number of them have each of the five possible answers (Athrough F> as the correct answer, that “negative oneyyrubric will guarantee at best a zeroexp for blindguessing.Ifusingitem-weighting,youwouldalso want to that in the long run this arrangement holds across each oup of items of comparabledifficul~/meritlevel. For a teacher-made test, it’s hardlyworth taking that much trouble about balancing the correct answers, unless constructing a cumulative item pool, in whch case the trouble is relatively small for any one run of the test. If a oresbalancing,theresultsarehkely to beonly ninor the ideal: The expectancy from guessing d vary slightly, tly negative to slightly positive, which should ensure that w llinot be worth following, thus protectingthe PCR. For those ~ 4 t han interest in diagnostic use of testing, and in deeper a powerfblvariation on the multiplecalinsights&om it, th item. In thistypeofitem,weseek the called the ‘two-s reasons that testees have for their ratings, if any. For a down the right-hand margin next to the list of its stubs we rovide a list of ten or a dozen possible reasons for the answers given). list may or may not include correct reasons for each correct answer; it usually does provide some that are correct for one or more answers, and ~

~

Assessing Six Assumptions in ~ssessment273 incorrect or absurd for others, and a number that areinconrect for all answers. “he testee then proceeds to rate each stub as usual, by selecting one of the grades (etc.) listed next to the stub, then writes in, next to the grade, the letters that label his or her reasons(or, for machinescoring, circles the appropriate lettersfrom a listprovided for each stub). For a teacher dealing with small classes, it is useful to leave a line between stubs on which the student can write in a reason if it is not seen as one of those on the list. ”his approach quickly pinpoints guessing,andalsohelps to identify h e s of incorrect reasoning that may explain any surprising failures to get the rightanswer. I think it’sfair to say that there is a c o m o n a s s ~ p t i o nin assessment circles that one cannot ellminate the possibility of guessing when using “objective” tests; and I think it’s clear that the twostage MR item ellminates it about aswellas the usually constructedresponse, e.g., essay test, and better than most ofthem. I hope enough has been said here to indicate that the MC test is a long way from being the most general machine-scorable test, and that there are i n t e r e s ~ gand usem possibilities in these further reaches of the testing universe, without the need for and cost of computer interactive testing or panelgrading. In head-to-headcomparisonswithsimulationtesting, the most authentic of all authentic tests, multiple-choice tests of the highest technicalqualityhavedoneremarkablywell;my thought here is that improving the scoringsystem for them,and supplementing them (not replacing them) with some, perhaps almost as many, multiple-rating items, canlead us to a newandsignificantlyhigherlevelofpracticaltesting methodology without the need for computer equipment and its attendant costs and without the need for moving into the quicksands-and the even higher costs-of most of the other, allegedly authentic testing approaches. S

It seemsclear enoughfrom the research that the best one-shot test of composition skill is a clozetest. The resultstrikes nonprofessionals as bizarre, but professionalsknow that whenyouhavelow test-retest reliabhty, a relatively stable even if content-invalid indicator can average closer to the mean of the criterion measures than any measuresof one true criterion sample. Nevertheless, although it’s apparently better, as a pretjlctor of what we want to predict, we should not use it, and in fact we no longer do use it as the principal predictor of the criterion ability. m y ? The usual assumption is that the correct answer is simply that it’s poli~cally

c o n c ~ e n t )validity ere is no feasiblealt

if it witss on the statis

e state we are

th thecloze test. on whatcanbe

Fisher,A., 8c Scriven, M. (1997). C ~ t i c at ~ ~ i n ng ~ und ~ a~s s e s ~ ~it.g Inverness, CA: Edgepress. and Centrefor Research on Critical Thinking, Universityof East Anglia. ~ e ~ ed.~pp. ~ e13-103). ~ e ~ t Y Messick, S. (1989). V&dity. In R. L. Lhn @d.)yE ~ ~ c u ~ ~ 1 P a l (3rd New York ~ a c ~ ~ a n .


My career has been strongly influenced by various mentors, but strangely withmuchdelayfromthe h e atwhichtheirinfluencewas fist encountered.Perhapsthe delaywasmywayofindicatingintellectual independence, not to adopt the approaches of my mentors while I was still in their presence. m e n I worked with Elliot Aronson at Hmard, I stayed away from his favored coptive dissonancetheory, but started wY.iting articles on the topic 15 years later. Gordon Allport advised my work on a dissertationproblem that I devised to bequiteremotefromhistopic interests and preferred research methods. Fifteen years later, however, I was working on one of his favorite topics, the self, and after nearly another 15 years I was working on prejudice, another of Allport's major topics. It is therefore no surprise to me to recognizethat,manyyearsafterbeing exposed to SamMessick during a fewformativepostdoctoralyearsat Educational Testing Service,I developed a research progrm that combines three of his career-long passions, psychological testing, construct validity, and educational psychology. How these influences worked themselves out in my research of the past several years, ina fashion that I hope d please Sam,is what this chapter is about. The psychologicaltest in whichbecame I interestedis a set of ~ n ~t i ~ ~ t ~Anyone c t i o n who . teaches at a procedures known as ~ t ~rating university in the United States has been exposed to these, as have those who teach in many other countries. In 1989 and 1990 I received widely divergent ratings for teaching the same course in 2 successive years, using the same syllabus, text, course requirements, meeting format, etc. Althou the discrepancy could conceivably have been nothing other than sampling error, it was impressive in degree"-my average ratings were separated by 8 277

deciles(about2.5standarddeviations) on theuniversity norms that accompanied the ratings reports I received. Searching for an e~planation,I natura~ysuspected that the ratings were influenced by something other thanthequalitiesoftheinstructor, whch shouldhavebeenalmost as as possible over the two offerings of this particular course. Perhaps s i ~ ~ c athat n t I fEst approached &.IS topic as an outsider who had previous research involvement in the topic. My first research on the ic therefore had some of the character of the naivechildseeing the Emperor’s New Clothes.

Electro~csearch made it possible to arrive quickly at a s ~ a r of y the history of research on valldity of student ratings. The search revealed that ty had been the focus of much research, peaking 15 to 20 years e 15.1 characterizes a sample of that research in the period from It can be seen in Figure 15.1 that, over the entire ~5-year riod, more publicationsfavored valicfity than invallcfity. It canalsobe seen that the research changed sharply in character around 1980. Prior to 1980, research was frequently critical of the validity of student a ratings. The major v~dity-relatedcriticism of the pre-1980 period was concern that instructors whdedlenientlyreceivedinap e 15.1,1980markedthe n student ratings. This wasa de tral ( ~ o p p i n gfrom 31 to 16 between 1976” se that werecritical(droppingeven more At thesame h e , thenurnberofstudies the same, and these increased in proportion 71) between 1976 and 1980 to a majority of y the1990s,research on validi of shed to such a low level that it is easyto infex that e major issues. Articles published from on do indeed give the hpression that some major questions about S validity were considered to have been answered. Researchers were in r e m ~ ~ a b lgeneral y to proclaimthevalidityofstudentratings

Constructs in Student

FIG. 15.1. S

~ appraisals f of~ validity ~ of student ratings. This f w r e s u m m ~ z e sthe tion of studyconclusionsonthebasis of abstractsretrievedfrom of PsycINFO and EMC,using for both data bases the search query,

tion in the search any words found by appending up to rt letters after the stem. C a ~ e ~ o ~ aas “biased” indicates study conclusions that student ratings of instruction are c o n t ~ a t e by d e~ in one or more extraneous influences. TheERIC search was limited to u n p u b ~ s hreports, order not tohave thetwo searches produce duplicates.

fashion. The opinions of severalprominentreviewers,concernina leniency as a possible source of i n v o f~ratings, ~ ~ are well S this ~ ~ o t a t i from o n McKeachie(1979):

In general, . ..most of the factors [that] might be expected to invalidate ratingshaverelativelysmalleffects . . . . Somestudieshavefounda tendency for teachers giving higher grades to get higher ratings. However, one might argue that in courses in which students learn more the grades shouldbehigherandtheratingsshouldbehigher so thatacorrelation between average grades and ratings is not necessarily a sign of invalidity ... My ownconclusion is thatoneneednot worry muchabout standards within therange of normalvariability.(McKeachie, 1979, pp. 390,391)

~ l t h o u g hresearch published in the 1970s covered a variety of concerns about validity, the major concern of that period was the possible effect of ades on ratings. The concern with grade-induced bias is apparent in the

“he present evidence, then, supports a notion that a teacher can get a “good” rating simply by assigning “good” grades. The effect of obtained gradesmaybiasthestudents’evaluation of theinstructorandtherefore challenges the validity of the ratings used on many college and university campuses. (Snyder & Clair, 1976, p. 81) The imp~cationsof the findings reportedareconsiderable,and it is nt of ins~ctionmust be sted that the validity of s ~ ~ eevaluations qu~stionedseriously. It is clear that . .an instructor [who] inflates grades ..will be much more likely to receive positive evaluations. ~ o ~ n & Wong, 1979, p. 775)

~

o

These wereconclusions&omexperimentsinwhichgradeshadbeen manipulatedupward or downward,and the ma~pulated gradeswere observed to result in raised or loweted student ratings, c o ~ e s p o n ~ g l y . Several such experiments, mostly in the 1970s, were conducted in actual ~dergraduatecourses (Chacko, 83; Holmes, 1972; Powell, 1977; Vasta Wong,1979). On r e a h g these field § a ~ e n t o ,1979; Worthington in 1990s, it is easy to concludethat, in e x p e ~ e n t sside by side make a powerful case that ratings can be biased g practices. Those experiments are difficult to r the 1990s,becausetheirgrademanipulations hposed stressesandused deceptions that university h u a n subjects review coxnnittees do not now not replicatingthese lookkindly on. However, the bestargumentfor

n

Constructs in Student Ratings 281 experbents 20 years later is that it hardly seems necessary to do so"the resultsoftheolderstudieswereclearenough so that thereseemslittle doubt about what new replications would find.1 This isastrangesituation. On the one hand,experimentalresults reported during the 19'70s appeared to demonstrate that grading practices influence student ratings. Contemporary folklore among academicians also endorses the conclusionthat one canraiseratingsbyinflating On the other hand, concern about the possibhty that grading p can distort student ratings largely disappeared from the scholarly literature on student ratingsafter about 1980. How didresearchmanage to quiet concerns that ratings could be biasedby ma~pulatinggrades?

Since1980,researchonstudentratingshasmostlybeenintheformof correlationalconstructvaliditydesigns. Three kindsofstudiesprovided evidence that has supported the construct validity ofstudent ratings.

~ ~ ~ Vk-lidip ~ ~ Ss~ ~ e dIn c~ ethe ~s . best ~ oof the ~ largest group of construct validity studies, multiple sections of the same course are taught by different instructors, with student ability approximately matched across sections and with all sections having identical or at least similarly difficult examinations. Using e x ~ n a t i o nperformance as the criterion measure of achevement, thesestudieshavedeterminedwhetherdifferences in achievement for students taught by different instructors are reflected in the student ratings of the kstructors. The collection of multisection validity studies has been reviewed in several meta-analyses. Although the meta-analytic reviews do not agree on d pointsconcerningthevalidityof student ratin nevertheless it is clear that multisection validllty studies yield evidence for modest validity of ratings. Correlations between ratings and exam-measured achievementaverage about 0.40 (see theoverviewofmeta-analyses by Abrarni, Cohen, & d'Apollonia, 1988, esp. pp. 160-162). Multisectionvaliditystudiesfavorconstructvalidityofratings by supporting an inte~retationof observedgrades-ratingscorrelationsin

l ~ o n t e m p o reviews r~ of student ratings literature either ornit treatment of these natural classroom experimentson effects of manipulated grades on ratings, or mention them onlyin the context of suggesting that they are collectively flawed (e.g., Marsh & D u n k , 1992, p. 202).

on effectsof a d variable,teaching effecti~eness.If with ratings S because good teachers produce both high ratings, then d is well with the validity of student

~

~ The second ~ type ~ of correlational ~ e cons ~ . S on both grades res the idea that effects of thir ainstheircorrelation, but con g effectiveness. For example, d andMaswell(1980) and ratings were both analysis t e c h ~ ~ u to e s show tha es of students’ level ofm o courses, from whxh the relations~pbetweengradesandstudentsatisfaction ~ g h be t viewed as a welcome result o f ~ p o ~ acausal n t relations~psamong other therthan s ~ p l y asevidence o f contationdue to grading

le of this type of study,

rsh (1980) observed that

analysis dernons~atedthat students’ Prior Subject Interest had s accounted for ngest impact on student r a ~ g [md] relations~pbetween Exp d Gradesandstudent srn xpected Grade was seen bias-albeit aas a atings, md even this ~ t e ~ r e t a t i owas n open to a l t e ~ a t i ~ e ~ t e ~ r e t a t i o n s .

of c o n s ~ c validity t study ssess both ~ ~ n v e r g e nand t

2~nte~retation of this a p p r o ~ a t e.40 correlation as reflecting processes other than, or in a d ~ ~ to, o nv~~~ of student ratingshas also been s d. For example, Mush and D u n k (1992, pp. 173ff.) notethat this correlation among students in differentsec otivationalvariations satsifaction with higher grades.

~ o n s ~ cint Student s stuchestypicallyhave of stud so ~ h ocons ~ t xpected

reported evidence for both conv S, hey a f con

one wants to

284 Greenwald of the economy, and make the raw unemplo~e~ rate t misleadmg(i.e., d i s c r ~ a n invahd) ~y as an indtcator of economic health. Fortunately,this d i s c ~ a n validity t problem of therawunemploymentratedoes not render those data useless. Ifone applies appropriate adjustments for known seasonal effects, then the adjusted unemployment rate can provide a valid measure of the overall economy. Student ratings measures are now used in most undergraduate institutions without any adjustments. In other words, student ratings are treated asiftheyhaveexcellentdiscriminantvalidtty. It is assumed that student ratings do not suffer from any substantial c o n t ~ a t i n influences g (as asserted, e.g., in the previously quoted remark byMcKeachie,1979). On the one hand, this seems plausible, because convergent validity with ~o~ating measures s of quality of instruction has never been s h o w to be more than moderate, and also because replicated experhents, conducted in actual classroom settings, have repeatedlydemons~atedthat grading policy variations substanti~yaffect student ratings. On the other hand, however, for the past20 years well respected researchers have repeatedly asserted that student ratings are construct-valid measures of instructional quality. This is a paradox.

T An acceptable responseto the paradoxical status ofthe literature on student ratings is to attempt to subdue the paradox with theoretical analysis and new data. Toward that goal, a series of data collections was conducted at in 1992. These studies sought to choose ~ ~ v e r s i of t y~ a s ~ g t starting o n among the alternative theoretical inte~retationsthat were central to prel980 concerns about ratings validity (see review by Stumpf 19 e15.2showsthe grades-ra~gscorrelation in theformofa structuralmodelthatrelates two measuresofexpected measures of courseandinstructorevaluation. obtained in a series of three studies at~ ~ v e r s i t y 199~-~994 academic year. These studies used a new rating form (Form ll, see G ~ o r Br: e Greenwald,1994) that addedseveralmeasures to forms pre~ouslyin use at ~ ~ v e rofs Washin i ~ 200 or more courses in each of several acadenic terms. Although these were u~versity-widesamples of courses that were diverse in subject matter,

Constructs in Student Ratings

I

I

I

l

I

I

FIG. 15.2. Structural model includingtwo measures of expected grade and two measures of evaluativeratings of courseandinstructor.The three coefficients on eachpathare standard~edvalues(i.e., on the same -1 to + l scaleascorrelationcoefficients)shownin left-to-right order for the three data sets. Statistics report major tests of fit for this structural model. ~ o n s i ~ f i c a(p n t .05) chi-squarevaluesindicatesatisfactoryfit.Chi-squarevalues have an extra degree of freedom (df) when the computational routine added a constraint to avoid a negative vaxiance estimate. ‘rmsea’ is the root-mean-square error of approximation index of fit thathasbeendescribedbyBrowneandCudeck (1993) andby ~ a c C ~ u m 7 Browne, and Sugawara (1996). Theseauthorscharacterizerrnsea .05 as indicating ‘ c ~ ~ ~ e ’ fit, .05 -.08 as “close to f d ’ fit, .08 -.l0 as ccmediocre” fit, and rmsea> .l0 as “poor’’ fit. T(c1ose fit)’ values greater than.05 indicate satisfactory fit.

class size, and academic level, the courses were also self-selected by virtue o f i n s ~ c t o rhaving s volunteered to use the new rating form. Results from aduate courses for which at least 10 students provided ratings responses are summarized in the figure. “he positive correlation is measured by the standardized path coefficie +.45 for the three samples) that h k s the two latent variables of Expec Grade and Evaluation,

The existence o f this ~ a d e s - r a ~correlation ~s of course prompts a suspicion that ratings can be increased by the s ~ a t oef increas ~ “hat conclusion assumes a causal influence of grades on ratin observed correlations such as in Figure 15.2 by no means demand the conclusion of a causal influence of grades on evaluative ratings. Each o f the

7

fist three of the f o ~ five o ~~ e ~ o ~ explains e s the ~ r a d e s - r a ~ n correla~onin noncausalfashion by h ~ o ~ e s that ~ i an thkd ~ variable influences both grades and ratings. The re^^^ two theories do a s s u ~ e a causal influenceof grades on ratings. Both Grades and Ratings. This is the one theory 1. ~eachjng ~~ectiveness ~rlf~~ences

that is hJly based on the presumed construct validityof student ra McKeachie, 1979, pp.390-391). Thecentralprinciple of the effectivenesstheoryisthatstronginstructorsteachcoursesinwhich s~dents both (a) learn much (therefore they earn and deserve high grades), eappropriatelyhighratings tothecourse andinstructor. I n s t ~ c ~quality o n ~ is thus a third variable that explains the grades-ra~gs correlation in a way that raises no concern about grades having improper in~uenceson ratings.

2. S t ~ ~ ~ General t s ’ Academic ~ o t ~ v a tIrlf~~ences ~o~ Both Gradesand

Ratjn~s. Compared to unmotivated students, students with strong acadernic motivation should both (a) do better in their course work and (b)more f d y appreciate the effortsof the instructor, possibly even inspiring the instructor tosuperiorperfomance.Coursesthatattract highlymotivatedstudents should give %her grades @ecause the students work harder) and should get higher ratings (because the motivated students appreciate both course and instructor).Studentmotivationhasbeensuggestedastheoperativethird in several research investigationsof student ratings (e.g., Howard 1980; Marsh, 1984).

t sr ’s e - S ~ ~ oet~ ~ v~a ct i Ionn~ ~ e nBoth c e s ~ r aand~Ratings. s This theory 3. S t ~ ~C~o ~

differs from the preceding one by supposing that a student’s moti~ationcan varyfromcoursetocourserather thanbeingadcharacteristic of the student.Becausethetwomotivationtheoriesctherelationbetween ades and ratingsto a characteristicof students, they appear notto support effectiveness inte~retationof ratings.However, if student mo~vationis itself credited to the ins~ctor-for example, the i n s ~ c t o r either attracts highly motivated studentsor motivates them once they are in thecourse-thesetheoriesretainthe inte~retationthatratingsmeasure teaching effectiv~ness. ~ ~ ~ a~and L Own i t~ AbiLip ~ Fmm ~ c e j v e dGrades. Social cal a ~ ~ b ~theories ~ i o n describe how people make inferences about both their own traits and the properties of situations in which they act by observing the outcomes of their actions. Research in the attribution theory tradi~onshowsthatfavorableoutcomes for one’s own behavior typically lead toinferencesthat one hasdesirabletraits?whereas favorable

4, ~

t IrlferCourse ~

Constructs in Student Ratin~s

outcomes may lead oneto perceive situational obstaclesto success. A simple ry of these at~butionalprinciples is that people tendto accept credit ired outcomes while denying responsib~ty forundesired outcomes (~reenwald,1980). A p p l ~ gthisprinciple to studentratingsleads to an exectationthathighgrades will beself-attributedtointelligenceand/or nce, and low grades to poor instruction. Social psycholo~cal utiontheorymaturedafterthepeak of researchactivityonstudent ratings, perhaps explaining whythis ~terpretationhas been little mentioned inresearch on studentratings.Somerecentdiscussion of a ~ b u t i o n ations appear in papers by Gigliotti and Buchtel (1990) and Theall, ,andLudlow (1990); seealsotherecentoverviewbyFeldman (1997). in ~ ~ ~ For ~Lenient ~ G~ a a ~ ~n"he gidea . that i ~ n praise induce he praiser (especially if the praise greater is than expected)is f a ~ ainrsocial psycho lo^ (Aronson & h d e r , 1965). The translation of this f ~ a principle r intotheratingscontextisthatthe effect praisesthestudentviaahighgrade,andthestudent's isespressedbyprovidinghighratings. "This ~ e n j e nor~ grade ory has been a focus of much controversy in past research on validity of studentratings.Theleniencyinterpretationwasadvoc researchers who were criticalof ratings validity in the 1970s, inclu who p u ~ ~ s h edemonstrations d in natural class settings 1983; Holmes, 1972; manip~ationsaffected student ratings (Chacko, Vasta & S a ~ e n t o 1979; , ~ o r ~ g t o n port for the leniency theory dropped sharply in the wake of correlationalconstructvalidityresearchconductedinthelate 1970s and early 1980s. Mentions of leniency or grade-satisfaction theories inpost-1980 publicationsappearmostlyinthecontext of assertingthatleniencymay account for only &nor and iporable influences on studentratings(in addition to McKeachie's [l9791 a~eady-~uoted coment, see Marsh, 1984, pp. 741,749; Howard et al.,1985, p. 187; C a s h , 1995, p. 6).

5. S~~~~~~ Give

os

s

presents four data patterns that, as a collection, can ~ s c ~ a t e of the grades-ratings correlation. five theoretical in.te~r~tations ades-ratings correlation also appears as a fifth @ut st-listed) p a ~ in.eTable ~ 15.1. th the exception of one that was tested only during a single acadedc term (the third one listed below), the four diagnostic data pattern of Table

15.1 have been corroborated as ~ m ~innseparate ~ s data collections over three or more academicterms in u ~ v e r s i ~ - samples ~ d e of courses at ~ ~ v of ~ ~r a s s i~ ~ As t oeach n .fmdin is described, its use to evaluate the five theories is explained. ~ o . r i t i ~ e G ~ a ~ $ - ~ t i Within n g $ Ch.r.re$. ~ ~ t j In o ~addition . r ~ ~ $to betweenclasses grades-ra~gscorrelations as described in Figure15.2, correlations are also routinely obtained ~ i t h i nclasses (Stumpf 1979). In the University of Washington data, the within-classes relationship hasbeenobservedveryreliably,Because,inthe t e a c ~ effectiveness g theory, the variable that influences both grades and ratings is a constant (the instructor) within any classroom, that theory does not explain the withinclasses positive correlation of grades and ratings. By contrast, the two thirdvariabletheoriesthatrelatestudentmotivation ~fferencesto ratings differences are able to explain the within-classes grades-ratin~s correlation. That is, within each class, the more highly motivated students may both get Also, of course, the at~butionand highergradesandgivehigherratings. leniency theories very directly explain why students who get higher grades in a course should evaluate that course more positively than others. 2. ~ ~ ~ G ~na ~g $e- ~~~t i ~n g~$i o n Vitb $ b ~~$~ a t j(Than u e ~ ~ . r o ~~ e~ ~t e . )ofr ~ ~ $ ~ ~ e c tGrade. e d Figure 15.2's structuralmodelincluded two measures of

expected a ~ . r o ~and ~ t~e ~ t i expected ue grades. The absolute measure usedclas ns on the 0.0 , ! E ( or fail) to 4.0 (A) gradingsysteminuseat University of Washington.Therelativemeasureusedclassmedians on a measure that asked each student to report the relationof the grade expected intheratedcoursetothestudent'saverageesinothercourses.The stronger weight of the relative measure on 15.2's Expected Grade latent variable reflects the finding that the ratings relations~pwas stronger for the relative-grade measure than for the absolute-grade measure. analyses that predicted ratingss~ultaneouslyfrom bothof the de ~easures,therelativegrademeasureyieldeda subst~tial gain in percent of ratings variance explained,over'and above that explained by theabsoluteexpectedgrademeasure.Bycontrast,theabsolutegrade measure accounted for very little beyond what was explained by the relative grade measure. The superiority of the relative grade measure was evident in both between-course and within-course analyses. "be useof a relative grade measurewasanovelfeature of theUniversity of W a s ~ g t o nresearch. Consequently,this ~ ~ g - t h athe t ~rades-ra~gs correlationisstronger fortherelative-grademeasure-waspreviouslyunreportedinthestudent r a ~ gresearch s literature.

Constructs in Student Ratings 289

90 The teaching effectiveness inte~retation does not explain any des-ratingscorrelation,letalonethegre fortherelative-~adethantheabs generalacademicmotivationtheory,which tie a s s ~ e dstable level ofmotiva~on,also has tro the relative grade measure,unless it is (very h motivated students always report that they e above their average grade. By contrast, the course-specific moti~ation theory andthe a t ~ b u ~ oand n leniencytheoriesreadily ex associated with a specific course are higher when the grade in that comse is relatively high for the student.

3. ~

r u ~ Halo-E ~~in ~~e ~~ ~ Course ~e~ ~ C ~~ a r an c ~ eg~Ins ~Wint ~cs. 1994, a p p r o ~ a t e l y100 instructors at ~ n i v e r of ~ i~~ a s ~ g add a small set of items to their regular included three j u d ~ e n t sthat, a priori,we related to quality ofinstruction.Thesethreeitemssought s~dents’ I$) a u ~ b i l i t yof ~ s t ~ ~ t o r ’ s ents of (a) l e ~ b i l i t y o f i n s ~ c t owriting, r’s and (c) quality of classroom facilities to aid i n s t ~ c t i o n(such as an rojector). “here was no evidence of a ~ a d e s - r a ~ ~ s r e l a ~ o ~ s ~ the b e ~ e e n - c o ~ sanalyses e of theseresponsesconsistentwiththe j tionthattheseitemsareperipheralto . However, ~ ~ - c o ~ sanalyses e s S (Greenwald & Gibnore,1997a,pp.121 ese ~ ~ - c o ~ relationships s e s werenotlarge,theywerenevertheless very stable statistically. Because all students in the same classroom saw the same ~ s t r u c t o r ’ s h a n d w r i ~heard g , the same~ s t ~ c t o rvoi ’s sameclassroomteaching aids, the o b s e ~ a t i o n of these correlations is remarkable. The content of items on which effectsoccurred”es inst~c~onal quali~ra~gs.3

3Previous fmdings that front-of-class seating is associated with higher grades (e.g., 1982)providethebasis for apossible s~dent-~otivation ~te~retation of thewithincourses relations~psof expected gradesto ratings of instructor voiceand legibility, although not the relationship to ratings of classroomfacilities.TheauthorthanksLloyd K.Stires (personal communication,October 26, 1995)fornoting the relevance of theclassroom seating variable to these data.

d-va~abletheories have d i f ~ c uwith l ~ thes ~ ~ theory, if there related halo effects. For the t e a c effectiveness u dclass i bfacilities ~ ~ ite ? gradeeffects on the l e ~ b ~ ~ ? a and effects should appear in be~een-classesanalyses (but they do no

~

o

~ ~ ~~ u ~ t u~3 ued t ~~ eseClmses, b~ ~

students should work harder in courses thaninonesinwhichtheyreceivelowThereasonable expectation rests on two ass~ptions: p r o ~ d ean indicator of student ac (b)that students work harder in cours courses in which they learn little, Fro that s~dentsshould tend to work har

It seems rea

*

work in course shown in the structural model

of

eories d h p l y nonne and workload. This is most readily seen fo S. If students e m high grades by hard)then a positive relatio

FIG. 15.3. Structuralequationmodelreplicated on threedatasets from the 1993-1994 academicyear.The“Challenge”and“Hours WorkedperCredit”measuresarebased, ’ s X (see Gillmore & respectively,on Items 20 and 26 of the Universityof W a s ~ ~ o nForm Greenwald, 1994). The negative between-course relationship between Expected Grade and Workload is measured by the standardized coefficients (average=I -.45) for the path linking their latent variables. See additional explanatory information in F@re 15.2’s caption.

desandworkload is clearlyexpected. For theteaching theory, it might be assumed that effective teachers manage to get their students to do more work and, thus, if high grades are explained by effectiveteaching, a positiverelationbetweenexpectedgradesand workload is expected. If, on the other hand, it is assumed that effective teachers are ‘ust more efficient in imparting knowledge to students, then expected S should be unrelated (but not negatively related) to in regard to the workload. attribution theory equivocal is workload relationship becauseit is possible either for students to attxibute a ade to hard work or for them to attribute hard workto a h w expected e.Onlytheleniencytheoryreaddyexplainstheobservedne

Constructs in Student Ratings 293 relationship. The explanation is that s~ct-grading instructors induce students to work hard in order to avoid very low grades.4

sclmmary ~~~~~~~~0~of the Fhe ~ h e o ~Each e s . theory predicts a &fferent subset o f the four &agnostic data patterns of Table 15.1, r a n p g from the teaching effectiveness theory predcting none o f them to the leniency theory predicting all (see Table 15.1). Each of the three ~ d - ~ ~ a b l e theories fails to explain at least two o f the four fiindmgs. The two directades influence ratings) theories fare best as aclass and, o f these two, the leniency theory is favored by virtue of being the only theory to explain the negative relation between grades and workload.

The findings described previously, considered in the context of much previous research on student ratings, justify the following conclusions: l* I ~ ~Grades t e~ ~~ ~ I ~# ~cRatings. ae ~ "he e ~conclusion that grades influence ratingsappearstobedecisivelyestablishedonthecombinedbasis o f (a) experimentalstudiesthatshowimpact o f grades on ratings, (b)replicable correlational data patterns that are explained only by theories that suppose a existence o f wellcausalinfluence o f grades on ratings,and(c)the establishedtheories(attributionandreciprocalattraction)thatprovide plausiblesocialpsychological inte~retationso f this causalinfluence. The evidence certainly does not warrant the conclusion that giving high grades is, ly itse6 sufficient to assurehighratings.Nevertheless,itdoessupportthe conclusion that, if an instructor varies n o h g between two course o f f e ~ g s otherthangradingpolicy,higherratingsshould be obtainedinthemore leniently graded section.

2. With S ~ a t i s ~ i c a ~ l c ; l ~StuaTent # ~ t ~Ratings e n t , M 9 Be VeV Us@d "heir f h g o f d i s c ~ a n tvalidity (i.e., their c o n t ~ a t i o nby grading leniency) to have no~thstanding,student ratings have repeatedly been shown modest convergent validity. In other words, at the same time that student ratings provide a distorted measure o f instructional quality, they also have a moderatelevel o f validcorrelationwithinstructionalquality. The valid

4The analyses of competing interpretations of the relationship between expected grades and workload have been described in more detail by Greenwald and G h o r e (1997b, pp. 743744).

component o f ratings may be enhanced to the extent that it is possible to statistic~yadjust for the sources o f d i s c ~ ainvalidity. nt 3.

~ o ~ ~~ e ~a soAre ~ a~~s~~~ ~s The consistent finding of a ne~tive relations~pbetween course grades and workload ( ~ u s ~ a t in e dFigure 15.3) is d i s ~ b i n g . Although this relations~pmayexistatmanycollegesand ~versities,ithasneverbecomeafocus o f research a~efltion,perhaps in studentratings. becauseworkloadmeasuresarenotroutinelyincluded The inclusion of workload estimates in course ev~uationforms can assure at this important aspect of differences arnong courses does not continue to escape attention.

T he sx ~ a t i o n posfy c ~ o l o ~ c a l processes student ~ g hbet interpreted as s u f ~ ~ ebasis n t for aban whole ent g student ratings. However,therearethreegoodreasons to conclude just thereverse"thatevenmoreattentionshouldbe paid to '

*

First,inmanycases there is no practical alternative method for ating ~ n s ~ c t i o n .~ t h o uexpert ~ h appraisals and standar ent tests night, inprinciple, providemore valid assessments, unfortunat~l~ both of those alternative methodsare considerabl~more costly than student ratings. Their present very limited use probably stands dicator of their relative ~ ~ r a c t i c a ~ ~ . ence for convergent validity of student ra ~ t h o u gstudent ~ ratings are overlaid theynevertheless also c o n t ~usefid i n f o ~ a t i o n . cal a d j u s ~ e n tcan s make that i n f o ~ a ~ more o n usable f o ~ a t i o nIn . &IS worst-

s m e way that an assessment o f bedside manner is hysician. The assessment of bedside manner doesn't success in preventing or curing illness, but it does that may predict apatient's ~ ~ e to as ~ es r eto ents and to return for future ~ e a ~ e n t . ~ ~ a r l y

Constructs in Student Ratings

b owl edge of how much a teacher is Ued should provide i n f o ~ a t i o nthat can predictastudent’s ess to do assignedworkandtore further course workfro ary, there very fiery is an instructional quahty baby in withthe bathwater of grades-ratings correlations andother possible c o n t a ~ a n t sof ratings. It seems much, much wiser to gve that baby a bath, to clean it up andmake it presentable,than to abandonthebabyintheprocessof discarding the bathwater.

Higher educationreliesheavily

on student ratings to evaluate f a c u l ~ teaching, largely because the alternatives (expert peer appraisalsor objective p e r f o ~ a n c ecriteria) are costly or unavailable. Because student ratings are crucial not only to ~ p r o instruction, ~ g but also in making or b r e a h g faculty careers, it is important to assure that they provide valid indications of instructional quality. Problematic~y, findings s ~ a r ~ in e dh s chaptershow that student ratingssufferfromartifacts that lead to underestimation of teaching ability for instructors who grade strictly and overestimates for those who grade leniently. Some likely system impacts of this distortion of ratings areto @de (a) instructors toward lenient g r a h g , and F)students toward n o n c h a ~ e ncourses. ~~ The bright side of this picture is that the usefulness of student ratings can be improved statistically.

Portions of this chapter were fEst presented as an address for the Donald T. Campbell Award from the Society of Personality and Social Psychology, presented at the 1995 meetings of the American Psychological Association in New York. The research on whch this chapter isbas education^ Assessmentat fac~tatedby theOfficeof ~ a s ~ n g t oand n , was conducted in co~a~oration with Geral who isco-authorofsomeofthe more detailedre orts collaborative research (Greenwald, 1997; Greenwald G h o r e , 1997a, in 1997b).Somematerialfromthose other reportshasenreproduced this chapterwithpermissionofAmericanPsychologicalAssociation. A d ~ t i o nsupport ~ provided by was om National Science Founda~on,and by Grants from 01533 ,

296 Greenwald National Institute of Mental Health. For c o m e n t s on various previous drafts of related material, the author thanks Robert D. Abbott, Philip C. Abrami, Kenneth A. Feldman, Gerald M. Gibnore, Joe Horn, George S. Howard, Herbert W. Marsh, Scott E. Maxwell, Jeremy D.Mayer, Robert S. . Stires,and John E. Stone.Correspondence maybe 351525, addressed to the author atDepartxnentofPsychology-Box ty of W a s ~ ~ t oSeattle, n, W A 98195-1525, and by electronic md i ,washington.edu.

Abrami, P. C., Cohen, P. A., & &Apollo&, S. (1988). Implementation problems in metaanalysis. & ~ e w~Educationai&.reach, 58,15 1-1 79, Aronson, E., & finder, D. E. (1965). Gainandlossofesteemasdeterminants of interpersonal attractiveness.Journal ~ ~ ~ e ~ S~o ~e ani tP ~ac hi ol,i156-171. o~, Browne,M. W., & Cudeck, R. (1993).Alternativeways of assessing model fit. In K,A. Bollen & J.S. Long Fds.), Testing st~cturaie~uatjonmodeh (pp. 236-162). Newbury Park, CA: Sage, Cashin, W. E. (1995). Student ratings of teaching: The research revisited, IDEA Paper No. 32. Manha~an,KS: KansasStateUniversity,CenterforFacultyEvaluationand Development. instruction: Afunctionofgradingstandards. Chacko, T. I. (1983). Studentratingsof &search Q.tzar$eriy, 8(2), 19-25. .(1997). I d e n t i ~ n gexemplary teachers and teaching: Evidence form student ~e in higher e~catjon:&searcb and ratings. In R. P. Perry & J.C. Smart (ESds.), E ~ e c t i ~each~ng ~ r ~ c t i(pp. c e 368-395). New York Agathon Press. Freedman, R. D., Stumpf, S. A., & Aguanno, J.C, (1979). Validity of the Course-Faculty Instrument (CFI): Intrinsic and extrinsic variables.Ed.tzcation#ieY P ~ c h o i o g j c a i ~ e ~ ~ ~ ~ e n t 39,153-158. i Gigliotti, R.J.,& Buchtel, F. S. (1990). Attributional bias and course evaluations. J o u ~ aof E ~ c a ~ i o ~ a i P ~82,341-351, cho~~, Gillmore, G. M., & Greenwdd, A. G, (1994, April). Theefects of course gi an^ and g r a ~ n g lenieny on student ratings ~jnst~ction. Paper presentedat meetings of American Educational Research Association, Orlando,FL. Greenwdd, A. G. (1980). " h e totalitarian ego: Fabrication and revision of personal history. h e t i c a n ~ ~ c h o i 35,603-61 o~st, 8. ity concernsand useiihess ofstudentratings. h e ~ c ~ n Greenwald, A. G., & Gillmore, G. M, (1997a). Grading leniency is a removablec o n t ~ a n t of student ratings.heticata P~cho~gjst, 52,1209-1217. Greenwdd, A. G., & Gillmore, J. M. (1997b). No pain, no gain? The importance of in studentratingsof instruction. J o u ~ a of i Educationai measwingcourseworkload P ~ c ~ o89,743-751. ~ o ~ , Holmes, D. S. (1972). Effects of grades and disconffimed grade expectancies on students' evaluations of their instructor.~ o o f Eu~ c a t~i o ~ ~# l P ~~c63,130-133, ~o~o~,

Constructs in Student Ratings 297 Howard, G. S., Conway, C. G, & Maxwell, S. E. (1985). Construct validity of measures of college teachingeffectiveness.Journal ofEduca~ionalP ~ c b o l o77,187-196. ~, Howard2G. S., & Maxwell, S, E. (1980). Correlation between student satisfaction and grades: A case of mistaken causation?Jou#a~~ E d u c a ~ o Pn a~~c b o72,810-820. ~~, Knowles, E. S. (1982). A comment on the study of classroom ecology: A lament for the 8,357-361. n ,~ good olddays. persona^^ und S o ~ a l ~ ~~cu bl l eo~~ MacCallum, R. C., Browne, M. W., & Sugawara,H. M. (1996). Poweranalysisand d e t e ~ a t i o nof sample size for covariance structure modeling. P ~ c b o ~ g ~~ ce a t~ ~ l,o ~ 130-1 49. Marsh,H. W. (1980). The influence of student,course,andinstructorcharacteristics on evaluations of university teaching.American E ~ ~ c a ~ ~ n a Jo#rna4 l ~ s e a17,219-237. ~b Marsh,H. W. (1982). Validity of students’evaluations of collegeteaching: A multitraitmultirnethod analysis.J o u ~~aE~~ c a ~ o # a l P ~ c b74,264-279. olo~, Marsh, H. W. (1984). Students’ evaluationsof university teaching: D~ensionality,reliability, validity, potential biases, and utility.~ournal~Educatio#alP ~ c b o l o ~76,707-754. , Marsh, H. W., & D u n h , M. J. (1992). Students’evaluations of universityteaching: A multi~ensionalperspective. ~ ~ b Ee ~ rc a ~ o n : ~ a n dof~Tbeory o o ~ and ~ ~ e a r c8, b ,143233. McKeachie, W. J.(1979). Student ratings of faculty: A reprise. Acahme, 65,384-397, PoweIl, R. W. (1977). Grades,learning,andstudentevaluation of instruction. ~~~~~~b in ~ ~ bE e~ cra ~ o7,193-205. n, Snyder, C. R., & Clair, M. (1976). Effects of expectedandobtainedgrades on teacher evaluation and attributionof performance.Journal o f ~ ~ c a ~ i o n a l P ~ c b6o8 l, o7 5~4, 2 . of Stumpf, S. A., &.Freedman, R.D, (1979). Expected grade covariation with student ratings Jou#al o f E ~ c a ~ o nPa~lc b o l o ~71,293-302, , instruction: Individual versus class effects. Theall, M., Franklin, J., & Ludlow, L. (1990). Attributions and retributions: Student ratings and the perceived causesof performance,~ n s ~ ~ c ~ o n a l E v a1~l,a12-17. ~on, Vasta, ,R.,& Sarmiento, R F. (1979). Liberalgrading improvesevaluations but not ~~, l. performance.~ournalo f E ~ c a ~ o Pn a~ l c b o 71,207-21 W o ~ h i n ~ oA. n ,G., & Wong, P. T. P. (1979). Effects of earnedandassignedgrades on student evaluationsof an instructor.J o u r n a ~ ~ E ~ c a ~P j~oc~baol ~71,764-775. o~,

,


ave to dowith constructv a as a ~ the concept servesus intellec~ally, do to the waywe frame our research question nd

se, at least not in the company of such experts as those eople who are broadly interested in some thought overthe years and era1 occasions.Alhoughthe

clearly asI can. to be e s p l ~ e dI, take a very circuitous approach to my two ost the entire first half of mychapter to the telling of ersonal anecdotes that bear ~ustrativelybut only tangentiallyat best on the c o ~ e n t contained a ~ in the second half. "he fact that Sam Messick figures ectly into oneof my stories andwas indirectly involved with the other helps to explain whyI have chosen to tell them on &IS special occas that r o ~ n ~ a b opath u t brings meu l ~ a t e l y to the central topic oint ~ u ~ toeabove. d ~

l

l

i

* * * * * ~ ~ OLE ~fellows~p n gyear at the Center for oral Sciences.Our stay at the Center b 99

~

300 Jackson

fall of1962, just about the time ofthe Cuban Missile crisis, when John Foster Dulles andN h t a Khrushchev stared each other down, eyeballto eyeball, while the world trembledin fear ofthe consequence. For a reason soon to be made known, I had occasion to fly from C ~ f o to~ the a east coast during those tense days, a trip made with some trepidation I must say, and one whose attendant feehgs I stdl vividly recall. The intemation~crisis aside,it was a great year to be at the Center. In fact, S a m and I to this day boast ofit having been one ofthe greatest years in the Center’s history, an audacious claitn, to say the least.m e t h e r or notour clairn is justifie~,the year as a whole certainly felt like a once-in-a-life the two of us. Wewere both young enough at the h e to be properly impressed by the luminosity of that year’s crop of senior fellows, which cl ded Neyer Schapiro, Michael Polanyi,E& Erickson, Carl Rogers, Renato oli, Fred Nosteller, Lancelot Law m y t e , Fred Harbison, and others. ere was also an impressive collection of younger talenth on a ~ o others, n ~ Wayne H o l ~ m a n , ~ c h aScriven, el Alberta Schaller, IrvingDevore, and Phyllis Jay.Even thecasual visitors in and out of the Center that year wereoften noteworthy enoughto make us all sit up and take notice.h o l d Toynbee graced us with a one-day sexninar, as didAlanWatts,whosewritingsonZenwere then verypopular.More impressive company would be difficultto imagine, I hardy need say. As new acquaintances Sam and I hit it off fromthe start. We found that we had a lot in common, both professionally and non-professionally. On the professionalside,wewere both interested in problemsof psycholo~ca~ assessment, particularly the assessment of creativity.I had just co-authored a ~ n ~ ~ ~ ~ ~e Z ~~ ~Z e onwith c ~~G@ed :~ ~ ~ o book withJ. W. Getzels entitled~~e~~~~~ den^^ (Getzels &Jackson,1962). Among Sam’s many interests at that h e werecorrelatesofintellectual p e r f o ~ a n c eassociatedwithpersonality variables, especially those havkg to do with what was calledc o ~~ ~ ~Z e .~ ~ ~ The closeness ofthose interests ledus to spend a lot oftirne discussing how creativitywascurrentlybeingassessedand~~g about how those assessments might be improved. We talked about extending them to include aspects of creative performance that went far beyond the then-popular notions of divergentt h i n h g and verbal fluency. From those dClseussions came a jointly authored paperentitled“”’heperson, the product, and the response: Conceptualproblems in the assessment p~blishedin the ~ o~ ~ ~~(Jacks~ eZ ~ ~ o *

Construct Vali&ty and the Language 301 m a t Sam and I sought to doin that paper wasto postulate a set of criteria by which a creative product of almost any land might be judged. We next asked by what standards those j u d ~ e n t might s be made and, ~ r t h e ~ o r e , what lands of responses those properties of the object might elicit fromthe t. We sought alsoto name the personal qualities e styles of the in&viduals associated with each standard and eachland of response.Our effort was ambitious,to say the least. about any of In retrospect I would also callit naive. Neither of us knew much the arts, at least not enough to speak authoritativelyabout them. Though we drew our examples from &verse sources, includmg poetry-, music, and painting, So did the ~ g u m e nwe t were they by and large lacked richness and subtlety. -g to make. It is easy to translate what we were W n g to do in our pa language ofconstruct validity. Essentially, we were workingthe onproblem of ct under-representation as it appliqs to creativity. By drawing on our owledge of poems, music, novels, andso forth, we sought to identi5 aspects of creativity that were not then being tested. We also tried to give some order to the qualities we identified. We did so by positing a developmental h yet-to-be-tested components of the construct of our work involved us in what might properly becalled the ~tmctzr of the construct ofcreativity.Another way of at we wereabout w d be introduced later. our work onthe paper my o m enthusiasm for c o n t i n ~ to g doresearch on creativity suffered a serious setback. It did so as a result ofthe trip to the east coastthat I have already mentioned. "he reason for my trip and what I found when I got there are both a bit complicated, so I must explain eachin some detd. In the research described in our book Getzels and I had focused on two groups of students identified from among all of the students, from the sixth ade forward,then attending the ~ ~ v e rofs Chicago's i ~ Laboratory School. had p e r f o ~ e very d wellon several tests of creativity, so-called, but so o u t s t a n ~ g l yon a standard test ofintekgence; the other group had apattern of scores exactlythe reverse. We sought to compare those two groups on a wide variety of dependent variables having chiefly to dowith their ~ e r f o ~ a n c in e s school-variablesliketeacher and peer ratings, attitudes toward school and f a d y , career asp~ations,and so forth. The five tests comprising our ''creativity" battery had either been modeled or after were exact of investigations of creativity copies of tests employed in a wide variety other J.P. Gdford, Raymond B. Cattell, E. Paul ors such as on. Here is how we describedthat set of five tests.

In general, our tests of creativity involved theability to deal inventivelywith verbal and n ~ e r i c a symbol l systems and with object-space relations. m a t most of these tests hadin cormnon was that thescore depended not on a single p r e d e t e ~ e dcorrect response ..but on the n w b e r , novelty, and varietyof adaptive response to a given s ~ task. (Getzels ~ ~tk Jackson, s 196

o m were eagerto hasized in our book tha purposes was veryp r e m a ~ and e was not s or r e c o ~ e n dwe , dldnot take thea d d l ~ o n ~ ch test,forwe wanted inspection and use by other researchers. A s it t u ~ e out, d an early report of their p r e ~ a use ~ ein a on-research e to me shortly after my arrivalat the Center. I received a call from endent of schools in a suburb of oston (let’s call it Easton). as he called it, curren described an ~ ‘ e d u c a ~ exp o n ~ent,” u n d e ~ a yin the Easton schools. Near the close of th irector of ~ s y c h o l o ~ cServices al had supervised the f c r e a tests ~ ~to~allof the ~ f t h - ~ r a stude de scored accordingto the i n s ~ c ~ o supplied ns in our book. Twentyor

Construct Valihty and the Language 303

uneasiness. I also reasoned that the students themselves were probably not s u f f e ~ &om g having been so identified and might even be benefitting from their new educationala~angement.In.the end, my curiosity turned out tobe g s stronger than my sense of disapproval. So I suppressed my m i s ~ ~ and accepted the supe~tendent’sinvitation. I looked forwardto what my journey to Easton heldin store. My visit there covered three school days plus a day of traveling at each end. I spent mostof the timetheresitting in the backof the expe classroom, watchmg the tom’s 25 %nost creative” sixth graders and their “most creative” teacher goabout their business. m a t I saw turned out to be interesti~g,all right,but not at allin the way I had anticipated. During my first m o ~ n g ’ svisit I waschieflystruckbyhow e v e r y ~ looked. g The physical classroom, the students, the teacher, looked pretty much like their counterparts back in the Laboratory School in Chlcago, which, inturn, had looked to me pretty much &e the classrooms I had inhabited as child, I had no particular expectations about what I would witness in Easton but somehow I expected it tobe hfferent from whatit was. “he room was bright and cheerful with plenty of colorfid charts and samples of student artworkondisplay. The students seemedeager to participate in whateveractivity the teacher introduced. “hey raisedtheirhandswhen questioned. They huddled together in smallgroups to work on projects of one h d or another. They read silently during free reading time. They wrote reports. They moved about the room andchatted comfortably with each other and with their teacher during study periods. Andso on. The cwriculwn too contained nothing out of the ordinary that I could see; justthe standard school subjects of math, social studies, language arts, science, and so forth. Long before my first fullday of observation had ended I found myself sti and beginningto wonder whetheror not the trip would turn out t worthwhde. Some questions pertaining to construct validity did enter my thinkmg h o s t at once onmy first day. I worried first of allabout how well the testing had been handled. Had the tests been properly a ~ s t e r e d Were ? they accurately scored? (Iwas not then f d a r with the phrase G ~ n ~ ~ ~ ~ e ~~ Ge ~~ a va~anGe,but such is the tern I would now applyto the subject of that set of worries.) I also wondered how other differences in the way the students had been selected might have affected the outcome of the selection process.The students Getzelsand I hadcalled our h$ ~ e a ~had ~ not v e ~beenequally outstanding on tests of intelligence. They were, one might say, distinctively er ~erfornerson the battery of creativity tests. The school on had paidno attention to intelhgence test scores at all in ma

c ~ t e is~warrted. o ~ ( ~ ~ o r r ~1989, a c ~p., 151)

306 Jackson the students was I watching.I was not at all sure what I was looking for itand turns out that I never did come to a clear sense ofnission rivaling that ofmy prior test-gming and data-gathering past. (More on that too in due course.) But my new tourist-like gawking seemed &e the right thing to be doing and elementary classrooms seemed like the right place to do it. I m fast a p p r o a c ~ gthe end ofmy two anecdotal accountsbut I cannot resist adding a brief extension to the story I have just told. It ties to one of the things I later want to say about construct validity. In retrospect, I find it musing, although I dld not find it so at the time. N.L. Gage had just moved from the University of Illinois to Stanford that year. As we were both newcomers to the west coast and had known each other casually before then, we decided to get together on a fairly regular basis to compare notes on our new surroundings and to share some of our conxnon interests. At one of our lunchtime meetings shortly after my return from Easton I announced to Nate my planto visit some local classrooms just to look around and see whatwas going on.He stared at meas if 1had lostmy mind. ‘TI0you know what you’re doing, Phil?,”he asked incredulously, ‘You’re the electriclight!’’ What he meant, of course, s t u d ~ the g oil lamp in the of age was that any teaching I might observe was soonto be outdated and replaced by an entirely new mode of instruction that would be based on “scientific” research. He and his students at Stanford were then busily engaged in exactly the kind of research that they confidently believed would one day form the basis ofa new science of teaching. Numerous others were sknilarly engaged at u~versitiesaround the country and throughout the world. Although Nate did not come right out and say so at the b e , I have no doubt that he also believed that what I was proposing to do was unworthy of being called “research,” much less ‘cscience.’’ He might then have conceded (as he later did in someh ~ofwritings) s that the kmd of unstructured observations that I was about to undertake could possibly leadto science. “hey might, in other words, become the source of hypotheses that could later be tested but this meant that they were no more than exploratory in nature, somethingto engage in very briefly before getting down to brass tacks, which isto say: the serious business of doing research.“he latter, as both Gage and I at the time if asked, entailed undertaking a carefdly designed would have readily explained study, ~ a t h e ~ quantifiable g data,analyzing it statistically,presentingits S in tabular or graphic form, andso on. confess that at the h e Gage’s reaction aroused worries that even then I had worked hard had not been very far beneath the surface of my thoughts. how to design and conduct empirical studies. I was moderately pleased with whatI had already managed to accomplish alongthose lines.I was

Construct Validity and the Language 30'7 also as confident as was Gagethat Science with a capital S offered the only true a psychologist, particularly route to knowledge. I further took pride in being one who did research,The prospect of givingup that identification was not at allattractive, in fact it made me very uneasy, yet that was the dkection whchin my proposed plan pointed,as Gage's reaction to it clearly implied. d I have something h t h e r to say about the source of my uneasiness in due course. I now want to turn from my two anecdotal accounts to a set of remarks about construct validity as a tool of thought. All that has been to this saidpoint w li continue to reverberate as background? I trust, as I move on to what follows. My concerns are twofold. They have to do in part with what might be called the ontological status of constructs, with what thec word o ~ stands ~ ~ for ~ and c ~ with howwe habitually speak of whatever the term represents. "hey also have to do with the validation process qua process and with whether and why we should insist onit being perceived as scientific.

Do psychological constructs like intelligenceor anxiety or creativity refer to red? entities that truly exist or are they fictional devices, more theoretical than The answer to that question,according to philosopherStephen Norris, dependson one'sphilosophicalpedigree (Norris, 1983). "hose who call themselves logical positivists, for example, (a dying breed, it would seem) answer in one way, whereas those who think of themselves as realrsts (a term of self-reference alsonot in wide use, it again would seem) answer in another."he former treat constructsas fictions and thereby tend to dismiss them; the latter see them as real and thereby attach great importanceto them. VVhich one is correct? The issue, according to NOIT~S'S account, remains open to debate, although not as tendentiouslyperhapsaswasoncethecase.Moreover, opposing views of the matter are by no means limlted to philosophers who attach themselves to one or another school of thought. They also include, among others: psychologists, educational researchers, and test-makers. It is among the latter, Norris tells us, that we find some of the most ous forms of ignorance and confusion with respect to this important set of questions. The chief dtfficulty, we are told, lies in the inconsistency with which test-makers in particular view and discuss the ontological status of psychological constructs. Sometknes they treat them one way (as real) and sometimes another (as fictional). Some, it would seem, want to have it both ways at once. Still others appear not to care very much one wayor the other

~ o n s ~Validity c t andthe La them asreal or unrealbecausethefact of thematterhas not yetbeen. of constructs like‘ estab~shed.Messick’s sa^^ that the reality or “ c ~ d h o o remains ~’ re deb at able'' would seem to leave room p o s s i b ~occurs ~ to me and it is oae on whch I is that such consistencies as occur, if that is what to c any ~ t e ~ laxity e cnor~ because ~ of any test-~akersbut simply becausethat is the

As an ~ u s ~ a of ~ our o nh a b i mixed-up-ness ~ ~ whenit come perhaps with ~ ~ s t oIs~COLeE . it simply stand for bow certain

ons ships between the and

a word, we ‘ ‘ o b j e c ~ ~ ~ his e n ~ o ~ eCO n t

310 Jackson

to clear up such confusions andto answer such questions once and for all?” Why not allow such investigationsto determine whatis real and unreal? Messick obliquely addresses the attraction of such a proposed policy in the distinctionhedrawsbetween o ~ ~ ec ~o e nd ~ (of ~ a~ pattern t ~ of~test~ e d (Messick,1989b, p. 29). perfo~ance9let ussay) and a ~ e ~constmrct ~ r e s ~ a b l the y 9greater the observed consistency, the more confident we become in believing that whatis named by the construct refersto s o m e ~ g real (i.e., the more willing we are to reify it). as he elsewhere points out, such observed consistencies are almost open to more than one interpretation.Moreover,alternativeinterpretationsneed not invariably devolve into a life-or-death struggle to determine which one reigns supreme. When it comes to establishing fiuitfid waysof thinking about particular constructs, even when one of them posits something real and other the does not, the law of the excluded middle,it would seem, neednot always apply. Lest the previous talk about thelaissez-fire character of ordinary speech come across as sounding too pedssive to apply within the framework of an tion, let me return briefly to Noms’ warning about the risks akersbeinginconsistentinhowtheyspeak about the ychological constructs. His worry, we d recall, was t ultimately suffer the negative effects of such inconsistencies. As much as I disagree with the extremely high value Noms attached to the virtue of consistency in such matters, I acknowledge that he had a point a How we choose to speak ofthis construct or that often does the risks involved. have consequences that need to be taken into account on penalty of doing harm, albeit unwit~gly.I would agree with that.It also mattersto whom we speak in particular ways. Moreover, the risks involved in different ways of speaking are commonly m u l ~ p l i by e ~the a u t h o r i ~of the speaker.Those who present themselves as experts and who speak in the n m e of science or research do so fiom positions of authority that elevate the significance of their words far above that of the or&ary speaker. That being so, they, above all, need be sensitive to what they say and how they it, say as well asto whom they it. ”’his warning isin accord with the spirit of Norris’s admonition, or so it seems to me. though it applies far more broadly than he intended it, While b k i n g about the potential consequences of different kinds of tak, I recalled Dewey’s observation about our ways of speaking of intelligence. In ~ e and ~ d~ ~ c he ~ osays: t ~ o n~ ~ ~ How one person’s abilities compare in quantity with those of another is none of the teacher’s business.It is irrelevant to his work. What is required is

Construct Validlty and the Language 311 that every individual should have opportunitiesto employ his o activities that havemeaning. Pewey, 1916, p. 172)

m powers in

Dewey went onto urge teachers (and others)to dunk of intelligenceas being more like an adjective or adverb than a noun, By so doing, he believed, teachers in particular would be more inched than they otherwise might be to ask how persons may be encouraged to behave moreinte~gentlyin particular circ~stancesand with referenceto particular lrlnds of materials and subject matter. In making those suggestions, Deweywas not advocating that teachers be so to speak, about well-established facts having to do with the kept in the dark, relative stability of performances on intelligence tests or a n w n g else of that nature. He was., however,q u e s t i o ~ the g wisdom of certain practices, such as s u p p l ~ gteacherswitheachstudent’sscore on suchatest. He also by i n ~ e c t i o nwas c a h g into question the kind of talk about intelligence that causal potency than the facts allow. Wedo not have to s h a t e of the risks involved in such practices and such talk to understandwhathethoughtthose risks to be. He fearedthat misinformation and even accurate ~ f o ~ a t i oirnproperly n7 co~unicated, might inadvertently contributeto a decline in teachers’ effortsto do the best they can for each of their students. The possibdity of untoward consequences being associated with certain ways of talhng about psychological constructs callsto rnkd Messiclr‘s many discussions of what he calls c the o~se~~ basis e nof~ validity, ~ ~ Z which refers to the social c o n s e ~ ~of e ~test c e ~use and interpretation (Messick, 1989a., 1989b., 1995). Would Dewey’s worries about how teachers might make use of the construct of intelligencefit comfortably within Messick‘s category of s o ~ ~ Z c O ~ s e ~ ~ Ie ~ c e s ? see no reason why not. Indeed, they seem to me to provide an almost classic instance of the kind of concerns with which today’s eahtened test-m~ers are increasingly occupied. FollowingDewey’slogic,might it bethat,ingeneral,adjectivaland adverbial ways of referencing psychological constructs are better suited to the within educational contexts than are the more kinds of discussions that go on noun-referenced discussions that typify so many psychology textbooks? I would not support such a conclusion without it having been given a lot of g allthe same. thought andinvesti~tionbut I do find it to be an i n t r i ~ idea, This brings meto the word6 0 ~ which s ~ conveniently ~ ~ ~ ~ can serve as either a noun or a verb, depending on its use. “ C ~ n sconnotes ~ ~ c ~construction and artifice,” Jane hevinger declared, back in the days when the term’s me wasbeingfreshlyexploredbypsychologsts Foevinger, 195?, p. e

.Indeed, there may be

314 Jackson As part of their effort to make test use and interpretation scientifically respectable, Cronbach and Meehl includedwithin their paper (in a section subtitled “The Logicof Test Use”) alot of talk about “theinterlocking systemof laws which constitute a theory as a ~ o ~ o Z o ~g ~ e~ ~~Z (Cronbach ~ o ~ & ~:Meehl, ’ 1955, p. 290). In brief, sucha net wassaid to consist of lawsin which observableswere tied to theoretical constructs, either statistically or d e t e ~ s t i c a ~After y . explaining how such a nomologicalnetwork might be constructed, a process that entails the application of scientific methodol Cronbach andMeehlconclude,“Theprecedingguide rules [sic] should reassure the ‘ t o u g h d e d ’ who fear that allowing construct validation opens the door tononconfirmable test claims” (Cronbach &:Meehl, 1955, p. 291). Things &d notturn out quite asCronbach and Meehl hadhoped. Not only were not all of the t o u g h - ~ d e dreassured by the article’s appeal to the criterion of scientific methodology, but, ironically, at least one such tough~ d e critic, d HacoldBechtoldt of the Universityof Iowa, was quickto accuse the article’s authors of themselves beingnonscienti~c.Bechtoldt mote: The renamingof the process of building a theoryof behavior by the new term‘constructvalidity’contributesnothingtotheunderstanding of the process nor to theusefidness of theconcepts.Theintroductioninto discussions of psychological theorizing of the aspects of construct validity [discussed in this article] ...creates, at best, unnecessary confusion and, at worst,anonempirical,nonscientificapproachtothestudy of behavior. (Bechtoldt, 1959, p. 628) The vituperative toneof Bechtoldt’s accusation is indicative of the passions that could be easily aroused in those days by any perceived threat to the scientific status of psychology. L o o h g back on lus jointly written paper some 30 years later, Cronbach sounded apologetic and almost embarrassed byits scientistic flavor, He said, Paragraphs on the network andon links between theoretical notions and d to the CM [Cronbach & Meehl] paper. They~ u ~ ~ ear e d observables a d ~ ~g~~~ ~~~u~~ chim that CV [construct validity] was inwith linephilosophy of science, and not a nostrum brewed up hastily to relieve psychology’s pains. (p.159, emphasis added).

Construct Validity and the Language 315 He then goes on: Still it was~ r e ~ e to n ~dress ~ o up ~ our s at^^ scietzce in positivist l ~ ~ a ~ e ; to say a ~ that~ an construct g not part of a nomological network and itwas s e ~ ~ ~ is not s c i e n ~ ~ c a (p. ~ 159,aemphasis d ~ ~ added) s ~ ~ ~ .

Cronbach’scandorisunsparingandexemplary.“Adding “‘bolstering a virtuous claim,)’ being “pretentious,” and “self-defea behalf of an ‘‘in~rnature science”-those are harsh things to say about one’s prior behavior, even from the distance of several decades. Yet, although applauding hisf o r ~ g h ~ e swe s ,must not ignore the question of why he and Meehl might have behaved that way in the fixst place. A piece of the answer surely must in liethe aspirationto be tough-minded and to be seen as being so, which, at least back then, translated into the goal of beingscientificandofscience.Moreover,thatPast, to somedegree, remains our Present. No psychologist, then or now, can remain completely indifferent to such aspirations. To do so would cast doubt on lus or her entitlement to membership within the community of psychologists. It was certainlythe fear ofthat happening that made me wary back in 1962 when 1 turned &om p i n g paper-and-penc~ tests to more informal and naturalistic methods of data- the ring. Was I not giving up science in the process? That was the question I asked myself at the time, I feared I was, knew that not every psychologist crunched nwbers and worked data. In my heart of hearts, however, I harbored little respect for those psychologistswho had nothing but words at their command. That lack of respect was widely shared by the majority of the faculty and students with whom Istudled, I must say,andremainedevidentamongmycolle associates at Chicago and elsewhere. In his 1989 overview of the history of construct validity, Cronbach clearly rues the pretensions associated with his earlier effort to pacify the toughminded andto keep psychology scientifically respectable. Yet if one reads his 1989 papercaremy one canstill discern, I believe, traces of the attitudes and biases that partially motivated the earlier work. In 1989 he stdl spoke h o s t w i s t ~ yof “‘the community of pure scienti~ts’~ and of “the very lon enterprise of pure science” which lives “for daythe whentruth becomes crystal clear’7 @. 163). He also twice uses the tern j ~ ~to refer ~ to ~ the#test~ ~ interpretations of those who have no hard datato back theirc l h s . c‘”he good news,” Cronbach told us in 1989, ‘5s that today’s [test] manuals rarely flood users with jawboning speculative defenses.” The innuendo c o ~ u n i c a t e d by

ortant set of ideas and~ ~ a c ~ c ~ s

echtoldt, H. P.(1959). C o n s ~ cvalidity: t A critique. A~e?icun~ ~ c ~ o ~ o14(10), ~ s t ,619-629. ronbach, L. J. (1989). C o n s ~ c t i o nvalidity after thirty years. In R. L. Linn (Ed.), I ~ ~ ~ ~ ~ ~ e ent, ~ ~ u ~n ~ ~0~~ o ~ ~@p. ~ 147-171). ,~ ~ cUrbana: Universityof Illinois Press. J.,& Meehl, P.E.(1955). Construct validity in ~ s y c h o l o tests. ~ c ~~~~~0 ~ ~ 52(4),~281-302. ~ e ~ ~ , II. New York The Free Press,

essick, S. (1995). Validity of p s y c h o l o ~ assessment. c~ &e?icm

~ ~ c ~ o ~5~(9), f f ~741-749. s t ,

74). San Francisco: Josey-Bass. ~I n ~ es s ~ g~u ~ oOxford ~~s . ~Blackwell. c u ~ i ~ ~ e ~L.s (1968). t e i n~~ ~

~


Abrarni, P.C., 281,296 Abramson, L. Y., 159,164 Ackerman, P.L., 105,121-122,124,137, 144 Apanno, J.C., 282,296 Ainley, M.D., 134,143 Alexander, P.A., 134,146 Allard, F., 154,164 Allen,J.S., 152-153,164 Allison, R.B., 131,143 Allport G. W,, 104-105,122 Ambron, S. R.,173,178 American Psychiatric Association,40,47, 98,122 Anastasi, A., 175,177, 189, 191,245,252 Anderson,J.R.,180,191 76,92 Anderson, M,, Angleitner, A., 20,25,34 Aronson, E., 287,296 Arter, J.A., 214,227 Ashton, M.C., 104,124 Ashton, S. G., 9 , 1 6 215,218,227 Atash, N., Atkinson, R.C., 128,144

Baer, R.A., 51,66 Bakan, D., 60,66 Baker, E. L., 179,191,245,252 Bakker, F.C., 157,164 Balke, G., 79-80,89,93 Bandura, A., 10,16 Barlow, D. H., 101,126 Barnes, M.L., 177-178 Baron, J.B., 183,186,191 Baxter, G. I?.,180,185,186,189,191 Bechtoldt, H. P.,314,317 Beck, A. T., 42,47 Ben-Porath, Y. S., 41,47 Benet, V., 104,122 Berends, M.,236,253

Berg, L. R.,186,192 Berry, D. T., 51,66 Biggs,J. B., 134,143 Binet, A.,74,75,92 Block, J.,50,53,59,66,97,106-107,122 Bonanno, G. A., 59,66 Bond, M.H., 59,69 Bonner, M.W., 200,204 Borgatta, E. F., 102,122 Bouchard, T. J.,Jr., 99,122 Boyle, G. J., 97,99, 102,122 200,204 Breland, H. M+, Bridgeman, B., 194,199,202,204 Briggs, S. R.,99,105,124 Brody, N., 76,92 Brodzinsky, D. M.,141,144 Brown,J.D., 59,66,69 Brown, W., 26,34 Browne, M.W., 285,296297 57,65,68 Bruce, M.N., Bruner,J. S., 10,17 Buchtel, F. S., 287,296 Burt, C., 78,92 BUSS,A. K.,102,121-122 BUSS, D. M.,22,34,42,47 41, 46-47, 98,123 Butcher, J.N., Byrne, D., 54,66

c

California Departmentof Education, 184,187-188,191 California Learning Assessment System, 221,227 Campbell, D. T., 44,47,172,177 Campbell,J.R.,240,247-248,252-253 Cantor, N., 148,164 Carlson, S. B., 203,205 Carlyon, E., 183,191 Carroll,J. B., 72-78,92,98--100,110, 123,128,144 Cartmight, D., 108,123 Cashin, W. E., 287,296 Cattell, R.B., 24,34,42-43,47-48,53, 66,77-78,93,97,99,102,122 Chacko, T. I., 280,287,296 Chan, D. W., 10,14-15,17 Chance,J.,51,66

3 19

Cicchetti, D., 151,165 Clair, M.,280,297 C l o ~ ~ eC.rR., , 105,123 Goan, R.W.?73,93 Cofer, C. N., 51,66 Cohen, P. A., 281,296 93-194,197-201

Dyke, R.B., 4 5 4 8 204-205,

Gomtey, A.L., 105,122-123,125 Connor?K., 148,164 Conwa~,C. G., 282,297 Cook, T.D.,172,177,247-248,253 Corno, L., 94,136,245 Costa, P.T.,53,57-58,6667 Costa, P.T.Jr., 41,48,97-99,102-103, 10f,lO8, 110,115,121,123,125

Edwards, A.L., 51,53,66 Elm, S. M., 236,254

Exner, J. E., Jr., 108,125 H. J.,§2,54,66,97,10

Cremin, L. A.,238,253 Crisp, F., 154,165

Dewey, J.,311,317 Dicken, C.,30,34

Funder, D.C., 59,66

Hehan, J. R.,58,67 Helson, R., 159,164 Hendriks, A. J.,A.20,23,27,29,33-34

Hofstee, W. K.B., 20,23--24,

Goff, M.,121,124,137,l

Holmes, D. S., 277,287,297 Holzinger, K.J., 79-81,87,93 Holzman, P.S., 45,47,130,144 Hoorens, V., 57,67 Horn, J. L.,77,93 Horvath,J. A.,31,35 Howad, G. S., 282,286-287,297 Hull,C. L., 171,178 ~ u ~ ~ h r L. e yG., s , 76-77,85,94,111-112)125 Hundleby, J., 43,48 Hunt, E., 128,131,144

Innerst, 233,253 Inouye, J., 97,110--111,119,124 Ippel, M.J., 138,145

Jackson, D. N., 4,7-11,13-17,4

49-52,54,60,67-68,73,94,

67-68 Johnson, E. A.,59,67 Johnson,J.A.,99,105,

Judson, A.J., 51,66

Lysy, D., 63,68

K Kahn, H. D., 200,204 Kaltman, S., 59,66 Kaplan, B., &up,S. A., 45,48 Kenny, D., 132,144 Whstrom,J.F., 148,164 Klein, G. S., 45,47,130,144 Knowles, E. S., 290,297

,N.,127,13O,t44-145,148,150,

164 Kopriva, R.,207,218”220,226-227 Kovacevic, A.,59,66 Kozma, A.,50,67 G s t o f , A. L., 32,34 k g yS. E., 105,124 Kubota, M. l’., 200,204 Kuhl, J., 135,144 Kyllonen, P. C., 141-142,144

L Law, H., 105,125 k h a n ,J. M., 101,108,113,125 kvin, R,A,,57,68 Lewis, C., 197,205 Linder, D. E., 287,296 Lindley, S., 154,165 Linn, R.L., 179,190-191,233-234,247, 249,252-253 Linton, H, B., 45,47,130,144 Loehlin, J. C., 20,34 Loevkger, J., 4,17,42,48,81,86,94, 171,178,311,317 Lohman, D.F., 75,94,99,125,133, 137-138,141--142,144”145

Uhr, F-J., 25,34 Lomask, M., 183,186,188,191 London, H., 108,125 Lord, F. M., 76,94,129,145 Lon, M.,99,102,125-126 Lubinski, D., 76-77,94 Ludlow, L.,287,297 Lunneborg, C., 128,131,144 Lykken, D., 108,125

MacCallum, R.,113,125,285,297 MacLennan, R.N., 13,17 Mapus, L.,21 5,217,227 Marlowe, D., 52-53,66 Marsh, H. W., 285,297 Martin, D. W., 186,192 Maxwell, S. E., 282,297 May, H. A.,52,66 Mayr, E., 129,145,188,192 M c C ~ e y - J e n ~L,s ,197,205 McCrea, R.R.,41,48,53,57-58,66-67, 97-99,102-103,106,108,110,115,

121,123,125 McDonnell, L.,238,253 McHale, F., 199, 204 McHoskey, J. W., 62,67 McKeachie,W. J., 279,280,286-287, 297 McKinley, J. C., 52,54,66 McLaughlin, M.W., 236,241,253 McPeek, W, M.,202,204 Meehl, P. E., 7,17,175,178,212,227, 313-317 Messick, S. J., 4, 17, 37,43,47,49-52, 54”56,59-60,62,65-67,73,89,90, 94,1127,133-134,136,137,144-145, 169,173-175,178,179,187,190,

192,194”196,204,214,218,223224,227,232,234,247-248,253, 259,275,300,308,310-311,316317 Meston, C.M.,58,67 Metalsky, G. I., 159,164 Milholland, J. E., 53,67 Millham, J. E., 53,67 Miller, A.,134,136,145 MilleryJ. L., 57,68 Millon, T., 41 ,48 Mirels, H. L., 10, 17 Mischel, W.y10,17 Montane& R.G., Jr., 111-1 12,125 Morey, L. C., 41,48 Moss, P. A.,197,204,233,253 Most, R.B., 99,125

Author Index 323 Mullis, I. V., 200-201,204 Murphy, G., 57,68 Murphy, R. J.,200,204

N Nanda, H., 132,144 National Councilof Teachers of Ma~ematics,237,253 Neisser, U.,76,94 Nevid, J.S.,50,67 New Standards Project, 207,216,218, 220,227 Newtson, D., 176,178 Neyman, C.A.,Jr., 194,204 Nichols, D, S.,59,67 Noice, H., 154,164 Noice, T., 154,164 Noller, P.,105,125 Norman, W. T., 102,125 Norris, S. P.,307-308,311 Notareschi, R. F., 62,68 Novacek, J.,62,68 Novick, M., 130,145 Nunndy, J.C., 29,34

0 Oakes, J.,236,254 Odbert, H. S., 104-105,122 O’Neil, H. F.,Jr., 245,252 O’Neil, K.A.,202,204 Ones, D. S.,30,34,50,68,120,124 Orr, D. V., 194,204 Ozer, D. J.,97,125

P Pascual-Leone,J.,135,145 Pask, G., 134,145 Padhus, D. L.,49-50,52-53,55-60,6269 Paunonen, S, V., 11,17,60,68 Paw&, K.,43,48 Peabody, D.,97,125

Peacock, A.C., 15,17 108,123 Peckar, H., Pedhazur, E. J.,84,94 Perry, M., 42,47 Pewin, L.A.,97,105,125 Peterson, C., 159,164 Petit, M., 215,218,227 Phillips, D., 173,178 Phillips, E., 160,164 Phillips, G, W., 240,252 Pincus, A. L.,41,48,98,120,126 Plomin, R., 102,122 Pollack, J.M., 201,204 Powell, R. W., 280,287,297 Prince, L.M.,156,164

R. k e n , S. A.,200-201,204 Rajaratnam, N., 132,144 Ranist, L.,197,205 Raskin, R, N., 62,68 Rathunde, K,,151,164 Ravitch, D., 245,248,254 Reciniello, S., 159,164 Reed, P.L.,13-14,17 Reese, C. M., 240,252 Reid, D.B., 57,59,68 Reise, S. P.,97,125 Reiss, A.D., 30,34,50,68 Resnick, L,B., 180,192,207,220,227 Reuterbuer, S. -E., 81,94 Revelle, W., 119,125 Riding, R,J.,142,145 Riemann, R., 20,34 Rifkin, B., 140,146 Ride,R. E., 156,164 W e y , S.,215,218,227 Robins, R., 59,67 Rock, D. A.,201,204 Rocklin, T., 99,125 Romberg, T., 231,254 Rose, L.C., 236,254 Rosh, M., 88,94 Ross, J.G., 57,68

Rouse, S. V., 41,4&47,98,123 Royer, F. L.,128,145

Schaa~man, H., 23,34

S c ~ ~ A., t t 194,202,204 ,

Shweder, R. A.,10,17

Stmkov, L., 97,99,102,122 StanSey,J.C., 172,177 Starkes,J.L.,154,164,165 57,68 Stecher, M.D,, Steer, R.A., 42,47 Sternberg, R.J.,31,35,131,140,146, 148,165,177-178 Stones, M.J.,50,67 Strack, S., 99,126 Strelau, J.,20,34 Stricker, L.J.,10,17 Stump€,S. A., 282,284,2 S u ~ wH.~M., , 285,297 Svrakic, D. M,,105,123 S~anson, J.,136,145 Swineford, F., Szyarto, C., 62,67

Tagiuri, R., 10,17 Tanchuk, T.,63,68 Tapsfield, l?.,76,93 Tatsuoka, M.M. 105,123 Taylor, S. E., 59, 69 Technical R e c o ~ e n d a ~ ofor ns psycho logic^ Tests and D Tech~~ues, 171,178 Tellegen, A.,97,102,104,107,119,121, J. M.F., 20,27,29,34

145

Theall, M,,~ 8 7 , 2 9 7 Thorndike, R.M.,75,90,94 Thurstone, L. L.,73,76-77,94 Tomes, J.L., 104,124

Unde~ood,13. J.,128,146 U n d h e ~J. , O., 73-74,76,78,93,95

V e ~ o nP. , E., 78,86,95

Authox Index 3 Villee, C.,186,192 Viswesvaran, C.,30,34,50,68 von Bayer, C.,159,164

W a ~ e rR. , K., 31,35,151,165 Walker, D.F.,173,178 Walker,J.N., 51,54,66 Wallach, E., 148,158,165 Waller, N.G., 104,122,126 Ward, W. C.,203,205 Wechsler, D., 75,95 Wehr, P., 63,68 Weiner, S. S., 173,178 Weinste~,C.E., 134,146 Weiss, I. R.,200-201,204 Welsh, G. S., 150,165 Werner, H., 151,165 Weston, E. A., 200-201,204 Wetter, M.W., 5 1,66 VVhalen, S., 151,164 .. ,T. A., 98,123 , J. S., 41,48,51--52,54,60,6263,69,98-99,105,120,126 Wild, C.L., 202,204 Wiley, D.E., 74,90,95,207,210,214-

215,217,219-221,223-224,226227,233,254

W a r n s , W. M.,31,35 W ~ g h r nW. , W., 193-194,197-201, 205,249,254 Wilson, G. D., 163,165 Wilson, M.J.,159,165 Winner, E., 151-1 52,165 Witkin, H. A., 45,48 L, 316-317 Witt~enste~, WO~~ ZJ. ,,141-142,144 D, Wong, P.T. P,,280,287,297 Woodrow, H., 131,146 W o r t ~ ~ oA. n G., , 280,287,297 Womel, W., 62,67

Y Yik, M.,59,63,68,69 Young, M,J., 207,220,227

Zavala,J.D., 104,126 Zeidner, M.,98-99,125 Zinbarg, R,E., 101,126


A Ability tests, 171 Accuracy constructs, 53 Achievement tests, 171 Adjusted Goodness o f Fit Index (AGFI), 112 Adven~rousness,157 Alpha, 11 9 Antisocial personalities, 108 Assessment, 221,244 Assessment scores, 221

Content, 173 lean-process constrained, 184 open, 185 process space, 183 relevance, 257 rich-process open, 183 rich-processed constrained, 186 standards, 216,238,240 validity, 234 CREBST validity criteria, 257

D Desirability factor, 104 Diagnostic and Statistical Manual of Mental Disorders (L)SM),40,98

E Balanced Inventory of Desirable Responding (BIDR), 57-58 Beck Depression Inventory, 42 Beta, 1 1 9 Big Five Personality Traits, 57,61,98

c California Psychological Inventory (CPI), 9 Childhood activity checklist, 159 Clinical assessment, 39 Clinical depression, 13 Cognitive ability, 103 construct, 140 research, 179 style, 44,127,133 Cognition, 134 Coherent, 31 Cohort differences, 193 Competence, 181 Computer base testing? 202 Conative construct, 140 Consequential, 174,255,311 Consistency, 188 Construct differences, 193 Construct validity, 57,171-172,175,190, 195,282,299,301,303 Constructs, 3-4,212,307,311,312

Educational achievement, 179,208 Educational experiment, 302 Ego-resiliency, 53 Elaborate constructs, 52 Emotional stability, 33 Emotionality, 157 Essentialist thinking, 129 External, 174 Extraversion, 1 18 Extroversion-introversion, 157

F Face-validity, 266 Fair test design, 198 Fairness, 193 Five-Factor Model, 101,121 Five-Factor Personality Inventory (FFPI), 27,33 Five-Factor Personality Model, 97,99 Flexibility of closure, 7 4

G G C E Exam, 200 GPA, 199 G m , 257 GRE, Mathematical Reasoning Test, 261 Gender differences, 201

327

Subject Index General Goodness of Personality, 119 G e n e r ~ a b ~ 174 ty, Goal misinte~reted, 188 Goodness of Fit Index(GFI), 112

Hierarc~calModel, 73 High creatives, 302 er order factors, 100 Holtzman Inkblot Technique, 43 Human cognition, 179

IQ, 26,84 Impression man~ement, 57 Impulsivity, 8 Intellectual autonomy,33 I n t e ~ e n c e3, 1 Intrinsic validity, 170,257 In~oversion,1 18 Item weighting,265

Junior Eysenck Personality Inventory, 14

183 Model based r e a s o ~ skills, g

M ~ t i ~ e n s i o nscaling a l methods, 3 analysis, 8 Multi~lechoice, 200,210 rating, 270,271 Multiple factor analysis,77 ~ ~ t i ~ ~ t - m u l studies, ~ e t h282 od “Mystery Powdered” Assessment,185

NELS Performance Test,200 NE”€, 147 Narcissistic personalityinventory, 60,64 National Assessmentof E~ucational Progress (NAEP), 235,240-241 National Instituteof Mental Health Postdoctoral Fellow, 3 Near-miss scoring,266 New Standards Project,207 network, 314 Nomolo~cal Nonverbal, 11 ,13

Objective-Anal~cPersonality Factor Batteries,43 Observed consistency,310 Other deception, 54

Paranoid psycho lo^, 13 Performance assessment, 190,202,214,234 goals, 210 records, 213 standards, 239 Personality, 155 assessment, 7,19,26,38 assessment inventory,39,41 constructs, 9 motivated behavior,7 questionnaires, 19,32 Personality ResearchForm Scales, 8,lO11 Platonic true score, 130 Point Constancy Requirement,

~ s y c h o l o ~ cmeasurement, al 193 P s ~ c h o p a ~ o 13-14 lo~,

Task performance, 209Task specification,21 0 Tellegen model, 105 Test bias, 197 c o n s ~ c t s194 , design, 196 use, 196

validity, 256

Thematic ~pperceptionTest, 43 Trait factor model,127 ersonality Inventory, 10 1

understand in^ overes~ated,1 U n i f o performance, ~ 187

Sample differences,193 Self-deception,54 Self-KIeceptive Denial(SKID), Self-Deceptive Enhancement 59 Social consequences, 31 1

Socialization, 155 Spatial relations,74 S p e ~ ~ - H o l ~Unitary ~ g e Trait r Study, 80

S t ~ d ~ d s - b a sassessment, ed 207,233 cational reform, 236

S ~ c t u ~ e - o f - I n (SI), t ~ ~77 ~ct S~~erfactors, 108

Validity of c o n s ~ c ~223 s, Verbal, 11 V V V 237,240,251-252

Welsh Figure Preference Test,150 W~~~ proficiency, 199

The Role of Constructs in Psychological and Educational Measurement

Essentials of Educational Measurement

Role of Stress in Psychological Disorders

Extending Intelligence: Enhancement and New Constructs (Educational Psychology (Hardcover Lea))

Extending Intelligence: Enhancement and New Constructs (Educational Psychology (Hardcover Lea))

Ethnic Constructs in Antiquity: The Role of Power and Tradition (Amsterdam Archaeological Studies)

Handbook of Multicultural Assessment: Clinical, Psychological, and Educational Applications

Handbook of Multicultural Assessment: Clinical, Psychological, and Educational Applications

Educational Testing and Measurement: Classroom Application and Practice (Seventh Edition)

The Psychology of Personal Constructs, Vol. 2

The Psychology of Personal Constructs, Vol. 1

Spirituality in Educational Leadership (The Soul of Educational Leadership Series)

Adapting Educational and Psychological Tests for Cross-Cultural Assessment

Psychological Testing and Assessment: An Introduction to Tests & Measurement

Primary Process Thinking: Theory, Measurement, and Research (Psychological Issues)

Psychological Testing and Assessment: An Introduction to Tests & Measurement

Adapting educational and psychological tests for cross-cultural assessment

The Measurement of Intelligence

The Learning Sciences in Educational Assessment: The Role of Cognitive Models

The Role of Variables

Handbook of Psychological and Educational Assessment of Children: Intelligence, Aptitude, and Achievement, 2nd Edition

Psychological Factors and Cardiovascular Disorders: The Role of Stress and Psychosocial Influences

Handbook of Psychological and Educational Assessment of Children: Personality, Behavior, and Context, 2nd Edition

Psychological Factors and Cardiovascular Disorders: The Role of Stress and Psychosocial Influences

The Appraisal of Investments in Educational Facilities

Educational Psychology: A Century of Contributions: A Project of Division 15 (educational Psychology) of the American Psychological Society

The role of mathematics in science

Fundamentals of Instrumentation and Measurement

Role of Mathematics in Science

The Role of Reason in Science

The role of mathematics in science

The Role of Constructs in Psychological and Educational Measurement

Essentials of Educational Measurement

Role of Stress in Psychological Disorders

Extending Intelligence: Enhancement and New Constructs (Educational Psychology (Hardcover Lea))

Extending Intelligence: Enhancement and New Constructs (Educational Psychology (Hardcover Lea))

Ethnic Constructs in Antiquity: The Role of Power and Tradition (Amsterdam Archaeological Studies)

Handbook of Multicultural Assessment: Clinical, Psychological, and Educational Applications

Handbook of Multicultural Assessment: Clinical, Psychological, and Educational Applications

Educational Testing and Measurement: Classroom Application and Practice (Seventh Edition)

The Psychology of Personal Constructs, Vol. 2

The Psychology of Personal Constructs, Vol. 1

Spirituality in Educational Leadership (The Soul of Educational Leadership Series)

Adapting Educational and Psychological Tests for Cross-Cultural Assessment

Psychological Testing and Assessment: An Introduction to Tests & Measurement

Primary Process Thinking: Theory, Measurement, and Research (Psychological Issues)

Psychological Testing and Assessment: An Introduction to Tests & Measurement

Adapting educational and psychological tests for cross-cultural assessment

The Measurement of Intelligence

The Learning Sciences in Educational Assessment: The Role of Cognitive Models

The Role of Variables

Handbook of Psychological and Educational Assessment of Children: Intelligence, Aptitude, and Achievement, 2nd Edition

Psychological Factors and Cardiovascular Disorders: The Role of Stress and Psychosocial Influences

Handbook of Psychological and Educational Assessment of Children: Personality, Behavior, and Context, 2nd Edition

Psychological Factors and Cardiovascular Disorders: The Role of Stress and Psychosocial Influences

The Appraisal of Investments in Educational Facilities

Educational Psychology: A Century of Contributions: A Project of Division 15 (educational Psychology) of the American Psychological Society

The role of mathematics in science

Fundamentals of Instrumentation and Measurement

Role of Mathematics in Science

The Role of Reason in Science

The role of mathematics in science

Recommend Documents