Analyzing Rater Agreement: Manifest Variable Methods

Analyzing Rater Agreement 8 Analyzing Rater Agreement Manifest Variable Methods This page intentionally left blank ...

Author: Alexander von Eye | Eun Young Mun

26 downloads 601 Views 7MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form

DOWNLOAD PDF

Analyzing Rater Agreement 8

Analyzing Rater Agreement Manifest Variable Methods

This page intentionally left blank

Analyzing Rater Agreement Manifest Variable Methods

Alexander von Eye Michigan State University Eun Young Mun University of Alabama at Birmingham

LAWRENCE ERLBAUM ASSOCIATES, PUBLISHERS 2005

Mahwah, New Jersey

London

Camera ready copy for this book was provided by the authors.

Copyright © 2005 by Lawrence Erlbaum Associates, Inc. All rights reserved. No part of this book may be reproduced in any form, by photostat, microform, retrieval system, or any other means, without prior written permission of the publisher. Lawrence Erlbaum Associates, Inc., Publishers 10 Industrial Avenue Mahwah, New Jersey 07430 Cover design by Kathryn Houghtaling Lacey Library of Congress Cataloging-in-Publication Data Eye, Alexander von. Analyzing rater agreement: manifest variable methods / Alexander von Eye, Eun Young Mun. p. cm. Includes bibliographical references and index. ISBN 0-8058-4967-X (alk. paper) 1. Multivariate analysis. 2. Acquiescence (Psychology)—Statistical methods. I. Mun, Eun Young. II. Title. QA278.E94 2004 519.5'35—dc22 2004043344

CIP Books published by Lawrence Erlbaum Associates are printed on acidfree paper, and their bindings are chosen for strength and durability. Printed in the United States of America 10 9 8 7 6 5 4 3 2 1

Disclaimer: This eBook does not include the ancillary media that was packaged with the original printed version of the book.

Contents Preface

IX

1.

Coefficients of Rater Agreement

1.1

Cohen's

1.1.1

1C

1.1.2

Conditional

1.2

Weighted

1.3

1

(kappa)

1C

as a Summary Statement for the E ntire Agreement Table

2 8

1C

10

1C

Raw Agreement, Brennan and Prediger's Comparison with Cohen's

K.,

and a

1C

13 17

1.4

The Power of

1.5

Kendall's W for Ordinal Data

19

1.6

Measuring Agreement among Three or More Raters

22

1C

1.7

Many Raters or Many Comparison Objects

25

1.8

Exercises

27

2.

Log-Linear Models of Rater Agreement

2.1

A Log-Linear Base Model

32

2.2

A Family of Log-Linear Models for Rater Agreement

34

2.3

Specific Log-Linear Models for Rater Agreement

35

2.3.1

The Equal-Weight Agreement Model

35

2.3.2

The Weight-by-Respo nse-Category Agreement Model

40

2.3.3

Models with Covariates

41

2.3.3 .I

Models for Rater Agreement with Categorical Covariates

42 48

2.3.3.2

Models for Rater Agreement with Continuous Covariates

2.3.4

Rater Agreement plus Linear-by-Linear Association for

2.3.5

Differential Weight Agreement Model with Linear-by-Linear

31

54

Ordinal Variables Interaction plus Covariates

59

2.4

Extensions

63

2.4.1

Modeling A greement among More than Two Raters

63

2.4.1.1

Estimation of Rater-Pair-Specific Parameters

64

2.4.1.2

Agreement among Three Raters

67

2.4.2

Rater-Specific Trends

2.4.3

Generalized Co efficients

2.5

Exercises

3.

Exploring Rater Agreement

68 1C

70 75

79

3.1

Configura) Frequency Analysis: A Tutorial

80

3.2

CFA Base Models for Rater Agreement Data

85

1.


The number of coefficients for rater agreement is large. However, the number of coefficients that is actually used in empirical research is small. The present section discusses five of the more frequently used coefficients. The first is Cohen's (1960) K (kappa), one of the most widely employed coefficients in the social sciences. K is a coefficient for nominal level variables. The second and the third coefficients are variants of K. The second coefficient, weighted K, allows the statistical analyst to place differential weights on discrepant ratings. This coefficient requires ordinal rating scales. The third coefficient is Brennan and Prediger's K", a variant of K that uses a different chance model than the original K. The fourth coefficient is raw agreement which expresses the degree of agreement as percentage of judgements in which raters agree. The fifth coefficient discussed here is Kendall's W. This coefficient is defined for ordinal variables.

1.1

Cohen's

K

(Kappa)

Clearly the most frequently used coefficient of rater agreement is Cohen's (1960) kappa, K. In its original form, which is presented in this section, this coefficient can be applied to square cross-classifications of two raters' judgements (variants for three or more raters are presented in Section 1.6). These cross-classifications are also called agreement tables. Consider the two raters A and B who used the three categories 1;2, and 3 to evaluate

2


students' performance in English. The cross-classification of these raters'

judgements can be depicted as given in Table 1.1.

The interpretation of the frequencies, m;i' in the cross-classification

given in Table 1.1 is straightforward. Cell 11 displays the number of

instances in which both Rater A and Rater B used Category 1. Cell 12

contains the number of instances in which Rater A used Category 1 and Rater B used Category 2, and so forth. The cells with indexes i

=

j display

the numbers of incidences in which the two raters used the same category.

These cells are also called the agreement cells. These cells are shaded in

Table 1.1.

Table 1.1:

Cross-Classification of Two Raters' Judgements

Rater B

Rating

2

3

m,

mu

m, ..

2

111:1

m!J

m2J

3

m.,

mJ J

mH

Rater

Rating

Categories

at gories

The following two sections first introduce K as a coefficient that allows one to describe rater agreement in the form of a summary statement

for an entire table. Second, conditional K is introduced. This measure allows one to describe rater agreement separately for each rating category.

1.1.1

K as a Summary Statement for the Entire Agreement Table

To introduce Cohen's K, let piJ be the probability of Cell ij. The cells that indicate rater agreement, that is, the agreement cells, have probability

Pw

The parameter theta1, I

e,

=

LPzi z�l

describes the proportion of instances in which the two raters agree, where

I is the number of categories. To have a reference with which 81 can be compared, we assume independence of the two raters. In other words, we

Cohen's

3

K

assume that the raters did not influence each other when evaluating the students. (Later in this text, we will see that this assumption corresponds to main effects models in log-linear analysis and Configura! Frequency Analysis.) Based on this assumption, we can estimate the proportion of instances in which the two raters agree by chance using thet K > K••

Cohen's k, Bretnlatl atld Prediger's k, and Raw Agreetnent t! .,

0 ..5

....

lE .,

0

0

t)

'1:1

]

·� •

-0

..5

:a

-1

Values in Cell21 --- Cohen's kappa---+-- B&P's kappa

Figure 2:

......._

raw agreement

K, K,., and ra when agreement decreases; marginals discrepant.

From the earlier considerations and these simulations, we conclude that (1)

for ra > 0.5, which indicates that rater agreement is more likely than disagreement, the three measures ra, K, and K. correlate strongly;

(2)

for ra > 0.5 and K

*

1, the magnitude of K is only a weak indicator

of strength of agreement; instead, K indicates strength of agreement above and beyond the variation that can be explained by taking into account the main effects;

(3)

thus, except for the case in which K

=

1 . 0, K is not a measure of

raw agreement but a measure of agreement beyond chance;

(4)

researchers are well advised when they report raw agreement or Brennan and Prediger's K. in addition to Cohen's K.

The Power of K

1.4

17

The Power of K

In this section, power considerations for K are reported for two cases. The first case covers sample size calculations for 2 x

2 tables. The second case 3x3

is more general. It covers sample size calculations for tables of size

or larger for two or more raters.

The sample size requirements for K under various hypotheses in 2 2 tables have been described by Cantor (1996). Based on Fleiss, Cohen, and Everitt (1969), the asymptotic variance of the estimate of K can be x

given by QIN, with Q

=

f {LPu[(l - 62) - (p, ,) 2 i E LPlJ(p, P) - (ele2 - 2e2 eli}

(1 - e2r4 +

(1

- 6

+

p (l - 6l)

+

'* 1

+

'

where 61, 62,pu,Pu and pi, are defined as in Section 1.1, andpii indicates the probability of the off-diagonal cells, andpi is the probability of Column

j. Fortunately, the values of Q are tabulated for various values ofpi (Cantor, 1996, p. 152), so that hand calculations are unnecessary.

and pi

Now suppose, a researcher intends to test the null hypothesis H0:

K = Ko with a =

0.05

and power

required for this situation is N

=

1

-

� = 0.80.

The sample size that is

[ /Qo zp.JQ;]2, zrx

+

Kl - Ko

whereza is thez-score that corresponds to the significance level a (e.g., for

0.05, one obtains z0.05 = 1.645; accordingly, for � = 0.2, one obtainsz0.2 0.842), Q0 and Q1 are the Q-scores for the null hypothesis and the

a= =

alternative hypothesis, respectively. Note that the null hypothesis that K0 =

0

can be tested using this methodology. The values for Q can be either

calculated or taken from Cantor's Table 1.

Consider the following example. A police officer and an insurance

agent determine whether drivers in traffic accidents are at-fault or not-at

fault. A researcher intends to test the null hypothesis that K0 the one-sided alternative hypothesis that K >

0.8

for K1

table Q0

=

0.3

with a=

0.3 against 0.05 and power =

0.5. We can take the two z-scores from above and find in the 0.91 and Q1 0.75. Inserting into the equation for Nyields

=

=


18

N

=

I

1.645J[9l

]

0.842vf0}5 2 0.5 - 0.3 +

Thus, to reject the null hypothesis that Ko =

0.3

=

132.07.

in favor of the one-sided

alternative hypothesis that K = 0.5 with power 0.8 at a significance level of

a=

0.05, the researcher needs

at least

132

traffic accidents in the sample.

For more hypothesis tests, including tests that compare rating patterns from two independent samples, see Cantor larger than

(1996). For tables 2 x 2 see, for instance, Flack, Afifi, and Lachenbruch (1988).

A more general approach to power analysis for K was recently proposed by Indurkhya, Zayas, and Buka

(2004).

The authors note that,

under the usual multinomial sampling scheme, the cell frequencies follow a Dirichlet multinomial distribution. This distribution can then be used to estimate the probability that all raters choose category i. Based on this estimate, the authors derive a X.2 -distributed statistic for the null hypothesis that K =

0

and an equation for the required sample size. This estimate

depends on the probabilities that all raters chose a given rating category, the number of rating categories, the significance level, and the desired power. The authors present two tables with minimum required samples. The first

table states the required sample sizes for the null hypothesis that K0 = 0.4 versus the alternative hypothesis that K1 = 0.6, for a= 0.05 andp

=

0.8. The

second table presents the required sample sizes for the null hypothesis that

0.6 versus the alternative hypothesis that K1 0.8. Table 1.4 summarizes these two tables.

K0 = =

=

0.8, for a = 0.05

and p

Minimum Required Sample Sizes for a = 0.05 and p =

Table 1.4:

0.8 (adapted from Indurkhya et al., 2004)

Null hypothesis: K0 =

0.4; alternative hypothesis:

3 Categories 1tl

1tz

1t3

K1 =

0.6

Number of raters

2

3

4

5

6

68

59

0.1

0.1

0.8

205

113

83

0.1

0.4

0.5

127

50

40

35

0.33

0.33

0.34

107

69

42

35

30

58

I cont.


19

4 Categories

Number of raters

1tl

1tz

1t3

1t4

2

3

4

5

6

0.1

0.1

0.1

0.7

102

42

38

32

29

0.1

0.3

0.3

0.3

88

30

30

29

27

0.25

0.25

0.25

0.25

60

28

27

25

25

Null hypothesis: K0

=

0.6; alternative hypothesis:

3 Categories

K1

=

0.8

Number of raters

1tl

1tz

1t3

2

3

4

5

6

0.1

0.1

0.8

172

102

77

66

58

0.1

0.4

0.5

102

60

46

40

35

0.33

0.33

0.34

87

52

39

33

30

4 Categories

Number of raters

1t!

1tz

1t,

1t4

2

3

4

5

6

0.1

0.1

0.1

0.7

157

74

68

52

49

0.1

0.3

0.3

0.3

89

38

34

27

24

0.25

0.25

0.25

0.25

68

26

24

23

20

The sample sizes listed in Table

1.4 suggest that the required sample size

decreases as the number of raters and the number of rating categories increase. In addition, as the probabilities of categories get more even, the required sample size increases. Donner and Eliasziw

(1992) had presented

similar results for the case of two raters and two rating categories.

1.5

Kendall's Wfor Ordinal Data

For ordinal (rank-ordered) data, Kendall's (1962) Wis often preferred over K

(for interval level data, K is equivalent to the well-known intraclass


20

correlation; see Rae,

1988).

Wallows one to (I) compare many objects;

compare these objects in a number of criteria; and

(3)

take the ordinal

nature of ranks into account. The measure W, also called

concordance,

(2)

coefficient of

compares the agreement found between two or more raters

with perfect agreement. The measure is s

W=

where N is the number of rated objects, k is the number of raters , ands is

the sum of the squared deviations of the ranks R; used by the raters from the average rank, that is,

with i

I, ... ,I, the number of rating categories (i.e., the column sum). The maximum sum of squared deviations is =

_I

12

k2(N3

-

N).

This is the sums for perfect rater agreement. For small samples, the critical values of W can be found in tables (e.g., in Siegel,

1 56). 9 X2

For

8 or more objects, s

=

_1 12

=

k(N2

+

the X2-distributed statistic

k(N - l )W

N)

can be used under df N- I. Large, significant values of W suggest strong agreement among the k raters. =

When there are ties, that is, judgements share the same rank, the maximum deviance from the average rank is smaller than the maximum deviance without ties. Therefore, a correction element is introduced in the denominator of the formula for Kendall's W. The formula becomes W'

s =

_I

12

k2(N3

-

N)

-

" kLJ J

3

,

(tJ - t) J

where tj is the number of tied ranks in thejth tie. This quantity is also called

the

length ofthejth tie. If there are no ties, there will be N"ties" of length


1,

21

and the second term in the denominator disappears. The corrected

formula then becomes identical to the original formula. Data example. The following example re-analyzes data presented by Lienert (1978). A sample of

10 participants processed a psychometric test.

The solutions provided by the participants were scored according to the

three criteria X = number of correct solutions, Y = number of correct solutions minus number of incorrect solutions, and Z = number of items

attempted. The participants were ranked in each of these three criteria. We now ask how similar the rankings under the three criteria are. Table

1.5

displays the three series of ranks.

Table

1.5:

Ranks of 10 Subjects in the Criteria X= Number of Correct Solutions, Y = Number of Correct Solutions minus Number oflncorrect Solutions, and Z= Number of Items Attempted Participant

Criterion X

2

3

4

5

6

7

8

9

10

4.5

2

4.5

3

7.5

6

9

7.5

10

y

2.5

2.5

4.5

4.5

8

9

6.5

10

6.5

z

2

4.5

4.5

4.5

4.5

8

8

8

10

9

13.5

12

20

23

23.5

25.5

26.5

Sum

5.5

6.5

The average rank is calculated as the sum of all ranks over the number of participants. We obtain �R = 165 and squared deviations we obtains=

(5.5-

R

=

16.5.

For the sum of the

591. 12; for Y: (43- 4)+ (33- 3)= 84. The sum

16.5)2 + ... + (26.5- 16.5Y

For the correction term, we calculate for X: (23- 2) + (23- 2) =

18; and for Z: of these correction terms is 114. Inserting into the equation for

(23- 2)+ (23- 2)+ (23- 2)=

w'

591 =

=

_l

12

0

.

=

W1yields

828.

32(103 - 10) - 3·114

The value of W1= 0.83 suggests a high degree of concordance. That is, the three criteria suggest largely the same rankings. In the present example, this

does not come as a surprise because there exists an algebraic relation


22

between these three criteria. We find that Y

=

X - F andZ

=

X + F, where

F is the number of false solutions. Because of this dependency, the following significance test can be interpreted only as an illustration. Inserting into the_x2 formula, we obtain_x2

22.349. Fordf= 9, we find thatp

=

3(10- 1) 0.828

=

0.0078 and reject the null hypothesis

=

of no concordance.

1.6

Measuring Agreement among Three or More Raters

The routine case of determining rater agreement involves two raters.

However, occasionally three or more raters are used. In specific contexts, for example in political polls, large samplesfunction as raters. The present section focuses on small numbers of raters. The formulas presented here describe measures for three raters. Generalizations to four or more raters are straightforward.When assessing the agreement among three raters, an I xI x!-table is created. That is, it is required again that all raters use the same rating categories. This table has three diagonals. The diagonal that contains the agreement cells is the one with like cell indexes, that is, iii.

More specifically, the cells 111, 222, . . ,III are the agreement cells. .

For Cohen's non-weightedK, we specify

8,

LP;;;,

=

and

so that we obtain, as before,

81

1C

1

-

-

82

82

An estimator ofK for three raters under multinomial sampling is then fc

N2 "t"' L._,

m

lll

-

"t"' L._,

m m .m . 1..

.l.

..I

The interpretation ofK for three or more raters is the same as for two raters.

Agreement among Three or More Raters

23

For Brennan and Prediger's K", we obtain the estimator

N

1 For both,K andK" for three or more raters, the binomial test can be used as significance test. Specifically, let

p

one-sided tail probability forK >

0 is then

p

be estimated by 62 and

q -p. =

1

The

- f ( N_)pJqN-J

=

J-Lmiil

1

The estimator of raw agreement among three raters is

ra

Lm;;; N

Data exanmle. For the following example, we analyze data from a study on

the agreement of raters on the qualification of job applicants in a large agency in the United States.Z A sample of

420

interview protocols was

examined by three evaluators. Each evaluator indicated on a three-point scale the degree to which an applicant was close to the profile specified in

the advertisement of the position, with 1 indicating very good match and 3

indicating lack of match. In the following paragraphs, we employ the coefficients just introduced. Table 1 .6 displays the cross-classification of the three raters' judgements in tabular form. The table contains in its first column the cell indexes, in its second column the observed frequencies, followed by the expected frequencies that result for Cohen's K, and the standardized residuals, zu; this cross-classification is

(m;;; =

-

m ; ;; •

)2

723.26 (df= 20;

•

p

The overall Pearson_X! for
..1 are

>..: are the main

effect parameters for the column variable (Rater B), and

eij

are the

residuals. In the models that we present in the following chapters, rater agreement is compared with this base model and is parameterized in terms of deviations from this model.

A Log-Linear Base Model

33

Different base models are conceivable. For instance, Brennan and Prediger (1981; see Section 1 .3, above) propose using a model in which the two terms,

11

and

1; are fixed to be zero. In the following sections, we

enrich the base model given here by introducing various terms that allow one to consider characteristics of measurement models, or to test specific hypotheses. The base model does not contain terms for the interaction between the two raters' responses. Interaction terms are not needed for the following two reasons. First, models for I x J cross-classifications that contain the A x B interaction are saturated. The present approach strives for more parsimonious models. Second, one reason why the raters' responses may be associated with each other is that they agree, but there may be other reasons for an association. The present approach attempts to model rater agreement. However, before presenting the general model, we give an example of the design matrix approach to log-linear modeling (Christensen, 1997). Consider a 3 x 3 table. The log-linear base model for this table can be written in matrix form as e

log m11

0

1

0

0

0

1

0

-1

-1

0

1

1

1

log m12 log m13 log m 21

ll

e1

2

Ao

e

0

A AI

e 1 2

n

log m22

1

0

1

0

1

A A2

log m23

1

0

1

-1

-1

e 3 2

-1

-1

B AI

e31

-1

-1

0

1

B A2

-1

-1

-1

-1

log m31 log m32 log m33

0

+

e

22

e3

2

e33

where the column vector on the left hand side of the equation contains the logarithms of the observed cell frequencies. The design matrix, X, on the right hand side of the equation, contains five column vectors. The first of these vectors contains only ones. It is the constant vector, used to estimate

A0, the intercept parameter. The following two vectors represent the main effect of the first rater, A. The judgements from this rater appear in the


34

rows of the 3 x 3 cross-classification. The first of these two vectors contrasts the first of the three categories with the third. The second of these vectors contrasts the second category with the third. The last two vectors of the design matrix represent the main effect of the second rater, B. The judgements from this rater appear in the columns of the 3 x 3 cross classification. As for first rater, the first column main effect-vector contrasts the first with the third categories, and the second column main effect vector contrasts the second with the third categories. In different words, the first two main effect vectors contrast rows, whereas the second two main effect vectors contrast columns. To express the base model for the 3 x 3 example, we used the methods of effects coding. There exists a number of alternative coding schemes. Most prominent is dummy coding (see Christensen,

1997). The

two methods of coding are equivalent in the sense that each model that can be expressed using one of the methods can be expressed equivalently using the other method. We select for the present context effects coding because this method makes it easy to show which cells are contrasted with each other.3

2.2

A

Family of

Log-Linear Models for Rater

Agreement Consider theI xI cross-classification of two raters' judgements of the same objects. In matrix notation, the base model given above can be expressed as log M =X).

+ e,

where M is the column vector of observed frequencies,

X is a design matrix (for an example see the matrix in Section 2.1, above), A is a parameter vector, and

e

is the residual vector. For the models

considered in the following chapters we use the form X= [Xb, X6, Xp, XJ, whereXb is the design matrix for the base model. An example was given in the last section. The adjoined, that is, horizontally merged, design matrices express those aspects of the bivariate frequency distribution that we consider when modeling rater agreement. X6 is the design matrix that contains indicator variables for the cells that reflect rater agreement. These are typically the cells in the main diagonal of the I xI cross-classification. Xp is the design matrix that contains indicator variables used to specify

3 Readers who use the SPSS package to recalculate the data examples in this booklet will notice that SPSS uses dummy coding b y default.

35


characteristics of the measurement model concerning, for instance, the ordinal nature of the rating categories. Xc is the design matrix for cov ariate information. Accordingly, the parameter vector becomes

A'= [A A A'�, 'b•

' 6,

A 'J. The models presented in the following chapters use selections of the

design matrices and parameter vectors. Any selection of these matrices is conceivable. Expressed in terms of the parameters that we estimate, the general model is log m

=

A0

+

AA I

+

B

A./

+

o

+

p uI u)

+

C

A lJ

+

e.

The order of the terms on the right hand side of this equation is the same as in the matrix form of the equation. It should be noted, however, that the order of terms is arbitrary. It has no effect on the magnitude and standard errors of parameter estimates. These estimates can depend, however, on the other parameters in the model. The following sections discuss submodels and the meaning of parameters in more detail.

2.3


The following sections introduce readers to specific log-linear models for rater agreement. We begin with the equal weight agreement model, originally proposed by Tanner and Young (1985). For a discussion of such models in the context of symmetry and diagonal parameter models see Lawai

(2001).

2.3.1

The Equal-Weight Agreement Model

The first model that we consider for rater agreement, adds one term to the base model introduced in Section log m

=

A0

+

2.1. This model is

A1

+

AJ

+

o

+

e,

where o (delta) assigns weights to the cells in the main diagonal, that is, the cells where the two raters agree. Typically, one assigns equal weights to the cells in the main diagonal, that is,

in diagonal cells otherwise. The resulting model is called equal-weight agreement model (Tanner & Young,

1985) or diagonal set model (Wickens, 1989). The model considers


36

each rating category equally important for the assessment of rater agreement. Data example. To illustrate the equal weight agreement model, we use data from a study by Lienert (1978), in which two teachers estimate the intelligence of 40 students by assigning them into four groups. The groups are 1: IQ < 90; 2: 90s IQ < 100; 3: 100 s IQ < 110; and 4: IQ z 110. Table 2.1 displays the cross-classification of the two teachers' ratings. C ross-Classification of Two Teache rs' Intelligence

Table 2.1:

Ratings of Their Students

Intelligence Ratings T e a c

IQ

m1, Celli is said to constitute a CFA type. Ifm1 < m1, Celli is said =

to constitute a CFA antitype. If there is no statistically significant difference between m1 and m1, Cell i constitutes neither a type nor an antitype.7

Types suggests that the frequency in Cell i is significantly

greater than expected based on the chance model used to estimate m1• Antitypes suggest that the frequency in Cell i is significantly smaller than expected based on the chance model used to estimate m,. CFA chance models reflect assumptions that are contradicted by the existence of types and antitypes. For example, if the chance model proposes independence between the variables that span a cross-classification, types and anti types suggest that the variables are associated, at least locally, that is, in particular sectors of the cross-classification (Havranek & Lienert, 1984). In principle, any model that allows one to derive expected probabilities for a cross-classification can serve as a CFA chance model. Constraints result mostly from sampling schemes and from interpretational issues (von Eye& Schuster, 1998a). Four groups of base models have been discussed in the CFA literature (von Eye, 2002, 2004). The first and most frequently employed groups includes hierarchical log-linear models. These are standard log-linear models which take main effects and variable interactions into account. The role of these models in CFA is to estimate the expected cell frequencies according to the specifications made in some base model. Estimation typically uses maximum likelihood methods and the observed marginal frequencies. The second group of base models estimates the expected cell frequencies using population parameters rather than observed marginal frequencies. The third group of base models uses theoretically derived, a priori probabilities for the estimation of the expected cell frequencies. Groups two and three of base models are not

7

A similar approach, with a focus on wha t is here called a type, w as described

by DuMouchel

(1999) in the context of

contingency t ables.

Bayesian exploration of large

81

Configura) Frequency Analysis

necessarily log-linear. The fourth group estimates the expected cell frequencies based on distributional asswnptions. This group of basemodels is practically never log-linear.

In the present context, however, we focus on log-linear base models of the kind given above. More specifically, we ask whether the agreement between two raters results in types in the main diagonal of anI xI cross classification, and in antitypes in the off-diagonal cells. More detail follows in the sections below. There exists a large number of statistical tests that can be used for the comparison of m; with m;. Most popular are the binomial test and Pearson's Jf test. The binomial test allows one to estimate the tail probabilityB; of the observed frequency, m;, of Celli, given the probability, p, expected for this cell from some base model and the sample size,

N. The

tail probability B; can be calculated as follows. For a CFA chance model, the expected probability for Celli is estimated asp;= the cells of the cross-classification, also called

mJN, where i indexes

configurations.

Then, the

tail probability for the observed frequency, m;, and more extreme frequencies is

�

B;(m;) = L j=Q

( ) N .

1

i

N-J

P, q;

,

where i indexes the cells of the cross- classification, q

ifm; ifm; and

I

=

{

m'

N

Analyzing Rater Agreement: Manifest Variable Methods

Manifest

Manifest

Manifest

Manifest

Manifest B

Manifest Destiny

Manifest Komunistyczny

Manifest Destiny

Bayesian Methods for Measures of Agreement

Manifest B

Manifest Destiny

Manifest B

Manifest komunistyczny

Manifest komunistyczny

Manifest Kinds

Manifest Destiny

Real-variable methods in harmonic analysis

Real variable methods in Fourier analysis

State Variable Methods in Automatic Control

Complex Variable Methods in Plane Elasticity

Real-variable methods in harmonic analysis

Analyzing Schubert

Weekend Agreement

Gentleman's Agreement

The Agreement

The Agreement

Gentleman's Agreement

The Agreement

Binding Agreement

The Agreement

Analyzing Rater Agreement: Manifest Variable Methods

Manifest

Manifest

Manifest

Manifest

Manifest B

Manifest Destiny

Manifest Komunistyczny

Manifest Destiny

Bayesian Methods for Measures of Agreement

Manifest B

Manifest Destiny

Manifest B

Manifest komunistyczny

Manifest komunistyczny

Manifest Kinds

Manifest Destiny

Real-variable methods in harmonic analysis

Real variable methods in Fourier analysis

State Variable Methods in Automatic Control

Complex Variable Methods in Plane Elasticity

Real-variable methods in harmonic analysis

Analyzing Schubert

Weekend Agreement

Gentleman's Agreement

The Agreement

The Agreement

Gentleman's Agreement

The Agreement

Binding Agreement

The Agreement

Recommend Documents