Analyzing Rater Agreement 8
Analyzing Rater Agreement Manifest Variable Methods
This page intentionally left blank
...
26 downloads
601 Views
7MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Analyzing Rater Agreement 8
Analyzing Rater Agreement Manifest Variable Methods
This page intentionally left blank
Analyzing Rater Agreement Manifest Variable Methods
Alexander von Eye Michigan State University Eun Young Mun University of Alabama at Birmingham
LAWRENCE ERLBAUM ASSOCIATES, PUBLISHERS 2005
Mahwah, New Jersey
London
Camera ready copy for this book was provided by the authors.
Copyright © 2005 by Lawrence Erlbaum Associates, Inc. All rights reserved. No part of this book may be reproduced in any form, by photostat, microform, retrieval system, or any other means, without prior written permission of the publisher. Lawrence Erlbaum Associates, Inc., Publishers 10 Industrial Avenue Mahwah, New Jersey 07430 Cover design by Kathryn Houghtaling Lacey Library of Congress Cataloging-in-Publication Data Eye, Alexander von. Analyzing rater agreement: manifest variable methods / Alexander von Eye, Eun Young Mun. p. cm. Includes bibliographical references and index. ISBN 0-8058-4967-X (alk. paper) 1. Multivariate analysis. 2. Acquiescence (Psychology)—Statistical methods. I. Mun, Eun Young. II. Title. QA278.E94 2004 519.5'35—dc22 2004043344
CIP Books published by Lawrence Erlbaum Associates are printed on acidfree paper, and their bindings are chosen for strength and durability. Printed in the United States of America 10 9 8 7 6 5 4 3 2 1
Disclaimer: This eBook does not include the ancillary media that was packaged with the original printed version of the book.
Contents Preface
IX
1.
Coefficients of Rater Agreement
1.1
Cohen's
1.1.1
1C
1.1.2
Conditional
1.2
Weighted
1.3
1
(kappa)
1C
as a Summary Statement for the E ntire Agreement Table
2 8
1C
10
1C
Raw Agreement, Brennan and Prediger's Comparison with Cohen's
K.,
and a
1C
13 17
1.4
The Power of
1.5
Kendall's W for Ordinal Data
19
1.6
Measuring Agreement among Three or More Raters
22
1C
1.7
Many Raters or Many Comparison Objects
25
1.8
Exercises
27
2.
Log-Linear Models of Rater Agreement
2.1
A Log-Linear Base Model
32
2.2
A Family of Log-Linear Models for Rater Agreement
34
2.3
Specific Log-Linear Models for Rater Agreement
35
2.3.1
The Equal-Weight Agreement Model
35
2.3.2
The Weight-by-Respo nse-Category Agreement Model
40
2.3.3
Models with Covariates
41
2.3.3 .I
Models for Rater Agreement with Categorical Covariates
42 48
2.3.3.2
Models for Rater Agreement with Continuous Covariates
2.3.4
Rater Agreement plus Linear-by-Linear Association for
2.3.5
Differential Weight Agreement Model with Linear-by-Linear
31
54
Ordinal Variables Interaction plus Covariates
59
2.4
Extensions
63
2.4.1
Modeling A greement among More than Two Raters
63
2.4.1.1
Estimation of Rater-Pair-Specific Parameters
64
2.4.1.2
Agreement among Three Raters
67
2.4.2
Rater-Specific Trends
2.4.3
Generalized Co efficients
2.5
Exercises
3.
Exploring Rater Agreement
68 1C
70 75
79
3.1
Configura) Frequency Analysis: A Tutorial
80
3.2
CFA Base Models for Rater Agreement Data
85
1.
Coefficients of Rater Agreement
The number of coefficients for rater agreement is large. However, the number of coefficients that is actually used in empirical research is small. The present section discusses five of the more frequently used coefficients. The first is Cohen's (1960) K (kappa), one of the most widely employed coefficients in the social sciences. K is a coefficient for nominal level variables. The second and the third coefficients are variants of K. The second coefficient, weighted K, allows the statistical analyst to place differential weights on discrepant ratings. This coefficient requires ordinal rating scales. The third coefficient is Brennan and Prediger's K", a variant of K that uses a different chance model than the original K. The fourth coefficient is raw agreement which expresses the degree of agreement as percentage of judgements in which raters agree. The fifth coefficient discussed here is Kendall's W. This coefficient is defined for ordinal variables.
1.1
Cohen's
K
(Kappa)
Clearly the most frequently used coefficient of rater agreement is Cohen's (1960) kappa, K. In its original form, which is presented in this section, this coefficient can be applied to square cross-classifications of two raters' judgements (variants for three or more raters are presented in Section 1.6). These cross-classifications are also called agreement tables. Consider the two raters A and B who used the three categories 1;2, and 3 to evaluate
2
Coefficients of Rater Agreement
students' performance in English. The cross-classification of these raters'
judgements can be depicted as given in Table 1.1.
The interpretation of the frequencies, m;i' in the cross-classification
given in Table 1.1 is straightforward. Cell 11 displays the number of
instances in which both Rater A and Rater B used Category 1. Cell 12
contains the number of instances in which Rater A used Category 1 and Rater B used Category 2, and so forth. The cells with indexes i
=
j display
the numbers of incidences in which the two raters used the same category.
These cells are also called the agreement cells. These cells are shaded in
Table 1.1.
Table 1.1:
Cross-Classification of Two Raters' Judgements
Rater B
Rating
2
3
m,
mu
m, ..
2
111:1
m!J
m2J
3
m.,
mJ J
mH
Rater
Rating
Categories
at gories
The following two sections first introduce K as a coefficient that allows one to describe rater agreement in the form of a summary statement
for an entire table. Second, conditional K is introduced. This measure allows one to describe rater agreement separately for each rating category.
1.1.1
K as a Summary Statement for the Entire Agreement Table
To introduce Cohen's K, let piJ be the probability of Cell ij. The cells that indicate rater agreement, that is, the agreement cells, have probability
Pw
The parameter theta1, I
e,
=
LPzi z�l
describes the proportion of instances in which the two raters agree, where
I is the number of categories. To have a reference with which 81 can be compared, we assume independence of the two raters. In other words, we
Cohen's
3
K
assume that the raters did not influence each other when evaluating the students. (Later in this text, we will see that this assumption corresponds to main effects models in log-linear analysis and Configura! Frequency Analysis.) Based on this assumption, we can estimate the proportion of instances in which the two raters agree by chance using thet K > K••
Cohen's k, Bretnlatl atld Prediger's k, and Raw Agreetnent t! .,
0 ..5
....
lE .,
0
0
t)
'1:1
]
·� •
-0
..5
:a
-1
Values in Cell21 --- Cohen's kappa---+-- B&P's kappa
Figure 2:
......._
raw agreement
K, K,., and ra when agreement decreases; marginals discrepant.
From the earlier considerations and these simulations, we conclude that (1)
for ra > 0.5, which indicates that rater agreement is more likely than disagreement, the three measures ra, K, and K. correlate strongly;
(2)
for ra > 0.5 and K
*
1, the magnitude of K is only a weak indicator
of strength of agreement; instead, K indicates strength of agreement above and beyond the variation that can be explained by taking into account the main effects;
(3)
thus, except for the case in which K
=
1 . 0, K is not a measure of
raw agreement but a measure of agreement beyond chance;
(4)
researchers are well advised when they report raw agreement or Brennan and Prediger's K. in addition to Cohen's K.
The Power of K
1.4
17
The Power of K
In this section, power considerations for K are reported for two cases. The first case covers sample size calculations for 2 x
2 tables. The second case 3x3
is more general. It covers sample size calculations for tables of size
or larger for two or more raters.
The sample size requirements for K under various hypotheses in 2 2 tables have been described by Cantor (1996). Based on Fleiss, Cohen, and Everitt (1969), the asymptotic variance of the estimate of K can be x
given by QIN, with Q
=
f {LPu[(l - 62) - (p, ,) 2 i E LPlJ(p, P) - (ele2 - 2e2 eli}
(1 - e2r4 +
(1
- 6
+
p (l - 6l)
+
'* 1
+
'
where 61, 62,pu,Pu and pi, are defined as in Section 1.1, andpii indicates the probability of the off-diagonal cells, andpi is the probability of Column
j. Fortunately, the values of Q are tabulated for various values ofpi (Cantor, 1996, p. 152), so that hand calculations are unnecessary.
and pi
Now suppose, a researcher intends to test the null hypothesis H0:
K = Ko with a =
0.05
and power
required for this situation is N
=
1
-
� = 0.80.
The sample size that is
[ /Qo zp.JQ;]2, zrx
+
Kl - Ko
whereza is thez-score that corresponds to the significance level a (e.g., for
0.05, one obtains z0.05 = 1.645; accordingly, for � = 0.2, one obtainsz0.2 0.842), Q0 and Q1 are the Q-scores for the null hypothesis and the
a= =
alternative hypothesis, respectively. Note that the null hypothesis that K0 =
0
can be tested using this methodology. The values for Q can be either
calculated or taken from Cantor's Table 1.
Consider the following example. A police officer and an insurance
agent determine whether drivers in traffic accidents are at-fault or not-at
fault. A researcher intends to test the null hypothesis that K0 the one-sided alternative hypothesis that K >
0.8
for K1
table Q0
=
0.3
with a=
0.3 against 0.05 and power =
0.5. We can take the two z-scores from above and find in the 0.91 and Q1 0.75. Inserting into the equation for Nyields
=
=
Coefficients of Rater Agreement
18
N
=
I
1.645J[9l
]
0.842vf0}5 2 0.5 - 0.3 +
Thus, to reject the null hypothesis that Ko =
0.3
=
132.07.
in favor of the one-sided
alternative hypothesis that K = 0.5 with power 0.8 at a significance level of
a=
0.05, the researcher needs
at least
132
traffic accidents in the sample.
For more hypothesis tests, including tests that compare rating patterns from two independent samples, see Cantor larger than
(1996). For tables 2 x 2 see, for instance, Flack, Afifi, and Lachenbruch (1988).
A more general approach to power analysis for K was recently proposed by Indurkhya, Zayas, and Buka
(2004).
The authors note that,
under the usual multinomial sampling scheme, the cell frequencies follow a Dirichlet multinomial distribution. This distribution can then be used to estimate the probability that all raters choose category i. Based on this estimate, the authors derive a X.2 -distributed statistic for the null hypothesis that K =
0
and an equation for the required sample size. This estimate
depends on the probabilities that all raters chose a given rating category, the number of rating categories, the significance level, and the desired power. The authors present two tables with minimum required samples. The first
table states the required sample sizes for the null hypothesis that K0 = 0.4 versus the alternative hypothesis that K1 = 0.6, for a= 0.05 andp
=
0.8. The
second table presents the required sample sizes for the null hypothesis that
0.6 versus the alternative hypothesis that K1 0.8. Table 1.4 summarizes these two tables.
K0 = =
=
0.8, for a = 0.05
and p
Minimum Required Sample Sizes for a = 0.05 and p =
Table 1.4:
0.8 (adapted from Indurkhya et al., 2004)
Null hypothesis: K0 =
0.4; alternative hypothesis:
3 Categories 1tl
1tz
1t3
K1 =
0.6
Number of raters
2
3
4
5
6
68
59
0.1
0.1
0.8
205
113
83
0.1
0.4
0.5
127
50
40
35
0.33
0.33
0.34
107
69
42
35
30
58
I cont.
Kendall's W for Ordinal Data
19
4 Categories
Number of raters
1tl
1tz
1t3
1t4
2
3
4
5
6
0.1
0.1
0.1
0.7
102
42
38
32
29
0.1
0.3
0.3
0.3
88
30
30
29
27
0.25
0.25
0.25
0.25
60
28
27
25
25
Null hypothesis: K0
=
0.6; alternative hypothesis:
3 Categories
K1
=
0.8
Number of raters
1tl
1tz
1t3
2
3
4
5
6
0.1
0.1
0.8
172
102
77
66
58
0.1
0.4
0.5
102
60
46
40
35
0.33
0.33
0.34
87
52
39
33
30
4 Categories
Number of raters
1t!
1tz
1t,
1t4
2
3
4
5
6
0.1
0.1
0.1
0.7
157
74
68
52
49
0.1
0.3
0.3
0.3
89
38
34
27
24
0.25
0.25
0.25
0.25
68
26
24
23
20
The sample sizes listed in Table
1.4 suggest that the required sample size
decreases as the number of raters and the number of rating categories increase. In addition, as the probabilities of categories get more even, the required sample size increases. Donner and Eliasziw
(1992) had presented
similar results for the case of two raters and two rating categories.
1.5
Kendall's Wfor Ordinal Data
For ordinal (rank-ordered) data, Kendall's (1962) Wis often preferred over K
(for interval level data, K is equivalent to the well-known intraclass
Coefficients of Rater Agreement
20
correlation; see Rae,
1988).
Wallows one to (I) compare many objects;
compare these objects in a number of criteria; and
(3)
take the ordinal
nature of ranks into account. The measure W, also called
concordance,
(2)
coefficient of
compares the agreement found between two or more raters
with perfect agreement. The measure is s
W=
where N is the number of rated objects, k is the number of raters , ands is
the sum of the squared deviations of the ranks R; used by the raters from the average rank, that is,
with i
I, ... ,I, the number of rating categories (i.e., the column sum). The maximum sum of squared deviations is =
_I
12
k2(N3
-
N).
This is the sums for perfect rater agreement. For small samples, the critical values of W can be found in tables (e.g., in Siegel,
1 56). 9 X2
For
8 or more objects, s
=
_1 12
=
k(N2
+
the X2-distributed statistic
k(N - l )W
N)
can be used under df N- I. Large, significant values of W suggest strong agreement among the k raters. =
When there are ties, that is, judgements share the same rank, the maximum deviance from the average rank is smaller than the maximum deviance without ties. Therefore, a correction element is introduced in the denominator of the formula for Kendall's W. The formula becomes W'
s =
_I
12
k2(N3
-
N)
-
" kLJ J
3
,
(tJ - t) J
where tj is the number of tied ranks in thejth tie. This quantity is also called
the
length ofthejth tie. If there are no ties, there will be N"ties" of length
Kendall's W for Ordinal Data
1,
21
and the second term in the denominator disappears. The corrected
formula then becomes identical to the original formula. Data example. The following example re-analyzes data presented by Lienert (1978). A sample of
10 participants processed a psychometric test.
The solutions provided by the participants were scored according to the
three criteria X = number of correct solutions, Y = number of correct solutions minus number of incorrect solutions, and Z = number of items
attempted. The participants were ranked in each of these three criteria. We now ask how similar the rankings under the three criteria are. Table
1.5
displays the three series of ranks.
Table
1.5:
Ranks of 10 Subjects in the Criteria X= Number of Correct Solutions, Y = Number of Correct Solutions minus Number oflncorrect Solutions, and Z= Number of Items Attempted Participant
Criterion X
2
3
4
5
6
7
8
9
10
4.5
2
4.5
3
7.5
6
9
7.5
10
y
2.5
2.5
4.5
4.5
8
9
6.5
10
6.5
z
2
4.5
4.5
4.5
4.5
8
8
8
10
9
13.5
12
20
23
23.5
25.5
26.5
Sum
5.5
6.5
The average rank is calculated as the sum of all ranks over the number of participants. We obtain �R = 165 and squared deviations we obtains=
(5.5-
R
=
16.5.
For the sum of the
591. 12; for Y: (43- 4)+ (33- 3)= 84. The sum
16.5)2 + ... + (26.5- 16.5Y
For the correction term, we calculate for X: (23- 2) + (23- 2) =
18; and for Z: of these correction terms is 114. Inserting into the equation for
(23- 2)+ (23- 2)+ (23- 2)=
w'
591 =
=
_l
12
0
.
=
W1yields
828.
32(103 - 10) - 3·114
The value of W1= 0.83 suggests a high degree of concordance. That is, the three criteria suggest largely the same rankings. In the present example, this
does not come as a surprise because there exists an algebraic relation
Coefficients of Rater Agreement
22
between these three criteria. We find that Y
=
X - F andZ
=
X + F, where
F is the number of false solutions. Because of this dependency, the following significance test can be interpreted only as an illustration. Inserting into the_x2 formula, we obtain_x2
22.349. Fordf= 9, we find thatp
=
3(10- 1) 0.828
=
0.0078 and reject the null hypothesis
=
of no concordance.
1.6
Measuring Agreement among Three or More Raters
The routine case of determining rater agreement involves two raters.
However, occasionally three or more raters are used. In specific contexts, for example in political polls, large samplesfunction as raters. The present section focuses on small numbers of raters. The formulas presented here describe measures for three raters. Generalizations to four or more raters are straightforward.When assessing the agreement among three raters, an I xI x!-table is created. That is, it is required again that all raters use the same rating categories. This table has three diagonals. The diagonal that contains the agreement cells is the one with like cell indexes, that is, iii.
More specifically, the cells 111, 222, . . ,III are the agreement cells. .
For Cohen's non-weightedK, we specify
8,
LP;;;,
=
and
so that we obtain, as before,
81
1C
1
-
-
82
82
An estimator ofK for three raters under multinomial sampling is then fc
N2 "t"' L._,
m
lll
-
"t"' L._,
m m .m . 1..
.l.
..I
The interpretation ofK for three or more raters is the same as for two raters.
Agreement among Three or More Raters
23
For Brennan and Prediger's K", we obtain the estimator
N
1 For both,K andK" for three or more raters, the binomial test can be used as significance test. Specifically, let
p
one-sided tail probability forK >
0 is then
p
be estimated by 62 and
q -p. =
1
The
- f ( N_)pJqN-J
=
J-Lmiil
1
The estimator of raw agreement among three raters is
ra
Lm;;; N
Data exanmle. For the following example, we analyze data from a study on
the agreement of raters on the qualification of job applicants in a large agency in the United States.Z A sample of
420
interview protocols was
examined by three evaluators. Each evaluator indicated on a three-point scale the degree to which an applicant was close to the profile specified in
the advertisement of the position, with 1 indicating very good match and 3
indicating lack of match. In the following paragraphs, we employ the coefficients just introduced. Table 1 .6 displays the cross-classification of the three raters' judgements in tabular form. The table contains in its first column the cell indexes, in its second column the observed frequencies, followed by the expected frequencies that result for Cohen's K, and the standardized residuals, zu; this cross-classification is
(m;;; =
-
m ; ;; •
)2
723.26 (df= 20;
•
p
The overall Pearson_X! for
..1 are
>..: are the main
effect parameters for the column variable (Rater B), and
eij
are the
residuals. In the models that we present in the following chapters, rater agreement is compared with this base model and is parameterized in terms of deviations from this model.
A Log-Linear Base Model
33
Different base models are conceivable. For instance, Brennan and Prediger (1981; see Section 1 .3, above) propose using a model in which the two terms,
11
and
1; are fixed to be zero. In the following sections, we
enrich the base model given here by introducing various terms that allow one to consider characteristics of measurement models, or to test specific hypotheses. The base model does not contain terms for the interaction between the two raters' responses. Interaction terms are not needed for the following two reasons. First, models for I x J cross-classifications that contain the A x B interaction are saturated. The present approach strives for more parsimonious models. Second, one reason why the raters' responses may be associated with each other is that they agree, but there may be other reasons for an association. The present approach attempts to model rater agreement. However, before presenting the general model, we give an example of the design matrix approach to log-linear modeling (Christensen, 1997). Consider a 3 x 3 table. The log-linear base model for this table can be written in matrix form as e
log m11
0
1
0
0
0
1
0
-1
-1
0
1
1
1
log m12 log m13 log m 21
ll
e1
2
Ao
e
0
A AI
e 1 2
n
log m22
1
0
1
0
1
A A2
log m23
1
0
1
-1
-1
e 3 2
-1
-1
B AI
e31
-1
-1
0
1
B A2
-1
-1
-1
-1
log m31 log m32 log m33
0
+
e
22
e3
2
e33
where the column vector on the left hand side of the equation contains the logarithms of the observed cell frequencies. The design matrix, X, on the right hand side of the equation, contains five column vectors. The first of these vectors contains only ones. It is the constant vector, used to estimate
A0, the intercept parameter. The following two vectors represent the main effect of the first rater, A. The judgements from this rater appear in the
Log-Linear Models of Rater Agreement
34
rows of the 3 x 3 cross-classification. The first of these two vectors contrasts the first of the three categories with the third. The second of these vectors contrasts the second category with the third. The last two vectors of the design matrix represent the main effect of the second rater, B. The judgements from this rater appear in the columns of the 3 x 3 cross classification. As for first rater, the first column main effect-vector contrasts the first with the third categories, and the second column main effect vector contrasts the second with the third categories. In different words, the first two main effect vectors contrast rows, whereas the second two main effect vectors contrast columns. To express the base model for the 3 x 3 example, we used the methods of effects coding. There exists a number of alternative coding schemes. Most prominent is dummy coding (see Christensen,
1997). The
two methods of coding are equivalent in the sense that each model that can be expressed using one of the methods can be expressed equivalently using the other method. We select for the present context effects coding because this method makes it easy to show which cells are contrasted with each other.3
2.2
A
Family of
Log-Linear Models for Rater
Agreement Consider theI xI cross-classification of two raters' judgements of the same objects. In matrix notation, the base model given above can be expressed as log M =X).
+ e,
where M is the column vector of observed frequencies,
X is a design matrix (for an example see the matrix in Section 2.1, above), A is a parameter vector, and
e
is the residual vector. For the models
considered in the following chapters we use the form X= [Xb, X6, Xp, XJ, whereXb is the design matrix for the base model. An example was given in the last section. The adjoined, that is, horizontally merged, design matrices express those aspects of the bivariate frequency distribution that we consider when modeling rater agreement. X6 is the design matrix that contains indicator variables for the cells that reflect rater agreement. These are typically the cells in the main diagonal of the I xI cross-classification. Xp is the design matrix that contains indicator variables used to specify
3 Readers who use the SPSS package to recalculate the data examples in this booklet will notice that SPSS uses dummy coding b y default.
35
Specific Log-Linear Models for Rater Agreement
characteristics of the measurement model concerning, for instance, the ordinal nature of the rating categories. Xc is the design matrix for cov ariate information. Accordingly, the parameter vector becomes
A'= [A A A'�, 'b•
' 6,
A 'J. The models presented in the following chapters use selections of the
design matrices and parameter vectors. Any selection of these matrices is conceivable. Expressed in terms of the parameters that we estimate, the general model is log m
=
A0
+
AA I
+
B
A./
+
o
+
p uI u)
+
C
A lJ
+
e.
The order of the terms on the right hand side of this equation is the same as in the matrix form of the equation. It should be noted, however, that the order of terms is arbitrary. It has no effect on the magnitude and standard errors of parameter estimates. These estimates can depend, however, on the other parameters in the model. The following sections discuss submodels and the meaning of parameters in more detail.
2.3
Specific Log-Linear Models for Rater Agreement
The following sections introduce readers to specific log-linear models for rater agreement. We begin with the equal weight agreement model, originally proposed by Tanner and Young (1985). For a discussion of such models in the context of symmetry and diagonal parameter models see Lawai
(2001).
2.3.1
The Equal-Weight Agreement Model
The first model that we consider for rater agreement, adds one term to the base model introduced in Section log m
=
A0
+
2.1. This model is
A1
+
AJ
+
o
+
e,
where o (delta) assigns weights to the cells in the main diagonal, that is, the cells where the two raters agree. Typically, one assigns equal weights to the cells in the main diagonal, that is,
in diagonal cells otherwise. The resulting model is called equal-weight agreement model (Tanner & Young,
1985) or diagonal set model (Wickens, 1989). The model considers
Log-Linear Models of Rater Agreement
36
each rating category equally important for the assessment of rater agreement. Data example. To illustrate the equal weight agreement model, we use data from a study by Lienert (1978), in which two teachers estimate the intelligence of 40 students by assigning them into four groups. The groups are 1: IQ < 90; 2: 90s IQ < 100; 3: 100 s IQ < 110; and 4: IQ z 110. Table 2.1 displays the cross-classification of the two teachers' ratings. C ross-Classification of Two Teache rs' Intelligence
Table 2.1:
Ratings of Their Students
Intelligence Ratings T e a c
IQ
m1, Celli is said to constitute a CFA type. Ifm1 < m1, Celli is said =
to constitute a CFA antitype. If there is no statistically significant difference between m1 and m1, Cell i constitutes neither a type nor an antitype.7
Types suggests that the frequency in Cell i is significantly
greater than expected based on the chance model used to estimate m1• Antitypes suggest that the frequency in Cell i is significantly smaller than expected based on the chance model used to estimate m,. CFA chance models reflect assumptions that are contradicted by the existence of types and antitypes. For example, if the chance model proposes independence between the variables that span a cross-classification, types and anti types suggest that the variables are associated, at least locally, that is, in particular sectors of the cross-classification (Havranek & Lienert, 1984). In principle, any model that allows one to derive expected probabilities for a cross-classification can serve as a CFA chance model. Constraints result mostly from sampling schemes and from interpretational issues (von Eye& Schuster, 1998a). Four groups of base models have been discussed in the CFA literature (von Eye, 2002, 2004). The first and most frequently employed groups includes hierarchical log-linear models. These are standard log-linear models which take main effects and variable interactions into account. The role of these models in CFA is to estimate the expected cell frequencies according to the specifications made in some base model. Estimation typically uses maximum likelihood methods and the observed marginal frequencies. The second group of base models estimates the expected cell frequencies using population parameters rather than observed marginal frequencies. The third group of base models uses theoretically derived, a priori probabilities for the estimation of the expected cell frequencies. Groups two and three of base models are not
7
A similar approach, with a focus on wha t is here called a type, w as described
by DuMouchel
(1999) in the context of
contingency t ables.
Bayesian exploration of large
81
Configura) Frequency Analysis
necessarily log-linear. The fourth group estimates the expected cell frequencies based on distributional asswnptions. This group of basemodels is practically never log-linear.
In the present context, however, we focus on log-linear base models of the kind given above. More specifically, we ask whether the agreement between two raters results in types in the main diagonal of anI xI cross classification, and in antitypes in the off-diagonal cells. More detail follows in the sections below. There exists a large number of statistical tests that can be used for the comparison of m; with m;. Most popular are the binomial test and Pearson's Jf test. The binomial test allows one to estimate the tail probabilityB; of the observed frequency, m;, of Celli, given the probability, p, expected for this cell from some base model and the sample size,
N. The
tail probability B; can be calculated as follows. For a CFA chance model, the expected probability for Celli is estimated asp;= the cells of the cross-classification, also called
mJN, where i indexes
configurations.
Then, the
tail probability for the observed frequency, m;, and more extreme frequencies is
�
B;(m;) = L j=Q
( ) N .
1
i
N-J
P, q;
,
where i indexes the cells of the cross- classification, q
ifm; ifm; and
I
=
{
m'
N