:I
1
,
- ,',
I 1111111H'J' 07-136
MISSING DATA
I�AlJL DQ ALLISON
I /,1
iversity of Pennsylvania
SAGE PUBLICATIO...
186 downloads
1553 Views
13MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
:I
1
,
- ,',
I 1111111H'J' 07-136
MISSING DATA
I�AlJL DQ ALLISON
I /,1
iversity of Pennsylvania
SAGE PUBLICATIONS International Educational and Professional Publisher
Thousand Oaks
London
New Delhi
---- ---
Copyright
C02()02 by Sage
Publications, Inc.
/\11 rights reserved. No part of this book n1ay be reproduced or utilized hy l
'J . 1:' .1 ...
,---
_____ ___
. ________________________ __-----,
,
I�
o .
"l 2
" .30
0 .) 4
�) . ! t\ .., . l�
lb
:1 . J '; o
_
1
:2
n .10 C .Ja 0 .06 0 . 1)4 o . 0 7:
-0 -
_
8
1-'. ')
.
.
04
' U _ ('I --' O "
- 0 .u8 - c1
,
1 ()
- 0 . l2
1 ;;
34 S " 7 8 9 1 1 1
1 1 1 � 1 1 :;';22 22 2 22 2 22 3 3 3 3 3 :1 3 3 3 3 4 44 -1 4 4 4 4 !' 4 5 5 5 5 '> 5 5 5 5 S 6 6
:Jl 2 ,4 => 6 7
S 66 fj 66 6 67 7 77 7 n 7 Tn] 8 8 e B 8 8 88 F. 9:J
9 9 �l 9 '3 9 01 9 1
8 90 1 23 4 56 '1 89 0 12 3 4 5 6 "1 8 9 C 1 ::> 3 4 5 6 7 8 9 0 I :2 .3'1 5 6 7 a 9 0 � ;:> 3 4 '0 6 j 8 90 :' 23 4 :,(, 7 B9 C 12 � 1 5 6 7 8 ') 0 1 2 J 4 to 6 7 8 9 G :1
Lan"
Figure 5 . 2. Autocorrel ations for Regression Slope of GRAD RAT on CSAT
for Lags Varying Between 1 and 1 00
e xp e c ted diminishing returns in t h e effect of
enrollmen t
on gr ad u ati o n
rates, I deci ded to l e ave the vari able in logarithm i c form, j ust as I
di d for the regression models estimated by ML in Chapter 4 .
So
the five compl eted data sets. Th is i s facilitated
in SAS
by
the
e ach the use
next step was simply to estimate the regressi on mo del for
of of
a BY statement, wh ich avoids the necessi ty to sp ec ify five different
re g r essi o n pr o c
mo dels :
r e g dat a
�
m o de l gradrat
c o l l imp out e s t �
by _imput at i on_ ;
�
e s t imat e
c o v o ut ;
c s at l enro l l p r i vat e s t uf a c rmbrd ;
run ;
This
I
set of state ments tells
SAS to
e stima te a separate regr e ss i o n
model for ea c h subgroup defined by the five values of the
jmputation _
vari able . Qutest== esti mate re quests that the re gre ssion estimates b e
e sti mat e and covout requ ests th at the c ova r i an c e matrix of the r e gr e ss i on parame ters be included in this data set. This makes it ea sy to combine the es t i m a te s in the n ext step . Results for the five r e gr e ss io n s a re shown in Table 5 . 3 . Clearly t h e re i s a gre at deal of stability from one re g re s s io n to the n ext, but
written into
a
new data set calle d
,
·1 1 I here is also noticeable variability, which is attributable to the ra n d O l l i
(omponent of the imputation .
The results from these regressions are integrated into a single of estimates using another
SAS
procedure called
i nvoked with the following statements : pr o c m i analyz e dat a v ar int e r c ept
=
sc t
MIANALYZE. I t is
e s t imat e ;
c s at lenr o l l pr i vat e stuf ac rmbrd ;
run ;
rrhis procedure operates directly on the data set esti mate, which con tains the coefficients and associated statistics produced by the regres sions runs. Results are shown in Figure
5.3.
The column labeled "Mean" in Figure 5 . 3 contains the means of
the
coefficients in Table
formula
5.1,
5.3.
The standard errors, calculated using
are appreciably larger than the standard errors in Table
5 . 3 , because the between-regression variability is added to the within
regression variability. However, there is more between-regression vari ability for some coefficients than for others. At the low end, the stan dard error for the l en rol l coefficient in Figure
5.3
is only about
larger than t he mean of the standard errors in Table end, the combined standard error for rm b rd is about
5 .3. At
70%
10%
the high
larger th an
the me an of the individual standard errors. The greater variability in the rm b rd coefficients is apparent in Table range from
1 .66
to
5 .3,
where the estimates
2.95.
The column labeled " t for H O : M ean = O" in Figure
5.3
is j ust the
ratio of each coefficient to its standard error. The immediately pre ceding column gives the degrees of freedom used to calculate the p
value from a t table. This number h as nothing to do with the num ber of observations or the number of variables.
It
is simply a way to
specify a reference distribution that happens to be a good approxi mation to the sampling distribution of the t-ratio statistic . Although it is not essential to know how the degrees of freedom is calculated,
I think it is worth a short explanation . For a given coefficient, let U
be the average of the squared, within-regression standard errors. Let B be the variance of the coefficients between regressions . The
increase in variance due to missing data r ==
is defined as
(1 + M-1 )B -----
U
'
relative
, _. __ ,
I,
_____ _ _ _____ _ __ _ __ __ _ _____ _._
.-,
;,d,lA ; , ,�J,",
, , ,
.
\ i
M u l t i p l e � I m p u t a t i o n P a ramet e r Est ima t e s F r act ion Mis sing
t f o r HO :
St d E r ro r
I n f o rma t io n
Mean
Mean
OF
32 . 3 0 9 795
5 . 63 9 4 1 1
72
- 6 . 5 96 9 9 5
< . 000 1
0 . 2 5 5 7 24
( : �-; a t
0 . 068255
0 . 004692
39
1 4 . 5 4 73 8 8
< . 000 1
0 . 3 5 64 5 1
l_ e rl r o l l
1 . 9 1 6654
0 . 5 95 2 2 9
1 10
3 . 2 20027
0 , 00 1 7
0
p r iv a t e
1 2 . 4 8 1 05 0
1 . 3 67858
40
9 . 1 24524
< . 000 1
0 . 3 44 1 5 1
st ufac
- 0 . 1 69484
0 . 09933 1
42
- 1 . 706258
0 . 09 5 3
0 . 32 9284
2 . 348 1 3 6
0 . 6 7 0 1 05
10
3 . 5 04 1 3 2
0 . 0067
0 . 708476
\1 . 1 1
j
;dJ I
I ll!
('
rcept
i�
rm b rd
-
Me a n =
Pr > I t I
0
.
2062 1 0
Figure 5 . 3 . S elected Output From PRO C MIANALYZE
M is, as before, the number of completed data sets used to produce the estimates . The degrees of freedom is then calculated as
where
df
==
(M
1 ) (1 + r 1 ) -
-
'1 -'- •
v ari a t io n is relative to the within -regression variation, the l arger is the degrees of freedom. Sometimes the calculated degrees of freed o m will be substanti ally g rea t e r than the numb er of obse rv a t i o n s. This is nothing to be co n cerned a b ou t , because any number greater t h a n about 1 5 0 w i l l yield a t table that is es s e n t i a l ly the same as a s t a n d a r d norm al distribu tion. However, some so ftware (including PROC MIANALYZE) can p r o d uc e an a dj u s t e d degrees of freedom th at ca n n o t be greater than t h e s ample size (Barnard & Rubin, 1 999). The l a s t column, " Fracti o n M i ssi ng I nfo rm ati o n , " is an estimate of how m u c h information about each coefficient is lost because of miss ing d a t a It ra ng e s from a low of 21 % for l en rol l to a high of 7 1 % for rmbrd . It 's not s u rp risi n g that the m i s s i ng information is high fo r rmbrd, w h ich had 40 % missing data, but it is surprisingly high for p rivate, wh ich had no m is s i n g data, and stufac, which had less than 1 % m issi ng d at a To u n de rs ta n d thi s, it is i mpo r t a n t to know a co u pl e of things. Fi rs t , the amount of mi ssing i n fo rm at io n for a g ive n Thus, the smaller the
betwe en -regression
.
.
( -ocfficient dep ends not only on the missing data for that pa rticu l a r v a r iab le but also on the percentage of missing data for other variables I hat are correlated with it. Second, the MIANALYZE pr oc e d u r e h a s 1 1 0 way to know how much missing data there are on each vari ab l e . I nstead, the m issing information estimate is based entirely on the rcl ; , tive variation within and between regressions. If t h e re is a lot of variation between regressions, that is an indication of a lot of miss i n g i nformation. S ome times denoted as y , the fraction of missing informa tion is calculated from two statistics that we j ust defined, r and df. S-1 eifi cally ,
(}
,
y == A
r + 2j(df + 3) •
r + 1
Keep in mind that the fraction of missing information reported in the table is only an estimate that may be subject to considerable sampling v ar i ab i l ity .
As noted earlier, one of the troubling things about multiple impu tation is that it does not produce a determinate result . Every time you do it, you get slightly different estimates and associated statis tics. To see this, take a look at Figure 5.4, which is based on five data
M u lt i p l e - I m p u t a t i o n P a r a m e t e r E s t im a t e s
Std E r r o r
t
Fract ion
f o r HO :
Miss ing
Mean
Mean
OF
- 3 2 . 474 1 5 8
4 . 8 1 634 1
1 24
- 6 . 742496
< . 0001
0 . 1 92 4 2 9
c s at
0 . 066590
0 . 0 05 1 87
20
1 2 . 838386
< . 000 1
0 . 48934 1
l e n r oll
2 . 1 732 1 4
0 . 546 1 7 7
2 1 57
3 . 978955
< . 000 1
0 . 043949
p r i vate
1 3 . 1 25 024
1 . 1 7 1 4 88
1 1 91
1 1 . 2 03 7 1 9
< . 000 1
0 . 059531
stufac
- 0 . 1 90 03 1
0 . 0 99 0 2 7
51
- 1 . 9 1 8 9 88
0 . 0607
0 . 307569
2 . 3 5 7444
0 . 5 9 934 1
12
3 . 933 39 6
0 . 0 0 20
0 . 623224
Variable int e r c e p t
rmb rd
Figure
5 .4.
Imputation
Output
From
M e a n=O
MIANALYZE
for
Pr
>
ItI
Replication
I n f o rmat ion
of
M u l t i p le
-- .- . -.
· .
i)()
scls J he
Hu·
i ly new run of data augmentation. Most of resuits are quite si mil ar to those in Figure 5.3, although note that fr;tctions of missing information for lenroll and private are much produced by
an ent re
lu\vcr t han before. When the frac t ion of missing information is high, more than the
rccollllncnded thre e to five completed data sets may be necessary to get stable estimate s . How many might that be? Mu lt iple imputa t ion \vi than infinite number of data sets is fully efficient (like ML), but MI wi th
a
( J <JH7)
finite
number of data set s does not achieve full efficiency. Rubin
showed that the relative efficiency of an estimate
based
on
M
uata sets compared with an estimate based on an infinite number of data se ts is given by (1 + 'Y / M)-l, wh ere 'Y is the fraction of missing
information. This imp lies that with five data sets and information, the efficiency of the estimation proc edure
10
data sets, the efficiency goes up to
95%.
Equivalently, us i ng only
five data sets would give us standard errors that are when an
infinite
50% missing is 91 %. With
5%
larger than
number of data sets is used. Ten data sets would yie ld
are 2.5% larger than an infinite number of data sets. The bottom line is that even w ith 50% m i ssi ng information, five data sets do a pretty good job. Doubling the number of data sets cuts the excess sta n d ard error in half, but the excess is small to begin with.
standard errors that
Before leaving the regression example, let us compare the MI
5.4 with the ML results in Table 4.6. estimat e s are quite similar, as are the standard errors Certainly the same conclusions would be reached results in Figure
The coefficient and t stati s tic s . from the two
analyses.
6& MULTIPLE IMPUTATION: COMPLICATIONS Interactions and Nonlinearities in MI Although the
ng
e s t ima t i
met
hods
we have jus t described are very good for
the main effects of the variables with
m issing
data, they
may not be so good for estimating interaction effects. Suppose, for example, we suspect that the effect of SAT scores (CSAT) on grad
for public and p ri vate colleges . One way to test this hy pothesis ( Method 1) would be to take the previ ously impute d data, create a new var i ab l e that is the product of CSAT
uation rate (G RAD RAT) is different
TABLE
6.1
Regressions With Interaction Terms Method 1 J
f.lnable
I NTERCEPT
( .'SAT
I , ENROLL
STUFAC
P RIVATE
RMBRD PRIVCSAT
Coefficient
p
Three Methods Method j
Method 2
Value
Coefficient
p
Value
Coefficient
J) J 1 1 /1 1 .
- 50.2
.( 100
-39. 1 42
.000
-48.046
.000
0.073
.000
0 .085
.000
0.085
2.383
.000
1 .932
.00 1
1 .950
- 0 . 1 75
. 205
- 0 . 204
.083
-0. 1 52
20.870
.023
35 . 1 28
.00 1
36. 1 1 8
2. 1 34
.002
2 .448
.000
2.641
- 0.008
.388
- 0 .024
.022
- 0.024
.OOI l .( ) I
\
.( )( ) I
.00 . ' .003 . ()2 II
and PRIVATE, and include this product term in the regression equa tion along with the other variables already in the model. The leftmost panel of Table 6. 1 (Method 1) shows the results of doing this. The variable PRIVCSAT is the product of CSAT and PRIVATE. With a p value of .39, the interaction is far from statistically significant, so we conclude that the effect of CSAT does not vary between public and private institutions. The problem with this approach is that although the multivariate normal model is good at imputing values that reproduce the linear relationships among variables, it does not model any higher-order , moments. Consequently, the imputed values display no evidence of interaction unless special techniques are implemented. In this exam ple, where one of the two variables in the interaction is a dichotomy (PRIVATE), the most natural solution (Method 2) is to do separate chains of data augmentation for private colleges and for public col leges. This allows the relationship between CSAT and G RAD RAT to differ across the two groups and allows the imputed values to reflect that fact. Once the separate imputations are completed, the data sets are recombined into a single data set, the product variable is created, and the regression is run with the product variable. Results in the mid dle panel of Table 6 . 1 show that the interaction between PRIVATE and CSAT is significant at the .02 level. More specifically, we find that the positive effect of CSAT on graduation rates is smaller in private colleges than in public colleges. A third approach (Method 3) is to create the product variable for a l l cases with observed values of CSAT and PRIVATE before imputation ,
52 the n imp ute t he product va riable j ust like any other variable with miss ing data, and, finally using th e imp u ted data, estimate the regression ,
model that includes the product variable. This method is less appeal ing than Method 2 becaus e the product variable typ icall y will have a distribution that is far from normal, yet normali ty is assume d in the imp u tatio n process. Nevertheless, as seen in the righ t - hand panel of Tabl e 6.1, the results from Method 3 are very close to those obtained with Method 2 and
y much closer th an those of Method 1.
c ertainl
The results for Method 3 are reassuring because Method 2 is not feasible when both variables in the interaction are measured on quan titative scales. Thus, if we wish to estimate a
model
with the interac
tion of CSAT and RMBRD, we need to create a product variable for the 476 cases that have data on both these variables. For the
ainin g 826 cases we must impute the product tern1 as part of the data augmentation process. This method (or Method 2 when possi b le ) should be used whenever the goal is to esti m ate a model with no nli n ea r relationships that involves vari ab le s with missing data. For rem
,
example, if we want to estimate a model with both RMBRD and RMBRD squared, the squared term should be imputed as part of the dat a augmentation. This
req u i rement
puts some b urd e n
on
th e
imp u ter to anti c ipate the desired functional form before beginning the imputation. It also means that we must be cautious abo u t e s tim ating nonlinear models from data th at have been i m puted by others using st r ictly linear models. Of co urse if the percentage of missing data on ,
a given variable is small, we may be able to get by with imp uti n g
a
variable in its ori ginal form and then co n st ructing a non linear trans formation later. Certainly for the variab le STUFAC (student/faculty ratio), w i t h only two cases missing out of 1,302, it would be quite ac ceptable to p u t STUFAC squared in a regres s ion model after sim· ply
s quaring
the two imputed values rather th an i mp u ting the squared
values.
Compatibility of the Imputation Model and the Analysis Mod.el
The problem of interactions illustrates a more general issue in mul tiple imputation Ideally, the model used for imputation should agree .
with the model used in
analysi s ,
and both should c orre c tl y represent
the data. The basic for m u l a (Equation 5.1) for co mputing standard errors depends on this
c om patibility
and co rrectn es s .
-----
----- -
- --
I
,
,
53 I,
I I
W h at happens when the imput ation and an alys i s models differ? I I L l l d e p en ds on the n ature of the difference and which mode l is H I eet (Schafer, 1997) . Of p a r ti cu l ar interest are cases in which one 1 1 1 ( ( 1 �1 is a special case of t h e other. For example, the i mput a t i o n H leI may allow for interactions, but the a naly s i s model may no t , j H ' he a n a l ys is model m ay allow for interactions, but the imputation I I I < Hlel m ay not. In e i t he r case, if the additional restrictions i mp o s e d b y I l ll" s i m p l er mo del are correct, then the pro c e d u r e s we h ave d i scu ss e d I i I f i nfer ence under mUltiple i mpu t a ti o n will be val id . However, if the , t d d itional restrictions are n o t correct, inferences that usc the standard I l l cthods may not be valid . Methods that are less sensitive to model choice h ave been p r o p o se d I ( ) r es t i mati n g standard errors under mu ltipl e imputation (Wang & I {obins, 1998; Robi n s & Wang, 2000) . S p e c ifically these methods give va lid st an dar d error e stim ate s when the im p u t at i on and analysis mod ( - I s are incompatible and when both models are in co r r ect Neverthe l e ss, i n co rr e ct models at either stage still may give biased parameter e s ti m a t e s and the alternative methods require specialized softw a r e I hat is not yet re adily av a i labl e /
t
)
J I ((
,
.
,
.
«tole of the Dependent Variable in Imputation
one of the variables included in the data augmentation process, th e d epen de nt variable was implicitly use d to impute missing values on the independent variables. Is this l egit i m a te ? Does it not tend to p rodu ce spuriously l arge regression co e ffic ie n ts ? The answer is that not only is this OK, i t is essential for getting unbi ased estimates of the re gr e ss ion coefficients. With deterministic impu tation, using the d epe n d ent variable to im p ute the m is si ng values of the independent vari ab les can, indeed, produce spu ri ou s ly l arge regression coefficients, but the introduction of a r an dom component into the impu t a tio n process counterbalances this tendency and gives us approximately unbiased estimates. In fact, l e aving the dependent variable out of the i m p u t a t i on process tends to p ro du ce regression coefficients th at are spur iously sm all, at least for those variables that have missing data ( L a nde rman L a n d & Pieper, 1997) . In the co l l eg e example, if GRADRAT is n o t used in the imputations, the coefficient s for CSAT and RMBRD , both wi th la rge fractions of m i ssi n g data, are reduced by about 25 % and 20% , respectively. At the same time, the Because GRADRAT was
,
,
,
, , .
,
' .
54
coefficient for larger.
Of
LENROLL, w h ich
only h a d five missing values, is 65 %
course, including G RAD RAT in the data augmentation process
also means t h a t any missing values of GRAD RAT also were
imputed.
Some authors have recommended against imputing nlissing dat a on
the dependent variable (Cohen & Cohen, 1 985). To
follow
this advice,
we would have to delete any cases with missing data on the dependent
imputation.
variable before beginning the
There is a valid rationale
for this recommendation, bu t it applies only in special cases. If there are missing dat a on the depend ent variable but not on any of the
independent variables, m aximum likelihoo d estimation of a regres
sion model
(whether l inea r or nonline ar) does not use any information
from cases with missing data. Because ML i s optimal, t here is nothing to gain from imputing the missing cases under m u l tip l e imputation. In
wou l d not lead to any However, the situation
fact, although such imputation
bias, the stan
d ard errors would be larger.
changes
there
is
wh e n
also missing data on the indepen dent variables. Then cases
with missing values on the dependent variables do h ave some infor mation to contribu te to the estimation of the regression coefficients, a l t ho ugh prob ably not
a
great deal.
The
upshot is that in the typical
case with missing values on both dependent and independent vari ables, the cases not be d eleted.
with
missing values on the dependent variable should
U sing Additional Variables in the Imputation Process
As alre ady noted, the set of variables used in data augmentation certainly should include all variables t h a t will be used in the planned analysis. In
the
college
example,
we also included one additional
variable, ACT (mean ACT scores) , because of its high correlation
with CSAT, a variable that had substanti al missing data. The goal was to im p rove the imputations of CSAT to get more reliable estimates
of its regression coefficient. We might have done even better had we inc luded s till other variables that were correlated wit h CSAT. A somewhat s i mp le r example illustrates the benefits of additional predictor vari ables. Suppose we want to estimate the mean CSAT score across the 1 ,302 college s . As we know, d ata are missing on CSAT for 523 cases . If we c a l c ul at e the mean for the other 779 cases with values on CSAT, we get the results in the first line of Table 6 .2 . The
55 TABLE 6 .2
M eans (and Standard Errors) of CSAT With Different Used in Imputation Standard
Variables %
Missing
t
. I l Il/hles Used in Imputing
Mean
Error
[
1 4 1 1 1e
967 .98
4.43
40 . 1 a
956.87
3 .8 4
26.5
959.48
3 . 60
1 3 .3
958.04
3.58
1 1.3
\( \(
\�
'"
(
'
" I : PCT25 v I: PCT25, GRADRAT
Informa tion
. \d ual percentage of missing data.
line shows the estimated mean (with standard error) using mul t i ple imputation and the ACT variable. The me an has decreased by I ) points, whereas the standard error ha s decreased by 1 3 % . Although /\Cf has a correlation of about 0.90 with CSAT, its usefulness as a p redictor variable is somewhat m arred by the fact that values are ( lbserved on ACT for only 226 of the 5 23 cases missing on CSAT. If we add an additional variable, PCT25 (the percentage of students in t h e top 25 % of their class) , we get an additional reduction in stan dard error. PCT25 has a correlation of about 0.80 with CSAT and is available for an additional 240 ca s e s that have missing data on both (�SAT and ACT. The last line of Table 6.2 adds in GRAD RAT, which has a corre lation of about 0 . 60 with CSAT, but is only available for 17 cases not already covered by PCT25 or ACT. Not surprisingly, the decline in the standard error is quite smalL When I tried to introduce all the other variables in the regre ss i on model in Figure 5.4, the s t a n d a r d error actually got larger. This is likely due to the fact that the other variables have much lower correlations with CSAT, yet additional vari ability is introduced because of the need to estimate their regression coefficients for predicting CSAT. As in other forecasting problems, imputations may get worse when poor pred ic tors are added to the model I l ext
Other As
Parametric Approaches to Multiple Imputation
we have
seen,
multiple imputation under the multivariate n or
mal model is re asonably straightforward un d er a wide variety
of data
�,n d m is s i ng data patterns. As a routine met ho d for han dling In issin � d ata, it is probably the best t h a t i s curre n t ly available . There a r c , h(1wever, several alt e r n at ive app ro a c hes that may be preferable lyp e s
i n SOIn,e circumst ances .
() n e of the m o s t obv ious limitations of the lllultivaria te no rm a l t110 d e l is t h a t it is designed only to impute mis si n g values fo r quanti
tative v a r iab le s. As we have seen, c atego r ic a l variables can be accom lllodat vd by using some ad hoc fixups. However, sometimes you may w a n t t () do b e t te r For situations in which all variables in the i mp u t at i on process are ca teg or i c al a more attractive model is the unre stri cte e;1 mu lt in omi a l m o del (w h ic h has a parameter for every cell in t h e CO lltingency table) or a log-linear model that allows restrictions o n t he multinomial parameters. In Chapter 4, we discussed ML esti mat io n of these models . S chafer ( 1997) showed how these m ode ls a l so c an be used as the basis for data augmentation to p r od u c e mul t ip le irJ1putations a nd he developed a freeware program called CAT to i mp jement the method (http://www . stat.psu.edu/''--'j ls/). Ano } her Schafer pro g r a m (MIX) uses data augmentation to gen erate i fll P u ta ti o n s when the data consist of a mixt ure of categoric al and qll antitative variables. This method presumes t h at t he cate go ri cal va riables h ave a m u lt in omial d istribution, possibly with log-line ar r estrictions on th e parameters. Within each cell of the contingency tab le �reated by the categorical v ari ab l e s the quantitative vari ables are as� ume d to have a multivariate norm al distribution. The means o f the �e variables are a l l ow e d to vary across cells, but the covariance mat rix is assumed to be constant. At t ltis wri ting both CAT and MIX are available only as libraries to the S- PLUS statistical package, although stand-alone versions are p r o m i �� d In both cases, the underlying models potentially have many mo re t1 a r am e t e r s than the mu ltivariate normal modeL As a result, effec t i\(e use of these methods typically r e q u i res more knowledge and inpu t f:1 o m the person performing the i mpu t ati o n together with larger sampl e sizes to achieve stable estimates. I f d C\ta are missing for a single ca tego r ica l variable, multiple im p u t a tion u r�d er a logistic (logit) regre s s io n model is reasonably straightfor w ard (]tu bi n 1987) . Suppose data are missing on marital status, coded into fiV"� c a t egor ie s , and there are several po t e n t i al predictor vari ab l e s h o th c c;:) n t in u o u s and categorical. For th e purposes o f i mp u t a ti o n we csti ma lle a m u ltino mial l ogit model for marital status as a fu n cti on of t h� p r �dictors, using cases with complete data. This pro d u ce s a set of .
,
,
,
.
,
!
,
,
,
57 co efficient estimates j3 a n d an estimate of the covariance matrix V( S). 10 allow for variability i n the parameter estimates, we take a random draw from a normal distribution with a mean of /J and a covariance tnatrix V ( f3 ) . ( S c h a fe r [1997] g av e practical suggestions on how to do this efficiently.) For each case with missing data, the drawn coeffi cient values and the observed covariate value s are substituted in t o the multinomial logit model to generate predicted probabilities of falling into the five marital status categories . Based on these predicted prob abilities, we randomly draw one of the marital s ta tus categories as the fi n al imputed value.1 2 The whole process is repeated multiple times to generate multiple completed data sets . Of course, a binary va ri ab l e would be just a special case of this method. This approach also can be used with a variety of other parametric models, i n c lu d in g Poisson regression and parametric failure-time regressions . .-...
"-
Nonparaloetric and Partially Parametric Methods
Many methods have been proposed for doing multiple imputation under less stringent assumptions than the fully parametric methods we have just considered . In this section, I will cons id er a fe\tv repre sentative approaches, but keep in mind that each of these approaches has many different variations. All these methods are most natu rally applied when there are missing data o n only a s ingl e variable, . although they often can be generalized without difficulty to multiple v ariables when data are missing in a monotone pattern (described i n Chapter 4) . See Rubin (1 987) for details on monotone generaliza tions. These methods can sometimes be used when the missing data do not follow a monotone pattern, but in s uch settings they typ ic ally lack solid theoretical justification. When choosing between parametric and nonparametric methods, there is the usual trade-off between bias and sampling variability. Parametric methods tend to have less sampling variability, but they m ay give biased estimates if the parametric model is not a good approximation to t h e phenomenon of interest. N o np a r am etri c meth ods may be less prone to bias under a variety of situations, but the estimates often have more sampling variability.
Hot Deck Methods The best-known approach to non p arametric imputation is the " hot deck" method, which is frequently used by t he U . S . Census Bureau to
58
p rod uce
imputed values for p ubl i c u se -
data sets. The basic
idea is that
we want to i mpu te miss in g values for a p artic u lar vari able Y, which
may
a
be either q u n t i t at iv e or categorical . We
missing d a ta )
X variables (with no
forn1
wi t h
a
fin d
a
s et
of categorical
that are associated with Y. We
continge n cy table base d on the X variables . If there
m is
s ing Y values
are cases
withi n a p artic ul ar ce ll of the contingency table,
we take one or more of the no n mis s i ng case s in the same cell an d
use
thei r Y values t o im p ut e the m iss i n g Y values .
Obviously there are a lot of complication s that may arise. The crit ical qu e s ti o n is how do you choose whi ch " donor" values to assign
Cl e a rly the to avoid bias.
to the cases wit h miss ing values ?
choice of donor cases
should be random ized somehow
This leads n a t u r ally to
multiple i m p u ta t i o n b ecause any randomize d method can be applied
m o re
than once to produce different imp u t e d values . The trick is to
do the randomization in such a way that all t he natural
variabi li ty
is
preserved . To acco m p lis h this, Rub i n p ro p o se d a method he c o in ed
(Rubin, 1 987; Rubin & Schenker, is done. S u ppo se that in a part i cu lar cell of the
the app r o xim ate Bayesian b o o t strap 1 99 1 ) . Here is how it
co n t i ngency table there are n l cases with complete c as es with missing dat a on Y. Fol low these s t ep s : 1 . From the set of
n1
data on Y an d
no
cases with complete data, take a random sample
(with replacement) of n 1 cases .
2. From this sample, take a random sample (with repl acement) of
n o cases .
3 . Assign the no observed values of Y to the no cases with missing data
on Y .
4 . Repeat steps 1 to 3 for every cell i n the contingency table .
These four steps produce one co m p l e t e d data set wh e n ap p lie d to all cells of th e contingency tabl e . For m ulti pl e imputation, the whole process i s repeated
m u l ti p le
After the desire d analysis is per r e s u l t s are combined u si ng the s ame
times .
for m e d on e ach data set, the
formul as we used for multivari ate normal imputations .
Although it might seem that we could c h o o se n o ·
donor cases from
among
the
n1
skip
cases
s t e p 1 and directly
with compl ete data,
does not produce sufficie n t va r i a bil i ty for estim ating stand ard errors. Ad d i t i o n a l var iabil i ty comes from th e fact that s ampling in this
step 2 i s with replacement.
59 Predictive Mean Matching
A major attraction of hot deck im p ut a t io n is th at the imp u te d val l ies are all actual observed values. Consequently, there are no impo s sible " or o u t o f r a nge v a l ues an d th e s h ap e of the d i s t r ib u t ion tends t o be preserved. A disadvantage is that the p r edi ct or variables a l l Inust be c at e g o r i c a l ( or t reat e d as s uch ) which imposes serious lim i tations on the number of p ossibl e p red i c t o r variables. To remove this l i m it a t i on Little (1 988) p ropos ed a p a rt i al ly p ar a me t ri c method called predictive mean matching. Like t h e multivariate normal para metric m e t ho d t hi s app r o ac h be g i n s by regre ss i n g Y, the v ar i ab le to be i mp u t ed , on a s et of pre d ict o rs for cases with com p le te data . This r e gres s i o n is then used to ge nerat e predicted v al u e s for both the miss ing and the nonmissing c ases Then, for each case with m i ss i n g data, we find a set of cases ,vith com pl e te data that have p r e di ct e d values of Y th a t are "close " to the predicted value for the case w i th missing data. From t h is set of cases, we randomly choose on e case whose Y "
-
-
,
,
,
,
.
v alu e i s donated to the missing
case. For a single Y variable, it is straightforward to define closeness as the absolute difference between predicted values. However, then we mu s t decide how many of the c l o s e p re dic ted values to include in the donor pool for each missing ca se o r equiv al e ntly what should ' be the cu t o ff point in closeness for fo rIn i ng the set of possible donor v a lue s ? If a small donor pool is chosen, there will be more s amp l i ng vari ab i l i ty in the estimates. On the other h an d too large a donor pool c an lead to possible b i as because many donors m ay be unlike the re ci p ien t s To deal with this amb ig u i ty , Schenker and Taylor ( 1 996) developed an a d ap t iv e method" that varies the size of the donor pool for e ach missing c a se based o n th e d ens i ty of co mp l e t e cases w i t h close p r e d ict e d values. T h ey found that their method did somewhat better than methods with fixed size donor pools of either 3 or 10 clo se s t cases. H owever the differences among the t h re e methods were sufficiently small that the adaptive method hardly seems worth the extra com p u t a t io n a l cost. In d o i n g p r e d ict iv e mean mat c h in g , it is also important to adjust for the fact that the regr e s s i o n coefficients are on ly estimates of t h e true coefficients . As in the p ar ame tr ic case, this can be acco m p li s he d by ra n d o m ly drawing a new set o f r eg r e s s i o n parameters from t h e i r p ost e rio r distribution before c a lcu lat in g p re d icted values for each ,
,
,
.
"
"
,
"
•
to do it:
i l l l P l l 1 c d dat a s e t . Here is how
1 . Regress Y on X (a vector of covari ates) for the
n 1 cases with no missing
data on Y , producing regression coefficients b (a k x
residu a] vari ance estimate
52 .
1 vecto r) and
2 . M ake a random d raw fro m the posterior distribut ion o f t h e res idual
varian ce (assuming a noninformative prior) . This is accompli shed by calcul ating ( n 1
-
k) 52/X2,
where X2 represents a random draw from a
chi-s quare distribution with n 1
-
first such random draw .
k
degrees of freedom. Let
s[l ]
be the
3 . Make a random draw from the posterior distribution of the regression coefficients. This is accomplish ed by drawing from a mul tivariate normal distrib ution with m ean b and covariance matrix
srlJ (X'X) -- l , where
X is
an n 1 x k matrix of X values . Let h [ 1 J be the first such random draw .
See Sch afe r ( 1 997) for practi cal suggestions on how t o do this .
For e a ch new set of regression p a rameters, p redicted values are
genera t ed for all
c as e s
.
Then, for
each
case
with m i ssi n g
data on Y,
we form a donor pool b a s ed on the predicted values and randomly choose one of the observed value s of Y
from
the donor pool . Th is
appro ach to predictive mean m atc h ing can be generalized to
with mis s ing data, al t hough the co m plex (Little, 1 988) .
th an one Y v a ri a ble
m ay become r a ther
more
computations
Sampling on Empirical Residuals I n the
data au gmentation
m eth od , residu al values are sampl ed
from a standard normal distribution and then adde d to the r egr es s io n
method
p re d ic t ed
values to get the final impu t e d values. We can modify this be l ess dependent on
parametric assumptions by m aking random draws from the actual set of r e s idu als produce d by the lin ear r e g r e ssion. This can yield imputed values whose distribution is more like t h a t of the ob se rv ed variable ( Rub in 1 987), although it is still pos s ibl e to get i mpu t ed values that a r e outside the permissibl e to
,
range . As
with
other app ro a c h e s to multiple imp u t a t i o n t h e re ,
arc
some
i mpor t an t s ubtleties involved in doi n g this properly. As before, let Y be the variable with missing data to be i m p ut e d for
observed data on ing
a
n1
co n s t an t ) with
ca s e s .
Let X
be a
kx1
no missing d ata o n the
no
c a s es with ,
v ec t or of variables (inclu d n1
c a s es We begin by per .
forming the preceding three steps to obtain the li n e a r r eg r es s i o n of
" �., , ,
I
:
61 Y on X and generate random draws from the posterior distribution ( )f the parameters. Then we add the following steps: 4. Based on the regression estimates in step 1 , calculate standardized resid uals for t h e cases with no missi n g data:
5 . Draw a simple random sample (with replacement) of n 1 residuals calcul ated in step 4.
no
values from the
6. For the no cases with missing data, calculate imputed values of Y a s
where e i rep r esen t s the residuals drawn in step 5 , an d b[ l
and Si l l a re the first random draws from the post e ri o r distribution o f the para m e t ers . I
These six steps produce one completed set of data. To get additional data sets, simply repeat steps 2 through 6 (except for step 4, w hic h should not be repeated). As Rubin (1987) explained, this methodology can be re ad i ly extended to data sets with a monotonic missing pattern on several variables. Each variable is imputed using as predictors all varia hlcs that are observed when it is missing. The empirical residual me tho d also can be modified to allow for heteroscedasticity in the im pu ted values (Schenker & Taylor, 1 9 9 6 ). For each case to be impu ted, t h e pool of residuals is restricted to those observed cases that h ave pre dicted values of Y that are close to the predicted value for the case with missing data. Example
Let us try the partially parametric methods on a subset of the col lege data. TUITION is fully observed for 1 ,272 colleges. (For SlIl1p1 ic ity, we shall exclude the 30 cases with missing data on this variable.) Of these 1 ,272 colleges, only 796 report BOARD , the annual ave r age cost of board at each college. Using TUITION as a pred icto r, our goal is to impute the missing values of B OARD for the other 476 colleges, and estimate the mean of BOARD for all 1 ,272 college s . First, we apply the methods we have used before . For the 796 col leges with complete d ata (listwise deletion), the average BOARD is $2,060 with a standard error of 23 . 4 . Applying the EM algorithm to
f
I
I
I
{ I I' I ' I ( ) N a n d BOARD,
a mean B OARD of 2,032 (but no - , L I I I ( I : l r