Missing Data

:I 1 , - ,', I 1111111H'J' 07-136 MISSING DATA I�AlJL DQ ALLISON I /,1 iversity of Pennsylvania SAGE PUBLICATIO...

Author: Paul D. Allison

186 downloads 1553 Views 13MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form

DOWNLOAD PDF

:I

1

,

- ,',

I 1111111H'J' 07-136

MISSING DATA

I�AlJL DQ ALLISON

I /,1

iversity of Pennsylvania

SAGE PUBLICATIONS International Educational and Professional Publisher

Thousand Oaks

London

New Delhi

---- ---

Copyright

C02()02 by Sage

Publications, Inc.

/\11 rights reserved. No part of this book n1ay be reproduced or utilized hy l

'J . 1:' .1 ...

,---

_____ ___

. ________________________ __-----,

,

I�

o .

"l 2

" .30

0 .) 4

�) . ! t\ .., . l�

lb

:1 . J '; o

_

1

:2

n .10 C .Ja 0 .06 0 . 1)4 o . 0 7:

-0 -

_

8

1-'. ')

.

.

04

' U _ ('I --' O "

- 0 .u8 - c1

,

1 ()

- 0 . l2

1 ;;

34 S " 7 8 9 1 1 1

1 1 1 � 1 1 :;';22 22 2 22 2 22 3 3 3 3 3 :1 3 3 3 3 4 44 -1 4 4 4 4 !' 4 5 5 5 5 '> 5 5 5 5 S 6 6

:Jl 2 ,4 => 6 7

S 66 fj 66 6 67 7 77 7 n 7 Tn] 8 8 e B 8 8 88 F. 9:J

9 9 �l 9 '3 9 01 9 1

8 90 1 23 4 56 '1 89 0 12 3 4 5 6 "1 8 9 C 1 ::> 3 4 5 6 7 8 9 0 I :2 .3'1 5 6 7 a 9 0 � ;:> 3 4 '0 6 j 8 90 :' 23 4 :,(, 7 B9 C 12 � 1 5 6 7 8 ') 0 1 2 J 4 to 6 7 8 9 G :1

Lan"

Figure 5 . 2. Autocorrel ations for Regression Slope of GRAD RAT on CSAT

for Lags Varying Between 1 and 1 00

e xp e c ted diminishing returns in t h e effect of

enrollmen t

on gr ad u ati o n

rates, I deci ded to l e ave the vari able in logarithm i c form, j ust as I

di d for the regression models estimated by ML in Chapter 4 .

So

the five compl eted data sets. Th is i s facilitated

in SAS

by

the

e ach the use

next step was simply to estimate the regressi on mo del for

of of

a BY statement, wh ich avoids the necessi ty to sp ec ify five different

re g r essi o n pr o c

mo dels :

r e g dat a

�

m o de l gradrat

c o l l imp out e s t �

by _imput at i on_ ;

�

e s t imat e

c o v o ut ;

c s at l enro l l p r i vat e s t uf a c rmbrd ;

run ;

This

I

set of state ments tells

SAS to

e stima te a separate regr e ss i o n

model for ea c h subgroup defined by the five values of the

jmputation _

vari able . Qutest== esti mate re quests that the re gre ssion estimates b e

e sti mat e and covout requ ests th at the c ova r i an c e matrix of the r e gr e ss i on parame ters be included in this data set. This makes it ea sy to combine the es t i m a te s in the n ext step . Results for the five r e gr e ss io n s a re shown in Table 5 . 3 . Clearly t h e re i s a gre at deal of stability from one re g re s s io n to the n ext, but

written into

a

new data set calle d

,

·1 1 I here is also noticeable variability, which is attributable to the ra n d O l l i

(omponent of the imputation .

The results from these regressions are integrated into a single of estimates using another

SAS

procedure called

i nvoked with the following statements : pr o c m i analyz e dat a v ar int e r c ept

=

sc t

MIANALYZE. I t is

e s t imat e ;

c s at lenr o l l pr i vat e stuf ac rmbrd ;

run ;

rrhis procedure operates directly on the data set esti mate, which con tains the coefficients and associated statistics produced by the regres sions runs. Results are shown in Figure

5.3.

The column labeled "Mean" in Figure 5 . 3 contains the means of

the

coefficients in Table

formula

5.1,

5.3.

The standard errors, calculated using

are appreciably larger than the standard errors in Table

5 . 3 , because the between-regression variability is added to the within

regression variability. However, there is more between-regression vari ability for some coefficients than for others. At the low end, the stan dard error for the l en rol l coefficient in Figure

5.3

is only about

larger than t he mean of the standard errors in Table end, the combined standard error for rm b rd is about

5 .3. At

70%

10%

the high

larger th an

the me an of the individual standard errors. The greater variability in the rm b rd coefficients is apparent in Table range from

1 .66

to

5 .3,

where the estimates

2.95.

The column labeled " t for H O : M ean = O" in Figure

5.3

is j ust the

ratio of each coefficient to its standard error. The immediately pre ceding column gives the degrees of freedom used to calculate the p

value from a t table. This number h as nothing to do with the num ber of observations or the number of variables.

It

is simply a way to

specify a reference distribution that happens to be a good approxi mation to the sampling distribution of the t-ratio statistic . Although it is not essential to know how the degrees of freedom is calculated,

I think it is worth a short explanation . For a given coefficient, let U

be the average of the squared, within-regression standard errors. Let B be the variance of the coefficients between regressions . The

increase in variance due to missing data r ==

is defined as

(1 + M-1 )B -----

U

'

relative

, _. __ ,

I,

_____ _ _ _____ _ __ _ __ __ _ _____ _._

.-,

;,d,lA ; , ,�J,",

, , ,

.

\ i

M u l t i p l e � I m p u t a t i o n P a ramet e r Est ima t e s F r act ion Mis sing

t f o r HO :

St d E r ro r

I n f o rma t io n

Mean

Mean

OF

32 . 3 0 9 795

5 . 63 9 4 1 1

72

- 6 . 5 96 9 9 5

< . 000 1

0 . 2 5 5 7 24

( : �-; a t

0 . 068255

0 . 004692

39

1 4 . 5 4 73 8 8

< . 000 1

0 . 3 5 64 5 1

l_ e rl r o l l

1 . 9 1 6654

0 . 5 95 2 2 9

1 10

3 . 2 20027

0 , 00 1 7

0

p r iv a t e

1 2 . 4 8 1 05 0

1 . 3 67858

40

9 . 1 24524

< . 000 1

0 . 3 44 1 5 1

st ufac

- 0 . 1 69484

0 . 09933 1

42

- 1 . 706258

0 . 09 5 3

0 . 32 9284

2 . 348 1 3 6

0 . 6 7 0 1 05

10

3 . 5 04 1 3 2

0 . 0067

0 . 708476

\1 . 1 1

j

;dJ I

I ll!

('

rcept

i�

rm b rd

-

Me a n =

Pr > I t I

0

.

2062 1 0

Figure 5 . 3 . S elected Output From PRO C MIANALYZE

M is, as before, the number of completed data sets used to produce the estimates . The degrees of freedom is then calculated as

where

df

==

(M

1 ) (1 + r 1 ) -

-

'1 -'- •

v ari a t io n is relative to the within -regression variation, the l arger is the degrees of freedom. Sometimes the calculated degrees of freed o m will be substanti ally g rea t e r than the numb er of obse rv a t i o n s. This is nothing to be co n cerned a b ou t , because any number greater t h a n about 1 5 0 w i l l yield a t table that is es s e n t i a l ly the same as a s t a n d a r d norm al distribu tion. However, some so ftware (including PROC MIANALYZE) can p r o d uc e an a dj u s t e d degrees of freedom th at ca n n o t be greater than t h e s ample size (Barnard & Rubin, 1 999). The l a s t column, " Fracti o n M i ssi ng I nfo rm ati o n , " is an estimate of how m u c h information about each coefficient is lost because of miss ing d a t a It ra ng e s from a low of 21 % for l en rol l to a high of 7 1 % for rmbrd . It 's not s u rp risi n g that the m i s s i ng information is high fo r rmbrd, w h ich had 40 % missing data, but it is surprisingly high for p rivate, wh ich had no m is s i n g data, and stufac, which had less than 1 % m issi ng d at a To u n de rs ta n d thi s, it is i mpo r t a n t to know a co u pl e of things. Fi rs t , the amount of mi ssing i n fo rm at io n for a g ive n Thus, the smaller the

betwe en -regression

.

.

( -ocfficient dep ends not only on the missing data for that pa rticu l a r v a r iab le but also on the percentage of missing data for other variables I hat are correlated with it. Second, the MIANALYZE pr oc e d u r e h a s 1 1 0 way to know how much missing data there are on each vari ab l e . I nstead, the m issing information estimate is based entirely on the rcl ; , tive variation within and between regressions. If t h e re is a lot of variation between regressions, that is an indication of a lot of miss i n g i nformation. S ome times denoted as y , the fraction of missing informa tion is calculated from two statistics that we j ust defined, r and df. S-1 eifi cally ,

(}

,

y == A

r + 2j(df + 3) •

r + 1

Keep in mind that the fraction of missing information reported in the table is only an estimate that may be subject to considerable sampling v ar i ab i l ity .

As noted earlier, one of the troubling things about multiple impu tation is that it does not produce a determinate result . Every time you do it, you get slightly different estimates and associated statis tics. To see this, take a look at Figure 5.4, which is based on five data

M u lt i p l e - I m p u t a t i o n P a r a m e t e r E s t im a t e s

Std E r r o r

t

Fract ion

f o r HO :

Miss ing

Mean

Mean

OF

- 3 2 . 474 1 5 8

4 . 8 1 634 1

1 24

- 6 . 742496

< . 0001

0 . 1 92 4 2 9

c s at

0 . 066590

0 . 0 05 1 87

20

1 2 . 838386

< . 000 1

0 . 48934 1

l e n r oll

2 . 1 732 1 4

0 . 546 1 7 7

2 1 57

3 . 978955

< . 000 1

0 . 043949

p r i vate

1 3 . 1 25 024

1 . 1 7 1 4 88

1 1 91

1 1 . 2 03 7 1 9

< . 000 1

0 . 059531

stufac

- 0 . 1 90 03 1

0 . 0 99 0 2 7

51

- 1 . 9 1 8 9 88

0 . 0607

0 . 307569

2 . 3 5 7444

0 . 5 9 934 1

12

3 . 933 39 6

0 . 0 0 20

0 . 623224

Variable int e r c e p t

rmb rd

Figure

5 .4.

Imputation

Output

From

M e a n=O

MIANALYZE

for

Pr

>

ItI

Replication

I n f o rmat ion

of

M u l t i p le

-- .- . -.

· .

i)()

scls J he

Hu·

i ly new run of data augmentation. Most of resuits are quite si mil ar to those in Figure 5.3, although note that fr;tctions of missing information for lenroll and private are much produced by

an ent re

lu\vcr t han before. When the frac t ion of missing information is high, more than the

rccollllncnded thre e to five completed data sets may be necessary to get stable estimate s . How many might that be? Mu lt iple imputa t ion \vi than infinite number of data sets is fully efficient (like ML), but MI wi th

a

( J <JH7)

finite

number of data set s does not achieve full efficiency. Rubin

showed that the relative efficiency of an estimate

based

on

M

uata sets compared with an estimate based on an infinite number of data se ts is given by (1 + 'Y / M)-l, wh ere 'Y is the fraction of missing

information. This imp lies that with five data sets and information, the efficiency of the estimation proc edure

10

data sets, the efficiency goes up to

95%.

Equivalently, us i ng only

five data sets would give us standard errors that are when an

infinite

50% missing is 91 %. With

5%

larger than

number of data sets is used. Ten data sets would yie ld

are 2.5% larger than an infinite number of data sets. The bottom line is that even w ith 50% m i ssi ng information, five data sets do a pretty good job. Doubling the number of data sets cuts the excess sta n d ard error in half, but the excess is small to begin with.

standard errors that

Before leaving the regression example, let us compare the MI

5.4 with the ML results in Table 4.6. estimat e s are quite similar, as are the standard errors Certainly the same conclusions would be reached results in Figure

The coefficient and t stati s tic s . from the two

analyses.

6& MULTIPLE IMPUTATION: COMPLICATIONS Interactions and Nonlinearities in MI Although the

ng

e s t ima t i

met

hods

we have jus t described are very good for

the main effects of the variables with

m issing

data, they

may not be so good for estimating interaction effects. Suppose, for example, we suspect that the effect of SAT scores (CSAT) on grad

for public and p ri vate colleges . One way to test this hy pothesis ( Method 1) would be to take the previ ously impute d data, create a new var i ab l e that is the product of CSAT

uation rate (G RAD RAT) is different

TABLE

6.1

Regressions With Interaction Terms Method 1 J

f.lnable

I NTERCEPT

( .'SAT

I , ENROLL

STUFAC

P RIVATE

RMBRD PRIVCSAT

Coefficient

p

Three Methods Method j

Method 2

Value

Coefficient

p

Value

Coefficient

J) J 1 1 /1 1 .

- 50.2

.( 100

-39. 1 42

.000

-48.046

.000

0.073

.000

0 .085

.000

0.085

2.383

.000

1 .932

.00 1

1 .950

- 0 . 1 75

. 205

- 0 . 204

.083

-0. 1 52

20.870

.023

35 . 1 28

.00 1

36. 1 1 8

2. 1 34

.002

2 .448

.000

2.641

- 0.008

.388

- 0 .024

.022

- 0.024

.OOI l .( ) I

\

.( )( ) I

.00 . ' .003 . ()2 II

and PRIVATE, and include this product term in the regression equa tion along with the other variables already in the model. The leftmost panel of Table 6. 1 (Method 1) shows the results of doing this. The variable PRIVCSAT is the product of CSAT and PRIVATE. With a p value of .39, the interaction is far from statistically significant, so we conclude that the effect of CSAT does not vary between public and private institutions. The problem with this approach is that although the multivariate normal model is good at imputing values that reproduce the linear relationships among variables, it does not model any higher-order , moments. Consequently, the imputed values display no evidence of interaction unless special techniques are implemented. In this exam ple, where one of the two variables in the interaction is a dichotomy (PRIVATE), the most natural solution (Method 2) is to do separate chains of data augmentation for private colleges and for public col leges. This allows the relationship between CSAT and G RAD RAT to differ across the two groups and allows the imputed values to reflect that fact. Once the separate imputations are completed, the data sets are recombined into a single data set, the product variable is created, and the regression is run with the product variable. Results in the mid dle panel of Table 6 . 1 show that the interaction between PRIVATE and CSAT is significant at the .02 level. More specifically, we find that the positive effect of CSAT on graduation rates is smaller in private colleges than in public colleges. A third approach (Method 3) is to create the product variable for a l l cases with observed values of CSAT and PRIVATE before imputation ,

52 the n imp ute t he product va riable j ust like any other variable with miss ing data, and, finally using th e imp u ted data, estimate the regression ,

model that includes the product variable. This method is less appeal ing than Method 2 becaus e the product variable typ icall y will have a distribution that is far from normal, yet normali ty is assume d in the imp u tatio n process. Nevertheless, as seen in the righ t - hand panel of Tabl e 6.1, the results from Method 3 are very close to those obtained with Method 2 and

y much closer th an those of Method 1.

c ertainl

The results for Method 3 are reassuring because Method 2 is not feasible when both variables in the interaction are measured on quan titative scales. Thus, if we wish to estimate a

model

with the interac

tion of CSAT and RMBRD, we need to create a product variable for the 476 cases that have data on both these variables. For the

ainin g 826 cases we must impute the product tern1 as part of the data augmentation process. This method (or Method 2 when possi b le ) should be used whenever the goal is to esti m ate a model with no nli n ea r relationships that involves vari ab le s with missing data. For rem

,

example, if we want to estimate a model with both RMBRD and RMBRD squared, the squared term should be imputed as part of the dat a augmentation. This

req u i rement

puts some b urd e n

on

th e

imp u ter to anti c ipate the desired functional form before beginning the imputation. It also means that we must be cautious abo u t e s tim ating nonlinear models from data th at have been i m puted by others using st r ictly linear models. Of co urse if the percentage of missing data on ,

a given variable is small, we may be able to get by with imp uti n g

a

variable in its ori ginal form and then co n st ructing a non linear trans formation later. Certainly for the variab le STUFAC (student/faculty ratio), w i t h only two cases missing out of 1,302, it would be quite ac ceptable to p u t STUFAC squared in a regres s ion model after sim· ply

s quaring

the two imputed values rather th an i mp u ting the squared

values.

Compatibility of the Imputation Model and the Analysis Mod.el

The problem of interactions illustrates a more general issue in mul tiple imputation Ideally, the model used for imputation should agree .

with the model used in

analysi s ,

and both should c orre c tl y represent

the data. The basic for m u l a (Equation 5.1) for co mputing standard errors depends on this

c om patibility

and co rrectn es s .

-----

----- -

- --

I

,

,

53 I,

I I

W h at happens when the imput ation and an alys i s models differ? I I L l l d e p en ds on the n ature of the difference and which mode l is H I eet (Schafer, 1997) . Of p a r ti cu l ar interest are cases in which one 1 1 1 ( ( 1 �1 is a special case of t h e other. For example, the i mput a t i o n H leI may allow for interactions, but the a naly s i s model may no t , j H ' he a n a l ys is model m ay allow for interactions, but the imputation I I I < Hlel m ay not. In e i t he r case, if the additional restrictions i mp o s e d b y I l ll" s i m p l er mo del are correct, then the pro c e d u r e s we h ave d i scu ss e d I i I f i nfer ence under mUltiple i mpu t a ti o n will be val id . However, if the , t d d itional restrictions are n o t correct, inferences that usc the standard I l l cthods may not be valid . Methods that are less sensitive to model choice h ave been p r o p o se d I ( ) r es t i mati n g standard errors under mu ltipl e imputation (Wang & I {obins, 1998; Robi n s & Wang, 2000) . S p e c ifically these methods give va lid st an dar d error e stim ate s when the im p u t at i on and analysis mod ( - I s are incompatible and when both models are in co r r ect Neverthe l e ss, i n co rr e ct models at either stage still may give biased parameter e s ti m a t e s and the alternative methods require specialized softw a r e I hat is not yet re adily av a i labl e /

t

)

J I ((

,

.

,

.

«tole of the Dependent Variable in Imputation

one of the variables included in the data augmentation process, th e d epen de nt variable was implicitly use d to impute missing values on the independent variables. Is this l egit i m a te ? Does it not tend to p rodu ce spuriously l arge regression co e ffic ie n ts ? The answer is that not only is this OK, i t is essential for getting unbi ased estimates of the re gr e ss ion coefficients. With deterministic impu tation, using the d epe n d ent variable to im p ute the m is si ng values of the independent vari ab les can, indeed, produce spu ri ou s ly l arge regression coefficients, but the introduction of a r an dom component into the impu t a tio n process counterbalances this tendency and gives us approximately unbiased estimates. In fact, l e aving the dependent variable out of the i m p u t a t i on process tends to p ro du ce regression coefficients th at are spur iously sm all, at least for those variables that have missing data ( L a nde rman L a n d & Pieper, 1997) . In the co l l eg e example, if GRADRAT is n o t used in the imputations, the coefficient s for CSAT and RMBRD , both wi th la rge fractions of m i ssi n g data, are reduced by about 25 % and 20% , respectively. At the same time, the Because GRADRAT was

,

,

,

, , .

,

' .

54

coefficient for larger.

Of

LENROLL, w h ich

only h a d five missing values, is 65 %

course, including G RAD RAT in the data augmentation process

also means t h a t any missing values of GRAD RAT also were

imputed.

Some authors have recommended against imputing nlissing dat a on

the dependent variable (Cohen & Cohen, 1 985). To

follow

this advice,

we would have to delete any cases with missing data on the dependent

imputation.

variable before beginning the

There is a valid rationale

for this recommendation, bu t it applies only in special cases. If there are missing dat a on the depend ent variable but not on any of the

independent variables, m aximum likelihoo d estimation of a regres

sion model

(whether l inea r or nonline ar) does not use any information

from cases with missing data. Because ML i s optimal, t here is nothing to gain from imputing the missing cases under m u l tip l e imputation. In

wou l d not lead to any However, the situation

fact, although such imputation

bias, the stan

d ard errors would be larger.

changes

there

is

wh e n

also missing data on the indepen dent variables. Then cases

with missing values on the dependent variables do h ave some infor mation to contribu te to the estimation of the regression coefficients, a l t ho ugh prob ably not

a

great deal.

The

upshot is that in the typical

case with missing values on both dependent and independent vari ables, the cases not be d eleted.

with

missing values on the dependent variable should

U sing Additional Variables in the Imputation Process

As alre ady noted, the set of variables used in data augmentation certainly should include all variables t h a t will be used in the planned analysis. In

the

college

example,

we also included one additional

variable, ACT (mean ACT scores) , because of its high correlation

with CSAT, a variable that had substanti al missing data. The goal was to im p rove the imputations of CSAT to get more reliable estimates

of its regression coefficient. We might have done even better had we inc luded s till other variables that were correlated wit h CSAT. A somewhat s i mp le r example illustrates the benefits of additional predictor vari ables. Suppose we want to estimate the mean CSAT score across the 1 ,302 college s . As we know, d ata are missing on CSAT for 523 cases . If we c a l c ul at e the mean for the other 779 cases with values on CSAT, we get the results in the first line of Table 6 .2 . The

55 TABLE 6 .2

M eans (and Standard Errors) of CSAT With Different Used in Imputation Standard

Variables %

Missing

t

. I l Il/hles Used in Imputing

Mean

Error

[

1 4 1 1 1e

967 .98

4.43

40 . 1 a

956.87

3 .8 4

26.5

959.48

3 . 60

1 3 .3

958.04

3.58

1 1.3

\( \(

\�

'"

(

'

" I : PCT25 v I: PCT25, GRADRAT

Informa tion

. \d ual percentage of missing data.

line shows the estimated mean (with standard error) using mul t i ple imputation and the ACT variable. The me an has decreased by I ) points, whereas the standard error ha s decreased by 1 3 % . Although /\Cf has a correlation of about 0.90 with CSAT, its usefulness as a p redictor variable is somewhat m arred by the fact that values are ( lbserved on ACT for only 226 of the 5 23 cases missing on CSAT. If we add an additional variable, PCT25 (the percentage of students in t h e top 25 % of their class) , we get an additional reduction in stan dard error. PCT25 has a correlation of about 0.80 with CSAT and is available for an additional 240 ca s e s that have missing data on both (�SAT and ACT. The last line of Table 6.2 adds in GRAD RAT, which has a corre lation of about 0 . 60 with CSAT, but is only available for 17 cases not already covered by PCT25 or ACT. Not surprisingly, the decline in the standard error is quite smalL When I tried to introduce all the other variables in the regre ss i on model in Figure 5.4, the s t a n d a r d error actually got larger. This is likely due to the fact that the other variables have much lower correlations with CSAT, yet additional vari ability is introduced because of the need to estimate their regression coefficients for predicting CSAT. As in other forecasting problems, imputations may get worse when poor pred ic tors are added to the model I l ext

Other As

Parametric Approaches to Multiple Imputation

we have

seen,

multiple imputation under the multivariate n or

mal model is re asonably straightforward un d er a wide variety

of data

�,n d m is s i ng data patterns. As a routine met ho d for han dling In issin � d ata, it is probably the best t h a t i s curre n t ly available . There a r c , h(1wever, several alt e r n at ive app ro a c hes that may be preferable lyp e s

i n SOIn,e circumst ances .

() n e of the m o s t obv ious limitations of the lllultivaria te no rm a l t110 d e l is t h a t it is designed only to impute mis si n g values fo r quanti

tative v a r iab le s. As we have seen, c atego r ic a l variables can be accom lllodat vd by using some ad hoc fixups. However, sometimes you may w a n t t () do b e t te r For situations in which all variables in the i mp u t at i on process are ca teg or i c al a more attractive model is the unre stri cte e;1 mu lt in omi a l m o del (w h ic h has a parameter for every cell in t h e CO lltingency table) or a log-linear model that allows restrictions o n t he multinomial parameters. In Chapter 4, we discussed ML esti mat io n of these models . S chafer ( 1997) showed how these m ode ls a l so c an be used as the basis for data augmentation to p r od u c e mul t ip le irJ1putations a nd he developed a freeware program called CAT to i mp jement the method (http://www . stat.psu.edu/''--'j ls/). Ano } her Schafer pro g r a m (MIX) uses data augmentation to gen erate i fll P u ta ti o n s when the data consist of a mixt ure of categoric al and qll antitative variables. This method presumes t h at t he cate go ri cal va riables h ave a m u lt in omial d istribution, possibly with log-line ar r estrictions on th e parameters. Within each cell of the contingency tab le �reated by the categorical v ari ab l e s the quantitative vari ables are as� ume d to have a multivariate norm al distribution. The means o f the �e variables are a l l ow e d to vary across cells, but the covariance mat rix is assumed to be constant. At t ltis wri ting both CAT and MIX are available only as libraries to the S- PLUS statistical package, although stand-alone versions are p r o m i �� d In both cases, the underlying models potentially have many mo re t1 a r am e t e r s than the mu ltivariate normal modeL As a result, effec t i\(e use of these methods typically r e q u i res more knowledge and inpu t f:1 o m the person performing the i mpu t ati o n together with larger sampl e sizes to achieve stable estimates. I f d C\ta are missing for a single ca tego r ica l variable, multiple im p u t a tion u r�d er a logistic (logit) regre s s io n model is reasonably straightfor w ard (]tu bi n 1987) . Suppose data are missing on marital status, coded into fiV"� c a t egor ie s , and there are several po t e n t i al predictor vari ab l e s h o th c c;:) n t in u o u s and categorical. For th e purposes o f i mp u t a ti o n we csti ma lle a m u ltino mial l ogit model for marital status as a fu n cti on of t h� p r �dictors, using cases with complete data. This pro d u ce s a set of .

,

,

,

.

,

!

,

,

,

57 co efficient estimates j3 a n d an estimate of the covariance matrix V( S). 10 allow for variability i n the parameter estimates, we take a random draw from a normal distribution with a mean of /J and a covariance tnatrix V ( f3 ) . ( S c h a fe r [1997] g av e practical suggestions on how to do this efficiently.) For each case with missing data, the drawn coeffi cient values and the observed covariate value s are substituted in t o the multinomial logit model to generate predicted probabilities of falling into the five marital status categories . Based on these predicted prob abilities, we randomly draw one of the marital s ta tus categories as the fi n al imputed value.1 2 The whole process is repeated multiple times to generate multiple completed data sets . Of course, a binary va ri ab l e would be just a special case of this method. This approach also can be used with a variety of other parametric models, i n c lu d in g Poisson regression and parametric failure-time regressions . .-...

"-

Nonparaloetric and Partially Parametric Methods

Many methods have been proposed for doing multiple imputation under less stringent assumptions than the fully parametric methods we have just considered . In this section, I will cons id er a fe\tv repre sentative approaches, but keep in mind that each of these approaches has many different variations. All these methods are most natu rally applied when there are missing data o n only a s ingl e variable, . although they often can be generalized without difficulty to multiple v ariables when data are missing in a monotone pattern (described i n Chapter 4) . See Rubin (1 987) for details on monotone generaliza tions. These methods can sometimes be used when the missing data do not follow a monotone pattern, but in s uch settings they typ ic ally lack solid theoretical justification. When choosing between parametric and nonparametric methods, there is the usual trade-off between bias and sampling variability. Parametric methods tend to have less sampling variability, but they m ay give biased estimates if the parametric model is not a good approximation to t h e phenomenon of interest. N o np a r am etri c meth ods may be less prone to bias under a variety of situations, but the estimates often have more sampling variability.

Hot Deck Methods The best-known approach to non p arametric imputation is the " hot deck" method, which is frequently used by t he U . S . Census Bureau to

58

p rod uce

imputed values for p ubl i c u se -

data sets. The basic

idea is that

we want to i mpu te miss in g values for a p artic u lar vari able Y, which

may

a

be either q u n t i t at iv e or categorical . We

missing d a ta )

X variables (with no

forn1

wi t h

a

fin d

a

s et

of categorical

that are associated with Y. We

continge n cy table base d on the X variables . If there

m is

s ing Y values

are cases

withi n a p artic ul ar ce ll of the contingency table,

we take one or more of the no n mis s i ng case s in the same cell an d

use

thei r Y values t o im p ut e the m iss i n g Y values .

Obviously there are a lot of complication s that may arise. The crit ical qu e s ti o n is how do you choose whi ch " donor" values to assign

Cl e a rly the to avoid bias.

to the cases wit h miss ing values ?

choice of donor cases

should be random ized somehow

This leads n a t u r ally to

multiple i m p u ta t i o n b ecause any randomize d method can be applied

m o re

than once to produce different imp u t e d values . The trick is to

do the randomization in such a way that all t he natural

variabi li ty

is

preserved . To acco m p lis h this, Rub i n p ro p o se d a method he c o in ed

(Rubin, 1 987; Rubin & Schenker, is done. S u ppo se that in a part i cu lar cell of the

the app r o xim ate Bayesian b o o t strap 1 99 1 ) . Here is how it

co n t i ngency table there are n l cases with complete c as es with missing dat a on Y. Fol low these s t ep s : 1 . From the set of

n1

data on Y an d

no

cases with complete data, take a random sample

(with replacement) of n 1 cases .

2. From this sample, take a random sample (with repl acement) of

n o cases .

3 . Assign the no observed values of Y to the no cases with missing data

on Y .

4 . Repeat steps 1 to 3 for every cell i n the contingency table .

These four steps produce one co m p l e t e d data set wh e n ap p lie d to all cells of th e contingency tabl e . For m ulti pl e imputation, the whole process i s repeated

m u l ti p le

After the desire d analysis is per r e s u l t s are combined u si ng the s ame

times .

for m e d on e ach data set, the

formul as we used for multivari ate normal imputations .

Although it might seem that we could c h o o se n o ·

donor cases from

among

the

n1

skip

cases

s t e p 1 and directly

with compl ete data,

does not produce sufficie n t va r i a bil i ty for estim ating stand ard errors. Ad d i t i o n a l var iabil i ty comes from th e fact that s ampling in this

step 2 i s with replacement.

59 Predictive Mean Matching

A major attraction of hot deck im p ut a t io n is th at the imp u te d val l ies are all actual observed values. Consequently, there are no impo s sible " or o u t o f r a nge v a l ues an d th e s h ap e of the d i s t r ib u t ion tends t o be preserved. A disadvantage is that the p r edi ct or variables a l l Inust be c at e g o r i c a l ( or t reat e d as s uch ) which imposes serious lim i tations on the number of p ossibl e p red i c t o r variables. To remove this l i m it a t i on Little (1 988) p ropos ed a p a rt i al ly p ar a me t ri c method called predictive mean matching. Like t h e multivariate normal para metric m e t ho d t hi s app r o ac h be g i n s by regre ss i n g Y, the v ar i ab le to be i mp u t ed , on a s et of pre d ict o rs for cases with com p le te data . This r e gres s i o n is then used to ge nerat e predicted v al u e s for both the miss ing and the nonmissing c ases Then, for each case with m i ss i n g data, we find a set of cases ,vith com pl e te data that have p r e di ct e d values of Y th a t are "close " to the predicted value for the case w i th missing data. From t h is set of cases, we randomly choose on e case whose Y "

-

-

,

,

,

,

.

v alu e i s donated to the missing

case. For a single Y variable, it is straightforward to define closeness as the absolute difference between predicted values. However, then we mu s t decide how many of the c l o s e p re dic ted values to include in the donor pool for each missing ca se o r equiv al e ntly what should ' be the cu t o ff point in closeness for fo rIn i ng the set of possible donor v a lue s ? If a small donor pool is chosen, there will be more s amp l i ng vari ab i l i ty in the estimates. On the other h an d too large a donor pool c an lead to possible b i as because many donors m ay be unlike the re ci p ien t s To deal with this amb ig u i ty , Schenker and Taylor ( 1 996) developed an a d ap t iv e method" that varies the size of the donor pool for e ach missing c a se based o n th e d ens i ty of co mp l e t e cases w i t h close p r e d ict e d values. T h ey found that their method did somewhat better than methods with fixed size donor pools of either 3 or 10 clo se s t cases. H owever the differences among the t h re e methods were sufficiently small that the adaptive method hardly seems worth the extra com p u t a t io n a l cost. In d o i n g p r e d ict iv e mean mat c h in g , it is also important to adjust for the fact that the regr e s s i o n coefficients are on ly estimates of t h e true coefficients . As in the p ar ame tr ic case, this can be acco m p li s he d by ra n d o m ly drawing a new set o f r eg r e s s i o n parameters from t h e i r p ost e rio r distribution before c a lcu lat in g p re d icted values for each ,

,

,

.

"

"

,

"

•

to do it:

i l l l P l l 1 c d dat a s e t . Here is how

1 . Regress Y on X (a vector of covari ates) for the

n 1 cases with no missing

data on Y , producing regression coefficients b (a k x

residu a] vari ance estimate

52 .

1 vecto r) and

2 . M ake a random d raw fro m the posterior distribut ion o f t h e res idual

varian ce (assuming a noninformative prior) . This is accompli shed by calcul ating ( n 1

-

k) 52/X2,

where X2 represents a random draw from a

chi-s quare distribution with n 1

-

first such random draw .

k

degrees of freedom. Let

s[l ]

be the

3 . Make a random draw from the posterior distribution of the regression coefficients. This is accomplish ed by drawing from a mul tivariate normal distrib ution with m ean b and covariance matrix

srlJ (X'X) -- l , where

X is

an n 1 x k matrix of X values . Let h [ 1 J be the first such random draw .

See Sch afe r ( 1 997) for practi cal suggestions on how t o do this .

For e a ch new set of regression p a rameters, p redicted values are

genera t ed for all

c as e s

.

Then, for

each

case

with m i ssi n g

data on Y,

we form a donor pool b a s ed on the predicted values and randomly choose one of the observed value s of Y

from

the donor pool . Th is

appro ach to predictive mean m atc h ing can be generalized to

with mis s ing data, al t hough the co m plex (Little, 1 988) .

th an one Y v a ri a ble

m ay become r a ther

more

computations

Sampling on Empirical Residuals I n the

data au gmentation

m eth od , residu al values are sampl ed

from a standard normal distribution and then adde d to the r egr es s io n

method

p re d ic t ed

values to get the final impu t e d values. We can modify this be l ess dependent on

parametric assumptions by m aking random draws from the actual set of r e s idu als produce d by the lin ear r e g r e ssion. This can yield imputed values whose distribution is more like t h a t of the ob se rv ed variable ( Rub in 1 987), although it is still pos s ibl e to get i mpu t ed values that a r e outside the permissibl e to

,

range . As

with

other app ro a c h e s to multiple imp u t a t i o n t h e re ,

arc

some

i mpor t an t s ubtleties involved in doi n g this properly. As before, let Y be the variable with missing data to be i m p ut e d for

observed data on ing

a

n1

co n s t an t ) with

ca s e s .

Let X

be a

kx1

no missing d ata o n the

no

c a s es with ,

v ec t or of variables (inclu d n1

c a s es We begin by per .

forming the preceding three steps to obtain the li n e a r r eg r es s i o n of

" �., , ,

I

:

61 Y on X and generate random draws from the posterior distribution ( )f the parameters. Then we add the following steps: 4. Based on the regression estimates in step 1 , calculate standardized resid uals for t h e cases with no missi n g data:

5 . Draw a simple random sample (with replacement) of n 1 residuals calcul ated in step 4.

no

values from the

6. For the no cases with missing data, calculate imputed values of Y a s

where e i rep r esen t s the residuals drawn in step 5 , an d b[ l

and Si l l a re the first random draws from the post e ri o r distribution o f the para m e t ers . I

These six steps produce one completed set of data. To get additional data sets, simply repeat steps 2 through 6 (except for step 4, w hic h should not be repeated). As Rubin (1987) explained, this methodology can be re ad i ly extended to data sets with a monotonic missing pattern on several variables. Each variable is imputed using as predictors all varia hlcs that are observed when it is missing. The empirical residual me tho d also can be modified to allow for heteroscedasticity in the im pu ted values (Schenker & Taylor, 1 9 9 6 ). For each case to be impu ted, t h e pool of residuals is restricted to those observed cases that h ave pre dicted values of Y that are close to the predicted value for the case with missing data. Example

Let us try the partially parametric methods on a subset of the col lege data. TUITION is fully observed for 1 ,272 colleges. (For SlIl1p1 ic ity, we shall exclude the 30 cases with missing data on this variable.) Of these 1 ,272 colleges, only 796 report BOARD , the annual ave r age cost of board at each college. Using TUITION as a pred icto r, our goal is to impute the missing values of B OARD for the other 476 colleges, and estimate the mean of BOARD for all 1 ,272 college s . First, we apply the methods we have used before . For the 796 col leges with complete d ata (listwise deletion), the average BOARD is $2,060 with a standard error of 23 . 4 . Applying the EM algorithm to

f

I

I

I

{ I I' I ' I ( ) N a n d BOARD,

a mean B OARD of 2,032 (but no - , L I I I ( I : l r

Missing Data

Semiparametric Theory and Missing Data

Statistical analysis with missing data

Missing Data: A Gentle Introduction

Missing Data Problems in Machine Learning

Missing

Missing

Missing

Missing

Missing

Missing

Missing

Missing

Missing

Missing

Missing

Missing

Missing

Missing

Missing

Missing

MISSING

The Prevention and Treatment of Missing Data in Clinical Trials

Missing Data: A Gentle Introduction (Methodology In The Social Sciences)

Applied Missing Data Analysis (Methodology In The Social Sciences)

The Prevention and Treatment of Missing Data in Clinical Trials

The Prevention and Treatment of Missing Data in Clinical Trials

Semiparametric Theory and Missing Data (Springer Series in Statistics)

Semiparametric Theory and Missing Data (Springer Series in Statistics)

Spectral Analysis of Signals - The Missing Data Case

Missing Data in Clinical Studies (Statistics in Practice)

Missing Data

Semiparametric Theory and Missing Data

Statistical analysis with missing data

Missing Data: A Gentle Introduction

Missing Data Problems in Machine Learning

Missing

Missing

Missing

Missing

Missing

Missing

Missing

Missing

Missing

Missing

Missing

Missing

Missing

Missing

Missing

Missing

MISSING

The Prevention and Treatment of Missing Data in Clinical Trials

Missing Data: A Gentle Introduction (Methodology In The Social Sciences)

Applied Missing Data Analysis (Methodology In The Social Sciences)

The Prevention and Treatment of Missing Data in Clinical Trials

The Prevention and Treatment of Missing Data in Clinical Trials

Semiparametric Theory and Missing Data (Springer Series in Statistics)

Semiparametric Theory and Missing Data (Springer Series in Statistics)

Spectral Analysis of Signals - The Missing Data Case

Missing Data in Clinical Studies (Statistics in Practice)

Recommend Documents