QUANTITATIV DATAANALYSIS rch I Resea D oingS ocia to Testldeas
DO N AL D I. TRE IMA N
If i?j[i,i:l[fri:,
reserved' Copyright@2009by JohnWiley & Sons'Inc All dghts by JosseY-Bass Published
com cA 941O3-wwwjossevbass ftltijt?tlltJ,l'",, t"' Francisco, form stored in a retrieval system' or tansmitted in any No part of this publication may b€ reproduced' exceptas oth"*ise' ol t:o:lt:q. or bv anv means,elecfonic, mechatucal,photocopying'-recording theprior either Act'without
;:#;i
ffi;;!;".1b;i
ul'iei s'ut"'copvright
roa
"r aulori^tion trttougrtpuy-"ni of-theappropriate-p"-t:1oP1*" "' wrinen Dermissionof the putrisrrer,or 'i]"'iiie (e78) 750-8400' oiuq n-u"'l MA'ore23' ;;;;'I*: il;i""i iliilt:;;;ii should permission for publisher the t n"q*t" '?;;il"*ooa o. onttn"ut *t* fax (978)646-8600, NJ Hoboken' "clt stree! "oi-yig;' River l,1l Inc : to thePer.ir.ion. o"ptii!'ni,i"-rt^ wii"y n Sons' be addressed ssrons. www.wiley.com/so/pen at online or oiriid, iii,1j i1d_oor1,fax 201,744_6008, ascitationsor sourcesfor further information Readersshouldbe awarethat InternetWebsitesoffered waswrittenandwhenit is rcad' this time the between disappeared .. ."1 t """-.ft-ag"a publisherandauthorhaveusedtheir bestefforts Limit of Liability/Disclaimer of warranty: while the or com wi*l respeclto lhe accuracy or lhis book.Lheymakeno repre'enlations wafianlie' in DreDaring or merchanrabil' warranties implied rr,i. roor #i ,fi"iri.aiiy di-ctaimany ;iJ";K, ;i ;;;.;;;;;,,'oi
ffi;il;?;;ili."iu,'pttp"t"
n'*Lantvmavbecreatedorextendedtysalesrei::il
I The aivice and strategies contained herein may-not ,"1* .it"tials .i *iii." nor author shall publisher the leither upp.op;ut". ation.you should consutt wltt, a protessiinut-*-fi.." to special' limited not but including oit*'' be liable for any loss of p.ot t o' "ommerciJdamages' -ydamaBes' or other con(equential. rncidental.
most bookstores To-contactJossey-Bassdirectl) JosseyBass books and products are availablelhrough the United *itio if," Unitla Star". ur 1a0O)956-?739' outside call our CusromerCar" u"p*"n, (317) 572-4002' Siatesat (3ll) 572-3986' oi via fa'x at formats some content that appearsin Jossev-Bassalso publishesits books in a variety ofelecftonic print may not be ivailable in electronic books' Library of Congress Cataloging'in-Publication
Data
Donald J. Treiman, -jutu unalysis : doing social researchto test ideas/ Donald J Treiman d"-tl[G D, Cm,
2.Sociorogv-f,esearch-statist "liJJj;l.T'3;:3:;:,t3*"f33?,*,n"^"thods. methods-Computer + Socialsciences-statistical
methods. 3. Sociology-statisticar -"if'oOt programs. 5. Stata. I Title HA29.T675 2008 300;72-4c22 Printed in the United StatesofAmerica FIRST EDITION
PB Printing
l0 9 8 7 6 5 '1 3 I I
20080131:v
-*fq-$ Tg$XT'-{. fables, Figur€s,Exhibits. and Boxes
Xi
Preface
xxiii
The Author
xxvii
Introduction CROSS-TAB U LATIONS What This ChapterIs About Introductionto the Book via a ConcreteExample Cross-Tabulations What This ChapterHas Shown MORE ON TABLES What This ChapterIs About The Logic of Elaboration SuppressorVariables Additive and InteractionEffects Direct Standardization
xxix 1 1 2 8 19 21 z1 22 ).) 26 28
A Final Note on StatisticalControlsVersusExperiments What This ChapterHas Shown STILLMORE ON TABLES What This ChapterIs About ReorganizingTablesto Extract New Information When to Percentagea Table "Backwards"
45 47 47 48 50
Cross-Tabulations in Which the DependentVariable Is Representedby a Mean Writing About Cross-Tabulations
52 58 61
What This ChapterHas Shown
o-1
Index of Dissimilarity
Vl
Contents
4 ON THEMANIPULATION OFDATABYCOMPUTER
o)
What This ChaprerIs Abour
tr)
Introduction
66
How Data Files Are Organized Transforming Data What This ChapterHas Shown Appendix 4.A
Doing Analysis Using Stata Tips on Doing Analysis Using Stata Someparticularly Useful Stata 10.0Commands
INTRODUCTIONTO CORRELATION AND REGRESSION (ORDINARYLEASTSQUARES) What This ChapterIs About Introduction Quantifying the Size of a Relationship:RegressionAnalysis Assessingthe Strengthof a Relationship: CorrelationAnalysis The RelationshipBetweenCorrelation and RegressionCoefficients FactorsAffecting the Size of Correlation(and Regression)Coeflicients CorrelationRatios What This ChapterHas Shown 6
INTRODUCTIONTO MULTIPLE CORRELATION AND REGRESSION (ORDINARYLEASTSQUARES) What This ChapterIs About .
Introduction A WorkedExample:The Determinants of Literacy in China Dummy Variables A Strategyfor ComparisonsAcross Grouos A BayesianAlternativefor Comparing Models IndependentValidation What This ChapterHas Shown
MULTIPLE REGRESSION TRICKs: TECHNIQUES FOR HANDLING SPECIAL ANALYTIC PROBLEMS What This ChapterIs About NonlinearTransformations
OI
72 80 80 80 84
87 87 88 89 o1
94 94 99 102
r03 103 104 113 120 124 133 135 136
139 139 140
contentsVii Tesrin,ethe Equality of Coefficients TrendAnalysis: Testingthe Assumption of Linearity LrnearSplines Lrpressing Coefficientsas Deviationsfrom
MULTIPLEIMPUTATIONOF MISSING DATA \\tar This ChapterIs About lntroduction -\ WorkedExample:The Effect of Cultural Capital on EducationalAttainmentin Russia \\hat This ChaprerHas Shown SAMPLEDESIGNAND SURVEYESTIMATION \\har This ChapterIs About SurveySamples Conclusion \nlar This ChapterHas Shown REGRESSION DIAGNOSTICS what This ChapterIs About Introduction A WorkedExample:SocietalDifferences in StatusAttainment RobustRegression
' ! 1 SCALECONSTRUCTION What This ChapterIs About Introduction
149 152
the
Grald Mean (Multiple ClassificationAnalysis) OrherWaysof RepresentingDummy Variables Decomposingthe DifferenceBetween Two Means \\'har This ChapterHas Shown
Bootstrappingand StandardErrors What This ChapterHas Shown
147
r64 166 172 179 181 181 \82 187 194 195 t95 196 )t7
224 225 225 226 229 237 238 240 241 241 1,41
Validiry Reliability
242 243
Vlll
12
Contents ScaleConstruction
246
Errors-in-VariablesRegression What This Chapter Has Shown
258
LOG-LINEARANALYSIS What This ChapterIs About Introduction Choosinga PrefenedModel ParsimoniousModels A Bibliographic Note What This ChapterHas Shown Appendix 12.A Derivation of the Effect parameters Appendix 12.8 Introductionto Maximum Likelihood Estimation Mean of a Normal Distribution Log-Linear Parameters
,'3
BINOMIAL LOGISTICREGRESSION What This ChapterIs About Introduction Relationto Log-LinearAnalysis
261
263 263 264 265 277 294 295 295 297 298 299 301 301 302 303
A WorkedLogistic RegressionExample: PredictingPrevalenceof Armed Threats A SecondWorkedExample:SchoolingprogressionRatiosin Japan
304 314
A Third WorkedExample (Discrete-TimeHazard_Rate Models): Age at First Marriage
318
A FourthWorkedExample(Case-ControlModels): Who WasAppointed to a Nomenklataraposition in Russia? What This ChapterHas Shown Appendix l3.A Some Algebra for Logs and Exponents Appendix 13.8 Introduction to probit Analvsis
327 329 330 330
14 MULTINOMIAL AND ORDINALLOGISTIC REGRESSION AND TOBITREGRESSION WhatThisChapterIs About Muhinomial LogirAnalysis
335 J J.)
336
Contents lX frinal
Logistic Regression
342
Tobit Regression(andAllied Procedures)for Censored DependentVariables Otter Models for the Analysis of Limited DependentVariables &'hat This ChapterHas Shown
t5
353 360 361
IMPROVINGCAUSAL INFERENCE: FIXED EFFECTS AND RANDOM EFFECTS MODELING What This ChapterIs About Introduction Frxed Effects Models for Continuous Variables RandomEffects Models for ContinuousVariables A Worked Example: The Determinants of Income in China Fired Effects Models for Binary Outcomes A Bibliographic Note Wtat This ChapterHasShown
363 363 364 365 371 372 375 380 380
16 FINALTHOUGHTS AND FUTURE DIRECTIONS:
RESEARCH DESIGN AND INTERPRETATION ISSUES whar rhis Chapter is About ResearchDesignIssues The Importanceof Probability Sampling A Final Note: Good ProfessionalPractice What This ChaDterHas Shown
38r 381 382 397 400 405
Appendix A: Data Descriptions and Download Locations fot lie Data Used in This Book
407
Appendix B: Survey Estimation with the General Social Survey
4',11
References
417
lndex
431
':-,-,::,li::1,i' ;.l.ll LiFl,-..,
a:.x:X Ii:::.-i,:;,,*rXf":* i-::'.,:: i, TABLES I .1.
Joint FrequencyDisrributionof Militancy by Religiosity Among UrbanNegroesin the U.S., 1964.
1.2.
PercentMilitant by ReligiosityAmongUrbanNegroes in the U.S., 1964.
10
PercentageDistribution of Religiosity by EducationalAttainment, UrbanNegroesin the U.S., 1964.
l3
PercentMilitant by EducationalAttainment,Urban Negroes in the u.s., 1964.
l3
PercentMilitant by Religiosity and EducationalAttainment, UrbanNegroesin the U.S., 1964.
15
PercentMilitant by Religiosity and EducationalAttainment, Urban Negroesin the U.S., 1964(Three-DimensionalFormat).
18
PercentageWho Believe Legal Abortions ShouldBe PossibleUnder SpecifiedCircumstances,by Religion and Education,U.S. 1965 (N : 1,368;Cell Frequencies in Parentheses).
27
Percentage AcceptingAbortion by Religion and Education (HypotheticalData).
28
PercentMilitant by Religiosity,and PercentMilitant by Religiosity Adjusting (Standardizing)for Religiosity Differencesin Educational Attainment,UrbanNegroesin the U.S., 1964(N : 993).
30
1.3. 1.4. 1.5. 1.6. Ll.
2.2. 2.3.
2.4.
PercentageDistribution of Beliefs Regardingthe Scientific View of Evolution(U.S.Adults,1993.1994.and2000).
2.5.
Percentage Accepting the ScientificView of Evolution by ReligiousDenomination(N : 3,663).
2.6.
Percentage Acceptingthe ScientificView of Evolution by Level of Education.
2.7.
Percentage Accepting the ScientificView of Evolution by Age.
2.8.
Percentage Distributionof Educational Attainmentby Religion
2.9.
PercentageDistribution ofAge by Religion.
2.10.
Joint ProbabilityDistribution of EducationandAge.
33
35 35 36
Xll
Tables,Figures,Exhibits,and Boxes
2 .11. PercentageAccepting the ScientificView of Evolution by Religion, Age, and Sex (PercentageBasesin Parentheses) 2.12. ObservedProportionAccepting the ScientificView of Evolution, and ProportionStandardizedfor EducationandAge. 2.r3. PercentageDistribution of OccupationalGroupsby Race,South African Males Age 20-69, Early 1990s(Percentages ShownWithout Controlsand also Directly Standardizedfor Racial Differencesin EducationalAttainment";N = 4,004). 2.14. Mean Number of ChineseCharactersKnown (Out of 10), for Urban and Rural ResidentsAge 20-69, China 1996(MeansShown Without ControlsandAlso Directly Standardizedfor Urban-Rural Differencesin Distribution ofEducation; N : 6,081). FrequencyDistribution ofAcceptanceof Abortion by Religion andEducation,U.S.Aduits, 1965(N : 1,368). Social Origins of Nobel Prize Winners(1901-1972)and Other U.S. Elires (and,for Comparison,the Occupationsof EmployedMales i900-1920). 3.3. MeanAnnual Income in 1979Among ThoseWorking Full Time in 1980,by Educationand Gender,U.S. Adults (Category FrequenciesShownin Parentheses). Meansand StandardDeviationsof Income in 1979bv Education and Gender,U.S. Adults, 1980. 3.5. MedianAnnual Incomein 1979Among ThoseWork rg Full Time in 1980, by Educationand Gender,U.S. Adults (CategoryFrequencies Shownin Parentheses).
6.2.
6.3. 6.4.
PercentageDistribution Over Major OccupationGroupsby Race and Sex,U.S. Labor Force, 1979(N : 96,945). Mean Number of PositiveResponsesto an Acceptanceof Abortion Scale(Range:0-7), by Religion, U.S. Adults, 2006. Means,StandardDeviations,and CorrelationsAmong Variables Affecting Knowledgeof ChineseCharacters,EmployedChinese Adults Age 20-69, 1996(N = 4,802) Determinantsof the Number of ChineseCharactersConectly Identifiedon a Ten-ItemTest,EmployedChineseAdults Age2U69,1996 (StandardEnors in Parentheses). Coefficientsof Models ofAcceptanceofAbortion, U.S. Adults, 1974 (StandardErrors Shownin Parentheses); N : 1,481. Goodness-of-FitStatisticsfor Altemative Models of the Relationship Among Religion, Education,andAcceptanceofAbortion, U.S. Adults, 1973(N = 1,499). DemonstrationThat Inclusionof a Linear Term Does Not Affect PredictedValues.
37 39
4l
42 48 51
52
58 60
101
115
116 127
136
153
Tables, FiguretExhibits. and BoxesXiii ":
"-i
-.4
Cefficiens for a Linear Spline Model of Trends in years of Sciool Compleredby year of Birth, U.S. Adults Age 25 and Older, ad Comparisonswith Other Models (pooled Datafor 1972_2004, \ : -19.324). Goodness-of-FitStatisticsfor Models of Knowledgeof Chinese Cba-actersby year of Birth, Controlling for years of Schooling, rirh \-arious Specifications of the Effect of the Cultural Revolution rTbose Affected by the Cultural Revolution Are Deflned peoole as Tuning Age I I During the period 1966ttuough 1977),Chinese -{dnlts Age 20 ro 69 in 1996(N = 6,086). Cocfficientsfor Models 4, 5, and 7 predicting Knowledgeof Chinese Charactersby year of Birth, Controliins for ye;rs ( p Valuesin parentheses). of Scbooti_ng
--s
CoefficientsofModels of ToleranceofAtheists, U.S. Adults, 1[O to 2004 (N : 4,299). -6, Desiga Matrices for Alternative Ways of Coding Categorical \-ariables(SeeText for Details). Coefficients for a Model of the Determinants of Vocabulary Knorrledge,U.S. Adults, 1994(N : 1,,757R2 : .2445: Sald TestThat CategoricalVariablesAll Equal Zetot F.t,rrrt = 12.48; p :.r.iation Membership. -::quenl - ::quenl Distribution of Occupationby Father'sOccupation, C:rnese-{dults,1996. -:,:;raction Parametersfor the SaturatedModel Applied to Table 12.9. G..odness-of-FitStatisticsfor AlternativeModels of Intergenerational O,-cupational Mobility in China(Six-by-SixTable).
'
275
276 278 280 282 284
F:;quency Distribution of EducationalAttainmentby Size of ?,::e of Residenceat Age Fourteen,ChineseAdults Not Enrolled :: School.1996.
289
P.rcentageEver Threatenedby a Gun, by SelectedVariables,U.S. {Jults. 1973to 1994(N : 19,260).
306
G..t dness-of-FitStatisticsfor VariousModels Predictingthe P::ralenceof ArmedThreatto U.S.Adults, 1973to 1994. Eie!-r Parametersfor Models 2 and4 of Table 13.2.
308 310
Goodness-of-FitStatisticsfor VariousModels of the Processof ErucationalTransitionin Japan(PreferredModel Shownin Boldface).
315
Eiect Parameters for Model 3 ofTable 13.4.
316
OddsRatiosfor a Model Predictingthe Likelihood of Marriagefrom \Ee at Risk, Sex,Race,and Mother's Education,with Interactions Bet$ eenAge at Risk and the OtherVariables. Coeillcientsfor a Model of Determinantsof Nomenklatura \Iembership,Russia,1988.
328
Efiect Parametersfor a Probit Analysis of Gun Threat(Corresponding :.r \lodels 2 and4 ofTable 13.3).
331
Ettect Parametersfor a Model of the Determinantsof English and RussianLanguageCompetencein the CzechRepublic, 1993 p Valuesin Italic.) \ : 3,945).(StandardErrors in Parentheses;
339
Eftect Parametersfor an OrderedLogit Model of Political Party Identification, U.S.Adults, 1998(N : 2,443).
345
PredictedProbability Distributionsof Party Identificationfor Black and non-BIackMales Living in Large CentralCities of Non-Southern S\lSAs and Earning $40,000to $50,000perYear.
349
XVi 14.4. 14.5. 14.6. 14.7.
15.1. 15.2. 15.3.
Tables,Figuret Exhibits,and Boxes Effect Parametersfor a GeneralizedOrdercdLogit Model of political Party Identification,U.S. Adults, 1998. Effect Parametersfor an Ordinary Least-Squares Regression Model of Political party ldentification,U.S. Adults, 199g. Codesfor Frequencyof Sex in the Pastyear, U.S. Adults, 2000. AlternativeEstimatesof a Model of Frequencyof Sex,U.S Adults, 2000 (N : 2,258).(StandardErrors in parenthesesl All CoefficientsAre Significantat .001 or Beyond.) SocioeconomicCharacteristicsof ChineseAdults by Size ofplace of Residence,1996. Comparisonof OLS and FE Estimatesfor a Model of the Determinantsof Family Income,ChineseRMB, 1996(N : 5,342). Comparisonof OLS and FE Estimatesfor a Model of the Effect of Migration and Remittanceson SouthAfrican Black Children,s SchoolEnrollment,2OO2to 2003.(N(FE) : 2,408 Children; N(full RE) = 12,043Children.)
350 354 356
357 373 374
379
FIGURES 2 .1.
The ObservedAssociationBetweenX andy Is Entirelv Spurious and Coes to Zero When Z Is Controlled.
2.2.
The ObservedAssociationBetweenX andy Is partlv Sourious: theEffecrof X on Y ls ReducedWhenZ Is Controll;d(Z Affecrs X and Both Z and X Affect Y). The ObservedAssociationBetweenX andy Is Entirely Exolained by the InterveningVariableZ and Goesto Zero When 2 Is bontrolled. The ObservedAssociationBetweenX andy Is partly Explainedby the InterveningVariableZ: the Effect of X on y Is ReducedWhen Z Is Controlled(X Affects Z, and Both X and Z Affecr y).
2.5. 2.6.
4.1. 5.1. 5.2.
Both X and Z Affect Y, but ThereIs no AssumptionRegarding the CausalOrdering of X and Z. The Size of the Zero-OrderAssociationBetweenX andy (andBetween Z andY) Is Suppressed When the Effects ofX on Z andy haveOpposite Sign, and the Effects ofX and Z ony haveOppositeSign. An IBM punch card. ScatterPlot of Yearsof Schoolingby Father,syears of Schoolins (HypotheticalDara.N : t0). Least-Squares RegressionLine of the RelationBetween Yearsof Schoolingand Father'sYearsof Schoolins.
24 24 25
26 11
88 89
T Tables, Figures, Exhibits. and Boxes XVii -.:-.:-.iuares RegressionLine of the RelationBetweenyears S:: -.-'irn,sand Father'sYearsof Schooling,ShowingHow the '::::: Prediction"or "Residual"Is Defined. '-: -;..:-Squares RegressionLines for Three Conligurationsof Data: : :-:::.rl Independence, (b) PerfectCorrelation,and (c) perfect ----. :-:;ear Correlation-a ParabolaSymmetricalto the X-Axis. -:: I-e;r of a SingleDeviantCase(High Leveragepoint). - :-'.:=:lng DistributionsReducesCorrelations. - :: iiecr of Aggregationon Correlations. of the Relationship Between --:-:: DimensionalRepresentation \::-:er of Siblings,Father'sYearsof Schooling,andRespondent,s -::--. ri Schooling(Hypothetical Data;N : l0).
90
92 95 97 99
105
:r:e;:ed \umber of ChineseCharactersIdentified (Out of Ten) , . \:,r: ol Schoolingand Gender,Urban Origin ChineseAdults Age 20 : :- ::r 1996with NonmanualOccupationsand with years of Father,s S: :l.ine andLevelof CulturalCapitalSetat TheirMeans(N : 4,g02). \::e: ihe temaleline doesnot extendbeyondl6 because thereareno :'::.".esin the samplewith post-graduate education.) 120 :,j-':pranceofAbortion by EducationandReligiousDenomination, 131 -.S. -\dulrs.1974(N : 1.481). --.-: RelationshipBetween 2003 Income andAge, U.S. Adults .{:: Ttlen*'to Sixty-Fourin 2004(N : 1,573). t4l :r-;ted 1n(Income) by YearsOf SchoolCompleted, U.S. Males Females.2004, with Hours Workedper WeekFixed at the -:: l'1i-rntbr Both SexesCombined(42.7;N : 1,459). 1,44 ir:e.-ied Incomeby Yearsof SchoolCompleted, U.S. Malesand ::neles. 2004,with Hours Workedper Week Fixed at the Mean for 3-.rhSeresCombined(42.7). 145 ::end in ArtitudesRegardingGenderEquality,U.S.AdultsSurveyed : i9r-l Through1998(LinearTrendandAnnualMeans;N=21,464). 151 f-:arsof SchoolCompletedby Yearof Birth, U.S.Adults (pooled S:mplesfrom the 1972Through2004GSS;N = 39,324;Scatter Pr.rtShownfor 5 PercentSample). 154 \lean Yearsof Schoolingby Yearof Birth, U.S. adults(SameData :i tbr Figure7.5). 155 Tluee-YearMoving AverageofYears of Schoolingby year of Birth, L.S. Adults(SameDataasfor Figure7.5). 155 Trendin Yearsof SchoolCompletedby Year of Birth, U.S. Adults SameData as for Figure 7.5). PredictedValuesfrom a Linear Splinewith a Knot at 1947. 158
XVlll
Exhibits, andBoxes Tables, Figures,
7 .9.
Graphsof ThreeModels of the Effect of the Cultural Revolution on VocabularyKnowledge,Holding ConstantEducation (at TwelveYears),ChineseAdults, 1996(N : 6,086).
7.10. 10.1. 10.2.
10.3. 10.4.
Figure 7.9 Rescaledto Show the Entire Rangeof the Y-Axis. Four ScatterPlots with Identical Lines.
163 163 226
ScatterPlot of the RelationshipBetweenX andY andAlso the RegressionLine from a Model That IncorrectlyAssumesa Linear RelationshipBetweenX andY (HypotheticalData).
227
Yearsof School Completedby Number of Siblings,U.S. Adults, 1994 (N - 2,992). Yearsof SchoolCompletedby Number of Siblings,U.S. Adults, 1994.
10.5.
A Plot of LeverageVersusSquaredNormalizedResidualsfor Equation7 in TreimanandYip (1989).
10.6.
A Plot of LeverageVersusStudentizedResidualsfor Treimanand Yip's Equation7, with Circles Proportionalto the Size of Cook's D.
lO.7.
Added-VariablePlots for Treiman andYip's Equation7. Plot for Treiman andYip's Equation7. Residual-Versus-Fitted
10.8.
Plots for Treimanand AugmentedComponent-Plus-Residual Yip's Equation7. 10.10. ObjectiveFunctionsfor ThreeM Estimators:(a) OLS Objective Function,(b) Huber ObjectiveFunction,and (c) Bi-Square ObjectiveFunction.
228
zz8 232 233 233 234
10.9.
10.11. SamplingDistributionsof BootstrappedCoefficients (2,000Repetitions)for the ExpandedModel, Estimatedby RobustRegressionon SeventeenCountries. 11.1. 13.1. 13.2. 13.3. 13.4.
13.5.
235
238
240
Loadingsof the SevenAbortion-AcceptanceItems on the First Two 255 Factors,Unrotatedand Rotated30 DegreesCounterclockwise. ExpectedProbability of Marrying for the First Time by Age at 320 Risk,U.S.Adults, 1994(N = 1,556). Risk the First Time by Age at ExpectedProbability of Marrying for (Range:Fifteen to Thirty-Six), Discrete-TimeModel, U.S. Adults, 1994. 3ZZ ExpectedProbability of Marrying for the First Time by Age at Risk (Range:Fifteen to Thirty-Six), Polynomial Model, U.S. Adults, 1994. ExpectedProbability of Manying for the First Time by Age at fusk, Sex, and Mother's Education(Twelveand SixteenYearsof Schooling), Non-Black U.S. Adults, 1994. ExpectedProbability of Marrying for the First Time by Age at Risk, Sex,and Mother's Education(Twelveand SixteenYearsof Schooling),Black U.S.Adults, 1994.
322
326
326
Tables,Figures.Exhibits,and Boxes XIX
:,:.8.1. ProbabilitiesAssociatedwith Valuesof Probit and Logit Coefficients. --+.l. 11.1. 16.1. -6.1.
ThreeEstimatesof the ExpectedFrequencyof Sex per Year, U.S. Married Women,2000 (N : 552). ExpectedFrequencyof Sex PerYearby Genderand Marital Status, U.S.Adults,2000(N : 2,258). 1980Male Disability by Quarterof Birth (Preventedliom Work by a PhysicalDisability). Blau andDuncan'sBasicModel oflhe Processof Stratification.
JJ{
358
359 386 394
EXHIBITS :. 1 :2.
lllistration of How Data Files Are Organized. A CodebookCorresponding to Exhibit4.1.
67 68
BOXES
Stata-do- Files and Jog- Files Direct StandardizationIn Earlier SurveyResearch
3 6 9 10 14 15 16 18 22 27 30 31
The Weaknessof Matching and a Useful Fix
44
TechnicalPointson Table3.3
53 54 66 70 72 75
Open-EndedQuestions SamuelA. Stouffer TechnicalPointson Table 1.1 TechnicalPointson Table 1.2 TechnicalPointson Table 1.3 TechnicalPointson Table 1.4 TechnicalPointson Table 1.5 TechnicalPointson Table 1.6 Paul Lazarsfeld HansZeisel
SubstantivePointsOn Table3.3 A Histodcal Note on Social ScienceComputerPackages HermanHollerith The Way Things Were TreatingMissing Valuesas If They Were Not
XX
Tables,Figures,Exhibits,and Boxes
PeopleGenerallyLike to Respondto (Well-Designed andWell-Administered)Surveys Why Use the " Least Squares" Criterion to Determine the Best-FittingLine? Karl Pearson A Useful Computational Formula for r A "Real Data" Exampleof the Effect of Truncatingthe Distribution A Useful ComputationalFormulafor 12 Multicollinearity ReminderRegardingthe Varianceof DichotomousVariables A Formula for ComputingR':from Conelations Adjusted R'? Always PresentDescriptiveStatistics TechnicalPoint on Table6.2 Why You ShouldInclude the Entire Samplein Your Analysis Gettingp-valuesvia Stata Using Statato Comparethe Goodness-of-fitof RegressionModels R. A. (RonaldAylmer) Fisher
17 9I 93 93 97 101 108 110 111 r1 1 114 117 122
r25 125 126
How to Test the Significanceof the Difference BetweenTwo Coefficients Altemative Ways to EstimateBIC
129
Why the RelationshipBetweenIncome andAge Is Curvilinear
140
A Trick to ReduceCollinearity
145
In SomeYearsof the GSS,Only a Subsetof Respondents WasAsked CertainQuestions
150
134
An AlternativeSpecificationof SplineFunctions Why Black versusNon-black Is Better Than White versus Non-white for SocialAnalysis in the United States
156
A Commenton Credit in Science Why PairwiseDeletion ShouldBe Avoided
175
TechnicalDetailson lhe Variables TelephoneSurveys
188
Mail Surveys
r99 200 202 205
Web Surveys Philip M. Hauser A SuperiorSamplingProcedure
175 183 198
Tables, Figures, Exhibits, and BoxesXXi St-rurces of Nonresponse ["eslieKish Hos the ChineseStratifiedSampleUsed in the Design Erperimentswas Constructed $ii,ehdng Data in Stata Limitarions of the Stata10.0 SurveyEstimationprocedure -{n -{lternativeto SurveyEstimation Ho\l to DownweightSampleSize in Stata Eirs to AssessReliability $-h1' the SAI and GRE TestsInclude SeveralHundredItems TransformingVariablesso That ,,High,'has a ConsistentMeaning ConstructingScalesfrom IncompleteInformation h Log-LinearAnalysis "Interaction',Simply Means ,Association,, l: Defined Other Softwarefor EstimatingLog-Linear Models \larimum Likelihood Estimation ProbitAnalysis Techdcal Point on Table 13.1 Limitations of Wald Tests SmoothingDistributions EstimatingGeneralizedOrder Logit Models With Stata JamesTobin PanelSurveysin the PublicDomain Otis Dudley Duncan SewellWright -\sk a Foreigner To Do It GeorgePeterMurdock ln the United States,Publicly FundedStudiesMust be Made Available to the ResearchComrnunity Al'Available from Aulhor" Archive
207 ?08 212 2,13 215 219 219 244 245 248 249 264 267 294 302 302 305 309 325 349 354 369 395 396 398 401 404
, -, ,__ :l ,:-i ,
,"
.a.
: , :. a book abouthow to conducttheoreticallyinfomed quantitativesocialresearch ":-: .. socialresearchto testideas.It derivesfrom a coursefor graduatestudentsin sociprofessionalschools(public -:, .rnd other social sciencesand social science-based -.-----. education,socialwelfare,urbanplanning,and so on) that I havebeenteachingat - -.t tbr somethirty years.The coursehasevolvedasquantitativemethodsin the social , ::::s haveadvanced;early versionsof the coursewere basedon the first half of this -.., r throughChapterSeven),with additionalmaterialsaddedover the years.Interest:-:-.. I havebeenableto retainthe sameformat a twenty-weekcoursewith onethree::-: -e.tureper week and a weekly exercise,culminatingin a term paperwritten dudng --i .-it lbur weeksof the course from the outset,which is, I suppose,a tributeto the --.:=sing level of preparationand quantitativecompetenceof graduatestudentsin ::= .-..ial sciences.The book owes much to lively classdiscussionsover the years,of :: :ubtle andcomplexmethodologicalpoints. tsr rheendof the book,you shouldknow how to makesubstantive senseof a body of data. you That is, prepared should be well produce to publishable papersin -:-,:::ative :-: neld. as well as first-ratedissertationchapters.Of course,thereis alwaysmore to :=:. In the final chapter(ChapterSixteen),I discussadvancedtopics that go beyond ; '.: .an be coveredin a first coursein dataanalysis. Tie focusis on the analysisof datafrom representative samplesof well-definedpop- ,:, rns.althoughsomeexceptionsareconsidered.The populationscanconsistof almost societies,occupations,pottery shards,or what--l -:-rns people,formal organizations, - ::. ihe analytic issuesare essentiallythe same.Data collectionproceduresare men- :J only in passing.Thele simply is not enoughspacein an alreadylengthybook to do -.::re to both data analysisand datacollection.Thus, you will needto look elsewhere r .i stematicinstructionon data-collectionprocedures. A strongcasecan be madethat .hould do this after rather than before a courseon data analysisbecausethe main :. : ---emin designinga data collectionefforl is decidingwhat to collect, which means - irst needto know how you will conductyour analysis.An altemativemethod of :--:ring aboutthe practicaldetailsof datacollectionis to becomean apprentice(unpaid, : ,:;essary) to someonewho is aboutto conducta surveyand insistthat you get to par,:::ate in it step-by-step evenwhenyour presence is a nuisance. Thisbookcoversa varietyoftechniques,includingtabularanalysis,log-linearmodels r :abulardata,regressionanalysisin its variousforms,regressiondiagnosticsandrobust -.::-\sion, ways to cope with missing data,logistic regression,factor-basedand other :::.niquesfor scaleconsnxction,andfixed- andrandom-effects modelsasa way to make ,.-.al inferences.But this is not a statisticsbook; the emphasisis on usingtheseproce:-:;s to drawsubstantive conclusionsabouthow the socialworld works.Accordingly,the :' .-.kis designedfol a courseto be taken after a first-yeargraduatestatisticscoursein -: rocial sciences.Although thereare many equationsin the book. this is becauseit is
XXIV
Preface
necessa.ry to understandhow statisticalprocedureswork to usethernintelligently. Because the emphasisis on applications,there are many worked examples,often adaptedfrom my own research.In addition to data from samplesurveysI haveconducted,I also rely heavily on the GeneralSocial Survey,an omnibussurveydesignedfor use by the research community and also for teaching.Appendix A describesthe main data sets used for the substantiveexamplesand provides information on how to obtain them; they are all availablewithout cost. The only prerequisitesfor successfuluseof this book are a prior graduateJevelsocial sciencestatisticscourse,a willingnessto think carefullyandwork hard,andthe ability to do high school algebra-either rememberedor relearned.With only a handful of exceptions (referencesat one or two points to calculus and to matrix algebra),no mathematics beyondhigh school algebrais used.If your high schoolalgebrais rusty, you can find good reviews in Helen Walker, Mathematics Essential for Elementary St,,tistics, and W. L. Bashaw,Mathematicsfor Statistics.These books have been around forever. Although more recent equivalentsprobably exist, school algebra has not changed,so it hardly matters.Copiesof thesebooksarereadily availableat amazon . com, andprobably many otherplacesaswell. The statisticalsoftwarepackageusedin this book is Srara(release10). Downloadable commandfiles (-do- files in Stata'sterminology),files of results(-1og- files), and ancillary computer files used in the computations are available at wwwjosseybass. conr/golquantitativedataanalysis Often the details underlying particular computationsare only found in the downloadable do - and - 1og - files, so be sureto downloadandstudythemcarefully.Thesefiles will be updatedasnew releasesof Statabecomeavailable. I use Statain my teachingand in this book becauseit has very rapidly becomethe statistical packageof choicein leadingsociologyand economicsdepartments. This is not accidental.Statais a fast and efficient packagethat includes most of the statistical procedures of interest to social scientists,and new commandsare being addedat a rapid pace. Although many statistical packagesare available, the thrce leading contenderscurrently are Stata,SPSS,and SAS. As software,Statais clearly superiorto SPSS-it is faster, more accurate,andincludes a wider rangeof applications.SAS, althoughvery powerful, is not nearly as intuitive as Stata and is more difficult to learn (and to teach). Nonetheless, this book canbe readilyusedin conjunctionwith eitherSPSSor SAS, simply by translating the syntaxofthe Stata-do- files.(I havedonesomethinglike this,exploitingAllison's excellent,but SAS-based,expositionof fixed- andrandom-effects models[Allison 2005] by writing the correspondingStatacode.)
FORINSTRUCTORS Somenotes on how I have usedthesematerials in teaching may be helpful to you as you designyour own course. As noted previously,the courseon which this book is basedruns for two quarters (twenty weeks). I have offered one three-hour lecture per week and have assignedan exerciseeveryweek.When I fust taughtthe course,I readtheseexercisesmyself,but as
Preface
XXV
:-:: -::rentshaveincreased,I haveenjoyedthe servicesof a T.A. (chosenfrom among -::.-:. $ ho haddonewell in the coursein previousyears),who assistsstudentswith the : .::ies of computingand statisticsand also readsand commentson the exercises.In lecturesandhaveassignedexercisesfor all but the :::r: \eais. I haveofferedseventeen '.'' the course devotedto producingtwo draftsof a term paper -. -- :ih rhe final monthof : :::rJirihon sessionI readthe first draftsandwrite comments,in an attemptto emulate : : - : -:nal submissionprocess.Thus, in my course,everyonegetsa "reviseand resub::-: :i>ponse.I encouragestudentsto developtheir telm papersin the courseof doing andto completetheir draftsin the two weeksafter the lastexerciseis due. -:= -.:::ises l;-: initial exercisesare designedto lead studentsin a guided way through the , :-:::rics of analysis,and someof the later exercisesdo this as well. But the exercises - -::-.:nglr take a free form: "carry out an analysislike that presentedin the book." ,,:-.:ir e answersareprovidedfor thoseexercisesthat involvedefinitiveanswers that , ,- .3 sin-Iilarto statisticsproblemsets. -:3 .oursesyllabus,weekly exercises,andillustrativeanswersto thoseexercisesfbr i:-[ have written illustrative answersare availablefor downloadingfrom www. : ::_.r.i:s.com/go/quantitativedataanalysis
ACKNOWLEDGMENTS -,. , r:3dearlier,this book hasbeendevelopedin interactionwith manycohortsof gradu-:. .::dents at UCLA who havewrestledwith eachof the chaptersincludedhere and :- . :erealed troubles in the exposition, sometimesby way of explicit comments -- - : r:nerimesvia displaysof confusion.The book would not exist without them, as I :: :: -naginedmyselfwdting a textbook,and so I owe themgreatthanks.Onein partic---. ?.rmelaStoddard,literally causedthe book to be publishedin its currentfolm by : ::-.:ing in the courseof a chanceairplaneconversationwith Andrew Pastemack,a ...' , -Bassacquisitionseditor,that her professorwas thinking of publishingthe chap.. . : usedas a coursetext.Andy contactedme, andthe restis history. h: courseon which this book is basedfirst cameinto being throughcollaboration i -: :r] colleagueJonathanKelley, when he was a visiting professorat UCLA in the - - .. The first exerciseis borrowedfrom him, andthe generalthrustof the course,espe- - -. :re lirst half, owesmuch to hrm. \ly colleague,Bill Mason,recentlyretiredfrom the UCLA Sociologyand Statistics -..:::rients, hasbeenmy statisticalguru for manyyears.Otien I haveturnedto him lbr :: i::s irto difficult statisticalissues.And much that I have learnedabout topics that ;: : :roi part of the cuniculum when I was a graduatestudenthas beenfrom sitting in -,: ::red statisticscoursesofferedby Bill. Anothercolleague,Rob Mare, hasbeenhelp-- -. :nuchthe sameway.My new colleague,JennieBrand,who took over my quantita- : :;ia analysiscoursein the fall of2008, hasreadthe entiremanuscdptandhasoffered relptul suggestions. Finally, the book hasbenefitedgreadyfrom very carefulread--.,. .: :l' a group of about 100 Chinesestudents,to whom I gavea specialversionof the , --:.: in an intensivesumner sessionat Beijing University in July 2008.They caught
XXVI
Preface
ftmy errors that had gone unnoticed and mised often subtle points that resulted in the reworking of selectedportions of the text. My understanding of research design and statistical issues, especially conceming causality and theats to causal inference, has benefited greatly from the weeHy seminar of the Califomia Center for Population Research,which brings together sociologists, economists, ald other social scientists to listen to, and corrment on, presentationsof work in progress,mainly by visitors from other campuses.The lively and wide-ranging discussionhasbeen somethingof a floating tutorial, a realization of what I haveimagined academiclife could and should be like. Finally, my wife, Judith Herschman,has displayed endlesspatience, only occasionally asking, "When are you going to finally publish your methodsbook?"
. : & L JYht ** H t Treiman is distinguishedprofessorof sociologyat the Universityof Califomia u --s 1:.:-:s rLCLA) andwas until recentlydirectorof UCLA's Califomia Centerfor aorurr,:r Re:earch.He hasa BA from ReedCollege(1962)and an MA andphD from ! -n-.-.-:r .-'fChicago(1967).As a graduatestudentat Chicago,he spentmostofhis .f, \aiional Opinion ResearchCenter(NORC), wherehe gainedvaluabletrain_ :- .Er:- :1-nence in surveyresearch.He then taught at the University of Wisconsin, rntae :l :e,-ided that he really was a social demographerat heart, and made the Center ru }:,-1:rrph1 and Ecology his intellectualhome. From Wisconsin,he moved to I 'rrrrn-; Lnirersitv and then, in 1975,to UCLA, wherehe has beenever since,albeit qd E\i=J-1 so.;ournselsewhere,as staff director of a study committee at the National r;rrr='. .:: Sciences,4.Jational ResearchCouncil (1978-1981)and fellowship yearsat Bl:eau ofthe Census(1987-1988), theCenterfor AdvancedStudyin theBehav_ ---i umr rc S.r-ialSciences(1992 1993),andthe NetherlandsInstitutefor AdvancedStudy r M and SocialSciences(1996-1,997). l::--.or Treiman startedhis careeras a studentof social stratificationand status --::rrniries il.!yn-..:-- parricularlyfrom a cross-nationalperspective,and this has remained a con_ i'Fr._r :::3resr.He andhis Dutch colleague,Harry Ganzeboom,have beenengagedin a {mr--€:= project to analyzevariationsin the statusattainmentDrocess --ross-national [irrlr. :::!-lD! throughoutthe world over the courseof the twentiethcentury.To date, tEl r:-,: ;ompiled an archiveof more than 300 samplesurveysfrom more than 50 m:cs- =ngrns through the last half of the century. In addition to his comparativeproj_ s ?:: -::sor Treimanhas conductedlarge-scalenationalprobability samplesurveysin ir@ \--.,-a | 1991-1994),EastemEurope( 1993-1994),andChina(1996),all concemed q [ -.J::.u! aspectsof socialinequality. :lj .Lrent researchhasmovedin a more demographicdirection.He hasa national !r.rr!---'::\ lample surveycurrentlyin progressin China,which focuseson the determ! m.- :i:amics. andconsequences of internalmigration.
:r,{-rK*milcT-l*ru I -. :or uncommonfor statisticscoursestakenby graduatestudentsin the socialsciences x :E [eated essentiallyasmathematicscourses,with substantialemphasison derivations rnc:roofs. Evenwhenempiricalexamplesareused-which they frequentlyarebecause howingwhat the relationship betweenreligiosity and militancy would be if all religios1r-!goups had the samedistribution of education. It is in this precise sensethat we can sav we are showing the associationbetweenreligiosity and militancy net of the effect of education.As noted earlier, this procedureis known asdirect standardizationor covariate ,lCjustment. Note that the weights need not be constructed from the overall distribution in the table. Any other set of weights could be applied as well. For example, if we wanted to assessthe associationbetweenreligiosity and militancy on the assumptionthat Blacks had the samedistribution of educationas Whites, we would treat Whites as the stand.d.rd, topulation and use the White distribution across educational categories (derived from someextemalsource)asthe weights.We will seetwo examplesof this strategya bit later in the chapter. Now let us constructa militancy-by-religiositytable adjusted,or standardized,for education,to seehow the procedureworks.We do this from the datain Table 1.6.First, $e derivethe standarddistribution,the overall distributionof education.Becausethere are993 casesin the table(= 108 +... * 49), andthereare 353 1= 193 +201 + 44) peoplewith a grammarschooleducation,the proportionwith a grammarschooleducation is .356 (:353/993). Similarly, the proportionwith a high schooleducationis .508, and the proportionwith a collegeeducationis .137.Theseare our weights.Then to get the adjusted,or standardized,percentmilitant among the very religious, we take the n eightedsum of the percentmilitant acrossthe threeeducationgroupsthat subdivide fte "very religious" category(that is, the figuresin the top row of the table): 17Voa.356
30
QuantitativeData Analysis:Doing SocialResearch to Testldeas
TAB Le 2.3.
percentlvtitirantby Retigiosity, and p€rcentMilitanr
by Religiosity Adjusting (Standardizing) for Religiosity Differences in Educational Attainment, Urban Negroes in the U.S., 1964 (N = 993).
PercentMilitant
PercentMilitant Adiustedfor Education
Percentage spread
+ 34Va*.508+ 38Voa .137= 29Va.To get the adjustedpercentmilitant amongthe .,somewhat religious,"we apply the sameweightsto the percentages in the secondrow in the table:227o+.356+ 32Vo*.508+ 48%a.137= 31Vo. Finally,to get the adjustedpercentmilrtantamongthe "not very or not at all religious,"we do the samefor the third row of the table,which yields45 percent.We canthencomparethesepercentages to the corresponding percentages for the zero-orderrelationshipbetweenreligiosityandmilitancy (thatis, not controllingfor education).The comparisonis shownin Table2.3. (The Stata-do_ file usedto carry out the computations,using the command-dstdize- and the -Iog- file that showsthe results,are availableas downloadablefiles from the publisher,JosseyBass/lViley(wwwjosseybass.com/go/quantitativedataanalysis) asare similar files for the remainingworkedexamplesin the chapter.Becausewe havenot yet beguncomputing,it probablyis bestto notethe availabilityofthis materialandretum to it laterunlessvou are alreadyfamiliar with Stata.)
STATA -Do- FILESAND -Loc-
FILES Insrata, -do-iircsare
commands,and - 1oq- filesrecordthe resultsof executing-do- files.As you will seein Chapter Four,the management of dataanalysis is complexand is muchfacjlitated by the creationof -ao- files,whichare efficientand alsoprovidea permanentrecordof whar you naveoone to produceeach tabulationor coefficient.Anyonewho hastried to replicatean analysis performedseveral yearsor evenseveral monthsearlierwill appreciate the valueof havinqan exactrecordof the computationsusedto generateeach result.
Moreon Tables 31
N I
t0) I ,2)
Whenpresentingdataof this sort,it is sometimes usefulto comparethe rangein the :ercentagepositive(in this case,the percent militant) acrosscategoriesof the indepen_ lent variable,wirh and wirhout conft;ls. rn Tabb ti,;; ;" ;;;r" the differencein -hepercentmilitant betweenthe leastandmos-treligiou, rwenty_one points '.rhereas,when educationis controlled, "ut"go;r'r. the differJncet.;;;; ro srxteen polnts, a l-1 percentrcduction (= I - 16/2.1).ln ,o." ,.or., tt say that education "erplains" abour a quarrer "r,;;; of the relarionship^betweenreffirit| in"o w" n""o :o be cautiousaboutmaking computations of thi, .o.t unjonty -rt,un"y. tt"_ when they ':re helpful.in making the analysis_ "_ffoy clear.no. ii io"'rri, ir*" much senseto ""u,npr", a "spread"or "range" in the percentages iithe relatio;shi; betweenreligiosity 'ompute 3flimilitancy is not monotonic(that ir, if th. p!.""ntug" Jili""ia."'", increase,or ar reastnot decrease, asreligiositydeclines). "",
i
t' omen the : milf the ondatis, - file file sey: rhe g. it i ale
plEE_qT_ tN EARLTER STANDARDTZATTON
}.f,*nS l-':':ffiffil:f"i:::i:i N ;*::u,lxli:lifi
to a "weighted netpercentaqe difference,, or ,,weighted netpercentage spread.,, ThereaJly usefulpartof the procedure is the computation of adjusted, or staidarorzeO, rates.The subsequent computation of percentage differences or percent"g",p*uJr-i, onry,or"tlrnu, useful, asa wayof summarizing theeffectof control varjables.
Example2: BeliefThatHumansEvolvedfrom Animals(Direct Standard_ ization with Two or More Control Variables) Sometimeswe want to adjust,or standardize, our databy more than onecontrol variable 3i.a time.to€et a summaryof the effect of some variabieon _oii", \Jt/i,"ntwo or more orhervariablesareheld constant-Consider'ror u.."ptun"" of the scientificthe.rn of evolurion.In 1993, 1994. and 2000, ""u-pt", ttre N:ti1 / l '
'
5-f
f
77',
wneretherearef groupsand I categoriesof the dependent variable,which in this caseis desig_ natedby X 50 Xi is the scorefor the ith category(of thelh group,aithoughthe caregoryscores are,thesamefor ajl groups),and /, is the numberof cases in the rth .u,"!ory o,nong,uro"r, of the/th group Noticethe difierencefrom Equation5.9, wherethe r refeisto inoividuars rather than to categories of the dependentvarjable.
Kl
102
QuantitativeDataAnarysis: DoingsociarResearch to Testrdeas
WHATTHISCHAPTER HASSHOWN In this chapterwe have considered simple (two_vanable;ordinary
teast_squares (OLS)
H?;ru"'"",'J:ff j *::kl*,.*#.i:l*#li# [..8i":!i;il'""fJ"fi is affectej j br,o"ii"*"i" o,,i"i"iili ll1.':g].:.'io"""{ficients
mn'i* xffis*{#jl_:"_t W'}i.#*:;;'y.;"i* gjtF:?.J"-4ru::;:f,1 ffi:l**i#fi,ltrl1"Ji'ff ;:nnl'*ffi#*1'T;#',""ri*#;#l
(or-s) im. the elation 6cally, to the oughly le then hisan ebuta nltiple r more
CHAPT I iT
INTRODUCTION TO MULTIPLE CORRELATION AND REGRESSION
(onDtNARYLEAST SQUARES)
WHATTHISCHAPTER tSABOUT h this chapterwe consider the central techniquefor dealing with the most b/pical social r..ienceproblem-understanding how some-ontcome is affected by severaldetermining Frriablesthat are correlatedwith eachother. we begin with a conceitual overview of mur=le correlation and regression,and then continu! ,ith u ,ortJ to illustrate Lrll to interpret regressioncoefficients.We then turn to "*arople consideration of the specialprop_ =ties.of categoricalindependentvariables, which U" in"tuj"Jlo multiple regrcssion 3luatronsas a set of dichotomous(.,dummy,') variables, "al one for eachcategoryof the origi_ ril variable(exceptthat to enableestimation of the equation, one categorymust be repre_ :entedonly implicitly). In the courseof our discussionof oummy variiutes,we develip a {rategy for comparing goups that enablesus to determine wheiher whateversocial pro_ -'esswe are investigating operatesin the same way for two or more subsegmentsof the population-males and females, ethnic categories, anOso on. We conctudewith an alter:atrveway ofchoosinga prefenedmodel,the BayesianInformation Coefficient(BIc).
104
QuantitativeData Analysis:Doing SocialResearch to Testldeas
INTRODUCTION For most social sciencepurposes,the two variable regressions we encounteredin the previous.chapter.arenot very interesting,exceptas a baselineagainstwhich to compare modelsinvolving severalindependentvariabies.Sucn moO"t J" ttr" fbcusof this chap_ ter Here we generalizethe two-variable procedureto many variables.That is, we predicr some (interval or ratio) dependentvadable from a ser of iniependent vanables.The logic rs exactly the sameas in the caseof two-variabre regression, excepl that we are estimatrng an equationin many dimensions. Let us first consider the case where we have two independentvariables. Extending the ten-observationexample from the previous chapter, ,rp'p"r"-r"" *i"t ,frat education dependsnot only on the father,s educaiion but also^on th" iru-i". ot ,iUUngs.The argu_ melt.is that the more siblingsone has,the lessattention on. i"""iu.. f.orn one,sparents /all elseequal),and hence,in consequence, the lesswell one doesin schooland, there_ fore, the lesseducationone obtains,on average(for examplesof studiesof sibship_size effectsin rheresearchlirerarure,seeDown"y tlsgsl, N4_uffi i06 , L"[2005], andLu and Treiman [2008]). Suppose,further, that we have informution on utf tn"" uariablesfor our sampleof ten cases: Father's Yearsof Schooling 2 12 4 13 6 6 8 4 8 10
Respondent'sYears of Schooling 4 10 8 13 9 4 13 6 6 11
Number of Siblings 3 3 4 0 2 5 3 4 3 4
Note that the first two columns are simply repeated from the examplein the previous chapter(seepage88). To test our hypothesis that the number of siblings negatively all.ects educational afianment, we would estimatean equation of the form: E : a + b(Eo) + c(S)
(6.1)
(Note that I use generic symbols, for example, X and f, b indicate variablesin equa_ .. tions of a generalform, but nnemonic ,y-toi., io, OS,to indicate variables in equations that refer to speclnc concrete "*r_pt",'U,ir, examples.I find ^it much easier to keeptrack ofwhat is in my equationwhenI use mlemonlc symUotstor varlaUtes.;
|---
(OrdinaryLeastSquares) 105 Introduction to MultipleCorrelation and Regression
d in the preto compare of this chaps we predict es.The logic I are esumats. Extending ut education Es.The arguone'sparents ,l and, thereI sibship-size 0051,andLu lariables for
Numberof Siblings 3 3 4 0 2 5 3 4 3 4
Equationssuch as Equation 6.1 are known as muhiple regressionequations.In rldple regressionequationsthe coefficientsassociatedwith each variable measure ft expecteddifference in the dependentvariable associatedwith a one-unit difference r 6€ given independentvariable, holding constant each of the other independentyarii,t-es.So in thepresentcase,the coefficientassociated with thenumberofsiblings tells us a. erpected difference in educationalattainmentfor eachadditional sibling amongthose rfude fathers have exactly the sameyears of education.Corespondingly, the coefficient rsociated with the father's education tells us the expecteddifference in years of educarn for thosewhosefathers differ by one year in their educationbut who haveexactly the re numberof siblings. In the tbree-variablecase(that is, when we have only two indepodent variables), but not when we have more variables, we can construct a geometric that illustrates the sensein which we are holding constq.ntonevariable and -Fesentation simating the net effect of the other. h multiple regression,as in two-variableregression,we use the least-squares critem to find the "best" equation-that is, we find the equation that minimizes the sum of ryared errors of prediction. However, whereasin bivariate regressionwe think in terms r- fte deviation between each observedpoint and a line, in multiple regressionthe anahg is the deviation between each observedpoint and a k-dimensional geometric surface rherE t - I * the number of independentvariables.Thus, where there are two indepen&nr variables, the least-squarescriterion minimizes the sum of squared deviations of a::h observationfrom a olane.as shownin Fisure 6.1.
Dthe previous s educational
(6.1) ables in equas, to indicate nuch easrerto riables.)
012345 Number of siblinqs
Fi G i XA &,1" three-oimensional Representation of the Relationship Between and Respondent's )tnber of Siblings, Father'sYearsof Schooling, Yearsof Schooling OiwotheticalData;N = 10).
106
Quantitative DataAnalysas: DoingSocialResearch to Testldeas
M etric Regressi on Coeffi cie nts
Thecoefficientsassociated with eachindependent variableareknow fcients, or netregression coeffici)nts.(orsomerimes rau o,;;;-;;;;::;t;"::"#ri:;{;, to distinguishthemfrom siindardized ,"i"frr;;;;;,";;;;',;;Z;",r""wi'learn later).In thepresentcase,theestirnat"O " , ."gr"r.ion l;' "quati; E : 6.26+ .564(E _ .640(s) ") This equationtells us that a person
who had no siblings and whosetather
morc
(6.2)
had no edu-
[lF{,"#l!i,n'"","in:""ffi ry:fr.::,'"...:;,,",,### ::illiiki:r"]h***i,Til3:rTT
:itii,"#l!:i*1?rffi todifferintieirow"."h",ri;;;l;.;-;;;
;:Hf;"*:j:*".
oi"iar y"- 1p.".i."ry.
Note that the coefficient associated with the father,s education rn Equation 6.2 is smallerthan the correspondintct
#i;1.:'ffi1,_:':*iffill"X IT:,;"::;;i;il#:il:fl,':,!Tl,j.?'ff m ract.-.503 in rhisexarn'le). rrrus.in equation ;.i.;;;fi. observed
thefather's educarion on rheresnondenr., .dil;;li;Ji
nT 3ffii Hl#"*',H::l'l'r
effecrof ,iJir* *u, poortyedu_
{d;:;;;"d"'ilri'ies.tendtoso,ess
thisassociation andgives theeffeciorthe ru*,... t::9"tt"llingfor)thenumber "Jo"utiT;;;ii;fift::":T1or of siblings Theimptication .i,rrir'."rriij.ll1:,ol:-t11t
;i;;:"";Tl,ffiT#Jilijl:T;il:l#il1,[ij.,5 1"e.,"1,9!tJ"#;ffi
rn the equationwill be biased_that is, *ltt ou"r.t t" * u;;";;;" between
thegivenindependent variabl"Fqr;G;;;;;H;1""?;*0,,o t:fj-ut variableis uncorrelatedrvitri,r," illl,jlllft. xnown as specfficationerror or omitted variable bias. """"iri-iiL
*"
relation "ausat
thelimirins equation). Thisis
Someanatystspresenra ,*l:r^:_r^ y:*rll!t!"rno." multiple regression changes in thesizeo.fspecin"-"oefn"ieoir..".ulting "o_pt"t" 3^"_OllTl.9h":s fromtheinclul sronot additionalvariabresThisis a s"mur" rou,"gy unc". oo-e'*-p"J"i'" the analystwanrsto considerhow the ,t eo erect of ,, "onoitroo: modifiedby rhe inclusionof anothervariable(or variables). onJ6r;;;;;;;tb;", rr,"i ri-_'" _"i.li analogous the searchfor spuriousor intervening ro "ro."r, rer"tirrrtf, ii crraprers 1r." and. Two the "i"r*ir?irl, Three), analystmight wani to investigate il;;;;;rlar relationship is or partlyexplained. by another ru"o'r.no. .*u.pr.."iiil", ,. observed ::lll::4 Jourhemers rha areresstoreranr of sociardeviants However'theanalystmaywarr to assess ,rrun*. p*.'[^ riuingou,rio. theSoutithepossibilitythatttrisreLionstripis (or taryely)spurious' entiretv arisinerromthe.facr thais"";;";;il";. Iess we'educared andlessurbanthanothers,ind thatedu"ution _o *U- *rli"i.J i"o.."u." ,olerance. In it would be appropriare io pres"nt trvo _od"f___oo"-.lg."rrrog,or"r_"" * :T._h"T.:residence Joutnem anda second.egreisingtot"ruo"" on iou-tfrJrriierioen"",education
(OrdinaryLeanSquares) 107 rntroduction to MultipleCorrelation and Regression
€?rtj.
x-e
tuels be x€ :l;... lis D'. ls. of
&ress tbe lss. de led loD in,s s rs lon
:lu0€n the ;to iro ) rs hat rrh. €l) Ied -l n ron |on.
ml size-of-place-and then to discussthe reduction in the size of the coefficient rs.\ociatedwith Southernresidencethat occurs when educationand size-of-placeare rided to the equation. However, absentspecific hypothesesregarding spurious or medime effects,there is no point in estimatingsuccessiveequations(exceptfor models nsolling setsof dummy variables,discussedin the next section,or variablesthat alter imtional forms, discussedin the next chapter); rather, all relevant variables should be nluded in a single regressionequation.However,evenin this casethe analystshould resent a table of zero-order(two-variable) correlation coefficients betweenpairs of variri.ies. plus meansand standarddeviations for all interval and continuous variables and trE';entagedistributions for all categorical variables.Thesedescriptive statistics help the roler to understandthe properties of the variables being analyzed.In addition, as noted 3.rrlier,the zero-order correlations provide a baselinefor assessingthe size of net effects Tren othervariablesare controlled.
Tating the Significance of Individual Coefficients h :: conventionalto compute and report the standarderror of the coefficient of eachindemdent variable-although, as you will soon see,standarderrors have limited utility in :r caseof dummy variables or interaction terms. The convention is to interpret coeffi:renB at least twice the size of their standarderror as statisticallysignificant.This :onl'ention arises from the fact that the sampling distribution of regressioncoefficients :..ilows a l-distribution and that, with 60 d.f. (where the degreesof freedom is computed x -\ - k - 1, with ft the numberof independentvariables),r : 2.00 definesthe 95 per::nt confidence interval around the value b : 0. It is important to understandthat the :-{atistics indicate the significance of eachcoefficient net of the effect of ali other coeffi;rcnts in the model. Thus, when severalhighly correlated variables are included in the nrdel, it is possible that no one of them is significantly different from zero, although as a _:roupthey are significant(seealsothe following boxedcommenton multicollinearity). Some aralysts estimate regressionmodels involving severalindependentvariables, imp the variables with nonsignificant coefficients (this is known as trimming the regresrion equation), and reestimatethe model, on the ground that to leave coefficients in the nrdel that havenonsignificanteffectsbiasesthe estimatesof the other variables.How::|er. other analystsargue that the best estimate of the dependentvariable is obtained by -n^--ludingall possible predictors, even those for which the difference ftom zero cannot 5e established with high confldence.The latter shategyis preferablebecauseit provides a€ bestpoint estimatebasedon a setof variablesthatthe analysthasan apriori basisfor $specting aflect the outcome.
Standard ized Coeffi cie nts ,\ questionthat naturally ariseswhen there are multiple determinantsof some dependent ruiable is which determinanthasthe greatestimpact.We cannotdircctly comparethe coefn.'-ientsassociatedwith eachindependentvariable becausethey typically are expressedin lifferent metrics.Is the consequence of a differenceof one year of schoolingcompleted ofa differenceof onesibling?Although $ thefathergreateror smallerthan theconsequence 6e questioncan, of course,be answered-as we saw earlier, the cost of each additional
108
to Testldeas Research DoingSocial DataAnalysis: Quantitative
MULTICOLLINEARITY
correlated' variables arehighlv whenindependent
a condition known as multrto//,nea,ty,regressioncoefficientstend to have large standard of errorsand to be ratherunstable,in the sensethat quitesmallchangesin the distribution (1991 1 1; notes As Fox , the coefficients size of produce in the largechanges the dala can variable,./' an independent error o{ seealsoFox 1gg7, 337-366\,the inflationin the standard is given by 1\1 - Ri),where Rf is the coefficientof determination due to multicollinearity, of variableion the remaining (discussed with the regression laterin thischapler)associated and can be computedin factor inflation variance the independentvariables;this is known as (SeeFoxand Stataby usingthe -estat vif- commandafterthe -regress- command suchasa setof dummy variables, to setsof independent MonetteI19921for a generalization in chaptersevenof thisbook,in and itssquare;seealsothe discussion variables or a variable ") Transformations. the sectionon "Nonlinear to be for multicollinearity must be quitehighlycorrelated variables clearly.the independent quadrupled' and an importantproblem Forexample'i! Rl:75, the errorvariancewill be R;'s as largeas .75 are quite uncommon, the standarderrorwill thus be doubled.Because in mainlyarisingin situations sciences, problem social in the a is not often multicollinearity model a single in are included concept measures of the sameunderlying which alternative and most commonlywhen aggregateddata, suchas propertiesof occupations,cities,or nainto solutionis to combinethe measures a reasonable In suchsituations, tions,areanalyzed. a multiple-itemscale(seeChapterEleven). Someanalystsattempt to minimizemulticollinearityby employingwhat is known asstepwse in which variablesare selectedinto (or out of) a modelone at a time, in the order regression, that producesthe greatestincrement(or the smallestdecrement)in the sizeof the R'?Such methodsare generallymisguided,both becausethey are completelyathoreticaland because the order in which variablesare selecledcan be quite arbitrary,given the previouslynoted arehighlycorrelated' whenvariables coefficients ln regression instability
sibling is somewhatgreater than the gain from each year of the fatler's education-the answerdoesnot tell us which variable has the strongereffect on the dependentvanable becausethe variance in the number of siblings is much smaller than the variancein the father's years of schooling. If it is not obvious why the size of the valiance mattels considerthe effect of educationandincome on the valueof the car a persondrives.suppose that for a samDleof U.S. adults,we estimatesuchan equationand obtain the following:
500(E) v - rs,ooo+.s1r;-
(6.3)
We would hardly want to conclude flom this that the effect of education is 1'00 times as larse as the effect of income, or to measureincome in unis of $100 and then to f
I
(OrdinaryLeastSquares) 109 Introduction to MultipleCorrelation and Regression fted, dard nof ,1 1; *e,L mon nrn9 ed In ia n d mmy x, In
bbe , and mon, ns in tooel r nai Into
,u,/ise order Such tause roted
tion-the lent varirriance in e matters, i Suppose Dwing: (6.3) l is 1,000 trd then to
$clude that the effect of educationis 10 times that of income.Actually, the equation nlicates that a year of educationreducesthe (expected)valueof a person'scar by $500, of income,whereasa $1,000incrementin incomeincreasesthe (expected)valueof a -t luson's car by $500,net of education.ln this precisesense,a year of educationexactly d'=ts $1,000in income.However,a more generalway to compareregressioncoeffiis to transformtheminto a commonmetric. -ntsThe conventionalway this is doneis to expressthe relationshipbetweenthe depenieirt and independent variables in terms of standardized variables-that is, variables rrnsformed by subtracting the mean and dividing by the standard deviation. Because uh variablesall havestandarddeviation: 1, the regressioncoefficientsassociated with *andardized variables indicate the number of standard deviations of difference on the ft?erdent variable expectedfor a one standarddeviation difference on the independent r:riable, net of the effects of all other independentvariables. In the presentexample,the i{uation relating the standardizedcoefflcients-that is, the standardizedcounterpafi to E4uation6.2-is j-.601(et)-.260(s)
(6.4)
R.eminder:As noted in the previous chapter,there is no intercept becausestandardized rsiables all havemean = 0 and a regressionsurfacemust passthrough the mean of each r:riable.) From inspeciionof the coefficientsin Equation 6.4, we concludethat the irher's educationhas a greatereffect on educationalattainmentthan doesthe number of nalngs-a greatereffect in the precise sensethat a one standarddeviation difference in ir father's years of schooling implies an expecteddifference of .60 of a standarddevia:.-'n in the respondent'syears of schooling, whereasa one standarddeviation difference n rhe number of siblings implies only a -.26 standarddeviation expecteddifference in yearsof schooling ile respondenl'b Note that in practice we do not ordinarily standardizethe variables and recompute ::e regressionequationbut rather instruct the software to report standardizedcoefficients usuallyin additionto metric coefficients).Becausestandardizedcoefficientsoften are f,!'t reported,particularly in the economicsliterature, we also can make use of the relation 3 ,: bo\r/s")-ahat is, the fact that the standardizedcoefficientrelating independent r3riableX to dependentvariableyis equalto the metriccoefficientmultipliedby the ratio lf the standarddeviations of the independentand dependentvariables-to convert metric -",.Standardized coefficients(or vice-versa).(RecallEquation5.7 and5.8.) regressioncoefficients.The conThereis somecontroversyregardingstandardized ientional wisdom in sociologyand other social sciencesis that they are useful for the Fvrposejust described-to assessthe relativeeffect size of eachof a set of independent rrriables in determining someoutcome-but that they are inappropriatefor assessingthe relativeeffectsizeof a givenvariablein differentpopulations,preciselybecausethe standardizedcoefficients will differ if the relative standarddeviationsdiffer in the populations ld .l^f,i^tl,,ll:l l"n:tive, ::*:1^.::tl':,":,
.r
i!
vr yo r
!r
Iu l l
Because for dichoromous variables the sizeof standardrzed r r u s u lcoeffjcients uc | l r Ltet t15
of themerric coefficient burarsoon theproportion of the ::::?::.T,.:l fi,,positive" ll":o"attribute, jt
::T,-::lt-n,]n" coefficients for suchvariables.
4-
ir rn*r" to,"tl ilJ;;i;;ffi;; seneraly
(OrdinaryLeast5quares) '111 Introduction to MultipleCorrelation and Regression Ee we ks and I relats, that hat the r these :ars of om the shown iblings rg they ical for r 1976) Blacks tion of malysis )uncan highrndard.09,far d? The that for fte cost esswas because thenear riability
)F IOUS
Yarift is, hom a n dBnts Itn e lized
Cefficient of Determination (R2) Id!.g well doesEquation 6.2 explain the variancein educationalattainment?We determine trr r-ia an exact analogy to l, known as R2,or the coefficient of determination, which us the proportion of variancein the dependentvariable explainedby the entire set of -dlrs nlEpendent variables.Just as for f, R2 : 1 - the ratio of the error variance(the variance
A FORMULAFORCOMPUTINGR' FROMCORRELA-?, TIONS A convenient formulafor computing R, froma matrixof correlations ,"d Iq 5iandardized regression coefficients is
Rtr,, .r*:Dr,,,Fu, Thatis,B'?canbe computedasthe sum of the productsof the correlations betweeneachof :he independent variables and the dependentvariableand the corresponding standardized r F r l r ac< i.,n.^ a ff i.ia n r
coeffir can be needto educ r is not otheses fcance ing the to esti-
Y=d'+bT+
\--
z-
(7.33)
s-here ? is a linear representationof time (here, the year of the surveyJ,and the Z' are dummy variablesfor eachyear the survey was conducted; note that two dummy variables mustbe omitted becausethe linear term usesup one degreeof freedom. We then compare de two-models in the uslal via an F+esi of th" ,ignin"_"" ot,t e increment _waV, in R2 and a comparison of BIC valuesA convenientway to Jo the first in Stata is to estimate Equation7.33 and then to test the hypothesis,fr"i af tfr" ,2.il1, zero,vraa Wald test using Stata's - test - command. (Note that equution", "l"a smply a different parameterizationof an equation in which the linear ierm is omitteJand oniy the dummt are included. The coefficients will, of course, Olff".. nui tt p."dicted values, 'ariables R:, and-81Cwill be identical.) If w.econclu.le that " no simpf" fln"ar ,r"nO no ,he data, we mrght then posit either a model with a-smoothcurve by inifoJirrg u ,qr*.a t"rm for Z, or a model that tries to model particular historical events by g.oupiig y"_, ioto historically meaningful groups and identifying each group ltess one') ui'u u'a'orn_y variable, or a splinemodel (seethe section"Linear Sptines,iater in tne inuf".;i""uur" ,he explainedby Equation7.33 is the maximum possible '-y'."-p."."ntution variance ftom of tr-" (measuredin years), the R'?associated with Equation ?33 ;;, ;', a standard against which to assess,in substantiverather than .t i"tty rtatr.ti"ail"r-1, ro* close various
1 50
to Testldeas QuantitativeDataAnalysis:DoingSocialResearch
sociologically motivated constrainedmodels come to fully explaining temporal variation in the dependentvariable. Although, to simplify the exposition, I have not included any variablesin the model other than time, a model actually positedby a researcherqpically would include a number of covariates(otherindependentvariables)andalso,perhaps,interactionsbetweenthecovariatesandthe variablesrepresentingtime. Exactly the samelogic would apply to suchan analysis asto the simpleranalysisjust described;the logic is alsoidentical to the dummy variable approachto the assessment of group differencesdescribedin the previouschapter(although herethe "groups" are yearsor, if warrantedby the analysis,multiyear historical periods).
Prediding Variation in Gender Role Attitudes over nme: A Worked Example Four items on attitudesregardinggender-roleequality were askedin most yearsof the GSS between 1974 and 1998.The four variablesare shownhere with the percentageendorsing the pro-equalityposition,pooledover all yearsin which all four questionswere asked: r
Do you agree or disagreewith this statement?Women should fake care of running their homesand leaverunning the country up to men (74 percentdisagree).
r
Do you approveor disapproveofa married woman earning money in businessor industry if shehas a husbandcapableof supporting her? (77 percent approve).
r
If your party nominated a woman for President, would you vote for her if she werequalifiedfor thejob? (84 percentsayyes).
r
Tell me if you agreeor disagreewith this statement:Most men are better suited emotionallyfor politics than are mostwomen(63 percentdisagee).
To form a gender-equalityscale,I simply summedthe pro-equality responsesfor tbe four items, excluding all people to whom the questionswere not askedand treating other noffesponsesas negativevalues.The point of treating "don't know" and similar responses asnegativevaluesrather than excluding them is to savecases.But this would not be wise if therewerenot substartivegroundsfor doing so-in this case,it seemedreasonableto me to treat "don't know" as somethingother than a clear-cutendorsementof genderequality.
?,I N
l
h I
rN SOMEYEARSOFTHEGSS,ONLYA SUBSET OFRESPONDENTS WASASKEDCERTAIN QUESTIONSusersor the GSSneedto be awarethat to increase the numberof itemsthat can be includedin the G55 each year,some items are askedonly of subsetsof the sample.A convenientway to excludepeoplewho were not askedthe questionsis to usethe Stata-rmiss - option under the -egen- commandto countthe numberof missingdataresponses and then to exclude peoplemissingdata on all itemsincludedin a scale.However,in the currentanalysis I excludedall thosewho lackedresponses on any of the four itemsbecausesome,but not all. of the questionswere askedin someyears.
MultipleRegression Tricks: Techniques for HandlingSpecial Analyticproblems 151 u'al variation in the model rde a number enthecovarigrch an analmry variable ner(although lpenods).
s of the GSS geendorsing asked: care of runntdisagree). rbusinessor approve). r her if she
EstimathgequationssuchasEquations7.32and,7-33suggestssignificantnonlinearities in attitudes regarding gender inequality. The increment in R, implies F = 3.54 with 11and,21,448d.f.,which hasa probabilityof lessthan0.0001.Howevel the B1Cfor the lrnear trend model is more negative than the B1C for the annual variability model 'de BlCs are,respectively,-959 and -871), suggestingthat a lineartrendis morelikely siyenthe data.BecauseB1Candclassicalinferenceyield contradictoryresults,a sensible Fxt stepis to graph annualvariations in the meanlevel of support for genderequality, to Jee whether there is any obvious pattem to the nonlinearity. If substantively sensible deviationsfrom linearity are observed,the annual variation model might be accepted,or e new model, aggregatingyears into historically meaningful periods, might be posited ,teeping in mind the dangersof modifying your hypothesesbasedupon inspection of the dan-see the discussionof this issueat the end of ChapterSix), or a smoothcurve or spline function might be fitted to the data. Figure 7.4 showsboth the Iinear trend line and annualvariations in the mean.Inspecting the graph, it appearsthat deviationsfrom lineariq are neither large nor systematic. Given this, I am inclined to accept a linear trend model as the most parsimoniousrepresentationof the data, despitethe F-test results.The lineartrendis, in fact, quite substantial,implying an increaseof 0.81 (= .0338*(19981974))over the quarter of a century for which we have data; this is about 20 percent of 6e range of the scaleand is about two-thirds of the standarddeviation of the scalescores. -\pparcntly, support for gender equality has been increasing modestly but steadily ftroughout the closing years of the twentieth century. From a technical point of view, it may be helpful to comparethe estimatesimplied by rhe two altemative ways of representingdepa.rturesfrom linearity: Equation 7.33 and the
Etter suited nsesfor the tating other |I responses * be wise if ble to me to quality.
+ -
Llneartrend Mean fo. year
6 62
RES. ,sers of t in the way to l under xclu0e ls lexta ll, of
z
1914 1976 ',19781980 19a2 19a4 1986 1988 1990 1992 1994 1996 1998
Yearof survey
FiGUfl€ 7.&, rrendin AttitudesRegardingGenderEquatity,U.s.Adutts Surveyed in 1974Through1998(LinearTrendandAnnualMeans;N = 21,464).
152
euantitativeDataAnalysrs: DoingSocialResearch to Testldeds
altemativespecificationthat doesnot incrudea linear term for year.when the rinearterm is included,two dummy variabrecategones aredropped,ratherthanone,becausethe lin_ ear terrnusesup one degreeof freedom. However,,ir" t*o pro""J*". produceidentical results. is evidentfrom inspectlon of Table7. I . ^which untortunately. thereis no simp^lecorrespondence betweenthe coelficientsin equa_ trons rhe form of Equation z.j: o"viation, ;;; .of ;;';;i"hons of rhe tinear .ana equatron.If you want to show annual depanuresf.;h;;y:;; needro construcra new variable, which is the difference u.ir""" ,fr" pr"al"i.J J"fi", ,", each year from Eq,tation7.32 and Equation 7.33.^This i..u"ry ui""_oirJf, in Srata using rhe - foreach- or -forvalues_ cornmand. "ufi1o
I
t , |. !
I
LINEARSPLTNES I
Somedmeswe encountersituationsin which we believe that the relationshipbetween tu.o variableschangesabruptly at some point on the distribution of the independentvariabre. so that neither a linear nor a curvilinear representationoi-,i" l"fu,ionrrup is adequate. qlcofol c_olsurnlriyn.may have no impact on l"arf, U"to* l:l,"]-"-pr" some rhreshold. whereasabovethethresholdheatthdecline. i" li""*;;y;; ;;;;i consumption increases. Temporaltrendsalsomay abruptrycnange, " asa result of policy changes,cataclysnic evenl. suchas depressions,wars,revolutions, and so on. In casei of this kind, it is useful to representthe relationshipsvia a setof connected line segments,know:nu tin"o, ,plir"r.
A Worked Example:Trendsin Educational Altainment over Timein the united States
form?
_._l*^"ji il::fTfi::::Ti:,":llilffii I,r::.*" :""pr.,,n" ::ilT'ru.,i"";:;
showssucha plot, madewith the same specificationsas the scatterplot. Inspecting the y"",r::^,h"."verage educationin"r"u."a in u _or" o. i"rrlr"l_ rv"}, ,or thoseborn 11,,: between1900and 1947bur rhenlevel"aoff. n""uu." rt a bit, prob_ *" relarively ";l;;l;;;;. smau nu_b".;;; f;;;;; -""rd
Jffi
I
!
s 1, l>r-:
c
a
=v
a/
rtmigrrtueueuer to
hie*"uirii"J;;#ft?:l#;:?"::il*TTJ:i"*"llifJ.ffi ii:Tl,i: - do - and - 1os- files.) rnspecting this graph,*";;;;; ;;;;; conclusion_there
is a fairly abruptchansein ihe trend, wittr it os" b",..' il;;;;;;df
2
er < :-=
,;:"ffi:'i: tfere appears," b";;il;;;,::ruffis: i:1,,'# ;'"ffiff tffi* discem-is it linear or is the trendbetter representedby someotherfunctronal
:,0]la3:iy "r moving averase plot three-year
I
rt
!\
consider chalges in the averasereverof educationover time. Figure 7.5 presents a scafter plot relating educationalattainirent to year of birth, estimatedfro;'trre css. To create rhis graph,I combineddatafrom all vears betw.* 1972;;;6;.r"rv*"r, , *"0*o those bom prior to I 900 becausethe very small sampl" .ir* p;il;;;u"ii"".*"ro. , a., droppeda thoselessthanagetweng/-fivearthe rime of rh" J;";;;;;;iluiy"i"opr" ao not their schoolinguntil rheir mid-twenries. Th. ;d; i;;"; "o_pt.t. * , ,Jittered,' cases, oiJ"i ro make it readable,andis to ma_k "r*" To discoverhow rhe increasr
a
of the rwentierh
-= -a
Ito r r to ttr tr .t
llo n
I h d I ltrcl url orl
.,I .r I t.tt,..r l ot tl t l ).rot
N (rl A116r I l'r|'(lt(t.'(l
V.rhror.
Coef f ic ient i = a + I-rr " - i' i
1975 1977
1998
z .t6r4 3 * C tg tt:2 .5 1 1 4 * O:OOA I = 2 51q5
a. + D .19/5 + c,u,.s: _j j .68578+ O.O375B 72* 1975+ 0. 0403799: 2. 5893 i , + bl i g77 + c,pn:.-71..68578 _ ^111c^io + .A 37sa7i * 1q-7T 7U. |] J6418= 2. 5105 ),,,. . . ._,,. ..v)/)otz.tel /
154
QuantitativeDataAnalysis:Doing socialResearch to Testldeas
m]Il|::e
!.I J.... 16
:t; 1..
E 6
=
i
tz
o
't: . i..*.-.' 1900
1910
.'J.
1920
1930
194A 1950 Yearof b rth
1960
1970
fll${"}gti:ir.5.
Yea6 of SchoolCompletedby Yearof Birth, I).5. Adults (Pooled Samplesfrom the 1972Through 2004 GSi;N = 39,324;ScatterPlot Shown for 5 PercentSample).
re-m . ibsf
century(precisely,until 1947)experiencinga fairly steadyyear-by-yearincreasein their schooling,but thoseborn in 1947or later experiencingno changeat all. This suggest: that the trend in educational attainment is appropriately representedby a linear spline with a knot at 1947,where"knot" refersto the point at which the slopechanges. This specificationcanbe represented by an equationof the form: E - a'l br(Br)+ b"(8,)
(7.3+
whereBr - the yearof birth for thosebom in 1947or earlierand : 1947otherwise,and B, - the year of birth - 1947for thosebom after 1947and : 0 otherwise.More generalll.. a splinefunctionrelatingZto X with segments vt. . .!,*t andknotsatkr k2,. . . ,k,can be reDresented bv
Y : a'l br(X,)+ b,(Xr)+... + b,*lx,*)
(7.35r
wherev, : min(X k,), u, - max(mintX- k,. k, k1).0),.. ..urr+rr: max(X f,,0)(see Panis [1994]; the entry for Stata's -mksplinecommand lstatacory 2007]; and Greene[2008]).Eachslopecoefficientis thenthe slopeof the specifiedline segmenr.We can seethis concretelyby going back to our example,Equation7.34, and evaluating the equationseparatelyfor thosebom ir 1947or earlrerand thosebom after 194j. Fot thosebom in 1947or earlier,we have
.1
I
rc-m &im.,rll
xultiple Regression Tricks: Techniques for HandlingSpecial Analyticproblems f 55
14
d
l
13 o
12 11 10 9 8 1930 1940 1950 Yearof binh
old
FIGURE 7.6. ueu, yearcof Schooting by yearof Birth,u.s- Actutl(same hta asfor Figure7.5).
6eir 5e$s pline I E
I 14
13{t
ad tll]'. can
!-.
12 11
g
10
35r EC trd ['e
rg br
';
9 81930
1940 1950 Yearof binh
FlGtrRE 7.7. Three-year MovingAverage of yearsof schooling byyearof Birth, U.5.Adults (SameDataas for Figure7.5).
156
DataAnalysis: Quantitative DoingSocialResearch to Testldeas
E = a + b,(B)+ br(0): a + br(B) andfor thosebom laterthan 1947, we have E: a + b,(1947) + b,(B-1947) : (a + 1947br) + b2@-1947)
(7.37)
Notice that the intercept in Equation 7.37 is just the expectedlevel of educationfcr thoseborn in 1947 aadthatbrgives the slopefor thosebom after 1947.Thus, the expected level of educationfor those 6om in 1948 is just the expectedlevel of educationfor those bom in 1947 plus Dr; for those born in 1949 rt is the expectedlevel of educationfa thosebom in 1947plus 2br; and so on. Estimating Equation 7.34 from the pooled 1972-2004 GSS data yields the coefficients in Table 7.2. By inspecting the BICs for three models-the spline model, a linea trend model, and a model that allows the expectedlevel of schooling to vary year_b).year-it is evident that the linear spline model is to be prefened. Note, however, that a comparisonof R2sindicatesthat by the criterion of classicalinference,the model posirhg year-by-year variation in the level of schooling fits significantly better than the splbe model. I am inclined to discount this result becauseit has no theoretical iustification. is
SpECtFtCATtON OF SpLtNEFUNC4N ALTERNATTVE
TION S Analternative specification represents theslopeot eachlinesegment asa deviation fromtheslopeof theprevious linesegment. Inthisspecification, a different setof newvariables is constructed. Suppose therearek knots,thatX istheoriginal variable. andthatyr,...,yh+r) arethe constructed variables. Then
ur= X - k.,if X> k,; :0 otheruuise u,,, = X - k.il X > k"; : 0otherwise To seethis concretely, considerthe presentexamplespecifyinga knot at birth year 1947 in the trend in educationalattainment.We would estimatethe equationwherez, : birthyear(, andur: X 1947if X > 1947and = 0 otherwise.Thenfor thoseborn in 1947or earlier, t : a + b,(X) + b,(o) : a + b1(X) whilefor thoseborn laterthan 1947 E: a + b,(X) + b2(X- 44 Thus,for those born in 1948,the expectedlevelof educationis given by (a + 48b,) + b,; for thoseborn in 1949it is (a + 4gb1)+ 2br;and so on. Fromthis,it is evidentthat b, give; the deviationof the slopefor the previouslinesegment.Forusefuldiscussions of thesemethods,seeSmith(1979)and Gould(1993).
r.'!ultiple Regression Tricks: Techniques for Handringspeciar Anaryticprobrems 157 '- . -16
-'
r a,S ?.3, Co.ffi.i.rrts fora LinearSplineModel of Trends years in of School Completed by year of Birth, U.S. Aiults age iS Ofa.., comparisonswith other Moders(pooredDatar". "na = ".rO r6'iz-zoo+,
rv
r f . , 3Ltion tbr rPecred v *lose tion lbr e-oeffia linear ear-b] : that a osiring spbne tion. is
39,324),
s,e. 5bpe '. :: ,:..'i'.: 5.ope(bjrthyearsI94j-1979)
,i.:: .0092
.o024
r*""u1,rr,1.,:, Model Comparisons
2) Lineartrendmodel
.1167
(3) :. i5
I ) vs.(2)
-5 31
.0121
545.2
1;39321 .OOO0
:-:arly inferior by the BIC criterion,and occurs simply as a consequence of the large i-mple size.Thus, I acceptthe linear splinemodel asttrepreteneJmoO"t. The coefficientsfor the line segmentsindicatethat for peopt"iorn in 1947or earljer, :ere is an expectedincreaseof .0g6yearsof schooling foi ,*""rriu" birth cohort. .._. us.peopleborn lwelve yearsapartwould be expectld "uin to differ on averageby abouta 1:ar of schooling.However,for peopleborn in 1947or later,;";" ." trendin educa-:.rnalattainment;the coefficient .0092 implies ttut it *ouiO iut "about a century for -:. eraqeschoolingto increaseby one year This is a somewhatsurpnsrng result, espe_ :::lly becausetherehavebeensu
.":nraged minoriries, rhat is, ",""1::#'iff'":!ilT":fi:-:rTli:,ffiH:tr"Hi,
::d also amongwornen.However,as Mare (1995, tb:; not"r-.d*utronally disadvan_ --!ed proportionsof the population havegrown over tim"..tutlu" to tn" White majority. )saggregation of the trend woula be wolrthwhile u* p"r.ued here;it would :rte,an interestingpaper The graph implied by the"""".ii. coefficienisfor the linear spline :Lrdel is shown.inFigure 7.g, togetherwiih u 2 i".".nt rundornslmpte of observations : rr eachcohort_(redlced from 5 percentto 2 percentto mut. it .J". to seethe shapeof :e.spline). In this figure the -j itterfeaiurein Stutui, u."J'io _uke it clearwhere : rhegraphthereis the greatestdensityof points.
158
DataAnalysis: DoingSocialResearch Quantitative to Testldeas
". :,t'..-i i f
?.... .'.t'
lfi
_g E .r
t
t.
-iifir-Er,
t
flrcfl diiMd-
o fr
N[dd dbi
libu
%i
o
llrry
btrt
It
hr
l$.-" .!F, tEd l
'1900
1910
1920
1930
1940 1950 Yearof birth
1960
1970
1980
@m h ftr/rqi trtil
Ff &Untr 7 .&, rrenain Yearsof SchootCompteted by yearof Birth,U.S.Adutl (SameDataasfor Figure7.5;ScatterPlotShownfor 2 percentSampte). predicted Valuesfrom a LinearSplinewith a Knot at 1947.
tuq drF
A SecondWorked Example,with a Discontinuity: euality of Education in China Before, During, and After the Cultural Revolution The typical useof splinefunctionsis to estimateequationssuchasthe onejust discussedin which all points are connectedbut the slope changesat specifiedpoints (,,knots"rHowever, there are occasionsin which we may want to posit discontinuoas functionsThe Chinese Cultural Revolution is such a case.It can be argued that the disruption of socialorder at the beginningof the CulturalRevolutionin 1966was so massivethat it js inappropriateto assumeany continuity in trends.Deng and Treiman (1997) makejus such an argument with respect to trends in educational reproduction. They argue thal there was then a gradual 'tetum to normalcy" so that changesresulting from the end of the Cultural Revolution in 1977 were not nearly as sharp and were appropriately representedby a knot in a spline function rather than a break in the trend line. Here we consideranotherconsequence of the Cultural Revolution,the quality of educationreceived(the exampleis adaptedfrom Treiman [2007a]).Although prima4 schoolsremainedopen thoughout the Cultural Revolution,higher level schoolswere shutdown for varyingperiods:most secondaryschoolswereclosedfor two years,from 1966to 1968,and most universitiesand other tertiarylevel institutionswere closedfor six years,from 1966 to 1972. Moreover,it was widelv reDortedthat even when the
m-
lr"D. lhEb
h br& {@E
fu r frFfr ffi{
ryE'ft bd rlidh' Ed &trI
hr mb &'n|n b
litultipleRegression Tricks: Techniques for HandlingSpecial AnalyticProblems 159
0 ldutts Pd
n hste.l lol-s- 1,
:tiorr. ion e.: ar it i: Ie JL.:i |e thar end of reprelir-r e-ti
rima+ t $ efe
. from ed for en lhe
siools were open, little conventionalinstruction was offered: rather, school hours rere taken up with political meetingsand political indoctrination.Rigorousacademic himrction was not fully reinstituteduntil 1977, after the death of Mao. Under the ;iriumstances,we might well suspectthat, quite apart from deficits in the affiount of siooling acquiredby thosewho wereunfortunateenoughto be of schoolageduringthe Culmral Revolution period, those cohorts also experienced deficits in the quality of $ooling comparedto thosewho obtainedan equalamountof schoolingbeforeor after fre Cultural Revolution. To test this hypothesis,we can exploit the ten-item characterrecognition test ,&iristered to a nationalsampleof Chineseadultsthat was also analyzedin Chapter SLx(seeTable6.2). As before,I take the numberof characterscorrectly identified as a of literacy andhypothesizethat, net of yearsof schoolcompleted,peoplewho -asure age eleven during the Cultural Revolution would be able to recognizefewer rned [Laractersthanpeoplewho turnedelevenbeforeor after the Cultural Revolutionperiod. Uoreover,following Deng andTreiman (1997),I posit a discontinuityin the scoresat tu beginningbut not at the end of the period. To do this, I estimatean equationof fre form:
i - a + b1(B)+ bz(B) + cr(Dr) + \(\)
(7.38)
rhere B, = year of bfuth (last two digits) if born prior to or in 1955 and : 55 ifbom Fbsequentto 1955;Br: 0 if bom prior to 195(, = year of birth - 55 if born between 1956and 1967,inclusive, and : 67 - 55 if bom subsequentto 1967',83: 0 if bom = 0 for lrior to or in 1,967and : year of birth - 67 for those bom after 1967i and D, : 1955. Note difference prior 1 for those bom after that the born to or in 1955 and 6ose henveenthis and Equation7.35 is that I include a dummy variableto distinguishthose born after 1955from thoseborn earlier;this is what permitsthe line segmentsto be disat 1955.If I were to havepositeda discontinuityat 1967as well, the equarr')otinuous :ion would be the mathematicalequivalentto estimatingthree separateequations,for rte periodbefore,during, and after the Cultural Revolution,in eachcasepredictingthe rtrmberof charactersrecognizedfrom yearsof schoolingand year of birth. The advanage of equationssuchas Equation7.38 is that they permit the specificationof altematire modelswithin a coherentframeworkand by so doing permit us to selectbetween nodels. Estimatingthis equationyields the resultsshownfor Model 4 in Tables7.3 and 7.4. -{s in the previous example,I contrastmy theory-driven specification with other possibiliries: that there is a simple linear trend in the data; that there are year-by-yearvariations; tat there are knots at both the beginning and the end of the Cultural Revolution, but no discontinuities; that there are discontinuities at both the beginning and the end of the Culural Revolution; and, for the three spline functions, that there is a curvilinear relationship between year and knowledge of characters during the Cultural Revolution period.
l6S
QuantitativeData Analysis:Doing SocialResearch to Testldeas
''
'inla
Ra^r.rr
':,.:l:
: ,'.l' Goodness-of-Fit statistics for Models of Knowledge of chinese Characters by year of Birth, Controlling for years of schooling, with Various Specifications of the Effect of the Cultural Revolution (Those Affected by the Cultural Revolution Are Defined as people Turning Age il During the Period 1966 through 19771,Chinese Adutts Age 20 to O9 in 1996 (N = 6,08G),
: 'Chinese Char a( lues in Paren ---Va
':=-i o: schocl;:l: .665 .616
i 956-'196: .g
i:i6725.9
-6723.9
.612
- 6722.1
.611
:: -
--
-6724.1
. 2A 71.72
-6717.4
1116.33 :--.
---..1,i1,/:-
'::(
-€ar 1r€tc '-f5
-. ':::
4.26 - 42.4
:a Ba . a .=' .
30.04 ::
54.43
.003
s1.11
1.8
.00'l
. 6.86
6.5
.000
'a a - a _ e a - :t
-
.
: :;l _1i Lrn i :
:::
' .-
- i ddl l i L-rr:
:
. : - , t t r ing iit : a. - :-:-rruities-.. : , , - . likelr r : . :
' - t iple Re g re s s i olnri c k s :T e c h n i q u efo s r H andl i ngS peci aA na yti c P robl errs
] 5l
' , :, Coefficients for Models 4, 5. and 7 Predicting Knowledge :- Chinese Characters by Year of Birth, Controlling for Years of Schooling :-Va lues in Parentheses).
:: 's of schooling
- i 955 or earlier(age11 1965or earler)
:
:--
1956-1967(age11 1966*1977)
: r - - 1968or l a te r(a g e1 1 1 9 7 8o r l a te r)
: - i: q inu: t ya t ' 9 5 5
Model4
Model 5
.443 (.000)
.443 (.000)
A44 (.000)
0.001 \.721)
0.001 (.134\
0.001 (.749)
0.043 (0.000).
0.032 (0.000)
0.041 (.000)
0.016
-0.557 ( 000)
*0.508. (.000)
. -o.o4l (0.18s) 0.028 (.012) -0.349. / nnl\
o.241 (.010)
, : : r nt inuit ya t 1 9 6 7
0.0066 (.00e)
:--,llineartend'195ffi7
= : (rootmeansquareerror)
Model 7
0.770
0.770
0.771
0.571
0.672
o.672
1.29
1.29
1.29
. ,rnparison of the B.lCs suggeststhat three models-my hypothesized model, a model ,: in addition to a discontinuity at the beginning of the Cultural Revolution allows the =:J during the Cultural Revolution period to be curvilinear. and a model positing - .:ontinuities at both the beginning and the end of the Cultural Revolution are about , ..i1ly likely given the data, albeit with weak evidence favoring the single-knot model. - : that all three are strongly to be preferred over all other models.
162
QuantitativeData Analysis:Doing SocialResearch to Testldeas
Again, B1Cand classicalinferenceyield conrradictoryresultsbecausethe two alternativemodelsfit significantlybetter(at rhe 0.01 le\el) than doesthe originally hypothesizedmodel.Here I am in a bit of a quandaryas to u hich modelto prefer.I havealreadr stateda basisfor positing a single discontinuity.plus a knot at the end of the Cultura Revolution.However,anotheranalyst might favor a two-discontinuitymodel, on th; groundthat the curricularreform in 1977that restoredthe primacyof academicsubjecc was radical enoughto posit a discontinuityat the end as well as at the beginningof the Cultural Revolution.A third analystmight arguethat a linear specificationof trends. especiallyin times of great social disruption,is too restrictiveand that it makesmore senseto posit a curvilinear effect of time during the Cultural Revolutionperiod. Ir Treiman(2007a),I presentedthe model positinga discontinuityat 1955,a knot at 196-. and a curvebetween1955and 1967-see Figure7.4 in that paper.Howevel the truth i! thatthereis no clearbasisfor preferringany oneof the three,exceptfor the evidenceprc' vided by BlC, which suggeststhat ihe originally hypothesizedmodel is slightly mor; likely thanthe othersgiventhe data.Again, my suggestionis, go with theory.If you har: a theoreticalbasisfor one specificationover the others,that is lhe one to feature;but. iI the sametime, you mustbe honestaboutthe fact that alternativespecificationsfit nearll equally well. In fact, the optimal approachis to presentall threemodelsand invite th: readerto chooseamongthem.A waming: if you do this, you probablywill haveto figL with journal editors, who are always trying to get authors to reduce the length of papersand perhapswith reviewers,who sometimesseemto want definitive conclusionsere: when the evidenceis ambiguous. The estimatedcoefficientsfor all threemodelsare shownin Table7.4.In alt thr*modelseachadditionalyearof schoolingresultsin nearlyhalf a point improvementin dE numberof charactersidentified.However,the coefficientsassociatedwith trendsortime are relativelydifficult to interyrer.Again, this is an instancein which graphingrtr relationshiphelps.Figure7.9 shows,for eachof the threepreferredmodels,the predicr* numberof charactersrecognizedfor peoplewith twelve yearsof schooling,that is, *bi havecompletedhigh school.Although the threegraphsappearto be quite different,th1 all show a declineof abouthalf a point in the numberof charactersidentifiedfor thos who wereageelevenduring the early yearsof the CulturalRevolutionperiod,relatile :: thosewith the samelevel of schoolingwho tumed elevenbeforeand after the Cuhwir Revolution.Thus, despitethe difficulty in choosingamong alternativespecificatio*togetherthey stronglysuggestthat the quality of educationdeclinedduring the Culruii, Revolution.Peoplewho acquiredtheir middle school(unior high school)educationdr:ing the CulturalRevolution,in effect,lost a year of schooling-that is, displayedknos _edgeof vocabularyequivalentto thosewith one year lessschoolingwho wereeducai= before and after the Cultural Revolution. Still, we shouldbe cautiousin our interpretationof Figure 7.9, where the Culru:rr Revolutioneffect appearsto be quite large becauseof the way the data are graphi; (with the y-axis rangingfrom 5.3 to 6.7 charactersrecognized).Indeed,Figure 7.10_r which the y-axis rangesfrom 0 to 10, suggestsa ratherdiffereDtstory'-a very mod:s decline in the numberof charactersrecognized.It is quire rea_.onable ro reporr li,su::: suchas Figure7.9 to makethe differencesamonsthe model -,1
,I
x'l A^ 6 - .) -1
-t
,.,
l
o
"(
o'")
7' andYip'sEquation for Treiman FIGURE lA-V ' Xa"a-v"riablePlots occupational retums to education would dadng educational inequality to the level of arease.Denmark,bycontrast,nasunusuallyloweducationalinequalityrelativetoits -;;; much stronger educationi;"qoutity uoo indusnializaiion' but it has a ;;;" on the other two position its connection than would be expected from --p",i*
234
DataAnalysis: Quantitative DoingSocialResearch to Testldeas 1 .5 1 I
:2
:.
0 -.5
Flttedvalues
nGURr1S.8Residual-Versus-Fitted plot for Treimanand yip,s Equation 7variables, so the omission or downweighting of Denmark would decrease the effect of educational inequality. Graph (b), assessingthe effect of income inequality (1I), reveals that only Denmark is a large outlier. Otherwise, the plot is fairly unremarkable.Grapb (c), assessingthe effect of industrialization(d), showsthe United Statesto be a higileverage observation, with a very high level of industrialization relative to its level of educational and income inequality. Because the United States is below the regression line, its omissionwould increasethe slope.
ptotsand Formal Testsfor patterns Residual-Versus-Fitted in the Data
A secondtest,stata's - ovtest - command,assesses the possib ity of omittedvariables by testing whether the fit of the model is improved when the second through fourth powers of the fitted valuesare addedto the equation.Given the small samplesize,I takethe p-value of .08 resulting from this test as suggestingthe possibility of omitted variablesComponent-plus-residuar plots aretseful in iwealing theiunctional form of relationships and,by extension,the possibility of omitted variables.Suchplots differ from added-variable plots becausethey add back the linear componentof the;aftial relationshipbetweeny and X to the least-squaresresiduals, which may incl]de an unmodeled nonlinear component.Figure 10.9 showssuch plots for our data,using the ,.augmented,, version availablein StataGearchfor.,acprplot,,in the downloadableile ,,chl0.do,,). The plots in Figure 10.9 continue to show Denmark as a large outlier. But otherwise they do_not appearorderly; and-with one exception-I can tiink of no omitted vari_ ables.The exceptionderivesfrom work by Miillir and Shavit (199g) that suggeststhat the education-occupationcomection is especially strong in nations with wel-Jeveloped vocationaleducationsystemsand especiallyrv"uk in *ion, with poorly developedvoca_ tional educationsystems.In our dataDenmark,Germany,Austria, and ttre Netherlandshave especiallyshongvocationaleducationsysterns,andthe United States, Japan,ard Irelandhave very weak vocationaleducationsystems.The relationshipfound by Miiiter and shavit seems to hold in our data, with the nations with strong vocational educationsysternsabove the
ni riu t d Ef DC-
-_a
&16l lfcr dtufl
tutu, trd hi[
]F.
:lr b -*tu rEd€
|
h:r fr nru['d form r fuft.8!
RegressionDiagnostics 235
!
51 !9
n
==
0123 Educational inequality
-1.5
1-.5 0.5 Incomeinequality
-2 9-
1
Pe td *
F * td
-2
1012 Economi.development
.!0.9"Augmented Component-Plus-Residual FIGURE Plotsfor Treimanand Ws Equation7.
b
s t br F * rI 9I
ir 'tG D'
b Fd E} E F T
lb
qression line and the nationswith weak vocationaleducationsystemsbelow the regression h. This result suggestsaddingthe strengthof the vocationaleducationsystemasa predictor. b do this I add two dummy variablesto distinguishthe three setsof nations(strong,weab d neitherespeciallystrongnor especiallyweakvocationaleducationsystems).I thenreestiEe Equation7, which yields the coefficientsshownin the secondcolumn of Table 10.1 (for ruvenience, Column 1 showsthe metric coefflcientsfrom Treimanandyip's original Equa7, that is, those shownin Equation 10.3 of this chapter);lhe remaining colurnnsshow :ious additionalestimatesdiscussedin the following paragraphs. -D The specificationshownin Column2 poduces a betterrepresentation of the deierminants dthe strengthof theeducation-occupation cormectionin theeighteennationsstudiedherethan the original specification.The adjustedR2increasessubsantially and, as expectedfrom A pafiem of residuals,the coeffrcientsfor strongand weak vocationaleducationalsystems -es ld havethe expectedsigns.(I discussthe standarderrorslater in this chapter.) However,the question remains as to whether the results are still substantially driven !- India and Denmark. To determine this I repeatedall the diagnostic proceduresdisossed previously with the new equation. The Stata log contains the commandsI used, lm in the interest of saving spaceand avoiding tedium, I have not shown the resulting floa andwill not discussthe resultsexceptto note that India continuesto be a high leverage lnint and Denmark continues to be a large oudier, although the diagnostic indicators for both are somewhatless extremethan the conesponding indicatorsjust reviewed.
156
QuantitativeDataAnalysis:Doing Social Research to Testldeas
l';1;,,, :- 1,;, ', Coefficients for Modets of the Determinants of the Strergl,. of the Occupation-Education Connection in Eighteen Nations.
18 Observations
17 Observations (ExpandedModel)
f, ii
lf
{ ll
Original
IducationalInequality
IncomelnequaJity
lndustrialization
Model (Metric
Model (OLS
Coefficients)
Estimates)
- 0.354 (0.s32) -0.320 (0.299) 0.299 (0.27s)
Weak Voc. Ed.System
R' AdjustedR,
Expanded
l0ll
R o bur
ors
Regress:r
-0.292 (0.s6s)
-0.821 (2.268)
-a.J a,-
-0.342 (0.324)
,0.321 (1.s94)
0. 3: :
tiltl
\1.29nl tl a
4.2a7 (0.275)
0.208 \1.449)
0.836 (0.410)
0.707 (0.644)
0.5E': (0.59,
*0.476 (2.518)
tO.1La
lll
I
StrongVoc. Ed.System
Intercept
iut
-0.403 (o.414)
it
I
2.021 (0.222)
1.899 (0.2s1)
1.814 (0.631)
.553
.762
.792
.457
.662
.698
0.529
0.471
o.672 /V!fer Bootstrappedstandarcl errors,denv
1.8C: (0.5 a:
.
iii:l::$,jiij!,, and ::T:19;[J,.:'il;1';]f+tlq11fiI,i",.,l,"iil,llil,"l;j,"J:li 0r76 ror corumn 1:a i7s,o2;5.-*r,;r;;:;;;, 0 313,and 0 I 74 f oj Colum n3 ; ancO l i l I , A . 2 6 7 , A . 1 8 5 , 0";ili;,illill."iiili;liflillliii;1,-, .330,0.362,and0.189forCoIumn4.
RegressionDiagnostics 237
x)BUSTREGRESSION 5r- $hat to do? Becausewe have no clear basis for modifying or omitting particular oar.r\,ations,nor for transformingour variablesto a different functional form, we need an ivnative way ofhandling outliersandhigh leveragepoints.Onealtemativeis robustestiman'r which doesnot in generaldiscardobservationsbut ratherdownweightsthern-sivins less iduence to highly idiosyncraricssg@aqs. RobusteffiG iftac[iilelaus."they re nearl6GffidiEiTis ordina:f-least-squaresestimatorswhen the error distribution is nrmal andare much more efficient when the errorsare hear,y-tailed,as is qpical with high Lremge points and outlien. There are, however,severalrobustestimators,and there are no The bestadviceis to explore "*ffut rulesfor larowingwhich to applyin what circumstances. :rar dataas thoroughlyas time and energypermit. @or fi[ther detailson robustestimation. ,,:cult Fox [1997,405414;2f2],Berk [1990],andHamittonl199Za;I992b,207-2111.) One classof robustestimators,known as M estimators,works by downweighting dftlen'ationswith largeresiduals.It doesthis by performingsuccessive regressions, each :m (afterthe first) downweightingeachobservationaccordingto the absolutesizeofthe nidual from the previous iteration. Different M estimators are defined by how much kr_sht theygive to residualsofvarious sizes,which canbe showngraphicallyasobjective brtion* The objectivefunctionsof three well-known M estimatorsare shownin (a), ,b - and (c) of Figure 10.10The OLS objectivefunction ([a] of Figure 10.10)increases dqonentially, as it must given that OLS regressionminimizes the sum of sqzared residu. rk- The Huber function ([b] of Figure 10.10)gives small weight to small residualsbut reiehts largerresidualsas a linear finction of their size.The bi-squareobjectivefuncen ([c] of Figure 10.10)givessharplyincreasingweight to medium-sizedresidualsbut lh,'o flattensout so that all large residualshaveequal weight. BecauseHuber weights deal prrrrly with severeoudiers (whereasbi-weights sometimesfail to convergeor produce mldple solutions),Stata'simplementationof robustregressionfirst omits any observa_ nas with very large influence(Cook's D > 11,usesHuber weightsundl the solutions Jrll erge.ano tnen usesbt-welghtsunlll the solutlonsagalnconverge. Becauseof the rrr it is defined,robustregressiontakesaccountonly of outliersbut not of highJeverage ,ftervations with smallresiduals.For someproblemsthis can be a major limitation. Panel2 of Table 10.1 shows (in Column 4) robust regressionestimatesfor the elabomed model of the education-occupationconnectionwe have been studying. There is no rctust regressionestimatein Panel 1 becausethe procedure dropped India at the outset he-'auseof its large Cook's D. Columa 3 shows the correspondingOLS estimateswith hlia omitted. Interestingly, the OLS and robust regressionestimatesdiffer very little in hel 2, with the exceptionof the effect of strongvocationaleducation,which is reducedin rb mbust estimatebecauseDenmark,with its large residual,is downweighted.The agreemt betweendifferent estimatorsdoesnot alwayshold and shouldnot be taken asan indi:rion that robust estimationis unnecessary.However,the stability of the estimatesunder "iferent estimationproceduresgives us adde.dconfidencein them. By contrast,the omissionof India stronglyaffectsthe educationalinequalitycoeffi*trI. increasingit by more than a factor of two. The coefficientfor strongvocational a,ircationis modestly reduced,and the coefficient for industrializationis even more - | € z _ 4 . : =+'
238
euantitativeDataAnalysis: DoingSocialResearch to Testldeas
35
.8 .7 .6
'd
= 1s 10 .2
;l s-4 -3 2 _ toi2;;;tr
.l 0
-6
Deviation score
'6 (b)
5-4-
r
)-r
, i I1 4 . I\ 5 i ^ )/ Deviation score
3 9r -
__ 1
.5 0 - b- 5- 4- t
" tz 1456 -
^
:
i
Deviatton score
10, objective Functions forrhree M Estimator: (a) oLsobjectil f.gyry tunction,(b)113" Huberobjectivefunction, and(c)bi_squareoiluir" tu*rior.
modestlyreduced'A reasonable conclusionis that the education-occupatron connecti
tr;,r'-'"i#lno"iiu, u,*u.yor, generat relationship ;::illT:flff.ff ffi:*::::""':,::.:1.''*between ird*t'i;-;"il,];;;ffiffi H,ff :ilfffi properly set India asidefor separateconsideration.
BOOTSTRAPPTNG AND STANDARDERRORS
;J""Jt""ft",f"f"YJ'1il3r,'S1t::*:i:1g*.b"*ordinaryreastsquaresandr af. no: norrnally disnibuted,ifr" airt lUut* enors errors isasvmntoficallv rs asymptotica'v nn*"r *^:^u1o:, normar_that is,;,il;;i"#ff H;"J:?#iftr#:ltJ#
t
.ynil"ar,.":"p1"'r;,,],iJl , * the observadons number l:.,:l* d and r i. tr," *''r". o'r-iJ:t#[ "*r::ffffi.':#*I;T::::: ;"t :iffi#: **Afy;JT:1:T; ffi#ff iffi;rj glfl'lv:'arva'r""t l#i-i "'".i"^llTn* o,*,.*l oneway around thisprobrem-is;;;;'r;;;##;;::X?fiffiL,
. .
_* "a*jm#*1i*ffi*::",1# ;gIfi,11 Fxl:XT*TilTt ,*:tJT:;:,";::*t6:i::.tl#*,:#;:m:ff il"".::i{ff jtr**.ffi ;fiT ;lll'il"lXT'Ji,t;ilffi'#::;:."1J:;:Lffi
iff
J[*",..-
RegressionDiagnostics
dls m:f tr:rs Ei Ei fz e s@I& dEd lft
r :r F.-
239
d eighteennationsis drawn doesnot actuallyexist sincewe took alr nationsfor which dra were available.Thus we needto resorl to an approximation. Bootstrappingapproximatesresamplingby taking the observedsampleas a proxy fu the population and repeatedlysampling,with replacement,observationsfrom the *served sample.Thus,in our currentexample,we would randomlydraw (with reolace_ a first sampleof eighreencasesfrom our eighteenobservationr. say Norray. -ot) Srlerlands, India, Ireland, Austria, United States,Finland, philippines, Denmarie.this is true only to the extentthateachitem in a scalereflectsthe sameunderlying ":i":.irsion(theconceptualvariable).If an item is capturingsomeotherunderlyingdimeni'ri insteadof, or in additionto, the oneof interestto the analyst.it will undercutthe relir,-,n (and validity) of a scale.For example,supposeresponsesro a questionabout r --l,sness to have people of a different race as neighborsreflecteddifferencesin
244
DataAnalysis: DoingSocialResearch to Testldeas Quantitative
economicanxiety,with somepeoplerejectingpotentialneighborsnot becauseof racfl intolerance but for fear (rightly or wrongly) of a reduction in property values.We woul not want to include suchan item in a scaleof racial tolerancebecauseit would tendt make the scalelessreliable-with scalescoresdeterminedto someextent by whetherc happenedto include a lot ofpeople with economic anxieties or only a few such peopleAn important reason for creating reliable scales is that, all else equal, unreliaE scalestend to havelower correlations with other variabies.This follows from the fact 6t unreliable scales contain a lot of "noise." We might think of scales as having a "tnrcomponentand an "error" component.The "true" componentis representedby the cofl!lation of the observedmeasurementwith the true underlying dimension; the size of tli, corelation gives the reliability of the scale.The "error" component-the portion uncc related with the underlying dimension-reflects idiosyncratic determinants of tb observedmeasurement.From this definition of reliability, it follows that the los€r the reliability of each of two measures,the lower the correlation between their obsenel values relative to the true correlation between the underlying dimensions.Formally, n can estimatethe "true" correlation betweenvariables by knowing their observedcorreletion and the reliability of eachvariable. The true correlation is given by :
'*t
(1l .l l
where Pxrv, is the correlation between the fiue scores, rr" is the observed correlatil betweenX and \ andro, and r"". are the reliability coefficients for X and Il respective$Equation 11.1 is also referredto as a formula for correctingfor attenuationcausedb unreliabilify; Px,r, is the correlation between X and f corrected for attenuation. Fa example,if two scaleseachhavea reliability of .7 and an observedcorrelationbetwec them of .3, the correlationconectedfor attenuationwill be .3/J(.7)(.7) : .43. Clearl5 correlations can be strongly affected by the reliability of the componentvariables.
K[
Hl,#'t"g,::,tttt
RELIABILITY rhere are severar ways tomeasure
.
Iest-refestreliabiliA ls the correlationbetween scoresof a scaleadministeredat two pointsin time.
.
Alternate-formsreliabilityis the correlationbetween tlvo different scalesthought to measurethe sameunderlying dimension.
.
lntemalconsistencyreliabilityis a functionof the correlationamongthe itemsin a scale. Cronbach's alpha, discussed in the followingparagraphs, is an internal-consistency measure.
ScaleConstruction 245
j:?1"."':d{i.qi:fi #:"',l,T:r:f :#*f, : -a'*#*##',#l:l roestrm aiJ;; ;#; i# ::H[XT:';"Jil#:1,r.*n*,r,.J" !". 9.* aore "ri, sTlitlxH:".T:'H,3,TJ'j;tri:x"?;H,:"il:"i'; flif":*"*$: fh extensive .t;;"."o,,.]."u'"'ll,t":,'..:":kt andothersllg72, tgTglr"t."*"oi"r'"? "."
-Q'ates
"' Lqlsr rrr uus cnaptel
this concept.
*"esdepends onrwofacto6: rhe rrabiJity. orinrernar-consisrency t *ru.nc,J"#"'r,lil; ffifffffi:,fi::::.' _*,r:1iflJ:"#j"il:l.;;fi:::Tl^lj1*1y,,
*:,:'.1":i#iffi $: l":ls:*::i*:ii::,dnTyilJH';,'ffi "
= --- rtr--
1+ r(ff _ t)
(11.2)
E N is rhe numbero[ items _O ,,.r,:n: averagecorrelation arnongitems.ln Table
,lli,iu,".ug" .,:.,.,,.1'ri, ,irfr ffi Hru*,y"ilTru1i1"jj iot".it". * "i.."rl ffi :i":ilff tr-TffiT*ectiverv,de,.'s,-ir1ir,"i"iii:#:fffrTt:'*"fl 1fi ll"i:1;,f,T:*-1il,"y:,1**:g"correrationas.25,scarescom_ orat reasr rr€d _"t:?a:'Jfl :|:ff seven "'.i"r,.i,.'",1.aJii.+iiiff;.ffi,T1?l,1l.r,i;ill,li".,ll;
fi ;;3",,,'i#i: THli:."Trd#*;;,nr;trl,.".l?l;,T#*"',ffi :,T#,ffi:iil:&TJ,:l; .fr;"#ff :,*T#,ilT::;,ilH:,y#::;ffi;J""ft fNCLUDE 5EVERAL H,th#Bi+tn?."RE rEsrs number of items in makes a *.," *t,^ rJ KN clearwhyexaminations ",r ,r.f:-"I"tl:ltn"a19GRE comprise severar e".",,".ott"eu hu"" ;;"-' Lll o. 1,ffi; ;:r#::::::-:Ar
illl*;r::::;j :n;;:[l*l:lfilT:;T,",lj:f #"il:ii:i;::ff j}i:ii::j:;[iffi *:,n*l*nj*f"ilJtt#{i,"11i.,:il", :ix*] :ffi;:lT:?.::Tit:;Tl{ Ii[i:;,";"il#:;.;l:':.+T:iliT:::]H"ffi
preparation.
le the test is taken. and also,
of course,the degreeof
246
QuantitativeData Analysis:Doing SocialResearch to Testldeas 1'.i .:'1;-1: I '1 , 'i values of cronbach,s Alpha for Multiple-ttem Scales with Various Combinations of the Number of ltems and the Average Correlation Among ltems.
.09
.25
.17
.40
.23
.50
.28
.57
5
.33
.62
6
.37
.67
7
.41
.70
8
.44
.73
9
.47
.75
10
.50
.77
20
.66
.41
50
.83
.94
100
.91
.97
200
.95
.99
SCALECONSTRUCTION In this chapterwe considerthreestrategiesto createmultiple-itemscales:additivesring, factor-basedscaling,and effeclproponional scaling.lFor a brief generalintroc--tion to scaling,seeMclver and Carmines[1981]. For a recentextendedtreatmenr.\Netemeyer,Beardon,and Sharma[2003].A classicbut still usefultreatmentrs Nunna andBernrtein | 1984j..1
ScaleConstruction 247 Mftive Scaling h .'-:nplestway to createa multiple-itemscaleis simply to sumor averagethe scoresof run :i rhecomponentitems-which is what we havebeendoing up to now.Whereitems Jr]G :--hotomous,this amountsto countingthe numberof positiveresponses. Wherethe -hemselves |m. constitutescales-for example,continuousvariablesfor educationor [tr:--e or attitudeitemsrangingfrom "strongly agree"to "stronglydisagree" we ordimr-f-.:iandardizethe variablesbeforeaveraging(by subtractingthe meanand dividing "* :e 'tandarddeviation).If we fail to do this, the item that hasthe largestvariancewill !mn-:e greatestweight in the resultingscale.The effectof the varianceon the weight is rtr, :!1seeby consideringwhat would happenif a researcherdecidedto make a sociorriTl.r:iricstatus(SES) scaleby combiningeducationand income,and did so simply by 1rr!,rr-..for eachrespondent,the numberof yearsof schoolcompletedand the annual nr:c:. He may think he has an SES scale,but what he actuallyhasis an incomescale wlr : \ery slight amount of noise, becauseeducationtypically rangesfrom zero to xr-::," \ears (and,in the United States,effectivelyfrom eight to twenty years)whereas rr::-e rangesin the tensof thousandsof dollars.By dividing eachvariableby its stan@: :e\ iation, the analystgives eachvariableequal weight in determiningthe overall n:i* r;ore. (I first cameto appreciatethis point manyyearsagoin graduateschoolwhen mr-::essor told me that he andanothermemberof the faculty effectivelycontrolledwho :m* and who failed the collectivelygradedPhD exams.They did so simply by using -=d rd :re hundredpointsin the scalethe faculty haddevisedfor scoringexams,while most ,r :e:r colleaguesgavefailing examinationsa scoreof fifty or so.) lee trouble with simple additive scalesof the sortjust describedis that the items tu(-Jed may or may not reflecta singleunderlyingdimension.A scalewith a heterogenr:r-<setof itemsrunsthe risk of beingboth invalid,becausein additionto what the anatr; rinls the scaleis measuring,it is also measuringsomethingelse, and unreliable, :!e::j-ie at least someof the items are weakly or even negatively corelated. Wtor-Based Scaling fr;': --anwe determinewhetherthe itemswe proposeto include in a scalereflecta singr 5rension? First we identify a setof candidateitemsthat we believemeasurea single &-:lving concept.Then we empiricallyinvestigatetwo questions:(1) Do the items all 'br:: together" as a whole, or do one or more items tum out to be empirically distinct mr .in the senseof havinglow correlationswith) the remainingitems,eventhoughwe !rL-=irt they reflectedthe sameconceptualdomain?If so, we must reject the offending E. {2) Doeseachitem haveapproximatetyJlqsEnllretrettlo}lgloJb9._4gpendent llr-ile of interest?If not, the deviantitemsshouldnot be usedbecause this is againeviE:c,' that they do not measurethe same concept (or that they measureother concepts ;::,ies the oneof interest).Assessingthe secondquestionis a simplematterof regressing t j.pendent variableon the set of tentativelyselectedcomponentsof the scale,plus Whenthe scalewill be usedasa dependent m:ronal controlvariableswhereappropriate. cf,-r5le. the corrqlations between the !9lqBonent
items and the indepe4!9qt
-variables
'il.-'J bein:Gaa ln bor-hsiruation s.wnuitv. ..iooting Gfi( rhatthecandi";dence thesamemagnitude, u.,t.q5El!y)he_$)_glgnd approximately
Z4E
DoingsocialResearch to Testldeas DataAnalysis: Quantitative
Education,occupationalstatus,andincomearegood examplesof itemsthat arer:N'//tively correlatedbut thattendto havequitedifferentnet effectson variousdependeni.r'ables.For example,fertility is known to be negativelyrelatedto educationnet of in; -ar of to]er::,,r bttt positivelyrelatedto incomenet of education.Similarly,variousmeasures tendto be positivelyrelatedto educationnet of incomebut uffelatedor negativelyre-:inru to incomenet of education.For this reasonthe commonpracticeof consfructingscai:. Jr ( variablesshou: nr \\ socioeconomicstatusshouldbe avoided,and eachof the component /fr includedasa separate predictorof the dependentvariableof interest. t A useful procedurefor deciding whetheritems "hang together" is to submit therr-i, ,r factor analysls)is a p: ':: fqctor anolysis.Factoranalysis(or moreprecisely,e;rploratory durefor empiricallydetemining whethera setof obseNedcorrelationscan,with reir"-rEby, a smallnumberofhypoth:-:u ableaccuracy, be thoughtofas reflecting,or asgenerated with man\ \ :-:,r!underlyingfactors.Factoranalysisis a well-developed setof techniques, tions. However,this chapteris concemednot with the intricaciesof factor anallsr. rur with its useasa tool in scaleconstruction.For our presentpurposes, the optimalproc;:-* is to useprincipal factor analysiswith iterations anda varit?Mxrotation andlhen to in:r:: Ihe rotatedfactor matrLx.The varimax rotation rotates the factor matrix in such a $ : . '1! to maximizethe contrastbetweenfactors,which is what we want when we are tq t-. : detemine whetherwe canfind distinctivesubsetsof itemswithin a largersetof canrj::.or: items.We thenchoosethe itemsthathavehighloadtngson onefactorandlow loadin5..nr the remaining factor or factors.A rule of thumb for "high" is loadingsof .5 or more (\\ :f,rr areconsistentwith correlationsof about.52: .25 or hieher).
TRANSFORMING VARIABLES SOTHAT"HIGH"HASA CONSISTENT
M EAN I NG
" hish"refers offactor anaLysis, :: Inthecontext
va ueof a factorloading. the absolute We wouldthusregarda loadinglessthanorequalto : however, that a h 9h neq: or greaterthanor equalto .5 ashigh.lt is lmportantto appreciate, tiveloadingimpliesthat a variabe isnegatlvetrelatedto the underlying concept.Forth s re:' y runinthe samedirectron th.: son,it isdesirable to transform allvariables sotheyconceptual (frois,sothat a highvalueon the variabLe ndicates a highlevelof the underyingdlmension whichit thenfollowsthat allthe indicators shouldbe postivey correlated). Forexample, corsidertheGSSitemsSPKCAM('SupposethisadmittedCommunistwantedtomakeaspeech' your community.Shouldhe be allowedto speak,or nat?") andCOLCAM(Supposehe is teacring in a college.Shouldhe be fied, or not?"). Cleady,a positiveresponseto the first iternan: a negativeresponse to the seconditemboth ndicatesupportfor civi I berties.Soto maketl'. interpretation o{ the factoranalysis lessconfusing,it would be desirable to reverse the sca' ing of the seconditem.Thiscanbe accomplished easiy by transforming the originalvarlable X, into a reverse-scaled variable, X', usingthe relationX' : (k + 1) X wherethereare . response categories. Similartransformat onsare helpfulin anykindof multivariate analyss.
scaleconstruction 249 i:
:hen choose those items that meet both criteria-high
loadings on the factor and
::elationshipsto the dependentvariable-and combinethem into a singlescaleby .:::dardizing them (subtractingthe mean and dividing by the standafddeviation) ::r averagingthem. Theseprocedurestypically producescaleswith a meannea.r =j a rangefrom somemessynegativenumberaround-2. "r or -3. r to someequally ,-. :ositive number.For convenienceof exposition,it is useful to convertthe scale . :rnre extendingfrom zero to one becausethen the coefficientassociatedwith the _.:\'esthe expected (net) difference on the dependent variable between cases with
:'::st and highestscoreson the scale.Sucha conversionis easyto accomplish,by -:-:iso equationsin two unknownsas you did in schoolalgebra: 1:a*b(max) 0:a+&(nin)
( l r .3)
''max" is the maximumvalueof a scale.S, in the data.and "min" is the minimum r: 5 in the data.This yields a andr, which you then useto transformS into a new :-:. S'. asfollows: S': a + b(S)
(11.4)
CONSTRUCTING SCALES FROM INCOMPLETE ?>] INFORMATION
Whenyouconstruct multiple-item scales, it oftenis uselut $
: :rnpute scalescoresevenwhen rnformationon some rtemsis missinq.This reduces--: ^Jmber of missingcases.Forexample,if I am constructinga five-itemscae, I might -:-oute the averagelf data are presentfor at leastthreeof the five items.Thisis easyto ::::':'rplishin Stataby usingthe -rowrnean- commandto computethe mean and the -::i;niss- commandto count the numberof mlssingitems,replacingthe scalescore :- ihe m ssingvaluecode if the numberof missingitemsexceedsyour chosenlimit-in --: 3resentexarnple, lf morethan two of the ftveitemshavemissng values.
.ereralfactorsemergefrom the factoranalysis,we can,of course,constructseveral . Heretheproblemof validity loomsagain.Becausewe ordinarilystartwith a setof r:::e itemsthat a priori we think measurea singleunderlyingconcept,we areon the : _rroundif only one factor emerges.If more than one factor emerges,we are forced
- L.:der what concepteachfactor is measuring.Working from indicatorsto concepts :e very real dangerthat our sociologicalimaginationwill get the betterof us and ; : $ ill invent a concept to explain a set of correlations that reflect sampling error ':rn some underlying reality. The danger is compounded if we forger that we have
250
DataAnalysis: Quantitative DoingSocialResearch to Testldeas
inventedthe conceptto explain the data and stafl treatingit as if it has an independer reality-that is, if we refJr our concept.To be surethat we have actually discou"."d ,o* underlyingrealiry we shouldreplicatethe items andthe scarein someindependentdan set (perhapsby usinga randomhalf of our sampleto developour scalesand fit our mo& els and then using the orher random half of the sample to verify the adequacyof bot scalesand models).Unfortunately,this seldomis done,becausewe usuallywant larga samplesno matter how large our sampleis. However, the GSS provides such opportuli tiesbecausean analysisdevelopedusingdatafrom oneyear oftencanbe replicaiid us;B dala from the preceding or following year. I strongly encouragethis kind of independerr validation. Readersfamiliar with factor analysismay wonder why I suggestchoosinga set of candidateitems, weighting them equally,andaveragingthem,in contrastto constructinga scab by using thefactor scoresas weights.The reasonis that using the factor scoresmaximizes the associationbetweenthe hypotheticalunderryingconceptandthe conshuctedscalein rrr sample. Tltat is, it capitalizes on sampling variability. The result is that the correlationr betweena scale constructedin this way and other variablesare likely to be substantiar! smaller if the sameanalysisis replicatedusing a different data set. By contrast, the facttr_ basedscaling prccedure,tn wllch the items are equally weighted,is much less subjectEi cross-sampleshrinkage.In this sensefactor-basedscalesare more reliable than are scales constructedusing factor scoresasweights.
A WorkedExample:Religiosityand Abortion Attitudes (Again) Abortionhasbecomeanincreasinglysalientandemotionallychargedissuein recentyear-:Fundamentalist religiousgroups(andothers)opposeabortionas'1nurder"while feminis (and othert defendthe right of women to control their own bodies.Despitethe shal polarization of opinion regarding abortion, most Americans evidently support the ar.ai ability of legal abortionunder at least somecicumstances.Many peoplefind abonicr acceptablefor medical or therapeuticreasonsbut nol for reasonsof personalpreferencetr convenience. Consideringthe theologicalunderpinningof the "right to life" movemenrthat a fetus is a personand hencethat abortion is tantamountto murder_we might expe; strongly religious people to adamantly oppose abortion for personal preferencereasonr but to be lessopposedto abortionfor therapeuticreasons,when the ,rights,,of the fenrs mustbe weighedagainstthe healthandsafetyof themother.Thosewho arelessreligiou:_ by contrast,might be expectedto makelessof a distinctionbetweenthe acceptabiliq...r abortion for personalpreferenceand therapeuticreasons.If thesesuppositionsare conerwe would expectreligiosity to have a weakereffect on attitudesrlgarding therapeuui abortionthanon attitudesregardingabonionfor personalpreferencereasons. To testthis hypothesis,I usedatafrom the 1984GSS,a representative sampleof l.-l-l adultAmericans.(seedownloadabre fires"ch11.do"and"ch1Llog" for estimationdetail,. I usethe 1984surveybecauseit containsitemssuitablefor constructinga scaleofreligiority (discussed later).Specifically,I comparethe coefficientsin two regressionequations: (11,:
scaleConstruction 251
F: a' + b'(F) + c'(E)
(11.6)
*re I I and F are, respectively,scalesof the acceptability of abortion for therapeutic resons, the acceptability of abortion for personal preference reasons, and religiosity f-h). E is years of school completed,introducedas a control variablebecauseit is bsn that acceptanceof abortion increaseswith educationand that religious fundamen& is negatively correlaaedwith educationin the United States. The three scaleswere constructedby factor analyzing items thought to representthe funsion being measured,elirninating items with low factor loadings, converting each h to standardscoreform, and averagingitems. To facilitate interpretation of the regrescoefficients, the resulting scaleswere then transformed so that each had a range of Eo (for the lowest level of religiosity ard the lowest acceptarceof abortion) to one (for -r t highestlevel of religiosity and the highest acceptanceof abortion). Candidateitems for the scaleof religious fundamentalismincluded the following: '
l. ATTEND: How often do you attend religious servlces?(Range: never . . . several timesa week). 1. POSTLIFE: Do you believe there is a life ajler death? (no, yes). 3. PRAY:About how oftendo you prayl (Range:never. . . severaltimesa day). 1- RELITEN: Would you call yourself a rtrong [religion named by respondentin responseto questionon religiouspreferencefor not a strong [preference]?(not very strong; somewhatstrong lvolunteered] or don't know or no answer;strong). 5. B1B.'Altemative versions of this question were askedof two-thirds and one-third of the sample,respectively: L Which of thesestatementscomescl.osestto describing your feelings about the Bible? a.
The Bible i"sthe actual word of God and is to be taken literally, word for word. b. The Bible is the inspired word of God but not everything in it should be taken literally, wordfor word. c. The Bible is an ancient book offables, legends,history, and moral precepts recordedby men II. Here arefour statementsabout the Bible, and I'd like you to tell me which is closestto your own yiew: a. b.
The Bible is God's word and all it says is true. The Bible was wrixen by men inspired by God, but it contains some humanerrors.
252
euantitativeDataAnarysis: DoingsociarResearch to Testrdeas c. d.
The Bible is a good book beca '---.-aLause tt was written bywise men, bur Gd hod nothing to'ii* The Bible was written by men who lived so long ago that it is ,lofth little today.
*.sionsI-and rr were combined, excepr thar(
combinedwith category(c) from "".(:;H:ti.lHtJ*:ffi1t;:In versronI to a a newvariahle new variable, *r*oro o^,^,-2 NEwBIB. Before thesgn,,e-it",#;;" #JiffiT#:l#li; coneratedrvith,h;;;;;;;,.1ilo, ..ooot "no ;:?Xl:1,*:lj:..",:very answer', rno*_ responses
r"l ri. ;il'Jfit :."#:rtff ,".:.T"T?L ".r".J. size Arter .a.; ;;;;;;-u.ll"u", i#li*r**T"::::l.l*l"y the ,"." "ri-;", rumber orcases availabreil; ffi;;;yJj";fl5t :H:ffiT;llI U"latedprincipalfactonngandvarimax y4urrd roration. rurauon. A singredominant iacror emer ged'
*"t^r:1,:r,lg
wrth loadings after rotation:
ATTEND POSTLIFE PRAY RELITEN NEWBIB
which explained 86 percent of the total
.787 .573 .654 .260
Given the pattem of factor loadingg it appearsfrom simple inspection that a threo_
t;
jili','xf j:ffi:l1"T"1?:H:'::l+fi:,1's!di"" sca.rerhatincrudesennowzwnreiin";;ff#;l:;,;lT;%;.#;;*lt",T ::"#ffi "".".",iab,s\a,
ll1l'. r,li. ::Til:':,]:Til:l;,'l'ff ":T,": nrylv "g4q,,,,'"
,";:lfii,ii:,.',H;3"ffi1[tnl]. ;1;111ag,r'oery,-i',ffi r"f;r= w-ith much rower r".i.i#i,r.,u*i, liJ:ilfi i".iG ffi:*:fili:Ti:t1t""s
;'J.lffi"HJTff.n:"*.."##::,r :,..]:x"qhisr,L"oi,e,.u;;::;ff
jtr1H ffiffi [r".T:#'.i,ffi ilT"3#,*:::+m1;r;*:if"f
f,t1ri.Hffif:aft Jfr li*:J:"'ilJi:''.T#:::',i"l: abortion, Tocreatescale,."u.*Uirii
. ine ,"u* it ..,i;i; ,'
';;:#:;:;,f:;#::;:
ffiffi:H:::::Xi#::
I ractor anarvzed theroro*-
notlou think it shoutdbepossibte ror a pregnant,eotnd .o.r
1' ABDEFECT: If there is a strong chanceof seriousdefect in the baby? 2. ABN)M2RE:
If sheis mariedano 0"".,"r*ani_r-_#cilii**,
scaleconstruction
ZS3
ABHLTH: If the woman'sown hea
e.a.! ram'y has a";,i,*',:::ilj?":jff;,'"jr:lJ::ffi:ll*t ! on: rrthe children? OU^!!!!: If shebecame pregnant asa resultof rape? ABSINGLE: If sheis notmarriedanddoesnot*ant to mar.1, themanr ABANI If thewomanwantsit for anyreason / --1each case the possibleresponseswere ,,yes,',,.No,', ,,Don,t know,,,and ..No ,rr ,3r." '.rDon'tknow,,and ,.No answ .::. * ingrrnqdlur"bqtx""o and"No."AlTh-oTgEas-rndt.ut"dlpiJiilifi-nyfo,r,".i# -j"O ,r,u,,n"rearedistinc_ :' : ::jponsesto abortionfor therapeutic and personalprefea"n"",*aonr, I nonetheless ::*: : analyzedboth subsetsofiter together to confirm empiricallythat the two sets of :r--i do in fact behuu.dirtin"tiu"lLt T,.ro nontrivial factorswere .*ou.t.,l._ *hi"h rogetherexplained ' 96 percentof the *in the items.Table11.2showsthe loadings before,o,i-ior,-t.-e \s is evidenr,all sevenitems load strongly o-nFu"to. 1- B;;;;_" are posrtiveand r :: :re negativeon Factor2. The pattem of positiu" unOn"gutiu"ioadingson Factor 2 i*.:ir\ thatrheseitemscanbe subdivided into two distin"rtuit"r* iuUf. 11.3shows the :f -.: of executinga varimaxrotation, a rotation of ,i," .irt iactor matrix that :-: :rizes the distinctionbetweenfactors. "-* "
l-1*:
,'- l'1 ..?. ractorLoadinss
for Abortion Acceptance ltems Before Rotation. Factor 1
ABNOMORE
ABPOOR
Factor 2
*.263
.8 3 1
-.183 .412
.869
- .249
254
QuantitativeDataAnalysis:Doing SocialResearch to Testtdeas
-t"A* f-g
31,3.
ebortionractor
Loadings After Varimax Rotation. Factor 1
ABNOMORE
Factor 2
.880
ABPOOR
ABSINGIE
.876
.217
Inspectingtheseloadingsyou seethat,ashypothesized, two factorsunderlieabor::r attrtudes.ABNOMORE, ABPOOR, ABSINGLE, and,ABANy all load strongly on Fai::r 1 (shownin bold) andweakryon Factor2, whereastheremainingthreeitemsload stror.r' on Factor2 (shownin bold) andweaklyon Factorl. Thesetwo setsof itemsconespon;i: the a priori distinctionI madebetweenabortionfor personalpreferencereasons(racro: . andabortionfor therapeuticreasons(Factor2). Figure 11.1demonstrates that the unrotatedand rotatedfactor srucruresare slm:,-.. mathematicaltransformations of one anotheranddo nothingto changethe reradonsr_r amongthe variables.The rotationmerelypresentsthe resultsin a form that makes rl-_.n more readily interpretable.As notedpreviously,in the unrotatedmatrix (solid axesr. :] itemsload positivelyon Factor 1 but someitemshavepositiveloadingson Factor I someitemshavenegativeloadings.After I rotatethe axes30 degreescounterclock$:s -rr (to the dashedlines),all items havepositiveloadingson both fa&ors, but four (the s-_ sonalpreferencereasons)load stronglyon the first factorandweaklyon the secontlfa;-,r while three(the "therapeutic"reasons)load weakly on the first factorandstronglv on = secondfactor Giventheseresults,two separatescalesarewananted.I thereforeconstructeda s!--r of accaptanceof abortionfor personalpreferencereasons,using the fbur items loac-:r stronglyon Factor 1, and a scaleof items for therapeuticreasons,using the three ite= loadingstronglyon Factor2. In eachcasethe itemswereconyertedto standardtbrm.:rr averaged.I computedaveragesif valid responseswere availablefor at leastthree of the ir,personalpreferenceitems and at least two of the three therapeuticitems. Again ,:
-
ScaleConstruction 255
6
-.2
-.6 ---
-1 -8
axes Unrotaled Rotated axes
6 -o -',^Jorr'
4
6
8
1
F {C ,Jt?f 1 1 , f , roaarngs of the SevenAbortion-Acceptanceltemson the First TryoFactors,lJnrotated and Rotated30 Degreescounterclockwise'
D
tu
a t!, l['
t fF b r,-fl d E
F rcur rfr r*
& tx. rdl rfm' I&
rcales were transformed to range from zero to one, with one indicating high acceptance d abortion. The second criterion for scale validity is whether the cornponent items all bear ryroximately the same relationship to the other variables in the analysis. Ideally, one $ould assessboth the zero-order and net relationshipsbetweenthe componentitems and Le dependentvariables.Here, however,the dependentvariablesare the two abortion attirdes scales.Thus I assessthe consistencyof the relationshipssimply by inspectingthe curelations among each of the componentsof all three scalesplus the remaining indepdent variable,education.Thesecorrelationsare shownin downloadablefile "chll. lry." All of the componentsof eachscaleshow consistencywith respectto sign and gross imilarity with respect to magnitude in their correlations with the remaining variables fhus I concludethat combining theseitems into scalesas I have done is appropriate. Table 11.4 showsthe means,standarddeviations,and correlationsamongthe three r:ales andyearsof schoolcompleted,andTable11.5 showsthe coefficientsestimatedfor of theraErluations11.5 and 11.6.Not surprisingly,the meanfor the scaleof accaptance for of acceptance of abortion for the scale the mean Futic abortion is rnuch higher than the Lowest by converting (Because is calibrated each scale lnsonal preferenceteasons. sore in the sampleto zero and the highest scorein the sampleto one, comparisonof the rans acrossscalesis not, strictly speaking,legitimate. However,they do indicate where mostacceptingandleastaccepting te rypicalrespondentfalls relativeto the respondents d eachcategoryof aborlion, andhencecan be usedto comparethe relative acceptanceof 6e two typesof abortion.)
scateconstruction
257
-\s predicted, acceptanceof abortion for reasonsof personalpreferenceis somewhat Te strongly socially structured than is acceptanceof abortion for therapeuticreasons. L f: for the former is .182,comparedwith .136 for the latter.Moreover,both of the coefficients are substantially larger for the personal preferenceequation than for rec & fterapeutic equafion, indicating that both education and religiosity have a greater on attitudes regarding abortion for personal preference reasonsthan regarding ryfi ifution for therapeuticreasons.However, the standardizedeffect of religiosity is about T'-llv strongfor both setsof abortion reasons,whereasthe standardizedeffect of educa:[ is much strongerfor personalpreferenceabortion.
1n
i ngly UnreI ated Regressi on
,f, hrmal test of whether correspondingcoefficients differ significantly in the two equath is available through Zellner's seemingly unrelnted regression procedure, implearfrid in Stataas - sureg-. This proceduresimultaneously estimatesmodelscontaining or all of the same independentvariables but different dependentvariables. When t- fudependentvariablesare identical acrossmodels, the coefficients and standarderrors ilentical to thosefrom separatelyestimatedequations,but -sureg- providestwo Sional kinds of information-an estimate of the correlation between residuals ftom d equationand a test of the significanceof the difference betweencorrespondingcoeftas. In the presentcase,the correlationbetweenresidualsis .38, which tells us that *cr-er factors other than education and religiosity lead to acceptanceof abortion for lkzpeutic reasonsalso tend (modestly) to lead to acceptanceof abortion for personal F*rence reasons.The tests of the equality of correspondingcoefficients reveal that, as $adesized, the coefficients for education and religiosity are significantly larger in the preference equation than in the therapeutic equation. (See downloadable file lxtnal tI l.do" for delailson how to imnlemenr- errra.'- \
k-Proportional
scaling
.[ pecial kind of scaling problem arises when we have an independentvariable that has a anlinear relationship to the dependentvariable of interest. In Chapter Seven I diseed proceduresfor assessingwhether relationships are nonlinear and for representing ,-.in€ar relationships by changing the functional form of equations.One possibility I ftcqssed was to representnonlinear relationships by converting variables into sets of qories and studying the relationship between category membershipand the outcome rbble. In this sectionI describean extensionof categoricalrepresentations of vari*: efrect-propor"tionalscaling, which is availablein situations in which the dependent :i$le has a clear metric. (For an exampleof a researchuse of effect-proportional scalQ- .ee TreimanandTenell [1975].) Suppose,for example,that we are interestedin the relationshipbetweeneducational Ginment andoccupationalstatusin a nationwith a multitrackschoolsystem.We might d[ eapectthat in suchsystemsoccupationalattainmentdependsnot only orthe qmount drciooling but on the rypeof schoolingcompleted.How to representtheeffectof schooliq in a succinctway becomesa difficult problemin suchsituations.We could,of course, cf,-- andreport the coefficients for a typology of type-by-extent of schooling, but this is
258
QuantitativeData Analysis:Doing SocialResearch to Testldeas
likely to requirethe presentationof many coefficients.An altemative wourdbe to s. stepfurther and scaleeducationalcategoriesin termsof their e;fecl on occuparioritus. From a technicalpoint of view, this is very simple.We estimate the relatios betweenoccupationalstatus(measured, say,by the IntemationalSocioeconomrc lner occupations[ISEI] [Ganzeboom,de Graaf,andrreiman 1992; Ganzeboomandrren 19961)anda setof dumrnyvariablescorrespondingto our typology of type_by_errial schooling,and then we form a new educationvariablein-;hic;-each categorl.:typologyis assignedits predictedoccupationalstatus. Doing this maximizesthe conelationbetweeneducational attainmentand oc.-r tional status-no other scalingof educationwould produce high".;";;il;;=; the sameset of categories),and, of course,the correlationis "ideitical ," irr" ."r.: mtio Thusthe interpretationofthe educationvariabrebecomes"the highestlevel oi cation achieved,calibratedin termsof its averageoccupationalstatus return.,,So lo-r. the analystis candid with the readerthat this is whaf has been don", tt .un f objection.The clearadvantageof the procedureis that it allows "r" educatronal attaina: be includedsuccinctly in subsequentanalysisand thus permits assessment of hori relationshipbetweeneducationalattainmentand occupational statusrs aftectedb\ :d factors,andhow therelationshipdiffers acrosssubpopulationr, fo, ,.;* ethnicity. "*o-fi.,i, Hereis an exampleof the constructionanduseof sucha scale.(No log file is rr ^ this worked for examplebecauseno new computrngtechniquesare rntroduced.) t 1996Chinesesurveyanalyzedearlier(in ChapiersSix, Seven,and Nine; seeAppenr_ for detailson the dataset and how to obtainit), educationwas solicitedwith a cuesr that included the categoriesshownin Table 11.6.Although, with the exceptionc:l last two-categories,the classificationappearsto form an ordinal scaleof increasins: cation,it is not evidentwhetherthe scalehas a monotonic ..l"ti"".hrp ;;;;;;;;; status.In fact it doesnot, ascanbe seenfrom the meanson the ISEI shownin Table _ In particular,vocationaland technicalmiddle schoolgraduates tend to achievesuti tially higheroccupationalstatusthando academicuppir middle schoolgraduates\\ b: not go on to university. I thus created a new education variable in which each category was assrgnei mean ISEI score shown in Table 11.6. (A convenient way to do th;s in Stata is to re:
ISEI on the educationcategoriesand get predictedvaluesfrom th. ,.g*;;;. i; associatedwith this regression,.372,is, ofiourse, just the square of the correlation:: that we encounteredin ChapterFive, 4r.) This scaiecan then b" usedin other ana_i For example,we rnightwish to assess the dependence of occupationals;;;;;;;:: and father'soccupationalstatusfor severalnations,including China, to assessnal,r similaritiesand differencesin the relativeimportanceof achievementand ascriptin occupationalstatusattainment.
ERRORS-IN-VARIABTES REGRESSION As notedpreviously,unreliablemeasurementgenerallyproduces weakermeasuredel,-: Thus when variablesare measuredwith differential reiiability, the multivariate strucR_ relationshipscan be substantiallydistorted.Becauseattituie variablesotten ha'e r
Scaleconstruction ?59
$r F j @
d !!
- ',,3i-g X t.$. fvf..n score on the tsEtby Levelof Education, ChineseMales Age Twenty to Sixty-Nine,1996. Levelof Education illit er a te
Mean l S E l 1a.2 ' :' i l .:::
113 " ' r"
Canread
16.0
E
Uppermiddle(alsospecialized)
35.5
272
tt
" fi v eb i g " S pec ia l i z ei d n ,c l u d i n g
61.0
111
65.1
65
&-
.-.
m itu Ur !l
n I la a llN
! t.
$dll E > [|i
lmperialdegreeholdet (xiucai,juren)
30.5
[* il
Othe'
39.0
!t
Total
28.5
G
2,413
1$, qt li I
lui trl ml
e-bility, analysesincludingsuchvariablesoftencanbe misleading.A way of correcting trs lroblem, whenmeasures of reliability areavailable,is to correctconelationsfor attenrar..,ncausedby unreliability.The Statacommand- eivreg - (enors in-variablesregres'L-l.- doesthis conveniently.The analystsuppliesan estimateof the reliability of each ,rr-"ble,andthe commandmakestheadjustmentandcaniesout the regre5\ione\timation. li ro estimateis supplied, the variable is assumedto be measureduith perfect e::lility.)
260
to Testldeas DoingSocialResearch DataAnalysis: Quantitative
it canhave,I herepreserr To showhow this procedureworks andwhat consequences an analysisof the effectof abortionattitudesandreligiosity(the threescalescreatedpreviously) plus race,region of residence,an interactionbetweenrace and region, and the narurallog of incomeon politicaIconservatism. From the previousanalysiswe havethe reliability of the threescales.I takethe rel' ability of the income measute,.8, from Jencksand others (1979, Table A2.13) anr assumethat race and region of residenceare measuredwithout error.Table 11.7 shosi ari the resultsof OLS estimationwithout correctingfor unreliability of measurement, errors-in-variablesestimationthat does correct for unreliability. Because-eivregdoesnot permit correctionof the standarderrorsfor clusteringandrequiresaweightst* regressionwith the!. fweights),I carriedout both conventionaland errors-in-variables specifications.
fA,iLf 11"7. coefficients ofa Model of the Determinantsof Political Conservatism Estimated by Conventional OLSand Errors'in-Variables Regression,U.5, Adults, 1984 (N = 1,294). ConventionalOLS
s.e.
p
0.692
0.170
.000
-0.282
0.091
.OO2
b ReliEosity preference Personal aDonron
o
Errors-in-Variables s.e. P
1.066
o.2s7
0oOi
-0.220
0.113
.051
I
.816
0.172
o.425 DI
0.063
0.079
s .e .e .
1.24
1.23
I i
Scale Construction 261 The effect of adjustingfor differential reliability is dramatic-the coefficient associated mrt:eligiosity increasesby 54 percent.In addition, the coefficientsassociatedwith rc3?nce of therapeuticabortionand with incomeincreaseslightly,and the coefficient urn:cated with acceptanceof abortion for personalpreferencereasonsdecreasesslightly. ]^,"llGresultsindicate clearly how the relative effects of variablesin a multiple regression u re distorted if the variables are measuredwith differential reliability, as these are. Be::]l that the reliabilities of the religiosity, therapeuticabortion, income, and personal are,respectively, .66,.78,.80,and .93.) mr=.enceabortionmeasures \ote that with one exception,all the coefficientsare aboutwhat we would expect: increaseswith religiosity,with income,and for non-Blacks,with tuu:;ai conserrr'atism (although lruiem residence this last effectis only marginallysignificant)anddecreases ru,.ic:eptance of both kinds of abortionincrease.The unexpectedresultis that acceptance r ::erapeuticabortion is a much strongerpredictor of political conservatismthan is lrc:::tance of abortionfor personalpreferencereasons.From the analysisI presented rr-:eto make too much of this becausethe confidenceintervalsoverlap(the 95 perqr:onfidence intervalis 0.71 to 1.01 for county-levelcities and 0.63 to 0.84 for re:::ture-level cities).
790
QuantitativeDataAnalysis:Doing SocialResearch to Testldeas
Column-effectsmodels are formally identical to row_effectsmodels, but with role of rows and columnsreversed.A columa_effecrs model of the relationshipbetc sizeof placeat agefourteenand educationalattainmentdoesnot fit as well as the c{ spondingrow-effecrsmodel (B1C: - 108,A : 2.98, andp < .000),which suggests the_assumption of equal scaledifferencesbetweenadjacentsize_oi_ptace categories probably inconect. This is hardly surprising given the dlviation from equal diff.erencc: the estimatedcoefflcients for size-of-placecategoriesin the row_effects model and. cially, the non-monotonicity of the scoresrelative to my a priori ordenng. Row-and-Column-Effects Model I Another analytic possibility is to treat both the andcolumneffectsscoresasunknownquantitiesto beistimated.However, in this cr is important to have the correct ordering of both the row and column categoflesbe, the results are not invariant under different orderings. For the Chinese example we been exploring-the relationship between the size ;f the place of origin and educaticd attarnment-this createsa bit of a dilemma. Is it better to reorder the size_of_placectr gories according to the scale scoresderived from the row effects model or to retain rb.l priori orderingderivedfrom the Chineseadministrativehierarchy? One possibilityL.bney 1999,7-8)i and as noted,8/C is not available rre optimal solution may be to treat clusteredsamplesin a multilevelcontext, estimating er*rerfixed- or random-effectsmodels(Mason2001),which can be done in Statausingthe go beyondwhat can be coveredin this book, -:{t- or -gee- command;theseprocedures to Althoughevennow much 3(n seeChapterSixteenfor a briefintroduction multilevelanalysis. . -':ra:-_--l-journals, and treats simplyignorescomplexsanpledesrgns rat is published, evenin ldading this is generallyinappropriate cata as if they were generatedby randomsamplingprocedures, in its variForthe oresent,lsuqqestfor loqisticreqression 4d can leadto incorredinferences.
:us formsthatwhenyouhavedatathatareweightedorclustqedyoucarryouty!u!-estimatigl '.-:-relvon adtustedWald tettilor modelselection. +:---
iindit-tata3 survevestimalioncommandsand + Onlywhere 3e cautioui,however,in your interpretationand exploreallernativespecifications. you usethe - logistic - commandand random sample should have a true, unweighted, -,ou ikelihoodratio test (-lrtest) . Further,wheneverpossible,eschewweightingin favor of rxluding the variablesusedto createthe weightsin the model.
J
niquesfor makingthe generalshapeof a distribution clearby removng " no ," " -d"ui"tions $ from the underlyingtrend that resultfrom samplingerroror id osyncratic factors.Perhapsthe simplest smoothertsa movingaverage. A movingaverageis the average valueof several consecutlve data points.Considerthe workedexamplein this secton. A three-year moving averageof the expectedprobability of marriageat eachage would be constructed by first takingthe averageof the expectedprobablitlesfor agesfifteen,sixteen,and seventeen; thenthe average of the expectedprobabilities for agess xteen,seventeen, and eighteen; and so on. At the time the age-at-first-marriage examplewas created,the Statasubcommand -ma- ("movingaverage")was available within the -egen command.However, this sub(although commandisno longerdocumented in Stata10 it stillworks),andhasbeenreplaced by smooth . whichgenerates mediansof the lncludedpointsratherthan means.Another smootheravailable in Statais -lowess- .
il .tt
a'n-Blacks (precisely,0.591 : 0.190*3.108).Among 3O-year-old never-married people, fu oddsof marryingin that year amongthosewhosemothersare collegegraduatesare r:rlv 10percenthigherthan the oddsfor thoseof the sameraceand sexwhosemothers (precisely, r: highschoolgraduates 1.094: (0.918*1.114)4). Despitethe usefulnessof Table13.6for making specificcontrasts,the overallpattem qlied by the coefficientsis difficult to discem.Again, graphshelp. Figures 13.4 and of the expectedprobabilityof first marriageby age --:-i showthree-yearmovingaverages f, isk. separatelyfor Blacksandnon-Blacks.In eachgraph,separatelines are shownfor tri.r-esand femaleswhosemothershad twelve and sixteenyearsof schooling(as a con|€=ientway of visually representingthe effect of mother'seducation).Moving averages r: shownbecausethereis a greatdealof "float andbounce"for individualyears,which F lident from inspectionof the coefficientsin Table 13.6.(Seethe downloadablefile *:13_2.do" for details how on the moving averageswereconstructed.) InspectingFigures 13.4 and 13.5, we see that mariage rates for Blacks differ mnntially from thosefor non-Blacks,with Blacks much less likely than non-Blacks @:lirry at all. Moreover,non-Blackfemales(especiallythosewhosemothershaveonly ,I rsi schooleducation)marry at disproportionately high ratesat agesnineteenthrough lDdxn -five; non-Black males marry a bit later and with less concentrationin a short FL{. Black marriagerates,by contrast,are spreadout over a much longerperiod,but rrE ar upsurgein marriageratesfor malesin their thirties,especiallythosewhosemothG ire high schooleducated.For both Blacks andnon-Blacks,malestend to marry later k remales,with male ratesexceedlngthoseof femalesbeginningaroundage thirty. lirrJl. amongall race-by-sexgroups,thosewhosemothersarehigh schoolgraduatesare m3 likely to marry than are thosewhosemothersare collegegraduates. Ir I werepreparingtheseresultsfor publication,I would presentonly a subsetof the fter large set of tablesand graphswe havejust marchedthrough.The intent here,of
to Testldeas QuantitativeDataAnalysis:DoingsocialResearch
326
. 18 . 16
F e m a l e(s1 2 ) -.----o- Males(12) F e m a l e(s1 6 ) --_ Males(.16)
\
,/ . 14
6
i .os E p .u o
.04 .02 0 15
1/
21
19
23
25
27
29
31
33
35
Age at nsk
PtGtJ*i: 13'4, r"pecteaProbabititvof Marrvingfor the FiRt Timebv Ase Sex,andMother'sEducation(Twelveand sixteenYearsof Schooling)' at Risk, U.S.Adults,1994. Non-Black
. 18 . 16 . 14
(12) Fem ales - o.---o- Males(12) (16) Fem ales --Mates(16) -
E b
9 .oe € .o o
rr.r-.-Q,
.04 .02 0 19
21
23
25
21
29
31
Age at nsK
of Marryingfor theFirstfimebyAge Fl€URg 13'$. etpect"aProbabitity at Risk,Sex,andMother'sEducation(TwelveandSixteenYea'sof Schooling)' BlackU.s.Adults,1994.
;d nl !t
BinomialLogisticRegression 327 is to providealtemativesfor you to considerwhenpresentingyour own analyses. of the application of discrete-time hazard-ratemodels include Astone and oth1J00),Dawson(2000),Lewis andOppenheimer(2000),and Sweeney(2002).
FOURTHWORKEDEXAMPLE(CASE-CONTROL MODELS):WHO APPOINTED TO A NOMENKLATURA POSITIONIN RUSSIA? a dependentvariableis a rareevent,it is inefficientto draw a representative sample populationat risk for the event,becausethe samplesizewould haveto be extremely to obtainenough"positive"casesto analyze.This is a frequentoccurrencein epideical research,where the eventsof interestare diseases,but it also occursin the :ciences.For example,if we are interestedin studyingwhat determineswho gets to Congress,we could hardly do this by drawing a representative sampleof the ion andlooking for the congressmen in it. We havesimilar problemsin studvins crime victimization,homosexuality,and variousotherrelativelyuncommonpheOne solutionto this problemis to sampleon the dependentvariable(that is, to a sampleof congressmen, criminals,or homosexuals), collect informationon that collect oie. correspondinginformationon a representative sampleof the population 'lrs not experiencedthe rareevent(becomingcongressmen,criminals,or homosexuals), the two samples,and model the odds of experiencingthe rare event.This is ascase-controlsampllngin the epidemiologicalliterature(for an excellentreview itatisticalproceduresinvolved,seeBreslow [1996]). C3-ie-controlsampling exploits the fact that odds ratios are invariant under shifts distributionof the data.This extremelyimportantfeatureof oddsratios makesit to combine sampleswith very different distributionson the independentand variablein orderto modelrareevents.This capabilityis not possiblewith OLS becauseOLS coefficientsare affectedby the distributionsof the variablesin n]del. T.r see how case control procedures work in practice, let us consider what factors
the oddsof becominga memberof the Russianpolitical elite at the end of the ist era. From Social Stratification in Eostern Europe after 1989 (Treiman and samplesfrom Russia:a probabilitysampleof 1i 1993),we havetwo representative ,< population(N : 5,002)and a randomsampleof personswho werein nomenpositionsasof January1988(N = 850).(SeeAppendixA for a descriptionof the .md informationon how to obtain them.)Nomenklaturapositionswere thosethat the approvalof the CentralCommitteeof the Communistparty. They ranged rery high govemmentofficials (for example,membersof the politburo) down to of sensitiveorganizations-for example,rectorsof universities,editorsin chief of newspapers, andheadsof largeindustrialenterprises. Th generalpopulation sample departsin two ways from compliancewith the ions underlyingcase-controlsampling,but neitherdeviationis importantfrom standpoint.First, it is a probability sampleof the 1993 populationrather tb 1988population.However,the samplingframe is basedon the lg89 census,and nmple thereforeprobablyrepresentsthe 1988populationnearly as well as it does
BinomialLogisticRegression
II
329
Before tuming to interpretation of the results, we should note the one difference hween case-controlanalysisand ordinarybinomial logistic regression:in case-control aalysis the intercept is not meaningful. This should be obvious from the fact that the in logistic regressionindicates the proportion of the sample that is "positive" rid respectto the dependentvariable. However, in case-controldesignsthis proportion -ercept b ixed by the sampledesign, and thus the coefficient addsno information. Inspectingthe coefficientsin Table13.7,we seevery largeeffectsandfew surprises. Ech year of schoolingincreasesthe odds of becominga memberof the nomenklatura b more than 70 percent. Thus, all else equal, university graduates(who typically have Li yearsof schoolingin Russia)are more than 15 times as likely as high schoolgradu(with 10 yearsof schooling)to be appointedto nomenklaturapositions(precisely, -s l5-i2 : 1.72605r0)).The effect of genderis astronomical:malesare more than 17 times G likely as females to be appointed to nomenklatura posts. The effect of age is also anemely strong: all else equal, the odds of being appointed to a nomenklatura posrtion i;rease about 14 percentper year.Thus, for example,a SO-year-oldis more than 7 times hkely to securea nomenkhtura positionasis a 35-year-old(precisely,7.23 : 1.141(50-35)). -Itrhaps more interesting, the effect of social origins, evenamong thoseequally well educred, is far from trivial. Coming from a family in which one's father was a memberof the Communist Party improves one's chancesof a nomenHntura appointmentby about half, d elseequal.Also, eachyear of father'sschoolingincreasesthe oddsof nomenklatur.l qpointment by about 11 percent-this in the worker's paradise!-so that the offspring of t university-educatedintelligentsia (15 years of school) are about three times as likely * the offspring of those with only a primary educationto sectJrenomenklatura apporntlmts, inespectiveof their own educationalachievement(precisely,294 : 1.114(s5)). rllone amongthe variableswe haveconsidered,father's occupationalstatushasno impact r the odds of appointmentto a nomenklatura post.
XHAT THISCHAPTER HASSHOWN h dis chapterwe have seenhow to estimateand interpret binary logistic regressionmodds- which are widely usedto model dichotomousoutcomessuch as whether people vote, employed, or are members of a particular organization. We have seenthat although t- estimationproceduresare quite different, the interpretation of the coefflcients of such ndels is similar to that of OLS regression, except that the coefficients represent net &cts of eachindependentvariable on the log odds of an outcome. Because log odds are not intuitive quantities, we have considered two nonlinear :nsformations to more readily interpretable coefficients----oddsand expected problllities-and have also seenhow to graph net relationships, a form of regressionstanfor logistic regression.Finally, we consideredthree extensionsof the basic &ization models, listic regressionmodel:educationprogressionratios,discrete-timehazard-rate d case-controlmodels.A notable feature of logistic regressionmodels is that they are with respectto the distributions of variables in the sample,which is what makes procedureslegitimate in the logistic regressioncontext blrt not in the OLS Ge{ontrol -aiant rlresslon conlexr.
330
Quantitative DataAnalysis: DoingSocialResearch to Testldeas
APPENDIX I3,A SOM: AIGIBRA FORLOGSAND EXPONENTS who have forgotten their school algebra, here are some usefirl Io. 9lr" rnvorvlngnaturallogarithmsandantilogs(exponents): e'lt) : X
h(x*r):h(x)+h(r) ln(X /I) : tn(X) - ln(f) X * Y : et^6) etn(Y): e(L(x)+ln(Y) e <x +Y ) _ e x *e v "(x-Y)
"x1"v
ln(XP): P * 1n11; XP :1s1"(x)1c - .e.n1x1 Note that I : ln(X) and X : ey are equivalent.
APPENDIX13.8 TNTRODUCTIO}I TO PROBITANALYSIS *pinning of this chapter,an alternariveto logistic regressronas a mod *::j:1^:,,-1. ror predlctlngmodelbinaryresponses is theprcblr model,wiich is definedas k
P r(f = | lx ) = O (6 ' x \ : 6'\ror B + \ -n * ZrPt4i
rt
(13.B-O
where o is the standardcumulative norrnal distribution and thereare t predictor variabbFrom this definition it is evident that the are z_scores, Bs *O1n"i *" associatedprf, d",":ined by finding the area under rhe normal curve correspondingror ::ll?_:T-O: parucularz-score.This canbe doneby invoking Stata,s_normal _ function. Consider the example used in the chapter-to ittustrate tfre interpretation of logis_ regressionmodels-the determinantsof th; tkelihood of being ttfeatened by a gun r being shot at. Table 13.B.1showsrhe probit co"m"i"ntr_tfr" irlorr".pondiog to h logistic coefficients shownfor Models i and 4 in Table f :.:. Noi" tt ut tfr" p.obit andlqa models yield similar conclusionsexceptthat in Moder 4 th" int"ru"trL t".m ls marginallr
BinomialLogisticRegresslon
331
thatwhenestimatedby using _ :.ant wheneslimatedusinga logit model,andnot even " rit model. ":, in *undard deviation i..uur" p.obitsare.-scores,they indicatethe expect:q:!1n^g": 1-P 620] calls (what StataCorp7Q01lRefere,nce - , in th" iatent dependentvariabie variable' predictor :: rrobit index"),reiulting from a one-unitchangein the associated latent the of the variance - .,.r, iai. pt.perty oflrobits, in commonwith logits' that " Effect Parameters for a Probit Analysis o{ Gun Threat : .',..i' : c.rresponding to Models 2 and 4 of Table13'3)' rc:pendent Variable
b
Standard Error
P
Marginal Effect
o.8022
\i::
-0.01' 11 0.0062 0.2586
i : : ..
r::':ept
- 1.709s
0397
.000
.1A20
.000 .1154
''=: ..ed probability :: 4
,u; :
0 .8 1 26
.0038
.003
0.0062
.0022
.004
o.2994
.0545
.000
-0.0806
.o721
.264
.1810
.000
-0 .0 1 1 4
l
-..--'-* . ^.''----'
$i:,.-L4ale
.o729
1.7117
to Testldeas DoingSocialResearch QuantitativeDataAnalysis:
332
are introduced into a model- This means variable changesas additional variables - : is ts not appropriateto comparecorrespg"ditc Pt*'it ":i":::11t"3:,?":t:::t :|I; musi 'IUL'rPPrvPrralw meltricoLS coefficients Rather' we of mediating variabies' as we oo *itrt dependentvariable by-dividing by th" Y111"^"*:-i::::: r4lw'r svvv'^*---L'e latent luarurzsthe su standardize which canthenbe directly comn ir"o bution.This producesY*-standard "o"in"i"n,s, ol.pregicto: numbers
*ith differing lar,11t:*: :::riT:?.TlT inthe rogitcoerncients ""..*'"-o*l*t ;'it;';;nou'aiutionotoroinal #;.i"' ill::.:".Xliff
va{P:] f:1li11"'*'j"T*:1. metrics. 11:?' ,r'" inEquation GL..'iN"" the probiishaveintrinsic metrics'thel "'"1c;;;';;:'ffi; t;-;;ardized
;;t';;;; ttunttot-,t9]l:::f::::T":i'; Thus r'ur probits interpret. ro l'rlerPrel vrvv^'" tvfi"uttv difficult drlnculr to 'oo,"o*" givenconfigurationof va]u= -" fo, a ing the exPectedProbabilitYof a PlJi,i* effect of a changein eachprecl: marginat ttre ng the Dredictorvariabl", o, oy tnt"rpJ variable on the probability of a positive outcome'
'T*"
Il,tt':,:11*".'^:"*:T,:?:'i"#i'T l"'iio- rt'"wort"oexampte in Moder4 ri bvthelogitcoerncients ;;;;;i*pr'ed
tormaqrotilnodet'::,c19li"i1""tlh: piouuuliti"' the*1"1"* Til:T:i"H using probabilities ffi;##;;n;;''t" orschooli:4 wig lwenlv vears **"on rorp1op1e ffi*?ffi'ffiti#,it"ltlrt et:::":",Y#t:ul,i: of samevalues
".Ji#;J;;'il;';; :.fi;;";;,l":naine
theprobitequationiorthe 1994.To evaluate
:
= : a +.bE+20-+ o' .bv4e4 '-r'11" #.###1ffi;:#;il;;;'"""p" -o.miz,sq isthe = -r.:sos uurr+'zu - -"t-oi rwhere'bu *'bli::':T1?T:::jljTl?, ffiifrd; s b" is the Probit coefficient for Yea
t*"y;l
Then we write out the expected:
tlfg:-|,::*:":"1|;:'1,:.fiitl b" istneloeffici"nt byraceandsex(where themusins tem) andtransrorm ;;;-;;
t'-""tion
Non-Blacks
Black
Females
o(a')
6(a' + bu)
Males
Q(a' + br)
Q(a' + bu* bM+ bBM)
#;;#
H.T;i'J#
-normal - function:
these coefficients' we have Substituting the numerical values of
Females
O(- 1.3569)= 0'0874
Blacks + 0.2994) o(- 1.3569 = O (-1 . 0 5 7 5 ): 0 1 4 5 1
Males
O(- 1.3569+ 0 8126) = i'zstt : @(-0'5443)
O(- 1 3569+-0'8126+ 0'29v -0 0806): o(-0'325s): 0'-:-:4
Non-Blacks
predicr= extremelycloseto the percentages NoG that, multiplied by 100, theseare
for n"t B11"Il:T:t:i:t:"t:"::Hi":l:= thelogit model,whichare,respectively' rolc-" r- etu't men'37'2' (seetheparagraph
ff;:?;
ffi:;#;;;d;'';
Equation13.16.)
BinomialLoqistic Reqression 333 Now let us considerthe marginaleffect.We might askhow big a changein the probdiliry we could expectfor a small changein a parlicularindependentvariable.However, br-ause the relationship between the probit index and the probability is nonlinear, rhe answerdependson the valuesof the independentvariablesat which we evaluatethe uiange. Unless we have a reasonfor doing otherwise, evaluating the marginal effect of e-h variable relative to the expectedvalue when all independentvariables are set at their Ens would seemmost reasonable,and this is the approachStatatakesfor continucS r'ariables.However, there is an exception-it makeslittle senseto evaluatemarginal tlanges in dummy variables relative to their means.A better approachfor dummy vari*{es is to compute the discrete change ihe difference in the expectedprobability for fu6e scored1 and 0 on the dummy variable, with all other variables(including any other fumy variablesin the equation)set at their means.Thus, for example,we would want D how the expected difference in the probability of males and females having been &eatened, among people who are at the mean with respect to the other variables. For ,cmdluous variables, however,we want to know the effect of a small changerelative to rte meanfor all variables.Thus for continuousvariablesthe marginaleffect is defined r$ de slope of the probability function at the mean,extrapolatedto a unit increase. The marginaleffectsfor Model 2 are shownin the rightmostcolumnof Table13.8.1. lide that I do not show marginal effects for Model 4. This is becausewhen we haveinterrtion terms, the effects of the variables included in the interaction cannot be separated. Thus when we have a model involving interactions, it is best to evaluatethe probabilities ftr various combinations of variables, as in the logit example. The first thing to note is the predictedprobability, 0.1753, which tells us the expected Fobability that the averagepersonin the data sethasever beenthreatenedby a gun or shot r- h is reassuringthat the predictedvalue is close to the observedvalue-19.5 percentof m samplehasbeenthreatened.This gives us confidencein the corectness of the model. Now note the marginal effect for males. Becausesex is a dichotomousvariable, this efficient gives the difference in the expectedprobability of having ever been threatcd for males and females who are at the mean with respectto the other characteristics a^luded in the model; among suchpeople,males are predictedto be 21 percentmore Itely than femalesto haveexperienceda gun threat.We also seethat, at the mean,a onelcar increasein schoolingwould be expectedto reducethe probability of having been tleatened by 0.0029.What would, say a ten-yearincreasein schoolingbring?Note that hre we cannot simply extrapolatethe marginal effect. For example, it is not correct to q ftat a ten-yearincreasein schoolingwould resultin a 0.029decrease in the expected Foportion having been threatened.Rather, we need to compare the cumulative normal tmsformations at the mean and at the mean Dlusten vears:
Q(sa+{1M+ 13,(E+1O)+ BtM+ il"E+3.y+ 3p) &Y + pAB)-A(po+ : iD(-1.710 x 84.47 +0.111) * 0.451-0.0111* (12.39 + 0.802 + 10)+0.0062 + 0.259 x 0.451-0.0111* * 84.47 * 0.111) - O(- 1.710 + 0.802 + 0.0062 + 0.259 12.39 : .1482-.1753 : - .0272
(13.B.2)
334
Quantitative Data Analysis:Doing SocialResearchto Test ldeas
5
-3
-2
1012 Coe{fclen't(b)
ffe u$ qr13 .8.1 Probabilities " Associated with
3
4
S
Values of Probit and Logit
Coefficients. A flnal point to note is that the logit and probit modelshavesimilar shapes,er::d that probit coefficients more quickly reach probabilities asymptotically close to zer; m' one than do logit coefficients,as is evidentfrom Figure 13.8.1. For this reason.1:!t models are more sensitive when dealing with rare eventsor with predicted probabil-:x closeto zero or one.But with this exception,the two modelsalmostalwaysyield silrlb' substaniiveconclusions. For furtherdiscussionof thebinomialprobit model,seePetersen(1985),Long ( 19q:. 40-84), PowersandXie (2000,Chapter3), Long and Freese(2006),Wooldridge(li{rD, probit pos test imat ion-, -svy:prob::583-595),and the -probit-, and -svy:probit poste st imat ion- entriesin Statacorp(2007).For an inter:sing applicationseeManski andWise (1983). The Statacommandsusedto createthe worked examplefor the probit model anCtb outputare shownas the lastpart of downloadablefiles "ch13_1.do"and "ch13_l.log.-
-l
tt-
it and Logtt
ar shaPes,excef{ cr t closeto zero 'rhis logrr reason, kted Probabilitie: rays Yield similar (199-' ,985),Long sooldridge (20O6- svY: Probit- ' t?). For an interestrobit model andtbe ' nd "ch13-1'1og
C HAPT I I
AND MULTINOMIAL LOGISTIC ORDINAL AND TOBIT REGRESSION REGRESSION ISABOUT WHATTHISCHAPTER
types of limited dependent models for three additionar rn this chapter we consider rariabies: which multinomial more than two categories' for r categorical variables with logisdc regressionis aPProPnate ordinal logistic regressionis appropriate ordinal variables' for wh'rch : not observed variables' where observatronsare dependent censored' or ! truncated, ior whicrr tobit regressionis approPriate below or abovesome revi, an illustrative subis specifiedand then work through model the how see we case ln each standveanalysis'
336
DataAnalysis: Quantitative DoingSocialResearch to Testldeas
MUTTINOMIALLOGITANALYSIS Sometimes we wish to analyze categorical dependent variables with more than categories.In this case,we haveavailablea naturalextensionof binomial losistic sion: multinomiallogisiic regression.The procedureinvolvessimultaneousl"v est a setof logisticregression equations. of the form
"[##]=o,+fb^xo ,[" ,":4 D]-
o ,+ fb ,xo
(l
'[ffi#j:o.+fu^xr Here' one category of the dependentvariable is omitted and becomesthe reference gory. The estimation procedure yields, for a set of m + 1 categoriesof some deper u-artable,m logistic regressionequations, each of which prediits the log odds of a fallin^ginto a specific categoryrather than into the referenie category(here designateJ I: 0). Note, however,that although the interpretation is similario the oinomial caseestimation procedure is not equivalent Io estimating a set of binomial logistic regresJ equationsin which the oddsof beingin a particularcategoryversusnot beins in thatca gory are predicted. In general,the estimateswill differ and the binomial estiirates *-ill
lncolTect. This can easily be appreciated by imagining that we are interested in what
determinewhether,in 1988poland, a person was a Communist party official, a Cc nist Party memberbut not an official, or neither a membernor an ofdcial. If we esti a binomial logistic regressionpredicting ordinary party membership(without office ho ing) and anotherlogistic regressionequationpredicting party office holding, we woulJ in trouble with respectto the first equationbecausethe negativecategory(not an ordi party member) would include those who were neither party memb;rs nor officials aly thlse who wereparry ofrciats.In consequence, the resultingcoefficientswould misleading.For example,it is likely that a coefficient relating eduJationto party memi ship would be very weak becauseparty officials are likely to be better educated than members,whereasparty membersare likely to be better educatedthan nonmembers. The appropriate way to handle this problem would be to estimate a multinon logistic regressionmodel with three categories:nonmember,ordinary member,and cial. Doing so would result in two equations,one contrastingordinary membersv nonmembers and the other contrasting officials versus nonmembers, which are
and Tobit Regression Multinomial and Ordinal LogisticRegression
n > lf
337
rerpreted in the ordinary way. An altemative would be to do a sequentiallogit analysis r rnich first membershipversusnonmembershipis modeled,and then offlce holding usrs ordinary membership is modeled for party members only. The choice between Ge alternativeswould dependon how the processof becominga party memberor a official occurs.(Seethe brief discussionat the end of the chapterin the sectionon ;q *t(I5er Models.")
Yorked Example: Foreign-Language Competence bthe CzechRepublic E ;ee how this procedure works in practice, let us analyze the factors that account for in Englishand Russianin the CzechRepublic.The datausedherewere colnationalprobability sampleof 5,496 Czechsage Med in 1993 from a representative part the swey Social Stratifictltion in Eqstem Europe After kn to sixty-nine, as of rl${9 rTreimanand Szel6nyi1993;seeAppendixA for detailson this surveyandhow to Here we considerfour groups: frh the datasetanddocumentation). I
thosewho speakneitherEnglishnor Russian thosewho speakEnglishbut not Russian
r r
thosewho speakRussianbut not English
D
r
thosewho speakboth languages
e
To be classedas a speakerof a language,a resPondenthad to report that he speaks "fairly well" or "very well"; those who reported that they speak the lanJ.:-nguage "only a little" or "not at all" or who failedto answerthe questionwereclassifiedas of the language.Becausethe survey was conductedin Czech, everyone spoke Czech. A few may also have spoken a second language other than or English,but this possibilityis not analyzedhere. andtechnicianswould be morelikely thanother \l]' expectationis that professionals Englishis now the intemationallanguageof groups English because ion to speak and hence the ability to speakEnglishis important :e. technology,and scholarship, rofessional advancament.Those who were ever Communist Party members, and ially thosewho were govemmentor party officials, would be more likely than other for political advanceion groupsto speakRussianbecauseRussianwasnecessary in the EastemBloc. It is lessclearwhetheror to what extentbeinga managerwould for intemationalbusinessdealthe oddsof speakingEnglish(perhapsnecessary (perhaps dealings). for Eastem Bloc necessary or Russian ' To identify thosewho potentiallyneededRussianfor their careers,I classifyresponby their 1988 occupationand createfour dummy variablesfor 1988occupation, scoredI for thosein the category and scored0 otherwise: officials, other managers, sionalsandtechnicians,andothers.(This variablewasconstructedby recodingthe versionof ISCO 88 shownin Treiman[1994,AppendixC]. "Officials" include 1000to 1166."other manasers"include codes 1200to 1320."professionalsand includecodes2000to 3480,and"others"includecodes4000to 9333.Those
n q -* f
t lb n l.
d * lL E fd
fr 5 G dl !cI" fli
ru
338
Quantitative DataAnalysis: DoingSocialResearch to Testldeas
not reporting an occupationin iggg were excludedfrom the analysrs.)ln thesevariables,I includeeducationasa controlvariablebecause it is clearthattho.e are educatedare more likely to speakforeign languagesin general. The data were weighted to adjust for differentiar hou-sehord size and to britrr samplecharacteristics into conformity with populationdistributions 1r." fr.irnan tt SectionI.G, for details).However,standa.derrorswere not aOiusteO ibr clustenae_ the sampledesign,censustractswere divided into eight strata on ,n. i"rir'"i ,'Li
households were randomly sampled within strata. Be-cause the stratum rdentificffi._rr
not givenin the documentation,thereis no alternativebut to treat the sampleasa S (weighted)randomsample.Given the probablelack of systematicassocrationber andorhercharacteristics, the lack;f adjustmentfor strat.ifi;{ lT.:L: is likely "t:"^ir.lracrs to.be of little consequence. The results*" ."port"O iri tuOt" 14.1for rhe_: peoplewith a job in 1988 for whom completeinformation was avarlable.(Doqni able file "chl4_1.log" showsthe Statalog for the analysis, and ,,chl4_l.do,, shos:. - do - file usedto obtainthe results.) Inspectingthe coefficientsin Table14.1,we seethat, asexpected, the oddsof s ing-eitherRussianor English,or both, improvedsubstantially with education.The multipliersin the secondpaneltell us thateachadditionalyear of schoolingincrea.nd by 25 percent,the oddsof speakingEnglishby 36 perceDl :,1d::j_tp-"""kt"C,lussian. the odds of speaking both languagesby 51 percent__alt ir, "con;ast to speakingneib Russial nor English.Thus,for example,net of other factors, the oddsthat a Czeshr versrtygraduatecould speakRussianbut not English (in contrastto speakingDtfl Russiannor English) are nearly two and one half-rimesas high as the oddsthar a h schoolgraduatecoulddo so (because1.24gu6-12) = 2.43).The &ds ttrara unlversrn.s uale could speak English but not Russian are more than three times the o,lOsfor'"I
schoor gradriate r.:oyil;;=;.4; ilffi;fi r
'' -
[",.
(14.12)
if Y ! ! r
is. the observedvalueof I is equalto the "true" valueof Y Y*, \f Z* is abovethe at which observationsare censoredand is equal to someconstantvalue (usua.lly,but necessarilythe value at which observationsare censored)if the true value is at ol the value at which observationsare censored.For the first set,estimatesare derived sameway as in ordinary least-squaresestimation. For the secondset' it is possible imate the probability that an observation is censored,conditional on the values of fudependentvariables, and to use this probability to estimate the likelihood. These are then combined to produce expectedvaluesfor all observations,conditional valuesof the independentvariables:
(E(Y'l4t t,x,)l E(Ytlxt): lw(uncensoredlx,l* + fPr(censoredlx)* Tyl
(14.13)
x. - d+Db,x,k expositionof the mathematicsinvolved,seeLong (1997'Chapter7)' m accessible
356
Quantitaiive DataAnalysis: DoingSocialResearch to Testldeas
The tobit model hasbeenextendedandgeneralizedin
a numberof ways:
x
to allow for right censoringand both left atd right censoring (that is, at low valuesandhigh valuesof a distribution)
u
to-allow for the possibility that differentobservations are censoredat di values (for example,income when severalyears of the GSS are pooted) to allow for situationsin which an underlyingcontinuous variableis coded set of categories(in many surveysincomeis codedthis way) to correctly estimateeffects where observationsare truncated to dealwith sample-selection problems
r x !
In the following section I provide a worked example that illustrates many of thesee esrimation details, see the Stata downloadablefiles ..ch14 ;r9qs, lFor 3.do_ "ch14_3.1og.")
A Worked Example:Frequenqrof Sex
The 2000 GSS,includedthe question ,About how often did you have sex during tE .0 twelvemonths?"The responsecategories(shownwith coderio l" u."o tut"| ar. d.*i in Table14.6. Clearly,thesedamare censoredboth below and above. Thosewho havenot h.u at all in the last yearincludethosewho haveneverever had sexandthosewho har* , ply beenunlucky in the pastyear,with othersin between.At th" oth", "more than threetimes a week" asfour times a week, or five times a week,"*t."*". mav un
TAELe
14.5,
coae"ror Frequency of sex in the pastyear,u,s.Adutrs, Midpoint
2 or 3 timesa month
2 or 3 timesa week
LowerBound
UpperBo(d
MultinomialandOrdinalLogistic Regression andTobitRegressio" 357 b prowessof newlywedsand other sexualathletes.Finally, somecategoriesinclude a mge. which might or might not be optimally represented by the midpoint. To illustratethe effect of censoring,let us considera simple model in which frepency of sex is predicted from age, gender,and marital status(currently maried versus n1(,. ln fact, in this and most analysesinvolving age, it would be better to include a ryared term. However,I do not do sojust yel becauseincluding only linear terms makes fu crpositioneasier. Table14.7showsthe resultsfor four estimates: r
ordinaryleast-squares estimateswith the categoriescodedat their midpointsbut with an arbitrarytop codeof 208 for "more than 3 timesa week" (- 52+4)
I
tobit estimateswith censoringfrom below
r
tobit estimateswith censoringboth below andabove
r
intervalregressionestimateswith censoringboth below and above
C:nparing the coefficients in the two left columns, we see that the effect of censoring fum below is severe.Failure to take proper accountof such censoringresults in an
TA B i- t: '! 4.7. ett.rtt"tive Estimates of a Model of Frequencyof serc Gt Adults, 2OO0(N = 2,258). (Standard Errors in Parentheses;All Coefficients Significant at .O01or Beyond.) -
Model 1:
oLs -
Model 2: Tobit, Left censored
Model 3: Tobit, Left and Right Censored
Model 4: Interval, Left and Right Censored
119.2 (6.8)
118,4
1Z.V
v .t) -1 .4 1 (0.09)
-2.16 (0.12)
(6.8) ..'';i'l]''''
. . | .t, :. ., ,.r :..t . r.,.'ar,r t
71.7
11 n
358
QuantitativeData Analysis:Doing SocialResearch to Testldeas
underestimate of the effect of marital status on frequency of sex by about half and
very substantial underestimationof the effects of age and of being male. Interesti taking accountof censoringfrom aboveas well as below hardly changesthe coeffici suggestingthat marital status, age, and gender have little impact on the probabilirl being extremely sexually active. Inspection of the probability of censorshipfrom confirmsthis supposition:even among the most sexuallyactive group, young nu[D men! no more than about 15 percent have sex more than three times per week. Bv trast, there is great variability by marital status,sex, and especially agein the of never having had sex in the last year, ranging from about 3 percent of young num men to about 90 percent of elderlv unmarried women. Apart from the probabilities, three predictions are of interest: the linear predi from the model, the censoredprediction,and the ftuncatedprediction.Graphsof predictedvaluesfor Model 4 are shownin Figure 14.1by age,for marriedwomen, linear prediction is the prediction from the model, which tells us that, net of other the frequencyof sexper year declinesby about2.3 occasionsper year of age.The tells us that for married women the frequency of sex declines to less than once a ye:r about age seventy.Although negative observedvalues make no sense,the linear prerlrtion gives the values of a latent, or underlying, variable. We can think of this variable the propensityfor sex, which declinessteadilywith age (because,of course,we h* modeled the frequency of sex as a linear function of age). The censoreclprediction eqtals lhe latent prediction when the dependentvariat*r i observedand equalsthe censoringvalue when the dependentvariable is censored.(Sorr. what confusingly,Stata calls censoredpredictionsthe "ystar" option, although l-. r, 120 100 b B0
\-
E 60 ,:
40
n0 -20
Age
Ff6t-rnS 14.J. rf,r"" Estimates of the Expected Frequency of Sexperyear, U.S.MarriedWomen,2000(N : 552).
MultinomialandOrdinalLogistic Regression andTobitRegression 359
!ttr.{mq
ftcn
:3 & Br IFI fg f
F*3 l|@-
rTb -F ER ,I 3t
r t IEIL
'
I1
lEs
staily takento indicatethe latentvariable,as it is in Equation14.12.)Thus in this case, * assumethat 0 and 208 are fue valuesfor thosein the lowest andhighestcategories. D construction, censored predictions must fall within the range of the uncensored Gervations, The truncatedprediction is defined only for thoseobservationsthat ale not censored. h 6is case the truncated prediction gives the predicted frequency of sex among those rto had any sex at all in the last year. Note that neither the censoredprediction nor the rncated distribution is linear. Thus, thesepredictions must be evaluatedat specific levr* of the independentvariables.Most commonly we will be interestedin the linear pdiction. Now that we seehow to interpret tobit coefficients, let us extendthe analysisslightly a make it more substantivelyplausible. I do this by adding a squaredterm for age and *rying interactionsbetweenage,gender,andmaritalstatus.As it happens,it is not neccary to posit three-way interactions among marital status,gender,and, respectively,age d age squared; a model positing the three-way interactions does not fit significantly her than a model with the two sets of two-way interactions, between gender and, :spectively, age and age squared,and between marital status and, respectively,age and 4r squared.The coefficients for this model are shown in downloadable file "ch14_3. l;:- Becausethey are difficult to interpret directly, I have graphed (in Figure 14.2) the dationship between age and the frequency of sex for each gender-marital status mbhation. Inspectingthe graph, we seethat-no surprise-maried peoplehavemore active sex hs than do cunently unmaffied peopie of the same age and gender, and that sexual GiTiry declines at an increasins rate with ase. 100
tg
50
dE
E
150 -200
Currentlymarriedmen Currentlymarriedwomen Not marriedmen Not marriedwomen
tr: Age t'€ftt{,
Un€ 1rtr.,Z. Expected Frequency of sexPerYearbyGenderand Marital U.S.Adults, 2000 (N = 2,258).
360
QuantitativeData Analysis:Doing SocialResearch to Testldeas Interestingly, in both marital status categories,men report more actrve sex lives
do women of the sameage and marital status.The reasonfor the genderdiscrepar.w within marital statuscategoriesis not completelyclear but probablyreflectsa tendel.*for men to overreport and (or) for women to underreport their sexual activity. Note ii, consideringonly heterosexual activity,both the averagenumberof sexualencounteFdl the averagenumber of partnersmust be identical for males and females.Thus drclearlyis biasedreporting;differentialnonresponse (for example,the likelihoodthat-sai womenwith manysexualpanners- for example.prosritutes_are underrepresente; m the GSS);or morereportedhomosexualactivity amongmen than amongwolnen. Maried men and women both averageaboutone parlner(precisely,1.03 and -9k". which suggeststhat for both married men and married women their spousers usualh. :sr.. only partner,which in tum would imply that the averagenumberof sexualencourrer* should be the samefor currently maried men and women, adjusting for the three_r:1. averagedifference in age. However, inspection of Figure 14.2 shows a difference L--:s than_canbe explained by the age gap (if the age gap were the full explanation, the ja3r would be parallel for married men and women, and a line segmentofihree-years, le:s drawn to the left of the male line and parallel to the x-axis should iust touch the f-eI5E line).This suggests thepossibilitythateitheror both marriedmenandmarriedwomeni* tort their reportsof the frequencyof sexualactivity in a socially desirabledirectit-:_ men claiming sexualprowessand women claiming sexualmodesty.The likelihotx o distortion is substantiallygreateramong the unmarried:unmarrietl men on averagerrf:r abouttwice as many partnersin the last year as do unmarriedwomen(1.g5.o-p*". .90), which-given that rhis discrepancyis far too large to be accountedfor by differe..,-" homosexualactivity-suggeststhepossibilitythatunmarriedmen andwomendiston :rm the number of partnersand the frequencyof sexualactivity in the socially desirabled:::: . tion. Another possibility is that the propensity for women to be younger than thef =a partners pafily accountsfor the gender difference in repofied sexual activity amons ft unmarried.Adjudicatingamong thesepossibilitieswould requiremore analysis th;: : warrantedhere.
OTHERMODELSFORTHEANALYSIS OF LIMITED DEPENDENT VARIABLES This introductionby no meansexhauststhe varietyofproceduresavarrablefbr the arr,r , sis of limited dependentvariables.Stata10.0includescommandsto carry our a nur,:E: of procedures,including x
s
Conditionallogistic regressionand mixed models,where outcomesdepeni :r featuresof the outcomesas well as on characteristicsof the individuals.Fr" examples,see Boskin (1974), Hoffman and Duncan (19gg),White and L:-ae (1998),andYanovitzkyandCappella(2001). Nestedlogistic regression,which extendsconditionallogit analysisby dn i,..r_E outcomesinto a hierarchyof levels.For examples,seeCameron(2000).Soo:cz, manienandJohnes(2001),and SouthandBaumer(2001).
andTobitRegression 361 Regression MultinomialandordinalLogistic r r
Probit regression,an altemalive to logistic regression.For a brief introduction, seeAppendix 13.8. Poissonregression, usedto modelcounts,the numberofoccurrencesofan event. A classicexampleis von Bortkiewicz's 1898 study of the numberof soldiers kicked to deathby horsesin the Prussianarmy.Applicationsin the social sciencesinclude Long (1990), Greenberg(1991), Rasler (1996), Chattopadhyay and others (2006), andWeitoff and others (2008). The definitive statistical treatment of poissonregressionis CameronandTrivedi (1998).
(1997),HosmerandLemeshow(2000),and Powersand Xie (2000)provideexcelintroductionsto many of theseproceduresthat, with a bit of diligence, are accessible socialscientistswho havea modeststatisticalbackground.Long and Freese(2006) proa guide to using the proceduresin Stata.For a useful overview,seeGould (2000).
HAS SHOWN r THISCHAPTER fris chapterwe have seenhow to estimatemodels for three types of limited dependent : ordinal variables, for which ordinal logit analysis is the appropriatemethod; variables,for which multinomial logit analysisis the appropriatemethod; and variables(where valuesaboveor below somecutting point are not observed),for tobit modeling is the appropriatemethod.
d
rx
{\-!ld--rI ^t J-/\R T T iL }i, -l\
CAUSAL IMPROVING E FIXED INFERENC AND RANDOM EFFECTS MODELINC EFFECTS ISABOUT WHATTHISCHAPTER h this chapter we consider two closely related techniquesfor coping with omitted varilble bias. Recall from Chapter Six that omitted variable bias occuts when we havefailed n hclude in our model variables that affect the outcome and that are correlatedwith one r more of the predictor variables.The techniquesdiscussedin this chapterfor estimating nbiased coefficients are known as.fixedeffects and random effects models.Thesemodd' use information on the sameindividuals from two or more time points or information m two or more individuals within groups(families, schools,firms, communities'or similar measuredor unmeasured, goups) to purgethe estimatingequationof all characteristics, groups. The result is that the characteriswithin or constant over time tat are constant factors.For usetime-invadant by unobserved unbiased ncs we are ableto measureare (2006,Chapters (2005) Wooldridge and fol introductionsto thesetechniques,seeAllison l-j and 14),both of which I draw on in this chapter.
354
DataAnalysis: DoingSocialResearch to Testldeas Quantitative
INTRODUCTION As we haveseenat manvDointsin this book. the nonexperimental methodswe ha\e studvinsare vulnerableto omittedvariqblebias: the possibilitvthat unmeasured affect both the predictorand outcomevariables.In this casethe coefficientswe throughOLS or logistic regressionwill be incorrect.To appreciatethis most full)- r helpful to contrastthe linear model approaches we havebeenstudyingwith exDeriments. In the classicrandomizedexperiment,individuals are randomly assignedtc groups; membersof the treatment group are exposedto some sort of intervention membersof the controlgroup arenot, anddifferencesin one or more outcomesare sured.(This designcan be generalizedto include severaldifferenttreatmentgrout\ the logic remainsunchanged.)Becausethe treatmentand control groupsare,with:: limits of sampling eror, identical on averagein their pretreatmentattributes--or. tc the same point differently, receipt of fteatment is uncofielated with pretreatment butesof individuals anv differencein averaseoutcomesmav be assumedto be cby the treatment.
With linear model approaches, we attemptto approximaterandomized by statistically controlling for as many confounding factors-that is, factors with both the predictorand outcomevariables-as possible.For example,if we that men eam more than women, we might wonder whether this is due to
However.beforewe acceDtedsucha conclusionwe would want to considerwhedr= leastin paft, the pay gapis dueto the fact thatmen arc morelikely to havetechnical: ing, to enterhigh-payingfields,to havemorework experience,andto work longerhWe would then statisticallycontrol for thesevariablesand assessthe effect of gend:: eamingsamongpeoplewho areidenticalwith respectto the controlvariables.If $: found a genderdifference in pay, we might then be willing to attribute the remainin5 ferenceto discrimination.However.we would be vulnerableto the claim that we ha; includedother crucial factorsthat could result in oav differences.For examole.rr may not bargain as effectively as men and may thereforeacceptjobs at lower \:levelsthan men.If we omit a measureof bargainingprowess(or if we measure ing prowess imperfecdy so that true bargaining prowess remains pardy then any effect of prowesswould be captured by the enor term. However, if bargai prowessis correlatedwith gender,the assumptionof OLS (andotherlinear model: error is uncorrelatedwith the predictor variableswould be violated,producingbi coefficients. So what can we do? It tums out that if we havemeasurements on the sameindir-r als for at leasttwo points in time, we can get unbiasedestimatesof lhe effectsof r ablesthat, at leastfor someindividualsbeing studied,changeover time. We do th-. predictingchangein our outcomevariablefrom changesin our predictorvariables.*: hasthe effectofpurging from our predictionequationthosefactors,measuredand sured.that do not chanseover time. But thereis no suchthins asa free lunch.The ccr this method,known asfixed effecrs(FE) modeling,is twofold: (l) We are unableto mate the "main effects" of predictors that do not vary over time for individuals, for ple, sex and race (although we are able to estimate interaction effects involving
lmprovingCausalInference:FixedEffectsand RandomEffectsModeling
s Lf r[i
tl
!'
n
365
wiables and variables that do change over time-we will return to this point in the cmtext of our gender pay gap example-and we also are able to esimate effects lhat rtange over time for time-constant variables). (However, recent work by Bollen and Erand [2008] has shownhow, with suitable assumptionsabout unobservedlatent factors, i is possible to obtain effects of time-constant predictors within a structural equation ndeling (SEM) framework-a set of techniquesbriefly discussedin the next chapter). rlt When we are analyzing limited dependentvariables, we usually will have a substanfel reductionin samplesize becausein FE logistic regression,individualswho do not ciange over time on the outcomevariable are droppedfrom the analysis.However,under me circumstances,and with some additional assumptions,we can recover our sample lize by resorting to what is known as random effects (RE) modeling. We will consider tis approachlater in the chapter.
VARIABLES MODELSFORCONTINUOUS HXEDEFFECTS t' :ee how FE works for continuousoutcomevariables,let us write a prediction equation: yr: lL,+ Bxrllz,la,*e,,
i:1,
,n; t:1,
,T
( 15.1)
is an interrbere y,,is the value of the outcomevariable for the ith individual at time t; 7-r., ceprthat is allowed to vary with time; x,, is a vector of variablesthat vary both over indirrfuals and, for eachindividual, over time; z, is a vector of variables that vary over indiriirals but, for each individual, not over time; o, representsunmeasureddifferences Letrveenindividuals, that is, differences not accountedfor by the 12,,that are fixed over ine: and €, represents idiosyncratic factors that vary both over time and across hlividuals. To simplify the discussion,assumetltat I - 2, althoughthe sameconclusionshold rhen Z ) 2. Now supposewe simply pooled observationsfrom the two time points d estimatedour outcome through OLS. Clearly, insofar as omitted variables are correIled with the variables in the model (as in our example involving bargaining prowess), ting this will produce biased estimatesbecausethe fundamental assumption of OLS, frr the error term (which in this caseis the sumof d- * e, becauseor is unobserved)is r-orrelated with the predictor variable, will be violated.
RtndamentalFEEquation bs'ever, supposewe write sepamteequations for each time period and subtract one lom the other.Subtractins d
ld = F, ! l3xa + 12. + oi + €il
]
(rs.2)
t
liz:
G
pzI ]xiz + .f zi + oi + €i2
IT
u: E
liz - lt:
(pz- 11) + P(xiz - xit) + (€o - €^)
(1s.3)
to Testldeas Doing SocialResearch QuantitativeData Analysis:
366
eff:ct oj predictor variables: ::tF i': thl whv equantr -Td Notice that both 'y2,, the trme-constant is which ,,ain"r"n""ooufl of Equarion 15.3, ',;.J; rru{" l""n ;;il., 15 3 has twr equatiins' *:t t^'' of this sort are known ^" n'n-a'n"'""t"a ^lY"t well as any mea'udl are constant over dme' as purgeclof all unmeasuredfactors that gquation 15-.3solves,theomitted-vadable-I6 factors that are constant ou". ,r-"]iho. change 'w oo-"utot"d factors whose effects problem-assuming that there *" "o it tft*
***poot' thui time;thisisanontrivial
ffi.Til 1il#'i!ffi;d"t1il:ffi"'"
fi xTtr anc x''# rorwhich 5 l"u't'o'o"r'-u*:*lit;' ut?]-":.1":^f9 ' ro candidate '" as a age for example' ruling out
and.x, are not perfe.tly "ot'"tut"o ftfto'' with the idiosvn'-rrc oit"*"a p'"a)"tor variables are uncorrelated tbe observedpredictorvaria-fu "i;a;'ili;;;il*" foints: lhal is' thal error Lerms.r,, and c-. at Uotir.trm-e observedirr rttuiirt"f oo not otpend on the outcome arc srrictlyexogeno?s--cruclauy' earliertime Point.
Allowing the S/oPesof the Xs to Vary
lr that the effects of lhe predictor variablei Notice that in Equation 15'3 it ls assumed firstdifferen:-rr a can be testedby estimating .xs, are constantov". ti-". rnr. u^.Joaftion r *" allowed to vary. To see this. consider eauation in which the slopes or ;;; following Pair of equations: a' -t €t lir: Pr* l31;t* 1z'l ( ij rtur and 1a' * e'" !,2: lJz*1zx,z*1z' 15'4 from the secondyields Here,subffactingthe first equationof l r:--r{l
we hc:aoe slope of any of the x s differs over time' That is, to test the hypothesisthat the the tr score Then' if the coefflcient for both the time 1 variable ana tn" Jitf"t""t" nol e.ldl t""uo conclude that the slopes are 1 variable differs significantly ttoit^'Lo' I la:: time the for ty suUtractingthe coefficient and can get the value of the time t siope score' ablefrom the coel'ficientl'or the difference
Testingwhether the Effectsof the Time-lnvariant VariablesVaryover Time
over ctrrc the time-invariant variables to change We also can a1lowttre coefficients for of equatrons: To seethis, considerthe following pair yr= P 1l 0xi 1* 1rz' * a, * e,,
and t a,l €i z ! ,2: l f,r| Fx,z+ 1" 2'
Ll5 rx
lmprovingCausalInference:FixedEffectsand RandomEffeds Modeling
S-tacting
367
the first equation of 15.6 from the second yields l;z-
lit:
Qrz- lr)+ P(xo - xr)+ (12 - 7, )2, * (e,, - e,,)
(rs.7)
kc'm Equation15.7we seethat it is possibleto assessthe claim that the effectsof the z. & not vary over time, by testingthe significanceof the coefficientsassociatedwith the : rariables. Note that these coefficients do not show the effects of the zs but rather the Serences in tlte effectsof the z s betweentime 2 and time 1.
ftractions BetweenTime-Constantand Time-VaryingVariahles &* noted previously, we generally cannot get the effect of time-constant variables from t FE model (but seeBollen and Brand [2008]).However,we can get the effect of the of the time-constantvariables with the time-varying variables, the xs. To see considerthe following pair of equations: t:-Efaction la:
l \t
Ax i tI1 z ,I6 x,rz,I
a,I t,,
(15.8) y,, - 11,,*Bx,,I1zi
+ 6xi2ziI a.I e,,
Subtracting the first equation of 15.8 from the second yields liz - lr:
012- I,t)'l BQ' - x,r)'l 6zi(xi2- xi1)+ (€i2- €ii)
(1s.9)
erample, retuming to the effect of genderon income, the FE model doesnot allow a assessmentof the role of gender in creating income differences.However, it does us to determine, say, gender differences in the effect of changesin performance ion scoreson changesin income.Supposegenderis coded1 for malesand 0 for and x (now designating a single variable rather than a vector of variables) is a evaluationmeasure.Thenwe would have: f or f ema l e s : l i z - !a :p (x ,rior m ale s :
l o -J r:(0 + 6 )(x ,2 -
x r)+
(15.10) x)+
More than Two TimePoints we have three or more measurementsper individual-which we increasingly hrcause a number of multiple-wave data sets are now in the public domain-there *eleral possibilities,of which two are simpleextensionsof the methodswe havejust Consider each of these.
Fint, we may analyzetwo wavesat a time, computingfirst differencesbetweensuccesraves. This approachhasthe limitation that, unlesswe tum to advancedmethodssuch leastsquares,we cannotget a singlesetof coefficientsfor all wavescombined, approach Z - 1 setsof coefncientsfor f wavesof data.Thus the successive-waves to be of greatestinterestwhenthe numberof wavesis small,saythreeor four.
368
to Te* ldeas DoingsocialResearch QuantitativeDataAnalysrs:
eachynable in the data over waves; then' for Arr alternative is to pool the or"tl j****;;.rffil,o,ffi1['}l.t] comoutethe averagevaruc '"' computetheaveragevaluef and that individual'sover 1;;;;;'inOiutduut wave-specific equ the between oLS regression "tl,l;;;;;.;;;; in a conventional re:ulfn-9.1tr"e the use and average; comDute r|55 rI '- compute ^r E^,,.ri^n - or Equation rorm
ilTlti,:'#jl,; ;;;;;;-;"'h'
t,=+rt,*dr'=+P'" Then' for eachvariable' observationsfor person i' where n,. is the number of the observedvalue: o".-r*-l'o""in" *"an from and x',,= x,,-7 !',,: !i, -!, This yields an equation of
the form T.
y',,--L P,D,4 Bx',,* e'', differ by t allow the that variables dummy are ilf:":pJ: .t where rhe D, i;rJ';;."';;;,; equano-n inEquauu' om ando rhe the zs and zs rhat t *f :::-'.'li11ill,il#"t":'iiXT;.llT Noticethat Nodce a zero.Equation15.13a individuals' within afeconstant -th"9t"'1::l'^'-:":::;-,,ili r'" i"*"*t. Thereascr
th:'t11{11 tnat (JLs except oLSexcept of t* throush insteadof estimatedtrough estimated ::^T'"y'#"0:iffT",. "''#,.#;;;;hg the data but. instead equrvarcttt is-the 13 l5 "' Equation that is sample--fi this vari*]: -"^"""i'i"nJiar^f in the dummv a(]u,ru'r includng a .*.i,., i'cluding such€q Sucb i""i",i"" scores, deviation :1'^:^": by dummy.variables. o,.TiTlt'l*Jiadables. ;-dfi;;.;;;r.
-the -xt wrtncorrcLls'-;;Jr1;;;;." $Jl"..T"il:;.':;J.ffi#;:":'":':":':r#" commandstionscanbeestimated' l"j;:'fiT: -";--x'"H;:T Fh esdmate to way ':j""ii;'ii "f ..trupt".alsocanbe But theopdmal lt^",t],: o"iT:::,T^T';:lililr*,", irri,
models or rE elaboratrons elaborarons various The vadous rhe "' ""#i-";;;;;d than adaptedto fhe analysisof more
*."u.* neednot be firther discussed
RatherThan over Fixing Effects AcrossIndividuals
ltme
rucuruuJrut*ttt^t^:i.Lt-Tl:*:."#Hi::ff*:il rn"tt'ootlo't dtscusseo dis"o""a have wehave far we Sofar So Ueappliedwhenwen"]:.-"5 Jii.irgi" "-
;"1i"'*F*,'.':i*:m:;:m:'" ff lffi:iJT'ffi f;f;""l'Tffi i''#il"#l[lF":i,1ilif
i::::*X;l'lg::lg,T;Hff":ff "f;i."fi
tncome' shipberweeneducadonand Y:-'i:::':;":;" i.,*, tn so far in schoola.odF ramlies r.rulw that ur oI "i charactensucs to charac*r;r i"in il;;;;;.,"ristics Dart, l':t"::i:il.9,t:JJ",i;J$*" for such unobservedchara une. wav ,o job market the in '" ::l:::^';';;;""". successful "onoot in io.oln" ur u
siblin;: compar€ tocomparc be to wouldbe outs would families of of families 'ffi1;;;i;G'*d :"ly:Pt"lf'?'i,lllir*,ino, Krueger(1994)canied educauol "*"u or level the in of differences ::':ll1'::l ::.,,;;;*he effectsof educt
;H;:i;;;;;"*'.'li?J"til"L:Ti,.n*,'}.[f,*',T1l;;",con,ro stronger tnan were in fact slightly gender,age, and race'
lmprovingCausal Inference: FixedEffects and RandomEffectsModeling 369
PANELSURVEYSlN THE PUBLICDOMAIN Maior u.s.?)J panelstudiesof interestto socialscientists include
p|
.
(PslD):http://psidonline.isrumich.edu PanelStudyof IncomeDynamics
.
(NLS): NationalLongitudinal Surveys http://wwwbls.gov/nls
.
WisconsinLongitudinalStudy(WLS):http;//wwwssc.wisc.edu/wlsresearch
.
Healthand Retirement Study(HRS): htlp://hrsonline.isrumich.edu
.
NationalLongitudinal Studyof Adolescent Health(Add Health):http://wl,^,/wcpc.unc. edu/addhealth
lmportantforeignpanelstudiesinclude .
ChinaHealthand NutritionSurvey(CHNS): http://wavwcpc.unc.edu/china
.
GermanSocio-Economic PanelStudy(5OEP): http:/
.
^/ww.diw.de/english/soplndex.html IndonesiaFamilyLifeSurvey(IFLS): http://w1,vw.rand.orgllabor/Fl5/IFls
.
MexicanFamilyLifeSurvey(MXFLS): http://wwwradix.uia.mx/ennvih/main.php?lang=en
.
MexicanHealthand AgingStudy(MHAS):http:/Arywv/.mhas.pop.upenn.edu/english/ home.htm
lvlanyadditionalpanelsurveysmoreor lesscomparable to the PSIDare listedat http:// psidonline.isrumich.edu/Guidey'PanelStudies.aspx.
Now consider a secondexample.In an analysisof Indonesiandata, Frankenbergand hon (1995) studiedthe effect of maternaleducationon behaviorsconduciveto chil&a's health,includingsanitationandhygienepracticessuchasthe sourceandtreaffnent d drinking water,wastedisposalpractices,and so on. However,in developingnations rh aslndonesia,both a mother'slevel of educationandthe possibilityof easilyobtain.g safewater or protecting againstcontamination from human waste tend to vary across mmunities, dependingon their level of development.In this situationone would want n prge the associationbetweenmaternal educationand child health-reiated practicesof ft confounding influences of community characteristics.This is what Frankenbergand [son did by fixing community characteristicsand relating differences in health pracib to differencesin matemaleducationamongwomenin the samecommunities.In this rtr1 they were able to show a causaleffect of matemal educationon behavior conducive o child health.
linitations of Fixed Effects Approachesand Cautions to Keepin Mind Lte all other statistical procedures, FE approachescarry a set of assumptions and rquirements.When theseare violated,FE coefficientsmay be worse(morebiased)than
370
DataAnalysis: DoingSocialResearch to Testldeas Quantitative
simply poolingdataandobtainingOLS estimates.Unfortunately,oftentheseassumptr:rr areunteslable. Herearesomecautions; If unmeasuredeffectsdo changeover time (or, in the cross-sectionalapi-r:2. tion just discussed,do vary acrossindividuals),FE estimationdoesnot .:'- *t the bias problem. It is thus necessaryto think carefully about whethe: & assumptionof time-constantunmeasuredeffects is tenable.The samel-'.r holds even more strongly for family or community fixed effects-one h-. u assumethat noneof the unmeasuredfactorsaffectingthe outcomevaries3!-::1ii individuals within families or communities.This is often dubious,espe,--:-' within families. To convinceyourself of this, think of recentU.S. presic:m andtheir ne'er-do-wellsiblings;or simplyconsidervariationsamongsib-::. in families you know. Could such differencesaccountfor differencesir l: kinds of outcomesstudiedwith family FE models?This is a crucial que:::a (Of course,unmeasuredeffectsthat changer =often ignoredby researchers. time alsobias OLS coefficients.So resortingto OLS regressionin suchca:::.:i no solution.) The predictorvariablesmustbe strictlyexogenous, conditionalon theunobse:.= -,.r. variables.That is, we mustassumethat oncewe controlfor the unobserved ables,tiere is no remainingcorrelationbetweenthe predictorvariablesanc 1 idiosyncraticenors, the X,s and the e,s. One commonway strict exogeneir.:violatedis when one or more of the predictorvariablesdependson the our.i,T variablemeasuredat a previouspoint in time. For example,if we were stuc-.'l}r how the crimeraterespondsto changesin the sizeofthe policeforce,andthe :--r of the police force were determined by the crime rate in the previous year-* strictexogeneityassumptionwould be violated. Relativeto variabilityin the outcomevariable,theremustbe sufficientvariab:-r, over time in the predictorvariables(or acrossindividualsin the cross-secdt':a FE approach).What is sufficient?This is difficult to quantify.Still, it is obr::'rr that predictorvariablesthat hardly vary can havelittle impact on the outc.--. just asin OLS analysisonecannotpredicta variablefrom a constantandu t .3r a poorjob of trying to predicta variablefrom a nearconstant.
I L
@tr u!@ 3tu [T
3 0
-J ID fll
ffiD :[ !r! 0
nu D C @
1t5 [! iq @
T @!l
rdU E @
A corollary of the previouspoint is that variablesthat differ only by a Li:-.r transformationover time are regardedas unchangedover time. Thus,for er::ple, agecannotbe includedin an over-timeFE analysisbecauseageat time : $ identicalto ageat time I plus a constant.It thenfollows that variablesthatdi= over time by a nearlinear transformationcreateproblems. The predictorvariablesmust be reliably measured.As Wooldridgenotes,"D:ferencinga poorly measuredregressorreducesits variationrelativeto its cor:; tion with the differencederror causedby classicalmeasurement error,resul-.-;l: in a potentiallysizablebias" (2006,475).
I T
@
lmproving CausalInference:FixedEffeds and RandomEffectsModeling
e assumPDolls
donal aPPIt;lloes not solre c \\ hether 6a h€ same Poir ts--one hli -^' E \ aries acrol{ ous. especialll ".S. Presiden:' amongsiblilgs ferencesin thg nrcial questtt-r. rat changeorin suchcases:.' r rheunobsend rnobservedtariariables and th ict exogeneiqr' ; on the outcotrB re \lerc stud\ ins orce,and the siza reviousYear.rb ficient variabilT re cross-sectioDd Still, it is obuc'cl t on the outcoE nsrantand will ril
371
VARIABLES MODELSFORCONTINUOUS RANDOMEFFECTS BecauseFE modelsdo not allow us to assessthe size of time-invariantvariables(or, in family, organizational,or community applications,variablesthat are invariant across individualswithin units), therehas beena strongincentiveto find modelsthat do yield such estimates.Among these, a frequently used approach is lhe random effects (RE) model.Like the FE model,the RE model can be written by startingwith Equation15.1. However,the assumptionsare different. Whereasthe FE model assumesthat the g represent a set of fixed parameters,which are purged from the model by differencing, the RE model assumesthat each n. is a normally distributed random variable with a meanof zero and constantvariance and that it is independentof 2,,x,,, and e,r'This is a strong assumption. Fortunately,it canbe tested,usinga testproposedby Hausman(1978).The strategy is to estimatecorrespondingFE and RE models ald to comparethe similarity of the coefficientsusingthe Hausmantest.If the null hypothesisof no differenceis not rejected,we can concludethat the independenceof the d. i.ssupported,which meansthat the RE model yields unbiasedcoefficients. Becausethe RE model yields estimatesof the effects of the assumptionis satisfied.If it is not :, the RE modelis to be prefered if the independence of the effects of the 2,.The and forgo estimates FE model for the we must settle satisfied, the RE model. Bollen and does not support quite and often restrictive Hausmantest is FE and RE models for comparing (2008) statistics offer a range of altemative Brand procedures are based Brand's and also proceduresfor forming hybrid models.Bollen and but is i.n this book on structural equation modeling, which is beyond what is covered briefly discussedin the next chapter. How can we eslimate the RE model? The details are beyond what can be considered here,but it is possibleto sketchthe generalapproach.Because,by assumption,a- is uncorrelated with the explanatory variables, the coefficients of these variables could be However,doing so would ignore at consistentlyestimatedfrom a single cross-section. periods). Pooling the data and esti(or than two time for more more, leasthalf of the data yield estimates.However,neiconsistent matingthe coefficientsthroughOLS alsowould rher procedureyields the correct standarderrors.The reasonfor this is that the errors will the two errortermsin be seriallycorrelatedovertime.We caneasilyseethis by repLacing Equation 15.1 with a single term for the compositeerror:
(1s.14) r onlY bY a liner :. Thus,for examse ageat time I i5 ariables that ditr= "Di:-fuidge notes, lativeto its correbF rnt enol resultiry
Becauseo, is includedin the compositeerror for eachtime period,the u-,areseriallycorrelatedover time, with the correlationgivenby
corrqu,,,r,, 1: fi l1of,+ o!1, t=s
(15.1s)
*here ol : Var(o,)ando! : Var(e,).However,it is possibleto derive)a genernint the transformationthat eliminatesthe serial conelation alized least-souares Def,ning
EITOTS.
372
to Testldeas Doing SocialResearch QuantitativeData Analysis:
x:t-t":
l(4 +To""))""
rr- u
we can wrlte
+ B * (x " o -) 1 )+ (u " -\ u ' )
y,-)n=p.(1-))*p,Q r, ' ' -)r" )+
(l: ' r
t P""::l-t:;;3;"3';;;; l;Ji:X*:ffi,:f::i[:tJil sim'arity the Note ff ;;il;;erie.1]r;1e*resizeorthe=rwar':l lr*,TJ#Hx.:ilnTiH;:ffi in ttti*'ted (whichcanbedone "evenl it u^tJi on o2,o;'andi tiondepends lromthepooledi ' = iZ tun Ut tui*uted though OLS neednol concemus)' Equatlon' ' unOthecolr€ct standarderrc:' tn" time datato yield co"titt"nt "'u*uit"JJf the enor ter:: r "o"m"l"ntt u"tween FE and RE by rewriting Finallv' we can seettt" t"rut'JlJip Eouationi5.17 as u,,- \i, -- (l- \)a * e,,-)q
(1:,!
:: f ii':i::;:l:: i:; il""J#'J":'1il1i:ili:!, i11ilff ilfi HH:,'::t ffi ; ::i#;; ;i
""'*"' ",t:lllf :#,Lif":'ffiill3i,T,il'i;Li.if
ol tl approaches0. a larger frac on bv definition' the bias tncreases'
OF INCOMElN CHIi{A DETERMINANTS A WORKEDEXAMPLE:THE dependent .;and RE modelsl?l "oYnloot -oiFE To seehow to estimateand lnterpret aselser:':-: ln cl11 ttreoeterminantsf"Jiv i"t"*t iifnina' I consider ables, hi€herin theChtne'::r' Communities
ut'o" tot-onitits incomedifferssubstantialty fromruralvillagesto province-:': (a seven-category la---:' hierarchy ban '"i"rn"ittut 'ung"t tendto have'hjgheraverage tt-til"t" Chongqing' Beiiing, hu:"s cities: ""Jii"":fn) *" i"tt"t "naoted with both the income.But they alsonuu"poptitutio'i''irtui
*#;;il;;;.:lS*:i*;,:'Xfj,*1.ffi :,'ffilT;,"y,i';1;:Tj
#.r*'il*j:knn"ui$kt*lxm;:::l*ll differenc" : oitt.r, simply reflects the one hand, and family ln"o.", "i'ti" -community ournpl"' the tendencyof r-:-t tatlffett i*o'n"-fot otn"r and market labor th" "onoitions to disproportionately mov-eto iaP11, *rtilttt"^liv "atcation
--r"'^,',"o"yir,.."l"1-r::::i:jlt'j*';;:nj":'i,ffil.i#fJ
inser usetr survev ci'i'"* ""ir"''"rsampre -' vill'=.Y.lii?51,ifiJ.1li::S':Tii""ffi rural hundred one' oJtign ioitt'i' tu*ty inttooed previouschaptersTh" t*pr" andone hundredurDarltrcrBuuuruq aboutthifiY households(SeeAPPt tion on how to obtain the data )
u't- n"ignuoffi;' i" :;*a"tu hundred one and 9,!T-':"'*:T::T"XX'.".I#T?.= t o" ttt" studydesignandfor info=-.tot 't
lmproving Causallnference:FixedEffectsand RandomEffectsModeling
-
373
:ii,
f 15,1,. so"ioeconomi mr-a bl d !
For hc'ul.r-g' bii
0l I t!'r :E
o:l1sL'- d H ] . iJfilell :t- b
Flreamd 5 tO r-arTl
''
analysisusing a treafinent-effectsapproach,see Brand's 2006 study of the effect ofjob on the quality of subsequentjobs. displacement The most obvious way to correct for endogeneity problems is to measure all the factors thought to affect the outcome. We encounteredthis idea in our consideration of ordinary least-squaresregression in Chapter Six, when we discussedthe presentation of severalmodels,with successivelymore variables,to assesshow newly introduced r-ariablesmediate the effects of variables already in the model. From this we see that endogeneitybias is a form of omitted-variable bias. However,it is not alwayspossibleto measureall the potentialinfluenceson an outcome, either becausewe are reanalyzing already collected data or becausethe analyst may not be able to identify a priori all the potential influences on an outcome that are correlated with variables explicitly measured-for example, all the factors that might both lead individuals to join a union and be conelated with their capacity to command high wagesas individual employees.Thus we need ways to correct for endogeneitybias rand its close cousin, sample-selectionbias). We have already covered one approach, flxed effects or random effects modeling, which is possible when we havemeasurements for individuals at more tlan one point in time or measurementsfor different individuals rithin $oups (for example,families or classroomsor communities). When suchdata are rot available,severalother analytic strategiesmay be considered,all of which go beyond riat it hasbeenpossibleto coverin this book. For usefuldiscussionsof what is entailed h establishingcausality,seeHolland (1986), togetherwith commentsby Rubin, Cox, Glymour, and Granger,and Holland's rejoinder; andWinship and Morgan (1999). Variables Regression Al approach to coping with endogeneity that is ',Ef,;rumental popular among economistsis inslramental variable (lY) estimation.If a variable (Z) can be found that is uncorrelatedwith unobservedvariables (u) that aJfectthe outcome (f), is cnrrelatedwith the variable (X ) in the model thought to be correlatedwith the unobserved raiables, and is conditionally unrelatedto the outcomevariables net of the effect of both fu observedand unobservedvariables,Z can be usedas an instrument for X to yield reladrely unbiasedestimatesof the effect ofX on L For example,considera 1990paperby Angrist studyingthe effect of servicein the ilitary during the Vietnam War on lifetime earnings.The difficulty with estimating an (}LS equation is that the decision to join the military might well have been correlated uith unmeasuredfactors that affect earnings.Angrist exploited the fact that for much d the war period, a lottery system was used to determine who would be drafted into s'ice. Although there were many exceptions,the increasedprobability of being drafted fu those with low numbers makes the assignmentof lottery numbers a kind of natural .ry€riment----one's lottery number was correlated with the likelihood of serving but not rirh other factors relatedboth to serviceand to subsequentincome. Thus the lottery numbis a good instrument for adjusting the effect of Vietnam veteranstatuson income. Another situation where IV estimation may be helpful is where the causal order is nbiguous. Supposewe observethat women who work are lesslikely to be depressed fu women who do not work. Can we concludethat employmentprotectsagainstdepreswomenmay be less in? Perhaps.But the causalordermight go theotherway: depressed
390
QuantitativeData Analysis:Doing SocialResearch to Testldeas
likely to seek or retain employment. One way to address this problem would be to
rnstrumentfor employment.A reasonablechoice might be whether the rnother is known that the daughtersof mothers who worked-aremore likely to work tl But there is no particular reasonto believe that, net of her own emiloyment, a I mother's employment affects the likelihood that the woman herseli expenences sion-(theexampleis from Ettner [2004]).Thus,mother'semployment would satisfrl conditions for an instrumental variable A final circumstancein which IV approachescanbe helpful is to estimatesimulu_E equation_models or, asthey aresometimescalled, reciprocaicausationmodels. Wooldri (2002,555) providesa usefulexample:in a sample oi cities,we might expecttne mu rate to dependon the size of the police force_the more police peicaprta, the lowe, expectedper capita murder rate. But we might also expect the size o1 the police ft ,h" murder.rate-the higher the (anticipated) murder rate, rhe greater :^*Ty.l to increase incentive the sizeof the police force. Becausewe observeonly the equitibrir condition-a particular murder rate and a police force of particular a srze_specifu a simultareous equation model amounts to asking the qo".,i;;;; would be the murder rate if the size of the police force were "oun,".fu"*a different? What would be: size of the police force if the murder rate were different? IV methodsprovide a war estimatingsuchmodels. casethat usually involves reciprocal causationand hencemight be han - _Another *:d."t. (or by structural equation models of the kind discussed later) is when ,OI,|V. attitude is thought to affect anotherbut they are measuredat the sametrme. Ratherrhr assumingthat somehowone causallyprecedesthe other,it usually is more sensibleto rrdt them aseither both dependenton a third variable(seemingly o*"iut"O..gr"r.ion, encm_ teredin ChapterEleven, can be helpful in such cases)or-as having recrprocalefi.ectsThe difflculty with IV estimation is that it often;s difficult to"finOgood instrume','4 variables,and poor instrumentsoften produce results worse lmore irasea; than usitrsr instrumentsat all. For a good introduction to IV estimation ani lt. Oung"rr,,"" WootOri:ur (2006, Chapters15 and 16). Orherusefirl references incfodeeaun (20'06i,;" il;;; - ivregress - commandin StataCorp(2007),and Green(200g). sample-selection Brbs Sample-selectionbias ariseswhen unmeasuredtacbrs correld with ar outcomedeterminewhether an individual is included in the sample.For ex,rnfia a woman may enter the labor force only if she can command a reasonably high lagr_ Thus selection into the sample (people in the labor force) is nonrandombut dependsI unmeasuredcharacteristicscorrelatedwith the outcome variable (wages).Analyzing mry thosewith wagesthenresultsin biasedestimates. Consider anotherexample.Many surveysin China are restricted to the de jure l]Ib_ population.As wu and rteiman (2007) haveshown, in such studiesestimatesof ircgenerationalmobility are overstatedbecausethose of rural origins who obtain urban rar. istration are not a random sampleof the rural population bu, J,h". ;;;tr;;;;;; the."best and the brightest,', who have experienced long_rangeupwarA social mobilrt If the entirepopulationis includedin the analysis, oitt"'"*ii;iffi;;; muchmoremodest. "rti-it",
Designand Interpretation tssres 391 Research FinalThoughtsand FutureDirections: DiDdir [iE@"1 trle sdr.I r.iFI@
8s--{ & *:*iE-"
bnlbrlB E@ lt{slts lx f G
Flts Fiil rycfil.q fr iti dFrts t a114d trg}.ndH sriraG RtiE' itie --,:rd rrr-es{cde::s
[email protected] fiuirySi\iiE{E cdl rrlts
15i-fi€ld
Ftreu@. r higb r..|!E
rd+adsEhzin!d* de-iuEfl r-e! ot il; rlttim 4|'in rc?ofiiEdl
rial m.rtf,r f mct'unr-
Heckman Selection Model A standardapproachto correcting for sample-selectionbias (in caseswhere it is not possible to redefine the population as Wu and Treiman did) is to :osea Heckmancorreclion (seeHeckman 1979).The procedureinvolves predicting (using a binary probit equalion) the prcbability of being in the sample(or, equivalently, ofhaving an observedoutcome), calculating the expectederror for eachobservation,and using dleseerrors as regressorsin an equation predicting the outcome of interest. SeeWinship andMare (1992) for a very clear exposition of this and other modelsfor sample-selection bias,and seeDubin and Rivers (1989) for an extensionof theseproceduresto models sith binary outcomes. The Stataentry for the -heckman- command (Statacorp 2007) offers anothervery clear exampleand exposition of the method,using the canonical example,women's eamings. In the example,eamings (for women who haveeamings) are predicted from education and age, and the probability of having earningsis predicted from marital status,the number of children at home, education, and age (and implicitly-tbrcugh the inclusion of education and age, which predict the outcome----ofthe expected wage itselfl. Note rhat the assumptionhere is that marital statusand the number of children at home do not affect eamings but only the probability of having eamings.We might well quesfion this assumptionbecausemanied women, and particularly women with children at home, may cbooseto take lower-paying jobs that more readily accommodatetheir dual careersas sorkers and mothers, This examplethus revealsa major limitation of the procedure.To yield robust results, 6e predictors in the selection equation should strongly affect the probability of being selectedbut should haveno net effect on the outcome.(Heckmancorrectionscan be made e!-enwhen there arc no such variables, by relying on the functional form of the equation b identify the model. However,the results are often neithet robust nor substantivelycompelling.) Suitable variables are often difficult to find. Note the similarity to IV estimation discussedpreviously. For instructive applications of corrections for sample-selectionbias, see Mare and srnship's 1984study of employmenttrendsfor young Black and White men; Hagan's sudies of factors influencing the severity of punisbmentfor convicted criminals (Peterson md Hagan 1984;Hagan and Parker I985t Zatz and Hagan 1985);Manski and Wise's of graduationfrom college;andHardy's(1989)studyof r 1983)studyof the determinants acupational mobility in the nineteenthcentury basedon matching data acrosscensuses, rhich takesaccountof selectiondue to deaths,emigration, and namechanges Erdogenous Switching Regression Note that the Heckman procedurealso can be used n analyze endogenoustreatment effects, as an altemative to IV estimation, However, a is alsoavailablein Stata,in additionto -heckman-. $?aratecommand,-treatreg-, The problem of an endogenoustreatmenteffect-that is, where there is a nonzerocorreirion between assignmentto a "treatment" group and unmeasuredfactors affecting the stcome----can in tum be generalizedto the casein which the parametersof a model link.g treatmentsto outcomes differ acrosstreatment groups and assignmentto treatnent groups is endogenous.For example, Gerber (2000) asks whether the fact that former CommunistParty membersdo better in post-SovietRussia than do others is due to
392
to Testldeas DoingSocialResearch DataAnalysis: Quantitative
residual social capital (the fact that connectionscontinue to favor former party membagl or rather to unmeasuredfactors that affect both the likelihood that people becamep !| membersduring the Soviet era and their eamings in the post-Soviet periodThis kind of problem can be addressedusing methods that are similar to thoseir treatment effects and sample-selectionproblems-specifically, endogenous switch4 regressionmodels.Endogenousswitching regressionmodels are usedin sit[ations $ bE= one outcome.I., is observedif a selectionvariable, Z = 0, but a different oulcome. l- r observedif Z : 1. Using this method,Gerberconcludesthat the advantageenloyeOt-. former communistsis due entirely to unmeasuredcharacteristicsassociatedwith bectning a member of the Communist Party and that there is no lingering effect of Sovier-ta socialor political capital.(Seealso the critiqueby Rona-Tasand Guseva[2001] andfu rejoinderby Gerber[2001].) Gooddescriptionsof the techniqueandofhow to implementit canbe found in \h: andWinship (1988)andPowers(1993).For additionalapplicationsseeWillis andRosa (1979);GamoranandMare (1989);Long (1990);SakamotoandChen(1991);Manskid others(1992);TiendaandWilson (1992);Powersand Ellison (1995);Smock,Manniry and Gupta(1999);Hofmeyr and Lucas (2001);Lichter, Mclaughlin, and Ribar (20t,:c (2004);andProuteauandWoltr (2006). Sousa-Poza Propensiy ScoreMatchrng Another threat to correct causalinference occurs when dr predictor variable of interest occurs only rarely in the sample and is highly correl*=d with other independentvariables.For example,what is the effect of attendingan elite re occupationalstatus?The usualway of approachingsucha quesuil versityon subsequent is to carry out a multiple regressi.onof occupational status on attendanceat elite \ers other universities plus a set of variables controlling for family background, high schrd performance,and so on. The difficulty is that attending an elite university tends to be rl highly correlatedwith the control variables that controlling for confounding factors I-afo to hold them constant,becausethere are few people with low values on the control \.mableswho attend elite universities.Apart from the conceptualproblem this createsabtl the meaning of "holding constant," there is a serious statistical problem-"unbalamed treatments"tend to inflate standarderrors (Rosenbaumand Rubin 1983,48), malJE problematicthe rejectionof the null hypothesisofno effect.To copewith this probleu. analysts sometimesresort to matching pairs of casesthat differ with respectto the \aable of interest (the "treatmenf' variable) but that are identical on a set of covariarEl However, as Srnith notes (1997, 326-327), until recently matching studies otlen hsrc been resistedon the gound that they involve "throwing away" a lot of data. Moreorer- I often is difficult to find good matchesfor more than a small number of variablesbecaret for a linear increasein the number of covariatesthere is a geometric increasein the number of matchesrequired. However, advancesin the statistical theory of matching-the seminal adicle is ! Rosenbaumand Rubin (1983)-have led to the developmentof a procedurethat replaL'E the large set of discretematchesrequired by classicalmatching procedureswith r propensity score, a scalarsummaryof the degreeof similarity betweencaseswith resFq: to a large number of covariates.The procedureinvolves predicting the treaflnent variahic
EinalThoughtsand FutureDirections: Research Designand Interpretation lssues 393
r tt
:fu ff 5
q,r ff G
t-r l-
Jom covariatesand then matchingeach"treatment"casewith the confiol casethat has tre nearestpropensityscore(or sometimeswith severalcontrol cases;seeMorgan and rrnship [2007] for a useful discussionof the technicalissuesinvolved). The resultn-e sampleis then analyzedin one of severalways: focusing on outcomedifferences nenveenmatchedtreatmentand control cases,ignoring the unmatchedcases;stratifying te sampleinto stratawith similar propensityscoresand comparingoutcomeswithin sr:m (for an interestingapplication,seeBrand and Xie [2007]);or usingthe propensity aarredirectly in a regressionequationto get an estimateof the effect of the treatmentnet [l: rhepropensityto be in the "treatment"group.The essentialinsightis that by compar[€ casesthat havea similar propensityto be in the treatmentgroup, we createa quasi qFeriment.That is, we canthink of matchedcasesasbeing,in effect,randomlyassigned D eitherthe treatmentor the conhol group becausethey havethe sameprobability of lHne in eithergroup,giventheir covariates. Consider the example presentedby Smith (1997) in his illuminaiing exegesis d matchingmethods.He was interestedin comparingthe mortality rate in two typesof Lspitals, ordinaryhospitals(N : 5,053) and "magnef' hospitals(N - 3g)-hospitals rtrh organizationalpracticesthat enhancedtheir reputationsas good placesto practice n-\ing. Contrastingan OLS analysiswith a propensitymatchingprocedure,he showed fi& fte two methodsyielded similar estimatesof the difference in mortality rates in the Do Npes of hospitals,but the latter methodhad far smallerstardarderrors,yielding a uistically significantreductionin mortality in the magnethospitalscomparedto ordiri hospitals,a conclusionnot yieldedby the OLS analysisbecauseof the largestandard resultingfrom the unbalanceddesign. There is by now a substantialliterature on both the statistical theory underlying proiry scorematchingandpracticalproceduresfor implementingthe method.The 1997 paperis a goodplaceto startandalsohasa usefulbibliography.BeckerandIchino :). Abadie and others(2004), and Beckerand Caliendo(2007) discussthe impleion of propensityscorematchingin Stata.DehejiaandWahba(2002)and Brand Halaby(2006)provideusefulevaluationsandworkedexamples.Harding(2002)is a icularly instructiveapplication.For otherapplications,seeBerk and Newton (1985), andothers(1995),Keatingandothers(2001),Lu andothers(2001),Morgan(2001), andSmith (2003),Lundquist(2004),andCohen(2005).One limitation of propen\-.orematchingis that it may not balanceunobservedcovariates. Thus if you suspect , you will needto resortto oneof the methodsdiscussedhereor in the previ.hapterthat are specificallydesignedto handlesuchproblems.
Equation Models ':ural equqtionmodeling(SEM) is a technique(or, more precisely,a set of tech) thatpermitsthe estimationof systemsof equations,often involving unmeasured lrcnt constructs.Considera simpleexample,Blau and Duncan's(1967, 170)classic of statusattainment,shownin Figure 16.2.When we think abouthow occupa: statusis transmittedfrom onegenerationto the next,it becomesevidentthat this is process:men whosefathersare well educatedandhavehigh-statusjobs tend rriieve more schooling; those who achieve high levels of schooling tend to obtain
394
to Testldeas DoingsocialResearch QuantitativeDataAnalysis; Fathers educat|on
FatherS occ.
ft€UnS
16'?"
Respondents educallon
.3 1 0
.224
| 818 /
First job
of stratificatim' Modelof the Process Basic alauand Duncan's
sour.eiBlauandDuncan1967,170
those who have h1*' high-status first jobs (but their social origins may also trelp); and current jobs (but tu' high-stalus into them parlay ,-tu:to"nrrt iot, u." likely to be able to -!{ various "pathrThe to matter). education and even their social origins may continue shown in the figc' which fathers' occupational statusls transmitted to theh sons are equationspredict4 a set.of known as a "path diagram." The paths can be representedby can be explora: I each of the outcomeJin tum. The relationships among the equations the two wyield insights regarding the relative importarce of different paths linking (typically' if the size of particular cod' ables. Moreover, under some "rr"uattu"""' or more coefficienr' cients is fixed, usually but not necessarily^t zero' ot two r}Ie goodnessof fit oi:i: overidefiirted), is ir, if tne model ui""O ro i" "qoi-tlut "onrt model canbe assessed. variable Hosgra' In the modeljust discussed,thereis only one indicatorfor each thought to reF measures often the analyst has available repeatedmeasulesor a set of to use SEIIs n possible senta singleunderlyingor latentconstruct'In suchcases'it is Featherman(1977r d assessand correct tbr measurementerror. SeeBielby, Hauser,and Still anotherrc Hauser,Tsai, and Sewell ( 1983) for two early but instructive examples even involving lara of SEMs is to estlmate processesinvolving reciprocal causation' (Note thar m an example (1968) for such variables.See Duncan,Haller, and Portes work-the of recent lack to a due not applicationsjust cited are all very old. This is "smuch more expli* ,"nt tit"rutur" is vast-but rather to the fact that the early work was
FinalThoughtsand Future Directions:ResearchDesignand Interpretationtssues .:195
?J ,"ou T;":tQ " -."1L?.,'ii.'*-llll;'jj"j3
:'
statistician-sociologist LeoGoodman"the most importantquantitativesociologistin the wor d in the lattef half of ihe twentiethcentury" Duncanwas responsb e for ntroducingpath analysis(a versionof structuralequationmodels)inlo sociology.He usedpath analysisasthe technical appdratusto reconceptualize intergeneralionalsocialmobility as a multistepprocessin which statusattributes(suchas education,occupationalstatus,and ncome)are modeledas dependingnot only on parentalstatusbut aisoon the priorstatusesof individuals. Duncanalso contributedimportantly to our understanding of racialdifferences rn socioeconomic attainment,spatialand racialinequalllies withincities,and,laleIn hiscareer, attitudemeasurement. Althoughlackingadvanced mathematical training,Duncanprobablymadebetteruseof the statistical toolsat hisdisposal than anyolhersocialscientisi, throughthe combination of an unusualabil;tyto think through a problemin advanceand greatclarityabout how to representsociological models.lt is stril.ing,and telling,that becauseof thenideasin stalistical extant rulesgoverningaccessto Current PopulationSurveydata, all of the tabulationsand estirnalesin Duncan'slandmarkbook TheAmerlcanOccupationa! Structure(Btauand Duncan 1967)were specified in advance, withoutthe analysts havingseena s ngiecoefticient. InteF (1984), estingly, Duncanhimsel{regardedhis latebook, NotesonSocial Measurement as his most importantcontribution,a judgmeht not widelysharedby the many researchers strongly contrioJtons. 'rflrerceo by hissJbstant.ve yearsin Stillwater, Sornin Nocona, Texas, Duncanspentmostof hrsprecollege Oklahoma, professor lvherehisfather,OtisDurantDuncan,alsoa sociologist, was a at OklahomaState UniversitfDuncandid his undergraduatework at LouisianaStaieUniversity, obtainedan l\lA at the University of Minnesota. servedthreeyearsin the U.S.Armydur ng WorldWar ll, and ihen completedhisPhDat the University of Chicagoin 1949.Hetaughtat Pennsylvania State Jnive's;ty, the Universitv ol Wi:consin,lhe UlversiLyof Chicago,rne Urrversiry ol Micnigan, :he Unlversityol Arizona,and the Universityof Cali{o.niaat SantaBarbara.Durcan enloyeda secondcareeras a composerof electronicmusicand was famousamong peoplewho had no oedthat he wdsa distinguisl'ed socrasc'enlist.
-. -i the models being estimated than much of the literature that followed. after struc*-- equation modeling became widely used. Thus for didactic purposes the early papers --: lore useful.) TIle strategy for estimating SEMs is to exploit the lact that the posited relationships -- rg the vadables (observed and latent) inplies a particul;rr covariance structure (that - .et of relationships among the variances and covariances of the observed variables), , - , r is why the technique is sometimes called covariance stntcture npdeling. Goodness : is assessedby comparing the covariance structure implied by the nrodel with the rrnce stmcture obserued in the data set beins analvzed.
tts'es 397 Designand lnterpretation Research FinalThoughtsand FutureDirections:
r i $6. G iIDlss
( sF tc$ t'ek-r rof Chi-r !b-.Fb sPeciNfl 5r Erac ([sa.ntl€rF fszre ry*ff,-ar€ PO#F sr Snud i b1 -\rbnclftDe anabs r n a s1.*en d Ecent \esipos. srarisit
rhat enablesthe analyst to explore the implications of whatevermodel the analyst posits on a priori grounds.Thus structuralequationmodelingis best seenas an interpretatrve procedure,with the addedfeature that in somecasesit is possibleto determinewhether a particularmodelis consistentwith observeddata.Usedproperlyin this way, SEM canbe a valuable tool. (The best introduction remains the 1989 text by Bollen, which, although somewhatdemanding,is intendedfor and accessibleto social scientists.Seealso a collection of paperson technicalissues,editedby Bollen andLong [1993];Bollen andCurran's 2006 book using SEMs to estimate latent curve models; and Bollen and Brand's 2008 paperusingSEMSto estimaterandomandfixed effectsmodels.)
SAMPLING OFPROBABILIry THEIMPORTANCE To generalizefrom a sampLeto a population-which is what social scientistsare almost alwaysinterestedin doing, whetherwe admit it or not-it is necessaryto samplecases from the population of interest in such a way that eachindividual in the population has a known probability of being included in the sample.only under this circumstancedo the principles of statistical inference apply. Nonetheless,many studiesviolate this principle, drawing "convenience" or "causal" samples.Chinesesocial surveysare particularly egregiousin this respect,often sampling a sei of provinces or cities that are said to be typical of particular types of places; this is true of even high-quality surveys such as the Chinese Health and Nutrition Survey (Hendersonand others 1994). The difficulty is that there is no way of knowing to what extent and in what ways the chosenplaces a:e indeed similar to the places that are not chosenbut are purportedto be representedby the chosenplaces.In sum, samplesof "typical" placesare no substitutefor probability samples.It is well worth the extra cost-in the sampling effort and, often, in the fieldwork-to design a samplein such a way that it car be generalizedto the population of interest.
ASK A FOREIGNER TO DO lT
scientistsarenoto'iouslvb"d social f)f,
it Oroved survey. A casein point: in my 1996Chinesetheirown societies. at characterizing \ Trom ot opposrtron urbandistrictbecause to do the fieldworkin one county-level impossible localofficials.Insteadof askingme to providea substituteplacefrom the samestratum (recallfrom chapter Nine that there were twenty-five urban strata, basedon the level of educationin the population),my chinesecolleaguessimplysubstitutedanotherdistrict from the samecity that they saidwas very similarto the omitted district.However,it turned out that whereasthe omitteddistrictwas in the eighteenthstratum,the substjtutewas in tr modelhg. tal6 lggestiDg thar lh \' in PsYcholo'gil limitadons-=g r$' magically o\lF r. it is a Procedre
the twenty-thirdstratum,clearlya violationof the stratifiedsamplingdesign The truth is that if you want a clear-headedcharacterizationof a society,you should aska foreignerto renderit. Thisessentialpoint was understoodby the carnegiecorporation, GunnarMyrdalto heada study and sociologist the Swedisheconomist whichcommissioned monographAn wasthe classic The result in the'1930s States in the United of racerelations AmericanDilemma(Myrdal1944,vi-vii)
||
FinalThoughtsand FutureDirections: Research Designand Interpretation tssues399
rM" d E -!t lllF
.e Er5.D
rfr dirytd. .6|F
-
Still another ex:rmple can be found in institutionally based studies-for example, iudies of hospitals,clinics, and their catchmentareas,which are often usedin public bealthresearch.The justification for what amount to conveniencesamplesis that the pardcularhospitalsor clinics being studiedarerepresentative of all similar places. When is it legitimate to invoke the conceptof a superpopulation?I suggestthat when data for a population exist from which a probability sample can be drawn, convenience samplesare a poor substitute and do not meet current scientific standards----claimsof saduate student poverty, lack of time, and so on, notwithstanding. However, when the populationis unlmown and unknowable,as in the caseof Murdock and Provost'sethEographicsample,use of the available data and generalization to a superpopulationare kgitimate. In the caseof singlecross-sectional surveysbasedon probabilitysamplesof 6€ population at the time of the survey,we are on firm ground in characterizingthe socieq, as it was at the time of the survey but are increasingly on shaky ground as we try to gneralize over time. It canbe done,but it mustbejustified. Data from Multiple 5ureys Invoking the conceptof a superpopulationhas 'oling tonsiderable practicalusewhen it can be justified.A particularlycompellingapplication fu rvhencomparabledata are available over time, as in the U.S. GSS and other repeated uoss-sections.If it can be shown that relationships amongthe variables of interest do not ran. over time, data from severalyears may be pooled to increasethe size of the sample railable for analysis.This canbe a particularlyusefulstrategywhenany oneyearyields isufficient data to sustain reliable comparisons,for example of race differences in the f-rited States.The basictestis a varianton the strategyfor group comparisonsdiscussed rzlier in this chapterandalsoin ChapterSix (seealsothe discussionof trendanalysisin Gapter Seven).There are two steps.First, estimatean equation of the form .-,
I
- q+
\-,
J IJ
. r -r-\- \-,1 . -r> DI-. , \z 2- t- t , 221_-n..i . j:2
i:t
rr r j
(16.7)
j:2
the X are predictor variables and the { are cross-sectionalreplicates of the survey irh the first omitted to avoid linear dependency).Second,test whether the c. and the d.. collectivelyzero.If so,you can concludethat all the samplesaredrawnfrom a single ion and happily proceedto pool your data.But evenif there are year-to-yearvarims in the level of I (significant differences among the c,) or in the relationship of one moreof the Xs to I (significantdifferencesamongthe 1,,t,you may still wish to pool data but include the dummy variables and interaction terms necessaryto capturethe the social processyou are studyingchangesover time. This has the advantageof permitting al analysisof changeald increasingstatisticalpower for ing the relationships that do not vary over time. (For some recent examplesof the of this strategy,seeBarkanand Greenwood[2003],Chenand Guilkey [2003],Pow[2003],FitzgeraldandRibar [2004],Kelly ard Kelly [2005],andTavits[2005].) which hasmuch to recommendit, is to -$ altemativeuseof repeatedcross-sections, data from one survey to develop a preferred model, modifying the model in light of
400
Researchto Test ldeas Quantitative Data Analysis:Doing Social
in-the data' Then estimatey relationships unanticipatedby your theory but observed
ttini dutuftomur"plicatedcross-section' pr"r"rr"o l:] T:T^i?-tLtl:""":: ,: -"0"r in trt" precedingor following vear(recallthe discussicr "i;;, il;;;;;-"#o","0 this strategyin ChapterSeven).
for more than one idr A final possibility, in caseswhere information is collected trouseholdmember one than uiaoa *itt in u t oosehold(either by interviewing more T? members)' is to exp household other of characteristics the u.tirrg u i".pona"nt about
alirla:.t:T,^i-:T eachindividualfor whominformatron"i: u "uting it is necessaryto take account of the fact that obsen However, in ,o"h
,rr" ,"-pr" iy t
case. "ur". householdsuslng survey e$n are not independent,by adjusting for clustering within
;;;;.;"-d;
or Ly'adoptingllekind of.multilev"lT"9-"-tllc..':l::t""*t":=
is available for a restricted*"Masin (2001) cited earlier. Moreover, when information to the consequen!'85'attentive to be set of others, for example, spouses,it is important in conclusionsyidddr differences of the ty cu.rying ooi."ntitiuity io. of sensim4 (see discussion the -alysis adults "*umpt", ;;;ffii;'"f;#"d ieople and a sampleof all analysisin the next section).
PRACTICE A FINALNOTE:GOOD PROFESSIONAL
quantitativedata analystsandhar-ebd Now that we haveconsideredvanous issuesfacing of study, I close by offering serall a uri"r introduction to advancedtechniquesworthy that make a difference bersrar things g good professional pra&ce-the t;-dt ;;; principles' availat'tsI are-simple mediocreand supenorquantltatrvedata analysis These or brilliance insight or matllerEuny--ufyrt; tft"it upplication doesnot require particular to them is sure to improve the quality of your work' i"if f".lfi v. e* ",Adon
the Propertiesof You Data aJnderstand
or data fiom an archi\t u' Whether working with data you acquiredfrorn another-analyst you ,hoold thoroughly understandhow the data were cred y";t# il;;;; "ott""t"d, attention to the sampledesigr I and also should explore thell properties Pay particular to implement it For the sc determine whether survey estrmationis possible and how were constructedand hos n .*.on, Vo. need to understandhow any weight variables investigatorsare poorly documenrcd rr" tfr"*. Of"t afteweights provided byihe original ask them how they constnrd It is enfuely appropnateto wnte to the investigators to tfllt",I d:f::iltt^5y imposition' an their weights.You should not regardthis as :"": public use is to pro\rG for available the respJnsibilities of those who make their data adequatedocumentation. distributions for ers! You also should calculate and inspect univariate frequency you( analysis This is erq pertinent to variable in the data set, or at least every variable With respectto eachvariable'ax to ao Uy u.ing Stata's -cod.ebook- command' what you know abourlic *ft",t* ,ft" observeddistribution is plausible' given y"*.# of univari'c being studied. It is surprising how informative the inspection i.o"l",i*
ilfi;;'ilffi*or*
or tablesf .* u" ThLnextstepis to createcross-tabulations
dependentvariablesandaI meansthat show the associationbetweeneachof your central
F FinalThoughtsand FutureDirections: Research Designand Interpretation tssres 401
r!ul!D' *5 r rd RdF
ru,h
#r !F r-: G r r;l d!# FdtF
d Eita$l
fr
q
IN THE UNITEDSTATES, PUBLICLY FUNDEDSTUDIES MUST BE MADE AVAILABLE TO THE RESEARCH
COMMUNITY tt is now a reouirement of both the NationatscienceFoundation (NSF)and the NationalInstitutes of Health(NlH)that samplesurveysfundedby these agencies be madeavailable for publicusein a timelyway.ThecurrentNIHpolicyreads,"NlH endorses the sharingof final research data . . . and expectsand supportsthe timelyrelease and sharingof final research datafrom NIH-supporled studiesfor useby other researchers. 'Timelyreleaseand sharing'isdefinedas no laterthan the acceptance for publicationof the mainfindingsfrom the finaldataset" (http://grants.nih.gov/grantvpolicy/nihgps,2003/ NIHGPs_Part7. htm#_Toc546001 31, accessed December9,2007). The NSFpolicystateprecise principle: "NSFexpects. . . investigators ment is less but conveysthe same to share with other researchers, at no morethan incremental costand within a reasonable time,the data,samples,physicalcollections and other supportingmaterialscreatedor gatheredin the courseof the work. lt alsoencourages awardeesto sharesoftwareand inventionsor otherwiseactto makethe innovations theyembodywidelyusefuland usable"(http://wvwv. nsf.gov/pubs/2001/9c10'1/9c101 revl.pdf,accessed December 9, 2007).Providing adequate documentation is oart of the reouirement.
6e candidatepredictor variables. This too can be extremely informative, revealing both &ficiencies in the data and deficienciesin your a priori assumptions. I still recall, with someembanassment,an incident forry-five years ago when I was a @i-rning graduatestudentat the University of Chicago. I worked as a researchassistant r the National Opinion ResearchCenter (NORC), and Peter Rossi was the director of \ORC . I ran into him oneeveningashe wasleaving the building and carrying a greatstack {tr computerprintout--{ross-tabs from the study we were working on. I made some snide rma-rk about why should we bother with cross-tabsnow that we could do regressionsby cnrnputer,and he gaveme a withering look and said somethinglike, "Live and leam, kid." Ot-course,he was completely correct. There is a lot to be leamed by getting a feel for the daiabefore rushing to estimatefancy, or evennot-so-fancy,models.
E qlore Alternatives to Your a Priori Hypotheses E D:
ftr lss ddrbd\rs f
t=
(he of the features of truly strong research papers is that the auihor anticipates and qlores all of the altemative explanationsfor the observedphenomenonor relationship tar a critic might propose.In nonexperimentalwork the searchfor alternativeexplanations den amountsto assessingthe possibility of spuriousassociationdue to the failure to ilude variables that affect both the independentand dependentvariables in the model. fhus you need to ask yourself, is there an altemative explanation for the associationI $serve? In particular, might some other variable be causing both the outcome I observe al the values of my predictor variables?Then, if possible, include the candidate vari$les in your model, or do a side analysis (even using a different data set) to investigate ft associationof thesevariableswith variablesalreadvin vour model.
402
to Testldeas Doing SocialResearch QuantitativeData Analysrs:
thaterp- ':; tt A niceexampleof theuseof this strategv " P"P"l !1.ytlt^:1.(2007) incre'i' : an in th! twentiethcenturyresulted whethergrantingwomenthevoteearlyin shonge\1., -r: a reiuctionin chilclmortalityHe finds ;;;1i;":1rh ,p;"ding andhence thecausalargurl'ri::: thatbeforeaccepting in supportof his claim'But ne recognizes ri :! endosenous legislation-was outthepossibllitvthatsuffrage il;ru;""il;;;ie devoes a:i:_ ,[
:-l tfi*:
m- a
in public healthspending'He thus tols that alsoresult"d'n an '.ncreuse i:ialoityitsts" designedto rule out the possibili: ' of his paper (z4JB) tou'ioo' alternaiiveexplanationsfor his resultsconfou::-:-l not possiblebecausethe potential Where the ,t ut"gy .1"ttttitto*i"J is thati: 'r noting bv to rule ie-potslur" iiiluy ou'"'u"a' bt"n not th-T,r^o:J variableshave exa:---.. differ from what is observedFor f'"ai"t"o "r""it i"ould utCtt-1" "t"tii*"1;t"r; ofitera"y in Cttina(Treiman2007a)il -".. in apaperanalyzingthe determrnaiis
lllli:-it
;_ Lf
,"
6{ly::1"},iTilj:j"Til}i"}"j,::Hi}:1;:J5'Jil:Ti.ir," : ii*,',e;''"" thatnonmanualv "the hypothesis this-conclusion'I had to ru ' -: it'(146) ffo*tu"'' t"tot" accepting work suppresses ch'-': i" rOlOliry1y1"1t"terl historical the possibilitythat ug" anet"n""''--""'uttO out pointing - ' by literacyby cohofi' I did this "' in Chinathat prodo""a Oif"'"nte' ln rn'::*"; an expect ttl:l-Y:-:1:uld (decreased) the quality of educationincreased "t"t workersrafherthan the ob:' :* (decrease)in literacy to' Uotttmanuatand nonmanual the nonn"--I ruled out the postllllll 'n" as tn u ;;;;;;;;";s"t"" 'i*itu'''uv andmanualworkersdecltnec grew,the av"tug" "quuttti; of both nonmanual iector *t,ttti;*il;tiln, is^to sweepunmei-i--:. avaitJuteunder some circumstances' or randome.-:--: potentialconfoundersout of th;-;lititiy T^"i,:T""t "ttl-u'ing to adjustt' i:E is possibility lftupter' Still another models as we did in tn" p'"-*t endoge:' with coping ot ttt" methodsfor on" oting ly confounders potential effectof earlierin this chapter' inj .u.pi" t.f""tion bias discussed
&
'1r.- ri
Jlm
uric 0 d
!t
ConductSensitivitYAnalYsis
reader-:: inspireconfidenc:"1.1:^Ojn ot tt* Anotherway to gain confidence-and i'- t- '-':::' robust art ruuusr resul$ are your your results $|fi:Ti'T:,-" ;-,-^ ,-'^ 111:i:,:jllT*t motlel framework.c -:"onoJ'i""u'li1 linearmode ----"r ti.ear nationships in a general represem you forms by which -di1 ' ::' generallyan"alysis'and-more ent cutting points when to"Vt"g'o"i-iutufar omitted-rr'-: ' Like consiieration 3f'loleltial ways of representingyou' tont"fo' being anr- :set t|r,"]3:: t"q"i*-going bias,this sort of exploratronarso'iuv |:t:g o the adequacr:' ' izo-AV""-T^T:::tf See,for example'Treimanand r