THE
QUARTERLY JOURNAL OF ECONOMICS Vol. CXXIV
May 2009
Issue 2
CONSPICUOUS CONSUMPTION AND RACE∗ KERWIN KOFI CHARLES...
46 downloads
1011 Views
4MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
THE
QUARTERLY JOURNAL OF ECONOMICS Vol. CXXIV
May 2009
Issue 2
CONSPICUOUS CONSUMPTION AND RACE∗ KERWIN KOFI CHARLES ERIK HURST NIKOLAI ROUSSANOV Using nationally representative data on consumption, we show that Blacks and Hispanics devote larger shares of their expenditure bundles to visible goods (clothing, jewelry, and cars) than do comparable Whites. These differences exist among virtually all subpopulations, are relatively constant over time, and are economically large. Although racial differences in utility preference parameters might account for a portion of these consumption differences, we emphasize instead a model of status seeking in which conspicuous consumption is used as a costly indicator of a household’s economic position. Using merged data on race- and state-level income, we demonstrate that a key prediction of the status-signaling model—that visible consumption should be declining in reference group income— is strongly borne out in the data for each racial group. Moreover, we show that accounting for differences in reference group income characteristics explains most of the racial difference in visible consumption.
I. INTRODUCTION In his famous study of consumption during the Gilded Age, Veblen (1899) argued that “Consumption is evidence of wealth, ∗ We thank Mark Aguiar, Gary Becker, Matthew Gentzkow, Ed Glaeser, Jonathan Guryan, Daniel Hamermesh, Larry Katz, Kevin Murphy, Andy Postlewaite, Karl Scholz, Jesse Shapiro, Nick Souleles, Francesco Trebbi, and four anonymous referees for very useful comments and conversations. We are particularly indebted to Daniel Hartley for excellent research assistance. The paper has also benefited from comments from seminar participants at the University of Chicago, the IRP Summer Workshop, UCLA, Washington University, the University of Minnesota, Dartmouth College, the NBER Labor Studies Summer Program, the NBER Consumption Group Summer Program, Stanford University, Wharton School, and the St Louis Federal Reserve. Hurst acknowledges support from the University of Chicago Graduate School of Business and the Neubauer Family Faculty Fellowship. Roussanov acknowledges support from the Rodney White Center for Financial Research. We absolve all of them of responsibility for errors or omissions that remain. C 2009 by the President and Fellows of Harvard College and the Massachusetts Institute of
Technology. The Quarterly Journal of Economics, May 2009
425
426
QUARTERLY JOURNAL OF ECONOMICS
and thus becomes honorific, and . . . failure to consume a mark of demerit,” dubbing consumption that aims to demonstrate one’s economic position to observers “conspicuous consumption.”1 In this paper, we study households’ consumption of items that are readily observable in anonymous social interactions and that are portable across those interactions. We refer to these goods as “visible consumption.” Prompted by both Veblen’s insight that the consumption and display of these items communicates information about economic status, and the fact that few easily observable variables are as strongly correlated with economic status as an individual’s race, we investigate a series of questions about visible consumption and race. A large body of anecdotal evidence suggests that Blacks devote a larger share of their overall expenditures to consumption items that are readily visible to observers than do otherwise similar Whites. Automobiles, clothing, and jewelry are examples of these forms of “visible” consumption. To date, however, there has been little formal economic analysis of the degree to which these racial differences in consumption patterns actually exist in the data, what accounts for them if they do, and what the consequences of any such differential expenditures might be.2 We address these questions in this paper. The first part of our paper documents differences by race in expenditures devoted to visible consumption items. Using data from the Consumer Expenditure Survey (CEX) from 1986 to 2002, we show that although, unconditionally, racial minorities and Whites spend approximately the same fraction of their expenditures on visible consumption, Blacks and Hispanics spend about 25% more on visible goods, after differences in permanent income are accounted for. These expenditure differences are found for all subgroups except older households. We find that these racial gaps have been relatively constant over the past seventeen years, and that spending on housing or differential treatment in the housing 1. In fact, predating Veblen’s analysis by 140 years, Adam Smith argued that the desire for rank, and the display of wealth associated with it, are nearly universal features of human behavior (Smith 1759). 2. One exception is an early piece by Alexis (1970), who examined racial differences in consumption patterns between 1935 and 1960 using data from the Consumer Purchases Survey: 1935–1936 and early waves of the Federal Reserve’s Survey of Consumer Finances. Similarly to the findings we present below, Alexis found that Blacks were much more likely to spend on clothing (as a share of total expenditures) than similar Whites. Outside of economics, there is also limited work on the consumption patterns of Blacks. Examples include Mullins (1999), Lamont and Molnar (2001), and Chambers (2006).
CONSPICUOUS CONSUMPTION AND RACE
427
market cannot explain these patterns. Finally, the gaps are economically large: the absolute annual dollar differential for visible consumption is on the order of $1,900, which is a nontrivial quantity given Black and Hispanic average income. Because of an intertemporal budget constraint, spending devoted to visible consumption must be diverted from some alternative use. We show that the higher visible spending of racial minorities is drawn from both future consumption and all other categories of current consumption, with Blacks consuming less than Whites in essentially every other expenditure category (aside from housing) to maintain higher visible consumption.3 What theoretical explanation accounts for these facts? One argument is that racial differences in expenditures on visible items derive simply from racial differences in preferences—that minorities spend more on jewelry, cars, and apparel because they like these items more than Whites. This argument is consistent with the basic facts, but is essentially tautological. Moreover, an argument centered on racial differences in preferences yields no prediction that is falsifiable in the data. An alternative explanation assumes that utility functions are the same across races, but that some feature of the economic environment makes people from different races place different marginal valuations on visible consumption items. Apart from the fact that such an explanation does not simply assume that Blacks behave differently from Whites because they have different preferences, an argument of this form yields additional, empirically testable predictions beyond the basic facts described above that should hold within a racial group. Our alternative explanation borrows from the extensive theoretical literature on the demand for social status. According to the signaling version of this literature, individuals derive utility from status, which depends on others’ beliefs about their income (Ireland 1994; Glazer and Konrad 1996; Bagwell and Bernheim 1996). Although income (or wealth) is not observed, visible consumption is. The level of an individual’s conspicuous consumption can be expected to depend on the income distribution from 3. As discussed below, housing may be considered a visible good. In fact, we do find that Blacks and Hispanics spend more on housing than do comparable Whites. Our results (in terms of dollar magnitudes) are slightly stronger if we include housing as a component of visible consumption. However, given the large literature on racial differences in housing (which can explain housing expenditure differences), we err on the side of caution by excluding housing from our base measure of visible goods.
428
QUARTERLY JOURNAL OF ECONOMICS
which his income is drawn—his reference group. In particular, to the extent that visible consumption signals useful information about unobserved income, visible consumption should rise in own income, and, to the extent that being associated with a poorer reference group has negative informational consequences, visible consumption should fall in the income of the reference group. Applying these insights, we argue that a status-signaling model will predict racial differences in visible consumption even if there are no racial differences in preferences. Because Whites and racial minorities belong to reference groups with different income distributions, persons with similar incomes will face different incentives to signal by consuming visibly. Importantly, if status signaling is indeed a determinant of visible consumption, the predictions about the negative relationship between visible expenditures and higher average reference group income should apply not only across races but also within any given race in communities with different average incomes. To assess empirical support for the status-signaling argument, we combine data about expenditures from the CEX with income data from the Current Population Survey (CPS). We define an individual’s reference group as persons of the individual’s race living in his state. Strikingly, we find that, consistent with the status argument, there is a strong negative association between visible spending and the mean income of one’s reference group within races. That is, analysis performed on a sample of White households finds the same pattern as separate analyses conducted for racial minorities: increases in mean income of one’s own race in the state are associated with reduced visible spending, holding one’s own income constant. As a falsification test of the status and reference group conjecture, we relate household visible spending to mean incomes of other groups in the state and find either no effect or very modest positive effects. Additionally, we relate household nonvisible spending to reference average income and find no systematic relationship. The results for average reference group income remain qualitatively the same if we simultaneously control for the dispersion of reference group income, which theory suggests should also affect visible spending, although the predicted effect is ambiguous. We next turn to the following question: Do differences in reference group income explain the racial expenditure gaps that are our main focus? In a series of regressions, we show that accounting for the mean (and to a smaller degree the dispersion) of income
CONSPICUOUS CONSUMPTION AND RACE
429
in a household’s race/state reference group explains most of the racial gap in visible spending. This conclusion is robust to a variety of alternative sample modification and specification tests. Importantly, it is also robust to the addition of state fixed effects, which account for regional differences across all groups in the propensity to visibly consume. On the whole, the paper’s results point to an important role for consumption items, apart from their direct consumption value. Although this exhibitionistic motivation has long been discussed in economics, we are aware of very little formal evidence on the question, especially in terms of the racial differences that are our focus.4 Over the past decade, economists and sociologists have provided considerable empirical support for the notion that individuals care about their relative positions in their communities, often using evidence about subjective well-being.5 Our work complements this literature in that we are able to link consumption patterns to social concerns by analyzing economic behavior directly. Perhaps more importantly, our specific focus on racial differences in consumption and our results about the potential role played by the use and display of visible items suggest that a deeper understanding of the racial gaps in wealth, savings, and consumption that have long bedeviled economists and others will require further exploration of the issues raised in this paper. II. DATA Our primary source of data for studying racial differences in consumption patterns comes from the 1986–2002 CEX, collected by the United States Department of Labor. The CEX is an ongoing rotating panel dataset, in which participating households are interviewed up to five times at three-month intervals. In any given calendar quarter, there are approximately 5,000 households in the survey, with some households entering the survey and others exiting. The initial interview collects household demographic information, which is updated during subsequent interviews to reflect any changes in household composition. Information on income during the previous twelve months is collected during the second and fifth interviews. Additionally, the second through fifth interviews 4. Notable recent exceptions include Ravina (2007) and Kuhn et al. (2008). 5. Recent examples include Clark and Oswald (1996), McBride (2001), and Luttmer (2005). See also the survey by Kahneman and Krueger (2006) and references therein.
430
QUARTERLY JOURNAL OF ECONOMICS TABLE I DESCRIPTIVE STATISTICS OF CEX FULL SAMPLE
Age Education < 12 Education = 12 Education: some college Education: college or more Married Family size Number of adults Fraction with zero/missing income Total family income|income > 0 Quarterly total expenditure Sample size
All
White
Black
Hispanic
35.7 0.11 0.30 0.29 0.30 0.55 2.9 1.9 0.27 $57,800 $10,700 49,363
35.9 0.06 0.28 0.30 0.36 0.58 2.8 1.9 0.26 $63,800 $11,600 37,289
35.8 0.14 0.38 0.31 0.16 0.35 3.0 1.8 0.31 $38,400 $7,700 6,766
34.5 0.38 0.29 0.21 0.11 0.61 3.7 2.2 0.25 $39,800 $8,400 5,308
Notes. Data from the 1986–2002 waves of the Consumer Expenditure Survey (CEX). All expenditures are averaged over all quarters that the household remained in the survey. The sample includes all households where the head is between the ages of 18 and 49 (inclusive) and where the head reported his or her race as being White, Black, or Hispanic over all quarters in the sample. We also restrict the data to households that did not change their state during their sample period and that have nonmissing values for the head’s educational attainment, total household family size, and Census region where the household resides. All amounts are in 2005 dollars. In this table, age and education refer to the household head.
each collect detailed household expenditure information for the three calendar months immediately preceding the interview. Like previous users of CEX data, we aggregate to the consumption categories proposed by Harris and Sabelhaus (2000). We use the CEX family-level extracts made available by the National Bureau of Economic Research (NBER).6 Table A.1 lists the fifteen broad consumption categories used in the paper and their relationship to the 47 categories in the Sabelhaus and Harris files. All data are deflated to 2005 dollars using the June CPI-U. Our primary analysis sample consists of a total of 49,363 households, with heads of household between 18 and 49 years old.7 There are 37,289 White households, 6,766 Black households, and 5,308 Hispanic households. To mitigate the effects of measurement error in the expenditure categories, the unit of analysis is the average quarterly expenditure in a consumption category over the period that the household is in the sample. Descriptive statistics for the sample, by race, are provided in Table I. 6. Appendix I discusses in detail the NBER CEX family extracts, the details of our sample selection criteria, and the 47 specific expenditure categories included in the Sabelhaus and Harris consumption classification. 7. In some specifications, we explore the robustness of our results by examining the consumption patterns of older households and the sensitivity of our results to excluding younger households.
CONSPICUOUS CONSUMPTION AND RACE
431
Our focus in this paper is on visible consumption expenditures—items for which spending is readily observable and highly portable across a variety of interactions, including anonymous ones. Also, we want to identify goods with the characteristic that individuals who consume more of them are believed to be in better economic circumstances, on average, than individuals who consume less of such goods. Simple introspection suggests what these items are likely to be, but rather than simply asserting what those items are, we conducted an anonymous online survey of 320 students at the University of Chicago’s Harris School and Graduate School of Business. After providing basic demographic information, respondents were asked how close their interaction with someone would have to be in order to ascertain whether that person’s spending on various expenditure categories was above average. The details of our survey and a discussion of its results can be found in the Online Robustness Appendix to this paper posted on the QJE website. Consistent with both common sense and the results of our survey, our analysis treats visible consumption as expenditures on apparel (including accessories such as jewelry), personal care, and vehicles (excluding maintenance). Note that one especially important item is housing. Our survey evidence suggests both that housing is reasonably observable and that it is perceived to have high income elasticity. Our concern is that racial differences in housing expenditures might derive from differential treatment in the housing market—a phenomenon that has been the focus of a large literature.8 Differential treatment in the housing market could, by itself, cause minorities to have very different housing expenditures than Whites, even absent conspicuous or exhibitionistic considerations. Previewing our later results, we find that minorities spend more on housing than do comparable Whites, implying that if housing expenditures were lumped together with other visible spending, the overall estimated difference in visible expenditures that we estimate would be slightly larger. However, given the concerns about differential treatment in the housing market, we adopt the conservative policy of excluding housing from the measure of total spending in most of our main results. For the most part, we always treat housing separately, except for some robustness specifications in which we 8. There is evidence that minorities face significantly higher rejection rates for mortgages, which serve to limit their access to owner-occupied housing (see Munnell et al. [1996] and Charles and Hurst [2002]). Moral hazard considerations cause rental prices to exceed the flow cost from owning an otherwise identical unit, so households that rent will pay more for housing services, all else equal, than those who own.
432
QUARTERLY JOURNAL OF ECONOMICS
assess how the results are affected when housing expenditures are lumped in with overall visible spending. Appendix II summarizes expenditures in our CEX sample on visible and other goods. Overall, visible consumption expenditures compose roughly 12 percent of household total expenditures, whereas spending on food and shelter represent roughly 20 and 25 percent, respectively, of total expenditures. The table shows that some CEX households spend nothing on some expenditure categories over their time in the survey. Thus, whereas nearly all households spend on food, housing, entertainment services, and visible goods, 57 percent of households spend nothing on education, and around 20 percent spend nothing on alcohol and tobacco.9 III. RACIAL DIFFERENCES IN CONSPICUOUS CONSUMPTION Standard consumption theory suggests that total household expenditures should be related to the household’s permanent income (Modigliani and Brumberg 1954; Friedman 1957): households with lower permanent incomes should consume less, all else equal. Likewise, differences in family size should also affect household consumption. To explore racial differences in visible expenditures in our CEX sample, the regression one would want to estimate is ln(visiblei ) = β0 + β1 Blacki + β2 Hispanici (1)
+ ϕ(Permanent Income)i + θ Xi + ηi ,
where Blacki and Hispanici are indicator variables denoting whether a household head is Black or Hispanic, respectively; Permanent Income is the household’s permanent income; and Xi is a vector of controls designed to measure differences in age, family structure, and other demographic variables across households. This vector consists of a quadratic in the age of the household head, household wealth controls, year effects, and indicator variables for the number of adults in the household, the total number of family members in the household, marital status, whether 9. One thing to note from Appendix II is that the share of visible expenditures out of total expenditures is constant across races at 12%. These statistics do not imply that the consumption of visible goods is constant across races. The reason is that visible goods are luxuries (i.e., estimated slopes of within-race Engel curves are much greater than one). Given that Whites, on average, are much richer than Blacks and Hispanics, the Engel curves would predict that Whites should allocate a much bigger share of their expenditures to visible goods. In the section that follows, we estimate all differences in visible spending by race conditioning on household income.
CONSPICUOUS CONSUMPTION AND RACE
433
the household head is male, urban residence, MSA residence, and Census region.10 To estimate (1), one needs a good measure of household permanent income. The CEX asked households to report their various sources of income as they entered the survey. Many authors have shown that the CEX income data are of poor quality—something we find as well. As Table I shows, total family income, defined to include labor assets and transfer income, is missing for 27% of the sample. The CEX does not attempt to impute the missing income data. More importantly, Table I also shows that for those reporting positive income, White households have 67% higher total income than Black households and 61% higher total income than Hispanic households. These numbers are not consistent with those from other micro data sources designed to measure labor income. For example, using data from the Current Population Survey (CPS) for a similar time period and making similar sample restrictions, the comparable racial differences in total family income are 51% and 37%, respectively. Because the CEX’s income measures are not of especially high quality—particularly along racial dimensions— they are unlikely to accurately reflect racial differences in household permanent income needed for estimation of (1). Theory suggests a solution to the problem of poor quality CEX income data. Notice that the Permanent Income Hypothesis implies that total expenditure is an especially good proxy for a household’s permanent income. Fortunately, CEX expenditure data are of much higher quality than CEX income data. The racial differences in total expenditures from the CEX line up nearly exactly with the racial differences in total family income from the CPS. Specifically, as seen in Table I, Whites consume 50% more and 38% more than Blacks and Hispanics, respectively. However, proxying for permanent income with the log of total expenditures in (1) raises two problems. First, because expenditure components are jointly determined in models of lifecycle consumption, total expenditures are endogenous in an equation for any particular component of expenditures such as visible expenditures. Second, there is the purely statistical concern that measurement error in the components of consumption will be related to measurement error in total expenditures. 10. For household wealth, we use the log of liquid assets if liquid assets are positive and a dummy for whether the household has positive liquid assets as controls. Liquid assets are defined as checking accounts, savings accounts, stock holdings, and bond holdings.
434
QUARTERLY JOURNAL OF ECONOMICS
TABLE II ESTIMATED BLACK–WHITE GAP IN LOG VISIBLE EXPENDITURES WITH AND WITHOUT INCOME, EXPENDITURE, AND DEMOGRAPHIC CONTROLS Regression controls included (1) No additional controls (2) Specification (1) plus income controls (3) Specification (1) plus log total expenditure (4) IV regression where log total expenditure is instrumented with income controls (5) Specification (4) plus time dummies (6) Specification (5) plus demographic and wealth controls
Black coefficient Hispanic coefficient −0.38 (0.04) −0.03 (0.03) 0.31 (0.03) 0.23 (0.03)
−0.23 (0.04) 0.14 (0.04) 0.26 (0.06) 0.20 (0.05)
0.24 (0.03) 0.26 (0.02)
0.21 (0.05) 0.23 (0.05)
Notes. See the note to Table I for sample description and relevant sample sizes. The table reports the coefficient on the race dummies from a regression of the log of household visible consumption on race dummies and other controls. Specification (2) includes the log of current household income, if income is positive, a cubic in the level of current household income, and a dummy for whether current household income is positive, as well as dummies for the education level (four categories), occupation (1-digit), and industry (1-digit) of the household head. Specification (3) reestimates specification (1) including the log of total household expenditures as an additional regressor. Specification (4) is an IV regression where log total expenditures is instrumented with the income controls (added in specification (2)). Specification (5) is the same as specification (4) but also includes year dummies. Specification (6) is the same as specification (5) but also includes a quadratic in the age of the household head, a dummy if the household head is male, a married dummy, Census region dummies, a dummy if the household lived in an MSA, an urban dummy, wealth controls, and a series of separate dummies for the number of adults and children in the household. See Section III for a full description of these regressions. Robust standard errors (clustered at the state level) are reported in parentheses.
Given these problems, in our CEX sample we estimate
(2)
ln(visiblei ) = β0 + β1 Blacki + β2 Hispanici + ϕ ln (Total Expenditure)i + θ Xi + ηi
and instrument for the log of total expenditures using the vector of current and permanent income controls, Incomei . This vector consists of an indicator variable for whether current income is nonmissing, the log of current income if nonmissing, a cubic in the level of current income, three indicator variables for education, and a series of one-digit industry and occupation codes. Reassuringly, our CEX results are very robust to alternative instrument sets in (2), and in each case the F-stats on the instrument set are so large as to render irrelevant any “weak-instrument” concerns. Table II shows the results of our estimation. When we estimate (2) with only the race dummies and no other controls, Blacks and Hispanics are found to spend less on visible items than
CONSPICUOUS CONSUMPTION AND RACE
435
comparable Whites, by 38% and 24%, respectively (row (1)). These results simply reflect the unconditional means of visible expenditures, by race, reported in the first row of Appendix II. As we show below, spending on visible goods increases with income, and Blacks and Hispanics have much lower incomes than do Whites. The regressions in rows (2)–(4) of Table II control for permanent income in various ways. The specification in row (2) simply adds the vector Incomei . As expected, the addition of these income controls (whose limitations as a measure of permanent income we have already discussed) increases both the Black and White visible expenditure differences relative to the results shown in row (1). In row (3), we add the log of total expenditures rather than Incomei . Once this arguably better proxy for permanent income is added to the regression, we find that Blacks and Hispanics consume 31% and 26% more visible goods than Whites with similar permanent income. Next, given the concerns outlined above about using total expenditures as a control in a regression for a specific component of expenditures, we instrument the log of total expenditures with the vector Incomei in row (4). The results in row (4) are similar to those in row (3). Specifically, we find that Blacks and Hispanics spend 22% and 19% more, respectively, on visible goods than White households with similar permanent income. In rows (5) and (6) of the table we add a full set of time and demographic controls to the specification. These rows show that the addition of time and demographic controls does not appreciably change the estimated racial differences in visible spending. In our preferred estimate (row (6) of Table II), Blacks and Hispanics spend 26% and 23% more, respectively, on visible goods than do otherwise similar Whites.11 Although, to conserve space, we do not report point estimates for the nonrace coefficients, two results are worth noting. First, the propensity to purchase visible goods declines sharply with age for all races. Second, we find that visible goods are luxury goods. Specifically, the estimated coefficient on the log of total expenditures from the regression shown in row (6) of Table II is 1.5 (standard error = 0.03), implying that a 1% increase in total expenditures results in a 1.5% increase in visible expenditures. The 11. All of our results are robust to controlling for nonlinear measures of total expenditures. For example, if we reestimate the specification shown in row (6) of Table II, including both the log of total expenditures and the square of the log of total expenditures, the estimated coefficients on the Black and Hispanic indicator variables remain essentially unchanged at 0.27 and 0.24, respectively.
436
QUARTERLY JOURNAL OF ECONOMICS
luxury property of visible goods suggests why it is essential to control for permanent income when measuring racial differences in visible goods expenditures. It also explains why there is no unconditional racial difference in the share of spending devoted to visible goods: Blacks spend more than Whites on visible goods at every level of permanent income, but in unconditional comparisons this is obscured by the fact that Whites, with their higher incomes, consume more of these luxury goods. The racial difference in visible expenditures is large in absolute dollars. Appendix II shows that, on the average, Whites spend about $7,160 on visible items per year. The finding that Blacks and Hispanics spend 26% more than comparable Whites on visible goods therefore implies that Blacks and Hispanics spend, on average, roughly $1,900 per year more on visible goods than their White counterparts. Because the CEX underreports total household consumption relative to data from the National Income and Product Accounts, this estimate is likely a lower bound. To put these magnitudes in perspective, data from the March CPS show that, for the 1990–2002 period, Black and Hispanic households had average incomes, respectively, of $42,500 and $48,300 in 2005 dollars. Outlays on visible goods thus represent a substantial fraction of the overall budget of minorities. Figure I plots the estimated nonlinear visible expenditures Engel curves for Blacks and Whites separately. To generate the Engel curves, we regress log visible expenditures on log total expenditures and log total expenditures squared separately for Blacks and Whites. As above, we instrument log total expenditures and log total expenditures squared with the vector Income. The figure shows that for both Blacks and Whites, on average, visible expenditures are luxury goods. Also, at every level of log total expenditures, Blacks spend more on visible goods then their White counterparts.12 Notice further that the two Engel curves are parallel over most of the total expenditures range, mitigating concerns that the main results derive in some way from a fundamental difference in the shapes of these relationships across races. 12. One may ask whether there are differences in “price” effects that cause Blacks to spend more on visible goods than comparable Whites. For example, if Blacks were discriminated against in the market for visible goods, Blacks with a given income would pay more for those items than comparable Whites. General discrimination cannot explain the results in Table II, which control for total expenditures directly. As a result, the correct interpretation of our results should lead to the question of why Blacks and Hispanics allocate a greater share of their expenditures to visible goods. There is no evidence that, relative to other goods, Blacks and Hispanics pay higher prices for clothing, jewelry, and personal care items than similar Whites.
437
2
Log quarterly visible expenditures 6 4
8
CONSPICUOUS CONSUMPTION AND RACE
7
8 9 Log quarterly total expenditure Black
10
White
FIGURE I Estimates of Nonlinear Visible Good Engel Curves: Estimated Separately for Blacks and Whites The figure shows separate Engel curve estimates of log visible expenditures on a quadratic expression in log total expenditures for Blacks (solid line) and Whites (dotted line) using data from the CEX. Log total expenditures and log total expenditures squared are instrumented using the same vector Income described in the notes to Table II. The regressions are estimated over a similar range populated by both Black and White households: households with quarterly total expenditure greater than $1,300 and less than $26,200 (in 2005 dollars). These total expenditures cutoffs are approximately the first percentile of the White quarterly total expenditure distribution and the ninety-ninth percentile of the Black quarterly total expenditure distribution, respectively.
The finding that racial minorities exhibit a greater propensity to consume visible goods is robust to a variety of alternative specifications and restrictions, including restricting the sample to households with positive current income, excluding households with less than $23,200 a year in total expenditures (the twentyfifth percentile of the expenditure distribution), excluding households under the age of 24, varying the specific components of the instrument set Incomei , including log expenditures on housing as an additional control, and restricting the sample to include only those who completed all four CEX surveys. Additionally, the racial differences in visible spending are found in all subgroups in our sample. For example, single Black men, single Black women, and married Black households consume 32% more, 28% more, and 22% more than their respective White counterparts. The racial differences in visible spending are statistically larger among single men than is the substantial gap among
438
QUARTERLY JOURNAL OF ECONOMICS
married households. Similar patterns are found among Hispanics. We find racial differences in visible spending within all education groups, and the gap for those with only a high school education (−0.30) is not statistically different from the gaps for those with at least a college degree (−0.23). The racial visible spending difference does diminish sharply with age. Among households aged 18–34 the Black–White conditional gap in visible spending is 30%, which declines to 23% for households aged 35–49, and declines further to only 15% for households aged 50–69.13 Table III presents estimated racial differences for the separate components of visible consumption, namely, vehicles, clothing, and personal care, in the CEX. Panel A presents results for the full sample, whereas Panel B presents results for the sample of households that own vehicles. In both samples, Blacks and Hispanics spend significantly more both on personal care and on clothing and jewelry than comparable Whites. For vehicle spending the results are more nuanced. In the overall sample, both Blacks and Hispanics spend less on cars than do Whites. Among vehicle owners, however, Blacks in the CEX spend around 12% more on vehicles than comparable Whites. The fact that Blacks and Hispanics, all else equal, are less likely to own vehicles explains why the racial difference in vehicle spending is not found for the full sample. The lower vehicle ownership among Blacks and Hispanics is likely the result of two factors: the fact that Blacks and Hispanics are more likely to live in city centers and, as a result, have lower vehicle needs, and the fact that liquidity constraints may prevent Blacks and Hispanics from making a sufficient down payment to purchase a vehicle. If minority households spend more on visible goods than White households with the same permanent income and demographics, on what expenditures are they spending less? The intertemporal budget constraint implies that the observed higher spending on conspicuous goods must come from another component of current consumption and/or from future consumption (i.e., current savings). Table IV looks at the conditional differences in spending on other consumption categories. Along with visible consumption, these consumption categories compose the universe of consumption expenditures in the CEX and are described in Table A.1. The coefficients in Table IV come from a regression identical to that reported in row (6) of Table II, except that the dependent 13. The Online Robustness Appendix presents results for various alternative specifications and subsamples.
0.38 (0.03) 0.73 (0.05) −0.43 (0.07) −0.46 (0.10)
Black dummy 0.41 (0.04) 0.43 (0.03) −0.29 (0.10) −0.34 (0.17)
Hispanic dummy 0.36 (0.04) 0.81 (0.06) 0.12 (0.04) 0.09 (0.03)
Black dummy
0.37 (0.02) 0.42 (0.05) 0.09 (0.06) 0.04 (0.05)
Hispanic dummy
B. Positive car spending
Notes. For Panel A, the sample and specification are the same as the sample and specification used in row (6) of Table II except for the fact that the dependent variable is a subcomponent of visible consumption (cars, clothing and jewelry, or personal care). The sample for Panel B is the same as the sample for Panel A except for the further restriction that the household must report owning at least one automobile (sample size = 11,900 households). The limited measure of car spending includes only initial outlays for new or used cars. The expanded car spending measure includes the initial outlays plus expenditures on car services and the principal component of the vehicle loan payment. See Appendix I for more details on the two car measures. Our primary measure of visible consumption only includes the limited measure of car spending. Robust standard errors (clustered at the state level) are in parentheses.
Cars (including maintenance)
Cars (limited)
Personal care
Clothing/jewelry
Visible consumption subcategory
A. Full sample
TABLE III RACIAL DIFFERENCES IN LOG SPENDING ON SPECIFIC VISIBLE ITEMS, CONTROLLING FOR INCOME, EXPENDITURE, AND DEMOGRAPHIC CONTROLS
CONSPICUOUS CONSUMPTION AND RACE
439
440
QUARTERLY JOURNAL OF ECONOMICS
TABLE IV DIFFERENCES IN LOG EXPENDITURES BY CATEGORY AMONG BLACKS, HISPANICS, AND WHITES IV regressions Log expenditure category Housing Utilities Food Other transportation Entertainment services Home furnishingsa Educationa Entertainment durablesa Healtha Alcohol and tobaccoa
Black coefficient 0.03 (0.02) 0.09 (0.03) −0.06 (0.02) −0.15 (0.03) −0.29 (0.03) −0.18 (0.04) −0.16 (0.10) −0.35 (0.05) −0.51 (0.05) −1.04 (0.05)
Hispanic coefficient 0.13 (0.03) −0.02 (0.02) 0.06 (0.02) −0.02 (0.04) −0.36 (0.05) 0.09 (0.05) −0.30 (0.12) −0.17 (0.05) −0.48 (0.06) −1.04 (0.05)
Notes. The sample and specification are the same as those used in row (6) of Table II except for the fact that the dependent variable is the log of all other consumption categories. These consumption categories are defined in Table A.1. The consumption categories denoted with a superscript a have a nontrivial fraction of the respondents reporting zero spending on these categories in a given year (see Appendix II). For these categories, we estimate the specification using a Tobit and report the corresponding unconditional marginal effects. Robust standard errors (clustered at the state level) are in parentheses.
variable is now the log of the particular consumption category and Tobit models are estimated for categories with a high incidence of zero expenditure. The first striking fact from Table IV is that there is no evidence that Blacks and Hispanics allocate a higher percentage of their spending than Whites to any consumption category other than visible goods and housing. In fact, aside from utilities, Blacks spend less than similar Whites on all other consumption categories. Some of the differences are small, such as the very small differences between Blacks and Whites in food expenditures. However, Blacks spend 16% less on education, approximately 29% less on entertainment, and 50% less on health. Similar patterns emerge for Hispanics.
CONSPICUOUS CONSUMPTION AND RACE
441
Both Blacks and Hispanics spend slightly more on housing and utilities than their White counterparts, while at the same time spending much less on home furnishings. As we have noted, housing may itself be a visible good, which would explain why it is associated with expenditure patterns similar to those for jewelry, clothing, and vehicles. However, as discussed above, it is also possible that there may be discrimination against racial minorities in the housing market. To provide conservative estimates of conspicuous spending differences, we exclude housing from our measure of visible goods. To confirm the patterns depicted above for racial consumption differences, we also estimate a variety of models using data from the Panel Study of Income Dynamics (PSID). This exercise is important partly to establish whether our main results are found in another nationally representative data source with information on consumption. Moreover, as noted previously, although the CEX is the primary source of data on consumption expenditures in the United States, and thus serves as our main data source, it is not designed to measure household income. By contrast, the PSID provides excellent measures of household income over multiple years, so it is possible to control carefully for permanent income in our regressions. The limitation of the PSID is that, until recently, it only contained limited measures of household consumption. Starting in 2005, the survey added an expanded set of expenditure questions, including some questions about the visible items we study. Currently, these measures are available only for the 2005 wave. Using data from the 2005 PSID, we can examine racial differences in consumption patterns for this limited set of categories using a different measure of permanent income. These estimates can then be compared with those from the CEX, where permanent income is proxied by total expenditures. We restrict the 2005 wave of the PSID to meet the same age and other restrictions used for the CEX sample.14 We estimate versions of (1) using the log of clothing expenditures as the dependent variable. Our proxy for the household’s permanent income is the average of total 14. Full sample selection and other details about the PSID sample are provided in the Online Robustness Appendix. We also discuss additional visible expenditures results from the PSID data beyond the estimates of cross-race differences given here and present results about the distribution of retail establishments by the racial makeup of the ZIP code, with data from the county business patterns. This evidence is only suggestive, but it does show a higher incidence of business devoted to selling visible items such as clothing in ZIP codes with greater numbers of racial minorities.
442
QUARTERLY JOURNAL OF ECONOMICS
TABLE V DIFFERENCES IN LOG EXPENDITURES BY CATEGORY ACROSS BLACKS AND WHITES, PSID DATA
Log expenditure category Visible spending (1) Clothing expenditures, no additional controls (2) Clothing expenditures, full controls (3) Price of recent car purchase, full controls
Coefficient on Black dummy −0.07 (0.07) 0.24 (0.07) 0.12 (0.09)
Other spending categories (4) Food expenditures, full controls −0.12 (0.03) (5) Entertainment, full controls −0.33 (0.08) (6) Other transportation, full controls −0.09 (0.06)
Sample size 3,928 3,928 1,882
4,167 3,724 3,708
Notes. The sample includes all households in the 2005 wave of the Panel Study of Income Dynamics where the head is between the ages of 25 and 49 (inclusive) and is either Black or White. The table displays the coefficient on the Black dummy of a regression of log spending for different consumption categories on a race dummy only (specification in row (1)) and a race dummy, the log of household permanent income, a cubic in the age of the household head, a dummy for the sex of the household head, a marital status dummy, and a vector of family size, number of children, and region dummies (specifications in rows (2)–(6). For our measure of permanent income, we average total household annual family income between 1999 and 2005 for the years that that the household was in the sample. See Section III for complete details. Sample sizes differ across the specifications given that we restricted each specification to only include households with positive spending on the given category. Nearly all the sample conducted some spending on food expenditures and clothing, and nearly all consumed some form of entertainment and other transportation. The purchasing of a vehicle during the prior three years was limited to roughly 50% of the sample. All data are weighted using the PSID core family weights. Robust standard errors (clustered at the state level) are in parentheses.
annual family income between 1999 and 2005 for the years that the household was in the sample. Table V presents the results for the measures available in the PSID. Row (1) presents the estimated racial difference in clothing expenditures when no controls are added to regression (1). As in the CEX, lower overall income among Blacks means that they tend to spend less on clothing than do Whites, on average. The specification in row (2) controls for permanent income and for the full set of demographic controls used in earlier regressions. The results for clothing are similar to the preferred CEX estimates: Blacks in the PSID spend 24% more on clothing than do comparable Whites. Row (3) presents results for the price of new car purchases—the only other visible spending we can sharply identify in the 2005 PSID data. The estimate suggests that Blacks bought cars that were 12% more expensive than those similar
CONSPICUOUS CONSUMPTION AND RACE
443
Whites bought. Perhaps because of the small sample size, the effect is not statistically significant at conventional levels, but it is reassuringly similar to the corresponding estimate from the CEX data. Similarly reassuring are the other PSID estimates in the table, which indicate that, as in the CEX, Blacks in the PSID spend less than similar Whites on food (row (4)), entertainment (row (5)), and other transportation (row (6)). The fact that the PSID estimates, which control directly for permanent income using high quality panel data on income, correspond well with our preferred CEX estimates suggests that the approach of using total expenditures as a proxy for permanent income and then instrumenting it using available income measures captures variation in permanent income quite well. Indeed, as seen in Figure II, the distributions, by race, of total expenditures from the CEX are remarkably similar to the distributions of household permanent income, by race, in the PSID. In summary, we find that Blacks and Hispanics spend roughly 30 percent more on visible expenditures (cars, clothing, jewelry, and personal care items) than do otherwise similar Whites. These patterns are similar across all subgroups of the population (with the notable exception that the differential racial propensity to consume visibly declines sharply with age), across the two nationally representative surveys in which this can be studied, and with different methods of controlling for household permanent income. Strikingly, while minority households consume more visible goods than comparable Whites, they consume less than or the same amount as Whites of all other consumption categories aside from housing. IV. STATUS AND CONSPICUOUS CONSUMPTION What explains these differences in visible spending? Racial differences across dimensions as diverse as cuisine, music, and popular entertainment suggest that the consumption patterns above could derive, in part, from differences in tastes. We eschew this essentially tautological explanation, however, and investigate instead whether racial consumption differences can be reconciled within a framework in which no racial preference differences are assumed. We draw on insights from the literature spawned by the seminal work of Veblen (1899) and Smith (1759), which centers on the idea that individuals care about their status—the economic position that others ascribe to them. In this framework, conspicuous
444
Density
0
0.00001 0.00002 0.00003
A
QUARTERLY JOURNAL OF ECONOMICS
0
20,000
40,000
60,000 80,000 100,000 120,000 140,000 160,000 Total e xpenditure (a nnual)
0
Density 0.00001
B
0.00002
White Black
0
25,000
50,000
75,000 100,000 125,000 150,000 175,000 200,000 Average family income White Black
FIGURE II Kernel Density of Black and White Annual Family Income and Expenditures The figures show the kernel density of total annual expenditures from the CEX (Panel A) and average family income from the PSID (Panel B) separately for Blacks and Whites. The samples used for the kernel estimation are the same samples as described in the notes to Table I (for the CEX) and Table V (for the PSID). Likewise, the measures of total expenditures and total family income are also described in Section III.
consumption is a form of signaling in the sense demonstrated by Spence (1973). We briefly outline a signaling model of visible consumption, and discuss its testable implications.15 Consider an economy in which individuals belonging to group k have incomes yik drawn from a known distribution with 15. Other formulations of a person’s utility from visible goods are determined by how personal consumption of the good compares to the average consumption or income in some reference group. See Duesenberry (1949) for an important early treatment. The NBER working paper version of our paper outlines a model of this form. The main predictions discussed in this section about how the mean income of one’s reference group affects visible spending can also be derived within this alternate class of models.
CONSPICUOUS CONSUMPTION AND RACE
445
k k . Income is , ymax density fk ( y) and support on the interval ymin not publicly observed, and is used to finance consumption of two goods: c, which is observed by outsiders, and ( y − c), which is not. Each agent has the same utility, given by (3)
v yik − cik + u cik + w sik ,
where u, v, and w are each concave and twice continuously differentiable. In (3), status, sik, reflects society’s inference about i s income based on things observed about i. It follows that sik = E[yik | cik∗ , k], where cik∗ is i s equilibrium visible consumption, and k is his group. In the separating equilibrium of this model, each agent chooses consumption to maximize (3) subject to his budget set, and society’s beliefs about income are correct for each individual; that is, si (cik∗ (yik)) = yik. Recent theoretical work studies models of this form, formally characterizing the equilibrium and key comparative statics.16 We summarize and provide some intuition for these results. Equilibrium spending on conspicuous goods, cik∗ , is strictly increasing in yi . The relationship is concave if utility from status is sufficiently more concave than that for the two other components of utility. Otherwise, visible spending rises with income in a convex fashion. Importantly, because the income of the poorest person in a group is correctly assessed, in equilibrium this person has no incentive to engage in greater consumption of the visible good than if there were no signaling motive whatsoever. What does the theory say about the relationship between cik∗ and changes (or differences) in the income distribution of a group in the perfectly revealing equilibrium? There are two results. The first is that as the dispersion of a group’s income distribution increases, the effect on average conspicuous spending in the group is theoretically ambiguous. The intuition for the ambiguous result is as follows. Suppose that there is a redistribution in which income is transferred from Ato a richer person B and group income dispersion increases. Because ci∗ is strictly increasing in yi , conspicuous spending will decrease for A and increase for B. However, because 16. See Mailath (1987), Ireland (1994), and especially Glazer and Konrad (1996) for formal treatments of models of this form. Our framework borrows most from the work of Glazer and Konrad (1996), who study the signaling value of observable charitable donations rather than consumption. Otherwise, our framework is virtually identical to theirs. See their paper for a formal derivation of the predictions discussed here.
446
QUARTERLY JOURNAL OF ECONOMICS
the relationship between ci∗ and yi may be either concave or convex, the relative magnitude (absolute value) of the decrease in visible spending for A and the increase in visible spending for B is ambiguous.17 The other result about the distribution of group income is unambiguous: if poorer persons are added to a group, so that the support of the group’s income distribution becomes [ymin − θ, ymax ] with θ > 0, and average group income falls, then conspicuous spending rises at every level of income. The intuition is that as poorer people are added to a population, persons of every level of income must now signal more to distinguish themselves from those immediately poorer, because those people are themselves now compelled to spend more to distinguish themselves from persons who are even poorer still. The framework outlined above is quite general. Depending on the situation, different types of expenditures may be visible to observers. More importantly, the reference groups k represent, in theory, any type of grouping into which a population can be sorted. Depending on the situation, observers will know more or less about the distribution from which other individuals’ unobserved income is drawn. In other words, the particular reference group k that is used to draw inferences about individual income will vary from one context to another. The key prediction is that information about one’s reference group influences observers’ inferences about one’s income, and thus interacts with the optimal choice of signaling expenditures. The patterns in Figure II showing that Blacks have a much lower permanent income, on average, than Whites suggest that the higher relative visible spending of Blacks is consistent with the main prediction of a status model if race is the only exogenous observable characteristic that helps one infer an individual’s socioeconomic position. But, even in a random anonymous situation, an observer of a Black (White) person will typically know more about the person’s income than that it is drawn from the national income distribution of Blacks (Whites). At a minimum, the observer knows that the person’s income is likely drawn from the Black (White) income distribution in the state where the person
17. It has also been shown that equilibrium conspicuous signaling is invariant to a replication of the distribution of income. That is, conspicuous signaling should be unaffected by differences in the size of groups, all else equal. See Glazer and Konrad (1996).
CONSPICUOUS CONSUMPTION AND RACE
447
resides.18 If k is taken to represent different race/state cells, several interesting testable predictions follow from the statussignaling model. First, differential visible spending should be observed not only across races based on the mean and dispersion of racial incomes, but also among persons of the same race in different states. Further, the overall income distribution in different states should not determine visible spending for a given race; only the income distributions of people in the state of a person’s own race should matter. Finally, if visible spending is truly driven by statusseeking behavior, the estimated racial differences in visible spending shown in the previous section will be eliminated, or at least substantially reduced, if controls for the mean and dispersion of the person’s race/state cell were added to the regressions. We analyze these implications below.19 V. EMPIRICAL TESTS OF THE CONSPICUOUS CONSUMPTION MODEL V.A. Explaining Within-Race Conspicuous Consumption Differences Before conducting separate within-race analyses of conspicuous spending behavior, we explore whether there is evidence to support the idea that persons at a given level of income, and belonging to a particular race/state cell, spend more on visible goods than similar persons belonging to race/state cells with higher average income. Using the same CEX sample described above, we estimate the following regression of the total visible spending of an individual i of race r living in state s:20 ln(visibleisr ) = β0 + δsr (s ∗r ) (4)
+ ϕ log(Total Expenditurei ) + θ Xi + ηi ,
18. In fact, observers likely have more detailed spatial information than a person’s state. We define reference groups with respect to state because the state is the lowest level of spatial aggregation available in our CEX data. In the Online Robustness Appendix, we use the PSID data to explore the sensitivity of our empirical results to the use of finer levels of spatial aggregation. 19. The signaling interpretation for conspicuous spending may also account for the fact that conspicuous consumption differences decline with age, as shown in Section III. Younger persons, given their greater involvement in marriage and other social markets as they search for spouses and friends, are likely more concerned than their older counterparts about outsiders’ assessments of their wealth and should be more likely to conspicuously consume as a result. 20. Otherwise, the controls are identical to those used in row (6) of Table II, and the sample restrictions are the same as discussed above.
448 Conditional log difference in visible spending relative to White Alabamans –0.5 0 1 0.5
QUARTERLY JOURNAL OF ECONOMICS
AL KY
AR
MANVNV NJ AR KS SC IL VA TX OK TX OH WA WI KS MD DC CA OR NYDC AL MD CO MA MI AK PATN CO CA WIIA FLSC NY AZ AZINNJHI AK IL KY MN NC NC LA MN CT OH MI CT OKMOIN GA IAVA FL WA AL HI PA OR AR KY PA IN LA SC MO IA OH KSWI MI MA TN GA LA TN MONC TX IL ORAZ GA MN WA CT NY AK OK FL NVCO VA CA NJ HI
9.6
10 10.4 10.8 Log of mean income of race-state cell White
Black
DC
MD
11.2
Hispanic
FIGURE III Relationship between Conditional Log Visible Spending and Log of Mean Income by Race/State Cells The figure plots the log of mean male labor income of race/state cells against the conditional log difference in visible spending of race/state cells. The mean of male labor income for each race/state cell is computed using CPS data as described in the note to Appendix III. Visible spending shown for race/state cells is dummies estimated on race/state interactions in a regression of log visible spending on race/state dummies, total expenditure controls (instrumented with current income controls), and demographics. The regression uses CEX data and is identical to the regression described in row (6) of Table II, aside from the fact that the race dummies are replaced with race/state dummies. The omitted race/state dummy in the regression is White Alabamans, so conditional visible spending in each cell is relative to White Alabamans. The figure presents results for 114 race/state cells. For the years studied, the CEX does not interview households from ME, MT, ND, RI, SD, WV, and WY, so there is no spending on households from these states. In addition, the numbers of observations for some race/state cells (Blacks in Utah, for example) were very tiny. We therefore exclude observations from DE, ID, NE, NH, UT, and VT from the data. The remaining sample covers 37 states plus DC, for three racial groups.
where s and r are vectors of state and race effects, respectively; and where, as in previous regressions, log total expenditures proxies for permanent income and is instrumented for with the vector Income (described above). Figure III plots the estimated effects δsr against the mean level of income for the particular race/state cell as estimated in the CPS. We use data from the 1990 through the 2002 March CPSs to compute the mean labor income of White
449
CONSPICUOUS CONSUMPTION AND RACE
males by state.21 To be consistent with our CEX sample, we restrict the CPS sample to include only individuals between the ages of 18 and 49 (inclusive). Two results in Figure III are striking. First, there is a negative and strongly statistically significant relationship between the mean income of a race/state cell and average spending on visible items among persons in that cell, relative to similar persons belonging to other race/state groupings. This result, estimated across all race/state cells, is consistent with the prediction of the status-seeking model. Notice also that the distribution of visible expenditures for different race/age cells supports the crossrace evidence presented earlier: Black race/state cells have lower permanent incomes and higher visible spending, White race/state cells have substantially higher permanent incomes and lower visible spending, and Hispanic race/state cells are, on average, between those for Blacks and Whites on both dimensions. What is the evidence about visible spending for people of the same race? To answer this question we estimate separately for each race the regression given by
(5)
y y ln(visibleik) = β0 + δ1 µk + δ2 Dk + ϕExpenditurei + θ Xi + ηi , y
where k is a race/state cell for the particular race, and µk and y Dk are, respectively, the (log of) the mean and dispersion of income for persons in the race/state cell. As before, we instrument for Expenditurei using the vector Incomei . Henceforth, we measure the dispersion of income in a race/state by the coefficient of variation—a dimensionless measure of dispersion. As noted previously, mean and dispersion are estimated from CPS income. Table VI presents results for Whites in the CEX. Column (1) of Table VI shows that the base estimate of δ1 is a strongly statistically significant −0.60. This implies that doubling the mean state income of Whites reduces the visible expenditures of Whites by 60%, all else equal. The specification in the second column adds 21. The labor income of adult men of a person’s state/race cell is our main measure of average reference group income. We also tried several alternative measures for reference group income, including total family income and total family labor of all persons of the individual’s race/state cell. In all that follows, the results are essentially unchanged under these alternative income specifications. We use the CPS to estimate our measure of the mean income of the reference group within each state as opposed to the CEX data because of both the large sample sizes available in the CPS and the better quality income data.
−0.70 (0.14) −0.72 (0.30)
−0.60 (0.14)
−0.58 (0.13) −0.63 (0.28) −0.13 (0.06)
(3) 0.23 (0.06) 0.59 (0.10) 0.01 (0.03)
(4)
Log food expenditure
−0.01 (0.05) −0.06 (0.03) −0.15 (0.02)
(5)
Notes. The sample in the table is the same as that used in Table II except for the additional restriction that it include only White households (n = 37,289). For column (1), the specification is the same as in row (6) of Table II except for the following two changes: the race dummies are dropped as regressors and the log of mean total household labor income of White men in the household’s state of residence is included as a regressor. In column (2), we also add the coefficient of variation for total labor income of White men in the household’s state of residence. In column (3), we add in the log of individual housing prices as an additional regressor. We instrument individual housing expenditures with the mean level of housing prices in the individual’s state of residence. See Section V for a discussion of how we use CPS data to compute the mean and standard deviation of total labor income for men by state and for a discussion of how we use data from the 1990 and 2000 U.S. Census to compute state housing prices. Columns (4) and (5) repeat the specification shown in column (3) with two different dependent variables. In column (4), the dependent variable is the log of household food spending. In column (5), the dependent variable is the log of household total spending less spending on visible goods and less spending on housing. Robust standard errors (clustered at the state level) are shown in parentheses.
Log of mean income of own race in state Coefficient of variation of income for own race in state Log of individual housing expenditures
(2)
(1)
Log visible expenditure
Log total expenditure less visible and housing expenditures
TABLE VI WITHIN-WHITE DIFFERENCES IN VISIBLE EXPENDITURE BY MEAN INCOME OF OWN RACE WITHIN A STATE
450 QUARTERLY JOURNAL OF ECONOMICS
CONSPICUOUS CONSUMPTION AND RACE
451
the coefficient of variation. In this regression, we continue to find that higher average income of Whites in a White household’s state is associated with lower visible spending, all else equal. Indeed, the point estimate on mean reference group income is larger than the specification in column (1). These basic results for average reference group income in columns (1) and (2) are strongly consistent with the main prediction of the status-signaling model. Higher dispersion in reference group income is shown to lower White visible spending, with an effect that is strongly statistically significant, in column (2). As discussed above, the theory is ambiguous about the sign of the effect of reference group income dispersion on visible spending. A potential concern about the results in the first two columns is that there may be some factor correlated with average state income that mechanically causes reduced spending on visible goods. Differences across states in housing prices represent one such factor. Consider a state where the price of housing is high, all else equal. Individuals with a given level of income in that state will spend more for the same amount of housing, and less on other consumption items, perhaps including visible items. To account for this, we control directly for the individual’s log housing expenditures in our estimation of (5). Given the endogeneity of individuals’ expenditures on housing with respect to their total and visible expenditure decisions, we instrument individual housing expenditures with the mean value of house prices in the household’s state of residence.22 We compute the mean value of house prices using data from the 1990 and 2000 U.S. Censuses. For households in the CEX from 1986 to 1994, we use the 1990 Census average state house price; for CEX households from 1995 to 2002, we use the 2000 Census average state house price.23 Column (3) of Table VI shows the results from including log individual housing prices (instrumented with state housing prices) as an additional control. We find that controlling for individual housing expenditures slightly reduces the estimated effects of both the mean and dispersion of reference group income on Whites’ 22. The first-stage relationship between housing expenditures and state housing prices is very strong, with F-stats on the excluded instruments well in excess of 50. 23. The ordinal relationship across states in average housing prices is so strong that it does not matter if instead we use only the 1990 house price or only the 2000 price.
452
QUARTERLY JOURNAL OF ECONOMICS
visible spending. Both effects, however, remain significant after controlling for housing expenses. Apart from concerns about state-level differences in housing costs, there is a possibility that a state’s level of income might be related to the menu of prices its residents pay for different consumption items. For example, the generosity of transfer or insurance programs might vary with a state’s average level of income. If so, Whites with the same level of income in different states would effectively pay different prices for and consume different amounts of various consumption items in the different states. In particular, we would expect to find a negative pattern between state income and levels of expenditure for other items. The specifications in columns (4) and (5) are identical to those in column (3), except that the outcome variables are, in turn, the log of food expenditures and the log of all expenditures minus reported visible and housing expenditures. In stark contrast to the results for visible goods, we find no evidence of a negative relationship between higher average reference group income and these expenditures among Whites. Indeed, food expenditures increase with state income for Whites. For total nonvisible expenditures, we find no evidence of any systematic relationship between Whites’ propensity to spend on these items across states and mean levels of White incomes in the state. Overall, for Whites, the sharp negative relationship between visible expenditures and mean reference group income does not exist for other categories of expenditures. These results provide strong support for the main unambiguous prediction of the status-signaling model that visible spending is negatively related to the economic status of the reference group from which a person’s income is drawn, all else equal. Table VII presents within-race estimates for a pooled sample of Blacks and Hispanics. We pool together Blacks and Hispanics to increase the sample size for our estimation. However, aside from larger standard errors, the point estimates in the pooled regression are similar to the point estimates we get if we restrict the sample to include only Blacks or only Hispanics. The measure y of µk used for the results in Table VII is the mean income of either Black men in the state if the household head is Black, or Hispanic men in the state if the household head is Hispanic. The results indicate that among racial minorities, visible spending is lower the higher the mean income of racial minorities in the state. The point estimate indicates that a doubling of the average of minority incomes lowers minority visible spending by 44%, all else equal.
−0.51 (0.12) 0.25 (0.17)
−0.44 (0.13)
−0.45 (0.13) 0.26 (0.18) −0.09 (0.08)
(3) −0.64 (0.15) 0.26 (0.17) −0.16 (0.09) 0.60 (0.31)
(4) 0.12 (0.08) −0.14 (0.07) 0.16 (0.04)
(5)
Log food expenditure
−0.02 (0.03) −0.02 (0.04) −0.14 (0.03)
(6)
Notes. The sample in the table is the same as that used in Table II except for the additional restriction that it include only Black and Hispanic households (n = 12,074). For column (1), the specification is the same as in row (6) of Table II except for the following two changes: the race dummies are dropped as regressors and the log of mean total household labor income of men of the household’s same race in the household’s state of residence is included as a regressor. In column (2), we also add in the coefficient of variation for total labor income of the household’s same race in the household’s state of residence. In Column (3), we add in the log of individual housing prices as an additional regressor. We instrument individual housing expenditures with the mean level of housing prices in the individual’s state of residence. In column (4), we add the log of mean total labor income for all men in the household’s state of residence. See Section V for a discussion of how we use CPS data to compute the mean and standard deviation of total labor income for men by state and for a discussion of how we use data from the 1990 and 2000 U.S. Census to compute state housing prices. Columns (5) and (6) repeat the specification shown in column (3) with two different dependent variables. In column (5), the dependent variable is the log of household food spending. In column (6), the dependent variable is the log of household total spending less spending on visible goods and less spending on housing. Robust standard errors (clustered at the state level) are shown in parentheses.
Log of mean income of own race in state Coefficient of variation of income for own race in state Log of individual housing expenditures Log mean income of all in state
(2)
(1)
Log visible expenditure
Log total expenditure less visible and housing expenditures
TABLE VII WITHIN-BLACK AND -HISPANIC DIFFERENCES IN VISIBLE EXPENDITURE BY MEAN INCOME OF OWN RACE WITHIN A STATE CONSPICUOUS CONSUMPTION AND RACE
453
454
QUARTERLY JOURNAL OF ECONOMICS
In the second column we add the dispersion of reference group y income to the regression. The estimated effect of µk in this regression is still strongly negative. Interestingly, unlike the White regressions, we find that greater dispersion of reference group income is associated with lower visible spending for minorities. Although we have stressed a reluctance to rely on racial preference differences to explain our results, recall that the sign of the dispersion has been shown theoretically to depend crucially upon the relative curvatures of the different components of the utility function. We cannot reject the possibility that this curvature might differ across races, which would explain the difference in the income dispersion results. We control for the log of individual housing expenditures— instrumented with mean state housing prices—in column (3) of Table VII, and find that the estimated effect of both the mean and dispersion of reference group income remains essentially unchanged. In column (4) of Table VII, we include the mean income of all men in the state as an additional regressor. Strikingly, we continue to find that Blacks and Hispanics have lower visible expenditures when the mean income of their race-based reference group is higher. However, if the mean income of all men in the state increases, holding the mean income of men from the person’s own race constant, visible expenditures increase. The final two columns of the table repeat the exercise conducted earlier for Whites: We estimate the same regression as in column (3), but now with food expenditures and all nonvisible plus housing spending as the outcomes. The results are very similar to those for Whites. There is some evidence that food spending varies positively with average reference group income; but for total nonvisible spending, the very small point estimates indicate that the qualitative impact of higher mean reference group incomes of expenditures is zero. Overall, the within-race results are strongly consistent with the status-signaling model outlined above. If the mean income of a person’s own race/state cell increases, the person spends less on visible expenditures, all else equal. This fact is found among Whites, Blacks, and Hispanics, and persists even after controlling for differences in housing expenditures across states. V.B. Explaining Racial Differences in Visible Expenditures We analyze next whether the racial differences in visible consumption presented earlier in the paper can be reconciled by a status model with the key features described in the previous
455
CONSPICUOUS CONSUMPTION AND RACE
TABLE VIII RACIAL DIFFERENCES IN LOG VISIBLE EXPENDITURES AFTER CONTROLLING FOR MEAN GROUP STATE INCOME, INCLUDING OWN INCOME, EXPENDITURE, AND DEMOGRAPHIC CONTROLS
Black coefficient Hispanic coefficient Log of mean own group income in state Coefficient of variation of income for own race in state State fixed effects included
(1)
(2)
(3)
(4)
(5)
0.26 (0.02) 0.23 (0.05)
0.28 (0.02) 0.26 (0.03)
−0.03 (0.07) −0.01 (0.08) −0.53 (0.12)
−0.005 (0.07) −0.01 (0.06) −0.51 (0.11)
No
Yes
No
Yes
−0.04 (0.07) −0.04 (0.07) −0.52 (0.11) 0.17 (0.12) Yes
Notes. The table shows the results of the regression of log visible consumption on race dummies and a full set of income, total expenditures, demographic, and year controls. These controls are the same as those used in the regression displayed in row (6) of Table II (see the note to Table II for details). The first column of this table replicates the results shown in row (6) of Table II. In the second column, we include state fixed effects. In the third column, we add the log of mean total household income for men of the household’s same race in the household’s state of residence as an additional control (but exclude state fixed effects). In the fourth column, we include both state fixed effects and the log of mean income for the household’s own race within its state of residence. In column (5), we include the coefficient of variation of income for one’s own race within its state of residence. Robust standard errors (clustered at the state level) are in parentheses.
section. Using the same methods as described above, we reestimate equation (2)—with which we earlier documented the differences in visible spending across races—but now add to that regression, for each individual, the mean and coefficient of variation of income in the household’s race/state cell. This regression assesses whether Blacks or Hispanics, with their own income and the mean income of the racial peer group held constant, have the same visible expenditures as Whites, all else equal. The results are shown in Table VIII. Column (1) displays the results from row (6) of Table II, in which we do not control for features of reference group income distribution. Without reference group income controls, observationally equivalent Black and Hispanic male-headed households consume 26% and 23% more on visible goods, respectively, than do Whites. The regression in column (2) continues to exclude reference group income but now adds state fixed effects. The estimated effects of 0.28 and 0.25 show that the state fixed effects have no influence on the estimated racial gaps in visible expenditures. In the third column we add, for each individual, the average income of his or her race/state reference group and exclude the state fixed effects. This regression shows dramatically that our control for reference group income explains nearly the entire gap in spending across races. Both the Black and Hispanic point estimates are
456
QUARTERLY JOURNAL OF ECONOMICS
quantitatively tiny and statistically indistinguishable from zero. Column (4) adds state fixed effects to the regression in the third column, and the fifth column adds both state fixed effects and the coefficient of variation of reference group income. The results in both of these specifications are qualitatively the same as the results in column (3). In summary, the results show that the visible expenditure differences between Blacks and Hispanics versus Whites vanish once we control for the average income of the race/state cells from which individuals’ incomes are drawn. Importantly, the results also indicate that it is not some generic trait of the state that explains the conspicuous consumption gap, but rather the incomes of individuals’ racial reference groups specifically. Income distributions of reference groups are not exogenously assigned in our regressions. In practice, persons who differ in ways unrelated to conspicuous preferences may choose to locate in one place versus another. Could our results be explained by systematic sorting whereby, for example, persons whose high discount rates make them buy more visible items than investment goods locate in states where the mean income for their racial group is high? Lacking instrumental variables for location in our regressions, we cannot rule out this possibility, but two empirical facts suggest it is unlikely. First, the results in Table IV indicate that racial minorities spend less than comparable Whites on tobacco and alcohol—goods most consumed by those with high discount rates. Second, contrary to what a sorting story would imply, Tables VI and VII find no relationship between spending devoted to all nonvisible, nonfood items and average income in race/state cells. On the whole, these results are strongly consistent with the predictions of the models of status and conspicuous consumption discussed in Section IV. This simple model appears to explain differences in individual visible consumption within and across races, and does so without requiring that there be systematic differences in preferences by race. Race is important only insofar as it provides information to an observer about the income distribution from which a person’s income is drawn, creating in the process a differential incentive for people of different races to engage in conspicuous signaling. VI. CONCLUSIONS In this paper we document divergent patterns of expenditures on visible consumption goods across races. Consistent
CONSPICUOUS CONSUMPTION AND RACE
457
with popular perception, we find that minorities spend more on conspicuous items than Whites, controlling for differences in income. A variety of estimates show that these visible expenditure differences are relatively large and are associated with substantial diversion of resources from other uses, such as health care and education. Next, we argue that one does not need to appeal to cultural or racial differences in preferences to understand this evidence. Specifically, we outline a model of status-seeking and conspicuous consumption in which individuals use conspicuous spending as a signal of income. Consistent with results from a growing theoretical literature that suggest visible consumption should rise as poorer persons are added to a reference group, we find that visible consumption both within and across races falls as the mean of reference group income rises. This finding is buttressed by additional tests that show racial consumption spending differences are sharply reduced when we control for the mean and dispersion of reference group income, which jointly provide a powerful empirical measure of the reference distribution. Of course, our results do not rule out the possibility that there may yet be racial differences in utility parameters that act in combination with the effects we have identified. Note that the random, anonymous social interactions that are the focus of our paper constitute only a subset of the possible interactions that people care about. Depending on the interaction, an observer will already have finer or coarser information about the particular income distribution from which the person’s income happens to be drawn, meaning that the relevant reference groups across different interactions may be narrower or broader than the race/state cells we study. Further, the specific types of goods used to signal economic position in different interactions may also be different than the items in this paper. For example, among friends or family, status signaling might be effected with home furnishings, entertainment durables, or spending on children’s education—expenditures that only intimates have an opportunity to observe. Interesting avenues for future work include an investigation of which specific types of conspicuous consumption matter in different contexts, and whether people choose their neighbors with an eye to satisfying status considerations. Our findings on status signaling may have policy implications. Recent authors have suggested that a desire for social status informs such behavior as spending on weddings in rural India (Bloch, Rao, and Desai 2004) or the expenditures of recent
458
QUARTERLY JOURNAL OF ECONOMICS
immigrants (Chung and Fisher 2001). That these status-related expenditures may represent inefficient transfers from spending on goods such as healthcare, education, or savings has been forcefully argued by Frank (2000). Ireland (1994) investigates whether the provision of monetary rather than certain in-kind transfers may lead to superior outcomes, because the receipt and use of money communicates much less negative information about economic position than is true of observable in-kind benefits. Our results on conspicuous spending and race offer further evidence that understanding the complicated nature and possible consequences of status signaling is an important area for future work.
APPENDIX I: DATA APPENDIX For our primary analysis, we use the extracts of the Consumer Expenditure Survey (CEX), compiled by Harris and Sabelhaus (2000) and available online through the National Bureau of Economic Research (NBER).24 The NBER CEX files are available from 1980Q1 to 2003Q1, and we use data from 1986 to 2002. The year 1986 is the first year that the CEX data included unique family identifiers, which we need to merge key additional information from the Bureau of Labor Statistics’ (BLS) raw CEX data files. The NBER CEX extracts were intended to provide a condensed version of original data that was consistent over time. The extracts include information from the CEX family files, the member files, the detailed expenditure files, and the detailed income files. The extracts aggregate spending over 500 detailed items in the raw data into 47 spending categories. Our analysis further aggregates spending into 15 categories, as summarized in Table A.1. The 15 categories we use in this paper compose the universe of all expenditure categories in the NBER CEX files. We restrict the NBER–CEX data to include only household heads (ensuring that there is only one observation per household in our data). After deletions, our sample includes 49,363 households, comprising 37,289 White households, 6,766 Black households, and 5,308 Hispanic households. 24. See http://www.nber.org/data/ces cbo.html for the data files. See http://www.nber.org/ces cbo/Cexfam.doc for corresponding documentation. All data and code used to generate the results in this paper can be found at http://faculty .chicagogsb.edu/erik.hurst/research/race and consumption data page.html.
CONSPICUOUS CONSUMPTION AND RACE
459
TABLE A.1 AGGREGATION OF THE NBER CEX FILES Our spending categories Clothing/jewelry Personal care Vehicle (limited) Vehicle (expanded)
Housing
Food Utilities
Other transportation
Entertainment services Entertainment durables Alcohol and tobacco Household furnishings Education
Health
Other
Corresponding NBER CEX spending categories Visible spending components Clothing and shoes (029), clothing services (030), jewelry and watches (031) Toilet articles and preparations (032), barbershops, beauty parlors, and health clubs (033) Net outlay on new and used motor vehicles (052) Net outlay on new and used motor vehicles (052), repair, leasing, greasing, washing, parking, storage, and rental (054), reduction of principal on vehicle loan (096), tires, tubes, accessories, and other parts (053) Other spending components Tenant-occupied nonfarm dwellings—rent (including the rental of furniture and appliances) (034), rental equivalence of owned home (075) Food off-premise (023), food on-premise (024), food furnished employees (025) Electricity (038), gas (039), water and other sanitary services (040), fuel oil and coal (040), telephone (042) Vehicle gasoline and oil (055), bridge, tunnel, ferry, and toll roads (056), auto insurance (057), mass transit systems (058), taxicab, railway, bus, and other travel (059) Recreation services (060), books and maps (061), magazines, newspapers, nondurable toys (062) Recreation and sports equipment (063) Tobacco products (026), alcohol off-premise (027), alcohol on-premise (028) Furniture and durable household equipment (036) Higher education (066), nursery, elementary and secondary education (067), other education services (068) Prescription drugs (044), opthalmic products and orthopedic appliances (045), physicians, dentists, other medical professionals (046), hospitals (047), nursing homes (048), health insurance (049) Nondurable household supplies and equipment (037), domestic service, other household operation (043), business services (050), expense of handling life insurance (051), pari-mutuel net receipts (065), religious and welfare activities (069)
Notes. A full description of the NBER CEX consumption categories can be found online at http://www.nber.org/ces cbo/Cexfam.doc. The category numbers from the NBER CEX files are in parentheses.
460
QUARTERLY JOURNAL OF ECONOMICS
We briefly summarize the modifications and restrictions we imposed on the data. • The NBER CEX files do not include state of residence, Hispanic origin, city size, number of adults in the household, or number of quarters that the household participated in the survey. We download these key variables from the CEX raw files and merged them with the analysis sample manually. • As is standard in the literature, we compute a measure of housing service flows. For renters, this is the rent for their home/apartment; for homeowners, it is the homeowner’s report of the rental equivalence of the home. In the analysis, we experiment with other measures of housing flow services, such as setting them to 6% of the homeowner’s housing value. The results are unaffected. • The analysis uses two measures of vehicle spending: a “limited” measure that includes only net outlays (mostly down payments) associated with the initial purchase of the vehicle, and an “expanded” measure that includes the repayment of principle on vehicle loans, spending on maintenance, leasing, repairs, storage, and rental, and spending on tires, tubes, accessories, and other parts. • Our measure of housing services spending includes spending on the rental of household furniture and spending on home maintenance (such as paint and roof repair and replacement), home remodeling (adding an addition), and home decorating (wall-to-wall carpeting, replacement of hardwood floors). The inclusion of these categories is an artifact of the NBER CEX files. The measure of rent paid for tenant-occupied dwellings in the NBER CEX files combines a broad set of housing expenditures aside from rent paid and as a result, it is impossible to disaggregate the data at a finer level. • The NBER CEX files report the sum of spending in a variety of categories across all quarters that the household participated in the survey. Households surveyed for two quarters will therefore have only half the total expenditures of otherwise identical households participating for all four quarters. The NBER CEX files do not include an indicator variable for the number of quarters that the household participated in the survey, although a summary variable indicates that less than 50% of the sample completed all four surveys.
461
CONSPICUOUS CONSUMPTION AND RACE
After manually merging in the exact number of quarters that the household participated in the survey, we reexpress the spending data on a per-quarter basis, where per-quarter spending in a given category is computed as the NBER–CEX data on spending in a given category divided by the number of quarters that the household participated in the survey. We made the following restrictions on the CEX sample: • We include only households reporting themselves as Black, White, or Hispanic. We treat mixed race heads as Hispanic in our analysis. This has no effect on the results, as the results are the same if we exclude these households. • We exclude households with total expenditures of over $400,000 per year (in 2005 dollars). These 98 households are the top 0.1% percent of the total expenditures distribution. • We exclude households that changed their state of residence during the year; for which the head’s education is missing (4,134 households); and for which the household’s region is missing (617 households). APPENDIX II: MEAN QUARTERLY EXPENDITURE (IN 2005 DOLLARS), PERCENT WITH POSITIVE EXPENDITURES, AND EXPENDITURE SHARES BY CONSUMPTION CATEGORY, BY RACE
Visible expenditures
Shelter expenditures
Food expenditures
Utility expenditures
Vehicle service expenditures
Other transportation expenditures
All
White
Black
Hispanic
1,670 0.99 0.12 2,500 0.99 0.25 1,660 1.00 0.18 740 0.99 0.08 800 0.88 0.07 670 0.98 0.07
1,790 0.99 0.12 2,670 0.98 0.25 1,730 1.00 0.17 760 0.99 0.07 870 0.93 0.07 710 0.99 0.07
1,260 0.99 0.12 1,830 0.99 0.26 1,300 1.00 0.21 730 0.99 0.11 580 0.71 0.06 500 0.96 0.06
1,320 0.99 0.12 2,150 0.99 0.28 1,630 1.00 0.22 650 0.99 0.09 540 0.80 0.05 580 0.97 0.07
462
QUARTERLY JOURNAL OF ECONOMICS APPENDIX II: (CONTINUED)
Entertainment service expenditures
Health expenditures
Home furnishing expenditures
Education expenditures
Entertainment durable expenditures
Alcohol/tobacco expenditures
Other expenditures
Sample size
All
White
Black
Hispanic
580 0.98 0.05 410 0.85 0.04 310 0.83 0.03 250 0.43 0.02 250 0.80 0.02 210 0.82 0.02 650 0.92 0.05 49,363
660 0.99 0.07 470 0.89 0.04 350 0.86 0.03 290 0.46 0.02 290 0.85 0.02 240 0.86 0.03 730 0.95 0.05 37,289
290 0.95 0.04 250 0.74 0.03 190 0.71 0.02 170 0.35 0.02 110 0.64 0.01 120 0.68 0.02 400 0.86 0.04 6,766
330 0.95 0.04 270 0.76 0.03 220 0.79 0.02 120 0.30 0.01 140 0.72 0.02 120 0.70 0.02 360 0.85 0.03 5,308
Notes. See the notes to Table I for full sample description. See Table A.1 for the definition of each consumption category. For each consumption category, the first row shows the average spending per quarter in that category (in 2005 dollars, rounded to the nearest ten dollars), the second row shows the fraction of households with positive spending in the consumption category, and the third row shows the share of expenditures in the consumption category out of total expenditures. Columns (1)–(4), respectively, show the relevant statistics for the total population, a sample with White heads, a sample with Black heads, and a sample with Hispanic heads.
AL AK AZ AR CA CO CT DE DC FL GA HI ID IL IN IA KS KY LA ME
30,346 40,425 33,421 26,983 36,057 39,302 44,472 35,153 35,396 32,577 34,767 33,651 31,728 37,934 34,380 32,685 34,921 31,597 30,775 31,690
Mean
All
1.13 0.97 1.17 1.11 1.19 1.08 1.06 0.98 1.33 1.17 1.11 0.98 1.07 1.07 0.99 0.97 1.09 1.15 1.16 1.11
Coefficient of variation 35,071 43,715 38,440 29,215 45,185 42,132 47,665 38,148 60,269 36,898 40,551 36,828 33,275 42,859 35,432 33,480 36,422 32,379 35,643 31,677
Mean 1.06 0.90 1.11 1.07 1.08 1.04 1.02 0.95 1.06 1.12 1.03 0.99 1.07 1.00 0.97 0.96 1.09 1.15 1.07 1.11
Coefficient of variation
White
17,809 31,985 26,070 15,905 27,050 24,826 24,922 25,390 22,847 20,492 24,363 31,158 11,511 22,434 23,032 20,041 26,793 21,102 17,460 29,120
Mean 1.15 1.35 0.81 1.05 1.27 1.07 1.05 1.01 1.20 1.16 1.26 0.84 1.40 1.25 1.01 0.92 0.85 0.94 1.11 0.77
Coefficient of variation
Black
26,371 40,835 23,664 22,339 24,456 29,310 28,062 26,864 23,020 26,844 28,050 29,573 21,417 26,981 27,290 21,260 27,561 26,878 30,640 53,447
Mean
0.84 1.12 1.30 1.32 1.16 1.26 1.50 1.04 1.40 1.23 0.95 0.82 0.85 1.17 1.17 0.90 1.04 1.08 1.18 0.83
Coefficient of variation
Hispanic
APPENDIX III: MEAN AND STANDARD DEVIATION OF MALE LABOR INCOME BY RACE AND STATE USING CPS DATA
CONSPICUOUS CONSUMPTION AND RACE
463
MD MA MI MN MS MO MT NE NV NH NJ NM NY NC ND OH OK OR PA RI SC SD TN TX UT
40,196 39,941 37,748 38,178 26,868 33,532 27,262 33,525 36,396 39,982 43,518 28,031 36,352 33,008 30,138 36,285 30,990 34,682 35,639 36,161 31,515 29,268 30,840 34,065 34,127
Mean
All
1.05 1.08 1.04 1.09 1.13 1.11 0.98 0.95 1.05 1.03 1.09 1.04 1.21 1.08 1.04 1.05 1.14 1.08 1.10 1.08 1.02 1.04 1.22 1.15 1.02
Coefficient of variation 46,347 42,208 40,029 39,522 32,271 34,943 28,016 34,673 40,452 39,972 48,838 34,296 42,313 36,642 31,107 38,046 33,472 36,645 37,553 38,216 35,637 30,517 33,051 42,797 35,382
Mean 0.95 1.05 1.00 1.07 1.01 1.08 0.95 0.95 0.98 1.03 1.01 0.95 1.13 1.04 1.02 1.02 1.12 1.04 1.07 1.05 0.95 1.02 1.19 1.02 1.01
Coefficient of variation
White
27,443 23,961 22,252 22,868 17,432 21,879 24,632 22,279 25,241 36,794 26,872 31,239 21,773 22,577 21,750 22,082 20,827 28,994 20,066 23,707 21,183 21,588 20,726 23,671 21,800
Mean
APPENDIX III: (CONTINUED)
1.24 1.45 1.17 1.16 1.32 0.93 0.64 0.95 1.19 1.10 1.40 0.91 1.25 1.08 0.55 1.24 0.99 1.75 1.28 1.06 1.06 0.78 1.28 1.27 0.87
Coefficient of variation
Black
31,327 21,414 30,429 24,157 31,987 27,648 24,904 23,320 26,510 37,150 28,339 23,175 23,598 23,312 24,257 30,821 23,458 20,376 27,075 20,225 29,305 25,887 24,628 23,733 24,749
Mean 1.21 1.11 1.16 0.96 0.82 1.39 1.05 0.82 1.20 1.21 1.15 1.07 1.30 1.02 0.80 1.07 0.96 1.03 1.27 1.30 1.23 0.86 0.98 1.23 0.91
Coefficient of variation
Hispanic
464 QUARTERLY JOURNAL OF ECONOMICS
33,152 38,667 38,229 26,352 36,256 33,842
0.97 1.07 1.08 1.11 0.97 0.97
Coefficient of variation 33,344 42,737 39,699 26,534 37,812 34,564
Mean 0.97 1.03 1.03 1.07 0.95 0.96
Coefficient of variation
White
10,801 24,949 33,986 15,257 19,243 24,369
Mean 1.72 1.03 2.12 1.26 1.14 0.89
Coefficient of variation
Black
38,846 34,833 23,616 24,327 25,236 28,919
Mean
0.87 1.16 0.99 0.95 1.15 0.81
Coefficient of variation
Hispanic
Note. The table shows the means and coefficient of variation for male labor income by race and state from the 1990–2002 CPS. Data are averaged over the entire sample period and are reported in 2005 dollars. The sample used is males aged 18–49 (inclusive). All data are weighted using the CPS weights.
VT VA WA WV WI WY
Mean
All
APPENDIX III: (CONTINUED)
CONSPICUOUS CONSUMPTION AND RACE
465
466
QUARTERLY JOURNAL OF ECONOMICS
UNIVERSITY OF CHICAGO AND NBER UNIVERSITY OF CHICAGO AND NBER UNIVERSITY OF PENNSYLVANIA
REFERENCES Alexis, Marcus, “Patterns of Black Consumption: 1935–1960,” Journal of Black Studies, 1 (1970), 55–74. Bagwell, Laurie Simon, and B. Douglas Bernheim, “Veblen Effects in a Theory of Conspicuous Consumption,” American Economic Review, 86 (1996), 349–373. Bloch, Francis, Vijayendra Rao, and Sonalde Desai, “Wedding Celebrations as Conspicuous Consumption: Signaling Social Status in Rural India,” Journal of Human Resources, 39 (2004), 675–695. Chambers, Jason, “Equal in Every Way: African Americans, Consumption and Materialism from Reconstruction to the Civil Rights Movement,” Advertising and Society Review, 7 (2006). http://muse.jhu.edu/journals/asr/v0007/ 7.1chambers.html. Charles, Kerwin Kofi, and Erik Hurst, “The Transition to Home Ownership and the Black–White Wealth Gap,” Review of Economics and Statistics, 84 (2002), 281–297. Chung, Ed, and Eileen Fischer, “When Conspicuous Consumption Becomes Inconspicuous: The Case of Migrant Hong Kong Consumers,” Journal of Consumer Marketing, 18 (2001), 474–487. Clark, Andrew E., and Andrew J. Oswald, “Satisfaction and Comparison Income,” Journal of Public Economics, 56 (1996), 359–381. Duesenberry, James S., Income, Saving, and the Theory of Consumer Behavior (Cambridge, MA: Harvard University Press, 1949). Frank, Robert H., Luxury Fever: Money and Happiness in an Era of Excess (Princeton, NJ: Princeton University Press, 2000). Friedman, Milton, A Theory of the Consumption Function (Princeton, NJ: Princeton University Press, 1957). Glazer, Amihai, and Kai Konrad, “A Signaling Explanation for Private Charity,” American Economic Review, 86 (1996), 1019–1028. Harris, Ed, and John Sabelhaus, “Consumer Expenditure Survey, Family-Level Extracts, 1980:1–1998:2,” NBER Working Paper, 2000. Ireland, Norman, “On Limiting the Market for Status Signals,” Journal of Public Economics, 53 (1994), 91–110. Kahneman, Daniel, and Alan Krueger, “Developments in the Measurement of Subjective Well-Being,” Journal of Economic Perspectives, 20 (2006), 3–24. Kuhn, Peter J., Peter Kooreman, Adrian R. Soetevent, and Arie Kapteyan, “The Own and Social Effects of an Unexpected Income Shock: Evidence from the Dutch Postcode Lottery,” RAND Working Paper #574, 2008. Lamont, Michele, and Virag Molnar, “How Blacks Use Consumption to Shape Their Collective Identity: Evidence from Marketing Specialists,” Journal of Consumer Culture, 1 (2001), 31–45. Luttmer, Erzo F. P., “Neighbors as Negatives: Relative Earnings and Well-Being,” Quarterly Journal of Economics, 120 (2005), 963–1002. Mailath, George J., “Incentive Compatibility in Signaling Games with a Continuum of Types,” Econometrica, 55 (1987), 1349–1365. McBride, Michael, “Relative-Income Effects on Subjective Well-Being in the CrossSection,” Journal of Economic Behavior and Organization, 45 (2001), 251–278. Modigliani, Franco, and Richard Brumberg, “Utility Analysis and the Consumption Function: An Interpretation of the Cross Section Data,” in Post-Keynesian Economics, Kenneth Kurihara, ed. (New Brunswick, NJ: Rutgers University Press, 1954), 388–436. Mullins, Paul, “Race and Genteel Consumer: Class and African-American Consumption, 1850–1930,” Historical Archaeology, 33 (1999), 22–38. Munnell, Alicia H., Geoffrey M. B. Tootell, Lynn E. Browne, and James McEneaney, “Mortgage Lending in Boston: Interpreting HMDA Data,” American Economic Review, 86 (1996), 25–53.
CONSPICUOUS CONSUMPTION AND RACE
467
Ravina, Enrichetta, “Habit Formation and Keeping Up with the Joneses: Evidence from Micro Data,” Columbia Business School Working Paper, 2007. Smith, Adam, The Theory of Moral Sentiments, reprint (New Rochelle, NY: Arlington House, 1969 [original publication 1759]). Spence, Micheal, “Job Market Signaling,” Quarterly Journal of Economics, 87 (1973), 355–374. Veblen, Thorstein, The Theory of the Leisure Class: An Economic Study of Institutions, reprint (Kila, MT: Kessinger, 2004 [original publication 1899]).
THE DIFFUSION OF DEVELOPMENT∗ ENRICO SPOLAORE AND ROMAIN WACZIARG We find that genetic distance, a measure associated with the time elapsed since two populations’ last common ancestors, has a statistically and economically significant effect on income differences across countries, even controlling for measures of geographical distance, climatic differences, transportation costs, and measures of historical, religious, and linguistic distance. We provide an economic interpretation of these findings in terms of barriers to the diffusion of development from the world technological frontier, implying that income differences should be a function of relative genetic distance from the frontier. The empirical evidence strongly supports this barriers interpretation.
I. INTRODUCTION What explains the vast differences in income per capita across countries? This paper provides new empirical evidence shedding light on this question.1 At the center of our analysis is genetic distance, a measure based on aggregate differences in the distribution of gene variants across populations.2 For the first time, we document and discuss the relationship between genetic distance and differences in income per capita across countries. We find that measures of genetic distance bear a statistically and economically significant relationship to income differences, and that this relationship is robust to controlling for a large number of measures of geographical distance, climatic differences, transportation costs, and measures of historical, linguistic, and religious distance. The effect of genetic distance holds not only for contemporary income differences, but also for income differences measured since 1500. ∗ We are grateful to Alberto Alesina, Robert Barro, Sam Bowles, Robert Boyd, Francesco Caselli, Dan Cox, Steven Durlauf, James Fearon, Oded Galor, Luigi Guiso, Peter Howitt, Yannis Ioannides, Larry Katz, Pete Klenow, Andros Kourtellos, Ed Kutsoati, Edward Leamer, John Londregan, Peter Lorentzen, Lisa Lynch, Paolo Mauro, Deborah Menegotto, Sharun Mukand, Louis Putterman, G´erard Roland, Fabio Schiantarelli, Antonio Spilimbergo, Susan Stokes, Chih Ming Tan, David Weil, Bruce Weinberg, Ivo Welch, several anonymous referees, and participants at numerous seminars and conferences for helpful comments. We gratefully acknowledge financial support from Stanford University’s Presidential Fund for Innovation in International Studies. 1. Contributions to the literature on the determinants of income per capita using cross-country regressions include Hall and Jones (1999), Acemoglu, Johnson, and Robinson (2001), Easterly and Levine (2003), Alcala´ and Ciccone (2004), and Glaeser et al. (2004), among many others. 2. Our source for genetic distances between human populations is CavalliSforza, Menozzi, and Piazza (1994). Recent textbook references on human evolution are Boyd and Silk (2003) and Jobling, Hurles, and Tyler-Smith (2004). For a nontechnical discussion of these concepts see Dawkins (2004). C 2009 by the President and Fellows of Harvard College and the Massachusetts Institute of
Technology. The Quarterly Journal of Economics, May 2009
469
470
QUARTERLY JOURNAL OF ECONOMICS
Moreover, the effect of genetic distance on income differences is quantitatively large, statistically significant, and robust not only worldwide, but also within Europe, for which more precise measures of cross-country genetic distance are available. In addition to establishing these facts, we provide an economic interpretation for our findings. What does genetic distance capture, and why is it correlated with income differences, even controlling for geographical distance and other factors? Technically, genetic distance measures the difference in gene distributions between two populations, where the genes under considerations are neutral—they change randomly and independent of selection pressure. The rationale for this approach is that divergence in neutral genes provides information about lines of descent. Most random genetic change takes place regularly over time, as in a molecular clock. Therefore, genetic distance measures the time since two populations have shared common ancestors— that is, the time since they have been the same population. In other words, genetic distance is a summary measure of general relatedness between populations. An intuitive analogue is the familiar concept of relatedness between individuals: two siblings are more closely related than two cousins because they share more recent common ancestors—their parents rather than their grandparents. Because genetic distance is based on neutral change, it is not meant to capture differences in specific genetic traits that directly matter for survival and fitness. Hence, our results provide no evidence for a direct effect of specific genes on income or productivity. Our findings are not about some societies having some specific genes that make them directly richer. Instead, our results provide strong evidence that a general measure of genealogical relatedness between populations can explain income differences today, even though it reflects mostly neutral genetic variation. Why? Our interpretation is that genetic distance captures barriers to the diffusion of development. More closely related societies are more likely to learn from each other and adopt each other’s innovations. It is easier for someone to learn from a sibling than from a cousin, and easier to learn from a cousin than from a stranger. Populations that share more recent common ancestors have had less time to diverge in a wide range of traits and characteristics that are transmitted across generations with variation. Of course, in human populations many of those traits are transmitted across
THE DIFFUSION OF DEVELOPMENT
471
generations culturally rather than biologically.3 Similarity in such traits would tend to facilitate communication and understanding, and hence the diffusion and adaptation of complex technological and institutional innovations. What traits are captured by genetic distance? We argue that, by its very definition, genetic distance is an excellent summary statistic capturing divergence in the whole set of implicit beliefs, customs, habits, biases, conventions, etc. that are transmitted across generations—biologically and/or culturally—with high persistence. In a nutshell, human genetic distance can be viewed as a summary measure of very long-term divergence in intergenerationally transmitted traits across populations. Our key hypothesis is that such long-term (and mainly random) divergence has created barriers to the diffusion of technological and institutional innovations across societies in more recent times. Although we provide a general economic interpretation of genetic distance in terms of barriers to the diffusion of development from the frontier, we remain largely agnostic about specific mechanisms of technology diffusion, as well as about the specific traits and characteristics that create the barriers. If our interpretation is correct, the relevant measure of genetic distance associated with economic distance between two societies should not be the absolute genetic distance between them, but their relative distance from the world technological frontier. In our empirical analysis we test this central implication using Britain (in the nineteenth century) and the United States (in the twentieth century) as the world technological frontiers. Consistent with our hypothesis, we find that the effect of relative genetic distance on economic distance is positive, and larger than the effect of absolute genetic distance, itself only an imperfect proxy for relative genetic distance to the frontier. We view this as important evidence in support of a barriers effect. The historical evidence also suggests that the effect of genetic distance on income differences, although always positive and significant since 1500, increased considerably between 1820 and 1870, consistent with a salient role for relative genetic distance during the gradual 3. Classic references on cultural transmission and evolution are CavalliSforza and Feldman (1981) and Boyd and Richerson (1985). See also Richerson and Boyd (2004). Economic analyses of cultural transmission across generations include Bisin and Verdier (2000, 2001) and others. This issue will be discussed in more detail in Section II of this paper.
472
QUARTERLY JOURNAL OF ECONOMICS
spread of the Industrial Revolution. More broadly, our interpretation is consistent with the diffusion of economic development as emerging from the formation of a human web, gradually joined by different cultures and societies as a function of their relative distance from the technological and institutional frontier.4 This paper is part of a small but growing set of contributions that use human genetic data in empirical economic analyses.5 Guiso, Sapienza, and Zingales (2004) used genetic distance between European populations as an instrument for trust in trade gravity regressions.6 Giuliano, Spilimbergo, and Tonon (2006) directly study the effect of genetic distance on bilateral trade flows in gravity regressions and argue that the effect of genetic distance on trade volume is sensitive to controlling for geography (in contrast, in our analysis the effect of genetic distance on income differences is robust to geographical controls, including transportation costs). A major difference between our paper and these contributions is our focus on the determinants of income differences rather than trade flows. Moreover, we are the first economists to use worldwide measures of genetic distance across human populations, in addition to European data. Desmet et al. (2007) document the close relationship between genetic distance and cultural differences and argue that genetic distance can be used to study nation formation in Europe. They show a strong and robust correlation between answers to the World Values Survey (WVS) and genetic distance, finding that European populations that are genetically closer give more similar answers to a set of 430 questions about norms, values, and cultural characteristics included in the 2005 WVS sections on perceptions of life, family, religion, and morals. They also find that the correlation between genetic distance and cultural values remains positive and significant after controlling for linguistic and geographic distances. Their empirical analysis supports our interpretation of genetic distance as a broad measure of differences in intergenerationally transmitted characteristics, including cultural values. 4. For a historical overview of the formation of the complex web of exchanges and interactions across human communities going back to the Neolithic period, see McNeill and McNeill (2003). 5. There also exists a different economic literature that uses genetic distances between species to evaluate biodiversity—for example, Weitzman (1992). 6. In a more recent version of their paper, Guiso, Sapienza, and Zingales no longer use genetic distance, but rely on a new measure of somatic similarities (see our discussion in Section II.B).
THE DIFFUSION OF DEVELOPMENT
473
Finally, Ashraf and Galor (2008) study the relationship between genetic diversity within a society and economic development in precolonial times, measured by population density, and find a nonmonotonic relation between genetic diversity and population density (a higher population density is associated with intermediate levels of genetic diversity). Their paper shares our focus on economic development, but considers different genetic data and effects (genetic heterogeneity within each society), whereas we study the effect of genetic distance between societies. In Section II we present a simple framework in which genetic distance captures divergence in characteristics that are transmitted across generations within populations over the long run, and those differences act as barriers to the diffusion of development from the world technological frontier. The section also contains a general taxonomy of the mechanisms linking genetic distance and economic outcomes. Section III presents the data on genetic distance. In Section IV we present and discuss our empirical findings. Section V concludes. II. A CONCEPTUAL FRAMEWORK In this section we present an analytic framework linking genetic distance, intergenerationally transmitted traits, and the diffusion of economic development from the technological frontier. Our analysis leads to a testable prediction: Income differences across societies should depend on their relative genetic distance from the technological frontier. In Section IV, we show that the empirical evidence strongly supports this prediction. The main building block of our model is that genetic distance between populations captures the degree of genealogical relatedness of different populations over time. Thus, it can be interpreted as a general metric for average differences in characteristics transmitted across generations. In this paper we call vertically transmitted characteristics (or vertical characteristics) the set of characteristics passed on across generations within a population over the very long run—that is, over the time horizon along which populations have diverged.7 7. This terminology is borrowed from the evolutionary literature on cultural transmission (Cavalli-Sforza and Feldman 1981; Boyd and Richerson 1985; Shennan 2002; Richerson and Boyd 2004). Such vertical transmission takes place across generations within a given population, and, in our definition, includes not only direct parent-to-child transmission, but also “oblique” transmission from
474
QUARTERLY JOURNAL OF ECONOMICS
This leads to our second main idea: differences in vertical characteristics act as barriers to the diffusion of productivityenhancing innovations across societies.8 We argue that populations that share a more recent common history, and are therefore closer in terms of vertical characteristics, face lower costs and obstacles to adopting each other’s innovations.9 We are interested primarily in the diffusion of economic development in historical times, and especially after the Industrial Revolution. Thus, in our empirical analysis we focus on differences in vertical characteristics as barriers to the diffusion of development from the modern technological frontier.10 II.A. A Simple Model A stylized model illustrates our ideas in the simplest possible way.11 Consider three periods (o for “origin,” p for “prehistory,” and h for “history”). In period o there exists only one population (population 0). In period p the original population splits into two populations (1 and 2). In period h each of the two populations splits in two separate populations again (population 1 into 1.1 and 1.2, and population 2 into 2.1 and 2.2), as displayed in Figure I. In this setting the genetic distance dg (i, j) between population i and population j can be simply measured by the number of periods
other genetically related people within the group. Hence, our definition is broader than usages of the term strictly limited to parent-to-child transmission. We thank Robert Boyd for pointing out this distinction to us. 8. Policy-induced barriers to the diffusion of technology are analyzed by Parente and Prescott (1994, 2002). In our framework we interpret barriers more broadly to include all long-term societal differences that are obstacles to the diffusion of development. 9. The idea that differences in long-term societal characteristics may act as barriers to the diffusion of development is stressed in a large literature on the diffusion of innovations, including the classic book by Rogers (1962). For a historical comparative analysis of managerial innovations and performance, see Clark and Wolcott (1999). 10. World technological leadership since the British Industrial Revolution (1700s) has been predominantly associated with Britain and, by the late 1800s, the United States (Brezis, Krugman, and Tsiddon 1993). In the years right before the Industrial Revolution, the technological frontier was probably the Netherlands. According to Maddison (2003), in previous times the regions with the highest levels of income per capita were Italy (around 1500) and China (around 1000). We return to these issues in Section IV. 11. In a previous version of this paper (available upon request), we presented a dynamic micro-founded extension of this model, built on Barro and Sala-i-Martin (1997). In that extension, imitation costs were a function of distance in vertical characteristics from the technological frontier, and hence income differences were a function of relative genetic distance in steady state.
475
THE DIFFUSION OF DEVELOPMENT 1.1 1 1.2 0 2.1 2 2.2 o
p
h
Time
FIGURE I Population Tree
since they were one population: (1)
dg (1.1, 1.2) = dg (2.1, 2.2) = 1
and (2) dg (1.1, 2.1) = dg (1.1, 2.2) = dg (1.2, 2.1) = dg (1.2, 2.2) = 2. For simplicity, all vertical characteristics of a population are summarized as a point on the real line (i.e., a population i has vertical characteristics vi , where vi is a real number). Populations inherit characteristics from their ancestor populations with variations—a population i descending from a population i will have characteristics (3)
vi = vi + ηi .
Consider the simplest possible mechanism for variation: vertical change as a random walk—for every population i , ηi takes value η > 0 with probability 1/2 and −η with probability 1/2.12 Consequently, the distance in vertical characteristics between two populations dv (i, j) ≡ |v j − vi | is, on average, increasing in their genetic distance dg (i, j).13 This captures our first main idea. 12. This simplification is consistent with the molecular-clock interpretation of genetic distance itself. While more complex processes could be considered, this formalization has two advantages: it is economical (“Occam’s razor”), and it illustrates how neutral random changes are sufficient to generate our theoretical predictions. 13. Specifically, in period h the expected difference in vertical characteristics between populations at a genetic distance equal to 2 and populations
476
QUARTERLY JOURNAL OF ECONOMICS
Our second main idea can also be captured within this simplified setting. Assume that in periods o and p all populations produce output using the basic technology Y = A0 L, so that all populations have the same income per capita y0 = A0 . In period h a population happens to find a more productive technology A1 = A0 + , where > 0.14 Denote this population, the technological frontier, as f .15 We assume that populations farther from population f in terms of vertical characteristics face higher barriers to adopting the new technology. To fix ideas, assume that a society i at a vertical distance from the frontier equal to dv (i, f ) can improve its technology only by (4)
i = [1 − βdv (i, f )],
where the parameter β > 0 captures the barriers to the horizontal diffusion of innovations due to distance in vertical characteristics.16 Hence, income per capita in society i is given by (5)
yi = A0 + [1 − βdv (i, f )].
This implies that the economic distance between population i and population j, measured by their income difference de (i, j) ≡ |yi − y j |, is a function of their relative vertical distance from the frontier |dv (i, f ) − dv ( j, f )|: (6)
de (i, j) ≡ |y j − yi | = β|dv (i, f ) − dv ( j, f )|.
As we have shown, vertical difference dv (i, j) and genetic distance dg (i, j) are positively correlated. Therefore, on average, income differences across societies are increasing in their relative genetic at a genetic distance equal to 1 is given by E{dv (i, j) |dg (i, j) = 2} − E{dv (i, j) |dg (i, j) = 1} = η/2 > 0. Of course, this is not a deterministic relationship. Some pairs of populations that are genealogically more distant may end up with more similar vertical characteristics than two more closely related populations, but that outcome is less likely to be observed than the opposite. On average, genetic distance and distance in vertical characteristics go hand in hand. 14. We abstract from the possibility that the likelihood of finding the innovation may itself be a function of a society’s vertical characteristics. Such direct effects of vertical characteristics would strengthen the links between genetic distance and economic outcomes, but are not necessary for our results. 15. The model can be viewed as a very reduced form of a dynamic process in which the frontier economy produces several innovations, including improvements to the innovation process itself, in the spirit of the observation that “the greatest invention of the 19th century was the invention of the method of invention” (Alfred North Whitehead [1931, p. 38], quoted in Howitt and Mayer-Foulkes [2005]). 16. Without loss of generality, we assume that β is lower than 1/2. Alternatively, the formula could be rewritten as i = max{[1 − βdv (i, f )], 0}.
THE DIFFUSION OF DEVELOPMENT
477
distance from the frontier society. Formally, E{de (i, j)||dg (i, f ) − dg ( j, f )| = 2} (7)
−E{de (i, j)||dg (i, f ) − dg ( j, f )| = 1} =
ηβ > 0. 3
This result is intuitive. As we increase relative genetic distance from the frontier, the expected income gap increases. The size of the effect is a positive function of the extent of divergence in vertically transmitted characteristics (η), the extent to which this divergence constitutes a barrier to the horizontal diffusion of innovations (β), and the size of the improvement in productivity at the frontier (). Our framework predicts a positive correlation between economic distance |y j − yi | and relative genetic distance from the frontier |dg (i, f ) − dg ( j, f )|. It also accounts for a positive correlation between economic distance and simple genetic distance dg (i, j) as long as |dg (i, f ) − dg ( j, f )| and dg (i, j) are positively correlated.17 At the same time, our theory predicts that relative genetic distance from the frontier should have a stronger impact on economic distance than absolute genetic distance, because relative distance is a more accurate measure of relative distance from the frontier in terms of vertical characteristics. In fact, the expected economic distance associated with an absolute genetic distance dg (i, j) = 1 is E{de (i, j)|dg (i, j) = 1} = ηβ, whereas the expected economic distance associated with an equivalent level of relative genetic distance |dg (i, f ) − dg ( j, f )| = 1 is higher:18 E{de (i, j)||dg (i, f ) − dg ( j, f )| = 1} 7ηβ > E{de (i, j)||dg (i, j) = 1}. (8) = 6 In summary, our theory has the following testable implications: IMPLICATION 1. Relative genetic distance from the frontier is positively correlated with differences in income per capita (economic distance). 17. It is easy to verify that the two measures are positively correlated in our theoretical framework. More importantly, relative genetic distance from the frontier and absolute genetic distance are also positively correlated in the actual data, as we show in Section IV. Our framework provides an explanation for the observed positive correlation between economic distance and absolute genetic distance in the data: absolute genetic distance is an imperfect proxy of the economically relevant variable, relative genetic distance. 18. An analogous relationship exists between E{de (i, j) ||dg (i, f ) − dg ( j, f )| = 2} and E{de (i, j) ||dg (i, j) = 2}.
478
QUARTERLY JOURNAL OF ECONOMICS
IMPLICATION 2. The effect on income differences associated with relative genetic distance from the frontier is larger than the effect associated with absolute genetic distance. As we will see in Section IV, both predictions are consistent with the empirical evidence. II.B. A General Taxonomy To clarify the nature of the links between genetic distance and income differences, it is useful to introduce a broader classification of different mechanisms through which the transmission of characteristics across generations may in principle affect economic outcomes. In general, traits can be transmitted across generations through DNA (call it “genetic transmission,” or GT—e.g., eye color) or through pure cultural interactions (call it “cultural transmission,” or CT—e.g., a specific language). Moreover, vertical characteristics, whether passed on through GT or CT, may affect income differences because of a direct (D) effect on productivity or because they constitute barriers (B) to the transmission of innovations across populations. There are four possible combinations of mechanisms through which intergenerationally transmitted characteristics may affect income differences. The following chart summarizes the four possibilities: Genetic transmission (GT) Cultural transmission (CT)
Direct effect (D) Quadrant I Quadrant III
Barrier effect (B) Quadrant II Quadrant IV
For instance, genetic traits affecting the trade-off between quality and quantity of children in the theoretical framework proposed by Galor and Moav (2002) would be examples of GT direct effects (Quadrant I).19 GT barrier effects (Quadrant II) could stem from visible characteristics (say, physical appearance) that do not affect productivity directly, but introduce barriers to the diffusion of innovations by reducing exchanges and learning across populations that perceive each other as different. This effect is related to the already cited study by Guiso, Sapienza, and Zingales (2004), who argue that differences in physical characteristics affect the extent of trust across populations, and that trust affects bilateral trade between different societies. Consistent with this view, in the 19. For a discussion of related ideas, see also the recent book by Clark (2007) on the causes of the Industrial Revolution.
THE DIFFUSION OF DEVELOPMENT
479
most recent version of their paper these authors use a measure of somatic similarity to instrument for trust, on the grounds that people tend to trust people who look similar to them physically. Direct economic effects of cultural characteristics have been emphasized in a vast sociological literature that goes back at least to Max Weber. A recent empirical study of the relationship between cultural values and economic outcomes that is consistent with the mechanisms of Quadrant III is provided by Tabellini (2005). Guiso, Sapienza, and Zingales (2006) define culture as “customary beliefs and values that ethnic, religious and social groups transmit fairly unchanged from generation to generation” and provide an extensive discussion of the links between cultural variables and economic outcomes. The link between differences in vertically transmitted characteristics—including cultural characteristics, as in Quadrant IV—is at the core of our own model. In the model presented in Section II.A, differences in neutral characteristics (traits that do not have a direct effect on productivity) explain income differences by acting as barriers to the diffusion of innovation across populations. The distinction between GT and CT is useful to fix ideas, but is not a clear-cut dichotomy. In fact, this distinction (related to the distinction between nature and nurture), if taken too literally, may be misleading from an economic as well as from a biological perspective. Generally, the economic effects of human characteristics are likely to result from interactions of cultural and genetic factors, with the effects of genetic characteristics on economic outcomes changing over space and time depending on cultural characteristics, and vice versa. To illustrate this point, consider differences across individuals within a given population (say, the United States). Consider a clearly genetic characteristic, for instance having two X chromosomes, the purely genetic characteristic associated with the female sex. This characteristic is likely to have had very different effects on a person’s income and other economic outcomes in the year 1900 and in the year 2000, because of changes in culturally transmitted characteristics over the century. This is a case where the impact of genes on outcomes varies with a change in cultural characteristics.20 By the same token, one can think of the differential impact of a given cultural 20. This is a variation on an example by Alison Gopnik in her comment to the Pinker vs. Spelke debate at http://www.edge.org/discourse/science-gender.html#ag. Pinker’s response is also available at http://www.edge.org/discourse/sciencegender.html.
480
QUARTERLY JOURNAL OF ECONOMICS
characteristic (say, the habit of drinking alcohol) on individuals with different genetic characteristics (say, genetic variation in alcohol dehydrogenase, the alcohol-metabolizing enzyme). An example of a complex interaction in which culture affects genes is the spread of the gene for lactose tolerance in populations that domesticated cows and goats. Hence, in interpreting our empirical results we do not dwell much on the distinction between genetic and cultural transmission of traits, but instead interpret genetic distance as an overall measure of differences in the whole set of intergenerationally transmitted characteristics.21
III. THE GENETIC DISTANCE DATA III.A. Measuring Genetic Distance Because the data on genetic distance are not commonly used in the economics literature, we describe them in some detail. Genetic distance measures genetic differences between two populations. The basic unit of analysis is the allele, which is a particular form taken by a gene.22 By sampling populations for specific genes that can take different forms, geneticists have compiled data on allele frequencies.23 Differences in allele frequencies are the basis for computing summary measures of distance between populations. Following Cavalli-Sforza, Menozzi, and Piazza (1994), we use measures of FST distance, also known as “coancestor coefficients.” FST distances, like most measures of genetic differences, are based on indices of heterozygosity, the probability that two 21. That said, we do find clues pointing to cultural transmission, rather than purely biological transmission, as a likely mechanism behind our results. For instance, we find large effects of genetic distance on income differences within Europe, among populations that are geographically close, have shared very similar environments, and have had a very short time to diverge genetically. The view that cultural transmission trumps genetic transmission in explaining differences within human populations is standard among geneticists and anthropologists. For nontechnical discussions of these issues, see Diamond (1992), Cavalli-Sforza and Cavalli-Sforza (1995), Diamond (1997), and Richerson and Boyd (2004). 22. A gene is commonly defined as a DNA sequence that encodes for a protein. The genetic data in Cavalli-Sforza, Menozzi, and Piazza (1994) have been obtained from “classical analysis,” which focuses on protein polymorphism. More recent approaches look directly at the DNA. So far those studies, which include the Human Genome Diversity Project (http://www.stanford.edu/group/morrinst/hgdp.html) and the International HapMap Project (http://www.hapmap.org/), have confirmed the results from classical protein analysis, but are not yet available for extensive cross-regional analysis. 23. Allele frequencies for various genes and for most populations in the world can be found at http://alfred.med.yale.edu/.
THE DIFFUSION OF DEVELOPMENT
481
alleles at a given locus selected at random from two populations will be different. FST takes a value equal to zero if and only if the allele distributions are identical across the two populations, whereas it is positive when the allele distributions differ. A higher FST is associated with larger differences.24 Measures of genetic distance can be used to reconstruct phylogenies (or family trees) of human populations. FST is strongly related to how long two populations have been isolated from each other.25 When two populations split apart, their genes can start to change as a result either of random genetic drift or natural selection. When calculating genetic distances to study population history and phylogenesis, geneticists concentrate on neutral characteristics that are not affected by strong directional selection, but only by random drift.26 Importantly, our measures of genetic distance are based on such neutral markers only, and not on selected traits. When populations become separated, the process of random drift will take them in different directions, raising their genetic distance. The longer the period of separation, the greater the genetic distance becomes. If drift rates are constant, genetic distance can be used as a molecular clock—that is, the time elapsed since two populations separated can be measured by the genetic distance between them.27 Consequently, FST is a measure of distance to the most recent common ancestors of two populations, or, equivalently, of their degree of genealogical relatedness. To summarize, we use FST distance as a measure of genealogical relatedness between populations. A larger FST distance reflects a longer separation between populations, and hence, on average, a larger difference in vertical characteristics. 24. Appendix I provides an illustration of the construction of FST for the simple case of two populations of equal size, and one gene that can take only two forms (i.e., two alleles). 25. Isolation here refers to the bulk of the genetic heritage of a given population. As stressed by Cavalli-Sforza, Menozzi, and Piazza (1994), small amounts of intermixing between members of different populations do not affect measured genetic distance. 26. Cavalli-Sforza, Menozzi, and Piazza (1994, p. 36). The classic reference for the neutral theory of molecular evolution is Kimura (1968). For more details on the neutral theory, the molecular clock hypothesis, and the construction and interpretation of measures of genetic distance, see Jobling, Hurles, and TylerSmith (2004). 27. When genetic distance is based on neutral markers, and populations are sufficiently large, geneticists have shown that drift rates are indeed constant (very small populations are generally subject to faster random genetic drift).
482
QUARTERLY JOURNAL OF ECONOMICS
III.B. The World Sample The genetic distance data are from Cavalli-Sforza, Menozzi, and Piazza (1994, pp. 75–76). Our main focus is on the set of 42 world populations for which they report all bilateral distances, computed from 120 alleles.28 These populations are aggregated from subpopulations characterized by a high level of genetic similarity. However, measures of bilateral distance among these subpopulations are available only regionally, not for the world as a whole. Among the set of 42 world populations, the greatest genetic distance observed is between Mbuti Pygmies and Papua New Guineans, where the FST distance is 0.4573, and the smallest is between the Danish and the English, where the genetic distance is 0.0021.29 The mean genetic distance among the 861 available pairs is 0.1338. Figure II, from Cavalli-Sforza, Menozzi, and Piazza (1994, Figure 2.3.2B, p. 78), is a phylogenetic tree illustrating the process by which different human populations have split apart over time.30 Such phylogenetic trees, constructed from genetic distance data, are the population analogs of family trees for individuals. Genetic distance data are available at the population level, not at the country level. It was thus necessary to match 28. Cavalli-Sforza, Menozzi, and Piazza (1994) also provide a different measure of genetic distance (Nei’s distance). FST and Nei’s distance have slightly different theoretical properties, but their correlation (93.9%) is very high (Table I). We show below that the choice of measures does not affect our results. 29. Among the more disaggregated data for Europe that we also gathered, the smallest genetic distance (equal to 0.0009) is between the Dutch and the Danish, and the largest (equal to 0.0667) is between the Lapps and the Sardinians. The mean genetic distance across European populations is 0.013. Genetic distances are roughly ten times smaller on average across populations of Europe than in the world data set. 30. The figure was constructed to maximize the correlation between Euclidean distances to common nodes, measured along the branches, and the FST genetic distance computed directly from allele frequencies. Hence, the tree diagram is a simplified summary of (but not a substitute for) the matrix of genetic distances between populations, organized by clusters. It is important to notice that the organization of populations by tree does not imply that genetic distance establishes a linear relation among all of them, either along the x-axis (abscissa) or along the y-axis (ordinate). The abscissa at the bottom of the diagram can be used to read the genetic distance between pairs of populations in the tree only when they share direct common ancestors. For example, the genetic distance between New Guineans and Australians can be calculated by reading the position of the node that separates the two populations, which is approximately at 0.1. It is also possible to measure average genetic distance between clusters of populations by reading the position of the node that separates two clusters on the abscissa. For example, the average genetic distance between African populations and the rest of the world is approximately 0.2. However, to read the genetic distance between any pair of populations, one should use (as we do) the complete matrix of genetic distances, which is provided in Cavalli-Sforza, Menozzi, and Piazza (1994, Table 2.3.1A, p. 75).
THE DIFFUSION OF DEVELOPMENT
FIGURE II Genetic Distance among 42 Populations Source. Cavalli-Sforza, Menozzi, and Piazza (1994).
483
484
QUARTERLY JOURNAL OF ECONOMICS
populations to countries. We did so using ethnic composition data by country from Alesina et al. (2003). It was possible to match ethnic group labels with population labels from Cavalli-Sforza, Menozzi, and Piazza (1994), using their Appendices 2 and 3 to identify the ethnic groups sampled to obtain genetic distances. Obviously, many countries feature several ethnic groups. Alesina et al. list 1,120 country-ethnic group categories. We matched virtually all of these categories to some genetic group. The only groups that were not matched were the ones that were not labeled in Alesina et al.—usually residual groups labeled “other” that represented a small share of a country’s population. As an example, the Alesina et al. (2003) data on ethnic groups has India composed 72% of “Indo-Aryans” and 25% of “Dravidians.” These groups were matched, respectively, to the CavalliSforza groups labeled “Indians” and “Dravidians” (i.e., S.E. Indian in Figure II). The residual category “India Other” (3% of the population) was not matched to any genetic group. Another example is Italy, where the ethnic groups labelled “Italian” and “Rhaetians” (a combined 95.4% of the population) were matched to the genetic category “Italian,” whereas the “Sardinians” ethnic group (2.7% of the population) was matched to the “Sardinian” genetic group.31 This match served as the basis for constructing measures of genetic distance between countries, rather than groups. We constructed two such measures. The first was the distance between the plurality ethnic groups of each country in a pair, that is, the groups with the largest shares of each country’s population. In the examples above, that means that the plurality genetic distance between India and Italy is the genetic distance between the Indian and the Italian genetic groups (FST = 0.026). This resulted in a data set of 21,321 pairs of countries (207 underlying countries and dependencies) with available genetic distance data.32 The second was a measure of weighted genetic distance. Many countries, such as the United States and Australia, are made up of subpopulations that are genetically distant, and for which both genetic distance data and data on the shares of each genetic group are 31. The complete match of genetic groups to ethnic groups, and in turn to countries, is available upon request. 32. For 27 countries, the data on group shares were missing from Alesina et al.’s (2003) database, but a match to genetic groups based on plurality groups was possible through information from the Encyclopaedia Britannica. Thus, the weighted measure of genetic distance covers 16,110 pairs, or 180 countries—27 fewer than the plurality match.
THE DIFFUSION OF DEVELOPMENT
485
available. Assume that country 1 contains populations i = 1, . . . , I and country 2 contains populations j = 1, . . . , J; denote by s1i the share of population i in country 1 (similarly for country 2) and by di j the genetic distance between populations i and j. The weighted FST genetic distance between countries 1 and 2 is then (9)
W FST =
I J (s1i × s2 j × di j ), i=1 j=1
where ski is the share of group i in country k and di j is the FST genetic distance between groups i and j.33 The interpretation of this measure is straightforward: it represents the expected genetic distance between two randomly selected individuals, one from each country. Weighted genetic distance is very highly correlated with genetic distance based on dominant groups (the correlation is 94%), so for practical purposes it does not make a big difference which one we use. We will use the weighted FST distance as the baseline measure throughout this study, as it is a more precise measure of average genetic distance between countries. Error in the matching of populations to ethnic groups should lead us to understate the correlation between genetic distance and income differences. Several regions may be particularly prone to matching errors. One is Latin America, where it is sometimes difficult to identify whether populations are predominantly of European descent or of Amerindian descent. This is particularly problematic in countries with large proportions of Mestizos, that is, populations of mixed descent, such as Colombia (in this specific case the country’s dominant group was matched to the South Amerindian category). Another is Europe, where countries can only be matched to one of four genetic groups (Danish, English, Greek, and Italian). As a strict rule, we matched countries to groups that were the closest genetically to that country’s population, using data on regional genetic distance from Cavalli-Sforza, Menozzi, and Piazza (1994). The ethnic composition in Alesina et al. (2003) refers to the 1990s. This is potentially endogenous with respect to current 33. When some ethnic category was not matched to a genetic group due to a missing ethnic label in the Alesina et al. (2003) source data, the populations shares were rescaled to sum to 1 for the purpose of calculating weighted distances. Thus, for instance, the weighted genetic distance between India W = (0.972 × 0.258 × 0.0402) + (0.972 × 0.742 × and Italy was calculated as FST 0.0261) + (0.028 × 0.258 × 0.0531) + (0.028 × 0.742 × 0.0449) = 0.0302.
486
QUARTERLY JOURNAL OF ECONOMICS
income differences if the latter are persistent and if areas with high income potential tended to attract European immigration since 1500. This would be the case, for example, under the view that the Europeans settled in the New World due to a favorable geographical environment.34 To construct genetic distance between countries as of 1500, we also mapped populations to countries using their ethnic composition as of 1500, prior to the major colonizations of modern times. Thus, for instance, although the United States is classified as predominantly populated with English people for the current match, it is classified as being populated with North Amerindians for the 1500 match. This distinction affected mostly countries that were colonized by Europeans since 1500 to the point where the dominant ethnic group is now of European descent (New Zealand, Australia, North America, and some countries in Latin America). Because we do not have data on ethnic composition going back to 1500, the corresponding match refers only to plurality groups. Genetic distance in 1500 can be used as a convenient instrument for current genetic distance. The matching of countries to populations for 1500 is also more straightforward than for the current period, because Cavalli-Sforza, Menozzi, and Piazza (1994) attempted to sample populations as they were in 1500, likely reducing the extent of measurement error.
III.C. The European Sample Cavalli-Sforza, Menozzi, and Piazza (1994) also present matrices of genetic distance among populations within several regions. These submatrices cannot be merged with the world data, because they are based on sets of underlying genes distinct from the 120 genes used for the 42 populations in the world sample, and because the genetic distance between most groups in the regional samples and in the world sample are unavailable. They can, however, be used separately. We assembled a data set of genetic distances between 26 European populations, a much finer classification than the world sample, which only featured four distinct (nonminority) European populations (English, Danish, 34. In fact, income differences are not very persistent for a long time horizon such as this—see Acemoglu, Johnson, and Robinson (2002). Our own data show that pairwise log income differences in 1500 are uncorrelated with the 1995 series in the common sample (Table II).
THE DIFFUSION OF DEVELOPMENT
487
Italian, and Greek).35 Matching populations to countries is more straightforward for the European sample than for the world sample, because the choice of sampled European populations generally corresponds to nation-state boundaries. This should reduce the incidence of measurement error. The populations were matched to 26 countries, resulting in 325 country pairs.36 The largest FST genetic distance among those pairs was 0.032, between Iceland and Slovenia. The smallest, among countries matched to distinct genetic groups, was between Denmark and the Netherlands (FST = 0.0009). IV. THE EMPIRICS OF INCOME DIFFERENCES In this section we test the empirical implications of our model. We investigate the relationship between genetic distance and economic distance. Genetic distance is considered both relative to the technological frontier and in absolute terms. In line with our theory, we use log income per capita as a metric of economic performance. The data on per capita income are purchasing powerparity adjusted data from the World Bank, for the year 1995.37 IV.A. Genetic Distance to the Frontier We start with a simple descriptive approach. Does a country’s genetic distance to the world technological frontier correlate with its income level? To investigate this hypothesis, we run income level regressions, for now confining our attention to the world sample, where we have data on all our variables for 137 countries. We consider the United States as the technological frontier in 1995. We measure distance to the United States using our weighted measure, which is more appropriate because the United 35. Minority populations in the world sample also include Basque, Lapp, and Sardinian. 36. These 26 countries are Austria, Belgium, Croatia, Czech Republic, Denmark, Finland, France, Germany, Greece, Hungary, Iceland, Ireland, Italy, Macedonia, the Netherlands, Norway, Poland, Portugal, Russian Federation, Slovak Republic, Slovenia, Spain, Sweden, Switzerland, the United Kingdom, and Serbia/Montenegro. The Basque, Lapp, and Sardinian populations were not matched to any country, and some countries were matched to the same groups (Croatia, Slovenia, Macedonia, and Serbia/Montenegro were all matched to the Yugoslavian population, whereas the Czech and Slovak Republics were both matched to the Czech population). 37. We also used data from the Penn World Tables version 6.1 (Heston, Summers, and Aten 2002), which made little difference in the results. We focus on the World Bank data for 1995, as this allows us to maximize the number of countries in our sample.
488
QUARTERLY JOURNAL OF ECONOMICS TABLE I INCOME LEVEL REGRESSIONS, WORLD DATA SET
(1) Univariate FST genetic distance to the United States, weighted Absolute difference in latitude from the United States Absolute difference in longitude from the United States Geodesic distance from the United States (1,000s of km) =1 for contiguity with the United States =1 if the country is an island
−12.906 (1.383)∗∗
−12.523 (1.558)∗∗ 1.970 (0.868)∗∗ 0.438 (0.454) −0.179 (0.075)∗∗ 1.055 (0.300)∗∗ 0.505 (0.397) −0.384 (0.206)∗ −0.201 (0.197) 3.460 (2.507)
9.421 (0.149)∗∗ 137 .39
8.876 (0.536)∗∗ 137 .46
=1 if the country is landlocked =1 if the country shares at least one sea or ocean with the United States Freight rate to northeastern United States (surface transport) Linguistic distance to the United States, weighted Religious distance to the United States, weighted Constant Observations Adjusted R2
(2) Add geographic distance
(3) Add linguistic and religious distance −10.245 (1.567)∗∗ 1.518 (0.827)∗ 0.786 (0.401)∗ −0.191 (0.071)∗∗ 0.452 (0.390) 0.362 (0.483) −0.410 (0.198)∗∗ −0.080 (0.171) 5.794 (2.816)∗∗ −0.520 (0.648) −2.875 (0.591)∗∗ 10.499 (0.751)∗∗ 137 .53
Note. Dependent variable: log income per capita 1995. Robust standard errors in parentheses.
∗ Significant at 10%. ∗∗ Significant at 5%.
States is a genetically diverse country (variation in this measure is dominated by distance to the English population). Table I presents the results. In column (1), genetic distance to the United States is entered alone, and the coefficient has the expected negative sign and is highly significant statistically, with a t-statistic of about 9.3. In this specification, genetic distance entered alone accounts for 39% of the variation in log income levels. Figure III displays the univariate results of column (1) graphically. Columns (2) and (3) add several controls for geographic
489
USA
NOR CHE DNK JPN AUT CAN BEL AUS NLD ITA FRA GBR SWE FIN ARE IRL NZL KWT ESP ISR CYP PRT GRC SVN KOR CZE SAU OMN ARG HUN SVK URY CHL MYS POLCRI MEX EST HRV BRA LTU TTO COL RUS THA BGR VEN ROM MKD IRN TUR LVA PAN TUN DZA PRY DOM PER SLV UKR LBN JOR JAM PHL GTM GUY KAZ BLR ECU SYR EGY MAR IDN LKA HND TKM ALB CHN NIC DJIBOL AZE ARM GIN PAKIND GHA MRT HTI GMB CIV VNM MDA TGO GEO SDN KHM MNG UZB BGD LAOSEN KGZ NPL BFA GNB BEN ERI TCD NGA NER MDG TJK MLI SLE ETH
ZAF GAB NAM BWA SWZ ZWE AGO LSO
CMR
CAFCOG KEN UGA RWAZAR ZMB BDI MOZ MWI TZA
5
6
Log per capita income 1995 7 8 9 10
11
THE DIFFUSION OF DEVELOPMENT
0
0.1 0.05 0.15 FST genetic distance to the United States, weighted
0.2
FIGURE III Log Income in 1995 and Genetic Distance to the United States
distance from the United States and transport costs (column (2)), as well as linguistic and religious differences (column (3)). We will say a lot more about these control variables below, but for now it suffices to note that the coefficient on genetic distance is barely affected by the inclusion of the geographic distance controls, and that its magnitude is reduced by 20% by including linguistic and religious distance. The latter finding is consistent with our interpretation of genetic distance as capturing a broad set of vertical characteristics, including but not limited to language and religion. We return to this important topic in Section IV.F. IV.B. Bilateral Approach To generalize the results of the previous section, we consider a specification in which the absolute difference in income between pairs of countries is regressed on measures of distance between the countries in this pair. In addition to being closer to our theoretical specification, this has two advantages. First, we can now investigate the correlation between absolute genetic distance and income differences. (Section II.A led to predictions about the sign of this correlation.) Second, we can make more efficient use of a wealth of bilateral distance data as regressors. We will use this bilateral approach for the rest of this paper.
490
QUARTERLY JOURNAL OF ECONOMICS
We computed income differences between all pairs of countries for which income and other data were available, that is, 9,316 pairs (based on 137 underlying countries) in the world sample, and 325 pairs (based on 26 underlying countries) in the Europe sample. Define GiDj as the absolute genetic distance between countries i and j. Denote as GiRj the genetic distance between i and j relative to the technological frontier (in most of what follows, the technological frontier is the United States). Then, by definition D − G Dj,US |.38 Our baseline specifications are GiRj = |Gi,US (10)
| log yi − log y j | = β0 + β1 GiDj + β2 Xi j + εi j
and (11)
| log yi − log y j | = γ0 + γ1 GiRj + γ2 Xi j + νi j ,
where Xi j is a set of measures of geographic and cultural distance and εi j and νi j are disturbance terms.39 By using income differences rather than a single country’s income level on the left-hand side, we can use bilateral measures of distance between countries on the right-hand side. Our regression is not directional: our specifications are not simply obtained by differencing level regressions across pairs of countries.40 We should also stress that our specifications are reduced forms. Differences in income are presumably the result of differences in institutions, technologies, human capital, savings rates, etc., all of which are possibly endogenous with respect to income differences, and themselves a function of geographic and human barriers. 38. Absolute and relative genetic distance are algebraically the same when one of the two countries in a pair is the frontier economy, and the two measures are also closely correlated when pairs involving one country that is very close genetically to the frontier economy are considered. The measures differ most for countries that are genetically far from each other (e.g., Ethiopia and Nigeria) but roughly equally distant from the frontier—in this case absolute distance is large, and relative distance is small. Relative genetic distance is meant to capture the fact that per se the distance between Nigeria and Ethiopia does not matter in explaining their income difference, because they are unlikely to learn frontier technologies from each other. Rather, what matters is their relative distance from the United States. 39. We also estimated an alternative specification where the distance measures were all entered in logs. This did not lead to appreciable differences in the economic magnitude or statistical significance of any of the estimates. Because several countries were matched to the same genetic group, so that the corresponding pairs had a genetic distance of zero, taking logs resulted in the loss of valuable observations, so we omit these results here. 40. Our methodology is as much akin to gravity regressions in the empirical trade literature as it is to levels or growth regressions in the literature on comparative development.
THE DIFFUSION OF DEVELOPMENT
491
Before turning to the results, we must address a technical point regarding the disturbances εi j and νi j . In principle, if one is willing to assume that the measures of barriers are exogenous, equations (10) and (11) can be estimated using least squares. However, in this case the usual methods of inference will be problematic due to spatial correlation resulting from the construction of the dependent variable.41 Appendix II illustrates why using the difference in log income as the dependent variable results in spatial correlation. To address the problem of spatial correlation, we rely on twoway clustering of the standard errors, following the approach in Cameron, Gelbach, and Miller (2006). In our application, clustering arises at the level of country 1 and at the level of country 2, and is non-nested: each individual observation on income differences, say | log yi − log y j | belongs to the group that includes country i and the group that includes country j. The estimator in Cameron, Gelbach, and Miller (2006) allows for an arbitrary correlation between errors that belong to “the same group (along either dimension)” (p. 7). Their method is therefore directly applicable to the specific econometric issue we face (on p. 3 of their paper the authors specifically mention spatial correlation as a possible application of their estimator). Results obtained with this method feature standard errors that are an order of magnitude larger than those obtained with simple OLS with heteroscedasticity-robust standard errors, suggesting that spatial correlation was indeed an important issue. However, as we show below, genetic distance remains statistically significant even after correcting the standard errors for spatial correlation.42 41. This, of course, was not a concern in the simple regressions presented in Section IV.A. These results featured t-statistics in excess of 9 in the world sample, much larger than the t-statistics that we find using the bilateral approach with two-way clustering. This reinforces our confidence that our results are not driven by standard errors that are too low due to spatial correlation. 42. There are in principle several other ways to address the problem of spatial correlation. One approach would be to do feasible GLS by explicitly estimating the elements of the covariance matrix and introducing the estimated covariance matrix as a weighing matrix in the second stage of the GLS procedure. This is computationally very demanding, as the dimensionality of the matrix is large— in our application we have over 9,316 country pairs with available data on the variables of interest, and up to 137 covariance terms to estimate (for the same reason, it is difficult to implement tests of spatial correlation in our context). Another approach, which we pursued in a previous version of this paper, is to include in our regressions common country fixed effects, meant to soak up the spatial correlation. For this we relied on well-known results cited in Case (1991), showing that fixed effects soak up spatial correlation, though in a context quite N γkδk + ηi j , where different from ours. Following this insight, we modeled εi j = k=1
492
QUARTERLY JOURNAL OF ECONOMICS
IV.C. Unconditional Results Table II presents some summary statistics for our variables. Throughout, we use a baseline sample of 9,316 country pair observations obtained from 137 underlying countries. We consider various measures of genetic distance. As already mentioned, our baseline measure is weighted FST genetic distance. We also used the weighted Nei genetic distance.43 These measures bear a high correlation with each other (.939), and in practice it matters little which one we use. On the other hand, the theoretically more appropriate measure of relative distance to the United States bears a correlation of only .634 with absolute FST genetic distance. Finally, we considered FST genetic distance with countries matched to populations as they were in 1500. The correlation between this variable and the current measure is .827. Our measure of absolute FST genetic distance, G D, bears a positive correlations of .197 with the absolute value of log income differences in 1995. Genetic distance relative to the frontier, G R, bears a higher correlation with income differences, equal to .337, which is directly in line with our model’s prediction (these correlations are higher in the European sample, respectively .328 and .409). Table III presents univariate regressions of income differences on various measures of genetic distance for the world sample. As a measure of the magnitude of the coefficients, we report the standardized beta coefficient on genetic distance for each regression.44 Column (1) shows that, when entered alone in the regression, one standard deviation in FST genetic distance between plurality groups accounts for 16.79% of a standard deviation of income differences. This effect rises in magnitude to 26.98% when we consider genetic distance (also between plurality ethnic groups of each country in a pair) relative to the frontier (column (2)). This δk = 1 if k = i or k = j, δk = 0 otherwise, and ηi j is a well-behaved disturbance term. We treated δk as fixed effects; that is, we introduced into the regression a set of N dummy variables δk, each taking a value of one N − 1 times—δ j takes a value of 1 whenever country j appears in a pair. This did not affect the qualitative and quantitative nature of our results compared to the solution we pursue here—but the standard errors were smaller than the ones we report in this version. Finally, we also included separate fixed effects for each country in a pair, with no significant changes in the estimated effects of genetic distance. 43. In past work we also used both FST and Nei genetic distance based on plurality groups, with results very similar to those reported here. 44. The standardized beta is defined as the effect of a one-standard-deviation change in the regressor, expressed as a percentage of one standard deviation of the dependent variable.
1
.544 .742 .337 .200 −.030
.634
.827
.939
.197
.007
−.088
Note. Number of observations: 9,316, except a 1,326 and b 325.
FST genetic distance weighted FST genetic distance, weighted, relative to United States FST genetic distance, 1500 match Nei genetic distance, weighted Abs. log income difference, 1995 Abs. log income difference, 1870b Abs. log income difference, 1500a
Variable
FST genetic distance, weighted, relative to United States FST genetic distance, 1500 match Nei genetic distance, weighted Abs. log income difference, 1995 Abs. log income difference, 1870a Abs. log income difference, 1500b −.068
.072
.231
1
9,316 9,316 9,316 9,316 9,316 1,326 325
0.111 0.062 0.127 0.018 1.290 0.658 0.327
0.071 0.048 0.082 0.013 0.912 0.488 0.237
Panel B. Means and standard deviations # of obs. Mean Std. dev.
.198
.058
.203
.782
1
0.000 0.000 0.000 0.000 0.000 0.000 0.000
Min
−.051
.596
1
0.344 0.213 0.356 0.059 4.133 2.110 1.012
Max
.060
1
Panel A. Simple correlations among genetic and economic distance measures FST gen. dist., FST gen. dist. weighted, relative to FST gen. dist., Nei gen. dist., Abs. log income Abs. log income weighted United States 1500 match weighted diff. 1995 difference, 1870a
TABLE II SUMMARY STATISTICS FOR THE MAIN VARIABLES (WORLD DATA SET)
THE DIFFUSION OF DEVELOPMENT
493
1.079 (0.051)∗∗ 16.79 .03
1.853 (0.508)∗∗
0.977 (0.049)∗∗ 26.98 .07
3.541 (0.654)∗∗
1.010 (0.059)∗∗ 19.71 .04
2.516 (0.630)∗∗
0.893 (0.052)∗∗ 33.65 .11
6.357 (0.996)∗∗ 16.868 (3.792)∗∗ 0.986 (0.057)∗∗ 27.01 .05
1.044 (0.050)∗∗ 20.07 .03
2.214 (0.533)∗∗
∗ Significant at 10%. ∗∗ Significant at 5%.
Notes. Dependent variable: absolute value of log income differences, 1995. Two-way clustered standard errors in parentheses; 9,316 observations from 137 countries.
Standardized beta (%) R2
FST genetic distance relative to the United States Weighted FST genetic distance FST gen. dist. relative to the United States, weighted Weighted Nei genetic distance Constant
FST genetic distance
FST
(2) (3) (4) (5) (6) (1) FST gen. dist. relative Weighted FST Weighted FST gen. dist., Weighted Nei Weighted gen. dist. to United States gen. dist. relative to United States gen. dist. regression
TABLE III UNIVARIATE REGRESSIONS (TWO-WAY CLUSTERED STANDARD ERRORS)
494 QUARTERLY JOURNAL OF ECONOMICS
THE DIFFUSION OF DEVELOPMENT
495
means that the effect of genetic distance relative to the frontier is larger than the effect of bilateral genetic distance, exactly as predicted by Implication 2 of our model. Turning to the weighted measure, similar effects are found, with slightly larger magnitudes (columns (3) and (4)), respectively 19.71% and 33.65%. The larger magnitudes are consistent with the idea that weighted measures are better proxies for the expected genetic distance between countries. The effect is also larger when Nei genetic distance is used instead of FST (column (5)).45 We next make use of data from Cavalli-Sforza, Menozzi, and Piazza (1994) on the standard deviation of the genetic distance estimates. Because these data are based on allele frequencies collected from samples of different sizes, they are estimated more or less precisely depending on population pairs. We have data on the standard errors of each estimate of genetic distance, obtained from bootstrap analysis. In column (6), we linearly downweigh observations with higher standard errors on genetic distance. As expected, the magnitude of the resulting weighted least squares effect of FST genetic distance is larger than under simple OLS, consistent with the idea that measurement error is greater for pairs with high standard errors on genetic distance. Similar results are obtained using alternative measures of genetic distance. While providing suggestive evidence in favor of Implications 1 and 2 of our theoretical model, these unconditional results may confound the effect of barriers linked to vertical characteristics with geographic barriers. In the next section, we control for a large number of measures of geographic distance. In everything that follows we will focus on the weighted relative genetic distance to the frontier as the baseline measure of genetic distance, that is, the measure used in column (4) of Table III, because it is theoretically more appropriate. IV.D. Controlling for Geographic Factors Genetic distance and geographic isolation are likely to be highly correlated. The more isolated two groups become, the more they will drift apart genetically, since genetic admixture is made difficult by geographic barriers. It is therefore important to control adequately for geographic isolation: failing to do so would 45. We found no evidence of nonlinear effects of genetic distance. A quadratic term in genetic distance bore an insignificant coefficient and the total effect of genetic distance in this quadratic specification, evaluated at the mean of genetic distance, was commensurate with the linear effect reported here.
496
QUARTERLY JOURNAL OF ECONOMICS
ascribe to genetic distance an effect that should be attributed to geographic distance. In this section, we control for a vast array of measures of geographic isolation. Distance Metrics. Our first set of measures of geographic isolation between countries consists of various measures of distance. We consider a measure of the great circle (geodesic) distance between the major cities of the countries in our sample, from a data set compiled by researchers at Centre d’Etudes Prospectives et d’Informations Internationales (CEPII).46 We also include latitudinal distance—that is, simply the absolute value of the difference in latitude between the two countries i and j in each pair: GiLA j = |latitudei − latitude j |. Latitude could be associated with climatic factors that affect income levels directly, as in Gallup, Mellinger, and Sachs (1998) and Sachs (2001). Latitude differences would also act as barriers to technological diffusion: Diamond (1997) suggests that barriers to the transmission of technology are greater along the North–South axis than along the East–West axis, because regions at the same latitude share similar climate, availability of domesticable species, soil conditions, etc. We should therefore expect countries at similar latitudes to also display similar levels of income. Third, we use a measure of longitudinal distance, GiLO j = |longitudei − longitude j |, to capture possible geographic isolation along this alternative axis. Table IV, column (2), includes these three measures jointly with FST genetic distance relative to the frontier. The effect of relative genetic distance barely changes at all compared to the baseline univariate regression replicated in column (1). We find evidence that latitudinal distance matters—the standardized beta on this variable is 11.97%, consistent with Jared Diamond’s hypothesis. Microgeographic Factors. In addition to these straightforward distance measures, we controlled for other measures of isolation between countries. In the context of gravity regressions for Europe, Giuliano, Spilimbergo, and Tonon (2006) argued that genetic distance was likely correlated with features of the terrain. These “microgeographic” features may not be well captured by 46. The data are available at http://www.cepii.fr/anglaisgraph/bdd/distances .htm. The correlation between geodesic distance and weighted FST genetic distance is .349. This correlation rises to .486 if genetic distance is measured based on populations as they were in 1500, because the colonization era acted to weaken the link between genetic distance and geographic distance by shuffling populations across the globe. The correlations are lower in magnitude when considering distances relative to the frontier, although their relative magnitude is preserved.
=1 if either country is an island =1 if either country is landlocked =1 if pair shares at least one sea or ocean Freight rate (surface transport) Climatic difference of land areas, by 12 KG zones Difference in % land area in KG tropical climates
6.273 (0.989)∗∗ 0.494 (0.238)∗∗ 0.391 (0.226)∗ −0.057 (0.026)∗∗ −0.456 (0.064)∗∗ 0.178 (0.094)∗ 0.071 (0.076) −0.029 (0.062)
6.312 (0.988)∗∗ 0.494 (0.237)∗∗ 0.376 (0.224)∗ −0.081 (0.039)∗∗ −0.462 (0.064)∗∗ 0.180 (0.094)∗ 0.078 (0.076) −0.024 (0.062) 1.282 (1.568)
4.134 (1.046)∗∗ −0.228 (0.217) 0.084 (0.162) −0.008 (0.036) −0.284 (0.060)∗∗ 0.119 (0.090) 0.110 (0.071) 0.030 (0.050) −0.197 (1.517)
6.067 (0.960)∗∗ 0.254 (0.221) 0.257 (0.224) −0.062 (0.038)∗ −0.328 (0.061)∗∗ 0.162 (0.102) 0.084 (0.076) 0.044 (0.059) 1.160 (1.490) 0.032 (0.007)∗∗
−0.033 (0.083)
6.368 (1.003)∗∗ 0.497 (0.237)∗∗ 0.380 (0.224)∗ −0.081 (0.039)∗∗ −0.464 (0.064)∗∗ 0.181 (0.092)∗ 0.075 (0.075) −0.024 (0.062) 1.286 (1.583)
(2) (3) (4) (5) (6) (7) Distance Add microAdd transport Continent Climatic difference Tropical difference metrics geography controls costs dummies control control
FST gen. dist. relative to the 6.357 6.387 United States, weighted (0.996)∗∗ (0.994)∗∗ Absolute difference in 0.523 latitudes (0.241)∗∗ Absolute difference in 0.387 longitudes (0.235)∗ Geodesic distance −0.050 (1,000s of km) (0.028)∗ =1 for contiguity
(1) Baseline
TABLE IV CONTROLLING FOR GEOGRAPHIC DISTANCE (TWO-WAY CLUSTERED STANDARD ERRORS)
THE DIFFUSION OF DEVELOPMENT
497
0.866 (0.066)∗∗ 33.81 .12
(2) Distance metrics 0.889 (0.078)∗∗ 33.20 .13
0.675 (0.263)∗∗ 33.41 .13
1.919 (0.407)∗∗ 21.88 .22
0.284 (0.263) 32.12 .15
0.681 (0.264)∗∗ 33.70 .13
(3) (4) (5) (6) (7) Add microAdd transport Continent Climatic difference Tropical difference geography controls costs dummies control control
Notes. Dependent variable: absolute value of log income differences, 1995. Two-way clustered standard errors in parentheses. 9,316 observations from 137 countries in all columns. Column (5) includes two sets of continent dummies (estimates not reported): a set of dummies each equal to 1 if both countries in a pair are on the same given continent; and a set of dummies each equal to one if exactly one country belongs to a given continent, and the other not. Continents are defined as Europe, Africa, Latin America, North America, Asia, and Oceania. ∗ Significant at 10%. ∗∗ Significant at 5%.
0.893 (0.052)∗∗ Standardized beta (%) 33.65 .11 R2
Constant
(1) Baseline
TABLE IV (CONTINUED)
498 QUARTERLY JOURNAL OF ECONOMICS
THE DIFFUSION OF DEVELOPMENT
499
simple metrics of distance. We included dummy variables taking a value of 1 if countries in a pair were contiguous, if they had access to a common sea or ocean, or if either country in a pair was an island or was landlocked.47 These measures are meant to capture ease of communication and travel between countries, which may be associated both with barriers to technological diffusion and to population isolation (and thus genetic distance). Column (3) of Table IV shows that these variables have the expected signs, but their inclusion does not affect the coefficient on genetic distance. To summarize, although the additional controls have explanatory power for income differences, we found no evidence that the inclusion of microgeographic factors modifies the effect of genetic distance. Transportation Costs. A good summary measure of geographic isolation is transportation costs. Giuliano, Spilimbergo, and Tonon (2006) use a new measure of transportation costs based directly on freight rates for surface transport (sea or land) between European countries. We have obtained the same data as those they used, for the world sample.48 Column (4) of Table IV adds this measure of freight costs to our specification. We find that freight costs bear a positive relationship to income differences, as expected. However, this effect is not significant statistically and does not affect the signs or magnitudes of the other included variables, particularly genetic distance. We find no evidence that genetic distance captures the effect of geographic isolation or transportation costs in our application.49 Continent Effects. The largest genetic distances observed in our worldwide data set occur between populations that live on 47. The common sea variable is the same as that used in Giuliano, Spilimbergo, and Tonon (2006). These authors also used a measure of the average elevation of the countries that lie between any two countries in a pair, as a measure of how hard it is to travel from one to the other. Although we calculate and use this variable for the Europe sample, where it is relatively straightforward to do so, there are simply too many possible paths between any two countries in the world for this to be practical in the broader sample of world countries. 48. The data are available from http://www.importexportwizard.com/. The measure we used referred to 1,000 kg of unspecified freight transported over sea or land, with no special handling. This is the same definition used in Giuliano, Spilimbergo, and Tonon (2006). The data on 10,825 pairs of countries were downloaded from the website using a Perl script. 49. In the previous version of this paper, we also used the approach in Limao and Venables (2001) and Hummels and Lugovskyy (2006) to measure trade costs indirectly through the matched partner technique, using the ratio of CIF to FOB exports. The measure of indirect trade costs is ITCi j = (CIFi j /FOBi j ) − 1. Results with this alternative measure of trade costs featured a much smaller sample, but were similar to those reported here.
500
QUARTERLY JOURNAL OF ECONOMICS
TABLE V ENDOGENEITY OF GENETIC DISTANCE AND DIAMOND GAP (TWO-WAY CLUSTERED STANDARD ERRORS) (1) 2SLS with 1500 genetic distance FST genetic distance relative to the United States, weighted FST genetic distance relative to the English, 1500 match Absolute difference in latitudes Absolute difference in longitudes Geodesic distance (1,000s of km) =1 for contiguity =1 if either country is an island =1 if either country is landlocked =1 if pair shares at least one sea or ocean Freight rate (surface transport) Diamond gap Constant # of observations # of countries Standardized beta (%) R2
9.400 (1.665)∗∗
(2) Without New World 4.428 (1.252)∗∗
(3) Diamond gap, w/o New World
(4) Income 1500, Diamond gap
2.815 (1.347)∗∗ 1.737 (0.427)∗∗
0.402 (0.293) 0.601 (0.246)∗∗ −0.114 (0.039)∗∗ −0.381 (0.063)∗∗ 0.209 (0.094)∗∗ 0.052 (0.076) −0.043 (0.077) 1.700 (1.341)
0.901 (0.420)∗∗ 0.349 (0.258) −0.087 (0.051)∗ −0.471 (0.069)∗∗ 0.134 (0.113) 0.016 (0.081) −0.060 (0.087) 1.627 (1.809)
0.488 (0.241)∗∗ 9,316 137 49.75 .10
0.701 (0.312)∗∗ 6,105 111 23.56 .11
1.078 (0.471)∗∗ 0.781 (0.333)∗∗ −0.155 (0.055)∗∗ −0.461 (0.067)∗∗ 0.176 (0.115) 0.022 (0.076) −0.071 (0.085) 1.508 (1.781) 0.472 (0.137)∗∗ 0.760 (0.309)∗∗ 6,105 111 14.98 .13
0.152 (0.138) −0.007 (0.070) −0.016 (0.022) −0.048 (0.040) 0.004 (0.053) −0.059 (0.034)∗ −0.068 (0.047) −0.263 (0.847) 0.164 (0.059)∗∗ 0.338 (0.144)∗∗ 325 26 37.96 .22
Notes. Dependent variable: absolute value of log income differences, 1995 (columns (1)–(3)) or 1500 (column (4)). Two-way clustered standard errors in parentheses. The Diamond gap is a dummy variable that takes on a value of 1 if one and only one of the countries in each pair is located on the Eurasian landmass, and 0 otherwise. ∗ Significant at 10%. ∗∗ Significant at 5%.
different continents. One concern is that genetic distance may simply be picking up the effect of cross-continental barriers to the diffusion of development, that is, continent effects. To test explicitly for this possibility, we added to our baseline specification two sets of continent dummies. We included one set of six dummies
THE DIFFUSION OF DEVELOPMENT
501
(one for each continent) taking on a value of one if the two countries in a pair were on the same continent. We also included a set of six dummies each equal to one if one country belonged to a given continent, and the other did not. The results are in column (5) of Table IV. The inclusion of continent dummies reduces by about one-third the magnitude of the genetic distance effect, but the latter remains statistically significant. Its magnitude is still large, with a standardized beta of 21.88%. Figure III shows that many of the countries most genetically and economically distant from the United States are in sub-Saharan Africa. To examine whether this drives our results, we excluded any pair involving a sub-Saharan African country from our sample. In the resulting regression (available upon request), the standardized beta on genetic distance was 32.11% and was highly significant statistically. We therefore find no evidence that our results are driven by the inclusion of Africa in our sample. We will provide further evidence on the within-continent effects of genetic distance using the European data set in Section IV.H. Climatic Similarity. Next, we constructed measures of climatic similarity based on 12 Koeppen–Geiger climate zones.50 One measure is the average absolute value difference, between two countries, in the percentage of land area in each of the 12 climate zones. Countries have identical climates, under this measure, if they have identical shares of their land areas in the same climates. As a simpler alternative, we used the absolute difference in the percentage of land areas in tropical climates. As with latitude, climate may have direct effects on productivity, or barrier effects on technological diffusion: countries located in different climates may have experienced difficulties in adopting each other’s modes of production, particularly in the agrarian era. Columns (6) and (7) of Table IV report the results. As expected, climatic differences are associated with greater income differences, even controlling for latitude differences. However, the inclusion of these variables hardly affects the coefficient on genetic distance. 50. The 12 Koeppen–Geiger climate zones are tropical rain forest climate (Af), monsoon variety of Af (Am), tropical savannah climate (Aw), steppe climate (BS), desert climate (BW), mild humid climate with no dry season (Cf), mild humid climate with a dry summer (Cs), mild humid climate with a dry winter (Cw), snowy-forest climate with a dry winter (Dw), snowy-forest climate with a moist winter (Ds), tundra/polar ice climate (E), and highland climate (H). The data, compiled by Gallup, Mellinger, and Sachs, are available at http://www.ciesin.columbia.edu/eidata/.
502
QUARTERLY JOURNAL OF ECONOMICS
We conclude that our results are robust to controlling for a wide variety of measures of geographic distance, microgeographic measures of isolation, continent effects, climatic differences, and transportation costs, whether entered in absolute terms or relative to the frontier.51 IV.E. Endogeneity and the Diamond Gap Possible Endogeneity of Current Genetic Distance. We attempted to control for the possible endogeneity of genetic distance with respect to income differences. Although differences in (neutral) allele frequencies between the populations of two countries do not result causally from income differences, migration could lead to a pattern of genetic distances today that is closely linked to current income differences. The issue arises from the pattern of colonization of the New World starting after 1500. Europeans tended to settle in larger numbers in the temperate climates of North America and Oceania. If geographic factors bear a direct effect on income levels, and were not properly accounted for in our regressions through included control variables, then genetic distance today could be positively related to income distance not because genetic distance precluded the diffusion of development, but because similar populations settled in regions prone to generating similar incomes. To assess this possibility, we use our data on FST genetic distance as of 1500, relative to the English population, as an instrument for current genetic distance. This variable reflects genetic distance between populations as they were before the great migrations of the modern era, that is, as determined since the Neolithic era, and yet is highly correlated (.611) with current genetic distance relative to the United States, so it fulfills the conditions of a valid instrument. An added benefit of the IV approach is that it allows us to address in part possible measurement error in current genetic distance—the matching of populations to countries is much more straightforward for 1500, as explained above. In column (1) of Table V, the magnitude of the genetic distance effect is raised by one-third, with a standardized beta reaching 49.75%. As is usual in this type of application, the larger estimated effect 51. In all these regressions, geographic distance was entered in absolute terms, not relative to the frontier. In results available upon request, all the measures of distance and transport costs used in Table IV were entered relative to the United States instead. If anything, the coefficient on genetic distance became larger.
THE DIFFUSION OF DEVELOPMENT
503
may come from a lower incidence of measurement error under two-stage least squares. To further assess whether our results are driven by endogeneity of the sort discussed above, column (2) of Table V excludes from the sample any pairs involving one or more countries from the New World (defined as countries in North America, Latin America, the Caribbean, and Oceania), where the endogeneity problem is likely to be most acute. The effect of genetic distance falls by about one-third, but remains both statistically and economically significant. The difference in latitudes becomes much larger, an observation to which we shall return below. The Diamond Gap. Jared Diamond’s (1997) influential book stressed that differences in latitude played an important role as barriers to the transfer of technological innovations in early human history, and later in the preindustrial era, an effect that could have persisted to this day. Our estimates of the effect of latitudinal distance provided evidence that this effect was still at play: in our regressions we found evidence that differences in latitudes help explain income differences across countries, and this effect was much larger when excluding the New World from our sample. However, Diamond took his argument one step further, and argued that Eurasia enjoyed major advantages in the development of agriculture and animal domestication because (a) it had the largest number of potentially domesticable plants and animals and (b) it had a predominantly East–West axis that allowed an easier and faster diffusion of domesticated species. By contrast, differences in latitudes in the Americas and Africa created major environmental barriers to the diffusion of species and innovations. More generally, Eurasia might have enjoyed additional benefits in the production and transfer of technological and institutional innovations because of its large size.52 It is important to properly control for Diamond’s geography story, as it is either a substitute or a complement to ours. To test and control for a Eurasian effect, we constructed a dummy variable that takes on a value of 1 if one and only one of the countries in each pair is in Eurasia, and 0 otherwise (the “Diamond gap”).53 In order to test Diamond’s hypothesis, we added 52. This point is stressed in Kremer (1993). See also Masters and McMillan (2001). 53. For further tests providing statistical support for Diamond’s observations, see Olsson and Hibbs (2005).
504
QUARTERLY JOURNAL OF ECONOMICS
the Diamond gap to regressions explaining income differences in 1995 (column (3) of Table V) and, using Maddison’s historical income data, in 1500 (column (4)). For the former regression, we restricted our sample to the Old World.54 As expected, in the regression for 1995 income differences, the Diamond gap enters with a positive and significant coefficient, and its inclusion reduces (but does not come close to eliminating) the effect of genetic distance. In column (4), using 1500 income differences as a dependent variable, the Diamond gap is also significant and large in magnitude, despite the paucity of observations. Again, the effect of genetic distance relative to the English population using the 1500 match remains large in magnitude, with a standardized beta of 37.96%. This provides suggestive quantitative evidence in favor of Diamond’s observation that the diffusion of development was faster in Eurasia. We also conclude that genetic distance between populations plays an important role in explaining income differences even when controlling for the environmental advantages and disadvantages associated with Eurasia. Diamond’s hypothesis on the long-term diffusion of development is complementary to ours. IV.F. Controlling for Common History, Linguistic Distance, and Religious Distance In this section we control for additional possible determinants of income differences.55 We consider common history variables (for example, whether countries shared a common colonial past), and variables capturing distances in specific cultural characteristics, such as language and religion. In principle, countries that are close in terms of genetic distance may also be close in terms of common history, language, and religion, so we check whether the effect of genetic distance on income differences is robust to controlling for these specific channels of historical and cultural similarities. In particular, in this section we discuss two related questions: (a) How correlated is genetic distance with measures 54. It is appropriate to exclude the New World from the sample when using 1995 incomes because Diamond’s theory is about the geographic advantages that allowed Eurasians to settle and dominate the New World. If we were to include the New World in a regression explaining income differences today, we would include the higher income per capita of nonaboriginal populations who are there because of guns, germs, and steel, that is, thanks to their ancestors’ Eurasian advantage. 55. Throughout this section we will use the specification in column (4) of Table IV as the baseline—that is, we include a large array of geographic isolation controls.
THE DIFFUSION OF DEVELOPMENT
505
of linguistic and religious distances? (b) How is the effect of genetic distance on income differences affected by the introduction of these additional variables? As we detail below, overall we find positive but relatively modest correlations between measures of linguistic or religious distances and genetic distance across populations. We also find that controlling for such distances reduces the effect of genetic distance on income, but only to a modest extent: the effect of genetic distance remains large and significant. These results confirm the robustness of the relationship between genetic distance and income differences, but may also be viewed as somewhat surprising. Because both language and religion tend to be transmitted across generations, one could have expected that measures of linguistic, religious, and genetic distance would capture similar patterns of genealogical relatedness, and that their joint inclusion would reduce the effect of genetic distance. However, as we discuss below, there are several reasons, related to the definition and measurement of these variables, that make them empirically distinct. This may shed light on why genetic distance plays such a predominant role in explaining income differences, and why its effect appears to be largely unaffected by the inclusion of linguistic and religious distances. Measures of Linguistic and Religious Distance. We construct two measures of linguistic distance and one measure of religious distance. Our first approach to linguistic distance follows Fearon (2003). Fearon used data from Ethnologue to create linguistic trees, classifying languages into common families and displaying graphically the degree of relatedness of world languages. The linguistic tree in this data set contains up to 15 nested classifications. If two languages share many common nodes in the tree, these languages are more likely to trace their roots to a more recent common ancestor language. The number of common nodes in the linguistic tree, then, is a measure of linguistic similarity. For instance, according to this measure, French and Italian share four common nodes—both belong to the IndoEuropean/Italic/Romance/Italo-Western linguistic groupings. Using data on the linguistic composition of countries (also from Fearon [2003]), and matching languages to countries, we can construct indices of linguistic distance between countries. We did so, as for genetic distance, in two ways: first, we computed a measure of the number of common nodes shared by languages
506
QUARTERLY JOURNAL OF ECONOMICS
spoken by plurality groups within each country in a pair. Second, we computed a weighted measure of linguistic similarity, representing the expected number of common linguistic nodes between two randomly chosen individuals, one from each country in a pair (the formula is analogous to that of equation (9)).56 Following Fearon (2003), we transformed each of these series so that they are increasing in linguistic distance (LD) and bounded by 0 and 1: (12)
LD =
(15 − # Common Nodes) . 15
Our second measure of linguistic distance is based on work in the field of lexicostatistics (a branch of linguistics). We use data from Dyen, Kruskal, and Black (1992). They assembled data on 200 common “meanings” from all Indo-European languages. For each language, they compiled lists of words expressing these meanings. When words from two languages expressing a given meaning originated from a common source, these words were considered to be cognate. For instance, the words “table” in French and “tavola” in Italian are cognate because both stem from the word “tabula” in Latin. Aggregating over the 200 meanings, our measure of linguistic distance is the percentage of noncognate words, and as before we can compute an expected (weighted) measure and a measure based on the percentage of cognate words between the languages spoken by the plurality linguistic groups in each country in a pair. Again, the greater the percentage of cognate words, the more recently the languages shared a common ancestor language. In contrast to the linguistic trees data, this measure has the advantage of being a continuous measure of similarity. Its main drawback is that it is only available for Indo-European languages, so the geographic coverage is reduced to 62 countries (when considering the % cognate between plurality languages).57 To measure religious distance, we followed an approach similar to that used for linguistic distance. We relied on a nomenclature of world religions obtained from Mecham, Fearon, and Laitin 56. Using the measure based on the plurality language or the weighted measure did not make any difference for the results. In keeping with what we did for genetic distance, we focus on weighted measures in our empirical work. 57. In addition, when using the weighted measure of lexicostatistical distance, we lose further countries with sizable minorities of non-Indo-European speaking populations, such as India. For this variable, only 43 countries remain. These are mostly countries in Europe and the Americas.
THE DIFFUSION OF DEVELOPMENT
507
(2006).58 This nomenclature was broken down into religious families, first distinguishing between monotheistic religions of MiddleEastern origin, Asian religions, and “others,” then subdividing each group into finer groups (such as Christians, Muslims, and Jews), and so on. The number of common classifications (up to five in this data set) is a measure of religious proximity. We matched religions to countries using Mecham, Fearon, and Laitin’s (2006) data on the prevalence of religions by country and transformed the data in a manner similar to that in equation (12). Pairwise correlations between measures of genetic, linguistic, and religious distances are displayed in Table VI. These correlations are generally positive, as expected, but they are not very large in magnitude. For instance, the correlation between FST genetic distance and weighted linguistic distance is .227. The two alternative measures of linguistic distance bear a correlation of .745. Religious distance bears a correlation of .438 with linguistic distance and .171 with genetic distance. As mentioned above, the relatively small magnitude of the correlation between linguistic distance and genetic distance may be viewed as somewhat surprising. Anthropologists and population geneticists, including Cavalli-Sforza, Menozzi, and Piazza (1994, pp. 98–105), have pointed out that there is usually little genetic admixture between populations that speak different languages and that linguistic trees often mirror genetic trees for aboriginal populations. However, these scholars have also stressed forces that may occasionally lead to dramatic divergence between genetic distance and linguistic distance. Historically, conquests and migrations have often been associated with language replacement as well as gene replacement, therefore creating a wedge between linguistic trees and genetic trees. An example is the very different relation between the Hungarian population and neighboring populations in terms of genetic distance vs. linguistic distance. The Hungarian language (Magyar) is a Uralic language, unrelated to most other languages in Europe, and was introduced into current Hungary by Magyar-speaking tribes around 900 A.D. However, modern Hungarians are genetically very close to their European neighbors, suggesting that the Magyars mixed with preexisting Slavic speaking populations when they arrived in modern-day Hungary and/or that a large 58. An alternative classification obtained from the World Christian Database, with only three nested classifications, did not lead to appreciably different results.
1 .438 .210 .058 .143 .745 .274
.227 .171 .634
−.020
.052
−.105 .079 .618 .176
.343
.026
1 .126
.011 .169
.061
.062
1
.128 .702
.459
1
.292 .163
1
0.111 0.968 0.841 0.062 0.088 0.149
Weighted FST genetic distance Weighted linguistic distance Weighted religious distance Weighted FST genetic distance relative to United States Weighted linguistic distance relative to United States Weighted religious distance relative to United States
Note. Number of observations = 9,316, except a number of obs. = 1,830.
Mean
Variable
0.071 0.106 0.151 0.048 0.169 0.134
Standard deviation
0.000 0.000 0.089 0.000 0.000 0.000
Minimum
1 .372
0.344 1.000 1.000 0.213 1.000 0.999
Maximum
Panel B: Summary statistics for genetic distance and various measures of linguistic and religious distance
Weighted linguistic distance Weighted religious distance Weighted FST genetic distance relative to United States Weighted linguistic distance relative to United States Weighted religious distance relative to United States 1 − % cognate, plurality a 1 − % cognate, relative to United States, plurality a
Panel A: Correlations between genetic distance and various measures of linguistic and religious distance Weighted FST Weighted 1% cognate, Weighted Weighted Weighted gen. dist. ling. dist. Weighted religious relative to FST genetic linguistic religious relative to relative to distance relative United States, distance distance distance United States United States to United States pluralitya
TABLE VI SUMMARY STATISTICS FOR GENETIC DISTANCE AND MEASURES OF LINGUISTIC AND RELIGIOUS DISTANCE
508 QUARTERLY JOURNAL OF ECONOMICS
THE DIFFUSION OF DEVELOPMENT
509
number of non-Magyar-speaking individuals moved to Hungary and adopted Magyar as their language in following centuries. For example, the genetic distance between Hungarians and Poles is only 0.0025 (lower than the genetic distance between Swedes and Danes, who speak closely related languages). By contrast, Poles and Yugoslavs have a genetic distance equal to 0.0137, even though both populations speak Slavic languages (interestingly, the genetic distance between Hungarians and Yugoslavs is about the same, 0.0136). This example illustrates how populations who speak very different languages may be genetically close, whereas populations that speak more similar languages may be quite far apart genetically. Even more dramatic examples can be found among countries that were colonized by European powers and adopted the colonizers’ language (English, French, Portuguese, or Spanish), while maintaining very distinct populations in terms of common ancestry. By the same token, conquests and conversions have led to the adoption of similar religions by genetically distinct populations, as well as of different religions by genetically close populations. In that respect, the relatively low correlation between genetic distance and linguistic or religious distance partly reflects the fact that these variables measure conceptually distinct relations between populations. In addition, technical reasons related to the construction of the measures of linguistic and religious distances contribute to the low correlation between these measures and genetic distances. Genetic distance is a continuous measure, reflecting an objective molecular clock, and maps linearly into the time elapsed since different populations shared common ancestors. In contrast, the number of nodes is a discrete and imperfect measure of linguistic or religious distance, based on sometimes arbitrary classifications of languages into groups by linguists, and counting the discrete number of common nodes may not capture such distances appropriately.59 For example, Fearon (2003) argues that the move from 0 to 1 common node is more important than the move from 59. Populations may share few common nodes but linguistic splits may have occurred recently, in which case one is overestimating distance, or they may share lots of common nodes but the last split may have occurred a long time ago, in which case one is underestimating distance. The idea that these measures of linguistic and religious distance may include a lot of measurement error is confirmed by Fearon (2006) in a short unpublished comment on our work suggesting, for European countries, that genetic distance is robust to the inclusion of a variety of measures of linguistic distance in a regression seeking to explain income differences.
510
QUARTERLY JOURNAL OF ECONOMICS
3 to 4 common nodes. The lexicostatistical measure may partly address this problem, because it is more continuous, but at the cost of losing all of the non-Indo-European-speaking countries. Finally, a third reason for low correlations is that we consider genetic and cultural distances relative to the world technological frontier (the United States) in our regressions, in keeping with our theory. As shown in Table VI, the correlations between measures relative to the United States are lower than the correlations between simple distances. For instance, while the correlation between weighted linguistic and genetic distances is .227, once these variables are considered relative to the United States the correlation falls to .062. Regression Results. Table VII presents results obtained when including measures of linguistic and religious distance as well as common history variables. We first control for variables representing a pair’s common historical experience, obtained from CEPII. These are dummy variables for pairs that were ever part of the same country (for example, Austria and Hungary), were ever in a colonial relationship, have shared a common colonizer since 1945, and are currently in a colonial relationship (such as France and French Polynesia). These variables all bear the expected signs and have statistically significant coefficients (Table VII, column (2)). For instance, having had a common colonizer and having been part of the same country are associated with smaller income differences. The inclusion of these variables in the regression does not affect the magnitude of the genetic distance effect. Turning to linguistic and religious distance, in Table VII, columns (3) and (4), both linguistic distance and religious distance enter with the expected positive signs and are statistically significant at the 5% level when entered individually. Their standardized betas are, respectively, 15.10% and 20.17%, so these variables can help account for a sizable fraction of the variation in income differences. When the two variables are entered jointly, only religious distance remains significant (column (5)). More importantly from our perspective, the inclusion of these variables, either alone or together, slightly reduces the effect of genetic distance on income differences. Comparing column (5) with column (2), the reduction in the coefficient on genetic distance (and in the standardized beta), is 11.5%. Column (7) shows the results obtained when including the measure of linguistic distance based on the percentage
6.283 (0.988)∗∗ −0.217 (0.088)∗∗ 0.304 (0.131)∗∗ −0.226 (0.066)∗∗ −1.033 (0.193)∗∗
0.740 (0.257)∗∗ 33.26 .14
6.312 (0.988)∗∗
0.675 (0.263)∗∗ 33.41 .13
0.849 (0.246)∗∗ 30.84 .16
5.827 (0.944)∗∗ −0.223 (0.087)∗∗ 0.255 (0.134)∗ −0.214 (0.063)∗∗ −0.823 (0.200)∗∗ 0.815 (0.204)∗∗
(3) Linguistic distance, weighted
0.703 (0.250)∗∗ 30.18 .17
1.373 (0.266)∗∗
5.702 (0.950)∗∗ −0.209 (0.087)∗∗ 0.307 (0.101)∗∗ −0.135 (0.060)∗∗ −0.969 (0.167)∗∗
(4) Religious distance, weighted
0.763 (0.246)∗∗ 29.42 .18
5.557 (0.940)∗∗ −0.214 (0.087)∗∗ 0.282 (0.106)∗∗ −0.142 (0.059)∗∗ −0.873 (0.176)∗∗ 0.409 (0.292) 1.172 (0.317)∗∗
(5) Religious + linguistic, weighted
1.053 (0.179)∗∗ 23.58 .14
6.853 (2.116)∗∗ −0.262 (0.084)∗∗ 0.109 (0.112) −0.046 (0.111) −0.720 (0.162)∗∗
(6) Baseline (smaller sample)
0.631 (0.185)∗∗ 0.928 (0.188)∗∗ 20.63 .19
5.995 (2.109)∗∗ −0.196 (0.090)∗∗ 0.167 (0.119) −0.038 (0.082) −0.444 (0.179)∗∗
(7) % cognate, plurality
Note. Dependent variable: absolute value of log income differences, 1995. Two-way clustered standard errors in parentheses. 9,316 observations from 137 countries in columns (1)–(5), 1,830 observations from 61 countries in columns (6) and (7). All columns include geographic controls, that is, absolute difference in latitudes, absolute difference in longitudes, geodesic distance, dummy for contiguity, dummy = 1 if either country is an island, dummy = 1 if either country is landlocked, dummy = 1 if pair shares at least one sea or ocean, freight rate for surface transport (estimates not reported). ∗ Significant at 10%. ∗∗ Significant at 5%.
Standardized beta (%) R2
FST gen. dist. relative to the United States, weighted =1 if countries were or are the same country =1 for pairs ever in colonial relationship =1 for common colonizer post-1945 =1 for pairs currently in colonial relationship Linguistic distance index, relative to United States, weighted Religious distance index, relative to United States, weighted 1 − % cognate, relative to United States, plurality Constant
(1) Baseline
(2) Colonial history controls
TABLE VII CONTROLLING FOR COMMON HISTORY, LINGUISTIC DISTANCE, AND RELIGIOUS DISTANCE (TWO-WAY CLUSTERED STANDARD ERRORS)
THE DIFFUSION OF DEVELOPMENT
511
512
QUARTERLY JOURNAL OF ECONOMICS
of cognate words between plurality languages.60 To allow comparisons within a common sample, column (6) presents a baseline regression controlling for geographic distance, transport costs and common history variables, for the sample for which the lexicostatistical measure is available. We find results consistent with the ones obtained using Fearon’s discrete measure of linguistic distance: comparing columns (6) and (7) of Table VII, the effect of genetic distance falls by 12.5% when controlling for lexicostatistical distance. In summary, using the best available measures of linguistic and religious distance, the effect of genetic distance on income differences is reduced by about 12%, but the effect remains large and significant. Overall, these results are consistent with our interpretation: when we measure some specific differences in vertically transmitted traits, such as in language or religion, we obtain a reduction in the size of the coefficient on genetic distance, suggesting that genetic distance was capturing some of the barrier effects associated with differences in these vertical characteristics. However, the reduction is not large enough to suggest that genetic distance only captures the effect of linguistic and religious distance. On the contrary, the reduction is relatively modest, and the effect of genetic distance remains large and significant even when controlling for linguistic and religious distance. This suggests that language and religion are but two of the many vertical characteristics that differ across populations, and perhaps not the most important barriers to the diffusion of economic development. As already mentioned, linguistic or religious distance and genetic distance do not necessarily capture the same long-term historical relations among populations. Societies with very different languages may share recent common ancestors, and therefore a large number of other important cultural and biological characteristics, whereas societies with different genetic histories and traits may share similar languages and religion because of more recent conquests or conversions. This opens up the very interesting question of what other vertical traits and characteristics are behind the large effects of genetic distance on income differences, besides language and religion. Although the identification of specific traits and characteristics is beyond the scope of this paper, and is left for further research, some further discussion is in order. 60. Results using the weighted measure, which features far fewer observations, were similar and are available upon request.
THE DIFFUSION OF DEVELOPMENT
513
There are several possible (and not mutually exclusive) channels through which relative genetic distance may operate as a barrier to the diffusion of innovations and development. A possibility is that genetic distance creates obstacles to interaction and communication that cannot be overcome through translation technologies (such as those readily available when people speak different languages). For example, genetic distance may reflect biological traits that, for cultural reasons—racism, discrimination, lack of trust—affect people’s willingness to interact with each other. This would be consistent with work by Guiso, Sapienza, and Zingales (2004, revised in 2008) on cultural biases and trade. Even when people are willing to interact with each other, communication and adaptation of each other’s innovations may be hampered by deep cultural differences (norms, values, habits, etc.) that are not codifiable and translatable from one society to the other. This would be consistent with the evidence in Desmet et al. (2007), already mentioned in the Introduction, showing a strong correlation between genetic distance and answers to 430 questions about norms, values, and cultural characteristics in the World Values Survey, correlations that remain even after controlling for linguistic distance.61 Such characteristics may facilitate the diffusion of innovations across cultures that share a set of common attitudes, while preventing or slowing down such diffusion when societies are more distant across a large range of values and norms.62 More generally, our results suggest that (a) societies 61. Desmet et al. (2007) compared the matrix of genetic distances between fourteen European countries from Cavalli-Sforza, Menozzi, and Piazza (1994) with the answers given in the World Values Survey (WVS) to 430 questions on “Perceptions of Life, Family and Religion and Morals” from the four WVS waves currently available online at http://www.worldvaluessurvey.org/. They construct a matrix of opinion poll distances across countries, such that each element of the matrix represents the average Manhattan distance between each pair of nations in their respective responses to the 430 questions (p. 25). They find that the WVS matrix of cultural distances and the matrix of genetic distances are strongly correlated, with a correlation coefficient equal to .64, and that “the hypothesis of non-positive correlation is strongly rejected based on a Mantel test with 100,000 replications (p-value of .00014)” (p. 27). They also find that the correlation between the matrix of WVS cultural distances and the matrix of genetic distances remains positive and significant even controlling for a matrix of geographical distances and a matrix of linguistic distances. 62. For example, while the Germans and Austrians share the same language and many other characteristics, there also exist cultural dimensions where the distance between Germany and Britain is smaller than that between Austria and Britain. In the 1999 World Value Survey (http://www.worldvaluessurvey.org/), when asked about “important child quality that children can be encouraged to learn at home,” “hard work” was listed as important by 38.7% of respondents in
514
QUARTERLY JOURNAL OF ECONOMICS
differ in more respects than those captured by language and geography, (b) such differences go back to the distant historical and possibly prehistoric past, and (c) such differences still matter for differences in income per capita and the diffusion of modern economic development. Genetic distance represents a novel and useful way to summarize these important traits and characteristics, which are—almost by definition—difficult to codify and measure. IV.G. Historical Income Data In this subsection we examine the time variation in the effect of genetic distance in the 500 years that surrounded the Industrial Revolution. We find a pattern of coefficients supportive of our model of diffusion. In Table VIII, we use income per capita data since 1500 from Maddison (2003), and repeat our basic reduced form regression for 1500, 1700, 1820, 1870, 1913, and 1960.63 Our measure of genetic distance is now FST genetic distance between plurality groups, relative to the English population.64 This is both the group to which the plurality genetic group in the United States is matched for the modern period, and (conveniently) the group located in the birthplace of the Industrial Revolution. For the 1500 and 1700 regressions, we use the early match for genetic distance, that is, genetic distance between populations as they were in 1492, prior to the discovery of the Americas and the great migrations
Britain, 23% in Germany, and only 9.9% in Austria, whereas “saving money” was mentioned as important by 32.8% of respondents in Britain, 35% in Germany, and 47.6% in Austria. These numbers are just quick examples of measured cultural characteristics where the British happen to be more similar to the Germans than to the Austrians. The genetic distance between the Germans and the English is less than half that between the English and the Austrians in our European sample. 63. The data on income for 1960 and 1995 are from the Penn World Tables version 6.1. For the pre–Industrial Revolution periods (1500 and 1700), where the level of technology might be well captured by population density rather than per capita income, we also used the absolute difference in log population density as the dependent variable instead of the absolute difference in log per capita income. Relative genetic distance again came out positive and significant, with a standardized beta of about 30%, slightly smaller than but in line with what we find for income differences. For a recent analysis of economic development in precolonial, Malthusian times, using population density as the main dependent variable, see Ashraf and Galor (2008). 64. We also used Italy as the reference point for the early periods, because there is evidence that Italy was the technological leader in Europe during the Renaissance. This led to no appreciable difference in the results. The Italians and the English are genetically very close relative to average worldwide genetic distance—the genetic distance between the English and the Italians is 0.0072 whereas the average genetic distance between world populations is 0.111.
=1 if either country is an island =1 if either country is landlocked =1 if pair shares at least one sea or ocean Freight rate (surface transport) Constant
Relative FST genetic distance to the English, 1500 match FST genetic distance relative to the English, weighted Absolute difference in latitudes Absolute difference in longitudes Geodesic distance (1,000s of km) =1 for contiguity
2.788 (0.515)∗∗
0.690 (0.180)∗∗ 0.164 (0.085)∗ −0.082 (0.025)∗∗ −0.168 (0.054)∗∗ 0.003 (0.044) −0.018 (0.063) −0.045 (0.067) 1.727 (1.187) 0.051 (0.170)
0.233 (0.124)∗ 0.040 (0.075) −0.015 (0.023) −0.051 (0.047) −0.069 (0.031)∗∗ −0.042 (0.032) −0.027 (0.045) −0.005 (0.863) 0.249 (0.149)∗
(2) Income 1700
2.059 (0.479)∗∗
(1) Income 1500
0.671 (0.338)∗∗ 1.034 (0.199)∗∗ 0.525 (0.144)∗∗ −0.096 (0.033)∗∗ −0.226 (0.053)∗∗ −0.067 (0.029)∗∗ 0.136 (0.037)∗∗ −0.009 (0.042) 1.098 (1.045) 0.139 (0.192)
(3) Income 1820
1.684 (0.846)∗∗ 1.188 (0.274)∗∗ 0.692 (0.267)∗∗ −0.124 (0.064)∗ −0.257 (0.048)∗∗ 0.070 (0.099) 0.173 (0.078)∗∗ 0.070 (0.049) 1.668 (2.859) 0.100 (0.471)
(4) Income 1870
1.967 (0.924)∗∗ 1.286 (0.263)∗∗ 0.950 (0.261)∗∗ −0.175 (0.084)∗∗ −0.272 (0.049)∗∗ 0.062 (0.088) 0.217 (0.081)∗∗ 0.118 (0.047)∗∗ 3.072 (4.374) −0.066 (0.689)
(5) Income 1913
3.503 (0.784)∗∗ 1.201 (0.244)∗∗ 0.742 (0.248)∗∗ −0.171 (0.065)∗∗ −0.102 (0.064) −0.010 (0.073) 0.125 (0.089) 0.082 (0.061) 4.600 (3.331) −0.250 (0.526)
(6) Income 1960
TABLE VIII REGRESSIONS USING HISTORICAL INCOME DATA (TWO-WAY CLUSTERED STANDARD ERRORS)
4.948 (0.785)∗∗ 0.527 (0.245)∗∗ 0.420 (0.239)∗ −0.087 (0.040)∗∗ −0.466 (0.064)∗∗ 0.177 (0.095)∗ 0.083 (0.075) −0.011 (0.067) 1.362 (1.584) 0.663 (0.267)∗∗
(7) Income 1995 THE DIFFUSION OF DEVELOPMENT
515
406 29 40.48 40.58 .24
325 26 45.01 45.01
.18
(2) Income 1700
.23
1,035 46 8.74 10.20
(3) Income 1820
.16
1,485 55 14.95 32.63
(4) Income 1870
.17
1,653 58 14.89 29.71
(5) Income 1913
∗ Significant at 10%. ∗∗ Significant at 5%.
Notes. Dependent variable: difference in per capita income, in dates specified in row 2. Two-way clustered standard errors in parentheses.
# observations # countries Standardized beta (%) Standardized beta (%) (common sample of 325 obs.) R2
(1) Income 1500
TABLE VIII (CONTINUED)
.17
4,753 98 29.07 26.24
(6) Income 1960
.13
9,316 137 32.84 25.35
(7) Income 1995
516 QUARTERLY JOURNAL OF ECONOMICS
THE DIFFUSION OF DEVELOPMENT
517
of modern times.65 For the subsequent periods we use the current match. Table VIII shows that across all periods, the coefficient on relative genetic distance is statistically significant and positive. Moreover, the magnitudes are much larger than for the current period: in regressions obtained from a common sample of 26 countries (325 pairs) for which data are continuously available, standardized beta coefficients range from 25.35% (in 1995) to 45.01% (in 1500). Thus, genetic distance is strongly positively correlated with income differences throughout modern history. It is worth noting that genetic distance bears a large, positive, and significant effect on income differences for the past five centuries, even though income differences in 1500 and in 1995 are basically uncorrelated. (Table II shows this correlation to be −0.051 for the 325 country pairs for which data are available.) This noteworthy fact is highly consistent with our interpretation of genetic distance as a barrier to the diffusion of innovations across populations: genetic distance remains significant throughout the centuries despite significant reversals of fortune since 1500, and despite the fact the genetic distance itself remained highly persistent (composition effects related to the conquest of the New World being the only significant sources of change). The time pattern of the effect in the common sample of 26 countries provides additional clues that support our interpretation. The standardized beta on genetic distance decreases from 1500 to 1820, then increases significantly in 1870 during the Industrial Revolution, and declines gradually thereafter. The shape of this time path during the 19th century is consistent with the view that the effect captures the diffusion of economic development from the world technological frontier—in particular, the gradual spread of the Industrial Revolution. A major shift in the growth regime (the Industrial Revolution) initially results in large income discrepancies. These discrepancies persist in proportion to genealogical relatedness. As more and more countries adopt the major innovation, the impact of genetic distance progressively 65. Regressions for these early periods feature at most 29 countries. These countries are Australia, Austria, Belgium, Brazil, Canada, China, Denmark, Egypt, Finland, France, Germany, Greece, India, Indonesia, Ireland, Italy, Japan, Korea, Mexico, Morocco, Netherlands, New Zealand, Norway, Portugal, Spain, Sweden, Switzerland, the United Kingdom, and the United States. There were 325 pairs (26 countries) with available data for 1500 income and 406 pairs (29 countries) for 1700.
518
QUARTERLY JOURNAL OF ECONOMICS
declines.66 At the same time, the slight decrease of the effect in recent times suggests that the impact of genetic distance may progressively decline, as more and more countries adopt the frontier innovations, and intersocietal barriers to the diffusion of development decrease through globalization and other forces.67 IV.H. Genetic Distance across European Countries As the last step in our empirical investigation, we provide a detailed analysis of the European sample. Analyzing the European data can be informative for several reasons. First, it constitutes a robustness check on the worldwide results. Second, matching populations to countries is much more straightforward for Europe than for the rest of the world, because the choice of sampled populations happens to match nation-state boundaries. This should reduce the incidence of measurement error. Third, genetic distances are orders of magnitude smaller across countries of Europe, and genetic specificities there have developed over the last few thousand years (and not tens of thousands of years). It is very unlikely that any genetic traits have risen to prominence within Europe as the result of strong natural selection over such a short period of time, so a finding that genetic distance based on neutral markers within Europe is associated with income differences would be evidence that barriers to the diffusion of development are primarily induced by differences in culturally transmitted traits. We maintain the choice of the United States as the frontier country. This requires us to use measures of genetic distance based on plurality groups, because we lack the data to calculate weighted genetic distances from European countries to the United States.68 To maintain consistency throughout, we also use measures of linguistic and religious distance based on plurality languages and plurality religions (this choice does not matter in terms of our 66. In terms of the comparative statics of our simplified reduced-form model, an increase in the effect of genetic distance on income differences may be expected right after a big jump in the parameter at the technological frontier. A more general interpretation is that the effect should increase after a series of big positive shocks to technology at the frontier, including possibly to the R&D technology itself. See the discussion in Howitt and Meyer-Foulkes (2005). An analytical formalization of these ideas within a dynamic extension of our framework is available upon request. 67. Within our simplified model, globalization and other forces that reduce intersocietal barriers can be interpreted as a reduction in the parameter β. 68. We do not have data on genetic distance between West Africans, Central Amerindians, and Chinese, on the one hand, and European populations at the level of precision of the European data set, on the other. These would be required to compute weighted genetic distance from European countries to the United States.
THE DIFFUSION OF DEVELOPMENT
519
empirical results). Because the plurality population of the United States is matched to the English population, genetic distance is entered relative to the English. Similarly, linguistic distance is relative to the English language, and religious distance is relative to Protestants.69 Univariate Regressions and Geographic Controls. In Table IX we present univariate regressions and regressions controlling for geography and transport costs.70 Genetic distance is again positively and significantly associated with income differences. Columns (1) and (2) confirm empirically Implication 2 of our model—the coefficient on relative genetic distance is about 30% larger than the coefficient on absolute genetic distance. These effects also are about 30% larger in magnitude than the corresponding effects found in the world sample (Table III, columns (1)–(4)). While genetic distances across European countries are smaller than in the world sample, so are the income differences to be explained. We then add distance metrics and a large set of micro-geography controls as defined in Section IV.D.71 The main direction of geographic inequality in Europe seems to be longitudinal—between the former Soviet Bloc countries and the West. The controls include the variables used in Giuliano, Spilimbergo, and Tonon (2006), namely a dummy variable taking a value of 1 if countries in a pair share access to the same sea or ocean; a variable that measures the average elevation of countries 69. Our results are robust to using Germany or the United Kingdom as the frontier countries instead of the United States. Results with the Germany and UK baselines are available upon request. It is not surprising that the results using the United Kingdom as the baseline would be similar to those using the United States (the genetic and linguistic groups are the same, and only the religious plurality groups differ). Germans are genetically very close to the English, and like the United States the plurality religion is Protestant, so again the results change little when Germany is used as the baseline. In fact, any genetic group in Northwestern Europe is close to the English. 70. Summary statistics for the European sample are available upon request. On these, brief observations are in order. First, as stressed by Giuliano, Spilimbergo, and Tonon (2006), genetic distance is indeed correlated with geodesic distance and transportation costs (geodesic distance and freight costs themselves bear a .968 correlation with each other). This remains true, though the correlations are weaker, when relative genetic distance is considered. Second, as in the world sample, relative genetic distance is only weakly correlated with relative linguistic and religious distances. Third, the correlation of absolute log income differences with absolute genetic distance (.328) is smaller that their correlation with relative genetic distance to the English (.409), as our model predicts. 71. Again, introducing distance metrics and freight costs relative to the frontier (either the United Kingdom, the United States, or Germany) did not change our results concerning genetic distance. These regressions are available upon request.
520
QUARTERLY JOURNAL OF ECONOMICS
TABLE IX RESULTS FOR THE EUROPEAN DATA SET (TWO-WAY CLUSTERED STANDARD ERRORS) (1) (2) No controls, No controls, simple genetic relative genetic distance distance FST genetic distance in Europe FST genetic distance, relative to the English Absolute difference in latitudes Absolute difference in longitudes Geodesic distance (1,000s of km) =1 for contiguity =1 if either country is an island =1 if pair shares at least one sea or ocean Average elevation between countries Freight rate (surface transport) =1 if either country is landlocked Constant # of observations # of countries Standardized beta (%) R2
(3) Add geography
(4) 1870 income data
39.333 (18.708)∗∗ −0.800 (0.723) 0.233 (0.129)∗ −0.344 (0.306) −0.136 (0.073)∗ −0.078 (0.087) −0.159 (0.137) −0.028 (0.223) 16.532 (14.387) 0.074 (0.178) −2.079 (2.293) 325 26 34.61 .21
39.842 (11.052)∗∗ 0.467 (1.397) 1.022 (1.171) −0.150 (0.177) −0.204 (0.063)∗∗ 0.039 (0.086) −0.063 (0.070) −0.049 (0.141) −4.004 (5.938)
28.134 (14.605)∗ 45.222 (22.193)∗∗
0.378 (0.099)∗∗ 325 26 31.69 .10
0.382 (0.084)∗∗ 325 26 39.80 .16
a
1.130 (0.947) 171 19 59.28 .39
Notes. Dependent variable: Difference in log per capita income across pairs (in 1995 for columns (1)-(3), in 1870 for column (4)). Two-way clustered standard errors in parentheses. a Dropped due to singularity. ∗ Significant at 10%. ∗∗ Significant at 5%.
that lies on the direct path between two countries (for instance, the average elevation between France and Austria is the average elevation of France, Germany, and Austria); and their measure of freight costs from the Import Export Wizard. In addition to these controls, we included additional measures of isolation: a dummy for contiguity, a dummy taking a value of 1 if at least one country in a pair is landlocked, and a similar dummy variable for islands. Together, the inclusion of these variables reduces the standardized effect of relative genetic distance from 41.08%
THE DIFFUSION OF DEVELOPMENT
521
to 34.61%, but the effect remains larger than in the world sample, and statistically significant at the 5% level. Although it is clearly crucial to control for geographic factors here, these do not seem to play nearly as important a role in income differences as they apparently do in bilateral trade between European countries. The last column of Table IX uses income differences in 1870 as the dependent variable. Although we lose seven countries for lack of income data in the Maddison source, the results on genetic distance relative to the English are quantitatively and statistically much stronger than those for the contemporary period, consistent with the findings reported in Section IV.G. Upon impact, a major new innovation such as the Industrial Revolution diffuses in proportion to how genetically distant countries are from the frontier.72 Controlling for Linguistic and Religious Distance. In a short comment on our work, Fearon (2006) examined the interrelationships between genetic and linguistic distance within Europe. Regressing income levels on genetic distance from the English, geodesic distance to the United Kingdom, and linguistic distance from the English language (using both his data based on linguistic trees and the lexicostatistical data), he found that genetic distance was generally robust to the inclusion of these variables.73 We reexamine this issue using our bilateral methodology and with our full set of geographic controls. Table X presents the results.74 The bottom line is that the magnitude of the genetic distance effect is not affected by the inclusion of linguistic and religious distance. We refer the reader to Section IV.F for an interpretation of these results. 72. It is convenient (and surely not a coincidence) that the baseline population for calculating relative genetic distance, the English, is the plurality group both in the United States (the frontier in 1995) and in the birthplace of the Industrial Revolution (the frontier in 1870). 73. Fearon (2006) presented one regression with 22 observations using the lexicostatistical data where the t-statistic on the coefficient on genetic distance to the English fell to 1.5. We have replicated this regression with all 23 European countries for which the lexicostatistical and linguistic trees data are available and found that genetic distance remained statistically significant at the 5% level (Iceland, Hungary, and Finland are missing from these regressions due to lack of data on one or the other linguistic distance measure). These results are available upon request. 74. While our baseline sample features 26 countries, we lack observations on linguistic and religious distance for Iceland, and we lack lexicostatistical data for Hungary and Finland, where Indo-European languages are not spoken. As a result, columns (1) and (3) present baseline regressions, including geographic controls only, for comparison.
522
QUARTERLY JOURNAL OF ECONOMICS
TABLE X CONTROLLING FOR RELIGIOUS AND LINGUISTIC DISTANCE IN THE EUROPE DATA SET (TWO-WAY CLUSTERED STANDARD ERRORS)
(1) Baseline FST genetic distance, relative to the English Linguistic distance, plurality, relative to English Religious distance, plurality, relative to Protestants 1 − % cognate, plurality, relative to English Constant # of observations # of countries Standardized beta (%) R2
41.691 (18.875)∗∗
(2) Linguistic and religious distance 42.485 (19.310)∗∗ −0.224 (0.109)∗∗
(3) Baseline (smaller sample)
(4) % cognate, plurality
44.252 (20.209)∗∗
44.096 (20.288)∗∗
−1.339 (2.161) 276 24 39.52 .25
0.045 (0.186) −1.314 (2.231) 276 24 39.38 .25
0.107 (0.171)
−1.537 (2.025) 300 25 37.55 .21
−1.445 (1.973) 300 25 38.26 .22
Notes. Dependent variable: absolute value of log income differences, 1995. Two-way clustered standard errors in parentheses. All columns include the following controls (estimates not reported): absolute difference in latitudes, absolute difference in longitudes, geodesic distance, dummy for contiguity, dummy = 1 if either country is landlocked, dummy = 1 if pair shares at least one sea or ocean, average elevation between countries, freight rate (surface transport). Compared to Table IX, in columns (1) and (2) Iceland is dropped due to missing data on linguistic and religious distance from Fearon. In columns (4) and (5) Hungary and Finland are dropped because their languages are not Indo-European, and thus not part of the lexicostatistical data set. ∗ Significant at 10%. ∗∗ Significant at 5%.
V. CONCLUSION In this paper we make two contributions: (1) For the first time, we document a statistically and economically significant positive relationship between measures of genetic distance and cross-country income differences, even controlling for measures of geographical and climatic distances, transportation costs, and measures of historical, linguistic, and religious distance. (2) We provide an economic interpretation of these findings, in terms of barriers to the diffusion of development from the world technological frontier. Our interpretation is based on two main ideas. The first idea is that, on average, genetic distance captures divergence in characteristics that are transmitted across generations within populations over the very long run. Genetic distance, measuring the time
THE DIFFUSION OF DEVELOPMENT
523
since two populations shared common ancestors, provides an ideal summary of differences in slowly changing genealogically transmitted characteristics, including habits and customs. The second idea is that such differences act as barriers to the diffusion of development from the world technological frontier. The empirical evidence is consistent with this barriers interpretation. In line with our framework, the effect on economic distance associated with relative genetic distance from the world technological frontier is larger than the effect of absolute genetic distance. We also found that the effect has varied across time and space in ways that support our diffusion-from-the-frontier interpretation: the effect increased in the first part of the nineteenth century, peaked in 1870, and slightly decreased afterward, consistent with the view that relative genetic distance captures barriers to the diffusion of the Industrial Revolution. Some evidence, particularly the results for European countries, suggests that these differences may stem in substantial part from cultural (rather than purely biological) transmission of characteristics across generations. Although our analysis provides a general macroeconomic framework to interpret our empirical findings, the study of the specific microeconomic mechanisms through which the effects operate is left for future research. An analysis of microeconomic data may shed light on the relations among genetic distance, vertical characteristics, imitation costs, and the spread of specific innovations.75 Interestingly, we have found that linguistic and religious distances, two culturally transmitted characteristics, only slightly reduce the effect of genetic distance on income differences, therefore suggesting a role for other slow-changing biological and/or cultural traits—including differences in customs, norms, or values. These traits are inherently harder to measure, particularly within the long-term macroeconomic perspective that we have adopted. Another natural extension of our work would be to investigate whether and how genetic distance affects bilateral exchanges and interactions between different groups and societies, both peacefully (trade, foreign direct investment) and nonpeacefully (conflict and wars). Finally, it would be interesting to link our results to the vast literature on demographics and economic 75. For instance, recent microeconomic analysis of international technological diffusion finds an important role for ethnic scientific communities, consistent with our interpretation (Kerr 2008).
524
QUARTERLY JOURNAL OF ECONOMICS
growth and explore the connections between genetic distance, intergenerationally transmitted characteristics, and demographic patterns.76 A final consideration is about policy implications. A common concern with research documenting the importance of variables such as genetic distance or geography is pessimism about policy implications. What use is it to know that genetic distance explains income differences, if one cannot change genetic distance, at least in the short run? These concerns miss a bigger point: available policy variables may have a major impact not on genetic distance itself, but on the magnitude of the effect of genetic distance on income differences. This magnitude has changed over time, and can change further. If we are correct in interpreting our results as evidence for long-term barriers across different societies, significant reductions in income disparities could be obtained by encouraging policies that reduce such barriers, including efforts to translate and adapt technological and institutional innovations into different histories and traditions, and to foster cross-societal exchanges and openness. More work is needed—at the micro as well as macro level—to understand the specific mechanisms, market forces, and policies that could facilitate the diffusion of development across societies with distinct long-term histories. APPENDIX I: DEFINITION OF FST In this Appendix we illustrate the construction of FST for the simple case of two populations (a and b) of equal size, and one gene that can take only two forms (allele 1 and allele 2). Let pa and qa = 1 − pa be the gene frequencies of allele 1 and allele 2, respectively, in population a.77 The probability that two randomly selected alleles at the given locus are identical within the population (homozygosity) is pa2 + qa2 , and the probability that they are different (heterozygosity) is (13)
ha = 1 − pa2 + qa2 = 2 paqa .
76. For instance, Coale and Cotts Watkins (1986) documented the correlation between cultural similarity and the time paths of fertility across Europe (see also Richerson and Boyd [2004, Chapter 5]). Galor (2005) provides an in-depth discussion of the economic literature on demographics and growth. 77. Note that because pa + qa = 1 we also have ( pa + qa )2 = pa2 + qa2 + 2 pa qa = 1.
THE DIFFUSION OF DEVELOPMENT
525
By the same token, heterozygosity in population b is hb = 1 − pb2 + qb2 = 2 pbqb, (14) where pb and qb = 1 − pb are the gene frequencies of allele 1 and allele 2, respectively, in population b. The average gene frequencies of allele 1 and 2 in the two populations are, respectively, p=
(15)
pa + pb 2
and q=
(16)
qa + qb = 1 − p. 2
Heterozygosity in the sum of the two populations is h = 1 − ( p2 + q2 ) = 2 pq.
(17)
Average heterozygosity is measured by hm =
(18)
ha + hb . 2
FST measures the variation in the gene frequencies of populations by comparing h and hm: (19)
FST = 1 −
paqa + pbqb 1 ( pa − pb)2 hm =1− = . h 2 pq 4 p(1 − p)
If the two populations have identical allele frequencies ( pa = pb), FST is zero. On the other hand, if the two populations are completely different at the given locus ( pa = 1 and pb = 0, or pa = 0 and pb = 1), FST takes the value 1. In general, the higher the variation in the allele frequencies across the two populations, the higher is their FST distance. The formula can be extended to account for L alleles, S populations, and different population sizes and to adjust for sampling bias. The details of these generalizations are provided in Cavalli-Sforza, Menozzi, and Piazza (1994, pp. 26–27). APPENDIX II: SPATIAL CORRELATION This Appendix illustrates why spatial correlation may be present in our bilateral analysis. Consider three countries, 1, 2,
526
QUARTERLY JOURNAL OF ECONOMICS
and 3. Observations on the dependent variable, | log y1 − log y2 | and | log y1 − log y3 |, will be correlated by virtue of the presence of country 1 in both observations. Conditioning on the right-handside variables (which are bilateral in nature) should reduce crosssectional dependence in the errors ε12 and ε13 , but we are unwilling to assume that observations on the dependent variable are independent conditional on the regressors. In other words, simple least-squares standard errors will lead to misleading inferences due to spatial correlation. With N countries, there are N(N − 1)/2 distinct pairs. Denote the observation on absolute value income differences between country i and country j as dyi j . Pairs are ordered so that country 1 appears in position i and is matched with all countries from 2 to N appearing in position j. Then country 2 is in position i and is matched with 3 to N appearing in position j, and so on. The last observation has country N − 1 in position i and country N in position j. We denote the nonzero off-diagonal elements of the residual covariance matrix by σm, where m is the country common to each pair. A simple example when the number of countries is N = 4 is illustrative. In this case, under our maintained assumption that the error covariances among pairs containing a common country m are equal to a common value σm, the covariance matrix of the vector of residuals ε is of the form 2 σε ε12 ε13 σ1 σε2 ε14 σ1 σ1 σ 2 ε = . = cov ε23 σ2 σ3 0 σ 2 ε ε24 σ2 0 σ4 σ2 σ 2 ε 2 ε34 0 σ3 σ4 σ3 σ4 σε This clearly demonstrates the presence of cross-sectional (spatial) correlation. It is important to note, however, that our data are not linearly dependent; that is, there is additional information brought in by considering the bilateral approach. One major reason is that the dependent variable is the absolute difference in log income, not just the difference in log income. It is easy to show that taking absolute values greatly reduces spatial dependence in the dependent variable. Another major reason is that we are conditioning on right-hand-side variables (such as geodesic distance and genetic distance) that are truly bilateral in nature; that
THE DIFFUSION OF DEVELOPMENT
527
is, our empirical model is not the result of simply differencing a “level” specification across cross-sectional units. TUFTS UNIVERSITY AND NATIONAL BUREAU OF ECONOMIC RESEARCH UNIVERSITY OF CALIFORNIA, LOS ANGELES AND NATIONAL BUREAU OF ECONOMIC RESEARCH
REFERENCES Acemoglu, Daron, Simon Johnson, and James A. Robinson, “The Colonial Origins of Comparative Development: An Empirical Investigation,” American Economic Review, 91 (2001), 1369–1401. ——, “Reversal of Fortune: Geography and Institutions in the Making of the Modern World Income Distribution,” Quarterly Journal of Economics, 117 (2002), 1231–1294. ´ Francisco, and Antonio Ciccone, “Trade and Productivity,” Quarterly JourAlcala, nal of Economics, 119 (2004), 612–645. Alesina, Alberto, Arnaud Devleeschauwer, William Easterly, Sergio Kurlat, and Romain Wacziarg, “Fractionalization,” Journal of Economic Growth, 8 (2003), 55–194. Ashraf, Quamrul, and Oded Galor, “Human Genetic Diversity and Comparative Economic Development,” working paper, Brown University, 2008. Barro, Robert J., and Xavier Sala-i-Martin, “Technological Diffusion, Convergence and Growth,” Journal of Economic Growth, 2 (1997), 1–26. Bisin, Alberto, and Thierry Verdier, “Beyond the Melting Pot: Cultural Transmission, Marriage, and the Evolution of Ethnic and Religious Traits,” Quarterly Journal of Economics, 105 (2000), 955–988. ——, “The Economics of Cultural Transmission and the Evolution of Preferences,” Journal of Economic Theory, 97 (2001), 98–319. Boyd, Robert, and Peter J. Richerson, Culture and the Evolutionary Process (Chicago, IL: University of Chicago Press, 1985). Boyd, Robert, and Joan B. Silk, How Humans Evolved (New York, NY: Norton, 2003). Brezis, Elise, Paul Krugman, and Daniel Tsiddon, “Leapfrogging in International Competition: A Theory of Cycles in National Technological Leadership,” American Economic Review, 83 (1993), 1211–1219. Cameron, A. Colin, Jonah B. Gelbach, and Douglas L. Miller, “Robust Inference with Multi-Way Clustering,” NBER Technical Working Paper No. T0327, 2006. Case, Anne, “Spatial Patterns in Household Demand,” Econometrica, 59 (1991), 953–965. Cavalli-Sforza, Luigi L., and Francesco Cavalli-Sforza, The Great Human Diasporas (Reading, MA: Addison Wesley, 1995). Cavalli-Sforza, Luigi L., and Marcus W. Feldman, Cultural Transmission and Evolution (Princeton, NJ: Princeton University Press, 1981). Cavalli-Sforza, Luigi L., Paolo Menozzi, and Alberto Piazza, The History and Geography of Human Genes (Princeton, NJ: Princeton University Press, 1994). Clark, Gregory, A Farewell to Alms. A Brief Economic History of the World (Princeton, NJ: Princeton University Press, 2007). Clark, Gregory, and Susan Wolcott, “Why Nations Fail: Managerial Decisions and Performance in Indian Cotton Textiles, 1890–1938,” Journal of Economic History, 59 (1999), 397–423. Coale, Ansley J., and Susan Cotts Watkins, The Decline of Fertility in Europe (Princeton, NJ: Princeton University Press, 1986). Dawkins, Richard, The Ancestor’s Tale: A Pilgrimage to the Dawn of Evolution (Boston, MA: Houghton Mifflin, 2004). Desmet, Klaus, Michel Le Breton, Ignacio Ortuno Ortin, and Shlomo Weber, “Stability of Nations and Genetic Diversity,” working paper, Universidad Carlos III, 2007.
528
QUARTERLY JOURNAL OF ECONOMICS
Diamond, Jared, The Third Chimpanzee. The Evolution and Future of the Human Animal (New York, NY: Harper Collins, 1992). ——, Guns, Germs, and Steel: The Fates of Human Societies (New York, NY: Norton, 1997). Dyen, Isidore, Joseph B. Kruskal, and Paul Black, “An Indoeuropean Classification: A Lexicostatistical Experiment,” Transactions of the American Philosophical Society, 82 (1992), 1–132. Easterly, William, and Ross Levine, “Tropics, Germs, and Crops: How Endowments Influence Economic Development,” Journal of Monetary Economics, 50 (2003), 3–39. Fearon, James, “Ethnic and Cultural Diversity by Country,” Journal of Economic Growth, 8 (2003), 195–222. ——, “Is Genetic Distance a Plausible Measure of Cultural Distance?,” working paper, Stanford University, 2006. Gallup, John L., Andrew D. Mellinger, and Jeffrey D. Sachs, “Geography and Economic Development,” NBER Working Paper No. 6849, 1998. Galor, Oded, “From Stagnation to Growth: Unified Growth Theory,” in Handbook of Economic Growth, Philippe Aghion and Steven Durlauf, eds. (Amsterdam, the Netherlands: North-Holland, 2005). Galor, Oded, and Omer Moav, “Natural Selection and the Origin of Economic Growth,” Quarterly Journal of Economics, 117 (2002), 1133–1191. Giuliano, Paola, Antonio Spilimbergo, and Giovanni Tonon, Genetic, Cultural and Geographical Distances, unpublished, International Monetary Fund, 2006. Glaeser, Edward, Rafael LaPorta, Florencio Lopez-de-Silanes, and Andrei Shleifer, “Do Institutions Cause Growth?,” Journal of Economic Growth, 9 (2004), 271– 303. Guiso, Luigi, Paola Sapienza, and Luigi Zingales, “Cultural Biases in Economic Exchange,” NBER Working Paper No. 11005, 2004. ——, “Does Culture Affect Economic Outcomes?,” Journal of Economic Perspectives, 20 (2006), 23–48. Hall, Robert, and Jones, Charles I., “Why Do Some Countries Produce So Much More Output per Worker Than Others?” Quarterly Journal of Economics, 114 (1999), 83–116. Heston, Alan, Robert Summers, and Bettina Aten, “Penn World Table Version 6.1,” Center for International Comparisons at the University of Pennsylvania (CICUP), 2002. Howitt, Peter, and David Mayer-Foulkes, “R&D, Implementation, and Stagnation: A Schumpeterian Theory of Convergence Clubs,” Journal of Money, Credit and Banking, 37 (2005), 147–177. Hummels, David, and V. Lugovskyy “Usable Data? Matched Partner Trade Statistics as a Measure of Transportation Cost,” Review of International Economics, 14 (2006), 69–86. Jobling, Mark A., Matthew E. Hurles, and Chris Tyler-Smith, Human Evolutionary Genetics. Origins, Peoples & Diseases (New York, NY: Garland Science, 2004). Kerr, William R., “Ethnic Scientific Communities and International Technology Diffusion,” Review of Economics and Statistics, 98 (2008), 518–537. Kimura, Motoo, “Evolutionary Rate at the Molecular Level,” Nature, 217 (1968), 624–626. Kremer, Michael, “Population Growth and Technological Change: 1,000,000 B.C. to 1990,” Quarterly Journal of Economics, 108 (1993), 681–716. Limao, Nuno, and Anthony Venables, “Infrastructure, Geographical Disadvantage, Transport Costs and Trade,” World Bank Economic Review, 15 (2001), 451– 479. Maddison, Angus, The World Economy: Historical Statistics (Paris, France: OECD Development Center, 2003). Masters, William A, and Margaret S. McMillan, “Climate and Scale in Economic Growth,” Journal of Economic Growth, 6 (2001), 167–186. McNeill, John Robert, and William H. McNeill, The Human Web: A Bird’s Eye View of Human History (New York, NY: Norton, 2003). Mecham, R. Quinn, James Fearon, and David Laitin, Religious Classification and Data on Shares of Major World Religions, unpublished, Stanford University, 2006.
THE DIFFUSION OF DEVELOPMENT
529
Olsson, Ola, and Douglas A. Hibbs Jr., “Biogeography and Long-Run Economic Development,” European Economic Review, 49 (2005), 909–938. Parente, Stephen L., and Edward C. Prescott, “Barriers to Technology Adoption and Development,” Journal of Political Economy, 102 (1994), 298–321. ——, Barriers to Riches (Cambridge, MA: MIT Press, 2002). Richerson, Peter J., and Robert Boyd, Not by Genes Alone: How Culture Transformed Human Evolution (Chicago, IL: University of Chicago Press, 2004). Rogers, Everett M., Diffusion of Innovations (New York, NY: The Free Press, 1962). Sachs, Jeffrey, “Tropical Underdevelopment,” NBER Working Paper No. 8119, 2001. Shennan, Stephen, Genes, Memes and Human History. Darwinian Archaeology and Cultural Evolution (London: Thames and Hudson, 2002). Tabellini, Guido, “Culture and Institutions: Economic Development in the Regions of Europe,” IGIER Working Paper No. 292, 2005. Weitzman, Martin, “On Diversity,” Quarterly Journal of Economics, 107 (1992), 363–405. Whitehead, Alfred North, Science and the Modern World (Cambridge, UK: Cambridge University Press, 1931).
WAS WEBER WRONG? A HUMAN CAPITAL THEORY OF PROTESTANT ECONOMIC HISTORY∗ SASCHA O. BECKER AND LUDGER WOESSMANN Max Weber attributed the higher economic prosperity of Protestant regions to a Protestant work ethic. We provide an alternative theory: Protestant economies prospered because instruction in reading the Bible generated the human capital crucial to economic prosperity. We test the theory using county-level data from late-nineteenth-century Prussia, exploiting the initial concentric dispersion of the Reformation to use distance to Wittenberg as an instrument for Protestantism. We find that Protestantism indeed led to higher economic prosperity, but also to better education. Our results are consistent with Protestants’ higher literacy accounting for most of the gap in economic prosperity.
I. INTRODUCTION In the century since Max Weber suggested in The Protestant Ethic and the Spirit of Capitalism that a “Protestant ethic” was instrumental for economic progress (Weber 2001), several interpretations have emerged as to how Protestants came to be more prosperous than Catholics. Weber’s study is considered a seminal work in sociology, and subsequent sociological interpretations have incorporated the idea that the specific ethic of Protestant theology may have induced its followers to work harder and to save more. To these sociological theories we offer a simple alternative economic theory based on the standard human capital model. Martin Luther explicitly favored universal schooling in order to enable all Christians to read the Gospel by themselves. We ∗ The idea that the rise of Protestantism may have fostered human capital accumulation in Europe stems from our discussions with Paul E. Peterson, who again traces it back to a late-1960s Chicago seminar by C. Arnold Anderson and Mary Jean Bowman. We received substantive comments during seminar presentations at the Universities of Harvard, Stanford, UC Davis, Munich, Zurich, Stockholm, Florence, and Rome “Tor Vergata,” the London School of Economics, Aarhus Business School, the Ifo Institute Munich, ZEW Mannheim, WZB Berlin, the Max Planck Institute for Research on Collective Goods Bonn, the Third Christmas Conference of German Expatriate Economists, the European Meeting of the Econometric Society in Budapest, the Oslo conference of the European Association of Labour Economists, and the Munich conference of the German Economic ¨ Association. Discussions with and comments from Andreas Ammermuller, Knut Borchardt, David Card, Andreas Diekmann, Guido Friebel, Claudia Goldin, Avner Greif, Bob Hart, Mathias Hoffmann, Larry Kahn, Tim Lorentzen, Volker Meier, Petra Moser, Paul Peterson, Guido Schwerdt, Holger Sieg, Daniel Sturm, Marty West, Ulrich Woitek, the editors, and anonymous referees were very fruitful. Support has come from the Program on Education Policy and Governance of Harvard University. Erik Hornung, Martin Hofmann, and Clemens K¨onig provided capable research assistance. We are grateful to all of them.
C 2009 by the President and Fellows of Harvard College and the Massachusetts Institute of
Technology. The Quarterly Journal of Economics, May 2009
531
532
QUARTERLY JOURNAL OF ECONOMICS
suggest that, incidentally, the ensuing literacy among Protestants was also then used in economic activities. We take Weber’s own perspective to test our theory empirically, using variation across the 452 counties of his native Prussia, the dominant state of the German Empire. We find a significant, positive association between Protestantism and economic prosperity in late-nineteenth-century Prussia, confirming Weber’s descriptive observation. To our knowledge, this is the first thorough empirical analysis of the Weber thesis at the subnational level. More importantly, we argue that the approximately concentric diffusion of Protestantism in Prussia around Luther’s city of Wittenberg in Lutheran times allows us to identify exogenous variation in Protestantism in the late nineteenth century. Using distance to Wittenberg as an instrument for counties’ shares of Protestants, we find that Protestantism had a strong effect on literacy, confirming the basic mechanism of our human capital theory. In this model, identification comes from the assumption that the Reformation was an exogenous event, generating a random shock that spread concentrically around Wittenberg. We corroborate this identifying assumption by showing that distance to Wittenberg is indeed unrelated to a series of proxies for economic and educational development before 1517, including the pre-Luther placement of schools, universities, monasteries, and free imperial and Hanseatic cities and urbanization. The Protestant lead in literacy is large enough to account for practically the entire Protestant lead in economic outcomes. When we restrict the economic return to literacy to values consistent with existing causal estimates in the literature, the point estimate of the independent effect of Protestantism on economic outcomes adjusted for literacy approaches zero. Our results thus suggest that human capital can account for at least some of the denominational difference in economic affluence, and they are consistent with the hypothesis that it can account for most or even all of the difference. This would leave little scope for independent effects of channels traditionally associated with the Weber thesis, such as increased work effort and saving. The result holds for a series of measures of economic outcomes, including per capita income taxes, an income measure based on teacher salaries, and the size of the nonagricultural sector. Prussia in the late nineteenth century provides the natural place to study the relationship between Protestantism, education,
WAS WEBER WRONG?
533
and economic prosperity. It includes Wittenberg, where the Reformation was initiated and from whence Luther’s doctrine diffused in its purest form. Prussia had uniform laws and institutional settings, so that the empirical investigation is not hampered by institutional heterogeneity. It was also reasonably well divided between Protestants and Catholics, at roughly two-thirds to onethird of the population, so that no denomination constitutes just a small minority. Finally, the Prussian Statistical Office collected an impressive amount of data, the quality of which is generally accepted as having been outstanding already in the nineteenth century, and which have survived at the county level in the archives. The 1871 Prussian Census was the first to survey the literacy of the whole population. We thus do not have to rely on data from selective samples such as military recruits, which provide only a limited picture of the population at large. The Prussian county data allow us to go beyond the existing empirical literature on the Weber thesis, which mostly uses cross-country variation. Although the issue is not fully resolved in the literature (cf. Iannaccone [1998]; Delacroix and Nielsen [2001]), we provide descriptive evidence below that Protestant countries were on average economically more advanced in 1900 than Catholic countries—and were substantially more literate. A broader context of papers studies the association between religion and economic outcomes. As an important expression of culture (Guiso, Sapienza, and Zingales 2006), religion is generally viewed as a possible fundamental determinant of economic growth. Thus, Barro and McCleary (2003, 2005) study the association between different religions and economic growth. In a study concerned with controlling for the effects of economic institutions, Acemoglu, Johnson, and Robinson (2001, 2005) find no effect of religion on growth in a cross-country setting. Any cross-country study is plagued by the difficulty of disentangling the effect of religion from other possible fundamental causes of economic prosperity that vary across countries, such as institutions and geography. By contrast, looking at regional data within Prussia exposes all our observations to the same institutional and legal setting. Similarly, problems of geographical variation are substantially smaller within Prussia than globally, and we test for robustness by adding controls for a rich set of geographical features. We can even include district fixed effects, using only variation across counties within each district. In effect, we
534
QUARTERLY JOURNAL OF ECONOMICS
hold institutions and geography constant and ask whether there is a role for religion in economic outcomes.1 Our results suggest that although religious affiliation indeed had economic consequences, this may have been for reasons other than the disposition of the work ethic. Our human capital theory of Protestant economic prosperity is certainly complementary to ethics-based theories in the sense that the economic role of the Protestant ethic may work essentially via human capital accumulation (a thought that, however, is not explicitly contained in Weber’s work). But our evidence that the human capital channel can be traced back to denominational variation stemming from the choices of local rulers during the Reformation in the sixteenth and early seventeenth centuries suggests that explanations based purely on differential work ethics may have limited power. A major driving force of the higher economic prosperity of Protestants in late-nineteenth-century Prussia was education. Of course, religion was important for economic success in the sense that without intention it resulted in an uneven accumulation of human capital. The denominational differences originating in the Reformation affected economic outcomes even after several centuries. The next section presents evidence that Protestant countries and regions were economically more prosperous than Catholic countries and regions in the late nineteenth century. Section III develops the theoretical foundation. Section IV demonstrates that Protestant countries and regions had better-educated populations. Section V shows that this association warrants a causal interpretation in an instrumental variable model exploiting the historical origin of the Reformation in Prussia. Section VI provides evidence on the importance of education in accounting for the Protestant economic lead in late-nineteenth-century Prussia. II. THE BASIC FACTS: PROTESTANTISM AND ECONOMIC PROSPERITY The purported association underlying the Weber thesis is that Protestantism was correlated with greater economic prosperity. This section provides several pieces of evidence—patterns 1. In a cross-country study, Glaeser et al. (2004) find that human capital may be a more basic source of growth than institutions. Our finding of an important role of education is also consistent with a long literature stressing the importance of human capital for historical economic development in general; cf., for example, Easterlin (1981), Goldin (2001), Lindert (2003), and Galor (2005). For a recent review of the vast literature on the role of human capital in modern economic growth, cf. Hanushek and Woessmann (2008).
WAS WEBER WRONG?
535
FIGURE I The Cross-Country Pattern of Protestantism and GDP Per Capita, 1900 See Appendix III for data details.
across countries in 1900, within-country patterns in the existing literature, and new evidence for Prussia in the 1870/1880s and Germany today—confirming the validity of this basic association. Despite some negative cross-country assessments in the literature (Iannaccone 1998; Delacroix and Nielsen 2001),2 Glaeser and Glendon (1998) provide evidence from available data that economic growth between 1820 and 1950 was faster in seven predominantly Protestant countries than in five predominantly Catholic countries. They also report that average income levels were higher in the Protestant countries. In Figure I, we plot Maddison’s (2006) data on GDP per capita in 1900 against the data on religious population shares by Barrett, Kurian, and Johnson (2001) for all countries in which Catholics and Protestants together accounted for the majority of the population. The cross-country pattern in 1900 reveals that countries with a larger share of Protestants were on average economically more advanced than countries with 2. See also the growth model calibrated to England by Cavalcanti, Parente, and Zhao (2007), which suggests that differences between Catholics and Protestants could at best account for only slight delays in the start of industrialization.
536
QUARTERLY JOURNAL OF ECONOMICS
a larger share of Catholics. The correlation coefficient across the 29 countries for available data is .52 (statistically significant at the 1% level).3 Micro evidence from within countries is much scarcer. The only explicit evidence that Weber put forward is a descriptive exposition by Offenbacher (1900) for the German region of Baden in 1885–1895 showing that Protestant children were more likely than Catholics to attend institutions that prepared for technical and commercial occupations.4 Although George Becker (1997) aptly reveals the shortcomings of these data, his reanalysis still reveals a positive association between Protestantism and orientation toward higher-paying occupations. For the United States, Goldin and Katz (2000) show that Protestants had higher earnings than Catholics in 1915 Iowa. To this evidence, we add our new 1870/1880s analyses at the county level in Prussia, which was well divided between Protestants and Catholics (see Section VI for details of the data and regression analyses). Our best proxy for county income is income tax revenues in 1877. Per capita income taxes and the share of Protestants are significantly positively correlated for tax data across the 426 counties (correlation coefficient .13, statistically significant at the 1% level). Per capita income taxes are 9.1% higher in the 225 mostly-Protestant counties than in the 102 mostly-Catholic counties (defined as having either more than 80% or less than 20% Protestants). Another indicator of economic progressiveness is the sectoral structure of the 452 counties in 1882, where we observe the shares of the labor force working in the manufacturing sector and in the service sector. The service-sector share, which includes such businesses as trade, insurance, and transport (but not servants, 3. The scatterplot depicts the role of the Nordic countries as “impoverished sophisticates” (Sandberg [1979] on Sweden) whose level of economic development was not up to their level of human capital before the industrial revolution took hold there. Without the Nordic countries, the correlation between GDP per capita and the share of Protestants across predominantly Christian countries in 1900 is as high as .77. 4. Common perceptions of systematic denominational differences in economic backwardness within Germany also suggest that the Weber observation was indeed viewed as an important stylized fact in Germany both in the late nineteenth century and in the mid-twentieth century. Weber (2001) refers to regular public discussions at official meetings of the Catholic laity in Germany on the general public feeling that Catholics were economically lagging behind Protestants at his time. The very same discussions of Catholic backwardness reemerged in Catholic meetings and media in the 1950s and 1960s (e.g., Herder-Korrespondenz [1954]; Erlinghagen [1965]).
WAS WEBER WRONG?
537
housemaids, or the public administration), is significantly positively correlated with Protestantism (correlation coefficient .10, significant at 4%). The correlation of Protestantism with the manufacturing share in the full sample is not statistically significantly different from zero, but this is driven by the fact that the affluent Ruhr area derived its large manufacturing share from coal, which fostered a large mining industry (subsumed under manufacturing in the 1882 census). Disregarding the two provinces that contain the Ruhr area (Rheinprovinz and Westphalen), the manufacturing share is also significantly positively correlated with Protestantism (correlation coefficient .10, significant at 8%). The same is true for the combined share of the labor force that has moved out of agriculture into manufacturing and services, which we will refer to as nonagricultural share in the remainder of the paper (correlation coefficient .13, significant at 2%).5 In this sample, the nonagricultural share in the 223 mostly-Protestant counties of 32.0% is 3.5 percentage points higher than in the 45 mostly-Catholic counties. Finally, using individual-level data from Germany today (see Section VI.E for details), we still observe that Protestants earn 6.9% higher incomes than Catholics. In sum, there is clear evidence that Protestantism was (and is) associated with economic prosperity, as purported by the Weber thesis. Motivated by these descriptive patterns, we turn to the question of how these basic facts can be explained. III. THE PROTESTANT ETHIC VERSUS THE HUMAN CAPITAL HYPOTHESIS This section presents two alternative theoretical approaches to understanding the history of Protestants’ relative economic progressiveness, Weber’s thesis based on a Protestant work ethic and our human capital theory of Protestant economic history. The two approaches are not mutually exclusive; in fact, they may well be complementary. III.A. The Weber Thesis of a Protestant Ethic Max Weber (2001) proposed the “most famous link between culture and economic development” (Acemoglu, Johnson, and Robinson 2005, p. 401), namely that the Protestant Reformation 5. When the two provinces are disregarded, the correlation between income taxes and Protestantism also increases to .30.
538
QUARTERLY JOURNAL OF ECONOMICS
was instrumental in facilitating industrial capitalism in Western Europe.6 The descriptive observation of greater economic prosperity of Protestants had been the subject of a long-running discussion, traceable at least as far back as to Menschenfreund (1772). The particular feature of Weber’s main thesis is that it is the specific ethic of Protestantism that affected economic outcomes.7 Weber argued that the Reformation introduced the crucial notion of the “calling” (“Beruf ”), with the current use of the word originating in Luther’s translation of the Bible. The notion of the calling carries the suggestion of a religious conception, the sanctification of labor to a task set by God. This notion, according to Weber, created a particular Protestant work ethic, which—in contrast to the Catholic ideal of surpassing worldly morality in monastic asceticism—valued the fulfillment of worldly duties as the highest moral achievement. According to Weber (2001, p. 40), “The only way of living acceptably to God was . . . solely through the fulfillment of the obligations imposed upon the individual by his position in the world. That was his calling.”8 The Protestant work ethic approved of the accumulation of wealth and thus, according to Weber’s argument, provided the moral foundation for capitalist industrialization. Success in a calling became regarded as a sign of being among the select group that God will save from damnation (cf. Giddens’s introduction to Weber [2001]). Thus, Weber provides an ethics-based theory for economic development.9 6. On a cautionary note, we stress that there is considerable controversy about what Weber’s own main hypothesis about Protestantism and the development of capitalism actually was. However, it is undisputed that the core of his argument is that there is a difference in ethical disposition between Protestants and Catholics that had a significant bearing on economic outcomes. 7. Doepke and Zilibotti (2008) endogenize preferences and cultural values, suggesting that the preindustrial professional distribution generated class-specific attitudes (among them “the spirit of capitalism” in the middle class) that help to explain the socioeconomic transformation that occurred during the industrial revolution. 8. Weber explicitly traces this central notion back to Luther, whereas later it was most rigorously developed in certain Protestant communities, such as Calvinists, Puritans, Methodists, and Baptists. In Prussia, the vast majority of Protestants were Lutherans. The distinction between Lutherans and the second Protestant community, Reformists, was dropped in official statistics when they were merged into the Protestant Church in Prussia (Evangelische Kirche in Preußen) in 1817. Just before the merge, 94% of Protestants in Prussia were ¨ Lutherans (Mutzell 1825). 9. Weber’s work has been criticized as misinterpreting Protestant doctrine, Catholic doctrine, and the development of specific forms of capitalism (cf. Giddens’s introduction to Weber [2001]). Critics also pointed out the historical inconsistencies that most capitalist institutions preceded the Reformation (Tawney 1926), early Reformation leaders were uninterested in or even hostile to economic issues and ignorant of the working of capitalist institutions (Samuelsson 1993),
WAS WEBER WRONG?
539
One mechanism underlying the Weber thesis is that the work ethic drives Protestants to simply work harder. Another mechanism is that their belief system compels them to save more in order to defer gratification, which transforms into investments and thus higher productivity in the longer run.10 Because there is substantial controversy as to whether Weber was trying to explain economic disparities existent at his time or just the initial origin of capitalism, we resort to aiming our analysis at what has been called the “Common Interpretation” (Delacroix and Nielsen 2001) of the Protestant Ethic, which has taken a life of its own, namely the simple emphasis on a “connection between Protestantism and economic progress” (Coleman’s 1959 introduction to Samuelsson [1993]) in general. III.B. A Human Capital Theory of Protestant Economic History It is a highly acclaimed fact that Martin Luther produced the first widely used German translation of the Bible. He opposed the Roman Catholic practice of reading out the Gospel in the scholarly language of Latin and wanted everyone to be able to understand God’s Word. What is less well known today is that Luther also very explicitly promoted the expansion of education (cf. Rupp [1996, 1998]). Quite obviously, if one wants to read the Bible, one must be able to read. Already in his very early preaching, Luther (1888, pp. 461–462) explicitly demanded that every town should have both a boys’ and a girls’ school where every child should learn to read the Holy Scriptures. Luther’s call to teach everyone so that they are able to read God’s Word by themselves is the key feature of our alternative theory of the relative economic affluence of Protestants, because— as a mere coincidence—the literacy that was created also had a significant use in the economic sphere.11 It should be stressed, though, that Luther never had an economic use in mind. The and several selective regional examples of economic development went counter to the Weber thesis (cf. also Iannaccone [1998]). 10. Merton (1936) stressed the importance of Protestantism for the development of science. Blum and Dudley (2001) interpret the Weber thesis in terms of information networks and model the adoption of Protestantism as an increase in the cost of defection in contractual relationships with strangers that increased trade. 11. In a closely related argument, Botticini and Eckstein (2005, 2007) suggest a human capital interpretation of Jewish history, where the ultimate root of Jewish economic prosperity as merchants lies in a centuries-old Judaic rule that required male Jews to be able to read the Torah in the synagogue and to teach the reading of the Torah to their sons.
540
QUARTERLY JOURNAL OF ECONOMICS
increased education of Protestants was purely religiously motivated; instruction, learning, and scientific engagement did not carry a value of their own for Luther. “Luther’s prime concern in this area was the creation of elementary schools for the people as a means of providing all Christians with access to the word of God, as contained in the Bible” (Rupp 1996, p. 618).12 This relates both to the authority of a book, the Bible, for Protestantism and to Luther’s general theological tenet of the ‘universal priesthood of all believers’ (cf. Pelikan [2005]). Rather than relying on injunctions by specifically ordained priests, ceremonial exercises, and sacerdotal imagery, each Christian was urged to read the sacred text for himself or herself. This required breaching the clerics’ privilege of education in favor of universal basic education. Rupp (1998, p. 172) summarizes the basic line of reasoning: because the divine revelation had quasi materialized itself in the Holy Scripture, each Christian, each Protestant believer was indispensably referred to getting to know and reading this scripture. But this, in turn, made it necessary that everybody could indeed read this scripture—and this, of course, had corresponding efforts of education in schools, which had still to be established, as its precondition. . . .
Luther addressed his educational demands to two different addressees. First, as is most evident in a 1524 pamphlet, he pressured the Protestant rulers to build and maintain schools. In To the Councilmen of All Cities in Germany That They Establish and Maintain Christian Schools, Luther (1899) assigned the duty of operating schools to the rulers and territorial authorities. If parents did not take care of schools, Luther argued, it would be the duty of the rulers to incur the effort and cost of running schools. In his practical implementation of educational reforms, Melanchthon also made the authorities responsible for organizing the new education system (Rupp 1996). As a consequence, the costs of schooling might have been lower for individuals in Protestant regions than in Catholic regions. Owing to the higher prevalence of public schools in Protestant regions, the commuting costs to schools would be lower. Depending on the incidence of the ruler’s financing of the costs of schools, part or all of the financial burden might also not have to be carried by the individual in terms of taxes, but 12. Woodberry (2004) uses a similar argument to show that Protestant (rather than Catholic) missionaries were central in expanding mass education in the colonial world.
WAS WEBER WRONG?
541
might have come, for example, from reduced spending on amenities for the ruler and his prot´eg´es. Second, most evident in his 1530 Sermon on Keeping Children in School, Luther (1909, p. 526) also demanded that each individual, especially the parents, put emphasis on education and send children to school: I see that the common people are dismissive to maintaining the schools and that they withdraw their children from instruction altogether and turn solely to the care for food and bellies, and besides they either will not or cannot consider what a horrible and un-Christian thing they are doing and what great and murderous harm they are doing everywhere in so serving the devil.
Thus, in line with the universal priesthood of all believers, all Christians are called on to ensure that their children receive a decent education. As a consequence of Luther’s postulations, the individual (religious) benefit of schooling would have been higher for Protestants than for Catholics. Luther’s educational postulations might also have induced Protestants to view learning as less of a strain and more of an enjoyment, thereby reducing the individual costs of schooling. Combining the two effects, a simple economic model predicts that when optimizing individual utility, in equilibrium Protestants will have more education on average than Catholics because they have lower costs and higher benefits of schooling (see Section II.C of Becker and Woessmann [2007] for a depiction of this argument in a simple human capital model). The fact that the Reformation was one of the leading origins of elementary schooling in Germany is well accepted in the study of German educational history (cf. Spranger [1949]; Flitner [1954]; Reble [2002]).13 The leading reformers were very active in putting Luther’s educational preaching into practice. Protestant cities and territories instituted new Church and School Ordinances that postulated universal education of all children and required building new schools (cf. Green [1979] for examples). Regular visitation by leading reformers ensured the implementation of these ordinances. Green (1979) documents the vastly increasing number of schools in the Protestant region of Brandenburg in the first decades of the Reformation until 1600. 13. In the post-Luther era of the Counter-Reformation, it was particularly the Jesuits who tried to advance education also among the Catholic population. However, as our evidence below shows, this was far less encompassing than the Protestant urge for education.
542
QUARTERLY JOURNAL OF ECONOMICS
The final step in our argument is that such educational expansion was useful beyond religion, for economic productivity in our case. In economics, the supreme importance of education for economic prosperity has received particular emphasis since the emergence of the theory of human capital in the early 1960s. The key idea is that education is an investment that yields higher labor-market earnings because it increases productivity. The linguistic and methodical skills created by the teaching of God’s Word—reading, understanding, and knowing the Bible, including its exegetical comprehension—are thus valuable in other tasks beyond the religious realm. In sum, Luther’s educational postulations give rise to a simple alternative theory of the historical economic success of Protestant regions: Protestants acquired more schooling than Catholics for religious reasons, and as a side effect, this higher schooling then transformed into higher economic prosperity (cf. Becker and Woessmann [2007] for additional details). Of course, such a theory does not preclude other effects of Protestantism. For example, it may well be complementary to the Weber thesis in the sense that Protestants might become more educated because of a better work ethic. But our theory provides the innovation of stressing education as a key channel in the Protestant economic lead.
IV. EVIDENCE ON PROTESTANTISM AND EDUCATION To validate the basic tenet of our human capital hypothesis, this section provides evidence on the association between Protestantism and education. After a brief discussion of the crosscountry pattern and existing evidence, we turn to our analysis of county-level data from late nineteenth-century Prussia. IV.A. The International Pattern We derive cross-country data on literacy rates in 1900 mainly from the UNESCO (1953) compilation based on national population censuses, reporting the share of persons above 10 or 15 years who could read in 1900 (or a close year). We supplement this by additional literacy data in Cipolla (1969) and Flora (1983), some of which stem from military records on literacy among recruits and from marriage registers on the share of newly married bridegrooms and brides who could sign their marriage
WAS WEBER WRONG?
543
FIGURE II The Cross-Country Pattern of Protestantism and Literacy, 1900 See Appendix III for data details.
certificates.14 We follow Tabellini (2005) in combining literacy data from different sources in a cross-country comparison, fully aware that there are severe limits to cross-country comparability due to differences in literacy definitions and samples (cf. UNESCO [1953]). Although such limitations often will not allow establishing literacy rates within a few percentage points, the broad crosscountry pattern, as depicted in Figure II, should not be affected.15 The figure reveals clearly that among the countries where Protestants and Catholics together accounted for the majority of 14. Both additional sources of literacy data have disadvantages: Military records generally refer to a specific age group of the able-bodied male population only, and marriage registers to varying age compositions and proportions of people who got married. 15. For example, the literacy data for Germany, the Netherlands, and Switzerland refer to military recruits and may thus overestimate literacy among the adult population. The data for Argentina and Colombia refer to 1914 and 1917, respectively, and the data for Cuba count all people attending school as able to read, all of which might bias the estimates upward. All of this suggests that if anything, the positive cross-country correlation between Protestantism and literacy may be underrepresented in the graph.
544
QUARTERLY JOURNAL OF ECONOMICS
the population, all countries with Protestant majorities had nearly universal literacy in 1900, whereas no Catholic country reached full literacy, and many Catholic countries fell far short of it.16 The correlation coefficient between Protestantism and literacy across the 22 countries with available data in 1900 is as high as .78 (statistically significant at the 1% level).17 Existing within-country data provide a similar pattern. Goldin and Katz (2002) show that in the United States in 1910– 1938, areas that led in secondary education had higher shares of Protestant population. Similarly, Go and Lindert (2007) report that in some specifications, Protestantism had a positive effect relative to Catholicism on several schooling outcomes in 1850 across U.S. counties. In Ireland in 1871, illiteracy among the different Protestant communities was between 7% and 14%, whereas it was 40% among Catholics (Cipolla 1969). In Finland in 1880, only 1.3% of Lutherans were unable to read or write, as against 54.4% among Catholics (Markussen 1990). Even today in Germany, Protestants have 0.8 years of education more than Catholics (with an average of 12.4 years of education; see Section VI.E for details). IV.B. Data for Prussian Counties in the Late Nineteenth Century Prussia in the late nineteenth century is the obvious place to probe the association between Protestantism and education more deeply, using subnational data. First, nineteenth-century Prussia has the birthplace of the Reformation at its center. Luther proclaimed his 95 Theses in Wittenberg, and the Prussian territory conserved Protestantism in its purest form. Second, Prussia is Max Weber’s birthplace, and his views were shaped by what he observed across Germany. Third, Prussia had rather uniform laws and institutional frameworks, with the possible exception of recent annexations (dealt with below). By contrast, cross-country comparisons, which constitute the existing literature, are notoriously plagued by the difficulty of netting out the effects of other fundamental causes, such as institutions and geography (cf. Acemoglu, Johnson, and Robinson [2005]). Fourth, Prussia was well divided between Protestants and Catholics, with Protestants constituting roughly two thirds and Catholics roughly one third 16. Although there are no encompassing literacy data for Denmark and Norway, there is suggestive evidence that these two Protestant countries also reached universal literacy by 1900 (cf. Markussen [1990]). 17. Dummies indicating the different data sources do not enter statistically significantly into a regression framework and do not change this pattern of results.
WAS WEBER WRONG?
545
of the total population, so that no denomination was an extreme minority. This differs from the more lopsided denominational distributions of most other countries. What is more, Prussia was exceptional in granting freedom of religion to each individual at least as early as the mid-eighteenth century. Frederick the Great, the enlightened monarch of Prussia, had famously declared in 1740 that in his country, everybody might find his salvation in his own way.18 Fifth, with a population of about 24.6 million in 1871, Prussia was one of the largest European countries and accounted for 60 percent of the inhabitants of the German Empire. Sixth, Prussian proverbial orderliness and thoroughness yielded high-quality data at the county level in the second half of the nineteenth century. We thus build our database on Protestantism and literacy in nineteenth-century Prussia mainly from census material collected by the Prussian Statistical Office in 1871, which we supplement with additional survey data from the 1870s and 1880s, all available at the county level. Our data cover all 452 Prussian counties (Kreise) at the time, divided into 35 districts (Regierungsbezirke) and 11 provinces (Provinzen); see Appendix I for details. The 1871 Population Census provides data on religious affiliation and literacy, as well as a set of standard demographic variables such as gender and age. The descriptive statistics, reported in Table I, reveal that the average share of Protestants in a county was 64.2%, against 34.5% Catholics (the remaining shares being Jews at 1.1% and other Christian denominations at 0.2%). There are two things to note. First, both Protestants and Catholics are not just a small minority, but constitute a sizeable fraction of the Prussian population. Second, there is substantial variation across counties, essentially ranging from zero to 100% Protestants or Catholics, which provides the variation for our empirical analysis. In fact, more than 75% (60%) of the counties have a share of at least 80% (90%) of either Protestants or Catholics. Figure III provides a rough impression of the geographical distribution of the religious affiliation across the 452 counties, revealing a mostly concentric pattern of the diffusion of Protestantism with Wittenberg at the center. Protestant diffusion came to a halt in the 18. “. . . hier mus ein jeder nach Seiner Fac¸on Selich werden.” Frederick also wrote that “all religions are equal and good.” A unique feature in the eighteenth century, a Protestant and a Catholic church stood next to each other in the Forum Fridericianum at the origin of the central boulevard “Unter den Linden” in Berlin.
546
QUARTERLY JOURNAL OF ECONOMICS TABLE I DESCRIPTIVE STATISTICS IN NINETEENTH-CENTURY PRUSSIA Mean (1)
StdDev (2)
Economic outcome variables Income tax revenue per capita 1.98 0.70 (1877)a Income of male elem. school 982.83 200.42 teachers (1886) % of labor force in manufacturing 33.91 15.31 and services (1882) % of labor force in manufacturing 27.65 13.41 (1882) % of labor force in services (1882) 6.26 3.55 Main explanatory variables % Protestants 64.18 37.83 % Catholics 34.48 37.54 % literate 87.51 12.67 % pupils with distance to school 2.99 3.42 over 3 km Distance to Wittenberg in km 326.19 148.77
Min (3)
Max (4)
0.21
5.63
711.96
1,954.19
7.93
81.53
6.12
71.76
1.80
24.46
0.26 0.04 37.40 0.00
99.89 99.73 99.33 19.79
0.00
731.46
15.33 0.00 43.97 32.01 74.22 3.83 11,609 −7.76 0.00 0.03 0.02 0.02 0.00 83.93 10.52 0.00 0.00 0.00
29.87 12.87 54.63 87.23 100.00 5.86 826,341 33.83 6.72 0.24 0.42 1.56 650.04 97.24 39.40 1.00 54.19 100.00
1525
1866
Control variables % age below 10 24.71 2.48 % Jews 1.14 1.33 % females 51.00 1.51 % born in municipality 58.97 12.39 % of Prussian origin 99.07 1.97 Average household size 4.79 0.34 Total population size 54,426.16 42,078.42 Popul. growth 1867–1871 (in %) 1.60 4.93 % missing education info 1.69 1.10 % blind 0.09 0.03 % deaf-mute 0.10 0.05 % insane 0.23 0.17 Distance to Berlin (in km) 332.89 146.61 Latitude (in rad) 90.88 2.53 Longitude (in rad) 22.08 8.17 Polish-speaking provinces 0.26 0.44 % of labor force in mining 2.54 7.57 % of county population in urban 27.53 21.90 areas Year in which annexed by Prussia 1,751.69 111.05
Source. Data for Prussian counties (452 observations) from the 1871 Population Census, the 1877 Income Tax Statistics, the 1882 Occupation Census, and the 1886 Education Census; see main text and Appendix I for details. Note. Monetary variables are in Marks (at current prices). a 426 observations (data not available for urban counties).
WAS WEBER WRONG?
547
FIGURE III Protestantism in Nineteenth-Century Prussia County-level depiction based on 1871 Population Census. See Appendix I for data details.
western provinces (Rhineland and Westphalia) and in the eastern parts, which were predominantly Polish-speaking.19 The 1871 census was explicitly the very first census ever to survey literacy in Prussia.20 People were coded as literate if they could read and write. The question was only to be answered by people aged 10 years or older. As a measure of educational outcome, literacy may be a more informative measure of accumulated human capital than standard enrollment data, which may partly capture years in school that did not lead to effective educational 19. Note that the diffusion of Protestantism was intimately related to Luther’s German-language Bible translation and his German-language texts. It is thus no surprise that the Reformation was less successful in the Polish-speaking districts. The German-speaking districts of K¨onigsberg and Gumbinnen in the far east of Prussia, however, have been an integral part of the Prussian mainland for centuries and are again predominantly Protestant. Our regression results are robust when a dummy for the three predominantly Polish-speaking provinces Pommern, Posen, and Schlesien is included. 20. Other parts of the German Empire did not survey literacy in the 1871 census; neither was literacy surveyed again in any later Prussian census (Hesse 1911).
548
QUARTERLY JOURNAL OF ECONOMICS
outcomes. Average literacy across the counties was as high as 87.5% (Table I).21 This mirrors the fact that Prussia was well known for its primary education system in the second half of the nineteenth century, which is often viewed as a key feature responsible for the transfer of industrial leadership from Britain to Germany (cf. Landes [1969, pp. 339–348]). Still, there is substantial cross-county variation in the literate share of the population, ranging from 37.4% to 99.3%, and 16% of the counties had more than one-fourth of their adult population illiterate. As a measure of the supply of schools, the 1886 Education Census provides county-level information on the share of students who had a distance to school of more than 3 kilometers. Although the information is limited to those who were students in 1886 (rather than the adult population), the measure may still provide a useful proxy for the supply of schools. Note also that the measure applies only to those children who actually attended school; it may underestimate the true average distance to school if there are children who did not attend school because the distance was too far. Data from the 1886 Education Census also show that the vast majority of students (95.5%) went to schools affiliated with a single religious denomination. Most children attended a school of their own denomination, but schools were open to children from other denominations. Although schools were denominationally affiliated, funding was mostly independent of official church sources. Nearly half of the average funding for teaching staff came from local public authorities, 16.7% from school fees, and slightly above 10% each from endowment funds, trusts, and needs-based central government grants. Thus, local communities and authorities could develop and maintain significant educational differences along denominational lines. The demographic control variables from the 1871 Population Census include age structure, gender, native population, household size, and county size (Table I).22 We routinely include population growth between 1867 and 1871 as a control variable to capture possible effects of the Franco-Prussian war of 1870/1871.23 21. This made West German regions those with the highest literacy of Western Europe at the time (Tabellini 2005). 22. All our qualitative results are unaffected by excluding certain control variables that are correlated with Protestantism, such as the population share aged below 10 and the average household size. 23. Although the impact of the war on Prussian territory was very low in general, with a relatively low death toll of the Prussian army of 40,000 soldiers,
WAS WEBER WRONG?
549
IV.C. The Association between Protestantism and Literacy in Nineteenth-Century Prussia The first column of Table II reveals that there is a strong positive association between literacy and the share of Protestants in a county. On average, all-Protestant counties have a literacy rate that is 8.0 percentage points higher than all-Catholic counties. Viewed against an average literacy of 87.5%, this is a substantial difference across religious denominations. Column (2) adds the list of basic control variables to estimate the model (1)
LIT = α1 + β1 PROT + Xγ1 + ε1 ,
where LIT is the share of literates in a county’s population aged 10 or older, PROT is the share of Protestants in the county, and X is the set of demographic control variables including the share of Jews and females in the county, the share of the county population below 10 years of age, born in the specific municipality, and of Prussian origin, shares of the population with physical or mental disabilities (blind, deaf-mute, and insane), average household size, size of the county, population growth over the four preceding years, and the share of the population with missing information on literacy (which is the case for only 1.7% on average; cf. Table I).24 In the multivariate specification, the significant association between Protestantism and literacy becomes even larger, and there was nearly a year between the end of the war in January and the census in December, the control variable for recent population growth may capture any remaining differential migration or death toll across counties. 24. Given that the dependent variable in this model is clustered near the right-censored value of 100%, the linear model might be inadequate and suffer from heteroscedasticity. We therefore also estimated the model on a logittransformed dependent variable and with heteroscedasticity-consistent weighted least squares, yielding the same qualitative results (available from the authors). Furthermore, although the supply-side point of our theoretical model—that costs of schooling may be lower in Protestant regions—requires a model specified on aggregate data, the demand-side point—that Protestants may get additional nonmonetary benefits from literacy—raises the issue of ecological inferences of individual associations from aggregate data (cf. Robinson [1950]). However, special tables in the 1871 Population Census on literacy rates by religious denomination within each county show that Protestants are indeed more literate than Catholics, ruling out an ecological fallacy. Unfortunately, the other variables are not reported in a breakdown by religious denomination, so that our lowest possible unit of analysis is the county. Our individual-level analyses of contemporary German data in Section VI.E also confirm the association at the individual level. Note that Robinson (1950) showed that the difference between ecological and individual inference will usually be lower the more the variables are clustered within regions, and our variables—especially Protestantism—are very highly clustered in Prussian counties (cf. Table I).
% born in municipality % of Prussian origin Average household size ln(population size) Popul. growth 1867–1871 (in %)
% females
% Jews
% age below 10
% Protestants
0.080 (0.015)∗∗∗
(1)
0.099 (0.010)∗∗∗ −1.936 (0.158)∗∗∗ −0.965 (0.284)∗∗∗ −1.280 (0.300)∗∗∗ 0.484 (0.033)∗∗∗ −0.324 (0.181)∗ −1.812 (1.273) −1.183 (0.873) 0.186 (0.093)∗∗
(2) 0.117 (0.011)∗∗∗ −1.734 (0.155)∗∗∗ −1.712 (0.264)∗∗∗ 0.239 (0.317) 0.193 (0.046)∗∗∗ −0.029 (0.166) 1.129 (1.211) −0.090 (0.790) −0.091 (0.100)
(3) 0.069 (0.012)∗∗∗ −1.438 (0.174)∗∗∗ −0.307 (0.288) −0.046 (0.286) 0.095 (0.045)∗∗ 0.260 (0.159) −2.747 (1.393)∗∗ 0.576 (0.712) −0.154 (0.087)∗
(4)
All counties
0.101 (0.010)∗∗∗ −1.950 (0.159)∗∗∗ −1.013 (0.290)∗∗∗ −1.256 (0.302)∗∗∗ 0.478 (0.034)∗∗∗ −0.298 (0.184) −1.701 (1.280) −1.026 (0.892) 0.184 (0.094)∗∗
(5)
Share literate
0.075 (0.012)∗∗∗ −1.552 (0.151)∗∗∗ 0.270 (0.287) −1.065 (0.284)∗∗∗ 0.281 (0.039)∗∗∗ −0.041 (0.174) −3.926 (1.292)∗∗∗ 0.607 (0.821) −0.047 (0.086)
(6) 0.103 (0.011)∗∗∗ −1.952 (0.183)∗∗∗ −0.966 (0.302)∗∗∗ −1.263 (0.318)∗∗∗ 0.488 (0.035)∗∗∗ −0.345 (0.237) −2.498 (1.364)∗ −0.906 (0.938) 0.159 (0.101)
(7)
0.089 (0.010)∗∗∗ −1.711 (0.176)∗∗∗ −0.970 (0.416)∗∗ −1.077 (0.339)∗∗∗ 0.455 (0.037)∗∗∗ −0.474 (0.229)∗∗ −0.339 (1.337) −0.016 (0.915) −0.141 (0.105)
(8)
Excluding Counties with free % Protestants cities 80%
TABLE II THE ASSOCIATION BETWEEN PROTESTANTISM AND LITERACY IN NINETEENTH-CENTURY PRUSSIA
−0.026 (0.004)∗∗∗ 0.148 (0.067)∗∗ −0.097 (0.119) 0.458 (0.126)∗∗∗ −0.108 (0.014)∗∗∗ 0.031 (0.076) 3.067 (0.528)∗∗∗ −0.079 (0.366) −0.166 (0.039)∗∗∗
(9)
All counties
Distance to school
550 QUARTERLY JOURNAL OF ECONOMICS
% missing education info ln(distance to Berlin in km) Latitude (in rad) × 100 Longitude (in rad) × 100 Latitude × longitude × 100 Polish-speaking provinces
(1)
−0.307 (0.320)
(2) −0.002 (0.238) −1.716 (0.924)∗ −0.686 (1.007) −0.925 (3.687) 0.155 (4.014)
(4)
2.878 −13.207 (0.998)∗∗∗ (4.832)∗∗∗
−0.269 (0.276) −0.474 (0.571) −0.824 (0.393)∗∗ −0.597 (1.540) −0.074 (1.672)
(3)
All counties
−0.294 (0.321)
(5)
Share literate
TABLE II (CONTINUED)
0.020 (0.297)
(6) −0.090 (0.336)
(7)
−0.172 (0.350)
(8)
Excluding Counties with free % Protestants cities 80%
(9)
All counties
Distance to school
WAS WEBER WRONG?
551
452 .057
452 .737
(2)
452 .810
0.139 (0.050)∗∗∗ 0.054 (0.018)∗∗∗
(3)
452 .885
0.076 (0.045)∗ 0.032 (0.017)∗ yes
(4)
452 .737
0.003 (0.003)
(5)
452 .826
yes
(6)
406 .751
(7)
343 .718
(8)
Excluding Counties with free % Protestants cities 80%
452 .356
(9)
All counties
Distance to school
Source. Data for Prussian counties from the 1871 Population Census and the 1886 Education Census; see main text and Appendix I for details. Further controls: % blind, % deaf-mute, % insane. Notes. Dependent variable in column (9): % Pupils with distance to school over 3 km. Dependent variable in all other columns: % Literate among those aged ≥10. Standard errors in parentheses. Significance at ∗ 10, ∗∗ 5, ∗∗∗ 1 percent.
% of labor force in mining % of county pop. in urban areas 35 district dummies Year when annexed by Prussia 36 dummies for years when annexed by Prussia Obs. R2
(1)
All counties
Share literate
TABLE II (CONTINUED)
552 QUARTERLY JOURNAL OF ECONOMICS
WAS WEBER WRONG?
553
with literacy rates 9.9 percentage points higher in all-Protestant than in all-Catholic counties on average.25 The following columns probe the robustness of the association between Protestantism and literacy for more extensive specifications. To exclude the possibility that the result is driven by geographical differences across the Prussian counties, column (3) adds a set of geographical control variables. These include distance of the county capital from the Prussian capital of Berlin (measured as the greater circle distance) to account for periphery; longitude (measured in rad) to proxy for distance to the North and Baltic Seas in the Prussian north; latitude to trace out the westward expansion of Prussia over the centuries; an interaction of latitude and longitude; a dummy for the three predominantly Polish-speaking provinces Pommern, Posen, and Schlesien to proxy for Slavic languages (results are equivalent to a dummy for counties located in Poland today); the fraction of the work force employed in mining, to control for the effects of the availability of natural resources; and the fraction of the county population living in urban municipalities.26 Although several of the geographical controls enter significantly into the model, the estimated association between Protestantism and literacy is hardly affected. Column (4) adds a whole set of 35 district dummies to the model, thereby excluding all the variation that exists across districts and exploiting only the within-district variation. To the extent that there is unobserved regional heterogeneity, district dummies should be able to capture most of its substance. Although the estimated association between Protestantism and literacy is reduced in magnitude, it remains highly robust. For territories annexed by Prussia shortly before the 1870s, the assumption of an effectively uniform institutional setting might be questioned, giving rise to potential issues of unobserved heterogeneity in effective institutions. Column (5) therefore 25. As the data from the 1871 Population Census are available separately for urban municipalities and for rural areas in each county (where a population size of 2,000 is used to classify municipalities into urban and rural; see Appendix I for details), we can estimate this association separately for rural and urban areas. Although the association is statistically significant in both subpopulations, it is more pronounced in rural areas, as might be expected with average literacy rates in urban municipalities as high as 91.0% (cf. Becker and Woessmann [2007] for details). 26. Table A.2 in the Appendices shows that the share of Protestants is virtually identical on average in municipalities and in rural areas (64.6% and 64.7%, respectively). There is no tendency for Protestants or Catholics to live predominantly in urban (or rural) areas.
554
QUARTERLY JOURNAL OF ECONOMICS
controls for the year in which a county came to Prussia as a linear variable, and column (6) as 36 dummies for all the rounds of annexations after 1525. Our qualitative results are unaffected by these controls. The first specification also shows that there is no significant linear effect of the year of annexation, suggesting that more recent annexations do not perform systematically differently than earlier Prussian territories.27 Free imperial and Hanseatic cities, economic and educational hubs in Lutheran times, may have been more inclined to follow the Reformation (cf. Section V.B). Free cities were virtually selfruling enclaves independent of the rule of regional princes. Many of them had accumulated substantial wealth through trade, and they were well known for their liberal thinking, which might have been conducive to adopting the Reformation. However, excluding all Prussian counties from our sample that contain former free imperial or Hanseatic cities hardly affects the qualitative result (column (7)). To account for possible nonrandom migration of different denominations, column (8) restricts the analysis to the subsample of counties that are denominationally hardly intertwined, that is, only counties that are either mostly Protestant or mostly Catholic (defined as having more than 80% or less than 20% Protestants). Given the limited pattern of migration, the dominant denomination in these counties will derive from the historical choices of local rulers (see below), but not from migration. Our results are unaffected by restricting the analysis to this sample of 343 counties.28 Column (9) uses distance to school as an alternative measure of human capital. The results show that the share of Protestants in a county is negatively related to the share of students who had a long distance to their school, indicating that there was a denser supply of schools in Protestant counties. The same result holds in a bivariate association, and it is robust to all the robustness 27. As an alternative robustness check, we restricted the analysis in an increasingly restrictive manner to subsamples of counties that had been with Prussia for a long time. We start with the 361 counties that had been part of Prussia for more than 50 years in 1871, and then go on to restrict to the 235 counties that had been part of Prussia before 1800, 179 counties before 1750, and 89 before 1650. Again, our main qualitative results are perfectly robust in these subsamples. 28. Note also that our basic model already controls for patterns of migration over the lifespan of the 1871 population by including the share of the population born in the respective municipality and the share of the population that is of Prussian origin. In addition, our results are robust to the sample of counties with above-average shares of inhabitants born in their specific municipality, as well as to the sample of counties with less than 1% of non-Prussian origin (not shown).
WAS WEBER WRONG?
555
specifications just discussed, including controlling for urbanity (not shown). V. THE HISTORICAL ORIGIN OF THE REFORMATION AS AN EXOGENOUS SHOCK This section addresses potential endogeneity issues in the spread of Protestantism. It exploits the historically concentric dispersion of Protestantism around Luther’s city of Wittenberg to use distance to Wittenberg as an instrumental variable (IV) to yield exogenous variation in Protestantism. V.A. Instrumental Variable Results Exploiting the Concentric Dispersion of Protestantism around Wittenberg Several concerns may emerge in interpreting the association between Protestantism and literacy presented above as a causal effect. For example, if poor areas with a more prevalent opposition to the Catholic establishment had been more likely to convert to Protestantism during the time of the Reformation, and if economic development was correlated over time, the residual term in equation (1) might not be exogenous to the spread of Protestantism. As a consequence, we need a strategy to deal with potential endogeneity issues. In principle, however, several historical facts suggest that the spread of the Reformation in Prussia can be viewed as an exogenous shock (see Becker and Woessmann [2007] for greater detail). The vast majority of the regional denominational variation that existed in late-nineteenth-century Prussia had already been determined at the time of Reformation in the sixteenth and early seventeenth centuries.29 The Imperial Diet held 1555 in Augsburg had adopted the principle “Cuius regio, eius religio” (“Whose rule, his religion”), which meant that denominational choices were made only by the rulers of the large number of territories that constituted the fragmented German empire at the time of Reformation. The citizens were forced to accept their respective sovereigns’ denominational choices, which were mostly driven by reasons of power politics, following or seceding from the worldly forces supporting the Pope. There is little room for denominational choices being endogenous to literacy at that time, because literacy 29. This means that the religious variation far predates industrialization and thus any manufacturing occupation that constitutes our sectoral outcome measure below.
556
QUARTERLY JOURNAL OF ECONOMICS
in Germany around 1500 is estimated as low as 1 percent of the population, exclusively restricted to the nobility and some townsmen (Engelsing 1973, p. 19).30 Also, Luther’s theses were mostly distributed to the general public by caricatures denouncing the unethical behavior of the Pope and his allies (Scribner 1994). Finally, the regional origination of Protestantism from Luther’s city of Wittenberg was triggered by a specific shock, a particularly vicious example of indulgence practice to which many of Luther’s parishioners succumbed. Bishop Albrecht of Brandenburg initiated in 1517 the selling of indulgences in the province of Magdeburg officially to support the construction of St. Peter in Rome, but in reality half of the revenues were used to pay off Albrecht’s debts to the Fugger dynasty. Although this historical origin of denominations rules out the most obvious forms of potential endogeneity, some possible sources of endogeneity might remain. For example, Ekelund, H´ebert, and Tollison (2002) hypothesize that the diffusion of the Reformation might have been facilitated in societies characterized by the decline of feudalism and a relatively unstable distribution of wealth. This hypothesis is explicitly aimed at the diffusion across countries, though, and may be less relevant for the diffusion within Prussia. Similarly, although the idea that choice of denomination may be endogenous to education (Glaeser and Sacerdote 2008) in principle provides an additional source of endogeneity, this source also seems less of an issue in our case, because there was hardly any effective individual denominational choice in the nineteenth century.31 However, wealthy regions may have been less likely to select into Protestantism at the time of the Reformation because they benefited more from the hierarchical Catholic structure, because the opportunities provided by indulgences allured to them, and because the indulgence costs weighted less heavily on them. When education became more widespread in subsequent centuries, these regions could have more easily afforded to educate their children. The fact that “Protestantism” was initially 30. If there was any systematic aspect about the spread of Protestantism, it might have been centered in cities. However, as discussed above, the shares of Protestants in rural and urban areas were in fact identical in the late nineteenth century, and controlling for urbanity and excluding all free imperial cities and Hanseatic cities does not change our results. 31. On average over 1859–1867, only 766 adult Catholics per year out of more than 7 million Catholics converted to Protestantism, mostly in the course of marriage to a Protestant partner (Hilse 1869).
WAS WEBER WRONG?
557
a “protest” movement involving peasant uprisings that reflected social discontent is suggestive of such a negative selection bias.32 To rule out such potential remaining worries about endogeneity, we use a particular aspect of the historical diffusion of Protestantism across the German Empire to restrict the variation in Protestantism used in the estimation to a part that is credibly exogenous. Reformation historians refer to the diffusion of Protestantism as resembling the propagation of a wave caused by a stone thrown into water.33 Luther’s work had its most imminent effect in the area surrounding his city of Wittenberg, and there is a tendency for the impact to diminish with distance to Wittenberg. In effect, in the German Empire, Protestantism dispersed around Wittenberg in a mostly concentric pattern.34 As evidenced in Figure III above, it seems that the Reformation spread out from Wittenberg in all directions, but then came to a halt after some distance. The main reasons for a circular dispersion around Wittenberg may have been the costs of traveling and of information diffusion through space, and these transportation and transaction costs played a crucial role at the time. Electoral Saxony, the principality around Wittenberg, was an early leader in implementing Luther’s visions of reform, serving as a role model of practical and political implementation for princes in other areas (Dixon 2002). This gives places closer to Wittenberg the advantage of being able to observe the Reformation ideals put in practice and to more easily form alliances of Protestant territories against Catholic 32. There is illustrative evidence that rich regions may have been less likely to join the Protestant movement at the time of the Reformation. With the exception of Hanseatic cities, all our indicators of economic and educational development before the Reformation discussed in the next section are negatively associated with the share of Protestants, also after controlling for distance to Wittenberg (to compare regions of different wealth at a similar distance to Wittenberg). Two of the negative associations (with imperial cities and universities in 1517) reach standard levels of statistical significance, the other three (urbanization, monasteries, and schools in 1517) are marginally significant. 33. Luther himself likened the spreading of the sermon to “throwing a stone into the water which makes waves, circles, and streaks around it, and the waves push each other further and further; one pushes the other . . . ” (Luther 1905, p. 140). He also stressed that the preaching “will be disseminated further and further and that from the Church which is located in a certain place many others will be drawn to the Word” (Luther 1902, p. 224). In the latter source, Luther explicitly refers to Wittenberg as the place from which a creek irrigates the neighboring regions. 34. We do not purport that the dispersion was concentric outside the German Empire. Other countries had other Protestant reformers, who for example provided the first Bible translation in their native languages. Our argument strictly refers to the diffusion within Prussia and the German Empire.
558
QUARTERLY JOURNAL OF ECONOMICS
powers.35 Furthermore, thousands of students came to Wittenberg to hear Luther’s sermons and speeches, and they spread the word as preachers back in their home regions (Peters 1969; Bunkowske 1985). In fact, starting in 1535 everyone who wanted to become a priest in Electoral Saxony had to be centrally ordained in Wittenberg. Although this was not legally compulsory for ordinands in other Protestant territories, many came to Wittenberg for ordination to obtain the seal of approval connected to the prestige of Luther and Melanchthon (Krarup 2007). Given the arduousness of travel in the early 1500s, the propensity to come to Wittenberg to listen to Luther and his successors likely declined with distance to Wittenberg. Finally, the fact that the German regions spoke ever more different dialects the further distant the regions were may also have contributed to a concentric pattern of the dispersion of Protestantism, both by oral and by written means of dissemination, and in the dissemination both to rulers and to the population at large. The geographically concentric pattern of the dispersion of the Reformation provides a means to obtain a specific variation in Protestantism that is credibly exogenous to economic and educational considerations: the variation due to distance to Wittenberg. We thus use distance to Wittenberg as an instrument for the share of Protestants in a county in nineteenth-century Prussia. The first two columns of Table III report the IV estimate of the effect of Protestantism on literacy, where Protestantism is instrumented by distance to Wittenberg. As is evident from the F-statistic of the instrument in the first stage, distance to Wittenberg is a strong instrument for the share of Protestants in a county. Each 100 km distance to Wittenberg is associated with a Protestant share that is 9.5 percentage points lower. The second stage uses only that part of the Protestant share that is due to distance to Wittenberg to predict the literacy rate. The positive effect of Protestantism on literacy is highly robust in the IV specification. In fact, the point estimate is significantly higher, with a difference in literacy of 18.9 percentage points between an all-Protestant and an all-Catholic county, indicative of a negative selection bias. Similarly, the negative effect of Protestantism on distance to school is confirmed in the IV specification 35. To rule out concerns of endogeneity, we ensured that our results are robust to dropping those counties that belonged to Electoral Saxony in Reformation times from our analysis.
559
WAS WEBER WRONG?
TABLE III THE EFFECT OF PROTESTANTISM ON LITERACY: IV RESULTS BASED ON DISTANCE TO WITTENBERG 1st stage
Dependent variable: Distance to Wittenberg in km
Share Protestants (1)
% Jews % females % born in municipality % of Prussian origin Average household size ln(population size) Popul. growth 1867–1871 (in %) % missing education info Observation R2 1st -stage F statistic
Share literates (2)
Distance to school (3)
0.189 (0.028)∗∗∗ −1.952 (0.170)∗∗∗ −0.437 (0.341) −1.073 (0.327)∗∗∗ 0.607 (0.050)∗∗∗ −0.181 (0.199) 0.885 (1.573) −1.318 (0.936) 0.410 (0.119)∗∗∗ −0.505 (0.348) 452 .689
−0.025 (0.011)∗∗ 0.147 (0.066)∗∗ −0.094 (0.130) 0.459 (0.126)∗∗∗ −0.107 (0.019)∗∗∗ 0.032 (0.076) 3.084 (0.595)∗∗∗ −0.080 (0.361) −0.164 (0.046)∗∗∗
−0.095 (0.011)∗∗∗
% Protestants % age below 10
2nd stage
0.205 (0.702) −7.264 (1.242)∗∗∗ −0.557 (1.343) −1.390 (0.134)∗∗∗ −1.935 (0.802)∗∗ −14.610 (5.778)∗∗ −0.977 (3.883) −1.962 (0.404)∗∗∗ 1.729 (1.418) 452 .419 74.19
452 .356
Source: Data for Prussian counties from the 1871 Population Census and the 1886 Education Census; see main text and Appendix I for details. Further controls: % blind, % deaf-mute, % insane. Note. Standard errors in parentheses. Significance at ∗ 10, ∗∗ 5, ∗∗∗ 1 percent.
(column (3) of Table III). The IV results are robust to the set of robustness specifications discussed above, such as including the geographic controls and indicators for recent annexations and excluding free cities and denominationally intertwined counties (not shown). V.B. Is Distance to Wittenberg a Valid Instrument? The fact that it is generally accepted that Wittenberg was an “unimportant place” (Holborn 1942, p. 133) until 1517 suggests
560
QUARTERLY JOURNAL OF ECONOMICS
that distance to Wittenberg should be unrelated to a county’s economic and educational state before it adopted Protestantism. We probe this more rigorously in a set of empirical tests that can shed light on the validity of the instrument, despite the obvious limits of data on the economic or educational situation at the time of the Reformation (cf. Appendix II for details on the data sources for the following analyses).36 The first three tests relate to whether several indicators for economic development at Lutheran times are related to distance to Wittenberg, and the next three tests perform the same analysis for several indicators for educational development in Lutheran times (the economic and educational situation may of course be interrelated). First, free imperial cities (Reichsstadte) ¨ were major economic hubs at Lutheran times. In the Holy Roman Empire, free imperial cities had particular political systems and legal independence, controlled their own trade, and built an extraordinary amount of wealth. The group of free imperial cities included such important cities as Aachen, Cologne, Frankfurt am Main, Gdansk, and Hamburg. As Table IV reveals, in our sample of 452 Prussian counties, distance to Wittenberg is completely insignificant in predicting the probability of being a free imperial city, measured in pre-Reformation status. Second, a similar argument about economic advancement can be made for free Hanseatic cities. Although some of them, such as Cologne, Gdansk, and Hamburg, were also free imperial cities, other important cities such as Hanover, K¨onigsberg, and Magdeburg were not, but still belonged to the Hanseatic League. Again, distance to Wittenberg is uncorrelated with the probability of being a Hanseatic city. Third, economic historians often use urban population as a proxy for preindustrial economic prosperity because cities could only be supported in areas with high agricultural productivity, advanced economic specialization, and developed transport systems (cf. Bairoch [1988]; Acemoglu, Johnson, and Robinson [2002]). We 36. The 1871 Population Census is explicitly the first occasion on which consistent data on literacy were surveyed. There is only scattered historical evidence on the spread of literacy and schooling in Prussia between 1500 and 1871, discussed in Appendix B of Becker and Woessmann (2007), which suggests that Luther’s educational postulations did have a long-lasting effect. Our analysis has to leave open whether Luther’s educational postulations had economic effects already in the agrarian economy of the sixteenth to eighteenth centuries, or whether they had to wait for the industrial revolution to raise the economic payoff to education, as has been argued for Sweden (Sandberg 1979).
0.0008 (0.0084)
452 .00002
0.0034 (0.0071)
452 .0005
(2)
452 .0004
0.00006 (0.00013)
(3)
148 .014
0.0059 (0.0042)
(4)
City population in 1500
16 .003
−1.626 (7.998)
−0.0019 (0.0047)
452 .0004
(6)
(5)
452 .049
0.0020 (0.0110) 0.578 (0.121)∗∗∗
(7)
Year of Monasteries University foundation per km2 in 1517 of university in 1517
Universities
333 .002
−0.0073 (0.0099)
(8)
School in 1517
59 .001
0.0610 (0.2384)
(9)
Year of foundation of school
Schools
Indicators of pre-Reformation educational development
a Distance
Note. Standard errors in parentheses. Column (7) uses data on monasteries in the German Empire which is available only for municipalities starting with letters A–L. See main text for details. See Appendix III for data sources. to Wittenberg measured in 100 km. Significance at ∗ 10, ∗∗ 5, ∗∗∗ 1 percent.
Distance to Wittenberga Share of municipalities beginning with letter A to L Obs. R2
(1)
Imperial Hanseatic Urban pop. city city per km2 in 1517 in 1517 in 1500
Urbanization
Indicators of pre-Reformation economic development
TABLE IV EXOGENEITY OF THE DISTANCE-TO-WITTENBERG INSTRUMENT
WAS WEBER WRONG?
561
562
QUARTERLY JOURNAL OF ECONOMICS
use the data on urban population by Bairoch, Batou, and Ch`evre (1988) to construct a measure of urban population per square kilometer in 1500 for each county in our 1871 data.37 As column (3) of Table IV reveals, the extent of urbanization just before the Reformation is uncorrelated with distance to Wittenberg. Likewise, we use all 148 cities of the 1871 German Empire (or the 75 cities of 1871 Prussia) contained in the Bairoch, Batou, and Ch`evre (1988) data and estimate whether their population size is correlated with distance to Wittenberg; the point estimate is even positive and approaches standard levels of statistical significance (column (4)).38 In addition, city growth between 1400 and 1500 is uncorrelated with distance to Wittenberg. Fourth, one measure of education available for the preLuther period is the existence of universities. We use data from Eulenburg (1994) on German universities founded before 1517 to estimate whether distance to Wittenberg predicts whether a county in our sample had a university before 1517 (column (5)). We also regress the year of foundation of universities on the territory of the 1871 German Empire in existence before 1517 on distance to Wittenberg (column (6); results are equivalent when restricting to 1871 Prussian territory). In both exercises, we find that distance to Wittenberg is completely unrelated to the spread of universities before Lutheran times. Fifth, throughout medieval times, monasteries were the guardians of learnedness, preserving the skill of literacy and often containing substantial libraries (cf., e.g., Marry [1953]; Frank [1993]).39 The density of monasteries can thus serve as another important correlate of literacy before Luther. Grote (1881) provides an encyclopedia of monasteries in the German Empire, detailing their location, year of foundation, and (if applicable) year of abandonment. Its major advantage is that it covers the whole of Prussia within its borders of the late nineteenth century. A 37. The adjustment for county size in square kilometers is performed to account for the fact that county boundaries in the late nineteenth century are drawn so that counties are smaller in areas of higher population density. 38. Given that the city sample of Bairoch, Batou, and Ch`evre (1988) is defined by all cities that had at least 5,000 inhabitants once between 800 and 1800, we also performed analyses on the restricted sample of those cities that had at least 5,000 inhabitants in 1500 to rule out issues of sample selectivity. Results are the same. 39. The importance of monasteries as centers of literacy is also documented by the medieval dictum of Gottfried of St. Barbe-en-Auge in 1170, “claustrum sine armario quasi castrum sine armamentario” (“a monastery without a library is like a fortress without an armory,” Migne [1855, Vol. 205, Col. 845A]).
WAS WEBER WRONG?
563
drawback is that the available volume covers only locations beginning with letters A to L; the envisaged second volume was never completed due to Grote’s untimely death. We control for this by adding the share of municipalities in each county beginning with letters A to L as a control variable, derived from the complete list of municipalities in the 1871 Prussian Population Census. Although this control enters significantly into a regression of the density of monasteries that were in operation in 1517, distance to Wittenberg does not (column (7)). Sixth, an explicit “List of the oldest schools in the Germanspeaking area” is provided at the German version of the Wikipedia encyclopedic website. Although this is not necessarily an exhaustive list, we can use it to get a list of 59 schools that existed before the Reformation, as an additional indicator for pre-Lutheran literacy. In contrast to all previous measures, this measure is available only for German-speaking territory, rather than for the whole of Prussia, so that we exclude the Polish-speaking provinces from the analysis. Note that if there were reporting bias in this list in the sense that schools that survived and prospered after 1517 are more likely to be contained on this list, the fact that the Reformation gave a boost to schools means that our test will be biased toward finding more schools closer to Wittenberg. But in fact, despite this possible bias, distance to Wittenberg is not significantly correlated with the existence of a school in a county before 1517 or with the years of foundation of schools (columns (8) and (9) of Table IV). To see whether these tests of the validity of our instrument are strong tests, we can look at whether the different indicators for pre-Reformation economic and educational development are correlated with our measure of literacy in the late nineteenth century. Indeed, all six indicators are positively correlated with 1871 literacy. Statistically significant positive correlations exist between 1871 literacy and the existence of schools before 1517 (correlation coefficient .13, statistically significant at the 2% level), the existence of universities before 1517 (.08, 10%), and the probability of having been a free imperial or Hanseatic city (.09, 5%). Thus, these indicators of economic and educational development before Luther’s time are indeed measures relevant to literacy in the late nineteenth century, and our instrument is orthogonal to all of them, providing confidence in its plausibility and validity.
564
QUARTERLY JOURNAL OF ECONOMICS
VI. THE IMPORTANCE OF EDUCATION FOR THE ECONOMIC PERFORMANCE OF PROTESTANTS The results so far provide empirical support for the fact that Protestantism led to a better educated population, establishing the founding stone for our human capital theory of Protestant economic history. This section turns to an empirical analysis of the importance of education in accounting for the higher economic prosperity of Protestants. VI.A. Measures of Economic Outcomes in Nineteenth-Century Prussia Our main measure of economic outcome is a proxy for county income based on income tax statistics. The advantage of income measures is that they are arguably the most encompassing measure of economic prosperity. Our best proxy for county income close in time to our data on religion and literacy is the income tax revenue per capita, available from income tax statistics in 1877 (see Appendix I for details on the different data sources). For historical reasons, there were two types of income tax in 1877—the so-called class tax for annual incomes from 420 to 3,000 Marks and the socalled classified income tax for incomes above that. However, these two effectively constituted an income tax with slightly increasing tax rates up to incomes of 3,000 Marks and a constant tax rate for higher incomes. Hill (1892, p. 214) stresses that at least since the 1873 tax reform, “The [class tax] was . . . recognized as being, in fact, an income tax which was to be assessed ‘on the basis of the estimated value of the annual income.’ ” The classified income tax was a pure income tax. Although very low incomes were exempted from income tax40 and although the tax rate was linear only for incomes above 3,000 Marks, we think that the total income tax revenues per capita are a reasonable proxy for county income at the time, and certainly the best available one.41 Average income tax revenues amounted to 2.0 Marks per county inhabitant, ranging from 0.2 to 5.6 Marks across counties. 40. While we do not know what fraction of the population was subject to income tax in 1877, in the income tax statistics of 1892, 41.8% of the population lived in households that were subject to income tax (ranging from 21.9% to 68.4% across the counties). However, the minimum taxable income had been more than doubled (from 420 to 900 Marks) in the early 1880s with the explicit aim of benefiting the poorer classes (cf. Hill [1892]), suggesting that a much larger fraction of households were subject to income taxation in 1877. 41. Income tax data are not available for the 26 city counties in 1877.
WAS WEBER WRONG?
565
Drawbacks of this main measure of economic outcome are that it does not capture the low end of the income distribution and that the underlying income cannot be perfectly inferred from the amount of tax revenues because the tax scale has some progressivity at the low end (although most of the 1877 tax scale is linear).42 More generally, the per capita income tax measure is not a direct measure of income, but rather infers income indirectly from the income tax revenues of each county.43 We therefore also use an alternative measure that overcomes this drawback. Although general county-level income data are not available for nineteenthcentury Prussia, the 1886 Education Census provides the unique opportunity of direct income data for one specific occupation, namely the average annual incomes of male elementary school teachers. This measure is the only direct measure of income available for all counties at the time and has been used as a proxy for income in general in other studies (e.g., Lee, Galloway, and Hammel [1994]). In nineteenth-century Prussia, teachers’ salaries were almost entirely financed from local contributions and therefore reflect the overall income in the county (cf. Schleunes [1989]). The correlation of the teacher income measure with the 1877 per capita income tax measure is .60. The downsides of the teacher income measure are that it refers to one occupational group only, that teacher salaries may be affected by how much education is valued in a county, and that there may be reverse 42. To overcome these drawbacks, we also constructed an advanced proxy of average income from more detailed data available only in later years (for the 60 largest Prussian cities 1892 and for all Prussian counties in 1901). These measures combine data on the share of households not paying income tax with data on daily wages of unskilled day laborers from social security statistics and detailed income tax data that make it possible to infer underlying income directly from taxes paid (see Appendix I for details on the income tax statistics and the social security statistics in 1892 and 1901). Their downsides are the further time lag to our other data and several necessary approximations. Our result that Protestantism does not have a significant effect on income once the latter is adjusted for differences in literacy is perfectly robust to these encompassing average income measures in both bounding analyses reported below, suggesting that tax progressivity and the omission of low earners from the income proxy reported here are not driving the results. Although the estimated effect of Protestantism on the income measure is statistically significant without adjusting for literacy differences in the full sample (available only in 1901), it does not reach statistical significance in the smaller sample of large cities (in 1892). 43. Measures of average income may only partly capture the specific traits of entrepreneurship, a dimension of economic outcomes sometimes implicated with the Weber thesis. As one measure that may capture entrepreneurial income in particular, we also used only the classified income tax part of the total 1877 income tax, which captures only the high incomes, referring to roughly 4% of the population. Qualitative results are the same.
566
QUARTERLY JOURNAL OF ECONOMICS
causation from teacher income to literacy, giving rise to problems of endogeneity.44 As a second alternative measure of economic development, we also use the sectoral structure, derived from the 1882 Occupation Census. The average share of the labor force in nonagriculture is 33.9% (27.7% in manufacturing and 6.3% in services; cf. Table IV). This measure captures a lot of what the issue of economic development in historical perspective in general, and the Weber thesis in particular, is very explicitly about: modernization, the advancement of capitalism, and division of labor. Weber was not suggesting a theory of income levels but a theory of the advancement of modern capitalism. Sectoral shares may capture such concepts even better than standard income measures. The correlation of the size of the nonagricultural sector with 1877 per capita income tax is .42 and it is .74 with 1886 teacher income. The main drawback of sectoral shares as measures of economic prosperity is that they may miss important dimensions of variation in economic output, such as productivity, hours worked, or entrepreneurship, which may well be related to Protestantism. Although each of our different measures of economic outcomes has its specific advantages and drawbacks, our results below prove very robust across all the measures, excluding the possibility that they are driven by the drawbacks of one specific measure. VI.B. The Association between Protestantism and Economic Outcomes Table V provides results of regressions of our measures of economic outcome Y in the 452 Prussian counties on the share 44. The downside of occupation specificity is overcome in the 1892 and 1901 measures based on income-tax data for higher incomes and wage data for lowincome earners, discussed in footnote 42. Another measure of occupation-specific income that should not be subject to the endogeneity problems is the income of city mayors. A special survey collected data on the annual salaries of mayors and other paid members of city magistrates in 1879 in the 138 Prussian cities with more than 10,000 inhabitants with the explicit aim of obtaining first information on the “cost of labor” (Blenck 1880; see Appendix I for details on the 1879 Survey of Mayor Incomes). To the extent that the salary of mayors was financed from local taxes, it likely constitutes a decent proxy for average income. Using mayoral income or the income of all upper-rank civil servants in the 138-county sample also never yields significant positive effects of Protestantism on income after adjusting for literacy differences. However, the association of the income proxy with Protestantism is again not as clear in this sample of big cities, while it does show a significant positive association with literacy. The nominal income measures may also be affected by differences in price levels across counties. However, the Balassa–Samuelson hypothesis suggests that prices are higher in economically advanced areas (Balassa 1964; Samuelson 1964), so that nominal income differences may still provide a good proxy for economic affluence.
0.154 (0.091)∗
0.063 (0.019)∗∗∗
% age below 10
−4.721 −1.816 (1.810)∗∗∗ (0.302)∗∗∗ % Jews 4.236 1.258 (3.018) (0.538)∗∗ % females −20.086 −4.320 (2.992)∗∗∗ (0.572)∗∗∗ % born in −0.155 0.264 municipality (0.305) (0.063)∗∗∗ % of Prussian 1.534 −0.133 origin (1.813) (0.345)
% literate
% Protestants
Alternative outcomes
Main outcome
Alternative outcomes
IVa Main outcome
Alternative outcomes
OLS
−0.440 (0.232)∗ −0.005 (0.413) −2.857 (0.439)∗∗∗ 0.361 (0.049)∗∗∗ −0.373 (0.265)
0.035 (0.015)∗∗
0.105 (0.050)∗∗
−5.301 −1.827 (1.881)∗∗∗ (0.304)∗∗∗ 7.388 1.494 (3.479)∗∗ (0.601)∗∗ −18.772 −4.230 (3.143)∗∗∗ (0.583)∗∗∗ 0.446 0.321 (0.435) (0.090)∗∗∗ 2.473 −0.071 (1.921) (0.354)
0.586 (0.236)∗∗
−0.452 (0.235)∗ 0.262 (0.464) −2.755 (0.451)∗∗∗ 0.425 (0.069)∗∗∗ −0.302 (0.274)
0.082 (0.039)∗∗
−0.068 0.001 (0.097) (0.020) 2.460 0.636 (0.424)∗∗∗ (0.086)∗∗∗ −0.812 −0.573 (1.874) (0.331)∗ 8.963 1.917 (3.045)∗∗∗ (0.520)∗∗∗ −16.544 −3.477 (2.935)∗∗∗ (0.553)∗∗∗ −1.362 −0.043 (0.361)∗∗∗ (0.073) 2.998 0.091 (1.772)∗ (0.329)
−0.013 (0.015) 0.490 (0.066)∗∗∗ 0.507 (0.254)∗∗ 0.463 (0.399) −2.234 (0.425)∗∗∗ 0.124 (0.056)∗∗ −0.217 (0.252)
Per capita ln(teacher Share Per capita ln(teacher Share Per capita ln(teacher Share income taxb income)b manuf & serv income taxb income)b manuf & serv income taxb income)b manuf & serv (1) (2) (3) (4) (5) (6) (7) (8) (9)
Main outcome
OLS
TABLE V PROTESTANTISM AND ECONOMIC OUTCOMES IN NINETEENTH-CENTURY PRUSSIA
WAS WEBER WRONG?
567
Alternative outcomes
Main outcome
Alternative outcomes
IVa Main outcome
Alternative outcomes
OLS
5.972 (1.672)∗∗∗ 1.106 (0.213)∗∗∗ 452 .529
8.680 (9.068) 0.292 (1.180) 426 .291
−37.441 −5.954 (13.698)∗∗∗ (2.753)∗∗
452 .602
5.170 (1.292)∗∗∗ 1.759 (0.165)∗∗∗
−9.465 (2.127)∗∗∗
426 .383
7.993 (8.513) −0.940 (0.924)
452 .586
6.674 (1.580)∗∗∗ 0.887 (0.170)∗∗∗
−42.230 −5.738 (11.581)∗∗∗ (2.305)∗∗
452 .654
5.806 (1.213)∗∗∗ 1.550 (0.130)∗∗∗
−9.963 (1.769)∗∗∗
Source: Data for Prussian counties from the 1871 Population Census, the 1877 Income Tax Statistics, the 1882 Occupation Census, and the 1886 Education Census; see main text and Appendix I for details. Note. Standard errors in parentheses. Further controls: % blind, % deaf-mute, % insane, and (in columns (7)–(9)) % missing education info. a % Protestants is instrumented by distance to Wittenberg; see the first column of Table III for the corresponding first-stage result. b Coefficients multiplied by 100. Significance at ∗ 10, ∗∗ 5, ∗∗∗ 1 percent.
Average −50.042 −7.149 −10.818 (11.834)∗∗∗ (2.397)∗∗∗ household (1.840)∗∗∗ size ln (population) 7.618 6.013 5.217 size (8.814) (1.663)∗∗∗ (1.276)∗∗∗ Popul. growth −1.002 1.002 1.642 (0.960) (0.179)∗∗∗ 1867–1871 (0.137)∗∗∗ (in %) Obs. 426 452 452 .328 .534 .611 R2
Per capita ln(teacher Share Per capita ln(teacher Share Per capita ln(teacher Share income taxb income)b manuf & serv income taxb income)b manuf & serv income taxb income)b manuf & serv (1) (2) (3) (4) (5) (6) (7) (8) (9)
Main outcome
OLS
TABLE V (CONTINUED)
568 QUARTERLY JOURNAL OF ECONOMICS
WAS WEBER WRONG?
569
of Protestants PROT in the county, as well as our set of control variables X: (2)
Y = α2 + β2 PROT + Xγ2 + ε2 .
As in the bivariate setting of Section II, the results show that counties with larger shares of Protestants exhibit an advanced degree of economic progressiveness, consistently across the different measures. The first three columns of Table V report OLS estimates.45 Columns (4)–(6) report IV estimates, where Protestantism is again instrumented by distance to Wittenberg. These coefficients reflect the total causal effect of Protestantism on economic outcomes, including any indirect effect running through literacy. The IV estimates are larger than the OLS estimates, in line with the negative bias of OLS estimates of the effect of Protestantism on literacy discussed above. They suggest that income tax revenues per capita increase significantly with the share of Protestants in a county. On average, an all-Protestant county has income taxes 0.59 Mark higher per capita than an all-Catholic county.46 This is equivalent to 29.6% of the average per capita income tax across all counties—an economically and statistically significant difference. Results are similar for our two alternative measures of economic outcome. The only direct income measure available for all Prussian counties, annual income of teachers, also increases significantly with the share of Protestants in a county. An allProtestant county has 10.5% higher income than an all-Catholic county on this measure. Similarly, an all-Protestant county has a nonagricultural share of its labor force that is 8.2 percentage points larger than an all-Catholic county.47 Viewed against 45. These results are robust to the different robustness checks discussed in Section IV.C, including the addition of geographical controls, recognition of different waves of Prussian annexations, exclusion of free cities, and migration analyses (cf. Becker and Woessmann [2007]). An additional way to test whether migration and spillovers across neighboring counties affect our results is to include the average share of Protestants in neighboring counties as an additional control variable. Our results do not change in such a specification, and the share of Protestants in the neighboring counties does not enter significantly into predicting our measures of economic outcomes (not shown). 46. Per capita income tax is used as a level variable because inspection of kernel densities shows that it is roughly normally distributed. In contrast, teacher income is used in logarithm because it is roughly log-normally distributed. 47. Separate regressions show that this combines a manufacturing sector that is 6.5 percentage points larger and a service sector that is 1.7 percentage points larger (2.1 and 1.4, respectively, in case of the OLS estimate). Estimates for the male labor force, reported in Becker and Woessmann (2007), are even higher.
570
QUARTERLY JOURNAL OF ECONOMICS
the average share of the nonagricultural sector in total employment of 33.9%, the average difference in economic progressiveness between Protestants and Catholics appears modest, but both economically and statistically significant. In sum, there is robust evidence of a significant positive effect of Protestantism on economic outcomes. VI.C. The Effect of Protestantism after Adjustment for Literacy: A Bounding Analysis The main tenet of our human capital theory of Protestant economic history is that Protestantism affected economic outcomes largely via human capital accumulation. Therefore, we now consider the extent to which the causal effect of Protestantism on literacy shown above can account for the association between Protestantism and economic outcomes just described. To do so, we would in principle like to estimate a model with both Protestant shares and literacy rates on the right-hand side: (3)
Y = α3 + β3 PROT + χ3 LIT + Xγ3 + ε3 .
For descriptive purposes and for comparison with the following exercises, OLS estimates of such a model are reported in the last three columns of Table V. Literacy has a large and significant association both with the main measure of economic outcome and with the two alternatives. Once this association is controlled for, the share of Protestants loses all its association with economic outcomes. The problem with such a model is that not only Protestantism but also literacy may be endogenous in this setting. Shocks that affect economic outcomes may also affect literacy rates, biasing least-squares estimates of χ 3 . Although distance to Wittenberg provides us with exogenous variation in Protestantism, no independent instrument is at our disposal for literacy. We therefore revert to restricting the literacy effect to estimates χ that are consistent with evidence found in other, wellidentified studies in the literature: (4)
Y − χ LIT = α4 + β4 PROT + Xγ4 + ε4 .
This strategy allows us to obtain estimates of the effect of Protestantism (instrumented by distance to Wittenberg) on economic outcomes net of the literacy effect.
WAS WEBER WRONG?
571
In his survey of the extensive literature on the causal economic return to education, Card (1999, p. 1802) concludes that “the average . . . return to education is not much below the estimate that emerges from a standard human capital earnings function fit by OLS.”48 He suggests that studies based on identical twins place the causal return at about 10% below the OLS estimate, whereas studies using institutional changes in the education system as instruments estimate returns that are 20%– 40% higher than the corresponding OLS estimates. A plausible explanation of the latter is that marginal returns are higher for people with low education outcomes (who are mostly affected by the institutional changes), which makes a downward bias in OLS estimates more likely in our setting. Assuming that the result of only weakly biased OLS estimates of educational returns also applies in our setting, we proceed by putting upper and lower bounds around the reference estimate of χ 3 obtained by estimating equation (3) by OLS, as reported in the last columns of Table V. This auxiliary regression allows us to estimate β 4 in equation (4) for a range of χ values. Specifically, we bound the range of estimates of the economic return to literacy to 40% below and above its OLS estimate, which at the lower bound is substantially more conservative than suggested by the Card (1999) review. Table VI reports estimates of β 4 from this exercise for our three outcome measures. All estimates of the effect of Protestantism in this conservative range are small and statistically nonsignificant. Results for our main measure of economic outcome, per capita income tax receipts in 1877, generally point to very small effects of Protestantism after adjustment for literacy differences. For example, assuming that the causal effect of education is 90% of the OLS estimate—equivalent to the bias suggested by twin studies—yields a point estimate that is only 29% of the total effect of Protestantism reported in Table V, statistically no longer distinguishable from zero. Assuming instead that the OLS estimate is downward biased in the range of 20%–40%— equivalent to the bias suggested by IV studies—the point estimate of the effect of Protestantism independent of literacy is very close to zero. The estimates are relatively small and not significantly 48. More recent evidence confirms that there is little ability bias in OLS estimates of the rate of return to education, which is found to be around 10 percent in many developed countries (Leigh and Ryan 2008).
572
QUARTERLY JOURNAL OF ECONOMICS TABLE VI EFFECT OF PROTESTANTISM ON ECONOMIC OUTCOMES AFTER ADJUSTING FOR LITERACY: BOUNDING ANALYSIS Main outcome
40% below OLS estimate 20% below OLS estimate 10% below OLS estimate OLS estimate 10% above OLS estimate 20% above OLS estimate 40% above OLS estimate
Alternative outcomes
Per capita income taxa (1)
ln(teacher income)a (2)
0.309 (0.226) 0.216 (0.224) 0.170 (0.223) 0.124 (0.223) 0.078 (0.222) 0.031 (0.222) −0.061 (0.223)
0.033 (0.047) 0.010 (0.047) −0.002 (0.047) −0.014 (0.047) −0.026 (0.047) −0.038 (0.047) −0.062 (0.048)
Share manuf & serv (3) 0.027 (0.036) 0.009 (0.036) −0.0002 (0.036) −0.009 (0.036) −0.018 (0.036) −0.028 (0.036) −0.046 (0.036)
Source: Data for Prussian counties from the 1871 Population Census, the 1877 Income Tax Statistics, the 1882 Occupation Census, and the 1886 Education Census; see main text and Appendix I for details. Further controls: % age below 10, % Jews, % females, % born in municipality, % of Prussian origin, average household size, ln(population size), population growth 1867–1871 (in %), % blind, % deaf-mute, % insane. Note. Each cell reports the result of a separate regression. Reported coefficients are the instrumental-variable estimates on % Protestants, where distance to Wittenberg is the instrument. Dependent variable is the economic outcome measure reported at the top of each column minus % literate times the return to literacy. The return to literacy stems from an OLS estimate on % literate in an auxiliary regression of the economic outcomes on % literate, % Protestants, and the control variables (as reported in columns (7)–(9) of Table V), multiplied by the adjustment factor reported in the first column (to provide bounds for the potentially biased OLS estimate). Standard errors in parentheses. a Coefficients multiplied by 100. Significance at ∗ 10, ∗∗ 5, ∗∗∗ 1 percent.
different from zero even when we move down to a return to literacy that is 40% below the estimated OLS return to literacy, an upward OLS bias actually inconsistent with the existing literature.49 Results are similar for our two alternative measures of economic outcome. In both cases, when a literacy effect 10% below its OLS estimate is assumed, the point estimate of β 4 is close 49. In fact, we can calculate the threshold value of χ below which the coefficient on Protestantism becomes statistically significant (at the 5% level): Only if the true effect of literacy was at least 71% below the OLS estimate would the effect of Protestantism independent of literacy become statistically significant, at a size of 0.452 (77% of the total Protestantism effect). Of course, if we go down all the way to assuming that literacy does not have any economic effect, we are back to the full total Protestantism effect of Table V.
WAS WEBER WRONG?
573
to zero. For example, the corresponding sectoral share specification implies that a 10-percentage-point increase in the share of Protestants in a county lowers the fraction of the work force in manufacturing and services by 0.002 percentage points, a negligible effect.50 Although the point estimates suggest that most of the effect of Protestantism on economic outcomes may be attributable to higher literacy, the statistical power of some of the IV estimates of the independent Protestantism effect does not allow us to rule out substantial effects of Protestantism that come from sources other than literacy.51 In case of the income tax measure, the upper bound (of 0.609) of the 95% confidence band around the Protestantism point estimate that assumes a literacy effect 10% below its OLS estimate lies just above the total Protestantism effect of Table V. However, when a literacy effect 40% above its OLS estimate is assumed, the upper bound of the 95% confidence band already allows rejecting the possibility that much more than half the total Protestantism effect stems from other sources than literacy. In case of the two alternative outcome measures, the upper bounds of the 95% confidence bands make it possible to rule out the Table V estimate, but not much more, when assuming a 10% upward bias of the OLS literacy coefficient. But when a 40% downward bias is assumed, they allow us to reject the possibility that even one-third of the total Protestantism effect stems from nonliteracy sources. An alternative way to perform the bounding analysis is to use direct estimates of the causal effect of education on earnings from other studies in the literature for χ in equation (4). This is only feasible in the case of our direct earnings measure, based on teacher salaries. Although there is no evidence on returns to literacy in nineteenth-century Prussia, Mitch (1984), in the scenario closest to ours, calculates an internal rate of return to literacy in 50. Estimating the threshold value of statistical significance, the effect of Protestantism independent of literacy would become statistically significant only when the true effect of literacy was assumed to be at most 6% of its OLS estimate in the case of teacher income and 7% in the case of the nonagricultural share. Given that the nature of the possible endogeneity in the case of the sectoral-share measure may be different from the two income measures, the bounds based on the biases found for income measures in the literature may not directly apply to the sectoral-share measure. 51. Note that the OLS estimates of columns (7)–(9) of Table V are more precisely estimated and allow ruling out magnitudes for the independent effect of Protestantism that are economically significant. There, the 95-percent confidence bands around the estimates allow us to rule out that, once literacy is controlled for, the Protestant lead in per capita income taxes is larger than 0.12 Marks (and larger than 1.65 percentage points in the nonagricultural share).
574
QUARTERLY JOURNAL OF ECONOMICS TABLE VII EFFECT OF PROTESTANTISM ON INCOME AFTER ADJUSTING FOR LITERACY: BOUNDING ANALYSIS Years to achieve literacy
Return to one year of schooling equal to 8% 9% 10% 11% 12% 13% 14% 15%
4 (1)
5 (2)
6 (3)
7 (4)
0.045 (0.048) 0.037 (0.047) 0.030 (0.047) 0.022 (0.047) 0.015 (0.047) 0.008 (0.047) 0.00006 (0.047) −0.007 (0.047)
0.030 (0.047) 0.021 (0.047) 0.011 (0.047) 0.002 (0.047) −0.007 (0.047) −0.017 (0.047) −0.026 (0.047) −0.035 (0.047)
0.015 (0.047) 0.004 (0.047) −0.007 (0.047) −0.019 (0.047) −0.030 (0.047) −0.041 (0.047) −0.052 (0.047) −0.063 (0.048)
0.00006 (0.047) −0.013 (0.047) −0.026 (0.047) −0.039 (0.047) −0.052 (0.047) −0.065 (0.048) −0.078 (0.048) −0.091 (0.049)∗
Source. Data for Prussian counties from the 1871 Population Census and the 1886 Education Census; see main text and Appendix I for details. Further controls: % age below 10, % Jews, % females, % born in municipality, % of Prussian origin, average household size, ln(population size), population growth 1867–1871 (in %), % blind, % deaf-mute, % insane. Note. Each cell reports the result of a separate regression. Reported coefficients are the instrumental-variable estimates on % Protestants, where distance to Wittenberg is the instrument. Dependent variable: ln(teacher income) − r × y × % literate. Each cell refers to a different assumption on the return to literacy r × y. The average return to one year of schooling r varies across rows. The average number of years of schooling required to achieve literacy y varies across columns. Standard errors in parentheses. Coefficients multiplied by 100. Significance at ∗ 10, ∗∗ 5, ∗∗∗ 1 percent.
England of 59% in 1839–1843 and 49.5% in 1869–1873. This is, in fact, very close to our OLS estimate of χ 3 of 63.6%, reported in column (8) of Table V. The IV estimate of β 4 that results from estimating equation (4) with the return to literacy restricted to 55% (in the middle of Mitch’s estimates) is reported in the fourth row of column (2) of Table VII. The estimate is statistically insignificant and very close to zero: The point estimate of 0.0019 is less than a mere 2% of the total effect of Protestantism on earnings reported in Table V. A return to literacy of 55% is also in line with the extensive literature on the causal return to a year of education, combined with reasonable assumptions on years required to achieve literacy. As an estimate not too distant in time from our observation, Goldin
WAS WEBER WRONG?
575
and Katz (2000) estimate a rate of return to (high school or college) education of 11% in 1915 Iowa. The survey by Psacharopoulos and Patrinos (2004) suggests that returns to primary education may be substantially higher than for subsequent levels of education. Their average estimate of the social return to primary education, drawn mainly from developing countries today, is as high as 18.9%. In his review, which is more concerned with causality but focused mostly on (higher) education in developed countries today, Card (1999) places the average causal return to a year of education at slightly below 10% and interprets recent IV estimates as showing that returns may be higher for people with low education. Similar estimates have also been used to depict effects of education on countrywide income in extensions of the macroeconomic growth literature following Barro (1991) and others, where Hall and Jones (1999) assume a rate of return of 13.4% for the first four years of education.52 Given that the larger the return to literacy, the smaller will be the independent effect of Protestantism on earnings, an estimate of 11% can serve as a sensible parameter choice. To transform returns to years of education into returns to literacy, we need a parameter estimate of how many years of schooling it takes to reach literacy. The Prussian census coded people as being literate if they could read and write. The literacy question was surveyed only for people at least 10 years of age. Given that children tended to enroll in school at age five, this amounts to an implicit assumption of the Prussian census statisticians that it took at least 5 years to reach literacy. This is in line with the fact that progress toward the Millennium education goal of reaching primary schooling is interchangeably measured by literacy rates (similar to our Prussian measure) and by completing a primary school cycle of usually five or six years (cf. Filmer, Hasan, and Pritchett [2006]). Combining this with a return of 11% per year again yields a return to literacy of at least 55%. Although we thus think that 55% is a sensible conservative estimate of the return to literacy, Table VII shows that our qualitative result is not sensitive to wide bounds of reasonable 52. The equivalence of the effect of education on individual-level and grouplevel income (relevant in our county-level analysis) corresponds to the studies by Acemoglu and Angrist (2000) and Ciccone and Peri (2006), who find no evidence for externalities of education (see also Lange and Topel [2006]). By contrast, Moretti (2004) finds evidence that social returns are significantly larger than private returns, which would make our parameter choice even more conservative.
576
QUARTERLY JOURNAL OF ECONOMICS
alternative assumptions about returns to schooling and about years to achieve literacy. Another way of assessing the importance of education for the higher economic prosperity of Protestants, assuming that OLS estimates of the literacy effect are hardly biased, is to perform a descriptive accounting exercise. Remember from Table III that, on moving from an all-Catholic to an all-Protestant county, the average literacy rate increases by 18.9 percentage points and from Table V that per capita income taxes (the nonagricultural sector) increases by 0.59 Mark (8.2 percentage points). The (statistically highly significant) OLS coefficients on literacy in a regression predicting economic outcomes (cf. Table V) are equivalent to a 0.46 Mark higher per capita income tax (9.3 percentage points larger nonagricultural share) for an 18.9-percentage-point increase in the literacy rate. It thus turns out that based on OLS estimates of the literacy effect, Protestants’ higher literacy can account for roughly the whole gap in economic outcomes between the two denominations. The point estimates of our analyses suggest that once income differences are adjusted for literacy differences, the remaining difference is no longer systematically related to Protestantism. If education had the same effect on economic outcomes here as it has been shown to have in other settings, our results suggest that the higher literacy of Protestant regions can account for at least some of their economic advantage over Catholic regions, and they are consistent with the hypothesis that literacy can account for most or even all of the advantage. VI.D. A Three-Stage Model Given the result that Protestantism does not have an impact on economic outcomes independent of its effect on literacy, we can estimate the following system of three equations:
(5)
+ Xγ4 + ε4 Y = α4 + β4 LIT + Xγ5 + ε5 LIT = α5 + β5 PROT PROT = α6 + β6 WITT + Xγ6 + ε6
In this system, the first stage predicts the share of Protestants in a county by its distance from Wittenberg (WITT). The part of the variation in Protestantism that is due to distance to Wittenberg is then used in the second stage to predict the literacy rate of the county, as in our IV model above. Finally, in the third stage, this
577
WAS WEBER WRONG? TABLE VIII PROTESTANTISM, LITERACY, AND ECONOMIC OUTCOME: A 3SLS MODEL Dependent variable 1st stage % Protestants (1) Distance to Wittenberg in km
2nd stage % literate (2)
−0.097 (0.011)∗∗∗
% Protestants
0.190 (0.028)∗∗∗
% literate Obs. R2
3rd stage per capita income taxa (3)
426 .442
426 .699
3.242 (1.169)∗∗∗ 426 .374
Source. Data for Prussian counties from the 1871 Population Census and the 1877 Income Tax Statistics; see main text and Appendix I for details. Note. Standard errors in parentheses. Further controls: % age below 10, % Jews, % females, % born in municipality, % of Prussian origin, average household size, ln(population size), population growth 1867–1871 (in %), % missing education info, % blind, % deaf-mute, % insane. a Coefficients multiplied by 100. Significance at ∗ 10, ∗∗ 5, ∗∗∗ 1 percent.
variation in literacy is used to predict economic progressiveness. In effect, this system of three equations specifies a “double-IV” estimation, which can be estimated via three-stage least squares (3SLS). Such a model accentuates the three-stage character of our main argument and allows us to provide an estimate of the economic return to literacy in our setting. The 3SLS results, reported in Table VIII for our main outcome measure, support our previous findings. Distance to Wittenberg is negatively associated with Protestantism; the part of Protestantism that is due to distance to Wittenberg has a positive effect on literacy; and the part of literacy that is due to the part of Protestantism that is due to distance to Wittenberg has a positive effect on economic outcome. VI.E. An Addendum: Protestantism, Education, and Individual Earnings in Contemporary Germany As a sequel to the historical analysis, we briefly analyze the association between Protestantism, education, and economic outcomes in contemporary Germany. The German Socio-Economic
578
QUARTERLY JOURNAL OF ECONOMICS
Panel (GSOEP) provides data on religious affiliation, years of education, and individual income for a representative sample of Germans in 1997. On a descriptive basis, Protestants have 5.4% higher income and 0.8 years more education than Catholics even today. These associations are confirmed in a standard regression framework (cf. the first four columns of Table IX).53 However, once we adjust income for the effects of education, the income difference between Protestants and Catholics vanishes. Similar to the bounding analyses above, we purge the income measure on the left-hand side of an economic return to education taken from the literature. Columns (5) and (6) of Table IX use a rate of return to education of 9.4%, a causal estimate provided for Germany by Ichino and Winter-Ebmer (2004), whereas columns (7) and (8) use a rate of return of 5.2% and 6.0%, respectively, equivalent to 90% of an auxiliary OLS regression using our data (assuming a 10% upward OLS bias, in line with the Card [1999] review). Just as in the historical analysis, Protestants do perform better economically even in contemporary Germany, but again, the whole gap can be accounted for by different levels of human capital.54 This observation goes largely unnoticed in present-day Germany because few datasets collect information on religious denomination, education, and income. Still, the current Catholic education gap is not completely surprising, considering the fact that family background plays an important role in human capital accumulation, which perpetuates the education gap over time. Even after more than a hundred years of a public school system that provides equal access to schooling independent of religious affiliation, Protestants are still better educated. The results suggest that Luther’s educational postulations may have had very 53. The sample share of Protestants (Catholics) is 29% (40%), average schooling 12.4 (11.6) years, and average gross monthly income 5,061 (4,802) Marks. 54. In a similar vein, the estimate on Protestantism in Table IX approaches zero as soon as years of schooling are added as a control variable to the model (see Becker and Woessmann [2007]). Given that migration waves after World War II, increased mobility, and voluntary Church secessions and conversions may undermine the instrument characteristics of the historical spread of the Reformation for Protestantism today, the contemporary analysis of the association between Protestantism and earnings does not necessarily draw on exogenous variation and thus stays purely descriptive. The three most recent GSOEP waves that collected data on religious affiliation are 1990, 1997, and 2003. We find the same reported (1997) pattern in 1990, but not in 2003. Whether this is due to data problems (e.g., a refreshment sample with relatively young households and thus more volatile incomes) or a true change in economic associations is left for future investigation.
Other Christian denomination Non-Christian religious affiliation No religious affiliation Female
Protestant
0.051 (0.018)∗∗∗ −0.004 (0.032) −0.171 (0.021)∗∗∗
0.156 (0.022)∗∗∗ −0.268 (0.016)∗∗∗
0.154 (0.024)∗∗∗
(2)
0.048 (0.020)∗∗ −0.033 (0.034) −0.169 (0.025)∗∗∗
(1)
ln(gross monthly earnings)
0.805 (0.148)∗∗∗
0.806 (0.128)∗∗∗ −0.579 (0.242)∗∗ −1.826 (0.202)∗∗∗
(3)
0.892 (0.136)∗∗∗ −0.412 (0.107)∗∗∗
0.706 (0.118)∗∗∗ −0.306 (0.225) −1.818 (0.188)∗∗∗
(4)
Years of schooling
0.078 (0.021)∗∗∗
−0.028 (0.019) 0.022 (0.035) 0.002 (0.029)
(5)
0.072 (0.019)∗∗∗ −0.229 (0.015)∗∗∗
−0.016 (0.016) 0.025 (0.031) 0.0002 (0.026)
(6)
0.112 (0.021)∗∗∗
0.006 (0.018) −0.003 (0.034) −0.075 (0.028)∗∗∗
(7)
0.103 (0.019)∗∗∗ −0.243 (0.015)∗∗∗
0.009 (0.016) 0.014 (0.031) −0.062 (0.026)∗∗
(8)
ln(gross monthly earnings) ln(gross monthly earnings) adjusted for 9.4% return adjusted for 5.2%/6.0% to schoolinga return to schoolingb
Dependent variable
TABLE IX RELIGION, EDUCATION, AND EARNINGS IN CONTEMPORARY GERMANY
WAS WEBER WRONG?
579
2,566 .040
2,566 .073
0.018 (0.018) 0.027 (0.059) 2,566 .210
0.012 (0.003)∗∗∗ −0.016 (0.009)∗ 2,566 .183
(4) −0.202 (0.027)∗∗∗ 0.183 (0.061)∗∗∗
(3)
0.021 (0.004)∗∗∗ −0.054 (0.009)∗∗∗
(2)
Years of schooling
2,566 .009
(5)
0.011 (0.003)∗∗∗ −0.018 (0.008)∗∗ 2,566 .223
0.040 (0.004)∗∗∗ −0.072 (0.008)∗∗∗
(6)
2,566 .018
(7)
0.011 (0.003)∗∗∗ −0.017 (0.008)∗∗ 2,566 .207
0.033 (0.004)∗∗∗ −0.065 (0.008)∗∗∗
(8)
a Dependent
Source. GSOEP 1997. Sample: full-time employed workers, ages 20–55. Note. Standard errors in parentheses. Left-out category of religious affiliation is “Catholic.” variable is earnings − r × years of schooling, using a rate of return to schooling r of 9.4 percent, a causal estimate for Germany provided by Ichino and Winter-Ebmer (2004, Table 6). b Dependent variable is earnings − r × years of schooling, using a rate of return to schooling r of 5.2 percent (6.0 percent in column (8)), equivalent to 90 percent of the OLS estimate on years of schooling from an auxiliary OLS regression (assuming a 10 percent upward bias in the OLS estimate). Significance at ∗ 10, ∗∗ 5, ∗∗∗ 1 percent.
Square of firm tenure Obs. R2
Potential experience in years Square of potential experience in years Firm tenure
(1)
ln(gross monthly earnings)
ln(gross monthly earnings) ln(gross monthly earnings) adjusted for 9.4% return adjusted for 5.2%/6.0% to schoolinga return to schoolingb
Dependent variable
TABLE IX (CONTINUED)
580 QUARTERLY JOURNAL OF ECONOMICS
WAS WEBER WRONG?
581
long-term repercussions. The precise nature of the contemporary associations is a matter for future research, however. VII. CONCLUSIONS This paper advances an alternative to the Weber thesis: an explanation for the historically greater economic prosperity of Protestant regions based on a standard human capital argument. As an unintended side effect of Luther’s exhortation that everyone be able to read the Gospel, Protestants acquired literacy skills that functioned as human capital in the economic sphere. This human capital theory of Protestant economic history is consistent with Luther’s preaching, with the cross-country pattern in 1900, and with county-level evidence from late nineteenth-century Prussia. Using the roughly concentric dispersion of Protestantism around Luther’s city of Wittenberg during the Reformation to obtain exogenous variation in Protestantism, we find that Protestantism led to substantially higher literacy across Prussian counties in the late nineteenth century. Our results are consistent with the hypothesis that this higher literacy in Protestant regions can account for the major part of their edge in economic progressiveness over Catholic regions. So, was Weber wrong? Or, more precisely, is what has come to be known as the Weber thesis, as commonly interpreted, wrong? Given the complexity and multifaceted character of the thesis, there can of course be no simple answer to this question. Within the scope of this paper, there are at least three aspects to the question, with three different answers. First, is the Weber thesis wrong in the main descriptive pattern of its argument? In contrast to the conclusion of some existing cross-country research (Delacroix and Nielsen 2001), we show that Weber was right in his observation that Protestant regions were economically more affluent than Catholic regions, across countries in 1900 and within Prussia in the second half of the nineteenth century. Second, is the Weber thesis wrong with respect to the main channel through which this pattern arises? Our evidence suggests that in this aspect, his thesis as commonly interpreted is likely wrong. We find that the key channel appears to be the acquisition of literacy (a factor not generally associated with the Weber thesis), which seems to be able to account for a major part of the association between Protestantism and economic prosperity
582
QUARTERLY JOURNAL OF ECONOMICS
in late nineteenth-century Prussia. The results are in accord with an explanation where the main channels advanced by the Weber thesis, namely the pure effects of work effort and thrift, do not have substantial effects. Third, is the Weber thesis wrong with respect to the importance of ethical considerations in the association between religious denomination and economic success? This aspect is hard to answer (as always when dealing with topics of ethics), given the virtual impossibility of observing ethical considerations, particularly centuries after the fact. We cannot exclude the possibility that Protestants achieved higher literacy partly because of a different work ethic. In this sense, our human capital theory may be complementary to ethical explanations. However, our result that the spread of Protestantism, and with it the spread of literacy, can be traced back to incidents occurring centuries before our time of observation and lying beyond the influence of individual citizens driven by differential ethics may provide an indication that ethical channels may be limited. The relative importance of ethics in advancing literacy remains an important question for future research. Our findings from nineteenth-century Prussia reveal that the Protestant Reformation had very long-lived economic consequences, spanning several centuries. Protestantism led to substantially higher literacy, which in turn led to economic progress. The link between cultural factors and economic development, although clearly present, may thus work quite differently from what is generally assumed, in ways going beyond the Weber thesis. APPENDIX I: COUNTY-LEVEL DATA FOR PRUSSIA IN THE LATE NINETEENTH CENTURY Three major censuses and several additional surveys in Prussia provide the data for our analysis: the 1871 Population Census, the 1882 Occupation Census, and the 1886 Education Census, as well as the 1877, 1892, and 1901 Income Tax Statistics, the 1879 Survey of Mayor Incomes, and the 1892 and 1901 Social Security Statistics. By the second half of the nineteenth century, the Prussian Statistical Office collected huge amounts of demographic and socioeconomic data, and the quality of the statistical material is generally viewed as outstanding. Knodel (1974, p. 28) concludes that the quality of Prussian demographic data was very high by the 1860s. Similarly, Wojtun (1968) reports that
WAS WEBER WRONG?
583
population counts were virtually complete by 1864. Demographers have found county-level data for Prussia at the end of the nineteenth century to be a unique source of highest-quality data for analyses at a disaggregate level (cf. Galloway, Hammel, and Lee [1994]; Lee, Galloway, and Hammel [1994]). We have compiled the county-level data from respective archives. In 1871, Prussia consisted of 452 counties,55 organized into 35 districts and 11 provinces. Table A.1 lists the names of the Prussian provinces and districts, together with a count of counties in each district. A. 1871 Population Census The 1871 Population Census took place on December 1, 1871. Questionnaires were to be filled out by household heads after personal instruction through an agent of the Prussian Statistical Office. The agent assisted in filling out the questionnaire, where requested, and made sure the information provided was correct. Questionnaires were available in the different languages spoken by the Prussian population. The Census surveyed standard demographic variables such as sex and age, but also religion and literacy. Religious affiliation was surveyed in four categories: Catholic, Protestant, other Christian denominations, and Jews. Literacy was surveyed for the first time ever in Prussia in 1871. It is measured as the ability of those who are aged 10 years or older to read and write. In the volume detailing the results of the Census, the Prussian Statistical Office attested to the unexpectedly high quality of the literacy question. The state of literacy is unknown for only slightly more than 1 percent of respondents (captured by our variable “% Missing education info”). The Statistical Office expressed surprise about the fact that more than 10 percent of all males were illiterate, given the authorities’ longstanding official educational objectives. In contrast to the other data sources, the data from the 1871 Population Census allow a separate analysis of urban and rural areas in each county, where a population size of 2,000 was used to classify municipalities into urban and rural. Table A.2 reports 55. We combined Communionharz, a tiny county of 690 inhabitants, with the neighboring county Zellerfeld, as the Occupation and Education Censuses do. After 1871, some bigger counties were split into two separate counties; we aggregated the post-1871 data up to the 452 counties existing in 1871.
18 16 3 7 7 7 5 8 23 12 4 12 13 4 9 18 8 16
District Counties (Regierungsbezirk) (Kreise)
Brandenburg Frankfurt/Oder Potsdam Hannover Aurich Hannover Hildesheim ¨ Luneburg ¨ Osnabruck Stade Hessen Kassel Wiesbaden Hohenzollern Hohenzollern Pommern K¨oslin Stettin Stralsund Posen Bromberg Posen Preußen Danzig Gumbinnen
Province (Provinz) 20 13 11 21 13 11 13 10 15 17 24 20 19 20 14 10 11
District Counties (Regierungsbezirk) (Kreise)
K¨onigsberg Marienwerder Rheinprovinz Aachen ¨ Dusseldorf Koblenz K¨oln Trier Sachsen Erfurt Magdeburg Merseburg Schlesien Breslau Liegnitz Oppeln Schleswig-Holstein Schleswig Westphalen Arnsberg Minden ¨ Munster
Preußen
Province (Provinz)
TABLE A.1 PRUSSIAN PROVINCES AND DISTRICTS
584 QUARTERLY JOURNAL OF ECONOMICS
585
WAS WEBER WRONG? TABLE A.2 RURAL-URBAN BREAKDOWN OF POPULATION CENSUS DATA
% Protestants % literate % age below 10 % Jews % females % born in municipality % of Prussian origin Average household size Total population size Popul. growth 1867–1871 (in %) % missing education info % blind % deaf-mute % insane Number of observations
Total
Rural
Urban
(1)
(2)
(3)
64.62 (38.09) 87.33 (12.56) 24.91 (2.18) 1.08 (1.18) 51.09 (1.42) 59.65 (11.93) 99.16 (1.85) 4.79 (0.34) 51,965.22 (19,124.29) 1.03 (4.08) 1.71 (1.11) 0.09 (0.03) 0.10 (0.05) 0.23 (0.18)
64.57 (39.32) 86.27 (13.78) 25.56 (2.25) 0.43 (0.63) 51.16 (1.44) 61.03 (13.16) 99.35 (1.46) 4.89 (0.40) 38,736.12 (14,255.86) 1.79 (25.98) 1.87 (1.30) 0.09 (0.03) 0.10 (0.05) 0.22 (0.15)
64.69 (35.33) 91.00 (8.13) 22.82 (2.48) 3.44 (4.18) 51.13 (2.32) 55.34 (10.23) 98.69 (2.31) 4.51 (0.36) 13,229.11 (10,659.00) 2.26 (6.16) 1.18 (0.98) 0.11 (0.06) 0.12 (0.16) 0.27 (0.52)
452
427
437
Source. Data for Prussian counties from the 1871 Population Census. Column (1) displays county totals. Column (2) displays values for rural municipalities (≤2,000 inhabitants) in these counties. Column (3) is for urban municipalities (>2,000 inhabitants) in these counties. Note.All columns show means. Standard deviations in parentheses.
descriptive statistics of our Population Census data separately by urban and rural municipalities in the counties. The source of the Population Census data is the K¨onigliches Statistisches Bureau, Die Gemeinden und Gutsbezirke des Preussischen Staates und ihre Bev¨olkerung: Nach den Urmaterialien der allgemeinen Volkszahlung ¨ vom 1. December 1871 (Berlin: Verlag des K¨oniglichen Statistischen Bureaus, 1874).
586
QUARTERLY JOURNAL OF ECONOMICS
B. 1877 Income Tax Statistics Financial statistics of Prussian counties provide income tax data for the budget year 1877/1878 (ranging from April 1877 to March 1878). They contain information on the total amount of class tax (Klassensteuer) and classified income tax (classifizierte Einkommensteuer) collected in the county. The division into two types of income taxes, which existed until 1891, has historical reasons: the class tax was the successor of the poll tax of 1811, whereas the classified income tax was introduced later in 1851 (cf. Hill [1892] for details). The class tax was collected on yearly incomes between a minimum taxable income of 420 Marks and a maximum of 3,000 Marks. There were twelve income classes, with tax payments ranging from 3 to 72 Marks. The implied rates on the minimum income of each class increased gradually from 5 /7 % in the lowest class to 22/3 % in the highest class. The relevant 1873 tax law states that income was to be assessed on the basis of the estimated value of the annual income (cf. Engel [1875]). Even before this, instructions by the finance minister explicitly specified that incomes were meant to constitute the principal determining factor in the assessment of the class. Incomes above 3,000 Marks were subject to the classified income tax, which was assessed solely on the basis of income. The rates were equivalent to 3% of the minimum income of a large number of increasing income brackets. To obtain the amount of income tax paid per capita, we divided total tax revenues by the total population of the county in 1877, available from the same source. The financial statistics are not available for the 26 counties that were city counties (where the county equaled one big city) in 1877, so that the total number of observations equals 426 counties in these data. The source of the 1877 Income Tax Statistics data is the Preussisches Statistisches Bureau, “Finanzstatistik der Kreise ¨ das Jahr 1877/78,” Zeitschrift des preussischen Staates fur ¨ des Preussischen Statistischen Landesamtes, Erganzungshefte, 7 (1878), 113–174. C. 1879 Survey of Mayoral Incomes In 1879, the Ministry of the Interior mandated a special survey of the incomes of mayors and other paid members of city magistracies. The survey was restricted to all municipalities with
WAS WEBER WRONG?
587
more than 10,000 inhabitants. The survey thus covered 159 towns with a total of 5.2 million inhabitants, constituting 59% of the total urban population (defined by the Prussian Statistical Office as municipalities with more than 2,000 inhabitants) and more than 20% of the total Prussian population at the time. All 159 cities reported data on mayoral incomes, but not necessarily on incomes of other members of the magistracy. Obviously, mayoral incomes are not a perfect measure of incomes of the population at large, but, to the extent that their salaries were financed from local taxes, they are likely to reflect the income level to an acceptable degree. In 121 cases, there is only one city with more than 10,000 inhabitants in a county. Only 17 counties host two or three cities with more than 10,000 inhabitants; for these counties, we take average mayoral incomes of the cities. This leaves a total of 138 counties with mayoral income information. The source of the 1879 Survey of Mayor Incomes data is ¨ Blenck, Emil, “Die Gehaltsverhaltnisse der h¨oheren Gemeindebeamten in den preussischen Stadtgemeinden mit mehr als 10000 Einwohnern,” Zeitschrift des Preussischen Statistischen Bureaus, 20 (1880), 271–283. D. 1882 Occupation Census The 1882 Occupation Census collected information on employment and self-employment across two-digit sectors. We calculate the share of the total labor force, as well as the share of the male labor force, working in the manufacturing sector and in the service sector. We use the classification provided by the Prussian Statistical Office to classify the two sectors. The manufacturing sector (Sector B in the 1882 classification) includes mining, construction, and manufacture of metals, machinery, equipment, chemicals, textiles, paper, leather, food products, and wood. The service sector (Sector C in the 1882 classification) includes trade business, insurance, transport, lodging, and restaurants. Note that the service sector C does not include servants and housemaids, nor does it include those working in public administration and the military. Our results are robust with respect to dropping or including certain subsectors in the analysis, for example, the mining industry, which in modern sector classifications would not be included in the manufacturing sector.
588
QUARTERLY JOURNAL OF ECONOMICS
The source of the Occupation Census data is the Preussische Statistik, Vol. 76b, pp. 232–695 and Vol. 76c, p. 239. E. 1886 Education Census The 1886 Education Census collected information on both primary schools and secondary schools. From the Education Census, we derive the average annual income of full-time male elementary school teachers in a county. Given that teacher incomes were almost entirely financed from local contributions, they should provide a reasonable proxy for the average income of the county (cf. Schleunes [1989]). The Education Census also provides county-level information on the share of students who had a distance to school of more than 3 kilometers. Although the information applies to students (rather than the adult population) in 1886 and does not include schoolaged children who did not attend school, the measure may still provide a useful proxy for the supply of schools in the different counties in our analysis. The source of the Education Census data is Preussische Statistik, Vol. 101, pp. 2–391. F. 1892 and 1901 Social Security Statistics Starting in 1892, wages of day laborers were systematically collected after an amendment of the Health Insurance System, one of the main pillars of the Prussian social security system. The April 1892 version of the Health Insurance Law decreed that payments to the compulsory Health Insurance System be 1.5% of the customary wage paid to day laborers. The fact that the Prussian authorities used these measures of day laborer wages as reference values to determine contributions shows that they were considered sufficiently representative of wages in low-income households, and thus a useful proxy for the local standard of living of this segment. Data were collected at the municipality level, separately for male and female workers and for those below and above 16 years of age. A ministerial directive explicitly required that only unskilled labor be considered, that annual averages be computed when wages varied seasonally, and that any in-kind benefits be added to the cash rate at local prices. The wage data are available at the county level, separately for urban and rural municipalities, for the years 1892 and 1901. Individual annual incomes can be computed by multiplying the daily wage by 300, as the official
WAS WEBER WRONG?
589
implementation regulation of the Health Insurance Law assumes 300 working days per year. The source of the Social Security Statistics data is Neuhaus, ¨ Georg, “Die ortsublichen Tagel¨ohne gew¨ohnlicher Tagearbeiter in Preußen 1892 und 1901,” Zeitschrift des K¨oniglich Preussischen Statistischen Bureaus, 44 (1904), 310–346. G. 1892 Income Tax Statistics The 1892 Income Tax Statistics go beyond the 1877 Income Tax Statistics described above by detailing, for 60 urban counties, the number of taxpayers for every one of the 297 income tax brackets, from the lowest income bracket (900–1,050 Marks) to the highest existent bracket in that year (10,900,000–10,905,000 Marks). By multiplying the number of taxpayers in an income bracket by the average income in that bracket (the midpoint of the interval) and summing over all income brackets, we compute the total income of income-tax-paying households in a county. Several changes occurred in the tax code between 1877 and 1892. After an increase in the minimum taxable income from 420 to 900 Marks in the early 1880s, the income tax law of June 1891 brought further changes (cf. Finanzarchiv [1891]). The class tax was removed and subsumed in the classified income tax to form a new combined income tax. The former cutoff point of 3,000 Marks between the class tax and the classified income tax remained relevant only insofar as it marked the threshold above which an official tax declaration by the taxpayer became compulsory. The new tax schedule, with a finer classification of income brackets, started with a tax of 6 Marks in the lowest income tax bracket, corresponding to a tax rate of roughly 0.6%. Tax rates increased progressively to reach 3% for incomes of 10,500 Marks, remained flat at 3% for brackets up to 30,500 Marks, and then increased again up to 4% for top income brackets. The source of the 1892 Income Tax Statistics data is the ¨ Statistik der preussischen Einkommensteuer-Veranlagung fur das Jahr 1892/93, Mittheilungen aus der Verwaltung der direkten Steuern im preussischen Staate (1892), 212–281. H. 1901 Income Tax Statistics In a special Festschrift on the occasion of its centenary in 1905, the Prussian Statistical Office published a volume containing both the income tax receipts and the total number of income tax payers covering all Prussian counties, averaged over the tax
590
QUARTERLY JOURNAL OF ECONOMICS
years 1899 through 1903. The total volume of tax receipts is available, but not the distribution of tax payments across income brackets, as in 1892. The data on the size of the taxpayer population allow inferring the size of the nontaxpayer population. The source of the 1901 Income Tax Statistics data is the K¨oniglich Preussisches Statistisches Bureau, Festschrift des K¨oniglich Preussischen Statistischen Bureaus zur Jahrhundertfeier seines Bestehens (Berlin: Verlag des K¨oniglich Preussischen Statistischen Bureaus, 1905).
APPENDIX II: ECONOMIC AND EDUCATIONAL DATA BEFORE 1517 The data sources for the proxies of economic and educational development at Lutheran times are as follows. Imperial cities in 1517: Oestreich and Holzer (1973) contains a list of the free imperial cities (Reichsstadte). ¨ We derive the sample of cities that preserved their status as free imperial cities in 1517. Hanseatic cities in 1517: Hammel-Kiesow (2000) provides a map of the Hanseatic cities with the dates of their last participation at the Hanseatic Diet. We derive the sample of cities that had participated in Hanseatic Diets at least until 1535. Urban population in 1500: Bairoch, Batou, and Ch`evre (1988) provide data on the population in 1500 of European cities that had at least 5,000 inhabitants once between 800 and 1800. Universities in 1517: Eulenburg (1994) documents all German universities with their year of foundation, from which we derive the sample of universities founded before 1517. Monasteries in 1517: Grote (1881) provides an encyclopedia of monasteries, cloisters, preceptories, and convents in the German Empire, detailing their locations, years of foundation, and (if applicable) years of abandonment. Of an envisaged two-volume work, only the first one was published, covering locations beginning with letters A to L. The volume includes male, female, and mixed monasteries. We derive a list of monasteries in existence in 1517, leaving out all monasteries established after 1517 or abandoned before 1517, also drawing on additional information added by P. Adalrich Arnold in 1939. Schools in 1517: The German version of the Wikipedia encyclopedia website provides a “List of the Oldest Schools in the ¨ German-Speaking Area” (Liste der altesten Schulen im deutschen Sprachraum) at http://de.wikipedia.org/wiki/Liste der %C3%
WAS WEBER WRONG?
591
A4ltesten Schulen im deutschen Sprachraum (accessed September 3, 2007). From this list, which is likely nonexhaustive and covers only German-speaking territory, we draw the sample of schools founded before 1517.
APPENDIX III: CROSS-COUNTRY DATA IN 1900 We restrict our cross-country analyses to the sample of countries in which Protestant and Catholic Christians together accounted for the majority of the population. The data sources are as follows. GDP per Capita in 1900: Maddison (2006) provides data on per capita GDP in 1900 for a total of 29 countries with a majority of Protestant and Catholic Christians. GDP is measured in 1990 international Geary–Khamis dollars. Religious Population Shares: Barrett, Kurian, and Johnson (2001) provide data on fractions of religious adherence of the population for 11 religious groups in 1900 for the 29 countries with available GDP data that had a majority of Protestants and Catholics in their data. In addition to Protestant and Catholic, the remaining groups are Orthodox Christian, other Christian, Jewish, Muslim, Hindu, Buddhist, other Eastern religions, other religions, and nonreligious. Literacy Rates: UNESCO (1953) compiles data on the share of persons above 10 or 15 years who could read in 1900 (or a close year) from national population censuses. Among the 29 countries in the above sample, 11 have literacy data in 1900 (or a directly adjacent year) in the UNESCO compilation, and an additional three countries have somewhat later data (Chile in 1907, Argentina in 1914, and Colombia in 1917). Flora (1983) has 1900 literacy data for an additional four countries in our sample, of which Austria is based on censuses, the Netherlands and Sweden on military records of recruits, and the United Kingdom on marriage registers of the share of newly married bridegrooms and brides who could sign their marriage certificates. Cipolla (1969) provides 1900 literacy data from military records of recruits for another two countries in the sample (Switzerland and Germany). We follow Tabellini (2005) in combining literacy data from different sources in a cross-country comparison but caution that there are severe limits to cross-country comparability due to differences in literacy definitions and samples (cf. UNESCO [1953]).
592
QUARTERLY JOURNAL OF ECONOMICS
UNIVERSITY OF STIRLING, IFO INSTITUTE, CESIFO, AND INSTITUT ZUR ZUKUNFT DER ARBEIT UNIVERSITY OF MUNICH, IFO INSTITUTE, CESIFO, AND INSTITUT ZUR ZUKUNFT DER ARBEIT
REFERENCES Acemoglu, Daron, and Joshua D. Angrist, “How Large Are the Social Returns to Education? Evidence from Compulsory Schooling Laws,” in NBER Macroeconomics Annual 2000, Ben S. Bernanke and Kenneth Rogoff, eds. (Cambridge, MA: MIT Press, 2000). Acemoglu, Daron, Simon Johnson, and James A. Robinson, “The Colonial Origins of Comparative Development: An Empirical Investigation,” American Economic Review, 91 (2001), 1369–1401. ——, “Reversal of Fortune: Geography and Institutions in the Making of the Modern World Income Distribution,” Quarterly Journal of Economics, 117 (2002), 1231–1294. ——, “Institutions as a Fundamental Cause of Long-Run Growth,” in Handbook of Economic Growth, Volume 1A, Philippe Aghion and Steven N. Durlauf, eds. (Amsterdam, the Netherlands: North-Holland, 2005). Bairoch, Paul, Cities and Economic Development: From the Dawn of History to the Present (Chicago, IL: University of Chicago Press, 1988). Bairoch, Paul, Jean Batou, and Pierre Ch`evre, The Population of European Cities from 800 to 1850: Data Bank and Short Summary of Results (Gen`eve: Librairie Droz/Centre of International Economic History, 1988). Balassa, Bela, “The Purchasing-Power Parity Doctrine: A Reappraisal,” Journal of Political Economy, 72 (1964), 584–596. Barrett, David B., George T. Kurian, and Todd M. Johnson, World Christian Encyclopedia, 2nd ed. (Oxford, UK: Oxford University Press, 2001). Barro, Robert J., “Economic Growth in a Cross Section of Countries,” Quarterly Journal of Economics, 106 (1991), 407–443. Barro, Robert J., and Rachel M. McCleary, “Religion and Economic Growth across Countries,” American Sociological Review, 68 (2003), 760–781. ——, “Which Countries Have State Religions?” Quarterly Journal of Economics, 120 (2005), 1331–1370. Becker, George, “Replication and Reanalysis of Offenbacher’s School Enrollment Study: Implications for the Weber and Merton Theses,” Journal of the Scientific Study of Religion, 36 (1997), 483–495. Becker, Sascha O., and Ludger Woessmann, “Was Weber Wrong? A Human Capital Theory of Protestant Economic History,” CESifo Working Paper No. 1987, 2007. ¨ Blenck, Emil, “Die Gehaltsverhaltnisse der h¨oheren Gemeindebeamten in den preussischen Stadtgemeinden mit mehr als 10000 Einwohnern,” Zeitschrift des Preussischen Statistischen Bureaus, 20 (1880), 271–283. Blum, Ulrich, and Leonard Dudley, “Religion and Economic Growth: Was Weber Right?” Journal of Evolutionary Economics, 11 (2001), 207–230. Botticini, Maristella, and Zvi Eckstein, “Jewish Occupational Selection: Education, Restrictions, or Minorities?” Journal of Economic History, 65 (2005), 922–948. ——, “From Farmers to Merchants, Conversions and Diaspora: Human Capital and Jewish History,” Journal of the European Economic Association, 5 (2007), 885–926. Bunkowske, Eugene W., “Was Luther a Missionary?” Concordia Theological Quarterly, 49 (1985), 161–179. Card, David, “The Causal Effect of Education on Earnings,” in Handbook of Labor Economics, Volume 3A, Orley Ashenfelter and David Card, eds. (Amsterdam, the Netherlands: North-Holland, 1999). Cavalcanti, Tiago V., Stephen L. Parente, and Rui Zhao, “Religion in Macroeconomics: A Quantitative Analysis of Weber’s Thesis,” Economic Theory, 32 (2007), 105–123. Ciccone, Antonio, and Giovanni Peri, “Identifying Human Capital Externalities: Theory with Applications,” Review of Economic Studies, 73 (2006), 381–412.
WAS WEBER WRONG?
593
Cipolla, Carlo M., Literacy and Development in the West (Harmondsworth, UK: Penguin, 1969). Delacroix, Jacques, and Francois Nielsen, “The Beloved Myth: Protestantism and the Rise of Industrial Capitalism in Nineteenth-Century Europe,” Social Forces, 80 (2001), 509–553. Dixon, C. Scott, The Reformation in Germany (Oxford, UK: Blackwell, 2002). Doepke, Matthias, and Fabrizio Zilibotti, “Occupational Choice and the Spirit of Capitalism,” Quarterly Journal of Economics, 123 (2008), 747–793. Easterlin, Richard A., “Why Isn’t the Whole World Developed?” Journal of Economic History, 41 (1981), 1–19. Ekelund, Robert B. Jr., Robert F. H´ebert, and Robert D. Tollison, “An Economic Analysis of the Protestant Reformation,” Journal of Political Economy, 110 (2002), 646–671. Engel, Ernst, “Die Klassen- und klassifizierte Einkommensteuer und die Einkommensverteilung im preussischen Staat in den Jahren 1852 bis 1875,” Zeitschrift des Preussischen Statistischen Bureaus, 15 (1875), 105–148. Engelsing, Rolf, Analphabetentum und Lekture: ¨ Zur Sozialgeschichte des Lesens in Deutschland zwischen feudaler und industrieller Gesellschaft (Stuttgart: Metzler, 1973). Erlinghagen, Karl, Katholisches Bildungsdefizit in Deutschland (Freiburg, Germany: Herder, 1965). Eulenburg, Franz, Die Frequenz der deutschen Universitaten ¨ von ihrer Grundung ¨ bis zur Gegenwart (Berlin, Germany: Akademie Verlag, [1904] 1994). Filmer, Deon, Amer Hasan, and Lant Pritchett, “A Millennium Learning Goal: Measuring Real Progress in Education,” Center for Global Development Working Paper No. 97, 2006. Finanzarchiv, “Preussisches Einkommensteuergesetz vom 24. Juni 1891 nebst An¨ weisung des Finanzministers vom 5. August 1891 zur Ausfuhrung desselben,” Finanzarchiv, 8 (1891), 331–451. Flitner, Wilhelm, Die vier Quellen des Volksschulgedankens, 3rd ed. (Stuttgart, Germany: Ernst Klett, [1941] 1954). Flora, Peter, State, Economy, and Society in Western Europe 1815–1975: A Data Handbook in Two Volumes, Volume I: The Growth of Mass Democracies and Welfare States (London, UK: Macmillan, 1983). ¨ ¨ Frank, Karl Suso, “Lesen, Schreiben und Bucher im fruhen M¨onchtum,” in ¨ Schriftlichkeit im fruhen ¨ Mittelalter, Ursula Schaefer, ed. (Tubingen, Germany: Gunter Narr, 1993). Galloway, Patrick R., Eugene A. Hammel, and Ronald D. Lee, “Fertility Decline in Prussia, 1875–1910: A Pooled Cross-Section Time Series Analysis,” Population Studies, 48 (1994), 135–158. Galor, Oded, “From Stagnation to Growth: Unified Growth Theory,” in Handbook of Economic Growth, Volume 1A, Philippe Aghion and Steven N. Durlauf, eds. (Amsterdam, the Netherlands: North-Holland, 2005). Glaeser, Edward L., and Spencer Glendon, “Incentives, Predestination and Free Will,” Economic Inquiry, 36 (1998), 429–443. Glaeser, Edward L., Rafael La Porta, Florencio Lopez-de-Silanes, and Andrei Shleifer, “Do Institutions Cause Growth?” Journal of Economic Growth, 9 (2004), 271–303. Glaeser, Edward L., and Bruce I. Sacerdote, “Education and Religion,” Journal of Human Capital, 2 (2008), 188–215. Go, Sun, and Peter H. Lindert, “The Curious Dawn of American Public Schools,” NBER Working Paper No. 13335, 2007. Goldin, Claudia, “The Human-Capital Century and American Leadership: Virtues of the Past,” Journal of Economic History, 61 (2001), 263–292. Goldin, Claudia, and Lawrence F. Katz, “Education and Income in the Early Twentieth Century: Evidence from the Prairies,” Journal of Economic History, 60 (2000), 782–818. ——, “Why the United States Led in Education: Lessons from Secondary School Expansion, 1910 to 1940,” NBER Working Paper No. 6144, revised version, 2002; forthcoming in Human Capital and Institutions: A Long-Run View, David Eltis, Frank D. Lewis, and Kenneth L. Sokoloff, eds. (New York, NY: Cambridge University Press).
594
QUARTERLY JOURNAL OF ECONOMICS
Green, Howell, “The Education of Women in the Reformation,” History of Education Quarterly, 19 (1979), 93–116. Grote, Otto Freiherr, Lexicon Deutscher Stifter, Kl¨oster und Ordenshauser—Erste ¨ Abtheilung: Das Heutige Deutsche Reich (Osterwieck a. Harz: CommisionVerlag von A. W. Zickfeldt, 1881). Guiso, Luigi, Paola Sapienza, and Luigi Zingales, “Does Culture Affect Economic Outcomes?” Journal of Economic Perspectives, 20 (2006), 23–48. Hall, Robert E., and Charles I. Jones, “Why Do Some Countries Produce So Much More Output per Worker than Others?” Quarterly Journal of Economics, 114 (1999), 83–116. Hammel-Kiesow, Rolf, Die Hanse (Munich, Germany: C.H. Beck, 2000). Hanushek, Eric A., and Ludger Woessmann, “The Role of Cognitive Skills in Economic Development,” Journal of Economic Literature, 46 (2008), 607–668. Herder-Korrespondenz, “Meldungen aus der katholischen Welt: Eine Konfessionsstatistik der westdeutschen Studentenschaft,” Herder-Korrespondenz (1954), 99–101. Hesse, A., “Analphabeten,” in W¨orterbuch der Volkswirtschaft in zwei Banden, ¨ Vol. 1, 3rd ed., Ludwig Elster, ed. (Jena, Germany: Gustav Fischer, 1911). Hill, Joseph A., “The Prussian Income Tax,” Quarterly Journal of Economics, 6 (1892), 207–226. ¨ zur Kenntnis der Bewegung der Bev¨olkerung innerhalb der Hilse, Carl, “Beitrage evangelischen und der r¨omisch-katholischen Landeskirche des preussischen Staats in den Jahren 1859 bis 1867,” Zeitschrift des K¨oniglich Preussischen Statistischen Bureaus 9 (1869), 305–318. Holborn, Louise W., “Printing and the Growth of a Protestant Movement in Germany from 1517 to 1524,” Church History, 11 (1942), 123–137. Iannaccone, Laurence R., “Introduction to the Economics of Religion,” Journal of Economic Literature, 36 (1998), 1465–1496. Ichino, Andrea, and Rudolf Winter-Ebmer, “The Long-Run Educational Cost of World War II,” Journal of Labor Economics, 22 (2004), 57–86. Knodel, John E., The Decline in Fertility in Germany, 1871–1939 (Princeton, NJ: Princeton University Press, 1974). Krarup, Martin, Ordination in Wittenberg: Die Einsetzung in das kirchliche Amt ¨ in Kursachsen zur Zeit der Reformation (Tubingen, Germany: Mohr Siebeck, 2007). Landes, David S., The Unbound Prometheus: Technological Change and Industrial Development in Western Europe from 1750 to the Present (London, UK: Cambridge University Press, 1969). Lange, Fabian, and Robert Topel, “The Social Value of Education and Human Capital,” in Handbook of the Economics of Education, Volume 1, Eric A. Hanushek and Finis Welch, eds. (Amsterdam, the Netherlands: North-Holland, 2006). Lee, Ronald D., Patrick R. Galloway, and Eugene A. Hammel, “Fertility Decline in Prussia: Estimating Influences on Supply, Demand, and Degree of Control,” Demography, 31 (1994), 347–373. Leigh, Andrew, and Chris Ryan, “Estimating Returns to Education Using Different Natural Experiment Techniques,” Economics of Education Review, 27 (2008), 149–160. Lindert, Peter H., “Voice and Growth: Was Churchill Right?” Journal of Economic History, 63 (2003), 315–350. Luther, Martin, “An den christlichen Adel deutscher Nation von des christlichen Standes Besserung (To the Christian Nobility of the German Nation Concerning the Reform of the Christian Estate),” in Dr. Martin Luthers Werke: Kritische Gesamtausgabe, Vol. 6 (Weimar, Germany: Verlag Hermann B¨ohlhaus Nachfolger [1520] 1888). ¨ ——, “An die Ratsherren aller Stadte deutschen Landes, dass sie christliche Schulen aufrichten und halten sollen (To the Councilmen of All Cities in Germany That They Establish and Maintain Christian Schools),” in Dr. Martin Luthers Werke: Kritische Gesamtausgabe, Vol. 15 (Weimar, Germany: Verlag Hermann B¨ohlhaus Nachfolger [1524] 1899). ¨ ——, “Vorlesung uber Jesaja (Lecture on Jesaja),” in Dr. Martin Luthers Werke: Kritische Gesamtausgabe, Vol. 25 (Weimar, Germany: Verlag Hermann B¨ohlhaus Nachfolger [1528] 1902).
WAS WEBER WRONG?
595
——, “Sermon am Auffahrtstage (Ascension Day Sermon),” in Dr. Martin Luthers Werke: Kritische Gesamtausgabe, Vol. 10, Part 3 (Weimar, Germany: Verlag Hermann B¨ohlhaus Nachfolger [1522] 1905). ——, “Eine Predigt, daß man Kinder zur Schule halten solle (A Sermon on Keeping Children in School),” in Dr. Martin Luthers Werke: Kritische Gesamtausgabe, Vol. 30, Part 2 (Weimar, Germany: Verlag Hermann B¨ohlhaus Nachfolger [1530] 1909). Maddison, Angus, The World Economy, Vol. 1: A Millennial Perspective, Vol. 2: Historical Statistics (Paris, France: Organisation for Economic Co-operation and Development, 2006). Markussen, Ingrid, “The Development of Writing Ability in the Nordic Countries in the Eighteenth and Nineteenth Centuries,” Scandinavian Journal of History, 15 (1990), 37–63. Marry, Winifred, “The Mediaeval Scribe,” Classical Journal, 48 (1953), 207–214. Menschenfreund, Christian Friedrich, Untersuchung der Frage: Warum ist der Wohlstand der protestantischen Lander ¨ so gar viel gr¨oßer als der catholischen? (Salzburg, Austria: Freisingen, 1772). Merton, Robert K., “Puritanism, Pietism, and Science,” Sociological Review, 28 (1936), 1–30. Migne, Jacques Paul, Patrologiae, Cursus Completus, Vol. 205 (Paris, France: Imprimerie Catholique de Migne, 1855). Mitch, David, “Underinvestment in Literacy? The Potential Contribution of Government Involvement in Elementary Education to Economic Growth in Nineteenth-Century England,” Journal of Economic History, 44 (1984), 557– 566. Moretti, Enrico, “Workers’ Education, Spillovers, and Productivity: Evidence from Plant-Level Production Functions,” American Economic Review, 94 (2004), 656–690. ¨ Mutzell, Alexander A., Neues Topographisch-statistisch-geographisches ¨ W¨orterbuch des Preussischen Staats (Halle: Karl August Kummel, 1825). ¨ ¨ ¨ Oestreich, Gerhard, and E. Holzer, “Ubersicht uber die Reichsstande,” in Handbuch der Deutschen Geschichte, 9th ed., Vol. 2, Bruno Gebhardt, ed. (Stuttgart, Germany: Ernst Ketler Verlag, 1973). Offenbacher, Martin, Konfession und soziale Schichtung: Eine Studie uber ¨ die ¨ wirtschaftliche Lage der Katholiken und Protestanten in Baden (Tubingen, Germany: Mohr, 1900). Pelikan, Jaroslav, Whose Bible Is It? A History of the Scriptures (New York, NY: Penguin, 2005). Peters, Paul, “Luthers weltweiter Missionssinn,” Lutherischer Rundblick, 17 (1969), 162–175. Psacharopoulos, George, and Harry A. Patrinos, “Returns to Investment in Education: A Further Update,” Education Economics, 12 (2004), 111–134. Reble, Albert, Geschichte der Padagogik, ¨ 20th ed. (Stuttgart, Germany: KlettCotta, 2002). Robinson, W.S., “Ecological Correlations and the Behavior of Individuals,” American Sociological Review, 15 (1950), 351–357. Rupp, Horst F., “Philipp Melanchthon (1497–1560),” Prospects: The Quarterly Review of Comparative Education, 26 (1996), 611–621. ¨ die enzyklopadische ¨ ——, “Der Bildungsbegriff: Konsequenzen fur Frage der Theologie,” in Religionspadagogik ¨ und Theologie: Enzyklopadische ¨ Aspekte, pp. 167–183, Werner H. Ritter and Martin Rothgangel, eds. (Stuttgart, Germany: Kohlhammer, 1998). Samuelson, Paul A., “Theoretical Notes on Trade Problems,” Review of Economics and Statistics, 46 (1964), 145–154. Samuelsson, Kurt, Religion and Economic Action: The Protestant Ethic, the Rise of Capitalism, and the Abuses of Scholarship (Toronto: University of Toronto Press, [1957] 1993). Sandberg, Lars G., “The Case of the Impoverished Sophisticate: Human Capital and Swedish Economic Growth before World War I,” Journal of Economic History, 39 (1979), 225–241. Schleunes, K.A., Schooling and Society: The Politics of Education in Prussia and Bavaria 1750–1900 (London, UK: St. Martin’s Press, 1989).
596
QUARTERLY JOURNAL OF ECONOMICS
Scribner, Robert W., For the Sake of the Simple Folk: Popular Propaganda for the German Reformation, new ed. (Oxford, UK: Oxford University Press, 1994). Spranger, Eduard, Zur Geschichte der deutschen Volksschule (Heidelberg, Germany: Quelle & Meyer, 1949). Tabellini, Guido, “Culture and Institutions: Economic Development in the Regions of Europe,” CESifo Working Paper No. 1492, 2005. Tawney, Richard H., Religion and the Rise of Capitalism (New York, NY: Harper and Row, 1926). UNESCO, Progress of Literacy in Various Countries: A Preliminary Statistical Study of Available Census Data since 1900 (Paris, France: United Nations Educational, Scientific and Cultural Organization (UNESCO), 1953). Weber, Max, “Die protestantische Ethik und der ‘Geist’ des Kapitalismus,” Archiv fur ¨ Sozialwissenschaft und Sozialpolitik 20 (1904/1905), 1–54 and 21: 1–110; reprinted in Gesammelte Aufsatze ¨ zur Religionssoziologie, pp. 17–206, 1920; English translation as The Protestant Ethic and the Spirit of Capitalism, translated by Talcott Parsons (London, UK: Routledge Classics, [1930] 2001). Wojtun, B.S., “Demographic Transition in West Poland, 1816–1914,” Ph.D. dissertation, Department of Economics, University of Pennsylvania, Philadelphia, 1968. Woodberry, Robert D., “The Shadow of Empire: Christian Missions, Colonial Policy, and Democracy in Postcolonial Societies,” Ph.D. dissertation, Department of Sociology, University of North Carolina, Chapel Hill, NC, 2004.
DOES MEDICARE SAVE LIVES?∗ DAVID CARD CARLOS DOBKIN NICOLE MAESTAS Health insurance characteristics shift at age 65 as most people become eligible for Medicare. We measure the impacts of these changes on patients who are admitted to hospitals through emergency departments for conditions with similar admission rates on weekdays and weekends. The age profiles of admissions and comorbidities for these patients are smooth at age 65, suggesting that the severity of illness is similar on either side of the Medicare threshold. In contrast, the number of procedures performed in hospitals and total list charges exhibit small but statistically significant discontinuities, implying that patients over 65 receive more services. We estimate a nearly 1-percentage-point drop in 7-day mortality for patients at age 65, equivalent to a 20% reduction in deaths for this severely ill patient group. The mortality gap persists for at least 9 months after admission.
I. INTRODUCTION Medicare pays nearly one-fifth of total health care costs in the United States. Yet evidence on the health effects of the program is limited. Studies of aggregate death rates before and after the introduction of Medicare show little indication of a program impact (Finkelstein and McKnight 2005). The age profiles of mortality and self-reported health in the population as a whole are likewise remarkably smooth around the eligibility threshold at age 65 (Card, Dobkin, and Maestas 2004; Dow 2004). Although existing research has shown that the utilization of health care services increases once people become eligible for Medicare (e.g., Decker and Rapaport [2002], McWilliams et al. [2003, 2007], Card, Dobkin, and Maestas [2004]), the health impact of these additional services remains uncertain. This paper presents new evidence on the health effects of Medicare, based on differences in mortality for severely ill people who are admitted to California hospitals just before and just after their 65th birthdays. Specifically, we focus on unplanned admissions through the emergency department (ED) for “nondeferrable” conditions—those with very similar weekend and ∗ We are extremely grateful to the editor and four referees for comments and suggestions on an earlier draft. We also thank the California Department of Health Services for providing the data used in the paper. This research was supported by the National Institute on Aging through Grant 1 R01 AG026290-01A1. The contents of this paper are solely the responsibility of the authors and do not necessarily represent the official views of the National Institute on Aging. C 2009 by the President and Fellows of Harvard College and the Massachusetts Institute of
Technology. The Quarterly Journal of Economics, May 2009
597
598
QUARTERLY JOURNAL OF ECONOMICS
weekday admission rates. These admissions include patients with diagnoses such as obstructive chronic bronchitis, acute myocardial infarction (AMI), and stroke and represent 12% of hospital admissions for people between the ages of 60 and 70, and 25% of all deaths between these ages. We argue that the decision to present at an ED for these conditions is unlikely to depend on insurance status. Consistent with this assertion, the arrival rate is essentially identical for patients just under and just over age 65. In contrast, admission rates for all causes jump 7% once people reach 65, and total ED admissions rise by 3%. Focusing on nondeferrable admissions, we turn to an analysis of the age profiles of patient characteristics and outcomes, testing for discontinuities at age 65. The demographic composition and comorbidities of the sample trend smoothly through the age-65 barrier, as would be expected under the assumption of no differential sample selection before and after Medicare eligibility. On the other hand, the fraction of patients with Medicare as their primary insurer rises by about fifty percentage points, whereas the fraction with no insurance drops by eight percentage points. Associated with these changes in insurance, we find a small but statistically significant increase in the number of procedures performed in the hospital and a 3% rise in total list charges. Using death records matched to our sample of hospital admissions, we find a clearly discernible drop in mortality once people become eligible for Medicare. Relative to people who are just under 65 when admitted, those who are just over 65 have about a 1% lower likelihood of death within a week of admission, or roughly a 20% reduction in 7-day mortality. A similar absolute reduction in mortality is registered at longer horizons and persists for at least 9 months, suggesting that the differential treatment afforded to those with Medicare coverage has an important impact on patient survival. We conclude by discussing potential channels for the Medicare effect. One mechanism is an increase in services for the relatively small fraction ( 6.62 and 2.54 < t < 6.62) show clear evidence of a jump at age 65, whereas the age profile for diagnoses in our preferred group, with |t| < 0.965, shows no visible evidence of an increase in admissions. Formal testing results are summarized in Table II. Each pair of columns in this table presents the estimated discontinuities at age 65 from two alternative RD models for the log of the number of admissions by age (in days) of the admitted patient. We limit the sample to people between the ages of 60 and 70, resulting in 3,652 observations—one for each potential value of age in days. Both specifications include a dummy for age over 65 and a quadratic polynomial fully interacted with the post-65 dummy. We have also fit the models with cubic polynomials and found no significant differences in the estimated values of the post-65 effects (see Table A in the Online Appendix). Although we know each patient’s exact age (measured in days) at the time of admission, we do not know patients’ birthdates or the exact admission date.18 Because Medicare eligibility begins on the first day of the month that a person turns 65, people who are admitted in the period up to 31 days before reaching their 65th birthdays may or may not be eligible for Medicare. Figure C in the Online Appendix shows the fraction of admitted patients in our nondeferrable sample who are recorded as having Medicare as their primary insurance provider, by age in days for a narrow window around age 65. This fraction is relatively flat for people up to a month before their 65th birthdays, and then rises linearly in the 31 days before they reach age 65, as would be expected given Medicare eligibility rules and a uniform distribution of birthdates. Because we do not know the Medicare eligibility status of patients who are admitted within 31 days of their 65th birthdays, the specifications reported in the even-numbered columns of Table II include a dummy for these observations. The addition of this dummy has relatively little impact on the estimated coefficients.19 Looking at the models in columns (1)–(4), we estimate that non-ED and planned ED admissions rise by about 12% at 65, whereas unplanned ED admissions rise by 2.6%. The remaining columns report the results for the four quartiles of unplanned 18. This restriction was imposed by the California Department of Health and Human Services as a condition for access to the discharge files. 19. For all six models presented in the even-numbered columns of Table II the t-statistic for a test that the coefficient of the dummy is 0 is well below the usual critical value.
12.0 (0.5) Yes
3.6 (0.9) No
(7) 3.7 (1.0) Yes
(8)
Weekend t-stat 2.54–6.62
11.9 (0.5) No
(2) 2.6 (0.5) Yes
(4)
2.7 (0.9) No
(9) 3.0 (1.0) Yes
(10)
Weekend t-stat 0.96–2.54
2.4 (0.5) No
(3)
ED and unplanned
3.3 (1.1) Yes
(6)
0.6 (0.9) No
(11)
0.6 (0.9) Yes
(12)
Weekend t-stat < 0.96
3.2 (1.0) No
(5)
Weekend t-stat > 6.62
Notes. Standard errors in parentheses. Dependent variable in all models is the log of the number of admissions by patient’s age (in days) at admission, for patients between 60 and 70 years of age (3,652 observations). Count of admissions is based on hospital discharge records for California, and includes admissions from January 1, 1992, to November 30, 2002. All models include a second-order polynomial in age (in days) fully interacted with a dummy for age over 65. Models in even-numbered columns include a dummy for people who are within 1 month of their 65th birthdays, whose Medicare eligibility status at the time of admission is indeterminate. Sample in columns (1) and (2) includes admissions that either were planned or did not occur through the ED. Samples in all other columns include admissions that were unplanned and occurred through the ED. Samples in columns (5)–(12) are further restricted by including only admissions for diagnoses (ICD-9 classifications) for which the t-test for equality of weekend and weekday admission rates is in the range indicated. In columns (1)–(12) the coefficient on “age over 65” and its standard error have been multiplied by 100.
Dummy for just under 65
Age over 65 (×100)
Dummy for just under 65
Age over 65 (×100)
(1)
Non-ED or planned
TABLE II REGRESSION DISCONTINUITY MODELS FOR LOG (NUMBER OF ADMISSIONS) TO CALIFORNIA HOSPITALS BY AGE OF PATIENT AT ADMISSION
612 QUARTERLY JOURNAL OF ECONOMICS
DOES MEDICARE SAVE LIVES?
613
ED admissions shown in Figure III. As suggested by the graph, the estimated models for our preferred subgroup of diagnoses (in columns (11) and (12)) show no evidence of a rise in admissions at age 65. Although the number of admissions in our nondeferrable sample trends smoothly at age 65, we also tested for a change in the health characteristics of admissions at age 65. Specifically, we constructed a Charlson comorbidity score from the secondary diagnoses listed on each discharge record.20 Figure D in the Online Appendix shows the age profile in the Charlson comorbidity scores for the nondeferrable sample. There is no discernible evidence of a drop in the severity of comorbidities at age 65; in fact, formal RD analysis on the age profile indicates a small but statistically insignificant increase in severity (see Table B in the Online Appendix). If we interpret the rise as a true measure of the change in severity (and not a result of upcoding incentives in the Medicare payment system), it suggests that, if anything, our sample becomes slightly less healthy at age 65. We have also checked for discontinuities in the case mix and demographic composition of the nondeferrable subsample at age 65. Counts of admissions for each primary diagnosis code included in our subsample trend smoothly with age, which suggests that the case mix is very stable through the age 65 boundary. Tests for jumps in the racial composition, sex, and fraction of Saturday or Sunday admissions (available on request) are all far below conventional critical values. To increase the power to detect differences in patient health, we used all the available covariates for an admission (including age, race/ethnicity, sex, year, month, and day of admission, and principal diagnosis fixed effects) to fit linear probability models for mortality over 7, 14, 28, 90, and 365 days. We then took the predicted mortality rates from these models and conducted an RD analysis, looking for any evidence that the predictors of mortality shift at age 65. The age profiles of 7-day and 28-day predicted mortality are shown in Figure E of the Online Appendix. (Results for other follow-up periods are very similar.) These age profiles are extremely smooth, and show no jump at age 65. In an RD specification with a quadratic in age and a dummy for over 65, interacted with the linear and quadratic terms, the 20. We are grateful to a referee for suggesting this analysis. The Charlson comorbidity index is a weighted count of the presence of 19 diseases (e.g., diabetes with end organ damage is weighted 2, whereas peptic ulcer disease is weighted 1). We use the STATA coding for the index developed by Stagg (2006).
614
QUARTERLY JOURNAL OF ECONOMICS
t-statistics for the post-65 coefficient are 0.4 (7-day mortality) and 0.25 (28-day mortality), providing no evidence that the observable health of the sample changes at age 65. A final piece of indirect evidence that patients with severe conditions are not more (or less) likely to present at the ED once they become Medicare-eligible comes from studies of the effect of cost-sharing (i.e., copayments and deductibles) on the use of the ED.21 In the RAND health insurance experiment, patients with cost-sharing insurance plans were no less likely to present at the ED for the most serious conditions (chest pain/acute heart disease, surgical abdominal disease, acute eye injuries, seconddegree burns) than those with the “free” plan (O’Grady et al. 1985). Similarly, two recent studies of the introduction of cost-sharing for ED visits by patients in a health maintenance organization (HMO) (Selby, Fireman, and Swain 1996; Wharam et al. 2007) conclude that copayments have no large or significant effect on ED visits for patients with severe conditions. The sample sizes in these studies are modest, however, and one certainly could not detect changes on the order of 3% in the use of the ED, which is the magnitude of the jump at age 65 for all unplanned ED admissions in the California discharge data. A fourth HMO-based study by Hsu et al. (2006) has much larger sample sizes and concludes that the introduction of modest copayment requirements reduces ED visits. This study does not break down ED visits by the severity of patients’ conditions, though the authors find that patients with copayment requirements have somewhat better clinical outcomes, which suggests that patients with life-threatening conditions such as AMI, stroke, and chronic obstructive pulmonary disease are obtaining appropriate medical services. Overall we interpret these studies as supporting the hypothesis that people with severe conditions present at the ED independent of their insurance status.
V. SHIFTS IN INSURANCE, HEALTH SERVICES, AND MORTALITY AT 65 V.A. Insurance We now turn to the impact of the Medicare eligibility age on health-related outcomes. We begin by looking at health insurance coverage. Figure IV shows the age profiles of the fractions of people with various primary insurers (private, Medicaid, Medicare, 21. We are grateful to a referee for suggesting that we consider this literature.
615
DOES MEDICARE SAVE LIVES? 1
0.9
0.8
Private
Private fitted
Medicare
Medicare fitted
Medicaid
Medicaid fitted
Other insurance
Other insurance fitted
Uninsured
Uninsured fitted
Proportion with coverage
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0 60
61
62
63
64
65
66
67
68
69
70
Age at admission
FIGURE IV Primary Insurance Coverage of Admitted Patients See notes for Figure II. In this figure the y-axis represents the fraction of patients with different classes of primary insurance coverage. Sample includes 425,315 patients with nondeferrable primary diagnoses, defined as unplanned admissions through the emergency department for diagnoses with a t-statistic for the test of equal weekday and weekend admission rates of 0.965 or less. Medicare eligibility status of patients within one month of their 65th birthdays is uncertain and we have excluded these observations.
other, and none) in the nondeferrable admissions subsample. Consistent with the patterns in Figure I for the overall population, we see a big increase in the fraction of patients with Medicare as their primary insurer at 65, coupled with a decline in the fraction with no insurance. RD models for health insurance outcomes are presented in Table III. The models follow the same specification as in Table II, although we now include a set of covariates (year, month and day of admission, race/ethnicity, sex, and admission diagnosis fixed effects) in the specifications shown in even-numbered columns. For reference, the specifications in the odd-numbered columns exclude these controls and also exclude the dummy for admissions in the 31 days before a patient’s 65th birthday. The regression results confirm the visual impressions conveyed in Figure IV. At age 65, the fraction of patients with Medicare as their primary insurer rises by about 47 percentage points, whereas the fractions with private insurance and Medicaid
43.9 (0.4) No 24.0
47.5 (0.4) Yes
(2)
(4) −26.8 (0.4) Yes 43.3
(3) −24.8 (0.4) No
Private
−10.1 (0.3) No
(5)
(6) −10.8 (0.3) Yes 43.3
Medicaid
−7.4 (0.2) No
(7)
9.7
−8.0 (0.2) Yes
(8)
Uninsured
Notes. Standard errors in parentheses. Dependent variable is indicator for type of insurance listed as “primary insurer” on discharge record. Sample includes 425,315 observations on patients between the ages of 60 and 70 admitted to California hospitals between January 1, 1992, and November 30, 2002 for an unplanned admission through the emergency department, with a diagnosis (ICD-9) for which the t-test for equality of weekend and weekday admission rates is less than 0.96 in absolute value. All models include second-order polynomial in age (in days) fully interacted with dummy for age over 65 and are fit by OLS. Models in even-numbered columns include the following additional controls: a dummy for people who are within one month of their 65th birthday; dummies for month, year, sex, race/ethnicity, and admission on Saturday or Sunday; and a complete set of unrestricted fixed effects for each ICD-9 admission diagnosis. In columns (1)–(8) the coefficient on “age over 65” and its standard error have been multiplied by 100.
Additional controls Mean of dependent variable for patients aged 64–65 (×100)
Age over 65 (×100)
(1)
Medicare
TABLE III REGRESSION DISCONTINUITY MODELS FOR PROBABILITY OF DIFFERENT FORMS OF PRIMARY INSURANCE COVERAGE
616 QUARTERLY JOURNAL OF ECONOMICS
DOES MEDICARE SAVE LIVES?
617
both fall.22 Note that in the sample of nondeferrable admissions the Medicare coverage rate at age 64 is 24%, substantially higher than in the overall population (shown in Figure I). Presumably this reflects the fact that many of these patients are chronically ill and on DI prior to 65. The percentage with no insurance at 64 is correspondingly a little lower than in the overall population (10% versus about 13%), and the reduction in the rate of noninsurance at 65 is a little smaller (−8 percentage points in the nondeferrable subsample, versus −9.5 percentage points for the population as a whole). Nevertheless, as in the population as a whole, patients with nondeferrable conditions have much different insurance coverage just after age 65 than just before. V.B. Intensity of Treatment We have three basic measures of the intensity of treatment offered to patients: the length of their stays in hospital, the number of procedures performed, and total hospital list charges.23 Figure V shows the age profiles for these measures, whereas Table IV presents RD models similar to the specifications in Table III. The age profile for mean length of stay is somewhat noisier than the other two profiles, but all three profiles suggest an upward jump at 65. The estimation results in Table IV show that mean length of stay increases by 0.4 days (or about 4.5%) at 65, though the estimated gain is not statistically significant. Similarly, the number of procedures jumps by 0.1, or approximately 4% (with a t-ratio around 4), whereas log list charges jump by 2.5% (with a t-ratio of around 2.6). One concern with an RD model for the logarithm of list charges is that the dispersion in charges may increase (or decrease) once patients become Medicare-eligible. If this is true, then the expected level of charges may rise by more (or less) than 3% at age 65 (see Manning [1998] for a general discussion of modeling 22. Unfortunately we have no information on secondary coverage. We suspect that many of the 45% who have private coverage prior to age 65 enroll in Medicare and a supplementary policy at 65. 23. We sum the duration of stay, list charges, and number of procedures for all consecutive stays. List charges are accounting charges, and do not represent the charges actually billed to insurers or patients. They also exclude charges for physician services, and are not reported for patients at Kaiser hospitals (hence the smaller sample size in columns (5)–(6) in Table IV). We interpret list charges as a convenient “price-weighted” summary of services rendered, albeit at artificial prices. Note that if list prices are a markup over actual costs, then the percentage change in list charges will be a good indicator of the change in the costs of the services provided to patients who are just under or just over 65.
618
QUARTERLY JOURNAL OF ECONOMICS
10.20
10
10.15
9
10.10
Log (list charges)
7
10.00
9.95
6
9.90
5
9.85
9.80
Days and number of procedures
8 10.05
4
Log list charges
Log charges fitted
Length of stay fitted
Length of stay (days)
Number of procedures
Procedures fitted
3
9.75
9.70
2 60
61
62
63
64
65
66
67
68
69
Age at admission
FIGURE V Three Measures of Inpatient Treatment Intensity See notes to Figure IV. Sample includes unplanned admissions through the emergency department for diagnoses with a t-statistic for the test of equal weekday and weekend admission rates of 0.965 or less. In this figure the sample is further restricted to patients with valid SSNs (407,386 observations). Sample for log list charges excludes patients admitted to Kaiser hospitals. Length of stay, number of procedures, and list charges are cumulated over all consecutive hospitalizations. List charges are measured in 2002 dollars.
health spending). To evaluate this concern we fit RD models to the standard deviation of log list charges for patients in each age-atadmission cell (with age measured in days). These models showed no large or statistically significant effect on the dispersion in list charges at age 65. We also fit RD models to the 75th and 90th percentiles of list charges for each age-at-admission cell. These models showed that both the 75th and 90th percentiles of list charges increase by about 2%–3% at age 65. (The regression models, and associated graphs, are presented in the Online Appendix, Table C and Figures F and G.) Overall, we conclude that there are modest but statistically significant increases in the intensity of treatment at age 65 for patients in our nondeferrable admissions sample, on the order of 3%. These increases are much smaller than the 10%–25% increases in rates of elective procedures such as hip and knee replacements observed in the overall population (Card, Dobkin, and Maestas
0.37 (0.24) No 8.12
0.35 (0.26) Yes
0.09 (0.03) No 2.50
0.11 (0.03) Yes
2.5 (1.1) No
(5)
9.87
2.6 (1.0) Yes
(6)
Log list charges (×100)
Notes. Standard errors in parentheses. Dependent variable is length of stay in days (columns (1) and (2)), number of procedures performed (columns (3) and (4)), and log of total list charges (columns (5) and (6)). Sample includes 407,386 (352,652 in columns (5) and (6)) observations on patients with valid SSNs between the ages of 60 and 70 admitted to California hospitals between January 1, 1992, and November 30, 2002 for an unplanned admission through the ED. Data on list charges are missing for Kaiser hospitals. See note to Table III for additional details on sample, and list of additional covariates included in even-numbered columns. In columns (5) and (6) the coefficient on “age over 65” and its standard error have been multiplied by 100.
Additional controls Mean of dependent variable for patients aged 63 or 64
Age over 65
(4)
(3)
(1)
(2)
Number of procedures
Length of stay (days)
TABLE IV REGRESSION DISCONTINUITY MODELS FOR CHANGES IN TREATMENT INTENSITY
DOES MEDICARE SAVE LIVES?
619
620
QUARTERLY JOURNAL OF ECONOMICS
2008) but suggest that the availability of Medicare affects the utilization of health care services even for severely ill patients. We also performed a more detailed analysis of the changes in the use of specific procedures at age 65 for two major sets of diagnoses: obstructive chronic bronchitis with acute exacerbation (the largest ICD-9 in our nondeferrable sample, shown in row (1) of Table I) and acute myocardial infarction (AMI), which combines the various detailed AMI diagnoses in our nondeferrable sample. The results are summarized in Table D of the Online Appendix. For AMI admissions we see a relatively large and precisely estimated increase in the overall number of procedures at age 65 (a rise of 0.44 on a base rate of 5.0 among 64-yearolds, or approximately 9%) and significant increases in the use of several important diagnostic procedures, including coronary arteriography, cardiac catheterization, and angiocardiography.24 In contrast, for obstructive chronic bronchitis patients, we see no change in the overall number of procedures and small increases or decreases in the incidence of specific procedures. This analysis suggests that the relatively small increase in the overall number of procedures for all admission diagnoses in Table IV is masking larger increases for certain “procedure-intensive” diagnoses, such as AMI, and near-constancy for other diagnoses. Unfortunately, the sample sizes for other diagnoses are too small to permit a more extensive investigation. We conclude, however, that the onset of Medicare eligibility is associated with an increase in the use of specific potentially life-saving procedures. V.C. Transfers and Readmissions Patients who are initially admitted for acute care may be transferred (i.e., discharged and immediately readmitted) to another care/treatment unit in the same hospital, to another hospital, or to nonhospital care (e.g., a nursing home).25 Because our data are derived from hospital discharge records, we cannot measure transfers to standalone skilled nursing facilities or to other care options that may be substitutable for postacute care in a hospital setting. Nevertheless, we find that within- and 24. Cutler and McClellan (2001) have estimated that invasive diagnosis and treatment procedures as a whole (including catheterization, angioplasty, and bypass surgery) are cost-effective in the treatment of AMI. The efficacy of specific procedures is less clear: see, for example, McClellan, McNeil, and Newhouse (1994) and Cutler, McClellan, and Newhouse (1999). 25. Note that to avoid double counting we have collapsed all consecutive hospital stays to single records.
621
DOES MEDICARE SAVE LIVES? 0.30 Death 365 days
Death 365 days fitted
Death 180 days Death 90 days
Death 180 days fitted Death 90 days fitted
Death 28 days fitted Death 14 days fitted Death 7 days fitted
Death 28 days Death 14 days
0.25
Death 7 days
Death rate (fraction)
0.20
0.15
0.10
0.05
0.00 60
61
62
63
64
65
66
67
68
69
70
Age at admission
FIGURE VI Patient Mortality Rates over Different Follow-Up Intervals See notes to Figure IV. Sample includes unplanned admissions through the emergency department for diagnoses with a t-statistic for the test of equal weekday and weekend admission rates of 0.965 or less. In this figure the sample is further restricted to patients with valid SSNs (407,386 observations). Deaths include include in-hospital and out-of-hospital deaths.
between-hospital transfer rates rise at age 65, with a particularly large rise in within-hospital transfers (25%) that appears to be driven by a jump in the rate of transfer to skilled nursing facilities within the same hospital. We also find a marginally significant reduction in the probability that patients are readmitted to a California hospital within 28 days after their initial admission (point estimate = −0.6-percentage-point reduction on a base readmission rate of 17.0% for 64-year-olds, t = 1.70). Graphs and estimates for transfers and readmissions are presented in the Online Appendix (Figures H and I and Table E). V.D. Mortality Figure VI plots the age profiles for the probability of death within 7, 14, 28, 90, 180, and 365 days of admission to the hospital, whereas the first two rows of Table V present estimates from RD regression models corresponding to each of these outcomes. Inspection of Figure VI shows that each of the mortality
−1.0 (0.2) −0.8 (0.2) −0.7 (0.2) −0.8 (0.2) 7.1
−1.1 (0.2) −1.0 (0.2) −0.7 (0.3) −0.8 (0.2) 5.1
28 days
90 days
9.8
14.7
Estimated discontinuity at age 65 (×100) −1.1 −1.1 (0.3) (0.3) −0.9 −0.9 (0.3) (0.3) −0.6 −0.9 (0.4) (0.4) −0.8 −0.9 (0.2) (0.2)
14 days
18.4
−1.2 (0.4) −0.8 (0.3) −0.9 (0.5) −1.1 (0.3)
180 days
23.0
−1.0 (0.4) −0.7 (0.4) −0.4 (0.5) −0.8 (0.3)
365 days
Notes. Standard errors in parentheses. Dependent variable is indicator for death within interval indicated by column heading. Entries in rows (1)–(3) are estimated coefficients of dummy for age over 65 from models that include a quadratic polynomial in age (rows (1) and (2)) or a cubic polynomial in age (row (3)) fully interacted with a dummy for age over 65. Models in rows (2) and (3) include the following additional controls: a dummy for people who are within 1 month of their 65 birthdays, dummies for year, month, sex, race/ethnicity, and Saturday or Sunday admissions, and unrestricted fixed effects for each ICD-9 admission diagnosis. Entries in row (4) are estimated discontinuities from a local linear regression procedure, fit separately to the left and right, with independently selected bandwidths from a rule-of-thumb procedure suggested by Fan and Gijbels (1996). Sample includes 407,386 observations on patients between the ages of 60 and 70 admitted to California hospitals between January 1, 1992, and November 30, 2002, for unplanned admission through the ED who have nonmissing Social Security numbers. All coefficients and their SEs have been multiplied by 100.
Fully interacted quadratic with no additional controls Fully interacted quadratic plus additional controls Fully interacted cubic plus additional controls Local linear regression procedure fit separately to left and right with rule-of-thumb bandwidths Mean of dependent variable (%)
7 days
Death rate in
TABLE V REGRESSION DISCONTINUITY ESTIMATES OF CHANGES IN MORTALITY RATES
622 QUARTERLY JOURNAL OF ECONOMICS
DOES MEDICARE SAVE LIVES?
623
measures has a drop on the order of 1.0 percentage point at age 65. The regression estimates in Table V confirm this: we observe a reduction in 7-day mortality of about 0.7 to 1.0 percentage points that persists over the longer follow-up periods. The effect is relatively precisely measured in the shortest time intervals but has an increasing sampling error as the follow-up window is extended, yielding t-ratios of about 5 at 7 days, about 3 at 28 days, and around 1.8 at 365 days. We have performed extensive robustness checks to ensure that the mortality results are not an artifact of a particular specification of the RD model. As shown in row (3) of Table V, specifications with a cubic age polynomial yield estimates that are similar to the simpler quadratic models, though typically a little smaller, particularly for the 28-day follow-up window. (Figure J in the Online Appendix compares the fits of linear, quadratic, and cubic RD models for 28-day mortality.) We also refit the models using logits (rather than linear probability specifications) and obtained essentially identical estimates of the change in the probability of death at age 65 (see Table F of the Online Appendix). Finally, we used local linear regression (LLR) models to obtain “nonparametric” estimates of the mortality rates for patients just under and just over 65 (Hahn, Todd, and van der Klaauw 2001; Imbens and Lemieux 2008). Row (4) of Table V presents LLR-based estimates using a triangular kernel and the rule-of-thumb bandwidth selection procedure suggested by Fan and Gijbels (1996). These are very similar to the parametric estimates using a quadratic polynomial but a little more precise. Figure K of the Online Appendix presents LLR-based estimates of the 28-day mortality rate for patients on each side of the age 65 boundary, using all possible bandwidths between 1 month and 5 years.26 The estimated mortality rates for patients just under 65 and patients just over 65 stabilize once the bandwidth reaches about 1 year, centering on values close to the predicted values from our basic quadratic specification. We used the specification from row (2) of Table V to fit parametric RD models for the probability of death in all possible followup windows between 1 day and 2 years. The resulting estimates of the jump in mortality at age 65 are plotted in Figure VII, along with the associated 95% confidence intervals. The data show a robust pattern of reductions in mortality for windows of up to 26. Here we follow the recommendation of Imbens and Lemieux (2008) and use a rectangular kernel for the local linear regressions on each side.
624
QUARTERLY JOURNAL OF ECONOMICS
0.010
Estimated RD In mortality at age 65
Estimated discontinuity in mortality at age 65
0.005
95% CI upper bound 95% CI lower bound
0.000
–0.005
–0.010
–0.015
–0.020 0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
Length of follow-up period (in years) since admission
FIGURE VII Estimates of the Discontinuity in Mortality Rates at Age 65 over Various Follow-Up Periods See notes to Figure VI. The estimates in this figure are from a regression with a quadratic polynomial in age fully interacted with a dummy for age over 65. The regressions also include a dummy for patients within one month of their 65th birthdays, month and year dummies, fixed effects for the primary diagnosis, and dummies for race, sex, and admissions on Saturday or Sunday. Regression discontinuities are estimated for probability of death within 7, 14, and 28 days and then at 30-day intervals. The estimates and associated 95% confidence intervals from each regression are then linearly interpolated in the figure.
9 months on the order of 0.8–1.0 percentage points, with somewhat smaller point estimates for windows of 1–2 years. Our estimates of the mortality effect of Medicare eligibility are relatively large: they represent a 14%–20% reduction in 7-day mortality, a 7%–9% reduction in 28-day mortality, and a 2%–4% reduction in 1-year mortality relative to death rates among 64year-olds with similar conditions at admission. The emergence of the effect within 7 days of admission suggests that the extra services or changes in the quality of services provided to Medicareeligible patients have an immediate life-saving impact. It is also worth noting that the mortality reductions estimated in Table V appear to reflect changes in the treatment of patients with Medicare within the same hospital, rather than patient sorting to higher-quality hospitals at 65. The fractions of patients with nondeferrable conditions entering different kinds of
DOES MEDICARE SAVE LIVES?
625
hospitals show only small changes at age 65. The largest change is a reduction of about three percentage points in the fraction entering county hospitals. Interestingly, the 28-day mortality rate for 63–64-year-olds is actually lower at county hospitals (6.8%) than at nonprofit (9.2%), for-profit (9.0%), or district (9.7%) hospitals in our data, so it is implausible that such a small shift in patients out of county hospitals could have much affect on average mortality.27 Thus, it does not appear that Medicare reduces mortality by shifting patients to better hospitals. V.E. Robustness of Mortality Estimates—A Bounding Approach To further probe our estimated mortality effects, we used a simple bounding procedure to obtain lower-bound estimates of the (absolute) mortality effect of Medicare eligibility on broader samples of hospital admissions, including the entire patient population. The basis of this procedure is the observation that in any sample of sick people close to age 65 there are two subgroups: one group (which we index with subscript 1) who enter the hospital regardless of whether they are Medicare-eligible or not; and a second group (indexed by subscript 2) who will only enter the hospital if they are over 65. Let α ≥ 0 represent the sample fraction of the second group. We have argued that among people with nondeferrable conditions, α = 0. In more general patient populations, however, α > 0, and a comparison of mortality between patients just over and just under 65 contains a selectivity bias. Let m1 denote the mortality rate of the first group if they enter the hospital just before their 65th birthdays and let m1 denote the mortality rate if they enter after 65. The causal effect of Medicare eligibility for group 1 is = m1 − m1 . The observed mortality rate of the patient population just over 65 is an average for groups 1 and 2, m ¯ = (1 − α)m1 + αm2 = (1 − α)(m1 + ) + αm2 , where m2 is the post-65 mortality rate of group 2. Using this expression it is easy to show that (3)
¯ − m2 ). m ¯ − m1 = − α/(1 − α) × (m
27. To see that the effect of a small amount of sorting is negligible, note that even if (contrary to fact) the mortality rate at county hospitals were 50% higher than that of private hospitals, it could account for at most a negligible amount of the estimated mortality gain: 0.03 × 0.045 = 0.00135 percentage points.
626
QUARTERLY JOURNAL OF ECONOMICS
Thus, the mortality differential between the post-65 patient population and the pre-65 patient population is equal to , the causal effect of Medicare eligibility on group 1, plus a bias term, Bias = −α/(1 − α) × (m ¯ − m2 ), which depends on the fraction of group 2 and the deviation of their post-65 mortality rate from the average of groups 1 and 2. Because m2 > 0, a lower bound on the absolute value of the bias caused by the presence of group 2 in the post-65 patient population is (4)
Worst-case Bias = −α/(1 − α) × m. ¯
This bias tends to 0 as α → 0, and is proportional to m. ¯ Table VI presents estimates of the various terms in equation (3) for the 28-day mortality rates of various patient populations, including all patients (column (1)); those who enter the hospital via a route other than the ED, or for a planned hospitalization (which we call “elective” admissions, in column (2)); those who enter via the ED for an unplanned hospitalization (column (3)); and the four subgroups of the unplanned-ED group, based on admission diagnoses with different ranges of weekend versus weekday admissions (i.e., the four subgroups graphed in Figure III) in columns (4)–(7). The first row of Table VI presents the estimated RD in the log of the number of hospital admissions at age 65, which is an estimate of α/(1 − α).28 Row (2) shows the estimated change in the mortality rate of patients at age 65 (i.e., the estimate of m ¯ − m1 ), obtained from an RD model with an interacted quadratic function of age fit to aggregated mortality rates by age in days.29 Row (3) shows an estimate of the constant in the mortality regression, which is our estimate of the mortality rate for people just under 65. (The implied estimate of m ¯ is therefore the sum of the entries in rows (2) and (3).) Row (4) shows our estimate of the worst-case selectivity bias, based on equation (4), whereas 28. If α is the share of all potential patients who are only admitted after age 65, then the proportional increase in admissions at age 65 is (1 − (1 − α))/(1 − α) = α/(1 − α), so the RD in log admissions is an estimate of α/(1 − α). 29. To construct a standard error for our lower bound we need to construct a standard error for + α/(1 − α)m, ¯ where is the estimated RD in mortality, α/(1 − α) is the estimated RD in log admissions, and m ¯ is the estimated mortality rate for those just over 65, which can be estimated as the constant in the mortality regression (assuming age is normalized to 0 at age 65) plus the value of the RD in mortality. Because we need the covariance between the estimated parameters from the mortality and admissions models, we fit the two RDs as seemingly unrelated regressions using grouped age cells and use the delta method to construct the sampling error.
11.85 −0.29 3.29 −0.36
0.06 (0.09) 0.50
7.12 −0.46 5.10 −0.33 −0.13 (0.07) 1.00
0.50
−0.33 (0.12)
−0.15
6.86
−0.49
2.43
0.11
0.37 (0.16)
−0.10
2.68
0.27
3.23
0.13
−0.39 (0.25)
−0.24
7.41
−0.63
3.57
0.12
−0.27 (0.24)
−0.17
6.87
−0.45
2.70
0.12
−1.04 (0.29)
−0.05
10.25
−1.09
0.56
Notes. Row (1) presents estimated regression discontinuities in the log of the number of hospital admissions at age 65, from a specification fit to admission counts by age for patients age 60 to 70 (3,652 observations) that includes a quadratic polynomial in age, fully interacted with a dummy for age over 65. Row (2) presents estimated regression discontinuities in the probability of death at age 65 from a specification fit to mean death rates by age at admission data for patients age 60 to 70 (3,652 observations) that includes a quadratic polynomial in age, fully interacted with a dummy for age over 65. Row (3) presents estimated mean probabilities of death within 28 days of admission for patients who are 64 years old. Row (4) presents the estimated worst case bound on the selectivity component of the change in death rates at age 65, computed as −row (1) × (row (2) + row (3))/100. Row (5) is the estimated lower bound on the causal effect of reaching age 65 on the 28-day death rate, computed as row (8) − row (4). The standard error is computed by the delta method, using the sampling errors of the estimates of the entries in rows (1), (2), and (3), and their sampling covariances. In rows (1)–(3) the coefficients have been multiplied by 100.
Estimated discontinuity in log of number of admissions at age 65 (×100) Estimated discontinuity in probability of death at age 65 (×100) Mean probability of death for people aged 64 (×100) Estimated “worst case” bound on selection bias component of change in death rate at age 65 Estimated lower bound for causal effect of reaching age 65 on death rate (standard error) Share of admissions in column sample
Unplanned ED admissions (by range of t-test for Planned or Unplanned weekend admission rate) All non-ED ED admissions admissions admissions t > 6.62 2.54 < t < 6.51 0.96 < t < 2.54 t < 0.96 (1) (2) (3) (4) (5) (6) (7)
TABLE VI ESTIMATED LOWER BOUNDS FOR CHANGE IN 28-DAY MORTALITY AT AGE 65
DOES MEDICARE SAVE LIVES?
627
628
QUARTERLY JOURNAL OF ECONOMICS
row (5) shows our lower bound estimate of the effect of Medicare eligibility on the patient population, and row (6) shows an estimated sampling error for this bound. Finally, for reference, row (7) shows the fraction of patients in each subgroup. Three key conclusions emerge from the table. First, the lowerbound estimate of the overall effect of Medicare on the 28-day death rate of the entire patient population is −0.13% (and only marginally significant). This is about 1/10 as large as our estimate of the effect on the nondeferrable admissions group, who represent 12% of the overall patient population. Second, for “elective” admissions (column (2)), our point estimate of the lower bound mortality effect is actually positive (as it is for the top quartile of diagnoses with lowest weekend admission rates in column (5)). For these admissions we cannot rule out the possibility that selection bias explains the entire (relatively small) drop in mortality we see after age 65. Even for the two middle quartiles of weekend/weekday admission codes, the estimated lower bounds on the Medicare effect are small. Thus, virtually all of the (lower bound) mortality effect we observe for the overall patient sample is attributable to the reduction in mortality for the nondeferrable subgroup. A third observation is that the unadjusted change in mortality at age 65 for the top quartile diagnosis group (column (4)) is actually positive (+0.27%). This is reassuring in two ways. First, it proves that there is no mechanical data problem that is causing us to measure lower death rates for all patients over 65.30 Second, the diagnoses in this quartile are relatively non-life-threatening. In particular, the 28-day mortality rate for 64-year-old patients in this group is only 2.7%, somewhat below the death rate for patients admitted on an elective basis. It would be surprising if Medicare eligibility had much effect on mortality for such a relatively healthy group, and the estimates imply that it does not. VI. DISCUSSION Our empirical results point to a significant positive effect of Medicare eligibility on the intensity of treatment for acutely ill patients with nondeferrable conditions and a negative effect on 30. We believe that any such data problems are likely to bias the results in the opposite direction. In particular, because the in-hospital mortality rate of people without SSNs is higher, at worst we would add to the sample at 65 a small group with higher potential mortality, which would lead to a rise in the measured death rate for people over 65.
DOES MEDICARE SAVE LIVES?
629
patient mortality. In this section we discuss the possible channels for this effect. To aid in this discussion, it is helpful to consider a simplified causal model in which Medicare eligibility affects insurance characteristics, insurance affects health care services, and health services affect mortality. Building on the analysis in Section II, suppose that patient i has a health insurance package with a vector of characteristics zi , including whether i has any coverage, whether he or she has Medicare or some other form of primary coverage, and (possibly) other characteristics. Assume that the age profile for zi is generated by a model of the form (5)
zi = g(ai , γz ) + Post65i π + υzi ,
where g is a smooth function of age (ai ) with parameters γz , υzi is an error term that is mean-independent of the dummy Post65i , and π represents the vector of discontinuities in insurance characteristics at age 65. Suppose that the quality-adjusted health care services delivered to patient i (Si ) depend on age, an error term υsi , and the characteristics of the insurance package:31 (6)
Si = h(ai , γs ) + θ zi + υsi .
Finally, assume that the likelihood of death of patient i (yi = 1) depends on age and on quality-adjusted health services: (7)
yi = k(ai , γs ) + λSi + υ yi .
Equations (5), (6), and (7) yield reduced-form models such as equation (1), with a discontinuity in health care services at age 65 equal to (8a)
βs = θ π
and a discontinuity in mortality equal to (8b)
β y = λθ π.
In this simplified setup, each element of the insurance package represents a separate “channel” that contributes additively to the reduced form effects on services and mortality. For example, the kth element of zi contributes θkπk to the RD in services 31. This equation simplifies health care services to a single dimension. In fact, changes in insurance can cause the use of some types of services to rise and the use of other types of services to fall (or stay constant).
630
QUARTERLY JOURNAL OF ECONOMICS
and λθkπk to the RD in mortality. Unfortunately, we have no information on the individual components of θ , and only limited information on the vector π of insurance changes at age 65. For example, we do not observe secondary coverage, or whether the primary insurance is managed care. Nevertheless, it is possible to shed some light on the mortality effect associated with one key insurance characteristic: whether the patient has any insurance coverage or none. In particular, note that the maximum contribution of the “any coverage” channel cannot exceed πc (the jump in coverage at 65) times the average mortality rate of uninsured 64-year-olds, because the extension of coverage to the previously uninsured group can only reduce their mortality rate to 0. The average 7-day mortality rate of uninsured patients who are just under 65 years of age in our nondeferrable admission subsample is 0.05, whereas πc = 0.08 (Table III, column (8)). Thus the maximum reduction in mortality attributable to the reduction in the number of people with no health insurance is 0.004—about 40% of the 7-day mortality effect we estimate. This is an extreme bound because it is based on the assumption that none of the previously uninsured would die if they were covered. A more plausible bound is that insurance coverage reduces the death rate by no more than onehalf: in this case the “any coverage” channel can explain at most 20% of the total mortality effect. In principle we can gain some additional insight by comparing changes in health insurance, the intensity of treatment, and mortality for different subgroups of patients.32 Unfortunately, the limited demographic variables in our discharge data make this a challenging exercise. Comparisons across race/ethnicity groups are uninformative, because the sample sizes for blacks (n = 41,000) and Hispanics (n = 66,280) are too small to obtain useful estimates. We also tried dividing patients into two groups based on the average fraction of 55–64-year-old patients from the same ZIP code who had no insurance coverage. Even here, we were unable to estimate systematic differences in the changes in treatment intensity or mortality outcomes at age 65 between residents from “low-insurance” and “high-insurance” ZIP codes. We do find 32. In particular, assume that π varies by subgroup, with a value of π (g) for subgroup g. If the parameters λ and θ are constant across groups then the discontinuity in services for group g is θ π (g) and the discontinuity in mortality is λθ π (g). By comparing the relative sizes of the discontinuities in insurance, treatment intensity, and mortality across subgroups it is possible to judge whether the data are consistent with a “1-channel” explanation.
DOES MEDICARE SAVE LIVES?
631
significant increases in the numbers of procedures and significant reductions in mortality even for patients from the high-insurance ZIP codes, suggesting that an increase in insurance coverage per se is not the explanation for the impacts of Medicare. An alternative explanation for the measured mortality effects is that for most people Medicare imposes fewer restrictions than private insurance or Medicaid, leading to more (and possibly higher-quality) services to patients over 65 than to those under 65.33 Card, Dobkin, and Maestas (2008) find clear evidence of this mechanism for a wide range of nonurgent medical procedures, such as surgery to insert a stent in a blocked coronary artery (which rises by 11% at age 65 in California, Florida, and New York), hip and knee replacement surgery (which rises by 23%), and gall bladder removal surgery (which rises by 18%). In fact, Table IV presents evidence of small increases (3%–4%) in the number of procedures and in total list charges at age 65 for patients with nondeferrable conditions. Arguably, however, such small increases in the intensity of treatment are unlikely to generate a one-percentage-point reduction in mortality, though as we noted above, the increase in procedures for AMI cases—which may be more sensitive to medical intervention—is closer to 10%.34 Thus, the precise mechanisms for the mortality effect remain unclear, though we believe the evidence points to a combination of channels. VII. SUMMARY AND CONCLUSIONS A longstanding question in health economics is whether health insurance affects health. This question is particularly 33. This net effect is likely a mix of some people attaining more generous coverage and others receiving less generous coverage relative to their pre-65 insurance plans. Even if Medicare is more generous with respect to case review procedures, it may be less generous on other dimensions, such as prescription drug coverage. 34. At the suggestion of a referee we looked at differences in the magnitude of the changes in treatment intensity at age 65 for patients from California counties with relatively high rates of managed care among nonelderly patients, versus patients from counties with relatively low rates of managed care. Specifically, we used data from the 1998 Area Resource File to split counties in two groups, based on whether the fraction of non-Medicare patients in HMOs was under or over 44% (the HMO penetration rate of the median individual’s county). A specification parallel to the RD model in column (6) of Table IV yields an estimated jump in log list charges of 3.6% (standard error 1.8%) for counties with above-median HMO penetration, and 2.4% (standard error 1.1%) for counties with below-median HMO penetration. Models for the RDs in the number of procedures show the opposite pattern across the two groups of counties, but the difference is not statistically significant.
632
QUARTERLY JOURNAL OF ECONOMICS
relevant for Medicare, the largest medical insurance program in the country, which provides nearly universal coverage to people once they turn 65. We focus on measuring the health effects of Medicare eligibility for a relatively sick population—specifically, people who are admitted to hospitals through the ED with diagnoses that have similar admission rates on weekdays and weekends. In contrast to elective hospitalizations, there is no jump in these “nondeferrable” hospital admissions at age 65. Moreover, the predicted mortality rate of admitted patients (based on demographics and admission diagnoses) trends smoothly. These findings suggest that the underlying health of patients admitted with nondeferrable conditions is very similar whether the patients are just under or just over 65. In light of this conclusion, we use a regression discontinuity approach to measure the impacts of reaching age 65 on the intensity of treatment in the hospital, and on mortality for up to 2 years after the hospital admission. We find modest but statistically significant increases in measures of treatment intensity at age 65, including the number of procedures performed in the hospital and total list charges. Associated with these changes, we find an important and large reduction in patient mortality at age 65. Medicare eligibility reduces 7-day mortality by about 0.8 to 1.0 percentage points, with similar-sized and statistically significant reductions at windows of up to 9 months. We probe the robustness of these findings by using a bounding procedure to evaluate the lower-bound effect of Medicare eligibility on the entire hospital patient population. The bounds for the overall population are consistent with the magnitude of the effect we estimate for patients with nondeferrable conditions, providing further credence to our basic results. The magnitude of the estimated mortality effect of Medicare eligibility is too large to be driven solely by changes among the 8% of the patient population who move from no health insurance coverage to Medicare when they reach 65. This is an important distinction between our analysis and existing studies that have attributed much larger mortality gains to insurance status in specialized populations such as auto accident victims (Doyle 2005) and former Medi-Cal recipients (Lurie et al. 1984, 1986). Instead, our findings point to a more widespread effect of Medicare, including an impact on patients who were insured prior to 65. Given the relatively modest increases in the intensity
DOES MEDICARE SAVE LIVES?
633
of treatment we measure at age 65, however, we conclude that the actual mechanism for this effect is unclear. An important limitation of our analysis is that it focuses on just one health outcome, albeit an important one. Certainly, Medicare might affect other dimensions of health and other patient populations, with effects that arise through channels other than hospitalization (e.g., outpatient care and prescription drug use) and that persist over a longer time horizon than that supported by our research design. Nonetheless, our analysis illustrates an important lesson for future research. Any plausible effect of insurance on health status in the general population will likely be small and easily confounded by selection effects in observational settings. Indeed, the only randomized health insurance experiment ever mounted found insignificant impacts of insurance on the health status of the overall population (Newhouse 1993). Further progress on this question will require research designs based on samples larger than those typically available for health services research, along with particular attentiveness to the selection problem.
APPENDIX I We use annual hospital discharge files from the state of California for the period 1992–2002. These files represent a census of discharges from state-regulated acute care hospitals, collected by OSHPD, the Office of Statewide Health Planning and Development, under the California Code of Regulations, Title 22. The data set excludes people admitted to federally regulated hospitals such as VA hospitals. Our working sample includes people admitted to the hospital between January 1, 1992, and November 30, 2002, whose age at admission was between 60 and 70. The discharge files include the patient’s sex, race/ethnicity, ZIP code of residence, date of admission, source of admission, route of admission (ED or not), type of admission (scheduled or unscheduled), principal diagnosis, other diagnoses present at admission, a list of procedures performed, total list charges, expected payer code, disposition of patient (routine discharge, transfer to another hospital, etc.), and date of discharge. We obtained a restricted use version of the file that also includes the patient’s age in days at admission (calculated from the exact date of admission and
634
QUARTERLY JOURNAL OF ECONOMICS
exact date of birth). The file also includes a unique patient identifier and an indicator for whether the patient provided a valid SSN. We also obtained from OSHPD a separate file constructed by OSHPD by matching individuals in the discharge database to state records of death, by name/SSN/date of birth. We merged this file with the discharge record file by patient identifier. In our analysis we combine all consecutive hospital admissions into a single record defined by the first date of admission (and by the descriptors associated with this first admission). We cumulate the length of stay, the number of procedures, and list charges across all stays. UNIVERSITY OF CALIFORNIA, BERKELEY UNIVERSITY OF CALIFORNIA, SANTA CRUZ RAND CORPORATION
REFERENCES Autor, David H., and Mark G. Duggan, “The Rise in the Disability Rolls and the Decline in Unemployment,” Quarterly Journal of Economics, 118 (2003), 157– 205. Canto, John G., William J. Rogers, Nisha C. Chandra, William J. French, Hal V. Barron, Paul D. Frederick, Charles Maynard, and Nathan R. Every, “The Association of Sex and Payer Status on Management and Subsequent Survival in Acute Myocardial Infarction,” Archives of Internal Medicine, 162 (2000), 587–593. Card, David, Carlos Dobkin, and Nicole Maestas, “The Impact of Nearly Universal Insurance Coverage on Health Care Utilization and Health: Evidence from Medicare,” NBER Working Paper No. 10365, 2004. ——, “The Impact of Nearly Universal Insurance Coverage on Health Care Utilization: Evidence from Medicare,” American Economic Review, 98 (2008), 2242–2258. Cohen, Robin A., and Michael E. Martinez, “Impact of Medicare and Medicaid Probe Questions on Health Insurance Estimates in the National Health Interview Survey, 2005,” Report, Centers for Disease Control, National Center for Health Statistics, May 2007. Currie, Janet, and Jonathan Gruber, “Health Insurance Eligibility, Utilization of Medical Care, and Child Health,” Quarterly Journal of Economics, 111 (1996a), 431–466. ——, “Saving Babies: The Efficacy and Cost of Recent Changes in the Medicaid Eligibility of Pregnant Women,” Journal of Political Economy, 104 (1996b), 1263–1296. Cutler, David, and Mark McClellan, “Is Technological Change in Medicine Worth It?” Health Affairs, 20 (2001), 11–29. Cutler, David, Mark McClellan, and Joseph Newhouse, “The Costs and Benefits of Intensive Treatment for Cardiovascular Disease,” in Measuring the Prices of Medical Treatments, J. E. Triplett, ed. (Washington, DC: Brookings Institution Press, 1999). Decker, Sandra, “Medicare and Inequalities in Health Outcomes: The Case of Breast Cancer,” Contemporary Economic Policy, 20 (2002), 1–11. Decker, Sandra, and Carol Rapaport, “Medicare and Disparities in Women’s Health,” NBER Working Paper 8761, 2002.
DOES MEDICARE SAVE LIVES?
635
Dobkin, Carlos, Hospital Staffing and Inpatient Mortality, UC Santa Cruz Unpublished Working Paper, 2003. Dow, William H., The Introduction of Medicare: Effects on Elderly Health, Unpublished Working Paper, University of California, Berkeley, 2004. Doyle, Joseph J. Jr., “Health Insurance, Treatment and Outcomes: Using Auto Accidents as Health Shocks,” Review of Economics and Statistics, 87 (2005), 256–270. Fan, Jianqing, and Irene Gijbels, Local Polynomial Modelling and Its Applications (London: Chapman and Hall, 1996). Finkelstein, Amy, and Robin McKnight, “What Did Medicare Do (and Was It Worth It)?” NBER Working Paper 11609, 2005. General Accounting Office, “Hospital Emergency Departments: Crowded Conditions Vary among Hospitals and Communities,” Report GAO-03-460, U.S. General Accounting Office, 2003. Hahn, Jin, Petra Todd, and Wilber van der Klaauw, “Identification and Estimation of Treatment Effects with a Regression Discontinuity Design,” Econometrica, 69 (2001), 201–209. Hiestand, Brian C., Dawn M. Prall, Christopher J. Lindsell, James W. Hoekstra, Charles V. Pollack, Judd E. Hollander, Brian R. Tiffany, W. Frank Peacock, Deborah B. Diercks, and W. Brian Gibler, “Insurance Status and the Treatment of Myocardial Infarction at Academic Centers,” Academic Emergency Medicine, 11 (2004), 343–348. Horowitz, Joel L., and Charles F. Manski, “Identification and Robustness with Contaminated and Corrupted Data,” Econometrica, 58 (1995), 281–302. Hsu, John, Mary Price, Richard Brand, G. Thomas Ray, Bruce Fireman, Joseph P. Newhouse, and Joseph Selby, “Cost-Sharing for Emergency Care and Unfavorable Clinical Events: Findings from the Safety and Financial Ramifications of ED Copayments Study,” Health Services Research 41 (2006), 1801– 1820. Imbens, Guido, and Thomas Lemieux, “Regression Discontinuity Designs: A Guide to Practice,” Journal of Econometrics, 142 (2008), 615–635. Lee, David, “Randomized Experiments from Non-random Selection in U.S. House Elections,” Journal of Econometrics, 142 (2008), 675–697. Levy, Helen, and David Meltzer, “What Do We Really Know about Whether Health Insurance Affects Health?” in Health Policy on the Uninsured: Setting the Agenda, Catherine McLaughlin, ed. (Washington, DC: Urban Institute Press, 2004). Lichtenberg, Frank R., “The Effects of Medicare on Health Care Utilization and Outcomes,” Frontiers in Health Policy Research, 5 (2001), 27–52. Lurie, Nicole, Nancy B. Ward, Martin F. Shapiro, and Robert H. Brook, “Termination from Medi-Cal—Does It Affect Health?” New England Journal of Medicine, 311 (1984), 480–484. Lurie, Nicole, Nancy B. Ward, Martin F. Shapiro, Claudio Gallego, Rati Vaghaiwalla, and Robert H. Brook, “Termination of Medi-Cal Benefits. A Follow-up Study One Year Later,” New England Journal of Medicine, 314 (1986), 1266–1268. Manning, Willard G., “The Logged Dependent Variable, Heteroscedasticity, and the Retransformation Problem,” Journal of Health Economics, 17 (1998), 283– 295. McClellan, Mark, Barbara J. McNeil, and Joseph P. Newhouse, “Does More Intensive Treatment of Acute Myocardial Infarction in the Elderly Reduce Mortality? Analysis Using Instrumental Variables,” Journal of the American Medical Association, 272 (1994), 859–866. McCrary, Justin, “Manipulation of the Running Variable in the Regression Discontinuity Design: A Density Test,” Journal of Econometrics, 142 (2008), 698–714. McWilliams, J. Michael, Alan M. Zaslavsky, and John Z. Ayanian, “Use of Health Services by Previously Uninsured Medicare Beneficiaries,” New England Journal of Medicine, 357 (2007), 143–153. McWilliams, J. Michael, Alan M. Zaslavsky, Ellen Meara, and John Z. Ayanian, “Impact of Medicare Coverage on Basic Clinical Services for Previously Uninsured Adults,” Journal of the American Medical Association, 290 (2003), 757– 764.
636
QUARTERLY JOURNAL OF ECONOMICS
Newhouse, Joseph P., Free for All? Lessons from the RAND Health Insurance Experiment (Cambridge, MA/London: Harvard University Press, 1993). O’Grady, Kevin F., Willard G. Manning, Joseph P. Newhouse, and Robert H. Brook, “The Impact of Cost Sharing on Emergency Department Use,” New England Journal of Medicine, 313 (1985), 484–490. Ruhm, Christopher J., “Are Recessions Good for Your Health?” Quarterly Journal of Economics, 115 (2000), 617–650. Selby, Joseph V., Bruce Fireman, and Bix E. Swain, “The Effect of a Copayment on the Use of the Emergency Department in a Health Maintenance Organization,” New England Journal of Medicine, 334 (1996), 635–641. Smith, Nicole M., Joseph S. Bresee, David K. Shay, Timothy M. Uyeki, Nancy J. Cox, and Raymond A. Strikas, “Prevention and Control of Influenza: Recommendations of the Advisory Committee on Influenza Practices,” Morbidity and Mortality Weekly Reports (MMRW) (2006; http://www.cdc.gov/mmwr/preview/ mmwrhtml/rr5510a1.htm). Stagg, Viki, “CHARLSON: Stata Module to Calculate Charlson Index of Comorbidity” (2006; http://fmwww.bc.edu/repec/bocode/c/charlson.ado). Von Wachter, Till M., “The End of Mandatory Retirement in the US: Effects on Retirement and Implicit Contracts,” University of California Berkeley, Center for Labor Economics Working Paper 49, 2002. Wharam, J. Frank, Bruce E. Landon, Alison A. Galbraith, Ken P. Kleinman, Stephen B. Soumerai, and Dennis Ross-Degnan, “Emergency Department Use and Subsequent Hospitalizations among Members of a High-Deductible Health Plan,” Journal of the American Medical Association, 297 (2007), 1093– 1102.
DO HIGHER PRICES FOR NEW GOODS REFLECT QUALITY GROWTH OR INFLATION?∗ MARK BILS Much of Consumer Price Index (CPI) inflation for consumer durables reflects shifts to newer product models that display higher prices, not price increases for a given set of goods. I examine how these higher prices for new models should be divided between quality growth and price inflation based on (a) whether consumer purchases shift toward or away from the new models and (b) whether new-model price increases generate higher relative prices that persist through the model cycle. I conclude that two-thirds of the price increases with new models should be treated as quality growth. This implies that CPI inflation for durables has been overstated by almost 2 percentage points per year, with quality growth understated by the same magnitude.
I. INTRODUCTION Much of economic growth occurs through growth in quality as new models of consumer goods replace older, sometimes inferior, models. Moulton and Moses (1997) estimate that BLS methods allowed for perhaps as much as 1 percent average quality growth in goods in 1995. It is often argued, however, that the BLS methods miss much of the growth in goods’ quality.1 I employ the Consumer Price Index (CPI) Commodities and Services Survey, the microdata underlying the CPI, to show that introduction of new models of durable goods generates large increases in unit prices. How these price increases are attributed to quality growth versus CPI inflation dramatically affects measured price inflation for durables. I present evidence that these price increases should largely be treated as quality growth. I conclude that price inflation ∗ This work was conducted while I was visiting the Bureau of Labor Statistics (BLS) under the Intergovernmental Personnel Act (IPA) agreement. A number of persons at the BLS have been extremely helpful. I particularly thank David Johnson, Robert McClelland, Teague Ruder, Paul Liegey, Michael Hoke, and Walter Lane. Any interpretations presented are my own, and should not be associated with the BLS. I also thank Mark Aguiar, Gordon Dahl, Pete Klenow, the editor, and three referees for their comments; this paper initiated from discussions with Pete Klenow. The research was supported by a grant from the National Science Foundation. 1. Hausman (2003) and Pakes (2003) are two prominent examples. Shapiro and Wilcox (1996) review much of the previous evidence. The Boskin Commission Report (1996) suggests that the BLS overstates inflation by perhaps 1% per year. Unmeasured growth in quality of goods is put forth as the most important component contributing an overstatement of inflation of 0.6% per year, including 1.0% for durables. But these estimates are based on examining a fairly limited set of goods and allowing for no bias for a number of goods. C 2009 by the President and Fellows of Harvard College and the Massachusetts Institute of
Technology. The Quarterly Journal of Economics, May 2009
637
638
QUARTERLY JOURNAL OF ECONOMICS
for durables has been overstated by nearly 2 percentage points per year. Because most consumption deflators for the National Income and Product Accounts are based on BLS’s measures of CPI inflation, any measurement error in CPI inflation will lead to an opposite error in rates of real growth in consumption and productivity. Thus my findings imply that measured consumption and productivity growth for consumer durables has been understated by almost 2 percentage points per year.2 In the next section, I show that from January 1988 to December 2006 unit prices for consumer durables, excluding computers, increased at an annual rate of 2.5%. For the most part, the prices collected by the BLS can be compared to the price of the same item priced in a previous month at the same outlet. But for durables these matched items display a very different rate of price change than the overall rate of 2.5%, averaging deflation of 3.7% per year. The difference of over 6% between unit price inflation of 2.5% and this matched-rate of −3.7% reflects changes in the models being priced. At scheduled rotations the BLS draws a new sample of goods (and outlets) to better reflect current spending. In addition, an outlet may stop selling the priced item, forcing the BLS agent to substitute another model. Both the scheduled rotations and forced substitutions create shifts to models that are typically newer to the market and higher priced. For durables I find that scheduled rotations generate increases of a little over 2% annually in unit prices, while forced substitutions, occurring much more frequently, generate increases of nearly 4% annually. The most important contributor to price increases with forced substitutions for durables, weighting by goods’ expenditure shares, is the model-year turnover for vehicles. These shifts in price quotes from last year’s models to newer models generate price increases averaging more than 5% per year. How to allocate the price increases that accompany model changes between quality growth and inflation is an open question. In any month a large majority of collected quotes do follow prices of the same model. Because quality is fixed for these matched items, their rate of price increase provides one natural qualityadjusted measure of price inflation. From just above, this rate has
2. For example, multifactor productivity growth for motor vehicles and other transportation equipment (SIC 37) is estimated by the BLS to be only about 0.5% per year for my sample period. My estimates suggest that productivity growth for vehicles has been understated by many times this.
DO HIGHER PRICES REFLECT QUALITY GROWTH?
639
averaged −3.7% since 1988. So adopting this as a measure of price inflation implies that durables, even excluding computers, have exhibited a dramatic rate of quality growth of just over 6% per year (equal to the excess of unit price inflation over this −3.7% rate for price inflation). Pakes (2003) suggests that even this is likely to understate quality growth because goods that exit the market are obsolete and, absent the substitution, were likely to experience a relative fall in price. By contrast, Triplett (1997), among others, has argued that sellers use periods of model turnovers to increase price more than is justified by quality improvements. In principle, hedonic pricing equations as developed by Adelman and Griliches (1961) and Griliches (1961) might be used to split goods’ rates of unit-price inflation between quality growth and true declines in purchasing power. But in practice the exacting detail on product characteristics this requires is typically not collected.3 The BLS treats price increases at forced substitutions very differently than those at scheduled rotations, even though they reflect the same economic phenomenon—newer versions of goods sell at higher prices. The higher prices across scheduled rotations are not incorporated in measured inflation, and so the price increases are implicitly treated as quality growth. By contrast, price increases accompanying forced substitutions, which are nearly twice as important, are largely attributed to CPI inflation, not quality growth. I calculate that BLS methods attributed less than one-fifth of the price increases from forced substitutions, only 0.7 percentage point per year, to quality growth, with 3.1 percentage points attributed to inflation. So if, supposing counterfactually, the BLS had measured inflation based only on price changes for models available in consecutive periods, this would have reduced CPI inflation for these goods by 3.1 percentage points per year, yielding higher measured quality growth of the same 3.1 percentage points. 3. Hausman (2003) discusses practical limitations of hedonics given that the analyst typically possesses a quite small number of relevant characteristics. He notes that the shadow prices of characteristics are notoriously unstable and sometimes even appear perverse in sign. For instance, hedonic equations for automobiles can exhibit a negative coefficient for fuel efficiency, presumably reflecting a negative correlation between fuel efficiency and unmeasured quality. The use of hedonic price equations by the BLS in constructing the CPI has been fairly limited. Computer equipment is one good where data on several relevant characteristics are collected (e.g., RAM, processor speed) and hedonic prices play an important role. But even here, hedonics are only employed at substitutions if it is possible to match brand and all but a small number of characteristics to the base-period product.
640
QUARTERLY JOURNAL OF ECONOMICS
This BLS treatment of price changes at forced substitutions implies that goods with model changes increase prices dramatically, net of quality growth, relative to those that do not change models. If consumer demands across competing goods do not shift systematically toward those exhibiting model changes, then these price increases, net of quality change, should cause the goods with model changes to lose market share. I test this prediction in Section III. Using Ward’s Auto data for vehicles and scanner data for consumer electronics, I examine growth in market shares for goods with versus without model substitutions. These goods generate more than 80% of the price increases from model substitutions for durables excluding computers. For all goods I find that market share increases with model turnover. The finding that goods with new models increase market share suggests that on average the price increases with model changes, and possibly more, should be attributed to increased quality, corresponding to quality growth for durables of 6% per year, 3 percentage points greater than measured by BLS methods. Interpreting increased market share for new goods as a relative price decline assumes that relative demands are stable across model substitutions. If the demand for the new cars or other durables reflects a fashionlike component, with the good valued for its time on the market separate from other qualities, then a new model could exhibit increases in both relative price and market share even if not improved. Similarly, intertemporal price discrimination across buyers could generate price declines over a durable’s model cycle that are reversed with the next model cycle. I test for the importance of these model-cycle demands in Section IV by observing whether the size of price increases for new models persists across the model cycle. I find that, for the most part, price changes at model substitutions do persist in explaining the value of the good throughout its model cycle. There is some important regression, however, especially for price increases that the BLS labels price inflation, not quality change. Allowing for these factors, I calculate that about a third of the price changes with model substitutions might reflect transitory demand increases with the model cycle, with two-thirds persistently valued as quality. Nevertheless, this leads to the conclusion that average quality growth for durables has been understated by nearly 2 percentage points per year.
DO HIGHER PRICES REFLECT QUALITY GROWTH?
641
II. NEW-MODEL PRICE CHANGES To calculate the CPI the BLS tracks the prices of about 90,000 nonhousing goods and services each month.4 These prices are contained in the CPI Commodities and Services Survey, which is the primary source of data employed in the empirical work to follow. The goods followed change for two principal reasons. At scheduled rotations, roughly every four years, the BLS draws a new sample of stores and products within a geographic area to better reflect current spending.5 In addition, a store may stop selling the particular product being priced. The BLS agent then substitutes another model of that brand or a similar product. These (forced) substitutions occur about every three to four years for all nonhousing CPI items. They occur much more frequently, nearly once per year, for consumer durables. I first show that the price changes associated with new models, those from both scheduled rotations and forced substitutions, contribute greatly to goods’ rates of increase in unit prices. For this reason, how these unit-price changes are divided between quality growth and price inflation has a dramatic impact on overall measured growth. The balance of the section discusses how BLS methods have divided these price changes between quality change and price inflation. I compare this division to the alternative of measuring price inflation based purely on price changes for goods that experience no model change. Unit-price inflation can be broken into contributions from price changes at scheduled rotations plus that from following a particular price quote within a rotation. In turn, the latter component reflects the rate of inflation for continuously followed (matched) models plus the systematically higher unit-price increases at forced substitutions because of model changes.
(1)
πunit price = πrotations + πwithin-rotations = πrotations + πmatched + s(πchanges − πmatched ).
4. Prices are collected from about 22,000 outlets in 87 Primary Sampling Units across 45 geographic areas. About half of the goods are priced monthly, with the others priced bimonthly. The BLS chooses outlets probabilistically based on household point-of-purchase surveys (POPS), and chooses items within outlets based on estimates of their relative sales. The BLS sampling methods are described in detail by Armknecht, Lane, and Stewart (1997) and the BLS Handbook of Methods (U.S. Department of Labor, 1997). 5. These rotations occurred every 5 years historically, including much of my sample period. The BLS has moved to even more frequent sample rotations for consumer electronics.
642
QUARTERLY JOURNAL OF ECONOMICS
The component πrotations reflects both the share of quotes experiencing rotations as well as the percentage increase in price between the old and new quotes. I express the extra unit-price inflation associated with forced model changes as the share of quotes with forced model substitutions (s) multiplied by the excess inflation rate at these substitutions (πchanges − πmatched ). Implicit are subscripts denoting a particular category of good and time period. I examine the importance of each of these components for fifty separate categories of consumer durables using information from the CPI Commodities and Services Survey for January 1988 through December 2006. The fifty categories of consumer durables are listed in Table I. I focus on durables because forced substitutions are much higher for durables and because I am able to examine data on quantities for vehicles and a number of other durables. Goods with a strong seasonal fashion cycle exhibit large price reductions as the seasons change. I exclude apparel and other goods with important seasonal or fashion cycles, such as motorboats or entertainment CDs, to limit the importance of these fashion changes. I also exclude used vehicles.6 I observe a total of 987,086 price changes within sample rotations for the fifty durables: 95% reflect changes over one or two months (the average duration is 1.7 months); 90.7% of price quotes follow the same model version, whereas 9.3% are forced substitutions. For the remainder of the paper, I present weighted statistics. In each year a category is weighted by its expenditure share as measured from the Consumer Expenditure Surveys.7 The second column of Table I 6. The BLS does not collect price information on used vehicles. Price information in the CPI for used cars and trucks comes from the National Automobile Dealers Association Official Used Car Guide. These prices are adjusted for estimated depreciation. Because prices are not collected at outlets, there are no forced substitutions. The sample is updated by one model year each fall to maintain the same ages of vehicles over time. These updates create increases in unit prices. The quality adjustments made for used vehicles reflect the same rates of quality adjustments that were made for those vehicles by the BLS when the vehicles were priced as new models. So to the extent this paper concludes that the BLS understates quality growth for new vehicles, that result can be translated precisely one-to-one to used-car prices. The expenditure share for used vehicles is about 30% that of new vehicles. The results combining all durables would suggest modestly more growth if this additional weight is given to vehicles. In particular, the conclusion in Section IV that quality has increased by 2.5% per year would increase to 2.7%. 7. Expenditure shares by category were obtained from the BLS for each year for 1988 to 1995 and for 1999 to 2004. The CPI reflects weights that are only periodically updated. For instance, in 2002 the CPI weights began reflecting expenditures by category from the 1999–2000 period, replacing weights that reflected expenditures during 1993–1995. (Since 2002 the BLS has updated more frequently.) Related to this, and to a revision in expenditure categories that occurred between 1997 and 1998, it was not possible to obtain disaggregate expenditure shares for
DO HIGHER PRICES REFLECT QUALITY GROWTH? TABLE I DURABLE GOODS STUDIED Good Watches Jewelry Personal computers & equipment Telephone & equipment Calculators, typewriters, etc. Electric personal care products Luggage Infant’s equipment Curtains & drapes Window coverings Mattresses & springs Bedroom furniture Sofas & slipcovers Living room chairs Living room tables Kitchen & dining room furniture Infant’s furniture Occasional furniture Refrigerators & home freezers Washers & dryers Stoves Microwaves Vacuums Small kitchen appliances Other electric appliances Lamps & lighting Clocks & decorative items Dishes Flatware Nonelectric cookware Tableware & nonelectric kitchenware Power tools Misc. hardware Nonpowered hand tools Medical equipment for general use Supportive & convalescent equipment Televisions Other video equipment Audio equipment Bicycles General sports equipment Hunting, fishing, & camping equipment Photography equipment Sewing machines Musical instruments & accessories
Spending share 0.069 0.416 0.370 0.080 0.014 0.021 0.027 0.018 0.064 0.053 0.146 0.193 0.276 0.122 0.057 0.162 0.025 0.148 0.083 0.103 0.030 0.029 0.064 0.034 0.079 0.040 0.325 0.083 0.014 0.039 0.057 0.058 0.096 0.026 0.011 0.031 0.246 0.104 0.164 0.044 0.229 0.086 0.057 0.044 0.069
643
644
QUARTERLY JOURNAL OF ECONOMICS TABLE I (CONTINUED) Good
Spending share
New cars Pickups & vans New motorcycles Tires Other vehicle equipment accessories
3.265 1.992 0.072 0.270 0.212
Source. Data from CPI Commodities and Services Survey.
provides the average annual spending share for each category for 1988 to 2006. The combined share for the fifty goods is 10.3% of the CPI, with vehicles making up about half of this. Weighting increases the share of forced substitutions to 12.5%. Panel A of Table II breaks down unit-price changes according to equation (1). I calculate the overall rate of increase in unit prices as follows. I first construct the average price (in logs) for each year for each category of good, then calculate its annual rate of growth. For each year, I then construct an average growth rate by weighting each category’s rate by its expenditure share for that year. The overall average is then given by averaging these annual averages over the nineteen years of data. Looking at column (1), unit prices for durables (excluding computers) increased at a rate of 2.5% per year for January 1988 to December 2006. BLS treatment of price changes associated with model changes for computers markedly differs from other durables. The statistics to follow are for the 49 durables excluding computer equipment, with results for computers presented separately. This 2.5% unit-price inflation can be broken into contributions from price changes across scheduled rotations and the rate 1996 to 1998 or after 2004. I employ the 1995 relative shares for 1996 and 1997, 1999 shares for 1998, and 2004 shares for 2005 and 2006. I also examined results employing fixed weights by category, based on 1995 expenditure shares. For the most part, results are very similar to those reported here. With fixed weights, scheduled rotations are associated with less unit-price inflation for electronics and computers. In constructing mean inflation rates at the category and year levels, I weight the price quote’s inflation rate by the duration it covers (usually one or two months). I exclude price changes that are measured, due say to repeated stockouts, over a period of more than six months. The BLS selects outlets proportionally to their importance in a somewhat wider product category than an Entry Level Item (ELI), for instance, based on men’s clothing, not the specific ELI men’s shirts. In constructing ELI-level statistics, I weight by the percentage of sales within the broader category at the outlet corresponding to that ELI. The BLS refers to this as the percent of POPS category.
645
DO HIGHER PRICES REFLECT QUALITY GROWTH? TABLE II ANNUAL PRICE INFLATION AND QUALITY GROWTH—JANUARY 1988 TO DECEMBER 2006 Durables excluding computers (1) πunit price = πsample rotations + πwithin-rotations = πmatched + s(πforced − πmatched ) quality for forced (BLS) quality for forced (BLS) + πsample rotations πunit price − πmatched Substitution rate (%) Number of quotes
Cars, vans, trucks, SUVs (2)
Panel A (%) 3.6 2.1 1.4 −3.8 5.2 Panel B (%) 0.7 0.7
2.5 2.3 0.2 −3.7 3.8
2.9 6.1
2.8 7.4
Panel C 11.8 15.3 966,242 170,480
Video, audio, telephones (3)
Computers and equipment (4)
−0.6 4.3 −4.9 −10.9 6.0
−0.1 2.9 −2.9 −19.6 16.7
3.2
17.8
7.5 10.3
20.7 19.6
15.0 114,450
31.3 20,844
Source. Data are from CPI Commodities and Services Survey.
of growth in unit prices within a rotation sample (Table II, rows (2) and (3)). The rate of price increases within sample rotations is first calculated separately by category and year then aggregated based on expenditures shares. The average annual rate of price increase within rotations equals 0.2%. This implies that almost all of the average growth rate in unit prices for the 49 durables, 2.3 percentage points per year, reflected price increases across sample rotations. But the small average rate of price inflation of 0.2% within rotations hides big differences across price quotes without and with forced substitutions (rows (4) and (5)). Absent model substitutions, the average price change was quite negative, translating into an inflation rate averaging −3.7% per year. By contrast, across forced substitutions unit prices increase on average by 4.2%. Although these forced substitutions constituted only 12% of the price quotes, their price increases relative to inflation for matched models added 3.8% annually to inflation in unit prices. The price increases with new models contribute price increases of 6.1% per year, 2.3% from sample rotations and 3.8% from forced substitutions. How these price increases are divided between price inflation and quality change dramatically affects measured growth for durables. How do BLS methods treat the
646
QUARTERLY JOURNAL OF ECONOMICS
components of price changes? Price inflation for the goods without model changes, −3.7%, is obviously treated as part of CPI inflation. The price increases associated with sample rotations, 2.3% per year, are implicitly treated as quality growth. For time t, prices are collected for both the outgoing and incoming samples. The rate of inflation for t − 1 to t is based on price changes for the outgoing sample; the rate for t to t + 1 is based on price changes for the incoming sample. Because there is no direct comparison of prices across the two samples, price increases from sample rotations have no impact on measured inflation. By contrast, for many forced substitutions the BLS does compare prices across the old and new versions. The BLS treats forced substitution by several methods. These are described in detail by Armknecht and Weyback (1989). The Appendix shows the prevalence of each procedure and associated price increases. For more than a third of substitutions the new models were judged strictly comparable to the former ones, with quality growth set to zero. These substitutions had new-model prices that were 2.7% higher on average, with this entirely allocated to CPI inflation. The other common method for durables, also employed in over a third of substitutions, is to make a quality adjustment based on certain characteristics of the old and new models. New models averaged 4.5% higher prices for these substitutions. Despite the quality adjustments, by my calculations only one-ninth of this 4.5% was attributed to quality growth, with most attributed to inflation. One-sixth were treated with less direct quality adjustments; these exhibited 7.4% higher prices after substitution, with only one-third of this allocated to quality growth. Finally, for 10% of forced substitutions the BLS omitted the price change in calculating CPI inflation. This parallels the treatment of price changes with sample rotations, with these quotes implicitly assigned the CPI inflation for other quotes in that category. These averaged price increases of 2.3%, with all of this attributed to quality growth. Putting these together I calculate that, of the price increases of 3.8% per year from forced substitutions, BLS methods attributed only 0.7 percentage points to quality growth, with the balance attributed to CPI inflation. Measured quality growth depends critically on this treatment as illustrated in Panel B of Table II. BLS methods result in measured quality growth of 2.9% per year. But much of this reflects the price increases of 2.3% per year from sample rotations. The quality growth attributed by the
DO HIGHER PRICES REFLECT QUALITY GROWTH?
647
BLS to the newer goods introduced with forced substitutions contributes less than one-fourth of the 2.9%. By contrast, suppose price inflation was based solely on the rate of price changes for goods without model changes. This treatment has intuitive appeal. When products are replaced, it treats the increased price for the newer model, relative to increases for matched models, as a measure of quality change. With this measure of inflation, quality growth would have averaged 3.2 percentage points higher, at 6.1% per year. Growth at 2.9% per year implies that quality of durables increased by a factor of 70% over the nineteen-year period, January 1988 to December 2006. But, if growth was at 6.1% per year, the growth factor over the nineteen years exceeds 200%. Based on the analysis of market shares to follow, I separate out two sets of durables: (i) vehicles and (ii) consumer electronics (video, audio, and telephone equipment). This is done in columns (2) and (3) of Table II. Both vehicles and consumer electronics display frequent forced substitutions, each with rates of 15% compared to 7% for the balance of the 49 durables. Forced substitutions generated unit-price increases of 5.2% per year for vehicles; BLS methods attributed only a small part, 0.7 percentage points per year, to quality growth.8 Forced substitutions generated 6.0% annual increases in unit prices for consumer electronics; I calculate that BLS methods attributed just over half of this, 3.2 percentage points, to quality growth. If we consider measuring price inflation based on the inflation rate for goods without model changes, the implied rates of quality growth would be higher by 4.5 percentage points per year for vehicles and by 2.8 percentage points for electronics. These calculations omit computer equipment. Column (4) of Table II presents results for computers. The substitution rate for computing equipment is 31%. The matched-model rate of inflation is very negative, −20% per year. But prices jump up greatly with substitutions, adding nearly 17% to annual growth in unit prices, so that unit prices decline within rotations by only 3% per year. Unlike other durables, price increases with substitutions do not translate into CPI inflation. For computers the BLS often uses 8. Table II reports a matched rate of inflation of −3.8% per year for vehicles. Based on data from J.D. Power and Associates for model years 1999 to 2003, Corrado, Dunn, and Otoo (2006) report a rate of price change for vehicles within the model year of about −6%. Most of this difference reflects sample period—for years 1999 to 2003, I find a matched-model rate of −5.6%. The small remaining difference may largely reflect that the J.D. Power data, unlike the CPI data, do not control for changes in how the vehicle is equipped with options.
648
QUARTERLY JOURNAL OF ECONOMICS
hedonic adjustments or omits price changes across model changes. My calculations show that BLS methods imputed about 18% annual quality growth from substitutions for computing equipment, slightly more than the associated price increases of 17%. So the measured rate of quality growth would actually have been lower by 1% if the BLS had based CPI inflation just on matched-model price changes. Although computing equipment is a small share of the CPI for these durables, their inclusion more than doubles quality growth from forced substitutions using BLS methods, from 0.7 to about 1.5% per year. III. GROWTH IN MARKET SHARE WITH NEW GOODS The price increases with new models caused unit prices for durables to increase by 6% per year relative to the rate of price change for matched models, that is, those without model changes. So if we measure price inflation simply by the rate of price increase for the matched models, we would infer that quality growth averaged 6%. As just discussed, BLS methods do not take this approach, instead attributing much of the price increases with forced substitutions to price inflation. As a result, measured inflation is higher by 3 percentage points per year, with quality growth reduced by the same amount. Note that BLS methods imply that goods experiencing forced substitutions exhibit large price increases relative to the matched models. If we assume that goods’ demands are decreasing in relative price, and relatively stable across substitutions, then this predicts that consumers will substitute away from goods with model changes. In this section, I test this prediction for vehicles and consumer electronics. I find that consumer purchases move toward the models experiencing model changes, suggesting that the rate of price inflation for matched models does not understate inflation. The goal is to measure inflation, allowing for possible quality changes. Let be the rate of price inflation for a particular category of goods. (Indices for the goods category and time period are implicit.) The rate of price inflation is a weighted average of inflation for distinct product models within the category, N i rates s i , where n indexes models, each weighted by sn to re = i=1 flect its expenditure share. Number the models so that the first M corresponds to matched models, those without model substitutions; M + 1 to N are those with substitutions. Let matched equal the average inflation rate for matched models. Inflation can be
DO HIGHER PRICES REFLECT QUALITY GROWTH?
649
expressed in terms of this inflation rate for matched models plus any differential inflation for models with substitutions. (2)
= matched +
N
si ( i − matched ).
i=M+1
Consider measuring inflation purely by matched models’ price changes—would this understate inflation and overstate quality growth? The answer is “no,” unless models with substitutions systematically display higher inflation. To test for that scenario, I examine what happens to the market share, in physical units, for models with versus without model substitutions. The intuition is straightforward. If the inflation rate for goods with substitutions exceeds that for matched products, then we expect their market share to fall as some consumers will be driven toward the goods without model changes. Hausman (2003) argues that constructing a price index requires data on quantities, so that price changes accompanying quality changes can be based on a good’s change in quantity demanded divided by its price elasticity of demand. The approach here is directly related, but more conservative in that it provides a bound, rather than an estimate, of quality growth.9 But it has an important practical advantage in not requiring knowledge of goods’ demand elasticities. Furthermore, despite only providing a bound, it suggests annual growth that is dramatically higher, by more than 3% per year, than based on BLS methods. The key identifying assumption is that relative shifts in demands across competing models are orthogonal to the timing of 9. In Bils (2004), I consider a discrete-choice model for purchasing a durable. There I explicitly tie an estimate of price inflation for goods with quality changes to the inflation rate for matched models and the change in market share for the goods with versus without substitutions. But, as in Hausman (2003), this hinges on knowing elasticities of substitution across durable models that are not readily estimated. There I follow the literature on discrete choice across differentiated models (e.g., Anderson, de Palma, and Thisse [1992]) in assuming that consumer demands are defined symmetrically over the competing models. In discrete-choice models all competing models are gross substitutes. So, if we see consumers moving toward the models that change, then this implies that their relative price has decreased. Without symmetry across models, there is a caveat. Suppose there are three competing brand models—A, B, and C—but A competes only with B for consumers, and B only with C. Suppose C experiences a substitution. Then it would be possible to construct a scenario where market share rises for C, because its price falls relative to B; but its price does not fall relative to a weighted average of A and B. Given the findings to follow that, averaging over many products and over many months, goods with model substitution increase market share, this caveat should not seriously qualify the conclusions.
650
QUARTERLY JOURNAL OF ECONOMICS
model substitutions. Introducing a new model of a good presumably takes time to implement. This suggests that model substitutions are plausibly predetermined with respect to innovations to relative product demands for that month. Of greater concern is that demands shift systematically in response to the product cycle. The discussion here assumes that a good is valued based on its characteristics, separately from how long the product has existed on the market. If consumers place a value on consuming a new-to-the-market product, everything else equal, then this can violate the assumption that shifts in model demands be orthogonal to timing of product substitutions. As an example, consider novels. Persons may prefer to consume a novel shortly after its arrival on the market, perhaps because they wish to discuss the book with others currently reading it (or avoid hearing it discussed before they read it). New releases might sell more copies even if their prices are higher, but we would not want to infer from this that novels are getting better and better. I allow for this in two ways in the empirical work. First, I exclude goods, including all apparel, that are likely to have an important seasonal or fashion cycle. Second, I test for this fashion component by observing whether larger price increases with substitutions fail to persist across the product cycle. Such a sawtooth pattern in prices is predicted by the fashion story. These tests are described more fully in Section IV. I first examine for automobiles, vans, pickup trucks, and SUVs how growth in market share responds to a spike in forced substitutions for that vehicle model. Second, I present marketshare results based on sales scanner data for televisions, audio goods, and other consumer electronics. Together the vehicles and consumer electronic goods make up more than 50% of consumer spending on all the durables detailed in Table I. Because substitution rates are skewed toward these goods, they constitute over 80% of extra price increases accompanying forced substitutions for durables, excluding computers, weighting by spending share. III.A. Model Substitutions and Changes in Market Shares for Automobiles and Other Vehicles Data on monthly unit sales by car, van, and pickup model are compiled by Ward’s Auto for their Automotive Yearbook—I obtained these data for January 1988 to January 2005. I then construct a data table of substitution rates by vehicle model from
DO HIGHER PRICES REFLECT QUALITY GROWTH?
651
TABLE III RESPONSE OF UNIT-PRICE INFLATION, CPI PRICE INFLATION, AND MARKET SHARE OF UNIT SALES TO SUBSTITUTION RATE FOR CARS AND OTHER LIGHT VEHICLES
Dependent variable ln(unit price) ln(CPI price) ln(market share of sales) Number of model–month observations
Automobiles (1)
Vans, pickups, and SUVs (2)
5.1 (0.09) 4.4 (0.08) 14.2 (2.0) 21,344
4.5 (0.13) 4.1 (0.11) 1.2 (2.3) 8,336
Sources. Data are taken from the CPI Commodities and Services Survey and Ward’s Automotive Sales Data, both for January 1988 to January 2005. Notes. Independent variable is the monthly rate of forced substitutions for that vehicle model. Standard errors are in parentheses. Regressions include monthly time-period dummies.
the CPI data covering the same period.10 Combining this table with the Ward’s data provides a panel data set with 29,680 observations on forced substitution rates, price increases, and sales growth by vehicle model.11 Table III presents results on how a vehicle model’s monthly rates of growth in prices and unit sales respond to its rate of forced substitutions. Results are presented separately for automobiles and other light vehicles. (Each observation is weighted by the number of BLS price quotes underlying that month’s substitution rate for the vehicle model.) The regressions include 10. The BLS field agent records some descriptive information for an item when it is selected for pricing. I am able to identify the vehicle model for 98% of price quotes for automobiles and 96% of other vehicles. A model-year change accompanies 89% of substitutions for cars and 87% for other vehicles. Less than 1% of quotes result in a change in the vehicle model being priced. These are not reflected in the substitution rates in the regressions below. Only quotes in the CPI data covering one or two months are included. For quotes of two-month duration, I allocate inflation equally between the months. For two-month quotes with substitutions I allocate slightly over one-half substitution to each month. The amount over one-half reflects the small probability of exhibiting substitutions in consecutive months, with this probability estimated from quotes that are monthly in duration. 11. The Ward’s data combine leased vehicles with regular sales. Leased vehicles are not incorporated into the CPI Commodities and Services Survey until 1998, and then only gradually. Also, lease quotes are not readily separated between cars and other vehicles. Therefore, my analysis of substitution rates, here and in Section II, is based on purchased vehicles. But model turnover dates for a specific model should be similar for vehicles leased versus purchased. Furthermore, estimates are very similar for the first and second halves of the sample, strongly suggesting that vehicle leasing, which is much less important in the first half of the sample, is not driving the results.
652
QUARTERLY JOURNAL OF ECONOMICS
time-period dummies, and so the coefficients should be interpreted as the growth rate in the dependent variable for models that experienced a 100% rate of forced substitutions for that month relative to the growth rate for models experiencing no substitutions. (Aggregate variations in sales of cars or in sales of other vehicles are not a factor.) The first two rows show the impact of substitutions on unit-price inflation and unit-price inflation net of the BLS adjustment for quality growth. The findings are consistent with the results reported in Section II. For cars, substitutions are associated with 5.1% greater price increases with only about one-seventh of this captured as an increase in quality. For vans, pickups, and SUVs substitutions are associated with 4.5% greater price increases with only about one-tenth captured as increased quality. The third row shows the impact on market share. For automobiles, forced substitutions are associated with a considerable increase in market share of units sold of 14.2% (with standard error of 2.0%). I also examined results separately by five classes of vehicle, ranging from subcompact to luxury; the positive impact of substitutions on market share is quite consistent across classes. For vehicles other than cars the estimated impact of substitutions on the rate of growth in market share is positive, but very small and insignificant, equaling 1.2% (with standard error of 2.3%). I also estimated the impact of substitutions on growth in market share separately for vans, pickups, and SUVs, but results look similar across these categories.12 There is also no evidence that model changes that generate larger price increases lose market share. I estimated further specifications that interact the monthly relative price change for a vehicle model with its substitution rate for that month. For both automobiles and other vehicles, a price increase, absent model 12. Nearly half of forced substitutions for vehicles occur in the two fall months of October and November. (Model changes have been less skewed to the autumn during the last twenty years than historically.) The timing of these forced substitutions might be viewed as more exogenous. If I allow a differential impact for these two months, the impact on market share of a substitution is more positive for October and November. For cars, a fall substitution increases market share by 17.6% (standard error 3.1%); but substitutions in other months still have a large and statistically significant impact on market share of 11.9% (standard error 2.6%). For other vehicles the effect of a substitution is statistically insignificant both for fall substitutions and for those outside of October and November. I also used this seasonality in substitutions to estimate by instrumental variables. I instrument for a model’s rate of forced substitutions by its rate eleven, twelve, and thirteen months prior. This actually yields a more positive impact of substitutions on market share, but standard errors for the estimates are much larger.
DO HIGHER PRICES REFLECT QUALITY GROWTH?
653
TABLE IV DYNAMIC RESPONSES OF MARKET SHARE TO SUBSTITUTIONS FOR VEHICLES Automobiles (1) ln(market share of sales) at t − 1 ln(market share of sales) at t − 2 ln(market share of sales) at t − 3 ln(market share of sales) at t − 4 Substitution rate at t Substitution rate at t − 1 Substitution rate at t − 2 Substitution rate at t − 3 Substitution rate at t − 4 Adjusted R2 Number of observations
−0.24 (0.007) −0.18 (0.008) −0.12 (0.008) −0.06 (0.007) 14.2 (2.3) 6.3 (2.4) 3.3 (2.4) 6.6 (2.2) 1.9 (2.1) .26 18,747
Vans, pickups, SUVs (2) −0.24 (0.011) −0.17 (0.012) −0.10 (0.012) −0.05 (0.011) −1.8 (2.5) 5.3 (2.6) 3.9 (2.5) 1.6 (2.4) −0.1 (2.2) .40 7,946
Source. Data are taken from CPI Commodities and Services Survey and Ward’s Automotive Sales Data, both for January 1988 to January 2005. Notes. Dependent variable is share of sales. Standard errors are in parentheses. Regressions include monthly time-period dummies. The p-value for the set of substitution variables for autos is σo ). In this case β j = α − σo − x j (σ y − σo ). Even in the absence of a differential direct effect for violent movies, the level of violence in a movie can affect crime. If violent movies are more likely to attract the violent subpopulation (i.e., x v > x m > x n), as we document empirically below, then the effect of exposure becomes more negative with the violence level of the movie: β v < β m < β n. Exposure to violent movies can lower crime relative to nonviolent movies simply because violent movies induce more substitution away from dangerous activities for the violent subgroup. In addition to this selection effect, there can be a direct effect of movie violence, as suggested by the arousal and catharsis theories. To capture this possibility, modify the example in the preceding paragraph so that strongly violent movies have a direct effect α v (with nonviolent and mildly violent movies still having impact α). Then the impact of exposure to a violent movie is β v = (α v − α) + (α − σo ) − x v (σ y − σo ). If we could observe the selection of criminals x j into the different types of movies, we could estimate the differential direct effect of violent movies (the parameter captured in the laboratory experiments) as (5)
xv − xn n α −α =β − β + m (βm − βn) . x − xn v
v
The solution for α v − α is the difference between the actual impact of strongly violent movies (β v ) and the predicted impact based on selection (the term in square brackets). If strongly violent movies trigger additional aggression due to arousal or imitation (α v − α > 0), the impact of strongly violent movies β v can be less negative than mildly violent movies β m. In Section IV.A, we provide an estimate of α v − α under the assumptions outlined above. Finally, although we have emphasized the impact of movies on potential criminals, we note that exposure to movies can also have a parallel effect on potential victims. During the duration of the movie, potential victims are likely to be protected from crime. After the movie, they may be more or less susceptible to assaults depending on whether their alternative activity would have placed them in a more or less volatile situation (accounting for any arousal or catharsis effects). Therefore, although we
688
QUARTERLY JOURNAL OF ECONOMICS
cannot distinguish between effects on the supply side and on the demand side of criminal activity, the interpretations of the results and the policy implications remain essentially unchanged. In fact, it is likely that any effect of movie attendance, such as a reduction of alcohol consumption, would operate symmetrically on both offenders and victims. II.C. Comparison of Lab to Field Before continuing, a brief comparison to the psychology experiments is in order. There are three factors that differ between the laboratory and the field. The first and most important is the comparison group. In the experiments, exposure to violent and nonviolent movies is manipulated as part of the treatment, whereas in the field, subjects optimally choose relative to a comparison activity as . Hence, in the laboratory, the treatment effects are estimated as the difference between the effect of violent versus nonviolent movies. In contrast, the effect of exposure in the field is measured as the difference between the effect of movie violence and the effect of the foregone alternative activity. The second factor is selection. Subjects in the laboratory are a representative sample of the (student) population, while moviegoers in the field are a self-selected sample. The sorting of violent individuals into violent movies, which could result in large displacement effects in the field, is not present in the lab. Finally, the third factor is the type of violence. The clips used in the experiments typically consist of five to ten minutes of selected sequences of extreme violence. In the field, instead, media violence also includes meaningful acts of reconciliation, apprehension of criminals, and nonviolent sequences. The exposure to random acts of violence may induce different effects from the exposure to acts of violence viewed in a broader context. Within our empirical specification, an estimate of β v in the laboratory experiment yields
Ny Ny v v αov . α + 1− βlab = Ny + No y Ny + No Comparing this estimate to the estimate obtained from field data in (4) makes apparent the first two differences discussed above. First, the impact of media violence in the lab does not include the indirect effect of σ, which operates through the alternative activity. By virtue of experimental control, the indirect effect is shut down. Second, the weights on the young and old coefficients
DOES MOVIE VIOLENCE INCREASE VIOLENT CRIME?
689
are different (compare Ny / Ny + No to x v ). The laboratory experiments capture the reaction to media violence of a representative sample, whereas the field evidence assigns more weight to the parameter of the individuals that sort into the violent movies (the “young”). Hence, the laboratory setting is not representative of exposure to movie violence in most field settings, where consumers choose what media to watch. However, it is representative of instances of unexpected exposure, as in the case of a violent advertisement or a trailer placed within family programming. Recognizing these differences is important not only to better understand the effect of media on violence, but also more generally to understand the relationship between experimental and field evidence (Levitt and List 2007). In our setting, the field findings are important to evaluate policies that would restrict access to violent movies, as such policies would lead to substitution toward alternative activities in the short run. The results of the laboratory experiments, however, are useful to evaluate different policies, such as the short-run impact of unexpected exposure to media violence. In addition, some of the differences between laboratory and field can be altered by changes in the laboratory design. For instance, the laboratory experiments can incorporate sorting into a violent movie (Lazear, Malmendier, and Weber 2006) to replicate the selection in the field, or can change the exposure to a fulllength movie. One important limitation of both the laboratory and field designs is that neither provides evidence on the long-term effects of repeated exposure to violent media. These cumulative effects could be substantial, yet they are difficult to estimate causally. III. DATA In this section we introduce our various data sets, provide summary statistics, and describe general patterns of movie attendance and violent crime. III.A. Movie Data Data on box-office revenue are from the-numbers.com, which uses the studios and Exhibitor Relations as data sources. Information on total weekend box-office sales is available for the top fifty movies consistently from January 1995 on. Daily revenue is available for the top ten movies beginning mid-August 1997. We focus on daily data for Friday, Saturday, and Sunday because
690
QUARTERLY JOURNAL OF ECONOMICS
movie attendance, and therefore the identifying variation for our analysis, is concentrated on weekends (see Table I). To estimate movie theater attendance, we deflate both the weekend and the daily box-office sales by the average price of a ticket. For the period January 1995 to mid-August 1997 and for all movies that do not make the daily top-ten list, we impute daily box-office revenue (see Appendix I). We match the box-office data to violence ratings from kids-in-mind.com, a site recognized by Time Magazine in 2006 as one of the “Fifty Coolest Websites.” Since 1992, this nonprofit organization has assigned a 0- to 10-point violence rating to almost all movies with substantial sales. The ratings are performed by trained volunteers who, after watching a movie, follow guidelines to assign a violence rating. In Table A.1, we illustrate the rating system by listing the three movies with the highest weekend audiences within each rating category. For most of the analysis, we group movies into three categories: strongly violent, mildly violent, and nonviolent. Movies with ratings between 0 and 4 such as Toy Story and Runaway Bride have very little violence; their MPAA ratings range from G to R (for sexual content or profanity). Movies with ratings between 5 and 7 contain a fair amount of violence, with some variability across titles (Spider-Man versus Mummy Returns). These movies are typically rated PG-13 or R. Movies with a rating of 8 and above are violent and almost uniformly rated R, and are disproportionately more likely to be in the “Action/Adventure” and “Horror” genres. Examples are Hannibal and Saving Private Ryan. For a very small number of movies, typically with limited audiences, a rating is not available. We define the number of people (in millions) exposed to movies of violence level k on day t as Akt = j∈J d jka j,t , where a j,t is the audience of movie j on day t, d jk is an indicator for film j belonging to violence level k, and J is the set of all movies. The violence level varies between 0 and 10.1 We define three summary measures for movies with differing levels of violence. The measure of exposure to strongly violent movies on day t is the audience k for movies with violence levels between 8 and 10, Avt = 10 k=8 At . 1. The rereleases of Star Wars V and VI in 1997 were not rated because the original movie predates kids-in-mind.com. We assigned them the violence rating 5, the same rating as for the other Star Wars movies. To deal with the small number of remaining movies with missing violence ratings, we assume ratings are missing at random with respect to the level of violence in a movie, and inflate each day’s exposure variables Akt accordingly. The average share of missing ratings is 4.1% across days.
Assault data for weekends (Friday–Sunday) By gender of offender Share with male offender By age of offender Share with offender of ages 18 to 29 Alcohol-related assaults Share with offender suspected of using alc. or drugs Share with offender of ages 17 to 20 (underage) Share with offender of ages 21 to 24 (over-age) Number of observations
Assault data for all days Weekend (Friday–Sunday) Friday Saturday Sunday Weekday (Monday–Thursday)
0.340
0.378
0.359
0.784
0.467
0.811
354 432 449 182 205
0.082 0.185 0.290 0.125 0.139 0.138 0.118 0.123 0.182 N = 1,563 days, 2,272,999 assaults, 1,781 agencies
0.755
0.170 0.133 0.135
531 543 558 491 480
12 A.M. to 6 A.M. (4)
Share of weekend assaults in each category
569 614 557 536 608
6 P.M. to 12 A.M. (3)
Assaults (per day) 6 A.M. to 6 P.M. (2)
0.779
1,454 1,589 1,564 1,209 1,293
Entire day (1)
TABLE I SUMMARY STATISTICS
DOES MOVIE VIOLENCE INCREASE VIOLENT CRIME?
691
3.92 4.13 4.82 2.82 2.09
0.64 1.56 1.72
6.29 5.74 7.90 5.24 2.00
0.87 2.43 2.99
VHS/DVD rentals (6)
Notes. An observation is a day over the years 1995–2004. Assault data come from the National Incident Based Reporting System (NIBRS), and the sample includes agencies that do not have missing data on any crime (not just assaults) for more than seven consecutive days for that year. The movie audience numbers are obtained from the-numbers.com and are daily box-office revenue divided by the average price per ticket. The ratings of violent movies are from kids-in-mind.com. The audience of mildly violent movies is the audience of all movies with a violence rating 5–7. The audience of strongly violent movies is the audience of all movies with a violence rating 8–10. VHS/DVD rental data come from Video Store Magazine.
Movie audience data for all days Weekend (Friday–Sunday) Friday Saturday Sunday Weekday (Monday–Thursday) Movie audience data for weekends (Friday–Sunday) By kids-in-mind.com rating Strongly violent movies Mildly violent movies Nonviolent movies
Theater audience (5)
Movie audience (millions of tickets or rentals per day)
TABLE I (CONTINUED)
692 QUARTERLY JOURNAL OF ECONOMICS
DOES MOVIE VIOLENCE INCREASE VIOLENT CRIME?
693
n Similarly, exposure to mildly violent Am t and nonviolent At movies on day t are defined as the aggregated audiences for movies with a violence level between 5–7 and 0–4, respectively. Figure Ia plots the measure of strong movie violence, Avt , over the sample period 1995 to 2004. To improve readability, we plot the weekend audience (the sum from Friday to Sunday) instead of the daily audience. In the graph, we label the top ten weekends with the name of the movie responsible for the spike. The series exhibits sharp fluctuations. Several weekends have close to zero violent movie audience. On other weekends, over twelve million people watch violent movies. The spikes in the violent movie series are distributed fairly uniformly across the years, and decay within two to three weeks of the release of a violent blockbuster. Figure Ib plots the corresponding information for the measure of mild movie violence, Am t . Because more movies are included in this category, the average weekend audience for mildly violent movies is higher than for strongly violent movies, with peaks of up to 25 million people. There is some seasonality in the release of violent movies, with generally lower exposure to movie violence between February and May. This seasonality is less pronounced for the strongly violent movies compared to the mildly violent movies. To put audience size into perspective, note that blockbuster movies are viewed by a sizable fraction of the U.S. population. Over a weekend, strongly violent and mildly violent blockbusters attract up to 4% and 8%, respectively, of the U.S. population (roughly 300 million). This extensive exposure provides the identifying variation in our setup.
III.B. Violent Crime Data Our source for violent crime data is the NIBRS, chosen for two important features. First, it reports violent acts known to police, such as verbal intimidation or fistfights, which do not necessarily result in an arrest. Second, it reports the date and time of the crime, allowing us to match movie attendance and violent crime at the daily level. Alternative large-scale data sets on crime, such as the Uniform Crime Report and the National Crime Victimization Survey, do not contain this same type of detailed information at the daily level. The NIBRS data collection effort is a part of the Uniform Crime Reporting Program. Submission of NIBRS data is still
694
QUARTERLY JOURNAL OF ECONOMICS
voluntary, and over time the number of reporting agencies has increased substantially. In 1995 (the first year of NIBRS data), only 4% of the U.S. population was covered, but by August 2005, there were 29 states certified to report NIBRS data to the FBI, for a coverage rate of 22% of the U.S. population (reporting is not always 100% within a state). This 22% coverage represents 17% of the nation’s reported crime, which reflects the fact that NIBRS coverage is more heavily weighted toward smaller cities and counties (where crime rates are lower). One limitation of NIBRS is that it does not cover crime in the nation’s largest cities, although it does include medium-size cities such as Memphis and Cincinnati. We use data from 1995 to 2004 for NIBRS city and county reporting agencies, which include local police forces and county sheriff offices. Because not all agencies report consistently, in each year we exclude agencies that have missing data on crime (not just assaults) for more than seven consecutive days, where a report of zero counts as nonmissing data. This filter eliminates 12.5% of reported assaults. If no crime is reported on a given day after this filter, we set that day’s assault count to zero. Our main violence measure is the total daily number of assaults, Vt , defined as the sum of aggravated assault, simple assault, and intimidation,2 across all agencies on day t. In some specifications, we separate assaults into four time blocks: 6 A.M.–12 P.M., 12 P.M.–6 P.M., 6 P.M.– 12 A.M., and 12 A.M.–6 A.M. We assign assaults occurring between 12 A.M. and 6 A.M. to the previous calendar day to match them to movies played the previous evening. To provide graphical evidence on this series, we construct the residual of log daily assaults, after controlling for an extensive set of indicator variables for year, month, day-of-week, day-of-year, and holidays as well as weather and TV audience measures (the same set of variables used in our main specification and described in Appendix I). Figure Ic plots the average of the Friday to Sunday residuals (the days with highest movie audience) over time. The residuals behave approximately like white noise. Only 44 weekends differ from the mean by more than 0.05 log points, and just one differs by more than 0.10 log points. 2. Aggravated assault is an unlawful attack by one person upon another wherein the offender uses a weapon or displays it in a threatening manner, or the victim suffers obvious severe or aggravated injury. Simple assault is also an unlawful attack but does not involve a weapon or obvious severe or aggravated bodily injury. Intimidation is placing a person in reasonable fear of bodily harm without a weapon or physical attack.
DOES MOVIE VIOLENCE INCREASE VIOLENT CRIME?
695
The figure also labels the top ten weekends for the audience of strongly violent (see Figure Ia) and mildly violent movies (see Figure Ib). Interestingly, Figure Ic offers an indication of a negative relationship between violent movies and crime. For both mildly violent and strongly violent movies, seven of the top ten weekends have residuals below the median. (One of the positive residuals is for Passion of the Christ, an atypical violent movie, both for its target audience and its potential effect on crime.) In addition, out of twenty weekends with a residual more negative than −0.05 log points, two are among the top ten weekends for strongly violent movies, and two are among the top ten weekends for mildly violent movies. We examine the relationship between violent movies and violent crime in detail in the next section. III.C. Summary Statistics After matching the movie and crime data, the resulting data set includes 1,563 weekend (Friday through Sunday) observations, covering the time period from January 1995 to December 2004. The data set contains a total of 2,272,999 assaults and 1,781 reporting agencies. Table I reports summary statistics. The average number of assaults on any given weekend day is 1,454. The assaults occur mostly in the evening (6 P.M.–12 A.M.), but are also common in the afternoon (12 P.M.–6 P.M.) and in the night (12 A.M.– 6 A.M.). Assaults are highest on Friday and Saturday, and lower on Sundays and other weekdays. Assaults are three times larger for males than for females, and are decreasing in the age of the offender (for ages above 18). The share of assaults where the offender is suspected of using alcohol or drugs is 17.0% over the whole day, with a much larger incidence in the night hours. Table I also reports summary statistics for movie attendance. The average daily movie audience on a weekend day is 6.29 million people, with a peak on Saturday. The audience for strongly and mildly violent movies is respectively 0.87 million and 2.43 million. The table also presents information on VHS and DVD movie rentals. IV. EMPIRICAL RESULTS IV.A. Theater Audience—Daily To test for the short-run effects of exposure to violent movies, we focus on same-day exposure, a short time horizon similar to the one considered in the psychology experiments. The outcome
696
QUARTERLY JOURNAL OF ECONOMICS
variable of interest is Vt , the number of assaults on day t. Although the number of assaults is a count variable, specifying explicitly the count process (as in a Poisson regression) is not key because the number of daily assaults is sufficiently large. Hence, we adopt an OLS specification, which allows us to more easily instrument for movie exposure later in the paper. The benchmark specification that follows from the model developed in Section II is (6)
n n ln Vt = β v Avt + β m Am t + β At + Xt + εt .
The number of assaults depends on the exposure to strongly violent movies Avt , mildly violent movies Am t , and nonviolent movies Ant. The coefficient β v can be interpreted as the percent increase in assaults for each million people watching strongly violent movies on day t, with a similar interpretation for the coefficients β m and β n. Identification of the parameters relies on time-series variation in the violence content of movies at the theater (see Figures Ia and Ib). By comparing the estimates of β v and β m to the estimate of β n, one can obtain a difference-in-difference estimate of the effect of violent movies versus nonviolent movies. The variables Xt are a set of seasonal control variables: indicators for year, month, day-of-week, day-of-year, holidays, weather, and TV audience. Because new movie releases and movie attendance are concentrated on weekends, we restrict the sample to Friday, Saturday, and Sunday. All standard errors are robust and clustered by week, to allow for arbitrary correlation of errors across the three observations on the same weekend. In column (1) of Table II we begin by estimating equation (6) with only year controls included. The year controls are necessary because the cities and counties in the sample vary year-to-year. In this specification, exposure to media violence appears to increase crime. However, we also obtain the puzzling result that exposure to nonviolent movies increases crime significantly, suggesting that at least part of this correlation is due to omitted variables. Einav (2007) documents seasonality in movie release dates and underlying demand, with the biggest ticket sales in the beginning of the summer and during holidays. Because assaults are also elevated during summers and holidays, it is important to control for seasonal factors. In columns (2) and (3), we include indicators for month-of-year and for day-of-week. Although introducing these coarse seasonal variables increases the R2 substantially, from .9344 to .9846, these variables do not control for additional effects
1,334.31
0.9711 1,563
0.9344 1,563
X X
0.0005 (0.0053) 0.0017 (0.0029) −0.0164 (0.0030)∗∗∗
1,934.02
X
0.0324 (0.0053)∗∗∗ 0.0246 (0.0030)∗∗∗ 0.0082 (0.0029)∗∗∗
0.9846 1,563
88.56
X X X
−0.0061 (0.0033)∗ −0.0084 (0.0020)∗∗∗ −0.0062 (0.0021)∗∗∗
0.9904 1,563
13.37
X X X X
−0.0051 (0.0033) −0.0042 (0.0026) −0.0023 (0.0024)
0.9912 1,563
15.05
X X X X X
0.9931 1,563
18.58
X X X X X X
−0.0072 −0.0091 (0.0033)∗∗ (0.0026)∗∗∗ −0.0056 −0.0079 (0.0027)∗∗ (0.0022)∗∗∗ −0.0029 −0.0035 (0.0026) (0.0024)
(6)
1,563
X
X X X X X X
−0.0106 (0.0031)∗∗∗ −0.0102 (0.0028)∗∗∗ −0.0050 (0.0029)∗
(7)
IV regressions
Notes. An observation is a Friday, Saturday, or Sunday over the years 1995–2004. Assault data come from the National Incident Based Reporting System (NIBRS), where the sample includes agencies that do not have missing data on any crime (not just assaults) for more than seven consecutive days for that year. The movie audience numbers are obtained from the-numbers.com and are daily box-office revenue divided by the average price per ticket. The ratings of violent movies are from kids-in-mind.com. The audience of strongly violent movies is the audience of all movies with a violence rating 8–10. The audience of mildly violent movies is the audience of all movies with a violence rating 5–7. The specifications in columns (1) through (6) are OLS regressions with the log(number of assaults occurring in day t) as the dependent variable. The specification in column (7) instruments the audience numbers with the predicted audience numbers based on next weekend’s audience. Details on the construction of the predicted audience numbers are in the text. Robust standard errors clustered by week are in parentheses. * Significant at 10%; ** significant at 5%; *** significant at 1%.
F-test on additional controls Audience instrumented with predicted audience using next weekend’s audience R2 N
Control variables Year indicators Day-of-week indicators Month indicators Day-of-year indicators Holiday indicators Weather and TV audience controls
Audience of strongly violent movies (millions of people in day t) Audience of mildly violent movies (millions of people in day t) Audience of nonviolent movies (millions of people in day t)
Log (number of assaults in day t) (3) (4) (5)
(2)
Dep. var.: (1)
OLS regressions
Specification:
TABLE II EFFECT OF MOVIE VIOLENCE ON SAME-DAY ASSAULTS
DOES MOVIE VIOLENCE INCREASE VIOLENT CRIME?
697
698
QUARTERLY JOURNAL OF ECONOMICS
such as the Christmas season in the second half of December or for holidays such as Independence Day. In columns (4) and (5), we therefore add 365 day-of-year indicators (dropping February 29 in leap years) and holiday indicators (see Appendix I), raising the R2 further to .9912. As we add these variables, the coefficients β v and β m on the violent movie measures flip sign and become negative, significantly so in column (5). This suggests that the seasonality in movie releases and in crime biases the estimates upward. This negative correlation, however, may still be due to an unobserved variable that contemporaneously increases violent movie attendance and decreases violence εt . For example, on rainy days assaults are lower, but movie attendance is higher. To address this possibility, we use two strategies. First, we add a set of weather controls to account for hot and cold temperatures, humidity, high winds, snow, and rain. We also control for distractors that could affect both crime and movie attendance by controlling for the day of the Super Bowl and for the other days with TV shows having an audience in excess of fifteen million households according to Nielsen Media Research. (These controls are described in Appendix I.) Adding these controls makes the estimates more negative (column (6)). Second, we instrument for movie audience on day t using information on the following weekend’s audience for the same movie. This instrumental variable strategy exploits the predictability of the weekly decrease in attendance. At the same time, it removes the effect of any shocks that affect violence and attendance in week w(t), but are not present in week w (t) + 1. Examples include one-time TV events or transient weather shocks that are not already captured in our TV and weather controls. This procedure, detailed in Appendix II, generates predictors for the audience of strongly violent, mildly violent, and nonviolent movies on day t. Panel B in Table III shows that these predictors are strongly correlated with the actual audience numbers they are instrumenting for. In the first stage for the audience of strongly violent movies (column (1)), the coefficient on the predicted audience for strongly violent movies is highly significant and close to 1 (.9145), as predicted. The other two coefficients in this regression are close to 0, though also significant. We obtain similar first stages for the audience of mildly violent movies (column (2)) and nonviolent movies (column (3)). Column (7) in Table II presents the IV estimates, where we have instrumented for the movie audience variables with their
Control variables Full set of controls Audience instrumented with predicted audience using next week’s audience N
Audience of strongly violent movies (millions of people in day t) Audience of mildly violent movies (millions of people in day t) Audience of nonviolent movies (millions of people in day t) Time of day
Dep. var.:
Specification:
X X 1,563
X 1,563
−0.0030 (0.0050) −0.0001 (0.0045) 0.0016 (0.0046) 12 P.M.–6 P.M.
X 1,563
X
−0.0130 (0.0049)∗∗∗ −0.0109 (0.0040)∗∗∗ −0.0063 (0.0043) 6 P.M.–12 A.M.
Log (number of assaults in day t in time window) (2) (3)
X
−0.0050 (0.0066) −0.0106 (0.0060)∗ −0.0033 (0.0060) 6 A.M.–12 P.M.
(1)
A. Benchmark results Instrumental variable regressions
TABLE III EFFECT OF MOVIE VIOLENCE ON SAME-DAY ASSAULTS BY TIME OF DAY
X 1,562
X
−0.0192 (0.0060)∗∗∗ −0.0205 (0.0052)∗∗∗ −0.0060 (0.0054) 12 A.M.–6 A.M. next day
(4)
DOES MOVIE VIOLENCE INCREASE VIOLENT CRIME?
699
−0.1431 (0.0210)∗∗∗ 0.8532 (0.0255)∗∗∗ −0.1363 (0.0195)∗∗∗ X 889.02 1,563
X 1,050.89 1,563
Audience of mildly violent movies (2)
IV regression, first stage
0.9145 (0.0196)∗∗∗ −0.0399 (0.0101)∗∗∗ −0.0480 (0.0097)∗∗∗
Audience of strongly violent movies (1)
B. First stage
X 730.85 1,563
−0.1694 (0.0281)∗∗∗ −0.1817 (0.0296)∗∗∗ 0.8138 (0.0309)∗∗∗
Audience of nonviolent movies (3)
Notes. See notes to Table II. The number of observations in column (4) of Panel A is one fewer than in columns (1)–(3) of Panel A because we are missing the assault data for January 1, 2006, for the hours between 12 A.M. and 6 A.M. * Significant at 10%; ** significant at 5%; *** significant at 1%.
Pred. audience of strongly violent movies (millions of people in day t) Pred. audience of mildly violent movies (millions of people in day t) Pred. audience of nonviolent movies (millions of people in day t) Control variables Full set of controls F-test on instruments N
Dep. var.:
Specification:
TABLE III (CONTINUED)
700 QUARTERLY JOURNAL OF ECONOMICS
DOES MOVIE VIOLENCE INCREASE VIOLENT CRIME?
701
predicted values. Instrumenting makes the correlation between movie violence and violent crime become more negative. An increase of one million in the audience for violent movies decreases violent crime by 1.06% (strongly violent movies) and 1.02% (mildly violent movies), substantial effects on violence. Nonviolent movies have a smaller (marginally significant) negative effect on assaults. The IV estimates do not noticeably change if the weather controls are excluded (not reported), suggesting that the instruments are taking care of temporary shocks, such as those due to weather. IV.B. Theater Audience—Time of Day Table II implies that exposure to violent movies diminishes crime in the short run. To clarify this potentially puzzling result (relative to the findings in the laboratory experiments), we separately examine the effect of violent movies on violent crime by time of day. In these and all subsequent specifications, we include the full set of controls Xt and instrument for the actual audiences n Avt , Am t , and At using the predicted audiences. In Table III, we present our baseline estimates by time of day: assaults committed in the morning (6 A.M.–12 P.M.), afternoon (12 P.M.–6 P.M.), evening (6 P.M.–12 A.M.), and nighttime (12 A.M.– 6 A.M.). Because movie audiences are unlikely to watch movies in the morning and in the afternoon, and especially so for violent movies, we expect to find little or no effect of exposure to violent movies in the first two time blocks. There are small negative effects for assaults in the morning hours which are not very significant. This appears to be due to a spillover from the previous day’s movie exposure (which is highly correlated with today’s movie exposure). Exposure to violent movies has no differential impact on assaults in the afternoon (column (2)). Because we consistently find similar effects for these two time periods (small negative effects in the early morning and no effect in the afternoon), we pool them in subsequent tables to save space. During the evening hours (column (3)), we find, instead, a significant negative effect of exposure to violent movies. An increase in the audience of mildly violent movies of one million decreases violent crime by 1.09%. Exposure to strongly violent movies has a slightly larger effect. Exposure of one million additional people reduces assaults by 1.30%. Exposure to nonviolent movies is negatively correlated with violent crime, but the point estimate is smaller than for violent movies, and not significant. Over the night hours following exposure to a movie (column (4)), violent
702
QUARTERLY JOURNAL OF ECONOMICS
movies have an even stronger negative impact on violent crime. Exposure to mildly and strongly violent movies for one million people decreases violent crimes by, respectively, 2.05% and 1.92%. The impact of nonviolent movies is also negative but substantially smaller and not significantly different from 0. To put these estimates into perspective, on an unseasonably cold day (20–32 degrees Fahrenheit) assaults go down by 11% in the evening hours and 8% in the night hours.3 In comparison, the blockbuster strongly violent movie Hannibal (with an audience size of 10.1 million on opening weekend) is predicted to account for a 4.4% reduction in assaults in the evening hours and a 6.5% reduction in the night hours (see footnote 14 for details on this calculation). In Section V, we provide interpretations of these findings. IV.C. Theater Audience—Timing of Effects So far, we have estimated the impact of exposure to movie violence on same-day violent crimes. We now estimate whether there is a delayed impact at various time intervals. If violent movies increase violent crime in the medium run, or if they lead to intertemporal substitution of crime (as in the case of weather shocks in Jacob, Lefgren, and Moretti [2007]), violent crime is likely to be higher in the period following movie exposure. Monday and Tuesday. In columns (1) and (2) of Table IV, we estimate the impact of average weekend movie audience on violent crime for the Monday and Tuesday following the weekend. Because the movie audience on these weekdays is limited, to a first approximation this specification captures the delayed effect of movie exposure one to three days later. We find no evidence of an increase in violent crime due to either imitation or intertemporal substitution. Most coefficients are close to zero, and the only marginally significant coefficient indicates a delayed negative impact of mildly violent movies. One Week, Two Weeks, and Three Weeks Later. In the following specifications, we estimate the impact one, two, and three weeks after the original exposure, controlling for contemporaneous exposure. Separate identification is made possible by new releases occurring after the initial exposure. Lagged movie attendance is instrumented using a similar methodology as for the 3. These are coefficients from the baseline IV regression, with 33–79 degrees Fahrenheit as the omitted category.
Lagged audience of strongly violent movies (millions of people in day t) Lagged audience of mildly violent movies (millions of people in day t) Lagged audience of nonviolent movies (millions of people in day t)
Audience of strongly violent movies (millions of people in day t) Audience of mildly violent movies (millions of people in day t) Audience of nonviolent movies (millions of people in day t)
(4)
(6)
(7)
(8)
−0.0142 −0.0209 −0.0136 −0.0199 (0.0051)∗∗∗ (0.0067)∗∗∗ (0.0051)∗∗∗ (0.0063)∗∗∗
(5)
−0.0018 (0.0026) −0.0007 (0.0028)
−0.0146 (0.0076)∗ −0.0065 (0.0074)
−0.007 (0.0050)
0.0012 (0.0054)
0.0046 (0.0042)
−0.0004 (0.0087)
−0.0027 (0.0033)
0.0031 (0.0041)
0.0001 (0.0037)
−0.0017 (0.0054)
0.0030 (0.0050)
−0.0060 (0.0042)
−0.0061 (0.0037)
−0.0028 (0.0047)
−0.0050 (0.0046)
0.0012 (0.0055)
−0.0056 (0.0049)
0.0020 (0.0062)
−0.0079 (0.0061)
0.0011 (0.0036)
0.0002 (0.0031)
0.0017 (0.0044)
−0.0070 (0.0044)
−0.0049 (0.0048)
−0.0105 (0.0045)∗∗
−0.0065 (0.0056)
−0.0076 (0.0056)
−0.0061 −0.0087 −0.0096 −0.0194 −0.0114 −0.0199 (0.0031)∗∗ (0.0043)∗∗ (0.0042)∗∗ (0.0056)∗∗∗ (0.0041)∗∗∗ (0.0052)∗∗∗
−0.0127 −0.0081 (0.0045)∗∗∗ (0.0060)
(3)
Three weeks later
Log (number of assaults in day t in time window)
Two weeks later
OLS regressions Next week
0.0019 (0.0058)
(2)
Log (number of assaults on Monday and Tuesday in time window)
Dep. var.:
(1)
Next Monday and Tuesday
Timing:
Specification:
TABLE IV MEDIUM-RUN EFFECT OF MOVIE VIOLENCE
DOES MOVIE VIOLENCE INCREASE VIOLENT CRIME?
703
X X
1,041
X X
1,041
Lag: weekend before 6 P.M.– 12 A.M.– 6 A.M. 12 A.M. next day
(4)
1,559
X No
1,558
X No
Lag: 7 days before 6 P.M.– 12 A.M.– 12 A.M. 6 A.M. next day
(3)
(6)
1,556
X X
1,555
X X
Lag: 14 days before 6 P.M.– 12 A.M.– 12 A.M. 6 A.M. next day
(5)
1,553
X X
1,552
X X
Lag: 21 days before 6 P.M.– 12 A.M.– 12 A.M. 6 A.M. next day
(7)
(8)
Three weeks later
Log (number of assaults in day t in time window)
Two weeks later
OLS regressions Next week
Notes. See notes to Table II. The specifications are IV regressions with the log(number of assaults occurring in day t) as the dependent variable. The specifications in columns (3) and (4) are not instrumented, because the predictors for the audience of the previous week are highly collinear with the contemporaneous audience. *Significant at 10%; **significant at 5%; ***significant at 1%.
Control variables Full set of controls Audience instrumented with predicted audience using following week’s audience N
Lag specification Time of day
(2)
Log (number of assaults on Monday and Tuesday in time window)
Dep. var.:
(1)
Next Monday and Tuesday
Timing:
Specification:
TABLE IV (CONTINUED)
704 QUARTERLY JOURNAL OF ECONOMICS
DOES MOVIE VIOLENCE INCREASE VIOLENT CRIME?
705
other movie attendance variables, except for the one-week lag (columns (3) and (4)). In this specification, we report the OLS results, because the instrument for lagged exposure would be essentially collinear with contemporaneous exposure. Across the three specifications (columns (3)–(8)), we find no evidence of a delayed effect of movie exposure. Of eighteen coefficients for lagged exposure, only one is significant (negative) at the 5% level. At the same time, we find strong evidence of a negative impact of contemporaneous exposure to violent movies, as in our benchmark specifications. These results suggest that there is no medium-run effect of exposure to movie violence due to either imitation or intertemporal substitution. IV.D. Theater Audience—Robustness Before discussing how to interpret the results, in Table V we assess the robustness of the benchmark estimates of Table III, reproduced in column (1). In column (2), we use a different set of instruments for movie attendance—information on the production budget and the number of theaters in which a movie is playing in week w (t) (see Appendix II for details). Production budgets are decided far in advance, whereas the number of screens is finalized one or two weeks in advance (Moretti 2008). These instruments, like our baseline instruments, should purge the estimates of short-term shocks affecting both attendance and crime. We supplement these instruments with an additional instrument for total movie audience size, based on our standard procedure.4 The results are remarkably similar to the benchmark IV results. Column (3) uses the standard instrument but includes all seven days of the week instead of just the weekend (column (3)). Many of the point estimates for the effect of movie violence in the evening and night (Panels B and C) become more negative, including the estimate for nonviolent movies, which is now significant. The latter finding may reflect an impact of nonviolent movies for the same reasons as for violent movies (with smaller magnitudes), 4. We supplement with total movie audience size because the new instruments do not predict overall movie audience well. This is because total number of theaters is essentially fixed in any given week, and production budgets do not provide much identifying variation. The joint F-tests for the first stages of this instrument set range from 280 to 378, with most of the power coming from the variables for the number of theaters.
(1)
Audience of strongly violent movies (millions of people in day t)
−0.0192 (0.0060)∗∗∗
Audience of strongly −0.013 violent movies (millions (0.0049)∗∗∗ of people in day t) Audience of mildly −0.0109 violent movies (millions (0.0040)∗∗∗ of people in day t) Audience of −0.0063 nonviolent movies (millions (0.0043) of people in day t)
Audience of strongly −0.0037 violent movies (millions (0.0046) of people in day t) Audience of mildly −0.003 violent movies (millions (0.0041) of people in day t) Audience of 0.0003 nonviolent movies (millions (0.0041) of people in day t)
Dep. var.:
Specification:
−0.0202 (0.0059)∗∗∗
−0.0098 (0.0040)∗∗
−0.0062 (0.0044)
−0.008 (0.0042)∗
C. Effects in the night (12 A.M.–6 A.M.) −0.0206 −0.0206 −0.0252 (0.0054)∗∗∗ (0.0055)∗∗∗ (0.0068)∗∗∗
−0.0098 (0.0036)∗∗∗
−0.0211 (0.0066)∗∗∗
−0.0069 (0.0040)∗
−0.0119 (0.0038)∗∗∗
−0.0109 (0.0039)∗∗∗
−0.0165 (0.0035)∗∗∗
−0.0107 (0.0042)∗∗
−0.0165 (0.0032)∗∗∗
−0.0153 (0.0044)∗∗∗
−0.0013 (0.0044)
B. Effects in the evening (6 P.M.–12 A.M.) −0.0158 −0.0144 −0.0144 −0.0139 ∗∗∗ (0.0048) (0.0046)∗∗∗ (0.0044)∗∗∗ (0.0063)∗∗
−0.0012 (0.0034) 0 (0.0039)
−0.0012 (0.0035)
−0.0028 (0.0039)
−0.0012 (0.0042)
−0.0006 (0.0033)
−0.003 (0.0040)
(6)
−0.0006 (0.0033)
(5)
−0.0046 (0.0042)
(4) −0.0047 (0.0044)
(3)
Log (number of violent crimes in day t in time window)
A. Effects in morning and afternoon (6 A.M.–6 P.M.) −0.0046 0.0005 0.0005 −0.0075 (0.0045) (0.0039) (0.0037) (0.0056)
(2)
Instrumental variables regressions
TABLE V ROBUSTNESS
−0.0098 (0.0052)∗
−0.0026 (0.0030)
−0.0065 (0.0029)∗∗
−0.0099 (0.0037)∗∗∗
−0.0079 (0.0028)∗∗∗
−0.0088 (0.0027)∗∗∗
−0.0096 (0.0035)∗∗∗
(7)
OLS reg.
−0.0133 (0.0035)∗∗∗
−0.003 (0.0024)
−0.0075 (0.0023)∗∗∗
−0.0081 (0.0030)∗∗∗
−0.0098 (0.0023)∗∗∗
−0.0102 (0.0023)∗∗∗
−0.0081 (0.0029)∗∗∗
(8)
No. of assaults
Poisson reg.
706 QUARTERLY JOURNAL OF ECONOMICS
(1)
(2)
(3)
(4)
(5)
Log (number of violent crimes in day t in time window)
Instrumental variables regressions
(6)
(7)
OLS reg.
(8)
No. of assaults
Poisson reg.
Notes. This table presents a series of robustness checks to the results in Table III, reproduced in column (1). Column (2) uses instruments constructed as in the benchmark instruments, but using the number of theaters showing the movie in week w(t) and the production budget (when available) as predictors. This specification also includes the instrument for overall movie audience constructed with the bechmark instruments. (See text for additional details) Column (3) uses data also from Monday–Thursday, in addition to Friday–Sunday. Column (4) uses the same sample as column (3) but with Newey-West standard errors with a 21-day lag. Column (5) presents the results for an alterative measure of movie violence based on the MPAA ratings. The number of observations is smaller because in the first weeks of 1995, the MPAA rating is missing for a number of movies; we set the MPAA violence measure missing for the ten weeks in which the rating is available for less than 70% of the movie audience. In column (6) the definition of crimes against a person, in addition to assaults and intimidation, includes robbery, homicide, and sex offenses. Column (7) presents an OLS specification, and column (8) presents a Poisson regression (also not instrumented). The number of observations in Panel C is one fewer than in Panels A and B because we are missing the assault data for January 1, 2006, for the hours between 12 A.M. and 6 A.M. See also notes to Table II. *Significant at 10%; **significant at 5%; ***significant at 1%.
Audience of mildly −0.0205 −0.0202 −0.0245 −0.0245 −0.0187 −0.0205 −0.0089 −0.0106 violent movies (millions (0.0052)∗∗∗ (0.0054)∗∗∗ (0.0040)∗∗∗ (0.0039)∗∗∗ (0.0050)∗∗∗ (0.0052)∗∗∗ (0.0041)∗∗ (0.0029)∗∗∗ of people in day t) Audience of nonviolent −0.006 −0.0047 −0.0103 −0.0103 −0.0104 −0.0075 0.0045 0.0005 movies (millions (0.0054) (0.0056) (0.0042)∗∗ (0.0041)∗∗ (0.0053)∗ (0.0053) (0.0043) (0.0029) of people in day t) Robustness specification Benchmark IV: Instruments Benchmark + Benchmark + Benchmark + Benchmark + OLS regress. Poisson IV specification budget and include Mo–Th include MoTh use MPAA dep. variable (no instruments) regression no. theaters + Newey-West measure of is all crimes (no instruments) 28-day corr. movie violence against person Control variables Full set of controls X X X X X X X X Audience instrumented X X X X X with predicted audience using following week’s audience N 1,563 1,563 3,645 3,645 1,539 1,563 1,563 1,563
Dep. var.:
Specification:
TABLE V (CONTINUED)
DOES MOVIE VIOLENCE INCREASE VIOLENT CRIME?
707
708
QUARTERLY JOURNAL OF ECONOMICS
for example by incapacitating potential criminals. An alternative possibility is that the instrument, which is based on next weekend’s audience, does not completely remove the impact of shortterm shocks, especially for Wednesdays and Thursdays, which fall immediately before the next weekend. Column (4) assesses the robustness of the standard errors to autocorrelation. One may worry that violent crime is positively correlated across weeks, even after controlling flexibly for seasonality. In this case, clustering by week (which assumes independence across weeks) may lead to standard errors that are too small. To address this concern, we replicate the specification of column (3) using Newey-West standard errors with a 28-day window.5 The Newey-West standard errors are on average 5% lower than the clustered standard errors, suggesting that autocorrelation is a minor issue. Next, we use an alternative measure of movie violence. In addition to rating movies (R, PG, etc.), the MPAA summarizes in one sentence the reason for their rating. We characterize as mildly violent those movies whose MPAA rating contains the word “Violence” or “Violent,” with two exceptions. If the reference to violence is qualified with “Brief,” “Mild,” or “Some,” we classify the movie as nonviolent. If qualified with either “Bloody,” “Brutal,” “Disturbing,” “Graphic,” “Grisly,” “Gruesome,” or “Strong,” we classify the movie as strongly violent. The kids-in-mind.com and MPAAbased measures have correlations of .68 (mild violence) and .66 (strong violence).6 The correlation is also apparent in Table A.1, which lists the violence ratings for blockbuster movies. Using this MPAA-based measure of movie violence yields similar results (column (5)). When we include both measures of violence (not shown), however, the effects on assaults load almost exclusively onto the kids-in-mind.com measures. We also consider an alternative definition of violent crimes, including any type of crime against a person (column (6)). In addition to assaults and intimidation, this definition includes also robbery, homicide, and sex offenses. The results are very similar
5. We use data for the seven-weekday data rather than the benchmark threeday weekend data because Newey-West standard errors imply a decay that is a function of the temporal distance between observations. 6. These are the correlations of the residuals from OLS regressions on the standard set of control variables appearing in column (6) of Table II, excluding the movie violence measures.
DOES MOVIE VIOLENCE INCREASE VIOLENT CRIME?
709
to the benchmark ones.7 We find qualitatively similar results for the three component categories of our assault measure (intimidation, simple assault, and aggravated assault), for assaults with and without injury, for assaults occurring at home and away from home, and for crimes involving a weapon (see Online Appendix Tables 1 and 4). We find larger effects for assaults against a known person, as opposed to against a stranger. We find small negative but statistically insignificant effects for property crimes (burglary, theft, motor vehicle theft, and vandalism).8 Finally, we estimate two specifications that do not instrument for movie audience: OLS (column (7)) and Poisson MLE (column (8)). In these specifications, the effect in the evening and night hours is qualitatively similar to the benchmark estimates, with somewhat smaller effects. Exposure to all types of movies in the morning and afternoon has a negative (significant) effect on violent crime. These small differences are likely due to omitted variables that are correlated with overall movie audience and crime. Indeed, if one considers the differential impact of violent versus nonviolent movies, the results mirror the IV results: no differential effect in the morning and afternoon, and large negative effects in the evening and night. An Online Appendix presents additional robustness checks, including (i) the use of 52 week-of-year indicators instead of 365 day-of-year indicators, (ii) estimates using only the audience for the first week of release, (iii) estimates for the set of agencies that report consistently for the entire sample, (iv) separate estimates for violence levels 0 through 10, and (v) estimates in two-hour blocks. The pattern of findings is similar in these specifications. In addition, the Online Appendix includes two placebo tests: one that reassigns movie attendance to the other date in the sample that falls on the same day of year and same day of week, and another that examines whether future exposure, controlling for current exposure, affects violent crime. We find no systematic impact for either set of placebo variables, suggesting that our findings are not due to unobserved seasonal factors. 7. Homicide and sex offenses are relatively infrequent, and not significant individually. Regressions for robbery by itself yield negative estimates that are significant in the evening hours but not in the nighttime hours. 8. Insofar as alcohol plays an important role (Section V.B), the smaller findings for property crimes are consistent with Carpenter and Dobkin (forthcoming) who find a smaller spike around the legal drinking age in property crimes, compared to violent crimes. It is also possible that movie attendance creates additional opportunities for property crimes because property owners may be in the theater.
710
QUARTERLY JOURNAL OF ECONOMICS
IV.E. DVD and VHS Rental Audience While this paper focuses on the effect of movies shown in theaters, a similar design exploits the releases of movie rentals on VHS and DVD. These releases occur several months after the theatrical release, and rentals of newly released VHSs and DVDs peak in the first week of release, with the top one to two movies capturing a substantial share of total rental revenue. We use data on weekly DVD and VHS rental revenue from Video Store Magazine covering the top 25 movies over the period January 1995–December 2004.9 The average number of rentals on a weekend day is 3.92 million (Table I). Weekend rentals of strongly violent (mildly violent) movies total 0.64 (1.56) million. While rentals are 30% to 40% smaller than the theater attendance, these numbers underestimate the audience reached because multiple people often view a single rented movie. The violent audience size for DVD and VHS rentals is positively correlated to the box-office measure in the corresponding week: the conditional correlation between the two measures of strong (mild) violence is .15 (.39) (see footnote 6). In columns (1)–(3) of Table VI, we estimate equation (6) using DVD and VHS rentals instead of box-office audience. We include the full set of controls and instrument using a predictor based on next week’s rentals. We find, as might be expected, no effect of exposure to violent movies in the morning and afternoon hours (column (1)). In the evening hours (column (2)), we find a large negative impact of exposure to mildly violent movies (a 1.48% decrease in assaults per million rentals), and a smaller, insignificant impact of strongly violent movies. In the night hours (column (3)), we find large negative effects of exposure to rentals of violent movies, but also a significant negative effect of the rental audience of nonviolent movies. These estimates are less precise than the estimates for box-office releases, with standard errors about 30% larger. When we also control for box-office movie audience in the regressions, the results are similar, although with larger standard errors (columns (4)–(6)). 9. To convert revenue data into an estimated number of rentals, we deflate rental revenue by the average price of a rental estimated using the Consumer Expenditure Survey. We impute daily rentals using the within-week distribution of rentals in the Consumer Expenditure Survey. As with the box-office data, we focus on weekend rentals. Data are missing for twenty weeks in which the magazine did not publish the relevant numbers.
−0.0078 (0.0063) −0.0148 (0.0052)∗∗∗ −0.0043 (0.0060)
−0.0148 (0.0078)∗ −0.0311 (0.0071)∗∗∗ −0.0225 (0.0076)∗∗∗
−0.0051 (0.0101) −0.0034 (0.0103) −0.0054 (0.0115)
−0.0044 (0.0104) −0.0227 (0.0092)∗∗ −0.0041 (0.0106)
Log (number of assaults in day t in time window) (2) (3) (4) (5)
−0.0107 (0.0120) −0.0193 (0.0102)∗ −0.0199 (0.0114)∗
(6)
X X
1,475
X X
1,475
1,475
X X
1,475
X X
1,475
X X
1,475
X X
0.0017 −0.0098 −0.0192 (0.0082) (0.0077) (0.0089)∗∗ 0.0034 −0.0119 −0.0202 (0.0076) (0.0070)∗ (0.0080)∗∗ 0.0042 −0.0049 −0.0071 (0.0078) (0.0070) (0.0079) 6 A.M.–6 P.M. 6 P.M.–12 A.M. 12 A.M.–6 A.M. 6 A.M.–6 P.M. 6 P.M.–12 A.M. 12 A.M.–6 A.M. next day next day
−0.0042 (0.0058) −0.0041 (0.0059) −0.0029 (0.0066)
(1)
Instrumental variable regressions
Notes. The daily audience numbers are computed from weekly data on DVD and VHS rental revenue from Video Store Magazine. The weekly revenue is divided by the average price of a rental and proportionately attributed to the Friday, Saturday, and Sunday window using the average within-week distribution of rentals in the CEX diaries. The specifications are IV regressions with the log(number of assaults occurring in day t) as the dependent variable. See also notes to Table II. ∗ Significant at 10%; ∗∗ significant at 5%; ∗∗∗ significant at 1%.
Control variables Full set of controls Rental and theater audiences instrumented with predicted audiences using next week’s audiences N
Theater audience of strongly violent movies (millions of people in day t) Theater audience of mildly violent movies (millions of people in day t) Theater audience of nonviolent movies (millions of people in day t) Time of day
DVD/VHS rentals of strongly violent movies (millions of people in day t) DVD/VHS rentals of mildly violent movies (millions of people in day t) DVD/VHS rentals of nonviolent movies (millions of people in day t)
Dep. var.:
Specification:
TABLE VI EFFECT OF DVD/VHS MOVIE VIOLENCE ON SAME-DAY ASSAULTS
DOES MOVIE VIOLENCE INCREASE VIOLENT CRIME?
711
712
QUARTERLY JOURNAL OF ECONOMICS
The results on DVD and VHS releases are consistent with a negative impact of violent movies on violent crime, especially over the evening hours. The similarity with the results from theater releases is interesting in light of the differences in setting (e.g., alcohol consumption is possible at home but not at the theater). V. INTERPRETATION AND ADDITIONAL EVIDENCE We summarize the findings so far as follows: (i) exposure to violent movies lowers same-day violent crime in the evening; (ii) this exposure also lowers violent crime in the night after exposure; (iii) in the night, strongly violent movies have a somewhat smaller effect on crime compared to mildly violent movies; (iv) nighttime hours have larger negative effects compared to evening hours; (v) there is no lagged effect of exposure in the weeks following movie attendance. We now provide interpretations and additional evidence for the first four of these findings (the fifth finding is straightforward to interpret). We stress that, because of data limitations, the interpretations in this section are based on ecological inference and not individual-level analysis. As such, alternative explanations for the findings are also possible. For example, whereas the decrease in crime in the evening hours has a natural interpretation as incapacitation of criminals, an alternative, complementary interpretation is protection of potential victims. V.A. Lower Crime in the Evening—Voluntary Incapacitation and Sorting We interpret the first finding, that violent movies lower crime in the evening hours, as voluntary incapacitation. Because it is virtually impossible to commit an assault while in the theater, as movie attendance rises, violent acts fall relative to the counterfactual. Interestingly, as simple as this explanation is, incapacitation has largely been ignored in discussions on the effect of movie violence. This voluntary incapacitation differs from the standard incapacitation in the literature because it is optimally chosen by the consumers, rather than being imposed, as in the case of school closings (Jacob and Lefgren 2003) or incarceration (Levitt 1996). Although the qualitative findings are consistent with incapacitation, are the magnitudes also consistent with this interpretation? Suppose watching a movie (including time spent buying tickets, waiting in the lobby, and traveling to and from the
DOES MOVIE VIOLENCE INCREASE VIOLENT CRIME?
713
theater) occupies roughly one-half of the 6 P.M.–12 A.M. time period and fully incapacitates individuals. For the rest of the time block, assume that crime rates are the same as for the alternative activity. Using the framework of Section II, denoting criminals with a y subscript, and assuming no crime is committed by nonviolent individuals (σo = 0) yields β j = −0.5x j σ y . If criminals were equally represented in the audience of a movie with one million viewers, about 1/300th (i.e., 1 million out of a total population of 300 million) of the criminals would be incapacitated, leading v = −0.5 ∗ (1/300) ≈ −0.0017, compared to the observed to βequal values βˆ v = −0.0130 and βˆ m = −0.0109. This implies violent individuals are overrepresented by about 0.0130/0.0017 = 7.6 times in strongly violent movies and 0.0109/0.0017 = 6.4 times in mildly violent movies. Although this is a substantial amount of selection, it is not implausibly large. To provide evidence on the sorting of more violent individuals into more violent movies, we turn to data from the Consumer Expenditure Survey (CEX). We take advantage of the fact that the CEX diaries record all expenditures of surveyed households day by day for a period of one or two weeks, including demographic information about the households that purchase movie tickets. For each day t in the years 1995–2004, we compute the share of interviewed households that watch a movie at the theater, . We regress this share on shares of the population atshareCEX t tending movies of different violence levels according to our primary movie attendance data10 : (7) shareCEX = α + βv t
Avt Am An + β m t + β n t + Xt + εt , Popt Popt Popt
where Popt is the U.S. population in year t (Table VII). Because j and At /Popt are both measures of the share of the popshareCEX t ulation attending a movie on day t, we expect, and indeed find, that the estimated regression coefficients β j are statistically indistinguishable from 1 when we include all demographic groups (column (1)). 10. The regressions include Friday, Saturday, and Sunday and are weighted by the number of households reporting consumption expenditures for day t, which averages 157.88. We include the standard set of controls Xt . We obtain similar results when using an imputed individual-level measure of movie attendance, and similar, but less precisely estimated, results if we instrument for movie attendance.
22.61 1,558
157.88 1,563
1,560
53.94
X X
1.146 (0.3328)∗∗∗ 1.4499 (0.2623)∗∗∗ 1.1555 (0.2491)∗∗∗ Ages 30–44
1,563
81.29
X X
0.4323 (0.2580)∗ 0.1259 (0.1711) 0.392 (0.1741)∗∗ Ages 45+
1,474
3.96
X X
2.7751 (1.4550)∗ 2.7825 (1.3110)∗∗ 0.4031 (1.2926) Single males age 18–29
(5)
Notes. An observation is a Friday, Saturday, or Sunday over the years 1995–2004. The dependent variable is the share of the households in the diary CEX sample that reported attending a movie on day t. The audience shares are obtained from daily box-office revenues divided by the average price per ticket and then divided again by the U.S. population. Because both the dependent variable and the independent variables are measures of attendance to the theater in shares, the coefficients in column (1) should be close to 1. The coefficients in columns (2)–(4) indicate the degree of self-selection of different demographic categories into movies of different violence levels. See also notes to Table II. ∗ Significant at 10%; ∗∗ significant at 5%; ∗∗∗ significant at 1%.
X X
X X
Full set of controls Regressions weighted by number of households interviewed in day t Average number of households in demographic group interviewed on day t N
2.094 (0.5602)∗∗∗ 1.4642 (0.4407)∗∗∗ 1.0786 (0.4652)∗∗ Ages 18–29
0.9469 (0.1883)∗∗∗ 0.7736 (0.1419)∗∗∗ 0.7614 (0.1440)∗∗∗ All
(1)
OLS regressions Share of households interviewed watching a movie at the theater in day t (2) (3) (4)
Share of audience of strongly violent movies (in share of U.S. population in day t) Share of audience of mildly violent movies (in share of U.S. population in day t) Share of audience of nonviolent movies (in share of U.S. population in day t) Demographic groups (by head of household)
Dep. var.:
Specification:
TABLE VII PATTERNS OF MOVIE ATTENDANCE BY DEMOGRAPHICS (CEX DATA)
714 QUARTERLY JOURNAL OF ECONOMICS
DOES MOVIE VIOLENCE INCREASE VIOLENT CRIME?
715
Although different types of movies should have the same impact on overall attendance, we expect differential sorting when we split the data by demographics (columns (2)–(5)). Indeed, younger households (heads ages 18 to 29, column (2)) have larger estimated coefficients, indicating that they attend the movies more often than older people. Younger households also select disproportionately into violent movies: they are 2.094/0.9469 = 2.2 times oversampled in strongly violent movies and 1.4642/0.7736 = 1.9 times oversampled in mildly violent movies, but only 1.0786/0.7614 = 1.4 times oversampled in nonviolent movies. Middle-age households (heads ages 30 to 44, column (3)) and especially older households (heads over 45 years, column (4)) attend the movie theater less and display a flatter attendance pattern with respect to the violence content of movies. The age groups with higher crime rates (Table I), therefore, select into violent movies, a result consistent with selective incapacitation. Because men also have higher assault rates compared to women (Table I), it would be useful to differentiate by gender. Although this is generally problematic in the CEX data (which only report purchases at the household level), we can consider single men ages 18–29. In this group (column (5)), we find even greater evidence of selection. Single young males are 2.7751/0.9469 = 2.9 times oversampled in strongly violent movies and 2.7825/0.7736 = 3.6 times oversampled in mildly violent movies. Although the estimates for this small group should be taken with caution given the large standard errors, they indicate substantial sorting into violent movies.11 We find substantial sorting even using relatively poor correlates of criminal behavior—age and gender. In addition to between-group sorting, we expect substantial within-group sorting. The combination of between- and within-group sorting can plausibly generate overrepresentation of potential criminals by a factor of 6 or 7, as implied by the effect on assaults. V.B. Lower Crime after Exposure—Sobriety The second result is that exposure to movie violence also lowers violent crime in the night. We interpret this to mean that an evening spent at the movies leads to less dangerous activities in 11. When we split households by income (results not shown), we find strong evidence of selection into more violent movies by lower-income households, a selection pattern consistent with research that documents that the poor are more likely to be victims of aggravated assaults.
716
QUARTERLY JOURNAL OF ECONOMICS
the night hours following exposure (i.e., α i < σ in expression (4)). This could be because a visit to the movie theater involves less alcohol consumption, disrupts and alters an evening’s activities, or places potential criminals in relatively safer environments once the movie is over. This is not a trivial finding, because attendance at movie theaters could have provided a meeting point for potential criminals, leading to an increase in crime. Alcohol is a prominent factor that has been linked to violent crimes, and assaults in particular (Carpenter and Dobkin forthcoming). Alcohol is banned in almost all movie theaters in the United States, so a mechanism for reduced crime in the nighttime could well be sobriety. To test this explanation, we examine whether the displacement is larger for assaults involving alcohol or drugs (columns (1) and (2) of Table VIII) than for assaults not involving such substances (columns (3) and (4)). Indeed, although the negative impact of movie violence on assaults is present in both samples, the estimates are on average 1.5 times larger for assaults involving alcohol or drugs. We also find large displacement effects in the night hours for assaults in bars and nightclubs and for arrests for drunkenness, although these estimates are imprecise (Online Appendix Table 3). To further test the impact of alcohol, in columns (5)–(8) we separately estimate the effect for offenders just under the legal drinking age (ages 17–20) and offenders just over the legal drinking age (ages 21–24). If the effect is due to alcohol consumption, it should be larger for the latter group, because the younger group is less likely to drink as part of their displaced alternative activity. Indeed, the effect of violent movies is two to three times larger for the over-age group. Finally, to provide direct evidence that movie attendance lowers alcohol consumption, we use data from the CEX time diaries. We examine whether exposure to violent movies reduces the share of respondents consuming alcohol away from home (column (9)). We find suggestive evidence that violent movies may have reduced alcohol consumption, though the estimates are not significantly different from zero. V.C. Nonmonotonicity in Violent Content—Arousal The third finding is that the negative effect in the night hours is not monotonic: strongly violent movies have a slightly smaller effect than mildly violent movies (−.0192 versus −.0205). This at
(1)
(2)
X X
1,560
X X
1,563
Assaults involving alcohol or drugs 6 P.M.– 12 A.M.– 12 A.M. 6 A.M. next day
1,563
X X
1,562
X X
Assaults not involving alcohol or drugs 6 P.M.– 12 A.M.– 12 A.M. 6 A.M. next day
1,563
X X
1,562
X X
Assault by offender ages 21–24 (over drinking age) 6 P.M.– 12 A.M.– 12 A.M. 6 A.M. next day
−0.02 −0.0213 (0.0089)∗∗ (0.0110)∗
0.0011 (0.0139)
1,563
X X
1,561
X X
Assault by offender ages 17–20 (under drinking age) 6 A.M.– 12 A.M.– 12 A.M. 6 A.M. next day
0.0065 (0.0106)
−0.0039 (0.0060)
−0.0057 (0.0048)
(8)
−0.0171 (0.0133)
(7)
−0.0197 −0.0229 −0.0338 −0.0112 (0.0059)∗∗∗ (0.0084)∗∗∗ (0.0107)∗∗∗ (0.0100)
(6)
−0.0088 (0.0046)∗
(5) −0.0058 (0.0149)
(4)
−0.0137 −0.0164 −0.0239 −0.0376 −0.0125 (0.0056)∗∗ (0.0070)∗∗ (0.0103)∗∗ (0.0115)∗∗∗ (0.0114)
(3)
Reg. (CEX data)
1,563
X
Alcohol consumption away from home Same day
−0.0271 (0.1993)
−0.1921 (0.2077)
−0.3303 (0.2696)
(9)
Share consuming alcohol away from home
Notes. The specifications are in IV regressions for specific types of assaults using NIBRS data in columns (1)–(8). Column (9) uses the CEX data used in Table VIII; the dependent variable is the share of the households in the diary CEX sample that reported consuming alcohol away from home. In column (9), the movie exposure variables are in share of the total population. See also notes to Table II. ∗ Significant at 10%; ∗∗ significant at 5%; ∗∗∗ significant at 1%.
Control variables Full set of controls Audience instrumented with predicted audience using next week’s audience N
Time of day
Type of crime
Instrumental variable regressions
Log (number of assaults of a type in day t in time window)
Audience of strongly −0.012 −0.0287 violent movies (millions (0.0080) (0.0109)∗∗∗ of people in day t) Audience of mildly −0.0183 −0.025 violent movies (millions (0.0071)∗∗ (0.0107)∗∗ of people in day t) Audience of −0.0068 −0.0102 nonviolent movies (millions (0.0076) (0.0114) of people in day t)
Dep. var.:
Specification:
TABLE VIII TEST OF SOBRIETY: EFFECT OF ALCOHOL CONSUMPTION
DOES MOVIE VIOLENCE INCREASE VIOLENT CRIME?
717
718
QUARTERLY JOURNAL OF ECONOMICS
first is puzzling, because strongly violent movies attract more potential criminals, and the additional selection should render the effect more negative. As discussed in Section II, however, this puzzle can be explained if strongly violent movies have a differential direct impact. We estimate the differential impact of strongly violent movies, α v − α, under the assumptions used to derive expression (5). Estimation of α v − α requires information about the selection of potential criminals x j into different movies. Although this selection is unobservable, we do observe selection along dimensions that correlate with criminal behavior, age, and gender. As Table I indicates, crimes are committed disproportionately by young males. We make the assumption that the selection of potential criminals into movie theaters, x j , is an affine transformation of the selection of young males, yi ; that is, x j = λ0 + λ1 yi . We can then estimate expression (5) by substituting the term ( yv − yn) / ( ym − yn) for the unobserved (x v − x n) / (x m − x n) . To estimate the sorting of young males, we turn to an auxiliary source of data, the IMDb.12 IMDb maintains a popular website for movie-goers, which invites its users to rate movies. A typical blockbuster movie is rated by tens of thousands of viewers. IMDb displays, for each movie, statistics on the rating for each combination of gender (male, female) and four age groups (under 18, 18 to 29, 30 to 44, and over 45). As a measure of the attractiveness of a movie to potential criminals, we use the share of raters that are male and are ages 18 to 29, a group disproportionately likely to commit crimes (see Table I). Figure II shows that the share of young male reviewers is fairly linear in the 0 to 10 violence ratings for movies from kids-in-mind.com. The extent of selection is substantial: while the fraction of raters of nonviolent movies that are young males, yn, is 0.459, the corresponding fraction for strongly violent movies, yv , is 0.546. These data allow us to estimate ( yv − yn) / ( ym − yn) as 1.718. Figure III displays both the actual impact of movie violence βˆ j (solid lines) and the predicted impact purely due to sorting (dotted lines). The two estimates are very close for crime in the evening hours, and one cannot reject the hypothesis that they are the same. This is to be expected, because a large share of the evening is 12. The CEX data used in Table VIII also indicate substantial selection: young households (with heads ages 18–29) select into strongly violent movies at a rate that is 43% higher compared to mildly violent movies. We use the IMDb data because they provide a substantially more precise estimate.
DOES MOVIE VIOLENCE INCREASE VIOLENT CRIME?
719
Share of young males among IMDb reviewers of movie
0.6
Young males in audience as function of violence of movie Confidence interval (lower bound) 0.55
0.5
Average for strongly violent movies
Confidence Interval (upper bound)
Average for mildly violent movies
Average for nonviolent
0.45
0.4
0.35 0
1
2
3
4
5
6
7
8
9
10
Rating of violent content of movie (from kids-in-mind.com)
FIGURE II Share of Young Males in Audience as Function of Movie Violence (Internet Movie Database Data) This plot employs IMDb rating data to provide a measure of the attractiveness to young males of movies of varying degrees of violence (0 is least violent, 10 is most violent). The measure of attractiveness to young males is the share of raters of a movie that report being male and ages 18 to 29. The plotted variable is the average share across all movies of a given violence level, weighted by the number of raters for the movie. The violence rating of movies is from kids-in-mind.com. The dotted lines are pointwise 95% confidence intervals.
spent inside the movie theaters, which mechanically implies α v ≈ α ≈ 0. In the night hours, instead, the observed impact of movie violence is substantially larger than the predicted impact because of selection, and the difference is marginally significant (p-value of v −α .08).13 The estimated differential impact of movie violence α is sizable (.011) and equal to about one-third of the predicted impact of strongly violent movies because of sorting. We therefore detect some evidence that, after accounting for selection, violent movies induce more violent crime relative to nonviolent movies, consistent with an arousal effect. This may occur for the same reasons as in the laboratory—an emotional effect of arousal, or short-term imitation of violent acts. As in the laboratory, we find no evidence of a cathartic effect, which would have made the effect of strongly violent movies even more negative. Our field evidence, hence, provides a natural comparison of the size of 13. Bootstrap standard errors take into account the sampling variability associated with ( yv − yn) / ( ym − yn).
720
QUARTERLY JOURNAL OF ECONOMICS 0
Effect of exposure to movie violence on assaults
Nonviolent movies –0.005
Mildly violent movies
Strongly violent movies
–.0063 –.0060 –.0109
–0.01
–.0130 –0.015
–.0142
Difference between observed and predicted estimates = .0012 p-value=.85
–.0205 –0.02
–.0192
–0.025
Actual estimates (evening, 6 P.M.–12 A.M.) –0.03
Predicted estimates with selection (no arousal) (evening, 6 P.M.–12 A.M.) Actual estimates (night, 12 A.M.– 6 A.M.)
–0.035
Difference between observed and predicted estimates = .0117 (arousal effect ) p-value=.08
–.0309
Predicted estimates with selection (no arousal) (night, 12 A.M.– 6 A.M.)
FIGURE III Effect of Movie Violence on Assaults: Selection and Arousal Effects This figure displays both the actual impact of movie exposure on violent crime (solid lines) and the predicted impact with linear selection (dotted lines) by type of movie (nonviolent/mildly violent/strongly violent) and by time block (evening 6 P.M.–12 A.M./night 12 A.M.–6 A.M.). The estimates of the actual impact (solid lines) are reproduced from columns (3) and (4) of Table III, Panel A, and can be interpreted as the percent change in violent crime due to the exposure of one million people to movies of type j in time period t. For example, an increase in one million of the audience of mildly violent movies lowers violent crime by 1.09% in the evening time block and by 2.05% in the nighttime block. The estimates of the predicted impact with linear selection (dotted lines) are computed using the estimates for nonviolent and mildly violent movies, taking into account the increased selection of criminals into strongly violent movies and assuming that all types of movies have the same direct effect on violent crime. The (unobserved) selection of criminals into movies is assumed to be related linearly to the (observed) selection of young males into movies. The comparison between the predicted and the actual effect of violent movies provides an estimate of the differential effect of strongly violent movies relative to mildly violent and nonviolent movies. The figure shows a marginally significant difference in the actual and predicted impact for the nighttime block: compared to the predicted impact, strongly violent movies cause more crime, consistent with an arousal effect of strongly violent movies. Details on the calculations of the difference are in the text.
the arousal effect to the other main impact of movie violence, time use. Although the estimated arousal effect on violence is sizable, it is one-third as large as the foregone violence associated with the alternative activity. We also point out that this evidence should be considered suggestive, given the assumptions involved. Other explanations for this nonmonotonic pattern are also possible. For example,
DOES MOVIE VIOLENCE INCREASE VIOLENT CRIME?
721
a potential offender may attend a mildly violent movie with a girlfriend and a strongly violent movie with drinking buddies. This could have an independent effect on the level of violence. V.D. Larger Nighttime Estimates—Compositional Effects The fourth finding is that, in the night hours following movie exposure (12 A.M.–6 A.M.), the impact of movie violence on assaults is higher than in the evening hours (6 P.M.–12 A.M.). This finding might seem puzzling, because the highest decrease in crime should occur when potential criminals are in the movie theater, when committing crimes is nearly impossible. However, the composition of crimes in the two time periods is different, making a direct comparison of the size of the effects difficult. For example, assaults involving alcohol or drugs and assaults committed by offenders just over the legal drinking age are much more common in the night hours than in the evening hours (Table I). As previously noted, alcohol-related assaults respond more to violent movie exposure (Table VIII). Hence, the decrease in alcohol consumption, a primary mechanism for the effects, is likely to prevent a higher fraction of violent crimes in the night (when inebriation would have the most impact) compared to the evening. The activities prevented by movie attendance in the night hours are more dangerous (in the model, have a larger σ ) than the activities prevented in the evening hours. Broadly speaking, we obtain similar compositional differences in the pattern of assaults by demographics (shown in Online Appendix Table 5). The impact of exposure to violent movies is larger (i.e., more negative) for male offenders than for female offenders, especially in the night hours, and male offenders commit a higher share of the assaults at night than in the evening hours (Table I). We also find a relatively monotonic decrease of the effect sizes by age (with the exception of the 45–54 age group), which contributes to explaining the findings, because the younger age group also contributes disproportionately to nighttime assaults (Table I). V.E. Additional Evidence on Selection In both the evening and the night hours, violent movies lower crime more than nonviolent movies. Our explanation for these facts is selection: violent movies are more likely to attract potential criminals. We now test another implication of selection, that
722
QUARTERLY JOURNAL OF ECONOMICS
movies that draw young men tend to decrease violent crime, even if the movies are not violent. We divide movies into thirds based on the fraction of young men rating a movie in the IMDb (see Figure II), and label the categories as Not Liked, Liked, and Highly Liked by young males. Table IX reports information on the blockbusters within the three categories, holding constant the kids-in-mind.com violence rating. Among nonviolent movies, Runaway Bride is not liked by young males, while Austin Powers in Goldmember is highly liked. For mildly violent movies, Save The Last Dance and Spider-Man are best sellers in the Not Liked and Highly Liked categories, respectively. Among strongly violent movies, there are essentially no blockbusters that are not liked by young males, because movie violence and liking by young males are highly correlated. However, the IMDb information distinguishes between movies in the middle group such as Passion of the Christ and movies in the top group such as Hannibal. To estimate the impact of movie attendance within on violence j each of the nine cells, we estimate ln Vt = 9j=1 β j At + Xt + εt , where j = 1, . . . , 9 denotes the nine cells. We adopt the full set of controls and use the baseline instrument. Table IX reports within each cell the coefficients for the evening time block and for the night time block. Moving down within a column shows that more violent movies are generally associated with lower crime, even holding constant the liking by young males (except for movies not liked by young males, where the violent movie category is very sparse and hence the estimates very noisy). For example, among the movies highly liked by young males, the estimated parameters βˆ j are −.0090 (nonviolent), −.0111 (mild violence), and −.0140 (strong violence) for the evening hours. These patterns are broadly consistent with the interpretations discussed in Sections V.A–V.D. More important for a test of selection, moving along a row the coefficients also generally become more negative. In nine of twelve pairwise comparisons, the estimates become more negative as the liking by males increases (seven of ten if we exclude the bottom-left group, which is very sparse). Movies that attract more young males, therefore, appear to lower the incidence of violent crimes more, even holding constant the level of violence in a movie. These results underscore the importance of selection. Exposure to movies that attract more violent groups (along observable lines) is associated with lower rates of violent crime.
Top 2 Top 3
Absolute Power, Random Hearts, Unfaithful Effect on 0.0049 (0.0111) (6 P.M.–12 A.M.) crime −0.0268 (0.0141∗ ) (12 A.M.–6 A.M.)
Save The Last Dance (1/13/01, 4.1) Double Jeopardy (10/2/99, 3.3)
Top 1
5–7 Mildly violent movies
Top 4–6
Harry Potter and the Chamber of Secrets (11/16/02, 15.2) Top 2 Harry Potter and the Chamber of Secrets (11/23/02, 7.3) Top 3 Runaway Bride (7/31/99, 6.8) Top 4–6 Sweet Home Alabama, America’s Sweethearts, Erin Brockovich Effect on −0.0041 (0.0062) (6 P.M.–12 A.M.) crime 0.0049 (0.0071) (12 A.M.–6 A.M.)
Double Jeopardy (9/25/99, 4.6)
Top 1
Blockbuster movies not liked by young males (date, millions of people) (2)
0–4 Nonviolent movies
(1)
Violence rating
Day after Tomorrow, Independence Day, Pearl Harbor −0.0099∗∗ (0.0047) (6 P.M.–12 A.M.) −0.0177∗∗∗ (0.0057) (12 A.M.–6 A.M.)
Harry Potter and the Prisoner of Azkaban (6/5/04, 15.1) Mummy Returns (5/5/01, 12.4) Planet of the Apes (7/28/01, 12.3)
Harry Potter and the Sorcerer’s Stone (11/17/01, 15.9) Shrek 2 (5/29/04, 11.8) Finding Nemo, Toy Story 2, Monsters Inc. −0.0035 (0.0042) (6 P.M.–12 A.M.) −0.0057 (0.0055) (12 A.M.–6 A.M.)
Shrek 2 (5/22/04, 17.4)
Blockbuster movies liked by young males (date, millions of people) (3)
TABLE IX MOVIE BLOCKBUSTER BY IMDB RATING AND VIOLENCE
Matrix Reloaded (5/17/03, 15.2) Lost World: Jurassic Park (5/24/97, 14.3) Spider-Man 2, X2: X-Men, Star Wars 2 −0.0111∗∗∗ (0.0039) (6 P.M.–12 A.M.) −0.0179∗∗∗ (0.0052) (12 A.M.–6 A.M.)
Spider-Man (5/4/02, 19.8)
Bruce Almighty (5/24/03, 11.2) Ace Ventura: When Nature Calls, Waterboy, Big Daddy −0.0090∗ (0.0053) (6 P.M.–12 A.M.) −0.0079 (0.0063) (12 A.M.–6 A.M.)
Austin Powers in Goldmember (7/27/02, 12.6) Incredibles (11/6/04, 11.3)
Blockbuster movies highly liked by young males (date, millions of people) (4)
DOES MOVIE VIOLENCE INCREASE VIOLENT CRIME?
723
Blockbuster movies not liked by young males (date, millions of people) (2) Passion of the Christ (2/28/04, 13.5) Passion of the Christ (3/6/04, 8.5) Air Force One (7/26/97, 7.9) Ransom, Sleepy Hollow, General’s Daughter −0.0084 (0.0082) (6 P.M.–112 A.M.) −0.0252∗∗∗ (0.0087) (12 A.M.–6 A.M.)
Blockbuster movies liked by young males (date, millions of people) (3)
−0.0140∗∗∗ (0.0047) (6 P.M.–12 A.M.) −0.0150∗∗ (0.0061) (12 A.M.–6 A.M.)
Hannibal (2/10/01, 10.1) Jurassic Park 3 (7/21/01, 9.1) Scary Movie (7/8/00, 8.2) Bad Boys 2, Troy, Terminator 3
Blockbuster movies highly liked by young males (date, millions of people) (4)
Notes. We divide movies into thirds using the fraction of IMDb raters of a movie that are male and of ages 18–29. Movies not liked by young males are defined by movies in the bottom third of this distribution, movies liked by young males are in the middle third, and movies strongly liked by young males are in the top third. The ratings of movie violence are from kids-in-mind.com. The table divides movies into nine categories defined by the interaction of how liked the movie is by young males and the violence level. The top three movies with the highest weekend audience are reported for each category, along with the next three largest distinct blockbuster movies. The "Effect on crime" rows report the coefficients on the audience sizes for each of the nine categories from two separate regressions for the evening (6 P.M.–12 A.M.) and nighttime hours (12 A.M.–6 A.M.), where the dependent variable is log(number of assaults occurring in day t in the specified time block) and the independent variables are the audiences in millions of people for movies in each of the nine categories. See also notes to Table II. ∗ Significant at 10%; ∗∗ significant at 5%; ∗∗∗ significant at 1%.
Missing (11/29/03, 1.8) Nurse Betty (9/9/00, 1.3) Copycat (11/4/95, 1.2) Jade, In Dreams, A Rich Man’s Wife Effect on 0.0625 (0.0384) (6 P.M.–12 A.M.) crime 0.0526 (0.0549) (12 A.M.–6 A.M.)
8–10 Top 1 Strongly violent Top 2 movies Top 3 Top 4–6
(1)
Violence rating
TABLE IX (CONTINUED)
724 QUARTERLY JOURNAL OF ECONOMICS
DOES MOVIE VIOLENCE INCREASE VIOLENT CRIME?
725
VI. CONCLUSION We have provided causal evidence on the short-run effect of exposure to media violence on violent crime. We exploit the natural experiment induced by time-series variation in the violence of movies at the box office. We show that exposure to violent movies has three main effects on violent crime: (i) it significantly reduces violent crime in the evening on the day of exposure; (ii) by an even larger percent, it reduces violent crime during the night hours following exposure; (iii) it has no significant impact in the days and weeks following the exposure. We interpret the first finding as voluntary incapacitation: potential criminals that choose to attend the movie theater forego other activities that have higher crime rates. As simple as this finding is, it has been neglected in the literature, despite its quantitative importance. We interpret the second finding as substitution away from a night of more volatile activities, in particular, a reduction in alcohol consumption. The third finding implies that the same-day impact on crime is not offset by intertemporal substitution of crime. An important component of these interpretations is the sorting of more violent individuals into violent movie attendance. These findings appear to contradict evidence from laboratory experiments that document an increase in violent behavior following exposure to movie violence. However, the field and laboratory findings are not contradictory. Exposure to movie violence can lower violent behavior relative to the foregone alternative activity (the field finding), even if it increases violent behavior relative to exposure to nonviolent movies (the laboratory finding). In fact, we document suggestive evidence that, after accounting for selection, violent movies induce more violent crime relative to nonviolent movies, consistent with an arousal effect. This example suggests that other apparent discrepancies between laboratory and field studies (see Levitt and List [2007]) might be reconciled if differences in treatment and setup are taken into account. In addition, the field evidence provides a bound for the laboratory finding of an arousal effect, which we estimate in the field to be one-third as large as the time-use effect. Given that movie attendance occupies a significant portion of leisure time use, our findings imply first-order welfare effects. We can calculate the change in assaults that would occur if the audience of violent movies did not go to the movies but instead
726
QUARTERLY JOURNAL OF ECONOMICS
engaged in their next best alternative. The total number of evening and nighttime assaults prevented is 997 assaults per weekend, adding up to almost 52,000 weekend assaults prevented yearly.14 With an estimated (in year 2007 dollars) direct monetary cost of $2,217 and an estimated intangible quality-of-life cost of $11,154 per assault (Miller, Cohen, and Wiersema 1996), this implies a benefit of roughly $695 million each year. Our estimates suggest that a strongly violent blockbuster movie such as Hannibal (with 10.1 million viewers on opening weekend) reduced assaults by 1,056 on its opening weekend, which amounts to a 5.2% decrease in assaults, about half the impact of the reduction in crime due to a cold day. This substantial short-term impact of violent movies had been overlooked by the previous literature. Of course, if strongly violent movies were banned as a matter of public policy, our estimated short-term effects could be offset partly if studios respond by producing more mildly violent movies. The degree to which this would temper our findings depends on how substitutable strongly and mildly violent movies are for each other. This substitution, however, is likely to be imperfect; a regression of strongly violent movie attendance on mildly violent movie attendance (including all the baseline controls of Table III) yields a coefficient of −.196 (s.e. .028). This implies that there will be substantial substitution to other nonmovie activities as well, and our empirical results suggest that these nonmovie activities are more conducive to violent behavior. In the paper, we find no impact of violent movies in the days and weeks following exposure. Still, our design (like the laboratory experiments) cannot address the important question about the long-run effect of exposure to movie violence. As such, this paper does not provide evidence on the long-term effects of a policy limiting the level of violence allowed in the media. However, it does indicate that in the short run these policies will likely increase violent crime, because they induce substitution toward more dangerous activities. Finally, a central point of our paper is that the merits of any particular activity must be viewed relative to the next best activity in utility terms. As such, our findings are relevant beyond 14. We assume: (i) no impact of media violence on assaults beyond the evening and night of the media exposure, (ii) no substitution toward other movies, and (iii) effects for the whole population being the same as for the set of cities in the NIBRS sample. We calculate the effect separately for each time block (evening and night) and level of violence (strong and mild). We multiply the estimated baseline coefficient by the assault rate in NIBRS data times the U.S. population (300 million), times average violent movie attendance.
DOES MOVIE VIOLENCE INCREASE VIOLENT CRIME?
727
the case of movies. For example, violent video games may well increase aggression, but they also incapacitate potential offenders for a substantial period of time. More generally, we hypothesize that other activities with a controlled, alcohol-free environment that attract young men, such as Midnight Basketball, should also reduce crime in the short run. APPENDIX I: DATA APPENDIX A. Imputation of Daily Box-Office Audience The daily box-office movie revenue for the ten highest-selling movies is available starting in September 1997. To extend coverage to January 1995–August 1997 and to movies that do not make the daily top-ten list, we make use of weekend revenue for the fifty highest-selling movies, because this is available throughout the whole sample. We take advantage of the regularity in the within-week pattern of sales and impute the daily data, whenever missing, using the weekend box-office data for the same movie in the same week. Denote by a j,t the daily audience of movie j on date t, and by awj,w(t) the weekend audience of movie j on weekend w(t) corresponding to date t. (Because most movies are released on Friday, the function w (t) assigns the days from Monday through Thursday to the previous weekend.) We assume that the daily audience is a share s of the weekend audience, where the share is allowed to depend on a set of controls Y , s (Y ): a j,t = s (Y ) awj,w(t) . In logs, the model can be written as ln a j,t = ln(s (Y )) + ln(awj,w(t) ). The most important control for the share ln (s (Y )) is the set of dayof-week indicators dtd, because different days of the week capture a different share of the overall revenue (Table I). In addition, we use the following controls X j,t for the weekday share: month indicators (in the summer the Monday–Thursday audience is larger), a linear time trend, indicators for the level of violence (nonviolent versus mildly violent versus strongly violent), indicators for rating type (G/PG/PG-13/R/NC-17/Unrated/Missing Rating), indicators for week of release (up to week 26), and indicators for audience size in week w (t) (audience 10 and ≤20, >20 and ≤32), the heat index falling in one of three categories (>100 and ≤115, >115 and ≤130, >130), the wind speed falling in one of two categories (>17 and ≤21, >21), any rain, and any snow.
APPENDIX II: INSTRUMENTS A. Benchmark Instrument Our set of instruments uses information on the following weekend’s audience for the same movie to predict movie attendance, and then aggregates these predictors across all movies of a given violence level. The procedure is similar to the imputation procedure described in Appendix I. We assume the daily audience of movie j on day t, a j,t , is a share of the weekend audience in the same week w(t), where the share is allowed to depend on a set of controls. In addition, we assume that the weekend audience decays each week at a rate that is also a function of the controls. This specification allows the decay rate to vary by weekday and differentially so for different types of movies. We use the same controls (including interactions with day of week) as for the imputation procedure described in Appendix I with three differences: (i) the indicators for audience size refer to week w(t) + 1 (as opposed to week w(t)); (ii) we add two indicators for slow releases, that is, indicators for the cases in which the weekend audience for week w(t) is less than 3 and less than 5 times smaller than in week w(t) + 1; (iii) we add 365 dayof-year indicators ηd(t) (not interacted with day of week). As in Appendix I, we estimate a log model, with ln(a j,t ) − ln(awj,w(t)+1 ) as the dependent variable. The regression uses observations with nonimputed movie audience and is weighted by next weekend’s
732
QUARTERLY JOURNAL OF ECONOMICS
audience awj,w(t)+1 . We obtain the predicted daily audience using aˆ j,t = exp[ln(awj,w(t)+1 ) + ln(a j,t ) − ln(awj,w(t)+1 )]. To generate the pren m v dicted audiences Aˆ t , Aˆ t , and Aˆ t , we aggregate across movies in the relevant violence category. We note that a coarser, but simpler, approach is to use as instruments the audience in week w(t) + 1 of all movies in a category (strongly violent, mildly violent, and nonviolent). The empirical results using this approach are similar, although somewhat noisier (see Online Appendix Table 1). B. Instrument for DVD/VHS Rentals The instrument for DVD and VHS rentals is constructed similarly to the benchmark instrument, except that Video Store Magazine only publishes the DVD and VHS rental at the weekly level. Hence, we estimate the equivalent of the predictive specification for the benchmark instrument, but without day-of-week dummies and day-of-week interaction variables. The regression is weighted by the next week’s rentals awj,w(t)+1 . The set of controls, as for the standard instrument, includes month indicators, a linear time trend, indicators for the level of violence, indicators for rating type, and indicators for rentals in week w (t) + 1. The holiday controls are separate indicators for whether the week w (t) includes any of the holidays described in Appendix I, and whether the week w (t) + 1 includes any of these holidays. The predicted values from the regressions are used to generate the predicted weekly rentals aˆ j,t . These predicted rentals are then apportioned to each day of week using the within-week shares of rentals from the CEX time diaries. C. Theaters and Budget Instrument The estimates in column (2) of Table V use instruments based on the number of theater screens on which a movie plays and its production budget (Moretti 2008). We use data from thenumbers.com and renormalize the number of screens and budget by the corresponding 90th percentile of each variable for that year. We use the number of screens in levels and take the log of production budget (setting it equal to zero for missing production budgets and adding an indicator variable for missing production budgets). Because the predictability of audience using number of screens and budget varies with both the weekday and the number of weeks a movie has been out, we interact these screen and
DOES MOVIE VIOLENCE INCREASE VIOLENT CRIME?
733
budget variables with indicators for day of week as well as number of weeks out (0 weeks, 1 week, 2–4 weeks, 5–9 weeks, 10–19 weeks, 20–26 weeks, >26). We estimate a log model, with ln(a j,t ) as the dependent variable, using observations with nonimputed movie audience and weighting by the number of screens next week. The set of controls is the same as for the standard instrument, except that we do not use information on the audience next week. UC SAN DIEGO AND NBER UC BERKELEY AND NBER
REFERENCES American Academy of Pediatrics, American Academy of Child and Adolescent Psychiatry, American Medical Association, American Psychological Association, American Academy of Family Physicians, and American Psychiatric Association, Joint Statement on the Impact of Entertainment Violence on Children, Congressional Public Health Summit, July 26, 2000. Available at http://www.aap.org/advocacy/releases/jstmtevc.htm. Anderson, Craig, Leonard Berkowitz, Edward Donnerstein, L. Rowell Huesmann, James D. Johnson, Daniele Linz, Neil M. Malamut, and Ellen Wartella, “The Influence of Media Violence on Youth,” Psychological Science in the Public Interest, 4 (2003), 81–110. Anderson, Craig A., and Brad J. Bushman, “Effects of Violent Video Games on Aggressive Behavior, Aggressive Cognition, Aggressive Affect, Physiological Arousal, and Prosocial Behavior: A Meta-analytic Review of the Scientific Literature,” Psychological Science, 12 (2001), 353–359. Ariely, Dan, and George Loewenstein, “The Heat of the Moment: The Effect of Sexual Arousal on Sexual Decision Making,” Journal of Behavioral Decision Making, 18 (2005), 1–12. Bandura, Albert, Dorothea Ross, and Sheila A. Ross, “Imitation of Film-Mediated Aggressive Models,” Journal of Abnormal and Social Psychology, 66 (1963), 3–11. Besley, Timothy, and Robin Burgess, “The Political Economy of Government Responsiveness: Theory and Evidence from India,” Quarterly Journal of Economics, 117 (2002), 1415–1451. Buschman, Brad J., “Moderating Role of Trait Aggressiveness in the Effects of Violent Media on Aggression,” Journal of Personality and Social Psychology, 69 (1995), 950–960. Card, David, and Gordon Dahl, “The Impact of Emotional Cues on Domestic Violence,” Rochester Center for Economic Research Working Paper No. 546, 2009. Carpenter, Christopher, and Dobkin, Carlos, “The Effect of Alcohol Consumption on Mortality: Regression Discontinuity Evidence from the Minimum Drinking Age,” American Economic Journal: Applied Economics, 1 (2009), 16, 482. DellaVigna, Stefano, and Ethan Kaplan, “The Fox News Effect: Media Bias and Voting,” Quarterly Journal of Economics, 122 (2007), 1187–1234. Einav, Liran, “Seasonality in the U.S. Motion Picture Industry,” RAND Journal of Economics, 38 (2007), 127–145. Federal Trade Commission, Marketing Violent Entertainment to Children: A Review of Self-Regulation and Industry Practices in the Motion Picture, Music Recording, and Electronic Game Industries, 2000. Available online at http://www.ftc.gov/reports/violence/vioreport.pdf. Gentzkow, Matthew, “Television and Voter Turnout,” Quarterly Journal of Economics, 121 (2006), 931–972. Gentzkow, Matthew, and Jesse Shapiro, “Preschool Television Viewing and Adolescent Test Scores: Historical Evidence from the Coleman Study,” Quarterly Journal of Economics, 123 (2008), 279–323.
734
QUARTERLY JOURNAL OF ECONOMICS
Jacob, Brian, and Lars Lefgren, “Are Idle Hands the Devil’s Workshop? Incapacitation, Concentration and Juvenile Crime,” American Economic Review, 93 (2003), 1560–1577. Jacob, Brian, Lars Lefgren, and Enrico Moretti, “The Dynamics of Criminal Behavior: Evidence from Weather Shocks,” Journal of Human Resources, 42 (2007), 489–527. Johnson, Jeffrey G., Patricia Cohen, Elizabeth M. Smailes, Stephanie Kasen, and Judith S. Brook, “Television Viewing and Aggressive Behavior during Adolescence and Adulthood,” Science, 295 (2002), 2468–2471. Lazear, Edward, Ulrike Malmendier, and Roberto Weber, “Sorting in Experiments with Application to Social Preferences,” NBER Working Paper No. 12041, 2006. Levitt, Steven, “The Effect of Prison Population Size on Crime Rates: Evidence from Prison Overcrowding Litigation,” Quarterly Journal of Economics, 111 (1996), 319–352. Levitt, Steven, and John List, “What Do Laboratory Experiments Measuring Social Preferences Tell Us about the Real World?” Journal of Economic Perspectives, 21 (2007), 153–174. Loewenstein, George, and Jennifer Lerner, “The Role of Affect in Decision Making,” in Handbook of Affective Sciences, Richard J. Dickson, Klaus R. Scherer, and H. Hill Goldsmith, eds. (Oxford, UK: Oxford University Press, 2003). Miller, Ted, Mark Cohen, and Brian Wiersema, Victim Costs and Consequences: A New Look, Research report prepared for National Institute of Justice, U.S. Department of Justice, 1996. Available at http://www.ncjrs .gov/pdffiles/victcost.pdf. Moretti, Enrico, “Social Learning and Peer Effects in Consumption: Evidence from Movie Sales,” NBER Working Paper No. 13832, 2008. Rees, Daniel, and Kevin Schnepel, “College Football Games and Crime,” Journal of Sports Economics, 10 (2009), 68–87. Stromberg, David, “Radio Impact on Public Spending,” Quarterly Journal of Economics, 119 (2004), 189–221.
POWER TO THE PEOPLE: EVIDENCE FROM A RANDOMIZED FIELD EXPERIMENT ON COMMUNITY-BASED MONITORING IN UGANDA∗ ¨ MARTINA BJORKMAN AND JAKOB SVENSSON This paper presents a randomized field experiment on community-based monitoring of public primary health care providers in Uganda. Through two rounds of village meetings, localized nongovernmental organizations encouraged communities to be more involved with the state of health service provision and strengthened their capacity to hold their local health providers to account for performance. A year after the intervention, treatment communities are more involved in monitoring the provider, and the health workers appear to exert higher effort to serve the community. We document large increases in utilization and improved health outcomes—reduced child mortality and increased child weight—that compare favorably to some of the more successful community-based intervention trials reported in the medical literature.
I. INTRODUCTION Approximately eleven million children under five years die each year and almost half of these deaths occur in sub-Saharan Africa. More than half of these children will die of diseases (e.g., diarrhea, pneumonia, malaria, measles, and neonatal disorders) that could easily have been prevented or treated if the children had had access to a small set of proven, inexpensive services (Black, Morris, and Bryce 2003; Jones et al. 2003). Why are these services not provided? Anecdotal, and recently more systematic, evidence points to one possible reason— ineffective systems of monitoring and weak accountability ∗ This project is a collaborative exercise involving many people. Foremost, we are deeply indebted to Frances Nsonzi and Ritva Reinikka for their contributions at all stages of the project. We would also like to acknowledge the important contributions of Gibwa Kajubi, Abel Ojoo, Anthony Wasswa, James Kanyesigye, Carolyn Winter, Ivo Njosa, Omiat Omongin, Mary Bitekerezo, and the field and data staff with whom we have worked over the years. We thank the Uganda Ministry of Health, Planning Division, the World Bank’s Country Office in Uganda, and the Social Development Department, the World Bank, for their cooperation. We are grateful for comments and suggestions by Paul Gertler, Esther Duflo, Abhijit Banerjee, and seminar and conference participants at Stanford, Berkeley, LSE, Oxford, IGIER, MIT, World Bank, NTNU, Namur, UPF, CEPR/EUDN conference in Paris, and BREAD & CESifo conference in Venice. We also thank three anonymous referees and the editor, Lawrence Katz, for very constructive suggestions. Financial support from the Bank-Netherlands Partnership Program, the World Bank Research Committee, the World Bank Africa Region division, and the Swedish International Development Agency, Department for Research Cooperation is gratefully acknowledged. Bj¨orkman also thanks Jan Wallander’s and Tom Hedelius’ Research Foundation for funding. C 2009 by the President and Fellows of Harvard College and the Massachusetts Institute of
Technology. The Quarterly Journal of Economics, May 2009
735
736
QUARTERLY JOURNAL OF ECONOMICS
relationships.1 This paper focuses on one of these accountability relationships, citizen-clients’ ability to hold providers accountable, using primary health care provision in rural Uganda as a testing ground. To examine whether community-based monitoring works, we designed and conducted a randomized field experiment in fifty communities from nine districts in Uganda. In the experiment, local nongovernmental organizations (NGOs) facilitated village and staff meetings in which members of the communities discussed baseline information on the status of health service delivery relative to other providers and the government standard. Community members were also encouraged to develop a plan identifying key problems and steps the providers should take to improve health service provision. The primary objective of the intervention was to initiate a process of community-based monitoring that was then up to the community to sustain and lead. The community-based monitoring project increased the quality and quantity of primary health care provision. A year after the first round of meetings, we found a significant difference in the weight of infants—0.14 z-score increase—and a markedly lower number of deaths among children under five—33 percent reduction in under-5 mortality—in the treatment communities. Utilization for general outpatient services was 20 percent higher in the treatment compared to the control facilities and the overall effect across a set of utilization measures is large and significantly positive. Treatment practices, including immunization of children, waiting time, examination procedures, and absenteeism, improved significantly in the treatment communities, thus suggesting that the changes in quality and quantity of health care provision are due to behavioral changes of the staff. We find evidence that the treatment communities became more engaged and began to monitor the health unit more extensively. Using variation in treatment intensity across districts we show that there is a significant relationship between the degree of community monitoring and health utilization and health outcomes, consistent with the community-based monitoring mechanism. Community-based, randomized, controlled field trials have been used extensively in medical research to evaluate the 1. For anecdotal and case study evidence, see World Bank (2003). Chaudhury et al. (2006) provide evidence on the rates of absenteeism. On misappropriation of public funds and drugs, see McPake et al. (1999) and Reinikka and Svensson (2004).
COMMUNITY-BASED MONITORING IN UGANDA
737
effectiveness of various health interventions (see footnote 14). Our paper is related but differs in one important dimension. Whereas the medical field trials address the question of impact of a biological agent or treatment practice when the health workers competently carry out their tasks, we focus on how to ensure that the health workers actually carry out their tasks and the impact that may have on health utilization and health outcomes. This paper also relates to a small literature on improving governance and public service delivery through community participation. Olken (2007) finds minor effects of an intervention aimed at increasing community participation in the monitoring of corruption in Indonesia. Our work differs in several ways. First, the intervention we evaluate was structured in a way to reduce the risk of elite capture. Second, unlike corruption, which is not easily observable, the information discussed in the meetings was basic facts on utilization and quality of services based on the community’s own experience. Finally, the intervention sought to address two constraints highlighted in the literature on community monitoring: lack of relevant information and inadequate participation. Banerjee, Deaton, and Duflo (2004) evaluate a project in Rajasthan in India where a member of the community was paid to check whether the nurse-midwife assigned to the health center was present at the center. The intervention had no impact on attendance and the authors speculate that a key reason for this is that the individual community member did not manage to use his or her information on absenteeism to invoke community participation. Here, on the contrary, we explicitly try to address the participation constraint by involving a large number of community members and encouraging them to jointly develop a monitoring plan. Finally, the paper links to a growing empirical literature on the relationship between information dissemination and accountability (Besley and Burgess 2002; Str¨omberg 2004; Ferraz and Finan 2008). In this paper, however, we focus on mechanisms through which citizens can make providers, rather than politicians, accountable. Thus, we do not study the design or allocation of public resources across communities, but rather how these resources are utilized. Second, we use microdata from households and clinics rather than disaggregated national accounts data. Finally, we identify impact using an experimental design. The next section describes the institutional environment. The community-based monitoring intervention is described in
738
QUARTERLY JOURNAL OF ECONOMICS
Section III. Section IV lays out the evaluation design and the results are presented in Section V. Section VI concludes. Details about the experiment and additional results are reported in the Online Supplemental Appendix. II. INSTITUTIONAL SETTING Uganda, like many newly independent countries in Africa, had a functioning health care system in the early 1960s. The 1970s and 1980s saw the collapse of government services as the country underwent political upheaval. Health indicators fell dramatically during this period until peace was restored in the late 1980s. Since then, the government has been implementing major infrastructure rehabilitation programs in the public health sector. The health sector in Uganda is composed of four types of facilities: hospitals, health centers, dispensaries, and aid posts or subdispensaries. These facilities can be government-operated and -owned, private for-profit, or private not-for-profit. The impact evaluation focuses on public dispensaries. Dispensaries are in the lowest tier of the health system where a professional interaction between users and providers takes place. Most dispensaries are rural. According to the government health sector strategic plan, the standard for dispensaries includes preventive, promotional, outpatient care, maternity, general ward, and laboratory services (Republic of Uganda 2000). As of 2001, public health services are free. In our sample, on average, a dispensary was staffed by an incharge or clinical officer (a trained medical worker), two nurses, and three nursing aids or other assistants. The health sector in Uganda is decentralized, and a number of actors are responsible for supervision and control of the dispensaries. At the lowest tier, the Health Unit Management Committee (HUMC) is supposed to be the main link between the community and the facility. Each dispensary has an HUMC, which consists of both health workers and nonpolitical representatives from the community. The HUMC should monitor the day-to-day running of the facility but it has no authority to sanction workers. The next level in the institutional hierarchy is the health subdistrict. The health subdistrict monitors funds, drugs, and service delivery at the dispensary. Supervision meetings by the health subdistrict are supposed to appear quarterly but, in practice, monitoring is infrequent. The health subdistrict has the authority to reprimand,
COMMUNITY-BASED MONITORING IN UGANDA
739
but not dismiss, staff for indiscipline. Thus in severe cases of indiscipline, the errand will be referred to the chief administrative officer of the district and the District Service Commission, which are the appointing authorities for the district. They have the authority to suspend or dismiss staff. Various local NGOs, so-called community-based organizations (CBOs), focusing primarily on health education, are also active in the sector.
III. EXPERIMENTAL DESIGN AND DATA III.A. Overview In response to perceived weak health care delivery at the primary level, a pilot project (citizen report cards) aimed at enhancing community involvement and monitoring in the delivery of primary health care was initiated in 2004. The project was designed by staff from Stockholm University and the World Bank, and implemented in cooperation with a number of Ugandan practitioners and eighteen community-based organizations. The main objective of the intervention was to strengthen providers’ accountability to citizen-clients by initiating a process, using trained local actors (CBOs) as facilitators, which the communities themselves could manage and sustain. Based on a small but rigorous empirical literature on community participation and oversight, and extensive piloting in the field, our conjecture was that lack of relevant information on the status of service delivery and the community’s entitlements, and failure to agree on, or coordinate expectations of, what is reasonable to demand from the provider, were holding back initiatives to pressure and monitor the provider. Although individual community members have private information—for example, they know whether their own child has died and whether the health workers did anything to help them—they typically do not have any information on aggregate outcomes, such as how many children in their community did not survive beyond the age of 5 or where citizens, on average, seek care, or what the community can expect in terms of quality and quantity of service provision (Khemani 2006). Partly as a response to this information problem, and partly because monitoring a public facility is a public good that may be subject to serious free-rider problems, few people actively participate in monitoring their service providers. Relaxing
740
QUARTERLY JOURNAL OF ECONOMICS
these two constraints was therefore the main objective of the intervention. The key behavioral change induced by more extensive community-based monitoring was expected to be increased effort by the health unit staff to serve the community. In Uganda, as in many other developing countries, health workers have few pecuniary incentives to exert high effort. Public money does not follow patients, and hiring, salaries, and promotions are largely determined by seniority and educational qualifications—not by how well the staff performs. An individual worker may of course still put in high effort if shirking deviates from her ideal choice (Akerlof and Kranton 2005). The effort choice may also be influenced by social rewards from community members or social sanctions against shirking workers. Social rewards and sanctions are key instruments available to the community to boost the health worker’s effort. III.B. Experimental Design The experiment involved fifty public dispensaries, and health care users in the corresponding catchment areas, in nine districts covering all four regions in Uganda. All project facilities were located in rural areas. We define a facility’s catchment area, or the community, as the five-kilometer radius around the facility.2 A community in our sample has, on average, 2,500 households residing within the five-kilometer radius of the clinic, of which 350 live within a one-kilometer radius. For the experimental design, the facilities were first stratified by location (districts) and then by population size. From each group, half of the units were randomly assigned to the treatment group and the remaining 25 units were assigned to the control group. III.C. Data Data collection was governed by two objectives. First, data were required to assess how the community at large views the quality and efficacy of service delivery. We also wanted to contrast the citizens’ view with that of the health workers. Second, data were required to evaluate impact. To meet these objectives, two surveys were implemented: a survey of the fifty providers and 2. Dispensaries are designed to serve households in a catchment area roughly corresponding to the five-kilometer radius around the facility (Republic of Uganda 2000).
COMMUNITY-BASED MONITORING IN UGANDA
741
a survey of users. Both surveys were implemented prior to the intervention (data from these surveys formed the basis for the intervention) and one year after the project had been initiated. A quantitative service delivery survey was used to collect data from the providers. Because agents in the service delivery system may have a strong incentive to misreport key data, the data were obtained directly from the records kept by facilities for their own need (i.e., daily patient registers, stock cards, etc.) rather than from administrative records. The former, often available in a highly disaggregate format, were considered to suffer the least from any incentive problems in record keeping. Data were also collected through visual checks by enumerators. The household survey collected data on both households’ health outcomes and health facility performance as experienced by the household. A stratified random sample of households within the catchment area of each facility was surveyed. In total, roughly 5,000 households were surveyed in each round.3 To the extent that it was possible, patient records (i.e., patient exercise books and immunization cards) supported the household’s response. The postintervention household survey also included a shorter module on health outcomes. Specifically, data on under-5 mortality were collected and we measured the weight of all infants in the surveyed households.
III.D. Intervention A smaller subset of the findings from the preintervention surveys, including utilization, quality of services, and comparisons ` vis-a-vis other health facilities, were assembled in report cards. Each treatment facility and its community had a unique report card, translated into the main language spoken in the community, summarizing the key findings from the surveys conducted in their area. The process of disseminating the report card information, and encouraging participation, was initiated through a series of meetings: a community meeting, a staff meeting, and an interface meeting. Staff from various local NGOs (CBOs) acted as facilitators in 3. The sampling strategy for the baseline household survey was designed to generate representative information on the core users’ variables in each community (such as the proportion of patients being examined with equipment). In total, 88% of the households surveyed in the baseline survey were resurveyed in the ex-post survey. The households that could not be surveyed were replaced.
742 End of 2004
QUARTERLY JOURNAL OF ECONOMICS Beginning of 2006
Beginning of 2005
Treatment areas
Collection of baseline household and facility data
Intervention (5 days): • report card dissemination; • facilitate the agreement of a community contract.
Community-based monitoring
Collection of household and facility data for program evaluation
Time line
Control areas
FIGURE I Timing of the Project
these meetings.4 A time line of the intervention is depicted in Figure I. The community meeting was a two-afternoon event with approximately 100 invited participants from the community. To avoid elite capture, the invited participants consisted of a selection of representatives from different spectra of society (i.e., young, old, disabled, women, mothers, leaders). The facilitators mobilized the village members by cooperating with village council representatives in the catchment area. Invited participants were asked to spread the word about the meeting and, in the end, a large number of uninvited participants also attended the meeting. More than 150 participants per day attended a typical village meeting. In the community meeting, the facilitators used a variety of participatory methods to disseminate the information in the report cards and encouraged community members to develop a shared view on how to improve service delivery and monitor the provider. Information on patients’ rights and entitlements was also discussed. The participants were divided into focus groups so that also more marginalized groups such as women and youth could raise their voices and discuss issues specific to their group. 4. The eighteen participating CBOs had been active in 64% of the treatment communities and half of the control communities prior to the intervention. A handful of them covered more than one treatment community. The CBOs were primarily focused on health, including issues of health education and HIV/AIDS prevention, although other objectives such as agricultural development, women’s empowerment, support of orphans and vulnerable children, and peace-building initiatives, were also common. The CBO facilitators were trained for seven days in data interpretation and dissemination, utilization of the participatory methodology, and conflict resolution and management. Various other CBOs also operate in the project communities.
COMMUNITY-BASED MONITORING IN UGANDA
743
At the end of the meeting, the community’s suggestions for improvements, and how to reach them without additional resources, were summarized in an action plan. The action plan contained information on health issues/services that had been identified by the community as the most important to address, how these issues could be addressed, and how the community could monitor improvements (or lack thereof). Although the issues raised in the action plans differed across communities, a common set of concerns included high rates of absenteeism, long waiting time, weak attention of health staff, and differential treatment. The health facility meeting was a one-afternoon event held at the facility with all staff present. In the meeting, the facilitators contrasted the information on service provision as reported by the provider with the findings from the household survey. An interface meeting with members from the community, chosen in the community meeting, and health workers followed the community and health facility meetings. During the interface meeting, the community representatives and the health workers discussed suggestions for improvements. The participants discussed their rights and responsibilities as patients or medical staff. The outcome was a shared action plan, or a contract, outlining the community’s and the service provider’s agreement on what needs to be done, and how, when, and by whom. The “community contract” also identified how the community could monitor the agreements and a time plan. Because the problems that were raised in the community meetings constituted the core issues discussed during the interface meetings, the community contract was in many respects similar to the community’s action plan. The three separate meetings aimed at kick-starting the process of community monitoring. Thus, after the initial meetings the communities were themselves in charge of establishing ways of monitoring the provider. After a period of six months, the communities and health facilities were revisited. The CBOs facilitated a one-afternoon community meeting and a one-afternoon interface meeting with the aim of tracking the implementation of the community contract. Health facility staff and community members jointly discussed suggestions for sustaining or improving progress, or in the case of no improvements, why so.5 5. Details on the report cards and the participatory methods used, as well as an example of an action plan, are provided in the Online Supplemental Appendix.
744
QUARTERLY JOURNAL OF ECONOMICS
IV. EVALUATION DESIGN AND EXPECTED OUTCOMES IV.A. Outcomes The main outcome of interest is whether the intervention increased the quantity and quality of health care provision and hence resulted in improved health outcomes. We are also interested in evaluating changes in all steps in the accountability chain: Did the treatment communities become more involved in monitoring the health workers? Did the intervention change the health workers’ behavior? As a robustness test we also assess alternative explanations. One concern is spillovers. Another concern is that the intervention did not only (or primarily) increase the extent of community monitoring, but had an impact on other agents in the service delivery chain, such as the health subdistrict. The intervention could also have affected the health workers’ behavior directly, or affected it through the actions of the CBOs, rather than through more intense community-based monitoring as we hypothesize. Although this would not invalidate the causal effect of the intervention, it would, of course, affect the interpretation. Therefore, these alternative hypotheses are also subject to a battery of tests. IV.B. Statistical Framework To assess the causal effect of the intervention we estimate (1)
yi jd = α + βT jd + X jdπ + θd + εi jd,
where yi jd is the outcome of household i (when applicable) in community/health facility j in district d, T jd is an indicator variable for assignment to treatment, and εi jd is an error term. Equation (1) also includes a vector, X, of preintervention facility-specific covariates and district fixed effects (θd).6 Because of random assignment, T should be orthogonal to X, and the consistency of β does not depend on the inclusion of X in the model. The regression adjustment is used to improve estimation precision and to account for stratification and chance differences between groups 6. The baseline covariates included are number of villages in the catchment area, number of days without electricity in the past month, indicator variable for whether the facility has a separate maternity unit, distance to nearest public health provider, number of staff with less than advanced A-level education, indicator variable for whether the staff could safely drink from the water source, and average monthly supply of quinine.
COMMUNITY-BASED MONITORING IN UGANDA
745
in the distribution of pre-random assignment (Kling, Liebman, and Katz 2007). We report the results of estimating equation (1) with X and θ excluded in the Online Supplemental Appendix. For a subset of variables we can also stack the pre- and postdata and explore the difference-in-differences in outcomes; that is, we estimate7 (2)
yi jt = γ POSTt + βDD (T j ∗ POSTt ) + μj + εijt ,
where POST is an indicator variable for the postintervention period, μ j is a facility/community specific fixed effect, and βDD is the difference-in-differences estimate (program impact). For some outcomes we have several outcome measures. To form judgment about the impact of the intervention on a family of K related outcomes, we follow Kling et al. (2004) and estimate a seemingly unrelated regression system, (3)
Y = [IK ⊗ (T X)]θ + υ,
where IK is a K by K identity matrix. We Kthen derive average stanβˆk/σˆ k, where βˆk and σˆ k dardized treatment effects, β˜ = 1/K k=1 are the point estimate and standard error, respectively, for each effect (see Duflo, Glennerster, and Kremer [2007]). The point estimate, standard error, and p-value for β˜ are based on the parameters, βˆk and σˆ k, jointly estimated as elements of θ in (3). V. RESULTS V.A. Preintervention Differences The treatment and the control group were similar on most characteristics prior to the intervention. Average standardized pretreatment effects are estimated for each family of outcomes (utilization, utilization pattern, quality, catchment area statistics, health facility characteristics, citizen perceptions, supply of resources, and user charges) using preintervention data. As shown in Table I, we cannot reject the null hypotheses of no difference between the treatment and the control group.8 7. It is a subset of variables because the postintervention surveys collected information on more variables and outcomes. 8. We report the test of difference in means across control and treatment groups for each individual variable in the Online Supplemental Appendix.
746
QUARTERLY JOURNAL OF ECONOMICS
TABLE I PRETREATMENT FACILITY AND CATCHMENT AREA CHARACTERISTICS AND AVERAGE STANDARDIZED EFFECTS
Variables Key characteristics Outpatient care Delivery No. of households in catchment area No. of households per village Drank safely today No. of days without electricity in past month Average standardized pretreatment effects Utilization Utilization pattern Quality measures Catchment area statistics Health facility characteristics Citizen perceptions Supply of drugs User charges
Treatment Control group group Difference 593 (75) 10.3 (2.2) 2,140 (185) 93.9 (5.27) 0.40 (0.10) 18.3 (2.95)
675 (57) 7.5 (1.4) 2,224 (204) 95.3 (6.32) 0.32 (0.10) 20.4 (2.90)
−82 (94) 2.8 (2.6) −84.4 (276) −1.42 (8.23) 0.08 (0.14) −2.12 (4.14) 0.11 (0.77) −0.48 (0.33) −0.35 (0.84) 0.11 (0.66) 0.14 (0.31) 0.37 (0.67) 0.73 (0.83) −0.65 (0.63)
Notes. Key characteristics are catchment area/health facility averages for treatment and control group and difference in averages. Robust standard errors in parentheses. Description of variables: Outpatient care is average number of patients visiting the facility per month for outpatient care. Delivery is average number of deliveries at the facility per month. Number of households in catchment area and number of households per village are based on census data and Uganda Bureau of Statistics maps. Drank safely today is an indicator variable for whether the health facility staff at the time of the preintervention survey could safely drink from the water source. Number of days without electricity in the month prior to preintervention survey is measured out of 31 days. Average standardized pretreatment effects are derived by estimating equation (3) on each family of outcomes. Utilization summarizes outpatients and deliveries. Utilization pattern summarizes the seven measures in Supplemental Appendix Table A.I, reversing sign of traditional healer and self-treatment. Quality measure summarizes the two measures in Table A.I, reversing sign of waiting time. Catchment area statistics summarize the four measures in Table A.I. Health facility characteristics summarize the eight measures in Table A.I and drank safely today and days without electricity, reversing sign of days without electricity and distance to nearest local council. Citizen perceptions summarize the four measures in Table A.I. Supply of drugs summarizes the five measures in Table A.I. User charges summarize the four measures in Table A.I, reversing all signs. The χ 2 test-statistic on the joint hypothesis that all average standardized effects are 0 is 4.70 with p-values = .79.
COMMUNITY-BASED MONITORING IN UGANDA
747
V.B. Processes The initial phase of the project, that is, the three separate meetings, followed a predesign structure. A parallel system whereby a member of the survey team originating from the district participated as part of the CBO team also confirmed that the initial phase of the intervention was properly implemented. After these initial meetings, it was up to the community to sustain and lead the process. In this section we study whether the treatment communities became more involved in monitoring the providers. To avoid influencing local initiatives, we did not have external agents visiting the communities and could therefore not document all actions taken by the communities in response to the intervention. Still, we have some information on how processes in the community have changed. Specifically, the CBOs submitted reports on what type of changes they observed in the treatment communities and we also surveyed the local councils in the treatment communities. We use facility and household survey data to corroborate these reports. According to the CBO reports and the local council survey, the community-based monitoring process that followed the first set of meetings was a joint effort mainly managed by the local councils, HUMC, and community members. A typical village in the treatment group had, on average, six local council meetings in 2005. In those meetings, 89% of the villages discussed issues concerning the project health facility. The main subject of discussion in the villages concerned the community contract or parts of it, such as behavior of the staff. The CBOs reported that concerns raised by the village members were carried forward by the local council to the facility or the HUMC. However, although the HUMC is an entity that should play an important role in monitoring the provider, it was in many cases viewed as being ineffective. As a result, mismanaged HUMCs were dissolved and new members elected. These claims are confirmed in the survey data: more than one-third of the HUMCs in the treatment communities were dissolved and new members were elected or received following the intervention, whereas we observed no dissolved HUMCs in the control communities. Further, the CBOs report that the community, or individual members, also monitored the health workers during visits to the clinic, when they rewarded and questioned issues in the community contract that had or had not been addressed, suggesting a
748
QUARTERLY JOURNAL OF ECONOMICS
more systematic use of nonpecuniary rewards. Monitoring tools such as suggestion boxes, numbered waiting cards, and duty rosters were also reported to be put in place in several treatment facilities. In Table II, we formally look at the program impact on these monitoring tools. We use data collected through visual checks by enumerators during the postintervention facility survey. As shown in columns (1) and (2), one year into the project, treatment facilities are significantly more likely to have suggestion boxes (no control facility had these, but 36% of the treatment facilities did) and numbered waiting cards (only one control facility had one, but 20% of the treatment facilities did). Columns (3) and (4) show that a higher share of the treatment facilities also posted information on free services and patients’ rights and obligations. The enumerators could visually confirm that 70% of the treatment facilities had at least one of these monitoring tools, whereas only 4 of 25 control clinics had at least one of them. The difference is statistically significant (Online Supplemental Appendix, Table A.II). Column (5) reports the average standardized effect of the monitoring tools. The estimate is significantly different from zero at the 1% level. The results based on household data mirror the findings reported in columns (1)–(5). The performance of the staff is more often discussed in local council meetings in the treatment communities, shown in column (6), and community members in the treatment group are, on average, better informed about the HUMC’s roles and responsibilities, as reported in column (7). Combining the evidence from the CBO reports and the household survey data thus suggests that both the “quantity” of discussions about the project facility and the subject, from general to specific discussions about the community contract, changed in response to the intervention. V.C. Treatment Practices The qualitative evidence from the CBOs and, to the extent that we can measure them, the findings reported in Table II suggest that the treatment communities became more involved in monitoring the provider. Did the intervention also affect the health workers’ behavior and performance? We turn to this next. We start by looking at examination procedures. The estimate based on equation (2) with the dependent variable being
0.04 50
0.16∗ (0.09)
0.32∗∗∗ (0.08)
0 50
Numbered waiting cards (2)
Suggestion box (1)
0.12 50
0.27∗∗∗ (0.09)
Poster informing free services (3)
0.12 50
0.14 (0.10)
Poster on patients’ rights (4)
— 50
2.55∗∗∗ (0.55)
Average standardized effect (5)
0.33 3,119
0.13∗∗∗ (0.03)
Discuss facility in LC meetings (6)
0.08 4,996
0.04∗∗∗ (0.01)
Received information about HUMC (7)
Notes. Robust standard errors in parentheses. Disturbance terms are clustered by catchment areas in columns (6)–(7). Point estimates, standard errors, and average standardized effect, columns (1)–(5), are derived from equation (3). Program impact measures the coefficient on the assignment to treatment indicator. Outcome measures in columns (1)–(4) are based on data collected through visual checks by the enumerators during the postintervention facility survey. Outcome measures in columns (6) and (7) are from the postintervention household survey. The estimated equations all include district fixed effects and the following baseline covariates: number of villages in catchment area, number of days without electricity in the past month, indicator variable for whether the facility has a separate maternity unit, distance to nearest public health provider, number of staff with less than advanced A-level education, indicator variable for whether the staff could safely drink from the water source, and average monthly supply of quinine. Specification: (1) indicator variable for whether the health facility has a suggestion box for complaints and recommendations; (2) indicator variable for whether the facility has numbered waiting cards for its patients; (3) indicator variable for whether the facility has a poster informing about free health services; (4) indicator variable for whether the facility has a poster on patients’ rights and obligations; (5) average standardized effect of the estimates in columns (1)–(4); (6) indicator variable for whether the household discussed the functioning of the health facility at a local council meeting during the past year; (7) indicator variable for whether the household has received information about the Health Unit Management Committee’s (HUMC’s) roles and responsibilities. *Significant at 10%. **Significant at 5%. ***Significant at 1%.
Mean control group Observations
Program impact
Specification:
Dependent variable
TABLE II PROGRAM IMPACT ON MONITORING AND INFORMATION COMMUNITY-BASED MONITORING IN UGANDA
749
750
QUARTERLY JOURNAL OF ECONOMICS TABLE III PROGRAM IMPACT ON TREATMENT PRACTICES AND MANAGEMENT
Spec. (1)
Dep. variable Equipment used
Model
Program impact
2005
Mean control group 2005
Obs.
DD
0.08∗∗
−0.07∗∗∗
0.41
5,280
0.41
2,758
131
6,602
131
3,426
0.47
46
−0.49
50
0.32
4,996
0.31
4,996
0.50
42
(2)
Equipment used
OLS
(3)
Waiting time
DD
(4)
Waiting time
OLS
(5)
Absence rate
OLS
(6)
Management of clinic OLS
(7)
Health information
OLS
(8)
Importance of family planning Stockouts
OLS
(9)
OLS
(0.03) 0.01 (0.02) −12.3∗ (7.1) −5.16 (5.51) −0.13∗∗ (0.06) 1.20∗∗∗ (0.33) 0.07∗∗∗ (0.02) 0.06∗∗∗ (0.02) −0.15∗∗ (0.07)
(0.02) −12.4∗∗ (5.2)
Notes. Each row is based on a separate regression. The DD model is from equation (2). The OLS model is from equation (1) with district fixed effects and baseline covariates as listed in Table II. Robust standard errors, clustered by catchment areas, are in columns (1)–(4) and (7)–(8), in parentheses. Program impact measures the coefficient on the assignment to treatment indicator in the OLS models and the assignment to treatment indicator interacted with an indicator variable for 2005 in the DD models. Specifications: (1) and (2) indicator variable for whether the staff used any equipment during examination when the patient visited the health facility; (3) and (4) difference between the time the citizen left the facility and the time the citizen arrived at the facility, minus the examination time; (5) ratio of workers not physically present at the time of the postintervention survey to the number of workers employed preintervention (see text for details); (6) first component from a principal components analysis of the variables Condition of the floors of the health clinic, Condition of the walls, Condition of furniture, and Smell of the facility, where each condition is ranked from 1 (dirty) to 3 (clean) by the enumerators; (7) indicator variable for whether the household has received information about the importance of visiting the health facility and the danger of self-treatment; (8) indicator variable for whether the household has received information about family planning; (9) share of months in 2005 in which stock cards indicated no availability of drugs (see text for details). *Significantly different from zero at 90% confidence level. **Significantly different from zero at 95% confidence level. ***Significantly different from zero at 99% confidence level.
an indicator variable for whether any equipment, for instance, a thermometer, was used during examination is shown in the first row in Table III. Fifty percent (41) of the patients in the treatment (control) community reported that equipment was used the last time the respondent (or the respondent’s child) visited the project clinic. The difference-in-differences estimate, a 20% increase, is highly significant. The cross-sectional estimate in row (2), based on equation (1), is less precisely estimated.
COMMUNITY-BASED MONITORING IN UGANDA
751
In row (3) we report the result with an alternative measure of staff performance—the waiting time—defined as the difference between the time the user left the facility and the time the user arrived at the facility, subtracting the examination time. On average, the waiting time was 131 minutes in the control facilities and 119 in the treatment facilities. The estimate based on equation (1), shown in column (4), is less precisely estimated. The results on absenteeism are shown in row (3).9 The point estimate suggests a substantial treatment effect. On average, the absence rate, defined as the ratio of workers not physically present at the time of the postintervention survey to the number of workers on the list of employees as reported in the preintervention survey, is 13 percentage points lower in the treatment facilities. Thus, in response to the intervention, health workers are more likely to be at work. Enumerators also visually checked the condition of the health clinics, that is, whether floors and walls were clean, the condition of the furniture, and the smell of the facility. We combine these variables through principal components analysis into a summary score. Treatment clinics appear to have put more effort into keeping the clinic in decent condition in response to the intervention. The point estimate, reported in row (6), implies a 0.56 standard deviation improvement in the summary score in the treatment compared to the control facilities. According to the government health sector strategic plan, preventive care is one of the core tasks for health providers at the primary level. A significantly larger share of households in the treatment communities have received information about the dangers of self-treatment, reported in row (7), and the importance of family planning, reported in row (8). The difference is 7 and 6 percentage points, respectively. There is no systematic difference in the supply of drugs between the treatment and control groups (see Section V.F). However, as shown in row (9), stockouts of drugs are occurring at a higher frequency in the control facilities even though, as reported 9. The postintervention survey was not announced in advance. At the start of the survey, the enumerators physically verified the provider’s presence. A worker was counted as absent if, at the time of the visit, he or she was not in the clinic. Staff reported to be on outreach were omitted from the absence calculation. Four observations were dropped because the total number of workers verified to be present or reported to be on outreach exceeded the total number of workers on the preintervention staff list. Assuming instead no absenteeism in these four facilities yields a point estimate (standard error) of −0.20 (0.065).
752
QUARTERLY JOURNAL OF ECONOMICS
below, the control facilities treat significantly fewer patients. These findings suggest that more drugs leaked from health facilities in the control group.10 The findings on immunization of children under five are reported in Table IV. We have information on how many times (doses) in total each child has received polio, DPT, BCG, and measles vaccines and vitamin A supplements. On the basis of the recommended immunization plan, we create indicator variables taking the value of 1 if child i of cohort (age) j had received the required dose(s) of measles, DPT, BCG, and polio vaccines, respectively, and 0 otherwise.11 We then estimate (3), for each age group, and calculate average standardized effects. The average standardized effects are significantly positive for the younger cohorts. Looking at individual effects (Online Supplemental Appendix Table A.IV), there are significant positive differences between households in the treatment and control community for all five vaccines, although not for all cohorts. For example, twice as many newborns in the treatment group have received vitamin A supplements, 46% more newborns have received the first dose of BCG vaccine, and 42% more newborns have received the first dose of polio vaccine as compared to the control group. V.D. Utilization To the extent we can measure it, the evidence presented so far suggests that treatment communities began to monitor the health unit more extensively in response to the intervention and that the health workers improved the provision of health services. We now turn to the question of whether the intervention also resulted in improved quantity and quality of care. Cross-sectional estimates based on equation (3) are given in Table V, Panel A. For outpatients and deliveries, we have 10. The dependent variable is the share of months in 2005 in which stock cards indicated no availability of drugs, averaged over erythromycin, mebendazole, and septrin. We find no significant difference between treatment and control clinics for chloroquine—the least expensive of the drugs on which we have data. Not all clinics had accurate stock cards and these clinics were therefore omitted. 11. According to the Uganda National Expanded Program on Immunization, each child in Uganda is supposed to be immunized against measles (one dose at nine months and two doses in case of an epidemic); DPT (three doses at six, ten, and fourteen weeks); BCG (one dose at birth or during the first contact with a health facility); and polio (three doses, or four if delivery takes place at the facility, at six, ten, and fourteen weeks). Because measles vaccination should not be given at birth, we exclude immunization against measles in the plan for infants under twelve months.
929
1.44∗∗ (0.72)
1.30∗ (0.70)
173
Under 1 year (2)
Newborn (1)
940
1.24∗∗ (0.63)
1 year old (3)
951
0.72 (0.58)
2 years old (4)
1,110
2.01∗∗∗ (0.67)
3 years old (5)
526
0.86 (0.80)
4 years old (6)
Notes. Average standardized effects are derived from equation (3) with the dependent variables being indicator variables for whether the child has received at least one dose of measles, DPT, BCG, and polio vaccines and vitamin A supplement, respectively (see text for details), and with district fixed effects and baseline covariates listed in Table II included. Robust standard errors clustered by catchment areas in parentheses. Groups: (1) Children under 3 months; (2) Children 0–12 months; (3) Children 13–24 months; (4) Children 25–36 months; (5) Children 37–48 months; (6) Children 49–60 months. *Significant at 10% level. **Significant at 5% level. ***Significant at 1% level.
Observations
Average standardized effect
Group Specification:
TABLE IV PROGRAM IMPACT ON IMMUNIZATION
COMMUNITY-BASED MONITORING IN UGANDA
753
661
9.2
3.48∗ (1.96) 100
(10)
(9)
189.1∗∗∗ (67.2) 100
5.3∗∗ (2.1) 50
(2)
(1)
130.2∗∗ (60.8) 50
Delivery
Outpatients
78.9
15.0 (11.2) 50
(3)
Antenatal
15.2
3.4 (3.2) 50
(4)
Family planning
–
2.30∗∗∗ (0.69) 100
(11)
1.75∗∗∗ (0.63) 50
(5)
Average std effect
0.24
0.031∗ (0.017) 100
(12)
0.026∗ (0.016) 50
(6)
Use of project facility
0.36
−0.046∗∗ (0.021) 100
(13)
−0.014 (0.011) 50
(7)
Use of selftreatment/ traditional healers
–
1.96∗∗ (0.89) 100
(14)
1.43∗ (0.87) 50
(8)
Average std effect
Notes. Panel A reports program impact estimates from cross-sectional models with district fixed effects and baseline covariates as listed in Table II, with robust standard errors in parentheses. Panel B reports program impact estimates from difference-in-differences models with robust standard errors clustered by facility in parentheses. Point estimates, standard errors, and average standardized effects in specifications (1)–(5), (6)–(8), (9)–(11), and (12)–(13) are derived from equation (3). Program impact measures the coefficient on the assignment to treatment indicator in the OLS models and the assignment to treatment indicator interacted with an indicator variable for 2005 in the DD models. Specifications: First column is average number of patients visiting the facility per month for outpatient care; second column is average number of deliveries at the facility per month; third column is average number of antenatal visits at the facility per month; fourth column is average number of family planning visits at the facility per month; fifth column is average standardized effect of estimates in specifications (1)–(4) and (9)–(10), respectively; sixth column is the share of visits to the project facility of all health visits, averaged over catchment area; seventh column is the share of visits to traditional healers and self-treatment of all health visits, averaged over catchment area; eighth column is average standardized effect of estimates in specifications (6)–(7) and (12)–(14), respectively, reversing the sign of use of self-treatment/traditional healers. *Significant at 10% level. **Significant at 5% level. ***Significant at 1% level.
Mean control group 2005
Observations
B: Panel data Program impact
Observations
A: Cross-sectional data Program impact
Dep. variable
TABLE V PROGRAM IMPACT ON UTILIZATION/COVERAGE
754 QUARTERLY JOURNAL OF ECONOMICS
COMMUNITY-BASED MONITORING IN UGANDA
755
preintervention data and can also estimate difference-in-differences models, shown in Panel B, and value-added models, shown in Table A.V in the Online Supplemental Appendix.12 One year into the program, utilization (for general outpatient services) is 20% higher in the treatment facilities as shown in specification (1). For the difference-in-differences and the value-added models (reported in specification (9) in Table V and specification (ix) in Table A.V), the coefficients on the treatment indicator are larger both in absolute magnitude and relative to their standard errors. Thus, controlling for baseline outcomes, y jt−1 , improves the precision of the treatment effect, which is to be expected given the persistent nature of the outcome variable. The difference in the number of deliveries, shown in specification (2), albeit starting from a low level, is 58% and is fairly precisely estimated. There are also positive differences in the number of patients seeking antenatal care (19% increase) and family planning (22% increase), although these estimates are not individually significantly different from zero. The average standardized effect, reported in specification (5), however, is highly significant. The last three columns in Table V, Panels A and B, report changes in utilization patterns based on household data. We collected data on where each household member sought care during 2005 in case of illness that required treatment and collapsed this information by community. There is an 11%–13% increase, specifications (6) and (12), in the use of the project facility in treatment as compared to the control group—a result consistent with that reported in specification (1) using facility records. Households in the treatment community also reduced the number of visits to traditional healers and the extent of selftreatment, specifications (7) and (13), but there are no statistically significant differences across the two groups in the use of other providers (not reported). Thus, as summarized in the average standardized treatment effects, specifications (8) and (9), households in the treatment communities switched from traditional healers and self-treatment to the project facility in response to the intervention. 12. The value-added specification is y jt = αVA + βVA T j + λy jt−1 + ε jt .
756
QUARTERLY JOURNAL OF ECONOMICS
V.E. Health Outcomes We collected data on births, pregnancies, and deaths of children under five years in 2005. We also measured the weight of all infants (i.e., under age 18 months) and children (between ages 18 and 36 months) in the surveyed households. Health outcomes could have improved for several reasons. As noted in the Introduction, access to a small set of proven, inexpensive services could, worldwide, have prevented more than half of all deaths of children under age 5. For a country with an epidemiological profile as in Uganda, the estimate of preventable deaths is 73 percent (Jones et al. 2003).13 In the community monitoring project specifically, increased utilization and having patients switch from self-treatment and traditional healers to seek care at the treatment facility could have an effect. Holding utilization constant, better service quality, increased immunization, and more extensive use of preventive care could also have resulted in improved health status. As a reference point we review the set of health interventions feasible for delivery at high coverage in low-income settings with sufficient evidence of effect on reducing mortality from the major causes of under-5 deaths (Jones et al. 2003). We focus on community-based, randomized, controlled field trials that bear some resemblance (because they are community-based) to our project. Several of these field trials document reductions in under-5 mortality rates of 30%–50% one to two years into the project.14 There is, however, a fundamental difference between the 13. This is likely to be a conservative number because only medical interventions for which cause-specific evidence of effect was available were included in the estimation. For example, increased birth spacing, which has been estimated to reduce under-5 mortality by 19 percent in India, was not considered. Several perinatal and neonatal health interventions that could be implemented in low-income countries were not included either (Darmstadt et al. 2005). 14. For example, a project in Tigray, Ethiopia, in which coordinators, supported by a team of supervisors, were trained to teach mothers to recognize symptoms of malaria in their children and provide antimalarials, reduced under-5 mortality by 40% (Kidane and Morrow 2000). Bang et al. (1990) document a 30% reduction in under-5 mortality from an intervention that included mass education about childhood pneumonia and case management of pneumonia by trained village health workers—a result similar to the meta-analysis estimate by Sazawal and Black (2003). Bang et al. (1999) evaluate a project in which trained village health workers, assisted by birth attendants and supervisory visits, provided home-based neonatal care, including treatment of sepsis. Two years into the project, they document a reduction in infant mortality by nearly 50 percent. Rahmathullah et al. (2003) assess the impact of a community-based project in two rural districts of Tamil Nadu, India, where newborn infants in the treatment group were allocated oral vitamin A after delivery. The intervention resulted in a 22% reduction in total mortality at age 6 months. Manandhar et al. (2004) evaluate a project in which a
COMMUNITY-BASED MONITORING IN UGANDA
757
interventions discussed in footnote 14 and our work. The medical field trials study the impact of a biological agent or treatment practice in a community setting when the community health workers and medical personnel competently carry out their tasks. In the experiment we consider, on the contrary, no new health interventions were introduced and the supply of health inputs was unchanged. Instead, we focused on incentivizing health workers to carry out their tasks through strengthened local accountability. Estimates for births and pregnancies are given in Table VI, columns (1) and (2). To the extent that the intervention had an effect on fertility, for example, through increased use of family planning services, it would primarily affect the incidence of pregnancies in 2005, given the forty-week period between conception to birth. The incidence of births is not significantly different across treatment and control groups. However, the treatment groups had 10% fewer incidences of pregnancies in 2005. Column (3) shows the treatment effect on under-5 mortality.15 The point estimate suggests a substantial treatment effect. The average under-5 mortality rate in the control group is 144, close to the official figure of 133 for 2005 (UNICEF 2006). In the treatment group, the under-5 mortality rate is 97, which is a 33% reduction in under-5 mortality. The difference is significant (and somewhat larger in absolute magnitude) when controlling for district fixed effects as reported in column (3). Although the effect is large, it is worth emphasizing that the 90% confidence interval of our estimate also includes much lower effects (90% CI: 8%–64% reduction in under-5 mortality rate). With a total of approximately 55,000 households residing in the treatment communities, the treatment effect corresponds to approximately 550 averted under-5 deaths in the treatment group in 2005. facilitator convened nine women’s group meetings every month in the Makwanpur district in Nepal in which perinatal problems were identified and strategies to address them formulated. Two years into the project they document a 30% reduction in neonatal mortality. Rahman et al. (1982) evaluate the impact of immunization of women with tetanus injections during pregnancy in rural Bangladesh. The intervention reduced neonatal mortality by 45%. Mtango and Neuvians (1986) evaluate a project in rural Tanzania in which trained village health workers visited families at their homes every six to eight weeks, giving health education on recognition and prevention of acute respiratory infections, treating children with pneumonia with antibiotics or referring them to the next higher level of care. Within a two-year period, they document a 27% reduction in under-5 mortality—a reduction slightly lower than that found in a similar study in rural Bangladesh (Fauveau et al. 1992). 15. The under-5 mortality rate is the sum of the death rates for each cohort (age groups 0–1, 1–2, 2–3, 3–4, and 4–5) per community in 2005, expressed per thousand live births.
−0.71 1,135
0.14∗∗ (0.07)
−0.71 1,135
0.14∗∗ (0.07) −1.27∗∗∗ (0.07) 0.27∗∗∗ (0.09)
Weight-for-age z-scores (5) (6)
Notes. Estimates from equation (1) with district fixed effects and baseline covariates as listed in Table II included. Specification (4) also includes a full set of year-of-birth indicators. Robust standard errors in parentheses (3), clustered by catchment area (1)–(2), (4)–(6). Program impact measures the coefficient on the assignment to treatment indicator. Specifications: (1) Number of births in the household in 2005; (2) indicator variable for whether any women in the household are or were pregnant in 2005; (3) U5MR is under-5 mortality rate in the community expressed per 1,000 live births (see text for details); (4) indicator variable for child death in 2005; (5)–(6) weight-for-age z-scores for children under 18 months excluding observations with recorded weight above the 90th percentile in the growth chart reported in Cortinovis et al. (1997). *Significant at 10% level. **Significant at 5% level. ***Significant at 1% level.
0.029 5,094
144 50
Mean control group 2005 Observations
Child death (4)
−0.026∗∗ (0.013) −0.019∗∗ (0.008) 0.003 (0.009) 0.000 (0.006) 0.002 (0.006) 0.29 4,996
−49.9∗ (26.9)
−0.03∗∗ (0.014)
−0.016 (0.013)
0.21 4,996
U5MR (3)
Pregnancies (2)
Births (1)
Program impact × year of birth 2005 Program impact × year of birth 2004 Program impact × year of birth 2003 Program impact × year of birth 2002 Program impact × year of birth 2001
Female
Child age (log)
Program impact
Specification:
Dependent variable
TABLE VI PROGRAM IMPACT ON HEALTH OUTCOMES
758 QUARTERLY JOURNAL OF ECONOMICS
COMMUNITY-BASED MONITORING IN UGANDA
759
Column (4) shows the age range of the mortality effects. We have information on the birth year of all children (under age 5) alive at in the beginning of 2005 and the birth year of all deceased children in 2005. Using these data we estimate (1), replacing the treatment indicator with a full set of year-of-birth indicators and year-of-birth-by-treatment interactions. We can then address the question: Conditional on having a child of age x at the end of 2004, or a child born in 2005, what is the probability that the child died in 2005? As evident, children younger than two years old drive the reduction in under-5 mortality. The point estimate for the youngest cohort, for example, implies a 35% reduction in the likelihood of death of a child born in 2005 in the treatment compared to the control group. The program impact on the weight of infants is reported in columns (5) and (6). On the basis of weight-for-age z-scores, Ugandan infants have values of weight far lower than the international reference of the U.S. National Center for Health Statistics of the Centers for Disease Control and Prevention (CDC) and the gap increases for older infants, consistent with the findings in Cortinovis et al. (1997).16 The difference in means of z scores of infants between the treatment and the control group is reported in column (5): The estimated effect (difference) is 0.14 in weight-for-age. Figure II plots the distribution of z scores for the treatment and control groups. The difference in measured weight is most apparent for underweight children. This is consistent with a positive treatment effect arising from improved access and quality of health care, rather than a general increase in nutritional status, because underweight status causes a decrease in immune and nonimmune host defenses and, as a consequence, underweight children are at a higher risk of suffering from infectious diseases or severe complications of infectious diseases, and therefore in higher demand of health care. In column (6) of Table VI, we add controls for age and gender. The results remain qualitatively unchanged. The treatment effect is quantitatively important. For this purpose, the baseline proportion of infants in each risk category 16. The z-score is a normally distributed measure of growth defined as the difference between the weight of an individual and the median value of weight for the reference population (2000 CDC Growth Reference in the United States) for the same age, divided by the standard deviation of the reference population. We exclude z-scores > |4.5| as implausible and omit observations with a recorded weight above the 90th percentile in the growth chart reported in Cortinovis et al. (1997). Because weight is measured by trained enumerators, the reporting error is likely due to misreported age of the child. The coefficient estimate (standard error) on the treatment indicator is 0.16 (0.09) when including these outliers.
760 0.3
QUARTERLY JOURNAL OF ECONOMICS
Control group
0
0.1
Density
0.2
Treatment group
–4
–2
0 z -scores
2
4
FIGURE II Distributions of Weight-for-Age z-Scores for Treatment and Control Groups Weight-for-age z-scores for children under 18 months excluding observations with recorded weight above the 90th percentile in the growth chart reported in Cortinovis et al. (1997). Sample size is 1,135 children. Solid line depicts the distribution for the treatment group and dashed line the distribution for the control group. Vertical solid line denotes mean in treatment group; dashed line denotes mean in control group.
(severe, 0, the parent could increase the utility of each child by adopting the income-maximizing human capital policy and adjusting the transfer to finance this change. It follows that the cost to the parent is the same and the child is made better off. In the unconstrained case, it is possible to fully characterize the solution to the income maximization problem. The main features of the solution are as follows (see Manuelli and Seshadri [2008] for details): 1. The optimal allocation of time implies that n(a) = 1 (the individual spends all his time producing human capital) for a finite number of years. This period, whose length we denote by s, corresponds to years of schooling. 2. For a > 6 + s, the individual is working but he continues to invest—at lower rates—in human capital. 3. Higher wages result in more schooling and in an increase in the amount of human capital per year of schooling. Given the interest rate, retirement age, and wage rate, human capital is independent of fertility decisions (in the unconstrained case). Thus, any effect of fertility upon quality is driven by general equilibrium effects. To characterize the solution to the household problem, we need to describe the optimal choice of consumption (this is standard) and the optimal choice of fertility. The first-order condition corresponding to the optimal choice of f is I (6) α1 e−α0 +α1 f e−ρ B e−ρ(a−I) u(ck(a))da + V k(hk(I), bk, gk) 0
I
= e f e−r B
e−r(a−I) ck(a)da + e−r(6−I) xE + bk + gker I
0
I
−
e−r(a−I) (whk(a)(1 − nk(a)) − xk(a))da .
6
The interpretation is simple. The left-hand side corresponds to the marginal benefit of a child. It is given by his utility multiplied by the effective discount factor. The right-hand side corresponds to the cost—measured in utility units—of an additional
EXPLAINING INTERNATIONAL FERTILITY DIFFERENCES
781
child. This cost is the sum of consumption expenditures, investment in early childhood capital and health capital, bequests, and net income. Note that both the costs and the potential benefits (i.e., if net income is positive) are considered only during the period that the child spends attached to his parent. In the steady state, it must be the case that V k(hk(I), bk, gk) = V (h, b, g),
hk(I) = h, b = bk, gk = g.
Moreover, in the steady state, the effective discount factor, −α0 + α1 f − ρ B, equals f − r B. Using these steady state restrictions, (6) is T (g) α1 1 −ρ(a−I) (7) e u(c(a))da f −r B 1−e u (c(I)) 0 I = e−r(a−I) [c(a) − (wh(a)(1 − n(a)) − x(a))]da 0
+ e−r(6−I) xE + b + er I gk . The left-hand side of (7)—the benefit of an additional child— is the goods equivalent of an infinite sequence of life-cycle utility using the effective discount factor ( f − r B) to discount future flows multiplied by the semi-elasticity of the value of a child (α1 ). The right-hand side contains the same cost items that we discussed before: Consumption and expenditures in human and health capital and transfers net of child labor. The optimal choice of health capital satisfies (8)
e−ρ(T (g)−I) u(ck(T (g))T (g) = u (c(I))er I .
The second-order condition requires that the left-hand side of (8) be a decreasing function of g. Given that this is satisfied (more on this later), the condition implies that decreases in the marginal utility of income—for example, driven by increases in productivity—result in increases in health capital and longer life expectancy. To obtain a more intuitive characterization of the solution, we assume that the utility function is isoelastic and given by u(c) =
c1−θ . 1−θ
782
QUARTERLY JOURNAL OF ECONOMICS
The optimal choice of consumption is given by (9)
c(a) = c(I)e
r−ρ θ
(a−I)
,
a ∈ [0, T (g)].
To compute the right-hand side of (7), we need the equilibrium values of the endogenous variables.6 For a ≥ 0, let net income be defined as y(a) = wh(a)(1 − n(a)) − x(a). Given our demographic structure, y(a) satisfies 0 0≤a 0. The system formed by equations (10)—after (27a), (27b), and (27c) are substituted in—(11), and (17) defines a solution for the triplet (c(I), g, f ) once the relationship between fertility and prices—as captured in (14), (15), and (16)—is taken into account. It is possible to use equation (10) to informally discuss the role of TFP. To simplify, let’s view the equilibrium level of g as a function of household income, I, the wage rate, w, and the interest rate, r. Consider increasing TFP, holding the interest rate constant. If this shock is to decrease fertility, it must be the case that the marginal benefit of a child increases less than the marginal cost. This corresponds to (assuming differentiability) dg dL(y, r, 0) ∂ (g, r) −1 0 dz ∂ I dz ∂w dz and dL(y, r, 0) ∂ L(y, r, 0) dw = > 0. dz ∂w dz It follows that a necessary condition for (19) to hold is that ∂ (g, r) < 1, ∂g which is equivalent to (20)
e−λ(r)T (g) +
T (g) 1 . (e−λ(r)T (g) − 1) < λ(r)(T (g)2 α1 + θ − 1
This expression establishes that the impact of TFP changes upon fertility is a quantitative issue. Simple algebra shows that
786
QUARTERLY JOURNAL OF ECONOMICS
the left-hand side of (20) is greater than one.7 Thus, for values of α1 + θ close to 2, the inequality will be violated. For increases in TFP to induce decreases in the number of children, it must be the case that the degree of imperfect altruism is sufficiently low (i.e., α1 is small). Equation (19) also highlights the role played by the response of health capital to increases in income: if the term ∂g/∂ I is sufficiently large, then increases in TFP are more likely to lower fertility. III. CALIBRATION We use standard functional forms for the utility function and the final goods production function. As indicated before, the utility function is assumed to take the form u(c) =
c1−θ , 0 < θ < 1. 1−θ
The production function is assumed to be Cobb–Douglas: F(K, H) = zKα H 1−α . We assume that the mapping between health capital and life expectancy is T (g) = T¯ (1 − g−µ ),
µ > 0.
Our calibration strategy involves choosing the parameters so that the steady state implications of the model economy presented above match the appropriate moments for the United States (circa 2000). The only exogenous variable that we allow to differ across economies is the level of TFP. It is chosen so that the model’s predictions for output per worker match the observed value for each group (decile) of countries. Consequently, although the model is silent about output per worker, it makes predictions about years of schooling, expenditures on health capital, and expenditures on education and lifespan. We set some parameters consistent with the values commonly accepted in the macro literature. The discount factor is fixed at ρ = 0.04 and the depreciation rate is set at δk = 0.06. Capital’s 7. To see this, note that the second-order condition requires that λ(r)(T (g))2 + T (g) < 0.
EXPLAINING INTERNATIONAL FERTILITY DIFFERENCES
787
share of income is set at 0.33. Less information is available on the fraction of job training expenditures that are not reflected in wages. Following Manuelli and Seshadri (2008), we assume that half of the investments in human capital in the postschooling period are not recorded as such in the NIPA. The parameter α1 determines the degree of curvature in the altruism function of the individual. We proceed by choosing the level of α1 in the United States to match a fertility rate (corresponding to 2 × e f in the model) of 2.1. Finally, we assume that B = 25. 1−γ Our theory implies that it is only the ratio hB / (zh1−υ w γ2 −υ(1−γ1 ) ) that matters for the moments of interest. Consequently, we can choose z, pk (which determine w) and hB arbitrarily and calibrate zh to match a desired moment. The calibrated value of zh is common to all countries. Thus, the model does not assume any cross-country differences in an individual’s “ability to learn” or in the ability to produce lifespan. This leaves us with ten parameters, α0 , δh, zh, γ1 , γ2 , υ, α1 , θ, T¯ , and µ. The moments we seek in order to pin down these parameters are as follows: 1. Earnings at age R/Earnings at age 55 of 0.8. Source: SSA.8 2. Earnings at age 50/Earnings at age 25 of 2.17. Source: SSA. 3. Years of schooling of 12.08. Source: Barro and Lee, 2000. 4. Schooling expenditures as a percentage of GDP of 3.77%. Source: OECD, Education at a Glance, 2003. 5. Pre-primary expenditures per pupil relative to GDP per capita of 0.14. Source: OECD, Education at a Glance, 2003. 6. Fertility rate of 2.1. Source: UNDP. 7. Lifetime intergenerational transfers/GDP of 4.5%. Gale and Scholz, 1994.9 8. Capital–output ratio of 2.52. Source: NIPA. 9. Health expenditures/GDP of 10%.10 10. Lifespan of 74 years. Source: World Health Statistics, 2005. 8. The earnings data come from restricted access social security earnings records for the cohort born between 1931 and 1941 (in the Health and Retirement Study) and are adjusted for overall earnings growth. 9. We experimented with 1% and 7% and the main results remained unchanged. 10. U.S. data for health expenditures as a fraction of GDP seem like an outlier. We chose to use a lower number, 10%, which corresponds to the average of the OECD countries.
788
QUARTERLY JOURNAL OF ECONOMICS
Theory implies that when bequests are in the interior, the human capital allocations that result from the solution to the parent’s problem correspond to the allocations that result from the simpler income maximization problem. Consequently, proceed in two steps because the ten equations in ten unknowns are “blockseparable.” For a given real interest rate and wage rate, we calibrate the parameters δh, zh, γ1 , γ2 , and ν to match moments 1, 2, 3, 4, and 5. Thus we follow a long-standing tradition in labor economics and use the properties of the age–earnings profile to identify the parameters of the production function of human capital. We then choose the other five parameters to match moments 6, 7, 8, 9, and 10. The parameter values that result in a perfect match between model and data are as follows: δh zh γ1 γ2 ν α1 θ T¯ µ α0 0.24 0.018 0.361 0.63 0.3 0.55 0.65 0.62 101.2 0.22 Of some interest are our estimates of αi . Because α0 is positive and α1 is less than one, our individuals are imperfectly altruistic. Further, α1 + θ > 1, which is necessary for the existence of an equilibrium with positive fertility. Our estimates of the parameters governing the human capital production function are in line with the best available estimates. The degree of returns to scale in the human capital production function is 0.93, which is exactly what Heckman, Lochner, and Taber (1998) and Kurusc¸u (2006) find. Nevertheless our central results (explaining international fertility differences) go through even with a lower value of 0.8. For more on how the human capital estimates compare with the existing literature, we refer the reader to Manuelli and Seshadri (2008). Our calibrated value of the intertemporal elasticity of substitution (1/θ ) is 1.6. Available estimates range from zero to values that are substantially higher than the one we calibrate. Hall (1988) and Campbell (1999) argue that it is significantly less than one. Hansen and Singleton (1982), Attanasio and Weber (1989), and Attanasio and Vissing-Jorgenssen (2003) find that the elasticity of intertemporal substitution is greater than one. Gruber (2006) discusses the biases associated with the different estimation strategies. He estimates the IES using (arguably exogenous) variability in the tax rates on capital income. Even though he cannot control for the correlation between the variance of consumption and his measures of changes in the intertemporal price of consumption, we view Gruber’s estimates as more reliable. His
EXPLAINING INTERNATIONAL FERTILITY DIFFERENCES
789
TABLE I WORLD DISTRIBUTION Schooling Relative output Decile
y/yUS
90–100 80–90 70–80 60–70 50–60 40–50 30–40 20–30 10–20 0–10
0.921 0.852 0.756 0.660 0.537 0.437 0.354 0.244 0.146 0.052
Years s 10.93 9.94 9.72 8.70 8.12 7.54 5.88 5.18 4.64 2.45
Expenditures
Lifespan
Fertility TFR*(1−inf)
xs
T
2×ef
3.8 4.0 4.3 3.8 3.1 2.9 3.1 2.7 2.5 2.8
78 76 73 71 69 64 57 54 51 46
1.74 2.1 2.28 2.50 2.82 3.37 3.92 4.76 5.32 5.66
preferred value of the IES is 2. Given the recent empirical evidence, our estimate is, if anything, somewhat conservative, as it lies close to the middle of the point estimates.11 Finally, Barro (2007) argues that the inverse of the intertemporal elasticity of substitution needs to be greater than 1 to generate the reasonable condition that greater uncertainty lowers the price–earnings ratio for unlevered consumption claims. All these arguments are consistent with the calibrated value of θ . IV. RESULTS Before turning to the results, we first describe the data in order to get a feel for the observations of interest. We start with the countries in Penn World Tables (PWT) 6.1 and put them in deciles according to their output per worker, y. Next, we combine them with observations on years of schooling (s), expenditures on schooling—primary and secondary—relative to GDP (xs ), life expectancy at age 1 (T ), and the total fertility rate (2 × e f ) for each of these deciles. The population values are displayed in the following table. Table I illustrates the wide disparities in incomes across countries. The United States possesses an output per worker that is about 20 times as high as countries in the bottom decile. 11. Bansal and Yaron (2004) show that if the true model—assuming Epstein– Zin preferences—has an intertemporal elasticity of substitution greater than one (1.5 in their example), it is possible for Hall’s regression to get an estimate close to zero.
790
QUARTERLY JOURNAL OF ECONOMICS
Years of schooling also vary systematically with the level of income—from about 2.5 years in the bottom decile to about 11 at the top. The quality of education, as proxied by expenditures on primary and secondary schooling as a fraction of GDP, also seems to increase with the level of development. This measure should be viewed with a little caution, as it includes only public inputs and not private inputs (including the time and resources that parents invest in their children). Next, notice that demographic variables also vary systematically with the level of development—higher-income countries enjoy greater life expectancies and lower fertility rates. More important, although demographics vary substantially at the lower half of the income distribution, they do not move much in the top half. IV.A. Accounting for International Differences in Fertility We now examine the ability of the model to simultaneously match the cross-country variation in output per capita and years of schooling. To be clear, we choose the level of TFP in a particular country/decile to match output per worker.12 We then see if the predictions for the fertility rate, lifespan, and schooling are in accordance with the data. In the process we do not assume that the solution is interior; we allow the constraint on the nonnegativity of bequests to be binding and, in fact, it binds for the poorer countries. Table II presents the predictions of the model and the data. The model is able to capture reasonably well the variation across countries in the quantity of children, as captured by the fertility rate and the quality of children, as reflected in years of schooling. As we move from the bottom to the top decile of the world income distribution, fertility in the model decreases from 5.82 to 2.22, which compares very favorably with that observed in the data. The change in lifespan from 48 years to 76 is also in line with the data.13 Furthermore, the model also captures the variation in schooling quantity and quality across countries. We conclude that differences in TFP can go a long way toward accounting for the observed cross-country differences in the quality and quantity of children. 12. We assume that R = min{64, T }. 13. We could have used life expectancy at age 5 instead of age 1. Doing so increases T by a little more than 2.5 years in the lowest decile, by 2 years in the second lowest, and by 1 year in the third lowest decile. The rest are not significantly affected. Such a change only brings the model’s predictions closer to the data.
791
EXPLAINING INTERNATIONAL FERTILITY DIFFERENCES TABLE II FERTILITY, LIFE EXPECTANCY, AND SCHOOLING—DATA AND MODEL 2×ef
s Decile
y/yUS
TFP
Data
Model
Data
Model
90–100 80–90 70–80 60–70 50–60 40–50 30–40 20–30 10–20 0–10
0.921 0.852 0.756 0.660 0.537 0.437 0.354 0.244 0.146 0.052
0.99 0.98 0.96 0.94 0.92 0.89 0.86 0.83 0.81 0.73
10.93 9.94 9.72 8.70 8.12 7.54 5.88 5.18 4.64 2.45
11.24 10.56 10.11 8.92 7.96 6.44 5.52 4.24 2.94 2.12
1.74 2.1 2.28 2.50 2.82 3.37 3.92 4.76 5.32 5.66
2.22 2.31 2.49 2.73 2.99 3.71 3.98 4.58 5.22 5.82
Decile
y/yUS
TFP
Data
Model
Data
Model
90–100 80–90 70–80 60–70 50–60 40–50 30–40 20–30 10–20 0–10
0.921 0.852 0.756 0.660 0.537 0.437 0.354 0.244 0.146 0.052
0.99 0.98 0.96 0.94 0.92 0.89 0.86 0.83 0.81 0.73
3.8 4.0 4.3 3.8 3.1 2.9 3.1 2.7 2.5 2.8
3.73 3.86 3.94 4.12 4.54 4.13 3.83 3.46 2.88 2.29
9.59 9.04 6.7 7.1 6.62 6.24 5.72 5.44 5.61 6.34
9.11 7.79 6.26 6.07 5.63 5.22 4.61 5.71 6.22 7.51
xs
T Data 78 76 73 71 69 64 57 54 51 46
Model 76 74 72 68 67 64 59 56 53 48
Health exp./GDP
The Role of Human Capital. What is the role played by human capital? In order to examine this, imagine shutting down human capital—specifically, let zh = 0 and recalibrate the model using moments 6 through 10 to pin down the five parameters (recall that five parameters were specific to the human capital sector). With the new parameters at hand, we can now gauge the sensitivity of the economic environment to changes in TFP just as we did in Table II. The results for the top, middle, and bottom deciles are presented in Table III. Shutting down human capital dramatically lowers the sensitivity of fertility rates to changes in TFP. Specifically, without human capital, the model predicts that birth rates in the lowest decile are only about 2.56, which is far from the 5.66 in the data. In the model, human capital amplifies the effect of changes in TFP, affecting the incentives to invest in quality.
792
QUARTERLY JOURNAL OF ECONOMICS TABLE III FERTILITY: THE EFFECT OF SHUTTING DOWN HUMAN CAPITAL Decile
Baseline
zh = 0
90–100 40–50 0–10
2.22 3.71 5.82
2.11 2.23 2.56
7
6
Total fertility rate
5
Model
4
Data 3
2
1
0 2
2.1
2.2
2.3
2.4
2.5
2.6
Capital–output ratio
FIGURE IV Capital–Output Ratio vs. Fertility—Model and Data
The Relationship between Capital Output Ratios and Fertility Rates. Recall from equations (14) and (15) that one of the key predictions of a Barro–Becker-type model is the tight link between interest rates (or capital–output ratios) and total fertility rates (under the assumption of a Cobb–Douglas production function). Figure IV displays the predictions of the model and data for the total fertility rate and the capital output ratio for the 10 deciles.14 We find this reasonably close association between the 14. To construct the capital–output ratio we followed closely the methodology of Hsieh and Klenow (2007). We obtain investment-to-GDP ratios (in domestic prices) from the PWT. We then divide this ratio by the sum of the depreciation rate
EXPLAINING INTERNATIONAL FERTILITY DIFFERENCES
793
model’s predictions and the data quite comforting in that this relationship is central to the mechanism at work.15 IV.B. Accounting for U.S.–Europe Fertility Differences The previous section demonstrated the ability of the model to capture the variation in fertility rates across the different stages of development. Nevertheless, there is one glaring failure—the inability to capture the low fertility rate observed in many European countries. Indeed this feature of the data has been puzzling: Why would the United States and European countries, which are at similar stages of economic development, have dramatically different fertility rates? In this section we examine the ability of the model to generate such differential behavior in fertility rates using differences in tax rates on labor income as a way of explaining these differences. Figure V shows the marked divergence in the fertility rates of the United States and the European nations starting around 1976.16 While American fertility rates increased by more than 17% over the next two decades, European fertility rates fell by a little more than 11%. At the same time, while taxes on labor income in the United States virtually stayed constant, tax rates on labor income in most European countries, as well as Japan and Canada, went up. Prescott (2003) argues that these higher taxes explain the lower hours worked in Europe relative to the United States. Davis and Henrekson (2004) present evidence in support of the negative effect of taxes on labor supply. Table IV presents data on tax rates, total fertility rates, and GDP and the percentage change in these variables between 1975 and 1995. The marginal (which we assume to be 0.06 in every country) and population growth rate to get the capital–output ratio. 15. Our data come from the PWT and are adjusted for differences in the relative price of capital. This is the measure used by Hsieh and Klenow (2007). More recently, Caselli and Feyrer (2007) report two measures of K/Y . One of their measures coincides with the Hsieh–Klenow measure. The other measure (available only for a subset of the countries we study), which includes adjustments for the value of agricultural and urban land to better measure capital, behaves a little differently than the Hsieh–Klenow estimate. Our model’s predictions lie (almost exactly) in between the Hsieh–Klenow measure and the Caselli–Feyrer measure. We think this is very reasonable for at least two reasons. First, there are many issues with the World Bank data that Caselli–Feyrer use that can potentially create biases. Second, the fact that the capital–output ratio is relatively flat in output per capita is specific to 2000 data. If one were to use 1960 or 1970 data, a much stronger relationship between capital–output ratios and development would emerge. 16. The European nations included in Figure V are Finland, France, Germany, Hungary, Italy, the Netherlands, Norway, Portugal, Sweden, Switzerland, and the United Kingdom.
794
QUARTERLY JOURNAL OF ECONOMICS
2.3 2.2
Total fertility rate
2.1
USA
2 1.9 1.8 Europe
1.7 1.6 1.5 1971
1976
1981
1986
1991
1996
2001
Year
FIGURE V U.S.–Europe fertility differences, 1970–1995
tax rate numbers that we report use data from both Barro and Sahasakul (1986) and Prescott (2003). Barro and Sahasakul report that marginal tax rates on labor income did not move between 1975 and 1994, as does Prescott (2003). However, the levels are different. Consequently, we adjust Prescott’s numbers (for each country) so that the marginal tax rates (for the United States) exactly match up with Barro and Sahasakul.17 The Effects of Taxes on Fertility. To study the effects of taxes on fertility, imagine adding a tax on labor income (τ h) and capital income (τ k) into the baseline model. The effective prices that
= w(1 − τ h). Further, the consumer faces are
r = r(1 − τ k) and w assume that the revenues from these taxes are rebated back to individuals in a lump-sum fashion. 17. Prescott starts with average tax rates and then multiplies them by 1.6 in order to convert them to a marginal. He acknowledges that the 1.6 number is questionable. A measure of the marginal tax rate that follows the same methodology as Barro and Sahasakul for the other G7 countries doesn’t seem available. Nevertheless we tried various other adjustments to Prescott’s numbers. In our most preferred adjustment, we used the ratio of marginal tax rates to average tax rates for 2000 (the earliest year for which we could locate data—OECD Tax Database) instead of the 1.6 that Prescott uses. The results were very similar to what we report in Table IV.
EXPLAINING INTERNATIONAL FERTILITY DIFFERENCES
795
TABLE IV TAXES AND FERTILITY, G7—DATA Tax rate on labor income
Total fertility rate
GDP per worker
Country
1975
1975
% change
1975
% change
Germany France Italy Canada U.K. Japan U.S.
0.54 0.43 0.38 0.43 0.47 0.23 0.45
1.77 2.36 2.37 2.06 2.2 2.02 2.1
−28 −29 −49 −20 −21 −22 −3
0.76 0.81 0.81 0.93 0.71 0.53 1.00
41 39 57 22 42 75 41
% change 19 35 100 34 −5 67 2
In order to describe the equilibrium in a model with taxes and transfers, let 0 0≤a− . ∂g dp dp However, as a first approximation, dL(q, r, p)/dp ≈ dI/dp, and dg/dp ≈ (∂g/∂ I)(dI/dp). Thus, such a redistribution increases
798
QUARTERLY JOURNAL OF ECONOMICS TABLE VI EFFECT OF SOCIAL SECURITY ON FERTILITY—MODEL τh 0.40 0.45 0.50 0.55 0.60
2×ef 2.10 1.95 1.84 1.76 1.66
fertility if
∂ (g, r) ∂g dI −1 +1 > 0. ∂g ∂I dp
Because dI/dp < 0, it must be the case that the term in square brackets is negative. As before, a large response of health capital to changes in income guarantees that this inequality holds. To summarize, increases in tax rates decrease fertility, whereas a redistribution of tax proceeds increases fertility. To evaluate the quantitative impact of an increase in a social security like regime, we consider an increase in tax rates with the proceeds allocated to individuals aged 65 or higher. Table VI displays the results. Notice that social security has a negative effect on fertility rates. This is driven by the way it is financed—using distortionary labor income taxes—and not by the timing of payments. Note also that Ricardian equivalence fails due to the presence of distortionary taxation. Social security leads to a decrease in fertility rates only because the presence of human capital makes the supply of effective labor elastic. Absent human capital—or if the social security program was financed using lump-sum taxes— there would be only one effect at play: A rise in social security receipts would lead to a fall in the stock of capital, which would lead to a fall in the capital–output ratio and hence raise the fertility rate. Indeed, this is the argument in Boldrin, De Nardi, and Jones (2005). Thus, the addition of human capital and the major role played by taxation in this version of the Barro–Becker model imply that more generous social security regimes financed by higher taxes on labor income have a negative net effect on fertility.
EXPLAINING INTERNATIONAL FERTILITY DIFFERENCES
799
Taking Stock: Fertility Change in the G7 Countries. The above discussion suggests that distortionary taxes on human capital acquisition can have large effects on fertility choice. We now see whether the model is capable of accounting for the changes in fertility rates between the 1970s and the 1990s for the United States and the G7 countries. We take as a starting point the parameterization (for the United States) from the previous section. Then, as we move across countries, we vary three things: • TFP (to match output per worker), • Taxes on labor income (using the estimates in Prescott [2003]), and • Ratio of Social Security payments to GDP (OECD, Quarterly Labour Force Statistics). Each of these countries saw a rise in the share of GDP spent on social security payments with some, such as Japan, experiencing almost a threefold rise, but others, such as Germany, seeing only a modest increase. Table VII contains relevant data from the G7 countries, and the predictions of the model. The match between model and data, though not perfect, is pretty reasonable, especially given that it abstracts from differences in the prices of many services (e.g., child care subsidies) and transfers (e.g., maternity leave) that may well affect fertility choices. The predictions of the model for 1975 agree with the evidence with two exceptions: France—where the model underpredicts fertility—and Japan—where the model overpredicts fertility. In all cases the model predicts that the combination of observed tax increases and changes in the social security regime result in fewer children per person. The magnitude of the decrease is in line with the data, with two exceptions: Japan and Italy. As before, the errors have different signs, which suggests that other factors that differentiate these countries probably play a major role. The model implies that lower fertility rates are associated with higher capital–output ratios. With the exception of the United Kingdom, this is exactly what we see. The correlation between capital–output ratios and total fertility rates is −.72 in the model and −.59 in the data. The model performs reasonably well in matching health statistics (the data source is World Health Statistics and includes both public and private expenditures on health)—both the quantity and expenditures on life extension. The match between model
1995
9.64 10.13 9.37 10.67 10.12 9.21 12.08
1975
Germany 7.86 France 7.15 Italy 6.34 Canada 9.55 U.K. 7.34 Japan 6.21 U.S. 10.11
Country 7.73 6.08 5.28 9.54 8.01 7.36 10.01
1975
1995 9.06 7.94 6.60 11.18 9.03 9.44 12.08
Data
67 66 66 68 67 63 69
1975 75 75 77 76 76 75 78
1995
Model
Life span
−0.28 −0.29 −0.49 −0.20 −0.21 −0.22 −0.03
1.77 2.30 2.33 2.10 2.20 2.02 2.10
−0.21 −0.32 −0.61 −0.21 −0.13 −0.09 −0.09
1.86 1.92 2.23 2.11 2.30 2.45 2.20
Model
0.64 0.58 0.76 0.58 0.45 0.38 0.46
Change in fertility rate Data
Total fertility rate
Model—1975 Data—1975 Model
Schooling
0.54 0.43 0.38 0.43 0.47 0.23 0.45
Germany France Italy Canada U.K. Japan U.S.
1995
Schooling
1975
Tax rate on labor
Country
TABLE VII TAXES AND FERTILITY, G7—-MODEL VERSUS DATA
70 71 68 71 71 70 68
74 74 76 76 75 76 74
2002
Data
4.86 9.35 10.90 5.06 8.76 9.70 4.59 8.40 8.50 4.85 9.18 9.60 5.53 8.86 7.70 5.06 8.67 7.90 6.87 10.16 14.60
1995
Model
Health exp./GDP
1975 1995 1975
Data
Life exp.
800 QUARTERLY JOURNAL OF ECONOMICS
EXPLAINING INTERNATIONAL FERTILITY DIFFERENCES
801
and data in terms of lifespan is worse in the 1975 steady state than in 1995. The model also does suggest, counterfactually (in 1975 and in 1995), that Americans should enjoy greater life spans than their counterparts in the G7 nations. The fact that households in the United States spend substantially more than other nations and yet enjoy a slightly lower lifespan is puzzling. Aside from these counterfactual predictions, the model performs reasonably well.21 We find it remarkable that changes in tax rates and TFP can go a long way toward understanding the puzzling behavior of fertility rates in the richer nations over the last few decades. V. THE UNITED STATES: 1900 VS. 2000 In this section we use the calibrated model to predict life expectancy, fertility, and schooling for the United States in 1900. To be precise, we take our base calibrated model (with taxes) as a good description of the (steady state) of the United States economy circa 2000. We take the (extreme) view that the U.S. economy circa 1900 was in a steady state as well. Clearly, this is not realistic, but accounting for transition effects is beyond the scope of this paper and, in any case, the purpose of this exercise is to evaluate how large changes in taxes and TFP can affect fertility. For that purpose, the steady state assumption is not a bad approximation. The only differences between 2000 and 1900 are the tax rates (which are assumed to be 0 in 1900) and TFP (which is chosen to match output per worker). Total factor productivity increases by about 36% in order to generate the fivefold increase in output per worker. The results of the experiment are in Table VIII. The model does a reasonable job of accounting for the changes over the last century. The predictions of the model for fertility, schooling, and life expectancy are not perfect but they are reasonably close to the data. If anything, it predicts—relative to the data—larger responses in the endogenous variables as a result of changes in TFP.22 21. Some caution needs to be exercised in comparing model and data in terms of health expenditures. In reality, not all expenditures on health are necessarily for the purpose of life extension. On the other hand, the value of time spent exercising must be imputed into health outlays in order to facilitate a more accurate comparison with the model’s predictions. 22. The astute reader will note that the elasticity of output per worker with respect to TFP is significantly smaller in the time series experiment than in the cross-section experiment in Table II. The reason for this is the distortionary effect
802
QUARTERLY JOURNAL OF ECONOMICS TABLE VIII UNITED STATES: 1900 AND 2000—DATA AND MODEL s
Period 1900 2000
y yUS
0.19 1
2×ef
g/y
T
Data
Model
Data
Model
Data
Model
Data
Model
5.4 12.08
4.83 12.08
na 0.10
0.04 0.10
3.8 2.1
4.32 2.1
52 78
48 78
VI. CONCLUSIONS This paper integrates a life-cycle model of human and physical capital accumulation where life expectancy is endogenous with the Barro–Becker framework. This permits an interesting tradeoff between the quantity and quality of children and the quantity and quality of life. The model is able to capture the wide variation in fertility rates seen across the income distribution. Further, the model suggests that a substantial part of the lower fertility rates in the G7 countries are due to their higher labor income tax rates. The fit between the model and data is surprisingly good, and it is natural to question the role left over for other factors to influence fertility and life span. Our view is that the next generation of models will need to incorporate a richer structure to better study the role of TFP versus policies. We view our analysis of the role of distortionary taxes as a first step in that direction. APPENDIX In this Appendix we show that, in the interior case, utility maximization and lifetime income maximization coincide. To be precise, we assume that r = ρ + [α0 + (1 − α1 ) f ]/B. In this case, the solution to the optimal human capital accumulation corresponding to the maximization of (1) subject to (2)–(4) is identical to the solution of the income maximization problem (22)
R
max
e−r(a−6) [wh(a)(1 − n(a)) − x(a)]da − xE ,
6
of taxes. Taxes have a large effect on output and hence, to counteract the negative influence of rising taxes over the twentieth century, TFP needs to rise by more in order to generate a fivefold increase in output per worker. Had taxes remained constant, output per worker in the United States would have increased by a factor of 19!
EXPLAINING INTERNATIONAL FERTILITY DIFFERENCES
803
subject to · h(a) = zh[n(a)h(a)]γ1 x(a)γ2 − δhh(a),
(23)
a ∈ [6, R)
and h(6) = hE = hBxEυ
(24)
with hB given. To see this, we show that the first-order conditions corresponding to both problems coincide. Because the problems are convex, this suffices to establish the result. Consider first the first-order conditions of the income maximization problem given the stock of human capital at age 6, h(6) = hE . Let q(a) be the costate variable. A solution satisfies whn ≤ qγ1 zh (nh)γ1 x γ2 , with equality if n < 1, γ1 γ2 x = qγ2 zh (nh) x , q· = rq − [qγ1 zh (nh)γ1 x γ2 h−1 − δh] − w(1 − n), · h = zh(nh)γ1 x γ2 − δhh,
(25a) (25b) (25c) (25d)
where a ∈ [6, R]. The transversality condition is q(R) = 0. Let be the Lagrange multiplier associated with the budget constraint (3). Then, the relevant (for the decision to accumulate human capital) problem solved by a parent is R
max
e−r(a−I) [wh(a)(1 − n(a)) − x(a)]da
I B+I
e−r(a−I) [whk(a)(1 − nk(a)) − xk(a)]da − e f e−r Bbk − e f e−r(B+6) xE + e−α0 +α1 f e−ρ BV k(hk(B + I), bk), +e
f
B
where, in this notation, a stands for the parent’s age. It follows that the first-order conditions corresponding to the choice of [h(a), n(a), x(a), qp(a)] are identical to those corresponding to the income maximization problem (25), including the transversality condition qp(R) = 0 for a ∈ [I, R]. It follows that qp(a) = q(a). Simple algebra shows that the first-order conditions corresponding to the optimal choices of [hk(a), nk(a), xk(a), qk(a)] also satisfy (25) for a ∈ [6, I). However, the appropriate transversality condition for
804
QUARTERLY JOURNAL OF ECONOMICS
this problem is qk(B + I) = e−[α0 +(1−α1 ) f ] e−(ρ−r)B
1 ∂ V k(hk(B + I), bk) . ∂hk(B + I)
However, given (5) and the envelope condition ∂ V k(hk(B + I), bk) = kqp(I), ∂hk(B + I) evaluated at the steady state = k, it follows that qk(B + I) = qp(I). Thus, the program solved by the parent (for a ∈ [I, R]) is just the continuation of the problem he solves for his children for a ∈ [6, I). It is clear that if (5) does not hold, then there is a “wedge” between how the child values his human capital after he becomes independent, qp(I), and the valuation that his parent puts on the same unit of human capital, qk(B + I). In Manuelli and Seshadri (2008), it is proved that the solution to the income maximization problem is given by the following conditions: (a) Time allocated to human capital accumulation: (26) 1
n(a) =
m(a) 1−γ 1
−δh (a−R)
e−δh(a−s−6) m(6+s) 1−γ + (r+δhγ)e1 δh
eδh(a−R)
eδh (6+s−R) (1−x
r+δh δh
γ
) 1−γ dx
for a ∈ [6 + s, R]. (b) Market goods allocated to human capital accumulation:
r+δh (1−γ1 ) 1 γ2 w (a−s−6) Ch(zh, w, r)m(6 + s) 1−γ e (1−γ2 ) (27a) x(a) = , r + δh a ∈ [6, 6 + s), 1 γ2 w Ch(zh, w, r)m(a) 1−γ , (27b) x(a) = a ∈ [6 + s, R). r + δh 1 γ (1−γ ) γ γ γ 1−γ (1−γ2 ) 2 γ1 1 γ2 1 2 zh1 w (1−γ1 )(1−γ2 ) m(6 + s) 1−γ (27c) . xE = υ (r + δh)(1−γ2 ) e(r+δh(1−γ1 ))s
,
EXPLAINING INTERNATIONAL FERTILITY DIFFERENCES
805
(c) Level of human capital of an individual of age a in the postschooling period (i.e., a ≥ 6 + s): 1 γ1 e−δh(a−R) (28) h(a) = Ch(zh, w, r) e−δh(a−s−6) m(6 + s) 1−γ + r + δh δh eδh(a−R) γ r+δh 1−γ 1 − x δh dx , a ∈ [6 + s, R). eδh(6+s−R)
(d) Stock of human capital at age 6, hE : (29)
hE = υ υ hB
γ (1−γ2 )
γ1 1
γ γ
γ
γ2 1 2 zh1 w (1−γ1 )(1−γ2 ) (r + δh)(1−γ2 )
× e−υ(r+δh(1−γ1 ))s m(6 + s)
υ(1−γ2 ) 1−γ
υ 1−γ
.
(e) Supply of human capital to the market by an individual of age a (for a ≥ 6 + s): (30) h(a)(1 − n(a))
1 1 m(6 + s) 1−γ m(a) 1−γ = Ch(zh, w, r)w γ1 e−δh(a−6−s) − γ1 r + δh r + δh δh (a−R) γ r+δh 1−γ e−δh(a−R) e (1 − x δh + dx , δh eδh(6+s−R)
where
γ
γ
γ 2 γ 1 zhw γ2 Ch(zh, w, r) = 2 1 (r + δh)γ
1 1−γ
.
(f) Income during the working years satisfies 1 1 1−γ m(a) 1−γ −δh (a−6−s) m(6 + s) y(a) = Ch(zh, w, r)w γ1 e − (γ1 +γ2 ) + δh r + δh γ −δh (a−R) eδh(a−R) r+δh 1−γ e (1 − x δh + dx δh eδh(6+s−R)
during the working life (i.e., 6 + s < a < R).
806
QUARTERLY JOURNAL OF ECONOMICS
DEPARTMENT OF ECONOMICS, WASHINGTON UNIVERSITY IN ST. LOUIS DEPARTMENT OF ECONOMICS, UNIVERSITY OF WISCONSIN–MADISON
REFERENCES Acemoglu, D., and J. Robinson, “Disease and Development: The Effect of Life Expectancy on Economic Growth,” NBER Working Paper No. 12269, 2006. Attanasio, O. P., and A. Vissing-Jorgensen, “Stock Market Participation, Intertemporal Substitution and Risk Aversion,” American Economic Review, 93 (2003), 383–391. Attanasio, O. P., and G. Weber, “Intertemporal Substitution, Risk Aversion and the Euler Equation for Consumption,” Economic Journal, 99 (1989), 59–73. Bansal, R., and A. Yaron, “Risks for the Long Run: A Potential Resolution of Asset Pricing Puzzles,” Journal of Finance, 59 (2004), 1481–1509. Barro, R. J., “Rare Disasters, Asset Prices and Welfare Costs,” NBER Working Paper No. 13690, 2007. Barro, Robert, and Gary S. Becker, “Fertility Choice in a Model of Economic Growth,” Econometrica, 57 (1989), 481–501. Barro, Robert J., and Jong-Wha Lee, “International Data on Educational Attainment: Updates and Implications,” CID Working Paper No. 42, 2000. Barro, R. J., and C. Sahasakul, “Average Marginal Tax Rates from Social Security and the Individual Income Tax,” Journal of Business, 59 (1986), 555–566. Becker, Gary S., Kevin M. Murphy, and Robert Tamura, “Human Capital, Fertility, and Economic Growth,” Journal of Political Economy, 98 (1990), S12–S37. Behrman, J., S. Duryea, and M. Sz´ekely, “Decomposing Fertility Differences across World Regions and through Time,” OCE Working Paper Series No. 406, Research Department, Inter-American Development Bank, 1999. Ben Porath, Y., “The Production of Human Capital and the Life Cycle of Earnings,” Journal of Political Economy, 75 (1967), 352–365. Birchenall, Javier, “Escaping High Mortality,” Journal of Economic Growth, 12 (2007), 351–387. Boldrin, Michele, Mariacristina De Nardi, and Larry Jones, “Fertility and Social Security,” NBER Working Paper No. 11146, 2005. Burtless, G., “Increasing the Retirement Age for Social Security Pensions,” Testimony to the Senate Special Committee on Aging (Washington, DC: The Brookings Institution, 1998). Campbell, John Y., “Asset Prices, Consumption, and the Business Cycle,” in Handbook of Macroeconomics Vol. 1, John B. Taylor and Michael Woodford, eds., 1231–1304 (Amsterdam: Elsevier, 1999). Caselli, F., and J. Feyrer, “The Marginal Product of Capital,” Quarterly Journal of Economics, 122 (2007), 535–568. Chakraborty, S., “Endogenous Lifetime and Economic Growth,” Journal of Economic Theory, 116 (2004), 119–137. Cigno, A., “Fertility Decisions When Infant Survival Is Endogenous,” Journal of Population Economics, 11 (1998), 21–28. Davis, Stephen J., and Magnus Henrekson, “Tax Effects on Work Activity, Industry Mix and Shadow Economy Size: Evidence From Rich-Country Comparisons,” University of Chicago Graduate School of Business, mimeo, 2004. Doepke, M., “Accounting for Fertility Decline during the Transition to Growth,” Journal of Economic Growth, 9 (2004), 347–383. Fernandez-Villaverde, Jesus, “Was Malthus Right? Economic Growth and Population Dynamics,” Department of Economics, University of Pennsylvania, mimeo, 2001. Gale, William G., and John Karl Scholz, “Inter-generational Transfers and the Accumulation of Wealth,” Journal of Economic Perspectives, 8 (1994), 145– 160. Galor, Oded, “From Stagnation to Growth: Unified Growth Theory,” Handbook of Economic Growth, P. Aghion and S. Durlauf, eds. (Amsterdam: North Holland, 2005). Galor, Oded, and Omer Moav, “Natural Selection and the Origin of Economic Growth,” Quarterly Journal of Economics, 117 (2002), 1133–1191.
EXPLAINING INTERNATIONAL FERTILITY DIFFERENCES
807
Gruber, J., “A Tax-Based Estimate of the Elasticity of Intertemporal Substitution,” NBER Working Paper No. 11945, 2006. Hall, R. E., “Intertemporal Substitution in Consumption,” Journal of Political Economy, 96 (1988), 339–357. Hansen, Lars, and Kenneth Singleton, “Generalized Instrument Variables Estimation of Nonlinear Rational Expectations Models,” Econometrica, 50 (1982), 1269–1286. Heckman, J., L. Lochner, and C. Taber, “Explaining Rising Wage Inequality: Explorations with a Dynamic General Equilibrium Model of Labor Earnings with Heterogeneous Agents,” Review of Economic Dynamics, 1 (1998), 1–58. Hsieh, C.-T., and P. J. Klenow, “Relative Prices and Relative Prosperity,” American Economic Review, 97 (2007), 562–585. Kalemli-Ozcan, Sebnem, “Does Mortality Decline Promote Economic Growth?” Journal of Economic Growth, 7 (2002), 411–439. Kurus¸c¸u, Burhanettin, “Training and Lifetime Income," American Economic Review, 96 (2006), 832–846. Lucas, Robert E., “The Industrial Revolution: Past and Future,” in Lectures on Economic Growth, Robert E. Lucas, ed. (Cambridge, MA: Harvard University Press, 2002). Manuelli, R., and A. Seshadri, “Neoclassical Miracles,” University of Wisconsin– Madison, mimeo, 2007. ——, “Human Capital and the Wealth of Nations,” University of Wisconsin– Madison, mimeo, 2008. Mincer, J., Experience and Earnings (New York: Columbia University Press, 1974). OECD, Education at a Glance: OECD Indicators (OECD, 2003). Prescott, Edward, “Why Do Americans Work So Much More Than Europeans?” Federal Reserve Bank of Minneapolis Staff Report No. 321, 2003. Soares, Rodrigo, “Mortality Reductions, Educational Attainment, and Fertility Choice,” American Economic Review, 95 (2005), 580–601. World Health Statistics 2005 (Geneva: WHO Press).
GENETIC VARIATION IN PREFERENCES FOR GIVING AND RISK TAKING∗ DAVID CESARINI CHRISTOPHER T. DAWES MAGNUS JOHANNESSON PAUL LICHTENSTEIN ¨ BJORN WALLACE In this paper, we use the classical twin design to provide estimates of genetic and environmental influences on experimentally elicited preferences for risk and giving. Using standard methods from behavior genetics, we find strong prima facie evidence that these preferences are broadly heritable and our estimates suggest that genetic differences explain approximately twenty percent of individual variation. The results thus shed light on an important source of individual variation in preferences, a source that has hitherto been largely neglected in the economics literature.
I. INTRODUCTION Writing in 1875, the prolific Francis Galton concluded the first scientific inquiry into the behavior of twins by remarking that “There is no escape from the conclusion that nature prevails enormously over nurture” (Galton 1875, p. 576). In fact, Galton was so taken with his results that he continued, “My only fear is that my evidence seems to prove too much and may be discredited on that account, as it seems contrary to all experience that nurture should go for so little.” Although his methodology would be considered dubious, if not flawed, by modern standards, Galton’s work laid the conceptual basis for behavior genetics (Bouchard and Propping 1993; Plomin et al. 2001a), the study of genetic and ∗ This paper has benefited from discussions with Coren Apicella, Samuel Bowles, Terry Burnham, Bryan Caplan, Tore Ellingsen, James Fowler, Jon Gruber, Garett Jones, Moses Ndulu, Matthew Notowidigdo, Niels Rosenquist, ¨ Paul Schrimpf, Steven Pinker, Tino Sanandaji, Orjan Sandewall, Vernon Smith ¨ and Robert Ostling. Thanks to Larry Katz and six anonymous referees for very helpful comments. We thank the Jan Wallander and Tom Hedelius Foundation, the Swedish Research Council, and the Swedish Council for Working Life and Social Research for financial support. The Swedish Twin Registry is supported by grants from the Swedish Research Council, The Ministry for Higher Education and Astra Zeneca. Rozita Broumandi and Camilla Bj¨ork at the Swedish Twin Registry responded to a number of queries, for which we are grateful. Patrik Ivert, Niklas Kaunitz and Benjamin Katzeff provided excellent research assistance. We ˚ are grateful to a number of colleagues, especially Hakan Jerker Holm and Fredrik Carlsson, for help with subject recruitment outside Stockholm. The paper was completed while David Cesarini was visiting the Research Institute of Industrial Economics in Stockholm; he gratefully acknowledges their hospitality. C 2009 by the President and Fellows of Harvard College and the Massachusetts Institute of
Technology. The Quarterly Journal of Economics, May 2009
809
810
QUARTERLY JOURNAL OF ECONOMICS
environmental influences on variation in human behavior. Today ample evidence for the importance of genetic influences (“nature”) on variation in human behavioral traits has amassed. However, the debate about the rather nebulous concepts “nature” and “nurture” still rages. In economics, there is a small but growing research field using behavior genetic techniques. The seminal paper is due to Taubman (1976), who employed the twin design to estimate the heritability of earnings for U.S. males. Later papers in this procession, based on either twins or adoptees, include Behrman and Taubman (1989), Sacerdote (2002, 2007), Plug and Vijverberg (2003), Bj¨orklund, Lindahl, and Plug (2006), and Bj¨orklund, ¨ Jantti, and Solon (2007). In short, these studies find that both “nature” and “nurture” are important determinants of life outcomes and uniformly corroborate the importance of genetic influences on educational attainment and earnings.1 Some recent work in economics also focuses on the issue of intergenerational transmission of preferences. Cipriani, Giuliani, and Jeanne (2007) report mother–son correlations for contributions in a standard public goods game and find no significant associations, interpreting this as evidence that peer effects influence contributions. Dohmen et al. (2006), on the other hand, use survey evidence on attitudinal questions and find modest intergenerational correlations in self-reported trust and risk attitudes. Naturally, these papers suffer from the limitation that it is impossible to separately identify genetic (parents passing on genes for a certain trait to their biological children) and cultural transmission. In this paper, we move beyond the computation of intergenerational correlations and offer a direct test of the hypothesis that economic preferences are under genetic influence. We elicit preferences experimentally with a subject pool of twins recruited from the population-based Swedish Twin Registry. The virtue of this approach is that by comparing monozygotic (MZ) twins, who share the same set of genes, to dizygotic (DZ) twins, whose genes are imperfectly correlated, we can estimate the proportion of variance in experimental behavior due to genetic and to shared and unique environmental effects. The measures of economic preferences that we use are based on de facto observed experimental 1. For an extensive collection of essays on the intergenerational transmission of economic opportunity, see the volume edited by Bowles, Gintis, and Osborne Groves (2005).
GENETIC VARIATION IN PREFERENCES
811
behavior under controlled circumstances with financial incentives attached to performance. For risk taking, we also present some supplementary survey-based evidence derived from hypothetical questions that have been behaviorally validated (Dohmen et al. 2005, 2006). This paper is the first to use the twin methodology to study (i) experimentally elicited risk preferences and (ii) giving behavior in a dictator game. Outside economics, two papers have used the twin methodology to shed light on individual variation in the ultimatum game (Wallace et al. 2007) and the trust game (Cesarini et al. 2008). Two other previous papers used twins as a subject pool (Loh and Elliott 1998; Segal and Hershberger 1999) but the experiments therein were designed to test whether cooperation varied by genetic relatedness, as predicted by inclusive fitness theory (Hamilton 1964). Therefore, twins played against their cotwins, and consequently it is not possible to estimate heritability from these studies. We find strong evidence that preferences for risk taking and giving are broadly heritable. Our point estimates from the best-fitting models suggest that approximately twenty percent of individual variation can be explained by genetic differences. Furthermore, our results suggest only a modest role for common environment as a source of variation. We argue that the significance of these results extends well beyond documenting an important, but hitherto largely ignored, source of preference heterogeneity. For example, although it is widely accepted that parent–offspring correlations in isolation cannot be used to discriminate between theories of genetic and cultural transmission, much economic research is carried out under the presumption that genetic transmission is small enough so that it can be safely ignored. Such an assumption is not consistent with our findings. Importantly, the estimates we report are in line with the behavior genetics literature, where survey based studies have documented substantial genetic influences on variation in economically relevant abilities, preferences, and behaviors such as intelligence (Bouchard et al. 1990), personality (Jang, Livesley, and Vernon 1996), addiction (True et al. 1997), prosociality (Rushton et al. 1986; Rushton 2004), sensation seeking (Stoel, De Geus, and Boomsma 2006), religiosity (Bouchard et al. 1999; Kirk et al. 1999; Koenig et al. 2005), political preferences (Alford, Funk, and Hibbing 2005), and political participation (Fowler, Dawes, and Baker 2008). The remainder of this paper is structured as follows:
812
QUARTERLY JOURNAL OF ECONOMICS
in Sections II and III, we describe the method and the experiments used in detail; in Section IV, we report the results; and in Section V, we discuss our findings. Section VI concludes. II. DATA COLLECTION II.A. Subject Recruitment The study was undertaken in collaboration with the Swedish Twin Registry at Karolinska Institutet.2 The registry, which is the largest twin registry in the world, has been described in detail elsewhere (Lichtenstein et al. 2006). All of our invitees were same-sex twin pairs that had previously participated in the Webbased survey STAGE, an acronym for “the Study of Twin Adults: Genes and Environment.” This survey was administered between November 2005 and March 2006 to all twins born in Sweden between 1959 and 1985, and it attained a response rate of 61%. Its primary purpose was to study environmental and genetic influences on a number of diseases (Lichtenstein et al. 2006), but it also contains self-reported data on marital, employment, and fertility status, as well as information on the frequency of twin contact. To allow further examination of the effects of our methods of recruitment on the representativeness of our sample, we also merged the STAGE cohort with a specially requested data set of socioeconomic and demographic variables compiled by Statistics Sweden. In a first recruitment effort, during the summer and fall of 2006, a total of 658 twins (71 DZ and 258 MZ pairs) participated in the Swedish cities of Stockholm, Gothenburg, Uppsala, Malm¨o, ¨ ¨ ˚ and Lund, Link¨oping, Norrk¨oping, Helsingborg, Orebro, Vaster as, Kristianstad. Due to the relatively small sample of DZ twins, a second round of data collection took place in February 2008. Both MZ and DZ twins were invited to participate, but DZ twins were pursued somewhat more vigorously, with personalized invitations and reminders sent to those who did not respond. This recruitment effort was successful in augmenting the sample size of DZ twins, and the complete data set comprises 920 twins: 141 DZ pairs and 319 MZ pairs. A vast majority of subjects, approximately 80%, are female. For the second data collection round, twins were recruited in the cities of Stockholm, Gothenburg, Uppsala, Malm¨o, Lund, 2. The study and subject recruitment were approved by the Ethics Committee for Medical Research in Stockholm.
GENETIC VARIATION IN PREFERENCES
813
¨ ¨ o, Vaster ¨ ˚ J¨onk¨oping, Borlange, ¨ Helsingborg, Orebro, Vaxj¨ as, and ˚ In all of the experimental sessions a condition for particUmea. ipation was that both twins in a pair be able to attend the same session. Moreover, invitations were extended only to twins who were both domiciled in the same city or its surrounding areas. Zygosity was resolved by questionnaire items that have been shown to have a reliability of somewhere between 95% and 98% (Lichtenstein et al. 2006). II.B. Experimental Procedures When subjects arrived at an experimental session they were seated apart and given general instructions orally. They were asked not to talk to one another during the experiment and to alert the experimenter if they had any questions (questions were rare and were answered in private). Subjects were also told about the strong norm against deception in experimental economics. After having filled out a form with information for the administration of payments, subjects were given instructions for the first experiment (the modified dictator game; see below). There were no time constraints, so when all participants finished making their decisions, the next set of instructions was handed out. Subjects participated in a total of five different experiments. The experiment phase was followed by a short questionnaire with survey questions, a personality test, and a test of cognitive ability. On average, experimental sessions lasted a little more than an hour and average earnings were SEK 325 (exchange rate; $1 is about SEK 6). II.C. Giving We used a modified dictator game to measure preferences for giving (“altruism”).3 In a standard dictator game (Forsythe et al. 1994), a subject decides how to split a sum of money between herself and another person (see Camerer [2003] for an overview of dictator game results). A variant of this approach first used by Eckel and Grossman (1996) is that the subject decides how to allocate a sum of money between herself and a charity. As donations 3. Independently, Bardsley (2007) and List (2007) have shown that augmenting the choice set of the dictator to allow him or her to take money from the partner dramatically reduces generosity. This suggests that people’s behavior in the standard dictator game is sensitive to cues about social norms in experimental settings. Regardless of one’s favored interpretation of giving in dictator games, we will provide evidence suggesting that such giving is heritable.
814
QUARTERLY JOURNAL OF ECONOMICS
to charity may be related to empathy and altruism more strongly than donations in the standard dictator game, we opted for this approach. Fong (2007) has shown that empathy is a more important motivation for dictator game giving when recipients are perceived to be in great need (in their case welfare recipients). In the present study subjects decided how to allocate SEK 100 (about $15) between themselves and a charity called “Stadsmissionen.” Stadsmissionen’s work is predominantly focused on helping the homeless in Sweden. All subjects responded to the dictator game question and are included in the analysis below (319 MZ pairs and 141 DZ pairs). II.D. Risk Taking To measure risk aversion, subjects were presented with six choices, each between a certain payoff and a 50/50 gamble for SEK 100 (about $15). The certain payoffs were set to SEK 20, 30, 40, 50, 60, or 80. After subjects had made their six choices, one of these was randomly chosen for payoff by rolling a die. The gamble was resolved with a coin toss in front of the participants. The measure of risk aversion determines seven intervals for the certainty equivalent of the gamble. A similar question has been used by Holt and Laury (2002). Nineteen subjects provided inconsistent responses (2% of the total sample), and these were dropped (leaving 307 MZ pairs and 135 DZ pairs for the analysis).4 We refer to this measure as risk aversion and it is our primary measure of risk preferences. We supplement this first measure of risk preferences with two hypothetical questions designed to measure risk attitudes. The first question, which we denote risk investment, asks the subjects to assume that they have won SEK 1 million on a lottery and that they are then given the opportunity to invest some of this money in a risky asset with an equal probability of doubling the investment or losing half the investment. Subjects can then choose between six different levels of investments: SEK 0, 200,000, 400,000, 600,000, 800,000, or 1 million. This question is similar to the question with real monetary payoffs, but involves much larger (although hypothetical) stakes. The second question, 4. An inconsistent response is one in which the certainty equivalent is not uniquely defined; that is, an individual who chooses SEK 20 rather than the gamble in the first question and then chooses the gamble rather than SEK 30 in the second question. Such behavior is a strong indication that the subject either has misunderstood the question, or has failed to take it seriously.
GENETIC VARIATION IN PREFERENCES
815
risk assessment, measures general risk attitudes on a 0–10 scale, where 0 is complete unwillingness to take risks and 10 is complete willingness to take risks. This scale question measures general risk attitudes rather than monetary risk attitudes. Dohmen et al. (2005) showed that all of these three measures of risk attitudes are significantly related to each other, and established the behavioral validity of the two hypothetical questions with respect to real risk taking. III. TWIN METHODOLOGY Comparing the behavior of identical and nonidentical twins is a form of quasi-controlled experiment. MZ and DZ twins differ in their genetic relatedness. If a trait is heritable, then it must be the case that the correlation in MZ twins is higher than the correlation in DZ twins. We start by examining the MZ and DZ correlations. Such an examination serves two purposes. A number of authors (Loehlin 1965; Goldberger 1977, 1979), have noted that moving from a crude comparison of correlations to a full-fledged variance decomposition requires making some strong independence and functional form assumptions. A first purpose is therefore to examine whether a significant difference in correlations exists. This serves as a diagnostic of whether the traits in question are under genetic influence. Second, as explained below, the workhorse models in behavior genetics do imply certain restrictions on the MZ and DZ correlations. Correlations that fall significantly outside the space of permissible correlations are therefore an indication of model misspecification and the raw correlations can be used to test for such misspecification. To explain why, it is necessary to introduce some basic concepts from behavior genetics (see Chapter 3 in Neale and Maes [2004]). By phenotype, we simply mean the observed outcome variable. The location of a gene on a chromosome is known as a locus. Alleles are the alternative forms of a gene that may occupy the same locus on a chromosome. Finally, the genotype of an individual is the alleles he or she has at a locus. Suppose that the phenotype of twin j ∈ {1, 2} in family i can be written as the sum of four independent influences, (1)
χi j = Ci j + Ei j + Ai j + Di j ,
where Ci j is the common environmental factor, Ei j is the individually experienced unique environment factor, Ai j is an additive
816
QUARTERLY JOURNAL OF ECONOMICS
genetic factor, and Di j is a dominance factor. Common environmental influences are defined as those influences shared by both twins, for example the home environment, so that Ci1 = Ci2 . Unique environmental influences, by contrast, are defined as environmental experiences idiosyncratic to each twin. Behavior geneticists distinguish between additive genetic effects and dominance effects. For an intuitive illustration of the difference, consider the simple case where there are two possible alleles, a1 and a2 , so that each individual, getting one allele from each parent, has genotype (a1 , a1 ), (a1 , a2 ), or (a2 , a2 ). Dominance is then present whenever the effect of having genotype (a1 , a2 ) is not equal to the mean effect of genotypes (a1 , a1 ) and (a2 , a2 ). In other words, dominance can be thought of as an interaction effect. Because the influences are assumed to be independent, the model predicts that the covariance in MZ twins is equal to (2)
COVMZ = σ A2 + σ D2 + σC2 ,
because identical twins share the same genes and were reared together. The phenotypic covariance between DZ twins is derived in Mather and Jinks (1977) as (3)
COVDZ = 12 σ A2 + 14 σ D2 + σC2 .
The coefficients of genetic relatedness for DZ twins in equation (3) thus imply that DZ twins share half the additive genetic effects and a quarter of the dominance effects. Notice that parameters of this model cannot be identified with twin data alone, because we have one equation less than the number of parameters to be estimated. This ambiguity is typically resolved in twin research by assuming that all gene action is additive, so that σ D2 = 0. Behavior geneticists distinguish between broad heritability, defined as (σ A2 + σ D2 )/(σ A2 + σ D2 + σC2 + σ E2 ), and narrow heritability, defined simply as σ A2 /(σ A2 + σ D2 + σC2 + σ E2 ). The identifying restriction that σ D2 equals zero can be tested by examining whether ρDZ is at least half of ρMZ , and the greatest difference in correlation allowed by the model arises when σC2 = 0 and σ A2 = 0, in which case ρMZ is four times greater than ρDZ . In our empirical analysis, we start by comparing the correlations of MZ and DZ twins using the bootstrap. Letting NMZ be the number of complete MZ pairs, we draw NMZ pairs with replacement 1,000 times and calculate both parametric and
GENETIC VARIATION IN PREFERENCES
817
nonparametric correlation each time. We proceed analogously for DZ twins and then create a 1,000×1 vector where the DZ correlation is subtracted from the MZ correlation for each draw. This gives a distribution for the difference in correlations between the two samples. The p-value for the test of the hypothesis that the two correlations are equal is then the number of negative entries in the vector divided by 1,000. The use of a one-sided test is theoretically justified in our case because the notion that the DZ correlation could be greater than the MZ correlation is not a particularly interesting alternative hypothesis. We also use the same bootstrap technique to test the hypothesis that the DZ correlation is at least half as large as the MZ correlation. The result of the latter exercise will inform our choice of identifying restrictions. For our two main outcome variables, we estimate mixedeffects Bayesian ACE models.5 We report results treating outcome variables as continuous as well as ordinal. Using the same notation as previously, the model is written as (4)
yi∗j = χi j ,
where χi j is the sum of genetic, shared environment, and unshared environment random effects. For MZ twins the latent variable is the sum of three random effects, (5)
χiMZ j = Ai + Ci + Ei j ,
where Ai is the family genetic factor, Ci is the family-shared environment factor, and Ei j is the individually experienced unshared environment factor. For DZ twins the latent variable is a function of four random effects variables, (6)
χiDZ j = A1i + A2i j + Ci + Ei j ,
where A1i is the family genetic factor shared by both twins, A2i j is the individually inherited genetic factor that is unique to each twin, and Ci and Ei j are the same as for MZ twins. In the continuous models, we take the outcome variables in the experiment to 5. Researchers have increasingly used Bayesian methods, implemented using Markov chain Monte Carlo (MCMC) algorithms, to estimate the variance components in ACE models. The likelihood functions in genetic models often present computational challenges for maximum likelihood approaches because they contain high-dimension integrals that cannot be evaluated in closed form and thus must be evaluated numerically. For a detailed discussion of Bayesian ACE models, we refer to van den Berg, Beem, and Boomsma (2006).
818
QUARTERLY JOURNAL OF ECONOMICS
be yi∗j . In the ordered models, the outcome variables are instead modeled under the assumption that yi∗j is not directly observed. Instead, the observed variable yi j is assumed to be one of k + 1 ordered categories separated by k thresholds that are estimated as part of the model. The three risk measures naturally fall into categories, and hence these categories are used in the analysis. A visual inspection of Figure I shows that the distribution of dictator game responses is roughly trimodal, with peaks at the three focal points: donating the entire endowment, donating half the endowment, or keeping the entire endowment. Approximately 80% of responses are in one of those three categories. Consequently we construct an ordinal variable where individuals who donate between 0 and 33 are coded as 0, individuals who donate between 33 and 66 are coded as 1, and individuals who donate more than 66 are coded as 2. We use the variances of the random effects to generate estimates of heritability, common environment, and unique environment. Because the underlying components are not constrained, the estimated proportions can range anywhere from 0 (the component has no effect on variance) to 1 (the component is solely responsible for all observed variance). Replicating the methods used in this literature, we assume that our unobserved random effects are normally distributed and independent: (7) A ∼ N 0, σ A2 , A1 ∼ N 0, σ A2 /2 , (8) A2 ∼ N 0, σ A2 /2 , (9) C ∼ N 0, σC2 , (10) E ∼ N 0, σ E2 . (11) The variance of A1 , the family genetic effect for DZ twins, is fixed to be half the variance of A, the family genetic effect for MZ twins, reflecting the fact that MZ twins on average share twice as many genes as DZ twins. Moreover, DZ twins are also influenced by individually specific genes A2 that are drawn from the same distribution as the shared genes, because on average half their genes are shared and half are not. These assumptions about the genetic variance help to distinguish shared genes from the shared environment variable C, which is assumed to have the same variance for both MZ and DZ twin families, and the residual unique environment variable, E, from which a unique
0
0.05
0.1
0.15
0.2
0.25
0.3
90
A MZ DZ
Density 0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
20
30
50
60
Certainty equivalent
40
B
80
100
MZ DZ
FIGUREPanel I Panel A: The distribution of giving (percent donated), by zygosity. B: The distribution of risk aversion (certainty equivalent), by zygosity.
Density
0.35
GENETIC VARIATION IN PREFERENCES
819
820
QUARTERLY JOURNAL OF ECONOMICS
draw is made for each individual. The contribution of a variance component is simply estimated as σi2 /(σ E2 + σ A2 + σC2 ), where i ∈ {A, C, E}.6 We estimate three types of models in addition to the ACE model. An AE model accounts for only heritability and common environment, a CE model accounts for only common and unique environment, and an E model accounts for only unique environment. Procedurally, the difference between the ACE model and these submodels is that one or more variances are restricted to equal zero. Estimating submodels allows testing whether the parameter restriction results in a significant deterioration in fit. For example, in the AE model, the random effect for the common environment is not estimated. To compare the fit of ACE, AE, CE, and E models we used the deviance information criterion (DIC), a Bayesian method for model comparison analogous to the Akaike information criterion (AIC) in maximum likelihood estimation. Models with smaller DIC are considered to have better out-of-sample predictive power (Gelman et al. 2004). The DIC is defined as the sum of deviance (Dbar), a measure of model fit, and the effective number of parameters ( pD), which captures model complexity.7 In our Markov chain Monte Carlo procedure we use vague, or flat, prior distributions to ensure that they do not drive our results. For the thresholds, τi , we use a mean-zero normal distribution with variance 1,000,000, and for the precision parameters associated with σ A2 , σ E2 , and σC2 , we use a Pareto distribution with 6. If we tried to estimate all three components of variance simultaneously in the ordered model, it would not be identified, so we fix the variance of the unshared environment σ E2 to be one. 7. Letting θ be the parameter vector, y the data, p the likelihood function, and f ( y) a standardizing term which is a function of the data alone, the deviance is defined as D(θ ) = −2 ln( p(y | θ )) + 2 ln f (y). Then Dbar is defined as Dbar = Eθ (D(θ )), and pD is defined as pD = Dbar − D(θ¯ ), where θ¯ is the expectation of θ. The deviance information criterion can then be calculated as DIC = pD + Dbar For further details, see Spiegelhalter et al. (2002).
GENETIC VARIATION IN PREFERENCES
821
TABLE I EXPERIMENTAL BEHAVIOR
Giving
Risk aversion
Risk investment
Risk assessment
Mean S.D. n Mean S.D. n Mean S.D. n Mean S.D. n
MZ twins
DZ twins
p-value
53.60 37.27 638 52.38 18.53 625 30.25 21.22 638 4.98 1.98 636
54.43 37.94 282 51.88 17.80 276 33.19 21.28 279 5.25 1.96 279
.77
.71
.08
.07
Notes. The p-value is for the test of the hypothesis that the mean of the MZ and DZ distributions are the same. Standard errors are adjusted to take nonindependence into account (Liang and Zeger 1986).
shape parameter equal to 1 and scale parameter equal to 0.001, which is the equivalent of putting a uniform (0, 1,000) prior on the variances. A Pareto distribution has proven to work well for variance components in genetic models (Burton et al. 1999; Scurrah, Palmer, and Burton 2000). In addition, we use convergence diagnostics to make sure that the stationary posterior distribution has been reached. To ensure that the models converged to their target posterior distribution, we began sampling from the joint posterior distribution after convergence was established using the Brooks and Gelman (1998) statistic (values of less than 1.1 on all parameters indicate convergence). For all of the models the “burn-in” period was 100,000 iterations and the chains were thinned by 100. IV. RESULTS In Table I we report some background statistics. On average, subjects donated 54% of their endowment in the dictator game to the charity and the average certainty equivalent in the risky gamble was 52.8 Results from the first hypothetical question reveal 8. To facilitate interpretation, in Table I we define the certainty equivalent as the midpoint between the lowest sure amount that the subject is willing to accept and the category immediately below. For example, a subject who chooses the gambles at 20, 30, and 40 and then prefers 50 SEK with certainty is assigned a certainty equivalent of 45.
822
QUARTERLY JOURNAL OF ECONOMICS
that subjects invest on average 31% of their endowment. Finally, on a scale from 0 to 10, subjects report an average willingness to take risks of just above 5. Tests of equality for all four variables fail to reject the null hypothesis that the MZ and DZ means are equal at the 5% level. To give an impression of individual variation in responses, in Figure I we plot histograms of the distributions for risk aversion and giving, separately, for DZ and MZ twins. A visual inspection reveals that there is ample variation in responses and fails to lend much support to the hypothesis that the frequency distributions vary by zygosity. Histograms and scatterplots for the survey-based risk measures are provided in Figures A1 and A2 in the Online Appendix. In Table II, we report parametric and nonparametric correlations for MZ and DZ twins. Pearson correlations do not differ appreciably from Spearman correlations. These correlations convey a lot of information, and because a purely environmental model cannot account for any differences between MZ and DZ correlations, they serve as a preliminary diagnostic of whether the preferences in question are in part under genetic influence. For giving, the Spearman correlation is .319 for MZ twins and .106 for DZ twins, consistent with a genetic effect. Similarly, for risk aversion, the Spearman correlation is .222 for MZ twins and .025 for DZ twins, whereas for risk investment, the corresponding figures are .264 and .096. However, for risk assessment, the separation is larger, with an MZ correlation of .367 and a DZ correlation of −.034. As the sample size is smaller for DZ twins, these correlations are estimated with less precision, yielding wider confidence intervals. Yet, when the equality of the correlations is tested using the bootstrap, the one-sided p-value is less than 2% for giving, risk aversion, and risk assessment. Though the MZ correlation is also higher than the DZ correlation for risk investment, the hypothetical investment question, the difference is not significant at 5% ( p = .07). The robust separation of MZ and DZ correlations is illustrated in Figure II, where we plot the response of twin 1 against the response of twin 2 separately for MZ and DZ twins. Hence, the evidence is very compelling that genes do contribute to phenotypic variation in both giving and risk aversion. We also used the same bootstrapping method to test the null hypothesis that the DZ correlation is at least half the MZ correlation, as implied by the ACE specification. For neither risk aversion ( p = .16), risk investment ( p = .36), nor giving ( p = .30) can we reject the null hypothesis. On the other hand,
Spearman Pearson n Spearman Pearson n Spearman Pearson n Spearman Pearson n
.319*** (.211–.426) .317*** (.208–.424) 319 .222*** (.118–.341) .222*** (.099–.342) 307 .264*** (.149–.364) .304*** (.177–.408) 319 .367***(.266–.468) .384*** (.280–.481) 317
.106 (−.067–.292) .099(−.075–.279) 141 .025 (−.150–.189) .024 (−.135–.179) 135 .096 (−.077–.277) .110 (−.079–.315) 139 −.034 (−.217–.148) −.043 (−.237–.139) 139
DZ twin pairs
.001 .001
.066 .057
.020 .024
.015 .013
p-value of diff.
Notes. ***, **, * significantly different from zero at 1%, 5%, and 10% levels. All results are bootstrapped. p-values are one-sided. 95% confidence intervals within parentheses.
Risk assessment
Risk investment
Risk aversion
Giving
MZ twin pairs
TABLE II PARAMETRIC AND NONPARAMETRIC CORRELATIONS FOR MZ AND DZ TWIN PAIRS GENETIC VARIATION IN PREFERENCES
823
824
QUARTERLY JOURNAL OF ECONOMICS B
60
Twin 2
0
0
20
40
60 40 20
Twin 2
80
80
100
100
A
0
20
40
60
80
100
0
20
40
80
100
D
60
Twin 2
60 20
20
30
30
40
40
50
50
80
80
100
100
C
Twin 2
60
Twin 1
Twin 1
20
30
40
50
60
Twin 1
80
100
20
30
40
50
60
80
100
Twin 1
FIGURE II Scatterplots (jittered for expositional clarity) Panel A: Scatterplot for the dictator game, percent donated, MZ twins. Panel B: Scatterplot for the dictator game, percent donated, DZ twins. Panel C: Scatterplot for risk aversion, certainty equivalent, MZ twins. Panel D: Scatterplot for the risk aversion, certainty equivalent, DZ twins.
we can reject the null hypothesis for risk assessment ( p = .02), suggesting that the estimation of an ACE model is inappropriate. Notice that even though we cannot reject the hypothesis at conventional levels of significance in three out of four cases, it is still striking that the estimated DZ correlations are always less than half the MZ correlations. In what follows, we restrict our attention to the results from our experiments with monetary incentives; results for the supplemental risk measures are reported in Tables A3–A5 in the Online Appendix. Because we cannot reject the null hypothesis that the
GENETIC VARIATION IN PREFERENCES
825
DZ correlation is at least half the MZ correlation for our two main experimental measures, we do not depart from the convention of estimating ACE models. In Tables III and IV we present the estimates of the variance components of the ACE model and its nested submodels. Parameter estimates are similar, regardless of whether the outcome variable is treated as continuous or ordinal. The estimate of genetic influences on giving is 0.22 (0.28) in the most general version of the continuous (ordered) model. Corresponding estimates for risk aversion are 0.14 and 0.16, whereas the contribution of the common environment is closer to zero, both in our modified dictator game and for risk aversion. It is interesting to contrast these results to those that have previously been reported for other outcome variables of interest ¨ to economists. For example, Bj¨orklund, Jantti, and Solon (2005) estimated heritability of earnings in Sweden using multiple sibling types and obtained heritability estimates for income in the range 10% to 30%, whereas Taubman’s original estimates based on a sample of white U.S. war veterans were slightly higher (Taubman 1976). The estimates for trust and trustworthiness reported in previous papers, though imprecise, are also in the neighborhood of 20% in both U.S. and Swedish data (Cesarini et al. 2008). Generally, the estimated heritabilities for our experimentally elicited preferences are a little lower than the reported broad heritabilities for personality, which tend to be around 50% (Plomin et al. 2001a), and lower still than the estimates of the heritability of IQ (Neisser et al. 1996). In making the comparison to psychological variables it is, however, important to bear in mind that the reliability of the measurement instruments used by psychometricians in IQ and personality research may be different from the reliability of behavior in economic experiments. In light of these results, it is not surprising to find that both for giving and for risk aversion, the diagnostics of model fit repeatedly point to the AE model as the most appropriate. Setting C to equal zero is potentially a drastic step, but is consistent with the fairly low DZ correlations that we observe. When the AE submodel is estimated, the estimates of A for giving are 0.31 (0.39) in the continuous (ordered) models. The corresponding figure for risk aversion is 0.21 (0.25). We also report the results from CE and E models. CE models always have fit diagnostics worse than the AE and ACE models. Not surprisingly, the E model fits the data very poorly.
A C E
A C E
0.28 (0.06, 0.46) 0.11 (0.01, 0.30) 0.61 (0.50, 0.73) 1,693 236.0 1,929
0.22 (0.05, 0.36) 0.09 (0.01, 0.23) 0.70 (0.60, 0.79) 4,719 227.3 4,946 0.39 (0.27, 0.51) — 0.61 (0.49, 0.74) 1,688 238.7 1,927
0.31 (0.21, 0.40) — 0.69 (0.60, 0.79) 4,706 234.9 4,941
AE
— 0.32 (0.21, 0.43) 0.68 (0.57, 0.79) 1,761 189.8 1,951
— 0.25 (0.16, 0.33) 0.75 (0.67, 0.84) 4,783 184.8 4,968
CE
— — 1.00 (1.00–1.00) 2,023 2.0 2,025
— — 1.00 (1.00–1.00) 5,043 2.0 5,045
E
Notes. A is the genetic contribution; C is the common environment contribution; E is the unique environment contribution. Dbar: Deviance. pD: Effective number of parameters. DIC: Bayesian deviance information criterion. 95% credible intervals within parentheses.
Dbar pD DIC
Ordered
Dbar pD DIC
Continuous
ACE
TABLE III RESULTS OF THE ACE MODEL AND ITS NESTED SUBMODELS FOR GIVING
826 QUARTERLY JOURNAL OF ECONOMICS
A C E
A C E
0.16 (0.01, 0.30) 0.09 (0.01, 0.22) 0.75 (0.65, 0.86) 2,760 181.4 2,941
0.14 (0.02, 0.27) 0.07 (0.00, 0.18) 0.80 (0.69, 0.89) 7,713 160.8 7,873 0.25 (0.14, 0.36) — 0.75 (0.64, 0.86) 2,752 186.3 2,938
0.21 (0.11, 0.31) — 0.79 (0.70, 0.89) 7,707 163.9 7,871
AE
— 0.20 (0.10, 0.30) 0.80 (0.70, 0.90) 2,804 149.1 2,953
— 0.17 (0.08, 0.26) 0.83 (0.74, 0.93) 7,752 130.6 7,883
CE
— — 1.00 (1.00–1.00) 2,985 5.9 2,991
— — 1.00 (1.00–1.00) 7,914 2.0 7,916
E
Notes. A is the genetic contribution; C is the common environment contribution; E is the unique environment contribution. Dbar: Deviance. pD: Effective number of parameters. DIC: Bayesian deviance information criterion. 95% credible intervals within parentheses.
Dbar pD DIC
Ordered
Dbar pD DIC
Continuous
ACE
TABLE IV RESULTS OF THE ACE MODEL AND ITS NESTED SUBMODEL FOR RISK AVERSION
GENETIC VARIATION IN PREFERENCES
827
828
QUARTERLY JOURNAL OF ECONOMICS
IV.A. Equal Environment Assumption Critics of the classical twin design cite a number of alleged failures of the equal environment assumption, including that MZ twins are more likely to interact, and that parents, on average, give MZ twins more similar treatment (Pam et al. 1996). Indeed, ¨ Bj¨orklund, Jantti, and Solon (2005) have shown, using a data set with nine different sibling types, that estimates of the variance components in income do change substantially when the equal environment assumption is relaxed. In the context of research on personality and IQ, the evidence is, however, fairly convincing that any bias that arises from the equal environment assumption is not of first order. Most importantly, for measures of personality and cognitive ability, studies of MZ and DZ twins reared apart tend to produce estimates of heritability similar to those using twins reared together (Bouchard 1998). Because studies of twins reared apart do not rely on the equal environments assumption, it is unlikely that the assumption is a major source of bias. Second, although it is true that MZ twins report a higher frequency of contact with one another than DZ twins, twin similarity has been shown to cause greater contact rather than vice versa (Posner et al. 1996). Other studies have failed to find a significant relationship between similarity and contact. For example, one large study found that frequency of contact is not correlated with similarity in social attitudes (Martin et al. 1986). Third, the claim that the greater similarity of MZ twins is due to more uniform parental influences rests on fairly weak empirical ground. Measures of the degree of similarity in parental treatment turn out to not be correlated with similarity in IQ or other personality measures (Bouchard et al. 1990). Also, in the relatively rare cases where parents miscategorize their twins as MZ instead of DZ (or the converse), differences in cognitive ability and personality persist (Bouchard and McGue 2003). Finally, we note that our estimated Cs are very low, and it would appear that the Bayesian estimator, if anything, overstates the importance of shared environment compared to other standard estimators.9 9. It is clear by inspection that a method of moments estimator would produce nonsensical negative estimates of common environment. When continuous ACE models are estimated using maximum likelihood in MPLUS (Muth´en and Muth´en 2006) and bootstrapping the standard errors, estimated Cs are always equal to zero, and the estimated heritabilities are 0.21 for risk aversion, 0.31 for giving, 0.29 for risk investment, and 0.35 for risk assessment. All estimates of A are significant at the 5% level.
GENETIC VARIATION IN PREFERENCES
829
IV.B. Measurement Error In the simplest case, where the studied preference is observed with mean zero random error, we can think of the unique environment component as being composed of two terms, Ei j = Ei∗j + i j , where i j is a mean zero variable with variance σ2 and is i.i.d. across time. Under these assumptions, it is easy to show that the estimates of A and C need to be scaled up by a factor of 1/(1 − σ2 ). For example, under the conservative assumption of a retest correlation of .8, this would imply a σ2 of .2, and therefore the estimates of A and C would need to be scaled up by 25%, that is, to somewhere between 0.18 and 0.41 for A in our ACE models. There is surprisingly little evidence on test–retest stability in economic experiments. One recent paper (Brosig, Riechmann, and Weimann 2007) examined the temporal stability of individual behavior in modified dictator and prisoner’s dilemma games and found that individual behavior is unstable across time in a given game. However, the authors used a concept of stability that is not easily mapped to an estimate of σ2 . Other papers have estimated error rates from identical responses to items, typically finding reversal rates on the order of 10%–20% (Harless and Camerer 1994; Hey and Orme 1994). IV.C. Representativeness Compared to most experimental work, our sample is an improvement in terms of representativeness because we draw our subjects from a population-based registry and not a pool of college students. Yet it is important to establish the “selectivity” of our sample. In particular, three questions arise. First, are the MZ and DZ twins who agree to participate drawn from similar environments? Second, to what extent does our method of sampling lead to overrecruitment of subjects with certain characteristics? If any such characteristics are associated with heritability, then estimates of variance components will be biased. Third, in light of the fairly skewed ratio of MZ twins to DZ twins in our sample, are there any reasons to believe that this has affected our estimates? A basic assumption of the ACE model is that MZ twins and DZ twins are drawn from the same environment. We have already demonstrated that in terms of experimental outcomes, the MZ and DZ distributions appear to be the same. To further investigate this hypothesis, we conducted a battery of tests for equality of background variables including gender, years of education,
830
QUARTERLY JOURNAL OF ECONOMICS TABLE V MZ–DZ COMPARISON FOR BACKGROUND VARIABLES MZ twins Mean
Female Age Education Income Employed full time Unemployed Self-employed On sick leave Government employee Cognitive ability Emotional stability Agreeableness Extraversion Conscientiousness Health Marital status Number of children
S.D.
DZ twins Mean
S.D.
0.77 0.42 0.82 0.39 34.30 7.35 35.95 7.81 13.70 2.22 13.63 2.18 201,973 152,674 217,548 119,997 0.54 0.50 0.60 0.49 0.03 0.18 0.04 0.19 0.04 0.20 0.07 0.25 0.04 0.19 0.02 0.12 0.40 0.49 0.45 0.50 0.03 0.99 −0.06 1.02 −0.04 1.00 0.10 0.99 0.02 0.98 −0.04 1.04 −0.04 0.98 0.08 1.04 −0.02 1.01 0.04 0.98 1.87 0.81 1.88 0.79 0.25 0.43 0.29 0.46 0.70 0.99 0.76 0.99
p-value Data source .24 .03 .69 .19 .23 .80 .32 .10 .26 .30 .09 .55 .16 .55 .86 .26 .55
Multiple Multiple Stat. Sweden Stat. Sweden STAGE STAGE STAGE STAGE STAGE Exp. session Exp. session Exp. session Exp. session Exp. session STAGE Stat. Sweden Stat. Sweden
Notes. Education refers to years of education. Income is the sum of wage income, taxable transfers, and income from own company for the year 2005 (in SEK). Employment information was gathered when the subject responded to the STAGE questionnaire. Psychological measures were adjusted to have mean 0 and standard deviation 1 for the whole sample. Health is self-reported on a scale from 1 to 5. Marital status is a dummy variable taking the value 1 if the subject is married. Number of children is number of children under 18 living in the respondent’s household in the year 2005. The p-value is for the test of the hypothesis that the mean of the MZ and DZ distributions are the same. We utilized adjusted Wald tests for equality taking into account nonindependence within twin families (Liang and Zeger 1986).
employment status, health, income, and marital status. With the exception of age, we did not find any significant differences between the MZ and DZ samples. The results are reported in Table V. Second, it is possible that the twins who participated are not representative of the population as a whole. Like most twin studies (Lykken, McGue, and Tellegen 1986), our method of recruitment led to an oversampling of women and of MZ twins. Comparing our participants to the STAGE cohort as a whole on a number of background variables, we find few economically interesting differences. These results are also reported in the Online Appendix. A comparison to the entire STAGE cohort is only an imperfect measure of representativeness, however, because STAGE respondents are also a self-selected group. We have therefore merged
GENETIC VARIATION IN PREFERENCES
831
our experimental data with information on educational attainment, marital status, and income from Statistics Sweden and can thus further examine how our sample compares to the population mean for the cohort born 1959 to 1985. The population marriage rate for women is 36% and is 29% for men. This is slightly higher than what we observe in our experimental sample. For income, the population averages are close to those of our participants. On average men earn 247,000 SEK, whereas our male subjects earn 244,000 SEK. For women the corresponding figures are 181,000 and 197,000. Finally, we find that the average years of education in the cohort as a whole are 12.09 for men and 12.49 for women, which is slightly more than one year less than the average for our experimental sample. The upshot of this discussion is that our method of sampling leads to mild overrecruitment of subjects who are younger than average, are less likely to be married, and have fewer children on average. There is also modest overrecruitment of subjects with better than average educational attainment. Is this above-average educational attainment of our subjects a source for concern? For instance, it has been suggested that the heritability of intelligence might be moderated by social stratum (Turkheimer et al. 2003), at least in children, and a similar argument might apply to the effect of educational attainment on our outcome variables. To investigate this, we modify the continuous version of our baseline model to allow for interaction between A and years of education.10 The fit of the new model is slightly better for risk aversion and slightly worse for the other three variables, suggesting that the interaction between A and education should not be included. For risk aversion, heritability increased somewhat, to 0.21 (95% CI 0.02, 0.39), compared to the baseline model.11 Finally, there is a third, more subtle way in which recruitment bias may be affecting our estimates. A plausible explanation for the overrecruitment of MZ twins is that because MZ twins are in more frequent contact with each other, it is easier for them to coordinate on a date and time. The concern here is that coordination costs, or willingness to participate more generally, might be 10. This model is χiMZ j = Ai + β × Ai × Educationi j + Ci + Ei j for MZ twins and χiDZ j = A1i + A2i j + β × (A1i + A2i j ) × Educationi j + Ci + Ei j for DZ twins. 11. The DIC for the risk aversion, risk investment, risk assessment, and dictator game interaction models are 7,813, 3,881, 3,698, and 4,919, respectively. New baseline models were run to account for the fact that the interaction models were based on fewer observations due to missing values for the years of education variable. The baseline DICs are 7,824, 3,872, 3,695, and 4,915.
832
QUARTERLY JOURNAL OF ECONOMICS
associated with behavioral similarity. If so, this will inflate correlations, leading to an upward bias in the estimates of A and C. If this form of selection is more severe for MZ or DZ twins, it will also bias the estimates of the relative importance of common environmental and genetic influences. A reasonable proxy variable for costs of coordination is the frequency of contact between twins. Self-reported data on frequency of contact are available in STAGE.12 When we compare twins who took part in our study with those who did not, there is a practically and statistically significant difference in the anticipated direction. MZ twins who participated in the study report a frequency of contact of 260 interactions per year, whereas those who did not participate report 234 interactions per year. The corresponding figures for DZ twins are 199 and 155. These differences are highly significant. In other words, frequency of contact is a robust predictor of participation. The crucial question, however, is whether frequency of contact predicts behavioral similarity. To test this, we regress the absolute value of the within-pair difference in giving and the three measures of risk on the average self-reported frequency of contact. Controlling for zygosity, the coefficient on frequency of contact is never significant. In other words, a reasonable proxy variable for “costs of coordination” does not seem to be related to behavioral similarity. A second robustness test is to take variables that are available for the STAGE cohort in its entirety and ask whether there are any systematic differences between subjects who participated in our experiments and those who did not, in terms of correlations. If correlations in health, income, years of education, and the numerous other variables we investigate are consistently higher in the experimental sample, this would then suggest that these are a self-selected group with greater concordance in general. The results from this exercise are reported in Table A2 of the Online Appendix of this paper. There is no tendency for the patterns of correlations to differ between the two groups. 12. We construct the frequency of contact variable as follows. Subjects who report at least one interaction (by e-mail, telephone, or letter) per day are assigned a value of 365. Subjects who report less than one interaction per day are simply assigned a value equal to the number of interactions per year. Interestingly, frequency of contact also provides a falsification test of the basic twin model. Because this variable is the same for both twins in a pair, it cannot possibly be heritable. A higher MZ correlation than DZ correlation would then suggest that measurement errors are more correlated in MZ twins. Fortunately, this turns out not to be the case. In our experimental sample, the MZ correlation is .76 and the DZ correlation is .71. In STAGE as a whole, the correlations are .77 and .75.
GENETIC VARIATION IN PREFERENCES
833
IV.D. Genetic Nonadditivity The models we use—like most behavior genetic models— assume that genes influence a trait in an additive manner. That is, the genetic effect is simply the sum of all individual effects. This is by far the most common way to achieve identification. It has long been known that the twin model suffers from parameter indeterminacy when, for example, dominance effects are present because the number of parameters to be estimated exceeds the number of independently informative equations (Keller and Coventry 2005). The fact that our DZ correlations are less than half of the MZ correlations could be the result of sampling variation. But it could also be an indication that there is some nonadditive genetic variation present. For one of our risk measures, risk assessment, we are in fact able to reject the hypothesis that the DZ correlation is at least half the MZ correlation. In Table A5 of the Online Appendix to this paper, we report the results of an ADE model and show that this model fits the data better, as judged by the DIC criterion. A more rigorous way to test for nonadditivity would be to extend the data set to include also sibling, parent–child, or even cousin data. Though our data do not contain such information, Coventry and Keller (2005) recently completed a major review of all published parameter estimates using the extended family design compared to classical twin design estimates derived from the same data. The authors report that the estimates of broad heritability in twin studies are fairly accurate. However, the classical twin design overestimates the importance of additive genetic variation and underestimates the importance of nonadditive genetic variation. Evidence from studies of adoptees points in the same direction. In a recent metastudy by Loehlin (2005), the author reports average correlations of .13 for personality and .26 for attitudes in families with children reared by their biological parents. However, the correlations for personality and attitudes are .04 and .07, respectively, between adopted children and their nonbiological parents, but .13 and .20 between adopted children and their biological parents (Loehlin 2005). Because only additive genetic variance is transmissible across generations (Fisher 1930), doubling the parent–child correlation produces an upper bound on the estimate of narrow heritability. The fact that this upper bound is lower than estimates derived from twin studies reinforces the point that there is probably nonadditive variation in personality
834
QUARTERLY JOURNAL OF ECONOMICS
and attitudes. The low DZ correlations we observe suggest that a similar situation obtains for economic preferences. We thus concur with the conclusion in Coventry and Keller (2005), namely that the estimates from the classical twin design should not be interpreted literally, but are nevertheless very useful because they produce reasonably accurate estimates of broad heritability, and hence of genes as a source of phenotypic variation. V. DISCUSSION In this paper, we have used standard behavior genetic techniques to decompose variation in preferences for giving and risk taking into environmental and genetic components. We document a significant genetic effect on risk taking and giving, with genes explaining approximately twenty percent of phenotypic variation in the best-fitting models. The estimated effect of common environment, by contrast, is smaller. Though these results are clearly in line with the behavior genetic literature (Turkheimer 2000), the implications of these findings in the context of modern economics merit further comment. In particular, it is important to exercise great care in interpreting the estimates of variance components. Contrary to what is sometimes supposed, they are estimates of the proportion of variance explained and thus do not shed any direct light on the determinants of average phenotype. This distinction is important. For instance, if genetic transmission in a studied population is uniform, then a trait that is primarily acquired through genes might actually show low, or zero, heritability. The same argument is true for common environment. A low estimated C could simply mean that there is little variation in how parents culturally transmit preferences or values to their children. This caveat is especially important to bear in mind when interpreting heritability estimates from a study population such as ours, where it seems plausible to assume that environmental variation between families is modest. Like any other descriptive statistic, a heritability estimate is specific to the population for which it is estimated, and, though our findings are probably informative about heritability in other modern Western societies, we caution against further extrapolation. Variation in our study population is in all likelihood small relative to cross-country differences or historical environmental
GENETIC VARIATION IN PREFERENCES
835
differences that could potentially generate greater variation in risk preferences and giving. Perhaps the most striking and intuitive illustration of this point comes from the study of income, which is moderately heritable in Sweden as well as in the United ¨ States (Bj¨orklund, Jantti, and Solon 2005; Taubman 1976). In recent centuries incomes have increased manifold, and even today an individual’s country of origin is by far the most important determinant of that individual’s income (Sala-i-Martin 2006). In other words, a heritability statistic says little about the malleability of a trait with respect to environmental interventions (Goldberger 1979). Caution should also be exercised in interpreting our estimate of unique environment (E), because it is not possible to separately identify unique environment and measurement error without knowledge of test–retest correlations (Plomin and Daniels 1987; Plomin et al. 2001b). This is because if there is noise in the elicitation of preferences, such noise will be subsumed under the estimate of unique environmental effects.13 Further, a number of important sources of unique environmental effects, such as accidents, are nonsystematic in nature. The observation that the human genome could not possibly specify every synaptic connection in the brain and that random events could lead to different developmental outcomes, even in genetically identical individuals, falls into this category (Molenaar, Boomsma, and Dolan 1993; Jensen 1997). Economists have traditionally expressed agnosticism about the causal mechanisms behind individual differences in preferences. Although choosing to overlook genetic explanations is often well motivated on the grounds of parsimony, especially in studies taking a historical or geographical perspective, our findings, combined with the preexisting behavior genetics literature, uncover a unique and potentially important source of preference heterogeneity. Despite ample experimental evidence, the origins of individual behavioral variation in economic games have thus far remained elusive, and many attempts to find theoretically appealing and empirically stable correlates to preferences elicited experimentally have yielded contradictory results (Camerer 2003). If 13. This result also has implications for the genome-wide association studies that are currently under way, examining genetic variation across the human genome and behavior in experimental games. Noise in the elicitation of, for instance, social preferences is likely to frustrate these efforts. Multiple measurement would be one way of dealing with the problem.
836
QUARTERLY JOURNAL OF ECONOMICS
preferences are indeed under moderate genetic influence, any attempt to understand heterogeneity in preferences without taking this into account will be incomplete. Recently, much interest has been directed toward finding biological or neurological correlates of experimental behavior. Of course, this does not necessarily imply either causality or a genetically mediated association. However, the fact that many of the biological variables with known associations to individual differences in strategies or preferences are strongly heritable does lend some support, if only circumstantial, to our findings. For instance, financial risk taking has been claimed to vary over the menstrual cycle in women (Br¨oder and Hohmann 2003; Chen, Katuscak, and Ozdenoren 2005) and correlates both with facial masculinity and with circulating testosterone levels in men (Apicella et al. 2008). A number of imaging studies have also explored the neural correlates of both giving and financial risk taking. One study found activation in the striatum both on receiving money and on donating to charity (Moll et al. 2006). Another study found similar activation patterns and demonstrated enhanced activation when the charitable donation was voluntary (Harbaugh, Mayr, and Burghart 2007). In the context of financial risk taking, Kuhnen and Knutson (2005) demonstrated that risk-seeking is associated with activation in the nucleus accumbens, whereas risk aversion is associated with activation in the insula. In general, brain structure is under strong genetic influence, though there are substantial regional differences in heritability (Thompson et al. 2001; Toga and Thompson 2005). The same is true for hormone levels (Harris, Vernon, and Boomsma 1998; Bartels et al. 2003). VI. CONCLUSIONS In this paper, we have presented an empirical investigation into the relative contributions of individual differences in genes and environment to observed variation in economic preferences for risk and giving. Notwithstanding the fact that all twin siblings are of the same age and were raised together in the same family, genetically identical MZ twins still exhibit much greater similarity in their preferences for risk and giving than do DZ twins. Although our results do not allow us to be as assertive as Sir Francis Galton, they do suggest that humans are endowed with genetic variation in their proclivity to donate money to charity and to take risks.
GENETIC VARIATION IN PREFERENCES
837
By now there are a plethora of studies exploring the sources of individual variation in economic experiments and games, yet up until recently considerations of genetic influences have remained relatively absent. Here we have argued that this failure to consider genes obscures an important source of preference heterogeneity. Ultimately, we hope that a better understanding of the underlying individual genetic heterogeneity14 in economic preferences and the adaptive pressures under which these preferences evolved will lead to a more comprehensive economic science that can bridge some of the unexplained gaps between empirical data and economic theory (Cosmides and Tooby 1994; Burnham 1997). Finally, our findings suggest a number of directions for future research. In recent years we have witnessed rapid advancement in the field of molecular genetics, including the initial tentative steps toward uncovering the complex genetic architecture underlying variation in individual personality and preferences. In fact, we are aware of one paper that has already uncovered a polymorphism on the AVPR1a gene that is associated with generosity in the dictator game (Knafo et al. 2008). Two recent papers also report that carriers of the 7R allele on the Dopamine Receptor D4 gene (DRD4) take greater financial risks in laboratory experiments (Dreber et al. 2009; Kuhnen and Chiao 2009). The identification of specific genes, or more likely combinations of genes, associated with particular traits holds promise for economic research. Most importantly, as noted by Benjamin et al. (2007), it will allow the study of interactions between genotypes and policies to better predict the consequences of policy for individuals. A second direction for future research is to look beyond the laboratory and instead consider field proxies for the underlying preferences. There are well-known issues associated with the generalizability of laboratory findings (Levitt and List 2007), and documenting similar genetic influences in the field therefore ought to be a priority. A third, and perhaps most natural, direction is to try to disentangle additive and nonadditive genetic variation. We anticipate that studies employing the extended family design will shed more light on this issue. The fairly low DZ correlations we observe provide some tentative, but far from conclusive, evidence for nonadditivity. 14. Genetic variation can be maintained in equilibrium for a number of reasons. For a discussion of this difficult subject in the context of personality differences, see two recent papers by Dall, Houston, and McNamara (2004) and Penke, Denissen, and Miller (2007).
838
QUARTERLY JOURNAL OF ECONOMICS
DEPARTMENT OF ECONOMICS, MASSACHUSETTS INSTITUTE OF TECHNOLOGY POLITICAL SCIENCE DEPARTMENT, UNIVERSITY OF CALIFORNIA AT SAN DIEGO DEPARTMENT OF ECONOMICS, STOCKHOLM SCHOOL OF ECONOMICS DEPARTMENT OF MEDICAL EPIDEMIOLOGY AND BIOSTATISTICS, KAROLINSKA INSTITUTET DEPARTMENT OF ECONOMICS, STOCKHOLM SCHOOL OF ECONOMICS
REFERENCES Alford, John R., Carolyn L. Funk, and John R. Hibbing, “Are Political Orientations Genetically Transmitted?” American Political Science Review, 99 (2005), 153– 167. Apicella, Coren L., Anna Dreber Almenberg, Benjamin Campbell, Peter Gray, Moshe Hoffman, and Anthony C. Little, “Testosterone and Financial RiskTaking,” Evolution and Human Behavior, 29 (2008), 384–390. Bardsley, Nicholas, “Dictator Game Giving: Altruism or Artefact?” Experimental Economics, doi 10.1007/s10683-007-9172-2, 2007. Bartels, Meike, St´ephanie M. Van den Berg, Frans Sluyter, Dorret I. Boomsmaa, and Eco J. C. de Geus, “Heritability of Cortisol Levels: Review and Simultaneous Analysis of Twin Studies,” Psychoneuroendocrinology, 28 (2003), 121–137. Behrman, Jere R., and Paul Taubman, “Is Schooling Mostly in the Genes? Nature– Nurture Decomposition Using Data on Relatives,” Journal of Political Economy, 97 (1989), 1425–1446. Benjamin, Daniel J., Christopher F. Chabris, Edward L. Glaeser, Vilmundur Gudnason, Tamara B. Harris, David Laibson, Lenore Launer, and Shaun Purcell, “Genoeconomics,” in Biosocial Surveys, Maxine Weinstein, James W. Vaupel, and Kenneth W. Wachter, eds. (Washington, DC: National Academies Press, 2007). ¨ Bj¨orklund, Anders, Markus Jantti, and Gary Solon, “Influences of Nature and Nurture on Earnings Variation: A Report on a Study of Various Sibling Types in Sweden,” in Unequal Chances: Family Background and Economic Success, Samuel Bowles, Herbert Gintis, and Melissa Osborne Groves, eds. (Princeton, NJ: Princeton University Press, 2005). ——, “Nature and Nurture in the Intergenerational Transmission of Socioeconomic Status: Evidence from Swedish Children and Their Biological and Rearing Parents,” Advances in Economic Analysis and Policy, 7 (2007), 1753–1753. Bj¨orklund, Anders, Mikael Lindahl, and Erik Plug, “The Origins of Intergenerational Associations: Lessons from Swedish Adoption Data,” Quarterly Journal of Economics, 121 (2006), 999–1028. Bouchard, Thomas J. Jr., “Genetic and Environmental Influences on Adult Intelligence and Special Mental Abilities,” Human Biology, 70 (1998), 257–279. Bouchard, Thomas J. Jr., David T. Lykken, Matt McGue, Nancy L. Segal, and Auke Tellegen, “Sources of Human Psychological Differences: The Minnesota Study of Twins Reared Apart,” Science, 250 (1990), 223–228. Bouchard, Thomas J. Jr., and Matt McGue, “Genetic and Environmental Influences on Human Psychological Differences,” Journal of Neurobiology, 54 (2003), 4– 45. Bouchard, Thomas J. Jr., Matt McGue, David T. Lykken, and Auke Tellegen, “Intrinsic and Extrinsic Religiousness: Genetic and Environmental Influences and Personality Correlates,” Twin Research, 2 (1999), 88–98. Bouchard Thomas J. Jr., and Peter Propping, eds. Twins as a Tool of Behavioral Genetics: Report of the Dahlme Workshop on What Are the Mechanisms Mediating the Genetic and Environmental Determinants of Behavior? (Chichester, UK: John Wiley, 1993). Bowles, Samuel, Herbert Gintis, and Melissa Osborne Groves, eds. Unequal Chances: Family Background and Economic Success (Princeton, NJ: Princeton University Press, 2005). Br¨oder, Arndt, and Natalia Hohmann, “Variations in Risk Taking Behavior over the Menstrual Cycle: An Improved Replication,” Evolution and Human Behavior, 24 (2003), 391–398.
GENETIC VARIATION IN PREFERENCES
839
Brooks, Stephen P., and Andrew Gelman, “General Methods for Monitoring Convergence of Iterative Simulations,” Journal of Computational and Graphical Statistics, 7 (1998), 434–455. Brosig, Jeannette, Thomas Riechmann, and Joachism Weimann, “Selfish in the End? An Investigation of Consistency and Stability of Individual Behavior,” MPRA Paper 2035, University Library of Munich, 2007. Burnham, Terence C., Essays on Genetic Evolution and Economics, Ph.D. thesis, Harvard University, 1997. Burton, Paul R., Katrina J. Tiller, Lyle C. Gurrin, William O. C. M. Cookson, Q. A. William Musk, and Lyle J. Palmer, “Genetic Variance Components Analysis for Binary Phenotypes Using Generalized Linear Mixed Models (GLMMS) and Gibbs Sampling,” Genetic Epidemiology, 17 (1999), 118–140. Camerer, Colin F., Behavioral Game Theory: Experiments in Strategic Interaction (Princeton, NJ: Princeton University Press, 2003). Cesarini, David, Christopher T. Dawes, James H. Fowler, Magnus Johannesson, Paul Lichtenstein, and Bj¨orn Wallace, “Heritability of Cooperative Behavior in the Trust Game,” Proceedings of the National Academy of Sciences, 104 (2008), 15,631–15,634. Chen,Yan, Peter Katuscak, and Emre Ozdenoren, “Why Can’t a Woman Bid More Like a Man?” Mimeo, University of Michigan, 2005. Cipriani, Marco, Paola Giuliani, and Olivier Jeanne, “Like Mother Like Son? Experimental Evidence on the Transmission of Values from Parents to Children,” IZA Discussion Paper No. 2768, 2007. Cosmides, Leda, and John Tooby, “Better Than Rational: Evolutionary Psychology and the Invisible Hand,” American Economic Review, 84 (1994), 327– 332. Coventry, William L., and Matthew C. Keller, “Estimating the Extent of Parameter Bias in the Classical Twin Design: A Comparison of Parameter Estimates from Extended Twin–Family and Classical Twin Designs,” Twin Research and Human Genetics, 8 (2005), 214–223. Dall Sasha, R. X., Alasdair I. Houston, and John M. McNamara, “The Behavioral Ecology of Personality: Consistent Individual Differences from an Adaptive Perspective,” Ecology Letters, 7 (2004), 734–739. Dohmen, Thomas, Armin Falk, David Huffman, and Uwe Sunde, “The Intergenerational Transmission of Risk and Trust Attitudes,” IZA Discussion Paper No. 2380, 2006. ¨ Dohmen, Thomas, Armin Falk, David Huffman, Uwe Sunde, Jurgen Schupp, and Gert G. Wagner, “Individual Risk Attitudes: New Evidence from a Large, Representative, Experimentally-Validated Survey,” IZA Discussion Paper No. 1730, 2005. Dreber, Anna, Coren L. Apicella, Daniel T. A. Eisenberg, Justin R. Garcia, Richard Zamore, J. Kuji Lum, and Benjamin C. Campbell, “The 7R Polymorphism in the Dopamine Receptor D4 Gene (DRD4) Is Associated with Financial RiskTaking in Men,” Evolution and Human Behavior, 30 (2009), 85–92. Eckel, Catherine C., and Philip J. Grossman, “Altruism in Anonymous Dictator Games,” Games and Economic Behavior, 16 (1996), 181–191. Fisher, Ronald A., The Genetical Theory of Natural Selection (Oxford, UK: Oxford University Press, 1930). Fong, Christina M., “Evidence from an Experiment on Charity to Welfare Recipients: Reciprocity, Altruism and the Empathic Responsiveness Hypothesis,” Economic Journal, 117 (2007), 1008–1024. Forsythe, Robert, Joel L. Horowitz, N. E. Savin, and Martin Sefton, “Fairness in Simple Bargaining Experiments,” Games and Economic Behavior, 6 (1994), 347–369. Fowler, James H., Christopher T. Dawes, and Laura Baker, “Genetic Variation in Political Participation,” American Political Science Review, 2 (2008), 233–248. Galton, Francis, “The History of Twins, as a Criterion of the Relative Powers of Nature and Nurture,” Fraser’s Magazine, 12 (1875), 566–576. Gelman, Andrew, John B. Carlin, Hal S. Stern, and Donald B. Rubin, Bayesian Data Analysis (New York: Chapman & Hall/CRC, 2004). Goldberger, Arthur S., “Twin Methods: A Skeptical View,” in Kinometrics: Determinants of Socioeconomic Success within and between Families, Paul Taubman, ed. (Amsterdam: North-Holland, 1977).
840
QUARTERLY JOURNAL OF ECONOMICS
——, “Heritability,” Economica, 46 (1979), 327–347. Hamilton, William D., “The Genetical Evolution of Social Behaviour I and II,” Journal of Theoretical Biology, 7 (1964), 1–52. Harbaugh, William T., Ulrich Mayr, and Daniel R. Burghart, “Neural Responses to Taxation and Voluntary Giving Reveal Motives for Charitable Donations,” Science, 316 (2007), 1622–1625. Harless, David W., and Colin F. Camerer, “The Predictive Utility of Generalized Expected Utility Theories,” Econometrica, 62 (1994), 1251–1289. Harris, Julie Aitken, Philip A. Vernon, and Dorret I. Boomsma, “The Heritability of Testosterone: A Study of Dutch Adolescent Twins and Their Parents,” Behavior Genetics, 28 (1998), 165–171. Hey, John D., and Chris Orme, “Investigating Generalizations of Expected Utility Theory Using Experimental Data,” Econometrica, 62 (1994), 1291–1326. Holt, Charles. A., and Susan K. Laury, “Risk Aversion and Incentive Effects,” American Economic Review, 92 (2002), 1644–1655. Jang, Kerry L., W. John Livesley, and Philip A. Vernon, “Heritability of the Big Five Personality Dimensions and Their Facets: A Twin Study,” Journal of Personality, 64 (1996), 577–592. Jensen, Arthur R., “The Puzzle of Nongenetic Variance,” in Heredity, Intelligence, and Environment, Robert J. Sternberg, and Elena L. Grigorenko, eds. (Cambridge, UK: Cambridge University Press, 1997). Keller, Matthew C., and William L. Coventry, “Quantifying and Addressing Parameter Indeterminacy in the Classical Twin Design,” Twin Research and Human Genetics, 8 (2005), 201–213. Kirk, Kathrine M., Hermine H. Maes, Michael C. Neal, Andrew C. Heath, Nicholas G. Martin, and Lyndon J. Eaves, “Frequency of Church Attendance in Australia and the United States: Models of Family Resemblance,” Twin Research, 2 (1999), 99–107. Knafo, Ariel, Salomon Israel, Ariel Darvasi, Rachel Bachner-Melman, Florina Uzefovsky, Lior Cohen, Esti Feldman, Elad Lerer, Efrat Laiba, Yael Raz, Lubov Nemanov, Inga Gritsenko, Christian Dina, Galila Agam, Brian Dean, Gary Bornstein, and Richard P. Ebstein, “Individual Differences in Allocation of Funds in the Dictator Game Associated with Length of the Arginine Vasopressin 1a Receptor (AVPR1a) RS3 Promoter Region and Correlation between RS3 Length and Hippocampal mRNA,” Genes, Brain and Behavior, 7 (2008), 266–275. Koenig, Laura B., Matt McGue, Robert F. Krueger, and Thomas J. Bouchard Jr., “Genetic and Environmental Influences on Religiousness: Findings for Retrospective and Current Religiousness Ratings," Journal of Personality, 73 (2005), 471–488. Kuhnen, Camelia M., and Joan Y. Chiao, “Genetic Determinants of Financial Risk Taking,” PLoS ONE, 4 (2009). Kuhnen, Camelia M., and Brian Knutson, “The Neural Basis of Financial Risk Taking,” Neuron, 47 (2005), 763–770. Levitt, Steven D., and John A. List, “What do Laboratory Experiments Measuring Social Preferences Tell Us about the Real World,” Journal of Economic Perspectives, 21 (2007), 153–174. Liang, Kung-Yee, and Scott L. Zeger, “Longitudinal Data Analysis Using Generalized Linear Models,” Biometrika, 73 (1986), 13–22. Lichtenstein, Paul, Patrick F. Sullivan, Sven Cnattingius, Margaret Gatz, Sofie Johansson, Eva Carlstr¨om, Camilla Bj¨ork, Magnus Svartengren, Alicja Volk, Lars Klareskog, Ulf de Faire, Martin Schalling, Juni Palmgren, and Nancy L. Pedersen, “The Swedish Twin Registry in the Third Millennium: An Update,” Twin Research and Human Genetics, 9 (2006), 875–882. List, John A. “On the Interpretation of Giving in Dictator Games,” Journal of Political Economy, 115 (2007), 482–494. Loehlin, John C., “Some Methodological Problems in Cattel’s Multiple Abstract Variance Analysis,” 72 (1965), 156–161. ——, “Resemblance in Personality and Attitudes between Parents and Their Children: Genetic and Environmental Contributions,” in Unequal Chances: Family Background and Economic Success, Samuel Bowles, Herbert Gintis, and Melissa Osborne Groves, eds. (Princeton, NJ: Princeton University Press, 2005).
GENETIC VARIATION IN PREFERENCES
841
Loh, Cheng Yin, and John M. Elliot, “Cooperation and Competition as a Function of Zygosity in 7- to 9-Year-Old Twins,” Evolution and Human Behavior, 19 (1998), 397–411. Lykken, David T., Matthew K. McGue, and Auke Tellegen, “Recruitment Bias in Twin Research: The Rule of Two-Thirds Reconsidered,” Behavior Genetics, 17 (1986), 343–362. Martin, Nicholas G., Lyndon J. Eaves, Andrew C. Heath, Rosemary Jardine, Lynn M. Feingold, and Hans J. Eysenck, “Transmission of Social Attitudes,” Proceedings of the National Academy of Sciences, 83 (1986), 4364–4368. Mather, Kenneth, and John L. Jinks, An Introduction to Biometrical Genetics (London: Chapman and Hall, 1977). Molenaar, Peter C. M., Dorret I. Boomsma, and Conor V. Dolan, “A Third Source of Developmental Differences,” Behavior Genetics, 23 (1993), 519–524. Moll, Jorge, Frank Krueger, Roland Zahn, Matteo Pardini, Ricardo de OliveiraSouza, and Jordan Grafman, “Human Fronto-Mesolimbic Networks Guide Decisions about Charitable Donation,” Proceedings of the National Academy of Sciences, 103 (2006), 15,623–15,628. Muth´en, Linda K., and Bengt O. Muth´en “Mplus. Statistical Analysis with Latent Variables. User’s Guide,” Version 4.1 (Los Angeles, CA: 2006). Neale, Michael C., and Hermine H. M. Maes, Methodology for Genetic Studies of Twins and Families (Dordrecht, the Netherlands: Kluwer Academic, 2004). Neisser, Ulric, Gwyneth Boodoo, Thomas J. Bouchard Jr., A. Wade Boykin, Nathan Brody, Stephen J. Ceci, Diane F. Halpern, John C. Loehlin, Robert Perloff, Robert J. Sternberg, and Susana Urbina, “Intelligence: Knowns and Unknowns,” American Psychologist, 51 (1996), 77–101. Pam, Alvin, Susan S. Kemker, Colin A. Ross, and R. Golden, “The Equal Environments Assumption in MZ–DZ Twin Comparisons: An Untenable Premise of Psychiatric Genetics?” Acta Geneteticae Medicae Gemellologiae (Roma), 45 (1996), 349–360. Penke, Lars, Jaap J. A. Denissen, and Geoffrey F. Miller, “The Evolutionary Genetics of Personality,” European Journal of Personality, 21 (2007), 549–587. Plomin, Robert, Kathryn Asbury, P. G. Dip, and Judith Dunn, “Why Are Children in the Same Family So Different? Unshared Environment a Decade Later,” Canadian Journal of Psychiatry, 46 (2001b), 225–233. Plomin, Robert D., and Denise Daniels, “Why Are Children in the Same Family So Different from Each Other?” Behavioral and Brain Sciences, 10 (1987), 1–16. Plomin, Robert D., John C. DeFries, Gerald E. McClearn, and Peter McGuffin, Behavioral Genetics, 4th ed. (New York: Freeman, 2001a). Plug, Erik, and Wim Vijverberg, “Schooling, Family Background, and Adoption: Is It Nature or Is It Nurture?” Journal of Political Economy, 111 (2003), 611– 641. Posner, Samuel F., Laura Baker, Andrew Heath, and Nicholas G. Martin, “Social Contact and Attitude Similarity in Australian Twins,” Behavior Genetics, 26 (1996), 123–133. Rushton, J. Philippe, “Genetic and Environmental Contributions to Pro-social Attitudes: A Twin Study of Social Responsibility,” Proceedings of the Royal Society B, 271 (2004), 2583–2585. Rushton, J. Philippe, David W. Fulker, Michael C. Neale, David K. B. Nias, and Hans J. Eysenck, “Altruism and Aggression: The Heritability of Individual Differences,” Journal of Personality and Social Psychology, 50 (1986), 1192– 1198. Sacerdote, Bruce, “The Nature and Nurture of Economic Outcomes,” American Economic Review, 92 (2002), 344–348. ——, “How Large Are the Effects from Changes in Family Environment? A Study of Korean American Adoptees,” Quarterly Journal of Economics, 122 (2007), 119–157. Sala-i-Martin, Xavier, “The World Distribution of Income: Falling Poverty and . . . Covergence, Period,” Quarterly Journal of Economics, 121 (2006), 351–397. Scurrah, Katrina J., Lyle J. Palmer, and Paul R. Burton, “Variance Components Analysis for Pedigree-Based Censored Survival Data Using Generalized Linear Mixed Models (GLMMs) and Gibbs Sampling in BUGS,” Genetic Epidemiology, 19 (2000), 127–148.
842
QUARTERLY JOURNAL OF ECONOMICS
Segal, Nancy L., and Scott L. Hershberger, “Cooperation and Competition between Twins: Findings from a Prisoner’s Dilemma Game,” Evolution and Human Behavior, 20 (1999), 29–51. Spiegelhalter, David J., Nicola G. Best, Bradley P. Carlin, and Angelika van der Linde, “Bayesian Measures of Model Complexity and Fit," Journal of the Royal Statistical Society, Series B (Statistical Methodology, 64 (2002), 583–639. Stoel, Reinoud D., Eco J. C. De Geus, and Dorret I. Boomsma, “Genetic Analysis of Sensation Seeking with an Extended Twin Design,” Behavior Genetics, 36 (2006), 229–237. Taubman, Paul, “The Determinants of Earnings: Genetics, Family, and Other Environments: A Study of White Male Twins,” American Economic Review, 66 (1976), 858–870. Thompson, Paul, Tyrone D. Cannon, Kathrine L. Narr, Theo van Erp, Veli-Pekka Poutanen, Matti Huttunen, Jouko L¨onnqvist, Carl-Gustaf Standertskj¨oldNordenstam, Jaakko Kaprio, Mohammad Khaledy, Rajneesh Dail, Chris I. Zoumalan, and Arthur W. Toga, “Genetic Influences on Brain Structure,” Nature Neuroscience, 4 (2001), 1–6. Toga, Arthur W., and Paul M. Thompson, “Genetics of Brain Structure and Intelligence,” Annual Review of Neuroscience, 28 (2005), 1–23. True, William R., Andrew C. Heath, Jeffrey F. Scherrer, Brian Waterman, Jack Goldberg, Nong Lin, Seth A. Eisen, Michael J. Lyons, and Ming T. Tsuang, “Genetic and Environmental Contributions to Smoking,” Addiction, 92 (1997), 1277–1288. Turkheimer, Eric, “Three Laws of Behavior Genetics and What They Mean,” Current Directions in Psychological Science, 9 (2000), 160–164. Turkheimer, Eric, Andreana Haley, Brian D’Onofrio, Mary Waldron, and I. Irving Gottesman, “Socioeconomic Status Modifies Heritability of IQ in Young Children,” Psychological Science, 14 (2003), 623–628. Van den Berg, St´epahnie M., Leo Beem, and Dorret I. Boomsma, “Fitting Genetic Models Using Markov Chain Monte Carlo Algorithms with Bugs,” Twin Research and Human Genetics, 9 (2006), 334–342. Wallace, Bj¨orn, David Cesarini, Paul Lichtenstein, and Magnus Johannesson, “Heritability of Ultimatum Game Responder Behavior,” Proceedings of the National Academy of Sciences, 104 (2007), 15,631–15,634.
REVISITING THE GERMAN WAGE STRUCTURE∗ CHRISTIAN DUSTMANN JOHANNES LUDSTECK ¨ UTA SCHONBERG This paper shows that wage inequality in West Germany has increased over the past three decades, contrary to common perceptions. During the 1980s, the increase was concentrated at the top of the distribution; in the 1990s, it occurred at the bottom end as well. Our findings are consistent with the view that both in Germany and in the United States, technological change is responsible for the widening of the wage distribution at the top. At the bottom of the wage distribution, the increase in inequality is better explained by episodic events, such as supply shocks and changes in labor market institutions. These events happened a decade later in Germany than in the United States.
I. INTRODUCTION The United States witnessed a sharp increase in wage and earnings inequality during the 1980s (e.g., Bound and Johnson [1992]; Levy and Murnane [1992]; Murphy and Welch [1992]; Juhn, Murphy, and Pierce [1993]; Katz and Murphy [1992]; Acemoglu [2002]). Upper-tail inequality, measured as the 90–50 wage gap, continued to rise at a similar pace during the 1990s, whereas lower-tail inequality, measured as the 50–10 wage gap, has been falling or flat since the late 1980s (e.g., Autor, Katz, and Kearney [2008]).1 A similar increase in inequality in the 1980s has also been observed in other Anglo-Saxon countries, such as the United Kingdom (e.g., Gosling, Machin, and Meghir [2000]) and Canada (e.g., Boudarbat, Lemieux, and Riddel [2006]). In contrast, most countries in continental Europe seem to have witnessed much smaller increases in inequality in the 1980s, or no increases at all (see, for example, Freeman and Katz [1995] and OECD [1996] for a summary of trends in inequality in European countries). In particular, West Germany, the third largest economy and the largest exporter in the world, has been singled ∗ For helpful comments, we would like to thank our editor, three referees, David Autor, David Card, Bernd Fitzenberger, Thomas Lemieux, Alexandra Spitz-Oener, and seminar participants at the Australian National University, ESPE, Frankfurt University, the Institute for Employment Research (IAB), Mannheim University, and the University of Melbourne for comments and suggestions. We gratefully acknowledge financial support from the German Research Foundation (DFG) and the Anglo-German Foundation (AGF). We thank Bernd Fitzenberger, Alexandra Spitz-Oener, and Joachim Wagner for sharing their programs and/or data with us. 1. Lemieux (2006b, 2008) also emphasizes that the increase in inequality in the United States is increasingly concentrated at the top of the wage distribution. C 2009 by the President and Fellows of Harvard College and the Massachusetts Institute of
Technology. The Quarterly Journal of Economics, May 2009
843
844
QUARTERLY JOURNAL OF ECONOMICS
out as a country characterized by a stable wage distribution during the 1980s (see, for example, Steiner and Wagner [1998]; Prasad [2004]).2 Numerous scholars cite this stability as evidence against the hypothesis that the growth of inequality observed in the United States and United Kingdom is primarily due to skill-based technological change, as firms in continental Europe had access to the same technologies as firms in the United States or United Kingdom (e.g., Card, Kramarz, and Lemieux [1999]; Piketty and Saez [2003]; Saez and Veall [2005]). Possible explanations for this puzzle include a larger expansion in the relative supply of the high-skilled in Germany (e.g., Abraham and Houseman [1995]; Acemoglu [2003]), unions and other labor market institutions (e.g., Krugman [1994]; Abraham and Houseman [1995]),3 and more recently social norms (e.g., Piketty and Saez [2003]). This paper revisits the changes in the wage structure in (West) Germany over the past three decades, between 1975 and 2004. Most existing studies on the German wage structure, such as OECD (1996) and Steiner and Wagner (1998), are based on the German Socio-Economic Panel (GSOEP). We instead use a 2% random sample of social security records, the IABS. We show that the common perception that Germany’s wage structure has remained largely stable during the 1980s is inaccurate. We find that wage inequality has increased in the 1980s, but mostly at the top half of the distribution. In the early 1990s, wage inequality started to rise also at the bottom half of the distribution. This pattern holds for both men and women.4 Our analysis highlights that, while the United States and Germany experienced similar changes at the top of the distribution during the 1980s and 1990s, the two countries markedly differed with respect to the lower end of the wage distribution. The rise in lower tail inequality happened in the 1980s in the United States, but in the 1990s in Germany. 2. Drawing on a variety of data sources, Atkinson (2008) illustrates developments in earnings inequality in Germany dating back to the 1920s. His figures show some increase in overall earnings dispersion over the past two decades. 3. Acemoglu (2002) emphasizes an interesting link between technological change and institituions. If unions compress wages, then firms have greater incentives to adopt labor-complementary technologies, which will reinforce wage compression. 4. The first of these findings has also been documented by Fitzenberger (1999), using an earlier version of our data for the years 1975–1990. The second finding is in line with recent papers by Kohn (2006) and Gernandt and Pfeiffer (2006) who document a similar increase in inequality in lower-tail inequality in the IABS and GSOEP, respectively. However, we are not aware of any paper that jointly analyzes changes in inequality in both the 1980s and 1990s and compares these trends with those in the United States.
REVISITING THE GERMAN WAGE STRUCTURE
845
We investigate several explanations for the changes in wage inequality in Germany. First, we use the kernel reweighting procedure first proposed by DiNardo, Fortin, and Lemieux (1996) to analyze whether the changes in inequality are explained by mechanical changes in the workforce composition, or whether they reflect changes in skill prices. In line with Lemieux (2006a), we show that it is important to account for changes in workforce composition, in particular at the upper end of the wage distribution. However, these changes cannot fully account for the divergent path of upper and lower tail inequality in the 1980s, or for the divergent path of lower tail inequality in the 1980s and 1990s. Second, we document a sharp decline in unionization rates in the late 1990s: The share of workers covered by union agreements declined from 87.3% in 1995 to 72.8% in 2004. There is little evidence of a similar decline during the 1980s. Using the same decomposition method as above, we find that between 1995 and 2004, de-unionization can account for 28% of the rise in inequality at the lower end of the wage distribution, but only 11% at the upper end. Third, we document a rise in the wage differential of mediumskilled workers (i.e., those with an apprenticeship degree) relative to the low-skilled (i.e., those with no postsecondary education) starting in the late 1980s, around the same time that lower-tail inequality started to increase. There is, however, no clear trend in the wage differential of high-skilled workers (i.e., those with a college degree) relative to the medium-skilled. We also document that the decline in the share of the low-skilled started to slow down in the late 1980s, whereas the share of the high-skilled increased at a roughly linear rate from 4.7% in 1975 to 14.8% in 2004. Using a nested constant elasticity of substitution (CES) production framework based on that of Goldin and Katz (2007b, 2008), we show that fluctuations in relative supply explain the evolution of the wage differential between the low- and mediumskilled very well, but do a poor job in predicting the evolution of the wage differential between the medium- and high-skilled. Fourth, building on the analysis of Spitz-Oener (2006), we provide evidence that is consistent with a polarization of work: during the 1980s and 1990s, occupations with high median wages in 1980 experienced the highest growth rate, whereas occupations in the middle of the 1980 wage distribution lost ground relative to occupations at the bottom. Moreover, occupations at the high end of the 1980 wage distribution predominantly use nonroutine
846
QUARTERLY JOURNAL OF ECONOMICS
analytic and interactive skills, whereas routine task usage is highest in the upper middle of the wage distribution. This is consistent with Autor, Levy, and Murnane’s (2003) hypothesis that computer technology decreases the demand for jobs that require routine manual or clerical skills (and are found in the middle of the wage distribution) and increases the demand for jobs that require nonroutine cognitive and interpersonal skills (and are found at the top of the wage distribution). This paper thus adds to the growing evidence that technology does not simply increase the demand for skilled labor relative to that of unskilled labor, but instead asymmetrically affects the bottom and the top of the wage distribution (see, for example, Autor, Katz, and Kearney [2006, 2008] for the United States and Goos and Manning [2007] for the United Kingdom). This may begin to supply the unifying international evidence on technological change that so far has been absent. The evidence provided in this paper is consistent with the idea that technological change is an important driving force behind the widening of the wage distribution, particularly at the top. This conclusion is reinforced by our finding that for occupations above the median, employment and wage changes by wage percentile are positively correlated. In contrast, below-median employment and wage changes are negatively correlated. The rise in lower-tail inequality may therefore be better explained by episodic events, such as supply shocks and changes in labor market institutions. We argue that these shocks happened a decade later in Germany than in the United States. The plan of this paper is as follows. Section II describes the data used for the analysis. Section III documents the major changes in the German wage structure between 1975 and 2004. We then analyze four possible explanations for the increase in inequality: changes in the workforce composition (Section IV.A), a potential decline in unionization (Section IV.B), supply shocks (Section IV.C), and polarization (Section IV.D). We conclude with a discussion of our findings in Section V. II. DATA DESCRIPTION Our empirical analysis is based on two data sets: the IABS, a 2% random sample of social security records, and the LIAB, a linked employer–employee data set. We describe each data set in turn.
REVISITING THE GERMAN WAGE STRUCTURE
847
II.A. The IABS: A 2% Random Sample of Social Security Records, 1975–2004 Our main data set is a 2% sample of administrative social security records in Germany for the years 1975 to 2004. The data are representative of all individuals covered by the social security system, roughly 80% of the German workforce. It excludes the selfemployed, civil servants, individuals currently doing their (compulsory) military service, and individuals on so-called “marginal jobs” (i.e., jobs with at most fifteen hours per week or temporary jobs that last no longer than six weeks). This data set (or earlier versions of it) has been used to study wage inequality by, among others, Steiner and Wagner (1998), Fitzenberger (1999), M¨oller (2005), Fitzenberger and Kohn (2006), and Kohn (2006). The IABS has several advantages over the German SocioEconomic Panel, the data set most often used to analyze trends in inequality in Germany (e.g., OECD [1996]; Steiner and Wagner [1998]; Prasad [2004]). First, the IABS is available from 1975 onward, as opposed to 1984 for the GSOEP. Second, the sample size is much larger (more than 200,000 observations per year, as opposed to around 2,000 in the GSOEP). Third, wages are likely to be measured much more precisely in the IABS than in the GSOEP, as misreporting by firms in the IABS is subject to severe penalties. Fourth, attrition rates in the GSOEP are large enough so that results may not be representative for the population as a whole (see, e.g., Spieß and Pannenberg [2003]). In contrast, although workers can also be followed over time in the IABS, each year the original sample is supplemented by a random sample of new labor market entrants. This guarantees that the IABS is representative of workers who pay social security contributions. The main disadvantage of the IABS is that it is right-censored at the highest level of earnings that are subject to social security contributions. Overall, each year between 9.4% and 14.2% of the male wage distribution is censored. Because of censoring, this paper mostly focuses on the changes in the uncensored part of the wage distribution, up to the 85th percentile. In the United States, much of the action in rising wage inequality since the mid-1980s has been above the 85th percentile (e.g., Piketty and Saez [2003]; Autor, Katz, and Kearney [2008]); consequently, topcoding in our data may lead us to substantially understate inequality growth. Another difficulty in our data is a structural break in the wage measure in 1984. From 1984 on, our measure includes
848
QUARTERLY JOURNAL OF ECONOMICS
bonus payments as well as other one-time payments (Steiner and Wagner 1998). We follow Fitzenberger (1999) and correct for the break (see Appendix I.A for details). Further, our data set does not contain precise information on the number of hours worked; we only observe whether a worker is working full- or part-time (defined as working less than 30 hours per week). We therefore restrict the wage analysis to full-time workers and use the daily wage, averaged over the number of days the worker was working in the year, as our wage measure. Robustness checks against the GSOEP suggest that this does not affect our results. From this database, we select all men and women between 21 and 60 years of age. Because the level and structure of wages differ substantially between East and West Germany, we concentrate on West Germany (which we usually refer to simply as Germany). Although we provide a descriptive overview of the evolution of inequality for both men and women, our main analysis focuses on men only. Further details on the sample selection and variable description can be found in Appendix I.B. II.B. The LIAB: Linked Employer–Employee Data, 1995–2004 The data set just described provides no information on union coverage, and thus cannot be used to analyze the impact of deunionization on the wage structure. Our analysis here is based on the LIAB, a linked employer–employee data set provided by the Institute for Employment Research (IAB). It combines information from the IAB Establishment Panel with information on all workers who were employed in one of these firms as of the 30th of June. The information on workers is drawn from the same social security records as our main data. A detailed description of this ¨ data set can be found in Herrlinger, Muller, and Bellmann (2005). Although data are available from 1993 to 2004, we use only waves from 1995 onward, for which information on union recognition is consistent. In Germany, a firm recognizes a union either by joining an employer federation (Arbeitgeberverband), or by engaging in bilateral negotiations with the union. In the first case, union wages are negotiated at a regional and industry level, typically on an annual basis. Our union variable distinguishes between firmand industry-level agreements. The IAB establishment panel oversamples large establishments. To make our results representative of the German economy as a whole, we weight our results using the cross-sectional weights provided by the LIAB. In Table B.1 in Online Appendix B,
REVISITING THE GERMAN WAGE STRUCTURE
849
we compare median wages as well as interquantile differences for men in the LIAB and the IABS. The two data sources draw a very similar picture of the developments in the wage structure over this period. III. TRENDS IN WAGE INEQUALITY Next, we describe the major changes in wage inequality in Germany from 1975 to 2004 (Section III.A). We then compare our findings with those reported in other studies in Section III.B. Because of wage censoring, we focus on the changes in the uncensored part of the wage distribution and impose no distributional assumptions on the error term in the wage regression. However, some of our findings, such as the evolution of the standard deviation of log wages and log wage residuals, require distributional assumptions. We assume that the error term is normally distributed, with a different variance for each education group and each age group, and impute the censored part of the wage distribution under this assumption. We prefer to work with imputed wages rather than with censored wages because wage residuals can be computed in a straightforward manner. A comparison between OLS estimates based on imputed wages and Tobit estimates based on censored wages shows that both the estimates and the standard errors are almost identical. More details on the imputation method can be found in Appendix I.C. We have conducted extensive robustness checks regarding alternative distributional assumptions, including an upper-tail Pareto distribution. Our results are highly robust to alternative imputation methods. Findings for alternative imputation methods can be found in Section 1 in the Online Appendix. III.A. Basic Facts Standard Deviation of Log Wages. Figure I displays the evolution of the standard deviations of log wages and log wage residuals. Panel A refers to men, Panel B to women. The standard deviation is obtained from standard OLS regressions on imputed wages, estimated separately for each year. We control for three education categories, eight age categories, and all possible interactions between these two. For men, the figure shows a continuous rise in both overall and residual inequality during the 1980s, with an acceleration in the 1990s. A simple within–between decomposition indicates that the majority of the increase in inequality
850
QUARTERLY JOURNAL OF ECONOMICS Panel B: Women
0.25
0.25
0.3
0.3
Standard deviation 0.35 0.4 0.45
Standard deviation 0.35 0.45 0.4
0.5
0.5
Panel A: Men
1975
1979
1983
1987
Log wage
1991 Year
1995
1999
Log wage residuals
2003
1975
1979
1983
1987
Log wage
1991 Year
1995
1999
2003
Log wage residuals
FIGURE I Evolution of the Standard Deviation of Log Wages and Log Wage Residuals Source. 2% IABS Sample for full-time workers between 21 and 60 years of age. The figures plot the evolution of the standard deviation of log wages and log wage residuals. Results are based on imputed wages that assume that the error term in the low-wage regression is normally distributed, with a different variance for each education and each age group. Regressions control for three education categories and eight age categories, as well as all possible interactions between these two variables.
occurred within age and education groups (86% between 1975 and 1989, and 65% between 1990 and 2004). For women, in contrast, the standard deviation of log wages and log wage residuals remained roughly constant during the 1980s, and started to increase only in the mid-1990s. A further difference between men and women is that age and education explain a smaller portion of the overall variance of log wages for women. As with men, most (82%) of the increase in overall inequality between 1990 and 2004 is due to a rise in within-group inequality. The Top versus the Bottom. Next, we separately analyze changes in inequality at the bottom and top of the wage distribution. Figure II displays the wage growth of the 15th, 50th, and 85th percentiles of the wage distribution. We distinguish between the pre- and postunification periods (1975 to 1989 and 1990 to 2004). For men, the 15th and 50th percentiles evolved similarly between 1975 and 1989, and increased by about 16%. Over the same time period, the 85th percentile rose by 27.2% (Panel A). The picture looks very different during the 1990s (Panel C): between 1993 and 2004, the 15th percentile declined by almost 5%, whereas the 50th and 85th percentiles increased by 4% and 13%, respectively. The pattern for women is somewhat different: between 1975 and 1989, wage gains were highest for the 15th percentile (about 25%, compared to only 16% for men). Over the same time
851
0
0
15th, 50th, and 85th percentile 0.05 0.1 0.15 0.2 0.25 0.3
15th, 50th, and 85th percentile 0.05 0.1 0.15 0.2 0.25 0.3
REVISITING THE GERMAN WAGE STRUCTURE
1975
1977
1979
1981
1983
1985
1987
1975
1989
1977
1979
1981
1983
1985
1987
1989
Year
Year
15th percentile 85th percentile
50th percentile
50th percentile
0.05
15th, 50th, and 85th percentile 0.2 0.05 0 0.05 0.1 0.15
15th, 50th, and 85th percentile 0.1 0.05 0 0.15 0.2
15th percentile 85th percentile
1990
1992
1994
1996
1998
2000
2002
Year 15th percentile 85th percentile
2004
1990
1992
1994
1996
1998
2000
2002
2004
Year 50th percentile
15th percentile 85th percentile
50th percentile
FIGURE II Indexed Wage Growth of the 15th, 50th, and 85th Percentiles: The Pre- versus the Postunification Period Source. 2% IABS Sample for full-time workers between 21 and 60 years of age. The figures show the indexed (log) real wage growth of the 15th, 50th, and 85th percentiles of the wage distribution. Panels A and B refer to the pre-unification period between 1975 and 1989, with 1975 as the base year. Panels C and D refer to the post-unification period between 1990 and 2004, with 1990 as the base year.
period, both the 50th and the 85th percentile grew by about 22%, compared to 16% and 27% for men (Panel B). In the postunification period, in contrast, wages at the 15th percentile stagnated, while the 85th percentile experienced the highest wage growth (17%, Panel D). Unlike the 1980s, in the 1990s wages of women caught up to those of men throughout the entire wage distribution. Figure III illustrates the divergent developments of the lower and upper ends of the wage distribution during the 1980s and 1990s in a different manner. It shows log real wage growth across the wage distribution, for the period between 1980 and 1990, as well as between 1990 and 2000. In the 1980s, male wages grew across the distribution, but substantially more so at the upper than at the lower tail. Wage growth accelerates beyond the 65th percentile. In contrast, between 1990 and 2000, wage growth has been negative below the 18th percentile, with wage losses at the 5th percentile of more than 10 log wage points. Starting from the
852
QUARTERLY JOURNAL OF ECONOMICS Panel B: Women
10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 Percentile 1980–1990
1990–2000
0.1 0.05 0 –0.05
Change in log real wage 5
–0.1
0.05 0 –0.05 –0.1
Change in log real wage
0.1
0.15
0.15
Panel A: Men
5
10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 Percentile 1980–1990
1990–2000
FIGURE III Wage Growth by Percentile: The 1980s versus the 1990s Source. 2% IABS Sample for full-time workers between 21 and 60 years of age. The figures plot wage growth by percentile from 1980 to 1990 and from 1990 to 2000. Due to censoring, we plot wage growth for men up to the 85th percentile only.
15th percentile, wage growth increases roughly linearly along the wage distribution (Panel A). For women (Panel B), the 1980s are characterized by wage compression at the lower tail of the wage distribution, whereas wage growth at the very top (i.e., the 95th percentile) exceeds that at the median by about 6%.5 In the 1990s, in contrast, wage growth increases roughly linearly along the wage distribution. How do these findings compare with developments in the United States? Both countries show an increase in inequality at the top of the wage distribution during the 1980s and 1990s, although in Germany the increase is more pronounced for men than for women. The two countries differ sharply with respect to the developments at the bottom of the wage distribution. In the United States, the 50–10 wage gap rose substantially in the 1980s, but ceased to increase in the 1990s. In Germany, the pattern is reversed. What about the magnitude of the changes? Because our wage measure is the full-time daily wage, our findings are probably most comparable to those based on the March CPS for weekly full-time earnings. Autor, Katz, and Kearney (2008) report that between 1975 and 2004, the difference between the 90th and 50th percentiles of the male earnings distribution increased by about one log point per year (their Figure III). We find that over the same time period, the 85–50 wage gap in Germany rose by about 0.6 log points per year. However, it is important to bear in mind that in the United States, much of the action in rising wage inequality 5. Because for women less than 5% of wages are censored, we plot wage growth up to the 95th percentile.
REVISITING THE GERMAN WAGE STRUCTURE
853
since the mid-1980s has been above the 85th percentile.6 Hence, topcoding of our data could lead us to substantially underestimate the rise in upper-tail inequality during the 1980s and 1990s. III.B. Comparison with Existing Studies These results seem to contradict the usual view that wage inequality in Germany has been largely stable over the past two decades, and in particular during the 1980s. What explains this discrepancy? The reason is that the majority of existing studies on inequality trends in Germany, such as OECD (1996), Steiner and Wagner (1998), and Prasad (2004), are based on a different data set, the German Socio-Economic Panel. Studies based on the IABS are generally consistent with our findings. In particular, Fitzenberger (1999) emphasizes that wage inequality rose during the 1980s, and that the increase was concentrated at the top of the distribution. His study uses data from 1975 to 1990 only and was therefore not able to detect the large increase in lower-tail inequality in the 1990s.7 Existing studies based on the GSOEP and our study based on the IABS thus seem to draw different pictures of the trends in inequality in Germany. We have investigated three possible explanations for the discrepancy between our findings and those based on the GSOEP. First, the GSOEP includes civil servants and the self-employed, but these workers are excluded from the IABS. Second, the wage measure in the IABS includes bonuses as well as other one-time annual payments. In contrast, studies based on the GSOEP typically do not include one-time payments, although they are available. Third, and most importantly, most studies based on the GSOEP construct an hourly wage rate, whereas the wage measure in the IABS is a daily wage. Here, we provide only a brief overview, focusing on men. A detailed comparison between the GSOEP and IABS can be found in Section 3 of the Online Appendix. Our findings indicate similar trends in inequality whether or not we include civil servants or the self-employed, and whether or not we include bonuses and other one-time payments in our wage measure. Importantly, inequality 6. See, for example, Piketty and Saez (2003), Dew-Becker and Gordon (2005), Goldin and Katz (2007a), Gordon and Dew-Becker (2007), and Autor, Katz, and Kearney (2008). 7. Other studies using the IABS focus on other aspects of the wage structure. For instance, Kohn (2006) concentrates on the recent developments in the 1990s as well as differences between East and West Germany (see also M¨oller [2005]), whereas Fitzenberger and Kohn (2006) analyze trends in the returns to education.
854
QUARTERLY JOURNAL OF ECONOMICS
trends based on monthly wages are also similar to those based on hourly wages. Any differences between the GSOEP and IABS are therefore not adequately explained by differences in the sample used or by differences in the wage measure. Our analysis further indicates that inequality rose during the 1990s in the GSOEP, in particular at the bottom of the wage distribution, which has also been stressed by Gernandt and Pfeiffer (2006). Our analysis also highlights that measures of inequality are very noisily estimated in the GSOEP. The changes in the 50–15 and 85–50 wage gaps as well as the changes in the standard deviation of log-wages between two years observed in the IABS are almost always within the 95% confidence interval of that observed in the GSOEP. For instance, using the specification that most closely resembles that in the IABS, the 95% confidence intervals for the changes in the 50–15 and 85–50 wage gaps between 1993 and 2002 are [0.044,0.154] and [−0.039,0.103], respectively. Over the same period, the 50–15 and 85–50 wage gaps rose by 0.059 and 0.058 in the IABS. Given the large standard errors in the GSOEP, it is not surprising that earlier studies, such as the 1996 OECD Employment Report, concluded that the German wage structure was largely stable between the mid-1980s to mid-1990s. Next, we explore several explanations for the rising wage inequality in Germany. Here, we restrict the analysis to men, for two reasons. First, female labor force participation rates rose considerably during the 1980s and 1990s; this is likely to have changed the selection of women into work, which may have had an independent impact on the female wage structure.8 Second, although the basic patterns in the wage structure (i.e., upper-tail inequality increased during the 1980s and 1990s, whereas lower-tail inequality mostly increased in the 1990s) are similar for men and women, there are also important differences. For instance, wage gains are substantially larger for women than for men, especially in the 1990s. Moreover, the increase in upper-tail inequality is more pronounced for men than for women, especially in the 1980s. Explaining these differences between men and women would be beyond the scope of this paper.
8. Mulligan and Rubinstein (2008) demonstrate that in the United States it is important to account for the changing selection of women into the workforce when computing male–female wage differentials.
REVISITING THE GERMAN WAGE STRUCTURE
855
IV. WHY DID WAGE INEQUALITY INCREASE? IV.A. The Role of Composition and Prices Is the increase in inequality described in the previous section explained by changes in the workforce composition, or do they reflect changes in skill prices? To see why it is important to account for compositional changes in the workforce, suppose that the variance of log wages is increasing in education and age. If the employment share of educated and older workers increases over time, then this will lead to a mechanical rise in inequality, even if skill prices do not change. Lemieux (2006a) stresses that in the United States, a large fraction of the rise in residual wage inequality between 1973 and 2003—and all since 1988—can be attributed to such changes in the workforce composition. This section employs the kernel re-weighting approach developed by DiNardo, Fortin, and Lemieux (1996) to recover the counterfactual wage distribution that we would have observed if the workforce composition had remained unchanged. Like Autor, Katz, and Kearney (2008), we focus on the divergent path of upper- and lower-tail inequality in the 1980s and 1990s, rather than the variance of log wage residuals, as does Lemieux (2006a). The following expression decomposes the observed density of log wages w in years t and t into a “price” g(.) and a “composition” function h(.) (see also Autor, Katz, and Kearney [2008]): f (w | t) = gt (w | x, T = t)ht (x | T = t) dx and f (w | t ) = gt (w | x, T = t )ht (x | T = t ) dx. Here, g(w | x, T = t) is the density of log wages in year t for observable characteristics x, and h(x | T = t) is the density of characteristics x in year t. To compute the counterfactual wage distribution in year t that would have prevailed if the workforce composition were the same as in year t, we simply need to reweight the price function gt (.) in year t by the ratio ht (.)/ ht (.) of the densities of characteristics x in years t and in year t .9 In our application, all regressors (i.e., all possible interactions between three education and eight age groups) are categorical, and the reweighting function is therefore straightforward to compute. 9. This ratio can be calculated as
h(x|T =t) h(x|T =t )
=
Pr(T =t|x) Pr(T =t |x)
·
1−Pr(T =t) . Pr(T =t )
856
QUARTERLY JOURNAL OF ECONOMICS
This decomposition method applies to calculating counterfactuals for overall inequality. To recover counterfactuals for residual inequality, we replace the pricing function gt (w | x, T = t) with the residual pricing function gt ( | x, T = t). The residuals are obtained from OLS regressions on imputed wages that control for all possible interactions between three education and eight age groups. We would like to point out that we do not need to impose any distributional assumptions on the error term to obtain the uncensored part of the counterfactual distribution of overall inequality. However, distributional assumptions are required to compute the counterfacutal distribution of residual inequality. Our results are robust to alternative imputation methods (see Section 1 of the Online Appendix for details). It is also important to stress that the decomposition ignores general equilibrium effects, as it is based on the assumption that changes in quantities do not affect changes in prices. Table I provides a first overview about how wage dispersion, measured as the 50–15 and 85–50 wage gaps, and employment shares vary by age and education groups. We distinguish three education groups, which we label low, medium, and high. The low-skilled are workers who enter the labor market without postsecondary education. The medium-skilled are workers who completed an apprenticeship or a high school degree (Abitur). The high-skilled are workers who graduated from a university or college. Due to severe censoring for the high-skilled, we only report the 50–15 wage gap for this group. Note that this may lead us to understate the increase in within-group inequality, as in the United States much of the growth in inequality is found in the top half of the high-skilled group. Results are based on imputed wages, and cells where the 85th or 50th percentile is censored are marked. Similar to the United States, wage dispersion is increasing in education and— with the exception of the low-skilled—in age. The share of the low-skilled decreased by 13 percentage points between 1976 and 1990, but only by 3.6 percentage points between 1990 and 2004. The share of the high-skilled rose monotonically from 4.7% in 1976 to 14.7% in 2004. The share of workers below the age of 36 rose from 38.9% in 1976 to 41.6% in 1990 and declined to 30.9% in 2004. Table I also highlights that wage dispersion rose within education and age groups, suggesting that mechanical changes in the workforce composition cannot fully account for the rise in inequality. Between 1976 and 1990, the medium-skilled above the
REVISITING THE GERMAN WAGE STRUCTURE
857
TABLE I WITHIN-GROUP WAGE DISPERSION BY AGE AND EDUCATION (MEN) Within-group wage dispersion 1976
1990
1976
1990
2004
0.085
0.046
0.034
0.088
0.025
0.029
0.083
0.054
0.026
0.256
0.125
0.089
0.285
0.336
0.238
0.223
0.188
0.280
0.189
0.261
0.247
0.697
0.785
0.764
0.019 0.016 0.011 0.047
0.034 0.029 0.027 0.090
0.037 0.063 0.046 0.147
45, 50–15 85–50 All, 50–15 85–50
0.242 0.215 0.226 0.215 0.227 0.217 0.232 0.217
Low 0.286 0.231 0.233 0.233 0.206 0.245 0.248 0.238
0.500 0.354 0.395 0.275 0.335 0.257 0.474 0.294
45, 50–15 85–50 All, 50–15 85–50
0.239 0.270 0.250 0.284 0.249 0.297∗ 0.252 0.286
Medium 0.241 0.269 0.268 0.346 0.260 0.361 0.261 0.348
0.326 0.327 0.307 0.374 0.314 0.408 0.327 0.379
45, All,
0.313 0.400∗ 0.388∗ 0.426∗
High 0.283 0.344 0.364∗ 0.343∗
0.365 0.376 0.414 0.410
50–15 50–15 50–15 50–15
Worker share
2004
Source. 2% IABS Sample for men between 21 and 60 years of age working full-time. Notes. The first three columns of the table report the 50–15 and 85–50 wage gaps for each education/age cell. Results are based on imputed wages. Due to severe censoring for the high-skilled, we only report the 50–15 wage gap here. Cells where the 85th (or 50th) percentile is censored are marked (*). The second set of columns show the worker share of each cell. The low-skilled are workers who enter the labor market with no post-secondary education. The medium-skilled are workers who completed an apprenticeship or have a high school degree (Abitur). The high-skilled are workers who graduated from university or college.
age of 45 experienced the sharpest rise in inequality, whereas between 1990 and 2004, the rise in inequality is strongest for the young low-skilled. For this group, the increase in the 50–15 wage gap increases by more than 20 log points. Here, it is important to stress that our data include employees covered by the social security system only; if temporary and marginal employment were included in the data, the increase might be even larger. Table II reports trends in observed and counterfactual overall and residual inequality. We distinguish three interquantile ranges: 85–15 (Panel A), 85–50 (Panel B), and 50–15 (Panel C). For each wage gap, the first row shows the observed change. The
0.058 0.037 0.039 0.026
0.025 0.018 0.017 0.019
Observed 1980 Xs 1990 Xs 2000 Xs
Observed 1980 Xs 1990 Xs 2000 Xs
0.056 0.061 0.057 0.046
0.051 0.023 0.031 0.035
0.107 0.085 0.087 0.082
1990–2000 0.065 0.056 0.056 0.057 0.039 0.033 0.033 0.035 0.026 0.023 0.023 0.023
Panel B: 85/50 0.167 0.081 0.077 0.060 Panel C: 50/15 0.117 0.114 0.107 0.093
1980–1990
Panel A: 85/15 0.284 0.195 0.184 0.154
1975–2004
0.043 0.045 0.041 0.037
0.041 0.026 0.028 0.027
0.084 0.071 0.069 0.065
1990–2000
Residual inequality
0.092 0.092 0.085 0.077
0.121 0.089 0.090 0.090
0.213 0.181 0.174 0.167
1975–2004
Source. 2% IABS Sample for men between 21 and 60 years of age working full-time. Notes. In each panel, the first row reports the observed change in the difference between the 85th and 15th (Panel A), 85th and 50th (Panel B), and 50th and 15th (Panel C) percentiles of the overall and residual wage distributions. The next rows show the change that would have prevailed if the age and education distributions were the same as in 1980, 1990, or 2000, respectively. The residuals are obtained from an OLS regression on imputed wages that controls for three education and eight age groups as well as the interaction between these two variables. The imputation assumes that the error term in the wage regression is normally distributed with different variances for each education and each age group.
0.083 0.055 0.057 0.045
Observed 1980 Xs 1990 Xs 2000 Xs
1980–1990
Overall inequality
TABLE II OBSERVED VERSUS COMPOSITION-CONSTANT OVERALL AND RESIDUAL WAGE INEQUALITY (MEN)
858 QUARTERLY JOURNAL OF ECONOMICS
REVISITING THE GERMAN WAGE STRUCTURE
859
next rows show the counterfactual change that would have prevailed if the workforce composition were the same as in 1980, 1990, or 2000. The table shows that the overall 85–15 wage gap increased by about 8.3 log points between 1980 and 1990 and by 10.7 log points between 1990 and 2000. If the labor force composition had remained the same as in 1980, the 85–15 wage gap would have risen by 5.5 log points between 1980 and 1990 and by 8.5 log points between 1990 and 2000. The results are similar when we use the workforce composition in 1990 or 2000 to calculate the composition-constant increase in overall inequality. Table II also illustrates that composition effects play a more important role for the upper tail than for the lower tail of the wage distribution. During both the 1980s and 1990s, changes in workforce composition can explain up to 50% of the increase in upper-tail overall inequality, but at most 15% of the increase in lower-tail overall inequality. This differs from findings for the United States, where the impact of changes in workforce composition is concentrated at the lower end of the earnings distribution (Autor, Katz, and Kearney 2008). Turning to residual inequality, the qualitative patterns are very similar. However, composition effects account for a considerably smaller share of the rise in the residual 85–50 wage gap than in the overall 85–50 wage gap (e.g., 15% versus 37% for 1980 characteristics). What are the principal factors that explain the role of composition in increasing upper-tail inequality, rising education or population aging? When we account for changes in the education structure, but not in the age structure, the composition-adjusted increase in the 85–50 wage gap is similar to the one when we additionally account for changes in the age structure, during both the 1980s and 1990s. This suggests that rising education is the driving factor. These results demonstrate that it is important to account for changes in the workforce composition, as emphasized by Lemieux (2006a). However, mechanical changes in the workforce composition do not fully explain the increase in upper-tail inequality in the 1980s, nor do they account for the divergent path of lower-tail inequality in the 1980s and 1990s. IV.B. Decline in Unionization Several papers in the United States argue that part of the increase in inequality in the 1980s can be linked to a decline in the minimum wage and unionization (e.g., DiNardo, Fortin,
860
QUARTERLY JOURNAL OF ECONOMICS TABLE III DECLINE IN UNION COVERAGE (MEN) No agreement (%)
1995 1996 1997 1998 1999 2000 2001 2002 2003 2004
12.7 13.1 13.6 19.1 22.1 24.5 24.7 25.2 25.3 27.2
Firm-level (%) 10.1 10.6 11.4 7.7 8.3 7.3 8.2 7.9 8.6 7.1
Industry-level (%) 77.2 76.3 75.0 73.2 69.6 68.2 67.1 66.9 66.1 65.7
Source. LIAB (1995–2004) for men between 21 and 60 years of age working full-time. Notes. The first column reports the shares of workers neither covered by firm-level nor by industrylevel agreements. The second and third columns display the shares of workers covered by firm-level and insdutry-level agreements, respectively. Entries are weighted to be representative for workers.
and Lemieux [1996]; Lee [1999]; Card and DiNardo [2002]; Card, Lemieux, and Riddel [2004]). We now explore this hypothesis for Germany using the LIAB data. The German system of collective bargaining differs in several aspects from that in the United States. Most importantly, in Germany the recognition of trade unions for collective bargaining purposes is at the discretion of the employer. Once a firm has recognized a union, collective bargaining outcomes apply de facto to all workers in that firm, no matter whether they are union members or not. A firm recognizes a union either by joining an employer federation (Arbeitgeberverband), or by engaging in bilateral negotiations with the union. In the first case, union wages are negotiated at a regional and industry level, typically on an annual basis. Another key difference from the United States is that there is no legal minimum wage in Germany. However, union contracts in Germany specify wage levels for specific groups in specific sectors, and can be considered an elaborate system of minimum wages. Table III, based on the LIAB data set, shows a remarkable decline in union coverage during the mid-1990s and early 2000s: Between 1995 and 2004, the share of workers covered by an industry-level agreement declined by about 12 percentage points, and the share of workers covered by a firm-level agreement decreased by 3 percentage points. Unfortunately, comparable data on union coverage do not exist before 1995. For the 1980s, only
REVISITING THE GERMAN WAGE STRUCTURE
861
data on union membership are available. Schnabel and Wagner (2006) report that throughout the 1980s, about 40% of men were union members.10 By 2000, however, union membership had dwindled to about 31%. This suggests that the decline in unionization in Germany is mostly a phenomenon of the 1990s. There is strong evidence that unions compress the wage structure in Germany, and more so at the lower end of the wage distribution (see, for example, Fitzenberger and Kohn [2005]; Gerlach and Stephan [2005, 2006]; Dustmann and Sch¨onberg [forthcoming]). A natural question to ask is whether the deunionization in the 1990s contributed to the rise in inequality over this period, in particular at the lower tail of the wage distribution? To test this hypothesis, we employ the same decomposition method as in Section IV.A and include as regressors all possible interactions between the recognition of an industryor firm-level agreement and the three education and eight age groups. It is again important to stress that the decomposition method ignores general equilibrium effects; in our application, this means that the union–nonunion wage differential is assumed to be independent of union coverage. Moreover, the decomposition assumes that unionization is exogenous and not itself determined by the same factors that raise wage inequality. A further assumption behind the decomposition method is that there are no spillover effects from the unionized to the nonunionized sector. Figure IV plots the observed wage changes between 1995 and 2004 as well the counterfactual wage changes that would have prevailed if unionization rates had remained at their 1995 level across the wage distribution. The figure illustrates that workers throughout the wage distribution would have experienced a higher wage growth if unionization rates had not declined. However, the impact of de-unionization is substantially stronger at the lower end of the wage distribution. For instance, wages in 2004 would have been 5.5% higher at the 5th percentile, but only 0.2% higher at the 85th percentile. We provide more details in Table IV. The first set of columns refer to overall inequality, whereas the second set of columns refer to residual inequality. The residuals are obtained from OLS regressions on imputed wages. In each pair of columns, we first 10. Because in Germany collectively bargained agreements apply to all workers in a firm that recognizes the union, union membership is much smaller than union coverage.
862
QUARTERLY JOURNAL OF ECONOMICS
0.05 0 –0.05 –0.1
Change in log real wage
0.1
Men, 1995–2004
0.05
0.15
0.25
0.35
Observed change
0.45 Percentile
0.55
0.65
0.75
0.85
Unionization at 1995 level
FIGURE IV Observed versus Composition-Constant Wage Inequality: The Role of De-unionization Source. LIAB (1995–2004) for men between 21 and 60 years of age working full-time. The figure plots actual wage growth by percentile from 1995 to 2004, as well as the wage growth that would have prevailed if unionization had remained at its 1995 level.
hold only unionization constant. We then additionally keep the age and education distribution constant. We again distinguish two interquantile differences: 85–50 and 50–15. We first report the observed change, and then the counterfactual change if the unionization, age, and education distribution had been the same as in 1995 or 2004, respectively. Between 1995 and 2004, the overall 85–50 wage gap rose by 0.068 log points. If unionization rates had remained at their 1995 level, the increase in upper-tail inequality would have been 0.059 log points—a reduction of 13%. Unionization plays a more important role at the lower end of the distribution: de-unionization can account for 28% of the increase in the overall 50–15 wage gap. The findings are similar for residual inequality. In line with the results in Table II, workforce characteristics also play an important role, particularly at the upper end of the distribution. These results indicate that the decline in union recognition in the 1990s had a profound impact on the wage structure, especially at the lower end of the distribution.
863
REVISITING THE GERMAN WAGE STRUCTURE TABLE IV OBSERVED VERSUS COMPOSITION-CONSTANT OVERALL AND RESIDUAL WAGE INEQUALITY: THE ROLE OF DE-UNIONIZATION (MEN, 1995–2004) Overall inequality Unionization only
Residual inequality All
Unionization only
All
Observed 1995 Xs 2004 Xs
0.068 0.059 0.057
Panel A: 85/50 0.068 0.026 0.043
0.046 0.038 0.035
0.046 0.026 0.020
Observed 1995 Xs 2004 Xs
0.063 0.045 0.044
Panel B: 50/15 0.063 0.038 0.036
0.043 0.034 0.030
0.043 0.032 0.022
Source. LIAB (1995–2004) for men between 21 and 60 years of age working full-time. Notes. In each panel, the first row reports observed changes in the difference between the 85th and 50th (Panel A) and the 50th and 15th (Panel B) percentiles of the overall and residual wage distribution. Column “Unionziation only” shows the changes that would have prevailed if the unionization were the same as in 1995 or 2004, respectively. Column “All” shows the corresponding changes that would have prevailed if unionization as well as the eduation and age distributions were the same as in 1995 or 2004. The residuals are obtained from an OLS regression on imputed wages that controls for three unionization groups (industry-level agreement, firm-level agreement, no agreement), three education groups, and eight age groups, as well as all interactions between these variables. The imputation assumes that the error term in the wage regression is normally distributed with different variances for each education and each age group.
IV.C. The Role of Relative Skill Supplies An important component of the rise in inequality in the United States is the remarkable increase in the return to education. We now provide evidence on the recent trends in the skill premium in Germany and analyze the explanatory power of demand and supply factors. We focus on the wage differential between medium-skilled workers (i.e., those who completed an apprenticeship) and low-skilled workers (i.e., those who lack postsecondary education). For completeness, we also report results for the wage differential between high-skilled workers (i.e., those with a university degree) and the medium-skilled. However, due to the high incidence of censoring among the highskilled, these results have to be viewed with considerable caution. Panel A of Figure V plots the wage differential between the low- and medium-skilled (left y-axis) and the medium- and highskilled (right y-axis). Our results are based on imputed wages, and our regressions control for all possible interactions between three education and eight age groups. The medium–low and the high– medium wage premiums are age-adjusted and are computed as a weighted average of the respective premium in each age group,
864
QUARTERLY JOURNAL OF ECONOMICS Panel B: Predicted versus observed skill premiums
1975
1979
1983
1987
Medium versus low
1991 Year
1995
1999
2003
High versus medium
0.15
0.2
0.25
Medium versus low
0.1
0.1
0.35
Medium versus low 0.15 0.2
0.4 0.45 High versus medium
0.5
0.25
Panel A: Education wage premiums
1975
1979
1983
1987
1991 Year
Predicted gap
1995
1999
2003
Observed gap
FIGURE V Fluctuations in Relative Supply and Skill Premiums (Men) Source. 2% IABS Sample for men between 21 and 60 years of age working full-time. Panel A depicts on the left axis the fixed-weighted wage ratio of the mediumskilled (apprenticeship degree) and the low-skilled (no postsecondary education) for a composition-constant set of age groups (eight age categories). On the right axis, the figure plots the fixed-weighted wage ratio of the high-skilled (university degree) and medium-skilled for a composition-constant set of age groups. Panel B plots the observed wage gap as well as the gap predicted by the two-level CES production function with three factors, based on estimates in columns (1) in Table V.
where the weights are the employment-weighted worker share in each age group, averaged over the entire sample period. The medium–low wage differential declined slightly between 1975 and 1989 and then increased sharply by about 0.7 percentage points a year. This timing coincides with the sharp rise in lowertail wage inequality. It also coincides with the deceleration in the decline of the share of the low-skilled during the 1990s: Whereas during the 1970s and 1980s, the share of low-skilled workers declined from 25.6% in 1976 to 12.5% in 1990, it decreased by only 3.6 percentage points between 1990 and 2004 (Table I). This suggests that fluctuations in supply may have played an important role in the rise of medium–low skill premium. The medium–high wage differential declined between 1975 and 1980, remained roughly constant during the 1980s and mid-1990s, and started to increase in the late 1990s. Contrary to the share of low-skilled workers, the share of university graduates rose at a roughly linear rate during the 1980s and 1990s, from 4.7% in 1976 to 14.7% in 2004 (Table I). To analyze the importance of fluctuations in labor supply more formally, we adopt a two-level CES production function framework (Goldin and Katz 2007b, 2008). Suppose that output Y depends only on quantities S and U of “skilled” and “unskilled” workers, defined as workers with and without university degrees,
REVISITING THE GERMAN WAGE STRUCTURE
865
respectively: ρ ρ 1/ρ Yt = At λt St + (1 − λt )Ut . In this expression, A is total factor productivity and λt represents a shift in technology. The aggregate elasticity of substitution between “skilled” and “unskilled” workers is given by σSU = 1/(1 − ρ). “Unskilled” labor is itself a CES subaggregate that depends on the quantities L and M of low- and mediumskilled workers:11 η η 1/η (1) Ut = θt Lt + (1 − θt )Mt . The elasticity of substitution between the medium- and the lowskilled is given by σ ML = 1/(1 − η). Under the assumption that labor is paid its marginal product, the medium-low and skilledunskilled wage differentials satisfy (2) and (3): λt 1 w St St (2) = log − , and log log wUt 1 − λt σ SU Ut w Mt θt 1 Mt log (3) = log − . log w Lt 1 − θt σ ML Lt Relative wages depend on demand shifters λt and θt , on relative supplies log(St /Ut ) and log(Mt /Lt ), and on the respective elasticities of substitution σ SU and σ ML. We estimate (2) and (3) in two steps. In the first step, we estimate (3) by OLS, and substitute for log(θt /[1 − θt ]) with a linear time trend. We then use the estimate for σ ML to compute the quantity Ut of the unskilled labor supplied in (1). In the second step, we estimate (2) by OLS, this time substituting log(λt /[1 − λt ]) with a linear time trend. To account for generated regressor bias in the first step, we bootstrap standard errors in the second step. Although our wage differentials refer to men only, we include women in our supply measures.12 Results are reported in Table V, columns (1) and (2). For the medium- versus low-skilled, we obtain an estimate for the elasticity of substitution of about σ ML = 5 (1/0.206). This estimate is considerably larger than the estimate of around 1.4 typically found in the United States, but this is likely because the typical 11. This assumption implies that an increase in skilled labor relative to unskilled labor does not affect the wage premium of the medium-skilled relative to the low-skilled. 12. See Appendix I.A. for a detailed description how the wage premiums and the relative supply measures are computed.
866
QUARTERLY JOURNAL OF ECONOMICS TABLE V REGRESSION MODELS FOR THE EDUCATION WAGE GAP, 1975–2004 (MEN) Two-level CES
Relative supply Time R2
Medium versus low (1)
Skilled versus unskilled (2)
−0.206 (0.018) 0.012 (0.001) .941
0.482 (0.195) −0.016 (0.006) .184
Two-factor CES High/medium versus low (3) −0.252 (0.020) 0.016 (0.001) .938
Source. 2% IABS Sample for men between 21 and 60 years of age working full-time. Notes. Columns (1) and (2) report results based on a two-level CES production function model that combines the low- and medium-skilled into one CES aggregate. Estimation follows in two steps. We first regress the wage premium of the medium-skilled relative to the low-skilled on a linear time trend and on the supplied quantities of the medium-skilled relative to those of the low-skilled.We then use these estimates to compute the quantities supplied by the “unskilled” (i.e., a mixture of the low- and medium-skilled). In the second step, we regress the wage premium of the “skilled” (i.e., the high-skilled) relative to that of the unskilled on a linear time trend and the supplied quantities of the skilled relative to those of the unskilled. Column (3) reports results based on a CES production function model with two factors (high/medium versus low). Here, we regress the the wage premium of the high- or medium-skilled relative to that of the low-skilled on a linear time trend and the respective relative supplied quantitities. See Appendix I.A for a detailed description of how relative skill premiums and relative supplies are constructed.
U.S. estimate refers to the elasticity of substitution between lowand high-skilled, which are presumably less perfect substitutes than the low- and medium-skilled considered here.13 This model can explain 94% of the time variation in the wage premium of the medium-skilled relative to the low-skilled. Figure V, Panel B, provides a visual illustration of the relationship between relative supplies and relative wages for the low- versus the medium-skilled. The panel plots the observed relative wage gap as well as the gap predicted by the two-level CES production function using the estimates in Table V, column (1), against time. The figure confirms our previous conclusions: the model predicts trends in the wage differential between the medium- and low-skilled very well. In contrast, for the “skilled” (i.e., university graduates) versus the “unskilled” (i.e., a CES aggregate of the low- and mediumskilled), the model performs poorly: the relative supply coefficient estimate is positive, and the coefficient on the linear time trend is negative. The model can explain only 18% of the time variation in 13. A complementary explanation for the higher elasticity of substitution in Germany is that wages in Germany are less responsive to supply and demand shocks than wages in the United States, due to higher unionization rates.
REVISITING THE GERMAN WAGE STRUCTURE
867
the relative wage premium between the high- and the combined medium- and low-skilled (Table V, column (2)). This could be a sign that a model that combines the low- and medium-skilled into one CES aggregate is misspecified. In the third column of Table V, we report results based on a two-factor CES production function that includes the medium- and the highskilled in one category and assumes that there is only one skill premium: high/medium versus low. This model appears to perform well, and can explain about 94% of the the time variation in the wage premium of the high-/medium-skilled relative to the lowskilled. These results suggest that the deceleration in the decline of low-skill employment shares in the 1990s had a profound impact on skill prices, and thus the wage structure—particularly in the lower tail of the wage distribution.14 What caused this deceleration? Although more research is needed on this issue, the timing suggests that it is a consequence of the breakdown of the communist regimes in Eastern Europe, as well as the reunification of East and West Germany. These events lead to a large inflow of East Germans, Eastern Europeans, and ethnic Germans from Eastern Europe into the West German labor market. Many of these immigrants were low-skilled; see Bauer et al. (2005) and Glitz (2007) for more details. Next, we provide some evidence of a rising demand for the high-skilled, relative to the medium- and low-skilled, by computing between-occupation demand shifts for each education group relative to a base year (see Katz and Murphy [1992]): Dk =
E jk E j j
Ek
Ej
.
Here k indexes skill groups and j indexes occupations, E j is total labor input measured in efficiency units in occupation j, and E jk/Ek is group k’s employment share (in efficiency units) in occupation j in the base year. We prefer this measure of demand shifts over that implied by the CES production function framework, because it is not based on relative wage differentials, 14. Existing studies on skill premiums in Germany, such as Abraham and Houseman (1995) and Acemoglu (2003), focus on the wage differential of college graduates relative to that of non-college graduates, and use data until the early or mid-1990s only. This explains why these studies fail to detect the deceleration in the decline of low-skill employment shares in the 1990s.
868
0.4 0.2 0
Demand shifts
0.6
QUARTERLY JOURNAL OF ECONOMICS
1975
1979
1983
1987
Medium versus low
1991 Year
1995
1999
2003
High versus medium
FIGURE VI Between-Occupation Demand Shifts: Medium/Low versus High/Medium (Men) Source. 2% IABS Sample for men between 21 and 60 years of age working full-time. The figure plots between-occupation demand shifts for the medium-skilled relative to the low-skilled, and for the high-skilled relative to the medium-skilled, with 1975 as the base year. The demand shift of group k relative to base year 1975 is computed as Dk = j (E jk/Ek)(E j /E j ), where k indexes education, and j indexes occupation, E j is total labor input measured in efficiency units in occupation j, and E jk/Ek is group k’s employment share (in efficiency units) in occupation j in the base year. We distinguish 82 occupations.
which may be seriously compromised due to the high incidence of wage censoring among the high-skilled, and because it does not require an estimate for the elasticity of substitution. It is important to stress, however, that this index does not account for the impact of price changes on observed employment shifts. Thus, if positive skill supply shocks cause expansions of high-skill occupations, this demand index will overstate the “demand” shock. Conversely, if skill premiums rise due to demand shifts, occupational shifts will be smaller than the price-constant counterfactual, thus leading the index to understate the “demand” shift. Figure VI plots the between-occupation demand shifts of the medium- versus low-skilled, and the high- versus the medium-skilled. The figure indicates a considerable demand shift favoring the high-skilled relative to the medium-skilled during the 1980s and 1990s. This demand shift was substantially larger than that favoring the medium-skilled relative to the low-skilled. To put these numbers into perspective, Katz and Murphy (1992)
REVISITING THE GERMAN WAGE STRUCTURE
869
report a between-industry demand shift of college graduates relative to high school graduates of 0.067 between 1979 and 1987.15 Over the same period, we find a between-occupation demand shift of 0.157 and a between-industry demand shift of 0.084 for the high- relative to the medium-skilled. IV.D. Polarization Our findings highlight the importance of distinguishing between changes in lower- and upper-tail inequality. Moreover, Figure VI suggests that demand shifts for the high-skilled relative to the medium-skilled exceed those for the medium-skilled relative to the low-skilled. Autor, Katz, and Kearney (2006, 2008) provide a simple demand-based explanation for this pattern (see also Autor, Levy, and Murnane [2003]). The idea is that technological change—in particular, the implementation of computer technology—differently affects the bottom and top ends of the skill distribution. Suppose that computerization decreases the demand for jobs that require routine analytical or clerical skills, and increases the demand for abstract, nonroutine cognitive and interpersonal skills. Computer technology neither strongly complements nor strongly substitutes for manual skills. If routine analytical skills are predominantly used in the middle, and manual and interactive skills at the bottom and top of the wage distribution, then technological change may lead to “polarization” (Goos and Manning 2007) and thus affect the wage distribution differently at the bottom and the top. For Germany, Spitz-Oener (2006) provides evidence that between 1979 and 1999, the demand for interactive and nonroutine analytical skills increased, while the demand for routine-cognitive skills declined. Much of these changes can be linked to computerization. This section further investigates this hypothesis for Germany. We first test a key assumption behind this approach: Nonroutine cognitive tasks are predominantly used at the upper ends, whereas routine and manual tasks are more common at the middle and lower ends of the wage or skill distribution. To this end, we rank the 340 occupations in our data according to their median wages and group them into 100 equal-sized groups. Figure VII, Panel A, plots the smoothed share of men performing manual, abstract, and routine tasks in each occupation, using the 1991 wave 15. This number is based on the between-industry demand shifts reported in Table VI and computed as exp(0.029 + 0.036) − 1.
870
QUARTERLY JOURNAL OF ECONOMICS
Panel A: Task input by occupation skill percentile
0.6 0.4 0
0.2
Task usage
0.8
1
(Ranking: Wage, 1980)
0
20
40 60 Skill percentile Routine tasks Abstract tasks
80
100
Manual tasks
FIGURE VII Change in Occupation’s Employment Shares and Task Usage by Skill Percentile in 1980 (Men) Source: German Qualification and Career Survey, 1991, full-time men between 21 and 60 years of age. Panel A depicts the share of workers performing routine, manual, and abstract tasks by the 1980 Occupational Percentile, using a locally weighted smoothing regression with bandwidth 0.5 and 100 observations. Occupations at the 3-digit level are ranked according to employment-duration weighted median wages and then grouped into 100 equal-sized groups, using the IAB data. Routine tasks include calculating and bookkeeping; correcting texts/data; equipping machines; shipping and transporting; and filing and archiving. Manual tasks include repairing or renovating houses/apartments/machines/vehicles; restoration of art/monuments; serving or accomodating; cleaning; and rubbish removal. Abstract tasks include research, evaluation, and planning; making plans, constructing, designing, and sketching; working out rules/prescriptions; using and interpreting rules; lobbying, coordinating, and organizing; teaching and training; selling, buying, and advertising; entertaining and presenting; employing and managing personnel.
of the German Qualification and Career Survey.16 The share of workers performing manual tasks declines monotonically with the occupational wage. Interestingly, abstract tasks are somewhat more common at the lower end of the occupational wage 16. The survey is conducted jointly by the Federal Institute for Vocational Education and Training (BIBB) and the Institute for Employment (IAB), with the goal of tracking skill requirements of occupations. See Spitz-Oener (2006) for a detailed description of the data. Routine tasks include calculating and bookkeeping; correcting texts/data; equipping machines; shipping and transporting; and filing and archiving. Manual tasks include repairing or renovating houses/apartments/machines/vehicles; restoring art/monuments; serving or accomodating; cleaning; and rubbish removal. Abstract tasks include research, evaluation, and planning; making plans, constructing, designing, and sketching; working out rules/prescriptions; using and interpreting rules; lobbying, coordinating, and organizing; teaching and training; selling, buying, and advertising; entertaining and presenting; and employing and managing personnel.
REVISITING THE GERMAN WAGE STRUCTURE
871
Panel B: Smoothed changes in employment
0.1 0 0.1
100 × change in employment share
0.2
(Ranking: Wages, 1980)
0
20
40 60 Skill percentile
80
100
FIGURE VII Change in Occupation’s Employment Shares and Task Usage by Skill Percentile in 1980 (Men) Source: 2% IABS Sample for men between 21 and 60 years of age working full-time. Panel B depicts log changes in employment shares, where the 3-digit occupations in our data are ranked according to their mean years of education (employment duration weighted) and then grouped into 100 equal-sized groups. We employ locally weighted smoothing regressions with 100 observations and bandwidth 0.8.
distribution than in the middle, and (as expected) most common at the high end. The relationship between routine task usage and occupational wages is likewise nonmonotone, with the share of workers performing routine tasks being highest around the 80th wage percentile. Following Goos and Manning (2007) and Autor, Katz, and Kearney (2008), we next test whether occupations in the middle of the skill distribution experienced lower growth rates than occupations at the bottom and top of the skill distribution. We again rank occupations by their median wages and group them into 100 equal-sized groups. Figure VII, Panel B, plots the smoothed log changes in the employment shares by occupational skill for two periods: 1980–1990 (when wage inequality rose mostly at the top) and 1990–2000 (when wage inequality rose at both the top and the bottom). Both decades are characterized by polarization in employment growth: employment shares of occupations at the top of the wage distribution increased substantially in both periods. Employment shares of occupations in the middle of the wage
872
QUARTERLY JOURNAL OF ECONOMICS
distribution declined. Occupations at the low end of the wage distribution have experienced neither strong losses nor strong gains. In order to more formally analyze the relationship between changes in occupational employment shares by skill percentile (measured as median wages) and wage changes by wage percentile in each decade, we estimate OLS regressions of the following form (see also Autor, Katz, and Kearney [2008]): Epτ = ατ + βτ w pτ + pτ . In this expression, Epτ denotes the change in log occupational employment at skill percentile p and decade τ , and w pτ represents the change in the daily wage at wage percentile p in the same decade. We estimate β80 = 0.43 (t-value: 1.28) for the 1980s, and β90 = 0.65 (t-value: 1.56), suggesting that employment and wage changes are weakly positively correlated. However, in both the 1980s and the 1990s this masks important differences at the lower and upper ends of the wage distribution: below the median, the correlation between employment and wage changes is negative (we estimate β80, p50 = 3.25 (2.74) for the 1990s). These findings differ somewhat from those for the United States, where changes in employment and changes in wages are strongly positively correlated in both decades throughout the entire distribution (Autor, Katz, and Kearney 2008). These results are difficult to reconcile with a simple theory of skill-biased technological change, according to which technology symmetrically affects the bottom and the top of the wage distribution. They are consistent with a nuanced view of skill-biased technological change, according to which technology substitutes for routine tasks, but complements nonroutine tasks, and thereby increases the demand for workers located mostly at the top of the wage distribution. Moreover, the negative correlation between employment and wage changes below the median speaks against a demand-based explanation for the rise in lower-tail inequality. Our previous findings are consistent with the view that the rise in lower-tail inequality may be better explained by episodic changes, most importantly the decline in unionization and the changes in the skill mix of the workforce in the 1990s.
REVISITING THE GERMAN WAGE STRUCTURE
873
V. DISCUSSION AND CONCLUSIONS This paper challenges the common view that the rise in wage inequality is a phenomenon observed only in a handful of countries, such as the United States, the United Kingdom, and Canada. In particular, we revisit trends in wage inequality in (West) Germany, a country that so far has been singled out as having a stable wage distribution. Based on a large administrative data set, we find that wage inequality in Germany increased in the 1980s, but mostly at the top of the distribution. In the early 1990s, wage inequality started to rise at the bottom of the distribution as well. This holds for both men and women, although the rise in upper-tail inequality is somewhat more pronounced among men. Hence, although the United States and Germany experienced similar changes at the top of the distribution during the 1980s and 1990s, the two countries markedly differ with respect to the lower end of the wage distribution: The rise in lower tail inequality that happened in the United States in the 1980s came a decade later in Germany. We explore several explanations for the increase in inequality. Changes in workforce composition play an important role in explaining changes in the wage structure. However, they cannot fully account for the divergent path of upper and lower tail inequality in the 1980s, or for the divergent path of lower tail inequality in the 1980s and 1990s. Moreover, our results are consistent with a polarization of work: occupations that were at the top of the 1980 wage distribution experienced the largest growth rates, whereas occupations in the middle declined relative to occupations at the bottom. This speaks against a simple theory of skill-biased technological change, according to which technology increases the demand for skilled jobs relative to that of unskilled jobs. It is, however, consistent with a more nuanced view of technological change, according to which technology asymmetrically affects the bottom and the top of the wage distribution, by substituting for routine tasks and complementing nonroutine tasks (e.g., Autor, Levy, and Murnane [2003]). Because results consistent with a polarization of labor demand have now been found in three advanced countries,17 this may begin to provide the unifying international evidence on technological change that so
17. See Autor, Katz, and Kearney (2006, 2008) for the United States and Goos and Manning (2007) for the United Kingdom.
874
QUARTERLY JOURNAL OF ECONOMICS
far has been absent—although more research for other advanced countries is needed to fully assess this hypothesis. Can the polarization of work alone account for the finding that in the 1980s inequality mostly rose at the top, whereas in the 1990s inequality rose both at the top and at the bottom? One piece of evidence against this hypothesis is that in Germany, employment and wage changes are negatively correlated for occupations below the median. The widening of the wage distribution at the bottom may therefore be better explained by episodic events, such as supply shocks and changes in labor market institutions. The hypothesis we put forward here is that these episodic events happened in the 1980s in the United States, but in the 1990s in Germany. First, the 1980s in the United States are characterized by an erosion of labor market institutions such as labor unions, as well as a declining minimum wage. In Germany, this process appears to have started only in the 1990s. Several papers show that these changes are important in explaining changes in inequality in the United States, in particular at the lower end of the wage distribution (e.g., DiNardo, Fortin, and Lemieux [1996]; Lee [1999]; and Card and DiNardo [2002]). We find that between 1995 and 2004, de-unionization can explain 28% of the increase in lower-tail inequality. Second, skill upgrading started to slow down in the United States in the early 1980s. In Germany, there is little evidence for a slowdown in skill upgrading during the 1980s. In the 1990s, however, the decline in the share of the low-skilled started to decelerate, whereas the share of the high-skilled kept increasing at a similar rate as during the 1980s. Several U.S. studies show that fluctuations in relative labor supply play an important role in explaining trends in the skill premium (e.g., Katz and Murphy [1992]; Card and Lemieux [2001]; Autor, Katz, and Kearney [2008]). We find that fluctuations in relative supply go a long way in explaining trends in the wage differential between the mediumand low-skilled, but do not predict trends in the wage differential between the high- and the combined medium/low-skilled. Why did the slowdown in skill upgrading happen a decade earlier in the United States than in Germany? Although more research is needed on this topic, the relative increase in the share of the low-skilled that started in 1990 in Germany is likely to be a consequence of the breakdown of the communist regimes in Eastern Europe and of the reunification of East and West Germany.
REVISITING THE GERMAN WAGE STRUCTURE
875
These events led to a large inflow of relatively low-skilled East Germans, Eastern Europeans, and ethnic Germans from Eastern Europe into the West German labor market; many of these immigrants were low-skilled (see Bauer et al. [2005] and Glitz [2007] for more details). APPENDIX I: IABS, 1975–2004 A. The Structural Break in 1984 Starting in 1984, one-time payments, such as bonuses, are included in our wage measure (see Bender et al. [1996] for more details); as Steiner and Wagner (1998) point out, ignoring this structural break results in a spurious increase in inequality. We correct for this break in a way similar to that of Fitzenberger (1999). The correction is based on the idea that higher quantiles appear to be more affected by the structural break than lower quantiles, and thus have to be corrected upward before 1984. To this end, we estimate locally weighted regressions of the wage ratio between 1982 and 1983 (i.e., before the break), and between 1983 and 1984 (i.e., after the break) on the wage percentiles in 1983 and 1984, respectively, using a bandwidth of 0.2. We then compute the correction factor as the difference between the smoothed values of the wage ratios in 1983 and 1984. In order to account for differential overall wage growth between the periods from 1982 to 1983 and from 1983 to 1984, we subtract from our correction factor the smoothed value of the wage ratio in 1983, averaged between the second and fortieth quantiles. In a final step, we correct wages prior to 1984 by multiplying them by 1 plus the correction factor. B. Sample Selection and Variable Description Sample Selection. In addition to the selection criteria described in Section II, we drop wage spells of workers in apprenticeship training. We further impose the restriction that daily wages (in 1995 DM) have to be at least 20 DM. For the wage analysis, we use full-time spells only. Part-time spells are included in our relative supply measures, though with a lower weight (see below). Wages. Our wage variable is the average daily wage. If a worker worked for more than one employer in a year, we compute a weighted average, where the weights are the shares worked for each employer. Our results are employment duration-weighted: a worker who works 365 days a year gets a weight of 365, whereas
876
QUARTERLY JOURNAL OF ECONOMICS
a worker who works only 7 days a year gets a weight of 7. Wages are deflated by the Consumer Price Index, with 1995 as the base year. As of 1999, wages are measured in euros; we use an exchange rate of 1 euro = 1.95583 DM to convert euros into Deutschmarks. Wages are censored at the highest level of earnings that is subject to social security contributions. More specifically, an individual may receive a wage increase from her current firm in the middle of the year that puts her wage above the censoring limit. In this case, the wage we observe is an employment-duration weighted average between the precensoring wage and the censoring limit.18 For this reason, we code wages as censored if they are three Deutschmarks below the censoring limit. Our results are very similar if we further reduce the censoring limit by six Deutschmarks. Education. Our education variable distinguishes three groups, which we label low, medium, and high. The low-skilled are workers who enter the labor market without postsecondary education. The medium-skilled are workers who completed an apprenticeship or a high school degree (Abitur). The high-skilled are workers who graduated from a university or college (Universitat ¨ or Fachhochschule). In the raw data, the education variable is missing for 10.6% of observations. However, because our data are longitudinal, we can impute a value by looking at past and future values of the education variable. The analysis in this paper is based on the education variable IP1 provided by Fitzenberger, Osikominu, and V¨olter (2006). This variable is missing for only 1.3% of observations. We code missings as low-skilled. Our findings are similar if individuals with missing education are dropped. Relative Supply Measures. Quantities supplied are measured in efficiency units. To compute the efficiency units, we calculate the mean real wage (based on imputed wages) by year, sex, education, and age. In each year, we normalize wages with the mean wage of medium-skilled men between 31 and 35. The efficiency unit for each group is computed as the arithmetic average over all years. The quantity supplied by each education group in a given year is the number of days worked in that year, multiplied by the respective efficiency unit, summed up over all workers in the education group. Part-time work is included, but weighted down by 0.67 (“long” part-time) or 0.5 (“short” part-time). 18. If the individual switches firms and earned a wage below the censoring limit at her old firm, but a wage above the censoring limit at her new firm, we observe the true wage at the old firm, and the censoring limit at the new firm.
REVISITING THE GERMAN WAGE STRUCTURE
877
Education Wage Differentials. The medium-low and the highmedium wage differentials are based on imputed wages that assume that the error term is normally distributed with the same variance for each education and age group. This differs from our baseline imputation rule, which allows different variances across education and age groups. We chose this restriction because the high–medium wage differential is now less sensitive to the chosen censoring limit.19 Our OLS regressions control for three education and eight age groups as well as all possible interactions; this specification therefore allows different wage premiums by age group. The medium–low and high–medium wage premiums are computed as a weighted average of the respective premium in each age group, where the weights are the employment-weighted worker shares in each age group, averaged over the entire sample period. To compute the wage differential between “skilled” (i.e., workers with a university degree) and “unskilled” workers (i.e., a mixture of the low- and medium-skilled), we predict age-adjusted wages for each education group in the same way. The wage of the “unskilled” is a weighted average of the wage of the low- and medium-skilled, where the weights are the employment-weighted worker shares of each education group, averaged over the entire sample period. The wage of the mixture of the high- and mediumskilled (Table V, column (3)) is computed in the same way. C. Imputation of Censored Wages We impute censored wages in the IABS and LIAB under the assumption that the error term in the wage regression is normally distributed, with different variances for each education and each age group. To this end, we estimate censored regressions (allowing a different variance for each education and age group) separately for each year (thereby allowing the variance in each group to vary across years). We control for all possible interactions between three education and eight age groups. For each year, we impute censored wages as the sum of the predicted wage and a random component, drawn from a normal distribution with mean zero and a separate variance for each education and age group, obtained from the standard error of the forecast. A comparison between OLS estimates based on imputed wages and Tobit estimates based 19. For instance, in 1975 the high–low wage gap is 12 percentage points higher when the censoring limit is reduced by 6 Deutschmarks and the variance is allowed to vary by education and age. If the variance is restricted to be the same by education and age, lowering the censoring limit makes little difference.
878
QUARTERLY JOURNAL OF ECONOMICS
on censored wages shows that both the estimates and the standard errors are almost identical. We have also imputed wages under four additional distributional assumptions. First, we continue to assume that the error term is normally distributed, but we restrict the variance to be the same across all education and age groups, or allow for a different variance for each education–age cell. Second, we assume that the upper tail of the unconditional wage distribution follows a Pareto distribution. Third, similarly to U.S. studies based on the CPS, we replace censored observations by 1.5 times the censoring limit. Imputations based on the normal distribution, however, suggest that this imputation factor is too high, and is in fact closer to 1.2. As a fourth robustness check, we therefore replace censored wages by 1.2 times the censoring limit. Findings for these alternative imputation methods can be found in Section 1 in the Online Appendix. We use a third data set, the GSES (German Structure of Earnings Survey), to evaluate which imputation method performs best at recovering the censored part of the wage distribution. The GSES is a survey of about 22,000 establishments conducted by the German Federal Statistical Office. A scientific usefile is currently available for the year 2001. The main advantage of the GSES compared to the IABS is that wages are not censored. We find that the imputation method that assumes that the error term is normally distributed with a different variance by age and education works somewhat better than the other imputation methods. This method was therefore chosen for our baseline results.
APPENDIX II: LIAB, 1995–2004 In addition to the selection criteria described in Section II, we discard all firms for which the union variable is missing. The maximum number of establishments lost due to this restriction is 72 (around 0.8%) in 2001. We further restrict our sample to firms that employ at least one man working full time between the ages of 21 and 60. Most of the variables in the LIAB closely correspond to those in the IABS, but there are a few exceptions. First, the wage variable refers to the first of July in the LIAB as opposed to an annualized average, as in the IABS. Second, because the LIAB does not contain complete biographies of workers, it is impossible to impute
REVISITING THE GERMAN WAGE STRUCTURE
879
missing observations in the education variable. We therefore code missing observations as an additional education category. UNIVERSITY COLLEGE LONDON ¨ ARBEITSMARKT- UND BERUFSFORSCHUNG, NURNBERG ¨ INSTITUT FUR ¨ ARBEITSMARKT- UND UNIVERSITY COLLEGE LONDON AND INSTITUT FUR ¨ BERUFSFORSCHUNG, NURNBERG
REFERENCES Abraham, Katherine G., and Susan N. Houseman, “Earnings Inequality in Germany,” in Differences and Changes in Wage Structures, Richard B. Freeman and Lawrence F. Katz, eds. (Chicago and London: University of Chicago Press, 1995). Acemoglu, Daron, “Technical Change, Inequality and the Labor Market,” Journal of Economic Literature, 40 (2002), 7–72. ——, “Cross-Country Inequality Trends,” Economic Journal, 113 (2003), F121– F149. Atkinson, Anthony B., The Changing Distribution of Earnings in OECD Countries (Oxford, UK: Oxford University Press, 2008). Autor, David H., Lawrence F. Katz, and Melissa S. Kearney, “The Polarization of the U.S. Labor Market,” American Economic Review Papers and Proceedings, 96 (2006), 189–194. ——, “Trends in U.S. Wage Inequality: Re-assessing the Revisionists,” Review of Economics and Statistics, 90 (2008), 300–323. Autor, David H., Frank, Levy, and Richard J. Murnane, “The Skill Content of Recent Technological Change: An Empirical Investigation,” Quarterly Journal of Economics, 118 (2003), 1279–1333. Bauer, Thomas, Barbara, Dietz, Klaus F. Zimmermann, and E. Zwintz, “German Migration: Development, Assimilation, and Labour Market Effects,” in European Migration. What Do We Know? Klaus F. Zimmermann, ed. (Oxford, UK: Oxford University Press, 2005). ¨ Bender, Stefan, Jurgen, Hilzendegen, G¨otz, Rohwer, and Helmut Rudolph, Die ¨ Beschaftigtenstichprobe ¨ 1975–1990, Beitrage zur Arbeitsmarkt- und Berufsforschung 197 (Nuremberg, Germany: IAB, 1996). Boudarbat, Brahim, Thomas, Lemieux, and W. Craig Riddell, “Recent Trends in Wage Inequality and the Wage Structure in Canada,” in Dimensions of Inequality in Canada, David A. Green, and Jonathan R. Kesselman, eds. (Vancouver, BC: UBC Press, 2006). Bound, John, and George Johnson, “Changes in the Structure of Wages in the 1980s: An Evaluation of Alternative Explanations,” American Economic Review, 82 (1992), 371–392. Card, David, and John E. DiNardo, “Skill-Biased Technological Change and Rising Wage Inequality: Some Problems and Puzzles,” Journal of Labor Economics, 20 (2002), 733–783. Card, David, Francis Kramarz, and Thomas Lemieux, “Changes in the Relative Structure of Wages and Employment: A Comparison of the United States, Canada, and France,” Canadian Journal of Economics, 32 (1999), 843–877. Card, David, and Thomas Lemieux, “Can Falling Supply Explain the Rising Return to College for Younger Men?” Quarterly Journal of Economics, 116 (2001), 705–746. Card, David, Thomas Lemieux, and W. Craig Riddell, “Unions and Wage Inequality,” Symposium on “What Do Unions Do—A Retrospective after Two Decades,” Journal of Labor Research, 25 (2004), 519–562. DiNardo, John E., Nicole Fortin, and Thomas Lemieux, “Labor Market Institutions and the Distribution of Wages, 1973–1992: A Semiparametric Approach,” Econometrica, 64 (1996), 1001–1044. Dew-Becker, Ian, and Robert J. Gordon, “Where Did the Productivity Growth Go? Inflation Dynamics and the Distribution of Income,” Brookings Papers on Economic Activity, 2 (2005), 67–127.
880
QUARTERLY JOURNAL OF ECONOMICS
Dustmann, Christian, and Uta Sch¨onberg, “Training and Union Wages,” Review of Economics and Statistics, forthcoming. Fitzenberger, Bernd, Wages and Employment Across Skill Groups: An Analysis for West Germany (Heidelberg: Physica/Springer, 1999). ¨ gleiche Arbeit? Zum Fitzenberger, Bernd, and Karsten Kohn, “Gleicher Lohn fur Zusammenhang zwischen Gewerkschaftsmitgliedschaft und Lohnstruktur in Westdeutschland 1985–1997,” Journal for Labour Market Research, 38 (2005), 125–146. ——, “Skill Wage Premia, Employment, and Cohort Effects: Are Workers in Germany All of the Same Type?” IZA Discussion Paper No. 2185, 2006. Fitzenberger, Bernd, Anderonke Osikominu, and Robert V¨olter, “Imputation Rules to Improve the Education Variable in the IAB Employment Subsample,” Schmollers Jahrbuch (Journal of the Applied Social Sciences), 126 (2006), 405–436. Freeman, Richard B., and Lawrence F. Katz, “Introduction and Summary,” in Differences and Changes in Wage Structure, Richard B. Freeman and Lawrence F. Katz, eds. (Chicago: The University of Chicago Press, 1995). Gerlach, Knut, and Gesine Stephan, “Bargaining Regimes and Wage Dispersion,” Jahrbucher ¨ fuer National¨okonomie und Statistik, 226 (2006), 629–645. ——, “Wage Settlements and Wage Setting—Results from a Multi-level Model,” Applied Economics, 37 (2005), 2297–2306. Gernandt, Johannes, and Friedhelm Pfeiffer, “Rising Wage Inequality in Germany,” ZEW Discussion Paper No. 06-019, 2006. Glitz, Albrecht, “The Labour Market Impact of Immigration: Quasi-experimental Evidence,” University of Pompeu Fabra, Mimeo, 2007. Goldin, Claudia, and Lawrence F. Katz, “Long-Run Changes in the Wage Structure: Narrowing, Widening, Polarizing,” Brookings Papers on Economic Activity, 2 (2007a), 135–165. ——, “The Race between Education and Technology: The Evolution of U.S. Educational Wage Differentials, 1890 to 2005,” NBER Working Paper No. 12984, 2007b. ——, The Race between Education and Technology (Cambridge, MA: Harvard University Press, 2008). Goos, Maarten, and Alan Manning, “Lousy and Lovely Jobs: The Rising Polarization of Work in Britain,” Review of Economics and Statistics, 89 (2007), 118–133. Gordon, Robert J., and Ian Dew-Becker, “Selected Issues in the Rise of Income Inequality,” Brookings Papers on Economic Activity, 2 (2007), 169–192. Gosling, Amanda, Stephen Machin, and Costas Meghir, “The Changing Distribution of Male Wages, 1966–1992,” Review of Economic Studies, 67 (2000), 635–666. ¨ Herrlinger, Dagmar, Dana Muller, and Lutz Bellmann, “Codebuch zum IAB Be¨ ¨ triebspanel. Version 1: Langsschnitt 1993–2001 (5. uberarbeitete Auflage),” FDZ Datenreport Nr. 5 (2005). Juhn, Chinhui, Kevin M. Murphy, and Brooks Pierce, “Wage Inequality and the Rise in Returns to Skill,” Journal of Political Economy, 101 (1993), 410– 441. Katz, Lawrence F., and Kevin M. Murphy, “Changes in Relative Wages, 1963–1987: Supply and Demand Factors,” Quarterly Journal of Economics, 107 (1992), 35– 78. Kohn, Karsten, “Rising Wage Dispersion, After All! The German Wage Structure at the Turn of the Century,” IZA Discussion Paper No. 2098, 2006. Krugman, Paul, “Past and Prospective Causes of Unemployment,” Economic Review, Federal Reserve Bank of Kansas City, QIV (1994), 23–43. Lee, David S., “Wage Inequality in the United States in the 1980s: Rising Dispersion or Falling Minimum Wage?” Quarterly Journal of Economics, 114 (1999), 977–1023. Lemieux, Thomas, “Increasing Residual Wage Inequality: Composition Effects, Noisy Data, or Rising Demand for Skill?” American Economic Review, 96 (2006a), 1–64. ——, “Post-secondary Education and Increasing Wage Inequality,” American Economic Review Papers and Proceedings, 96 (2006b), 1–23.
REVISITING THE GERMAN WAGE STRUCTURE
881
——, “The Changing Nature of Wage Inequality,” Journal of Population Economics, 21 (2008), 21–48. Levy, Frank, and Richard J. Murnane, “U.S. Earnings Levels and Earnings Inequality: A Review of Recent Trends and Proposed Explanations,” Journal of Economic Literature, 30 (1992), 1333–1381. M¨oller, Joachim, “Die Entwicklung der Lohnspreizung in West- und Ostdeutschland,” in Institutionen, L¨ohne, und Beschaftigung, ¨ Lutz Bellmann, Olaf ¨ ¨ ¨ Hubler, Wolfgang Meyer, and Gesine Stephan, eds. (Nurnberg: Beitrage zur Arbeitsmarkt- und Berufsforschung, 2005). Mulligan, Casey B., and Yona Rubinstein, “Selection, Investment, and Women’s Relative Wages over Time,” Quarterly Journal of Economics, 123 (2008), 1061– 1110. Murphy, Kevin M., and Finis Welch, “The Structure of Wages,” Quarterly Journal of Economics, 107 (1992), 285–326. OECD, “Earnings Inequality, Low-Paid Employment and Earnings Mobility,” OECD Employment Outlook (1996), 59–107. Piketty, Thomas, and Emmanuel Saez, “Income Inequality in the United States, 1913–1998,” Quarterly Journal of Economics, 118 (2003), 1–39. Prasad, Eswar S., “The Unbearable Stability of the German Wage Structure: Evidence and Interpretation,” IMF Staff Papers, 51 (2004), 354–385. Saez, Emmanuel, and Michael R. Veall, “The Evolution of High Incomes in Northern America: Lessons from the Canadian Experience,” American Economic Review, 95 (2005), 831–849. Schnabel, Claus, and Joachim Wagner, “The Persistent Decline in Unionization in Western and Eastern Germany, 1980–2004: What Can We Learn from a Decomposition Analysis?” IZA Discussion Paper No. 2388, 2006. Spieß, Martin, and Markus Pannenberg, “Documentation of Sample Sizes and Panel Attrition in the German Socio Economic Panel (GSOEP),” Research Notes, 28 (2003), German Institute for Economic Research, Berlin. Spitz-Oener, Alexandra, “Technical Change, Job Tasks and Rising Educational Demands: Looking outside the Wage Structure,” Journal of Labor Economics, 24 (2006), 235–270. Steiner, Viktor, and Kersten Wagner, “Has Earnings Inequality in Germany Changed in the 1980s?” Zeitschrift fur ¨ Wirtschafts- und Sozialwissenschaften, 118 (1998), 29–59.
THE VULNERABILITY OF AUCTIONS TO BIDDER COLLUSION* ROBERT C. MARSHALL AND LESLIE M. MARX Previous work has addressed the relative vulnerability of different auction schemes to collusive bidding. The common wisdom is that ascending-bid and second-price auctions are highly susceptible to collusion. We show that the details of ascending-bid and second-price auctions, including bidder registration procedures and procedures for information revelation during the auction, can be designed to completely inhibit, or unintentionally facilitate, certain types of collusion. If auctions are designed without acknowledging the possibility of collusion then the design will ignore key features that impact the potential success of colluding bidders.
I. INTRODUCTION Bidder collusion is a pervasive problem (Pesendorfer 2000) and is emphasized in competition enforcement. A General Accounting Office Report in 1990 noted that from 1982 to 1988, over half of the criminal restraint of trade cases filed by the U.S. Department of Justice’s Antitrust Division involved auction markets.1 The economics literature views ascending-bid and secondprice auctions as similarly susceptible to bidder collusion because they are strategically equivalent (“logically isomorphic” in Vickrey’s [1961] language).2 At a second-price auction, a cartel could have its highest-valuing member bid its value while all other cartel members bid some amount below the auctioneer’s reserve price. Similarly, this same cartel at an ascending-bid auction could have its highest-valuing member remain active up to ∗ The authors thank the Human Capital Foundation (http://www.hcfoundation .ru) for support. We thank Pino Lopomo, Peng Sun, Larry Katz (the editor), Martha Stancill, and five anonymous referees for helpful comments. We are indebted to Barry Ickes, Andrei Karavaev, Alexey Kuchaev, Vladimir Kreyndel, Alexey Pomanskiy, Mark Roberts, Georgy Trofimov, and Andrey Vavilov for numerous discussions regarding the Russian oil and gas lease auctions. Andrei Karavaev also provided skillful research assistance. 1. “GAO Report: Changes in Antitrust Enforcement Policies and Activities,” GAO/GGD-91-2, October 1990, available at http://archive.gao.gov/d22t8/ 142779.pdf. 2. The economics profession has generally credited Vickrey (1961) with being the first to propose the second-price auction format (McAfee and McMillan 1987; Milgrom 1989; Rothkopf, Teisberg, and Kahn 1990; Lucking-Reiley 2000); however, there are examples of the second-price auction being used in practice long before Vickrey’s paper (Moldovanu and Tietzel 1998; Lucking-Reiley 2000). C 2009 by the President and Fellows of Harvard College and the Massachusetts Institute of
Technology. The Quarterly Journal of Economics, May 2009
883
884
QUARTERLY JOURNAL OF ECONOMICS
its value while all other cartel members do not bid beyond the auctioneer’s reserve price. In each case, the highest-valuing cartel member acts just as it would have had the bidding been noncooperative, so other cartel members have no incentive to bid against it (Marshall and Meurer 2004). Despite the view that opportunities for collusion may be similar in ascending-bid and second-price auctions, in this paper we show that the two auction formats can differ in terms of their susceptibility to collusion. We show that ascending-bid and secondprice auctions can be designed to be robust to certain types of collusion. Whether an auction is designed to be robust to collusion or not can have a large impact on auction revenue. Most other design issues are focused on relatively small margins around the secondhighest valuation. In contrast, by inhibiting collusion the designer can be confident that the second-highest valuation (or something relatively close to it) is what the seller receives as opposed to, say, the fifth-, sixth-, or tenth-highest valuation, which might be the outcome from effective collusion. In this paper, we consider the effects of various auction rules on the ability of bidders to collude. As we show, aspects of an auction that are inconsequential for noncooperative behavior may be material when bidders collude. As an illustration of this point, several years ago the U.S. Federal Communications Commission (FCC) conducted auctions for spectrum licenses where the bids were large in dollar magnitude, but where there was no constraint on the exact amounts that could be submitted.3 When bids are in the hundreds of millions, no noncooperative bidder is too concerned about the last three digits of its bid, so FCC auction designers who were not focused on deterring collusion did not anticipate problems with the design. However, bidders took little time to realize that the last three digits offered the opportunity for anticompetitive signalling.4 In the standard model with noncollusive bidders, there is no strategic value from the last three digits of a bid; however, once the assumption of noncollusive bidders is relaxed, such details become important. 3. For discussions of FCC auctions, including the susceptibility of some FCC auctions to collusion, see McMillan (1994), McAfee and McMillan (1996), Weber (1997), Klemperer (1998, 2002, 2003), Cramton and Schwartz (2000, 2002), Kwasnica and Sherstyuk (2001), Brusco and Lopomo (2002), and Milgrom (2004a). For an analysis of inefficiencies induced by FCC auction design choices, see Bajari and Fox (2007). 4. As described in Weber (1997), this kind of signaling occurred at the FCC’s PCS A & B Block Spectrum Auction (FCC Auction 4).
THE VULNERABILITY OF AUCTIONS TO BIDDER COLLUSION
885
We focus on two components of an auction’s design that can be important for the susceptibility of the auction to collusion. First, we show that the information that is disclosed during an ascending-bid auction can affect its susceptibility to collusion. A common theme in the auction literature assuming affiliated values is that revenue will be enhanced by designing an auction scheme that reveals as much information as possible during the auction (McAfee and McMillan 1987; Milgrom 2004b). In addition, there may be good economic reasons for greater information disclosure in certain applications.5 However, these arguments for revealing information presume that collusion is not an issue. We show that the revelation of information during an ascending-bid auction can facilitate collusion. But we also show that some minor changes in the rules of the auction scheme, coupled with careful thought about the information revealed at the time bidders register, can change that result completely, leaving the ascending-bid auction immune to certain types of bidder collusion. Second, we show that the information disclosed about the identities of participating bidders can affect an auction’s susceptibility to collusion. Often a bidding cartel will organize itself so that payments are only required from a cartel member who wins the object, with nonwinning cartel members receiving payments from the cartel.6 This type of cartel organization is facilitated when auctioneers release detailed information about the identities of the registered bidders. By withholding information about the identities of the registered bidders, the auction designer potentially can create opportunities for the winning cartel member to circumvent payments to its co-conspirators.7 The cartel will observe these 5. For example, in multiobject auctions such as the FCC’s spectrum license auctions, revealing information on bidder identities prior to the auction and on the identities of current high bidders during the auction might provide information about the technological standards that are likely to be adopted, which are relevant for roaming possibilities and the cost of mobile units because of economies of scale (Marx 2006). In such cases, revealing information can potentially improve the efficiency of the auction. 6. This is a commonly observed characteristic of bidding cartels. See, for example, U.S. v. Ronald Pook (No. 87-274, 1988 U.S. Dist. LEXIS 3398; E.D. Pa. April 18, 1988); U.S. v. Seville Industrial Machinery Corp. (696 F.Supp. 986; D.N.J. 1988); District of Columbia v. George Basiliko (No. 91-2518, 1992 U.S. Dist. LEXIS 1260; D.C. February 10, 1992); and NY et al. v. Feldman et al. (No. 01-cv-6691, S.D.N.Y.). 7. As an example of a cartel member not making its agreed-on payment to its co-conspirators, see U.S. v. Portac, Inc. (869 F.2d 1288; 1989 U.S. App. LEXIS 2816). As described in that case, three companies conspired to rig bids at a government timber auction known as the “Up and Adam” timber sale held on March 22, 1985. The companies, Portac, Inc., Hoh River Timber Inc., and Astoria Plywood Corp.,
886
QUARTERLY JOURNAL OF ECONOMICS
opportunities ex ante and either have to alter its preferred mechanism, or give up on collusion entirely. Thus, anticollusive auction design can make collusion more difficult and less palatable to a typical bidding cartel. To give an example of how bidders have manipulated the bidder registration process to their advantage, consider the Russian oil and gas lease auctions of the past few years.8 In these auctions, at least two registered bidders are required for the auction to proceed. Three empirical regularities in the data are worth noting. First, in the auctions with more than two bidders, there often appears to be vigorous competition. Second, in the large plurality of auctions that have only two bidders, many end after the submission of only one bid. Third, many of the bidders that participate in the two-bidder auctions never win any oil or gas leases in our data. In summary, many of the two-bidder auctions appear to be one-bidder auctions where the single bidder has registered twice or arranged for an agent to register in order to satisfy the requirement that there be at least two registered bidders. We incorporate the possibility of this type of manipulation into our model by explicitly allowing bidders to register multiple bidder IDs. Also, similar to the Russian oil and gas lease auctions, we consider the possibility that the registration process does not provide complete information about the identities of the registered bidders, making it difficult for bidders to detect duplicate registrations by other bidders. A number of lessons for auction designers emerge from our analysis. First, designers should consider limiting how much information is released on the number and identities of registered bidders. Second, designers of ascending-bid auctions should consider limiting the information released during the course of the auction. Third, if registration information can be limited, but circumstances dictate that information on the identity of the current agreed that Astoria would win the auction, with Portac and Hoh River suppressing their bids, and they agreed that Hoh River would get the Hemlock and Portac would get a share of the Douglas Fir. The sale was indeed won by Astoria, but as stated in the case, “The agreed division of logs from the Up and Adam Sale never came to pass.” In the end, it was the head of Hoh River, who did not receive his agreed-on cartel transfer, who became the government’s prime witness at trial. The auction format used at the Up and Adam Sale facilitated collusion in that it released sufficient information to the bidders so that the cartel knew it would be able to observe whether Astoria won and so whether, according to their agreement, it was supposed to make a transfer to Hoh River. We show that auction formats that withhold information can be less vulnerable to collusion. 8. For more details, see the Online Appendix associated with this paper.
THE VULNERABILITY OF AUCTIONS TO BIDDER COLLUSION
887
high bidder cannot be suppressed in an ascending-bid format, then a second-price auction may be more robust to collusion than an ascending-bid auction. Of course, limiting the amount of auction information that is released to the public can potentially increase the scope for corruption by the auctioneer. However, in light of auction automation in recent years, there are many auction environments where opportunities for auctioneer corruption can be minimized by automating the auction process. For example, we are unaware of any concerns regarding auctioneer corruption at FCC spectrum license auctions, where automation has replaced human discretion in bid taking. Our results suggest that the steps described above make oneshot auctions more robust to collusion. One-shot auctions arise in a variety of contexts, and in many bid rigging cases the illegal behavior described involves only a single auction or procurement.9 The literature on collusion at repeated auctions shows that if a fixed set of bidders participates in an infinite sequence of similar auctions, with bidders’ values drawn from the same distributions at each auction, then they may be able to improve upon their noncooperative payoffs by forming an all-inclusive cartel if they are sufficiently patient (and, for some results, if they have access to a public randomization device).10 In practice, collusion at auctions originates with two different kinds of cartels. Large international market share cartels, such as the citric acid cartel of the 1990s,11 rig bids at procurement auctions, but the extensive repeated interaction of the cartel members on a range of issues beyond bid rigging, along with their focus on a market share agreement, is likely to make limiting registration and auction information a relatively ineffective tool to thwart their attempts to suppress interfirm competition. However, there are other cartels whose sole focus (or nearly so) is on bid rigging. Collusion at the FCC auctions, as well as the kinds of bid rigging seen in U.S. v. Seville, U.S. 9. Examples include U.S. v. Metropolitan Enterprises, Inc. (728 F.2d 444, 1984); U.S. v. A-A-A Elec. Co., Inc. (788 F.2d 242, 4th Cir. 1986); U.S. v. W.F. Brinkley & Son Construction Company, Inc. (783 F.2d 1157, 4th Cir. 1986); and Finnegan v. Campeau Corp. (722 F.Supp. 1114, S.D.N.Y. 1989). 10. See Fudenberg, Levine, and Maskin (1994); Aoyagi (2003); H¨orner and Jamison (2004); Skrzypacz and Hopenhayn (2004); and Blume and Heidhues (2006, 2008). In some cases, even for large discount factors, a history-dependent strategy cannot deter deviations in the absence of equilibrium-path punishment phases (Matsushima 2004). 11. European Commission Decision of 5 December 2001, Case No COMP/E1/36 604—Citric acid (2002/742/EC).
888
QUARTERLY JOURNAL OF ECONOMICS
v. Ron Pook, and NY v. Feldman, are examples. In the latter three cases, the cartels were not all-inclusive and had fluid member participation. All cartel transfers were completed at the end of each auction. Repetition was not needed for the ring to capture the full collusive surplus. In these cases, reduced information revelation is likely to have the biggest impact on thwarting collusion. And the heterogeneity of the ring membership at any auction, as well as the heterogeneity of the objects being sold at different auctions, makes the use of repetition to overcome an anticollusive auction design quite difficult. The paper proceeds as follows. Section II describes the model. Results are in Section III, including implications of our findings for collusion deterrence. A summary of the main results is contained in Section IV. Section V provides concluding discussion. II. MODEL We are interested in bidding cartels that operate in singleobject auction environments, where the bidding is only in terms of the price and where the auctioneer is nonstrategic except for setting a fixed reserve price r.12 We use the standard independent private values (IPV) formulation. Bidder i’s value is assumed to ¯ be drawn from distribution Fi with density fi and support [v, v], where v ≥ 0.13 All bidders are risk-neutral. We model auctions as involving a registration process and a bidding process in which only registered bidders may bid. We assume there are n ≥ 2 auction participants (as opposed to the number of registered bidders, which could be different), and that participants {1, . . . , k} are eligible to participate in a cartel, where 2 ≤ k ≤ n. We assume that the identities of the k cartel participants are common knowledge within the cartel, but that the total number of auction participants n may not be known to cartel members. Specifically, we assume that either it is known by all that the cartel is all-inclusive, or it is known by all that the cartel might not be all-inclusive, in which case we assume cartel 12. We assume no resale, but for a discussion of resale in our model, see the working paper version of this paper, Marshall and Marx (2008). See Garratt, Tr¨oger, and Zheng (2007) on the susceptibility of the English auction to collusion when resale is allowed. 13. The heterogeneous independent private values framework has been analyzed by Marshall et al. (1994), Lebrun (1999, 2006), Maskin and Riley (2000), and Bajari (2001). Assuming all distributions have a common interval support simplifies the analysis because, for example, we avoid environments in which only certain bidders could possibly have a value above the reserve price.
THE VULNERABILITY OF AUCTIONS TO BIDDER COLLUSION
889
members have a common belief distribution over the number of non-cartel bidders, where the distribution is assumed to have unbounded support {0, 1, 2, . . .}. We focus on ascending-bid and second-price auction formats, which we describe below.
II.A. Bidding Formats Ascending-Bid Auctions. A variety of ascending-bid environments are used in practice and in theory. In this paper, we focus on four that are distinguished by whether reentry is possible and whether the bidder IDs of the active bidders are revealed. We describe these four variations below. In all cases, we assume that if bidders are identified during the auction it is only through their bidder IDs, not the underlying identities behind those bidder IDs. In many modeling environments, the ascending-bid auction is borrowed from Milgrom and Weber (1982). In that variant, no reentry is possible. Once a bidder ID withdraws from the bidding it cannot reenter. In addition, the number of active bidders is publicly displayed, but the bidder IDs for the active bidders are not revealed. Following Milgrom and Weber (1982), we refer to this variant as the “Japanese English auction without identities” or “JEA without identities.” As a variant of the JEA, one could also have the bidder IDs of the currently active bidders revealed during the auction. We refer to this as a “JEA with identities.” In other ascending-bid formats, reentry is costless and always possible, as is typically the case at many oral ascending-bid auctions (Izmalkov 2002). In these formats, it may be possible for bidders to observe the bidder ID of the current high bidder. We will refer to this as the “Standard English auction with identities” or “SEA with identities.” As a variant of the SEA, bidders might not observe the bidder ID of the current high bidder, for example, if Internet-based or telephone bids are allowed or if the bidders are able to disguise the fact that they are bidding. We refer to this as an “SEA without identities.” As an additional example, in some oral ascending-bid livestock auctions, although the identity of the winner is revealed after the auction concludes, the identities of the active bidders and current high bidder are obscured through the use of “ring masters” who accept bids from bidders seated in their assigned areas and transmit those bids to the auctioneer.
890
QUARTERLY JOURNAL OF ECONOMICS
We assume that any information that is revealed during the bidding process is in terms of the bidder IDs, not their underlying identities. We assume that in an SEA, the auctioneer always signals when the bid ascent has stopped and allows some brief period for bidding before closing the auction. In the ascending-bid formats, the amount of the current high bid is observed by all bidders. In particular, the price paid by the winner is observed by all bidders. The winning bidder must be able to observe that it has won and losing bidders must be able to observe that they did not win. In a JEA with identities or an SEA with identities, the bidder ID of the winner is revealed through the auction process. In a JEA without identities and an SEA without identities, the bidder ID of the winner may or may not be revealed to all the bidders. In what follows, we will specify whether the bidder ID of the winner is revealed where necessary. In addition, for studying collusion at ascending-bid auctions it may also be important to specify how bid increments are determined; however, we abstract from this by assuming a continuous price ascent in all of the auction formats we consider.14 Second-Price Auctions. We consider a standard second-price auction in which bidders simultaneously submit bids, with the high bidder winning the object and paying the amount of the second-highest bid or the reserve price, whichever is higher. As with the ascending-bid auction formats, at the conclusion of the auction, the winning bidder must be able to observe that it has won and how much it must pay, and losing bidders must be able to observe that they did not win. In contrast to an ascendingbid auction, in a second-price auction the price paid may only be observed by the winning bidder. However, in what follows, to maintain comparability between second-price and ascendingbid auctions, we assume that at second-price auctions the price paid is observed by all bidders. In what follows, we will specify where necessary whether the bidder ID of the winner is revealed. II.B. Registration Regimes For any of the ascending-bid or second-price auction formats, we assume that before the auction, bidders participate in 14. See Avery (2002) on strategic jump bidding at ascending-bid auctions.
THE VULNERABILITY OF AUCTIONS TO BIDDER COLLUSION
891
TABLE I REGISTRATION REGIMES
Transparent Semitransparent Nontransparent
List of bidder IDs revealed
Bidder IDs linked to the underlying bidder
Yes Yes No
Yes No, but IDs can be claimed by the bidders No, but IDs can be claimed by the bidders
a registration process in which each bidder chooses how many bidder IDs to request and is randomly assigned that number of bidder IDs from an infinite set D. Each bidder ID in D is assigned to at most one bidder. Thus, for each i ∈ {1, . . . , n}, bidder i has a set Di of bidder IDs for which it is the underlying identity. We assume that bidder i controls the bidding of all bidder IDs in Di . In practice, multiple registrations may be accomplished through formal or informal agreements with another entity, perhaps specifying the terms of resale following success at the auction. We define three possible registration regimes: 1. Transparent registration: Prior to the auction, the auctioneer announces the set of all assigned bidder IDs, D ≡ n Di , and their underlying ∪i=1 identities; that is, the auctioneer announces the list (i, d) | i ∈ {1, . . . , n}, d ∈ Di . 2. Semitransparent registration: Prior to the auction, the auctioneer announces the set of all assigned bidder IDs, n Di , but does not reveal their underlying identiD ≡ ∪i=1 ties. 3. Nontransparent registration: The auctioneer does not reveal the set of assigned bidder IDs nor any information linking bidder IDs with their underlying identities. These registration regimes are summarized in Table I. We assume that any information revealed by the auctioneer must be accurate, although the auctioneer may choose not to reveal certain information. Under transparent registration, bidders know which auction participant is associated with every bidder ID. Thus, if one registrant has more than one bidder ID, that is revealed to all the bidders. Under semitransparent registration, an all-inclusive cartel can assure itself that no cartel member has multiple bidder IDs if there are exactly k bidder IDs, with each bidder ID claimed by one of the cartel members. Under nontransparent registration,
892
QUARTERLY JOURNAL OF ECONOMICS
bidders do not even know the set of assigned bidder IDs prior to the auction. II.C. Timing We take as given the bidding format (SEA with identities, SEA without identities, JEA with identities, JEA without identities, or second price) and the registration regime (transparent, semitransparent, or nontransparent) and consider the ability of a cartel to operate successfully. The timing and description of the stages is as follows: 1. Cartel formation: A cartel mechanism is announced (there is commitment to the mechanism). Potential cartel members observe the mechanism and join if and only if their expected payoff from participation in the mechanism is greater than their expected payoff from noncooperative play. Cartel members observe whether all potential cartel members join or not.15 If all potential cartel members join, then the cartel mechanism operates, and otherwise it does not, in which case all bidders participate in the auction noncooperatively.16 2. Values: Bidders learn their values. 3. Cartel mechanism: If the cartel mechanism is operating, cartel members participate in the cartel mechanism. The formal definition of the cartel mechanism is given in the Appendix, but we can describe the cartel mechanism as follows: Each cartel member makes a report to a “center,” which is a standard Myerson (1983) incentiveless mechanism agent. Based on these reports, the center makes nonbinding registration and bid recommendations privately to each cartel member and announces the transfer payments to be required after the auction as a function 15. As described below, we assume non-cartel bidders use the non–weakly dominated strategy of bidding their values, so it is not necessary that they observe the cartel mechanism or whether all potential cartel members join. 16. This is a common assumption in the auction literature. The assumption affects the statement of the individual rationality constraint, but is not necessary for the results of this paper. An alternative assumption is that refusal by one potential cartel member to join implies that the remaining potential cartel members form a cartel of size k − 1; however, given that we focus on ascending-bid and second-price auctions, the non–weakly dominated bidding strategy of a potential cartel member that does not join is not affected by whether a cartel of the other k − 1 forms or not.
THE VULNERABILITY OF AUCTIONS TO BIDDER COLLUSION
4. 5.
6.
7.
8.
893
of the reports and observed outcomes. We require that the center’s budget be balanced in expectation. The bid recommendations can be functions of information released as part of the registration and bidding processes as long as the information is available at the time the bid must be submitted. We restrict attention to mechanisms that satisfy incentive compatibility and obedience conditions so that it is a best reply for all cartel members to truthfully report their values to the center and to follow the registration and bid recommendations of the center. Registration process: Bidders participate in a registration process. Release of registration-related information: The auctioneer releases registration-related information as specified by the registration regime. Claiming of bidder IDs: Under semitransparent or nontransparent registration, bidders may claim to have a particular bidder ID, although this is not verifiable. If a cartel member claims a particular bidder ID, then the cartel may use that information. For example, if cartel member i claims bidder ID d, and if the cartel mechanism requires that a cartel member make a payment to the cartel if it wins the object, then the payment can be collected from cartel member i if bidder ID d wins the auction. Bidders cannot credibly communicate that a particular bidder ID is not associated with it. Bidding process: Registered bidders participate in the bidding process, with non-cartel bidders using the non– weakly dominated strategy of bidding their values.17 Cartel transfers: Any within-cartel transfer payments required by the mechanism are made.18 We assume that the cartel can compel cartel members to make their required payments.
17. We assume noncolluding bidders follow non–weakly dominated strategies, but cartel members are not so constrained. This assumption is also made in Robinson (1985), Graham and Marshall (1987), and Mailath and Zemsky (1991). The assumption is consistent with observed behavior in U.S. v. Ronald Pook; U.S. v. Seville; and District of Columbia v. George Basiliko. 18. As in Graham, Marshall, and Richard (1990) and Asker (2007), differential transfer payments are possible to account for heterogeneity among cartel members.
894
QUARTERLY JOURNAL OF ECONOMICS
III. RESULTS As a benchmark, we begin by defining the first-best collusive outcomes. For second-price and ascending-bid auctions, the first-best collusive outcome is for the highest-valuing cartel member to win the object whenever its value exceeds that of the highest-valuing outside bidder and to pay the maximum of the reserve price and the highest outside value. In the context of a second-price auction, this is achieved, for example, when the highest-valuing cartel member bids its value and all other cartel members bid below the reserve or do not bid. In the context of a JEA, the first-best collusive outcome is achieved when the highest-valuing cartel member remains active up to its value and all other cartel members exit at a price less than the highest-valuing cartel member’s value and at a price no greater than the price at which the highest-valuing outside bidder exits. In an SEA, the first-best collusive outcome is achieved when the highest-valuing cartel member bids up to its value and nonhighest-valuing cartel members do not bid or if the non-highestvaluing cartel members follow the rule of not bidding when the highest-valuing cartel member is the current high bidder and not bidding when an outside bidder is the current high bidder until the highest-valuing cartel member has had an opportunity to bid. III.A. No Restrictions on Payments If we allow payments from all cartel members, regardless of whether they win the auction, then a bidding cartel can suppress all within-cartel competition at a second-price or ascending-bid auction using the mechanism of Mailath and Zemsky (1991) or Marshall and Marx (2007). The mechanism of Mailath and Zemsky is ex post budget-balanced but may require payments from multiple cartel members, including those instructed not to bid at the auction. The mechanism of Marshall and Marx is ex ante budgetbalanced, but only requires a payment from the highest-reporting cartel member. In that mechanism, the highest-reporting cartel member pays the center an amount equal to the expected surplus that a bidder with value equal to the second-highest report would receive if it were to bid at the auction against the outside bidders, and the expected value of this payment is distributed among
THE VULNERABILITY OF AUCTIONS TO BIDDER COLLUSION
895
all the cartel members so that the mechanism satisfies ex ante budget balance.19 It is an equilibrium for all cartel members to report their values truthfully and follow the bid recommendations of the center. To see this, note that we can view cartel members as competing in a second-price auction for the right to be the sole cartel member to attend the auction. The usual second-price logic implies that it is a best reply for cartel members to report truthfully to the mechanism. Once the mechanism has identified the highest-valuing cartel member, cartel members have no incentive to deviate from the recommended bids. In addition, one can easily show that individual rationality is satisfied strictly. Because the mechanisms of Mailath and Zemsky (1991) and Marshall and Marx (2007) do not rely on any information from the auction itself, they are not affected by the details of the auctions rules including registration and bidding procedures. Thus, we have the following benchmark result. PROPOSITION 1. When a cartel is unrestricted in its ability to collect payments from cartel members, the first-best collusive outcome can be achieved at any second-price or ascending-bid auction, regardless of registration transparency and regardless of auction details. Proof. The results follow from either Mailath and Zemsky (1991) or Marshall and Marx (2007). Our distributional assumptions are stronger than those in Mailath and Zemsky, who allow bidders’ value distributions to have different supports, and although Marshall and Marx place additional restrictions on the densities fi , these additional assumptions are not necessary for their results for second-price auctions. QED
19. Specifically, in the mechanism of Marshall and Marx (2007), if r is the out is the highest value among the bidders outside the cartel (zero reserve price, v˜(1) if the cartel is all-inclusive), and s(2) is the second-highest report from a cartel member, then the highest-reporting cartel member pays the center out 1s Ev˜ out s(2) − max r, v˜(1) (1)
out (2) ≥max{r,v˜ (1) }
− p,
where p is 1/k of the ex ante expected payment by the high-valuing cartel member, and all other cartel members receive payment p. The mechanism then recommends that the highest-reporting cartel member bid its report at a second-price auction or bid up to its report at an ascending-bid auction, with all other cartel members bidding some amount below the reserve price.
896
QUARTERLY JOURNAL OF ECONOMICS
III.B. Payments Only from Winners The mechanisms of Mailath and Zemsky (1991) and Marshall and Marx (2007) allow first-best collusion at a second-price or ascending-bid auction regardless of whether the identity of the winner or price paid is revealed. However, a cartel might prefer a mechanism that only requires a payment from the highestvaluing cartel member when that cartel member wins the object at the auction. This is particularly relevant for procurement auctions, where cartel members may wish to fund transfer payments from auction proceeds or use subcontracting arrangements with other cartel members. The cartel may also prefer payments only from winners if the liquidity required to make the payment will come from the object being sold. In many prosecuted bidding cartels, only the winner made payments to the cartel.20 In a number of bidding cartels using post-auction knockouts, only the cartel member ultimately receiving the object made payments to the cartel.21 If the auctioneer or auction process reveals the identity of the winner, then a bidding cartel can condition transfer payments on that information. The mechanism of Graham and Marshall (1987) allows a bidding cartel to suppress all within-cartel competition while only requiring a payment from a cartel member if that cartel member wins the auction. In this mechanism, cartel members make reports to the center and the center recommends that nonhighest-reporting cartel members bid below the reserve price at the auction, whereas the highest-reporting cartel member bids its report at a second-price auction or up to its report at an ascendingbid auction. If the cartel member wins the auction, it pays the center nothing if the auction price is greater than the secondhighest report from the cartel. If the second-highest cartel report exceeds the price paid at the auction, then the winning cartel bidder pays the center the difference between the second-highest report and the price at the auction. Specifically, if the cartel members submit reports s1 ≥ s2 ≥ · · · ≥ sk, a cartel member that wins the auction at price p must pay the center max{0, s2 − p}. Ex ante 20. Examples include the collectable stamp cartel described in Asker (2007); U.S. v. A-A-A Elec. Co., Inc. (788 F.2d 242; 4th Cir. 1986), where A-A-A did not make payments to its co-conspirators until after receiving final payment from the buyer; U.S. v. Metropolitan Enterprises, Inc. (728 F.2d 444, 1984); and U.S. v. Inryco, Inc. (642 F.2d 290, 1981), where subcontracting arrangements were used to transfer payments between cartel members. 21. Examples include those prosecuted in U.S. v. Seville Industrial Machinery Corp., U.S. v. Ronald Pook, and District of Columbia v. George Basiliko.
THE VULNERABILITY OF AUCTIONS TO BIDDER COLLUSION
897
budget balance is achieved by having the center make a payment to each cartel member equal to 1/k times the expected revenue to the center as a result of payments by winning cartel members. Given this payment rule, a cartel member has no incentive to overreport because if doing so makes the difference between the cartel member’s report being highest and not, then it means that the second-highest report is greater than the cartel member’s value, and then the payment rule guarantees that the cartel member will have to pay an amount greater than its value if it wins the object. Similarly, there is no incentive to underreport because if doing so makes the difference between the cartel member’s report being highest and not, then because the highest-reporting cartel member bids truthfully at the auction, the deviating cartel member obtains no collusive gain. We assume that any results of the bidding process that are made available to the bidders are done so only using the bidder IDs, not the underlying identities behind those bidder IDs. Thus, when we say that a cartel can only collect payments from a cartel member that wins the auction, we mean that in semitransparent and nontransparent registration regimes, the cartel can only collect payments from a cartel member if that cartel member claims a bidder ID d during the “claiming of bidder IDs” phase and bidder ID d is observed to win the auction. PROPOSITION 2. When a cartel can only collect payments from a cartel member that wins the auction, the first-best collusive outcome can be achieved at any second-price or ascending-bid auction that has transparent registration and that reveals the winning bidder ID. Proof. In this environment (transparent registration and the auctioneer reveals the winning bidder ID), the cartel can identify whether a particular cartel member has won the auction. In an ascending-bid auction, bidders observe the price paid as part of the bidding process, and in a second-price auction we assume the auctioneer reveals the price paid. Thus, the cartel can use the mechanism of Graham and Marshall (1987) to achieve the first-best collusive outcome. Although Graham and Marshall assume symmetric bidders, Graham, Marshall, and Richard (1990) show that their collusive mechanism continues to be incentivecompatible and individually rational in our environment with potentially asymmetric bidders. QED
898
QUARTERLY JOURNAL OF ECONOMICS
Proposition 2 shows that under transparent registration, a restriction that the cartel only collect a payment from a cartel member who wins does not affect the profitability of collusion if the identity of the winner is revealed. However, if information suppression by the auctioneer, shill bidding, multiple registrations, subcontracting, or other arrangements interfere with the ability of cartel members to learn the true identity of the winner, then the result changes. PROPOSITION 3. When a cartel can only collect payments from a cartel member that wins the auction, the first-best collusive outcome cannot be achieved at a second-price auction or ascending-bid auction without identities (JEA or SEA) that has nontransparent registration (even if the auctioneer reveals the winning bidder ID). Proof. Assume a second-price auction with nontransparent registration, and consider a collusive mechanism that achieves the first-best collusive outcome and that only collects payments from a cartel member that wins the auction. Relying on the Revelation Principle, assume the mechanism is incentive-compatible in terms both of the truthful revelation of values and of obedience to the mechanism’s recommended registration and bidding behavior (Myerson 1985). To achieve the first-best collusive outcome, the highest-valuing cartel member must bid its value and nonhighest-valuing cartel members must bid below the reserve price or not bid. (If the cartel is all-inclusive, then the first-best collusive outcome can also be achieved by having the highest-valuing cartel member bid above its value.) If the cartel does not require any payments from cartel members, then a cartel member with value greater than the reserve price can profitably deviate by reporting a value equal to the upper support of the value distribution and then bidding its value at the auction. In this case, because we assume that all bidders have a common upper support of their value distributions, all other cartel members would be instructed by the cartel to bid below the reserve price or not bid, and so the deviation would increase the deviating cartel member’s payoff whenever its value was greater than those of the outside bidders but not the highest in the cartel. Thus, with positive probability, the collusive mechanism must require a payment from a cartel member that wins the auction. But if a cartel member has a positive expected payment to the cartel in the event that it wins the auction, and no payment
THE VULNERABILITY OF AUCTIONS TO BIDDER COLLUSION
899
if it does not win, then a cartel member can profitably deviate by reporting a value equal to the upper support of the value distribution and also registering a bidder ID that it does not reveal to the cartel (with nontransparent registration, no inference is possible by the cartel regarding multiple registrations by its members). The deviating cartel member can use that bidder ID to bid its value at the auction, while bidding zero with any other bidder IDs it has. The deviation allows the deviating cartel member to avoid having to make a payment to the cartel and is profitable whenever the cartel member’s value is greater than those of the outside bidders. Because the JEA without identities and the SEA without identities provide no information during the auction process that can be used to identify the current high bidder, the proof for those auction formats proceeds as in the case of a second-price auction. QED Proposition 3 shows that rules exist for second-price auctions and ascending-bid auctions without identities that prevent a cartel from achieving the first-best collusive outcome using a mechanism that only collects payments from a cartel member that wins the auction. When cartel members can register bidders whose underlying identities cannot be traced to them, cartel members prefer to use such a bidder to avoid having to make a payment to the center in the event that they win. Thus, first-best collusion cannot be sustained. In particular, with nontransparent registration, the mechanism of Graham and Marshall (1987) no longer works because the highest-valuing cartel member can use a bidder ID that is not recognized by the cartel and thereby avoid having to make a payment to the cartel. In the environment of Proposition 3, correlating devices with no transfers are the only available mechanisms for collusion in a one-shot environment. Although the cartel cannot achieve the first-best collusive outcome, the cartel mechanism can still play the role of an equilibrium selection device if there are multiple equilibria and can allow the cartel to implement a correlated equilibrium.22 22. If within-cartel payments can only be required from a cartel member who wins the auction, and if the auction process does not reveal the underlying identity of the winner, then a cartel member winning the auction has no incentive to pay (absent repeated-game incentives). Thus, if the auction process does not reveal the underlying identity of the winner, a cartel at a second-price auction must rely on correlated equilibria with no transfers among cartel members. For more discussion of this case, see the working paper version of this paper, Marshall and Marx (2008).
900
QUARTERLY JOURNAL OF ECONOMICS
Comparing Propositions 2 and 3, we see that there may be an incentive for a cartel to convert nontransparent registration to transparent registration if possible. For example, at the FCC’s Nationwide Narrowband (PCS) Auction (FCC Auction 1), the FCC’s intention was to hold an ascending-bid auction with identities, but with nontransparent registration. However, bidders were able to observe movements in and out of bidding booths and connect those with the timing of the posting of bids to figure out which bidder IDs were associated with which auction participants. The following proposition considers semitransparent registration. With a non-all-inclusive cartel, the proposition’s result depends on whether non-cartel bidders, that is, bidders that are truly independent non-cartel bidders, can and do identify themselves and claim their bidder IDs in a credible way. When a cartel member claims a bidder ID, it commits to making any payments required based on the observed bidding behavior of that bidder ID, but when a non-cartel member claims a bidder ID, that information is only useful to the cartel if it represents a credible statement that the claimed bidder ID is not actually the bidder ID associated with one of the cartel members, so the credibility of the claim becomes important. PROPOSITION 4. When a cartel can only collect payments from a cartel member that wins the auction, the first-best collusive outcome cannot be achieved at a second-price auction or ascending-bid auction without identities (JEA or SEA) that has semitransparent registration and that reveals the winning bidder ID, unless the cartel is all-inclusive or the cartel is not all-inclusive and all non-cartel participants identify themselves and claim their bidder IDs in a credible way. Proof. If the cartel is all-inclusive or the cartel is not allinclusive and all non-cartel participants identify themselves and claim their bidder IDs in a credible way, then the cartel can achieve the first-best collusive outcome by recommending reversion to noncooperative bidding with no transfers unless it is observed that all bidder IDs are claimed by a cartel member or outside bidder. If all bidder IDs are claimed, then bidding and transfers are defined as in Graham and Marshall (1987). In this environment, a cartel member cannot profitably deviate by registering a bidder For the development of this type of mechanism in an environment with resale, see Garratt, Tr¨oger, and Zheng (2007).
THE VULNERABILITY OF AUCTIONS TO BIDDER COLLUSION
901
ID that it does not claim in an attempt to avoid having to make a payment to the cartel because that deviation would result in an unclaimed bidder ID and, thus, reversion to noncooperative bidding. However, if the cartel is not all-inclusive and non-cartel participants either cannot or do not credibly identify themselves and credibly claim their bidder IDs, then as in the proof of Proposition 3, the first-best collusive outcome cannot be achieved, because any mechanism achieving the first-best collusive outcome is vulnerable to deviations in which a cartel member registers a second bidder ID that it does not claim but that it uses to submit its bid. Because the JEA without identities and the SEA without identities provide no information during the auction process that can be used to identify the current high bidder, the proof for those auction formats proceeds as in the case of a second-price auction. QED Comparing the results of Propositions 2, 3, and 4 for secondprice auctions, we have the following result. COROLLARY 1. At a second-price auction or ascending-bid auction without identities, transparent registration can be procollusive relative to semitransparent registration, which can be procollusive relative to nontransparent registration. A second-price or ascending-bid format that releases detailed information about the registered bidders prior to the auction can be procollusive because it can allow a cartel to police attempts by cartel members to set up alternative bidder identities that might allow them to disrupt the ability of the collusive mechanism to collect payments from a winning cartel member. Corollary 1 suggests that subcontracting and resale agreements arranged prior to an auction might be anticollusive if they established a second identity under which a cartel member could bid without being recognized as the underlying identity. However, such arrangements can be procollusive in other contexts, such as if subcontracting can be used to implement transfer payments among cartel members (Kovacic et al. 2006). In contrast to the above results, at an ascending-bid auction with identities the presence of bidders whose underlying identities cannot be observed need not eliminate the possibility of first-best collusion. In some environments, we can construct a collusive mechanism, which we refer to as a “responsive to outside bidders” or “ROB” mechanism, that employs the payment scheme
902
QUARTERLY JOURNAL OF ECONOMICS
of Graham and Marshall (1987) but requires active bidding by non-highest-valuing cartel members and thereby restores the possibility of first-best collusion at ascending-bid auctions when registration is not transparent. In the case of a JEA with identities, the ROB mechanism instructs the cartel members to claim their bidder IDs and instructs non-highest-valuing cartel members to stay active up to their values or until the last bidder that is not identifiable as a cartel member exits, whichever comes first. Under this mechanism, if a cartel member attempts to win the object using an unclaimed identity to avoid making a payment to the cartel, the other cartel members remain active up to their values and there is no collusive gain. In the case of an SEA with identities, the ROB mechanism once again instructs the highest-valuing cartel member to reveal its bidder ID to the other cartel members, and it instructs the non-highest-valuing cartel members to bid if the price is less than their values and the auctioneer has signaled that the auction is about to close and the current high bidder is not identifiable as the highest-valuing cartel member. The highest-valuing cartel member is instructed to bid promptly whenever it is not the current high bidder and the price is less than its value. Again, under this mechanism, if the highest-valuing cartel member attempts to win through a disguised identity to avoid making payments to the cartel, the collusive gain is lost because the other cartel members bid up to their values. Under the ROB mechanism, cartel members bid up to their values as long as they perceive competition from bidder IDs not claimed by the cartel, and this deters deviations based on disguised identities. PROPOSITION 5. When a cartel can collect payments only from a cartel member that wins the auction, the first-best collusive outcome can be achieved at a JEA with identities or an SEA with identities, even with semitransparent or nontransparent registration. As shown in Proposition 5, the cartel’s ability to eliminate cartel members’ use of disguised identities as a profitable strategy at ascending-bid auctions does not depend on whether reentry is possible—effective cartel strategies exist for both the JEA with identities and the SEA with identities, namely the ROB mechanism. As this argument shows, in both a JEA and SEA with
THE VULNERABILITY OF AUCTIONS TO BIDDER COLLUSION
903
identities, auction rules may permit cartel strategies that prevent disguised identities from being used by cartel members to cheat on the cartel. In such environments, these ascending-bid auctions are more susceptible to collusion than a second-price auction. COROLLARY 2. The susceptibility of ascending-bid auctions to collusion depends on whether the auctions are with or without identities but not on whether reentry is allowed (SEA allows reentry and JEA does not). Proposition 5 contrasts with Propositions 3 and 4 and shows that in some environments ascending-bid auctions are more susceptible to collusion than second-price auctions. COROLLARY 3. With nontransparent registration and in some cases with semitransparent registration, ascending-bid auctions with identities are more susceptible to collusion than second-price auctions. Because the economics literature on bidder collusion has typically focused on transparent registration, the result of Corollary 3 is in stark contrast with some existing results. For example, Graham and Marshall (1987, p. 1234) state, “Models of singleobject second-price and English auctions have been proposed in which cooperative behavior is permitted and in which the auctioneer is allowed to respond strategically to such behavior. . . . Therefore, the revenue equivalence result for the second-price and English auctions within the IPV context extends to cooperative behavior.” As Corollary 3 shows, the revenue equivalence result does not extend to cooperative behavior in environments with nontransparent or semitransparent registration. III.C. Implications for Bid Data In the ROB mechanism described in the previous section, at a JEA one would expect to observe cartel members exiting the auction at the same time as the last outside bidder. Traces of this simultaneous exit might be detected in bid data. For example, a paper submitted to the FCC in 2007 by Gregory Rose alleges that in FCC Auction 66 for Advanced Wireless Services, there was a mass simultaneous exit of incumbent wireless providers at the point when Wireless DBS LLC, a joint venture of the two leading satellite TV companies and a potential new competitor to the existing wireless providers, exited the bidding for the large
904
QUARTERLY JOURNAL OF ECONOMICS TABLE II SUMMARY OF RESULTS Can the first-best collusive outcome be achieved? Nontransparent Semitransparent registration registration
Transparent registration
Unrestricted ability to collect payments Yes (Prop. 1) Restricted to payments only from winners Ascending with IDs (JEA or SEA) Ascending without IDsa (JEA or SEA) Second price
Yes (Prop. 1)
Yes (Prop. 1)
Yes (Prop. 5)
Yes (Prop. 5)
Yes (Prop. 2)
No (Prop. 3)
No (Prop. 4) (exceptionsb ) No (Prop. 4) (exceptionsb )
Yes (Prop. 2)
No (Prop. 3)
Yes (Prop. 2)
a Assume the auctioneer reveals the identity of the winner. b Yes, if the cartel is all-inclusive or all outside bidders credibly identify themselves and claim their bidder
IDs.
F Block spectrum licenses.23 Although there are many possible explanations for the bidding behavior at FCC Auction 66, this example demonstrates that such bidding behavior potentially can be detected through a retrospective analysis of the data. IV. SUMMARY OF RESULTS Our results show that one auction format may be more or less susceptible to collusion depending on the details of the auction rules and environment as well as the strength of the cartel, in particular the cartel’s ability to collect payments from its members. The difference between the second-price and ascendingbid auction formats arises when the cartel restricts attention to mechanisms in which only the winner pays and when the use of disguised identities is possible. Differences within ascendingbid formats depend on the informational environment, that is, whether or not it is an ascending-bid auction with information or without information, but not the ability to reenter or not, that is, collusive opportunities are the same at a JEA and an SEA as long as both are with information or both are without information. Our results are summarized in Table II. As shown by the last column of the table, in an environment with transparent 23. Gregory Rose, “How Incumbents Blocked New Entrants in the AWS-1 Auction: Lessons for the Future,” FCC Docket No. 06-150, filed on behalf of Public Interest Spectrum Coalition by Media Access Project, April 23, 2007.
THE VULNERABILITY OF AUCTIONS TO BIDDER COLLUSION
905
registration, the auction formats we consider are all equally susceptible to collusion. But reading down the other columns, we see that for other registration regimes, auction design decisions can affect the susceptibility of the auction to collusion. Reading across the rows, we see that for a given auction format, the registration regime can affect the susceptibility of the auction to collusion. Finally, in some cases comparisons “along the diagonal” in the table may be relevant. For example, if a given auction format necessitates a particular registration regime, then the relevant comparisons involve changes in both the auction format and the registration regime. The results shown in Table II that are based on Proposition 1 are supported by the collusive mechanism of Marshall and Marx (2007), which involves transfer payments that do not depend on the outcome of the auction and may be required of a cartel member that does not win the auction. The results based on Proposition 2 are supported by the collusive mechanism of Graham and Marshall (1987), which involves a transfer payment only from a cartel member that wins the auction. The other two “yes” results in Table II, which are based on Proposition 5, are supported by the ROB collusive mechanism described in Section III.B., which requires that all cartel members participate in the auction and bid in a way that does not reduce the collusive gain, but does prevent bidders IDs not recognized as belonging to a cartel member from being able to win the auction at a price less than the values of the cartel members. The negative results in Table II, which reference Propositions 3 and 4, are novel results in that the literature on bidder collusion typically views ascending-bid and second-price auctions as susceptible to collusion, a view that we show follows from the literature’s focus on transparent registration. V. DISCUSSION Many results in the auction literature that hold for secondprice auctions also hold for ascending-bid auctions, and vice versa. However, the results of this paper show that there is a difference between ascending-bid and second-price auctions insofar as their susceptibility to collusion is concerned. Specifically, a cartel operating at an ascending-bid auction need not be disrupted by nontransparent registration, but we show that under certain conditions nontransparent registration forces a cartel at a secondprice auction to revert to noncollusive play.
906
QUARTERLY JOURNAL OF ECONOMICS
The results of this paper suggest that both the design of an auction and the actions of auctioneers can affect the profitability of collusion. Auction designs and auctioneer actions that reduce the profitability of collusion can be expected to inhibit collusion. Prior to an auction, steps can be taken to facilitate the use of disguised identities by potential cartel members, such as using nontransparent registration. During ascending-bid auctions, information on the identities of the active bidders and the current high bidder can be suppressed. After an auction, if possible, the auctioneer can keep the identity of the winner anonymous. Also, after an auction, bid data can be reviewed for evidence of simultaneous exit that might be suggestive of cartel behavior. Although reduced information disclosure has potential benefits in terms of inhibiting collusion, these gains must be balanced against the costs in terms of potentially increasing the scope for corruption by the auctioneer and potential costs in terms of decreased efficiency in environments with externalities (see footnote 5). With regard to the increased potential for auctioneer corruption, many auctions that were historically conducted by human bid takers can now be run with computer-automated procedures. This incremental design change greatly mitigates possibilities for auctioneer corruption and thereby makes our recommendations implementable without confronting a substantial trade-off. “Transparency in bidding” has been touted by the federal government. As we show in this paper, preauction transparency in the form of transparent registration, and real-time transparency in the form of revelation of the identities of the active bidders in a JEA and the identity of the current high bidder in an SEA, increase susceptibility to collusion. Thus, pre-auction and real-time transparency can be procollusive. If the primary motivation for “transparency in biddingy” is concern about the possibility of corruption by the auctioneer, then postauction transparency, where auction results are made public after the conclusion of the auction, may provide sufficient information to monitor the auction process without being as procollusive as pre-auction or real-time transparency. Additional benefits associated with suppressing information on the identities of active bidders and current high bidders are possible in simultaneous multiple object auctions. For example, in the FCC’s spectrum license auctions, information on the identities of bidders can potentially facilitate retaliatory bidding, signaling,
THE VULNERABILITY OF AUCTIONS TO BIDDER COLLUSION
907
gaming of the auction’s activity rule, and other attempts to deter or foreclose entry into markets (Brusco and Lopomo 2002; Reitsma et al. 2002; Marx 2006). Recently the FCC announced that in some cases it would modify its simultaneous multiple round auction (a multiobject variant of an English auction) so that bidders could no longer observe which bidder had submitted which bids. The FCC argued that this change would make its auctions less susceptible to collusion, a conclusion that is supported by the analysis of this paper.
APPENDIX In this Appendix, we define a collusive mechanism for a second-price auction when no registration-related information is revealed. The definition can be adapted for other auction formats and registration regimes. We focus on incentive-compatible, ex ante budget-balanced, strictly individually rational collusive mechanisms. Assume bidder i draws its values from distribution Fi with interval support S, where S is common to all bidders. Let K ≡ {1, . . . , k}. To allow the possibility that cartel members can submit multiple bids, for i ∈ K, let cartel member i’s bid bi be a finite-dimensional vector. If the mechanism recommends that cartel member i submit bid vector bi with dimension mi , we interpret that as a recommendation that cartel member i should register mi bidders with itself as the underlying identity and submit bids accordingly. Let B be the set of possible vectors of bid recommendations. For i ∈ K, let πi (vi , b1 , . . . , bk) be cartel member i’s expected payoff when its value is vi , cartel members bid b1 , . . . , bk, and outside bidders bid their values, taking the expectation over the outside bidders’ values (and the number of outside bidders if that is not known) and over any randomization in the auction mechanism, such as a random tie-breaking rule. We define a collusive mechanism by (µ, p), where µ : Sk → ( B) is the distribution over recommended bids and pi : Sk × I → R is the transfer payment required of cartel member i as a function of the reports made to the cartel center and the information I revealed as part of the auction process. It will also be useful to define the associated expected transfer payment for cartel member i given its report as p˜ i : R → R. A collusive mechanism (µ, p) is
908
QUARTERLY JOURNAL OF ECONOMICS
incentive-compatible if ∀i ∈ K, ∀(vi , vi ) ∈ S2 , ∀ψi : Bi → Bi , Ev−i πi (vi , bi , b−i ) dµ(b1 , . . . , bk | vi , v−i ) − p˜ i (vi ) B (1)
πi (vi , ψi (bi ), b−i ) dµ(b1 , . . . , bk | vi , v−i ) − p˜ i (vi ). ≥ Ev−i B
Condition (1) captures two types of incentive compatibility constraints. It ensures that cartel members report truthfully to the mechanism, and it also ensures that cartel members follow the recommendation of the center when they register and bid at the auction. (We interpret a ψi that maps an mi -dimensional bid recommendation onto a bid vector with different dimension as capturing a deviation by the cartel member in the number of bidders it registers with itself as the underlying identity.) Cartel members use the information contained in their recommendation to update their beliefs about the recommendations made to the other cartel members and to determine their optimal registration and bidding behavior. In an incentive-compatible mechanism, it is optimal for cartel members to obey the recommendation of the center given their posterior beliefs. The mechanism is ex ante budget-balanced if p ˜ (v ) = 0, and participation in µ is strictly Ev1 ,...,vk i∈K i i individually rational if ∀i ∈ K, πi (vi , bi , b−i ) dµ(b1 , . . . , bk | vi , v−i ) − p˜ i (vi ) Ev B
is greater than cartel member i’s ex-ante expected payoff when all bidders play noncooperatively. PENN STATE UNIVERSITY DUKE UNIVERSITY
REFERENCES Aoyagi, Masaki, “Bid Rotation and Collusion in Repeated Auctions,” Journal of Economic Theory, 112 (2003), 79–105. Asker, John, “A Study of the Internal Organisation of a Bidding Cartel,” working paper, New York University, 2007. Avery, Christopher, “Strategic Jump Bidding in English Auctions,” in Game Theory in the Tradition of Bob Wilson (Berkeley, CA: Berkeley Electronic Press, 2002). Bajari, Patrick, “Comparing Competition and Collusion: A Numerical Approach,” Economic Theory, 18 (2001), 187–205. Bajari, Patrick, and Jeremy T. Fox, “Measuring the Efficiency of an FCC Spectrum Auction,” working paper, University of Minnesota, 2007.
THE VULNERABILITY OF AUCTIONS TO BIDDER COLLUSION
909
Blume, Andreas, and Paul Heidhues, “Private Monitoring in Auctions,” Journal of Economic Theory, 131 (2006), 179–211. ——, “Modeling Tacit Collusion in Auctions,” Journal of Institutional and Theoretical Economics, 164 (2008), 163–184. Brusco, Sandro, and Giuseppe Lopomo, “Collusion via Signaling in Simultaneous Ascending Bid Auctions with Multiple Objects and Complementarities,” Review of Economic Studies, 69 (2002), 407–436. Cramton, Peter, and Jesse A. Schwartz, “Collusive Bidding: Lessons from the FCC Spectrum Auctions,” Journal of Regulatory Economics, 17 (2000), 229–252. ——, “Collusive Bidding in the FCC Spectrum Auctions,” Contributions to Economic Analysis and Policy, 1 (2002), Article 11. Fudenberg, Drew, David Levine, and Eric Maskin, “The Folk Theorem with Imperfect Public Information,” Econometrica, 62 (1994), 997–1040. Garratt, Rod, Thomas Tr¨oger, and Charles Z. Zheng, “Collusion via Resale,” working paper, University of California at Santa Barbara, 2007. Graham, Daniel A., and Robert C. Marshall, “Collusive Bidder Behavior at SingleObject Second Price and English Auctions,” Journal of Political Economy, 95 (1987), 1217–1239. Graham, Daniel A., Robert C. Marshall, and Jean-Francois Richard, “Differential Payments within a Bidder Coalition and the Shapley Value,” American Economic Review, 80 (1990), 493–510. H¨orner, Johannes, and Julian S. Jamison, “Collusion with (Almost) No Information,” working paper, Northwestern University, 2004. Izmalkov, Sergei, “English Auctions with Reentry,” working paper, Pennsylvania State University, 2002. Klemperer, Paul, “Auctions with Almost Common Values,” European Economic Review, 42 (1998), 757–769. ——, “What Really Matters in Auction Design,” Journal of Economic Perspectives, 16 (2002), 169–189. ——, “Why Every Economist Should Learn Some Auction Theory,” in Advances in Economics and Econometrics Invited Lectures to Eighth World Congress of the Econometric Society, Mathias Dewatripont, Lars Hansen, and Stephen Turnovsky, eds. (Cambridge, UK: Cambridge University Press, 2003). Kovacic, William E., Robert C. Marshall, Leslie M. Marx, and Matthew E. Raiff, “Bidding Rings and the Design of Anti-Collusion Measures for Auctions and Procurements,” in Handbook of Procurement, Nicola Dimitri, Gustavo Piga, and Giancarlo Spagnolo, eds. (Cambridge, UK: Cambridge University Press, 2006). Kwasnica, Anthony M., and Katerina Sherstyuk, “Collusion via Signaling in Multiple Object Auctions with Complementarities: An Experimental Test,” working paper, Penn State University, 2001. Lebrun, Bernard, “First Price Auctions in the Asymmetric N Bidder Case,” International Economic Review, 40 (1999), 125–142. ——, “Uniqueness of the Equilibrium in First-Price Auctions,” Games and Economic Behavior, 55 (2006), 131–151. Lucking-Reiley, David, “Vickrey Auctions in Practice: From Nineteenth-Century Philately to Twenty-First-Century E-Commerce,” Journal of Economic Perspectives, 14 (2000), 183–192. Mailath, George, and Peter Zemsky, “Collusion in Second Price Auctions with Heterogeneous Bidders,” Games and Economic Behavior, 3 (1991), 467–486. Marshall, Robert C., and Leslie M. Marx, “Bidder Collusion,” Journal of Economic Theory, 133 (2007), 374–402. ——, “The Vulnerability of Auctions to Bidder Collusion,” working paper, Duke University, 2008. Marshall, Robert C., and Michael J. Meurer, “Bidder Collusion and Antitrust Law: Refining the Analysis of Price Fixing to Account for the Special Features of Auction Markets,” Antitrust Law Journal, 72 (2004), 83–118. Marshall, Robert C., Michael J. Meurer, Jean-Francois Richard, and William Stromquist, “Numerical Analysis of Asymmetric First Price Auctions,” Games and Economic Behavior, 7 (1994), 193–220. Marx, Leslie M., “Economics at the Federal Communications Commission,” Review of Industrial Organization, 29 (2006), 349–368.
910
QUARTERLY JOURNAL OF ECONOMICS
Maskin, Eric S., and John G. Riley, “Asymmetric Auctions,” Review of Economic Studies, 67 (2000), 413–438. Matsushima, Hitoshi, “Repeated Games with Private Monitoring: Two Players,” Econometrica, 72 (2004), 823–852. McAfee, R. Preston, and John McMillan, “Auctions and Bidding,” Journal of Economic Literature, 25 (1987), 699–738. ——, “Analyzing the Airwaves Auctions,” Journal of Economic Perspectives, 10 (1996), 159–175. McMillan, John, “Selling Spectrum Rights,” Journal of Economic Perspectives, 8 (1994), 145–162. Milgrom, Paul, “Auctions and Bidding: A Primer,” Journal of Economic Perspectives, 3 (1989), 3–22. ——, Auction Theory for Privatization (Cambridge, UK: Cambridge University Press, 2004a). ——, Putting Auction Theory to Work (Cambridge, UK: Cambridge University Press, 2004b). Milgrom, Paul R., and Robert J. Weber, “A Theory of Auctions and Competitive Bidding,” Econometrica, 50 (1982), 1089–1122. Moldovanu, Benny, and Manfred Tietzel, “Goethe’s Second-Price Auction,” Journal of Political Economy, 106 (1998), 854–859. Myerson, Roger B., “Mechanism Design by an Informed Principal,” Econometrica, 51 (1983), 1767–1797. ——, “Bayesian Equilibrium and Incentive Compatibility,” in Social Goals and Social Organization, L. Hurwicz, D. Schmeidler, and H. Sonnenschein, eds. (Cambridge, UK: Cambridge University Press, 1985). Pesendorfer, Martin, “A Study of Collusion in First-Price Auctions,” Review of Economic Studies, 67 (2000), 381–411. ´ Reitsma, Paul S. A., Peter Stone, Janos A. Csirik, and Michael L. Littman, “SelfEnforcing Strategic Demand Reduction,” in Agent-Mediated Electronic Commerce IV. Designing Mechanisms and Systems, Julian Padget, Onn Shehory, David Parkes, Norman Sadeh, and William E. Walsh, eds. (Berlin, Germany: Springer-Verlag, 2002). Robinson, Marc S., “Collusion and the Choice of Auction,” RAND Journal of Economics, 16 (1985), 141–145. Rothkopf, Michael H., Thomas J. Teisberg, and Edward P. Kahn, “Why Are Vickrey Auctions Rare?” Journal of Political Economy, 98 (1990), 94–109. Skrzypacz, Andrzej, and Hugo Hopenhayn, “Tacit Collusion in Repeated Auctions,” Journal of Economic Theory, 114 (2004), 153–169. Vickrey, William, “Counterspeculation and Competitive Sealed Tenders,” Journal of Finance, 16 (1961), 8–37. Weber, Robert J., “Making More from Less: Strategic Demand Reduction in the FCC Spectrum Auctions,” Journal of Economics Management Strategy, 6 (1997), 529–548.