Editor’s Letter
Mike Larsen,
Executive Editor
Dear Readers, This issue of CHANCE contains articles about a variety of...
35 downloads
310 Views
9MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Editor’s Letter
Mike Larsen,
Executive Editor
Dear Readers, This issue of CHANCE contains articles about a variety of interesting topics. Three articles are associated by their concern with missing data and methods for imputation, or filling in the missing values. Tom Krenzke and David Judkins use data from the National Education Longitudinal Survey to illustrate a semiparametric approach to imputation in complex surveys. In the area of health, Michael Elliott describes a study of childhood obesity and some of its associated complications. Novel mixture model and multiple imputation approaches are used to address unusual observations, probable transcription errors, and missing data. His medical collaborator, Nick Stettler, comments on aspects of their interaction that enhanced the consulting experience for all parties involved. Mark Glickman brings us an article using multiple imputation in his Here’s to Your Health column. In this issue, Yulei He, Recai Yucel, and Alan Zaslavsky model the relationship between cancer registry data and survey data and use multiple imputation to improve the quality and quantity of information available for analysis. Two articles have sports themes. Eric Bradlow, Shane Jensen, Justin Wolfers, and Adi Wyner address the debate about baseball pitcher Roger Clemens and whether he used steroids. They examine a broad set of comparison pitchers on several dimensions. Read the article to learn their conclusions! Phil Everson, in his A Statistician Reads the Sports Pages column, examines the importance of offense versus defense in Women’s World Cup Soccer. In particular, did the United States make a strategic mistake in the 2007 final?
Peter Freeman, Joseph Richards, Chad Schafer, and Ann Lee illustrate the mass of data coming from the field of astrostatistics. The quality and quantity of information will allow examination of fundamental questions. Simo Putanen and George Styan discuss postage stamps with a probability and statistics theme. Actually, there are many more such stamps, and they will be described in upcoming articles. Peter Olofsson critiques arguments made by supporters of the idea of intelligent design on grounds of probability and hypothesis testing logic. Two additional columns and a letter to the editor complete the issue. Grace Lee, Paul Velleman, and Howard Wainer take on claims by computerized dating services in Visual Revelations; Donald Berry comments on the previous Visual Revelations column in a letter to the editor; and Jonathan Berkowitz brings us his first puzzle as column editor of Goodness of Wit Test. Some guidance on solving this and other puzzles accompany this first column. In other news, CHANCE cosponsored eight sessions at the recent Joint Statistical Meetings. (See the online program at www.amstat.org/meetings/jsm/2008 and select “CHANCE” as the sponsor.) We hope to encourage submissions to CHANCE in diverse areas on significant issues such as those discussed in these sessions. Plans for current issues of CHANCE to go online (in addition to the print version) for subscribers and libraries in 2009 are moving along. I think this will be a positive development for readers and authors of CHANCE. I look forward to your comments, suggestions, and article submissions. Enjoy the issue! Mike Larsen
CHANCE
5
Letter to the Editor The Best Graph May Be No Graph Dear Editor, In CHANCE 21(2), Howard Wainer writes about “Improving Graphic Displays by Controlling Creativity.” He makes good suggestions. In one example (Figure 4), he offers 10 improvements (Figure 5) on a report of “five-year survival rates from various kinds of cancer, showing the improvements over the past two decades” (from the National Cancer Institute). Indeed, the latter figure is neater. But, he missed the most important improvement: not showing the figure in the first place! It’s terribly misleading and doesn’t necessarily reflect any real improvement “over the past two decades.” The three cancers with survival improvement over time (i.e., breast, prostate, colorectal) are those with intensified screening programs over these two decades. Much, if not all, of the higher survival rates is due to what are called the lead time and length biases of screening. These biases are elementary and fundamental in cancer epidemiology. Lead-time bias is the easier of the two to understand. Someone whose cancer is detected n years early in a screening program lives up to n years longer after her tumor is discovered. The pure bias of n years adds to the cancer survival time of everyone whose tumors were detected by screening. Because of the heterogeneity of cancer, the value of n is highly variable and unknown for any particular tumor. The average of n is also unknown, but it is substantial; it is commonly estimated to be 3–5 years in breast cancer. The “length” in length bias refers to the tumor’s pre-symptomatic period, when it is detectable by screening, called the sojourn time. Aggressive tumors have shorter sojourn times because they grow faster. Indolent tumors have longer sojourn times. Screening finds tumors in proportion to the lengths of their sojourn times. Screening preferentially selects tumors with longer sojourn times and, therefore, tumors detected through screening are slower growing and less lethal. An extreme form of length bias is over diagnosis, in which some cancers are found by screening that would not have caused symptoms or death. There are many analogues that may help one’s intuition regarding length bias, and these should be familiar to statisticians. When you look into the sky and see a shooting star, it’s more likely to be one with a longer arc, simply because it’s the one you saw. Or, when you select a potato chip from a newly opened bag, it’s more likely to be a bigger one, simply because bigger ones are more likely to be selected. Waiting time paradoxes are standard examples. Suppose the interarrival times of buses at a certain bus stop are independently exponentially distributed, all with mean m. You arrive at the stop at an arbitrary time and catch a bus. What is the mean 6
VOL. 21, NO. 3, 2008
time between the arrival of the bus you caught and that of the previous bus? The answer is 2m. I don’t mean to suggest that we have not made important strides in treating cancer over the last two decades. We have. But, although Figures 4 and 5 are literally correct, they reflect mostly artifact and greatly exaggerate these strides. Similar figures have been misinterpreted by policymakers and the press and have led to inappropriate recommendations regarding screening, with potentially deleterious effects. The only good use of these figures is as an example for teaching, to demonstrate how easy it is to lie with statistics. Donald Berry Head, Division of Quantitative Sciences & Chair, Department of Biostatistics & Frank T. McGraw Memorial Chair of Cancer Research, The University of Texas M.D. Anderson Cancer Center
Howard Wainer responds: I am delighted that professor Berry raised this issue. While I was preparing this column, I debated with myself (and my colleague, Brian Clauser) this very point and decided not to include it, for it seemed an aside from my main point (fixing graphs to communicate better) and confused the goals of description with those of causation. As a descriptive graph, the figures are correct—survival times are increasing. But the causal inference, why they are increasing, is what professor Berry addresses. The issue is how much of the improvement is due to earlier detection and how much is due to improved treatment. This seems to me to be hard to partition. Perhaps by adjusting survival rates by, say, the maturity of the tumor at the time of discovery might provide some help. I would be interested in other schemes that could help us measure the causal effect of the changes in treatment.
Correction
According to Steve Stigler of The University of Chicago, the picture represented on Page 29 of CHANCE, volume 21, number 2, is an 1842 posthumous painting of Laplace, not of Chevalier de Mere. We have not located a confirmed image of Chevalier de Mere.
Filling in the Blanks: Some Guesses Are Better Than Others Illustrating the impact of covariate selection when imputing complex survey items Tom Krenzke and David Judkins
I
mputation is the statistical process of filling in missing values with educated guesses to produce a complete data set. Among the objectives of imputation is the preservation of multivariate structure. What is the impact of common naïve imputation approaches when compared to that of a more sophisticated approach? Fully imputing responses to a survey questionnaire in preparation for data publication can be a major undertaking. Common challenges include complex skip patterns, complex patterns of missingness, a large number of variables, a variety of variable types (e.g., normal, transformable to normal, other continuous, count, Likert, other discrete ordered, Bernoulli, and multinomial), and both time and budget constraints. Faced with such challenges, a common approach is to simplify imputation by focusing on the preservation of a small number of multivariate structural features. For instance, a hot deck imputation scheme randomly selects respondents as donors for missing cases, and, similarly, a hot deck within cells procedure randomly selects donors within the same cell defined by a few categorical variables. To simplify the hot deck procedure, a separate hot deck with cells defined by a small common set of variables (e.g., age, race, and sex) might be used for each variable targeted for imputation. Another example in the context of a longitudinal survey might be to simply carry forward the last reported value for each target variable. Although such procedures are inexpensive and adequately preserve some important multivariate structural features, they may blur many other such features. Such blurring, of course, diminishes the value of the published data for researchers interested in a different set of structural features than those preserved by the data publisher’s imputation process. We have been working on imputation algorithms that preserve a larger number of multivariate structural features. Our algorithms allow some advance targeting of features to be preserved, but also try to discover and preserve strong unanticipated features in the hopes of better serving secondary data analysts. The discovery process is designed to work without human intervention and with only minimal human guidance. In this article, we illustrate the effect of our imputation algorithm compared to simpler algorithms. To do so, we use data from the National Education Longitudinal Survey (NELS), which is a longitudinal study of students conducted for the U.S. Department of Education’s National Center for Education Statistics.
The NELS provides data about the experiences of a cohort of 8th-grade students in 1988 as they progress through middle and high schools and enter post-secondary institutions or the work force. The 1988 baseline survey was followed up at two-year intervals, from 1990 through 1994. In addition to student responses, the survey also collected data from parents, teachers, and principals. We use parent data (family income and religious affiliation) from the second follow up (1992) and student data (e.g., sexual behavior and expected educational attainment) from the third follow up (1994), by which time the modal student age was 20 years. This results in CHANCE
7
a sample size of approximately 15,000 student-parent dyads. We chose this set of variables for the variety of measurement scales and because the multivariate structure of this group of variables was not a central interest of the NELS. The example is, therefore, illustrative of what can happen when secondary analysts investigate new issues with existing data sets. There are interesting features in this subset of the NELS data. For example, there is a moderate correlation (0.31) between family income (reported by a parent) and the student’s expected education level at age 30 (self-reported). The statistics provided in this article are for illustration purposes and are not intended to be official statistics. Weights were not used in the generation of results, and response categories for some variables are collapsed to simplify the presentation. Family income is also moderately correlated (-0.28) with whether the student ever dropped out of school.
Challenges in Survey Data Some survey data complexities are described here to explain why data publishers may abridge their imputation approach. Among the issues to address are skip patterns, which begin with a response to a trigger item (referred to as the “skip controller”) and continue by leading the respondent through a certain series of questions dependent upon the response to the skip controller. For example, a “yes” response to the NELS item, “Have you had sexual intercourse?” would lead to a question about the date of first occurrence. A “no” response would result in the question about the date of first occurrence being skipped. Survey data become increasingly complex, as dozens or hundreds of items are nested within questions. Figure 1 shows a designed skip pattern for nine observations and three generic variables—A, B, and C—that creates a monotone pattern of missing values. In the figure, a -1 is used to show an inapplicable value. For example, case 6 reports a 1 for question A, a 2 for question B, and then skips question C. The shaded cells represent missing values. Another complexity is the “Swiss cheese” (or nonmonotone) pattern of missing data. The Swiss cheese pattern causes
Questions Case
A
B
C
Case
A
B
C
11 12
x
·
x
x
x
·
13
·
x
x
14
x
·
·
15
x
x
x
16
x
x
·
Figure 2. Swiss cheese missing data pattern illustration for questions A, B, and C on six cases. Shaded cells represent missing values.
havoc in attempts to preserve relationships between variables. The number of distinct missing data patterns escalates as the number of survey items grows. Furthermore, Swiss cheese patterns may occur within a pool of items controlled by the same skip controller. Each item to be imputed may be associated with a different set of key covariate predictor variables. Figure 2 shows a Swiss cheese pattern for six observations and three generic variables—A, B, and C—where x denotes a nonmissing value and the shaded cells denote missing values. For instance, case 11 reports a value for each question A and C, but not for question B. Managing different types of variables and retaining their univariate, bivariate, and multivariate distributions represents another set of challenges. Variables can be ordinal (e.g., year of first occurrence of sexual intercourse) or nominal (e.g., religious affiliation and race). Their distributions can be discrete, continuous, or semi-continuous (e.g., family income, where several modes appear in an otherwise continuous distribution due to rounding of reported values). The distribution for expected income at age 30 is semi-continuous and a portion of the distribution is displayed in Figure 3, extracted for values between $10,000 and $100,000. Due to disclosure control, values lower than $10,000 and greater than $100,000 are suppressed, as well as values between spikes with small percentages. Note the spikes caused by respondent rounding. Figure 4 shows the distribution for month of first occurrence of sexual intercourse. Perhaps others knew of the summer peak, but it is not a pattern we anticipated and is an example of the sort of fascinating discoveries that can be made in secondary analysis provided the data publisher has not blurred these features through poor imputation procedures.
1
·
·
·
2
1
·
·
3
1
1
·
4
1
1
1
5
1
1
2
Covariate Selection and the Semiparametric Approach
6
1
2
-1
We compare simple and data-driven semiparametric covariate
7
1
2
-1
8
2
-1
-1
9
2
-1
-1
Figure 1. Skip pattern illustration for sequential questions A, B, and C on nine cases. The shaded cells are missing values. Values of -1 indicate the question is not asked because it is not applicable.
8
Questions
VOL. 21, NO. 3, 2008
selection options in our illustration. For the simple covariate selection option, three demographic variables (race, age, and sex) are cross-classified to form hot-deck imputation cells. For each target variable and observation with a missing value, a donor is selected at random within the hot-deck cell, and the donor’s value is used to fill in the missing value of the target variable. We refer to this approach as the simple hot deck. The second, more sophisticated, option includes a
Percent
Expected Income at Age 30
Percent
Figure 3. Distribution of expected income at age 30, between $10,000 and $100,000. Due to disclosure control, values lower than $10,000 and greater than $100,000 are suppressed, as well as values between spikes with small percentages.
Month of first occurence of sexual intercourse Figure 4. Month of first occurrence of sexual intercourse. Respondents having had intercourse reported values 1 (January) to 12 (December). Respondents not having had intercourse have a value of -1 recorded. The height of all bars adds to 100.0.
CHANCE
9
more extensive search for covariates. This is the data-driven, semiparametric approach. The impetus behind the semiparametric approach was a friendly competition in 1993 between teams of designbased and Bayesian statisticians. The competition was the brainchild of Meena Khare and Trena Ezzati-Rice, both then at the National Center for Health Statistics. The goal was imputation of data for the National Health and Nutrition Examination Survey (NHANES) III. The Bayesian team’s driving force was Joe Schafer, who had developed imputation software based on Gibbs sampling. He was backed by Rod Little and Don Rubin. On the design-based team were David Judkins and Mansour Fahimi, backed up by Joe Waksberg and Katie Hubbell. Judkins had worked with others on specialized iterative semiparametric procedures in the early 1990s, but these were not yet ready for general use. Instead, the design-based team used more traditional methods—one nonparametric approach and one that was semiparametric but did not involve iteration. The Bayesians were declared the winners of that competition at JSM in San Francisco that summer. Design-based statisticians such as Ralph Folsom, the session discussant, were impressed. Judkins also admired the ability of the new Bayesian approach to preserve multivariate structure. However, semiparametric techniques are known to perform better at preserving unusual marginal distributions of single variables. Since that time, Judkins and others at Westat have
been working to develop imputation software that preserves both complex multivariate structures and marginal distributions with unusual shapes. A solution for imputing ordinal and interval-valued variables is to use cyclic n-partition hot decks, where the partition for any variable is formed by coarsening the predictions from a parametric model for the variable in terms of reported and currently imputed data. The general approach is defined by several features. First, a simple methodology is used to create a first version of a complete data set. Second, each variable is sequentially re-imputed using a partition optimized for it in terms of the other variables. After every variable has been re-imputed once, the process is repeated. This goes on until some measure of convergence is satisfied. Our current procedure uses unguided step-wise regression and stratification on predicted values to form the partitions, but other methods—such as hand-crafted models—are possible. Recent simulation studies have confirmed that this method can preserve pair-wise relationships that are nonlinear but monotonic, as well as highly unusual marginal shapes. The solution we developed for the imputation of nominal variables is more complex and involves the following steps: • Create a vector of indicator variables for the levels of a nominal variable • Create a separate parametric model (stepwise) for each indicator variable (in terms of other variables in the data set) • Using the estimated models, estimate the indicator-variable propensities (the probabilities that an indicator has a value of 1) for each indicator for each observation • Run a k-means clustering algorithm on the propensity vectors • Randomly match donors and recipients within the resulting clusters • Copy the reported value of the target variable from the donor to the recipient As a fillip, the clustering algorithm is actually run several times with different numbers of clusters. Then, if a small cluster happens to have a very low ratio of reported values to missing values, the search for a donor automatically retreats to a cluster from a coarser partition. This avoidance of the overuse of a small number of reported values tends to improve variances on reported marginal distributions at the cost of some decrease in the preservation of covariance structure. A simulation study by David Judkins, Andrea Piesse, Tom Krenzke, Zizhong Fan, and Wen-Chau Haung that was published in the 2007 JSM Proceedings also demonstrated that this procedure can preserve odd relationships between nominal variables. For example, when imputing religion, a series of separate models are fit for the probabilities of being Catholic, Lutheran, Jewish, Buddhist, atheist, and so on. The same covariates are not required to be used in each model. A particular cluster may contain people who have a high probability of belonging to one particular religion, or it may contain people who have a high probability of belonging to one of a small number of religions, depending on the number of clusters in the partition and the strength of the covariates.
10
VOL. 21, NO. 3, 2008
Table 1—Data Dictionary for Illustration Variables Variable Type1
Item Nonresponse Rate
Total family income from all sources 1991 15 categories: none to $200,000 or more
O
18%
Expected income at age 30
$0 to $1,000,000 or more
O
14%
Reported previous sexual intercourse occurrence
Yes, No
O
5%
Month of first intercourse occurrence
Not applicable, January – December
N
13%
Age of first intercourse occurrence
Not applicable, 1–23 years
O
13%
Respondent’s religion
Jewish, Mormon, Roman Catholic/Orthodox, Other Christian, Other, None
N
12%
Expected occupation at age 30
Blue collar, Clerical, Manager, Owner, Professional, Protection, Teacher, Other, Not working
N
10%
Age categories
Less than 19.25, between 19.25 and 20.25, more than 20.25
O
0%
Race/Ethnicity
Hispanic, Black, Other
N
0%
Sex
Male, Female
O
0%
Highest level of education expected
9 categories: Some high school to college/PhD and college/professional degree
O
2%
Ever dropped out flag
Never dropped out of high school, dropped out of high school at least once
O
0%
Current marital status
5 categories: Single never married, Married, Divorced/Separated, Widowed, Marriage-like relationship
N