Editor’s Letter Mike Larsen, Executive Editor
Dear Readers, This issue of CHANCE contains articles from four areas in which statistics and probability make a difference. The first is statistical consulting. H. Dean Johnson, Sarah Boslaugh, Murray Clayton, Christopher Holloman, and Linda Young discuss compensation for statistical consulting in their university settings. Tom Louis provides recommendations for improving the status, experience, and compensation for statistical consultants, and Janice Derr illustrates the benefits to students of being involved in effective consulting experiences. The second area is politics. Arlene Ash and John Lamperti present statistical arguments concerning the disputed 2006 Florida Congressional District 13 election. Could statistical arguments be used to overturn an election? Joseph Hall and Walter Mebane comment and provide their perspectives on statistical and other issues pertaining to U.S. elections. In the area of health, Jane Gentleman celebrates the 50th anniversary of the National Health Interview Survey by relating details about its design, statistical qualities, and innovations over time. Way to go NHIS! Mark Glickman brings us the Here’s to Your Health column. In this issue, Gerardo Chowell, N. W. Hengartner, Catherine Ammon, and Mac Hyman employ statistical models to speculate on the effect of hypothetical interventions on the 1918 influenza pandemic. David Banks describes five statistical challenges researchers in the area of metabolomics face. As David explains, metabolomics is an expanding dimension of the genomics revolution and a rich domain for statistical contributions. Phil Dixon and Geng Ding give a detailed examination of a study in plant metabolomics. The fourth area of content is graphics and, in particular, communication issues related to graphics. In his Visual Revelations column, Howard Wainer uses a
4
variety of examples to warn of the dangers of too much creativity with insufficient education of the reader in large reports. Linda Pickle, the author of a report praised by Howard, comments on her experiences researching and negotiating the improvement of graphics in government publications. A letter from John Tukey to Linda is included in supplemental online material. This issue is completed with a reflection by Roger Pinkham on the paradox of de Mere (and a favorite recipe) and the Puzzle Corner column by Tom Jabine. Tom has been writing the puzzle column for CHANCE for more than a decade. It is with deep appreciation for his service that I announce this puzzle to be his last regular contribution. Tom is turning Puzzle Corner over to Jonathan Berkowitz, and we look forward to his first puzzle in the next issue. In addition to the puzzle, an interview with Tom is included. In his answers, he expresses his enthusiasm for statistics and his enjoyment in helping and interacting with others throughout a fruitful career. Tom, thanks for all you have done for CHANCE and the field of statistics. Best wishes. I also want to thank Albyn Jones for his service as an editor of CHANCE. Albyn has joined his college’s institutional review board, and we wish him the best with this new position. Further, I want to welcome two new editors to CHANCE: Herbie Lee of the University of California at Santa Cruz and Sam Kou of Harvard University. Welcome aboard. As I mentioned in my first editorial, the articles in CHANCE cover a diverse set of topics in which statistics and probability play important roles, so get ready for some interesting reading! I look forward to your comments, suggestions, and article submissions. Enjoy the issue! Mike Larsen
VOL. 21, NO. 2, 2008
CHANCE 21_2.indd 4
4/21/08 11:44:42 AM
Statisticians and Metabolomics: Collaborative Possibilities for the Next *omics Revolution? David Banks
M
etabolomics is a new and rapidly expanding bioinformatics science that attempts to use measurements on metabolite abundance to understand physiology and health. The measurement technology has some similarity to proteomics, using mass spectrometry for data collection. It also has great potential to improve therapies, accelerate drug development, and enhance our understanding of biological processes. There are other *omics sciences, including genomics and proteomics. Genomics focuses on microarray data and attempts to understand which genes are associated with disease status. There are about 25,000 genes in the human genome, and most of these have unknown function. This means there is a very large number of explanatory variables—with complex interactions—and little contextual information to support data mining efforts.
Proteomics uses mass spectrometry to estimate the abundance of proteins (or groups of proteins) in biological samples. There are between 1 million and 2.5 million proteins, depending on how one counts small variations, and most of these have unknown function and unknown chemical structure. Again, there is little contextual information to guide data mining. Metabolomics represents the next step after genomics and proteomics. It focuses on health inferences based on metabolites, which are lightweight molecules produced (mostly by enzymes) during the process of creating and destroying proteins and other more complex molecules. In contrast to the other *omics sciences, there are only about 900 main metabolites in the human body. All of them have known chemical structure, and their biological activity in metabolic pathways, expressed in KEGG diagrams, is fairly well understood. So, although there is less biological information in metabolomic data than there is in genomic or proteomic data, there is a great deal more scientific context. For this reason, there is hope that, for many kinds of disease, diagnoses based on metabolite data can provide an interpretable and robust tool for physicians. Some common metabolites include the following: • cholesterol (implicated in heart disease) • glucose, sucrose, and fructose (sugars that describe diabetes status) • amino acids (which can identify infections, as a sign of proteolysis) • lactic acid and uric acid (high levels of which point to failures in specific organs) • ATP and ADP (which determine the efficiency with which energy is provided to cells) Additionally, there are synthetic metabolites not produced by normal organic processes. Important ones include breakdown products from medical drugs and illegal drugs. The CHANCE
5
Butanoate metabolism
Pyruvate metabolism
CITRATE CYCLE (TCA cycle)
Pyruvate Glycolysis / Gluconeogenesis
6.4.1.1 Phosphoenol 4.1.1.49 pyruvate 4.1.1.32
Glutamate metabolism
Lysine degradation
Aspartate metabolism Glyoxylate and dicarboxylate metabolism
Fatty acid biosynthesis (path1) Acetyl-CoA
Fatty acid biosynthesis (path2) Fatty acid metabolism Citrate
2.3.3.1 1.1.1.37 Oxaloacetate
4.2.1.3
2.3.3.8 CO2
(S)-Malate
4.2.1.3 cis-Aconitate
Isocitrate
4.2.1.3
6.2.1.18
1.1.1.42
(3S)-Citryl-CoA
4.1.3.34 4.2.1.2
1.1.1.41
2.8.3.10 Acetyl-CoA
Fumarate Urea cycle Tyrosine metabolism
1.1.1.42
CO2
Succinyl-CoA 2.3.1.61
6.2.1.5
Val, Leu & Ile degradation
ThPP Propanoate metabolism
6.2.1.4
Phenylalanine metabolism
CO2O Acetate
Arginine and proline metabolism
1.3.99.1 1.3.5.1
Succinate
Oxalosuccinate
4.1.3.6
1.2.4.2 S-Succinyldihydrolipoamide
3.1.2.3
1.2.4.2 3-Carboxy-1hydroxypropyl-ThPP
1.8.1.4 Dihydrolipoamide
Lipoamide
2-Oxoglutarate Lysine biosynthesis Ascorbate and aldarate metabolism Glutamate metabolism
1.2.7.3
D-Gl n & D-Glu metabolism
Figure 1. The portion of the metabolic pathway that pertains to the Krebs cycle
former often cause liver toxicity, which is a common reason a new drug fails to win approval from the FDA. The latter often appear in the popular press, when athletes are tested for illegal use of performance-enhancing supplements. Figure 1 shows a portion of the KEGG diagram for the citrate, or Krebs cycle, which drives the production and use of energy in the body. The circles are known metabolites; the rectangles are known enzymes; and the ovals are summaries of complex processes that involve enzymes, metabolites, and proteins. The enzymes convert one metabolite to another as building blocks for more complex molecules, such as proteins. The KEGG pathways contain important information relating to the abundance of each metabolite, including the following: • Stochiometric equations that show how much material is produced in a given chemical reaction (i.e., mass balance equations) • Rate equations that govern the speed at which reactions occur, determining the location of the Gibbs equilibrium (when it exists). 6
VOL. 21, NO. 2, 2008
Of course, for many portions of the metabolic pathway, there is no Gibbs equilibrium; for example, the amount of glucose changes as a function of what and when you eat. But other portions of the diagram are more homeostatically buffered from diet and exercise effects, such as one’s amino acids abundances over the course of a normal day for a healthy person. Potential applications of metabolomics include the following: • Detecting disease—necrosis, ALS, first-stage Alzheimer’s— and infection or inflammation • Assessing toxicity in experimental drugs • Understanding the biochemical effects of diet (such as vegetarian, macrobiotic, Scarsdale, etc.) • Improving knowledge of biochemical pathways Different applications appeal to different kinds of scientists. For commercial use, the dominant application is fast screening of new drugs before expensive clinical trials reveal disqualifying adverse effects.
Metabolic Pathways Metabolism of Cofactors and Vitamins
Metabolism of Complex Carbohydrates
Nucleotide Metabolism
Metabolism of Complex Lipids
Carbohydrate Metabolism Metabolism of Other Amino Acids
Lipid Metabolism Amino Acid Metabolism
Energy Metabolism Metabolism of Other Substances
Biochemical Profile Figure 2. An illustration of a possible medical diagnosis showing the patient is abnormal in the region of the pathways associated with vitamins and cofactors
From a medical standpoint, the Holy Grail is to be able to take a sample of tissue, run a metabolomic analysis, and then inspect which regions of the metabolic pathway show excess or deficient amounts of specific metabolites. Physicians suspect one could create a visualization, such as that shown in Figure 2, that indicates which parts of a patient’s metabolic pathways are unusual, and possibly unhealthy. To achieve these applications, one must address the following five major statistical problems: • Estimating uncertainty in measurement • Building statistical models for replication and crosslaboratory calibration • Estimating abundance • Data mining to match metabolic signals to disease status • Using compartmental models to describe flow through the metabolic pathways Each of these invites new statistical research.
Measurement Issues Data are obtained in the following way. A tissue sample (say, fasting blood or a liver biopsy) is taken from a patient. The sample is liquefied (with the medical equivalent of a blender) and certain chemicals are added to halt chemical interactions
and slow evaporation. Then, a small amount of the liquid is put onto wells in a silicon plate. The aliquot in each well is subjected to liquid or gas chromatography. After separation in the column, the sample is directed to a mass spectrometer. The mass spectrometer reports the mass to charge ratio of each ionized fragment that emerges from the chromatograph column, and a magnet directs the ions to a counter that measures the quantity. From this, sophisticated software estimates the abundances. The sample preparation involves the addition of stabilizing compounds, but it also involves the addition of “spiked-in” calibrants. These are known amounts of nonorganic molecules (such as antifreeze) that can be used as “anchors” for the estimation of abundance. Since the scientist knows exactly how much was added, the scientist also knows how much should be extracted. The preparation step is done by robot arms in a nitrogen atmosphere. It also entails the production of multiple aliquots of the same tissue, some of which are used for replication, some of which are destined for liquid or gas chromatography, and some of which are given somewhat different stabilizers. The main sources of measurement error in this step are the following (in probable decreasing order of magnitude): • Within-subject variation (from day to day) • Within-tissue variation (e.g., the liver is not perfectly homogeneous)
CHANCE
7
1200 1000 800 600 0
200
400
Abundance
0
50
100
150
200
250
300
Metabolite
Figure 3. A bar chart showing the estimated abundances of metabolites with different m/z ratios for the 327 metabolites commonly measured with gas chromatography. The four tallest bars represent abundant ions.
• Evaporation of volatile molecules (despite the use of stabilizers and refrigeration) • Contamination by cleaning solvents (used to clean the robotic fingers between preparations) • Uncertainty in the amount of the spiked-in calibrants The first two are the dominant sources of error. For gas chromatography, the scientist creates an ionized aerosol, and each droplet contains (in principle) a single ion (which may be a metabolite or a fragment of a metabolite molecule). The droplets evaporate during the passage through the tube, so at the end, one is left with a naked ion being directed by a magnetic field. This is separated by mass in the column, and then ejected to the mass spectrometer. The main sources of error in this step, in probable order of magnitude, are the following: • Imperfect evaporation in the column • Adhesion to the walls of the column, slowing the ion • Ion fragmentation and adductance (the metabolite molecules do not always fragment in exactly the same way, and sometimes they collide with other ions to produce rare ions) These errors are probably all of smaller magnitude than the within-subject and within-tissue variation. The liquid chromatography is significantly different, but poses 8
VOL. 21, NO. 2, 2008
similar statistical problems. Here, we will focus on gas chromatography for simplicity and clarity. Once the ion leaves the chromatograph, it is channeled by the magnetic field to the mass spectrometer. The traditional devices (MALDI-TOF and SELDI-TOF) measure the mass-to-charge ratio of the ion by the amount of time it takes the ion to move through the field to impact an ion counter. Heavy ions with low charges are less bent by the field, and thus have a longer time-of-flight (TOF). However, the new generation of technology uses Fourier mass spectrometry. Here, the exit time is irrelevant. The machine uses an oscillating magnetic field to channel the ions in a cyclotron and measures the strength of the current induced by the ions in a pair of plates. The primary source of error in this step is the field strength, as opposed to both the field strength and the time of exit, as were present in the previous generation of equipment. The result of all these steps is a set of mass-to-charge ratios (m/z) with timestamps for each ion. These can be viewed as a two-dimensional histogram in the m/z · time plane. From this, one estimates the amount of each metabolite; high peaks in the histogram correspond to abundant ions, which correspond to abundant metabolites. This requires software to normalize the histogram and map to metabolites. This also introduces error. Nonetheless, when all the analysis and post-processing is done, the scientist sees data such as that in Figure 3. This bar chart has the m/z ratio on the x-axis and the
estimated abundance of metabolites with specific m/z values on the y-axis. Additionally, the scientist may have access to uncertainty estimates on the heights of the bars, though this is not the usual situation and poses an opportunity for statistical contribution. Many of the error sources that arise in metabolomics also occur in proteomics. For this reason, caveats include extracting artifact signals associated with the alternating current in the wall socket, accounting for uncertainty in baseline correction, and allowing for the higher variance associated with measurement of lightweight ions.
Statistical Problems Using the data obtained from the metabolomic measurement process described in the previous section, statisticians want to be able to do the following: • Understand the uncertainty budget, identifying the sources and magnitudes of different kinds of error and their correlation structure • Support cross-platform comparisons, enabling independent replication and quality control • Identify and estimate the peaks in the m/z · t plane, with corresponding statements about the uncertainty in the metabolite abundances • Find markers for disease or toxicity or physiological change, typically through the application of some kind of data mining method The Uncertainty Budget The National Institute of Standards and Technology has developed a paradigm for approaching the uncertainty budget problem. The strategy includes the following: • Building a model for the error terms • Doing a designed experiment with replicated measurements, structured to disentangle the different error terms to the extent possible • Fitting a measurement equation to the data In an article titled “Error Analysis” published in Encyclopedia of Statistical Sciences, J. M. Cameron describes this process in more detail, with examples from analytic chemistry. However, the metabolomics problem is more complex, and statisticians are needed to scale up the procedure to handle these multivariate applications with complex error structure. To see how the uncertainty budget approach works, let Z be the vector of raw data, consisting of m/z ratio counts for specific time stamps (i.e., the information in the two-dimensional histogram). Then, the measurement equation is: g(Z) = X = μ + ε . Here, μ is the vector of unknown true metabolite abundances and ε is (one hopes) decomposable into separate components corresponding to different sources of uncertainty. For the ith metabolite, the abundance estimate is: J
gi ( Z) = In∑ wij ∫ ∫ s( Z − c(m, t)dmdt.
Richard Doll, a famous “ epidemiologist, once said that if the effect is strong, you don’t need a big study.
”
Here, the wij is a weight on the jth peak with respect to metabolite i; it represents the contribution of that peak (and for most peaks, that contribution is approximately zero, since the metabolite probably does not fragment in ways that produce ions with that m/z ratio). The integrals are taken over the m/z · time plane. The s(Z) function smoothes the counts in ways that reflect asymmetric measurement error (e.g., ions can be delayed in the chromatograph, but not accelerated). It can also account for the refractory period in the ion counter and saturation of the counter. The c(m,t) term addresses baseline correction; it is known that lightweight ions tend to be overcounted. The true zero level decreases as the m/z value increases, and this is estimated by the use of the spikedin calibrants. Drifting error in the lightweight ions is one of the effects that caused significant problems in the proteomic analysis of ovarian cancer data written about in an article titled “Proteomics Patterns in Serum and Identification of Ovarian Cancer” published in The Lancet. It is customary in this area to work with the natural log of the abundance estimate. The main reason for this is that measured abundance is affected by dilution of the sample, but the ratios of the metabolites are not. Since ratios are stable, so are differences on the log scale. If the study is well designed, with replication and balancing of factor levels to account for the order in which the sample aliquots are prepared and the kinds of spiked-in calibrants that are used, then one can estimate the variance in the measurements. This is based on the propagation of error formula, which statisticians know as the delta method: Var[ Xi ] =
∂2 g ∂2 g Var [ z ] + 2 Cov[zi , zk ]. ∑ i ∂zi2 i ≠ k ∂zi ∂zk
This gets complicated and typically requires numerical solution. Cross-Platform Experiments If one takes the aliquots from a single tissue sample and runs them through the metabolomic measurement systems in different laboratories, the results will be systematically different. In order to put metabolomics research on a sound footing, one needs statistical calibration procedures that allow results from different platforms to be compared. The general strategy for such cross-platform calibration is the performance of a key comparison design. Here, different aliquots from the same sample (or different aliquots from an artificial standard solution, such as GROB), are taken to different laboratories and measured. Each laboratory produces its own mass spectrogram. It is impossible to determine which laboratory is best, although, for many years, metrology scientists argued about
j =1
CHANCE
9
this. But that question is not statistically identifiable. Instead, what one can estimate is the systematic difference from one laboratory to another. This is sufficient to support the scientific need for independent replication of results at separate sites. The Mandel bundle-of-lines model is commonly used for such interlaboratory comparisons. With univariate data, the model is: Xik = α k + βkTi + εik . Here, Xik is the estimated abundance of metabolite i obtained in laboratory k. The α i and the βi are the intercept and slope, respectively, of the linear calibration needed to align laboratory i with an overall standard. The τ k is the true amount of metabolite i, and the εik is approximated as normal random error with mean zero and variance σ ik2 . As written, the problem is unidentifiable; the value of τ i is confounded with the coefficients in the calibration line. Usually, one imposes constraints, such as requiring the average of all laboratory intercepts to be zero and the average of all laboratory slopes to be one. For metabolomics problems, one needs a multivariate version of the Mandel bundle-of-lines model. This is not conceptually difficult, but it has not yet become common practice in the national metrology laboratories. Additionally, in metabolomics, one probably needs to extend the model to account for the different volatilization rates for the compounds. This is not trivial work, and it represents an important contribution that statisticians can make. Peak Identification This is an area of active statistical research, largely prompted by the work in proteomics. The Bayesian approach in the area has been developed by Merlise Clyde, Leanna House, and Robert Wolpert. Other approaches by Yutaka Yasui, Dale McLerran, Bao-Ling Adam, Marcy Winget, Mark Thornquist, and Ziding Feng are also available. In most metabolomics laboratories, the inference on peak location and size is not done in a principled statistical way. Instead, a black-box software program applies a set of ad hoc rules to generate the estimates. Often, the code is proprietary, and the code almost certainly does not incorporate information from mass balance constraints on ion fragmentation. Additionally, this application is one of the areas in which metabolomics should enjoy a significant advantage over proteomics. In metabolomics, one knows the chemical formula for each metabolite and its common fragments; thus, one knows exactly where the peaks ought to be. That is a major piece of information for Bayesian exploitation, and a ripe opportunity for statistics. Data Mining Metabolomic analysis is a problem in data mining. One has about 900 explanatory variables and wants to find classification rules for the health status of patients. Often, the sample of patients is not large. In our work, we have focused on identifying ALS patients and women in preterm labor who will have different kinds of outcomes. The main tools we have use are random forests and various kinds of support vector machines. Of these, we found random forests had superior performance, and this paper emphasizes that methodology. 10
VOL. 21, NO. 2, 2008
Table 1—The Misclassification Matrix for the Random Forest, Out-of-Bag Predictions on Preterm Labor Outcome Predicted Term
Infection
No Infection
Term
39
1
0
True Infection
7
32
1
No Infection
2
2
29
ALS diagnosis
For the ALS diagnosis problem, we had data on 317 metabolites for 63 subjects. Of these, 32 were healthy, 22 had ALS and were not on medication, and nine had ALS and were taking medication. Since the number of explanatory variables exceeds the number of patients, this is a classic data mining application. The random forest procedure correctly classified 29 of the healthy patients, 20 of the ALS patients without medication, and all nine of the ALS patients who were taking medication. The out-of-bag error rate, an unbiased measure of predictive accuracy, was 7.94%. Random forests used 20 of the 317 metabolites, and three were of dominant importance. Physicians believe all three are reasonable in the context of the nature of the disease. Random forests can detect outliers via proximity scores. There were four outliers among the 29 ALS patients who were not taking medication. It is suspected that these form a subgroup for the disease pathology, since all four had similar patterns for some of the key metabolites. For the support vector machine analysis, we considered a number of approaches: linear SVM, polynomial SVM, Gaussian SVM, L1 SVM, and SCAD SVM. These different SVM methods showed a wide range of performance, but we believe random forests gave the best answer. No doubt reasonable statisticians will disagree over which approach is best for metabolomic data, and further examination is wanted. Preterm Labor The National Institutes of Health had metabolic data on samples of amniotic fluid from women who presented at emergency rooms in preterm labor. They wanted to use data mining to find three categories of patient: those whose labor would cease and who subsequently deliver at the usual time (Term), those who would deliver early and had infection (Infection), and those who would deliver early but did not have infection (No Infection). The sample had 113 women. As before, random forests gave the best results with an out-of-bag accuracy rate of 88.49%, and the various SVMs had error rates that were 5% to 10% lower. The main diagnostic information was contained in the amino acids and carbohydrates. The physicians felt the metabolite indicators were reasonable, since high levels of amino acids indicated proteolysis associated with tissue breakdown during infection, and this was the marker for the Infection category. Low levels of carbohydrates indicated nutritional problems for the fetus, and this
was the marker for the No Infection category. Term delivery had low levels of amino acids, but high levels of carbohydrates. Roberto Romero, Ricardo Gomez, Jyh Kae Nien, Bo Hyun Yoon, R. Luo, Chris Beecher, and Moshe Mazor undertook work using this approach in their paper published in the American Journal of Obstetrics and Gynecology. Compartmental Modeling If the KEGG pathways in Figure 1 were perfectly known, and if the stoichiometric equations were all accurate, and if the metabolite measurements had very small error, then one could use compartmental modeling with differential equations to describe the dynamics of an individual metabolism. But this kind of breakthrough is not on the scientific horizon. However, statistical models may be sufficient to provide good approximations to the dynamics along some key pathways. A specific challenge in this work is that the same metabolite can appear in many pathways, so the abundance estimate obtained from metabolomics is actually a sum from many sources; some sources may reflect healthy pathways, but others may not. This aliasing is not necessarily intractable; some metabolites occur uniquely in a single pathway, which pins down that activity. Those unique measurements can act as anchors in untangling the summed measurements for more common metabolites that precede or follow them. A separate strategy for untangling the signal uses multivariate data on sets of common metabolites that co-occur in a pathway. This problem area is the least developed aspect of metabolomics. It poses significant methodological problems, but a solution would have value in other fields for which compartmental modeling is used.
Looking Ahead Metabolomics is an important new tool in the bioinformatics of health sciences. While it offers less information than genomics or proteomics, the signal is often stronger and more interpretable. Metabolomics poses fresh technical challenges to statisticians on a number of problems. We tried to lay out the current state of the art in problems associated with data mining, determining the uncertainty budget and enabling cross-platform calibration. Peak identification was slighted because it is well covered in the proteomics literature; however, with metabolomic data, one has the advantage of knowing where each metabolite ion fragment should be located, and this advantage has only been partially exploited. Regarding compartmental models, this problem is completely fresh and any progress would be important.
Further Reading Baggerly, K., Morris, J., Wang, J., Gold, D., Xiao, L., and Coombes, K. (2003) “A Comprehensive Approach to the Analysis of Matrix-Assisted Laser Desorption/IonizationTime of Flight Proteomics Spectra from Serum Samples.” Proteomics 3:1667–1672. Bradley, P., and Mangasarian, O. (1998) “Feature Selection via Concave Minimization and Support Vector Machines.” Proceedings of the 15th International Conference on Machine Learning. 82–90. San Francisco: Morgan Kauffmann Publishers.
Breiman, L. (2001) “Random Forests.” Machine Learning, 45:5–32. Cameron, J. (1982) “Error Analysis.” In Encyclopedia of Statistical Sciences, Vol. 2, edited by S. Kotz, N. Johnson, and C. Read. New York: Wiley. 545–551. Clyde, M., House, L., and Wolpert, R. (2007) “Nonparametric Models for Proteomic Peak Identification and Quantification.” In Bayesian Inference for Gene Expression and Proteomics, ed. by K.-A. Doh, P. Müller, and M. Vannucci (to appear). Fan, J., and Li, R. (2001) “Variable Selection via Penalized Likelihood.” Journal of the American Statistical Association, 96:1348–1360. Marshall, A.G., Hendrickson, C.L., and Jackson, G.S. (1998) “Fourier Transform Ion Cyclotron Resonance Mass Spectrometry: A Primer.” Mass Spectrometry Review, 17:1–35. Milliken, G., and Johnson, D. (1982) Analysis of Messy Data I. Boca Raton: Chapman & Hall-CRC Press. Petricoin, E., Mills, G., Kohn, E., and Liotta, L. (2003) “Proteomics Patterns in Serum and Identification of Ovarian Cancer.” The Lancet, 360:170–171. Romero, R., Gomez, R., Nien, J., Yoon, B., Luo, R., Beecher, C., et al. (2004) “Metabolomics in Premature Labor: A Novel Approach to Identify Patients at Risk for Preterm Delivery.” American Journal of Obstetrics and Gynecology, 191:S2. Tibshirani, R. (1996) “Regression Shrinkage and Selection via the LASSO.” Journal of the Royal Statistical Society, Series B, 58:267–288. Truong, Y., Lin, X., Beecher, C., Cutler, A., and Young, S. (2004) “Learning Metabolomic Datasets with Random Forests and Support Vector Machines.” Proceedings of Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Yasui, Y., McLerrna, D., Bao-Ling, A., Winget, M., Thornquist, M., and Feng, Z. (2003) “An Automated Peak Identi cation/Calibration Procedure for High-Dimensional Protein Measures from Mass Spectrometers, ” Journal of Biomedical Biotechnology, 4, 242-248. CHANCE
11
A Metabolomic Data Uncertainty Budget for the Plant Arabidopsis thaliana Philip M. Dixon and Geng Ding
In “Statistics and Metabolomics,” David Banks discusses five places for collaboration between statisticians and biologists collecting and interpreting metabolomic data. Here, we illustrate the first of those: the construction of an uncertainty budget. Our example comes from plant metabolomics. In plant metabolomics, the measurements are the same as in human metabolomics: the concentrations of cellular metabolites usually with a molecular weight less than 500. Although it could be used in the same way the human metabolome is used—as a fingerprint for rapid identification of disease—the primary motivation for studying the plant metabolome is its usefulness to basic science. The metabolome is the intermediary between enzyme activity, which ultimately is a consequence of the plant genome, and the phenotype, the observable characteristics of individual plants. The metabolome provides a tool for understanding the function of genes, even if that gene has minimal or no effect on the phenotype. Using reverse genetic techniques, it is possible to create a knockout mutant, in which the DNA sequence for a specific gene is changed and the gene product is disabled. Plants with the knockout mutant are then compared to wild-type controls. The knockout sometimes kills the plant, sometimes changes the visible phenotype, and sometimes produces plants that look identical to the wild-type. When the knockout is not lethal, comparing metabolomes of knockout and wild-type individuals provides a way to discover whether the gene of interest has a function and to understand the metabolic origin of phenotypic changes.
12
VOL. 21, NO. 2, 2008
The Data Set The data used here are part of Geng Ding’s investigation of knockout mutants for an enzyme that degrades an amino acid. The plants being studied are Arabidopsis thaliana, a model organism widely used in plant science. For each of two mutants, Ding has plants of three genotypes, differing in the number of copies (zero, one, or two) of the knockout DNA. These genotypes have subtle differences in the phenotype, but the differences are tiny during vegetative growth. Ding’s biological goal is to compare mean concentrations of each of 18 amino acids among the six combinations of two mutants and three genotypes, called “id’s” henceforth. Two plants of each id were grown in a homogeneous environment. At harvest, tissue from each of the 12 plants was split into two containers, yielding 24 samples. Because it wasn’t feasible to extract amino acids from all 24 samples at the same time, extraction was done in two batches. The first 12 containers, from one plant of each of the six id’s, were extracted in one batch. Then, the remaining 12 samples (and six plants) were extracted. Each of the 24 samples (6 ids × 2 plants per id × 2 extracts per plant) was then measured. Amino acid concentrations were measured by a gas chromatograph with a flame ionization detector (GC-FID). This detector measured the amount of carbon-containing compounds coming out of the GC every few seconds. Specific amino acids were identified by comparing their retention time in the GC to known standards. Each amino acid was quantified by integrating the signal from a distinct peak, normalizing by an internal standard, and using a calibration curve to determine the amino acid concentration. Concentrations were expressed as micromole of amino acid per gram of plant fresh weight. This data collection scheme, a complete block design, provides a measure of variability between extraction batches, a measure of the biological variability between plants, and measures of the differences among id’s. The variability between the two samples from the same plant includes the withinplant variability, the variability between extractions, and the variability between measurements. The entire study was then repeated to give a total of 48 samples. The variability between repetitions provides a measure of the repeatability of the results, including long-term drift in the measuring process and the growth environment. One extract was measured twice. The two measurements for this extract provide an estimate of the variability between measurements. Because only one extract was measured twice, the second measurement is omitted from most of the analyses described here.
0.8
−
0.7
− −
−
0.5
0.6
−
−
0.4
THR concentration (nm/mg)
−
2 Replicate
1
21 11 12 22
Plants
Measurements
Figure 1. Variability between replicates, plants within replicates, and measurements within plants for one id. The two dots labeled “Replicate” are the averages for replicate 1 and replicate 2. The four dots labeled “Plants” are the averages for the two plants from each replicate, sorted by replicate average and labeled by the replicate number. The eight dots labeled “Measurements” are the measurements from each plant, sorted by plant average and labeled by replicate and plant number. Each column average is indicated by –.
Components of the Uncertainty Budget Our goal here is quantify components of the uncertainty budget. Each level of replication (two repetitions of the study, two extraction batches, two plants per repletion and treatment, and two extractions/measurements per plant) is a component of the uncertainty budget that can be quantified by estimating variance components. Extraction batches are nested within repetitions, id’s are crossed with extraction batches, plants are nested within extraction batches, and extractions/measurements are nested within plants. Although Banks indicates many specific reasons for sampling and measurement variability in his article, most are confounded in this study and cannot be separated. Ding’s sampling design provides an estimate of the biological variability between plants, which is crucial for the comparison of genotypes and mutants. It also provides an estimate of the variability between extracts. The single extract measured twice provides an estimate of the variability between measurements.
Many designs to estimate variance components use only nested sampling. One example would be a design that grows plants of one genotype in three pots. Three plants are individually harvested and extracted from each pot, and then each extract is measured twice. Ding’s design introduces crossed effects because of the blocking by replicate and extraction batch. This blocking provides more precise estimates of differences between genotypes and mutants, but it complicates the analysis of variance components. We will first illustrate a typical analysis of nested effects by considering data from only one id. Then, we will illustrate the analysis of the entire data set. These are illustrated using data for one of the 18 amino acids measured by Ding, Threonine.
A Model for Nested Random Effects The data for a single id includes only one plant per extraction batch, so there are three nested random effects: between repetitions, between plants in a repetition, and between confounded extracts and measurements. One commonly used model for nested random effects can be written as: Yijk = μ + α i + βij + λijk , 1) The concentration of Threonine in replicate i, plant j, and measurement k is denoted Yijk. The overall mean Threonine concentration is denoted μ. The deviation from the mean associated with replicate i is α i. The deviation from the mean of replicate i associated with plant ij is βij. Within plant ij, the deviation of measurements ijk about the plant mean is γ ijk. The terms α i, , βij and γ ijk are considered random effects when the goal of the analysis is to estimate the magnitude of their variability. It is common to assume all random effects are independent normal random variables with constant variance The variance between observations from a randomly chosen replicate, plant, and measurement is the sum of the three variance components, In this sense, the variance components partition the random variation among observations into components associated with each source of uncertainty. The data for one id are shown in Figure 1. The two replicate averages are similar, the four plant averages are quite different—even when compared within the replicate—and the measurements from the same plant are very similar. The pooled variance between measurements on the same plant estimates
Table 1—ANOVA Table and Expected Mean Squares for Data From a Single Id Source
d.f.
Sum-of-Squares
Mean Square
Expected Mean Square
Replicates
1
0.004857
0.004857
2 2 2 σ meas + 2σ plant + 4σ rep
Plants(Reps)
2
0.109532
0.054766
2 2 σ meas + σ plant
Measurements (Plants, Reps)
4
0.001480
0.000370
2 σ meas
Corrected Total
7
0.115869
CHANCE
13
0.200 0.050 0.010 0.002
SD of measurements
0.5
1.0
1.5
2.0
2.5
Average of measurements
0.100 0.020 0.005
SD of log transf. meas.
0.500
Figure 2. Plot of the standard deviation (s.d.) and average of the two measurements per plant. Both X and Y axes are log scaled.
−0.5
0.0
0.5
Average of log transf. meas.
Figure 3. Plot of the standard deviation and average of the log-transformed measurements per plant. The y-axis is log scaled; the x-axis is not because some averages are less than zero.
2 σ meas , but the pooled variance between plant averages (Yij.), using dots as subscripts to indicate averaging (i.e., Yij.=(Yij1+ 2 . This is because, within a replicate, Yij2)/2), overestimates σ plant (i.e., conditional on α i ), the variance between plant averages, 2 2 2 2 Var Yij.is σ plant + σ meas / 2, which is larger than σ plant if σ meas >0. Similarly, the variance between replicate averages, Var Yi..= 2 2 2 2 Yi.. = σ rep + σ plant / 2 + σ meas / 4 , overestimates σ rep .
Estimators of Variance Components Although there are many estimators of the variance compo2 2 2 , σ plant , and σ meas nents— σ rep —the two most commonly used are the ANOVA and REML estimators. The ANOVA, or method-of-moments estimator, starts with an ANOVA table quantifying the observed variability for each component. The variance components are estimated by equating the observed 14
VOL. 21, NO. 2, 2008
mean squares to their expected values—the expected mean squares—and solving for the variance components. For the Threonine data in Figure 1, the ANOVA table and expected mean squares are given in Table 1. 2 The estimated variance components are σˆ meas = 0.00037, 2 2 σˆ plant = 0.027, and σˆ rep = −0.012. The variance component for plants is much larger than that for measurements, consistent with the pattern in Figure 1. The negative estimate for replicates is disconcerting, since a variance must be non-negative. Negative ANOVA estimates often occur when the parameter is close to zero, when the degrees of freedom for the effect are small, when there are outliers, or when the model is wrong. However, ANOVA estimates are unbiased when the model is correct and robust to the assumption of normality because they are computed from variances only. REML, restricted maximum likelihood, estimates are always non-negative because the estimates are constrained to lie within the parameter space for a variance. REML differs from standard maximum likelihood (ML) in correctly accounting for the estimation of any fixed effects. As a simple example, if independentN(μ, σ 2 ) the ML estimator of the variance of 2 a single sample, ∑ (Yi − Y ) / n, is biased. The REML estimator 2 ∑ (Yi − Y ) / (n − 1) is the usual unbiased variance estimator. However, when data have multiple levels of variation, REML estimates of variance components are often biased. The bias arises for two reasons: the constraint that an estimate is nonnegative and the adjustment to other variance components that occur when a negative ANOVA estimate is shifted to zero. For example, the REML estimates for the data in Figure 2 2 2 = 0.00037, σˆ plant = 0.019, and σ rep = 0. The replicate 1 are σˆ meas variance is estimated as zero, but that forces a shift in the plantplant variance component (from 0.027 to 0.019). However, the replicate variance is estimated from only two replicates (one degree of freedom), so one should expect a poor estimate. There is no consensus among statisticians as to which estimator is better. I prefer the ANOVA estimates because they are less dependent on a model and because estimates at one level are not adjusted because of insufficient data at another level. Others prefer REML estimators. The previous analysis uses only one-sixth the data in which there are only four plants and eight measurements. The entire data set includes 24 plants and 48 measurements. Pooled estimates of variance components using all the data will be more precise, which may eliminate the problem of a negative estimated variance component if it is reasonable to assume variance components are the same for all id’s. We will separately consider the measurement variance and the plant-plant variance.
Characteristics of the Measurement Variance The assumption of equal measurement variance is easy to assess using a plot of the average of the two measurements per plant against the standard deviation of those two measurements (Figure 2). There is a lot of variability because each standard deviation is computed from two measurements, but it is clear the measurement standard deviation tends to increase with the average. When this happens, using log Y instead of Y often equalizes the variances. As Banks indicates in his article, metabolomic data are usually log-transformed because of the biological focus on ratios naturally expressed on a log scale.
Table 2—ANOVA Table and Expected Mean Squares for Data From All Six Id’s Source
d.f.
Sum-ofSquares
Mean Square
Expected Mean Square
Replicates
1
4.077
4.0770
2 2 2 2 σ meas + 2σ plant + 12σ batch + 24σ rep
Extraction
2
0.428
0.2138
2 2 2 σ meas + 2σ plant + 12σ batch
Id
5
1.922
0.3842
2 2 σ meas + 2σ plant + 8 ∑ δ k2 / 5
Plants
15
2.908
0.1939
2 2 σ meas + 2σ plant
Measurements
24
0.540
0.0225
2 σ meas
0.05 0.02
Plant s.d., log(Y)
0.10
Corrected Total 47
−0.4
−0.2
0.0
0.2
0.4
0.6
Plant average, log(Y)
0.10 0.02
0.05
Characteristics of Plant-Plant Variation
0.01
Plant s.d., 1/Y
0.20
Figure 4. Plot of the plant-plant standard deviation (s.d.) and average, after log transforming the measurements
The Threonine data illustrate another reason for a transformation—to equalize variances. A useful characteristic of a random variable with a log normal distribution is that the coefficient of variance is a function of the log scale variance. If log Y (μ, σ 2 ), then the mean and the variance of2 the untransformed Y are E 2 2 Y = eμ +σ /2, Var Y = e2 μ +2σ − e2 μ +σ , so the coefficient of variation is 2 VarY = eσ − 1. Hence, assuming a constant variance on 2 ( EY ) the log scale is equivalent to assuming a constant coefficient of variation for the untransformed values. After using a transformation, one should check that it worked as intended. This can be done by plotting the average and standard deviation of the two log-transformed measurements per plant (Figure 3). While there is much less pattern after the transformation, there is still a tendency for the standard deviation to increase with the mean. A stronger transformation in the Box-Cox family, perhaps 1/Y, would do a better job of equalizing the measurement variances for this specific data set. However, a transformation of Y affects all aspects of the model. Before making a final choice, it would be good to assess the characteristics of the plant-plant variability.
0.6
0.8
1.0
1.2
1.4
1.6
Plant average, 1/Y
Figure 5. Plot of the plant-plant standard deviation (s.d.) and average, after using a 1/Y transformation of each measurement
It is harder to assess the characteristics of plant-plant variability (or any variability other than the residual variation) because the plant-plant variation is not directly observed. The only direct information about characteristics of the plant-plant variation comes from averages of the two measurements for each plant. Because these are averages of measurements, characteristics of the plant-plant variation are confounded with those of the measurement variation. CHANCE
15
associated with replicate i is α i. The deviation from the mean of replicate i associated with extraction batch ij is θij. The deviation from the mean associated with id k is δ k. The deviation associated with each plant ijk for each id k in extraction batch ij is βijk. Within each plant ijk, γ ijkl is the deviation of the observation ijkl from the plant mean. The variability described by the γ ijkl includes the variability among extracts and variability among measurements because there is only one measurement per extract in the 48-observation data set. All the random effects are assumed to be independent and normally distributed. Each source of variation has its own variance 2 2 2 component: α i N (0, σ rep ), θij N (0, σ batch ), βijk N (0, σ plant ), and γ ijkl N (0, σ 2 ). meas Fitting model (2) to the Threonine data gives the ANOVA table in Table 2. The estimated variance components are 2 2 2 2 σˆ rep = 0.16, σˆ batch = 0.0017, σˆ plant = 0.086, and σˆ meas = 0.022. The REML estimates of the variance components, in this case, are exactly the same because the data are balanced and all estimated variance components are positive.
Table 3—Standard Error (s.e.) of the Difference of Two Treatment Means for Different Choices of Sample 2 2 = 0.086, σ extract = 0.022, and Size, Assuming σ plant 2 σ tech = 0.00034.
Number of:
Two approaches can be used to investigate plant-plant variation. One is to assume a model, and, based on that model, calculate the best unbiased linear predictor (BLUP) of each random effect (i.e., predict the random effect), βij , associated with plant ij. The other is to ignore the measurement variability and use traditional diagnostics to evaluate the averages for each plant. The second approach is reasonable when the contribution of the measurement variance is approximately the same for all plants. This is the case here for log-transformed data, so we use plant averages to investigate the plant-plant variability. If observations are log transformed, the standard deviation (s.d.) between plant averages is approximately constant (Figure 4). But, if observations are transformed using the stronger 1/Y transformation, the s.d. between plant average is clearly not constant (Figure 5). Hence, the analysis will use a log transformation because it provides an approximately constant measurement variance and a constant plant-plant variance.
A Model for All Observations The model for all 48 observations is then: log Yijkl = μ + α i + θij + δ k + βijk + γ ijkl .
(2)
The Threonine concentration in replicate i, extraction batch j, id k, and measurement l is Yijkl. The overall mean Threonine concentration is denoted μ. The deviation from the mean 16
VOL. 21, NO. 2, 2008
Plants
Extracts per Plant
Measurements per Extract
s.e. of Difference
4
2
1
0.220
4
2
10
0.220
4
4
1
0.214
8
2
1
0.156
Estimating the Variability Between Measurements of the Same Extract Ding re-measured one of the 48 extracts used in the above analysis. The two measurements are 2.244 and 2.187. The variance of these values estimates the technical measurement variance (i.e., the variability between measurements made on the same 2 extract). Using log-transformed values, this is σ tech = 0.00034, which is two orders of magnitude less than the combination of measurement and extraction variability. Given an estimate of the technical measurement variance, it is possible to estimate the contribution to the error due to extraction. Because each of the 48 extracts in the original data set was measured once, 2 2 2 2 is the variance component σ meas = σ tech + σ extract , whereas σ extract between extracts of the same plant. The estimated variance 2 2 2 component is σ extract = σ meas − σ tech = 0.022 − 0.00034 = 0.022. 2 Although σ tech is not precise because it is a one degree of freedom (d.f.) estimate, it is clear that essentially all the variability between measurements is due to variability between different extractions of a single plant. Almost none of the variability comes from the instrument measurement.
The Uncertainty Budget Consistent with the earlier results for one id, the biological variance between plants is ca. four times larger than the variance between extractions and two orders of magnitude larger than the technical variability between measurements. The variability between different extracts is small, but the variability between the two replicates of the study is surprisingly large. The data indicate why the replicate variance component is so large. The average Threonine concentrations are 0.64 and 0.76 nm/mg for the two extractions in the first replicate and 1.23 and 1.52 nm/mg for the two extractions in the second replicate. The large variance component between replicates makes sense, but the biological reasons for such a large variation are, as of yet, unknown. Since the model assumes log-transformed values are normally distributed, the variance components can be converted into coefficients of variation for each component of error, as described previously. The technical measurement c.v. is exp(0.00034) − 1 = 1.8% , the extraction c.v. is 15.1%, the plant-plant c.v. is 29.9%, the batch c.v. is 4.1%, and the replicate c.v. is 42%. The uncertainty budget and estimated variance components provide useful information for designing subsequent studies. The goal of Ding’s work is to compare metabolite concentrations among genotypes and mutants. Blocking by extraction and replicate (i.e., measuring all id’s [combinations of genotypes and mutants] in the same extraction and same replicate) increases the precision of comparisons among id’s. When the average metabolite concentration is calculated from r replicates, b batches, e extractions, and m measurements per plant, the variance of the average difference between two id’s is: ⎛σ2 σ2 σ2 ⎞ VarY..1. − Y..2. = 2 ⎜ plant + extract + tech ⎟ . rbe rbem⎠ ⎝ rb When comparisons are made within blocks, neither the replicate nor batch variances contribute to the variance of the difference. The only variance components that matter are those for plants, extracts, and measurements. Increasing the number of plants—by increasing either the number of replicates, r, or the number of extraction batches, b—decreases the contribution of all three variance compo2 2 2 nents, σ plant , σ extracts and σ tech . This effect is sometimes called hidden replication because increasing the number of plants also increases the numbers of extracts and measurements. An alternative is to retain the same number of plants, but increase the number of extracts or measurements per plant. Assuming the variance components estimated from these data apply to a new study, the expected precision can be calculated for various combinations of # of plants, # of extractions per plant, and # of measurements per extract (Table 3). Because the technical measurement variance is so small, relative to the other sources of variability, increasing the number of measurements per extract tenfold has essentially no effect on the precision. Doubling the number of extracts per plant leads to a small increase in precision, but doubling the number of plants markedly increases the precision of the difference. The general advice for designing a study with multiple sources of error would be to replicate “as high up as possible.” In this study, that would be to increase the number of components, as it is here.
Final Thoughts Plant metabolomics has given us new biological data for studying the relationship between genotype and phenotype, thereby learning about basic scientific processes. Using data from one metabolite, we have explored the characteristics of measurement and plant-plant variability, constructed an uncertainty budget, and used the estimated variance components to evaluate design choices. We found that the biological variability between plants is larger than the variability between extractions, and considerably larger than the variability between measurements of the same extract. Similar sorts of evaluations are possible whenever there are replicated observations for each important source of variability, but the details of the statistical model will depend on the experimental design (i.e., whether random effects are crossed or nested). Estimating variance components and identifying the important parts of the uncertainty budget help design more precise and costeffective studies.
Further Reading Variance components analysis is described in many intermediate-level applied statistics books. Two of many good chapter-length treatments are in Angela M. Dean and Daniel Voss’ Design and Analysis of Experiments and George E. P. Box, J. Stuart Hunter, and William G. Hunter’s Statistics for Experimenters. Details and many extensions of what has been described here are presented in Shayle R. Searle, George Casella, and Charles E. McCulloch’s book, Variance Components, and D. R. Cox and P. J. Solomon’s book, Components of Variance. CHANCE
17
Florida 2006: Can Statistics Tell Us Who Won Congressional District-13? Arlene Ash and John Lamperti
Figure 1. Map of Congressional District 13
18
VOL. 21, NO. 2, 2008
E
lections seem simple. People go to the polls. They make choices about one or more contests or issues. The votes are counted. What can go wrong with that? Unfortunately, many things can go wrong. In the United States, voters are often confronted with bewildering numbers of issues. Ballot choices and designs vary from election to election and district to district—or even within a district. People may have trouble casting the votes they intend. Both machine and human issues affect how votes are recorded and counted. Especially in a close race, the official results may not reflect the actual choices of the voting public.
Florida’s 13th Congressional District 2006 Election The 2006 contest for the U.S. House of Representatives in Florida’s District 13 was such a race. The Republican candidate, Vern Buchanan, was declared the winner by just 369 votes, triggering a “mandatory recount.” Unsurprisingly, re-querying the same “touch screen” machines that delivered the vote the first time changed nothing. The Democrat, Christine Jennings, challenged the result well into 2008. The problem is not that the race was close. It is that, in Sarasota County, an area of relative Democratic strength, some 18,000 people—almost 15% of those who went to the polls and cast ballots—had no choice recorded for their representative to Congress. A cast ballot with no recorded choice in a race is called an “undervote.” The rest of the district contributed about half the total vote, but fewer than 3,000 undervotes. Jennings believes the excess missing votes in Sarasota would have tipped the race to her. Can statistical analysis help evaluate that claim?
CHANCE
19
Figure 2. Screen shots of the first two (of 21) pages of the Sarasota County 2006 touch screen ballot
Congressional District 13 (CD-13) is geographically diverse (see Figure 1), including all of Sarasota; all or most of DeSoto, Hardee, and Manatee Counties; and a small part of Charlotte County. About half the district’s population (a count of about 370,000 people) is in Sarasota. Manatee has a population of 310,000. DeSoto and Hardee together contribute 65,000 residents. Some issues and candidates are countyspecific, so voters in different parts of the district faced different ballots. George Bush received 56% of the entire CD-13 vote in 2004. However, Sarasota County leans Democratic, 20
VOL. 21, NO. 2, 2008
and, of course, the broader political climate also shifted between 2004 and 2006. In 2006, all voters in CD-13 participated in the House race plus five statewide elections—for U.S. Senate and four state offices: gubernatorial (for a combined governor/lieutenant governor slate), attorney general, chief financial officer, and commissioner of agriculture. They were also presented with numerous county-specific races and issues. Indeed, each District 13 voter faced a ballot presenting anywhere from 28 to 40 choices. Voting occurred in one of three ways: absentee
20.0% 18.0% 160% 14.0% 12.0% % missing in early voting % missing in election day voting
10.0% 8.0% 6.0% 4.0% 2.0% 0.0% 5
4
3
2
1
0
Number of Votes for Democrats in the 5 Statewide Elections
Figure 3. Undervotes in the House race by voting venue and partisanship of other votes among 104,631 ballots with votes recorded in all five statewide contests
ballot, early in-person voting, or traditional Election Day voting. Touch screen voting machines (also known as Direct Recording Electronic, or DRE) were used at all polling stations in Sarasota County for both early and same-day voting. Except for the absentee ballots, the machine totals are the only record of the vote. What accounts for the 18,000 missing votes for U.S. representative? What would their effect have been?
Undervotes Undervotes may be intentional—for example, in little-contested local races, where voters have no knowledge or preference. They also may be unintentional—the voters accidentally do not register a vote in a particular race. Finally, they may be entirely “false”—the voters choose, but no choice registers, as with the famous hanging chads of 2000. In well-publicized statewide or national races, undervoting is normally in the 1% to 3% range, with unknown contributions of intentional, unintentional, and false. The campaign for this important, open U.S. House seat had been intense and, by many accounts, dirty. Yet, in Sarasota County, about one out of every seven ballots cast by touch screen recorded no vote in this race. Why? State officials at first echoed the explanation offered by aides of the declared winner: voters must have abstained due to disgust at the nasty campaign. However, none of the other counties had unusual undervotes in the same race. Manatee County, for example, reported normal undervoting of only 2%. Why would voter disgust stop at the county line? Moreover, the undervote on absentee ballots was low everywhere; only ballots in Sarasota County that had been voted on touch screens displayed abnormally high undervoting. In Sarasota County, the highest undervote rate occurred in early voting. Thus, the huge undervote in Sarasota was specific to that county, applied to in-person voting but not absentee
ballots, and moderated somewhat between early and election-day voting. There is at least one obvious explanation for this pattern—a ballot design (faced by touch-screen voters in Sarasota County only) that made it more difficult to vote for U.S. Representative there than elsewhere in CD-13. Indeed, the Sarasota Herald-Tribune cited contacts from “more than 120 Sarasota County voters” reporting problems, mainly with ballot screens that “hid the race or made it hard to verify if they had cast their votes.” This alone would hurt Jennings, since Sarasota County voters were more favorable to her than were voters in the other counties. The ballot design in Sarasota County certainly caused problems. Computer Screen 1 was devoted entirely to Florida’s U.S. senatorial race, with seven lines of choices presented, immediately beneath a bright blue banner labeled “Congressional.” The undervote rate in this race was normal (that is, low). But Screen 2 presented the House race at the top with only two voting lines and no special banner. The bulk of the page, following a second bright blue banner (“State”) listed seven choices on 13 lines for the gubernatorial election. See Figure 2. Laurin Frisina and three collaborators believe the CD-13 undervote in Sarasota County was due to the ballot screen layout. They point out that abnormally high undervote rates (ranging from 17% to 22%) also were found in the attorney general’s race, and just in one part of CD-13: Charlotte County. On that ballot (only), it was the attorney general race with only two candidates that shared a screen with 13 lines of choices for the gubernatorial election. Other factors likely contributed, as well. For example, there were abnormally slow machine response times that could have led people to “unvote” while trying to ensure their vote registered. This was flagged as a problem by the voting machine supplier the previous August, but not fixed prior to early voting. Furthermore, there are strong patterns in the undervote CHANCE
21
Table 1—Florida’s CD-13 Race in Sarasota County for All with Votes in Five out of Five Statewide Contests Number Recorded and Missing Votes in the CD-13 Contest of Demofor the U.S. House of Representatives cratic Votes in the Recorded % Total # No Other of Buchanan Jennings % for UnderVote Five Ballots Buchanan vote Contests
Proportional Allocation of the Undervote
Buchanan
Change in Buchanan Minus Jennings Tally From Including Jennings the Undervotes
Early Voting 5
10,764
122
8,655
1,987
1.4%
18.5%
28
1959
-1,932
4
2,789
151
2,250
388
6.3%
13.9%
24
364
-339
3
1,170
174
831
165
17.3%
14.1%
29
136
-108
2
1,167
346
664
157
34.3%
13.5%
54
103
-49
1
2,173
1,227
657
289
65.1%
13.3%
188
101
87
0
9,455
8,059
435
961
94.9%
10.2%
912
49
863
5
25,326
468
21,541
3,317
2.1%
13.1%
71
3246
-3,176
4
7,637
561
6,261
815
8.2%
10.7%
67
748
-681
3
3,629
691
2,529
409
21.5%
11.3%
88
321
-233
2
3,847
1,387
2,022
438
40.7%
11.4%
178
260
-82
1
7,305
4,402
2,116
787
67.5%
10.8%
532
255
276
0
29,364
25,676
1,359
2,329
95.0%
7.9%
2212
117
2,095
Election Day
within Sarasota County (see below), despite all Sarasota voters facing the same ballot. Walter Mebane and David Dill, after extensive study, believe the cause of “the excessive CD-13 undervote rate in Sarasota County is not yet well understood and will not be understood without further investigation.” In any case, problems became evident during early voting, eventually leading Sarasota County’s supervisor of elections to issue warnings to precinct captains. On election day, undervoting on these machines was lower than in early voting, but still exceeded 10%. This much is beyond dispute.
would-be voters. Of course, we often make inferences from samples to the whole population. Usually, the sample size, n, is a small fraction of the population size, N. Here, we have a very large sample; n is more than 85% as large as N! Never mind, the calculations are the same. The r Republican votes in the sample are viewed as the result of n “trials,” draws without replacement from a population of size N, where the “success” probability is p = R/N, here approximately 1/2. Thus, the expected value of r and its variance are computed in the familiar way: E(r) = n ⋅ p;
Consequences of the Undervote But did it matter that 18,000 Sarasota voters had no recorded votes in the House race? Assuming a normal rate of intended undervotes, the choices of some 15,000 voters were not counted. What inferences can be made about how those votes would have divided between the candidates if they had been recorded? Would they have changed the outcome? There are several ways to tackle this question, and we’ll describe perhaps the simplest one. Imagine a group of N voters, with R of them intending to vote for the Republican candidate and D for the Democrat so that R+D = N. Suppose a random group of N-n votes are “lost,” creating an undervote. Thus, n votes are actually counted: r Republican votes and d Democratic ones (d = n – r). Let’s think of these n recorded votes as a random sample taken without replacement from the population of N 22
VOL. 21, NO. 2, 2008
.
Var(r) = np(1 − p)
N − n n( N − n) . ≈ N −1 4N
The multiplier (N–n)/(N–1) is the familiar “finite population correction factor” for sampling without replacement, found in any survey sampling text. It can often be neglected—but not here! Both N–n and n are large, so the distribution of r is nearly normal. In this case, all we need do to estimate the Republican advantage (possibly negative) in the whole population is “inflate” r–d, the Republican advantage in the counted votes, by N/n, the fraction by which the whole population exceeds the counted vote. Thus, a statistically unbiased estimator of R–D is: N N Estimated ( R − D) = (r − d) = 2r − n). n n
confidence interval still would not include zero; this raises the confidence level to 99.9%. Moreover, in the context of a one-sided question—did Buchanan really get more votes than Jennings?—one-sided confidence bounds could be used, raising the level of certainty even higher.
Refining the Estimate The associated standard error is: SE = N ( N − n) / n . This translates easily into a 95% confidence interval for R – D: N ( N − n) (r − d) − 2 N ≤R− D n n N ( N − n) R − D ≤ (r − d) + 2 N . n n How does this result apply to the District 13 election? First, let’s imagine that, say, 20,000 nonvoters were randomly chosen from the whole voting population of the district, which was roughly N = 240,000 in 2006. The counted ballots gave Republican Buchanan an edge of 369 votes; that’s the value of (r–d). By the above formula, the 95% confidence interval for R–D ranges from a low of just more than 100 to a high of nearly 700. Since the interval contains only positive numbers, we conclude with (greater than) 95% confidence that there would not be enough Democratic votes among the missing 20,000 to shift the outcome. Thus, despite the tiny winning margin (less than 1/6 of 1%) and the huge number of missing votes—if the missing votes were distributed just like the whole population—random error due to their loss would not threaten the outcome. Of course, the missing votes were not chosen randomly from the whole district. For starters, the vast majority came from Sarasota County where Jennings had an advantage. Suppose there was a “normal” intentional undervote of 2.5% among the 120,000 voters in that county, so that only 15,000 (of the 18,000) undervotes were unintentional. Assume the 15,000 uncounted votes were chosen randomly from the county. Would that matter? Indeed, it would! In Sarasota, the recorded votes gave Jennings an edge of 6,833, so r–d = –6,833. If R–D now stands for the true Republican advantage among 117,000 would-be voters in Sarasota County, the point estimate for R-D is –7,838, with a 95%-confidence interval ranging from about –8,100 to –7,575. Elsewhere in the district, Buchanan had an advantage of 7,202 votes. If we treat the votes in the other parts of the district as error-free, the estimate indicates a win for Jennings by 636 votes, with a 95% confidence interval for R–D ranging from –898 to –373. Again, the interval does not cross zero, and so, with more than 95% confidence, we conclude that Jennings should have won. In fact, had we used ±4 SE instead of ±2 SE, the
In making this estimate, we assumed 15,000 unintentional undervoters in Sarasota County differ from those who did vote only in that their votes were not recorded. Can this assumption be tested? Table 1 and Figure 3 are based on “ballot image” data from Walter Mebane that show the sets of choices for the 104,631 Sarasota County ballots with touch screen votes recorded in all five statewide contests. The data are arranged by early versus Election Day voting and by the number of Democrats chosen in the five statewide contests. We’ll soon see how useful such data can be. First, in both early and Election Day balloting, there is a steep gradient associating partisan voting in the other races and the preference of voters—those whose choices were captured—in the House race. For example, in early voting among otherwise “straight ticket” Democrats, only 1.4% of votes for the House race went to Buchanan, as opposed to 94.9% of recorded votes among early voting Republican stalwarts. Second, it was far easier to “lose” Democratic votes than Republican ones in this race. For example, the straight ticket Democrats had 18% uncounted votes in early voting as opposed to “only” 10% for their early voting Republican counterparts. Understanding what caused these differences is crucial for the legal challenge to this election and for avoiding future voting debacles. For our purposes, we merely note that—in contrast to our previous assumption—not all Sarasota voters were equally at risk for unintentional undervotes. We’ll return in a minute to the more refined calculation of the expected effect of the lost votes these data allow. A third important fact that emerges (see Figure 3) is that the undervote declined substantially within all categories of voters between early voting and Election Day voting. Apparently, many voters were helped by actions taken to mitigate the problems seen in early voting. A study exploring associations between corrective actions taken at individual precincts and undervote rates could be very informative. We do not have such data. What we do have in the ballot image data leads to a sharper estimate of the likely disposition of most of the missing congressional votes. First, it is hard to imagine that many of the 12,000 voters who expressed a choice in all five statewide races (including commissioner of agriculture and chief financial officer), but had no vote recorded in the House race, intentionally undervoted. Let’s suppose they all intended to vote. How would they have voted? A good guess is that the people with missing House votes in each of the 12 strata in Table 1 would have voted in the same proportions as those in the same stratum whose votes were recorded. That is, we perform the same calculations as above, this time within each CHANCE
23
stratum of Table 1. Then, we sum the estimates of the “full” vote across the strata, leading to a new estimate of R–D that represents the Republican advantage after imputing values for the undervote among these 12,000 people. This calculation suggests Jennings’ advantage among these lost votes alone was almost certainly greater than 3,000. It swamps Buchanan’s original 369-vote winning margin. For whatever reasons, it was harder to cast a successful vote for Jennings than for Buchanan in Sarasota County. The higher observed undervote among presumed Democrats means our previous confidence interval calculation was conservative; the conclusion that Jennings was the real winner in CD-13 becomes even surer. The study by Frisina uses two methods to analyze the CD-13 undervote. Both infer undervoters’ choices from their votes for other candidates. One uses precinct-level data from Sarasota County. The other involves matching Sarasota voters with counterparts in Charlotte County. Both show that Jennings was almost certainly the preferred choice among the majority of CD-13 voters. These different estimates may seem confusing. However, the key point is that all plausible models of what the lost votes would have been point to the same conclusion. Furthermore, the more carefully we examine the data, the more support we see for that conclusion. While poor ballot design may or may not fully account for the Sarasota undervote, it is clear that those missing votes switched the outcome of the congressional race from Jennings to Buchanan.
What Happens Now? Finally, two questions. How should Florida and other states fix their flawed electoral processes? Requiring a paper record is useful, but not enough, since recounting such a record in District 13 might have simply confirmed that 18,000 Sarasota County voters recorded no choice for their U.S. representative. The paper record, therefore, must at least be confirmed by each voter. We favor paper ballots, plus optical scanners to read them—the method familiar to us all from grading tests and used now for elections in many states. It is relatively inexpensive and foolproof. It does not require new, possibly fragile, technology or big capital investments. It provides an independent check on what is going on inside the machines that tally the votes. Optical scan ballots are also easier to read and less prone to the design problems that disfigured the CD-13 House race. Indeed, optical scanning was used in 2006 in Sarasota County for the absentee ballots and it worked well. The second question, of course, is what to do about that dubious 2006 election. The statistical evidence shows, beyond any reasonable doubt, that more voters wanted Jennings than Buchanan. However, there is—as yet—no precedent for a court overturning an electoral “count” based on a statistical analysis. We have recommended doing this election over—and doing it right. For the future, statisticians and voting experts should work together to develop guidelines for the appropriate use of statistical evidence to confirm, or overturn, elections.
Further Reading Adams, Greg (2001). “Voting Irregularities in Palm Beach, Florida.” CHANCE, 14:22–24. 24
VOL. 21, NO. 2, 2008
Frisina L., Herron M., Honaker J., and Lewis J. (in press). “Ballot Formats, Touchscreens, and Undervotes: A Study of the 2006 Midterm Election in Florida.” Election Law Journal. Draft at www.dartmouth.edu/~herron/cd13.pdf. Marker, D., Gardenier, J., and Ash, A. (2007). “Statistics Can Help Ensure Accurate Elections.” Amstat News, (360):2–3. Online at www.amstat.org/publications/AMSN/index.cfm?fuseact ion=pres062007. McCarthy, J., Stanislevic, H., Lindeman, M., Ash, A.S., Addona, V., Batcher, M. (2008). “Percentage-Based Versus Power-Based Vote Tabulation Statistical Audits.” The American Statistician, 62(1):1–6. (A more detailed version is available at www.verifiedvotingfoundation.org/auditcomparison as “Percentage-Based Versus SAFE Vote Tabulation Auditing: A Graphic Comparison.”) Mebane, W. Jr., Dill, D.L. (2007). “Factors Associated with the Excessive CD-13 Undervote in the 2006 General Election in Sarasota County, Florida.” www-personal.umich. edu/~wmebane. Meyer, Mary C. (2002). “Uncounted Votes: Does Voting Equipment Matter?” CHANCE, 15:33–38. Pynchon, S., and Garber, K. (2008). “Sarasota’s Vanished Votes: An Investigation into the Cause of Uncounted Votes in the 2006 Congressional District 13 Race in Sarasota County, Florida.” Florida’s Fair Election Coalition. www.floridafairelections.org. Wallace, J. (2006). “Political Operatives Gather for Recount.” Herald Tribune, www.heraldtribune.com/apps/pbcs. dll/article?AID=/20061111/NEWS/611110643. Special Section: District 13 Election, www.heraldtribune.com/ apps/pbcs.dll/section?CATEGORY=NEWS0521&template=ovr2. Wolter, K., Jergovic, D., Moore, W., Murphy, J., O’Muirheartaigh, C. (2003). “Reliability of the Uncertified Ballots in the 2000 Presidential Election in Florida.” The American Statistician, 57(1):1–14.
Statistical Solutions to Election Mysteries Joseph Lorenzo Hall
Elections in the United States are strange. While other nations have problems with violence at the polls or seemingly insurmountable logistical issues, the problems in our country cluster around complexity. No other country votes so frequently, for so many contests at all levels of government, using dozens of methods to enfranchise all eligible voters. Naturally, such complexity results in frequent errors and a few genuine mysteries. Arlene Ash and John Lamperti confidently (with greater than 99.9% confidence) conclude that the wrong candidate is currently holding the CD-13 office. This is perhaps the worst possible outcome in an election, with a close second being that there is no discernable winner. Although Ash and Lamperti don’t address it, the case of the disputed Florida 2000 presidential election was similar. Florida 2000 received a lot of attention in political science literature. Researchers such as Walter Mebane arrived at similar conclusions, but due to a dif-
ferent mechanism: Instead of mysterious undervotes changing the outcome of the race, the problem in Florida 2000 was with spurious overvotes—where ballots show too many choices recorded for a particular race. Both of these cases enjoy peculiar features that many election mysteries do not. First, the underlying data in terms of ballot image data and precinct-level vote data could be obtained by using Florida’s public records laws. Florida permits some of the highest access to the inner workings of its government via the Florida Public Records Act. In both of these cases, researchers were able to obtain crucial data that would typically not be made publicly available in other states. Second, when these data were analyzed, researchers found a definitive answer with respect to the disposition of the outcome. Many election mysteries remain mystifying, even after forensic investigation. A case in point is the search for an answer to a different question about the same CD-13 race: What was the cause of the prodigious undervote? As Ash and Lamperti point out, a team of academic computer security experts examined the software that runs the voting machines used in Sarasota’s CD-13 race and could not find a softwarebased cause. The problem that Ash and Lamperti address is a subset of a more general problem: measuring how confident we are that an election has been decided correctly. In hindsight, one would think mechanisms to ensure election confidence would have been designed into our electoral system, given its fundamentally adversarial nature. Unfortunately, in many cases, the only checks performed on election results are recounts, which can have significant costs and legal barriers and be noninformative. Part of the answer proposed by Ash and Lamperti is to regularize checking the math behind our elections. This requires two elements: There needs to be something to audit—an audit trail—and there needs to be the appropriate regulatory and procedural infrastructure to conduct election audits. For auditability, voting systems must produce an independent, indelible, and secure record of each ballot voters check for correctness. Fortunately, only a minority of 12 states currently do not require their voting systems to produce such records. However, a recent study by Sarah Everett provides compelling evidence that people don’t check these records, and, when they do, they don’t notice errors. To improve auditability, we need a combination of voter education about audit record verification and further usability research to make these records easily verifiable. Unfortunately, despite states overwhelmingly moving toward producing audit records, audits of these records are only performed in one-third of all states, and then they are performed under a wide variety of standards. A white paper authored by Lawrence Norden, Aaron Burstein, Joseph Lorenzo Hall, and Margaret Chen (see www. brennancenter.org/dynamic/subpages/ download_file_50227.pdf) with the input of a blue-ribbon technical panel highlights this disparity and reviews the various types of post-
election audit models in theory and practice. To address this imbalance and inconsistency, it appears we need federal legislation that mandates election audits and audit standards for federal elections. One solution that comes to my mind, which Ash and Lamperti do not propose, is that of user testing of ballot styles. User testing would involve usability testing each ballot style with a number of actual users to detect strange or unintended behavior. This kind of testing would discover both problems with particular ballot styles and other types of interaction problems, including software bugs. An analogy could be made to the use of focus groups and pilot studies to test survey instruments. Usability testing on this scale, where ballot styles can number in the thousands for certain jurisdiction’s primary elections, would be prohibitively expensive in terms of time and resources. Thousands of ballot styles can result when factors including political party, level of election (e.g., federal, state, local), ballot status (official or provisional), and language are crossed. Limited user testing would certainly be less expensive, but it would be much less effective. Ash and Lamperti propose a less intense, but equally radical, solution to these kinds of mysteries. They advocate allowing elections to be overturned based on statistical evidence. Compared to regularizing post-election audits, this proposal is obviously more complex, involving legal line-drawing standards about when to consider an election suspect based on statistical evidence. For example, is 95% confidence that the election was decided incorrectly enough? 90%? 99.9%? Should the standard be overwhelming statistical evidence, indisputable statistical evidence, or something else? And how will assumptions about undervotes, such as those discussed by Ash and Lamperti, and overvotes be evaluated? Who will do the evaluation? Different assumptions, in some cases, will make a difference. Developing such guidelines for statistical challenges to elections will be difficult, but it might be exactly what judges look for in future litigation involving election mysteries. Joseph Lorenzo Hall can be reached at
[email protected].
CHANCE
25
Counting Frustrated Voter Intentions Walter R. Mebane Jr.
People go to the polls to vote, and then what happens? Recent elections in the United States have seen many cases where voters voted in circumstances that left too many of them doubting whether their votes were counted. In the 2004 election, this happened not only in Ohio, but in several states that used electronic, touch screen voting technology. In 2006, there were relatively minor problems in various jurisdictions, but initial reports suggested voters’ experiences were, in general, better than they had been during 2004, according to a Washington Post article by Howard Schneider, Bill Brubaker, and Peter Slevin. The election for the U.S. House of Representatives in Florida’s District 13 helped shatter the illusion of normalcy, reliability, and success. As Arlene Ash and John Lamperti observed, more than 18,000 votes cast on iVotronic touch screen machines in Sarasota County in that race were unaccountably missing (an iVotronic machine like the ones used in Sarasota is described at www.srqelections.com/ivotronic/ivotronic.htm). Ash and Lamperti show that any of several reasonable conjectures about the intentions of the voters who cast these undervotes imply the missing votes are sufficient to have changed the outcome of the election. This finding agrees with the conclusions reached by experts on both sides of one of the lawsuits filed to challenge the outcome based on allegations of defects in the voting machines. Voter intent is, at first glance, a straightforward idea. Out of a set of candidates or a set of options regarding a ballot initiative, each voter has, at the moment of voting, decided to choose one or has decided not to make a choice. The voter’s intention is to have that choice conveyed accurately into the final vote count, or if the voter abstained, the intention is to not have an effect on the final vote count. The voter undertakes some physical gesture—for example, marking on a paper ballot or touching a video screen—with the idea that gesture will ultimately cause the final vote count to be changed, or not, in the way the voter intended. Nuances come to light when we think about different ways a voter’s intentions may be frustrated. Once the voter is at the moment of voting, there are, broadly speaking, two ways things can go wrong. Something can prevent the voter from making the gesture that would express the voter’s choice. Or, something can prevent the voter’s gesture from having the desired effect on the final vote count. In both cases, there are further important distinctions pertaining to where the difficulty occurs. When a voter is unable to make the appropriate gesture, is that something about the voter or something about the circumstances? Was the Election Day environment the same for all voters, but this particular voter was somehow unable to do the right thing in that setting? Or, were different voters somehow treated differently? When an appropriate gesture does not have the desired effect, is the obstacle something occurring immediately in the voting machine the voter is using or something that happens later in the process, perhaps long after the voter has left the polling place? 26
VOL. 21, NO. 2, 2008
Table 1—Sarasota 2006, District 13 Election Day Undervote Rate by Occurrence of Event 18 (“Invalid Vote PEB”) on Machine and Event 36 (“Low Battery Lockout”) in Precinct Event 18 Event 36 No
Yes
No
Yes
CD-13 Undervote Rate
13.7%
14.6%
Total Ballot Count
67,748
9,879
CD-13 Undervote Rate
14.6%
15.0%
Total Ballot Count
9,716
1,699
Note: Rates are the proportion of the Election Day ballots in each category that have a CD-13 undervote. Maybe all of these ways voters’ intentions may be frustrated can serve equally well to motivate a what-if exercise designed to see what would have happened had all votes been counted as they were intended. Ash and Lamperti do not try to decide among the several explanations that have been suggested for the excessively high rate of undervoting. But, it may be important to take a stronger stand on this. Suppose one believes the high number of undervotes was the result of some voters being unable to make an appropriate gesture in an environment that was the same for all voters. Someone with such beliefs may be skeptical that these undervotes are unintentional. After all, following the 2000 election debacle in Florida, an elections supervisor stated that the blame for spoiled ballots falls on the voters, wondering, “Where does their stupidity enter into the picture?” Such people seem to believe that would-be voters who fail to solve perceptual or procedural puzzles that all voters have been given to solve do not deserve to have their votes counted. To fully motivate what-if exercises such as the ones Ash and Lamperti carried out, it may be important to demonstrate that the frequency of undervotes varied with circumstances that varied across voters. So, one can have two attitudes about the claim Laurin Frisina et al. make, that “the exceptionally high Sarasota undervote rate in the 13th Congressional District race was almost certainly caused by the way Sarasota County’s electronic voting machines displayed on a single ballot screen for the congressional contest and the Florida gubernatorial race.” One view is that, because the ballot’s format varied across Florida counties, voters in different counties did face different circumstances. Such a perspective may carry the implication that voters’ experiences within each county were homogeneous. One might argue, then, that any variations in the undervote rate among voters within each county must trace back to something about the voters. The cross-county heterogeneity perspective might lead one to think the what-if exercises are well motivated, but the within-county homogeneity perspective might point in the opposite direction. In fact, different voters in Sarasota faced significantly different circumstances, because different voting machines
performed differently. Recent reports by Susan Pynchon and Kitty Garber document problems that afflicted iVotronic touch screen voting machines not only in Sarasota County, but wherever they were used throughout the state. These reports go beyond previous investigations that considered a limited range of evidence regarding software failures. Pynchon and Garber document extensive problems ranging from low battery errors and power failures to poor security for critical voting machine hardware. They also show that, across the state, undervote rates were higher for many races where iVotronic touch screen machines were used, regardless of the ballot format. The force of these recent reports is to suggest not only that machines, and not voters, were responsible for excessive undervotes, but that it is possible that security failures allowed vote counts to be altered long after the polls closed. The reports do not demonstrate that manipulations before, during, or after the election definitely occurred, but they do document striking security failures and show that previous investigations were not sufficient to rule out such possibilities. Using data from Sarasota, one can show that readily measurable problems with the voting machines correlate with significant variations in the frequency of undervotes in the election for the U.S. House of Representatives in District 13. I consider variations in this undervote rate across four categories, defined by two kinds of error conditions. One is an error indicating that an invalid Personalized Electronic Ballot (PEB) was used with the voting machine. PEBs are electronic devices used to conduct all transactions with the iVotronic touch screen machines, including the action of loading the ballot each voter will see and enabling the voter to vote. PEBs are described at www.srqelections.com/ivotronic/ivotronic.htm (click on #1). In records produced to show all of the transactions on each voting machine, an “invalid vote PEB” error is denoted as event 18. Walter Mebane and David Dill highlight the relationship between this error and variations in the undervote rate at www-personal.umich. edu/~wmebane/smachines1.pdf. The second kind of error is whether any voting machine in a precinct had a power failure. Such an event for a voting machine is indicated by event 36 (“low battery lockout”) in the machine’s transaction log. Garber observes that voting machines were often not plugged directly into a wall socket to receive power, but daisy-chained, with one machine plugged into another machine. She and Pynchon also point out that low power conditions or power failure may cause a variety of machine performance failures.
Table 1 shows that on election day in Sarasota, the District 13 undervote rate was lowest (13.7%) on machines not subject to either of the two kinds of error, and the undervote rate was highest (15.0%) on machines on which both kinds of error occurred. Having only the invalid vote PEB error on a machine and having only the low battery lockout error on a machine in the same precinct are each associated with an increase of almost 1% in the undervote rate (to 14.6%). These percentage differences are arguably small, relative to the overall undervote rate, but even they are enough to potentially have had a significant impact on the election outcome. If all four categories of votes shown in Table 1 had had the lowest displayed undervote rate, there would have been 202 fewer undervotes—a number about twothirds of the margin of victory in the election. By presenting Table 1, I do not mean to suggest the undervote problem mostly traces to circumstances unrelated to voting machine performance. Especially in view of the wide range of concerns Pynchon and Garber document, Table 1 should be viewed as expressing a lower bound on the share of the undervotes caused by mechanical failures.The final message about undervoting in the 2006 election in Florida is that we still don’t know precisely what caused the problem. In Sarasota, more than 18,000 votes effectively vanished into thin air, but, across Florida, the number of mysteriously missing votes is several times that number. Without paper ballots to recount and inspect, and barring purely statistical adjustments, it is difficult to know what can be done practically to remedy the situation in a way that inspires everyone’s full confidence. The worst fear is that, as bad as they are, the problems we can see are only a small part of what’s really wrong. Further reading can be found in the supplemental material at www.amstat.org/publications/chance. Walter R. Mebane Jr. can be reached at
[email protected].
CHANCE
27
A Historical Puzzle: de Mere’s Paradox Revisited Roger Pinkham
O
ne of my favorite urban legends is about a food company that sponsored a cooking contest. The woman who won did so by submitting a baked ham. The ham had a ginger ale pineapple glaze, was pierced with whole cloves, and had garlic inserted under the skin. At the award ceremony, the winner was questioned in detail about the preparation of the entree. Someone in the audience asked why she cut an inch from the bone end. The contest winner said she really didn’t know; it was her mother’s recipe. The woman’s mother was still alive, and the next time she visited her mother, she asked about the recipe. In fact, it was her grandmother who originated the recipe. Amongst dusty retired lampshades, old suitcases, and things too dear to throw away in her mother’s garage, they found her grandmother’s recipe box. On the ham recipe was the instruction, “Cut 1 inch plus from bone end. The pan is too short.” (You can find other versions of this legend at www. snopes.com/weddings/newlywed/secret.htm and elsewhere on the web.) Statisticians, trained to examine data relevant to evaluating statements, would not have waited as long in an analogous statistical situation as the contest winner waited before asking about the recipe detail. Or would they? A case that has puzzled me for years comes to mind, one in which the statistics and probability community has, it seems to me, been cutting an inch off the proverbial bone end for years. Most of us have read some version of the story involving Blaise Pascal, Pierre Fermat, and the Chevalier de Mere, which can be found in Isaac Todhunter’s Mathematical Theory of Probability and pages 223 and 224 of Statistics by David Freedman, Robert Pisani, and Roger Purves. In the 1600s, French gamblers bet on whether at least one ‘one’ would appear when four six-sided dice were rolled. They also bet on whether a pair of ‘ones’ would appear at least once 28
VOL. 21, NO. 2, 2008
throwing one dice [sic], there is a “When 1-in-6 chance of a ‘one’ appearing.”
when two six-sided dice were rolled 24 times. The Chevalier de Mere was one among probably many who thought these two games had an equal chance of success. The explanation given for de Mere’s belief is as follows:
The supposed thinking of de Mere rested on the following two points has been repeated over and over in the years since de Mere contacted Pascal: 1. Probability [at least one occurrence in n trials] = n x Probability [occurrence in one trial]
When throwing one dice [sic], there is a 1-in-6 chance of a ‘one’ appearing. In four rolls, therefore, there is a 4-in-6 chance to get at least one ‘one’ value. If a pair of dice is rolled, there is a 1-in-36 chance of two ‘ones’ resulting. In 24 rolls, the chance of at least one pair of ‘ones’ is thus 24-in-36. Both equal 2/3. It is claimed that de Mere noticed the outcome in question occurred more often in the first scenario (one die four times) than in the second (two dice 24 times). The apparent difference between probability theory and reality is referred to as the Paradox of Chevalier de Mere.
2. Experience shows the two chances under consideration are not equal Freedman, Pisani, and Purves point out that de Mere’s reasoning should have appeared obviously false, because the chance of getting at least one ‘one’ in six rolls of a six-sided die certainly is not 6/6, or 100%. We know the conclusion—both chances are 2/3—is wrong because elementary computation shows the probability of at least one six in four rolls of a die is 1-Probability [no ones in four rolls] =1–(5/6)4 = 0.518, whereas the probability of at least one pair of ones in 24 rolls of two dice is 1-Probability [no pair of ones in 24 rolls] = 1–(35/36)24 = 0.491.
Photographic reproduction of Philippe de Champaigne’s painting of Blaise Pascal (1623–1662)
Photographic reproduction of painting of Pierre de Fermat (1601–1665)
History will attest to de Mere possessing genuine mathematical talent. Thus it seems to me the reasoning in Point 1 attributed to him is most unlikely. Further, if one considers how many games it would take to actually notice a difference in the chances, then one is likely to doubt the veracity of Point 2. Repeating the story is akin to repeatedly cutting off the bone end when preparing the baked ham recipe. First, let me comment on Point 2. As a young man, much to the distress of my parents, I hung out with the older guys in the local pool room. On occasion, this resulted in my being in on all-night crap games (a dice game) after Sam’s Pool Hall closed. The room was usually noisy and smoke-filled, and the scene was a bit frenetic. Certainly no one was keeping records, least of all the participants. Thus, to me, it seems exceedingly unlikely that de Mere, much less anyone else in such conditions, could have determined the difference between 0.518 and 0.491 on the basis of empirical evidence. The simple question, “How many trials would one have to watch to render reasonable certainty to the two probabilities being different?” can be answered by tools presented in an elementary statistics class. The answer leaves little doubt that the result was not discovered empirically.
To be more precise, suppose one were to proceed by means of simulation. How many trials of each game would be necessary to distinguish 0.518 from 0.491 with reasonable certainty? This can be viewed as a standard problem in hypothesis testing. The null hypothesis is that the probabilities of the two events are equal. The alternative hypothesis is that they do not have equal chances. Two errors can be made. One can decide the chances are not equal when, in fact, they are equivalent (a so-called Type I error). Or one can conclude the chances are equal when they truly differ (a so-called Type II error). The test proceeds by playing both games independently several times. Based on the difference in the proportion of occurrences of the events divided by an estimate of the natural expected uncertainty in the games, one can compute a test statistic. If this statistic is sufficiently far from zero, then the null hypothesis is rejected and the games are said to have different chances. If the statistic is not too far from zero, then one cannot conclude with certainty that a difference exists. In the former, the chances might be equal, but we have seen a very unlikely series of events. In the latter, there might be a very small real difference, but one cannot necessarily discern it above the level of random chance.
Chevalier de Mere, a.k.a. Antoine Gombaud (1607–1684)
Computations of Necessary Number of Trials Computations use formulas such as those in Section 10.5 of Bernard Rosner’s Fundamentals of Biostatistics. If Δ = 0.03 is the expected difference between the chances of success, the two proportions are p1=0.52 and p2=0.49 (with an average of p =0.505), a Type I error of 0.05 corresponds to a (two-sided) critical value of 1.960, and a Type II error of 0.05 corresponds to a critical value of 1.645, then one should play each game n = [ 0.505 * 0.495 * 2 *1.960 + 0.52 * 0.48 + 0.49 * 0.51 *1.645]2 / 0.032 = 7216.3 times. One would naturally round up to 7,217 times. If 1.645 is replaced by 1.036 for a Type II error of 0.15, then the value is 4984.6, which rounds to 4,985.
CHANCE
29
obvious arithmetical argument above and locates relevant references in both French and Italian of which I was completely unaware. Ore points out in the following passage what he believes to be the origin of de Mere’s reasoning. In the passage below, n0 and n1 are the critical number of throws in each game with N0 and N1 being the number of equally likely outcomes in the respective games.
Stephen Stigler of The University of Chicago, an authority on the history of probability and statistics
Once a few stipulations are made, one can calculate the number of trials of each game one should play to determine if the chances are different with reasonable certainty. If the two proportions are 0.52 and 0.49 in reality and Type I and Type II errors are limited to 5% (0.05), then it is necessary to play each game 7,217 times! See Computations of Necessary Number of Trials for more on the computation. If one is willing to be more relaxed about the Type II errors and accept an error rate of 15% (0.15), then one still needs 4,985 trials under each scenario. Even for a dedicated gambler, keeping record of some 10,000 trials amidst shouts, confusion, and alcohol seems a virtual impossibility. Of course, such feats of human endurance have been known to occur, but we have no direct evidence that de Mere was such a dedicated experimenter. Armed with this bit of statistics, I submitted an article to CHANCE. What came back was an exceedingly interesting and informative response that addresses Point 1 of de Mere’s supposed reasoning. By asking Stephen Stigler of The University of Chicago, an authority on the history of probability and statistics, an editor learned someone had picked up on this anomaly many years before, unbeknownst to most of the statistical community. The mathematician Oystein Ore, in his article “Pascal and the Invention of Probability,” makes the 30
VOL. 21, NO. 2, 2008
Pascal does not understand de Mere’s reasoning, and the passage [from a letter by de Mere quoted by Ore] also has been unintelligible to the biographers of Pascal. However, de Mere bases his objections on an ancient gambling rule which Cardano also made use of. One wants to determine the critical number of throws, that is, the number of throws required to have an even chance for at least one success. If in one case there is one chance out of N0 in a single trial, and in another one chance out of N1, then the ratio of the corresponding critical numbers is as N0: N1 . That is, we have n0: N0= n1: N1.This immediately gives the proportion stated by de Mere.” In the games under consideration here, N0=6 for the game of trying to get a ‘one’ on a single six-sided die and N1=36 for the game focused on getting a pair of ‘ones’ on two six-sided dice. Thus, in de Mere’s case, N1=6N0 . The rule Cardano used gives the equality n1=6n0. Since we are throwing four dice in the first version of the game, n0=4. Cardano’s rule implies then that n1 should equal 24 to have the same chance of a success. Our previous calculations show the chances are not the same, and this means Cardano’s rule is incorrect. If one reads Todhunter’s version carefully, one will see that what de Mere may have been saying is he could compute the odds to be approximately 491:509 (491 to 509 against) for at least one pair of ‘ones’ in 24 rolls of a pair of six-sided dice. See Computations for Odds for computation details. Cardano’s rule, however, says they should have been more like 518:482, which they are in the simple case of at least one ‘one’ in four rolls of a single die. Perhaps he was writing to Pascal to
Computations for Odds If p is the probability of an event, then p/(1-p) is called the odds for the event. A statement a:b means that the probability of the event happening is a/(a+b) and the probability that it does not happen is b/(a+b). It also means the odds for the event is a/b. If you want the sum a+b to equal 1000, then set a/(1000-a) equal to p/(1-p) and solve for a. The solution is a=1000p rounded to be an integer value. A probability of 0.491 therefore has odds of 491:509, whereas a probability of 0.518 has odds of 518:482.
question the validity of the rules of probability theory as then understood. To me, this seems a more satisfying and rational explanation for the few facts we possess. It does not make de Mere seem so simplistic after all. Having read Ore’s discussion of the matter, I am more convinced than ever that de Mere should not be hastily maligned. I no longer feel as if my approach to this problem is akin to using a baked ham recipe appropriate for a short pan. One has here a statistical question involving real people and very real controversy that would make an admirable homework question for an introductory class. Ore’s article could be assigned as delicious collateral reading.
Further Reading Freeman, D., Pisani, R., and Purves, R. 1980). Statistics. New York: Norton. Ore, O. 1960. “Pascal and the Invention of Probability.” American Mathematical Monthly, 67:5, 409–419. Rosner, B. 2000. Fundamentals of Biostatistics, 5th ed. Pacific Grove, CA: Brooks/Cole. Stigler, S.M. 1986. The History of Statistics: The Measurement of Uncertainty Before 1900. Cambridge, Massachusetts: Belknap Press of Harvard University Press. Todhunter, I. 1949. A History of the Mathematical Theory of Probability Theory. New York: Chelsea.
Compensation for Statistical Consulting Services: Approaches Used at Four
U.S. UNIVERSITIES H. Dean Johnson
Four speakers from a session on statistical consulting at the 2006 Joint Statistical Meetings were asked to provide descriptions of the statistical consulting services offered at the universities they represent. The descriptions provided are presented below.
WASHINGTON UNIVERSITY SCHOOL OF MEDICINE Sarah Boslaugh is a performance review analyst for BJC HealthCare in Saint Louis, Missouri, and an adjunct faculty member at the Washington University School of Medicine, where she co-teaches a two-semester course in statistics for the health sciences. She recently edited The Encyclopedia of Epidemiology for Sage, has published books on SPSS programming and secondary data analysis, and is currently writing a book about basic statistics for O’Reilly.
W
ashington University School of Medicine (WUSM) is a large, prestigious, and researchoriented medical school. It was ranked among the top five medical schools in the United States in 2007 by U.S. News & World Report. WUSM has more than 1,000 students (~600 in MD programs), more than 8,000 employees (more than 20,000 when including affiliated institutions such as Barnes Jewish Hospital and the Siteman Cancer Center), and more than 1,600 faculty members. The school had clinical revenues of $450 million in 2005 and held grants and contracts worth $465 million in fiscal year 2004–2005. There were 33 applicants for each place in the 2005–2006 medical school class, and tuition was $41,910 in 2006–2007. The Department of Pediatrics of the Washington University School of Medicine had about 130 faculty members in 2006, plus numerous residents, fellows, and medical students affiliated with the department. From 2004–2006, one statistician was employed in the Department of Pediatrics to provide biostatistical consulting. Previously, these needs were met by contracting with members of the Department
of Biostatistics within Washington University or outside statisticians. However, those in the department thought they needed more access to a statistician for which they would not have to pay directly and voted in 2004 to fund a statistical position. About 25% of the statistician’s salary was funded by grants in 2005–2006, which functioned as salary buy-back (i.e., the money was paid to the department to reimburse them for services provided). There was no intention of allowing the position to become fully grant-funded because then the statistician would be working only for the people on whose grants he/she was included and the rest of the department would lose access. The department also provided the statistician travel and development funds at a level similar to a faculty member. Anyone who wanted biostatistical assistance contacted the statistician directly, and the statistician scheduled a time to meet. People requesting assistance were asked to fill out a basic form describing the project, deadlines, services required, approximate time required, whether grant support was available, and if the statistician would be included on future publications. The latter proved to be a frequent source of contention, and the consultant’s supervisor had to be called in several times to mediate authorship disputes. After a few such experiences, the consultant made it a policy to get the authorship agreement settled before beginning work. An advantage for the department included total access to the statistician, including basic assistance for students just beginning to do research. Another advantage was that those in the department could ask any kind of question, from “How do I put data into an electronic file?” on up. Unfortunately, this often led to expecting the statistician could answer any question, not understanding that modern statistics is as specialized as medicine, and as no physician is equally knowledgeable in clinical emergency medicine, oncology, genomics, and developmental biology, no single statistician is an expert in every branch of statistics. CHANCE
31
Advantages for the statistician included having a staff position funded by hard money, freedom from the tenure review process, and access to university resources. Disadvantages included the lack of a peer group and established career path within the department, the denigration of staff members
versus faculty within the medical school, the need to constantly advocate for authorship credit, and existing outside the social and hierarchical structures of the school (i.e., not being part of a lab, and therefore not having a principal investigator to look out for the statistician’s interests).
UNIVERSITY OF WISCONSIN-MADISON Murray Clayton earned a bachelor’s of mathematics degree in statistics and pure mathematics from the University of Waterloo, Ontario, in 1979 and a PhD in statistics from the University of Minnesota in 1983. Clayton’s research interests include the application of statistics to the agricultural, biological, and environmental sciences, focusing especially on spatial statistics. He is the author of more than 100 publications in the statistical and scientific literature.
T
he primary statistical consulting groups at the University of Wisconsin-Madison owe much to a vision developed by George Box. When Box first started Wisconsin’s Department of Statistics, his view was that the department should have a number of jointly appointed faculty members who actively collaborate with researchers from other disciplines. Today, roughly half of the faculty members in statistics have joint appointments in other programs. The majority of these are in one of two consulting groups: biostatistics, comprising faculty with joint appointments in the medical school, and biometry, comprising faculty jointly appointed in the College of Agricultural and Life Sciences (CALS) and the College of Letters and Science (L&S) and focusing principally on biology not involving human medicine. Here, I primarily discuss the biometry group, whose funding has seen major changes in the last few years. In biometry, there are five faculty members: two jointly appointed in botany and one each jointly appointed in horticulture, plant pathology, and soil science. These appointments are split 50/50 and carry joint tenure in the two departmental homes. There is an expectation that faculty members will routinely engage in consulting and collaborative research, and, accordingly, they have a reduced teaching load. Put another way, they carry 50% of the usual teaching load in statistics, and their “teaching” in their other department corresponds to their consulting and collaborative research. In addition to the faculty, the facility also employs two fulltime staff members who focus on computing support, a manager with a MS in statistics, and two graduate students. The faculty and computing staff members are funded entirely (on 12-month appointments) through CALS and L&S and not required to generate their own salaries. This reflects college commitment to the program as a value-added research resource. The funding for the manager and graduate students is more complex. All members of the 20 departments in CALS and all members of botany and zoology (which are housed in L&S) have free,
32
VOL. 21, NO. 2, 2008
unlimited access to the biometry facility. Users of the facility receive advice on grant preparation, study design, data collection, data analysis, and writing, as well as access to computing and computing advice. The focus is on collaborative projects, not “service,” and, in particular, members of the facility do not typically perform the actual data analyses (no more than a principal investigator would perform lab assays). Meetings are conducted primarily on an appointment basis, and drop-in visits are infrequent. In total, about 800 to 1,000 meetings take place per year, and the success of the facility is due in large part to strong college and campus appreciation of collaborative research and the value of statistics. Despite college commitment, two major changes occurred a few years ago. Until recently, the statistics staff consisted of four graduate assistants and five faculty members. Major budget cuts in CALS meant a considerable loss of funding and, although the faculty lines were secure, there remained only enough funds for one graduate student and 50% of the salary of a manager. To fund the remainder of the manager’s salary, and to hire at least one other graduate student, it became necessary to generate funds outside of those provided by the colleges. To construct a new funding model, a few important principles were articulated. It was thought that charging a fee on an hourly basis would be contrary to the spirit of engaging in collaborative research (i.e., an hourly fee can lead to “clock watching,” but good science requires a more unfettered approach). Also, typical collaborators in CALS and L&S are not generally supported by grants that include salary money. Therefore, there was a decision to not employ a model often used in biostatistics, in which NIH funding is common and includes salary money directly. With these ideas in mind, a simple fee structure was constructed: A principal investigator and their associated laboratory/program staff could have access to the facility for an annual fee of $3,000. Someone so enfranchised would receive exactly the same as those included for free (i.e., CALS, botany, and zoology), namely advice on statistics and computing and access to computing resources. The entire School of Veterinary Medicine wanted access to the facility, and, for them, $30,000 per year would be charged (the cost of one graduate assistant). Monitoring “user statistics” suggests no abuse of this funding approach. Of course, an important advantage of this new model is that extra income is generated, which is used to hire graduate assistants, supplement the manager’s salary, and cover occasional expenses. It also permits working with groups that would otherwise not be enfranchised. However, open questions remain: In the future, will we be forced to generate more and more of our own income? Should we extend our model to parties outside the university walls? What best serves the research interests of the campus? What best serves our needs? We are on unexplored territory, and it remains to be seen how our operation will function and flourish in the coming years.
THE OHIO STATE UNIVERSITY Chris Holloman is the current director of The Ohio State University’s statistical consulting service. He obtained his PhD in statistics from Duke University in 2002. Since then, he has worked in a consulting capacity at a contract research firm, a major national bank, and a university. In his current position, his primary responsibilities include mentoring graduate students and assisting university researchers and external clients in the planning, development, execution, and analysis of scientific experiments.
A
t The Ohio State University, the Statistical Consulting Service (SCS) has been in existence for more than 30 years. During this time, the organizational and compensation structures have varied. In most cases, past directors of the SCS were faculty members granted release time from teaching in exchange for managing a small group of graduate student consultants. These graduate students would provide consulting support to the wider university community and occasionally to researchers external to the university. Three years ago, the SCS was reorganized under a new management structure with a new philosophy—the SCS would be run like a private consulting business. To create this new structure, Thomas Bishop, a consultant who managed his own private statistical consulting business for more than 20 years, was hired to manage the SCS. Along with him, he brought the operations manager from his private firm to assist with billing and project organization. Additional SCS staff members have been added over the last two years. These include an associate director and five graduate students who work as practicing consultants in the SCS for one year. The basic compensation strategy used by the SCS is to charge clients an hourly rate for work performed by SCS consultants. In its original form, all external (non-university) clients were charged the hourly rate directly. In contrast, Ohio State graduate students and faculty members were not billed directly for these services, although the hours spent working for them were recorded. During the first two years of operation, the College of Mathematical and Physical Sciences (MAPS) reimbursed the Department of Statistics for these consulting services. This original funding strategy was never intended to last indefinitely. The College of MAPS only agreed to subsidize the consulting service long enough to prove the worth of the SCS to other colleges; the long-term strategy would be to charge other colleges the same hourly rates being charged to external clients. Unfortunately, an agreement between the deans of the various colleges could not be reached, so the College of MAPS was faced with either eliminating funding for Ohio State consulting projects or committing a revenue stream to maintaining the SCS. Due to budgetary constraints, the College of MAPS was forced to eliminate the subsidization of consulting for Ohio State researchers.
The SCS has recently developed a system to reinstate consulting services to graduate students within the university. For graduate students seeking help from the SCS, a new independent study course was created. When students sign up for this course, they are provided access to a graduate student consultant within the SCS. They can work with this consultant for up to 20 hours during the quarter, and no other requirements (e.g., attending a lecture, testing, homework) are required for credit. This strategy returns compensation to the SCS based on the Ohio State budget system, as each department receives funding based on the number of students enrolled in classes offered by that department. By enrolling in the independent studies class, graduate students receive the benefit of the educational opportunity to work with an SCS consultant on their research project and the SCS is paid indirectly through the university budget system. This new compensation system seems to be working well. Ohio State graduate students and their advisors are pleased to have SCS reinstated. During the winter quarter of 2007, the SCS provided consulting support to 50 graduate students across the university—enough to support 5 graduate student research assistants in the Department of Statistics. There are, however, several disadvantages to this strategy. We are limited in the speed with which we can address client needs, as they have to enroll during the quarter prior to which they need consultation. Also, the students working in the SCS have little opportunity to work on external or faculty projects, since their time is consumed by student projects. Although the independent study class provides a solution to the funding difficulties for graduate student projects, it does not provide a solution to the lack of funding for faculty projects. Since the College of MAPS ceased subsidization, the number of projects performed for faculty members in the university has declined dramatically. Currently, faculty members have two options for funding projects: they can pay for the consulting directly using grant or departmental funds or they can work with the SCS to include statistical consulting as a direct cost on grant applications. Although this funding strategy provides direct compensation to the SCS, it allows access only for those faculty members with grant or departmental funding. Ultimately, the SCS hopes to obtain central university funding for the services provided to university researchers. Since the SCS provides a service that benefits the entire university, it makes the most sense for funding to be provided at a higher level than the colleges. Current marketing efforts are focusing on securing this source of funding. For clients external to the university, the SCS continues to charge hourly rates. The SCS does not charge to meet with clients initially or to write formal proposals for projects. Once a proposal has been accepted, clients are simply invoiced monthly for the hours of consulting support provided. This compensation strategy sets up a very clear business relationship between the external client and the SCS, so both parties know what to expect and compensation is not disputed. Formally, all of the compensation provided to the SCS and its consultants is monetary. Since the SCS is an earnings unit within the university, it is responsible for covering its own CHANCE
33
overhead, including salaries, benefits, office space, equipment, and supplies. In contrast to other university consulting services with faculty in managerial positions, publications do not factor strongly into the success of the SCS. As a result, coauthorship
on papers is not considered compensation for consulting services provided. Nonetheless, SCS consultants are occasionally invited to coauthor papers when they have made a substantial intellectual contribution to the research.
UNIVERSITY OF FLORIDA Linda J. Young is a professor of statistics at the University of Florida, where she teaches, consults, and conducts research on statistical methods for studies in agricultural, environmental, ecological, and public health settings. Young has been the editor of the Journal of Agricultural, Biological, and Environmental Statistics. She is currently associate editor for Biometrics and Sequential Analysis. Young has served in various offices within the professional statistical societies, including the Eastern North American Region of the International Biometric Society, American Statistical Association, Committee of Presidents of Statistical Societies, and National Institute of Statistical Science. Young is a Fellow of the American Statistical Association and an elected member of the International Statistical Institute.
T
he University of Florida has groups of statisticians in four colleges: College of Liberal Arts and Sciences (CLAS), Institute of Food and Agricultural Sciences (IFAS), College of Public Health and Health Professions (PHHP), and College of Medicine (COM). The groups in CLAS and IFAS form one department with a single department chair. Although administratively separate, the statisticians in PHHP have strong ties to CLAS and IFAS statisticians, largely through common interests in the academic programs in statistics and biostatistics. Statisticians in COM collaborate with the researchers in the Health Sciences Center. Because the missions differ from group to group, the approach to funding statistical consulting also differs. In CLAS, there is no core funding for statistical consulting. Yet, the faculty members believe it is important to provide this service to college researchers. This is accomplished largely through a consulting course for statistics students who have had at least a year of graduate course work. Some CLAS statisticians have designated consulting hours. CLAS faculty and students may make a consulting appointment by contacting the statistics office. Students in the consulting class are notified of each appointment and are to attend at least one per week. In addition, each week, students, working in pairs, have walk-in hours for people with quick consulting questions. The consulting class meets weekly to discuss consulting projects and topics relevant to consulting. The CLAS approach to consulting provides free consulting to CLAS students and faculty. Statistics students gain
34
VOL. 21, NO. 2, 2008
consulting experience, and faculty members who participate get a course reduction. However, consulting is available only in the fall and spring semesters. Although some students take the course more than once, most are new each semester so there is little continuity in the consulting support. Further, it is difficult to develop good collaborative relationships in this setting. Statistical consulting in IFAS is in transition. Historically, core funding has been provided for faculty, a master’s-level statistician, and students to provide consulting to IFAS faculty and students. The current statistical consulting atmosphere is quite different. Research assistantships are no longer funded. Faculty members are strongly encouraged to avoid service consulting and to focus on developing collaborative relationships with fellow scientists. The service consulting is largely handled by the master’s-level statistician. Students from the consulting class may also work with these faculty members and students. This new model has caused frustration because IFAS faculty and students are accustomed to having core statistical support, and it is taking some time for them to consider having a statistician as a co-principal investigator on their proposals. Because statistics faculty members increasingly depend on grant support for at least a portion of their salaries, they are likely to spend much less time on service consulting. Instead, the focus is on sustained consulting that can lead to collaborative efforts. The master’s-level statistician has a heavy consulting load. Because there is so much consulting—both service and collaborative—within this unit, it is a rich learning environment for the consulting students. Both PHHP and COM are in the Health Sciences Center. Most of the statistical consulting is fee-based because there is no core support. The only exception is that some students in the consulting class often opt to gain experience by holding their walk-in hours in PHHP. This is extremely popular with health science faculty and staff because it is the only free consulting. In both colleges, the fee-based consulting is conducted by a mixture of faculty and master’s-level statisticians. The model used in the Health Sciences Center leads to self-supporting units. Because it is clear that statistical consulting is not free, researchers often think to include statistical support in grant proposals. They also make an effort to establish collaborative ties so the statistician can acquire an understanding of the science and subsequently make innovative contributions to the research. The consulting students gain experience in the health sciences. However, some faculty members and students do not have funding for the services, making it difficult, if not impossible, to obtain appropriate statistical support. Even those who do have funding find the process somewhat impersonal.
without compromising the ability of the statisticians to fulfill other aspects of their appointments.
Compensation Comes in Many Forms
Statistical Consultants and Funding
Many fields require the use of statistics for research. As a result, most faculty and graduate students in fields that require the use of statistical techniques for analyzing data have either taken or will take one or more statistics courses. Unfortunately, they often do not achieve the level of statistical expertise required of research journals today. In such instances, statistical consultants are often called upon to assist persons with statistical issues related to their research. Consultants, by definition, are persons who provide expert advice. Examples include the physician who provides medical advice, the lawyer who provides legal advice, and, lo and behold, the statistician who provides consultation on statistical methodologies. For the advice provided, the consultant receives compensation from the client. A physician or lawyer receives a fee paid by the client or a third party (e.g., a health insurance company or the losing side in a lawsuit). For statistical consultants, however, especially those working in a university setting, compensation can come in various forms and may not necessarily come from the client. As one example, instead of receiving money, a statistics professor might be promised authorship on any publications resulting from the collaboration. As another example, consultants working in a statistical consulting center might receive their compensation through university funding, and not directly through the clients (i.e., professors and graduate students), who receive consulting free of charge. Regardless of the form in which compensation comes, it is important that appropriate rewards for consulting services are provided. If proper compensation is not provided, statistical consulting will cease to exist in universities. An important implication of this is that researchers might use inappropriate statistical techniques to analyze their data, forcing results reported in research journals to be put into question. Another important implication of statisticians not participating in consulting is that statisticians are at risk of being left out of the main branches of science. Given these implications, one cannot underestimate the importance of providing proper compensation for statistical consulting services.
In a university setting, statistical consultants play the important role of helping graduate students and faculty members with statistical issues related to their research. They help researchers with the design of their studies. They provide consultation on statistical methodology, and they also can be involved in the data analysis. Research done at colleges and universities is greatly enhanced through the collaboration of statistical consultants and researchers. It is essential, however, for statisticians to be compensated for the services they provide. Here, several forms of compensation are mentioned, include receiving compensation through university funding, coauthorship on publications, and hourly fees. Having universities fund statistical consulting seems to be the most logical choice. Universities benefit greatly from the research produced through the collaboration of researchers and statistical consultants, and, as such, they should be responsible for covering the costs of the consulting services provided. Unfortunately, administrators are not always aware of the importance of statistical consulting, and, in these cases, it is the job of statistics departments to convince them of the importance. In periods of budget cuts, this task can be difficult. In cases where the collaboration of the consultant and researcher(s) leads to a publication, coauthorship can be granted to the consultant as a means of compensation. To avoid conflict at later stages of publication, it is important for all parties to reach a clear agreement on coauthorship during the initial stages of collaboration. In the absence of university support and coauthorship, an hourly fee can be paid to the consultant by the client as a means of compensation. For help in deciding what fee to charge clients, one can refer to the Spring 2006 edition of The Statistical Consultant published by the American Statistical Association (www.amstat.org/sections/cnsl). One problem with charging the client an hourly fee is that some clients are in a better position than others to make the payment. For example, a person with a grant could set aside money from the grant to pay for the services, whereas the person without the grant would not have this luxury. The quality of research conducted at colleges and universities is greatly improved through collaboration of statistical consultants with researchers in other fields, and, thus, it is essential for these collaborations to continue. They will not continue, however, if statistical consultants are not appropriately compensated for the services they provide. As stated earlier, “It is clear statistical consulting is not free.”
Clearly, at the University of Florida, there are four colleges with different approaches to funding statistical consulting. IFAS provides the strongest statistical support for faculty and students. This may change if core funding for service consulting is not maintained and creative solutions to replacing that funding are not identified. The other colleges do not provide sustained, quality consulting for faculty members and students unless they are able to pay for the services. The challenge is to discover ways to provide statistical consulting more broadly
Further Reading Cabrerra, J., and McDougall, A. (2002). Statistical Consulting. Springer-Verlag. Derr, J.A. (2000). Statistical Consulting: A Guide to Effective Communication. Duxbury Press. Sahai, H., and Khurshid, A. (1999). “A Bibliography on Statistical Consulting and Training.” Journal of Official Statistics, 15(4):587–629.
CHANCE
35
Compensation for Statistical Consulting Services: Observations and Recommendations Thomas A. Louis
Thomas A. Louis is a professor of biostatistics at Johns Hopkins Bloomberg School of Public Health. He is an elected member of the International Statistical Institute and a Fellow of the American Statistical Association and American Association for the Advancement of Science. From 2000 through 2003, he was coordinating editor of the Journal of the American Statistical Association. Louis has served as president of the Eastern North American Region of the International Biometric Society (IBS) and president of the IBS. He has published more than 200 articles, books/chapters, monographs, and discussions. Louis dedicates this article to colleague, advisor, and friend Jim Boen. He highly recommends Jim’s book with Doug Zahn, The Human Side of Statistical Consulting.
In, “Compensation for Statistical Consulting Services,” H. Dean Johnson, Sarah Boslaugh, Murray Clayton, Christopher Holloman, and Linda Young describe and evaluate several roles and compensation schemes for statistical consultants in a university setting. Each has its roots in local traditions, goals, and realities. Each setup generates and reflects on the role and standing of the consultant; each has its benefits and drawbacks. All underscore that achieving and maintaining scientific and organizational success in a university setting is challenging. Rather than evaluate each setup, I offer the following sociological and management observations and recommendations. Most recommendations are quite hard-nosed, and some may be controversial. However, even partial fulfillment will enhance the status and effectiveness of university consulting centers and improve the “compensation” of participating faculty and staff members and students. Communicating and implementing them increases the likelihood of fruitful, satisfying, and long-term relations, as well as helping to recruit and retain high-quality statisticians.
clinical, scientific, and policy studies; to educate students and consultees/collaborators; to upgrade the level of statistical and scientific practice; to develop professionally through participation in conferences and short courses; to engage in individually initiated research; to enjoy the status conferred by university employment; and to have the opportunity for career advancement. These last require job titles to be sufficiently prestigious and job expectations to line up with university requirements for advancement—that consulting center appointments are “structured for success.” Budgets and Billing Consultees and their administrative units need to know the anticipated and actual cost of a consulting project in dollars, personnel time, and other inputs. Therefore, clear and realistic budgets need to be prepared and monthly bills issued that list current charges and past due amounts. Budget preparation and communication document the scope and needs of a project, itself a benefit. Even if the consulting is not fee for service, budgets and bills should be prepared. The charges can be forgiven, but it is important to document what was required to produce “Table 1” or one t-test. Such documentation has the added benefit of providing important information for negotiating the next round of funding and space. Charges should include all it takes to run the operation. Hourly or daily charges should include salary and fringe (with different rates for different training and experience), overhead, equipment depreciation, and staff professional development (e.g., conferences, short courses, books). Include the cost of training students. Most universities require a service center break even and not make a profit, but most allow rates to include these expenses. Billing by the Hour: While it is true that keeping track of hours can influence a consultant’s behavior, it is precisely what consulting requires. In recording and reporting hours, we should be honest, but not shy. For example, if productive time is spent thinking about a problem while commuting, the time should be billed. For moderate or large expense and duration projects, consider charging a daily, rather than an hourly, rate. This arrangement will better reflect the scientific realities and reduces, but does not eliminate, clock watching. University Realities: The charges a university allows for short-term consulting, even with a “fully loaded” rate, generally are not sufficient to maintain a consulting operation. Usually, some form of subsidy is needed from a dean, a department, or, indeed, consulting center staff members. One solution is to have the center’s portfolio include both within-university and outside projects charged at higher rates. Care is needed here, because the outside projects are financially important for covering operating expenses, but too much time on them takes away from the primary mission of the consulting unit.
Observations and Recommendations Compensation entails far more than financial support through salary and benefits. Of course, financial compensation commensurate with skills and productivity is centrally important, but, in a university setting, salary is unlikely to be competitive with that available through a private consulting firm or other industrial employment. Therefore, other forms of compensation are vitally important. These include the opportunity to work with university researchers and students on important 36
VOL. 21, NO. 2, 2008
The Two-Way Street As Lincoln Moses and I communicate in our article, “Statistical Consulting in Clinical Research: The Two-Way Street,” effective consultations and collaborations require statisticians to understand the subject-area science, goals, and constraints. Similarly, consultees and collaborators should be expected to understand at least basic statistical concepts and methods—to have a basic education in statistical science. Almost everything
requires collaborative input; almost everything has a statistical component; almost nothing is purely statistical. Joint understanding and ownership of the statistical and other scientific issues is important. For example, statisticians should not have to shoulder all the blame for needing to increase a sample size by a factor of four to reduce confidence interval width by a factor of two. Similarly, statisticians need to take some responsibility for improving the science, especially with regard to measurement issues and framing research aims.
Rules of Engagement We need to avoid agreeing to consulting/collaborative arrangements and roles that are not conducive to success. Doing so lowers our status and effectiveness. We owe it to ourselves, to our colleagues, to the project, and to our discipline to insist on proper relations, funding processes, and roles.
Staffing and Roles Insist on proper staffing and involvement in most “nonstatistical” aspects. Ensure clarity about who does what and when (deliverables and timing). Who does data entry and data cleaning? Who runs statistical analyses, and are they up to the task? If some are to be conducted by the consultee’s staff, capabilities (e.g., staffing, hardware, software) must be assessed, as must the cost of working with staff members outside the statistical center. If an experienced master’s-level person can handle the job, don’t have a PhD do it. If clients complain that they want the PhD, discuss the additional cost and, possibly, the delay in finding an available consultant. We need to insist that the project leader attends at least the first few meetings. A lack of interest or time commitment from the leader practically ensures frustration and unsatisfactory outcomes. We should use the commitment of the leaders as an upper bound for our interests: “I won’t be any more interested in this project than are you.” Authorship Discuss rules for authorship up front, including the right to refuse authorship or acknowledgement. Determine who will participate in writing reports and presentations and what the review process for these will be. Generally, a statistician should review all such communications. Authorship should never—absolutely never—be used for or considered compensation. “Authorship or payment” is a demeaning, unethical, and destructive dichotomy. Authorship must be based on scientific and other intellectual contributions; financial compensation should be based on effort. There is no better way to ensure that individuals and our profession are perceived and treated as second-class citizens—as necessary disposables—than to agree to authorship as payment.
Final Thoughts Johnson, Boslaugh, Clayton, Holloman, and Young report on effective models for organizing and operating statistical consulting services. Operating a successful consulting enterprise is challenging, especially in a university setting, but meeting this challenge is in itself satisfying. My observations and recommendations will have varying degrees of relevance and viability in different settings and so need to be adapted and tuned. Attention to some version of them will increase the likelihood of organizational, scientific, educational, and personal success. They communicate and implement roles and procedures of respected professionals. Indeed, the respect accorded to statisticians and statistical science by others will be no greater than the respect we accord ourselves and our profession.
Further Reading Boen, J.R., and Zahn, D.A. (1982). Human Side of Statistical Consulting. Wadsworth Publishing Company. Moses, L., and Louis, T.A. (1984). “Statistical Consulting in Clinical Research: The Two-Way Street.” Statistics in Medicine, 3:1–5.
CHANCE
37
Statistical Consulting in a University Setting: Don’t Forget the Students Janice Derr
Janice Derr, who has a PhD in biology and a master’s degree in statistics, is employed as a statistician with the U.S. Food and Drug Administration. She joined the FDA in 1998, after working as managing director of the Statistical Consulting Center at Penn State University. While at Penn State, she participated in research teams and taught statistical consulting at the graduate level. Her book/video, Statistical Consulting: A Guide to Effective Communication, is a result of these experiences. There is a certain excitement in collaborative research projects that students can only experience by participating in them. In my opinion, programs in statistics can and should foster these opportunities. When I read “Compensation for Statistical Consulting Services,” I grew concerned about the impact today’s lean economy is having on the quality of opportunities for students in statistics and other quantitative disciplines. These research experiences have a value to statistics students that is not captured solely by a dollars-for-hours calculation. A statistical consulting unit is a good place to connect statistics students with opportunities to participate in actual, ongoing research projects. Based on my experiences while I was running the statistical consulting program at Penn State University, investigators who involve their own students in their research projects are often open to collaboration with statistics faculty and their students. The logistics of doing this effectively will vary in different academic settings, and has been discussed in other articles. Here are descriptions of two of the many rich examples from my experiences at Penn State. 38
VOL. 21, NO. 2, 2008
The “PSU-Mouse” project was funded partly by a research grant to Roy Hammerstedt of Penn State’s Department of Biochemistry. An aim of this grant was to use the techniques of response surface methodology to identify a set of conditions that would most successfully freeze and preserve the sperm of genetically engineered mice. Hammerstedt was instrumental in involving statistics students with the students in his biochemistry lab. Brenda Gaydos, a doctoral student in statistics, worked on the response surface design and analysis of PSU-Mouse studies. She also supervised an undergraduate statistics student who developed a Minitab training session on response surface graphics. This collaboration led to a road trip to Washington, DC. The entire PSU-Mouse team presented their findings to scientists at the National Zoo who were interested in preserving the sperm of endangered species. Everyone participated in the hands-on training session. In the category of unanticipated benefits was an excellent tour of the Zoo by Zoo scientists for the PSU-Mouse team and accompanying family members. The undergraduate student on the PSU-Mouse project obtained a bachelor’s and then a master’s degree in statistics and accepted a position with a pharmaceutical company. Gaydos earned her PhD in statistics and is now a statistician at Eli Lilly and Company. At Lilly, she is part of a small group of senior statisticians whose objective is to improve the efficiency and quality of all phases of clinical research by providing scientific support in statistical methods applications and development. Students who work together in multidisciplinary project teams gain valuable experience in the creative and problemsolving areas of research. One such team of students worked on a study of tethered human exercise in simulated microgravity. This team consisted of undergraduate and graduate students in biobehavioral health, statistics, exercise science, and mechanical engineering. Peter Cavanagh, who at the time of the study was director of Penn State’s Center for Locomotion Studies, initiated this project as part of a broader research program. The students worked together to design the study, recruit and test the subjects, and analyze and report the data. The project involved evaluating an exercise treadmill for use in the zerogravity environment of space flight. To simulate zero gravity, the treadmill was mounted on a wall, and subjects were suspended by bungee cords from the ceiling and tethered to the treadmill (Figure 1). The graduate student in statistics, Sandy Balkin, volunteered to experience the treadmill first-hand, so he could understand what the sequence of walking and running exercises would be like for a subject in the study. The team appreciated Balkin’s efforts, especially because he used the entire range of the “discomfort” scale to report on his sensations; the other subjects who were in the Reserved Officer Training Corps chose not to use the upper (more discomfort) end of the scale. Balkin also participated in the preparation of a manuscript that was published in the leading aviation and space medicine journal. Following is what Balkin and lead student Jean McCrory said about the value of this collaborative experience: “This experience allowed me to see what a consulting statistician actually does. I had to meet with other team members
Figure 1. Tethered human exercise in simulated microgravity. A student tries out the wall-mounted treadmill at Penn State’s Center for Locomotion Studies. Photo courtesy of Penn State University
from other disciplines, understand their aspects of the project, and be able to explain the statistical aspects to them. I had to learn enough about the mechanics of running on a treadmill to be able to hold and understand conversations concerning what was expected to come out of each experiment and what form the information was going to be in.” Balkin, a master’s student in statistics at the time of the project, also obtained a doctorate in management science and information systems and is now a health care investment analyst. “This was my first exposure to conducting research with an interdisciplinary cadre of colleagues … Regardless of academic experience, each student taught many fundamental principles of his/her major discipline to the other students on the grant. Also, because each student was considered the ‘expert’ in his/her discipline, that student felt a certain responsibility that his/her contributions to the research be correct.” McCrory, a doctoral student in biobehavioral health at the time of this project, earned her PhD and is now on the faculty of the Health and Physical Activity Department at the University of Pittsburgh. Statistics students emphasize their consulting experiences in their resumes and often are asked to describe this aspect of their training during job interviews. For this reason, I am glad the authors of “Compensation for Statistical Consulting Services” and their colleagues found creative strategies for providing some level of statistical consulting activity on campus, even with limited resources.
Creating and maintaining a consulting unit within an academic program does take faculty time and staff support. Involving statistics students in research teams takes resources, networking, and an appreciation of uncertainty. (Who better to appreciate uncertainty than statisticians?) Some teams, such as the ones I have described here, will have very rewarding experiences, and others will have less happy endings. All of them, with good management, can have reasonable learning experiences. An academic statistical consulting unit can have far-reaching benefits to statistics students by improving their preparation for future careers in statistics. I urge statistics faculty members and administrators to continue to find ways to fund this important activity. Editor’s Note: The views expressed in this editorial do not reflect policy or views of the U.S. Food and Drug Administration.
Further Reading McCrory, J.D., Baron, H.A., Balkin, S., and Cavanagh, P.R. (2002). “Locomotion in Simulated Microgravity: Gravity Replacement Loads.” Aviation, Space, and Environmental Medicine, 73(7):625–31. CHANCE
39
The National Health Interview Survey: 50 Years and Going Strong Jane F. Gentleman
I
t is 5 p.m. in October, and it’s already dark outside. The interviewer for the National Health Interview Survey (NHIS) drives slowly along the street, looking for one of the houses on her interview assignment list, but finding it difficult to see the house numbers. At last, she identifies the house she is looking for, parks her car, grabs a pile of papers, retrieves her laptop from the car’s trunk, walks up to the front door, and rings the doorbell. When there is no answer, the interviewer places a folder of NHIS information on the front doorstep and leaves, planning to come back and try again later. Three hours later, she returns, and, this time, an elderly man answers the door. The interviewer introduces herself as a U.S. Census Bureau employee who is conducting NHIS interviews on behalf of the National Center for Health Statistics. The man recalls previously receiving a letter saying this household was randomly selected to participate in the NHIS. The letter briefly described the NHIS, informed the reader of the confidentiality protection the gathered data would receive, and stated that participation is voluntary. The man agrees to be interviewed, but he lives alone and is unwilling to let the interviewer come inside his house. So, for the next 30 minutes, he and the interviewer stand on their respective sides of the open front door while the many NHIS questions are asked and answered. During the interview, the interviewer places her left arm around her open laptop, braces that arm against the outside wall of the house, and supports the laptop from below by jutting out her stomach. The sophisticated software with which the laptop is equipped helps her methodically work her way through the circuitous “skip patterns” of the NHIS questionnaire. By porch light, she reads aloud the questions that appear on the laptop screen and types in the man’s responses with her right hand. A notebook containing flashcards—lists of possible responses to questions whose 40
VOL. 21, NO. 2, 2008
possible responses are too numerous and/or too complex to grasp and remember—rests on the sidewalk and is handed to the man whenever a question designed to use a flashcard is asked. Conducting NHIS interviews is a hard job, and interviewers from the U.S. Census Bureau have been doing that for NCHS ever since the survey began. The census’ team of full- and part-time professional NHIS interviewers—about 450 in recent years—successfully conducts tens of thousands of NHIS interviews every year. The average NHIS interview is about 30 minutes longer than the one described above, because many households have more than one resident, and the NHIS asks questions about all household residents of all ages. Also, some people have a lot more to report about their health than others. Often, quite a few attempts to gain admission to a household are necessary, and sometimes the door is never answered or the interview is refused.
The 50th Anniversary of the NHIS Over the last 50 years, the NHIS has provided data that are of great use to a variety of communities. The dedicated staff members who plan and conduct the NHIS—including all who process, document, analyze, and disseminate the data and analytic products—marked the 50th anniversary of the NHIS with many commemorative events and products. Now that the NHIS’ 50th anniversary has passed, NCHS looks forward to many more years of conducting this invaluable survey. So, happy 50th anniversary to the National Health Interview Survey, and many happy returns.
Origin of the NHIS In 2007, the National Center for Health Statistics (NCHS), which conducts the NHIS and a number of other health surveys, observed the 50th anniversary of the NHIS. NCHS is the country’s official health statistics agency and is part of the Centers for Disease Control and Prevention. The NHIS was created as a result of the 1956 National Health Survey Act, Public Law 652, which was signed by President Dwight Eisenhower on July 3, 1956. The act stipulated that “The Surgeon General is authorized to make, by sampling or other appropriate means, surveys and special studies of the population of the United States to determine the extent of illness and disability and related information … and … in connection therewith, to develop and test new or improved methods for obtaining current data on illness and disability and related information.” The NHIS went into the field for the first time in July 1957. From the beginning,
the survey was designed to serve a diverse community of data users, rather than focusing solely on selected policy or program needs. Topics presently covered by the relatively stable core of the survey include health status, utilization of health care services, health insurance coverage, health-related behaviors (such as use of tobacco and alcohol), risk factors, and demographic and socioeconomic information. In addition, supplemental questions about special topics are sponsored by government agencies other than NCHS and added to the NHIS questionnaire each year. Users of NHIS data and analytic products include researchers, policymakers, government and nongovernment programs, teachers, students, journalists, and the general public.
Sample Sizes and Precision NHIS interviewers are in the field collecting data almost continuously,
Table 1. Percentages of Selected Racial/Ethnic Subgroups, According to Census Figures for the 2006 U.S. Civilian Noninstitutionalized Population, and in the Unweighted 2006 NHIS Sample Subgroup
2006 Census Figure
2006 NHIS Sample (Unweighted)
Hispanics
15%
24%
Non-Hispanic Blacks
13%
16%
Non-Hispanic Asians
4%
6%
Non-Hispanic Others
68%
54%
150 140 130 120 110
Number in 000s
stopping only for a couple of weeks each January to participate in refresher training, at which they learn about what is new in the next year’s NHIS, and brush up on difficult parts of the interview. The NHIS sample is designed to be representative of the civilian noninstitutionalized population of the United States. Figure 1 is a graph of the number of persons in the NHIS sample from 1962/1963 (the earliest survey year for which microdata files were permanently retained) to 2006. The NHIS operated on a fiscal year basis (July 1–June 30) from 1957/1958 through 1967/1968 and on a calendar year basis starting in 1968. The sample size has ranged from a low of 62,052 in 1986 to a high of 139,196 in 1966/1967. The obvious downward trend in Figure 1 is due to the increasingly high cost of conducting an in-person survey. The dips in sample sizes in 1985 and 1986 occurred because budget considerations required the sample to be reduced. Another low NHIS sample size, in 1996 (63,402 persons), was the result of using a large number of the 1996 interviews for a pre-test of the new computer-assisted personal interview (CAPI) system (which replaced the old paper-and-pencil method of administering the questionnaire) and the new leaner and meaner questionnaire that were debuted in 1997. To balance the NHIS’ budget, the new NHIS sample design implemented in 2006 indefinitely reduced the NHIS sample size by about one-eighth, and from 2002–2007 (except for 2005), the sample size was reduced by about another one-eighth by cancelling interviewing assignments. Consequently, the 2005 sample size was 98,649, the 2006 sample size was 75,716, and the 2007 sample size (not yet known at the time of writing) will be similar to that of 2006. The sample size reductions were made in such a way that the representativeness of the NHIS sample was maintained. When the sample size decreases, the variances of estimates calculated from NHIS data increase. It is useful to produce and compare estimates for subgroups of the population, but the ability to do that is limited by the sample size. It seems that no matter how large the sample size, there is always a desire to disaggregate estimates further and to conduct more sensitive hypothesis tests, but the ability to increase the sample size is limited by budget and staff resources and the need to carry out other important projects.
100 90 80 70 60 50 40 30 20 10 0 1963 1965 1967 1969 1971 1973 1975 1977 1979 1981 1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005
Figure 1. Number of persons in the NHIS samples, 1962/1963 to 2006
For example, in 2004, the NHIS was taken out of the field for three weeks to reduce its cost. NCHS assesses the precision of its published estimates using the relative standard error (RSE), which also is known as the coefficient of variation. The RSE is the ratio of the estimated standard error (square root of the estimated variance) of an estimate to the estimate itself. Thus, large values of the RSE are undesirable; NCHS usually warns users if RSEs are above 30% and suppresses publication of estimates with RSEs above 50%. Everything else equal, the RSE for estimates from a 47-week sample would be equal to the RSE from a 50-week sample multiplied by the square root of 50/47 (i.e., multiplied by 1.0314). Fortunately for analysts of NHIS data, the RSE does not increase linearly with a reduction in sample size; the standard error is inversely proportional to the square root of the sample size, which means reductions have an important, but much less dramatic, impact. Sometimes, the sample size is increased for selected types of persons. The NHIS has been oversampling black persons since 1985, Hispanic persons
since 1995, and Asian persons since 2006. The NHIS collection procedures have been methodically designed so the probabilities of members of these minority subgroups being in the NHIS sample are disproportionately high. Oversampling results in increased precision of estimates for those subgroups. To accomplish oversampling, selected households on the NHIS interviewers’ assignment lists are designated for screening. Those households receive a full NHIS interview only if the interviewer ascertains at the beginning of the interview that at least one member of one of the minority subgroups being oversampled resides in the household. Otherwise, the interviewer discontinues the interview (i.e., the household is screened out). Table 1 compares the percentages of selected subgroups in the U.S. civilian noninstitutionalized population with the corresponding percentages in the 2006 NHIS sample. The table shows how, for example, the percentage of Hispanics in the 2006 U.S. noninstitutionalized civilian population was 15%, whereas oversampling yielded a 2006 NHIS sample that was 24% Hispanic (unweighted). Because CHANCE
41
the NHIS sample size was designed to remain fixed, oversampling blacks, Hispanics, and Asians caused other subgroups to be undersampled, as shown in Table 1 by the lowered percentage of non-Hispanic others in the NHIS sample. Use of survey weights in calculating estimates from NHIS data is necessary to properly adjust for those differences.
Sample Design and Interviewing Protocols To control the cost of going door-to-door to gather data and the time it takes to do so, the NHIS is designed to collect data from hierarchically clustered geographical areas. Once a household is selected for the sample, the survey is administered to all families in the household and to all members of each family. Thus, the NHIS sample is a highly stratified multistage probability sample. It would be prohibitively costly and/or time consuming to administer the survey instead to a simple random sample of the same number of households, because they would be scattered all over the country. The complex design of the NHIS helps control costs, but there are inevitable tradeoffs. For example, clustering has the effect of introducing covariances between units of analysis, increasing the variance of most estimates calculated from NHIS data. Analysts of data from complex surveys such as the NHIS are strongly advised to use specialized software to calculate variance estimates; such software packages account for the complex sample design by using survey weights and adjusting for covariances when calculating estimates and variance estimates. The “design effect” of an estimate calculated from complex survey data is the estimated variance of the estimate divided by what the estimated variance would have been if the survey data had been a simple random sample of the same size. Thus, the design effect is the factor by which the survey’s complex design inflates the estimate of the variance of the estimate. Equivalently, the design effect is the factor by which the sample size would need to be inflated to achieve the same variance with a complex sample design as with a simple random sample. Design effects are estimate-specific, so there is no one design effect value that applies to all NHIS estimates. Design effects of the order of magnitude of three are common, but they can be much larger. 42
VOL. 21, NO. 2, 2008
Since 1997, the NHIS questionnaire has included three main modules. In the family module, at least one adult answers questions about everyone in the family. In the sample child module, a knowledgeable adult answers questions about one randomly selected child (the “sample child”). In the sample adult module, one adult (the “sample adult”) is randomly selected to answer questions. No proxy answers are permitted for the sample adult module; that is, only the person randomly selected by the computer is permitted to respond. Many questions in that section of the questionnaire are most accurately answered by the person to whom they apply, not by proxy. For example, the sample adult is asked questions such as “Has your doctor ever told you that you had [some chronic condition]?” or “What race(s) do you consider yourself to be?” Researchers often prefer to analyze data for the sample children or the sample adults because of the richness and high quality of those data, even though the sample sizes are smaller because the information was collected about only one child and from only one adult in the family. NHIS interviewers have repeatedly expressed their frustration over being required to administer the sample adult module only to the adult who was randomly selected by the computer, and not to some other adult in the family who might be more available or more willing to be interviewed. Interviewers also have asked why they sometimes have to travel to very remote places, such as the Aleutian Islands in Alaska, to collect data, when it would be quicker and cheaper to collect data in a more convenient location. And interviewers have often asked their supervisors to extend the deadline for completing interviews of the households on their assignment lists. The answers to those three issues are really the same; the interviewers must adhere to the interviewing protocols for the survey results to be representative of the general population. The NCHS even wrote and produced an interviewer training video, titled “The Right Time, The Right Place, The Right Person,” that demonstrates with humor why representativeness of the time of year, the geographical area, and the member of the population is important in NHIS data. Once a year, NCHS disseminates one year of NHIS microdata—appropriately
processed to remove some types of inconsistencies and gaps and to ensure confidentiality—on the NCHS web site at www.cdc.gov/nchs/nhis.htm. The more timely the release of data, the more useful the data, and as the major NHIS milestones listed in Table 2 for survey years 2004, 2005, and 2006 show, the timeliness of the annual NHIS microdata releases is extraordinarily good.
NHIS History and Evolution It is beyond the scope of this article to cover the history of the NHIS in detail, but we present here several examples of how the survey’s approach has evolved over 50 years, along with evolutions in public health, technology, survey methodology, laws, and societal values. Table 2 and the article, “History of the National Health Interview Survey,” by Eve Powell-Griner and Jennifer Madans provide summaries of the NHIS history. For example, the manner in which the NHIS ascertained the race and/or ethnicity of household members has changed greatly over time. Jacqueline Wilson Lucas of NCHS has prepared a detailed web site about this topic at www.cdc.gov/nchs/about/major/nhis/rhoi/rhoi. htm. In the earliest years, the NHIS did not ask respondents about the race of household members; the interviewer observed the respondent’s race (but not the ethnicity) and recorded it. Interviewers were instructed to record the race of the respondent and family members as “White,” “Negro,” or “Other,” probing verbally only when the race could not be “determined” by observation. Figure 2 is an excerpt from the 1967/1968 NHIS Interviewers’ Manual, showing that interviewers were further told to “assume that the race of all persons related to the respondent is the same as the race of the respondent” and “Report … persons of mixed Negro and other parentage as Negro” (the so-called “one drop rule”). The NHIS started collecting respondent-reported race and Hispanic origin information in 1976, following guidelines developed by the federal government’s Office of Management and Budget (OMB) for federal statistical systems. In response to changing demographics and the growing need of persons with multiple racial heritages to indicate that on federal surveys, the OMB issued revised standards in 1997 for race and Hispanic origin data
Table 2—Selected Major Milestones in the History of the NHIS: 1957–2008 Survey Year* 1957/58
Survey Year*
NHIS Event First went into the field in July 1957
1959/60
First asked about health insurance
1961/62
First asked about occupation and industry
1962/63
New sample design, utilizing information from the 1960 census
1962/63
First microdata files created and permanently retained
1962/63
First annual Current Estimates report (descriptive statistics)
1966/67
Largest sample (139,196 persons)
1973
New sample design, utilizing new information from the 1970 census
1976
First collection of respondent-reported race and Hispanic origin information
1976
First identification of multiple races (for adults only prior to 1982)
1985
New sample design, utilizing new information from the 1980 census
NHIS Event
2001
First quarterly release of estimates for key health indicators via Early Release (ER) program
2001
First release of public use microdata files on the internet
2003
First questions about cell phone usage
2004
Blaise software replaces CASES software in laptops
2004
Implementation of Contact History Instrument (CHI) to record characteristics of interviewers’ attempts to contact household occupants
2004
First annual microdata files to be released less than one year after the end of the survey year
2004
First quarterly release of detailed health insurance estimates via Early Release (ER) program
2005
First annual microdata files to be released less than seven months after the end of the survey year
2006
New sample design, utilizing new information from the 2000 census
2006
First oversampling of Asians (oversampling of blacks and Hispanics continues)
1985
First oversampling of blacks
1986
Smallest sample (62,052 persons)
1995
New sample design, utilizing new information from the 1990 census
1995
First oversampling of Hispanics (oversampling of blacks continues)
2006
1995
First supplement to track progress toward achieving national health objectives of the Healthy People program (this one for Healthy People 2000)
First oversampling of certain persons aged ≥65: increased probability of selection as the sample adult if they are Hispanic, black, or Asian
1997
First use of computer-assisted personal interviewing (CAPI) for the core questions
2006
First annual microdata files to be released less than six months after the end of the survey year
1997
Questionnaire completely revised
2007
1997
First set of three annual Summary Health Statistics reports (on population, adults, and children), replacing annual Current Estimates report
First semiannual release of report on cell phone usage via Early Release (ER) program
2007
Revamped income and Social Security number questions
2008
Release of first microdata file of paradata (metadata about the data-gathering process)
*Two consecutive NHIS years separated by a slash indicate one fiscal year of the survey, which operated on a fiscal year basis (July 1 to June 30) through 1967/68.
collection. The new standards stipulated new categories to be used to identify race and ethnicity, again mandating the placement of Hispanic origin questions before race questions, and required that respondents to federal surveys be able to select more than one group when answering questions about race. The various changes were fully implemented in the 1999 NHIS, although the NHIS has
permitted respondents to identify multiple races since 1976 (for adults only prior to 1982). Table 3 shows the Hispanic ethnicity categories and race categories used in 2007 NHIS flashcards. Time also has changed the sexspecificity of some NHIS questions. For example, between 1957 and 1981, persons aged 17 and older were asked about their usual activity. The questions
were phrased as follows (with interviewer instructions shown in square brackets): What were you doing most of the past 12 months – [For males:] working or doing something else? [For females:] keeping house, working or doing anything else? CHANCE
43
Figure 2. Excerpt from the 1967/1968 NHIS Interviewers’ Manual
Figure 3. Excerpt from the 1959/1960 NHIS paper questionnaire
Figure 3 shows how, in 1959/1960, the NHIS questionnaire asked about how a motor vehicle accident happened. In those days, it was necessary to list “Accident between motor vehicle and person riding on … horse-drawn vehicle” as one of the possible responses. The interviewer marked the answer on the paper questionnaire using a pencil. Although they still happen occasionally, these accidents would now be recorded as “Other” types of accidents. Data analysts can use NHIS information about demographic characteristics and socioeconomic status to study similarities and disparities in the health of population subgroups. For example, income questions on the NHIS permit analysis of the well-known positive association between income and health. The phrasing of NHIS income questions has, of course, had to change over time. For example, a 1962/1963 question that asked for total family income in the last 12 months had nine response categories, the two lowest being “Under $500” and “$500–$999” and the two highest being 44
VOL. 21, NO. 2, 2008
“$7,000–$9,999” and “$10,000 and over.” In the 2008 NHIS family module, the respondent is asked, “What is your best estimate of the total income of all family members from all sources, before taxes, in calendar year 2007?” The response can be any amount, but if it is ≥$999,995, it is “top coded” (recorded only as greater than or equal to that amount) to protect confidentiality. If the respondent refuses to provide a specific amount, then a succession of questions are asked to try to obtain an interval containing the family income, beginning with a question about whether the family income is