TM
Methods in Molecular Biology
Series Editor John M. Walker School of Life Sciences University of Hertfordshire Hatfield, Hertfordshire, AL10 9AB, UK
For other titles published in this series, go to www.springer.com/series/7651
Statistical Methods in Molecular Biology Edited by
Heejung Bang, Xi Kathy Zhou, and Madhu Mazumdar Weill Medical College of Cornell University, New York, NY, USA
Heather L. Van Epps Rockefeller University Press, New York, NY, USA
Editors Heejung Bang Division of Biostatistics and Epidemiology Department of Public Health Weill Cornell Medical College 402 East 67th St. New York, NY 10065 USA
[email protected] Xi Kathy Zhou Division of Biostatistics and Epidemiology Department of Public Health Weill Cornell Medical College 402 East 67th St. New York, NY 10065 USA
[email protected] Heather L. Van Epps Rockefeller University Press Journal of Experimental Medicine 1114 First Ave. New York, NY 10021 3rd Floor USA
[email protected] Madhu Mazumdar Division of Biostatistics and Epidemiology Department of Public Health Weill Cornell Medical College 402 East 67th St. New York, NY 10065 USA
[email protected] ISSN 1064-3745 e-ISSN 1940-6029 ISBN 978-1-60761-578-1 e-ISBN 978-1-60761-580-4 DOI 10.1007/978-1-60761-580-4 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2009942427 © Springer Science+Business Media, LLC 2010 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Humana Press, c/o Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Cover illustration: Adapted from Figure 19 of Chapter 2. Traditional MDS map showing genes clustered by coregulation (background) and significance of the uni-variate p-values (size of red circles). The overlayed network indicates the “most significant” gene pairs (green) and trios (blue) (right). Font size indicating the smallest of the uni-, bi- and tri-variate p-values. Printed on acid-free paper Humana Press is part of Springer Science+Business Media (www.springer.com)
Knowing is not enough; we must apply. Willing is not enough; we must do. -Johann Wolfgang Von Goethe
Preface This book is intended for molecular biologists who perform quantitative analyses on data emanating from their field and for the statisticians who work with molecular biologists and other biomedical researchers. There are many excellent textbooks that provide fundamental components for statistical training curricula. There are also many “by experts for experts” books in statistics and molecular biology which require in-depth knowledge in both subjects to be taken full advantage of. So far, no book in statistics has been published that provides the basic principles of proper statistical analyses and progresses to a more advanced statistics in response to rapidly developing technologies and methodologies in the field of molecular biology. Responding to this situation, our book aims at bridging the gap between these two extremes. Molecular biologists will benefit from the progressive style of the book where basic statistical methods are introduced and gradually elevated to an intermediate level. Similarly, statisticians will benefit from learning the various biological data generated from the field of molecular biology, the types of questions of interest to molecular biologists, and the statistical approaches to analyzing the data. The statistical concepts and methods relevant to studies in molecular biology are presented in a simple and practical manner. Specifically, the book covers basic and intermediate statistics that are useful for classical and molecular biology settings and advanced statistical techniques that can be used to help solve problems commonly encountered in modern molecular biology studies, such as supervised and unsupervised learning, hidden Markov models, manipulation and analysis of data from high-throughput microarray and proteomic platform, and synthesis of these evidences. A tutorial-type format is used to maximize learning in some chapters. Advice from journal editors on peer-reviewed publication and some useful information on software implementation are also provided. This book is recommended for use as supplementary material both inside and outside classrooms or as a self-learning guide for students, scientists, and researchers who deal with numeric data in molecular biology and related fields. Those who start as beginners, but desire to be at an intermediate level, will find this book especially useful in their learning pathway. We want to thank John Walker (series editor), Patrick Marton, David Casey, and Anne Meagher, (editors at Springer and Humana) and Shanthy Jaganathan (Integra-India). The following persons provided useful advice and comments on selection of topics, referral to experts in each topic, and/or chapter reviews that we truly appreciate: Stephen Looney (a former editor of this book), Stan Young, Dmitri Zaykin, Douglas Hawkins, Wei Pan, Alexandre Almeida, John Ho, Rebecca Doerge, Paula Trushin, Kevin Morgan, Jason Osborne, Peter Westfall, Jenny Xiang, Ya-lin Chiu, Yolanda Barron, Huibo Shao, Alvin Mushlin, and Ronald Fanta. Drs. Bang, Zhou, and Mazumdar were partially supported by Clinical Translational Science Center (CTSC) grant (UL1-RR024996). Heejung Bang
vii
Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vii
Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xi
BASIC STATISTICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.
Experimental Statistics for Biological Sciences . . . . . . . . . . . . . . . . . . . Heejung Bang and Marie Davidian
3
2.
Nonparametric Methods for Molecular Biology . . . . . . . . . . . . . . . . . . 105 Knut M. Wittkowski and Tingting Song
3.
Basics of Bayesian Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Sujit K. Ghosh
4.
The Bayesian t-Test and Beyond . . . . . . . . . . . . . . . . . . . . . . . . . 179 Mithat Gönen
PART I
PART II
DESIGNS AND METHODS FOR MOLECULAR BIOLOGY . . . . . . . . . . 201
5.
Sample Size and Power Calculation for Molecular Biology Studies . . . . . . . . 203 Sin-Ho Jung
6.
Designs for Linkage Analysis and Association Studies of Complex Diseases . . . . 219 Yuehua Cui, Gengxin Li, Shaoyu Li, and Rongling Wu
7.
Introduction to Epigenomics and Epigenome-Wide Analysis . . . . . . . . . . . 243 Melissa J. Fazzari and John M. Greally
8.
Exploration, Visualization, and Preprocessing of High–Dimensional Data . . . . 267 Zhijin Wu and Zhiqiang Wu
PART III STATISTICAL METHODS FOR MICROARRAY DATA . . . . . . . . . . . . 285 9.
Introduction to the Statistical Analysis of Two-Color Microarray Data . . . . . . 287 Martina Bremer, Edward Himelblau, and Andreas Madlung
10.
Building Networks with Microarray Data . . . . . . . . . . . . . . . . . . . . . 315 Bradley M. Broom, Waree Rinsurongkawong, Lajos Pusztai, and Kim-Anh Do
PART IV
ADVANCED OR SPECIALIZED METHODS FOR MOLECULAR BIOLOGY . . 345
11.
Support Vector Machines for Classification: A Statistical Portrait . . . . . . . . . 347 Yoonkyung Lee
12.
An Overview of Clustering Applied to Molecular Biology Rebecca Nugent and Marina Meila
ix
. . . . . . . . . . . . 369
x
Contents
13.
Hidden Markov Model and Its Applications in Motif Findings . . . . . . . . . . 405 Jing Wu and Jun Xie
14.
Dimension Reduction for High-Dimensional Data . . . . . . . . . . . . . . . . 417 Lexin Li
15.
Introduction to the Development and Validation of Predictive Biomarker Models from High-Throughput Data Sets . . . . . . . . . . . . . . . . . . . . 435 Xutao Deng and Fabien Campagne
16.
Multi-gene Expression-based Statistical Approaches to Predicting Patients’ Clinical Outcomes and Responses . . . . . . . . . . . . . . . . . . . . 471 Feng Cheng, Sang-Hoon Cho, and Jae K. Lee
17.
Two-Stage Testing Strategies for Genome-Wide Association Studies in Family-Based Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485 Amy Murphy, Scott T. Weiss, and Christoph Lange
18.
Statistical Methods for Proteomics . . . . . . . . . . . . . . . . . . . . . . . . 497 Klaus Jung
PART V
META-ANALYSIS FOR HIGH-DIMENSIONAL DATA . . . . . . . . . . . . 509
19.
Statistical Methods for Integrating Multiple Types of High-Throughput Data . . 511 Yang Xie and Chul Ahn
20.
A Bayesian Hierarchical Model for High-Dimensional Meta-analysis . . . . . . . 531 Fei Liu
21.
Methods for Combining Multiple Genome-Wide Linkage Studies . . . . . . . . 541 Trecia A. Kippola and Stephanie A. Santorico
PART VI
OTHER PRACTICAL INFORMATION . . . . . . . . . . . . . . . . . . . . 561
22.
Improved Reporting of Statistical Design and Analysis: Guidelines, Education, and Editorial Policies . . . . . . . . . . . . . . . . . . . . . . . . . 563 Madhu Mazumdar, Samprit Banerjee, and Heather L. Van Epps
23.
Stata Companion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599 Jennifer Sousa Brennan
Subject Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627
Contributors CHUL AHN • Division of Biostatistics, Department of Clinical Sciences, The Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA SAMPRIT BANERJEE • Division of Biostatistics and Epidemiology, Department of Public Health, Weill Cornell Medical College, New York, NY, USA HEEJUNG BANG • Division of Biostatistics and Epidemiology, Weill Cornell Medical College, New York, NY, USA MARTINA BREMER • Department of Mathematics, San Jose State University, San Jose, CA, USA JENNIFER SOUSA BRENNAN • Department of Biostatistics, Yale University, New Haven, CT, USA BRADLEY M. BROOM • Department of Bioinformatics and Computational Biology, University of Texas M. D. Anderson Cancer Center, Houston, TX, USA FABIEN CAMPAGNE • HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, Department of Physiology and Biophysics, Weill Cornell Medical College, New York, NY, USA FENG CHENG • Department of Biophysics, University of Virginia, Charlottesville, VA, USA SANG-HOON CHO • Department of Public Health Sciences, University of Virginia, Charlottesville, VA, USA YUEHUA CUI • Department of Statistics and Probability, Michigan State University, East Lansing, MI, USA MARIE DAVIDIAN • Department of Statistics, North Carolina State University, Raleigh, NC, USA XUTAO DENG • HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, Weill Cornell Medical College, New York, NY, USA KIM-ANH DO • Department of Biostatistics, University of Texas M. D. Anderson Cancer Center, Houston, TX, USA MELISSA J. FAZZARI • Division of Biostatistics, Department of Epidemiology and Population Health, Department of Genetics, Albert Einstein College of Medicine, Bronx, NY, USA SUJIT K. GHOSH • Department of Statistics, North Carolina State University, Raleigh, NC, USA MITHAT GÖNEN • Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, New York, NY, USA JOHN M. GREALLY • Department of Genetics, Department of Medicine, Albert Einstein College of Medicine, Bronx, NY, USA EDWARD HIMELBLAU • Department of Biological Science, California Polytechnic State University, San Luis Obispo, CA, USA KLAUS JUNG • Department of Medical Statistics, Georg-August-University Göttingen, Göttingen, Germany
xi
xii
Contributors
SIN-HO JUNG • Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, USA TRECIA A. KIPPOLA • Department of Statistics, Oklahoma State University, Stillwater, OK, USA CHRISTOPH LANGE • Center for Genomic Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA; Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA JAE K. LEE • Department of Public Health Sciences, University of Virginia, Charlottesville, VA, USA YOONKYUNG LEE • Department of Statistics, The Ohio State University, Columbus, OH, USA GENGXIN LI • Department of Statistics and Probability, Michigan State University, East Lansing, MI, USA LEXIN LI • Department of Statistics, North Carolina State University, Raleigh, NC, USA SHAOYU LI • Department of Statistics and Probability, Michigan State University, East Lansing, MI, USA FEI LIU • Department of Statistics, University of Missouri, Columbia, MO, USA ANDREAS MADLUNG • Department of Biology, University of Puget Sound, Tacoma, WA, USA MADHU MAZUMDAR • Division of Biostatistics and Epidemiology, Department of Public Health, Weill Cornell Medical College, New York, NY, USA MARINA MEILA • Department of Statistics, University of Washington, Seattle, WA, USA AMY MURPHY • Channing Laboratory, Center for Genomic Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA REBECCA NUGENT • Department of Statistics, Carnegie Mellon University, Pittsburgh, PA, USA LAJOS PUSZTAI • Department of Breast Medical Oncology, University of Texas M. D. Anderson Cancer Center, Houston, TX, USA WAREE RINSURONGKAWONG • Department of Biostatistics, University of Texas M. D. Anderson Cancer Center, Houston, TX, USA STEPHANIE A. SANTORICO • Department of Mathematical and Statistical Sciences, University of Colorado, Denver, CO, USA TINGTING SONG • Center for Clinical and Translational Science, The Rockefeller University, New York, NY, USA HEATHER L. VAN EPPS • Journal of Experimental Medicine, Rockefeller University Press, New York, NY, USA SCOTT T. WEISS • Channing Laboratory, Center for Genomic Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA KNUT M. WITTKOWSKI • Center for Clinical and Translational Science, The Rockefeller University, New York, NY, USA JING WU • Department of Statistics, Carnegie Mellon University, Pittsburgh, PA, USA RONGLING WU • Departments of Public Health Sciences and Statistics, Pennsylvania State University, Hershey, PA, USA ZHIJIN WU • Center for Statistical Sciences, Brown University, Providence, RI, USA ZHIQIANG WU • Department of Electrical Engineering, Wright State University, Dayton, OH, USA JUN XIE • Department of Statistics, Purdue University, West Lafayette, IN, USA
Contributors
xiii
YANG XIE • Division of Biostatistics, Department of Clinical Sciences, The Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA
Part I Basic Statistics
Chapter 1 Experimental Statistics for Biological Sciences Heejung Bang and Marie Davidian Abstract In this chapter, we cover basic and fundamental principles and methods in statistics – from “What are Data and Statistics?” to “ANOVA and linear regression,” which are the basis of any statistical thinking and undertaking. Readers can easily find the selected topics in most introductory statistics textbooks, but we have tried to assemble and structure them in a succinct and reader-friendly manner in a stand-alone chapter. This text has long been used in real classroom settings for both undergraduate and graduate students who do or do not major in statistical sciences. We hope that from this chapter, readers would understand the key statistical concepts and terminologies, how to design a study (experimental or observational), how to analyze the data (e.g., describe the data and/or estimate the parameter(s) and make inference), and how to interpret the results. This text would be most useful if it is used as a supplemental material, while the readers take their own statistical courses or it would serve as a great reference text associated with a manual for any statistical software as a self-teaching guide. Key words: ANOVA, correlation, data, estimation, experimental design, frequentist, hypothesis testing, inference, regression, statistics.
1. Introduction to Statistics 1.1. Basis for Statistical Methodology
The purpose of the discussion in this section is to stimulate you to start thinking about the important issues upon which statistical methodology is based. Heejung Bang adapted Dr. Davidian’s lecture notes used for the course “Experimental Statistics for Biological Sciences” at the North Carolina State University. The complete version of the lecture notes is available at http://www4.stat. ncsu.edu/~davidian/.
H. Bang et al. (eds.), Statistical Methods in Molecular Biology, Methods in Molecular Biology 620, DOI 10.1007/978-1-60761-580-4_1, © Springer Science+Business Media, LLC 2010
3
4
Bang and Davidian
Statistics: The development and application of theory and methods to the collection (design), analysis, and interpretation of observed information from planned (or unplanned) experiments. Typical Objectives: Some examples are as follows: (i) Determine which of three fertilizer compounds produces highest yield (ii) Determine which of two drugs is more effective for controlling a certain disease in humans (iii) Determine whether an activity such as smoking causes a response such as lung cancer Examples (i) and (ii) represent situations where the scientist has the opportunity to plan (design) the experiment. Such a preplanned investigation is called a controlled experiment. The goal is to compare treatments (fertilizers, drugs). In example (iii), the scientist may only observe the phenomenon of interest. The treatment is smoking, but the experimenter has no control over who smokes. Such an investigation is called an observational study. In this section, we will focus mostly on controlled experiments, which leads us to thinking about design of experiments. Key Issue: We would like to make conclusions based on the data arising as the result of an experiment. We would moreover like the conclusion to apply in general. For example, in (i), we would like to claim that the fertilizers produce different yields in general based on the particular data from a single experiment. Let us introduce some terminology. Population: The entire “universe” of possibilities. For example, in (ii), the population is all patients afflicted with the disease. Sample: A part of the population that we can observe. Observation of a sample gives rise to information on the phenomenon of interest, the data. Using this terminology, we may refine our statement of our objective. We would like to make statements about the population based on observation of samples. For example, in (ii), we obtain two samples of diseased patients: and subject one to drug 1 and the other to drug 2. In agricultural, medical, and other biological applications, the most common objective is the comparison of two or more treatments. We will thus often talk about statistical inference in the context of comparing treatments. Problem: A sample we observe is only one of many possible samples we might have seen instead. That is, in (ii), one sample of patients would be expected to be similar to another, but not identical. Plants will differ due to biological variability, which may cause different reactions to fertilizers in example (i). There is uncertainty about inference we make on a population based on observation of samples.
Experimental Statistics for Biological Sciences
5
The premise of statistical inference is that we attempt to control and assess the uncertainty of inferences we make on the population of interest based on observation of samples. Key: Allow explicitly for variation in and among samples. Statistics is thus often called the “study of variation.” First Step: Set up or design experiments to control variability as much as possible. This is certainly possible in situations such as field trials in agriculture, clinical trials in medicine, and reliability studies in engineering. It is not entirely possible in observational studies, where the samples “determine themselves.” Principles of Design: Common sense is the basis for most of the ideas for designing scientific investigations: • Acknowledgment of potential sources of variation. Suppose it is suspected that males may react differently to a certain drug from females. In this situation, it would make sense to assign the drugs to the samples with this in mind instead of with no regard to the gender of participants in the experiment. If this assignment is done correctly, differences in treatments may be assessed despite differences in response due to gender. If gender is ignored, actual differences in treatments could be obscured by differences due to gender. • Confounding. Suppose in example (i), we select all plants to be treated with fertilizer 1 from one nursery, all for fertilizer 2 from another nursery, etc. Or, alternatively, suppose we keep all fertilizer 1 plants in one greenhouse, all fertilizer 2 plants in another greenhouse, etc. These choices may be made for convenience or simplicity, but introduce problems: Under such an arrangement, we will not know whether any differences we might observe are due to actual differences in the effects of the treatments or to differences in the origin or handling of plants. In such a case, the effects of treatments are said to be confounded with the effects of, in this case, nursery or greenhouse. Here is another example. Consider a clinical trial to compare a new, experimental treatment to the standard treatment. Suppose a doctor assigns patients with advanced cases of disease to a new experimental drug and assigns patients with mild cases to the standard treatment, thinking that the new drug is promising and should thus be given to the sicker patients. The new drug may perform poorly relative to the standard drug. The effects of the drugs are confounded with the severity of the disease. To take adequate account of variation and to avoid confounding, we would like the elements of our samples to be as alike as possible except for the treatments. Assignment of treatments to the samples should be done so that potential sources of variation do not obscure treatment differences. No amount of fancy statistical analysis can help an experiment that was conducted without paying attention to these issues!
6
Bang and Davidian
Second Step: Interpret the data by taking appropriate account of sources and magnitude of variation and the design, that is, by assessing variability. Principle of Statistical Inference: Because variation is involved, and samples represent only a part of the population, we may not make statements that are absolute. Rather, we must temper our statements by acknowledging the uncertainty due to variation and sampling. Statistical methods incorporate this uncertainty into statements interpreting the data. The appropriate methods to use are dictated to a large extent by the design used to collect the data. For example, • In example (ii) comparing two drugs, if we are concerned about possible gender differences, we might obtain two samples of men and two samples of women and treat one of each with drug 1, the other of each with drug 2. With such a design, we should be able to gain insight into differences actually due to the drugs as well as variability due to gender. A method to assess drug differences that takes into account the fact that part of variation observed is attributable to a known feature, gender, would then be appropriate. Intuitively, a method that did not take this into account would be less useful. Moral: Design and statistical analysis go hand in hand. 1.2. First Notions of Experimental Design
It is obvious from the above discussion that a successful experiment will be one where we consider the issues of variation, confounding, and so on prior to collecting data. That is, ideally, we design an experiment before performing it. Some fundamental concepts of experimental design are as follows. Randomization: This device is used to ensure that samples are indeed as alike as possible except for the treatments. Randomization is a treatment assignment mechanism – rather than assign the treatments systematically, assign them so that, once all acknowledged sources of variation are accounted for, it can be assumed that no obscuring or confounding effects remain. Here is an example. Suppose we wish to compare two fertilizers in a certain type of plant (we restrict attention to just two fertilizers for simplicity). Suppose we are going to obtain plants and then assign them to receive one fertilizer or the other. How best to do this? We would like to perform the assignment so that no plant will be more likely to get a potentially “better” treatment than another. We’d also like to be assured that, prior to getting treatment, plants are otherwise basically alike. This way, we may feel confident that differences among the treatments that might show up reflect a “real” phenomenon. A simple way to do this is to obtain a sample of plants from a single nursery, which ensures that they are basically alike, and then determine two samples by a coin flip:
Experimental Statistics for Biological Sciences
7
• Heads = Fertilizer 1, Tails = Fertilizer 2 • Ensures that all plants had an equal chance of getting either fertilizer, that is, all plants are alike except for the treatment • This is a random process Such an assignment mechanism, based on chance, is called randomization. Randomization is the cornerstone of the design of controlled experiments. Potential Problem: Ultimately, we wish to make general statements about the efficacy of the treatments. Although using randomization ensures the plants are as alike as possible except for the treatments and that the treatments were fairly assigned, we have only used plants from one nursery. If plants are apt to respond to fertilizers differently because of nursery of origin, this may limit our ability to make general statements. Scope of Inference: The scope of inference for an experimental design is limited to the population from which the samples are drawn. In the design above, the population is plants from the single nursery. To avoid limited scope of inference, we might like to instead use plants from more than one nursery. However, to avoid confounding, we do not assign the treatments systematically by nursery. Extension: Include more nurseries but use the same principles of treatment assignment. For example, suppose we identify three nurseries. By using plants from all three, we broaden the scope of inference. If we repeated the above process of randomization to assign the treatments for each nursery, we would have increased our scope of inference, but at the same time ensured that, once we have accounted for the potential source of variation, nursery, plants receiving each treatment are basically alike except for the treatments. One can imagine that this idea of recognizing potential sources of variation and then assigning treatments randomly to ensure fair comparisons, thus increasing the scope of inference, may be extended to more complicated situations. For example, in our example, another source of variation might be the greenhouses in which the treated plants are kept, and there may be more than two treatments of interest. We will return to these issues later. For now, keep in mind that sound experimental design rests on these key issues: • Identifying and accounting for potential sources of variation • Randomly assigning treatments after accounting for sources of variation • Making sure the scope of the experiment is sufficiently broad Aside: In an observational study, the samples are already determined by the treatments themselves. Thus, a natural question is whether there is some hidden (confounding) factor that
8
Bang and Davidian
causes the responses observed. For example, in (iii), is there some underlying trait that in fact causes both smoking and lung cancer? This possibility limits the inference we may make from such studies. In particular, we cannot infer a causal relationship treatment (smoking) and response (lung cancer, yes or no). We may only observe that an association appears to exist. Because the investigator does not have control over how the treatment is applied, interpretation of the results is not straightforward. If aspects of an experiment can be controlled, that is, the experiment may be designed up-front (treatment assignment determined in advance), there is an obvious advantage in terms of what conclusions we may draw. 1.3. Data
The purpose of this section is to introduce concepts and associated terminology regarding the collection, display, and summary of data. Data: The actual observations of the phenomenon of interest based on samples from the population. A Goal of Statistics: Present and summarize data in a meaningful way. Terminology: A variable is a characteristic that changes (i.e., shows variability) from unit to unit (e.g., subjects, plots). Data are observations of variables. Types of variables are Qualitative Variables: Numerical measurements on the phenomenon of interest are not possible. Rather, the observations are categorical. Quantitative Variables: The observations are in the form of numerical values. • Discrete: Possible values of the variable differ by fixed amounts. • Continuous: All values in a given range are possible. We may be limited in recording the exact values by the precision and/or accuracy of the measuring device. Appropriate statistical methods are dictated in part by the nature of the variable in question.
1.3.1. Random Samples
We have already discussed the notion of randomization in the design of an experiment. The underlying goal is to ensure that samples may be assumed to be “representative” of a population of interest. In our nursery example, then, one population of interest might be that of plant subjected to treatment 1. The random assignment to determine which plants receive treatment 1 thus may be thought of as an attempt to obtain a representative sample from plants from this population, so that data obtained from the sample will be free from confounding and bias. In general, the way in which we will view a “representative sample” is one chosen so that any member of the population has an equal chance
Experimental Statistics for Biological Sciences
9
of being in the sample. In the nursery example, it is assumed that the plants ending up in the sample receiving treatment 1 may be thought of as being chosen from the population of all plants were they to receive treatment 1. All plants in the overall population may have equally ended up in this sample. The justification for this assumption is that randomization was used to determine the sample. The idea that samples have been chosen randomly is a foundation of much of the theory underlying the statistical methods that we will study. It is instructive, in order to shape our thinking later when we talk about the formal theory, to think about a conceptual model for random sampling. 1.3.2. Sampling – A Model
One way to think about random sampling in a simple way is to think about drawing from a box. Population: Slips of paper in a box, one slip per individual (plant, patient, plot, etc.). Sample: To obtain a random sample, draw slips of paper from the box such that, on each draw, all slips have a equal chance of being chosen, i.e., completely random selection. Two Ways to Draw: • With replacement: On each draw, all population members have the same chance of being in the sample. This is simple to think about, but the drawback is that an individual could conceivably end up in the sample more than once. • Without replacement: The number of slips in the box decreases with each draw, because slips are not replaced. Thus, the chance of being in the sample increases with the number of draws (size of the sample). If the population is large relative to the size of the sample to be chosen, this increase is negligible. Thus, we may for practical purposes view drawing without replacement as drawing with replacement. The populations of interest are usually huge relative to the size of the samples we use for experimentation. WHY Is This Important? To simplify the theory, standard statistical methods are predicated on the notion that the samples are completely random, which would follow if the samples were indeed chosen with replacement. This fact allows us to view the samples as effectively having been drawn in this way, i.e., with replacement from populations of interest. This is the model that we will use when thinking about the properties of data in order to develop statistical methods. Practical Implementation: We have already talked about using the flip of a coin to randomize treatments and hence determine random samples. When there are several treatments, the randomization is accomplished by a random process that is a little more sophisticated than a simple coin flip, but is still in the same spirit. We will assume that if sampling from the population has been
10
Bang and Davidian
carried out with careful objectivity in procuring members of the population to which to assign treatments, and the assignment has been made using these principles, then this model applies. Henceforth, then, we will use the term sample interchangeably with random sample, and we will develop methods based on the assumption of random sampling. We will use the term sample to refer to both the physical process and to the data arising from it. 1.3.3. Descriptive Statistics
Idea: Summarize data by quantifying the notions of “center” and “spread” or “dispersion.” That is, define relevant measures that summarize these notions. One Objective: By quantifying center and spread for a sample, we hope to get an idea of these same notions for the population from which the sample was drawn. As above, if we could measure the diameter of every cherry tree in the forest, and quantify the “center” and “spread” of all of these diameters, we would hope that the “center” and “spread” values for our sample of 31 trees from this population would bear resemblance to the population values. Notation: For the remainder of this discussion, we adopt the following standard notation to describe a sample of data: n = size of the sample and Y = the variable of interest Y1 , Y2 , . . . , Yn = observations on the variable for the sample. Measures of Center: There are several different ways to define the notion of “center.” Here, we focus on the two most common: • Mean or average: This is the most common notion of center. For our data, the sample mean is defined as 1 Y¯ = Yi = sample mean. n n
[1]
i=1
That is, the sample mean is the average of the sample values. The notation “ Y¯ ” is standard; the “bar” indicates the “averaging” operation being performed. The population mean is the corresponding quantity for the population. We often use the Greek symbol μ to denote the population mean: μ = mean of all possible values for the variable for the population. One may think of μ as the value that would be obtained if the averaging operation in [1] were applied to all the values in the population.
Experimental Statistics for Biological Sciences
11
• Median: The median is the value such that, when the data are put into ascending order, 50% of the observations are above and 50% are below this value. Thus, the median quantifies the notion of center differently from the mean – the assessment of “center” is based on the likelihood of the observations (their distribution) rather than by averaging. It should be clear that, with this definition, the value chosen as the median of a sample need not be unique. If n is odd, then the definition may be applied exactly – the median is the “center” observation. When n is even, the median can be defined as the average of the two “middle” values. Median better represents a typical value for skewed data. Fact: It is clear that mean and median do not coincide in general. Measures of Spread: Two data sets (or populations) may have the same mean, but may be “spread” about this mean value very differently. For a set of data, a measure of spread for an individual observation is the sample deviation (Yi − Y¯ ). Intuition suggests that the mean or average of these deviations ought to be a good measure of how the observations are spread about the mean n for the sample. However, it is straightforward to show that ¯ i=1 (Yi − Y ) = 0 for any set of data. A sensible measure of “spread” is concerned with how spread out the observations are; direction is not important. The most common idea of ignoring the sign of the deviations is to square them. • Sample variance: For a sample of size n, this is defined as 1 (Yi − Y¯ )2 . n−1 n
[2]
i=1
This looks like an average, but with (n − 1) instead of n as the divisor – the reason for this will be discussed later. [(n − 1) is called the degrees of freedom (df), as we’ll see.] Note that if the original data are measured in some units, then the sample variance has (units)2 . Thus, sample variance does not measure spread on the same scale of measurement as the data. • Sample standard deviation (SD): To get a measure of spread on the original scale of measurement, take the square root of the sample variance. This quantity, which thus measures spread in data units, is called the sample SD 2 s= s =
1 (Yi − Y¯ )2 . n−1 n
i=1
12
Bang and Davidian
The sample variance and SD are generally available on handheld calculators. It is instructive, however, to examine hand calculation of them. First, it can be shown that n 2 n n 1 2 2 (Yi − Y¯ ) = Yi − Yi . [3] n i=1
i=1
i=1
This formula is of interest for several reasons: Formula [3] is preferred when doing hand calculation over [2] to avoid propagation of error. Breaking the sum of squared deviations into two pieces highlights a concept that will be important to our thinking later. Write SS = ni=1 (Yi − Y¯ )2 . This is the sum of squared deviations, but is often called the sum of squares adjusted for the mean. The reason may be deduced from [3]. The two components are called n
Yi 2 = unadjusted sum of squares of the data
i=1
n 1 n
Yi
2 = correction or adjustment for the mean = nY¯ 2.
i=1
Thus, we may interpret SS as taking each observation and “centering” (correcting) its magnitude about the sample mean, which itself is calculated from the data. • Range: Difference between highest and lowest values, another simple measure of spread. • Interquartile range: Difference between the third and first quartiles, also called the midspread or middle fifty. 1.3.4. Statistical Inference
We already discussed the notion of a population mean, denoted by μ. We may define in an analogous fashion population variance and population SD. These quantities may be thought of as the measures that would be obtained if we could calculate variance and SD based on all the values in the population. Terminology: A parameter is a quantity describing the population, e.g., • μ = population mean • σ 2 = population variance Parameter values are usually unknown. Thus, in practice, sample values are used to get an idea of the values of the parameters for the population. • Estimator: An estimator is a quantity describing the sample that is used as a “guess” for the value of the corresponding population parameter. For example, Y¯ is an estimator for μ and
s 2 is an estimator for σ 2 .
Experimental Statistics for Biological Sciences
13
– The term estimator is usually used to denote the “abstract” quantity, i.e., formula. – The term estimate is usually used to denote the actual numerical value of the estimator. • Statistic: A quantity derived from the sample observations, e.g., Y¯ and s 2 , are statistics. The larger the sample, the closer estimates will be to the true (but unknown) population parameter values assuming that sampling was done correctly. • If we are going to use a statistic to estimate a parameter, we would like to have some sense of “how close.” • As the data values exhibit variability, so do statistics, because statistics are based on the data. • A standard way of quantifying “closeness” of an estimator to the parameter is to calculate a measure of how variable the estimator is. • Thus, the variability of the statistic is with respect to the spread of all possible values it could take on. – So we need a measure of how variable sample means from samples of size n are across all the possible data sets of size n we might have ended up with. • Think now of the population of all possible values of the sample mean Y¯ – this population consists of all Y¯ values that may be calculated from all possible samples of size n that may be drawn from the population of data Y values. • This population will itself exhibit spread and itself may be thought of as having a mean and variance. • Statistical theory may be used to show that this mean and variance are as follows: def Mean of population of Y¯ values = μ = μY¯ ,
[4]
σ 2 def 2 = σY¯ , Variance of population of Y¯ values = n
[5]
where the symbols on the far right in each expression are the customary notation for representing these quantities – the subscript “Y¯ ” emphasizes that these are the mean and variance of the population of Y¯ values, not the population of Y values. • Equation [4] shows that the mean of Y¯ values is the same as the mean of Y values! This makes intuitive sense – we expect Y¯ to be similar to the mean of the population of Y values, μ. Equation confirms this and suggests that Y¯ is a sensible estimator for μ.
14
Bang and Davidian
• Furthermore, the quantity σ 2 /n in [5] represents how variable Y¯ values are. Note that this depends on the n. In practice, the quantity σ 2¯ depends on σ 2 , which is Y unknown. Thus, if we want to provide a measure of how variable Y¯ values are, the standard approach is to estimate σ 2¯ by Y replacing σ 2 , the unknown population (of Y values) variance by its estimate, the sample variance s 2 . • That is, calculate the following and use as an estimate of σ 2¯ : Y
sY2¯ = • Define
s2 n
.
sY¯ =
s s2 =√ , n n
[6]
sY¯ is referred to as the standard error (SE) (of the mean) and is an estimate of the SD of all possible Y¯ values from samples of size n. • It is important to keep the distinction between s and sY¯ clear: s = SD of Y value and s ¯ = SD of Y¯ values. Y
• The SE is an estimate of the variability (on the original scale of measurement) associated with using the sample mean Y¯ as an estimate of μ. √ Important: As n increases, σY¯ and sY¯ decrease with 1/ n. Thus, the larger the n, the less variable (more precise, reliable) the sample mean Y¯ will be as an estimator of μ! As intuition suggests, larger n’s give “better” estimates. Coefficient of Variation: Often, we wish to get an idea of variability on a relative basis, that is, we would like to have a unitless measure that describes variation in the data (population). • This is particularly useful if we wish to compare the results of several experiments for which the data are observations on the same variable. • The problem is that different experimental conditions may lead to different variability. Thus, even though the variable Y may be the same, the variability may be different. • The coefficient of variation is a relative measure that expresses variability as a proportion (percentage) of the mean. For a population with mean μ, SD σ , the definition is CV =
σ σ as a proportion = × 100% as a percentage. μ μ
Experimental Statistics for Biological Sciences
15
• As μ and σ are unknown parameters, CV is estimated by replacing them by their estimates from the data, e.g., s CV = . Y¯ • CV is also a useful quantity when attention is focused on a single set of data – it provides an impression of the amount of variation in the data relative to the size of the thing being measured; thus, if CV is large, it is an indication that we will have a difficult time learning about the “signal” (μ) because of the magnitude of the “noise” (σ ). 1.4. Probability Distributions 1.4.1. Probability
In order to talk about chance associated with random samples, it is necessary to talk about probability. It is best to think about things very simply at first, so, as is customary, we do so. This may seem simplistic and even irrelevant; in particular, it is standard to think about very simple situations in which to develop the terminology and properties first and then extend the ideas behind them to real situations. Thus, we describe the ideas in terms of what is probably the most simple situation where chance is involved, the flip of a coin. The ideas, however, are more generally applicable. Terminology: We illustrate the generic terminology in the context of flipping a coin. • (Random) Experiment: A process for which no outcome may be predicted with certainty. For example, if we toss a coin once, the outcome, “heads” (H) or “tails” (T), may not be declared prior to the coin landing. With this definition, we may think of choosing a (random) sample and observing the results as a random experiment – the eventual values of the data we collect may not be predicted with certainty. • Sample space: The sample space is the set of all possible (mutually exclusive) outcomes of an experiment. We will use the notation S in this section to denote the sample space. For example, for the toss of a single coin, the sample space is
S = {H , T }. By mutually exclusive, we mean that the outcomes do not overlap, i.e., they describe totally distinct possibilities. For example, the coin comes up either H or T on any toss – it can’t do both! • Event: An event is a possible result of an experiment. For example, if the experiment consists of two tosses of a coin, the sample space is S = {HH , TH , HT , TT }.
16
Bang and Davidian
Each element in S is a possible result of this experiment. We will use the notation E in this Section to denote an event. We may think of other events as well, e.g., E1 = {see exactly 1 H in 2 tosses} = {TH, HT}. E2 = {see at least 1H in 2 tosses} = {HH, TH, HT}. Thus, events may be combinations of elements in the sample space. • Probability function: P assigns a number between 0 and 1 to an event. Thus, P quantifies the notion of the chance of an event occurring. The properties are as follows: – For any event E, 0 ≤ P(E) ≤ 1 . – If S is composed of mutually exclusive outcomes denoted by Oi , i.e.,
S = {O1 , O2 , . . .} then P(S ) =
P(Oi ) = 1.
i
Thus, S describes everything that could possibly happen, since, intuitively, we assign the probability “1” to an event that must occur. A probability of “0” (zero) implies that an event cannot occur. We may think of the probability of an event E occurring intuitively as P(E) =
no. of outcomes in S associated with E . total no. of possible outcomes in S
For example, in our experiment consisting of two tosses of a coin P(E1 ) =
1.4.2. Random Variables
1 2 = , 4 2
P(E2 ) =
3 . 4
The development of statistical methods for analyzing data has its foundations in probability. Because our sample is chosen at random, an element of chance is introduced, as noted above. Thus, our observations are themselves best viewed as random, that is, subject to chance. We thus term the variable of interest a random variable (often denoted by r.v.) to emphasize this principle. Data represent observations on the r.v. These may be (for quantitative r.v.’s) discrete or continuous. Events of interest may be formulated in terms of r.v.’s.
Experimental Statistics for Biological Sciences
17
In our coin toss experiment, let Y = no. H in 2 coin tosses. Then we may represent our events as E1 = {Y = 1}, E2 = {Y ≥ 1}. Furthermore, the probabilities of the events may also be written in terms of Y , e.g., P(E1 ) = P(Y = 1), P(E2 ) = P(Y ≥ 1). If we did our coin toss experiment n times (each time consists of two tosses of the coin), and recorded the value of Y each time, we would have data Y1 , . . . , Yn , where Y i is the number of heads seen the ith time we did the experiment. 1.4.3. Probability Distribution of a Random Variable
To understand this concept, it is easiest to first consider the case of a discrete r.v. Y . Thus, Y takes on values that we can think about separately. Our r.v. Y corresponding to the coin tossing experiment is a discrete r.v.; it may only take on the values 0, 1, or 2. Probability Distribution Function: Let y denote a possible value of Y . The function f (y) = P(Y = y) is the probability distribution function for Y . f (y) is the probability we associate with an observation on Y taking the value y. If we think in terms of data, that is, observations on the r.v. Y , then we may think of the population of all possible data values. From this perspective, f (y) may be thought of as the relative frequency (in the population) with which the value y occurs in the population. A histogram for a sample summarizes the relative frequencies with which the data takes on values, and these relative frequencies are represented by area. This gives a pictorial view of how the values are distributed. If we think of probabilities as the relative frequencies with which Y would take on values, then it seems natural to think of representing probabilities in a similar way. Binomial Distribution: Consider more generally the following experiment: (i) k unrelated trials are performed (ii) Each trial has two possible outcomes, e.g., for a coin toss, H or T; for clinical trial, outcomes can be dead or alive. (iii) For each trial, the probability of the outcome we are interested in is equal to some value p, 0 ≤ p ≤ 1 .
18
Bang and Davidian
For the k trials, we are interested in the number of trials resulting in the outcome of interest. Let S = “success” denote this outcome. Then the r.v. of interest is Y = no. of S in k trials. Y may thus take on the values 0, 1, . . . , k. To fix ideas, consider an experiment consisting of k coin tosses, and suppose we are interested in the number of H observed in the k tosses. For more realistic situations, the principles are the same, so we use coin tossing as an easy, “all-purpose” illustration. Then Y = no. of H in k tosses and S = H . Furthermore, if our coin is “fair,” then p = 12 = p(H ). It turns out that the form of the probability distribution function f (y) of Y may be derived mathematically. The expression is k f (y) = P(Y = y) = py (1 − p)k−y , y = 0, 1, . . . , k, y k where = k!/{y!(k − y)!}. The notation x! is “x factorial” = y x(x − 1)(x − 2) · · · (2)(1), that is, the product of x with all positive whole numbers smaller than x. By convention, 0! = 1. If we do k trials, and y are “successes,” then k − y of them are not “successes.” There area number of ways that this can happen k in k trials – the expression turns out to quantify this number y of ways. We are not so interested in this particular f (y) here – our main purpose is to establish the idea that probability distribution functions do exist and may be calculated. Mean and Variance: Thinking about f (y) as a model for the population suggests that we consider other population quantities. In particular, for data from an experiment like this, what would the population mean μ and variance σ 2 be like? It turns out that mathematical derivations may be used to show that μ = kp,
σ 2 = kp(1 − p).
We would thus expect, if we did the experiment n times, that the sample mean Y¯ and sample variance s 2 would be “close” to these values (and be estimators for them). Continuous Random Variables: Many of the variables of interest in scientific investigations are continuous; thus, they can take on any value. For example, suppose we obtain a sample of n pigs and weigh them. Thus, the r.v. of interest is Y = weight of a pig and the data are Y1 , . . . , Yn , the observed weights for our n pigs.
Experimental Statistics for Biological Sciences
19
Y is a r.v. because the pigs were drawn at random from the population of all pigs. Furthermore, all pigs do not weigh exactly the same; they exhibit random variation due to biological and other factors. Goal: Find a function like f (y) for a discrete r.v. that describes the probability of observing a pig weighing y units. This function, f, would thus serve as a model describing the population of pig weights – how they are distributed and how they vary. Probability Density: A function f (y) that describes the distribution of values taken on by a continuous r.v. is called a probability density function. If we could determine f for a particular r.v. of interest, its graph would have the same interpretation as a probability histogram did for a discrete r.v. If we were to take a sample of size n and construct the sample histogram, we would expect it to have the roughly same shape as f – as n increases, we’d expect it to look more and more like f. Thus, a probability density function describes the population of Y values, where Y is a continuous r.v. Normal Approximation to The Binomial Distribution: Recall that, as the number of trials k grows, the probability histogram for a binomial r.v. with p = 1/2 has shape suspiciously like the normal density function. The former is jagged, while the normal density is a smooth curve; however, as k gets larger, the jagged edges get smoother. Mathematically, it may be shown that, as k gets large, for p near 1/2, the probability distribution function and histogram for the binomial distribution looks more and more like the normal probability density function and its graph! Thus, if one is dealing with a discrete binomial r.v., and the number of trials is relatively large, the smooth, continuous normal distribution is often used to approximate the binomial. This allows one to take advantage of the statistical methods we will discuss that are based on the assumption that the normal density is a good description of the population. We thus confine much of our attention to methods for data that may be viewed (at least approximately) as arising from a population that may be described by a normal distribution. 1.4.4. The Standard Normal Distribution
The probability distribution function f for a normal distribution has a very complicated form. Thus, it is not possible to evaluate easily probabilities for a normal r.v. Y the way we could for a binomial. Luckily, however, these probabilities may be calculated on a computer. Some computer packages have these probabilities “built in”; they are also widely available in tables in the special case when μ = 0 and σ 2 = 1, e.g., in statistical textbooks. It turns out that such tables are all that is needed to evaluate probabilities for any μ and σ 2 . It is instructive to learn how to use such tables, because the necessary operations give one a better understanding of how probabilities are represented by area. We will learn how to evaluate normal probabilities when μ and σ 2 are known first; we
20
Bang and Davidian
will see later how we may use this knowledge to develop statistical methods for estimating them when they are not known. Suppose Y ∼ N (μ, σ 2 ). We wish to calculate probabilities such as P(Y ≤ y), P(Y ≥ y), P(y1 ≤ Y ≤ y2 ),
[7]
that is, probabilities for intervals associated with values of Y . Technical Note: When dealing with probabilities for any continuous r.v., we do not make the distinction between strict inequalities like “” and inequalities like “≤” and “≥” due to the limitations on our ability to see the exact values make it impossible to distinguish between Y being exactly equal to a value y and Y being equal to a value extremely close to y. Thus, the probabilities in [7] could equally well be written: P(Y < y), P(Y > y), P(y1 < Y < y2 ).
[8]
The Standard Normal Distribution: Consider the event (y1 ≤ Y ≤ y2 ). Thus, if we think of the r.v. Y − μ, it is clear that this is also a normal r.v., but now with mean 0 and the same variance σ 2 . If we furthermore know σ 2 , and thus the SD σ (same units as Y ), suppose we divide every possible value of Y by the value σ . This will yield all possible values of the r.v. Y /σ . Note that this r.v. is “unitless”; e.g., if Y is measured in grams, so is σ , so the units “cancel.” Rather, this r.v. has “units” of SD; that is, if it takes value 1, this says that the value of Y is 1 SD to the right of the mean. In particular, the SD of Y /σ is 1. Define Z =
Y −μ . σ
Then Z will have mean zero and SD 1. (It has units of SD of the original r.v. Y ). It will also be normally distributed, just like Y , as all we have done is shift the mean and scale by the SD. Hence, we call Z a standard normal r.v., and we write Z ∼ N (0, 1). Applying this, we see that
y2 − μ y1 − μ (y1 ≤ Y ≤ y2 ) ⇔ ≤Z ≤ σ σ y−μ . (Y ≥ y) ⇔ Z ≥ σ
and
If we want to find probabilities about events concerning Y , and we know μ and σ , all we need is a table of probabilities for a standard normal r.v. Z. 1.4.5. Statistical Inference
We assume that we are interested in a r.v. Y that may be viewed (exactly or approximately) as following a N (μ, σ 2 ) distribution.
Experimental Statistics for Biological Sciences
21
Suppose that we have observed data Y1 , . . . , Yn . In real situations, we may be willing to assume that our data arise from some normal distribution, but we do not know that values of μ or σ 2 . As we have discussed, one goal is to use Y1 , . . . , Yn to estimate μ and σ 2 . We use statistics like Y¯ and s2 as estimators for these unknown parameters. Because Y¯ and s2 are based on observations on a r.v., they themselves are r.v.’s. Thus, we may think about the populations of all possible values they may take on (from all possible samples of size n). It is natural to thus think of the probability distributions associated with these populations. The fundamental principle behind the methods we will discuss is as follows. We base what we are willing to say about μ and σ 2 on how likely the values of the statistics Y¯ and s2 we saw from our data would be if some values μ0 and σ02 were the true values of these parameters. To assess how likely, we need to understand the probabilities with which Y¯ and s2 take on values. That is, we need the probability distributions associated with Y¯ and s2 . Probability Distribution of Y¯ : It turns out that if Y is normal, then the distribution of all possible values of Y¯ is also normal! Thus, if we want to make statements about “how likely” it is that Y¯ would take on certain values, we may calculate these using the normal distribution. We have σ Y¯ ∼ N (μY¯ , σY2¯ ), μY¯ = μ, σY¯ = √ . n We may use these facts to transform events about Y¯ into events about a standard normal r.v. Z. 1.4.6. The χ 2 Distribution
Probability Distribution of s2 : When the data are observations on a r.v. Y ∼ N (μ, σ 2 ), then it may be shown mathematically that the values taken on by (n − 1)s 2 σ2 are well represented by another distribution different from the normal. This quantity is still continuous, so the distribution has a probability density function. The distribution with this density, and thus the distribution that plays a role in describing probabilities associated with s2 values, is called the Chi-square distribution with (n − 1) df. This is often written χ 2 .
1.4.7. Student’s t Distribution
Recall that one of our objectives will be to develop statistical methods to estimate μ, the population mean, using the obvious estimator, Y¯ . We would like to be able to make statements about “how likely” it is that Y¯ would take on certain values. We saw above that this involves appealing to the normal distribution, as Y¯ ∼ N (μ, σ 2¯ ). A problem with this in real life is, of course, that Y
22
Bang and Davidian
σ 2 , and hence σ 2¯ , is not known, but must itself be estimated. Y Thus, even if we are not interested in σ 2 in its own right, if we are interested in μ, we still need σ 2 to make the inferences we desire! An obvious approach would be to replace σY¯ in our standard normal statistic Y¯ − μ σY¯ by the obvious estimator, sY¯ in [6], and consider instead the statistic Y¯ − μ . sY¯
[9]
The value for μ would be a “candidate” value for which we are trying to assess the likelihood of seeing a value of Y¯ like the one we saw. It turns out that when we replace σY¯ by the estimate sY¯ , the resulting statistic [9] no longer follows a standard normal distribution. The presence of the estimate sY¯ in the denominator serves to add variation. Rather, the statistic has a different distribution, which is centered at 0 and has the same, symmetric, bell shape as the normal, but whose probabilities in the extreme “tails” are larger than those of the normal distribution. Student’s Distribution: The probability distribution describing the probabilities associated with the values taken on by the quantity [9] is called the (Student’s) t distribution with (n − 1) df for a sample of size n. 1.4.8. Degrees of Freedom
For both the statistics (n − 1)s 2 Y¯ − μ and , sY¯ σ2 which follow the χ 2 and t distributions, respectively, the notion of df has arisen. The probabilities associated with each of these statistics depend on the n through the df value (n − 1). What is the meaning of this? Note that both statistics depend on s 2 and 1 n 2 recall that s = n i=1 (Yi − Y¯ )2 . Recall also that it is always n ¯ true that i=1 (Yi − Y ) = 0. Thus, if we know the values of (n − 1) of the observations in our sample, we may always compute the last value, because the deviations about Y¯ of all n of them must sum to zero. Thus, s 2 may be thought of as being based on (n − 1) “independent” deviations – the final deviation can be obtained from the other (n − 1). The term df thus has to do with the fact that there are (n − 1) “free” or “independent” quantities upon which the r.v.’s above are based. Thus, we would expect their distributions to depend on this notion as well.
Experimental Statistics for Biological Sciences
23
2. Estimation and Inference 2.1. Estimation, Inference, and Sampling Distributions
Estimation: A particular way to say something about the population based on a sample is to assign likely values based on the sample to the unknown parameters describing the population. We have already discussed this notion of estimation, e.g., Y¯ is an estimator for μ, s 2 is an estimator for σ 2 . Now we get a little more precise about estimation. Note that these estimators are not the only possibilities. For example, recall that 1 (Yi − Y¯ 2 ); n-1 n
s2 =
i=1
an alternative estimator for σ2 would be 1 (Yi − Y¯ 2 ), n n
s∗2 =
i=1
where we have replaced the divisor (n − 1) by n. An obvious question would be, if we can identify competing estimators for population parameters of interest, how can we decide among them? Recall that estimators such as Y¯ and s 2 may be thought of as having their own underlying populations (that may be described by probability distributions). That is, for example, we may think of the population of all possible Y¯ values corresponding to all of the possible samples of size n we might have ended up with. For this population, we know that Mean of population of Y¯ = μY¯ = μ.
[10]
The property [10] says that the mean of the probability distribution of Y¯ values is equal to the parameter we try to estimate by Y¯ . This seems intuitively like a desirable quality. Unbiasedness: In fact, it is and has a formal name! An estimator is said to be “unbiased” if the mean of the probability distribution is equal to the population parameter to be estimated by the estimator. Thus, Y¯ is an unbiased estimator of μ. Clearly, if we have two competing estimators, then we would prefer the one that is unbiased. Thus, unbiasedness may be used as a criterion for choosing among competing estimators. Minimum Variance: What if we can identify two competing estimators that are both unbiased? On what grounds might
24
Bang and Davidian
we prefer one over the other? Unbiasedness is clearly a desirable property, but we can also think of other desirable properties an estimator might have. For example, as we have discussed previously, we would also like our estimator to be as “close” to the true values as possible – that is, in terms of the probability distribution of the estimator, we’d like it to have small variance. This would mean that the possible values that the estimator could take on (across all possible samples we might have ended up with) exhibit only small variation. Thus, if we have two unbiased estimators, choose the one with smaller variance. Ideally, then, we’d like to use an estimator that is unbiased and has the smallest variance among all such candidates. Such an estimator is given the name minimum variance unbiased. It turns out that, for normally distributed data Y , the estimators Y¯ (for μ) and s 2 (for σ 2 ) have this desirable property. 2.1.1. Confidence Interval for μ
An estimator is a “likely” value. Because of chance, it is too much to expect that Y¯ and s 2 would be exactly equal to μ and σ2 , respectively, for any given data set of size n. Although they may not be “exactly” equal to the value they are estimating, because they are “good” estimators in the above sense, they are likely to be “close.” • Instead of reporting only the single value of the estimator, we report an intervals (based on the estimator) and state that it is likely that the true value of the parameter is in the interval. • “likely” means that probability is involved. Here, we discuss the notion of such an interval, known as a confidence interval (CI), for the particular situation where we wish to estimate μ by the sample mean Y¯ . Suppose for now that Y ∼ N (μ, σ 2 ), where μ and σ are unknown. We have a random sample of observations Y1 , . . . , Yn and wish to estimate μ. Of course, our estimator is Y¯ . If we wish to make probability statements involving Y¯ , this is complicated by the fact that σ 2 is unknown. Thus, even if we are not interested in σ 2 in its own right, we can not ignore it and must estimate it anyway. In particular, we will be interested in the statistic Y¯ − μ sY¯
if we wish to make probability statements about Y¯ without knowledge of σ2 . What kind of probability statement do we wish to make? The chance or randomness that we must contend with arises because Y¯ is based on a random sample. The value μ we wish to estimate is a fixed (but unknown) quantity. Thus, our probability statements intuitively should have something to do with the uncertainty of trying to get an understanding of the fixed
Experimental Statistics for Biological Sciences
25
value of μ using the variable estimator Y¯ . We have P( − tn−1,α/2 ≤
¯ Y−μ ≤ tn−1,α/2 ) = 1 − α. sY¯
[11]
If we rewrite [11] by algebra, we obtain P(Y¯ − tn−1,α/2 sY¯ ≤ μ ≤ Y¯ + tn−1,α/2 sY¯ ) = 1 − α.
[12]
It is important to interpret [12] correctly. Even though the μ appears in the middle, this is not a probability statement about μ – remember, μ is a fixed constant! Rather, the probability in [12] has to do with the quantities on either side of the inequalities – these quantities depend on Y¯ and sY¯ , and thus are subject to chance. Definition: The interval (Y¯ − tn−1,α/2 sY¯ , Y¯ + tn−1,α/2 sY¯ ) is called a 100(1 − α)% CI for μ. For example, if α = 0.05, then (1 − α) = 0.95, and the interval would be called a 95% CI. In general, the value (1 − α) is called the confidence coefficient. Interpretation: As noted above, the probability associated with the CI has to do with the endpoints, not with probabilities about the value μ. We might be tempted to say “the probability that μ falls in the interval is 0.95,” thinking that the endpoints are fixed and μ may or may not be between them. But the endpoints are what varies here, while μ is fixed (see Fig. 1.1). Once it has been constructed, it doesn’t make sense to talk about the probability that it covers μ. We instead talk about our confidence in the statement “the interval covers μ.”
Fig. 1.1. Illustration of Confidence Interval.
2.1.2. Confidence Interval for a Difference of Population Means
Rarely in real life is our interest confined to a single population. Rather, we are usually interested in conducting experiments to compare populations. For example, an experiment may be set up because we wish to gather evidence about the difference among yields (on the average) obtained with several different rates of fertilizer application. More precisely, we would like to make a
26
Bang and Davidian
statement about whether treatments give truly different responses. The simplest case is where we wish to compare just two such treatments. We will develop this case here first, then discuss extension to more than two treatments later. Experimental Procedure: Take two random samples of experimental units (plants, subjects, plots, rats, etc.). Each unit in the first sample receives treatment 1, each in the other receives treatment 2. We would like to make a statement about the difference in responses to the treatments based on this setup. Suppose we wish to compare the effects of two concentrations of a toxic agent on weight loss in rats. We select a random sample of rats from the population of interest and then randomly assign each rat to receive either concentration 1 or concentration 2. The variable of interest is Y = weight loss for a rat. Until the rats receive the treatments, we may assume them all to have arisen from a common population for which Y has some mean μ and variance σ2 . Because these are continuous measurement data, it is reasonable to assume that Y ∼ N (μ, σ 2 ). Once the treatments are administered, however, the two samples become different. Populations 1 and 2 may be thought of as the original population with all possible rats treated with treatment 1 and 2, respectively. We may thus regard our samples as being randomly selected from these two populations. Because of the nature of the data, it is further reasonable to think about two r.v.’s, Y 1 and Y 2 , one corresponding to each population, and to think of them as being normally distributed: Population 1: Y1 ∼ N(μ1 , σ12 ) Population 2: Y2 ∼ N(μ2 , σ22 ) Notation: Because we are now thinking of two populations, we must adjust our notation accordingly, so we may talk about two different r.v.’s and observations on each of them. Write Y ij to denote the observation from the jth unit receiving the ith treatment, that is, the jth value observed on r.v. Y i . With this definition, we may thus view our data as follows: Y11 , Y12 , . . . , Y1n1 n1 = no. of units in sample from population 1 Y21 , Y22 , . . . , Y2n2 n2 = no. of units in sample from population 2. In this framework, we may now cast our question as follows: (μ1 − μ2 ) = 0: no difference (μ1 − μ2 ) = 0: difference. An obvious strategy would be to base investigation of this population mean difference on the data from the two samples by
Experimental Statistics for Biological Sciences
27
estimating the difference. It may be shown mathematically that, if both of the r.v.’s (one for each treatment) are normally distributed, then the following facts are true: • The r.v. (Y1 − Y2 ) satisfies (Y1 − Y2 ) ∼ N(μ1 − μ2 , σD2 ), where σD2 = σ12 + σ22. • Define n1 n2 1 1 ¯ ¯ ¯ ¯ ¯ Y1j ,Y2 = Y2j . D = Y1 − Y2 , where Y1 = n1 n2 j=1
j=1
¯ is the difference in sample means for the two samThat is, D ples. Then, just as for the sample mean from a single sample, the difference in sample means is also normally distributed, i.e., ¯ ∼ N(μ1 − μ2 , σ 2 ), D ¯ D
σD2¯ =
σ12 σ2 + 2. n1 n2
Thus, the mean of the population of all possible differences in sample means from all possible samples from the two populations is the difference in means for the original populations, by analogy to the single population case. Thus, this statistic would follow a standard normal distribution ¯ − (μ1 − μ2 ) D σD¯ . ¯ Intuitively, we can use D as an estimator of (μ1 − μ2 ). As before, for a single population, we would like to report an interval assessing the quality of the sample evidence, that is, give a CI for (μ1 − μ2 ). In practical situations, σ12 and σ22 will be unknown. The obvious strategy would be to replace them by estimates. We will consider this first in the simplest case: • n1 = n2 = n, i.e., the two samples are of the same size. • σ12 = σ22 = σ 2 , i.e., the variances of the two populations are the same. This essentially says that application of the two different treatments affects only the mean of the response, not the variability. In many circumstances, this is not unreasonable. These simplifying assumptions are not necessary but simply make the notation a bit easier so that the concepts will not be obscured by so many symbols. Under these conditions, n n 1 1 σ2 2 ¯ ¯ Y1 = . Y1j , Y2 = Y2j , and σD¯ = 2 n n n j=1
j=1
28
Bang and Davidian
If we wish to replace σ 2¯ and hence σ2 by an estimate, we must D first determine a suitable estimate under these conditions. Because the variance is considered to be the same in both populations, it makes sense to use the information from both samples to come up with such an estimate. That is, pool information from the two samples to arrive at a single estimate. A “pooled” estimate of the common σ2 is given by s2 =
s 2 + s22 (n − 1)s12 + (n − 1)s22 = 1 , 2(n − 1) 2
[13]
where s12 and s22 are the sample variances from each sample. We use the same notation as for a single sample, s 2 , as again we are estimating a single population variance (but now from two samples). Note that the pooled estimate is just the average of the two sample variances, which makes intuitive sense. We have written s 2 as we have in the middle of [13] to highlight the general form – when we discuss unequal n’s later, it will turn out that the estimator for a common σ2 will be a ‘weighted average’ of the two sample variances. Here, the weighting is equal because
2the 2 2 n’s are equal. Thus, an obvious estimator for σ ¯ is s ¯ = 2 sn . D D Just as in the single population case, we would consider the statistic ¯ − (μ1 − μ2 ) D 2 s. [14] , where sD¯ = sD¯ n It may be shown that this statistic has a Student’s t distribution with 2(n − 1) df. CI for (μ1 − μ2 ): By the same reasoning as in the single population case, then, we may use the last fact to construct a CI for (μ1 − μ2 ). In particular, by writing down the same type of probability statement and rearranging, and using the fact that the statistic [14] has a t2(n−1) distribution, we have for confidence coefficient (1 − α), ¯ − (μ1 − μ2 ) D P −t2(n−1), α/2 ≤ ≤ t2(n−1), α/2 = 1 − α, sD¯ ¯ + t2(n−1), α/2 s ¯ ) = 1 − α. ¯ − t2(n−1), α/2 s ¯ ≤ μ1 − μ2 ≤ D P(D D D The CI for (μ1 − μ2 ) thus is ¯ − t2(n−1), α/2 s ¯ , D ¯ + t2(n−1), α/2 s ¯ ). (D D D 2.2. Inference on Means and Hypothesis Testing
In the last section, we began to discuss formal statistical inference, focusing in particular on inference on a population mean for a single population and on the difference of two means under some simplifying conditions (equal n’s and variances). We saw in these situations how to
Experimental Statistics for Biological Sciences
29
• Estimate a single population mean or difference of means. • Make a formal, probabilistic statement about the sampling procedure used to obtain the data, i.e., how to construct a CI for a single population mean or difference of means. Both estimation and construction of CIs are ways of getting an idea of the value of a population parameter of interest (e.g., mean or difference of means) and taking into account the uncertainty involved because of sampling and biological variation. In this section, we delve more into the notions of statistical inference. As with our first exposure to these ideas, probabilistic statements will play an important role. Normality Assumption: As we have previously stated, all of the methods we will discuss are based on the assumption that the data are “approximately” normally distributed, that is, the normal distribution provides a reasonable description of the population(s) of the r.v.(s) of interest. It is important to recognize that this is exactly that, an assumption. It is often valid, but does not necessarily have to be. Always keep this in mind. The procedures we describe may lead to misleading inferences if this assumption is seriously violated. 2.2.1. Hypothesis Tests or Tests of Significance
Problem: Often, we take observations on a sample with a specific question in mind. For example, consider the data on weight gains of rats treated with vitamin A discussed in the last section. Suppose that we know from several years of experience that the average (mean) weight gain of rats of this age and type during a 3week period when they are not treated with vitamin A is 27.8 mg. Question: If we treat rats of this age and type with 2.5 units of vitamin A, how does this affect 3-week weight gain? That is, if we could administer 2.5 units of vitamin A to the entire population of rats of this age and type, would the (population) mean weight gain change from what it would be if we did not? Of course, we cannot administer vitamin A to all rats, nor are we willing to wait for several years of accumulated experience to comment. The obvious strategy, as we have discussed, is to plan to obtain a sample of such rats, treat them with vitamin A, and view the sample as being drawn (randomly) from the (unobservable) population of rats treated with vitamin A. This population has (unknown) mean μ. We carry out this procedure and obtain data (weight gains for each rat in the sample). Clearly, our question of interest may be regarded as a question about μ. Either (i) μ = 27.8 mg, that is, vitamin A treatment does not affect weight gain, and the mean is what we know from vast past experience, 27.8 mg, despite administration of vitamin A. (ii) μ = 27.8 mg, that is, vitamin A treatment does have an affect on weight gain.
30
Bang and Davidian
Statements (i) and (ii) are called statistical hypotheses H0 :μ = 27.8 vs. H1 :μ = 27.8, which are “null” hypothesis and “alternative” hypothesis (often denoted by HA as well), respectively. A formal statistical procedure for deciding between H0 and H1 is called a hypothesis test or test of significance. We base our “decision” on observation of a sample from the population with mean μ; thus, our decision is predicated on the quality of the sampling procedure and the inherent biological variation in the thing we are studying in the population. Thus, as with CIs, probability will be involved. Suppose in truth that μ = 27.8 mg (i.e., vitamin A has no effect, H0 ). For the particular sample we ended up with, recall that we observed Y¯ = 41.0 mg, say, from n = 5. The key question would be, How “likely” is it that we would see a sample yielding a sample mean Y¯ = 41.0 mg if it is indeed true that the population mean μ = 27.8 mg? • If it is likely, then 41.0 is not particularly unusual; thus, we would not discount H0 as an explanation for what is going on. (We do not reject H0 as an explanation.) • If it is not likely, then 41.0 is unusual and unexpected. This would cause us to think that perhaps H0 is not a good explanation for what is going on. (Reject H0 as an explanation, as it seems unlikely.) Consider the generic situation where H0 :μ = μ0 vs. H1 :μ = μ0 , where μ0 is the value of interest (μ0 = 27.8 mg in the rat example). If we assume (“pretend”) H 0 is true, then we assume that μ = μ0 , and we would like to determine the probability of seeing a sample mean Y¯ (our “best guess” for the value of μ) like the one we ended up with. Recall that if the r.v. of interest Y is normal, then we know that, under our assumption that μ = μ0 , Y¯ − μ0 ∼ tn−1 . sY¯
[15]
That is, a sample mean calculated from a sample drawn from the population of all Y values, when “centered” and “scaled,” behaves like a t r.v. Intuitively, a “likely” value of Y¯ would be one for which Y¯ is “close” to μ0 . Equivalently, we would expect the value of the statistic [15] to be “small” (close to zero) in Y¯ −μ0 magnitude, i.e., s ¯ close to 0. To formalize the notion of Y “unlikely,” suppose we decide that if the probability of seeing the value of Y¯ that we saw is less than some small value α, say α =
Experimental Statistics for Biological Sciences
31
0.05, then things are sufficiently unlikely for us to be concerned that our pretend assumption μ = μ0 may not hold. We would thus feel that the evidence in the sample (the value of Y¯ we saw) is strong enough to refute our original assumption that H0 is true. We know that the probabilities corresponding to values of the statistic in [15] follow the t distribution with (n − 1) df. We thus know that there is a value tn−1,α/2 such that Y¯ − μ 0 P > tn−1,α/2 = α. sY¯ Values of the statistic that are greater in magnitude than tn−1,α/2 are thus “unlikely” in the sense that the chance of seeing them is less than α, the “cut-off” probability for “unlikeliness” we have specified. The value of the statistic [15] we saw in our sample is thus a realization of a t r.v. Thus, if the value we saw is greater in magnitude than tn−1,α/2 , we would consider the Y¯ we got to be “unlikely”, and we would reject H0 as the explanation for what is really going on in favor of the other explanation, H1 . Implementation: Compare the value of the statistic [15] to the appropriate value tn−1,α/2 . In the rat example, we take μ0 = 27.8 mg (H0 assumed true). If we have n = 5, sY¯ = 4.472, then Y¯ − μ 41.0 − 27.8 0 = 2.952. = sY¯ 4.472 From the table of the t distribution, if we take α = 0.05, we have t4,0.025 = 2.776. Comparing the value of the statistic we saw to this value gives 2.952 > 2.776. We thus reject H0 – the evidence in our sample is strong enough to support the contention that mean weight gain is different from μ0 = 27.8, that is, H1 , i.e., vitamin A does have an effect on weight gain. Terminology: The statistic Y¯ − μ0 sY¯ is called a test statistic. A test statistic is a function of the sample information that is used as a basis for “deciding” between H0 and H1 . Another, equivalent way to think about this procedure is in terms of probabilities rather than the value of the test statistic. Our test statistic is a r.v. with a tn−1 distribution. Thus, instead of finding the “cut-off” value for our chosen α and comparing the magnitude of the statistic for our sample to this value, we find the probability of seeing a value of the statistic with the same magnitude as that we saw, and compare this probability to α. That is, if tn−1 represents a r.v. with the t distribution with (n − 1) df, find P (|tn−1 | > value of test statistic we saw)
32
Bang and Davidian
and compare this probability to α. In our example, from the T table with n − 1 = 4, we find 0.02 < P(|t4 | > 2.952) < 0.05. Thus, the probability of seeing what we saw is between 0.02 and 0.05 (small enough) and thus less than α = 0.05. The value of the test statistic we saw, 2.952, is sufficiently unlikely, and we reject H0 . These two ways of conducting the hypothesis test are equivalent. • In the first, we think about the size of the value of the test statistic for our data. If it is “large,” then it is “unlikely.” “Large” depends on the probability α we have chosen to define “unlikely.” • In the second, we think directly about the probability of seeing what we saw. If the probability is “small” (smaller than α), then the test statistic value we saw was “unlikely.” • A large test statistic and a small probability are equivalent. • An advantage of performing the hypothesis test the second way is that we calculate the probability of seeing what we saw – this is useful for thinking about just how “strong” the evidence in the data really is. Terminology: The value α, which is chosen in advance to quantify the notion of “likely,” is called the significance level or error rate for the hypothesis test. Formally, because we perform the hypothesis test assuming H0 is true, α is thus the probability of rejecting H0 when it really is true. When will we reject H0 ? There are two scenarios: (i) H0 really is not true, and this caused the large value of the test statistic we saw (equivalently, the small probability of seeing a statistic like the one we saw). (ii) H0 is in fact true, but it turned out that we ended up with an unusual sample that caused us to reject H0 nonetheless. The situation in (ii) is a mistake – we end up making an incorrect judgment between H0 and H1 Unfortunately, because we are dealing with a chance mechanism (random sampling), it is always possible that we might make such a mistake because of uncertainty. A mistake like that in (ii) is called a Type I Error. The hypothesis testing procedure above ensures that we make a Type I error with probability at most α. This explains why α is often called the “error rate.” Terminology: When we reject H0 , we say formally that we “reject H0 at level of significance α.” This states clearly what criterion we used to determine “likely”; if we do not state the level of significance, others have no sense of how stringent or lenient we were in our determination. An observed value of the test statistic
Experimental Statistics for Biological Sciences
33
leading to rejection of H0 is said to be (statistically) significant at level α. Again, stating α is essential. One-Sided And Two-Sided Tests: For the weight gain example, we just considered the particular set of hypotheses H0 :μ = 27.8 vs. H1 :μ = 27.8 mg. Suppose we are fairly hopeful that vitamin A not only has an effect of some sort on weight gain but in fact causes rats to gain more weight than they would if they were untreated. Under these conditions, it might be of more interest to specify a different alternative hypothesis: H0 :μ > 27.8. How would we test H0 against this alternative? As we now see, the principles underlying the approach are similar to those above, but the procedure must be modified to accommodate the particular direction of a departure from H0 in which we are interested. Using the same intuition as before, a “likely” value of Y¯ under these conditions would be one where the value of the statistic would be close to zero. On the other hand, a value of Y¯ that we would expect if H0 were not true but instead H1 were would be large and positive. We thus are interested only in the situation where Y¯ is sufficiently far from μ0 in the positive direction. We know that there is a value tn−1,α such that Y¯ − μ0 > tn−1,α = α. P sY¯ Terminology: The test of hypotheses of the form H0 :μ = μ0 vs. H1 :μ = μ0 is called a two-sided hypothesis test – the alternative hypothesis specifies that μ is different from μ0 , but may be on “either side” of it. Similarly, a test of hypotheses of the form H0 :μ = μ0 vs. H1 :μ > μ0 or H0 :μ = μ0 vs. H1 :μ < μ0 is called a one-sided hypothesis test. Terminology: For either type of test, the value we look up in the t distribution table to which we compare the value of the test statistic is called the critical value for the test. In the one-sided test above, the critical value was 2.132; in the two-sided test, it was 2.776. Note that the critical value depends on the chosen level of significance α and the n. The region of the t distribution that leads to rejection of H0 is called the critical region. For example, in the
34
Bang and Davidian ¯
0 one-sided test above, the critical region was Y −μ > 2.132. If we sY¯ think in terms of probabilities, there is a similar notion. Consider the two-sided test. The probability
Y¯ − μ 0 P > what we saw sY¯ is called the ‘p-value’. The p-value is compared to α/2 in a twosided test in the alternative method of testing the hypotheses. Reporting a p-value gives more information than just reporting whether or not H0 was rejected. For example, if α = 0.05, and the p-value = 0.049, yes, we might reject H0 , but the evidence in the sample might be viewed as “borderline.” On the other hand, if the p-value = 0.001, clearly, we reject H0 ; the p-value indicates that the chance of seeing what we saw is very unlikely (1/1000). How to Choose α: So far, our discussion has assumed that we have a priori specified a value α that quantifies our feelings about “likely.” How does one decide on an appropriate value for α in real life? Recall that we mentioned the notion of a particular type of mistake we might make, that of a Type I error. Because we perform the hypothesis test under the assumption that H0 is true, this means that the probability we reject H0 when it is true at most α. Choosing α thus has to do with how serious a mistake a Type I error might be in the particular applied situation. Important Example: Suppose the question of interest concerns the efficacy of a costly new drug for the treatment of certain disease in humans, and the new drug has potentially dangerous side effects. Suppose a study is conducted where sufferers of the disease are randomly assigned to receive either the standard treatment or the new drug (this is called a randomized clinical trial), and suppose that the r.v. of interest Y is survival time for sufferers of the disease. It is known from years of experience with the standard drug that the mean survival time is some value μ0 . We hope that the new drug is more effective in the sense that it increases survival, in which case it would be worth its additional expense and the risk of side effects. We thus consider H0 : μ = μ0 vs. H1 : μ > μ0 , where μ = mean survival time under treatment with the new drug. The data are analyzed, and suppose, unbeknownst to us, our sample leads us to commit a Type I error – we end up rejecting H0 when it is not true and claim that the new drug is more effective than the standard drug when in reality it is not! Because the new drug is so expensive and carries the possibility of dangerous side effects, this could be a costly mistake, as patients would be paying more with the risk of dangerous side effects for no real gain over the standard treatment. In a situation like this, it is intuitively clear that we would probably like α to be very small, so that the chance we end
Experimental Statistics for Biological Sciences
35
up rejecting H0 when we really shouldn’t is small. In situations where the consequences are not so serious if we make a Type I error, we might choose α to be larger. Another Kind of Mistake: The sample data might be “unusual” in such a way that we end up not rejecting H0 when it really is not true (so we should have rejected). This type of mistake is called a Type II Error. Because a Type II error is also a mistake, we would like the probability of committing such an error, β, say, to also be small. In many situations, a Type II error is not as serious a mistake as a Type I error (think about a verdict “innocent” vs. “guilty”!). In our drug example, if we commit a Type II error, we infer that the new drug is not effective when it really is. Although this, too, is undesirable, as we are discarding a potentially better treatment, we are no worse off than before we conducted the test, whereas, if we commit a Type I error, we will unduly expose patients to unnecessary costs and risks for no gain. General Procedure for Hypothesis Testing: Here we summarize the steps in conducting a test of hypotheses about a single population mean. The same principles we discuss here will be applicable in any testing situation. 1. Determine the question of interest. This is the first and foremost issue – no experiment should be conducted unless the scientific questions are well formulated. 2. Express the question of interest in terms of null and alternative (1 or 2-sided) hypotheses about μ (before collecting/seeing data!). 3. Choose the significance level α , usually a small value like 0.05. The particular situation (severity of making a Type I error) will dictate the value. 4. Conduct the experiment, collect the data, determine critical value, and calculate the test statistic. Perform the hypothesis test, either rejecting or not rejecting H0 in favor of H1 Remarks: • We do not even collect data until the question of interest has been established! • You will often see the phrase “Accept H0 ” used in place of “do not reject H0 .” This terminology may be misleading. If we do reject H0 , we are saying that the sample evidence is sufficiently strong to suggest that H0 is probably not true. On the other hand, if we do not reject H0 , we do not because the sample does not contain enough evidence to say it probably not true. This does not imply that there is enough evidence to say that it probably is true! Tests of hypotheses are set up so that we assume H0 is true and then try to refute it – if we can’t, this doesn’t mean the assumption is true, only that we couldn’t reject it. It could well be that the
36
Bang and Davidian
alternative hypothesis H1 is indeed true, but, because we got an “unusual” sample, we couldn’t reject H0 – this doesn’t make H0 true. Another way to think of this: We conduct the test based on the presumption of some value μ0 , in the rat example, μ0 = 27.8 . Suppose that we conducted a hypothesis test and did not reject H0 . Now suppose that we changed the value of μ0 say, in the rat example, to μ0 = 27.7 and performed a hypothesis test and, again, did not reject H0 . Both 27.8 and 27.7 cannot be true! If we “accepted” H0 in each case, we have a conflict. • The significance level and critical region are not cast in stone. The results of hypothesis tests should not be viewed with absolute yes/no interpretation, but rather as guidelines for aiding us in interpreting experimental results and deciding what to do next. Often, experimental conditions are so complicated that we can never be entirely assured that the assumptions necessary to validate exactly our statistical methods are satisfied. For example, the assumption of normality may be only an approximation, or may in fact be downright unsuitable. It is thus important to keep this firmly in mind. It has become very popular in a number of applied disciplines to do all tests of hypotheses at level 0.05 regardless of the setting and to strive to find “p-value less than 0.05”; however, if one is realistic, a p-value of, say, 0.049 must be interpreted with these cautionary remarks in mind. Also multiplicity of testing should be accounted for as necessary (See Section 3.2). 2.2.2. Relationship between Hypothesis Testing and Confidence Intervals
Before we discuss more exotic hypotheses and hypothesis tests, we point out something alluded to at the beginning of our discussion. We have already seen that a CI for a single population mean is based on the probability statement P −tn−1,α/2
Y¯ − μ ≤ ≤ tn−1,α/2 sY¯
= 1 − α.
[16]
As we have seen, a two-sided hypothesis test is based on a probability statement of the form Y¯ − μ P > tn−1,α/2 = α. sY¯
[17]
Comparing [16] and [17], a little algebra shows that they are the same (except for the strict vs. not-strict inequalities ≤ and >, which are irrelevant for continuous r.v.’s). Thus, choosing a “small” level of significance α in a hypothesis test is the same as choosing a “large” confidence coefficient (1 − α). Furthermore,
Experimental Statistics for Biological Sciences
37
for the same choice α, [16] and [17] show that the following two statements are equivalent: (i) Reject H0 :μ = μ0 at level α based on Y¯ . (ii) μ0 is not contained in a 100(1 − α)% CI for μ based on Y¯ . That is, two-sided hypothesis tests about μ and CIs for μ yield the same information. A similar notion/procedure can be extended to one-sided hypothesis counterparts. 2.2.3. Tests of Hypotheses for the Mean of a Single Population
We introduced the basic underpinnings of tests of hypotheses in the context of a single population mean. Here, we summarize the procedure for convenience. The same reasoning underlies the development of tests for other situations of interest; these are described in subsequent sections. General form: H0 : μ = μ0 vs. one-sided: H1 : μ > μ0 or H1 : μ < μ0 , two-sided: H1 : μ =μ0 . Test statistic: t=
Y¯ − μ0 . sY¯
Procedure: For level of significance α, reject H0 if one-sided: t > tn−1,α or t < −tn−1,α , two-sided: |t| > tn−1,α/2 . 2.2.4. Testing the Difference of Two Population Means
As we have discussed, the usual situation in practice is that in which we would like to compare two competing treatments or compare a treatment to a control. Scenario: With two populations, Population 1: Y1 ∼ N(μ1 , σ12 ), Y11 , Y12 , . . . , Y1n1 ⇒ Y¯ 1 , s12 Population 2: Y2 ∼ N(μ2 , σ 2 ), Y21 , Y22 , . . . , Y2n ⇒ Y¯ 2 , s 2 2
2
2
Because the two samples do not involve the same “experimental units” (the things giving rise to the responses), they may be thought of as independent (totally unrelated). General form: H0 :μ1 − μ2 = δ vs. one-sided: H1 :μ1 − μ2 > δ two-sided: H1 :μ1 − μ2 = δ where δ is some value. Most often, δ = 0, so that the null hypothesis corresponds to the hypothesis of no difference between the population means. Test statistic: As in the case of constructing CIs for μ1 − μ2 , intuition suggests that we base inference on Y¯ 1 − Y¯ 2 . The test statistic is ¯ −δ D t= , sD¯
38
Bang and Davidian
¯ = Y¯ 1 − Y¯ 2 and s ¯ is an estimate of σ ¯ . where D D D Note that the test statistic is constructed under the assump¯ μ1 − μ2 , is equal to δ. This is analogous tion that the mean of D, to the one-sample case – we perform the test under the assumption that H0 is true. The Case of Equal Variances, σ12 = σ22 = σ 2 : As we have already discussed, it is often reasonable to assume that the two populations have common variance σ2 . One interpretation is that the phenomenon of interest (say, two different treatments) affects only the mean of the response, not its variability (“signal” may change with changing treatment, but “noise” stays the same). In this case, we “pool” the data from both samples to estimate the common variance σ2 . The obvious estimator is s2 =
(n1 − 1)s12 + (n2 − 1)s22 , (n1 − 1) + (n2 − 1)
[18]
where s12 and s22 are the sample variances for each sample. Thus, [18] is a weighted average of the two sample variances, where the “weighting” is in accordance with the n’s. We have already discussed such “pooling” when the n is the same, in which case this reduces to a simple average. [18] is a generalization to allow differential weighting of the sample variances when the n’s are different. Recall that in general, σD2¯ =
σ12 σ2 + 2. n1 n2
When the variances are the same, this reduces to 1 1 2 2 + , σD¯ = σ n1 n2 which can be estimated by plugging in the “pooled” estimator for σ2 . We thus arrive at
1 1 1 1 2 2 + + . , sD¯ = s sD ¯ =s n1 n2 n1 n2 Note that the total df across both samples is (n1 − 1) + (n2 − 1) = n1 + n2 − 2, so the tn1 +n2 −2 distribution is relevant. Procedure:
For level of significance α, reject H0 if one-sided: t > tn1 +n2 −2,α , two-sided: |t| > tn1 +n2 −2,α/2 .
CIs: Extending our previous results, a 100(1 − α) % CI for μ1 − μ2 would be ¯ + tn +n −2,α/2s ). ¯ − tn +n −2,α/2s , D (D 1 2 1 2 ¯ ¯ D
D
Experimental Statistics for Biological Sciences
39
Also, a one-sided lower confidence bound for μ1 − μ2 would be ¯ − tn +n −2,αs . D 1 2 ¯ D
The relationship between hypothesis tests and CIs is the same as in the single sample case. The Case of Unequal Variances, σ12 = σ22 : Cases arise in practice where it is unreasonable to assume that the variances of the two populations are the same. Unfortunately, things get a bit more complicated when the variances are unequal. In this case, we may not “pool” information. Instead, we use the two sample variances to estimates σD¯ in the obvious way:
sD¯ =
s2 s12 + 2. n1 n2
Note that because we can’t pool the sample variances, we can’t pool df ’s, either. It may be shown mathematically that, under these conditions, if we use sD¯ calculated in this way in the ¯
denominator of our test statistic D−δ sD¯ , the statistic no longer has exactly a t distribution! In particular, it is not clear what to use for df, as we have estimated two variances separately! It turns out that an approximation is available that may be used under these circumstances. One calculates the quantity, which is not an integer 2 s12 s22 n1 + n2 . “effective df” = 2 2 s12 n1
/(n1 − 1) +
s22 n2
/(n2 − 1)
One then rounds the “effective df ” to the nearest integer. The approximate effective df are then used as if they were exact; for the critical value, use tedf ,α (for one-sided test) and tedf ,α/2 (for one-sided test), where edf is the rounded “effective df ” It is important to recognize that this is only an approximation – the true distribution of the test statistic is no longer exactly t, but we use the t distribution and the edf as an approximation to the true distribution. Thus, care must be taken in interpreting the results. – one should be aware that “borderline” results may not be trustworthy. 2.2.5. Testing Equality of Variances
It turns out that it is possible to construct hypothesis tests to investigate whether or not two variances for two independent samples drawn from two populations are equal.
40
Bang and Davidian
Warning: Testing hypotheses about variances is a harder problem than testing hypotheses about means. This is because it is easier to get an understanding of the “signal” in a set of data than the “noise” – sample means are better estimators of the population means than sample variances are of the population variances using the same n. Moreover, for the test we are about to discuss to be valid, the assumption of normality is critical. Thus, tests for equality of variances should be interpreted with caution. We wish to test H0 :σ12 = σ22
vs.
H1 :σ12 = σ22 .
It turns out that the appropriate test statistic is the ratio of the two sample variances F =
larger of (s12 , s22 ) smaller of (s12 , s22 )
.
The F Distribution: The sampling distribution of the test statistic F may be derived mathematically, just as for the distributions of our test statistics for means. In general, for normal data, a ratio of two sample variances from independent samples with sample sizes nN (numerator) and nD (denominator) has what is known as the F distribution with (nN − 1) and (nD − 1) df’s, denoted as FnN −1,nD −1 . This distribution has shape similar to that of the χ 2 distribution. Tables of the probabilities associated with this distribution are widely available in statistical textbooks. Procedure: Reject H0 at level of significance α if F > Fr,s,α/2 with df r(numerator) and s(denominator). 2.2.6. Comparing Population Means Using Fully Paired Comparisons
As we will see over and over, the analysis of a set of data is dictated by the design. For example, in the developments so far on testing for differences in means and variances, it is necessary that the two samples be independent (i.e., completely unrelated); this is a requirement for the underlying mathematical theory to be valid. Furthermore, although we haven’t made much of it, the fact that the samples are independent is a consequence of experimental design – the experimental units in each sample do not overlap and were assigned treatments randomly. Thus, the methods we have discussed are appropriate for the particular experimental design. If we design the experiment differently, then different methods will be appropriate. If it is suspected in advance that σ12 and σ22 may not be equal, an alternative strategy is to use a different experimental design to conduct the experiment. It turns out that for this design, the appropriate methods of analysis do not depend on whether the variances for the two populations are the same. In addition, the design may make more efficient use of experimental resources. The idea is to make comparisons within pairs of experimental
Experimental Statistics for Biological Sciences
41
units that may tend to be more alike than other pairs. The effect is that appropriate methods for analysis are based on considering differences within pairs rather than differences between the two samples overall, as in our previous work. The fact that we deal with pairs may serve to eliminate a source of uncertainty, and thus lead to more precise comparisons. The type of design is best illustrated by example. Example: (Dixon and Massey, 1969, Introduction to Statistical Analysis, p. 122). A certain stimulus is thought to produce an increase in mean systolic blood pressure in middle-aged men. One way to design an experiment to investigate this issue would be to randomly select a group of middle-aged men and then randomly assign each man to either receive the stimulus or not. We would think of two populations, those of all middle-aged men with and without the stimulus, and would be interested in testing whether the mean for the stimulated population is greater than that for the unstimulated population. We would have two independent samples, one from each population, and the methods of the previous sections would be applicable. In this setup, variability among all men as well as variability within the two groups may make it difficult for us to detect differences. In particular, recall that variability in the sample mean ¯ is characterized by the estimate s ¯ , which appears in difference D D the denominator of our test statistic. If sD¯ is large, the test statistic will be small, and it is likely that H0 will not be rejected. Even if there is a real difference, the statistical procedure may have a difficult time identifying it because of all the variability. If we could eliminate the effect of some of the variability inherent in experimental material, we might be able to overcome this problem. In particular, if we designed the experiment in a different way, we might be able to eliminate the impact of a source of variation, thus ending up with a more sensitive statistical test (that will be more likely to detect a real difference if one exists). A better design that is in this spirit is as follows, and seems like a natural approach in a practical sense as well. Rather than assigning men to receive on treatment or the other, obtain a response from each man under each treatment! That is, obtain a random sample of middle-aged men and take two readings on each man, with and without the stimulus. This might be carried out using a before–after strategy, or, alternatively, the ordering for each man could be different. We will thus assume for simplicity that measurements on each man are taken in a before–after fashion and that there is no consequence to order. To summarize, Design
Type of difference
Sources of variation
1
Across men
Among men, within men
2
Within men
Within men
42
Bang and Davidian
In this second design, we still may think of two populations, those of all men with and without the stimulus. What changes in the second design is how we have “sampled” from these populations. The two samples are no longer independent, because they involve the same men. Thus, different statistical methods are needed. Let Y 1 and Y 2 be the two r.v.’s representing the paired observations, e.g., in our example, Y 1 = systolic blood pressure after stimulus, Y 2 = systolic blood pressure before stimulus. The data are the pairs (Y1j , Y2j ), pairs of observations from the jth man, j = 1, . . . , n. Let Dj = Y1j − Y2j = difference for the jth pair. The relevant population is thus the population of the r.v. D, on which we have observations D1 , . . . , Dn ! If we think of the r.v.’s Y 1 and Y 2 as having the means μ1 and μ2 , our hypotheses are H0 :μ1 − μ2 = δ vs. H1 :μ1 − μ2 = δ for a two-sided test, where, as before, δ is often 0. These hypotheses may be regarded as a test about the mean of the population of all possible differences (i.e., the r.v. D), which has this same mean. To remind ourselves of our perspective, we could think of the mean of the population of differences as diff = μ1 − μ2 and express our hypotheses equivalently in terms of diff , e.g., H0 :diff = δ vs. H1 :diff = δ. Note that once we begin thinking this way, it is clear what we have – we are interested in testing hypotheses concerning the value of a single population mean, diff (that of the hypothetical population of differences)! Thus, the appropriate analysis is that for a single population mean based on a single sample applied to the observed differences D1 , . . . , Dn . Compute ¯ = 1 D Dj = sample mean n n
j=1
and ⎛ 2 = sD
1 ⎜ ⎝ n−1
n j=1
n Dj2 −
j=1 Dj
n
2 ⎞ ⎟ ⎠ = sample variance.
Experimental Statistics for Biological Sciences
43
It turns out that the sample mean of the the differences is algebraically equivalent to the difference of the individual sample means, that is, ¯ = Y¯ 1 − Y¯ 2 , D so that the calculation may be done either way. The SE for the ¯ is, by analogy to the single sample case, sample mean D sD¯ =
sD sD =√ . n n
Note that we have used the same notation, sD¯ , as we did in the case of two independent samples, but the calculation and interpretation are different here. We thus have the following: Test statistic: ¯ −δ D t= sD¯ Note that the denominator of our test statistic depends on sD , the sample SD of the differences. This shows formally the important point – the relevant variation for comparing differences is that within pairs – that this sample SD is measuring precisely this quantity, the variability among differences on pairs. Procedure:
2.2.7. Power, Sample Size, and Detection of Differences
For level of significance α, reject H0 if one-sided: t > tn−1,α , two-sided: |t| > tn−1,α/2 .
We now return to the notion of incorrect inferences in hypothesis tests. Recall that there are two types of “mistakes” we might make when conducting such a test: Type I error: reject H0 when it really is true, P(Type I error) = α Type II error: do not reject H0 when it really isn’t true, P(Type II error) = β. Because both Type I and II errors are mistakes, we would ideally like both α and β to be small. Because Type I error is often more serious, the usual approach is to fix the Type I error (i.e., that level of significance α) first. So far, we have not discussed the implications of Type II error and how it might be taken into account when setting up an experiment. Power of a Test: If we do not commit a Type II error, then we reject H0 when H0 is not true, i.e., infer H1 is true when H1 really is true. Thus, if we do not commit a Type II error, we have made a correct judgment; moreover, we have done precisely what we
44
Bang and Davidian
hoped to do – detect a departure from H0 when in fact such a departure (difference) really does exist. We call 1 − β = P(reject H0 when H0 is not true) the power of the hypothesis test. Clearly, high power is a desirable property for a test to have: low probability of Type II error ⇔ high power ⇔ high probability to detect a difference if one exists. Intuition: Power of a test is a function of how different the true value of μ or μ1 − μ2 is from the null hypothesis value μ0 or δ. If the difference between the true value and the null hypothesis value is small, we might not be too successful at detecting this. If the difference is large, we are apt to be more successful. We can be more formal about this. To illustrate, we will consider the simple case of testing hypotheses about the value of the mean of a single population, μ. The same principles hold for tests on differences of means and in fact any test. Consider in particular the one-sided test H0 :μ = μ0
vs.
Recall that the test statistic is
H1 :μ > μ0 . Y¯ −μ0 sY¯ .
This test statistic is based on the idea that a large observed value of Y¯ is evidence that the ¯ 0 > true value of μ is greater than μ0 . We reject H0 when Y −μ s¯ Y
tn−1,α . To simplify our discussion, let us assume that we know σ2 , the variance of the population, and hence we know σY¯ . Y¯ − μ0 ∼ N(0, 1), σY¯
[19]
if H0 is true. In this situation, rather than compare the statistic to the t distribution, we would compare it to the standard normal distribution. Let zα denote the value satisfying P(Z > zα ) = α, for a standard normal r.v. Z. Then we would conduct the test at α level by rejecting H0 when Y¯ − μ0 > zα . σY¯ Now if H0 is not true, then it must be that μ = μ0 but, instead, μ = some other value, say μ1 > μ0 . Under these conditions, in reality, the statement in [19] is not true. Instead, the statistic that really has a standard normal distribution is Y¯ − μ1 ∼ N(0, 1), σY¯
[20]
Experimental Statistics for Biological Sciences
45
What we would like to do is evaluate power; thus, we need probabilities under these conditions (when H0 is not true). Thus, the power of the test is Y¯ − μ0 > zα 1 − β = P(reject H0 when μ = μ1 ) = P σY¯ μ1 − μ0 Y¯ − μ0 + > zα , =P σY¯ σY¯ where this last expression is obtained by adding and subtracting side of the inequality. Rearthe quantity μ1 /σY¯ to the left-hand
¯
1 0 . Now, under ranging, we get 1 − β = P Y σ−μ > zα − μ1σ−μ Y¯ Y¯ what is really going on, the quantity on the left-hand side of the inequality in this probability statement is a standard normal r.v., as noted in [20]. Thus, the probability (1 − β) is the probability 0 that a standard normal r.v. Z exceeds the value zα − μ1σ−μ , ¯
i.e., 1 − β = P Z > zα −
μ1 − μ0 σY¯
Y
.
That is, power is a function of μ1 , what is really going on. 2.2.8. Balancing α and β and Sample Size Determination
The theoretical results of the previous subsection show that n helps to determine power. We also discussed the idea of a “meaningful” (in a scientific sense) difference to be detected. This suggests that if, for a given application, we can state what we believe to be a scientifically meaningful difference, we might be able to determine the appropriate n to ensure high power to detect this difference. We now see that, once α has been set, we might also like to choose β, and thus the power 1 − β of detecting a meaningful difference at this level of significance. It follows that once we have determined • The level of significance, α • A scientifically meaningful departure from H0 • Power, 1 − β , with which we would like to be able to detect such a difference we would like to determine the appropriate n to achieve these objectives. For example, we might want α = 0.05 and an 80% chance of detecting a particular difference of interest. Let zα be the value such that P(Z > zα ) = α for a standard normal r.v. Z. Procedure: For a test between two population means μ1 and μ2 , choose the n so that (zα + zβ )2 ζD , one-sided test n = diff 2 (zα/2 + zβ )2 ζD . two-sided test n = diff 2
46
Bang and Davidian
Here, diff is the meaningful difference we would like to detect, and depending on the type of design 2 independent samples ζD = 2σ 2 , Paired comparison ζD = σD2 , where σ2 is the (assumed common) variance for the two populations and σD2 is the true variance of the population of differences Dj . For the two independent samples case, the value of n obtained is the number of experimental units needed in each sample. For the paired comparison case, n is the total number of experimental units (each will be seen twice). Slight Problem: We usually do not know σD2 or σ 2 . Some practical solutions are as follows: • Rather than express the “meaningful difference,” diff, in terms of the actual units of the response (e.g., diff = 5 mg for the rat experiment), express it in units of the SD of the appropriate response. For example, for a test based on two independent samples, we might state that we wish to detect a difference the size of one SD of the response. We would thus take diff = σ , so that the factor ζD diff2
=
2σ 2 = 2. σ2
For a paired comparison, we might specify the difference to be detected in terms of SD of a response difference D. If we wanted a one SD difference, we would take diff = σ D . • Another approach is to substitute estimates for σD2 or σ 2 from previous studies. Another Slight Problem: The tests we actually carry out in practice are based on the t distribution, because we don’t know σ2 or σD2 but rather estimate them from the data. In the procedures above, however, we used the values zα , zα/2 , and zβ from the standard normal distribution. There are several things one can do: • Nothing formal. The n’s calculated this way are only rough guidelines, because so much approximation is involved, e.g., having to estimate or “guess at” σ 2 or σD2 , assume the data are normal, and so on. Thus, one might regard the calculated n as a conservative choice and use a slightly bigger n in real application. • Along these lines, theoretical calculations may be used to adjust for the fact that the tests are based on the t rather than standard normal distribution. Specifically, one may calculate an appropriate “correction factor” to inflate the n slightly. (i) Use the appropriate formula to get n
Experimental Statistics for Biological Sciences
47
(ii) Multiply n by the correction factor 2n + 1 for two independent samples and 2n − 1 n+2 for a paired design. n Remark: Calculation of an appropriate n, at least as a rough guideline, should always be undertaken. There is no point in spending resources (subjects/animals, time, and money) to do an experiment that has very little chance of detecting the scientifically meaningful difference in which one is interested! This should always be carried out in advance of performing an experiment – knowing the n was too small after the fact is not very helpful. Also note that we estimate “minimal n” rather than suitable n. Fun Fact about History of t-test (from Wikipedia): The t statistic was introduced by William Sealy Gosset for cheaply monitoring the quality of beer brews (“Student” was his pen name). Gosset was a statistician for the Guinness brewery in Dublin, Ireland, and was hired due to Claude Guinness’s innovative policy of recruiting the best graduates from Oxford and Cambridge to apply biochemistry and statistics to Guinness’ industrial processes. Gosset published the t-test in Biometrika in 1908, but was forced to use a pen name by his employer who regarded the fact that they were using statistics as a trade secret. In fact, Gosset’s identity was unknown not only to fellow statisticians. Today, it is more generally applied to the confidence that can be placed in judgments made from small samples.
3. Analysis of Variance (ANOVA) 3.1. One-Way Classification and ANOVA
The purpose of an experiment is often to investigate differences among treatments. In particular, in our statistical model framework, we would like to compare the (population) means of the responses to each treatment. We have already discussed designs (two independent samples, pairing) for comparing two treatment means. In this section, we begin our study of more complicated problems and designs by considering the comparison of more than two treatment means. Recall that in order to detect differences if they really exist, we must try to control the effects of experimental error, so that any variation we observe can be attributed mainly to the effects of the treatments rather than to differences among the experimental units to which the treatments are applied. We discussed the idea that designs involving meaningful grouping of experimental units are the key to reducing the effects of experimental error,
48
Bang and Davidian
by identifying components of variation among experimental units that may be due to something besides inherent biological variation among them. The paired design for comparing two treatments is an example of such a design. Before we can talk about grouping in the more complicated scenario involving more than two treatments, it makes sense to talk about the simplest setting in which we compare several treatment means. This is basically an extension of the “two independent samples” design to more than two treatments. One-Way Classification: Consider an experiment to compare several treatment means set up as follows. We obtain (randomly, of course) experimental units for the experiment and randomly assign them to treatments so that each experimental unit is observed under one of the treatments. In this situation, the samples corresponding to the treatment groups are independent (the experimental units in each treatment sample are unrelated). We do not attempt to group experimental units according to some factor (e.g., gender). In this experiment, then, the only way in which experimental units may be “classified” is with respect to which treatment they received. Other than the treatments, they are viewed as basically alike. Hence, such an arrangement is often called a one-way classification. Example: In a laboratory experiment for which the experimental material may be some chemical mixture to be divided into beakers (the experimental units) to which treatments will be applied, and experimental conditions are the same for all beakers, we would not expect much variation among the beakers before the treatments were applied. In this situation, grouping beakers would be pointless. We would thus expect that, once we apply the treatments, any variation in responses across beakers will mainly be due to the treatments, as beakers are pretty much alike otherwise. Complete Randomization: When experimental units are thought to be basically alike, and are thus expected to exhibit a small amount of variation from unit-to-unit, grouping them really would not add much precision to an experiment. If there is no basis for grouping, and thus treatments are to be simply assigned to experimental units without regard to any other factors, then, as noted above, this should be accomplished according to some chance (random) mechanism. All experimental units should have an equal chance of receiving any of the treatments. When randomization is carried out in this way, it is called complete randomization. This is to distinguish the scheme for treatment allocation from more complicated methods involving grouping, which we will talk about later. Advantages: • Simplicity of implementation and analysis.
Experimental Statistics for Biological Sciences
49
• The size of the experiment is limited only by the availability of experimental units. No special considerations for different types of experimental units are required. Disadvantages: • Experimental error, our assessment of variation believed to be inherent among experimental units (not systematic), includes all (both inherent and potential systematic) sources. If it turns out unexpectedly that some of the variation among experimental units is indeed due to a systematic component, it will not be possible to “separate it out” of experimental error, and comparisons will lack precision. In such a situation, a more complicated design involving grouping should have been used up-front. Thus, we run the risk of low precision and power if something unexpected arises. 3.1.1. ANOVA
We wish to determine if differences exist among means for responses to treatments; however, the general procedure for inferring whether such differences exist is called “analysis of variance”. ANOVA: This is the name given to a general class of procedures that are based roughly on the following idea. We have already spoken loosely of attributing variation to treatments as being “equivalent” to determining if a difference exists in the underlying population treatment means. It turns out that it may be shown that there is actually a more formal basis to this loose way of speaking, and it is this basis that gives the procedure its name. It is easiest to understand this in the context of the oneway classification; however, the basic premise is applicable to more complicated designs that we will discuss later. Notation: To facilitate our further development, we will change slightly our notation for denoting a sample mean. As we will see shortly, we will need to deal with several different types of means for the data, and this notation would be a bit easier. Let t denotes the number of treatments. Let Yij = response on the jth experimental unit on treatment i. Here, i = 1, . . . , t. (We consider first the case where only one observation is taken on each experimental unit, so that the experimental unit = the sampling unit.) We will consider for simplicity the case where the same number of experimental units, that is, replicates, are assigned to each treatment. To highlight the term replication, we let r = number of experimental units, or replicates, per treatment. Remark: Thus, r replaces our previous notation n. We will denote the sample mean for treatment i by 1 Yij . r r
Yi . =
j=1
50
Bang and Davidian
The only difference between this notation and our previous notation is the use of the “·” in the subscript. This usage is fairly standard and reminds us that the mean was taken by summing over the subscript in the second position, j. Also define r t 1 Y¯ ·· = Yij . rt i=1 j=1
Note that the total number of observations in the experiment is r × t = rt, r replicates on each of t treatments. Thus, Y¯ represents the sample mean of all the data, across all replicates and treatments. The double dots make it clear that the summing has been performed over both subscripts. Setup: Consider first the case of t = 2 treatments with two independent samples. Suppose that the population variance is the same for each treatment and equal to σ 2 . Recall that our test statistic for the hypotheses H0 :μ1 − μ2 = 0 vs. H1 :μ1 − μ2 = 0 was (in our new notation) Y¯ 1· − Y¯ 2· , sD¯ ¯ = Y¯ 1· − Y¯ 2· . Here, we have taken δ = 0 and conwhere now D sidered the two-sided alternative hypothesis, as we are interested in just a difference. For t > 2 treatments, there is no obvious generalization of this setup. Now, we have t population means, say μ1 , μ2 , . . . , μt . Thus, the null and alternative hypotheses are now H0 :the μi are all equal vs. H1 : the μi are not all equal Idea: The idea to generalize the t = 2 case is to think instead about estimating variances, as follows. This may seem totally irrelevant, but when you see the end result, you’ll see why! Assume that the data are normally distributed with the same variance for all t treatment populations, that is, Yij ∼ N (μi , σ 2 ). How would we estimate σ 2 ? The obvious approach is to generalize what we did for two treatments and “pool” the sample
Experimental Statistics for Biological Sciences
51
variances across all t treatments. If we write si2 to denote the sample variance for the data on treatment i, then 1 (Yij − Yi .)2 . r−1 r
si2 =
j=1
The estimate would be the average of all t sample variances (because r is the same for all samples), so the “pooled” estimate would be (r − 1)s12 + (r − 1)s22 + · · · + (r − 1)st2
t(r − 1) r r ¯ 2 ¯ 2 ¯ 2 j=1 (Y1j − Y1· ) + j=1 (Y2j − Y2· ) + · · · j=1 (Ytj − Yt· ) . t(r − 1) [21]
r =
As in the case of two treatments, this estimate makes sense regardless of whether H0 is true. It is based on deviations from each mean separately, through the sample variances, so it doesn’t matter whether the true means are different or the same – it is still a sensible estimate. Now recall that, if a sample arises from a normal population, then the sample mean is also normally distributed. Thus, this should hold for each of our t samples, which may be written as 2 σ [22] Y¯ i· ∼ N μi , r (that is, σ 2¯ = σ 2 /r). Now consider the null hypothesis; under Yi· H0 , all the treatment means are the same and thus equal the same value, μ, say. That is, under H0 , μi = μ, i = 1, . . . , t. Under this condition, [22] becomes σ2 ¯ Yi· ∼ N μ, r for all i = 1, . . . , t. Thus, if H0 really were true, we could view the sample means Y¯ 1· , Y¯ 2· , . . . , Y¯ t· as being just a random sample from a normal population with mean μ and variance σ 2 /r. Consider under these conditions how we might estimate the variance of this population, σ 2 /r. The obvious estimate would be the sample variance of our “random sample” from this population, the t sample means. Recall that sample variance is just the sum of squared deviations from the sample mean, divided
52
Bang and Davidian
by sample size −1. Here, our sample size is t, and the sample mean is ⎛ ⎞ r t t r t 1 ⎝1 ⎠ 1 ¯ 1 Yij = Yij = Y¯ ·· . Yi . = t t r rt i=1
i=1
j=1
i=1 j=1
That is, the mean of the sample means is just the sample mean of all the data (this is not always true, but is in this case because r is the same sample size for all treatments.) Thus, the sample variance 1 t ¯ ¯ 2 we would use as an estimator for σ 2 /r is t−1 i=1 (Yi − Y·· ) . This suggests another estimator for σ 2 , namely, r times this, or 1 ¯ (Yi . − Y¯ ·· )2 . t −1 t
r×
[23]
i=1
Remark: Note that we derived the estimator for σ 2 given in [23] under the assumption that the treatment means were all the same. If they really were not the same, then, intuitively, this estimate of σ 2 would tend to be too big, because the deviations about the sample mean Y¯ ·· of the Y¯ i . will include two components: 1. A component attributable to ‘random variation’ among the Y¯ i .s 2. A component attributable to the ‘systematic difference’ among the means μi The first component will be present even if the means are the same; the second component will only be present when they differ. Result: We now have derived two estimators for σ 2 : • The first, the “pooled” estimate given in [21], will not be affected by whether or not the means are the different. This estimate reflects how individual observations differ from their means, regardless of the values of those means; thus, it reflects only variation attributable to how experimental units differ among themselves. • The second, derived assuming the means are the same, and given in [23], will be affected by whether the means are different. This estimate reflects not only how individual observations, through their sample means, differ but also how the means might differ. Implication – The F Ratio: Recall that we derived the second estimator for σ 2 under the assumption that H0 is true. Thus, if H0 really were true, we would expect both estimators for σ 2 to be about the same size, since in this case both would reflect only variation attributable to experimental units. If, on the other hand, H0 really is not true, we would expect the second estimator to be
Experimental Statistics for Biological Sciences
53
larger. With this in mind, consider the ratio estimator for σ 2 based on sample means [23] F = estimator for σ 2 based on individual deviations [21] We now see that if H0 is true, ratio will be small. If H0 is not true, ratio will be large. The result is that we may base inference on treatment means (whether they differ) on this ratio of estimators for variance. We may use this ratio as a test statistic for testing H0 vs. H1 . Recall that a ratio of two sample variances for two independent populations has a F distribution (Sec. 2.2.5). Recall also that our approach to hypothesis testing is to assume H0 is true, look at the value of a test statistic, and evaluate how “likely” it is if H0 is true. If H0 is true in our situation here, then • The numerator is a sample variance of data Y¯ i . , i = 1, . . . , t, from a N(μ, σ 2 /r) population. • The denominator is a (pooled) sample variance of the data Yij . It turns out, as we will see shortly, that if H0 is true we may further view these two sample variances as independent, even though they are based on the same observations. It thus follows that we have the ratio of two “independent” sample variances if H0 is true, so that F ∼ F(t−1),t(r−1) . Interesting Fact: It turns out that, in the case of t = 2 treatments, the ratio F reduces to F =
(Y¯ 1· − Y¯ 2· )2 = t 2, s 2¯ D
the square of the usual t statistic. Here, F will have a F1,2(r−1) distribution. It is furthermore true that when the numerator df for a F distribution is equal to 1 and the denominator df = some value ν, say, then the square root of the F r.v. has a tν distribution. Thus, when t = 2, comparing the ratio F to the F distribution is the same as comparing the usual t statistic to the t distribution. 3.1.2. Linear Additive Model
It is convenient to write down a model for an observation to highlight the possible sources of variation. For the general oneway classification with t treatments, we may classify an individual observation as being on the jth experimental unit in the ith treatment group as Yij = μ + τi + εij , i = 1, . . . , t, j = 1, . . . , ri ,
54
Bang and Davidian
where • t = number of treatments • ri = number of replicates on treatment i. In general, this may be different for different treatments, so we add the subscript i. (What we called ni in the case t = 2 is now ri .) • μi = μ + τi is the mean of the population describing responses on experimental units receiving the ith treatment. • μ may be thought of as the “overall” mean with no treatments • τi is the change in mean (deviations from μ) associated with treatment i We have written the model generally to allow for unequal replication. We will see that the idea of an F ratio may be generalized to this case. This model is just an extension of that we used in the case of two treatments and, as in that case, shows that we may think of observations varying about an overall mean because of the systematic effect of treatments and the random variation in experimental units. 3.1.3. Fixed vs. Random Effects
Recall that τi represents the “effect” (deviation) associated with getting treatment i. Depending on the situation, our further interpretation of τi , and in fact of the treatments themselves, may differ. Consider the following examples. Example 1: Suppose t = 3 and that each treatment is a different fertilizer mixture for which mean yields are to be compared. Here, we are interested in comparing three specific treatments. If we repeated the experiment again, these three fertilizers would always constitute the treatments of interest. Example 2: Suppose a factory operates a large number of machines to produce a product and wishes to determine whether the mean yield of these machines differs. It is impractical for the company to keep track of yield for all of the many machines it operates, so a random sample of five such machines is selected, and observations on yield are made on these five machines. The hope is that the results for the five machines involved in the experiment may be generalized to gain insight into the behavior of all of the machines. In Example 1, there is a particular set of treatments of interest. If we started the experiment next week instead of this week, we would still be interested in this same particular set – it would not vary across other possible experiments we might do. In Example 2, the treatments are the 5 machines chosen from all machines at the company, chosen by random selection. If we started the experiment next week instead of this week here, we might end up with a different set of five machines with which to do the experiment. In fact, whatever five machines we end up with,
Experimental Statistics for Biological Sciences
55
these particular machines are not the specific machines of interest. Rather, interest focuses on the population of all machines operated by the company. The question of interest now is not about the particular treatments involved in the experiment, but the population of all such treatments. • In a case like Example 1, the τi are best regarded as fixed quantities, as they describe a particular set of conditions. In this situation, the τi are referred to as fixed effects. • In a case like Example 2, the τi are best regarded as r.v.’s. Here, the particular treatments in the experiment may be thought of as being drawn from a population of all such treatments, so there is chance involved. We hence think of the τi as r.v.’s with some mean and variance στ2 . This variance characterizes the variability in the population of all possible treatments, in our example, the variability across all machines owned by the company. If machines are quite different in terms of yield, στ2 will be large. In this situation, the τi are referred to as random effects. You might expect that these two situations would lead to different considerations for testing. In the random treatment effects case, there is additional uncertainty involved, because the treatments we use aren’t the only ones of interest. It turns out that in the particular simple case of assessing treatment differences for the one-way classification, the methods we will discuss are valid for either case. However, in more complicated designs, this is not necessarily the case. 3.1.4. Model Restriction
We have seen that Y¯ i . is an estimator for μi for a sample from population i. Y¯ i . is our “best” indication of the mean response for population i. But if we think about our model, μi = μ + τi , which breaks μi into two components, we do not know how much of what we see, Y¯ i ., is due to the original population of experimental units before treatments were applied (μ) and how much is due to the effect of the treatment (τi ). In particular, the linear additive model we write down to describe the situation actually contains elements we can never hope to get a sense of from the data at hand. More precisely, the best we can do is estimate the individual means μi using Y¯ i· ; we cannot hope to estimate the individual treatment effects τi without additional knowledge or assumptions. Terminology: Mathematically speaking, a model that contains components that cannot be estimated is said to be overparameterized. We have more parameters than we can estimate from the available information. Thus, although the linear additive model is a nice device for focusing our thinking about the data, it is overparameterized from a mathematical point of view.
56
Bang and Davidian
One Approach: It may seem that this is an “artificial” problem – why write down a model for which one can’t estimate all its components? The reason is, as above, to give a nice framework for thinking about the data – for example, the model allows us to think of fixed or random treatment effects, depending on the type of experiment. To reconcile our desire to have a helpful model for thinking and the mathematics, the usual approach is to impose some sort of assumption. A standard way to think about things is to suppose that the overall mean μ can be thought of as the mean or average of the individual treatment means μi , that is,
μ=
1 ¯ 1 μi , just as Y¯ ·· = Yi· . t t t
t
i=1
i=1
This implies that μ=
1 1 (μ + τi ) = μ + τi , t t t
t
i=1
i=1
so that it must be that t
τi = 0.
i=1
t The condition i=1 τi = 0 goes along with the interpretation of the τi as “deviations” from an overall mean. This restriction is one you will see often in work on ANOVA. Basically, it has no effect on our objective, investigating differences among treatment means. All the restriction does is to impose a particular interpretation on our linear additive model. You will often see the null and alternative hypotheses written in terms of τi instead of μi . Note that, under this interpretation, if all treatment means were the same (H0 ), then the τi must all be zero. This interpretation is valid in the case where the τi are fixed effects. When they are random effects, the interpretation is similar. We think of the τi themselves as having population mean 0, analogous to them averaging to zero above. This is the analog of the restriction in the case of random effects. If there are no differences across treatments, then they do not vary, that is, στ2 = 0. 3.1.5. Assumptions for ANOVA
Before we turn to using the above framework and ideas to develop formal methods, we restate for completeness the assumptions underlying our approach. • The observations, and hence the errors, are normally distributed. • The observations have the same variance σ 2 .
Experimental Statistics for Biological Sciences
57
• All observations, both across and within samples, are unrelated (independent). The assumptions provide the basis for concluding that the sampling distribution of the statistic F upon which we will base our inferences is really the F distribution. Important: The assumptions above are not necessarily true for any given situation. In fact, they are probably never exactly true. For many data sets, they may be a reasonable approximation, in which case the methods we will discuss will be fairly reliable. In other cases, they may be seriously violated; here, the resulting inferences may be misleading. If the underlying distribution of the data really is not normal and/or the variances across treatment groups are not the same, then the rationale we used to develop the statistic is lost. If the data really are not normal, hypothesis tests may be flawed in the sense that the true level of significance is greater than the chosen level α, and we may claim there is a difference when there really is not a difference. We may think we are seeing a difference in means, when actually we are seeing lack of normality. For some data, it may be possible to get around these issues somewhat. So far, we have written down our model in an additive form. However, there are physical situations where a more plausible model is one that has error enter in a multiplicative way: Yij = μ∗ τi∗ εij∗ . Such a model is often appropriate for growth data, or many situations where the variability in response tends to get larger as the response gets larger. If we take logarithms, we may write this as log Yij = μ + τi + εij , μ = log μ∗ , τi = log τi∗ , εij = log εij∗ . Thus, the logarithms of the observations satisfy a linear, additive model. This is the rationale behind the common practice of transforming the data. It is often the case that many types of biological data seem to be close to normally distributed with constant variance on the logarithm scale, but not at all on their original scale. The data are thus analyzed on this scale instead. Other transformations may be more appropriate in some circumstances. 3.1.6. ANOVA for One-Way Classification with Equal Replication
We begin with the simplest case, where ri = r for all i. We will also assume that the τi have fixed effects. Recall our argument to derive the form of the F ratio statistic F =
estimator for σ 2 based on sample means [23] . estimator for σ 2 based on individual deviations [21]
58
Bang and Davidian
In particular, the components may be written as follows: • Numerator: r×
t
¯
¯
i=1 (Yi . − Y·· )
2
=
t −1
Treatment SS df for treatments
• Denominator: r
t i=1
j=1 (Yij
− Y¯ i .)2
t(r − 1)
Error SS . df for error
=
Here, we define the quantities Treatment SS and Error SS and their df as given above, where, as before, SS = sum of squares. These names make intuitive sense. The Treatment SS is part of the estimator for σ 2 that includes a component due to variation in treatment means. The use of the term Error SS is as before – the estimator for σ 2 in the denominator only assesses apparent variation across experimental units. “Correction” Term: It is convenient to define
t C=
r
j=1 Yij
i=1
2 .
rt
Algebraic Facts: It is convenient for getting insight (and for hand calculation) to express the SS’ s differently. It is possible to show that
Treatment SS = r
t
r
t (Y¯ i. − Y¯ ·· )2 =
i=1
Error SS =
(Yij − Y¯ i. )2 =
i=1 j=1
t r
2 − C.
r
i=1 t r
j=1 Yij
t Yij2 −
i=1 j=1
i=1
r
j=1 Yij
2
r
Consider that the overall, “total” variation in all the data, if we do not consider that different treatments were applied, would obviously be well-represented by the sample variance for all the data, lumping them all together without regard to treatment. There are rt total observations; thus, this sample variance would be t i=1
r
j=1 (Yij
rt − 1
− Y¯ ·· )2
;
each deviation is taken about the overall mean of all rt observations.
.
Experimental Statistics for Biological Sciences
59
Algebraic Facts: The numerator of the overall sample variance may be written as r t
(Yij − Y¯ ·· )2 =
i=1 j=1
r t
Yij2 − C.
i=1 j=1
This quantity is called the Total SS. Because it is the numerator of the overall sample variance, it may be thought of as measuring how observations vary about the overall mean, without regard to treatments. That is, it measures the total variation. We are now in a position to gain insight. From the algebraic facts above, note that Treatment SS + Error SS = Total SS.
[24]
Equation [24] illustrates a fundamental point – the Total SS, which characterizes overall variation in the data without regard to the treatments, may be partitioned into two independent components: • Treatment SS, measuring how much of the overall variation is in fact due to the treatments (in that the treatment means differ). • Error SS, measuring the remaining variation, which we attribute to inherent variation among experimental units. F Statistic: If we now define MST = Treatment MS =
Treatment SS Treatment SS = df for treatments t −1
MSE = Error MS =
Error SS Error SS = df for error t(r − 1)
then we may write our F statistic as F =
Treatment MS . Error MS
We now get some insight into why F has an Ft−1,t(r−1) distribution. The components in the numerator and denominator are “independent” in the sense that partition the Total SS into two “orthogonal” components. (A formal mathematical argument is possible.) We summarize this information in Table 1.1. Statistical Hypotheses: The question of interest in this setting is to determine if the means of the t treatment populations are different. We may write this formally as
60
Bang and Davidian
Table 1.1 One–way ANOVA table – Equal replication Source of variation
DF
Definition
SS
Among Treatments Error (within treatments) Total
MS
r
F
2
t i=1 j=1 Yij r ti=1 (Y¯ i. − Y¯ ·· )2 − C MST F r t r 2 ¯ t(r − 1) by subtraction MSE i=1 j=1 (Yij − Yi. ) t r r 2−C 2 t ¯ rt − 1 (Y − Y ) Y ·· ij i=1 j=1 i=1 j=1 ij
t −1
H0,T : μ1 = μ2 = · · · = μt vs. H1,T : The μi are not all equal, This tmay also be written in terms of the τi under the restriction i=1 τi = 0 as H0,T : τ1 = τ2 = · · · = τt = 0 vs. H1,T : The τi are not all equal. where the subscript “T” is added to remind ourselves that this particular test is with regard to treatment means. It is important to note that the alternative hypothesis does not specify the way in which the treatment means (or deviations) differ. The best we can say based on our statistic is that they differ somehow. The numerator of the statistic can be large because the means differ in a huge variety of different configurations. Some of the means may be different while the others are the same, all might differ, and so on. Test Procedure: Reject H0,T in favor of H1,T at level of significance α if F > F(t−1),t(r−1), α. This is analogous to a two-sided test when t = 2 – we do not state in H1,T the order in which the means differ, only that they do. The range of possibilities of how they differ is just more complicated when t > 2. We use α instead of α/2 here because we have no choice as to which MS appears in the numerator and which appears in the denominator. (Compare to the test of equality of variance in the case of two treatments). 3.1.7. ANOVA for One-Way Classification with Unequal Replication
We now generalize the ideas of ANOVA to the case where the ri are not all equal. This may be the case by design or because of mishaps during the experiment that result in lost or unusable data. Here, again, our discussion assumes that the τi have fixed effects. When the ri are not all equal, we redefine the correction factor and the total number of observations as
C=
ri t i=1 j=1 Yij t i=1 ri
2
=
ri t i=1 j=1 Yij N
2 ,N =
t i=1
ri .
Experimental Statistics for Biological Sciences
61
Using the same logics that are used previously, we can reach the following ANOVA table (Table 1.2):
Table 1.2 One–way ANOVA table – Unequal replication Source of variation Among Treatments Error (within treatments) Total
DF t −1 t
SS
t
t
¯
¯
i=1 ri (Yi . − Y·· )
i=1 ri − t ( = N − t)
N −1
Definition
t
r
t
r
i=1 i=1
2
r i (
i=1
j=1 Yij ri
)2
−
MS
F
MST
F
C
¯ 2 j=1 (Yij − Yi .) ¯ 2 j=1 (Yij − Y·· )
by subtraction t
i=1
MSE
r
2 j=1 Yij − C
Under the same hypotheses under equal replication, Test Procedure: Reject H0 , T in favor of H1 , T at level of significance α if F > Ft−1,N −t,α . 3.2. Multiple Comparisons
When we test the usual hypotheses regarding differences among more than two treatment means, if we reject the null hypothesis, the best we can say is that there is a difference among the treatment means somewhere. Based on this analysis, we cannot say how these differences occur. For example, it may be that all the means are the same except one. Alternatively, all means may differ from all others. However, on the basis of this test, we cannot tell. The concept of multiple comparisons is related to trying to glean more information from the data on the nature of the differences among means. For reasons that will become clear shortly, this is a difficult issue, and even statisticians do not always agree. As you will see, the issue is one of “philosophy” to some extent. Understanding the principles and the problem underlying the idea of multiple comparisons is thus much more important than being familiar with the many formal statistical procedures. In this section, we will discuss the issues and then consider only a few procedures. The discussion of multiple comparisons is really only meaningful when the number of treatments t ≥ 3.
3.2.1. Principles – “Planned” Vs. “Families” of Comparisons
Recall that when we perform a F test (a test based on an F ratio) in the ANOVA framework for the difference among t treatment means, the only inference we make as a result of the test is that the means considered as a group differ somehow if we reject the null hypothesis: H0 ,T :μ1 = · · · = μt vs. H1 ,T :The μi are not all equal.
62
Bang and Davidian
We do not and cannot make inference as to how they differ. Consider the following scenarios: • If H0,T is not rejected, it could be that there is a real difference between, say, two of the t treatments, but it is “getting lost” by being considered with all the other possible comparisons among treatment means. • If H0,T is rejected, we are naturally interested in the specific nature of the differences among the t means. Are they all different from one another? Are only some of them different, the rest all the same? One is naturally tempted to look at the sample means for each treatment – are there differences that are “suggested” by these means? To consider these issues, we must recall the framework in which we test hypotheses. Recall that the level of significance for any hypothesis test α = P(reject H0 when H0 is true) = P(Type I error). Thus, the level α specifies the probability of making a mistake and saying that there is evidence to suggest the alternative hypothesis is true when it really isn’t. The probability of making such an error is controlled to be no more than α by the way we do the test. Important: The level α, and thus the probability of making such a mistake, applies only to the particular H0 under consideration. Thus, in the ANOVA situation above, where we test H0,T vs. H1,T , α applies only to the comparison among all the means together – either they differ somehow or they do not. The probability that we end up saying they differ somehow when they don’t is thus no more than α, that is all. α does not pertain to, say, any further consideration of the means, say, comparing them two at a time. To see this, consider some examples. Example – t = 3 Treatments: Suppose there are t = 3 treatments under consideration. We are certainly interested in the question “do the means differ?” but we may also be interested in how. Specifically, does, say, μ1 = μ2 , but μ3 = μ1 or μ2 ? Here, we are interested in what we can say about three separate comparisons. We’d like to be able to combine the three comparisons somehow to make a statement about how the means differ. Suppose, to address this, we decide to perform three separate t-tests for the three possible differences of means, each with level of significance α Then, we have P(Type I error comparing μ1 vs. μ2 ) = α P(Type I error comparing μ1 vs. μ3 ) = α P(Type I error comparing μ2 vs. μ3 ) = α.
Experimental Statistics for Biological Sciences
63
That is, in each test, we have probability of α of inferring that the two means under consideration differ when they really do not. Because we are performing more than one hypothesis test, the probability we make this mistake in at least one of the tests is no longer α! This is simply because we are performing more than one test. For example, it may be shown mathematically that if α = 0.05, P(Make at least one Type I error in the 3 tests) ≈ 0.14! Suppose we perform the three separate tests, and we reject the null hypothesis in the first two tests, but not in the third (μ2 vs. μ3 ), with each test at level α = 0.05. We then proclaim at the end that “there is sufficient evidence in these data to say that μ1 differs from μ2 and μ1 differs from μ3 . We do not have sufficient evidence to say that μ2 and μ3 differ from each other.” Next to this statement, we say “at level of significance α = 0.05.” What is wrong with doing this? From the calculation above, the chance that we said something wrong in this statement is not 0.05! Rather, it is 0.14! The chance we have made a mistake in claiming a difference in at least one of the tests is 0.14 – almost three times the chance we are claiming! In fact, for larger values of t (more treatments), if we make a separate test for each possible pairwise comparison among the t means, each at level α, things only get worse. For example, if α = 0.05 and t = 10, the chance we make at least one Type I error, and say in our overall statement that there is evidence for a difference in a pair when there isn’t, is almost 0.90! That is, if we try to combine the results of all these tests together into a statement that tries to sort out all the differences, it is very possible (almost certain) that we will claim a difference that really doesn’t exist somewhere in our statement. If we wish to “sort out” differences among all the treatment means, we cannot just compare them all separately without having a much higher chance of concluding something wrong! Another Problem: Recognizing the problems associated with “sorting out” differences above, suppose we decide to look at the sample treatment means Y¯ i· and compare only those that appear to possibly be different. Won’t this get around this problem, as we’re not likely to do as many separate tests? No! To see this, suppose we inspect the sample means and decide to conduct a t-test at level α = 0.05 for a difference in the two treatment means observed to have the highest and lowest sample means Yi· among all those in the experiment. Recall that the data are just random samples from the treatment populations of interest. Thus, the sample means Yi· could have ended up the way they did because • There really is a difference in the population means, OR
64
Bang and Davidian
• We got “unusual” samples, and there really is no difference! Because of the chance mechanism involved, either of these explanations is possible. Returning to our largest and smallest sample means, then, the large difference we have observed could be due just to chance – we got some unusual samples, even though the means really don’t differ. Because we have already seen in our samples a large difference, however, it turns out that, even if this is the case, we will still be more likely to reject the null hypothesis of no difference in the two means. That is, although we claim that we only have a 5% chance of rejecting the null hypothesis when it’s true, the chance we actually do this is higher. Here, with α = 0.05, it turns out that the true probability that we reject the null hypothesis that the means with smallest and largest observed Yi· values are the same when it is true is • Actually 0.13 if t = 3 . Actually 0.60 if t = 10 ! As a result, if we test something on the basis of what we observed in our samples, our chance of making a Type I error is much greater than we think. The above discussion shows that there are clearly problems with trying to get a handle on how treatment means differ. In fact, this brings to light some philosophical issues. Planned Comparisons: This is best illustrated by an example. Suppose that some university researchers are planning to conduct an experiment involving four different treatments: • The standard treatment, which is produced by a certain company and is in widespread use. • A new commercial treatment manufactured by a rival company, which hopes to show it is better than the standard, so that they can market the new treatment for enormous profits. • Two experimental treatments developed by the university researchers. These are being developed by two new procedures, one designed by the university researchers, the other by rival researchers at another, more prestigious university. The university researchers hope to show that their treatment is better. The administration of the company is trying to decide whether they should begin planning marketing strategy for their treatment; they are thus uninterested for this purpose in the two university treatments. Because the university researchers are setting up an experiment, the company finds it less costly to simply pay the researchers to include their treatment and the standard in the study rather than for the company to do a separate experiment of their own. On the other hand, the university researchers are mainly interested in the experimental treatments. In fact, their main question is which of them shows more promise. In this situation, the main comparison of interest as far as the company is
Experimental Statistics for Biological Sciences
65
concerned is that between their new treatment and the standard. Thus, although the actual experiment involves four treatments, they really only care about the pairwise comparison of the two, that is, the difference μstandard − μnew (company) .
[25]
Similarly, for the university researchers, their main question involves the difference in the pair μexperimental (us) − μexperimental (them) . In this situation, the comparison of interest depends on the interested party. As usual, each party would like to control the probability of making a Type I error in their particular comparison. For example, the company may be satisfied with level of significance α = 0.05, as usual, and probably would have used this same significance level had they conducted the experiment with just the two treatments on their own. In this example, prior to conduct of the experiment, specific comparisons of interest among various treatment means have been identified, regardless of the overall outcome of the experiment. Because these comparisons were identified in advance, we do not run into the problem we did in comparing the treatments with the largest and smallest sample means after the experiment, because the tests will be performed regardless. Nor do we run into the problem of sorting out differences among all means. As far as the company is concerned, their question about [25] is a separate test, for which they want the probability of concluding a difference when there isn’t one to be at most 0.05. A similar statement could be made about university researchers. Of course, this made-up scenario is a bit idealized, but it illustrates the main point. If specific questions involving particular treatment means are of independent interest, are identified as such in advance of seeing experimental results, and would be investigated regardless of outcome, then it may be legitimate to perform the associated tests at level of significance α, without concern about the outcome of other tests. In this case, the level of significance α applies only to the test in question and may be proclaimed only in talking about that single test! If a specific comparison is made in this way from data from a larger experiment, we are controlling the comparisonwise (Type I) error rate at α. For each comparison of this type that is made, the probability of a Type I error is α. The results of several comparisons may not be combined in to a single statement with claimed Type I error rate α. Families of Comparisons: Imagine a pea section experiment that there were four sugar treatments and a control (no sugar).
66
Bang and Davidian
Suppose that the investigators would like to make a single statement about the differences. This statement involves four pairwise comparisons – each sugar treatment against the control. From our previous discussion, it is clear that testing each pairwise comparison against the control at level of significance α and then making such a statement would involve a chance of declaring differences that really don’t exist greater than that indicated by α. Solution: For such a situation, then, the investigators would like to avoid this difficulty. They are interested in ensuring that the family of four comparisons they wish to consider (all sugars against the control) has overall probability of making at least one Type I error to be no more than α, i.e., P(at least one Type I error among all four tests in the family) ≤ α. Terminology: When a question of interest about treatment means involves a family of several comparisons, we would like to control the family-wise error rate at α, so that the overall probability of making at least one mistake (declaring a difference that doesn’t exist) is controlled at α. It turns out that a number of statistical methods have been developed that ensure that the level of significance for a family of comparisons is no more than a specified α. These are often called multiple comparison procedures. Various procedures are available in statistical software (e.g., PROC MULTTEST in SAS, “mtest” command in Stata). 3.2.2. The Least Significant Difference
First, we consider the situation where we have planned in advance of the experiment to make certain comparisons among treatment means. Each comparison is of interest in its own right and thus is to be viewed as separate. Statements regarding each such comparison will not be combined. Thus, here, we wish to control the comparisonwise error rate at α Idea: Despite the fact that each comparison is to be viewed separately, we can still take advantage of all information in the experiment on experimental error. To fix ideas, suppose we have an experiment involving t treatments and we are interested in comparing two treatments a and b, with means μa and μb . That is, we wish to test H0 :μa = μb vs. H1 :μa = μb . We wish to control the comparisonwise error rate at α. Then Test Statistic: As our test statistic for H0 vs. H1 , use instead Y¯ a . − Y¯ b . sY¯ a .−Y¯ b .
, sY¯ a .−Y¯ b . = s
1 1 + , s = MSE . ra rb
That is, instead of basing the estimate of σ 2 on only the two treatments in question, use the estimate from all t treatments.
Experimental Statistics for Biological Sciences
67
Result: This will lead to a more precise test: The estimate of the SD of the treatment mean difference from all t treatments will be a more precise estimate, because it is based on more information! Specifically, the SE based on only the two treatments will have only ra + rb − 2 df, while that based on all t will have df of r ri − t = N − t. j=1
Test Procedure: Perform the test using all information by rejecting H0 in favor of H1 if Y¯ a . − Y¯ b . > tN −t,α/2 . sY¯ a .−Y¯ b . A quick look at the t table will reveal the advantage of this test. The df, N − t for estimating σ 2 (experimental error), are greater than those from only two treatments. The corresponding critical value is thus smaller, giving more chance for rejection of H0 if it’s really true. Terminology: If we decide to compare two treatment means from a larger experiment involving t treatments, the value
1 1 sY¯ a. −Y¯ b. tN −t,α/2 = s + tN −t,α/2 , s = MSE ra rb is called the least significant difference (LSD) for the test of H0 vs. H1 above, based on the entire experiment. From above, we reject H0 at level α if Y¯ a. − Y¯ b. > LSD. Warning: It is critical to remember that the LSD procedure is only valid if the paired comparisons are genuinely of independent interest. 3.2.3. Contrasts
Before we discuss the notion of multiple comparisons (for families of comparisons), we consider the case where we are interested in comparisons that cannot be expressed as a difference in a pair of means. Example: Suppose that for the pea section experiment, a particular question of interest was to compare sugar treatments containing fructose to those that do not: μ3 , μ4 do contain fructose μ2 , μ5 do not contain fructose It is suspected that treatments 3 and 4 are similar in terms of resulting pea section length, treatments 2 and 5 are similar, and
68
Bang and Davidian
lengths from 3 and 4, on the average, are different from lengths from 2 and 5, on the average. Thus, the question of interest is to compare the average mean pea section length for treatments 3 and 4 to the average mean length for treatments 2 and 5. To express this formally in terms of the treatment means, we are thus μ2 +μ5 4 interested in the comparison between μ3 +μ 2 and 2 , that is, the average of μ3 and μ4 vs. the average of μ2 and μ5 . We may express this formally as a set of hypotheses: H0 :
μ2 + μ5 μ3 + μ4 = 2 2
vs.
H1 :
μ2 + μ5 μ3 + μ4 = . 2 2
These may be rewritten by algebra as H0 :μ3 + μ4 − μ2 − μ5 = 0 vs. H1 :μ3 + μ4 − μ2 − μ5 = 0. [26] Similarly, suppose a particular question of interest was whether sugar treatments differ on average in terms of mean pea section length from the control. We thus would like to compare the average mean length for treatments 2–5 to the mean for treatment 1. We may express this as a set of hypotheses: μ2 + μ3 + μ4 + μ5 = μ1 , or 4 4μ1 − μ2 − μ3 − μ4 − μ5 = 0 vs. μ2 + μ3 + μ4 + μ5 = μ1 , or H1 : 4 4μ1 − μ2 − μ3 − μ4 − μ5 = 0.
H0 :
[27]
Terminology: A linear function of treatment means of the form t
ci μi
i=1
t such that the constants ci sum to zero, i.e., i=1 ci = 0 is called a contrast. Both of the functions in [26] and [27] are contrasts: μ3 + μ4 − μ2 − μ5 4μ1 − μ2 − μ3 − μ4 − μ5
c1 = 0, c2 = −1, c3 = 1, c4 = 1, c5 = −1 c1 = 4, c2 = −1, c3 = −1, c4 = −1, c5 = −1
5 ci = 0 i=1 5 i=1 ci = 0.
Interpretation: Note that, in each case, if the means really were all equal, then, because the coefficients ci sum to zero, the contrast itself will be equal to zero. If they are different, it will not. Thus, the null hypothesis says that there are no differences. The alternative says that the particular combination of means is different from zero, which reflects real differences among functions of the means. Note that, of course, a pairwise comparison is a contrast, e.g., μ2 − μ1 is a contrast with c1 = −1, c2 = 1, c3 = c4 = c5 = 0.
Experimental Statistics for Biological Sciences
69
Estimation: As intuition suggests t that, the best (in fact, unbiased) estimator of any contrast i=1 ci μi is Q =
t
ci Y¯ i· .
i=1
It turns out that, for the population of all such Q’s for a particular set of ci ’s, the variance of a contrast is σQ2 = σ 2
t c2 i
i=1
ri
,
which may be estimated by replacing s 2 by the pooled estimate, MSE . Thus, a SE estimate for the contrast is sQ = 2 √ t ci s i=1 ri , s = MSE . It may be easily verified (try it) that this expression reduces to our usual expression when the contrast is a pairwise comparison. Test Procedure: For testing hypotheses of the general form H0 :
t
ci μi = 0 vs.
i=1
H1 :
t
ci μi = 0
i=1
at level of significance α when the comparison is planned, we use the same procedure as for the LSD test. We reject H0 in favor of H1 if |Q | = |
t
ci Y¯ i .| > sQ tN −t,α/2 .
i=1
The result of this test is not to be compared with any other. The level of significance α pertains only to the question at hand. 3.2.4. Families of Comparisons
We now consider the problem of combining statements. Suppose we wish to make a single statement about a family of contrasts (e.g., several pairwise comparisons) at some specified level of significance α. There are a number of methods for making statements about families of comparisons that control the overall family-wise level of significance at a value α. We discuss three of these here. The basic premise behind each is the same. Bonferroni Method: This method is based on modifying individual t-tests. Suppose we specify c contrasts C1 , C2 . . . , Cc and wish to test the hypotheses H0,k :Ck = 0 vs. H1,k :Ck = 0, k = 1, . . . , c
70
Bang and Davidian
while controlling the overall family-wise Type I error rate for all c tests as a group to be ≤ α. It may be shown mathematically that, if one makes c tests, each at level of significance α/c, then P(at least 1 Type I error in the c tests) ≤ α. Thus, in the Bonferroni procedure, we use for each test the t value corresponding to the appropriate df and α/c and conduct each test as usual. That is, for each contrast Ck = ti=1 ci μi , reject H0,k if |Qk | > sQ tN −t,α/(2c) , Qk =
t
ci Y¯ i· .
i=1
The Bonferroni method is valid with any type of contrast, and may be used with unequal replication. Scheffé’s Method: The idea is to ensure that, for any family of contrasts, the family level is α by requiring that the probability of a Type I error is ≤ α for the family of all possible contrasts. That is, control things so that P(at least 1 Type I error among tests of all possible contrasts) ≤ α. Because the level is controlled for any family of contrasts, it will be controlled for the one of interest. The procedure is to compute the quantity S = (t − 1) Ft−1,N −t,α . For all contrasts of interest, Ck , reject the corresponding null hypothesis H0,k if |Qk | > sQ S. Tukey’s Method: For this method, we must have equal replication, and the contrasts of interest must all be pairwise comparisons. Because only pairwise contrasts are of interest, the method takes advantage of this fact. It ensures that the family-wise error rate for all possible pairwise comparisons is controlled at α; if this is true, then any subset of pairwise comparisons will also have this property. The procedure is to compute 1 T = √ qα (t, N − t), 2 where qα (t, N − t) can be found in a table available in relevant statistical textbooks. For example, q0.05 (5, 45) ≈ 4.04 (the value closest in the table). If Ck is the kth pairwise comparison, we would reject the corresponding null hypothesis H0,k , if
Experimental Statistics for Biological Sciences
71
|Qk | > sQ T . It is legitimate to conduct the hypothesis tests using all of these methods (where applicable) and then choose the one that rejects most often. This is valid because the “cut-off” values S, T, and tN −t,α/(2c) do not depend on the data. Thus, we are not choosing on the basis of observed data (which of course would be illegitimate, as discussed earlier)! A Problem With Multiple Comparisons: Regardless of which method one uses, there is always a problem when conducting multiple comparisons. Because the number of comparisons being made may be large (e.g., all possible pairwise comparisons for t = 10 treatments!), and we wish to control the overall probability of at least one Type I error to be small, we are likely to have low power (that is, likely to have a difficult time detecting real differences among the comparisons in our family). This is for the simple reason that, in order to ensure we achieve overall level α, we must use critical values for each comparison larger than we would if the comparisons were each made separately at level α (inspect the pea section results for an example). Thus, although the goal is worthy, that of trying to sort out differences, we may be quite unsuccessful at achieving it! This problem has tempted some investigators to try to figure out ways around the issue, for example, claiming that certain comparisons were of interest in advance when they really weren’t, so as to salvage an experiment with no “significant” results. This is, of course, inappropriate! The only way to ensure enough power to test all questions of interest is to design the experiment with a large enough sample size! 3.3. Multi-Way Classification and ANOVA
Recall when no sources of variation other than the treatments are anticipated, grouping observations will probably not add very much precision, i.e., reduce our assessment of experimental error. If the experimental units are expected to be fairly uniform, then, a completely random design will probably be sufficient. In many situations, however, other sources of variation are anticipated. Examples: • In an agricultural field experiment, adjacent plots in a field will tend to be “more alike” than those far apart. • Observations made with a particular measuring device or by a particular individual may be more alike than those made by different devices or individuals. • Plants kept in different greenhouses may be more alike than those from different greenhouses. • Patients treated at the same hospital may be more alike than those treated at different hospitals. In such cases, clearly there is a potential source of systematic variation we may identify in advance. This suggests that we
72
Bang and Davidian
may wish to group experimental units in a meaningful way on this basis. When experimental units are considered in meaningful groups, they may be thought of as being classified not only according to treatment but also according to • Position in field • Device or observer • Greenhouse • Hospital Objective: In an experiment, we seek to investigate differences among treatments – by accounting for differences due to effects of phenomena such as those above, a possible source of variation will (hopefully) be excluded from our assessment of experimental error. The result will be increased ability to detect treatment differences if they exist. Designs involving meaningful grouping of experimental units are the key to reducing the effects of experimental error, by identifying components of variation among experimental units that may be due to something besides inherent biological variation among them. The paired design for comparing two treatments is an example of such a design. Multi-Way Classification: If experimental units may be classified not only according to treatment but to other meaningful factors, things obviously become more complicated. We will discuss designs involving more than one-way of classifying experimental units. In particular, • Two-way classification, where experimental units may be classified by treatment and another meaningful grouping factor. • A form of three-way classification (one of which is treatment) called a Latin square (we will not cover this novel design here). 3.3.1. Randomized Complete Block Design
When experimental units may be meaningfully grouped, clearly, a completely randomized design will be suboptimal. In this situation, an alternative strategy for assigning treatments to experimental units, which takes advantage of the grouping, may be used. Randomized Complete Block Design: • The groups are called blocks. • Each treatment appears the same number of times in each block; hence, the term complete block design. • The simplest case is that where each treatment appears exactly once in each block. Here, because number of replicates = number of experimental units for each treatment,
Experimental Statistics for Biological Sciences
73
we have number of replicates = number of blocks = r. • Blocks are often called replicates for this reason. • To set up such a design, randomization is used in the following way: - Assign experimental units to blocks on the basis of the meaningful grouping factor (greenhouse, device, etc.) - Now randomly assign the treatments to experimental units within each block. Hence, the term randomized complete block design: each block is complete, and randomization occurs within each block. Rationale: Experimental units within blocks are alike as possible, so observed differences among them should be mainly attributable to the treatments. To ensure this interpretation holds, in the conduct of the experiment, all experimental units within a block should be treated as uniformly as possible: • In a field, all plots should be harvested at the same time of day. • All measurements using a single device should be made by the same individual if different people use it in a different way. • All plants in a greenhouse should be watered at the same time of day or by the same amount. Advantages: • Greater precision is possible than with a completely random design with one-way classification. • Increased scope of inference is possible because more experimental conditions may be included. Disadvantages: • If there is a large number of treatments, a large number of experimental units per block will be required. Thus, large variation among experimental units within blocks might still arise, with the result that no precision is gained, but experimental procedure is more complicated. In this case, other designs may be more appropriate. 3.3.2. Linear Additive Model for Two-way Classification
We assume here that one observation is taken on each experimental unit (i.e., sampling unit = experimental unit). Assume that a randomized complete block design is used with exactly one experimental unit per treatment per block. For the two-way classification with t treatments, we may classify an individual observation as being from the jth block on the ith treatment:
74
Bang and Davidian
Yij = μ + τi +βj + εij = μ + βj +εij = μij + εij , μi
μij
i = 1, . . . , t; j = 1, . . . , r, where • t = number of treatments • r = number of replicates on treatment i, that is, the number of blocks • μ = overall mean (as before) • τi = effect of the ith treatment (as before) • μi = μ + τi mean of the population for the ith treatment • βj = effect of the jth block • μij = μ + τi + βj = μi + βj mean of the population for the ith treatment in the jth block • εij = “error” describing all other sources of variation (e.g., inherent variation among experimental units not attributable to treatments or blocks) In this model, the effect of the jth block, βj , is a deviation from the overall mean μ attributable to being an experimental unit in that block. It is a systematic deviation, the same for all experimental units in the block, thus formally characterizing the fact that the experimental units in the same block are “alike.” In fact, this model is just an extension of that for the paired design with two treatments considered previously. In that model, we had a term ρj , the effect of the jth pair. Here, it should be clear that this model just extends the idea behind meaningful pairing of observations to groups larger than two (and more than two treatments). Thus, the paired design is just a special case of a randomized complete block design in the case of two treatments. Fixed and Random Effects: As in the one-way classification, the treatment and block effects, τi and βj , may be regarded as fixed or random. We have already discussed the notion of regarding treatments as having fixed or random effects. We may also apply the same reasoning to blocks. There are a number of possibilities as expected. It turns out that, unlike in the case of the one-way classification with one sampling unit per experimental unit, the distinction between fixed and random effects becomes very important in higher way classifications. In particular, although the computation of quantities in an ANOVA table may be the same, the interpretation of these quantities, i.e., what the MSs involved estimate, will depend on what’s fixed and what’s random. Model Restriction: As in the one-way classification, the linear additive model above is overparameterized. If we think about the mean for the ith treatment in the jth block, μij , it should be clear that we do not know how much of what we see is attributable to each of the components μ, τi , and βj .
Experimental Statistics for Biological Sciences
75
One Approach: The usual approach is to impose the restrictions t r τi = 0 and βj = 0. i=1
j=1
One-way to think about this is to suppose the overall mean μ is thought of as the average of the μij , r r t t 1 1 μij = (μ + τi + βj ) μ= rt rt i=1 j=1
=μ+
3.3.3. ANOVA for Two-Way Classification – Randomized Complete Block Design with No Subsampling
1 t
t
i=1 j=1
τi +
i=1
1 r
r
βj .
j=1
Assumptions: Before we turn to the analysis, we reiterate the assumptions that underlie the validity of the methods: • The observations, and hence the errors, are normally distributed. • The observations have the same variance. These assumptions are necessary for F ratios we construct to have the F sampling distribution. Recall our discussion on the validity of assumptions – the same issues apply here (and, indeed, always). Idea: As for the one-way classification, we partition the Total SS, which measures all variation in the data from all sources, into “independent” components describing variation attributable different sources. These components are the numerators of estimators for “variance.” We develop the idea by thinking of both the treatment and the block effects τi and βj as being fixed. Notation: Define r t 1 Yi j = overall sample mean Y¯ ·· = rt i=1 j=1
1 Yij = sample mean for treatment i (over all blocks) Y¯ i· = r r
j=1
1 Yij = sample mean for block j (over all treatments) Y¯ ·j = t i=1 ( ti=1 rj=1 Yij )2 . C(correction factor) = rt t
The Total SS is, as usual, the numerator of the overall sample variance for all the data, without regard to treatments or, now, blocks:
76
Bang and Davidian
Total SS =
r t
(Yij − Y¯ ·· )2 =
i=1 j=1
r t
Yij 2 − C.
i=1 j=1
As before, this is based on rt − 1 independent quantities (df). As before, the numerator of a F ratio for testing treatment mean differences ought to involve a variance estimate containing a component due to variation among the treatment means. This quantity will be identical to that we defined previously, by the same rationale. We thus have
Treatment SS = r
t
t
r j=1 Yij
i=1
(Y¯ i· − Y¯ ·· )2 =
2 − C.
r
i=1
This will have t − 1 df, again by the same argument. This assessment of variation effectively ignores the blocks, as it is based on averaging over them. By an entirely similar argument with blocks instead of treatments, we arrive at analogous quantities:
Block SS = t
r
(Y¯ ·j − Y¯ ·· )2 =
j=1
r
t
i=1 Yij
j=1
2
t
− C.
This will have r − 1 df. This assessment of variation effectively ignores the treatments, as it is based on averaging over them. Error SS: Again, we need a SS that represents the variation we are attributing to experimental error. If we are to have the same situation as in the one-way classification, where all SSs are additive and sum to the Total SS, we must have Block SS + Treatment SS + Error SS = Total SS. Solving this equation for Error SS and doing the algebra, we arrive at Error SS =
r t
(Yij − Y¯ i· − Y¯ ·j + Y¯ ·· )2 .
[28]
i=1 j=1
That is, if it were indeed true that we could partition Total SS into these components, then [28] would have to be the quantity characterizing experimental error. Inspection of this quantity seems pretty hopeless for attaching such an interpretation. In fact,
Experimental Statistics for Biological Sciences
77
it is not hopeless. Recall our model restrictions t i=1
τi = 0 and
r
βj = 0
j=1
and what they imply, namely, that the τi and βj may be regarded as deviations from an overall mean μ. Consider then estimation of these quantities under the restrictions and this interpretation: the obvious estimators are μˆ = Y¯ ·· , τˆi = Y¯ i. − Y¯ ·· , βˆj = Y¯ .j − Y¯ ·· . The “hats” denote estimation of these quantities. Thus, if we wanted to estimate μij = μ + τi + βj , the estimator would be μˆ ij = Y¯ ·· + (Y¯ i. − Y¯ ·· ) + (Y¯ .j − Y¯ ·· ). Because in our linear additive model εij = Yij − μij characterizes whatever variation we are attributing to experimental error, we would hope that an appropriate Error SS would be based on an estimate of εij . Such an estimate is Yij − μˆ ij = Yij − {Y¯ ·· + (Y¯ i. − Y¯ ·· ) + (Y¯ .j − Y¯ ·· )} = Yij − Y¯ i. − Y¯ .j + Y¯ ·· by algebra. Note that this quantity is precisely that in [28]; thus, our candidate quantity for Error SS does indeed make sense – it is the sum of squared deviations of the individual observations, Yij from their “sample mean,” an estimate of all “left over” variation, εˆ ij . The df associated with Error SS is the number of independent quantities on which it depends. Our partition and associated df can be Block SS + Treatment SS + Error SS = Total SS, (r − 1) + (t − 1) + (t − 1)(r − 1) = rt − 1. An ANOVA table is as follows (Table 1.3): Treatment SS Block SS , MSB = , t −1 r−1 Error SS . MSE = (t − 1)(r − 1)
Here, MST =
Statistical Hypotheses: The primary question of interest in this setting is to determine if the means of the t treatment populations
78
Bang and Davidian
Table 1.3 Two-way ANOVA Table – Randomized complete block design Source of variation Among blocks
DF r−1
Among Treatments
t −1
Error
(t − 1) (r − 1)
Total
rt − 1
Definition
MS
F
−C
MSB
B FB = MS MSE
−C
MST
T FT = MS MSE
by subtraction j=1 (Yij − 2 ¯ ¯ ¯ Yi. − Y.j + Y·· ) t r t r 2 ¯ 2 i=1 j=1 (Yij − Y·· ) i=1 j=1 Y ij − C
MSE
t
r
r
t
¯ ¯ 2 j=1 (Y.j − Y·· ) ¯ ¯ 2 i=1 (Yi. − Y·· )
t
i=1
SS r
j=1
t
i=1 Yij
2
t t
i=1
r j=1 Yij
r
2
r
are different. We may write this formally as H0,T :μ1 = μ2 = . . . = μt vs. H1,T : The μi are not all equal. This tmay also be written in terms of the τi under the restriction i=1 τi = 0 as H0,T :τ1 = τ2 = · · · = τt = 0 vs. H1,T : The τi are not all equal. Test Procedure: Reject H0,T in favor of H1,T at level of significance α if FT > Ft−1,(t−1)(r−1),α . A secondary question of interest might be whether there is a systematic effect of blocks. We may write a set of hypotheses for this under our model restrictions as H0,B :β1 = . . . = βr = 0 vs. H1,B : The βj are not all equal. Test Procedure: Reject H0,B in favor of H1,B at level of significance α if FB > Fr−1,(t−1)(r−1),α . Note: In most experiments, whether or not there are block differences is not really a main concern, because, by considering blocks up-front, we have acknowledged them as a possible nontrivial source of variation. If we test whether block effects are different and reject H0,B , then, by blocking, we have probably increased the precision of our experiment, our original objective. Expected Mean Squares: We developed the above in the case where both ti and βj have fixed effects. It is instructive to examine
Experimental Statistics for Biological Sciences
79
the expected mean squares under our linear additive model under this condition as well as the cases where (i) both ti and βj are random and (ii) the case where ti is fixed but βj are random (the “mixed” case). This allows insight into the suitability of the test statistics. Here, σ 2 is the variance associated with the ij , that is, corresponding to what we are attributing to experimental error in this situation. From Table 1.4, we have
Table 1.4 Expected mean square Source of variation
Both fixed
Both random
Mixed
σε2 + tσβ2
σε2 + tσβ2
σε2 + rστ2
σε2 + r
r
MSB MST MSE
σε2 + t
2 j=1 βj
r−1 t 2 i=1 τi 2 σε + r t−1 σε2
σε2
σε2
t
2 i=1 τi t−1
• Both ti and βj fixed: FT and FB are both appropriate. The MSs in the numerators of these statistics, MST and MSB , estimate σ 2 (estimated by MSE ) plus an additional term that is equal to zero under H0,T and H0,B , respectively. • Both ti and βj random: Note here that MST and MSB estimate σε2 plus an extra term involving the variances στ2 and σβ2 , respectively, characterizing variability in the populations of all possible treatments and blocks. Thus, under these conditions, FT and FB are appropriate for testing H0,T :στ2 = 0 vs.
H1,T :στ2 > 0,
H0,B :σβ2 = 0 vs.
H1,B :σβ2 > 0.
These hypotheses are the obvious ones of interest when τ i and βj are random. • “Mixed” model: For τ i fixed and βj random, the same observations as above apply. FT and FB are appropriate respectively, for testing H0,T :τ1 = τ2 = . . . = τt = 0 vs. H1,T : The τi are not all equal and H0,B :σβ2 = 0 vs.
H1,B :σβ2 > 0.
• Blocking may be an effective means of explaining variation (increasing precision) so that differences among treatments that may really exist are more likely to be detected. • The data from an experiment set-up according to a particular design should be analyzed according to the
80
Bang and Davidian
appropriate procedure for that design! The above shows that if we set up the experiment according to a randomized complete block design, but then analyzed it as if it had been set up according to a completely randomized design, erroneous inference results, in this case, failure to identify real differences in treatments. The design of an experiment dictates the analysis. Remark: For more advanced or complex designs, readers should refer to the original Dr. Davidian’s lecture notes or relevant statistical textbooks. 3.3.4. More on Violation of Assumptions
Throughout the section, we have made the point that the statistical methods we are studying are based on certain assumptions. ANOVA methods rely on the assumptions of normality and constant variance, with additive error. We have already discussed the notion that this may be a reasonable assumption for many forms of continuous measurement data. We have also discussed that often a logarithmic transformation is useful for achieving approximate normality and constant variance for many types of continuous data as well. However, there are many situations where our data are in the form of counts or proportions, which are not continuous across a large range of values. In these situations, our interest still lies in assessing differences among treatments; however, the assumptions of normality and constant variance are certainly violated. It is well known for both count and proportion data that variance is not constant but rather depends on the size of the mean. Furthermore, histograms for such data are often highly asymmetric. The methods we have discussed may still be used in these situations provided that a suitable transformation is used. That is, although the distribution of Y may not be normal with constant variance for all treatments and blocks, it may be possible to transform the data and analyze them on the transformed scale, where these assumptions are more realistic. Selection of an appropriate transformation of the data, h, say, is often based on the type of data. The values h(Yij ) are treated as the data and analyzed in the usual way. Some common transformations are √ • Square root: h(Y ) = Y . This is often appropriate for count data with small values. • Logarithmic: h(Y ) = log Y . We have already discussed the use of this transformation for data where errors tend to have a multiplicative effect, such as growth data. Sometimes, the log transformation is useful for count data over a large range. √ √ • Arc sine: h(Y ) = arcsin( Y ) or sin−1 Y . This transformation is appropriate when the data are in the form of percentages or proportions.
Experimental Statistics for Biological Sciences
4. Simple Linear Regression and Correlation
81
So far, we have focused our attention on problems where the main issue is identifying differences among treatment means. In this setting, we based our inferences upon observations on a r.v. Y under the various experimental conditions (treatments, blocks, etc.). Another problem that arises in the biological and physical sciences, economics, industrial applications, and biomedical settings is that of investigating the relationship between two (or more) variables. Depending on the nature of the variables and the observations on them (more on this in a moment), the methods of regression analysis or correlation analysis are appropriate. In reality, our development of the methods for identifying differences among treatment means, those of ANOVA, are in fact very similar to regression analysis methods, as will become apparent in our discussion. Both sets of methods are predicated on representing the data by a linear, additive model, where the model includes components representing both systematic and random sources of variation. The common features will become evident over the course of our discussion. In this section, we will restrict our study to the simplest case in which we have two variables for which the relationship between them is reasonably assumed to be a straight line. Note, however, that the areas of regression analysis and correlation analysis are much broader than indicated by our introduction here. Terminology: It is important to clarify the usage of the term linear in statistics. Linear refers to how a component of an equation describing a relationship enters that relationship. For example, in the one-way classification model, recall that we represented an observation as Yij = μ + τi + εij . This equation is said to be linear because the components μ, τi , and εij enter directly. Contrast this with an example of an equation nonlinear in μ and τi : Yij = exp (μ + τi ) + εij . This equation still has an additive error, but the parameters of interest μ and τi enter in a nonlinear fashion, through the exponential function. The term linear is thus used in statistical applications in this broad way to indicate that parameters enter in a straightforward rather than complicated way. The term linear regression has a similar interpretation, as we will see. The term simple linear regression refers to the particular case where the relationship is a straight line. The use of the term linear thus refers
82
Bang and Davidian
to how parameters come into the model for the data in a general sense and does not necessarily mean that the relationship need to be a straight line, except when prefaced by the term simple. Scenario: We are interested in the relationship between two variables, which we will call X and Y . We observe pairs of X and Y values on each of a sample of experimental units, and we wish to use them to say something about the relationship. How we view the relationship is dictated by the situation: “Experimental” Data: Here, observations on X and Y are planned as the result of an experiment, laboratory procedure, etc. For example, • X = dose of a drug, Y = response such as change in blood pressure for a human subject • X = concentration of toxic substance, Y = number of mutant offspring observed for a pregnant rat In these examples, we are in control of the values of X (e.g., we choose the doses or concentrations) and we observe the resulting Y . “Observational” Data: Here, we observe both X and Y values, neither of which is under our control. For example, • X = weight, Y = height of a human subject • X = average heights of plants in a plot, Y = yield In the experimental data situations, there is a distinction between what we call X and what we call Y , because the former is under dictated by the investigator. It is standard to use these symbols in this way. In the observational data examples, there is not necessarily such a distinction. In any event, if we use the symbols in this way, then what we call Y is always understood to be something we observe, while X may or may not be. Relationships Between Two Variables: In some situations, scientific theory may suggest that two variables are functionally related, e.g., Y = g(X ), where g is some function. The form of g may follow from some particular theory. Even if there is no suitable theory, we may still suspect some kind of systematic relationship between X and Y and may be able to identify a function g that provides a reasonable empirical description. Objective: Based on a sample of observations on X and Y , formally describe and assess the relationship between them. Practical Problem: In most situations, the values we observe for Y (and sometimes X, certainly in the case of observational data) are not exact. In particular, due to biological variation among experimental units and the sampling of them, imprecision and/or inaccuracy of measuring devices, and so on, we may only
Experimental Statistics for Biological Sciences
83
observe values of Y (and also possibly X) with some error. Thus, based on a sample of (X,Y) pairs, our ability to see the relationship exactly is obscured by this error. Random vs. Fixed X: Given these issues, it is natural to think of Y (and perhaps X) as r.v.’s. How we do this is dictated by the situation, as above: • Experimental data: Here, X (dose, concentration) is fixed at predetermined levels by the experimenter. Thus, X is best viewed as a fixed quantity (like treatment in our previous situations). Y , on the other hand, which is subject to biological and sampling variation and error in measurement, is a r.v. Clearly, the values for Y we do get to see will be related to the fixed values of X. • Observational data: Consider Y = height, X = weight. In this case, neither weight nor height is a fixed quantity; both are subject to variation. Thus, both X and Y must be viewed as r.v.’s. Clearly, the values taken on by these two r.v.’s are related or associated somehow. Statistical Models: These considerations dictate how we think of a formal statistical model for the situation: • Experimental data: A natural way to think about Y is by representing it as Y = g(X ) + ε.
[29]
Here, then, we believe the function g describes the relationship, but values of Y we observe are not exactly equal to g(X) because of the errors mentioned above. The additive “error” ε characterizes this, just as in our previous models. In this situation, the following terminology is often used: Y = response or dependent variable X = concomitant or independent variable, covariate • Observational data: In this situation, there is really not much distinction between X and Y , as both are seen with error. Here, the terms independent and dependent variable may be misleading. For example, if we have observed pairs of X = weight, Y = height, it is not necessarily if we should be interested in a relationship Y = g(X ) or X = h(Y ), say. Even if we have in our mind that we want to think of the relationship in a particular way, say Y=g(X), it is clear that the above model [29] is not really appropriate, as it does not take into account “error” affecting X.
84
Bang and Davidian
4.1. Simple Linear Regression Model
Straight Line Model: Consider the particular situation of experimental data, where it is legitimate to regard X as fixed. It is often reasonable to suppose that the relationship between Y and X, which we have called in general g, is in fact a straight line. We may write this as Y = β0 + β1 X + ε
[30]
for some values β0 and β1 . Here, then, g is a straight line with Intercept Slope
β0 : The value taken on at X = 0 β1 : Expresses the rate of change in Y , i.e., β1 = change in Y brought about by a change of one unit in X.
Issue: The problem is that we do not know β0 or β1 . To get information on their values, the typical experimental setup is to choose values Xi , i = 1, . . . , n, and observe the resulting responses Y1 , . . . , Yn , so that the data consist of the pairs (Xi , Yi ), i = 1, . . . , n. The data are then used to estimate β0 and β1 , i.e., to fit the model to the data, in order to • quantify the relationship between Y and X. • use the relationship to predict a new response Y 0 we might observe at a given value X0 (perhaps one not included in the experiment). • use the relationship to calibrate – given a new Y 0 value we might see, for which the corresponding value X0 is unknown, estimate the value X0 . The model [30] is referred to as a simple linear regression model. The term regression refers to the postulated relationship. • The regression relationship in this case is the straight line β0 + β1 X . • The parameters β0 and β1 characterizing this relationship are called regression coefficients or regression parameters. If we think of our data (Xi , Yi ), we may thus think of a model for Y i as follows: Yi = β0 + β1 Xi + εi = μi + εi , μi = β0 + β1 Xi This looks very much like our linear additive model in the ANOVA, but with only one observation on each “treatment.” That is, μi = β0 + β1 Xi is the mean of observations we would see at the particular setting Xi . In the one-way classification situation, we would certainly not be able to estimate each mean μi with a single observation; how-
Experimental Statistics for Biological Sciences
85
ever, here, because we also have the variable X, and are willing to represent μi as a particular function of X (here, the straight line), we will be able to take advantage of this functional relation to estimate μi . Hence, we only need the single subscript i. The “errors” εi characterize all the sources of inherent variation that cause Y i to not exactly equal its mean, μi = β0 + β1 Xi . We may think of this as experimental error – all unexplained inherent variation due to the experimental unit. At any Xi value, the mean of responses Y i we might observe is μi = β0 + β1 Xi . The means are on a straight line over all X values. Because of “error”, at any Xi value, the Y i vary about the mean μi , so do not lie on the line, but are scattered about it. Objective: For the simple linear regression model, fit the line to the data to serve as our “best” characterization of the relationship based on the available data. More precisely, estimate the parameters β0 and β1 that characterize the mean at any X value. Remark: We may now comment more precisely on the meaning of the term linear regression. In practice, the regression need not be a straight line, nor need there be a single independent variable X. For example, the underlying relationship between Y and X (that is, the mean) may be more adequately represented by a curve like β0 + β1 X + β2 X 2 .
[31]
Or, if Y is some measure of growth of a plant and X is time, we would eventually expect the relationship to level off when X gets large, as plants cannot continue to get large without bound! A popular model for this is the logistic growth function β1 . 1 + β2 e β3 X
[32]
The curve for this looks much like that above, but would begin to “flatten out” if we extended the picture for large values of X. In the quadratic model [31], note that, although the function is no longer a straight line, it is still a straightforward function of the regression parameters characterizing the curve, β0 , β1 , β2 . In particular, β0 , β1 , and β2 enter in a linear fashion. Contrast this with the logistic growth model [32]. Here, the regression parameters characterizing the curve do not enter the model in a straightforward fashion. In particular, the parameters β2 and β3 appear in a quite complicated way, in the denominator. This function is thus not linear as a function of β1 , β2 , and β3 ; rather, it is better described as nonlinear. It turns out that linear functions are much easier to work with than nonlinear functions. Although we will work strictly with the simple linear regression model, be aware that the methods we discuss
86
Bang and Davidian
extend easily to more complex linear models like [31] but they do not extend as easily to nonlinear models, which need advanced techniques. 4.1.1. The Bivariate Normal Distribution
Consider the situation of observational data. Because both X and Y are subject to error, both are r.v.’s that are somehow related. Recall that a probability distribution provides a formal description of the population of possible values that might be taken on by a r.v.. So far, we have only discussed this notion in the context of a single r.v.. It is possible to extend the idea of a probability distribution to two r.v.’s. Such a distribution is called a bivariate probability distribution. This distribution describes not only the populations of possible values that might be taken on by the two r.v.’s, but also how those values are taken on together. Consider our X and Y . Formally, we would think of a probability distribution function f (x, y) that describes the populations of X and Y and how they are related; i.e., how X and Y vary together. Bivariate Normal Distribution: Recall that the normal distribution is often a reasonable description of a population of continuous measurements. When both X and Y are continuous measurements, a reasonable assumption is that they are both normally distributed. However, we also expect them to vary together. The bivariate normal distribution is a probability distribution with a probability density function f (x, y) for both X and Y such that • The two r.v.’s X and Y each have normal distributions with means μX and μy , and variances σ 2 X and σ 2 Y , respectively. • The relationship between X and Y is characterized by a quantity ρXY such that −1 ≤ ρXY ≤ 1. • ρXY = 1: is referred to as the correlation coefficient between the two r.v.’s X and Y and measures the linear association between values taken on by X and values taken on by Y . ρXY = 1: ρXY ρXY
all possible values of X and Y lie on a straight line with positive slope = −1: all possible values of X and Y lie on a straight line with negative slope = 0: there is no relationship between X and Y
Objective: Given our observed data pairs (Xi , Yi ), we would like to quantify the degree of association. To do this, we estimate ρXY 4.1.2. Comparison of Regression and Correlation Models
We have identified two appropriate statistical models for thinking about the problem of assessing association between two variables X and Y . These may be thought of as
Experimental Statistics for Biological Sciences
87
• Fixed X: Postulate a model for the mean of the r.v. Y as a function of the fixed quantity X (in particular, we focused on a straight line in X). Estimate the parameters in the model to characterize the relationship. • Random X: Characterize the (linear) relationship between X and Y by the correlation between them (in a bivariate normal probability model) and estimate the correlation parameter. It turns out that the arithmetic operations for regression analysis under the first scenario and correlation analysis under the second are the same! That is, to fit the regression model by estimating the intercept and slope parameters and to estimate the correlation coefficient, we use the same operations on our data! The important issue is in the interpretation of the results. Subtlety: In settings where X is best regarded as a r.v., many investigators still want to fit regression models treating X as fixed. This is because, although correlation describes the “degree of association” between X and Y , it doesn’t characterize the relationship in a way suitable for some purposes. For example, an investigator may desire to predict the yield of a plot based on observing the average height of plants in the plot. The correlation coefficient does not allow this. He thus would rather fit a regression model, even though X is random. Is this legitimate? If we are careful about the interpretation, it may be. If X and Y are really both observed r.v.’s, and we fit a regression to characterize the relationship, technically, any subsequent analyses based on this are regarded as conditional on the values of X involved. This means that we essentially regard X as “fixed,” even though it isn’t. However, this may be okay for the prediction problem above. Conditional on having seen a particular average height, he wants to get a “best guess” for yield. He is not saying that he could control heights and thereby influence yields, only that, given he sees a certain height, he might be able to say something about the associated yield. This subtlety is an important one. Inappropriate use of statistical techniques may lead one to erroneous or irrelevant inferences. It’s best to consult a statistician for help in identifying both a suitable model framework and the conditions under which regression analysis may be used with observational data. 4.1.3. Fitting a Simple Linear Regression Model – The Method of Least Squares
Now that we have discussed some of the conceptual issues involved in studying the relationship between two variables, we are ready to describe practical implementation. We do this first for fitting a simple linear regression model. Throughout this discussion, assume that it is legitimate to regard the X’s as fixed. For observations (Xi , Yi ), i = 1, . . . , n, we postulate the simple linear regression model Yi = β0 + β1 Xi + εi , i = 1, . . . , n.
88
Bang and Davidian
We wish to fit this model by estimating the intercept and slope parameters β0 and β1 . Assumptions: For the purposes of making inference about the true values of intercept and slope, making predictions, and so on, we make the following assumptions. These are often reasonable, and we will discuss violations of them later in this section. 1. The observations Y1 , . . . , Yn are independent in the sense that they are not related in any way. [For example, they are derived from different animals, subjects, etc. They might also be measurements on the same subject, but taken far apart enough in time to where the value at one time is totally unrelated to that at another.] 2. The observations Y1 , . . . , Yn have the same variance, σ 2 . [Each Y i is observed at a possibly different Xi value, and is thought to have mean μi = β0 + β1 Xi . At each Xi value, we may thus think of the possible values for Y i and how they might vary. This assumption says that, regardless of which Xi we consider, this variation in possible Y i values is the same.] 3. The observations Y i are each normally distributed with mean μi = β0 + β1 Xi , i = 1 . . . , n, and variance σ 2 (the same, as in 2 above). [That is, for each Xi value, we think of all the possible values taken on by Y as being well represented by a normal distribution.] The Method of Least Squares: There is no one way to estimate β0 and β1 . The most widely accepted method is that of least squares. This idea is intuitively appealing. It also turns out to be, mathematically speaking, the appropriate way to estimate these parameters under the assumption of normality 3 above. For each Y i , note that Yi − (β0 + β1 Xi ) = εi , that is, the deviation Yi − (β0 + β1 Xi ) is a measure of the vertical distance of the observation Y i from the line β0 + β1 Xi that is due to the inherent variation (represented by εi ). This deviation may be negative or positive. A natural way to measure the overall deviation of the observed data Y i from their means, the regression line β0 + β1 Xi , due to this error is n {Yi − (β0 + β1 Xi )}2 . i=1
This has the same appeal as a sample variance – we ignore the signs of the deviations but account for their magnitude. The method of least squares derives its name from thinking about this measure. In particular, we want to find the estimates of β0 and β1 that are the “most plausible” to have generated the data. Thus,
Experimental Statistics for Biological Sciences
89
Fig. 1.2. Illustration of Linear Least Squares.
a natural way to think about this is to choose as estimates the values βˆ0 and βˆ1 that make this measure of overall variation as small as possible (that is, which minimize it). This way, we are attributing as much of the overall variation in the data as possible to the assumed straight line relationship. Formally, then βˆ0 and βˆ1 minimize n
(Yi − βˆ0 − βˆ1 Xi )2 .
i=1
This is illustrated in Fig. 1.2. The line fitted by least squares is the one that makes the sum of the squares of all the vertical discrepancies as small as possible. To find the form of the estimators βˆ0 and βˆ1 , calculus may be used to solve this minimization problem. Define SXY =
n
(Xi − X¯ )(Yi − Y¯ ) =
i=1
SXX = SYY =
n i=1 n
n
(Xi − X¯ )2 =
i=1
n
n
n 2
i=1 Xi
Xi Yi −
i=1 n
(Yi − Y¯ )2 =
n
i=1 n
i=1 Xi
Xi2 −
n Yi2
−
i=1
n
i=1 Yi
n
i=1 Yi
2 ,
where X¯ and Y¯ are the sample means of the Xi and Y i values, respectively. Then the calculus arguments show that the values βˆ0 and βˆ1 minimizing the sum of squared deviations above satisfy βˆ1 =
SXY , βˆ0 = Y¯ − βˆ1 X¯ . SXX
90
Bang and Davidian
Thus, the fitted straight line is given by Yˆ i = βˆ0 + βˆ1 Xi . The “hat” on the Y i emphasizes the fact that these values are our “best guesses” for the means at each Xi value and that the actual values Y1 , . . . , Yn we observed may not fall on the line. The Yˆ i are often called the predicted values; they are the estimated values of the means at the Xi . Example: (Zar, Biostatistical Analysis, p. 225) The following data are rates of oxygen consumption of birds (Y ) measured at different temperatures (X). Here, the temperatures were set by the investigator, and the Y was measured, so the assumption of fixed X is justified. X [◦ C]:
−18
−15
−10
−5
0
5
10
19
Y [(ml/g)/hr]:
5.2
4.7
4.5
3.6
3.4
3.1
2.7
1.8
Calculations: We have n = 8. n
Yi = 29, Y¯ = 3.625,
i=1 n
n
Yi 2 = 114.04
i=1 n
Xi = −14, X¯ = −1.75,
i=1
Xi 2 = 1160,
i=1 n
Xi Yi = −150.4.
i=1
SXY = −150.4 − SXX =1160 − =8.915.
(−14)2 8
(29)( − 14) = −99.65 8
= 1135.5, SYY = 114.0 −
Thus, we obtain −99.65 = −0.0878, 1135.5 βˆ0 = 3.625 − ( − 0.0878)( − 1.75) = 3.4714.
βˆ1 =
The fitted line Yˆ i = 3.4714 − 0.0878Xi .
292 8
Experimental Statistics for Biological Sciences
91
Remark: It is always advisable to plot the data before analysis, to ensure that the model assumptions seem valid. 4.1.4. Assessing the Fitted Regression
Recall for a single sample, we use Y¯ as our estimate of the mean and use the SE sY¯ as our estimate of precision of Y¯ as an estimator of the mean. Here, we wish to do the same thing. How precisely have we estimated the intercept and slope parameters, and, for that matter, the line overall? Specifically, we would like to quantify • The precision of the estimate of the line • The variability in the estimates βˆ0 and βˆ1 . Consider the identity Yi − Y¯ = (Yˆ i − Y¯ ) + (Yi − Yˆ i ). Algebra and the fact that Yˆ i = βˆ0 + βˆ1 Xi = Y¯ + βˆ1 (Xi − X¯ ) may be used to show that n i=1
(Yi − Y¯ )2 =
n
(Yˆ i − Y¯ )2 +
i=1
n
(Yi − Yˆ i )2 .
[33]
i=1
The quantity on the left-hand side of this expression is one you should recognize – it is the Total SS for the set of data. For any set of data, we may always compute the Total SS as the sum of squared deviations of the observations from the (overall) mean, and it serves a measure of the overall variation in the data. Thus, [33] represents a partition of our assessment of overall variation in the data, Total SS, into two independent components. • (Yˆ i − Y¯ ) is the deviation of the predicted value of the ith observation from the overall mean. Y¯ would be the estimate of mean response at all X values we would use if we did not believe X played a role in the values of Y . Thus, this deviation measures the difference between going to the trouble to have a separate mean for each X value just using a single, common mean as the model. We would expect the sum of squared deviations n
(Yˆ i − Y¯ )2
i=1
to be large if using separate means via the regression model is much better than using a single mean. Using a single mean effectively ignores the Xi , so we may think of this as measuring the variation in the observations that may be explained by the regression line β0 + β1 Xi . • (Yi − Yˆ i ) is the deviation of the predicted value for the ith observation (our “best guess” for its mean) and the obser-
92
Bang and Davidian
vation itself (that we observed). Hence, the sum of squared deviations n
(Yi − Yˆ i )2
i=1
measures any additional variation of the observations about the regression line, that is, the inherent variation in the data at each Xi value that causes observations not to lie on the line. Thus, the overall variation in the data, as measured by Total SS, may be broken down into two components that each characterize parts of the variation: • Regression SS = ni=1 (Yˆ i − Y¯ )2 , which measures that portion of the variability that may be explained by the regression relationship (so is actually attributable to a systematic source, the assumed straight line relationship between Y and X). • Error SS (also called Residual SS) = ni=1 (Yi − Yˆ i )2 , which measures the inherent variability in the observations (e.g., Experimental error). Result: We hope that the (straight line) regression relationship explains a good part of the variability in the data. A large value of Regression SS would in some sense indicate this. Coefficient of Determination: One measure of this is the ratio R2 =
Regression SS . Total SS
R2 is called the coefficient of determination or the multiple correlation coefficient. (This second name arises from the fact that it turns out to be algebraically the value we would use to “estimate” the correlation between the Y i and Yˆ i value and is not to be confused with correlation as we have discussed it previously.) Intuitively, R2 is a measure of the “proportion of total variation in the data explained by the assumed straight line relationship with X.” Note that we must have 0 ≤ R2 ≤ 1, because both components are nonnegative and the numerator can be no larger than the denominator. Thus, an R2 value close to 1 is often taken as evidence that the regression model does “a good job” at describing the variability in the data, better than if we just assumed a common mean (and ignored Xi ). It is critical to understand what R2 does and does not measure. R2 is computed under the assumption that the simple linear regression model is correct, i.e., that it is a good description of the underlying relationship between Y and X. Thus, it assesses, if the relationship between X and Y really is a straight line, how much of the variation in the data may actually be attributed to
Experimental Statistics for Biological Sciences
93
Table 1.5 ANOVA–Simple linear regression Source
DF
SS
Regression
1
n
Error
n−2
Total
n−1
ˆ ¯ 2 i=1 (Yi − Y ) n (Yi − Yˆ i )2 i=1 n ¯ 2 i=1 (Yi − Y )
MS
F
MSR = SS 1
R FR = MS MSE
SS MSE = n−2
that relationship rather than just to inherent variation. If R2 is small it may be that there is a lot of random inherent variation in the data, so that, although the straight line is a reasonable model, it can only explain so much of the observed overall variation. ANOVA: The partition of Total SS above has the same interpretation as in the situations we have already discussed. Thus, it is common practice to summarize the results in an ANOVA table as in Table 1.5. Note that Total SS has n − 1 df, as always. It may be shown that Regression SS =
n
(Yˆ i − Y¯ )2 = βˆ1
i=1
n
(Xi − X¯ )2 .
i=1
βˆ1 is a single function of the Y i , thus it is a single independent quantity. Thus, we see that Regression SS has 1 df By subtraction, Error SS has n − 2 df Calculations: S2 Regression SS = XY . SXX Total SS ( = SYY ) is calculated in the usual way. Thus, Error (Residual) SS may be found by subtraction. It turns out that the expected mean squares, that is, the values estimated by MSR and MSE , are MSR : σ 2 + β12 ni=1 (Xi − X¯ )2 MSE : σ 2 . Thus, if β1 ≈ 0, the two MSs we observe should be about the same, and we would expect FR to be small. However, note that β1 = 0 implies that the true regression line is Yi = β0 + εi , that is, there is no association with X (slope = 0) and thus all Y i have the same mean β0 , regardless of the value of Xi . There is a straight line relationship, but it has slope 0, which effectively means no relationship.
94
Bang and Davidian
Table 1.6 ANOVA table for the oxygen data Source
DF
SS
MS
F
Regression
1
8.745
8.745
308.927
0.028
Error (Residual)
6
0.170
Total
7
8.915
If β1 = 0, then we would expect the ratio FR to be large. It is possible to show mathematically that a test of the hypotheses Ho:β1 = 0 vs. H1 : β1 = 0 may be carried out by comparing FR to the appropriate value from a F1,n−2 distribution. That is, the statistic F may be shown to have this distribution if H0 is true. Thus, the procedure would be Reject H0 at level of significance α if FR > F1,n−2,α . The interpretation of the test is as follows. Under the assumption that a straight line relationship exists, we are testing whether or not the slope of this relationship is in fact zero. A zero slope means that there is no systematic change in mean along with change in X, that is, no association. It is important to recognize that if the true relationship is not a straight line, then this may be a meaningless test. Example: For the oxygen consumption data, assume that the data are approximately normally distributed with constant variance. We have Total SS = SYY = 8.915, SXY = −99.65, SXX = 1135.5 from before, with n = 8. Thus, ( − 99.65)2 = 8.745, 1135.5 Error (Residual) SS = 8.915 − 8.745 = 0.170. (see Table 1.6) Regression SS =
We have F1,6,0.05 = 5.99. FR = 308.927 5.99; thus, we reject H0 at level of significance α = 0.05. There is strong evidence in these data to suggest that, under the assumption that the simple linear regression model is appropriate, the slope is not zero, so that an association appears to exist. Note that R2 =
8.745 = 0.981 8.915
Experimental Statistics for Biological Sciences
95
thus, as the straight line assumption appears to be consistent with the visual evidence in the plot of the data, it is reasonable to conclude that the straight line relationship explains a very high proportion of the variation in the data (the fact that Y i values are different is mostly due to the relationship with X). Estimate of σ 2 : If we desire an estimate of the variance σ 2 associated with inherent variation in the Y i values (due to variation among experimental units, sampling, and measurement error), from the expected mean squares above, the obvious estimate is the Error (Residual) MS. That is, denoting the estimate by s 2 , s 2 = MSE .
4.1.5. Confidence Intervals for Regression Parameters and Means
Because β0 , β1 , and in fact the entire regression line, are population parameters that we have estimated, we wish to attach some measure of precision to our estimates of them. SEs and CIs for Regression Parameters: It turns out that it may be shown, under our assumptions, that, if the relationship really is a straight line, the SDs of the populations of all possible βˆ1 and βˆ0 values are n 2 σ σ i=1 Xi , SD(βˆ0 ) = , SD(βˆ1 ) = √ √ SXX nSXX respectively. Because σ is not known, we estimate these SDs by replacing σ by the estimate s. We thus obtain the estimated SDs n 2 s s i=1 Xi EST SD(βˆ1 ) = √ , EST SD(βˆ0 ) = √ , SXX nSXX respectively. These are often referred to, analogous to a single sample mean, as the SEs of βˆ1 and βˆ0 . It may also be shown under our assumptions that βˆ0 − β0 βˆ1 − β1 ∼ tn−2 and ∼ tn−2 . EST SD(βˆ1 ) EST SD(βˆ0 )
[34]
These results are similar in spirit to those for a single mean and difference of means; the t distribution is relevant rather than the normal because we have replaced σ by an estimate (with n − 2 df). Because we are estimating the true parameters β1 and β0 by these estimates, it is common practice to provide a CI for the true values β1 and β0 , just as we did for a sample mean or difference of means. The derivation and the interpretation are the same. By
96
Bang and Davidian
“inverting” probability statements about the quantities in [34] in the same fashion, we arrive at the following 100(1 − α)% CIs: Interval for β1 : βˆ1 − tn−2,α/2 EST SD(βˆ1 ) , βˆ1 + tn−2,α/2 EST SD(βˆ1 ) Interval for β0 : βˆ0 − tn−2,α/2 EST SD(βˆ0 ) , βˆ0 + tn−2,α/2 EST SD(βˆ0 ) . The interpretation is as follows: Suppose that zillions of experiments were conducted using the same (fixed) Xi values as those in the observed experiment. Suppose that for each of these, we fitted the regression line by the above procedures and calculated 100(1 − α)% CIs for β1 (and β0 ). Then for 100(1 − α)% of these, the true value of β1 (β0 ) would fall between the endpoints. The endpoints are a function of the data; thus, whether or not β1 (β0 ) falls within the endpoints is a function of the experimental procedure. Thus, just as in our earlier cases, the CI is a statement about the quality of the experimental procedure for learning about the value of β1 (β0 ). SE and CI for the Mean: Our interest in the values of β0 and β1 is usually because we are interested in the characteristics of Y at particular X values. Recall that μi = β0 + β1 Xi is the mean of the Y i values at the value of Xi . Thus, just as we are interested in estimating the mean of a single sample to give us an idea of the “center” of the distribution of possible values, we may be interested in estimating the mean of Y i values at a particular value of X. For example, an experiment may have been conducted to characterize the relationship between concentration of a suspected toxicant (X) and a response like number of mutated cells (Y ). Based on the results, the investigators may wish to estimate the numbers of mutations that might be seen on average at other concentrations not considered in the experiment. That is, they are interested in the “typical” (mean) number of mutations at particular X values. Consider this problem in general. Suppose we have fitted a regression line to data, and we wish to estimate the mean response at a new value X0 That is, we wish to estimate μ0 = β0 + β1 X0 . The obvious estimator for μ0 is the value of this expression with the estimates of the regression parameters plugged in, that is, μˆ 0 = βˆ0 + βˆ1 X0 .
Experimental Statistics for Biological Sciences
97
In the example, μˆ 0 is thus our estimate of the average number of mutations at some concentration X0 . Note that, of course, the estimate of the mean will depend on the value of X0 . Because μ0 is an estimate based on our sample, we again would like to attach to it some estimate of precision. It may be shown mathematically that the variance of the distribution of all possible μˆ 0 values (based on the results of all possible experiments giving rise to all possible values of βˆ0 and βˆ1 ) is (X0 − X¯ )2 2 1 + σ . n SXX We may thus estimate the SD of μˆ 0 by
1 (X0 − X¯ )2 + . EST SD(μˆ 0 ) = s n SXX It may be shown that, under our assumptions, μˆ 0 − μ0 ∼ tn−2 . EST SD(μˆ 0 ) Thus, using our standard argument to construct a 100(1 − α)% CI for a population parameter based on such a result, we have that a 100(1 − α)% CI for μ0 , the true mean of all possible Y values at the fixed value X0 , is {μˆ 0 − tn−2,α/2 EST SD(μˆ 0 ) , μˆ 0 + tn−2,α/2 EST SD(μˆ 0 )}. Length of CI: The CI for μ0 will of course be different depending on the value of X0 . In fact, the expression for EST SD(μˆ 0 ) above will be smallest if we choose X0 = X¯ and will get larger the farther X0 is from X¯ in either direction. This implies that the precision with which we expect to estimate the mean value of Y decreases the farther X0 is from the “middle” of the original data. This makes intuitive sense – we would expect to have the most “confidence” in our fitted line as an estimate of the true line in the “center” of the observed data. The result is that the CIs for μ0 will be wider the farther X0 is from X¯ Implication: If the fitted line will be used to estimate means for values of X besides those used in the experiment, it is important to use a range of X’s which contains the future values of interest, X0 , preferably more toward the “center.” Extrapolation: It is sometimes desired to estimate the mean based on the fit of the straight line for values of X0 outside the range of X’s used in the original experiment. This is called extrapolation. In order for this to be valid, we must believe that the straight line relationship holds for X’s outside the range where we have observed data! In some situations, this may be reasonable;
98
Bang and Davidian
in others, we may have no basis for making such a claim without data to support it. It is thus very important that the investigator has an honest sense of the relevance of the straight line model for values outside those used in the experiment if inferences such as estimating the mean for such X0 values are to be reliable. In the event that such an assumption is deemed to be relevant, note from the above discussion that the quality of the estimates of the μ0 for X0 outside the range is likely to be poor. Note that we may in fact be interested in the mean at values of X that were included in the experiment. The procedure above is valid in this case. 4.1.6. Prediction and Calibration
Prediction: Sometimes, depending on the context, we may be interested not in the mean of possible Y values at a particular X0 value, but in fact the actual value of Y we might observe at X0 . This distinction is important. The estimate of the mean at X0 provides just a general sense about values of Y we might see there – just the “center” of the distribution. This may not be adequate for some applications. For example, consider a stockbroker who would like to learn about the value of a stock based on observed previous information. The stockbroker does not want to know about what might happen “on the average” at some future time X0 ; she is concerned with the actual value of the stock at that time, so that she may make sound judgments for her clients. The stockbroker would like to predict or forecast the actual value, Y 0 say, of the stock that might be observed at X0 . In this kind of situation, we are interested not in the population parameter μ0 , but rather the actual value that might be taken on by a r.v., Y 0 , say. In the context of our model, we are thus interested in the “future” observation Y0 = β0 + β1 X0 + ε0 , where ε0 is the “error” associated with Y 0 that makes it differ from the mean at X0 , μ0 . It is important to recognize that Y 0 is not a parameter but a r.v.; thus, we do not wish to estimate a fixed quantity, but instead learn about the value of a random quantity. Now, our “best guess” for the value Y 0 is still our estimate of the “central” value at X0 , the mean μ0 . We will write Yˆ 0 = βˆ0 + βˆ1 X0 to denote this “best guess.” Note that this is identical to our estimate for the mean, μˆ 0 ; however, we use a different symbol in this context to remind ourselves that we are interested in Y 0 , not μ0 We call Yˆ 0 a prediction or forecast rather than an “estimate” to make the distinction clear. Just as we do in estimation of fixed parameters, we would still like to have some idea of how well we can predict/forecast. To get an idea, we would like to
Experimental Statistics for Biological Sciences
99
characterize the uncertainty that we have about Yˆ 0 as a guess for Y0 , but, because it is not a parameter, it is not clear what to do. Our usual notion of a SE and CI does not seem to apply. We can write down an assessment of the likely size of the error we might make in using Yˆ 0 to characterize Y 0 . Intuitively, there will be two sources of error: • Part of the error in Yˆ 0 comes from the fact that we don’t know β0 and β1 but must estimate them from the observed data. • Additional error arises from the fact that what we are really doing is trying to “hit a moving target!” That is, Y 0 itself is a r.v., so itself is variable! Thus, additional uncertainty is introduced because we are trying to characterize a quantity that itself is uncertain. The assessment of uncertainty thus should be composed of two components. An appropriate measure of uncertainty is the SD of Yˆ 0 − Y0 , that is, the variability in the deviations between Yˆ 0 and the thing we are trying to “hit,” Y0 . This variance turns out to be ¯ )2 (X 1 − X 0 σ2 + σ2 + . n SXX The extra σ 2 added on is accounting for the additional variation above (i.e., the fact that Y0 itself is variable). We estimate the associated SD of this variance by substituting s 2 for σ 2 : ¯ )2 1 (X − X 0 EST ERR(Yˆ 0 ) = s 1 + + . n SXX We call this “EST ERR” to remind ourselves that this is an estimate of the “error” between Yˆ 0 and Y0 , each of which is random. The usual procedure is to use this estimated uncertainty to construct what might be called an “uncertainty interval” for Y 0 based on our observed data. Such an interval is usually called a ‘prediction interval’. A 100(1 − α)% interval is given by {Yˆ 0 − tn−2,α/2 EST ERR(Yˆ 0 ),Yˆ 0 + tn−2,α/2 EST ERR(Yˆ 0 )}. Note that this interval is wider that the CI for the mean μ0 . This is because we are trying to forecast the value of a r.v. rather than estimate just a single population parameter. Understandably, we cannot do the former as well as the latter, because Y 0 varies as well. Calibration: Suppose we have fitted the regression line and now, for a value Y 0 of Y we have observed, we would like to estimate the unknown corresponding value of X, say X0 .
100
Bang and Davidian
As an example, consider a situation where interest focuses on two different methods of calculating the age of a tree. One way is by counting tree rings. This is considered to be very accurate, but requires sacrificing the tree. Another way is by a carbon-dating process. Suppose that data are obtained for n trees on X = age by the counting method, Y = age by carbon dating. Here, technically, both of these might be considered r.v.’s; however, the goal of the investigators was to determine the relationship of the more variable, less reliable carbon data method to the accurate counting method. That is, given a tree is of age X (which could be determined exactly by the counting method), what would the associated carbon method value look like? Thus, for their purposes, regression analysis is appropriate. From the observed pairs (Xi , Yi ), a straight line model seems reasonable and is fitted to the data. Now suppose that the carbon data method is applied to a tree not in the study yielding an age Y 0 . What can we say about the true age of the tree, X0 , (that is, its age by the very accurate counting method) without sacrificing the tree? The idea is to use the fitted line to estimate X0 . Note that X0 is a fixed value, thus, it is perfectly legitimate to want to estimate it. The obvious choice for an estimator, Xˆ 0 , say, is found by “inverting” the fitted regression line: Y0 − β0 Xˆ 0 = . β1 Because Xˆ 0 is an estimate of a parameter, based on information from our original sample, we ought to also report SE and/or CI. It turns out that deriving such quantities is a much harder mathematical exercise than for estimating a mean or for prediction, and a description of this is beyond the scope of our discussion here. This is because the estimated regression parameters appear both in the numerator and denominator of the estimate, which leads to mathematical difficulties. Be aware that this may indeed be done; if you have occasion to make calibration inference, you should consult a statistician for help with attaching estimates of precision to your calibrated values (the estimates Xˆ 0 .) 4.1.7. Violation of Assumptions
Earlier in this section, we stated assumptions under which the methods for simple linear regression we have discussed yield valid inferences. It is often the case in practice that one or more of the assumptions is violated. As in those situations, there are several ways in which the assumptions may be violated. • Nonconstant variance: Recall in the regression situation that the mean response changes with X. Thus, in a given experiment, the responses may arise from distributions with means across a large range. The usual assumption is that the variance of these distributions is the same. However, it is often
Experimental Statistics for Biological Sciences
101
the case that the variability in responses changes, most commonly in an increasing fashion, with changing X and mean values. This is thus likely to be of concern in problems where the response means cover a large range. We have already discussed the idea of data transformation as a way of handling this violation of the usual assumptions. In the regression context, this may be done in a number of ways. One way is to invoke an appropriate transformation and then postulate a regression model on the transformed scale. Sometimes, in fact, it may be that, although the data do not appear to follow a straight line relationship with X on the original scale, they may on some transformed scale. Another approach is to proceed with a modified method known as weighted least squares. This method, however, requires that the variances are known, which is rarely the case in practice. A number of diagnostic procedures have been developed for helping to determine if nonconstant variance is an issue and how to handle it. Other approaches to transforming data are also available. Discussion of these and of weighted least squares is an advanced topic. The best strategy is to consult a statistician to help with both diagnosis and identification of the best methods for a particular problem. • Nonnormality: Also, as we have discussed previously, the normal distribution may not provide a realistic model for some types of data, such as those in the form of counts or proportions. Transformations may be used in the regression context as well. In addition, there are other approaches to investigating the association between such responses Y and a covariate X that we have not discussed here. Again, a statistician can help you determine the best approach. Remark-Outliers: Another phenomenon that can make the normal approximation unreasonable is the problem of outliers, i.e., data points that do not seem to fit well with the pattern of the rest of the data. In the context of straight line regression, an outlier might be an observation that falls far off the apparent approximate straight line trajectory followed by the remaining observations. Practitioners may often “toss out” such anomalous points, which may or may not be a good idea, depending on the problem. If it is clear that an “outlier” is the result of a mishap or a gross recording error, then this may be acceptable. On the other hand, if no such basis may be identified, the outlier may in fact be a genuine response; in this case, it contains information about the process under study and may be reflecting a legitimate phenomenon. In this case, “throwing out” an outlier may lead to misleading conclusions, because a legitimate feature is being ignored. Again, there are sophisticated diagnostic procedures for identifying outliers and deciding how to handle them.
102
Bang and Davidian
4.2. Correlation Analysis
Throughout this discussion, we regard both Y and X as r.v.’s such that the bivariate normal distribution provides an appropriate, approximate model for their joint distribution. Correlation: Recall that the correlation coefficient ρXY is a measure of the degree of (linear) association between two r.v.’s. Interpretation: It is very important to understand what correlation does not measure. Investigators sometimes confuse the value of the correlation coefficient and the slope of an apparent underlying straight–line relationship. These do not have anything to do with each other: • The correlation coefficient may be virtually equal to 1, implying an almost perfect association. But the slope may be very small at the same time. Although there is indeed an almost perfect association, the rate of change of Y values with X values may be very slow. • The correlation coefficient may be very small, but the apparent “slope” of the relationship could be very steep. In this situation, it may be that, although the rate of change of Y values with X values is fast, there is large inherent variation in the data. Estimation: For a particular set of data, of course, ρXY is unknown. We may estimate ρXY from a set of n pairs of observations (Xi , Yi ), i = 1, . . . , n, by the sample correlation coefficient n (Xi − X¯ )(Yi − Y¯ ) SXY =√ . rXY = i=1 n n S S 2 2 XX YY ¯ ¯ i=1 (Xi − X ) i=1 (Yi − Y ) Note that the same calculations are involved as for regression analysis! Recall in regression analysis that
SXY Regression SS = √ . SXX
2 is thus often called the coefficient of deterThe quantity rXY mination (like R2 ) in this setting, where correlation analysis is appropriate. However, it is important to recognize that the interpretation is different. Here, we are not acknowledging a straight– line relationship; rather, we are just modeling the data in terms of a bivariate normal distribution with correlation ρXY . Thus, the 2 has no meaning here. former interpretation for the quantity rXY Likewise, the idea of correlation really only has meaning when both variables Y and X are r.v.’s. CI: Because rXY is an estimator of the population parameter ρXY , it would be desirable to report, along with the estimate itself, a CI for ρXY . There is no one way to carry out these analyses. One common approach is an approximation known as Fisher’s Z transformation.
Experimental Statistics for Biological Sciences
103
The method is based on the mathematical result that the quantity 1 + rXY Z = 0.5 log 1 − rXY has an approximate normal distribution with mean and variance 1 + ρXY 1 0.5 log , and 1 − ρXY n−3 respectively, when n is large. This result may be used to construct an approximate 100(1 − α)% CI for the mean
1 + ρXY 0.5 log 1 − ρXY
,
where log is natural logarithm. This CI is
Z − zα/2
1 1 , Z + zα/2 , n−3 n−3
[35]
where, as before, zα/2 is the value such that, for a standard normal r.v. Z, zα/2 satisfies P(Z > zα/2 ). This CI may then be transformed to obtain a CI for ρXY itself as follows. Let Z L and Z U be the lower and upper endpoints of the interval [35]. We illustrate the approach for Z L . To obtain the lower endpoint for an approximate CI for ρXY itself, calculate exp (2Z L ) − 1 . exp (2Z L ) + 1 This value is the lower endpoint of the ρXY interval; to obtain the upper endpoint, apply the same formula to Z U . Hypothesis Test: We may also be interested in testing hypotheses about the value of ρXY . The usual hypotheses tested are analogous in spirit to what is done in straight line regression – the null hypothesis is the hypothesis of “no association.” Here, then, we test H0 :ρXY = 0 vs. H1 :ρXY = 0. It is important to recognize what is being tested here. The alternative states simply that ρXY is different from 0. The true value of the correlation coefficient could be quite small and H1 would be true. Thus, if the null hypothesis is rejected, this is not necessarily an indication that there is a “strong” association, just there is evidence that there is some association. Of course, as always, if we do not reject H0 , this does not mean that we do have enough evidence to infer that there is not an association! This is particularly critical here. The procedure for testing H0 vs. H1 is
104
Bang and Davidian
intuitively reasonable: we reject H0 if the CI does not contain 0. It is possible to modify this procedure to test whether ρXY is equal to some other value besides zero. However, be aware that most statistical packages provide by default the test for H0 :ρXY = 0 only. Warning: This procedure is only approximate, even under our bivariate normal assumption. It is an example of the type of approximation that is often made in difficult problems, that of approximating the behavior of a statistic under the condition that the n is large. If n is small, the procedure is likely to be unreliable. Moreover, it is worth noting that, intuitively, trying to understand the underlying association between two r.v.’s is likely to be very difficult with a small number of pairs of observations. Thus, testing aside, one should be very wary of over-interpretation of the estimate of ρXY when n is small – one “outlying” or “unusual” observation could be enough to affect the computed value substantially! It thus may be very difficult to detect when ρXY is different from 0 with a small n!
Chapter 2 Nonparametric Methods for Molecular Biology Knut M. Wittkowski and Tingting Song Abstract In 2003, the completion of the Human Genome Project (1) together with advances in computational resources (2) were expected to launch an era where the genetic and genomic contributions to many common diseases would be found. In the years following, however, researchers became increasingly frustrated as most reported ‘findings’ could not be replicated in independent studies (3). To improve the signal/noise ratio, it was suggested to increase the number of cases to be included to tens of thousands (4), a requirement that would dramatically restrict the scope of personalized medicine. Similarly, there was little success in elucidating the gene–gene interactions involved in complex diseases or even in developing criteria for assessing their phenotypes. As a partial solution to these enigmata, we here introduce a class of statistical methods as the ‘missing link’ between advances in genetics and informatics. As a first step, we provide a unifying view of a plethora of nonparametric tests developed mainly in the 1940s, all of which can be expressed as u-statistics. Then, we will extend this approach to reflect categorical and ordinal relationships between variables, resulting in a flexible and powerful approach to deal with the impact of (1) multiallelic genetic loci, (2) poly-locus genetic regions, and (3) oligo-genetic and oligo-genomic collaborative interactions on complex phenotypes. Key words: Genome-wide Association Study (GWAS), Family-based Association Test (FBAT), High-Density Oligo-Nucleotide Assay (HDONA), coregulation, collaboration, multiallelic, multilocus, multivariate, gene–gene interaction, epistasis, personalized medicine.
1. Introduction As functional genetics and genomics advance and prices for sequencing and expression profiling drop, new possibilities arise, but also new challenges. The initial successes in identifying the causes of rare, mono-causal diseases serve as a proof-of-concept that new diagnostics and therapies can be developed. Common diseases have even more impact on public health, but as they H. Bang et al. (eds.), Statistical Methods in Molecular Biology, Methods in Molecular Biology 620, DOI 10.1007/978-1-60761-580-4_2, © Springer Science+Business Media, LLC 2010
105
106
Wittkowski and Song
typically involve genetic epistasis, genomic pathways, and proteomic patterns, new requirements for database systems and statistical analysis tools are necessitated. Biological systems are regulated by various, often unknown feedback loops so that the functional form of relationships between measurement and activity or efficacy is typically unknown, except within narrowly controlled experiments. Still, many statistical methods are based on the linear model (5), i.e., on the assumption that the above (unknown) relationship is linear. The linear model has the advantage of computational efficiency. Moreover, assuming independence and additivity yields conveniently bell-shaped distributions and parameters of alluring simplicity. The prayer that biology be linear, independent, and additive, however, is rarely answered and the Central Limit Theorem (CLT) neither applies to small samples nor rescues from model misspecification. When John Arbuthnot (1667–1735) argued that the chance of more male than female babies being born in London for the last 82 consecutive years was only 1/282 “if mere Chance govern’d”, and, thus, “it is Art, not Chance, that governs”, (6) he was arguably the first to ever apply the concept of hypothesis testing to obtain what is now known as a ‘p-value’ and, interestingly, he applied it to a problem in molecular biology. The test, now known as the sign or McNemar (7) test (see below), belongs to the class of ‘nonparametric’ tests, which differ from their ‘parametric’ counterparts, in that the distribution of the data is not assumed to be known, except for a single parameter to be estimated. As they require fewer unjustifiable assumptions to be made, nonparametric methods are more adequate for biological systems, in general. Moreover, they tend to be easier to understand. The median, for instance, is easily explained as the cut-off where as many observations are above as below, and the interquartile range as the range with 25% of the data both below and above. In contrast to the mean and standard deviation, these ‘quartiles’ do not change with (monotonous) scale transformations (such as log and inverse), are robust to outliers, and often reflect more closely the goals of the investigator (8): ‘Do people in this group tend to score higher than people in that?’, ‘Is the order on this variable similar to the order on that?’ If the questions are ordinal, it seems preferable to use ordinal methods to answer them (9). The reason for methods based on the linear model, ranging from the t-test(s) and analysis of variance (ANOVA) to stepwise linear regression and factor analysis is not their biological plausibility, but merely their computational efficiency. While mean and standard deviation are easily computed with a pocket calculator, quartiles are not. Other disadvantages of ordinal methods are the relative scarcity of experimental designs that can be
Nonparametric Methods for Molecular Biology
107
analyzed. Moreover, nonparametric methods are often presented as a confusing ‘hodgepodge’ of seemingly unrelated methods. On the one hand, equivalent methods can have different names, such as the Wilcoxon (rank-sum) (10), the Mann–Whitney (u-) (11), and the Kruskal–Wallis (12) test (when applied to two groups). On the other hand, substantially different tests may be attributed to the same author, such as Wilcoxon’s rank-sum and signed-rank tests (10). In the field of molecular biology, the need for computational (bioinformatics) approaches in response to a rapidly evolving technology producing large amounts of data from small samples has created a new field of “in silico” analyses, the term a hybrid of the Latin in silice (from silex, silicis m.: hard stone (13, 14)) and phases such as in vivo, in vitro, and in utero, which seem to be better known than, for instance, in situ, in perpetuum, in memoriam, in nubibus, and in capite. Many in silico methods have added even more confusion to the field. For instance, a version of the t-test with a more conservative variance estimate is often referred to as SAM (significance analysis for microarrays) (15), and adding a similar “fudge factor” to the Wilcoxon signed-rank test yields SAM-RS (16). Referring to ‘new’ in silico approaches by names intended to highlight the particular application or manufacturer, rather than the underlying concepts has often prevented these methods from being thoroughly evaluated. With sample sizes increasing and novel technologies (e.g., RNAseq) replacing less robust technologies (e.g., microarrays) (17) that often relied on empirically motivated standardization and normalization, the focus can now shift from ad hoc bioinformatics approaches to well-founded biostatistical concepts. Below we will present a comprehensive theory for a wide range of nonparametric in silice methods based on recent advances in understanding the underlying fundamentals, novel methodological developments, and improved algorithms.
2. Univariate Unstructured Data 2.1. Independence 2.1.1. Introduction
For single-sample designs, independence of the observations constituting the sample is a key principle to guarantee when applying any statistical test to observed data. All asymptotic (large-sample) versions of the tests discussed here are based on the CLT, which applies when (i) many (ii) independent observations contribute in an (iii) additive fashion to the test statistic. Of these three requirements, additivity is typically fulfilled by the way the test statistic is formed, which may be, for instance, based on the sum of the data, often after rank or log transformation. Independence, then,
108
Wittkowski and Song
applies to how the data are being aggregated into the test statistic. Often, one will allow only a single child per family to be included, or at least only one of identical twins. Finally, the rule that ‘more is better’ underlies the Law of Large Numbers, which states that the accuracy of the estimates increases with the square root of the number of independent samples included. Finally, one will try to minimize the effect of unwanted ‘confounders.’ For instance, when one compares the effects of two interventions within the same subject, one would typically aim at a ‘cross-over’ design, where the two interventions are applied in a random order to minimize ‘carry-over’ effects, where the first intervention’s effects might affect the results observed under the subsequent intervention. Another strategy is ‘stratification’, where subjects are analyzed as subsamples forming ‘blocks’ of subjects that are comparable with respect to the confounding factor(s), such as genetic factors (e.g., race and ethnicity), exposure to environmental factors (smoking) or experimental conditions. A third strategy would be to ‘model’ the confounding factor, e.g., by subtracting or dividing by its presumed effect on the outcome. As the focus of this chapter is on nonparametric (aka ‘model-free’) methods, this strategy will be used only sparingly, i.e., when the form of the functional relationship among the confounding and observed variables is, in fact, known with certainty. 2.1.2. The Sign Test for Untied Observations
The so-called sign test is the most simple of all statistical methods, applying to a single sample of binary outcomes from independent observations, which can be discrete, such as the sexes ‘M’ of ‘F’ in Arbuthnot’s case (6), ‘correct vs. false’, such as R.A. Fisher’s famous (Lady Tasting) Tea test (7), where a lady claims to be able to decide whether milk was added to tea, or vice versa. When applied to paired observations (‘increase’ vs. ‘decrease’), the sign test is often named after McNemar (18). (Here, we assume that all sexes, answers, and changes can be unambiguously determined. The case of ‘ties’ will be discussed in Section 2.1.3 below.) When the observations are independent, the number of positive ‘signs’ follows a binomial distribution, so that the probability of results deviating at least as much as the result observed from the result expected under the null hypothesis H0 (the ‘p-value’) is easily obtained. Let X be the sum of positive signs among n replications, each having a ‘success’ probability of p (e.g., p = 1/2). Then, the probability of having k ‘successes’ (see Fig. 2.1) is n k bn,p (k) = P X = k n, p = p (1 − p)k k n! pk (1 − p)k . = k!(n − k!)
Nonparametric Methods for Molecular Biology
109
0.3
0.2
0.1
0 0
1
2
3
4
5
6
7
8
9
10
Fig. 2.1. Binomial distribution for n=10 under the hypothesis p=0.5.
n The (one-sided) p-value is easily computed as x=k bn,p (x). 1 n Let x◦ = n i=1 xi , then, for binomial random variables, the parameter estimate X /n = pˆ and its standard deviation
1 n (xi − x◦ )2 i=1 n 2 2 1 pˆ 1 − pˆ + (1 − pˆ ) 0 − pˆ = n 1 pˆ 1 − pˆ = n
sn pˆ =
are also easily derived. More than 100 years after Arbuthnot (6), Gauss (19) proved the CLT, by which the binomial distribution can be approximated asymptotically (as.) with the familiar Gaussian ‘Normal’ distribution √ pˆ − p0 n ∼as. N (0, 1) , i.e., pˆ 1 − pˆ
√ (pˆ −p0 ) n√ P >z pˆ (1−pˆ )
z
→
n→∞ −∞
2 √1 e −u /2 du. 2π
The former (exact) form of the sign test can be derived and applied using only elementary mathematical operations, while the latter, like the parametric equivalent, requires theoretical results that most statisticians find challenging to derive, including an integral without a closed form solution. For p0 = 1/2 McNemar (18) noted that this yields a (two-sided) test statistic in a particularly simple form (N+ − N− )2 ∼as. χ12 . N+ + N−
110
Wittkowski and Song
2.1.3. Sign Tests for Tied Observations
In reality, the situations where the sign test is to be applied are often more complicated, because the sign of an observation (or pair of observations) may be neither positive nor negative. When such ‘ties’ are present, the McNemar test (18) yields the same results for two distinctively different situations: • either we may have nine positive and one negative observations out of a total of ten, • or we have also nine positive and one negative observation, but also 9,990 tied observations. A ‘tie’ might be a subject being heterozygous or, in Arbuthnot’s case, having gonadal mosaicism, the rare conditions, where a subject has cells with both XX and XY chromosomes. Ignoring ties is often referred to as ‘correction for ties.’ Still one may feel uncomfortable with a result being ‘significant’, although only nine observations ‘improved’ and one even ‘worsened’, while 9,990 were (more or less) ‘unchanged.’ This observation has long puzzled statisticians. In fact, several versions of ‘the’ sign test have been proposed (20). Originally, Dixon, in 1946 (21), suggested that ties be split equally between the positive and negative signs, but in 1951 (22) followed McNemar (18) in suggesting that ties be dropped from the analysis. To explicate the discussion, we divide the estimated proportion of positive signs by their standard deviation to achieve a common asymptotic distribution. Let N− , N+ , and N= denote the number of negative, positive, and tied observations, respectively, among a total of n. The original sign test (21) can then be written as (N+ + p0 N= )/ n − p0 (N+ + p0 N= ) − pH n = T∗ = ! p0 1 − p0 n p0 1 − p0 n =
{if p0 =1/ 2}
N+ − N− ∼as. N(0, 1) √ n
and the alternative (18, 22) as N+ /(n − N= ) − p0 (N+ + p0 N= ) − pH n = T = ! p0 1 − p0 (n − N= ) p0 1 − p0 (n − N= ) N+ − N− ∼as. N (0, 1). {if p0 =1/ 2} N+ + N− =
The first and last term in the above series of equations show that the ‘correction for ties’ increases the test statis tic by a factor of n/(n − N= ), thereby yielding more ‘significant’ results. The center term is helpful in understanding the nature of the difference. T ∗ distributes the ties as the
Nonparametric Methods for Molecular Biology
111
numbers of observations expected under H0 and, then uses the unconditional variance (21). T excludes the ties from the analysis altogether, thereby ‘conditioning’ the test statistic. The sign test is often applied when the distribution of the data (or differences of data) cannot be assumed to be symmetric. (Otherwise, the paired t-test would be a robust alternative (5).) Assume, for simplicity of the argument, that the differences follow a triangular distribution with a median of zero and a symmetric discretization interval with (band-) width 2b around the origin (Fig. 2.2).
Fig. 2.2. Triangular distribution, discretized at zero with bandwidth b=0.5.
Then, with T, ‘significance’ increases with the discretization bandwidth, i.e., with the inaccuracy of the measurements; the test statistic increases from 0.00 to 2.34 (Table 2.1). With T ∗ , in contrast, the estimate for the probability of a positive sign remains within a narrow limit between 0.4 and 0.5 and the test statistic never exceeds the value of 1.0. To resolve the seeming discrepancy between theoretical results suggesting a specific treatment of ties as ‘optimal’ and the counterintuitive consequences of using this ‘optimal’ strategy, one needs to consider the nature of ties. In genetics, for instance, tied observations can often be assumed to indicate identical phenomena (e.g., mutation present or absent (23)). Often, however, a thorough formulation of the problem refers to an underlying continuous or unmeasurable factor, rather than the observed discretized variable. For instance, ties may be due to rounding of
112
Wittkowski and Song
Table 2.1 Expected estimates and test statistics (n=25) by bandwidth b p=
p+
p+ 1-p =
T 25
p+ 1-p =
T ∗25
T 25 T ∗25
0.00
0.00
0.50
0.50
0.00
0.50
0.00
1.0
0.25
0.21
0.39
0.49
0.06
0.49
0.05
1.1
0.50
0.41
0.27
0.46
0.28
0.48
0.21
1.3
0.75
0.62
0.14
0.37
0.78
0.45
0.48
1.6
0.90
0.75
0.06
0.23
1.38
0.43
0.69
2.0
0.95
0.79
0.03
0.14
1.68
0.42
0.77
2.2
0.99
0.82
0.01
0.03
1.98
0.42
0.84
2.4
1.00
0.83
0
0
2.07
0.41
0.86
2.4
1.50
0.93
0
0
2.34
0.46
0.36
3.7
>2.41
1.00
0
–
–
0.50
0.00
8.2
p
continuous variables (temperature, Fig. 2.2) or to the use of discrete surrogate variables for continuous phenomena (parity for fertility), in which case the unconditional sign test should be used. Of course, when other assumptions can be reasonably made, such as the existence of a relevance threshold (24), or a linear model for the paired comparison preference profiles (25), ties could be treated in many other ways (26). 2.2. Applications 2.2.1. Family-Based Association Studies
Family-based association tests (FBAT) control for spurious associations between disease and specific marker alleles due to population stratification (27). Thus, the transmission/disequilibrium test (TDT) for bi-allelic markers, proposed in 1993 by Spielman et al. (28), has become one of the most frequently used statistical methods in genetics. Part of! its appeal stems from its computationally simple form (b − c)2 (b + c), which resembles the conditional sign test (Section 2.1.3). Here, b and c represent the number of transmitted wild-type (P) and risk (Q) alleles, respectively, whose parental origin can be identified in affected children. Let nXY denote the number of affected XY children and let XY ∼ X Y denote a parental mating type (X , Y , X , Y ∈ {P, Q}; the order of genotypes and parental genders is arbitrary). Alleles transmitted from homozygous parents are noninformative, so that children with two homozygous parents are excluded from the analysis as ‘exact ties’ (29, 30). If one parent is homozygous, only the allele transmitted from the other parent is informative. When both parents are heterozygous, the number of PP and QQ children, nPP and nQQ , respectively, represents both the number
Nonparametric Methods for Molecular Biology
113
of P or Q alleles transmitted from either parent; nPQ represents the number of P alleles transmitted from one and the number of Q alleles transmitted from the other parent, yet the origin of the alleles is not known. To compare “the frequency with which [among heterozygous parents] the associated allele P or its alternate Q is transmitted to the affected offspring,” (28) the term b − c in the numerator of the TDT can be decomposed into the contributions from families stratified by the parental mating types: # # " " # " b − c = nPQ − nQQ PQ∼QQ + 2 (nPP − nQQ ) + (nPQ − nPQ ) PQ∼PQ + npp − npq PP∼PQ .
Of course [nPQ − nPQ ]PQ∼PQ = 0, so that this equation can be rewritten as (31) " " " # # # b − c = nPQ − nQQ PQ∼QQ + 2 nPP − nQQ PQ∼PQ + nPP − nPQ PP∼PQ .
[1]
As an ‘exact tie’ (29, 30), the term [nPQ − nPQ ]PQ∼PQ can be as ignored when computing the variance of [1], because they are as noninformative as the alleles transmitted from homozygous parents. As noted above, independence of observations is a key concept in building test statistics. While “the contributions from both parents are independent” (28), the observations are not. Because the effects of the two alleles transmitted to the same child are subject to the same genetic and environmental confounders, one does not have a sample of independently observed alleles to build the test statistic on, but “a sample of n affected children” (28). As a consequence, the factor ‘2’ in [1] does not increase the sample size (32), but is a weight indicating that the PP and the QQ children are ‘two alleles apart’, which implicates a larger risk difference under the assumption of co-dominance. Each of the three components in [1] follows a binomial distribution with a probability of success equal to 1/2, so that the variance of [1] follows from “the standard approximation to a binomial test of the equality of two proportions” (28) " " " # # # σ02 (b − c) = nPQ + nQQ PQ∼QQ + 4 nPP + nQQ PQ∼PQ + nPP + nPQ PP∼PQ
[2]
Dividing the square of estimate [1] by its variance [2] yields a stratified McNemar test (SMN) which combines the estimates
114
Wittkowski and Song
and variances of three McNemar tests in a fashion typical for stratified tests (33), (b − c)2 = σ02 (b − c)
"
2 " " # # # nPQ − nQQ PQ∼QQ + 2 nPP − nQQ PQ∼PQ + nPP − nPQ PP∼PQ " # # # " " ∼as χ12 . nPQ + nQQ PQ∼QQ + 4 nPP + nQQ PQ∼PQ + nPP + nPQ PP∼PQ
[3]
(‘Stratification’, i.e., blocking children by parental mating type, here is merely a matter of computational convenience. Formally, one could treat each child as a separate block, with the same results.) The TDT, in contrast, divides the same numerator by a “variance estimate” (28) # " σˆ 02 (b − c) = b + c = nPQ + nQQ PQ∼QQ +
$ " % # # " 2 nPP + nQQ PQ∼PQ # " + nPP + nPQ PP∼PQ , + 2 nPQ PQ∼PQ [4]
which relies on the counts of noninformative PQ children. Replacing the observed nPP + nQQ by its estimate nPQ under H0 would require an adjustment, similar to the replacement of the Gaussian by the t-distribution when using the empirical standard deviation with the t-test (34). Hence, the SMN has advantages over the TDT for finite samples (31), in part, because it has a smaller variance (more and, thus, lower steps, see Fig. 2.3) under H0 (see Table 2.2). Because the TDT estimates the variance under the assumption of co-dominance, it overestimates the variance (has lower power) for alleles with high dominance, i.e., when being het-
1.0 TDT 0.9
tdt SMN
0.8
smn
0.7
F(x)
0.6 0.5 0.4 0.3 0.2 0.1 0.0 –3
–2
–1
0 x
1
Fig. 2.3. Cumulative distribution function for the example of Table 2.2.
2
3
Nonparametric Methods for Molecular Biology
115
Table 2.2 Exact distribution of the TDT (top) and SMN (bottom) for a sample of three children nP- nP+ nP nQ nQ nQ 6 0 6 6 5 1 4 6 4 2 2 6 4 2 2 6 3 3 0 6 3 3 0 6 2 4 -2 6 2 4 -2 6 1 5 -4 6 0 6 -6 6
g1 PP PP PP PP PQ PP PP PQ PQ QQ
g2 PP PP PQ PP PQ PQ QQ PQ QQ QQ
g3 PP PQ PQ QQ PQ QQ QQ QQ QQ QQ
b 1 3 3 3 1 6 3 3 3 1
p1 0.25 0.25 0.25 0.25 0.50 0.25 0.25 0.50 0.50 0.25
p2 0.25 0.25 0.50 0.25 0.50 0.50 0.25 0.50 0.25 0.25
p3 0.25 0.50 0.50 0.25 0.50 0.25 0.25 0.25 0.25 0.25
f 0.016 0.094 0.188 0.047 0.125 0.188 0.047 0.188 0.094 0.016 1. 000
g1 PP PP PP QQ
g2 PP PP QQ QQ
g3 PP QQ QQ QQ
b 1 3 3 1
p1 0.25 0.25 0.25 0.25
p2 0.25 0.25 0.25 0.25
p3 0.25 0.25 0.25 0.25
f 0.016 0.047 0.047 0.016
PP PP PQ
PP PQ PQ QQ QQ QQ
3 0.25 0.25 0.50 6 0.25 0.50 0.25 3 0.50 0.25 0.25
0.094 0.188 0.094
2 1 0
0 1 2
2 0 -2
2 2 2
1.414 0.107 0.000 0.214 -1.414 0.107
PP PQ
PQ PQ
PQ QQ
3 0.25 0.50 0.50 3 0.50 0.50 0.25
0.188 0.188
1 0
0 1
1 -1
1 1
1.000 0.214 -1.000 0.214
PQ
PQ
PQ
1 0.50 0.50 0.50
0
0
0
0
0. 875
nPP- nPP nPP nQQ nQQ +nQ 3 0 3 3 2 1 1 3 1 2 -1 3 0 3 -3 3
x 2.449 1.633 0.816 0.816 0.000 0.000 -0.816 -0.816 -1.633 -2.449 0. 000
f 0.016 0.094 0.188 0.047 0.125 0.188 0.047 0.188 0.094 0.016 1. 000
x 1.732 0.577 -0.577 -1.732
f* 0.018 0.054 0.054 0.018
0. 000 1. 000
Legend: gi : child i genotype; b: number of combinations; pi : E0 (gi ); f : b × i pi ; alleles/children; x: (np − nQ )/ np + nQ (npp − nQQ )/ npp + nQQ
nx /nx y: number of
erozygous for the risk allele carries (almost) as much risk as being homozygous (31). On the other hand, the TDT underestimates the variance (yielding higher power) for recessive alleles (31). 2.2.2. Extensions of the Stratified McNemar
Understanding the role of parental mating types, the sensitivity of the SMN can be easily focused on either dominant or recessive alleles. When screening for recessive alleles, one excludes trios where one parent is homozygous for the wild type and assigns equal weights to the remaining trios (35). Conversely, when screening for dominant alleles, one excludes trios where one parent is homozygous for putative risk allele. A better understanding of the statistical principles underlying family-based associations studies leads not only to test statistics with better small sample properties and better targeted tests for alleles with low or high dominance but also to extensions which open new areas of applications.
116
Wittkowski and Song
For decades it has been known that the HLA-DRB1 gene is a major factor in determining the risk for multiple sclerosis (36). As HLA-DRB1 is one of the few genes where more than two alleles per locus are routinely observed, the SMN is easily generalized to multiallelic loci by increasing the number of parental mating-type strata and identifying within each stratum the number of informative children (35). In early 2009, applying the extension of the SMN for multiallelic loci allowed (Table 2.3) the narrowing down of risk determinants to amino acid 13 at the center of the HLA-DRB1 P4 binding pocket, while amino acid 60, which had earlier been postulated based on structural features (37), was seen as unlikely to play a major role (35, 38) (Fig. 2.4).
Table 2.3 Nucleotides (N#: nucleotide number, #A: amino acid number) found in 94 HLA-DRB1 SNPs discriminating the 13 main two-digit allelic groups. The nucleotide found in the HLA-DRB1∗ 15 (15) allele, is highlighted across allelic groups (modified from BMC Medical Genetics, Ramagopalan et al. (35) available from http://www.biomedcentral.com/1471-2350/10/10/ © 2009, with permission from BioMed Central Ltd). Allelotype: AlleleNucleotide 85 –1 13 124 125 126 60 265 266 96 373 375 101 390 133 485 142 511 166 584 585 #Cases:
02 15 T A G G T A C A A T A G A 3994
01
04
16 G G A T G T G T T T A A C G A G A G T G A G G G A G 208 1396
G C A T T A T T G G G G G 1982
03 05 17/18 11 12 G T C T T A C T G G G G G 1873
G T C T T A C T G G G G G 1068
06 13
G G G T G C T T T T AC AC C C T T G G G G G G G G G G 146 1519
08
10
09
07
G T T T T A C A G G G A G 80
G G T T T A T T T T C C C C T T G G G G G G G G G G 98 1653
14 G G TG G CG G T T CT T A A C C T T G G G G G G G G G G 243 408
2.2.3. Microarray Quality Control
Another simple method based on u-statistics is the median on which the Bioconductor package Harshlight is based, a program to identify and mask artifacts on Affymetrix microarrays (39, 40). After extensive discussion (41, 42), this approach was recently adopted for Illumina bead arrays as “BASH: a tool for managing BeadArray spatial artifacts” (43) (Fig. 2.5).
2.2.4. Implementation and Availability
Figure 2.6 demonstrates the relationships between the different versions of the sign and McNemar tests and how easily even their exact versions, where all possible permutations of data need to be considered, can be implemented in code compatible with R
Nonparametric Methods for Molecular Biology
117
a
b
Fig. 2.4. (a) Extension of the stratified McNemar to multi-allelic loci identifies amino acid 13 (at the center of the P4 pocket) as a major risk factor for multiple sclerosis (modified from BMC Medical Genetics, Ramagopalan et al. (35) available from http://www.biomedcentral.com/1471-2350/10/10/ © 2009, with permission from BioMed Central Ltd). (b) Positions of amino acids identified in Fig. 2.4a within β2 domain
118
Wittkowski and Song
Fig. 2.5. Implementation of the conditional and unconditional sign tests as part of the ‘muStat’ package for both R and S-PLUS. SMN: stratified McNemar, TDT: transmission disequilibrium test, MCN: McNemar, DMM: Dixon/Mood/Massey. The function MC() avoids scoping problems in S-PLUS and is part of the compatibility package muS2R.
and S-PLUS. For dominant and recessive alleles, the parameters wP/wQ are set to .0/.5, and .5/.0, respectively. HarshLight and BASH (as part of the beadarray package) are available from http://bioconductor.org.
3. Univariate Data 3.1. u-Scores 3.1.1. Definitions
Our aim is to first develop a computationally efficient procedure for scoring data based on ordinal information only. We will not make any assumptions regarding the functional relationship between variable and the latent factor of interest, except that the variable has an orientation, i.e., that an increase in this variable is either ‘good’ or ‘bad.’ Without loss of generality, we will assume that for each of the variables ‘more’ means ‘better.’ Here, the index k is used for subjects. Whenever this does not cause confusion, we will identify patients with their vector of L ≥ 1 observations to simplify the notation. The scoring mechanism is based on the principle that each patient
Nonparametric Methods for Molecular Biology
119
Fig. 2.6. Quality control images generated by Harshlight for Affymetrix chips (top, modified from BMC Bioinformatics (39, 40) © 2005, with permission from BioMed Central Ltd) and by the ‘Bead Array Subversion of Harshlight’ (bottom, modified from Bioinformatics (43) © 2008, with permission from Oxford University Press).
&
xk = (xk1 , . . . , xkL )
' k=1,...mj
is compared to every other patient in a pairwise fashion. More specifically, a score u(xk ) is assigned to each patient xk by counting the number of patients being inferior and subtracting the number of subjects being superior: u(xk ) = 3.1.2. Computational Aspects
k
I (xk < xk ) −
k
I (xk > xk ).
[5]
When Gustav Deuchler (44) in 1914 he developed what more than 30 years later should become known as the Mann–Whitney test (11) (although he missed one term in the asymptotic variance), proposed a computational scheme of striking simplicity, namely to create a square table with the data as both the row and the column headers (Fig. 2.7). With this, the row sums of the signs of paired comparisons yields the u-score. Since the matrix is symmetric, only one half (minus the main diagonal) need to be filled. Moreover, as Kehoe and Cliff (46) pointed out, even some of this information is redundant. The interactive program Interord computes which additional entries could be implied by transitivity from the entries already made.
120
Wittkowski and Song
A B C D
Y1 3 3 2 2
3 0 0 –1 –1
3 0 0 –1 –1
2 1 1 0 0
2 1 1 0 0
Y1 3 3 2 2
Y2 2 1 1 1
Y2 2 1 1 1
2 0 –1 –1 –1
1 1 0 0 0
1 1 0 0 0
1 1 0 0 0
Fig. 2.7. Matrices of paired comparisons for two variables (Y1, Y2), each observed in four subjects (modified from Statistical Applications in Genetics and Molecular Biology, Morales et al. (45) available at http://www.bepress.com/sagmb/vol7/iss1/art19/, with permission from The Berkeley Electronic Press, © 2008).
3.2. Randomization and Stratification 3.2.1. Introduction
When several conditions are to be compared, the key principle to be observed is that of randomization. In most situations, randomization is the only practical strategy to ensure that the results seen can be causally linked to the interventions to be compared. When observational units vary, it is often useful to reduce the effect of confounding variables through stratification. A special case of stratification is the sign test, when two interventions are to be compared within a stratum of two closely related subjects (e.g., twins). Here we will consider situations where more than two subjects form a stratum, e.g., where subjects are stratified by sex, race, education, or the presence of genetic or environmental risk factors. The fundamental concept underlying the sign test is easily generalized to situations where a stratum contains more than two observations. Mj Under H0 , the expectation of Tj = k=1 u xjk is zero. For the two-sample case, Mann and Whitney, in 1947 (11), reinvented the ‘u-test’ (this time with the correct variance). As u-scores are a linear function of ranks (u = 2r − (n + 1)), their test is equivalent (47) to the rank-sum test proposed in 1954 by Wilcoxon (10). The Wilcoxon–Mann–Whitney (WMW) test, in turn, is a special case of the Kruskal–Wallis 1952 (KW) rank test for >2 groups (12). The notation used below will prove to be particularly useful to generalize these results. The test statistic WKW is based on the sum of u-scores Mj Uj = k=1 u Xjk within each group j = 1, . . . , p. It can be computed as a ‘quadratic form:’
Nonparametric Methods for Molecular Biology −
WKW = U V U =
p
p j =1
j=1
2 Uj vjj− Uj ∼as. χp−1 ,
121
[6]
where V− is a generalized inverse of the variance–covariance matrix V. For the KW test, the matrix V, which describes the experimental design and the variance among the scores, is: ⎛⎛ ⎜⎜ ⎜ VKW = s 20 ⎜ ⎝⎝
s
2 0
⎞
M1 .. ⎧ ⎨
⎛M
⎟ ⎜ ⎟−⎜ ⎠ ⎝
.
Mp M+
1 M1 M+
..
M1 Mp M+
.
Mp M1 M+
2 x=1 u (x) Mj 2 k=1 u Xjk j=1
1 = M+ − 1 ⎩ p
Mp Mp M+
⎞⎞ ⎟⎟ ⎟⎟ , ⎠⎠
[7]
unconditional . conditional on ties
Traditionally, the WMW and KW tests have been presented with formulae designed to facilitate numerical computation at the expense of conceptual clarity, e.g.,: ⎛ V− KW =
s
2 0
1 s 20
⎜ ⎜ ⎜ ⎝
⎞
1 M1
..
. 1 Mp
⎟ ⎟ ⎟, ⎠
⎧ 1 unconditional M+ (M+ + 1) ⎨
= , 3 3 ⎩1− G 3 g=1 Wg − Wg / M+ − M+ conditional on ties
where Wg indicates the size of the g-th tie (group of identical observations). (The constant ‘3’ in the denominator of s02 is replaced by ‘12’ for ranks). The unconditional version (equivalent to the conditional version in the absence of ties) then yields the well known and computationally simple form: p
WKW
Tj 3 2 = ∼as. χp−1 . M+ (M+ + 1) Mj 2
j=1
With the now abundant computer power one can shift focus to the underlying design aspects and features of the score function represented in the left and right part of [7], respectively. In particular, the form presented in [6] can be more easily extended to other designs (below). 3.2.2. Complete Balanced Design
To extend u-tests to stratified designs, we add an index i = 1, . . . , n for the blocks to be analyzed. With this, the (unconditional) sign test can be formally rewritten as
122
Wittkowski and Song
VST V− ST
1 1 1 0 2 2 − 1 1 = ns02 I − 12 J , = 0 1 2 2 2 1 = 2 I, s02 = u2 j = 2, j=1 ns0 ns02
WST =
1 2 1 2 T 2 = (N+ − N− )2 ∼as. χp−1 . j=1 j 2n n
Coming from a background in econometrics, the winner of the 1976 Nobel prize in economics, Milton Friedman, had presented a similar test (48) in 1937, yet for more than two conditions: ⎛⎛ VFM =
⎜⎜
⎜ ns02 ⎜ ⎝⎝
⎞
1 ..
⎟ ⎜ ⎟−⎜ ⎠ ⎝
. 1
V− FM
⎛1
1 p
p
.. 1 p
. 1 p
⎞⎞
⎟⎟ ⎟⎟ = ns 2 I − 1 J , 0 ⎠⎠ p
p p+1 1 1 p 2 2 , = 2 I, s0 = u j = j=1 p−1 3 ns0
WFM =
p 3 2 T 2 ∼as. χp−1 . j=1 +j np p + 1
In either case, noninformative blocks are excluded with ‘conditioning on ties’. In the field of voting theory, the blocks represent voters and the u-scores are equivalent to the Borda counts, originally proposed in 1781 (49), around the time of Arbuthnot’s contributions, although the concept of summing ranks across voters was already introduced by Ramon Llull (1232–1315) (50). 3.2.3. Incomplete and/or Unbalanced Designs
In the 1950s (51, 52), it was demonstrated that the Kruskal– Wallis (12) and Friedman (48) tests, and also the tests of Durban (53) and Bradley-Terry (54), can be combined and extended by allowing for both several strata and several observations within each cell (combination of stratum and group). However, when blocks represent populations of different size (proportionate to mi , say) and/or have missing data (Mi < mi ), the problem arises how to deal with unbalanced and/or incomplete designs. Between 1979 and 1987, several approaches for weighting blocks were suggested (33). In 1988, Wittkowski (33) used the marginal likelihood principle to prove that Mike Prentice’s ‘intuition’ (55) was correct, i.e., that u-scores (or ranks) should be scaled by a factor of (m+ + 1)/(M+ + 1) to reflect differences in population sizes and missing data. In 2005, Alvo et al.
Nonparametric Methods for Molecular Biology
123
(56, 57) confirmed these results using a different approach: u xijk =
j k
I xij k < xijk −
j k
I xij k > xijk ,
mi + 1 Mij u Xijk k=1 Mi + 1 ⎛⎛ ⎞ ⎛M M i1 i1 Mi1 Mi+ ⎜ ⎜ ⎟ ⎜ .. .. 2 ⎜⎜ ⎟−⎜ Vi = si0 . . ⎝⎝ ⎠ ⎝ Mip Mi1 Mip M
Uij =
i+
2 si0
Mi1 Mip Mi+ Mip Mip Mi+
⎞⎞ ⎟⎟ ⎟⎟ , ⎠⎠
1 mi+ + 1 2 = Mi+ − 1 Mi+ + 1 $ Mi+ 2 unconditional x=1 u (x) . × p Mij 2 k=1 u Xijk conditional on ties j=1
[8]
In contrast to the above special cases, no generalized inverse with a closed form solution is known, so that one has to rely on a numerical solution: n p 2 W = U + V− + U+ = Uij vjj− Uij ∼as. χp−1 . [9]
i=1
jj =1
When the conditions to be compared are genotypes, a situation may arise, where the genotypes of some subjects are not known. Still, phenotype information from those subjects can be used when computing the scores by assigning these subjects to a pseudo group j = 0, which is used for the purpose of scoring only. 3.3. Applications 3.3.1. Binary Data/Mantel Haenszel
Genome-wide association studies (GWAS) often aim at finding a locus, where the proportions of alleles differ between cases and controls. For most human single-nucleotide polymorphisms (SNPs), only two alleles have been seen. Thus, the data can be organized in a 2 × 2 table. Control (0)
Cases (0)
M1
Allele 1
M1
M2
M+
M1
M2
M+
(1)
M2
(0)
Allele 0
(1)
M+
(1)
While data for the sign test can also be arranged in a 2 × 2 table, the question here is not, whether M1(1) = M2(0) , but whether M1(1)/M1 = M2(1)/M2 . Thus, the appropriate test here is not the sign, but Fisher’s exact or, asymptotically, the χ 2 test
124
Wittkowski and Song
for independence or homogeneity. This χ 2 test is asymptotically equivalent to the WMW test for two response categories. As with the sign test, ties are considered ‘exact’ in genetics and, thus, the variance conditional on the ties is applied. When the data are stratified, e.g., by sex, one obtains a 2 × 2 table for each block and the generalization of the χ 2 test is known as the Cochran–Mantel–Haenzel (CMH) test, which can also be seen as a special case of the W test: WCMH =
i=1
2 mi+ +1 Mi+ +1 Mi1 Mi2 (Pi1
n mi+ +1 2 i=1
=
n
n
i=1
Mi+ +1
2 Mi+ Mi+ −1 Mi1 Mi2 Pi+ (1 − Pi+ )
2 (1) (0) mi+ +1 Mi+ +1 (Mi2 Mi1
n mi+ +1 2 i=1
− Pi2 )
Mi+ +1
(1) (0) − Mi1 Mi2 )
(1) (0) 1 Mi+ −1 Mi+ Mi+ Mi1 Mi2
∼as. χ12 .
3.3.2. Degrees of Freedom in GWAS
For most statistical applications, the degrees of freedom for χ 2 tests is the rank of the variance–covariance matrix V. For genetic and genomic screening studies, however, where the dimension of the variance–covariance matrix may vary depending on the number of groups present for a particular gene or locus. Clearly, it would be inappropriate in a screening study to decrease the degrees of freedom (df) for the χ 2 distribution if one (or more) of the groups is (are) missing, but revert to the full df when a single observation for this group is available for another gene or locus.
3.3.3. Implementation and Availability
Among the many obstacles that have prevented statistical methods based on u-scores from being used more widely is that they are traditionally presented as an often confusing hodgepodge of procedures, rather an a unified approach backed by a comprehensive theory. For instance, the Wilcoxon rank-sum and the Wilcoxon signed-rank tests, are easily confused. Both were published in the same paper (10) but they are not special cases of a more general approach. The Lam–Longnecker test (58), on the other hand, directly extends the WMW to paired data. Finally, the Wilcoxon rank sum and the Mann–Whitney u tests are equivalent (47), although they were independently developed based on different theoretical approaches. Above, we have demonstrated that several rank or u-test can be presented as a quadratic form W = U + V− + U+ ∼as. 2 , where the variance–covariance matrix V reflects the variχp−1 + ous designs. The muStat package (available for both R and SPLUS) takes advantage of this fact, both by providing the unified
Nonparametric Methods for Molecular Biology
125
#---------------------------------------------------------------------# mu.test (y, groups, blocks, …) # most general form # mu.friedman.test (y, groups, blocks, …) # one observation per cell # mu.kruskal.test (y, groups, …) # single block # mu.wilcox.test (y, groups, …) # single block/two groups #---------------------------------------------------------------------# y, # data (NA allowed) # groups, # groups (unbalanced allowed) # blocks # blocks (unequal sz allowed) # score = "rank", # NULL: y already scored # paired = FALSE, # wilcox only # exact = NULL, # wilcox only # optim = TRUE # optimize for special cases # df = -1, # >0: as entered, =0: by design, -1: by data #---------------------------------------------------------------------mu.test T0 |θ] ≤ α for all θ ∈ 0 , then it follows that the same T0 would also satisfy BE1 (T0 ) ≤ α for any prior π(θ). In a similar manner one also defines the Bayesian type II error rate as BE2 (T0 ) = Pr [T (X ) ≤ T0 |θ ∈ a ]. Clearly, as T0 increases BE1 (T0 ) decreases while BE2 (T0 ) increases. One possible way to control both type of errors would be to find a T0 that minimizes the total weight error, TWE(T0 ) = w1 BE1 (T0 ) + w2 BE2 (T0 ) for some suitable nonnegative weights w1 and w2 . In other words, we can determine Tˆ 0 = arg min TWE(T0 ). Notice that if we choose w2 = 0, this would correspond to controlling only the (Bayesian) type I error rate while choosing a large value for w2 would be equivalent to controlling the (Bayesian) power 1 − BE2 (T0 ). The above procedure can also be used to determine required sample size once Tˆ 0 is determined. In other words, once Tˆ 0 is determined for a given sample size n, we can find the optimal (minimum) sample size so that the total error rate BE1 (Tˆ 0 ) + BE2 (Tˆ 0 ) ≤ α.
4. Region Estimation In addition to computing a point estimator for η or testing the value of θ in a given region, we can determine an interval and more generally a region (as a subset of the parameter space, ) such that the posterior probability of that region exceeds a given threshold value (e.g., 0.90). In other words, we would like to determine a subset R = R(X ) such that Pr [θ ∈ R(X )|X ] ≥ 1 − α for a given value α ∈ (0, 1). The set R(X ) is said to be a 100 (1 − α)% credible set for θ. The definition of a credible set may sound similar to the traditional confidence set (computed by means of frequentist methods), but these two sets have very different interpretations. A 100(1 − α)% credible set R = R(X ) guarantees that the probability that θ is in R(X ) is at least 1 − α, whereas a 100(1 − α)% confidence set C(X) for θ merely suggests that if the method
168
Ghosh
of computing the confidence set is repeated many times then at least 1 − α proportion of those confidence sets would contain θ, i.e., Pr [C(X ) θ|θ] ≥ 1 − α for all θ ∈ . Notice that given an observed data vector X the chance that a confidence set C(X ) contains θ is either 0 or 1. In many applied sciences this fundamental distinction in the interpretations of credible and confidence set is often overlooked and many applied researchers often wrongly interpret the confidence set as the credible set. Given a specific level 1 − α of credible set R(X ) it might be of interest to find the “smallest” such set or region that maintains the given level. Such a region is obtained by computing the highest probability density (HPD) regions. A region R(X ) is said to be a HPD region of level 1 − α, if R(X ) = {θ ∈ :K (θ;X ) > K0 (X )},
[10]
where K0 (X ) > 0 is chosen to satisfy Pr [θ ∈ R(X )|X ] ≥ 1 − α and K (θ;X ) denotes the posterior kernel function as defined in [2]. In practice, it might not be straightforward to compute the HPD region R(X) as defined in [10], but many numerical methods (including Monte Carlo methods) are available to compute such regions (see Section 5 for more details). Notice that even when η = η(θ) is a real-valued parameter of interest, the HPD region for η may not be a single interval but may consist of a union of intervals. For example, when the posterior density is a bimodal density, the HPD region may turn out to be the union of two intervals each centered around the two modes. However, when the posterior density of a real-valued parameter η is unimodal the HPD region will necessarily be an interval of the form R(X ) = (a(X ), b(X )) ⊆ , where −∞ ≤ a(X ) < b(X ) ≤ ∞.
5. Numerical Integration Methods
It is easily evident that almost any posterior inference based on the kernel K (θ;X ) often involves high-dimensional integration when the parameter space is itself a high-dimensional subset of the Euclidean space Rm . For instance, in order to compute the posterior mean η(X ˆ ) as defined in [5], we need to compute two (possibly) high-dimensional integrals, one for the numerator and another for the denominator. Similarly, to compute the posterior probabilities in [6] we need to compute the two high-dimensional integrals. Also to compute the HPD region in [10] we again need to compute high-dimensional integration. Thus, it turns out that in general given a generic function g(θ) of the parameter θ the posterior inference requires the computation of the integral
Basics of Bayesian Methods
I = I (X ) =
g(θ)K (θ;X )dθ.
169
[11]
Notice that we can take g(θ) = 1 to compute the denominator of [5] or [6] and g(θ) = η(θ) to compute the numerator of [5]. Similarly, we can choose g(θ) = Ij (θ) to compute the numerator of [6]. There are primarily two numerical approaches to compute integrals of the form I (X ) as defined in [11]. First, when the dimension of the parameter space, m is small (say m < 5) then usually the classical (deterministic) numerical methods perform quite well in practice provided the functions g(θ) and K (θ;X ) satisfy certain smoothness conditions. Second, when the dimension m is large (say m ≥ 5) the stochastic integration methods, more popularly known as the MC methods are more suitable. One of the advantages of the MC methods is that very little regularity conditions are required for the function g(θ). It is well known that if a (deterministic or stochastic) numerical method requires N functional evaluations to compute the integral I in [11], then order of accuracy of a deterministic numerical integration is generally O(N −2/m ), whereas the order of accuracy of a stochastic integration is O(N −1/2 ). Thus, the rate at which the error of the deterministic integral converges to zero depends on the dimension m and hence any such deterministic numerical integration suffers from the so-called curse of dimensionality. On the other hand the rate at which the error of the Monte Carlo integration converges (in probability) to zero does not depend on the dimension m. However, whenever a deterministic numerical integration can be performed it usually provides a far more accurate approximation to the targeted integral than that provided by the MC integration by using the same number of function evaluations. In recent years, much progresses have been made with both type of numerical integration methods and we provide a very brief overview of the methods and related software to perform such integrations. 5.1. Deterministic Methods
The commonly used deterministic numerical methods to compute integrals usually depend on some adaptation of the quadrature rules that are known for centuries and have been primarily developed for one-dimensional problem (i.e., when the dimension m of the parameter space is unity). In order to understand the basic ideas we first consider the one-dimensional (i.e., m = 1) .b case, where the goal is to compute the integral I = a h(θ)dθ, where h(θ) = g(θ)K (θ;X ) and we have suppressed the dependence on the data X (as once X is observed it is fixed for rest of the posterior analysis). The basic idea is to find suitable weights wj ’s and ordered knot points θ j ’s such that
170
Ghosh b
I = a
h(θ)dθ ≈ IN =
N
wj h(θj ),
[12]
j=1
where N = 2, 3, . . . is chosen large enough to ensure that |I − IN | → 0 as N → ∞. Different choices of the weights and the knot points lead to different integration rules. A majority of the numerical quadrature rules fall into two broad categories; (i) Newton–Cotes rules, which are generally based on equally spaced knot points (e.g., θj − θj−1 = (b − a)/N ) and (ii) Gaussian quadrature rules, which are based on knot points obtained via the roots of orthogonal polynomials and hence not necessarily equi-spaced. The Gaussian quadrature rules usually provide more accurate results, especially when the function h( · ) can be approximated reasonably well by a sequence of polynomials. The simplest examples of Newton–Cotes rules are trapezoidal rule (a two-point rule), mid-point rule (another two-point rule) and Simpson’s rule (a three-point rule). In general, if a (2k − 1)point (or 2k-point) rule with N equally spaced knots with suitably chosen weights is used it can be shown that |I − IN | = Ck (b − a)2k+1 h (2k) (θ ∗ )/N 2k for some θ ∗ ∈ (a, b) and the constant Ck > 0. The Gaussian quadratures go a further step which not only suitably choose the weights but also the knots at which the function h( · ) is evaluated. The Gaussian quadrature rules based on only N knots can produce exact results for polynomials of degree 2N − 1 or less, whereas Newton–Cotes rules usually produce exact results of polynomials of degree N (if N is odd) or degree (N − 1) (if N is even). Moreover, the Gaussian quadrature rules can be specialized to use K (θ;X ) as the weight functions. The literature on different rules is huge and is beyond the scope of this chapter (see (31) and Chapter 12 (pp. 65–169) of (33) for more details). In the software R (see Section 6.1) such one-dimensional integration is usually carried out using a function known as integrate which is based on the popular QUADPACK routines dqags and dqagi (see (33) for further details). An illustration of using the R function integrate is given in Section 6.1. The above one-dimensional numerical integration rules can be generalized to higher dimensional case and in general if the multidimensional trapezoidal rule is used one can show that I = IN + O(N −2/m ), where N denotes the number of knots at which we have to evaluate the function h(θ), where now θ = (θ1 , . . . , θm ) is an m-dimensional vector. Thus, it easily follows that with increasing dimension m the accuracy of the integration rule diminishes rapidly. One may try to replace the multidimension trapezoidal rule by a multidimensional version of Simpson type rule, but such a multidimensional rule would still lead to an error bound of O(N −4/m ). In the next section we will see that the
Basics of Bayesian Methods
171
error bound of the Monte Carlo methods is of Op (N −1/2 ) irrespective of the dimension m and hence for very large dimensions Monte Carlo integration methods are becoming more and more popular. Thus, in summary, although the deterministic numerical integration methods are very accurate (sometimes even exact for certain class of functions), the usefulness of such deterministic methods declines rapidly with increasing dimensions. Moreover, when the parameter space is a complicated region (e.g., embedded in a lower dimensional space) or the function to be integrated is not sufficiently smooth, the above mentioned theoretical error estimates are no longer valid. Despite such limitations a lot of progress has been made in recent years and an R package known as adapt offers generic tools to perform numerical integrations of higher dimensions (e.g., for 2 ≤ m ≤ 20). Many innovative high-dimensional integration methods and related software are available online at http://www.math.wsu.edu/faculty/genz/software/software. html maintained by Alan Genz at Washington State University. 5.2. Monte Carlo Methods
MC statistical methods are used in statistics and various other fields (physics, operational research, etc.) to solve various problems by generating pseudo-random numbers and observing that sample analogue of the numbers converges to their population versions. The method is useful for obtaining numerical solutions to problems which are too complicated to solve analytically or by using deterministic methods. The most common application of the Monte Carlo method is Monte Carlo integration. The catchy name “Monte Carlo” derived from the famous casino in Monaco was probably popularized by notable physicists like Stanislaw Marcin Ulam, Enrico Fermi, John von Neumann, and Nicholas Metropolis. S. M. Ulam (whose uncle was a gambler) noticed a similarity between the casino activities with the random and repetitive nature of the simulation process and coined the term Monte Carlo to honor his uncle (see (34)). The key results behind the success of MC methods of integration relies on primarily two celebrated results in Statistics: (i) The (Strong or Weak) Law of Large Numbers (LLN) and (ii) The Central Limit Theorem (CLT). For a moment suppose that we can generate a sequence of iid observations, θ (1) , θ (2) , . . . by using the kernel K (θ; X ) as defined in [2]. In many cases it is usually possible to generate pseudo-random samples just by using the kernel of a density (e.g., by using accept–reject sampling, etc.) and several innovative algorithms are available (see (35) and (36)). Later we are going to relax this assumption when we describe an algorithm known as the importance sampling.
172
Ghosh iid
First assume that we can generate θ (l) ∼ p(θ|X ) for l = 1, 2, . . . , N by using only K (θ;X ). Then it follows from (Strong/Weak) LLN that, as N → ∞, g¯ N
N p 1 = g(θ (l) ) −→ E[g(θ)|X ] = N
g(θ)p(θ|X )dθ, [13]
l=1
where p(θ|X ) denotes the posterior density and it is assumed that E[|g(θ)||X ] < ∞, i.e., the posterior mean of g(θ) is finite. Notice that by using the above result, we can avoid computing the numerator and denominator of [5], provided we are able to generate random samples by using only the kernel of the posterior density. In other words, with probability one the sample mean g¯ N as defined in [13] converges to the population mean E[g(θ)|X ] as N becomes large. Also notice that almost no smoothness condition is required on the function g( · ) to apply the MC method. Although N can be chosen as large as we wish, it would be nice to know in practice how large is good enough for a given application. In order to determine the least approximate value of N that we would need, the so-called CLT which states that as N → ∞, √ d N (g¯ N − E[g(θ)|X ]) −→ N (0, g ),
[14]
where g = E[(g(θ) − E[g(θ)|X ])(g(θ) − E[g(θ)|X ])) |X ] is the posterior variance (matrix) of g(θ), provided E[||g(θ)||2 |X ] < ∞. Hence it follows that g¯ N = E[g(θ)|X ] + Op (N −1/2 ) and (stochastic) error bound does not depend on the dimension m of θ. For simplicity, when g( · ) is a real-valued function we illustrate how to obtain an approximate value of N using CLT. Suppose for a given small number > 0, we want to determine N which would guarantee with 95% probability g¯ N is within ± of the posterior mean E[g(θ)|X ]. By using CLT it follows that the 95% confidence interval for E[g(θ)|X ] is given by g¯ N ± √ 1.96σ 1.96σg / N . So we want to find N such that √ g ≤ or equivalently N ≥
(1.96)2 σg2 . 2
N
Although σg2 is also unknown we can “esti (l) mate” it by the sample standard deviation σˆ g2 = N l=1 (g(θ ) − 2 g¯ N ) /(N − 1). In summary, if we want an accuracy of > 0 with about 95% confidence, we can find N that roughly satisfies N ≥ 4σˆ g2 / 2 . For example, if we assume σg2 = 1 and we require first decimal accuracy (i.e., ≈ 0.1), then we need to generate at least N = 400 random samples from the posterior distribution. However, if we require second decimal accuracy (i.e., ≈ 0.01), then we need to generate at least N = 40, 000 samples! In general, one may need to generate a very large number of samples
Basics of Bayesian Methods
173
from the posterior distribution to obtain reasonable accuracy of the posterior means (or generalized moments). It appears that the success of the MC method of integration comes at a price as the resulting estimate (g¯ N ) of the posterior mean is no longer a fixed value and will depend on the particular random number generating mechanism (e.g., the “seed” used to generate the random samples). In other words, two persons using the same data and the same model (sampling density and prior) will not necessarily obtain exactly the same estimate of the posterior summary. The MC integration only provides a probabilistic error bound as opposed to the fixed error bound of the deterministic integration methods discussed in the previous section. Also the stochastic error bound of the MC method can hardly be improved (though there exist methods to reduce the variability of sampling) and will generally be inferior to the deterministic integration methods when m ≤ 4 and the functions are sufficiently smooth. Hence, if the function g(θ) is smooth and the dimension m of θ is not very large, it is more efficient to use the deterministic integration methods (e.g., using the adapt function available in R). On the other hand, if the dimension is large, g( · ) is nonsmooth or the parameter space is complicated it would be more advantageous to use the MC integration methods. The next question is what happens when it is not possible to generate samples from the posterior density just by using the kernel function K (θ; X )? In such situations we can use what is known as the importance sampling which uses another density to generate samples and then uses a weighted sample mean to approximate the posterior mean. The idea is based on the following simple identity g(θ)K (θ; X )dθ =
g(θ)
K (θ; X ) q(θ)dθ = g(θ)w(θ)q(θ)dθ, q(θ) [15]
where q(θ) is a probability density function whose support contains that of K (θ;X ), i.e., {θ ∈ :K (θ;X ) > 0} ⊆ {θ ∈ :q(θ) > 0}. The function w(θ) as defined in [15] is called the importance weight function and the density q(θ) is called the importance proposal density. Thus, it follows from the identity in [15] that we can use the following algorithm to estimate I = I (X ) as defined in [11]: iid
Generate θ (l) ∼ q(θ) for l = 1, 2, . . . , N . Compute the importance weights w(l) = w(θ (l) ) for l = 1, 2, . . . , N . Compute I¯N = 1 N g(θ (l) w(l) ). N
l=1
p Again it follows by using LLN that I¯N −→ I as N → ∞. Notice that we can also estimate the posterior mean
174
Ghosh
(l) as E[g(θ)|X ] by the quantity g¯ N = N g(θ (l) w(l) )/ N l=1 l=1 w N (l) l=1.w /N converges in probability to the normalizing constant K (θ;X )dθ. The importance sampling as described above may sound that we can use it as an all-purpose method to compute posterior summaries, as in theory the choice of the importance density q( · ) can be any arbitrary density with support that contains the support of the posterior density. However in practice the choice of q(θ) is often not that easy, especially when θ is high-dimensional. In such cases MCMC methods are used to generate (dependent) samples from the posterior distribution, again only making use of the posterior kernel K (θ;X ) (37). A MCMC method is based on sampling from the path of a (discrete time) Markov Chain {θ (l) :l = 1, 2, . . .} whose stationary distribution is the posterior distribution. In other words, given a posterior distribution as its stationary distribution a MCMC method creates a suitable transition kernel of a Markov chain. One of the most popular approaches is to use the Metropolis algorithm which was later generalized by Hastings and now referred to as the Metropolis– Hastings algorithm (38). It is beyond the scope of this chapter to explain the details of the Markov Chain theory related to the MCMC methods. One striking difference between the direct MC method described above and the MCMC method is that the samples generated by the MCMC algorithm are dependent, whereas the samples obtained via direct MC method (e.g., rejection sampling or importance sampling) are independent. Further, the samples generated by the MCMC method are not directly obtained from the posterior distribution but rather they are only obtained in an asymptotic sense. This is the primary reason for discarding away first few thousands of samples generated by a MCMC procedure as burn-in’s. Thus, if {θ (l) :l = 1, 2, . . .} denotes a sequence of samples generated from a MCMC procedure then one only uses, {θ (l) :l = B + 1, B + 2, . . .} to approximate the posterior summaries for a suitably large integer B > 1. In other words, if we want to estimate η = η(θ) based on MCMC (l) samples θ (l) ’s, we can compute η¯ = N l=B+1 η(θ )/(N − B) to approximate its posterior mean E[η|X ] for some large integers N > B > 1. A generic software to implement MCMC methods to obtain posterior estimates for a variety of problems can be achieved by the freely available software WinBUGS. The software can be downloaded from the site: http://www.mrc-bsu.cam.ac. uk/bugs/.
6. Examples Let us consider again the vitamin C example, where we have f (X |θ) = L(θ;X ) = L(θ|s) = θ s (1 − θ)n−s and K (θ;X ) =
Basics of Bayesian Methods
175
K (θ|s) = θ s (1 − θ)n−s k(θ), where k(θ) is a kernel of a (prior) density on
[0, 1]. If we are interested in estimating the odds ratio θ or the log-odds ratio η = log ρ, then we can compute ρ = 1−θ the posterior estimator using [5] as follows
.1 θ s n−s k(θ)dθ 0 log 1−θ θ (1 − θ) , [16] ηˆ = .1 s (1 − θ)n−s k(θ)dθ θ 0 for a given choice of the prior kernel k(θ). We now illustrate the computation of [16] using the following kernels: (a) beta prior: k(θ) = θ a−1 (1 − θ)b−1 , where a, b > 0. In particular, a = b = 1 yields Uniform prior, a = b = 0.5 yields Jeffreys’ prior and a = b = 0 yields Haldane’s prior (which is improper). When using Haldane prior, one should check that s(n − s) > 0, otherwise the prior will lead to an improper posterior distribution. (b) Zellner’s prior: k(θ) = θ θ (1 − θ)1−θ . For illustration, suppose 15 out of 20 randomly selected patients reported that common cold was cured using vitamin C within a week. This leads to n = 20 and s = 15. See the codes in Section 6.1 for a suite of codes to obtain posterior estimates of θ and log (θ/(1 − θ)). Now suppose in addition to the responses xi ∈ {0, 1} we also have patient level predictor variables such as the gender, race, age, etc. which we denote by the vector zi . In such case we need to use a regression model (e.g., logistic model) to account for subject level heterogeneity. Let θi = Pr [xi = 1|zi ] = θ(zi ) for i = 1, . . . , n. We may use several parametric models for the regression function θ( · ), e.g., θ(z) = (1 + exp{−β z})−1 (which corresponds to a logistic regression model) or θ(z) = (β z) (which corresponds to a probit regression model) or more generally θ(z) = F0 (β z), where F0 ( · ) is a given distribution function with real line as its support. In general we can express the framework as a BHM as follows: xi |zi ∼ Ber(θi ) θi = F0 (β zi ) β ∼ N (0, c0 I ), where c0 > 0 is a large constant and I denotes identity matrix. In Section 6.2 we have provided a generic code using the WinBUGS language to implement the model. 6.1. R
To compute the posterior mean estimate of log-odd ratio η we can use the following code: #Prior kernel functions: #Beta prior: fix a and b #a=1;b=1 #k=function(theta){ #thetaˆ(a-1)∗(1-theta)ˆ(b-1)}
176
Ghosh #Zellner’s prior: k=function(theta){ thetaˆtheta∗(1-theta)ˆ(1-theta)} #Set observed values of s and n: s=15;n=20 #Likelihood function: L=function(theta){ thetaˆs∗(1-theta)ˆ(n-s)} #Posterior kernel function: K=function(theta){ k(theta)∗L(theta)} #Set the parameter of interest: #Odd ratio: #num.fun=function(theta){ #theta/(1-theta)∗K(theta)} #Log-Odds ratio: num.fun=function(theta){ log(theta/(1-theta))∗K(theta)} #Obtain the numerator and #denominator to compute posterior mean: num=integrate(num.fun,lower=0,upper=1) den=integrate(K,lower=0,upper=1) post.mean=num$value/den$value
Notice that the above code can be easily modified to compute the posterior mean of an one-dimensional parameter of interest for any other likelihood and prior kernel. 6.2. WinBUGS
To compute the posterior mean estimate of regression coefficient β we can use the following code: model{ for(i in 1:n){ x[i]∼dbern(theta[i]) theta[i] 1 the data provide evidence for H0 , and when BF < 1 then the data provide evidence for H1 (and against H0 ). Jeffreys (7) suggests BF < 0.1 provides “strong” evidence against H0 and BF < 0.01 provides “decisive” evidence. The posterior probability is simply related to the BF as 1 0 π1 1 −1 . P(H0 |data) = 1 + π0 BF
[6]
While [4] is simple and elegant, it is not immediately obvious how to compute the key quantity, P(data|Hj ) because it obscures the fact that each hypothesis posits a set of parameters which describe the distribution of the data P(data|Hj ). When considering the two-sample case in particular, these parameters are μ = μ1 = μ2 under H0 and (μ1 , μ2 ) under H1 . Now P(data|H0 ) can now be thought of as two parts: P(data|μ) and P(μ|H0 ). The relationship between these three distributions is given by P(data|H0 ) =
P(data|μ)P(μ|H0 )dμ.
[7]
184
Gönen
Similarly,
P(data |H1 ) =
P(data|μ1 , μ2 )P(μ1 , μ2 |H0 )dμ.
[8]
It is easy to see that [7] and [8] can be used to compute [4]. The important consideration here is that the elements of [7] and [8] are directly available: P(data|μ) or P(data|μ1 , μ2 ) are simply the normal distribution functions with the specified means and P(μ1 , μ2 ) and P(μ) are the two prior distributions. It turns out there is a simple, closed-form expression for the Bayesian t-test which can be derived from the material presented so far. Its simplicity heavily depends on a particular way we can use to assign the prior distributions. Section 2.2 deals with this issue and Section 2.3 presents the Bayesian t-test along with simple statistical software to calculate it. 2.2. Choice of Prior Distributions
By way of notation, let N (y|a, b) denote the normal (Gaussian) distribution function with mean a and variance b, and let χd2 (u) denote the chi-square distribution with d df. We informally stated in Section 1 that we have two groups with normally distributed data. This can be made more precise by stating that the data are conditionally independent with Yr |{μr , σ 2 } ∼ N (μr , σ 2 ). The goal is to test the null hypothesis H0 :δ = μ1 − μ2 = 0 against the two-sided alternative H1 :δ = 0. The first key step in thinking of the priors is to change the parameters of the problem from (μ1 , μ2 , σ 2 ) to (μ, δ, σ 2 ) by defining μ = (μ1 + μ2 )/2. This is consistent with the fact that we have been using μ to denote the common mean under H0 . It also has the advantage of defining a suitable μ for H1 as well enabling us to use one set of parameters to represent the distribution of the data under either of the hypotheses. The second key step is modeling the prior information in terms of δ/σ , called the standardized effect size rather than for δ. For Jeffreys (7), one of the founding fathers of Bayesian hypothesis testing, dependence of the prior for δ on the value of σ is implicit in his assertion “from conditions of similarity, it [the mean] must depend on σ , since there is nothing in the problem except σ to give a scale for [the mean].” We will start by assigning a N (λ, σδ2 ) prior to δ/σ . Since the standardized effect size is a familiar dimensionless quantity, lending itself to easy prior modeling, choosing λ and σδ2 should not be too difficult. For example Cohen (8) reports that |δ/σ | values of 0.20, 0.50, and 0.80 are “small,” “medium,” and “large,” respectively, based on a survey of studies reported in the social sciences literature. These benchmarks can be used to check whether the specifications of the prior parameters λ and σδ2 are reasonable; a simple check based
The Bayesian t-Test and Beyond
185
on λ ± 3σδ can determine whether the prior allows unreasonably large effect sizes. The remaining parameters (μ, σ 2 ) are assigned a standard noninformative prior, no matter whether δ = 0 or δ = 0. As explained in Chapter 3, the standard noninformative prior for μ (which also happens to be improper) is constant and for σ 2 log-constant. Since they require no prior input we will not dwell further on this choice. It suffices to say that they ensure that the BF depends on the data only through the two-sample t-statistic. To summarize, the prior is as follows: P(δ /σ | μ, σ 2 , δ = 0) = N (δ/σ | λ, σδ2 ),
[9]
with the nuisance parameters assigned the improper prior P(μ, σ 2 ) ∝ 1/σ 2 .
[10]
Finally, the prior is completed by specifying the probability that H0 is true: π0 = P(δ = 0),
[11]
where π0 is often taken to be 1/2 as an “ objective” value (12). However, π0 can be simply assigned by the experimenter to reflect prior belief in the null; it can be assigned to differentially penalize more complex models (7); it can be assessed from multiple comparisons considerations (7, 9); and it can be estimated using empirical Bayes methods (10). The next section provides a case study for prior assessment. It should be mentioned prominently that Jeffreys, who pioneered the Bayesian testing paradigm, derived a Bayesian test for H0 :μ1 = μ2 that is also a function of the two-sample tstatistic [2]. However, his test uses an unusually complex prior that partitions the simple alternative H1 :μ1 = μ2 into three disjoint events depending upon a hyperparameter μ: H11 :μ2 = μ = μ1 , H12 :μ1 = μ = μ2 , and H13 :μ1 = μ2 and neither equals μ. Jeffreys further suggests prior probabilities in the ratio 1:1/4:1/4:1/8 for H0 , H11 , H12 , and H13 respectively, adding another level of avoidable complexity. An additional concern with Jeffreys’ two-sample t-test is that it does not accommodate prior information about the alternative hypothesis. 2.3. The Bayesian t-Test
For the two-sample problem with normally distributed, homoscedastic, and independent data, with prior distributions as specified in Section 2, the BF for testing H0 :μ1 = μ2 = μ, vs.
186
Gönen
H1 :μ1 = μ2 is
BF =
Tν (t|0, 1) 1/2
Tν (t|nδ λ, 1 + nδ σδ2 )
.
[12]
Here t is the pooled-variance two-sample t-statistic [2], λ and σδ2 denote the prior mean and variance of the standardized effect size (μ1 − μ2 )/σ under H1 , and Tν (.|a, b) denotes the noncentral t probability density function (pdf) having location a, scale b, and df ν (specifically, Tν (.|a, b) is the pdf of the ran 2 dom variable N (a, b)/ χν /ν, with numerator independent of denominator). The mathematical derivation of [12] is available in (11). The data enter the BF only through the pooled-variance two-sample t-statistic [2], providing a Bayesian motivation for its use. For the case where the prior mean λ of the effect size is assumed to be zero, the BF requires only the central T distribution and is calculated very simply, e.g., using a spreadsheet, as 2 BF =
1 + t 2 /ν 1 + t 2 /{ν(1 + nδ σδ2 )}
3−(ν+1)/2 (1 + nδ σδ2 )1/2 .
[13]
Gene expression example revisited: How would the conclusions differ if one applies the Bayesian t-test to the gene expression data from gastrointestinal stromal tumors? An easy starting point is the calculation of nδ which turns out to be 7.639. We then need to specify λ, the prior mean for δ. Not knowing anything about the context a good first guess is λ = 0, giving equal footing to either mean. What would be a reasonable value for σδ ? This is clearly the most difficult (and perhaps the least attended) parameter. While gene expression scales are somewhat artificial, a difference of 2 or more on the log-scale is traditionally considered substantial. This suggests using log (Y1 ) and log (Y2 ) as the data and choosing σδ = 1 gives us a small but nonzero chance, a priori, that the means can differ by at least two-fold on the log-scale (probability that δ, with mean 0 and variance 1, will be greater than 2 or less than −2 is approximately 0.05). It is possible to use other values for σδ , without resorting to two-fold difference. If the gene expression study was powered to detect a particular difference the same calculation can be repeated using the detectable difference. Suppose the study was powered to detect a difference of 50% (1.5-fold) on the log-scale. Then σδ = 0.75 gives the same probability of means differing by at least 1.5-fold on the log-scale.
The Bayesian t-Test and Beyond
187
In this case the BF turns out to be 0.36 and the posterior probability is given by 0 1 Prior Odds −1 1+ , 0.36 where the prior odds is π/(1 − π), the odds that the null is true a priori. If one is willing to grant that the two hypotheses are equally likely before the data are collected (i.e., prior odds = 1) then the posterior probability of the null hypothesis is roughly one-fourth. While the data favors the alternative hypothesis by lifting its chances from one-half to three-fourth, the resulting posterior probability is not a smoking-gun level evidence to discard the null. Let us recall the findings of the frequentist analysis. Using the log-transform t was 4.59 corresponding to a p-value of 0.00006, a highly significant finding by any practical choice of type I error. In contrast, the Bayesian test offers a much more cautious finding: The data supports the alternative more than it does the null but the conclusion that the means are different is relatively weak with a probability of 0.26. It is tempting to directly compare 0.00006 with 0.26, but it is misleading—the two probabilities refer to different events. Most Bayesians prefer to choose the hypothesis that has the higher posterior probability, so in this analysis, the alternative hypothesis, with a probability of 0.74, will be preferred over the null. A common criticism for this Bayesian analysis would be the impact of the choice of prior on the Bayesian results. The different results between the frequentist and Bayesian methods, not to mention the cavalier way we chose the value for σδ , begets a sensitivity analysis. Here is a table which can be a template for analyses reporting Bayesian results (Table) 4.1, alternatively a display like Fig. 4.1 can be used to convey the sensitivity analyses.
Table 4.1. Sensitivity Analysis for the Bayesian t-test σδ
0.5
1
1.5
2.0
2.5
3.0
BF
0.46
0.36
0.31
0.28
0.26
0.25
P(H0 )
0.32
0.26
0.24
0.22
0.21
0.20
While there is some sensitivity to the specification of σδ , the conclusions are very similar and remain quite different from the frequentist analysis. It seems highly unlikely that the culprit is the prior for σδ . Of course one can vary δ or even π but in the context of this analysis other values for these two prior parameters hardly
0.0
0.2
0.4
0.6
0.8
1.0
Gönen
Posterior Probability or the Bayes Factor
188
0.0
0.5
1.0 1.5 2.0 Prior Variance for the Effect Size
2.5
3.0
Fig. 4.1. Sensitivity analysis for the Bayesian t-test. Horizontal axis is σδ and the vertical axis is either the Bayes factor (circles) or the posterior probability of the null (triangles).
make sense. A possible explanation for the differential results is the so-called irreconcilability of p-values and posterior probabilities (12). It turns out the disagreements between p-values and posterior probabilities are very common when testing point null hypothesis (when the null hypothesis is a single point, such as δ = 0). A justification of this requires technical arguments that are beyond the scope of this chapter, but it is important to bear in mind when one is trying to compare the findings of a Bayesian analysis with that of a frequentist one.
3. Simultaneous t-Tests An important aspect of most genetic data is its multivariable nature. In a gene expression study we will often have more than one gene of interest to test. It is widely-recognized in frequentist statistics that performing multiple tests simultaneously inflates the nominal type I error, so if each test is performed at the 5% level, the combined type I error probability will be higher than 5%. There is substantial literature on obtaining tests so that the combined error rate is controlled at 5%. Our goal is to provide the Bayesian counterpart for simultaneous evaluation of multiple t-tests, therefore we will not discuss the frequentist approaches further. While the literature on this topic has multiplied over the past few years (13–16), we will mention only one important concept from
The Bayesian t-Test and Beyond
189
this literature. Most of the frequentist discussions center on the type of error to protect. The most conservative route is the so-called family-wise error rate (FWER) which defines type I error as the probability of making at least one rejection if all the nulls are true, see Chapter 1 and 5 for more information on FWER. Clearly FWER should be used if it is plausible that all the nulls are true. As we noted in the previous section, the development of a Bayesian hypothesis test is not concerned with type I and type II errors. Indeed it is often said that Bayesians do not need any multiplicity adjustments. This is not entirely correct and technicalities behind these arguments require the use of statistical decision theory. A good starting point for this literature is Berger’s book (2) although the Bayesian section of Miller’s book (13) is also very helpful. There is also a practical side of performing multiple tests based on the assignment of prior probabilities which is more relevant for our development. Suppose we have k genes, and continuing with the theme of the GIST example, we want to see if their expressions differ between tumors originating from stomach and small bowel tumors. In the previous section we chose the prior probability of the null hypothesis P(H ) = 0.5 without much discussion. With several tests, to be called H1 , . . . , Hk can we find a similar “automatic” choice of prior? Suppose we set P(Hk ) = 0.5 for all k, which seems to be a natural starting point. What is, then, the probability that all nulls are true, that is P(all) = P(H1 = H2 = · · · = Hk = 1)? Supposing for the time being that the hypotheses are independent, the probability of all nulls being true is 0.5k . As k increases this quantity quickly approaches 0. Table 4.2 gives P (all) for some selected values of k.
Table 4.2 Probability that all the k nulls are true when they are independent a priori k
1
2
3
4
5
10
100
P(all)
0.5
0.25
0.125
0.0625
0.0031
0.0001
0 for all j. We also want a flexible way to specify nonzero Corr(Hi , Hj ), the correlations between the null hypotheses. It will be helpful to consider two independent latent vectors u and v: u ∼ MVNk (0, 1 )
[14]
v ∼ MVNk (λ, 2 )
[15]
and define the prior for δ as follows: $ 0 if uj < cj δj = , vj else where cj is a constant selected in a way to get the desired prior probabilities πj . Hence the elements of u determine the a priori probability of the corresponding δj to be zero (or Hj to be true). In a particular realization of δ, some δj ’s will be given the value of 0 by this procedure and others will take on the values supplied by v. The resulting prior is a mixture of two multivariate normals which has the following desirable properties: (i) P(Hj ) > 0 for all j.
The Bayesian t-Test and Beyond
191
(ii) Corr(Hi , Hj ), i = j, is given by the ij th element of 1 . (iii) Corr(Hi , Hj ), i = j, is given by the ij th element of 2 . It is not difficult to derive the posterior probabilities for this general setup (see next section and also (17)) but we find it helpful to consider a model further simplified as follows: Assume λ = (λ, . . . , λ). Here λ represents the “anticipated effect size.” We will also assume that 1 = and 2 = σδ2 , where is a correlation matrix with 1 on the diagonals and ρ on the off-diagonals, with −1/(k − 1) < ρ < 1 (this requirement is needed to ensure that the resulting matrix is positive-definite, a necessary condition for a covariance or correlation matrix). Under this simplified model, we have Corr(Hi , Hj ) = Corr(Hi , Hj ) = ρ. This is not unreasonable for most settings, many times there is no reason to expect that the correlation structure between the endpoints will be different depending on the existence of an effect. Before proceeding, an analyst needs to input values for λ, ρ, and σδ2 . One can think of u representing the structure corresponding to the null hypothesis and v representing the structure corresponding to the alternative hypothesis. Hence the choice of λ, ρ, and σδ2 can be simplified and based on the parameters of study design. We will give examples of this in Section 4. At this point it is instructive to revisit Table 4.2 by a simple implementation of this procedure. Using the simplified version where all the correlations between nulls is constant at ρ and specifying σδ to be 1, as in the univariate gene expression example we can compute the probability that all nulls are true if each null has a prior probability of 0.5, corresponding to cj = 0 for all j. All we need for this is the vector u and whether its elements are positive or not. Since u has mean 0, each null hypothesis has a prior probability of 0.5. By choosing 0 to have a constant correlation of ρ in the off-diagonals we can compute the probability that all nulls are true. While this can be done analytically it is easy and instructive to simulate. We first generate 10, 000 replicates of u and then count the number of times when all the elements are negative. A simple R code for this is given in the Appendix. Figure 4.2 displays P(all) as a function of ρ and k. As expected, increasing ρ increases P(all) and in the limit ρ = 1, P(all) = 0.5 = P(each). The effect of k becomes less pronounced as ρ increases. The results are worth considering seriously: when the hypotheses are moderately correlated (ρ ≤ 0.5) the probability that all nulls are true remain below one-sixth for k ≥ 4. Only when ρ is over 0.5 the chances of nulls being true start approaching 0.5, yet even when ρ = 0.9, a value that seems unlikely in practice, the probability that all nulls are true is about one-third for moderate to large values of k. I strongly recommend going through this exercise before choosing a value of ρ and cj . In fact it seems equally reasonable, if
Gönen
0.4 0.3 0.1
0.2
k=2
k = 10 0.0
Prior Probability that All Nulls Are True
0.5
192
0.0
0.2
0.4
0.6
0.8
1.0
Prior Correlation
Fig. 4.2. Prior probability that all the k nulls are true (vertical axis) as a function of ρ (horizontal axis) and k. Different values of k correspond to different lines with the highest line representing k= 2 and the lowest k= 10.
not more so, to strive for probability of all nulls being true to be 0.5 and let the value of ρ decide cj . 3.2. Computing the Posterior Probabilities for Simultaneous t-Tests
This section presents the technical details that are given by (17). It is not essential for a practical understanding and can be skipped. Suppose the data are k-dimensional random vectors, where k is the number of endpoints: Yir |(μi , ) ∼ N (μi , ), for i = 1, 2 (treatment regimens) and r = 1, . . . , ni (replications). Let ¯1 −Y ¯ 2, D=Y ¯ 1 + n2 Y ¯ 2 )/n, M = (n1 Y ¯ i )(Y ir − Y ¯ i ) = νS, C = i r (Yir − Y δ = μ1 − μ2 , μ = (n1 μ1 + n2 μ2 )/n. Note that (D, M, C) are minimal sufficient complete statistics, distributed as P(D, M, C|μ1 , μ2 , ) = N (D|δ, nδ−1 ) × N (M|μ, n.−1 ) × Wν (C|), where Wν (.|.) denotes the Wishart distribution with ν df, and where nδ− 1 = n1− 1 + n2− 1, n. = n1 + n2 , and ν = n. − 2. Define indicator variables Hj , for which Hj = 0 when δj = 0 and Hj = 1 when δj = 0. Let H = diag(Hj ) and σ = diag1/2 ().
The Bayesian t-Test and Beyond
193
We consider the following hierarchical prior: P(δ|μ, , H) = P(δ|, H) = N (δ|Hσ mξ , σ HSξ Hσ ), P(μ, |H) ∝ ||−(k+1)/2 , P(H = h) = πh .
[16] [17] [18]
Equation [16] is best viewed as a N (Hmξ , HSξ H) prior for the vector of effect sizes ξ = σ −1 δ. It requires a prior mean mξ j and prior variance sξ2j for each effect size, but correlations between the effect sizes also must be specified to obtain the full prior covariance matrix sξ . The prior [16] can be simplified considerably if one is willing to assume exchangeability of the nonzero ξj . At the second level [17] of the hierarchy we use a flat prior for μ and a standard noninformative prior for . At the third stage [18] we specify the prior probability that model h is true. Since there are 2k possible models, one might generically assume P(H = h) ≡ 2−k (18), as would result from independent Hj with πj = P(Hj = 0) ≡ 0.5. However, this implies in particular that the probability that all nulls are true is 2−k , which may be unreasonably low. Independence may be unreasonable as well. Our third stage hierarchy [18] takes all these into account using the prior inputs πj , j = 1, . . . , k, and ρ H , where ρ H is a specified tetrachoric correlation matrix of H. Given these prior specifications, the values πh can be calculated from the multivariate normal distribution either analytically, in simple cases, or numerically using algorithms (19). As is the case for [16], determining [18] is simplified by assuming exchangeability; in this case only two parameters (the prior null probability and the tetrachoric correlation) must be specified. The parameters μ and δ can be integrated analytically, leaving .
N (D|σ Hmξ , nδ−1 + σ HSξ Hσ )|C|(ν−k−1)/2 ||−(ν+k+1)/2 exp ( − .5tr −1 C)d. [19] While analytic evaluation of this expression seems impossible, it can be simulated easily and stably by averaging N (D|σ Hmξ , nδ−1 + σ Hsξ Hσ ) over independent draws of from the inverse Wishart distribution with parameter matrix C−1 and df ν. An instructive alternative representation of the multivariate likelihood is N (D|σ Hmξ , nδ−1 + σ HSξ Hσ ) = 1/2 1/2 |σ |−1 N (z|nδ Hmξ , ρz + nδ HSξ H), where Z = nδ σ −1 D, the vector of two-sample z-statistics. Thus, for large ν where the inverse Wishart distribution of is concentrated around its mean P(D, M, C|H) ∝
194
Gönen
C/(ν − k − 1), P(D, M, C|H) becomes closer in proportionality 1/2 to N (ˆz|nδ mξ H, ρˆz + nδ HSξ H). This yields an approximate method for calculating posterior probabilities (justified in Theorem 2 of the Appendix in reference [17]). that has been previously described (20), and for which software is available (16). Letting denote the set containing the 2k k-tuples h, we now have πh |Y = P(H = h|D, M, C) = πh P(D, M, C|H = h)/h ∈ Hπh P(D, M, C|H = h). Thus P(Hj = 0|D, M, C) = h∈H,hj =0 πh|Y , which is the sum of the 2k−1 posterior probabilities of models for which δj is zero. In the univariate case, all integrals can be evaluated analytically and one recovers the Bayesian t-test described in Section 2.
4. Simultaneous Testing in the GIST Microarray Example
We now return to the microarray example introduced in Section 1. Signal transduction pathways are long recognized to be key elements of various cellular processes such as cell proliferation, apoptosis, and gene regulation. As such they could be playing important roles in formation of malignancies. Four genes from a certain signal transduction pathway are chosen for this analysis, primarily based on previously reported differential expression. These genes are KIAA0334, DDR2, RAMP2, and CIC and the grouping variable of interest, once more, is the location of the primary tumor (small bowel vs. stomach). The relevant summary statistics for the analysis are calculated as ⎛ ⎛ ⎞ ⎞ −0.2578 0.341 0.018 0.131 0.021 ⎜ ⎟ ⎜ ⎟ ⎜ 0.1379 ⎟ ⎜ 0.018 0.271 0.068 −0.033 ⎟ ⎜ ⎟ ⎜ ⎟ D=⎜ ⎟ , C = ⎜ 0.131 0.068 0.544 0.038 ⎟ . ⎝ 1.1359 ⎠ ⎝ ⎠ 1.5772 0.021 −0.033 0.038 0.124 [These summary statistics are calculated on the logarithm of the gene expression]. We also note that n1 = 11 and n2 = 23. First order of business is to line up the H matrix. With four hypotheses, H has 16 rows and 4 columns, depicted in the first four columns of Table 4.3. Each row of H represents a possible combination of the true states of nature for the four hypotheses at hand. Since the likelihood expression [19] is conditional on H our likelihood function takes on 16 values, one for each row of H. These values were computed using Monte Carlo integration on and presented in Table 4.3 (column L, for likelihood). The
The Bayesian t-Test and Beyond
195
Table 4.3 All the 16 possible states of nature for the four null hypotheses from Section 4. L is the likelihood for each possible combination (up to a proportionality constant), prior is the prior probability assigned for that combination and posterior is the resulting posterior probability h1
h2
h3
h4
L
Prior
Posterior
0
0
0
0
0.953
0.16
0.494
0
0
0
1
0.131
0.27
0.114
0
0
1
0
0.128
0.27
0.112
0
0
1
1
0.018
0.14
0.008
0
1
0
0
0.129
0.27
0.112
0
1
0
1
0.018
0.14
0.008
0
1
1
0
0.018
0.14
0.008
0
1
1
1
0.003
0.27
0.002
1
0
0
0
0.132
0.27
0.114
1
0
0
1
0.018
0.14
0.008
1
0
1
0
0.018
0.14
0.008
1
0
1
1
0.003
0.27
0.002
1
1
0
0
0.018
0.14
0.008
1
1
0
1
0.003
0.27
0.002
1
1
1
0
0.003
0.27
0.002
1
1
1
1
0.001
0.16
0.002
code to perform this analysis is available at mskcc.org/gonenm. We also need to generate prior probabilities for each row of this table, which require some care. We continue to center the prior at 0, as we did in Section 2 for testing a single gene, since there is no a priori expectation of the direction of the effect (whether the expression will be higher in the stomach or in the small bowel). Using similar considerations we also take σδ = 1. The prior correlation between effect sizes is somewhat harder to model. We assumed that if the effect sizes were correlated, then the hypotheses Hj also were correlated; and we assumed that the correlation between effect sizes was equal to the correlation between hypotheses. To model correlation among hypotheses, we used latent multivariate normal variables as follows: we let U ∼ N (0, ρ), where ρ is the compound symmetric correlation matrix with parameter ρ, and we defined c = −1 (π). If Hj = I (Uj ≥ c), then the {Hj } are exchangeable, with P(Hj = 0) ≡ π and tetrachoric correlation ρ. Exchangeability implies
196
Gönen
24 5hj √ c − ρU P(H = h) = E √ 1−ρ 5k−hj 3 4 √ c − ρU 1− √ , 1−ρ
[20]
where expectation is taken with respect to a standard normal U. This expression is easily evaluated using one-dimensional integration. An example in R using the function integrate can be found at mskcc.org/gonenm. The correlation ρ is suggested by the discussion in Section 3.1. Two parameters of particular interest are π = P(Hj = 0) and π0 = P(H = 0). While these four genes are selected from a pathway which is known to affect malignant processes, there is no reason to think that these particular genes are differentially expressed in stomach tumors vs. small bowel tumors. For this reason we take π0 = P(H = 0) = 0.5. Using exchangeability once more we set π = P(Hj = 0) = 0.8 for the four genes. These selections imply a value of ρ = 0.31 (see Section 3.1 for computing ρ when π = P(Hj = 0) and π0 = P(H = 0) are given. Now we can use [20] to generate the prior probability for each row of Table 4.3. Once the likelihood and the prior are determined, the posterior is trivially calculated from the basics of the Bayes theorem: P(D, M, C|H = h)P(H = h) P(H = h) = , h P(D, M, C|H = h)P(H = h) which are presented in the last column of Table 4.3, labeled as the posterior. It is clear that the most likely state of nature is that all nulls are true (posterior probability of 0.494). There is absolutely no support for two or more nulls (any combination) being true, all of these have posterior probabilities less than 0.01. The probability that one of the nulls can be false is about 0.45 (add rows 2, 3, 5, and 9). The data, however, does not favor any one of the nulls over another in the case that one of them is false. Table 4.3 gives us the joint posterior probabilities of the nulls for all possible combinations. It could be more useful to look at the marginal probabilities, as they help us answer the question for a given gene without any reference to the others. This can also be deduced from Table 4.3. For the first gene for example, the rows 1 through 8 represent the cases where the first null is true, namely the gene KIAA0334 is not differentially expressed between the stomach GISTs and small bowel GISTs. Adding the posterior probabilities of these rows gives us P(h1 = 0) = 0.858, so we can safely conclude that the expression of KIAA0334 is
The Bayesian t-Test and Beyond
197
not significantly different between the sites. Going through a similar exercise one gets P(h2 = 0) = 0.860, P(h3 = 0) = 0.860, and P(h4 = 0) = 0.858. Therefore we conclude that none of the genes are differentially expressed.
5. Conclusions In this chapter we reviewed a Bayesian approach to a very common problem in applied statistics, namely the comparison of group means. Chapter 3 had already listed various reasons why a Bayesian approach can be beneficial over traditional approaches. The particular Bayesian t-test has all those benefits, in addition to being closely related to the traditional t-test. It also has the advantage of being naturally extended to simulatenous t-tests. Simultaneous testing of hypotheses is a contentious and difficult problem. While this Bayesian approach does not alleviate the analytic difficulties, it clarifies the needs for adjustment (a major source of confusion in frequentist multiple testing methods) and produces an output, such as Table 4.3, that is simple to communicate and explain. A particular disadvantage is that this is not an automatic procedure that can be extended to very high dimensions. It requires careful and thoughtful prior modeling of the anticipated effect sizes and directions, as well as correlations between the hypotheses. Although we chose our example from a microarray study, this approach would not be useful with thousands of genes.
Appendix Calculation of the Bayesian t -Test. Calculation of [12] requires evaluation of the noncentral T pdf with general scale parameter. Since software typically provide the pdf of the noncentral t having scale parameter 1.0, a simple modification is needed for the general case: Tν (t|a, b) = Tv (t/b 1/2 |a/b 1/2 , 1)/b 1/2 . Thus, for example, in SAS the BF can be computed using pdf( T ,t,n1 + n2 − 2)/(pdf( T ,t/sqrt(postv), [21] n1+n2-2,nc)/sqrt(postv);) or in R or Splus,
198
Gönen
pt(t,n1+n2-2)/(pt(t/sqrt(postv),n1+n2-2,nc))/ sqrt(postv)), [22] where t is the value of the two-sample t-statistic, postv = 1 + 1/2 nδ σδ2 and nc = nδ λ/(1 + nδ σδ2 )1/2 . Calculation of the Probability that All Nulls Are True. The following R code computes the probability that all nulls are true. This example sets ρ = 0.1 and other cases can be obtained by simply re-running the code from the point where ρ is defined, setting it to the desired value. mvtnorm library is required to generate random replicates of multivariate normal vectors using the rmvnorm function. This example uses 10,000 replicates which gives accurate estimates for all practical purposes. If a value other than 5 is needed then the three occurrences of 5 (twice in the definition of sigmat and once in the mvrnorm need to be replaced by the desired value of k). The output will be a k + 1 dimensional vector of probabilities. The j th element of this vector is the probability that j nulls are false, so the first element (for j = 0) is the desired probability. library(mvtnorm) Jmat 0) and those with zero (αˆ i = 0). If a data point falls outside the margin, yi (βˆ xi + βˆ0 ) > 1, then the corresponding Lagrange multiplier must be αˆ i = 0, and thus it plays no role ˆ On the other hand, the attribute vectors of in determining β. ˆ and such data points are the data points with αˆ i > 0 expand β, called the support vectors. The proportion of the support vectors depends on λ, but typically for a range of values of λ, only a fraction of the data points are support vectors. In the sense, the SVM
356
Lee
solution admits a sparse expression in terms of the data points. This sparsity is due to the singularity of the hinge loss at 1. A simple analogy of the sparsity can be made to median regression with the absolute deviation loss that has a singular point at 0. For classification of a new point x, the following linear discriminant function is used: fˆλ (x) = βˆ x + βˆ0 =
αˆ i yi xi x + βˆ0 .
[7]
i:αˆ i >0
Note that the final form of fˆλ does not depend on the dimensionality of x explicitly but depends on the inner products of xi and x as in the dual problem. This fact enables construction of hyperplanes even in infinite-dimensional Hilbert spaces (p. 406, (4)). In addition, [7] shows that all the information necessary for discrimination is contained in the support vectors. As a consequence, it affords efficient data reduction and fast evaluation at the testing phase. 2.4. Nonlinear Generalization
In general, hyperplanes in the input space may not be sufficiently flexible to attain the smallest possible error rate for a given problem. As noted earlier, the linear SVM solution and prediction of a new case x depends on the xi ’s only through the inner product xi xj and xi x. This fact leads to a straightforward generalization of the linear SVM to the nonlinear case by taking a basis expansion. The main idea of the nonlinear extension is to map the data in the original input space to a feature space and find the hyperplane with a large margin in the feature space. For an enlarged feature space, consider transformations of x, say, φm (x), m = 1, . . . , M . Let (x): = (φ1 (x), . . . , φM (x)) be the so-called feature mapping from Rp to a higher dimensional feature space, which can be even infinite dimensional. Then by replacing the inner product xi xj with (xi ) (xj ), the formulation of the linear SVM can be easily extended. For instance, suppose the input space is R2 and x = (x1 , x2 ) . Define :R2 → R3 √ as (x) = (x12 , x22 , 2x1 x2 ) . Then the mapping gives a new dot product in the feature space (x) (t) = (x12 , x22 ,
√ √ 2x1 x2 )(t12 , t22 , 2t1 t2 )
= (x1 t1 + x2 t2 )2 = (x t)2 . In fact, for this generalization to work, the feature mapping does not need to be explicit. Specification of the bivariate function K (x, t): = (x) (t) would suffice. With K, the nonlinear dis criminant function is then given as fˆλ (x) = ni=1 αˆ i yi K (xi , x) + βˆ0 , which is in the span of K (xi , x), i = 1, . . . , n. So, the shape
Support Vector Machines for Classification
357
of the classification boundary {x ∈ Rp :fˆλ (x) = 0} is determined by K. From the property of the dot product, it is clear that such a bivariate function is non-negative definite. Replacing the Euclidean inner product in a linear method with a non-negativedefinite bivariate function, K (x, t), known as a kernel function to obtain its nonlinear generalization is often referred to as the ‘kernel trick’ in machine learning. The only condition for a kernel to be valid is that it is a symmetric non-negative (semi-positive)-definite function: for every N ∈ N, ai ∈ R, and N zi ∈ Rp (i = 1, . . . , N ), i,j ai aj K (zi , zj ) ≥ 0. In other words, KN : = [K (zi , zj )] is a non-negative-definite matrix. Some kernels in common use are polynomial kernels with dth degree, K (x, t) = (1 + x t)d or (x t)d for some positive integer d and the radial basis (or Gaussian) kernel, K (x, t) = exp ( − x − t2 /2σ 2 ) for σ > 0. It turns out that this generalization of the linear SVM is closely linked to the function estimation procedure known as the reproducing kernel Hilbert space (RKHS) method in statistics (16, 17). And the theory behind the RKHS methods or kernel methods in short provides a unified view of smoothing splines, a classical example of the RKHS methods for nonparametric regression, and the kernelized SVM. The connection allows more abstract treatment of the SVM, offering a different perspective on the methodology, in particular, the nonlinear extension. 2.5. Kernel Methods
Kernel methods can be viewed as a method of regularization in a function space characterized by a kernel. A brief description of general framework for the regularization method is given here for advanced readers in order to elucidate the connection, to show how seamlessly the SVM sits in the framework, and to broaden the scope of its applicability in a wide range of problems. Consider a Hilbert space (complete inner product space) of real-valued functions defined on a domain X (not necessarily Rp ), H with an inner product f , gH for f , g ∈ H. A Hilbert space is an RKHS if there is a kernel function (called reproducing kernel) K (·, ·):X 2 → R such that (i) K (x, ·) ∈ H for every x ∈ X , and (ii) K (x, ·), f ( · )H = f (x) for every f ∈ H and x ∈ X . The second condition is called the reproducing property for the obvious reason that K reproduces every f in H. Let Kx (t): = K (x, t) for a fixed x. Then the reproducing property gives a useful identity that K (x, t) = Kx ( · ), Kt ( · )H . For a comprehensive treatment of the RKHS, see (18). Consequently, reproducing kernels are non-negative definite. Conversely, by the Moore–Aronszajn theorem, for every non-negative-definite function K (x, t) on X , there corresponds a unique RKHS HK that
358
Lee
has K (x, t) as its reproducing kernel. So, non-negative definiteness is the defining property of kernels. Now, consider a regularization method in the RKHS, HK with reproducing kernel K: 1 L(f (xi ), yi ) + λh2HK , n n
min
f ∈{1}⊕HK
[8]
i=1
where f (x) = β0 + h(x) with h ∈ HK and the penalty J (f ) is given by h2HK . In general, the null space can be extended to a larger linear space than {1}. As an example, X = Rp and HK = {h(x) = β x|β ∈ Rp } with K (x, t) = x t. For h1 (x) = β1 x and h2 (x) = β2 x ∈ H, h1 , h2 HK = β1 β2 . Then for h(x) = β x, h2HK = β x2HK = β2 . Taking f (x) = β0 + β x and the hinge loss L(f (x), y) = (1 − yf (x))+ gives the linear SVM as a regularization method in HK . So, encompassing the linear SVM as a special case, the SVM can be cast as a regularization method in an RKHS HK , which finds f (x) = β0 + h(x) ∈ {1} ⊕ HK minimizing 1 (1 − yi f (xi ))+ + λh2HK . n n
[9]
i=1
The representer theorem in (19) says that the minimizer of [8] has a representation of the form fˆλ (x) = b +
n
ci K (xi , x),
[10]
i=1
where b and ci ∈ R, i = 1, . . . , n. As previously mentioned, the kernel trick leads to the expression of the SVM solution: fˆλ (x) = βˆ0 +
n
αˆ i yi K (xi , x).
i=1
It agrees with what the representer theorem generally implies for the SVM formulation. Finally, the solution in [10] can be determined by minimizing ⎧ ⎞⎫ ⎛ n ⎨ n n ⎬ 1 ⎠ ⎝ 1 − yi b + cj K (xj , xi ) +λ ci cj K (xi , xj ) ⎩ ⎭ n i=1
j=1
+
i,j=1
over b and ci s. For further discussion of the perspective, see (17). Notably the abstract formulation of kernel methods has no restriction on input domains and the form of kernel functions. Because of that, the SVM in combination with a variety of kernels
Support Vector Machines for Classification
359
is modular and flexible. For instance, kernels can be defined on non-numerical domains such as strings of DNA bases, text, and graph, expanding the realm of applications well beyond Euclidean vector spaces. Many applications of the SVM in computational biology capitalize on the versatility of the kernel method; see (20) for examples. 2.6. Statistical Properties
Contrasting the SVM with more traditional approaches to classification, we discuss statistical properties of the SVM and their implications. Theoretically, the 0–1 loss criterion defines the rule that minimizes the error rate over the population as optimal. With the symmetric labeling of ±1 and conditional probability η(x): = P(Y = 1|X = x), the optimal rule, namely, the Bayes decision rule, is given by φB (x): = sgn(η(x) − 1/2), predicting the label of the most likely class. In the absence of the knowledge of η(x), there are two different approaches for building a classification rule that emulates the Bayes classifier. One is to construct a probability model for the data first and then use the estimate of η(x) from the model for classification. This yields such modelbased plug-in rules as logistic regression, LDA, QDA, and other density-based classification methods. The other is to aim at direct minimization of the error rate without estimating η(x) explicitly. Large margin classifiers with a convex surrogate of the 0–1 loss fall into the second type, and the SVM with the hinge loss is a typical example. The discrepancy of the 0–1 loss from the surrogate loss that is actually used for training a classifier in the latter approach generated an array of theoretical questions regarding necessary conditions for the surrogate loss to guarantee the Bayes risk consistency of the resulting rules. References (21–23) delve into the issues and provide proper conditions for a convex surrogate loss. It turns out that only minimal conditions are necessary in the binary case to ensure the risk consistency while much care has to be taken in the multiclass case (24–26). In particular, it is shown that the hinge loss is class-calibrated, meaning that it satisfies a weak notion of consistency known as Fisher consistency. Furthermore, the Bayes risk consistency of the SVM has been established under the assumption that the space generated by a kernel is sufficiently rich (21, 27). A simple way to see the effect of each loss criterion on the resulting rule is to look at its population analog and identify the limiting discriminant function which is defined as the population risk minimizer among measurable functions of f. Interestingly, for the hinge loss, the population minimizer of E(1 − Yf (X ))+ ∗ (x) : = sgn(η(x) − 1/2), the Bayes classifier itself, while is fSVM that of the negative log-likelihood loss for logistic regression is ∗ (x) : = log{η(x)/(1 − η(x))}, the true logit, for comparison. fLR This difference is illustrated in Fig. 11.3. For 300 equally spaced
360
Lee 1.5
1
0.5
0
−0.5
−1
−1.5 −2
true probability logistic regression SVM −1
0
1
2
x
Fig. 11.3. Comparison of the SVM and logistic regression. The solid line is the true function, 2η(x) − 1, the dotted line is 2ηˆ LR (x) − 1 from penalized logistic regression, and the dashed line is fˆSVM (x) of the SVM.
xi in (−2, 2), yi ’s were generated with the probability of class 1 equal to η(x) in Fig. 11.3 (the solid line is 2η(x) − 1). The dotted line is the estimate of 2η(x) − 1 by penalized logistic regression and the dashed line is the SVM. The radial basis kernel was used for both methods. Note that the logistic regression estimate is very close to the true probability 2η(x) − 1 while the SVM is close to sgn(η(x) − 1/2). Nonetheless, the resulting classifiers are almost identical. If prediction is of primary concern, then the SVM can be an effective choice. However, there are many applications where accurate estimation of the conditional probability η(x) is required for making better decisions than just prediction of a dichotomous outcome. In those cases, the SVM offers very limited information as there is no principled way to recover the probability from the SVM output in general. However, the remark pertains only to the SVM with a flexible kernel since it is based on the property that the asymptotic discriminant function is sgn(η(x) − 1/2). The SVM with simple kernels, the linear SVM for one, needs to be analyzed separately. A recent study (28) shows that under the normality and equal variance assumption on the distribution of attributes for two classes, the linear SVM coincides with the LDA in the limit. Technically, the analysis exploits a close link between the SVM and median regression yet with categorical responses. At least in this case, the
Support Vector Machines for Classification
361
probability information would not be masked and can be recovered from the linear discriminant function with additional computation. However, it is generally advised that the SVM is a tool for prediction, not for modeling of the probabilistic mechanism underlying the data.
3. Data Example Taking breast cancer data in (29) as an example, we illustrate the method and discuss various aspects of its application and some practical issues. The data consist of expression levels of 24,481 genes collected from patients with primary breast tumors who were lymph node negative at the time of diagnosis. The main goal of the study was to find a gene expression signature prognostic of distant metastases within 5 years, which can be used to select patients who would benefit from adjuvant therapy such as chemotherapy or hormone therapy. Out of 78 patients in the training data, 34 developed metastasis within 5 years (labeled as poor prognosis) and 44 remained metastasis free for at least 5 years (labeled as good prognosis). Following a similar preprocessing step in the paper, we filtered those genes that exhibited at least a twofold change in the expression from the pooled reference sample with a p-value < 0.01 in five or more tumors in the training data and discarded two additional genes with missing values, yielding 4,008 genes. Sample 54 with more than 20% of missing values was removed before filtering. First, we applied the linear SVM to the training data (77 observations), varying the number of genes from large to relatively small (d = 4,008, 70, and 20) to see the effect of the input dimension on error rates and the number of support vectors. Whenever a subset of genes were used, we included in classification those top-ranked genes by the p-value of a t-test statistic for marginal association with the prognostic outcomes. The value 70 is the number of genes selected for the prediction algorithm in the original paper although the selection procedure was not based on the p-values. λ affects classification accuracy and the number of support vectors as well. To elucidate the effect of λ, we obtained all the possible solutions indexed by the tuning parameter λ for each fixed set of genes (using R package svmpath). Figure 11.4 shows the error rate curves as a function of λ. The dotted lines are the apparent error rates of the linear SVM over the training data set itself and the solid lines are the test error rates evaluated over the 19 test patients, where 7 remained
Lee
0.5
0.5
0.4
0.4 0
10
5 λ
15
0.0
0.0
0.1
0.1
0.2
0.2
0.3
0.3
0.4 0.3 0.2 0.1
Error rate
d = 20
d = 70
0.5
d = 4008
0.0
362
0.0
0.5
1.0 λ
1.5
0.0
0.2
0.4
0.6
λ
Fig. 11.4. Error rate curves of the linear SVMs with three input dimensions (left: 4,008, center: 70, and right: 20). The dotted lines are the apparent error rates over 77 training patients and the solid lines are the test error rates over 19 test patients.
metastasis free for at least 5 years and 12 developed metastasis within 5 years. Clearly, when all of 4,008 genes are included in the classifier, the training error rates can be driven to zero as the λ decreases to zero, that is, classifiers get less regularized. On the other hand, the corresponding test error rates in the same panel for small values of λ are considerably higher than the training error rates, exemplifying the well-known phenomenon of overfitting. Hence, to attain the best test error rate, the training error rate and the complexity of a classifier need to be properly balanced. However, for smaller input dimensions, the relationship between the apparent error rates and test error rates is quite different. In particular, when only 20 genes are used, in other words, the feature space is small, regularization provides little benefit in minimizing the test error rate and the two error rate curves are roughly parallel to each other. In contrast, for d = 70 and 4,008, penalization (‘maximizing the margin’) does help in reducing the test error rate. The overall minimum error rate of around 20% was achieved when d = 70. Like the error rates, the number of support vectors also depends on the tuning parameter, the degree of overlap between two classes, and the input dimensionality among other factors. Figure 11.5 depicts how it varies as a function of λ for the three cases of high to relatively low dimension. When d = 4,008, the number of support vectors is approximately constant and except a few observations almost all the observations are support vectors. A likely reason is that the dimension is so high compared to the sample size that nearly every observation is close to the classification boundary. However, for the lower dimensions, as the λ approaches zero, a smaller fraction of observations come out to be support vectors. Generally, changing the kernel from linear to nonlinear leads to reduction in the overall training error rate, and it often
Support Vector Machines for Classification d = 20
0
5
10
15
70 60 50 40
40
50
50
60
60
70
70
d = 70
40
Number of Support Vectors
d = 4008
363
0.0
0.5
1.0
1.5
0.0
λ
λ
0.2
0.4
0.6
λ
Fig. 11.5. Relationship between the input dimension (left: 4,008, center: 70, and right: 20) and the number of support vectors for the linear SVM.
translates into a lower test error rate. As an example, we obtained the training and test error rate curves for the Gaussian kernel, K (x, t) = exp ( − x − t2 /2σ 2 ), with the 70 genes as shown in Fig. 11.6. The bandwidth σ , which is another tuning parameter, was set to be the median of pairwise distances between two classes in the left panel, its half in the center, and nearly a third of the median in the right panel, respectively. Figure 11.6 illustrates that with a smaller bandwidth, the training error rates can be made substantially small over a range of λ. Moreover, for σ = 1.69 and 1.20, if λ is properly chosen, then fewer mistakes are made in prediction for the test cases by the nonlinear SVM than the linear SVM. As emphasized before, generally the SVM output values cannot be mapped to class-conditional probabilities in a theoretically justifiable way perhaps with the only exception of the linear SVM in a limited situation. For comparison of logistic regression and SVM, we applied penalized logistic regression to the breast cancer data with the expression levels of the 70 genes as linear predictors. For simplicity, the optimal penalty size for logistic regression was
0.4
0.5
0.0
0.0
0.0
0.1
0.1
0.2
0.2
0.3
0.3
0.4
0.4 0.3 0.2 0.1
Error rate
σ = 1.20 0.5
σ = 1.69
0.5
σ = 3.38
0.00
0.02
0.04 λ
0.06
0.00
0.02
0.04 λ
0.000
0.010
0.020 λ
Fig. 11.6. Error rate curves of the nonlinear SVM with 70 genes and the Gaussian kernel for three bandwidths. The dotted lines are the apparent error rates and the solid lines are the test error rates.
0.2
logistic regression 0.4 0.6
0.8
1.0
Lee
0.0
364
−1.5
−1.0
−0.5
0.0 SVM
0.5
1.0
1.5
Fig. 11.7. Scatter plot of the estimated probabilities of good prognosis from penalized logistic regression versus the values of the discriminant function from the linear SVM for training data. The grey dots indicate the patients with good diagnosis and the black dots indicate those with poor diagnosis.
again determined by the minimum test error rate. Figure 11.7 is a plot of the estimated probabilities of good prognosis from the logistic regression versus the values of the discriminant function from the linear SVM evaluated for the observations in the training data. It shows a monotonic relationship between the output values of the two methods, which could be used for calibration of the results from the SVM with class-conditional probabilities. When each method was best tuned in terms of the test error rate, logistic regression gave 10% of the training error rate and 30% of the test error rate while both error rates were around 20% for the SVM. For more comparison between the two approaches, see (30, 31). The statistical issue of finding an optimal choice of the tuning parameter has not been discussed adequately in this data example. Instead, by treating the test set as if it were a validation set, the size of the penalty was chosen to minimize the test error rate directly for simple exposition. In practice, cross-validation is commonly used for tuning in the absence of a separate validation set. On a brief note, in the original paper, a correlation-based classifier was constructed on the basis of 70 genes that were selected sequentially and its threshold was adjusted for increased sensitivity to poor prognosis. With the adjusted threshold, only 2 out of 19 incorrect predictions were reported. This low test error rate could be explained as a result of the threshold adjustment. Recall that
Support Vector Machines for Classification
365
the good prognosis category is the majority for the training data set (good/poor=44/33) while the opposite is true for the test set (good/poor=7/12). As in this example, if two types of error (misclassifying good prognosis as poor or vice versa) are treated differentially, then the optimal decision boundary would be different from the region where two classes are equally likely, that is, η(x) = 1/2. For estimation of a different level of probability, say, η0 = 1/2 with the SVM method, the hinge loss has to be modified with weights that are determined according to the class labels. This modification leads to a weighted SVM, and more details can be found in (32).
4. Further Extensions So far, the standard SVM for the binary case has been mainly introduced. Since its inception, various methodological extensions have been considered, expanding its utility to many different settings and applications. Just to provide appropriate pointers to references for further reading, some of the extensions are briefly mentioned here. First, consider situations that involve more than two classes. A proper extension of the binary SVM to the multiclass case is not as straightforward as a probability model-based approach to classification, as evident in the special nature of the discriminant function that minimizes the hinge loss in the binary case. References (24, 25) discuss some extensions of the hinge loss that would carry the desired consistency of the binary SVM to the multicategory case. Second, identification of the variables that discriminate given class labels is often crucial in many applications. There have been a variety of proposals to either combine or integrate variable or feature selection capability with the SVM for enhanced interpretability. For example, recursive feature elimination (33) combines the idea of backward elimination with the linear SVM. Similar to the 1 penalization approach to variable selection in regression such as the LASSO and the basis pursuit method, (34), (35) and later (36) modified the linear SVM with the 1 penalty for feature selection, and (37) considered further the 0 penalty. For a nonlinear kernel function, (38, 39) introduced a scale factor for each variable and chose the scale factors by minimizing generalization error bounds. As an alternative, (40, 41) suggested functional analysis of variance approach to feature selection for the nonlinear SVM motivated by the nonparametric generalization of the LASSO in (42).
366
Lee
On the computational front, numerous algorithms to solve the SVM optimization problem have been developed for fast computation with enhanced algorithmic efficiency and for the capacity to cope with massive data. Reference (43) provides a historical perspective of the development in terms of relevant computational issues to the SVM optimization. Some of the implementations are available at http://www.kernel-machines.org, including SVM light in (44) and LIBSVM in libsvm. The R package e1071 is an R interface to LIBSVM and (45) is another R implementation of the SVM. Note that the aforementioned implementations are mostly for getting a solution at a given value of the tuning parameter λ. However, as seen in the data example, the classification error rate depends on λ, and thus, in practice, it is necessary to consider a range of λ values and get the corresponding solutions in pursuit of an optimal solution. It turns out that characterization of the entire solution path as a function of λ is possible as demonstrated in (46) for the binary case and (47) for the multicategory case. The solution path algorithms in the references provide a computational shortcut to obtain the entire spectrum of solutions, facilitating the choice of the tuning parameter. The scope of extensions of kernel methods in current use is, in fact, far beyond classification. Details of other methodological developments with kernels for regression, novelty detection, clustering, and semi-supervised learning can be found in (7). References 1. Hastie, T., Tibshirani, R., and Friedman, J. (2001) The Elements of Statistical Learning. Springer Verlag, New York. 2. Duda, R. O., Hart, P. E., and Stork, D. G. (2000) Pattern Classification (2nd Edition). Wiley-Interscience, New York. 3. McLachlan, G. J. (2004) Discriminant Analysis and Statistical Pattern Recognition. Wiley-Interscience, New York. 4. Vapnik, V. (1998) Statistical Learning Theory. Wiley, New York. 5. Boser, B., Guyon, I., and Vapnik, V. (1992) A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory 5, 144–152. 6. Cristianini, N. and Shawe-Taylor, J. (2000) An Introduction to Support Vector Machines. Cambridge University Press, Cambridge. 7. Schölkopf, B. and Smola, A. (2002) Learning with Kernels – Support Vector Machines, Regularization, Optimization and Beyond. MIT Press, Cambridge, MA. 8. Cortes, C. and Vapnik, V. (1995) SupportVector Networks. Machine Learning 20(3), 273–297.
9. Rosenblatt, F. (1958) The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review 65, 386–408. 10. Burges, C. (1998) A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2(2), 121–167. 11. Bennett, K. P. and Campbell, C. (2000) Support vector machines: Hype or hallelujah? SIGKDD Explorations 2(2), 1–13. 12. Moguerza, J. M., and Munoz, A. (2006) Support vector machines with applications. Statistical Science 21(3), 322–336. 13. Hoerl, A. and Kennard, R. (1970) Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12(3), 55–67. 14. Tibshirani, R. (1996) Regression selection and shrinkage via the lasso. Journal of the Royal Statistical Society B 58(1), 267–288. 15. Mangasarian, O. (1994) Nonlinear Programming. Classics in Applied Mathematics, Vol. 10, SIAM, Philadelphia. 16. Wahba, G. (1990) Spline Models for Observational Data. Series in Applied Mathematics, Vol. 59, SIAM, Philadelphia.
Support Vector Machines for Classification 17. Wahba, G. (1998) Support vector machines, reproducing kernel Hilbert spaces, and randomized GACV. In Schölkopf, B., Burges, C. J. C., and Smola, A. J. (ed.), Advances in Kernel Methods: Support Vector Learning, MIT Press, p. 69–87. 18. Aronszajn, N. (1950) Theory of reproducing kernel. Transactions of the American Mathematical Society 68, 3337–3404. 19. Kimeldorf, G. and Wahba, G. (1971) Some results on Tchebychean Spline functions. Journal of Mathematics Analysis and Applications 33(1), 82–95. 20. Schölkopf, B., Tsuda, K., and Vert, J. P. (ed.) (2004) Kernel Methods in Computational Biology. MIT Press, Cambridge, MA. 21. Zhang, T. (2004) Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics 32(1), 56–85. 22. Bartlett, P. L., Jordan, M. I., and McAuliffe, J. D. (2006) Convexity, classification, and risk bounds. Journal of the American Statististical Association 101, 138–156. 23. Lin, Y. (2002) A note on margin-based loss functions in classification. Statistics and Probability Letters 68, 73–82. 24. Lee, Y., Lin, Y., and Wahba, G. (2004) Multicategory Support Vector Machines, theory, and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association 99, 67–81. 25. Tewari, A. and Bartlett, P. L. (2007) On the consistency of multiclass classification methods. Journal of Machine Learning Research 8, 1007–1025. 26. Liu, Y. and Shen, X. (2006) Multicategory SVM and ψ-learning-methodology and theory. Journal of the American Statistical Association 101, 500–509. 27. Steinwart, I. (2005) Consistency of support vector machines and other regularized kernel machines. IEEE Transactions on Information Theory 51, 128–142. 28. Koo, J.-Y., Lee, Y., Kim, Y., and Park, C. (2008) A Bahadur representation of the linear Support Vector Machine. Journal of Machine Learning Research 9, 1343–1368. 29. van’t Veer, L. J., Dai, H., van de Vijver, M. J., He, Y. D., Hart, A. A., Mao, M., Peterse, H. L., van der Kooy, K., Marton, M. J., Witteveen, A. T., Schreiber, G. J., Kerkhoven, R. M., Roberts, C., Linsley, P. S., Bernards, R., and Friend, S. H. (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415(6871), 530–536.
367
30. Zhu, J. and Hastie, T. (2004) Classification of gene microarrays by penalized logistic regression. Biostatistics 5(3), 427–443. 31. Wahba, G. (2002) Soft and hard classification by reproducing kernel Hilbert space methods. Proceedings of the National Academy of Sciences 99, 16524–16530. 32. Lin, Y., Lee, Y., and Wahba, G. (2002) Support vector machines for classification in nonstandard situations. Machine Learning 46, 191–202. 33. Guyon, I., Weston, J., Barnhill, S., and Vapnik, V. (2002) Gene selection for cancer classification using support vector machines. Machine Learning 46(1–3), 389–422. 34. Chen, S. S., Donoho, D. L., and Saunders, M. A. (1999) Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing 20(1), 33–61. 35. Bradley, P. S., and Mangasarian, O. L. (1998) Feature selection via concave minimization and support vector machines. In Shavlik, J. (ed.), Machine Learning Proceedings of the Fifteenth International Conference Morgan Kaufmann, San Francisco, California, p. 82–90. 36. Zhu, J., Rosset, S., Hastie, T., and Tibshirani, R. (2004) 1-norm support vector machines. In Thrun, S., Saul, L., and Schölkopf, B. (ed.), Advances in Neural Information Processing Systems 16, MIT Press, Cambridge, MA. 37. Weston, J., Elisseff, A., Schölkopf, B., and Tipping, M. (2003) Use of the zero-norm with linear models and kernel methods. Journal of Machine Learning Research 3, 1439– 1461. 38. Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T., and Vapnik, V. (2001) Feature selection for SVMs. In Solla, S. A., Leen, T. K., and Muller, K.-R. (ed.), Advances in Neural Information Processing Systems 13, MIT Press, Cambridge, MA, pp. 668–674. 39. Chapelle, O., Vapnik, V., Bousquet, O., and Mukherjee, S. (2002) Choosing multiple parameters for support vector machines. Machine Learning 46 (1–3), 131–59. 40. Zhang, H. H. (2006) Variable selection for support vector machines via smoothing spline ANOVA. Statistica Sinica 16(2), 659–674. 41. Lee, Y., Kim, Y., Lee, S., and Koo, J.Y. (2006) Structured Multicategory Support Vector Machine with ANOVA decomposition. Biometrika 93(3), 555–571. 42. Lin, Y. and Zhang, H. H. (2006) Component selection and smoothing in multivariate nonparametric regression. The Annals of Statistics 34, 2272–2297.
368
Lee
43. Bottou, L., and Lin, C.-J. (2007) Support Vector Machine Solvers. In Bottou, L., Chapelle, O., DeCoste, D., and Weston, J. (ed.), Large Scale Kernel Machines, MIT Press, Cambridge, MA, pp. 301–320. 44. Joachims, T. (1998) Making large-scale support vector machine learning practical. In Schölkopf, C. B. (ed.), Advances in Kernel Methods: Support Vector Machines. MIT Press, Cambridge, MA. 45. Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and Lin, C.-J. (2008) LIB-
LINEAR: A library for large linear classification. Journal of Machine Learning Research 9, 1871–1874. 46. Hastie, T., Rosset, S., Tibshirani, R., and Zhu, J. (2004) The entire regularization path for the support vector machine. Journal of Machine Learning Research 5, 1391–1415. 47. Lee, Y. and Cui, Z. (2006) Characterizing the solution path of Multicategory Support Vector Machines. Statistica Sinica 16(2), 391–409.
Chapter 12 An Overview of Clustering Applied to Molecular Biology Rebecca Nugent and Marina Meila Abstract In molecular biology, we are often interested in determining the group structure in, e.g., a population of cells or microarray gene expression data. Clustering methods identify groups of similar observations, but the results can depend on the chosen method’s assumptions and starting parameter values. In this chapter, we give a broad overview of both attribute- and similarity-based clustering, describing both the methods and their performance. The parametric and nonparametric approaches presented vary in whether or not they require knowing the number of clusters in advance as well as the shapes of the estimated clusters. Additionally, we include a biclustering algorithm that incorporates variable selection into the clustering procedure. We finish with a discussion of some common methods for comparing two clustering solutions (possibly from different methods). The user is advised to devote time and attention to determining the appropriate clustering approach (and any corresponding parameter values) for the specific application prior to analysis. Key words: Cluster analysis, K-means, model-based clustering, EM algorithm, similarity-based clustering, spectral clustering, nonparametric clustering, hierarchical clustering, biclustering, comparing partitions.
1. Introduction In many molecular biology applications, we are interested in determining the presence of “similar” observations. For example, in flow cytometry, fluorescent tags are attached to mRNA molecules in a population of cells and passed in front of a single wavelength laser; the level of fluorescence in each cell (corresponding, for example, to level of gene expression) is recorded. An observation is comprised of the measurements taken from the different tags or channels. We might be interested in discovering groups of cells that have high fluorescence levels for H. Bang et al. (eds.), Statistical Methods in Molecular Biology, Methods in Molecular Biology 620, DOI 10.1007/978-1-60761-580-4_12, © Springer Science+Business Media, LLC 2010
369
370
Nugent and Meila
multiple channels (e.g., gating) or groups of cells that have different levels across channels. We might also define groups of interest a priori and then try to classify cells according to those group definitions. In microarray analysis, gene expression levels can be measured across different samples (people, tissues, etc.) or different experimental conditions. An observation might be the different expression levels across all measured genes for one person; a group of interest might be a group of patients whose gene expression patterns are similar. We could also look for different patterns among a group of patients diagnosed with the same disease; subgroups of patients that display different gene expression might imply the presence of two different pathologies on the molecular level within patients showing the same symptoms (e.g., T-cell/Bcell acute lymphoblastic leukemia (1)). An observation could also be the expression levels for one gene across many experimental conditions. Similarly expressed genes are used to help identify coregulated genes for use in determining disease marker genes. In these and other applications, we might ask: How many groups are there? Where are they? How are the grouped observations similar or dissimilar? How can we describe them? How are the groups themselves similar or dissimilar? In statistics, clustering is used to answer these types of questions. The goal of clustering is to identify distinct groups in a data set and assign a group label to each observation. Observations are partitioned into subsets, or clusters, such that observations in one subset are more similar to each other than to observations in different subsets. Ideally we would like to find these clusters with minimal input from the user. There are a wide array of clustering approaches, each with its strengths and weaknesses. Generally, an approach can be characterized by (1) the type of data available (observation attributes or (dis)similarities between pairs of observations) and (2) the prior assumptions about the clusters (size, shape, etc.). In this chapter, we will give an overview of several common clustering methods and illustrate their performance on two running two-dimensional examples for easy visualization (Fig. 12.1). Figure 12.1a contains simulated data generated from four groups, two spherical and two curvilinear. Figure 12.1b contains the flow cytometry measurements of two fluorescence markers applied to Rituximab, a therapeutic monoclonal antibody, in a drug-screening project designed to identify agents to enhance its antilymphoma activity (2, 3). Cells were stained, following culture, with the agents anti-BrdU and the DNA binding dye 7-AAD. In Section 2, we introduce attribute-based clustering and some commonly used methods: K-means and K-medoids, popular algorithms for partitioning data into spherical groups; modelbased clustering; and some nonparametric approaches including cluster trees, mean shift methods, and Dirichlet mixture models. Section 3 begins with an overview of pairwise (dis)similarity-
An Overview of Clustering Applied to Molecular Biology
371
(b)
7 AAD
x2
0
0.2
200
0.4
400
0.6
600
0.8
800
1.0
1000
(a)
0.0
0.2
0.4
0.6
x1
0.8
1.0
0
200
400
600
800
1000
Anti−BrdU FITC
Fig. 12.1. (a) Data set with four apparent groups; (b) two fluorescence channels (7-AAD, anti-BrdU) from the Rituximab data set.
based clustering and then discusses hierarchical clustering, spectral clustering, and affinity propagation. Section 4 gives a brief introduction to biclustering, a method that clusters observations and measured variables simultaneously. In Section 5, we review several approaches to compare clustering results. We conclude the chapter with an overview of the methods’ strengths and weaknesses and some recommendations. Prior to describing the clustering approaches (and their behaviors), we first introduce some basic notation that will be used throughout the chapter. Each observation xi is denoted by a vector of length p, xi = {xi1 , xi2 , xi3 , ..., xip }, where p is the number of variables or different measurements on the observation. The observations are indexed over i = 1, 2, ..., n where n is the total number of observations. Clusters are denoted by Ck and are indexed over k = 1, 2, ..., K where K is the total number of clusters. The cardinality (or size) of cluster Ck is denoted by |Ck |. The indicator function Ixi ∈Ck equals 1 (or 0) for all observations xi that are currently (or not currently) assigned to cluster Ck . When summing over observa tions assigned to a cluster, the notations ni=1 Ixi ∈Ck and xi ∈Ck can be used interchangeably. In addition, the methods presented in this chapter are for use with continuous real-valued data, xi ∈ Rp .
372
Nugent and Meila
2. AttributeBased Clustering
In attribute-based clustering, the user has p different measurements on each observation. The values are stored in a matrix X of dimension n x p: X = {x1 , x2 , x3 , ..., xn } ∈ Rp . Row i of the matrix contains the measurements (or values) for all p variables for the ith observation; column j contains the measurements (or values) for the jth variable for all n observations. Attribute-based clustering methods take the matrix X as an input and then look for groups of observations with similar-valued attributes. The groups themselves could be of varying shapes and sizes. The user may have reason to believe that the observations clump together in spherical or oval shapes around centers of average attribute values; the observations could also be grouped in curved shapes. For example, in Fig. 12.1a, there are two spherical groups of observations and two curvilinear groups of observations. Clustering methods often are designed to look for clusters of specific shapes; the user should keep these designations in mind when choosing a method.
2.1. K-Means and K-Medoids
K-means is a popular method that uses the squared Euclidean distance between attributes to determine the clusters (4, 5). It tends to find spherical clusters and requires the user to specify the desired number of clusters, K, in advance. Each cluster is defined by its assigned observations (Ixi ∈Ck = 1) and their center x¯ k = |C1k | xi ∈Ck xi (the average attribute values vector of the assigned observations). We measure the “quality” of the clustering solution by a within-cluster (WC) squared-error criterion: WC =
K
||xi − x¯ k ||2 .
[1]
k=1 xi ∈Ck
The term xi ∈Ck ||xi − x¯ k ||2 represents how close the observations are to the cluster center x¯ k . We then sum over all K clusters to measure the overall cluster compactness. Tight, compact clusters correspond to low WC values. K-means searches for a clustering solution with a low WC criterion with the following algorithm:
An Overview of Clustering Applied to Molecular Biology
K - MEANS
373
Algorithm
Input Observations x1 , . . . xn , the number of clusters K. •Select K starting centers x¯ 1 , . . . , x¯ K . •Iterate until cluster assignments do not change 1. for i = 1:n assign each observation xi to the closest center x¯ k . 2. for j = 1:K re-compute each cluster center as x¯ k = |C1k | xi ∈Ck xi .
We give the algorithm a starting set of centers; the observations are assigned to their respective closest centers, the centers are then re-computed, and so on. Each cycle lowers the WC criterion until it converges to a minimum (or the observations’ assignments do not change from cycle to cycle). Usually, the first few cycles correspond to large drops in the criterion and big changes in the positions of the cluster centers. The following cycles make smaller changes as the solution becomes more refined. Figure 12.2 shows four- and eight-cluster K-means solutions. The four-cluster solution (WC = 25.20946) finds the lower left group, combines the two spheres, and splits the upper left curvilinear group. While it might be surprising that the two spherical groups are not separated, note that, given the fixed number of clusters (K = 4), if one group is split, two groups are forced to merge. The eight-cluster solution (WC = 4.999) separates the two spherical groups and then splits each curvilinear group into multiple groups. The criterion is much lower but at the expense of erroneously splitting the curved groups.
0.0
0.2
0.4
0.6
1.0
(b)
2 22 2 2 2 2 222 22222222 2222 2 2 2 2222 2222 2 222 2 2 2 2 2 2 22 2 2 3 2 222 22 2222 32 22 222 2222222 2 2 222 22 2 2 2 22 2 2 2 22 2 2 22
0.4
x2
0.6
0.8
2 2 22 22 22 222 22 2 22 2 22 2 2 222 2 22 2 22 22 22222 2 2 222222222 2 2 2222 2222 222222222222 22 2 2 2 2222 2 2 22 2 2 22 22 2 2 2
2
0.8
x1
Fig. 12.2. (a) K-means: K = 4; (b) K-means: K = 8.
1.0
0.2
0.2
0.4
x2
0.6
0.8
1.0
(a) 11 1 1 1 111 1111 11 1 1111111 1 1 4 4 1 11 1 1111 1 1 1 11 4444 1 11111111111 11 1 1 111 1 1 11 1 4 444441111 11 1 11 11 444 111111 11 1 1 4 4 4 444444 444 11 4 4 4 4 111 44 4 4 444444 1 1 4 44 444 4 4 4 1 4 4 444 4 4 4 4 4 4 44 1 1 4 44 44 11 444 4 11 1 4 444 4 1 1 13 4 4 3 3 33 3 4 44 44 3 3 2 4 444 4 3 3 33333 4 4 4 4 3 3 333333 3 3 3 3 3 33 3 3 33 3 33 3 3333333 33 3 333333 33 33 3 3 3333 3 3 33 3 333333 333333333 3 3 333 33 3 33 3 3 3 333333 3 33 3 3 33333 3 333333 3 3 3 333333 3 3333 3333 33 333333 3 3 33 333 3 333 3333 3 3 3 33 3 3 3 33
44 4 4 4 444 4444 44 4 4444444 3 3 6 6 44 4 4 33 3 4444 4 3 6 666 4 444444444444 44 3 3 333 3 3 43 3 6 6666644444 44 3 666 44 44 44 33 33 3 3 6 66666 66 6 4 6 6 4 6 66 6 6 6 644 6 6 6 666666 4 3 6 666 666 6 6 6 3 6 6 666 6 6 6 6 6 88 8 33 3 8 88 88 3 888 8 33 3 8 888 8 2 2 22 88 8 2 2 8 2 2 22 2 8 8 8 2 8 888 8 2 2 22222 8 8 8 8 2 2 222222 2 2 2 2 2 22 2 2 22 2 22 2 2222222 22 2 222222 22 55 2 2 2222 2 2 2 5 555555 555222222 22 5 555 55 5 55 2 5 5 555555 5 55 5 5 55555 5 555555 5 5 5 555555 5 5555 5555 55 555555 5 5 55 555 5 555 5555 5 55 5 5 5 55 5 5
0.0
0.2
0.4
0.6 x1
1 1 11 11 11 111 11 1 11 1 11 11 11 111 1 11 1 11 1 1 1 1 1 111 11111 1 11 1 1 1 1 1 1 1111 1 1 11 11111111 11 1 1 1 1111 1 1 11 1 1 11 11 1 1 1
7 77 7 7 7 7 777 77777777 7777 7 7 7 7777 7777 7 777 7 7 7 7 7 7 77 7 7 7 7 777 77 7777 77 77 777 7777777 7 7 777 77 7 7 7 7 7 77 7 7777 77 7
0.8
1.0
374
Nugent and Meila
Although the number of clusters may be predefined as part of the problem, in practice, the number of clusters, K, may be unknown, particularly when the dimensionality of the data prohibits easy visualization. Moreover, the solution can be improved by increasing K as splitting an already tight cluster into two smaller clusters will correspond to a reduction in the criterion. As it is often assumed that overestimating the number of clusters is not a practical solution, we instead search for a solution with a reasonably low WC criterion (that does not decrease substantially with an increase in the number of clusters). We plot the number of clusters against the WC criterion; the “elbow” in the graph corresponds to a reasonable solution. Figure 12.3a contains the “elbow” graph for the flow cytometry data (Section 1). The criterion drops as we increase the number of clusters; around K = 12, 13, the drops become relatively negligible. Figure 12.3b presents the 13-cluster solution. As expected, the observations are partitioned into spherical groups. In fact, the large population of cells in the lower left corner are split into several small clusters. (A solution with six or seven clusters – corresponding to the end of the steeper criterion drops – combines several of these smaller clusters.) One drawback of the K-means method is that the solution could change depending on the starting centers. There are several common choices. Probably the most common is to randomly pick K observations as the starting centers, x¯ k . Figure 12.4 contrasts the original 13-cluster results and another 13-cluster solution given a different set of starting observations. The results are similar; some of the more densely populated areas are separated into spheres in slightly different positions. Users could also use a set of problem-defined centers; a set of centers generated
1000
(b)
800
4e+07
7 AAD 400 600
3e+07
200
2e+07
0
1e+07
Within Cluster Criterion
5e+07
(a)
5
10
15
K = No. of Clusters
Fig. 12.3. (a) Elbow graph; (b) K-means: K = 13.
20
0
200
400
600
Anti−BrdU FITC
800
1000
An Overview of Clustering Applied to Molecular Biology (b)
400
7 AAD
4 44 4 4 4 444 44 4 4 4 4 4 4 4 44 4 4 4 44 4 4 444 4 44 7444 44 4 44 4 4 44 4 4 777 77777777 7 44 444 444 44 7777777777777777777 7 7 7 7 7 77777 777 77 77 7 7 77 7777777 7777 77777777 7 77777 777777777777 7 10 10 10 10 10 10 10 10 10 10 10 7 77777 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 1 010 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 1010 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 1010 777 7 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 1 0 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 7 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 7 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 101010 10 4
200
400 0
0
200
7 AAD
600
600
800
800
1000
1000
(a)
375
0
200
400
600
800
1000
0
200
400
600
800
1000
Anti−BrdU FITC
Anti−BrdU FITC
Fig. 12.4. (a) K = 13; original starting centers; (b) K = 13; different starting centers.
from another clustering algorithm (e.g., hierarchical clustering – Section 3.2) is another common choice. Recall though that each clustering solution corresponds to a WC criterion; if desired, the user could run K-means using several different sets of starting centers and choose the solution corresponding to the lowest criterion. A related method is K-medoids. If outlier observations are present, their large distance from the other observations influences the cluster centers, x¯ k , by pulling them disproportionately toward the outliers. To reduce this influence, instead of re-estimating the cluster center as the average of the assigned observations in Step 2, we replace the center with the observation that would correspond to the lowest criterion value, i.e., x¯ k = argmin
xi ∈Ck x ∈C j k
||xj − xi ||2 .
[2]
This method is computationally more difficult since at each Step 2, the criterion for each cluster is optimized over all choices for the new center (i.e., the observations currently assigned to the cluster). 2.2. Model-Based Clustering
The statistical approach to clustering assumes that the observations are a sample from a population with some density f (x) and that the groups (clusters) in the population can be described by properties of this density. In model-based clustering (6, 7), we assume that each population group (i.e., cluster) is represented by a density in some parametric family fk (x) and that the population density f (x) is a
376
Nugent and Meila
weighted combination (or mixture) of the group densities: f (x) =
K
πk · fk (x;θk ),
[3]
k=1
K
where πk ≥ 0, k=1 πk = 1 are called mixture weights, and the densities fk are called mixture components. The procedure needs to choose the number of components (or clusters) K and the weights πk and to estimate the parameters of the densities fk . Most often the component densities are assumed to be Gaussian with parameters θk = {μk , k } (i.e., elliptical shapes, symmetric, no heavy tails). The covariance matrices k give the “shape” (relative magnitude of eigenvalues), “volume” (absolute magnitude of eigenvalue), and “orientation” (of the eigenvectors of k ) of the clusters. Depending on what is known one can choose each of these values to be the same or different among clusters. There are ten possible combinations, from which arise the following ten models [illustrated in Fig. 12.5 from (8)]:
EII
VII
EEI
VEI
EVI
VVI
EEE
EEV
VEV
VVV
Fig. 12.5. Illustrating the group density components in the candidate models (from (8)).
• EII: equal volume, round shape (spherical covariance) • VII: varying volume, round shape (spherical covariance) • EEI: equal volume, equal shape, axis parallel orientation (diagonal covariance) • VEI: varying volume, equal shape, axis parallel orientation (diagonal covariance) • EVI: equal volume, varying shape, axis parallel orientation (diagonal covariance) • VVI: varying volume, varying shape, equal orientation (diagonal covariance) • EEE: equal volume, equal shape, equal orientation (ellipsoidal covariance)
An Overview of Clustering Applied to Molecular Biology
377
• EEV: equal volume, equal shape, varying orientation (ellipsoidal covariance) • VEV: varying volume, equal shape, varying orientation (ellipsoidal covariance) • VVV: varying volume, varying shape, varying orientation (ellipsoidal covariance) Each model is fit using an Expectation-Maximization (EM) algorithm (9). EXPECTATION-MAXIMIZATION (EM) Algorithm Input Data {xi }i=1:n , the number of clusters K. Initialize parameters π1:K ∈ R, μ1:K ∈ 1:K , 1:K at random. k will be symmetric, positive definite matrices, parametrized according to the chosen covariance structure (e.g., EII, VII, etc.). Iterate until convergence. E step (Estimate data assignments to clusters) for i = 1:n, k = 1:K γki =
πk fk (x) . f (x)
M step (Estimatemixture parameters) denote k= ni=1 γki , k = 1:K note that k k = n k , k = 1:K n n γki xi μk = i=1 n k γki (xi − μk )(xi − μk )T k = i=1 k πk =
or another update according to the covariance structure chosen (10).
The algorithm alternates between estimating the soft assignments γki of each observation i to cluster k and estimating the mixture parameters, πk , μk , k . The values γki are always between 0 and 1, with 1 representing certainty that observation i is in cluster k. The value k represents the total “number of observations” in cluster k. In the algorithm, the equation for estimating k corresponds to the VVV model, where there are no shared parameters between the covariance matrices; the estimation of the other covariance structures is detailed in (10).
378
Nugent and Meila
The number of clusters K and clustering model (e.g., EII) are chosen to maximize the Bayesian Information Criterion (BIC) (10), a function of the likelihood of the observations given the chosen model penalized by a complexity term (which aims to prevent over-fitting). For n observations, if the current candidate model M has p parameters, BIC = 2 · log L(x|θ) − log (n) · p.
[4]
The BICs are calculated for all candidate models. Although only one final model is chosen, some model-based clustering software packages [e.g., mclust in R (6)] will produce graphs showing the change in BIC for an increasing number of clusters given a model, pairwise scatterplots of the variables with observations color-coded by cluster, and two-dimensional (projection) scatterplots of the component densities. When a model is chosen, parameter values (centers, covariances) are estimated, and observations are assigned to the mixture components by Bayes Rule. This assignment is a hard assignment (each x belongs to one cluster only); however, we could relax the hard assignment by instead looking at the vector γi (of length K) for each observation, representing the probabilities that the observations come from each mixture component. If an observation has several “likely” clusters, the user could return the soft assignments γki . Figure 12.6 contains the model-based clustering results for the four-group example data. The procedure searches all ten candidate models over a range of 1–12 clusters (range chosen by user). Figure 12.6a shows the change in the BIC as the number (b)
JI F D H
H JI G A C E B D F
H A G C B D JI E F
H AI C G B D J E F
1.0
H JI G A C F E B D
H A G C BI D E J F
0.8
J HI F E G B D C A
E C A
A C E
x2
BIC 200 400
600
JI B D H F G
J I H B D F G
J HI F E B D G C A
0.6
800
1000
(a)
0
D JI F G C H E A
JI G H A B C D E F
A B
2
4
6 8 number of clusters
10
12
5 5
0.2
−400 −200
5
0.0
1 1 1 1 11 11 1 1 1 1 1 11 1 1 111 11 1 1
6 66 6
6
5
5 55 5 5555 5 5 55 555 5 55 5 5 5
0.2
7
7 77 7 7 777 7 77 7 7777 77 7 7 7 7 7 777 77777 7 7 777777777 7 777 7 7 7 77 7 777 7 77 777 77 7 7777 7 7 77 7 7 777 7 7 77 7 7 7 7 77 7 7 7 7 7
1
6 66 6
6
6
6 66 66
6
3 3 3 3
0.4
B E G C
22 2 2 2 2 2 22222 22 2 222222 2 2 2 2222 22 2 2 2 2 2 2 2 2 2222222222 21 22 222222 222222 22222 2 22222 222 22222 2 22 2 22 222222 2 2 2 2 222 222 2 22 2 2 2 2 2 222 22 2 2 2 2 2 22 2 2 3 2 2 33 3 3 33 33 333 3 3 33 3 3 33 3 333 33 3 33 3
6 6
6
6
6
66 666 66666 66 6 4 4 44 4 4 4 4 4 4 4 44444 444 4 4 444444 44 4 4 44 4444 44 4 4 4 4 44 4 4 44444 4 444444 4 44 44 44 4 4 4 444444 4444 4 4 44 444 4 4 4444 444 444444444 4 5 44 44 44 4 5 4 55 44 5 5 4 4 55 55
0.4
8 88 8 88 8 8 8 8 8 8888 888 88 88 8 8 88 8 8 888 88 8 8 88 888 8 8 8 8 88 8 8 8 8 8 88 8 8 88 88 88 88 888 8 88888 8 8 8 88 88 8 8 88 8 8 88 8 88 8 8 88 8
0.6
0.8
1.0
x1
Fig. 12.6. (a) Number of clusters vs. BIC: EII (A), VII (B), EEI (C), VEI (D), EVI (E), VVI (F), EEE (G), EEV (H), VEV (I), VVV (J); (b) EEV, eight-cluster solution.
An Overview of Clustering Applied to Molecular Biology
379
of clusters increases for each of the models (legend in figure caption). The top three models are EEV, 8 (BIC = 961.79); EEV, 9 (954.24); and EII, 10 (935.12). The observations are classified according to the best model (EEV, 8) in Fig. 12.6b. The EEV model fits ellipsoidal densities of equal volume, equal shape, and varying orientation. The spherical groups are appropriately modeled; the curvilinear groups are each separated into three clusters. This behavior highlights an oft-cited weakness of model-based clustering. Its performance is dependent on adherence to the Gaussian assumptions. Skewed or non-Gaussian data are often over-fit by too many density components (clusters). The reader should keep in mind, however, that the mixture components fk do not need to be Gaussians, and that the EM algorithm can be modified easily (by changing the M step) to work with other density families. Also note in Fig. 12.6a, several models may have similar BIC values; users may want to look at a subgroup of models and choose based on practicality or interpretability. Figure 12.7 contains the results for the flow cytometry example. Model-based clustering chooses a VEV, 11 model (BIC = −37920.18) with ellipsoidal densities of varying volume, equal shape, and varying orientation. The two closest models are VEV, 12 (−37946.37) and VVV, 7 (−38003.07). Note in Fig. 12.7b that the density components can overlap which may lead to overlapping or seemingly incongruous clusters (e.g., the “9” cluster). (b)
G
G
E C
C
C
A
A
A
D C E F A B
C G
J FI H G D B E C A
I H B D G E C A
I H B DD G E C A
I B H G E C A
I H B D G E A C
1000
JI B D F G H E C A
A
0
200
−41000
JI G H
−42000
BIC −40000
−39000
JI
JI H F D E B
800
JI F H D B E
D F B G H
JI H D F B E
I B H D G E A C
7 AAD 400 600
−38000
(a)
2
4
6 8 number of clusters
10
12
4 4 4 4 4 4 4 444 4444444 4444 4444444 4 44 444 444444 4444 44 4 44 4 4 4 4 4 3 4 4444 4 4 333 4 4 4 444444 44 44 4 3 4 44 444444 4444 33 3333 3 3 3333 4 444444 4 3 3 3 6 3 4 3 6 4 3 333 6666666 44644 3 44444 4 3 36 66 66 3 3 3 6 3 3 3 3 3 3 3 3 3 3 6 33 33 3 6 6 66 4 3 33 33 3333333336 6 3 3 6 3 3 3 6 3 3 3 3 3 3 3 3 3 6 6 3 3 3 36 33 33 33 666 6 6666 3 3 333 3 33333 6 6 66 3 333 3333336 3 3 3 3 33 3 3 333 333 6 333 3 6 3 3 3 3 3 6 6 6 6 3 3 3 6 3 3 6 6 3 6 3 3 3 6 3 6 6 3 6 6 3 3 6 3 3 6 3 366 6 3333 6 6 6 66 3 333 333 3 6 333333 63 6 666666 666666 6 6 6 33 333333 33 6 3333 63 666 6 6 66666 66 333 33 66 6 666666663333336366666 6666 66 666 666
0
200
400 600 Anti−BrdU FITC
800
1000
Fig. 12.7. (a) Number of clusters vs. BIC: EII (A), VII (B), EEI (C), VEI (D), EVI (E), VVI (F), EEE (G), EEV (H), VEV (I), VVV (J); (b) VEV, eleven cluster solution.
2.3. Nonparametric Clustering
In contrast, nonparametric clustering assumes that groups in a population correspond to modes of the density f( x) (11, 12, 13, 14). The goal then is to find the modes and assign each observation to the “domain of attraction” of a mode.
380
Nugent and Meila
Nonparametric clustering methods are akin to “mode-hunting” [e.g., (15)] and identify groups of all shapes and sizes. Finding modes of a density is often based on analyzing crosssections of the density or its level sets. The level set at height λ of a density f (x) is defined as all areas of feature space whose density exceeds λ, i.e., L(λ;f (x)) = {x|f (x) > λ}.
[5]
The connected components, or pieces, of the level set correspond to modes of the density, particularly modes whose peaks are above the height λ. For example, Fig. 12.8a below contains a grey-scale heat map of a binned kernel density estimate (BKDE) of the four-group example data (16, 17). The BKDE is a piecewise constant kernel density estimate found over a grid (here 20 × 20); the bandwidth h is chosen by least squares cross-validation (here 0.0244). High-density areas are indicated by black or grey; low-density areas are in white. We can clearly see the presence of four high-frequency areas, our four groups. Figure 12.8b shows the cross-section or level set at λ = 0.00016. Only the bins whose density estimate is greater than 0.00016 remain in the level set (black); the remaining bins drop out (white). At this height, we have evidence of at least two modes in the density estimate. Increasing the height to λ = 0.0029 gives us the level set in Fig. 12.8c. We now have four sections of feature space with density greater than 0.0029, giving us evidence of at least four modes. 1.0
1.0
0.2
0.4
0.6
x1
0.8
1.0
0.4 0.2
0.2
0.2 0.0
0.6
x2
0.6 0.4
x2
0.8
0.8
1.0 0.8 0.6 0.4
x2
(c)
(b)
(a)
0.0
0.2
0.4
0.6
0.8
1.0
0.0
x1
0.2
0.4
0.6
0.8
1.0
x1
Fig. 12.8. (a) BKDE, 20 x 20 grid; h = 0.0244; (b) L(0.00016; BKDE) (c) L(0.0029; BKDE).
Figure 12.8 also shows an inherent limitation of a one-levelset approach. One single cross-section of a density may not show all the density’s modes. The height λ either may not be tall enough to separate all the modes (Fig. 12.8b) or might be too tall and so not identify modes whose peaks are lower than λ (not shown) (11–14). We can take advantage of the concentric nature of level sets (level sets for higher λ are contained in level sets for lower λ) to construct a cluster tree (18, 19). The cluster tree combines information from all the density’s level sets by analyz-
An Overview of Clustering Applied to Molecular Biology
381
ing consecutive level sets for the presence of multiple modes. For example, the level set at λ = 0.00016 in Fig. 12.8b above is the first level set with two connected components. We would split the cluster tree into two branches at this height to indicate the presence of two modes and then examine the subsequent level sets for further splits. Figure 12.9 shows the cluster tree and final assignments of our example data. Each leaf of the tree (numbered node) corresponds to a mode of the density estimate. The first split corresponds to an artifact of the density estimate, a mode with no observations. The “1” cluster is the first split identified in Fig. 12.8b that contains observations and so corresponds to its own branch. The remaining observations and modes stay together in the tree until subsequent level sets indicate further splits. (b) 1.0
0.006
(a)
44 4 4 4 444 4444 44 4 4444444 4 4 44 44 4 4 55 5 4444 4 4 444 4 44444444444 44 4 4 445 5 5 44 4 4 444444444 44 5 444 44444 44 45 55 5 4 4 44444 44 4 4 4 44 4 44 4 4 4 444 44 4 4 444444 4 5 4 4 444 4 4 4 5 4 4 444 4 4 4 4 4 43 4 66 6 3 4 3 44 6 334 4 66 6 3 333 3 6 6 66 3 3 6 6 6 6 66 8 3 33 33 8 3 333 3 8 8 88888 3 3 3 3 8 8 888888 8 8 8 8 8 88 8 8 88 8 88 8 8888888 88 8 888888 88 88 88 8 8888 8 88 8 888888 888888888 8 7 888 88 8 88 8 7 7 8888888 8 88 8 88888 7 888888 8 7 8 888888 8 8888 8888 77 777777 7 8 77 777 7 777 8888 7 7 7 88 8 8 8 8 8
0.8
7 8 6
0.6
3
0.4
x2
0.003
0.004
4 5
0.002 0.000
2
0.2
0.001
Density Estimate λ
0.005
4
1
0.0
0.2
0.4
0.6
2 2 22 22 22 222 22 2 22 2 22 22 22 222 2 22 2 2 2 2 2 2 2222222222 2 22 2 2 2 2222 222 222222222222 22 2 2 2 2222 2 2 22 2 2 22 22 2 2 2
1 11 1 1 1 1 1 111 1111111 1111 1 1 1 1111 1111 1 111 1 1 1 1 1 1 11 1 1 1 1 111 11 1111 11 11 111 1111111 1 1 111 11 1 1 1 1 1 11 1 1111 11 1
0.8
1.0
x1
Fig. 12.9. (a) Cluster tree of the BKDE; (b) cluster assignments based on final “leaves”.
Figure 12.10 illustrates the cluster tree and subsequent assignments for the flow cytometry example. There are 13 leaves in the cluster tree (five are modal artifacts of the binned kernel density estimate). In the cluster assignments, the observations have been labeled as either “core” (•) or “fluff” (x). Core observations are those who lie in bins belonging to the connected components (i.e., black bins in Fig. 12.8b; the cluster centers); fluff observations are in bins that were dropped prior to finding a split in the level set (i.e., the white bins in Fig. 12.8b; the edges of the clusters). Here we would report one large cluster and several smaller clusters. The two most similar clusters are “5” and “6”; they are the last clusters/modes to be separated. This approach does not put any structure or size restrictions on the clusters, only the assumption that they correspond to modes of the density. The hierarchical structure of the cluster tree also identifies “similar” clusters by grouping them; modes that are very close together will be on the same branch of the cluster tree.
382
Nugent and Meila (b) 1000
4
600 400
Anti−BrdU FITC
800
56
200
78 3 2
1 0
Density Estimate λ
0.000 0.002 0.004 0.006 0.008 0.010 0.012
(a)
0
200
400
600
800
1000
7 AAD
Fig. 12.10. (a) Cluster tree with 13 leaves (8 clusters and 5 artifacts); (b) cluster assignments: “fluff” = x; “core” = •.
Nonparametric clustering procedures in general are dependent on the density estimate. Although methods may find all of the modes of a density estimate, they may not correspond to modes of the density (i.e., groups in the population). Spurious modes of a poor density estimate will show up in the cluster tree. For example, the first split on the cluster tree in Fig. 12.9a (far right node) actually corresponds to a modal artifact in the density estimate (very small mode with no observations). Also, multiple modes are found in the curvilinear groups due to inherent noise in the density estimate. Often, pruning or merging techniques are used to combine similar clusters on the same branch (18). 2.4. Mean Shift Methods
Another category of nonparametric clustering methods explicitly finds, for each observation xi , the corresponding mode. All the observations under one mode form a cluster. Again, one assumes that a kernel density estimate (KDE) is available. Typically the kernel used in this estimate is the Gaussian kernel with bandwidth h, K (z) =
1 h p (2π)p/2
e −||z||
2 /2h 2
.
[6]
Hence, the KDE has the form n ||x−xi ||2 1 e 2h2 . f (x) = nh p (2π)p/2
[7]
i=1
If x is a mode of this density, the gradient at x is equal to 0, hence satisfying the relation
x =
1 nh p (2π)p/2
n
1 nh p (2π)p/2
i=1 e
n
||x−xi ||2 2h 2
i=1 e
xi
||x−xi ||2 2h 2
≡ m(x).
[8]
An Overview of Clustering Applied to Molecular Biology
383
The quantity m( x) above is called the mean shift, yielding its name to this class of methods. The idea of the mean shift clustering algorithms is to start with an observation xi and to iteratively “shift” it to m(xi ) until the trajectory converges to a fixed point; this point of convergence will be common to multiple observations; all the observations xi that converge to the same point form a cluster. The Simple Mean Shift algorithm is described below. SIMPLE MEAN SHIFT Algorithm Input Observations x1 , . . . xn , bandwidth h 1. for i = 1:n do (a) x ← xi (b) iterate x ← m(x) until convergence to mi . 2. group observations with same mi in a cluster.
This algorithm can be slow when the sample size n is large, because it has to evaluate the mean shift, which depends on all the n observations, many times for each individual observation. In (20) a faster variant is given, which is recommended for large sample sizes. The Gaussian Blurring Mean Shift (GBMS) Algorithm (21, 25) is a variant of Simple Mean Shift, where instead of following the trajectory x ← m(x) from an observation to a mode of the KDE, the observations themselves are “shifted” at each step. GAUSSIAN BLURRING MEAN SHIFT Algorithm Input Observations x1 , . . . xn , bandwidth h, radius . • Iterate until STOP 1. for i = 1:n compute m(xi ) 2. for i = 1:n compute xi ← m(xi ). • Cluster all the observation within of each other in the same cluster.
Typically, the parameter , which specifies how close two observations must be for us to assume they are “identical” is a fraction of the kernel bandwidth, e.g., = 0.1h. It is very important to note that, if the GBMS algorithm is run to convergence, all the observations converge to the same point! Therefore, the algorithm must be stopped before convergence, which is signified by the condition STOP in the algorithm’s body. If the algorithm is stopped earlier, we obtain more clusters than if we run it for more iterations, because at each iteration
384
Nugent and Meila
some clusters may coalesce. Hence, the number of clusters can be controlled by the number of iterations we let the algorithm run. The GBMS algorithm converges very fast; typically one can expect to find clusters after as few as 6 iterations (of course, if one waits longer, one will find fewer and larger clusters). A more detailed analysis of GBMS and of the stopping condition is in (23). Simple Mean Shift and GBMS do not get to the same clustering result, but both have been tested and found to give good results in practice. As with the level sets methods, mean shift methods also depend, non-critically, on the kernel bandwidth h. We assume that this parameter was selected according to the standard methods for bandwidth selection for KDE. As it is known from the KDE theory, a small h leads to many peaks in the KDE, and therefore to many small clusters, while a large h will have the opposite effect. 2.5. Dirichlet Process Mixture Models Clustering
3. Pairwise (Dis)Similarity Clustering
Dirichlet process mixture models (DPMM) are similar to the mixture models in Section 2.2, with the difference that the number of clusters K is unspecified, and the algorithm returns a number of clusters that varies depending on (1) a model parameter α that controls the probability of creating a new cluster and (2) the number of observations n. The number of clusters grows slowly with n. So, this method is parametric, in the sense that the shapes of the clusters are given (e.g., Gaussian round or elliptic), but it is nonparametric because the number of clusters depends on the observed data. The resulting cluster sizes are unequal, typically with a few large clusters and a large number of small and very small clusters, including single observation clusters. DPMM models have registered many successes in recent years, especially because they are very flexible and can model outliers elegantly (by assigning them to single observation clusters). Clustering data by DPMM is done via Markov Chain Monte Carlo, and the reader is referred to (24) for more details.
In similarity- or dissimilarity-based clustering, the user has a realvalued similarity or dissimilarity measure for each pair of observa n(n−1) tions xi , xj . There are n2 = 2 pairs of observations, but it is more common to store the (dis)similarities in an n x n symmetric matrix (i.e., row 1, col 2 element = row 2, col 1 element). Most pairwise clustering methods will accept this matrix (or some function of it) as an input. We first introduce some notation specific to Section 3. The similarity between two observations xi , xj can be written s(xi , xj ) where s() is a function denoting proximity or closeness. A large
An Overview of Clustering Applied to Molecular Biology
385
value of s(xi , xj ) indicates that two observations are very similar; a small value indicates very different observations. These similarities are then stored in a matrix S of dimension n by n. The element sij is the similarity between xi and xj ; sii is the similarity between an observation xi and itself. However, because a natural inclination is to associate the term “distance” with measuring proximity, large values are often intuitively assigned to objects that are “far apart.” For this reason, we often use dissimilarity between two observations instead as a more natural measure of closeness. The dissimilarity between two observations xi and xj is then d(xi , xj ). Small values of d(xi , xj ) indicate “close” or similar observations; large d(xi , xj ) values correspond to observations that are “far apart” (or very different) from each other. We can store the dissimilarities in an n by n matrix D where dij is the dissimilarity between xi and xj ; dii is the dissimilarity between an observation xi and itself. The relationship between the similarity measure and a dissimilarity measure could be described as inverse. If one were to order all pairs of observations by their proximity, the most similar pair of observations would be the least dissimilar pair of observations. We can correspondingly indicate the similarity and dissimilarity between clusters Ck and Cl as s(Ck , Cl ), d(Ck , Cl ) which allows us to order and/or merge clusters hierarchically. Their (dis)similarity matrices, SC , DC are of dimension K by K. 3.1. Measuring Similarity/Dissimilarity
The most common measure of dissimilarity between two observa p 2 tions is Euclidean distance, i.e., d(xi , xj ) = l=1 (xil − xjl ) = xi − xj . Here the off-diagonal values of D are non-negative entries and the diagonal values of D are zero (since the Euclidean distance between an observation and itself is zero). This measure is commonly the default dissimilarity measure in some clustering algorithms (e.g., K-means, hierarchical clustering) in statistical software packages. However, there are several other ways to indicate the dissimilarity between two observations. We motivate a few commonly used ones here (5, 25, 26). The Manhattan, or “city-block”, distance is the sum of the absolute value of the differences over the p attributes, i.e., p d(xi , xj ) = l=1 |xil − xjl |. Again, observations that are further apart have higher values of d(); the diagonal values of the corresponding D matrix are zero. Note that both the Euclidean distance and the Manhattan distance are special cases of the Minkowski distance where for r ≥ 1, p 1 dr (xi , xj ) = ( |xil − xjl |r ) r . [9] l=1
A more detailed discussion of different distances in this vein can be found in (5).
386
Nugent and Meila
In spectral clustering (11), it is common to transform a dissimilarity measure into a similarity measure using a formula akin to s(xi , xj ) = e −g(d(xi ,xj )) . The user needs to choose the particular function g() to transform the distance between observations. This choice can depend on the application and the algorithm. One can also measure the two observations’ correlation r(xi , xj ) over the p variables, p
− x¯ i )(xjl − x¯ j ) , p 2 2 (x − x ¯ ) (x − x ¯ ) il i jl j l=1 l=1
r(xi , xj ) = p
l=1 (xil
[10]
p where x¯ i = 1p l=1 xil , the average xil value over the p variables. A correlation of 1 would indicate perfect agreement (or two observations that have zero dissimilarity). Here we are clustering based on similarity; the corresponding S matrix has the value 1 along the diagonal. In Section 2.3, we described nonparametric methods that assign observations to a mode of a density. With this view, we might also associate closeness with the contours of the density between two observations. One choice might be s(xi , xj ) = min f (t · xi + (1 − t) · xj ) t∈[0,1]
[11]
or the minimum of the density along the line segment connecting the two observations (19). Two high-density observations in the same mode then have a high similarity; two observations in different modes whose connecting line segment passes through a valley are dissimilar. Measuring the similarity as a function of a connecting path is also done in diffusion maps where observations surrounded by many close observations correspond to a high number of short connecting paths between observations and so a high similarity (27). Regardless of choice of (dis)similarity measure, if we suspect that variables with very different scales may unduly influence the measure’s value (and so affect the clustering results), we can weight the different attributes appropriately. For example, if the variance of one attribute is much larger than that of another, small differences in the first attribute may contribute a similar amount to the chosen measure as large differences in the second attribute. One choice would be to standardize using the Karl Pearson distance: p (xil − xjl )2 d(xi , xj ) = , 2 s l l=1
[12]
An Overview of Clustering Applied to Molecular Biology
387
where sl2 is the variance of the lth variable. Under a change of scale, this distance is invariant. Another common weighted distance is the Mahalonabis distance which effectively weights the observations by their covariance matrix : d(xi , xj ) =
(xi − xj ) −1 (xi − xj ).
[13]
Prior to clustering, the user should identify the appropriate (dis)similarity measure and/or distance for the particular application. In addition, attention should be paid to whether or not to scale the variables prior to finding and clustering the (dis)similarity matrix. 3.2. Hierarchical Clustering
Given a pairwise dissimilarity measure, hierarchical linkage clustering algorithms (5, 12–14) “link up” groups in order of closeness to form a tree structure (or dendrogram) from which we extract a cluster solution. Euclidean distance is most commonly used as the dissimilarity measure but not required. There are two types of hierarchical clustering algorithms: divisive and agglomerative. In divisive clustering, a “top-down” approach is taken. All observations start as one group (cluster); we break the one group into two groups according to a criterion and continue breaking apart groups until we have n groups, one for each observation. Agglomerative clustering, the “bottom-up” approach, is more common and the focus of this section. We start with all n observations as groups, merge two groups together according to the criterion, and continue merging until we have one group of n observations. The merge criterion is based on a continually updated inter-group (cluster) dissimilarity matrix DC where d(Ck , Cl ) is the current inter-group distance between clusters Ck , Cl . The exact formulation of d(Ck , Cl ) depends on the choice of linkage method. The basic algorithm is as follows: HIERARCHICAL AGGLOMERATIVE CLUSTERING Algorithm Input A dissimilarity matrix D of dimension n by n; the choice of linkage method. • All observations start as their own group; D is the inter-group dissimilarity matrix. • Iterate until have just one group with all observations. 1. Merge the closest two groups. 2. Update the inter-group dissimilarities.
The algorithm requires a priori how to define the dissimilarity between two groups, i.e., the linkage method. Three commonly
388
Nugent and Meila
used linkage methods are single linkage, complete linkage, and average linkage. Single linkage defines the dissimilarity between two groups as the smallest dissimilarity between a pair of observations, one from each group, i.e., for Euclidean distance, d(Ck , Cl ) =
min
i∈Ck ,j∈Cl
d(xi , xj ) =
min
i∈Ck ,j∈Cl
xi − xj .
[14]
It is characterized by a “chaining” effect and tends to walk through the data linking up each observation to its nearest neighbor without regard of the overall cluster shape. Complete linkage defines the dissimilarity between two groups as the largest dissimilarity between a pair of observations, one from each group, i.e., for Euclidean distance d(Ck , Cl ) =
max
i∈Ck ,j∈Cl
d(xi , xj ) =
max
i∈Ck ,j∈Cl
xi − xj .
[15]
It tends to partition the data into spherical shapes. Average linkage then defines the dissimilarity between two groups as the average dissimilarity between a pair of observations, one from each group. Other linkage methods include Ward’s, median, and centroid (5). The order in which the groups are linked up is represented in a tree structure called a dendrogram. Two groups are merged at the tree height that corresponds to the inter-group dissimilarity between them. Very similar groups are then connected at very low heights. Dissimilar groups are connected toward the top of the tree. Once constructed, we extract K clusters by cutting the tree at the height corresponding to K branches; any cluster solution with K = 1, 2, ..., n is possible. Choosing K here is a subjective decision; usually we use the dendrogram as a guide and look for natural groupings within the data that correspond to groups of observations where the observations themselves are connected to each other at very low heights, while their groups are connected at taller heights (which would correspond to tight, well-separated clusters). If our groups are well-separated, linkage methods usually show nice separation. For example, Fig. 12.11 has the results for single linkage for the four well-separated groups example using Euclidean distance as a dissimilarity measure. The dendrogram in Fig. 12.11a shows evidence of four groups since there are four groups of observations grouped at low heights that are merged to each other at greater heights. The two spherical groups are on the left of the dendrogram; the two curvilinear groups are on the right. Note that the order that the observations are plotted on the dendrogram is chosen simply for illustrative purposes (i.e., the two spherical groups could have been plotted on the right side of
0.2
478 443 454 408 472 489 490 485 436 426 442 424 466 451 457 458 461 410 474 494 445 453 452 481 459 470 422 439 492 440 444 406 447 421 488 450 404 449 435 416 419 409 493 480 499 423 427 468 500 462 486 412 463 476 448 483 402 441 414 460 437 438 487 491 471 473 425 430 498 446 464 418 495 484 428 433 456 417 467 413 429 403 477 482 432 407 434 401 475 415 469 455 479 496 420 465 431 411 405 497 586 566 591 569 519 579 504 580 534 517 543 590 520 593 521 556 514 553 562 508 549 558 563 576 503 548 507 555 545 537 588 592 518 547 564 578 581 572 505 512 522 540 584 554 529 561 530 574 582 552 600 544 560 598 516 565 597 513 509 536 511 559 575 587 546 523 589 535 583 502 533 570 557 515 541 551 595 532 577 531 538 594 525 573 539 550 585 542 568 571 596 510 567 527 501 528 526 599 506 524 281 269 351 256 257 363 226 315 211 364 245 298 345 282 325 297 350 288 322 268 377 395 399 265 387 203 296 321 277 306 255 287 249 352 217 230 305 360 397 300 301 396 232 357 383 260 358 348 219 238 279 286 272 320 365 303 359 304 331 346 361 371 394 291 234 379 355 218 316 220 271 292 332 247 241 295 205 248 308 214 225 212 261 221 342 222 239 354 233 259 330 389 264 276 390 314 324 367 2 224 335 356 231 400 326 336 386 37542 334 223 236 273 309 378 243 307 278 294 393 229 374 289 263 369 201 338 202 347 340 213 329 323 285 293 317 290 254 215 283 246 284 353 209 312 362 328 237 274 392 240 370 270 208 216 372 319 227 366 210 391 275 373 244 381 251 267 302 252 258 327 339 344 384 398 235 262 337 341 250 299 228 206 380 343 349 207 204 280313 376 311 333 253 382 318 368 385 388 266 310 73 104 57 129 139 141 128 106 120 100 91 190 63 153 82 140 11 43 33 182 32 168 59 110 96 51 167 24 36 22 169 187 23 44 18315 184 199 112 165 115 170 80 159 46 189 118 35 111 52 191 133 174 74 136 146 37 49 94 195 2 48 55 176 90 198 53 76 54 19 193 84 188 14 145 40 10 20 17 79 5 81 157 67 164 114 156 103 99 180 68 172 194 45 181 69 70 123 131 13 7 1051 119 121 3 127 4 97 107 151 62 60 125 113 122 134 177 25 72 155 161 160 38 83 95 150 26 108 126 71 75 135 149 97192 186 124 21 50 178 4 163 56 78 185 87 179 8 162 27 102 144 58 158 116 18 30 34 39 41 154 117 86 130 132 148 142 175 197 6 98 137 66 109 89 138 29 92 16 85 61 12 143 15 96 42 77 152 166 101 173 171 28 64 93 31 200 88 147
0.0
0.2
0.4
0.4
x2
Height
0.6
0.6
0.8
0.8
1.0
1.0
1.2
0.05
0.2
0.4
x2
Height
478 458 431 445 453 408 472 423 427 421 480 499 488 484 492 406 447 440 444 443 454 461 410 474 494 462 409 493 496 420 465 452 481 459 470 422 439 500 411 405 497 489 428 433 417 467 456 446 464 418 495 425 430 468 490 485 424 466 451 457 436 426 442 486 412 463 4498 71 476 448 483 402 441 414 460 437 473 438 487 491 403 401 475 455 479 413 429 477 482 432 407 434 415 469 435 416 419 450 404 449 586 530 544 560 534 585 542 568 574 582 552 600 527 571 596 510 567 524 506 526 599501 528 517 543 590 520 593 598 516 565 597 513 572 505 512 592 580 569 519 579 504 566 591 535 583 502 533 570 5 45 537 588 576 503 548 507 555 558 563 5 78 581 518 547 564 554 529 561 522 540 584 562 508 549 521 556 514 553 509 536 5 46 523 589 575 587 511 559 551 595 532 577 531 538 594 550 539 525 573 557 515 541 1 192 119 57 129 73 104 115 107 151 15 170 139 141 33 128 106 120 82 140 63 153 100 91 190 182 32 168 11 196 43 22 169 24 36 96 51 167 59 110 121 47 112 155 13 7 105 122 113 165 199 187 23 184 44 183 29 9216 171 61 85 12 143 2 8195 64 93 80 159 161 25 134 177 72 160 38 83 31 20089 138 88 147 137 66 109 6 98 78 2 48 146 37 49 94 124 21 50 178 65 9 186 163 118 35 111 52 191 46 136 74 133 174 17 79 26 108 95 150126 185 144 154 41 34 39 58 158 175 197 27 102 132 148 116 18 30 117 142 86 130 87 179 56 8 162 189 99 180 68 172 194 1 49 135 71 75 03 114 156 5 81 157 164 67 54 53 76 84 188 19 193 198 40 145 14 10 20 90 55 176 42 77 123 131 152 166 101 173 97 4 3 127 70 69 45 181 62 60 125 253 382 3 18 368 313244 376 381 310 266 385 388207 333 311 204 280 206 228 380 343 349 296 321 203 255 287 277 306 257 363 226 315350 282 297 245 211 364 298 345 325 288 322 268 377 383 260 358 395 399 265 387 222 341 317 248 308 247 241 295 352 290 359 251 267 302 252 258 327 2 50 299 339 235 262 337 344 384 398 279 242 348 300 301 396 232 357 249 305 360 397 219 238 217 230 332 205 304 331 346 303 272 320 365 286 342 220 201 338 202 347 323 285 293 243 307 278 294 273 309 378 340 213 329 392 240 275 210 391 3 70 270 208 216 372 373 319 227 366254 215 283 292 246 284 328 389 264 276 237 274 353 209 312 362 361 371 214 225 212 261 221 390 324 367 233 259 330 314 239 354 394 291 234 379 335 356 224 231 400 375 334 223 236 326 336 386 393 289 229 374 263 369 271 355 218 316 256 281 269 351
0.00
0.6
0.10
0.8
0.15
1.0
An Overview of Clustering Applied to Molecular Biology
(a)
1
0.0
11 1 1 1 111 1111 11 1 1111111 1 1 11 11 1 1 11 1 1111 1 1 111 1 111111111111 1 1 1 1 111 1 1 11 1 1 111111111 11 1 11 11 111 11111 11 1 1 1 1 1 11111111111 111 1 1 11 1 1 1 1 111111 1 1 1 111 111 1 1 1 1 1 1 111 1 1 1 1 1 11 1 2 2 1 11 11 22 111 1 22 2 1 111 1 2 2 2 2 11 2 2 22 22 2 1 11 11 2 1 111 1 2 2 22 222 1 1 1 1 2 2 222222 2 2 2 2 22 2 2 22 22 22 222222222 22 2 22 22 22 22 2222 22 2 22 22222 2 2 22222 2222222 2 222 22 2 22 2 2 2 2 2 2 22222222 22 22 2 2 2 222222 22222222 2 2222 22222 22 222222 2 2 22 222 2 22 2222 2 2 22 2 2 2 22 2 2
Observations 0.0 0.2
0.2
Fig. 12.12. (a) Complete linkage dendrogram; (b) extracted clusters: K = 4. 0.4
(a) 0.6
1
11 1 1 1 111 1111 11 1 1111111 1 1 11 11 1 1 1111 1 1 11 11 1 1 1 1 1 1 1 1 1111111 111 1 11 11 11 1 11 111 11 1 111 11 1 11 1 111 11 1 111 11 1 1 1 11111 1 1 1 1 111111 1 11 1 1 1 111 1 11 1 1 1 1 1 1 1 1 111 11 1 1 1 1 1 111 1 1 1 1 1 1 11 2 2 1 11 11 22 111 1 22 2 1 111 1 2 2 22 11 2 2 22 22 2 1 11 11 2 1 111 1 2 2 22 222 1 1 1 1 2 2 22222 2 2 2 22 2 2 2 22 2 22 22 2 2222222 22 2 222222 22 22 2 2 2222 2 2 22 2 222222 222222222 2 2 222 22 2 22 2 2 2 222222 2 2 2 22 2 2 222 2 22 22222 2 2 2 222222 2 2222 2222 22 222222 2 2 22 222 2 22 2222 2 2 22 2 2 2 22 22
0.4
x1
0.6
0.8
0.8
389
(b)
3 3 33 33 33 333 33 3 33 3 33 33 3 33 333 333 3 3 3 3 333333333 3333 3 3 3 3333 3333 333333333333 33 3 3 3 3 333 3 3 33 3 3 33 33 3 3 3
4
4
44 4 4 44 44444 44 444 444 4 4 4 4 4 4 4 4 4 444 4 4 44 4 4 444444444 44 4 44 444 44 44 44 444 4444444 4 4 4 444 44 4 4 4 4 4 44 4 44 4 4 44
x1 1.0
Fig. 12.11. (a) Single linkage dendrogram; (b) extracted clusters: K = 4.
the dendrogram as well). Figure 12.11b contains the clustering results if we cut the tree to extract four branches (around height = 0.08). In this case, we recover the true clusters. Figure 12.12 contains the corresponding results for complete linkage. Note that the four groups are not as well-separated in the dendrogram. That is, the differences between the group merge heights and the observation merge heights are not as large, a direct result of the curvilinear groups. (Recall that complete (b)
3 33 33 33 333 33 3 33 3 33 33 3 33 333 3333 3 3 3 3 3 33333333 3 33 3 3 3 3333 3333 333333333333 33 3 3 3 3 333 3 3 33 3 3 33 33 3 3 3 3
44 4 44 4 4 4 444 44444 4 4 4 4 4444 4444 44 4 444 4 44 4 4 4 4 4 44 4 4 4 4 44 444 4 4 4 4 4 4 4 44 44 4 4 444 4 4 4 4 444 44 4 4 4 4 4 44 4 44 4 4 44
4
4
Observations
1.0
3.3. Spectral Clustering 20
40
Observations 0
60
Height 500
752 1504 1422 472 998 453 864 827 62 960 501 887 1317 424 720 939 225 1290 1128 71255 436 345 602 719 4790 55 391 1169 1291 53 121 919 37 147 454 420 174 256 603 634 840 15 483 294 1282 118 903 697 806 689 1027 363 710 86 905 801 834 159 967 1 265 1345 1004 362 387 91 677 1327 1204 1492 586 724 149 704 823 970 945 1332 326 1211 114 590 1115 548 157 873 386 384 1269 295 1320 1120 487 489 183 1339 529 610 528 312 1055 1095 15079 253 318 964 435 296 920 1354 1307 182 580 145 989 1542 276 1467 431 702 1152 68 1116 1218 309 691 38 587 651 871 426 1523 1165 1366 107 1545 1527 1178 1236 579 1144 1034 547 1214 1189 1323 322 184 654 575 616 1487 379 562 805 699 1123 1344 701 14091121 1374 1140 155 1039 290 1246 958 156 1388 83 251 1278 55 1462 204 281 639 104 935 119 782 1022 1382 132 1212 5 36 439 956 1372 33 1449 1285 144 1393 20 1133 443 1539 319 1237 502 1480 427 966 6773 783 161 1445 1262 626 288 561 863 388 918 759 1451 509 13 464 557 51 257 1227 8 408 550 378 1289 196 230 1292 810 1151 485 167 676 1156 300 1535 117 728 372 277 732 1199 1219 584 700 262 222 338 1287 1085 308 341 10 1273 838 1509 1 835 354 186 476 433 11 69 39 518 1331 340 283 621 409 977 541 844 272 1005 236 1377 1012 1524 1444 645 681 744 473 1381 722 1404 1201 481 611 976 771 468 356 1215 1207 58 552 907 1188 1536 780 573 684 1020 1181 595 866 1245 64 723 1008 1112 613 1019 1321 521 1070 1284 1403 172 999 1077 440 1164 1033 330 539 123 264 1234 22 1079 1460 1435 628 858 405 1043 642 1153 61 532 739 1143 1296 413 1170 1216 625 138 437 187 1452 1045 393 1391 1023 329 396 980 795 1419 763 659 714 798 750 888 124 1411 968 1244 1168 1350 1241 1268 1110 1177 612 879 249 503 883 1076 36 534 576 491 254 29 252 923 1383 407 542 1054 448 1226 250 1065 1533 1232 917 1512 1135 1537 898 990 263 177 1051 415 1489 1010 46 284 1103 1261 761 1499 985 661 1217 629 1450 869 19 1053 1306 313 995 1543 1506 377 911 1173 1239 959 797 979 994 486 1206 371 674 1503 857 1067 273 813 215 565 97 901 1243 1257 517 1038 1488 1471 1346 311 1044 512 1324 74 1132 1264 3 4 664 452 663 592 669 1413 1015 1373 982 305 818 31 1198 1472 753 1530 936 735 45 27 315 229 1066 618 418 877 243 1250 16 1130 1062 1334 153 410 558 758 1380 166 261 652 287 984 902 1531 1277 1353 142 1477 777 637 713 394 953 1439 644 143 275 202 736 492 1107 89 361 274 647 1 65 530 1541 169 134 93 417 589 1286 1275 381 113 1220 84 226 716 526 740 819 466 829 233 1075 1502 392 899 874 560 82 404 889 399 894 895 1108 1454 358 951 638 564 1101 519 533 343 1119 335 459 1163 733 1161 193 1361 554 632 1105 1049 1378 679 975 1418 764 1309 737 367 934 1294 266 540 620 779 1408 861 1047 246 662 286 799 1224 1134 1313 385 1235 809 302 480 210 703 1406 1187 209 585 770 1270 158 768 559 1359 1141 395 1304 1185 484 890 1195 317 1150 1074 537 897 916 630 667 957 26 538 860 1519 624 298 854 949 993 1149 100 1529 96 1398 1510 146 99 941 839 78 812 447 511 928 42 835 59 792 24 657 828 570 1522 94 527 1516 1387 1330 683 1437 223 240 904 164 350 1414 1003 1347 878 1233 872 987 996 176 1525 868 122 419 837 365 754 1124 449 588 1465 214 375 106 351 1086 1473 926 152 416 606 1469 490 882 660 339 1432 1098 228 280 198 1428 1142 1191 1162 525 259 927 73 220 826 115 495 403 775 1464 72 478 1068 756 110 695 1118 787 1096 947 1230 1360 32 1281 1508 414 1238 571 1249 692 44 1009 1035 1329 194 411 804 1025 303 1283 1102 1371 1526 25 1322 14 824 412 566 1314 1106 1456 168 366 778 816 1036 444 793 205 170 1271 507 32 195 239 1481 531 1392 523 656 221 421 334 321 1229 71 235 726 1448 1441 876 1288 1483 572 856 1029 450 997 1221 563 1017 267 1534 95 1272 599 655 1468 1513 506 961 1461 325 515 929 1397 289 1001 815 504 896 1256 549 1081 451 432 746 238 50 1089 545 800 135 1014 165 791 139 400 1343 802 88 1478 1180 79 219 635 747 577 781 301 614 390 1356 734 741 54 954 1127 1072 441 1333 1342 938 1308 347 1494 422 1267 402 148 1071 40 1429 333 522 1042 245 1176 137 1040 875 1259 1064 1078 751 109 666 742 578 1518 28 1351 908 1026 1325 1209 665 1310 648 76 1495 922 950 1087 944 608 1247 180 1427 310 1433 314 1242 855 825 1171 463 803 265 841 1496 304 461 258 880 423 906 601 1293 596 633 1341 831 636 788 1252 910 1090 199 279 211 912 1186 555 688 914 1184 328 1385 1521 851 1028 745 1192 154 673 598 546 670 368 269 63 986 1453 331 981 242 820 129 1172 17 870 1402 1122 291 915 1490 686 513 757 1129 1538 493 126 796 348 884 217 1349 56 948 500 931 937 1497 807 1358 306 789 163 320 1486 1091 1376 48 383 969 189 1279 1447 1400 690 567 192 179 1420 131 336 794 821 604 574 846 1520 1000 1458 615 397 12 607 1197 1111 323 1136 974 581 535 1100 1379 1401 1505 1137 1297 1117 892 924 583 1396 1436 66 141 1021 92 1485 1094 1225 1158 1024 1167 766 1154 293 268 643 232 627 808 672 1426 1069 767 75 125 847 983 1011 755 438 1041 1131 150 140 175 940 1202 1463 1104 1514 1060 207 833 190 23 886 988 1061 862 709 247 370 1459 845 1251 116 786 650 1511 355 60 814 1002 1316 1365 292 1364 731 255 593 130 406 128 729 1113 1058 470 551 830 776 1368 913 1084 1501 171 1222 482 1431 1194 30 465 212 653 85 80 344 1052 508 1425 1138 1417 1482 373 1260 705 1276 1280 316 1415 678 991 600 136 749 244 718 859 865 234 765 352 811 5389 1298 1305 380 1179 520 1362 213 649 1442 1412 456 342 1517 556 1099 1082 514 90 946 1210 21 963 1213 1430 353 774 307 102 360 1410 270 1540762 1182 271 1048 955 224 707 685 1016 1493 248 867 5708 43 1311 1148 605 1248 357 442 925 943 646 568 852 112 1474 842 853 1032 359 1126 479 1295 1037 1056 641 460 1302 346 1455544 1370 434 203 623 1318 173 952 721 1097 1254 1544 609 1088 1299 1394 1263 1200 1203 965 1476 364 900 376 430 1328 711 369 822 1160 218 227 748 1155 1384 231 446 382 7121338 1498 1125 1326 1337 1457 1622 395 188 1301 52 105 47 197569 1018 108 497 31057 37 1046 1139 1357 1193 1228 717 932 1389 727 1114 4 25 200 471 505 725 1348 332 992 1208 1196 1363 1355 1367 1175 1446 743 978 1443 1174 1491 237 1484 488 1421 206 631 1007 1166 597 1399 5 82 885 201 457 1475 1266 1438 49 1352 516 1146 1340 185 850 933 1440 4 75 694 1315 1671 738 499 769 1319668 972 1258 401 832 1223 127 760 1375 1080 848 1303 285 1470 682 1369 1424 458 1031 973 1253 696 216 12051030 77 474 67 178 70 698 496 640 398 591 299 1147 374 891 1386 208 1093 181 817 1405 160 1050 675 680 687 1183 1231 706 1500 893 1528 772 1092 101 1073 1190 1063 843 469 429 942 1335 881 1159 467 457 260 241 81 98 428 693 151 658 120 133 594 510 1145 494 1479 111 784 1434 849 1466 1407 278 619 1336 462 785 1274 103 715 1006 1109 1390 1240 971 43 282 962 1157 1423 324 909 498 1300 477 87 445 349 921 191 1059 1515 930 1532 297 162 524 617 1013 836 730 1416 1312 327 553
0 1083
100
1000
80
Height 120
140
390 Nugent and Meila
linkage tends to partition the observations into spheres.) Again, we recover the true clusters when we cut the tree into four branches. The ease with which we extracted the true clusters from the example data may be misleading. When group separation is not obvious, it can be very difficult to determine the number of clusters in the observations. Moreover, groups of observations with high variance or the presence of noise in the data can increase this difficulty. Figure 12.13 contains the single and complete linkage dendrograms for the flow cytometry data. Figure 12.13a exhibits the chaining commonly associated with single linkage and no obvious cluster structure. The complete linkage results in Fig. 12.13b are more promising; however, it is not clear where to cut the tree. Figure 12.14 contrasts two possible complete linkage solutions for K = 4, 5. Note that the observations have been partitioned into “spheres” or symmetric shapes. When faced with an unclear dendrogram, one solution might be to try several different possible values of K and determine the best solution according to another criterion. In similarity-based clustering, we have the analog of parametric and nonparametric clustering in spectral clustering and affinity propagation. (a) (b)
Fig. 12.13. (a) Single linkage dendrogram; (b) complete linkage dendrogram.
Observations
Spectral clustering methods use the eigenvectors and eigenvalues of a matrix that is obtained from the similarity matrix S in order to cluster the data. They have an elegant mathematical formulation and have been found to work well when the similarities reflect well the structure of the data and when there are not too many clusters of roughly equal sizes. Spectral clustering also can tolerate a very small number of outliers.
An Overview of Clustering Applied to Molecular Biology
1000
(b)
3 3 333333333333 3 33 3 33 3 33333 3333 3333 4 3 333333 4 3 333 4 33 3 33 4 3 3 3 3 3 333 4 4 33 3333 4 33 3 3333 33 3 11 1 11 33 3 33 1 3 33 1 3 3 1 3 3 3 33 3 33 11 1 1 1 3 333 111111 1 3 3 2 3333 33 111 11 1 33 1 1 1111 2 1 1111 2 2 33333 111 111 1111 1 11 111 222222 22 22222222 2 111111 2 2 2 1 1 1 2 222 22 2 22 2 11111111 11 111 222222 22 2 111 11 111 11 11111111 22 11 111 11 2 22 2222 22 11 11 22 1 22 1 11 1 1111 2 2 222 22 2 111 1111 11 1 1 1 2 11 1 1111 2222222222 11 11 11 11111 1 2 1 1 2222 2 2222222 111111 1 1 2222 2 111 22 1 1111 2 222 22222 11 111 11 1 22 111 1 11 11 11 11 1 111 1 1 2 2 2 2 2 1 1 1 1 2 2 1111 1 1 11111 1 2 2 2 1 2 1 2 2 2 2 1 1 2 1 11 2 2 222 2 1 11 1111 1 1 11 111 11111 1 1 1 1 1 1 1 2 1 1 2 2 2 2 1 1 1 1 1 1 1 2 1 1 2 1 1 111 11 111 1111 111 1 1111 1 1 1111 1111 11 1 11 111 11 111 11222 22222 222 2 1 11 1 1 1 1 11 1111 11 1 1 1 1 1 11 11 11111 11 111 11 111 1 1 1111 111 1 111111111 11 1 1 1 11 1 111 1 11 1 11 1 11111 111 2 2 2 11 1 11 1 1 1 1 1 1 1 1 1 11 11 1 11 1 1 11 11 1 111 11 1 111111 111 1 11 11 1 1 1 1 11111 1 1 1 11 1 1 11 1 11 1 11 11 1 1 1 1 11 111 11 1111 1 11 11 11 111 11 11111 1 111 11 1 111 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 111 1111 1111 11 1 11 1 1 1 1 11 111 11 1 111 1111111 11 1 11 1 111111 1 1 1 1 11 11 1 1111111 11 11 1 1 1 1111111 111111 11 1 11 11 1 1111111111 11 11 11111111 11 1 11 1 111111111111 1 11 1111 1 111 11 1 4 44
0
200
400
600
800
600 200
7 AAD
800
3
400
44
0
600 400 0
200
7 AAD
800
1000
(a)
391
1000
55
5 55
4 4 444444444444 4 44 4 44 4 44444 4444 4444 5 4 444444 5 4 444 5 44 4 44 5 4 4 4 4 4 444 5 5 44 4444 5 4 4 44444 4 4444 11 1 11 4 4 1 444 4 4 1 1 44 4 44 4 44 111 1 1 4444 44 111111 4 1 1 4 4 4 4 3 4 4444 111 11 1 1 111 33 3333 4 444 11111 1 1 1 1 3 1 11 11 1 11 1 11 11 33 3333 3 333333333 3 11 11111 2 11 33 333 33 3 333 33 3 33 3 1 11 111 222 111 11111111 21 33 22 222 22 333 11 21 33 33 2222222 33 3333 11 111 33333 3 2 1 2 1 111 33333 11 221 1 2 2222 11 2 11111111 1 1 1 1 3 22222 33 33 3 33 111111 3 33 33333 33333 22 3 3 3 1 1 2 3 1 1 22 22222 2 1 2 1 1 1 1111 11 3 3 333333 333333 212 3 3 1111111 222 22 33 22 3 3 3 3 2 3 3 2 2222222 1 3 1 2 1 3 3 3 3 3 3 1 2 3 3 333 3 1 11 11 11 2222 11 111 1 2 2 1 1 1 1 2 2 1 3 2 3 1 1 3 3 3 1 2 2 1 1 2 1 2 2 1 3 22221 11 111 11 1111 2 2222 1 1111 1 21 1111 11 21 11 1 33333 333 3 22 222 11333 3 1 1 22222 222 1 1 22 11 2222 22 22 22222 12 1 1 11 1 11 222 1 11 1111 2 2 1 2 111111111 11 1 2 11 11 22 2 1 2 1 1 11111 111 3 3 3 1 1 22 22 2 2 2 22 2 1 2 2 2 1 1 1 1 11 22 2 2 11 1 2 2 1 222 111111 11 1 2 111 1 22 2 2 1 22222 1 22 1 2 2222 1 11 1 22 2 2 1111 1 1111 2 1 11 2 11 1 1 1 11 1 1111 22 11 1 1 2 2 2 2 222 2 1 2 2 1 1 2 2 1 2 2 2 2 1 1 1 1 2 2 1 2 2 1 2 2 1 1 2 1 22 21111 11111 1111 1 111 1 11 11 1 1 1 22 2 2222 2222222 11 22 21111 2 22 222 22 111 1 2 2 2 111 111 1 1 1 1 1111111 222211 222222 22 1 22 11 1 2222222222 11 1 11 11111111 11 1 222222222222 2 21 1111 1 111 11 2 4
0
Anti−BrdU FITC
200
400
600
800
1000
Anti−BrdU FITC
Fig. 12.14. (a) Complete linkage K = 4; (b) Complete linkage K = 5.
Another great advantage of spectral clustering, especially for data with continuous features, is that the cluster shapes can be arbitrary. We exemplify spectral clustering with a simple algorithm, based on (28) and (29).
SPECTRAL CLUSTERING Algorithm Input Similarity matrix S, number of clusters K. 1. Transform S: Let Di = nj=1 Sij , for j = 1:n. Pij ← Sij /Di , for i, j = 1:n. Form the transition matrix P = [Pij ]nij=1 . 2. Compute the largest K eigenvalues λ1 = 1 ≥ λ2 ≥ . . . ≥ λK and eigenvectors v1 , . . . vK of P. 3. Form the matrix V = [ v2 v3 . . . vK ] with n rows and K − 1 columns. Let xi denote the i-th row of V. The n vectors xi are a new representation of the observations, and these will be clustered by K-means. 4. Find K initial centers in the following way: (a) take c1 randomly from x1 , . . . xn (b) for k = 2, . . . K , set ck = argminxi maxk 0 and 0 otherwise. (b) for all i, aik ← min{0, rkk + i =i,k [ri k ]+ } 4. assign an exemplar to i by k(i) ← argmax (rik + k
aik )
An Overview of Clustering Applied to Molecular Biology
393
The diagonal elements of S representing self-similarities have a special meaning for AP. The larger sii , the more likely i will be to want to become an exemplar. If all sii values are large, then there will be many small clusters; if the sii ’s are decreased, then items will tend to group easily in large clusters. Hence the diagonal values sii are a way of controlling the granularity of the clustering, or equivalently the number of clusters K, and can be set by the user accordingly.
4. Biclustering A more recent topic of interest is incorporating variable selection into clustering. For example, if there are three variables and only the first two contain clustering information (and, say, the third variable is just random noise), the clustering may be improved if we only cluster on the first two variables. Moreover, it may be that different groups of observations cluster on different variables. That is, the variables that separate or characterize a subset of observations might change depending on the subset. In this section, we briefly mention a biclustering algorithm (36) that simultaneously selects variables and clusters observations. This particular approach is motivated by the analysis of microarray gene expression data where we often have a number of genes n and a number of experimental conditions p. We would like to identify “coherent subsets” of co-regulated genes that show similarity under a subset of conditions. In addition, in contrast to other methods presented in this chapter, we allow overlapping clusters. Define a bicluster Bk as a subset of nk observations X k ∈ X = {x1 , x2 , x3 , ..., xn } and Pk , a subset of pk variables from the original p variables. Bk is then represented by an attribute matrix of dimension nk x pk . Define the residue of an element bij in this bicluster matrix Bk as
pk pk nk nk 1 1 1 bij − bij + bij . resij = bij − pk nk nk · pk j=1
i=1
[16]
j=1 i=1
This formula equals the element value minus its bicluster row mean minus its bicluster column mean plus the overall bicluster mean. In gene expression example, bij might be the logarithm of the relative abundance of the mRNA of a gene i under a specific condition j.
394
Nugent and Meila
We can measure a bicluster by its mean squared residue score H (Xk , Pk ): H (Xk , Pk ) =
1 nk · pk
(resij )2 .
[17]
i∈Xk ,j∈Pk
We would like to find a bicluster of maximal size with a low mean squared residue. A bicluster Bk is called a δ-bicluster if H (Xk , Pk ) < δ where δ is some amount of variation that we are willing to see in the bicluster. Note that the lowest possible mean squared residue score of H (Xk , Pk ) = 0 corresponds to a bicluster in which all the gene expression levels fluctuate in unison. Trivial biclusters of one element (one observation, one variable) also have a zero mean square residue score and so are biclusters for all values of δ. In general, we do a greedy search over the observations and the variables to find the bicluster with the lowest H () score. Briefly the algorithm is as follows: BICLUSTERING
Algorithm
Input An attribute matrix X of dimension n by p; the number of biclusters K; δ • for k in 1:K 1.
H (Xk , Pk ) = H (X , P), the mean squared residue score for all observations, all variables; if H (Xk , Pk ) < δ, go to 4.
2.
While (H (Xk , Pk ) < δ), (a) for all i ∈ X , compute each row’s contribution to H (Xk , Pk ): d(i) =
1 (resij )2 pk j∈Pk
for all j ∈ X, compute each col’s contribution to H (Xk , Pk ): d(j) =
1 (resij )2 . nk i∈Xk
(b) remove the row or column that corresponds to the largest d(), i.e., the largest decrease in H (Xk , Pk ) (c) update Xk , Pk , H (Xk , Pk ). 3.
while (H (Xk , Pk ) < δ),
An Overview of Clustering Applied to Molecular Biology
395
(a) for all i ∈ X , compute each row’s possible contribution to H (Xk , Pk ): d(i) =
1 (resij )2 . pk j∈Pk
for all j ∈ X, compute each col’s possible contribution to H (Xk , Pk ): d(j) =
1 (resij )2 . nk i∈Xk
(b) add the row/column with largest d() that satisfies H (Xk , Pk ) + d() < δ. If nothing added, return current Xk , Pk Else, update Xk , Pk , H (Xk , Pk ). 4. “Mask” the final bicluster in the matrix X.
The algorithm essentially first removes observations and/or variables that are not part of a coherent subset, and then, once the mean squared residue score is below the threshold, cycles back through the discarded observations and variables to check if any could be added back to the bicluster (looking for maximal size). It is a greedy algorithm and can be computationally expensive. Alternative algorithms are presented in (36) to handle simultaneous deletion/addition of multiple observations and/or variables. Once a bicluster has been found, we “mask” it in the original matrix to prevent its rediscovery; one suggested technique is to replace the bicluster elements in the matrix with random numbers. There are several other clustering methods that simultaneously cluster observations and variables including Plaid Models (37), Clustering Objects on Subsets of Attributes (38), and Variable Selection in Model-Based Clustering (39). The user should determine a priori the appropriate algorithm for the application.
5. Comparing Clusterings To assess the performance of a clustering algorithm by comparing its output to a given “correct” clustering, one needs to define a “distance” on the space of partitions of a data set. Distances between clusterings are rarely used alone. For instance, one may use a “distance” d to compare clustering algorithms. A likely scenario is that one has a data set D, with a given “correct” clustering. Algorithm A is used to cluster D, and the resulting clustering
396
Nugent and Meila
is compared to the correct one via d. If the algorithm A is not completely deterministic (e.g., the result may depend on initial conditions, like in K-means), the operation may be repeated several times, and the resulting distances to the correct clustering may be averaged to yield the algorithm’s average performance. Moreover, this average may be compared to another average distance obtained in the same way for another algorithm A . Thus, in practice, distances between clusterings are subject to addition, subtraction, and even more complex operations. As such we want to have a clustering comparison criterion that will license such operations, inasmuch as it makes sense in the context of the application. Virtually all criteria for comparing clusterings can be described using the so-called confusion matrix, or association matrix or contingency table of the pair C , C . The contingency table is a K × K matrix, whose kk th element is the number of observations in the intersection of clusters Ck of C and C of C : k
nkk = |Ck ∩ C |. k
5.1. Comparing Clusterings by Counting Pairs
An important class of criteria for comparing clusterings is based on counting the pairs of observations on which two clusterings agree/disagree. A pair of observations from D can fall under one of the four cases described below: N11 : number of pairs that are in the same cluster under both C and C N00 : number of pairs in different clusters under both C and C
N10 : number of pairs in the same cluster under C but not under C N01 : number of pairs in the same cluster under C but not under C
The four counts always satisfy N11 + N00 + N10 + N01 = n(n − 1)/2. They can be obtained from the contingency table [nkk ]. For example, 2N11 = k,k n2 − n. See (40) for details. kk Fowlkes and Mallows (40) introduced a criterion which is symmetric and is the geometric mean of the probabilities that a pair of observations which are in the same cluster under C (respectively, C ) are also in the same cluster under the other clustering.
F (C , C ) =
N11 N11 . n (n − 1)/2 n
(n − 1)/2 k k k k k
[18]
k
It can be shown that this index represents a scalar product (41).
An Overview of Clustering Applied to Molecular Biology
397
The Fowlkes–Mallows index F has a baseline that is the expected value of the criterion under a null hypothesis corresponding to “independent” clusterings (40). The index is used by subtracting the baseline and normalizing by the range, so that the expected value of the normalized index is 0 while the maximum (attained for identical clusterings) is 1. Note that some pairs of clusterings may theoretically result in negative indices under this normalization. A similar transformation was introduced by (42) for Rand’s index (43)
R(C , C ) =
N11 + N00 . n(n − 1)/2
[19]
The resulting adjusted Rand index has the expression
R(C , C ) − E[R] 1 − E[R] K K nkk 6K nk 76K n 7 n / 2 k =1 2 − k =1 k k=1 k=1 2 6 76 2 7 . = 6 7 K nk K K nk K n n n k =1 2k / 2 k=1 2 + k =1 2k /2 − k=1 2 [20] AR(C , C ) =
The main motivation for adjusting indices like R, F is the observation that the unadjusted R, F do not range over the entire [0, 1] interval (i.e., min R > 0, min F > 0). In practice, the R index concentrates in a small interval near 1; this situation was well illustrated by (40). The use of adjusted indices is not without problems. First, some researchers (44) have expressed concerns at the plausibility of the null model. Second, the value of the baseline for F varies sharply near the values between “near 0.6 to near 0” for n/K > 3. The useful range of the criterion thus varies from approximately [0, 1] to approximately [0.6, 1] (40). The adjusted Rand index baseline, as shown by the simulations of (40), varies even more: 0.5–0.95. This variation makes these criteria hard to interpret – one needs more information than the index value alone to interpret if two clusterings are close or far apart. 5.2. Comparing Clusterings by Set Matching
A second category of criteria is based on set cardinality alone and does not make any assumption about how the clusterings may have been generated. The misclassification error criterion is widely used in the engineering and computer science literature. It represents the probability of the labels disagreeing on an observation, under the best possible label correspondence. Intuitively, one first finds a “best match” between the clusters of C and those of C . If K = K , then
398
Nugent and Meila
this is a one-to-one mapping; otherwise, some clusters will be left unmatched. Then, H is computed as the total “unmatched” probability mass in the confusion matrix. More precisely, 1 H(C , C ) = 1 − max nk,π(k) . n π K
[21]
k=1
In the above, it is assumed without loss of generality that K ≤ K , π is an mapping of {1, . . . K } into {1, . . . K }, and the maximum is taken over all such mappings. In other words, for each π we have a (partial) correspondence between the cluster labels in C and C ; now looking at clustering as a classification task with the fixed-label correspondence, we compute the classification error of C with respect to C . The minimum possible classification error under all correspondences is H. The index is symmetric and takes value 1 for identical clusterings. Further properties of this index are discussed in (45, 46). 5.3. Information Theoretic Measure
Imagine the following game: if we were to pick an observation from D, how much uncertainty is there about which cluster is it going to be assigned? Assuming that each observation has an equal probability of being picked, it is easy to see that the probability of the outcome being in cluster Ck equals P(k) =
nk . n
[22]
Thus we have defined a discrete random variable taking K values, that is uniquely associated to the clustering C . The uncertainty in our game is equal to the entropy of this random variable H (C ) = −
K
P(k) log P(k).
[23]
k=1
We call H (C ) the entropy associated with clustering C . For more details about the information theoretical concepts presented here, the reader is invited to consult (47). Entropy is always non-negative. It takes value 0 only when there is no uncertainty, namely when there is only one cluster. Entropy is measured in bits. The uncertainty of 1 bit corresponds to a clustering with K = 2 and P(1) = P(2) = 0.5. Note that the uncertainty does not depend on the number of observations in D but on the relative proportions of the clusters. We now define the mutual information between two clusterings, that is, the information that one clustering has about the other. Denote by P(k), k = 1, . . . K and P (k ), k = 1, . . . K the random variables associated with the clusterings C , C . Let P(k, k )
An Overview of Clustering Applied to Molecular Biology
399
represent the probability that an observation belongs to Ck in clustering C and to C k in C , namely, the joint distribution of the random variables associated with the two clusterings: ; |Ck C k |
. [24] P(k, k ) = n We define I (C , C ) the mutual information between the clusterings C , C to be equal to the mutual information between the associated random variables
I (C , C ) =
K K k=1
P(k, k ) log
k =1
P(k, k ) . P(k)P (k )
[25]
Intuitively, we can think of I (C , C ) in the following way. We are given a random observation in D. The uncertainty about its cluster in C is measured by H (C ). Suppose now that we are told which cluster the observation belongs to in C . How much does this knowledge reduce the uncertainty about C ? This reduction in uncertainty, averaged over all observations, is equal to I (C , C ). The mutual information between two random variables is always non-negative and symmetric. The quantity VI (C , C ) = H (C ) + H (C ) − 2I (C , C )
=
K K k=1 k =1
P(k, k ) log
P(k)P(k ) P(k, k )2
[26]
is called the variation of information (48) between the two clusterings. At a closer examination, this is the sum of two positive terms: VI (C , C ) = [H (C ) − I (C , C )] + [H (C ) − I (C , C )].
[27]
The two terms represent the conditional entropies H (C |C ), H (C |C ). The first term measures the amount of information about C that we lose, while the second measures the amount of information about C that we have to gain, when going from clustering C to clustering C . The VI satisfies the triangle inequality and has other naturally desirable properties. For instance, if between two clusterings C , C some clusters are kept identical, and some other clusters are changed, then the VI does not depend on the partitioning of the data in the clusters that are identical between C and C . Unlike all the previous other distances, the VI is not bounded between √0 and 1. If C and C have at most K ∗ clusters each, with K ∗ ≤ n, then VI (C , C ) ≤ 2 log K ∗ .
400
Nugent and Meila
5.4. Comparison Between Criteria
The vast literature on comparing clusterings suggests that criteria like R, F , K, J need to be shifted and rescaled in order to allow their values to be compared. The existing rescaling methods are based on a null model which, although reasonable, is nevertheless artificial. By contrast, the variation of information and the misclassification error make no assumptions about how the clusterings may be generated and require no rescaling to compare values of VI (C , C ) for arbitrary pairs of clusterings of a data set. Moreover, the variation of information and misclassification error do not directly depend on the number of observations in the set. This feature gives a much stronger ground for comparisons across data sets, something we need to do if we want to compare clustering algorithms against each other. Just as one cannot define a “best” clustering method out of context, one cannot define a criterion for comparing clusterings that fits every problem optimally. Here we present a comprehensible picture of the properties of the various criteria, in order to allow a user to make informed decisions.
6. Concluding Remarks In this chapter, we have given a brief overview of several common clustering methods, both attribute-based and similarity-based. Each clustering method can be characterized by what type of input is needed and what types or shapes of clusters it discovers. Table 12.1 summarizes some of the basic features of the different algorithms. Prior to clustering, the user should think carefully about the application and identify the type of data available (e.g., attributes, similarities), whether or not there are any preconceived notions of the number or shape of the hypothesized clusters, and the computational power required. Clustering is a powerful tool to discover underlying group structure in data; however, careful attention must be paid to the assumptions inherent in the choices of clustering method and any needed parameters. Users should not simply rely on suggested parameter values but determine (and validate) choices that fit their particular application problem.
7. R Code The statistical software package R has been used to find the clustering results shown in this chapter. R is a freely available, multiplatform (Windows, Linux, Mac OS) program that is widely used in statistics/quantitative courses. It provides a program-
An Overview of Clustering Applied to Molecular Biology
401
Table 12.1 Overview of clustering methods Method
Type
Chooses K
Cluster shapes
K-means
Attribute
User
Spherical
K-medoids
Attribute
User
Spherical
Model-based Clustering
Attribute
Method
Spherical, elliptical
Nonparametric Clustering
Attribute
(user can define range) Method
No restrictions
Simple Mean Shift
Attribute
Method
No restrictions
Gaussian Blurring Mean Shift
Attribute
Method
No restrictions
Dirichlet Mixture Model
Attribute
Method
Spherical, elliptical
Hierarchical Clustering
Similarity
User
Linkage dependent
Single
No restrictions
Complete
Spherical
Spectral Clustering
Similarity
User
No restrictions
Affinity Propagation
Similarity
Method
No restrictions
Biclustering
Attribute
User
No restrictions (can be overlapping)
ming language that can be easily supplemented by the user. It also allows the production of publication-quality graphics. See http://www.r-project.org for more details. #####R Code Used to Cluster Data and Create Pictures #####Code included for a generic data set: data #####For help in R using a function (fxn.name), type help(fxn.name) at the prompt #####Initially reading the data into R and plotting the data data FP(d) =
B
(b)
#{i:zi
> d}/B,
b=1 (b)
where zi is the test statistic calculated from the bth permutated data set and B is the total number of permutations. This standard method may overestimate FP (16), and furthermore, the magnitude of induced bias may depend on the test statistic being used (21). Hence, it is not appropriate to use the resulting FDR estimates to evaluate different statistics. In this chapter, we will use a modified method proposed by Xie et al. (21) to estimate FP. The idea is quite simple: the overestimation of the standard method is mainly caused by the existence of target genes; if we use only the predicted non-target genes to estimate the null distribution, it will reduce the impact of target genes and improve the estimation of FP. Specifically, we use only non-significant genes to estimate FP: > FP(d) =
B
(b)
# {i: zi
> d & Zi > d}/B.
b=1
Xie et al. (21) gave more detailed descriptions and justifications for this method. In testing a null hypothesis, a false rejection occurs because just by chance the test statistic value (or its absolute value) is too large. Hence, if we know a priori that the null hypothesis is likely to be true, we can shrink the test statistic (or its absolute value) toward zero, which will reduce the chance of making a false positive. In the current context, among all the genes, suppose that based on some prior information we can specify a list of genes for which the null hypothesis is likely to hold, thus their test and null statistics are to be shrunken. If gene i is in the list, then for a given threshold value s, its test and null statistics are shrunken: (b)
(b)
(b)
Zi (s) = sign(Zi )(|Zi | − s)+ , zi (s) = sign(zi )(|zi | − s)+ , where f+ = f if f > 0 and f+ = 0 if f ≤ 0. On the other hand, if (b) (b) gene i is not in the list, then Zi (s) = Zi and zi (s) = zi .
516
Xie and Ahn
We proceed as before to draw statistical inference using Zi (s) (b) and zi (s). The shrinkage method we use is called soft thresholding (22, 23), in contrast to the usual hard thresholding. In the hard thresholding, when |Zi | is larger than s, the new statistic will remain unchanged, rather than be shrunken toward zero by the amount of s as the soft thresholding does. This property makes the statistics generated by the hard thresholding “jumpy,” since as the threshold s increases, the statistics of some genes may suddenly jump from zero to their original values. How many genes to shrink and how much to shrink are important parameters to be determined in data analysis. Xie et al. (14) proposed taking multiple trials using various parameter values and then using estimated FDR as a criterion to choose the optimal parameter values. It has been shown to work well. In practice, we suggest to use area under curve (AUC) as a measurement to compare estimated FDRs and tune the parameter. Specifically, we try different s values, for example, s = 0, 0.2, 0.4, and 0.6. For each s value, we estimate FDRs for different number of total positive genes and then make a plot of FDR vs. the number of total positive genes. So we can get one curve for each s value, and then calculate AUC for each curve. We will choose s value with lowest AUC as the optimal parameter value. The idea of using shrinkage to combine two data sets is simple: in the example with gene expression data and DNA–protein binding data, first, we use the expression data to generate a prior gene list where the genes are more likely to be “non-target” of the protein; and then we shrink the statistics of these genes toward null values in the binding data analysis. We treat all the genes in the gene list equally; this simplicity makes the shrinkage method flexible and thus applicable to combining various types of data or prior knowledge. For example, if we can generate a list of genes for which the null hypothesis is more likely to hold based on a literature review or some relevant databases, such as Gene Ontology (GO), then we can incorporate this prior information into the following analysis by using our proposed shrinkage method. An alternative way of a combined analysis is to make the amount of shrinkage for each gene depend on the probability of that gene in the gene list (or equivalently, on the amount of statistical evidence of rejecting the null hypothesis for the gene based on prior information): the higher probability, the more we will shrink it toward zero. This method can possibly use the prior information more efficiently, but it also requires a stronger link and association between the two sources of data; otherwise, it may not perform well. Xie et al. (14) illustrated that the quality of the prior gene list influenced the amount of shrinkage we should have and the final performance of the shrinkage method. The more the gene
Statistical Methods for Integrating Multiple Types of High-Throughput Data
517
list agrees with the truth, the larger amount of shrinkage should be taken, and the better the shrinkage method performs. On the other hand, when the gene list is very unreliable, any shrinking does not help. This phenomenon is consistent with the general Bayesian logic: how much information we should use from the prior knowledge (i.e., the gene list here) depends on the quality of the prior knowledge; if the prior knowledge is very vague, then we should use flat prior (here s = 0) so that the posterior information comes largely or only from the data itself. 3.2. Incorporate Gene Pathway and Network Information for Statistical Inference
Gene functional groups and pathways, such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways (24), play critical roles in biomedical research. Recently, genome-wide gene networks, represented by undirected graphs with genes as nodes and gene–gene interactions as edges, have been constructed using high-throughput data. Lee et al. (25) constructed functional networks for the yeast genome using a probabilistic approach. Franke et al. (26) constructed a human protein–protein interaction network. It is reasonable to assume that the neighboring genes in a network are more likely to share biological functions and thus to participate in the same biological processes, therefore, their expression levels are more likely to be similar to each other. Some recent work has attempted to incorporate genome-wide gene network information into statistical analysis of microarray data to increase the analysis power. Wei and Li (27) proposed integrating KEGG pathways or other gene networks into analysis of differential gene expression via a Markov random field (MRF) model. In their model, the state of each gene was directly modeled via a MRF. Spatial statistical models have been increasingly used to incorporate other information into microarray data analysis. Xiao et al. (28) applied a hidden Markov model to incorporate gene chromosome location information into gene expression data analysis. Broet et al. (29) applied a spatial mixture mode to introduce gene-specific prior probabilities to analyze comparative genomic hybridization (CGH) data. Wei and Pan (30) extended the work of Broet et al. from incorporating one-dimensional chromosome locations to two-dimensional gene network. They utilized existing biological knowledge databases, such as KEGG pathways, or computationally predicted gene networks from integrated analysis (25), to construct gene functional neighborhoods and incorporating them into a spatially correlated normal mixture model. The basic rationale for their model is that functionally linked genes tend to be co-regulated and co-expressed, which is thus incorporated into analysis. This is an efficient way to incorporate network information into statistical inference and we will introduce their method in this section.
518
Xie and Ahn
3.2.1. Standard Normal Mixture Model
We want to identify binding targets or differentially expressed genes in this analysis. We will use latent variable Ti to indicate whether gene i is true binding target (or differently expressed) gene. Suppose that the distribution functions of the data (e.g., Z-score) for the genes with Ti = 1 and Ti = 0 are f1 and f0 , respectively. Assuming that a priori all the genes are independently and identically distributed (iid), we have a marginal distribution of Zi as a standard mixture model: f (zi ) = π0 f0 (zi ) + (1 − π0 )f1 (zi ),
[1]
where π0 is the prior probability that H0,i (null hypothesis) holds. It is worth noting that the prior probabilities are the same for all genes. The standard mixture model has been widely used in microarray data analysis (17, 31–33). The null and non-null distributions f0 and f1 can be approxiK mated by finite normal mixtures: f0 = k00=1 π0k0 φ(μk0 , σk20 ) and 1 2 2 f1 = K k1 =1 π1k1 φ(μk1 , σk1 ), where φ(μ, σ ) is the density function for a Normal distribution with mean μ and variance σ 2 . For Z-score, using Kj = 1 often suffices (33). Wei and Pan (30) demonstrated that K0 = 2 and K1 = 1 worked well in most cases. The standard mixture model can be fitted via maximum likelihood with the Expectation-Maximization (EM) algorithm (34). Once the parameter estimates are obtained, statistical inference is based on the posterior probability that H1,i (alternative hypothesis) holds: Pr(Ti = 1|zi ) = π1 f1 (zi )/f (zi ). 3.2.2. Spatial Normal Mixture Model
In a spatial normal mixture model, Wei and Pan (30) introduced gene-specific prior probabilities πi,j = Pr(Ti = j) for i = 1, . . . , G and j = 0, 1. The marginal distribution of zi is f (zi ) = πi,0 f0 (zi ) + πi,1 f1 (zi ),
[2]
where f0 (zi ) and f1 (zi ) represent density distributions under null and alternative hypotheses, respectively. Note that the prior probability specification in a stratified mixture model (35, 36) is a special case of Equation [2]: a group of the genes with the same function share a common prior probability while different groups have possibly varying prior probabilities; in fact, a partition of the genes by their functions can be regarded as a special gene network. Based on a gene network, the prior probabilities πi,j are related to two latent Markov random fields xj = {xi,j ;i = 1, ..., G} by a logistic transformation: πi,j = exp(xi,j )/[exp(xi,0 ) + exp(xi,1 )].
Statistical Methods for Integrating Multiple Types of High-Throughput Data
519
Each of the G-dimensional latent vectors xj is distributed according to an intrinsic Gaussian conditional autoregression model (ICAR) (37). One key feature of ICAR is the Markovian interpretation of the latent variables’ conditional distributions: the distribution of each spatial latent variable xi,j , conditional on x(−i),j = {xk,j ;k = i}, depends only on its direct neighbors. More specifically, we have ⎛ ⎞ 2 σ 1 Cj ⎠, xi,j |x(−i),j ∼ N ⎝ xl,j , mi mi l∈δi
where δi is the set of indices for the neighbors of gene i, and mi is the corresponding number of neighbors. To allow identifiability, they imposed i xij = 0 for j = 0, 1. In this model, 2 acts as a smoothing prior for the spatial field the parameter σCj and consequently controls the degree of dependency among the 2 prior probabilities of the genes across the genome: the smaller σCj induces more similar πi,j ’s for those genes that are neighbors in the network. 3.3. Joint Modeling Approaches for Statistical Inference
Now we will discuss a joint modeling approach to integrate different sources of data. The benefit of joint modeling is that it can potentially improve the statistical power to detect target genes. For example, a gene may be supported in each source of data with some but not overwhelming evidence to be a target gene; in other words, this gene will not be identified as statistically significant based on either source of data, however, by integrating different sources of data in a joint model, the gene may be found to be significant. Pan et al. (38) proposed a nonparametric empirical Bayes approach to joint modeling of DNA–Protein binding data and gene expression data. The simulated data shows the improved performance of the proposed joint modeling approach over that of other approaches, including using binding data or expression data alone, taking an intersection of the results from the two separate analyses and a sequential Bayesian method that inputs the results of analyzing expression data as priors for the subsequent analysis of binding data. Application to a real data example shows the effects of the joint modeling. The nonparametric empirical Bayes approach is attractive due to its flexibility. Xie (39) proposed a parametric Bayesian approach to jointly modeling DNA– protein binding data (ChIP-chip data), gene expression data and DNA sequence data to identify the binding target genes of a transcription factor. We will focus on this method here.
3.3.1. Analyzing Binding Data Alone
We use a Bayesian mixture model (40) to analyze binding data. Specifically, suppose Xij is the log ratio of the intensities of test and control samples in ChIP-chip experiment for gene i
520
Xie and Ahn
(i = 1, ..., G) and replicate j (j = 1, ..., K ). We specify the model as following: iid
Xij |μix ∼ N (μix , σix2 ), iid
2 ), μix |Iix = 0 ∼ N (0, τ0x iid
2 ), μix |Iix = 1 ∼ N (λx , τ1x iid
Iix |px ∼ Ber(px ), where μix is the mean of log ratio for gene i; Iix is an indicator variable: Iix = 0 means the gene i being non-binding target gene and Iix = 1 means gene i being binding target gene. We assume that the mean log ratios of non-target genes concentrate 2 ), while the expected mean log around 0 with small variance (τ0x ratios of target genes follow a normal distribution with a positive mean. The prior distribution for indicator Iix is a Bernoulli distribution with probability px . The advantage of this hierarchical mixture model is that we can borrow information across genes to estimate the expected mean intensity, and we can use the posterior probability of being a binding target gene directly to do inference. 3.3.2. Joint Modeling
Similar to binding data, we used mixture models to fit expression data Yij and sequence data zi . For expression data iid
Yij |μiy ∼ N (μiy , σiy2 ), iid
2 ), μiy |Iiy = 0 ∼ N (0, τ0y iid
2 ), μiy |Iiy = 1 ∼ N (λy , τ1y iid
2 ), μiy |Iiy = 2 ∼ N ( − λy , τ1y iid
Iiy |Iix = 0 ∼ Multinomial(py00 , py10 , py20 ), iid
Iiy |Iix = 1 ∼ Multinomial(py01 , py11 , py21 ), where Iiy is a three-level categorical variable: Iiy = 0 indicates an equally expressed gene, Iij = 1 represents an up-regulated gene, and Iij = 2 means a down-regulated gene. Here we used conditional probabilities to connect the binding data and the expression data. Intuitively, the probability of being equally expressed for a non-binding target gene, py00 , should be higher than the probability of being equally expressed for a binding target gene, py01 . The difference between the conditional probabilities measures the correlation between the binding data and the expression data. If the two data sets are independent, the two sets of conditional probabilities will be the same. Therefore, this model is flexible to accommodate the correlations between data.
Statistical Methods for Integrating Multiple Types of High-Throughput Data
521
Similarly, we model the sequence data as iid
2 ), zi |Iiz = 0 ∼ N (λz1 , τ1z iid
2 ), zi |Iiz = 1 ∼ N (λz2 , τ2z iid
Iiz |Iix = 0 ∼ Ber(pz0 ), iid
Iiz |Iix = 1 ∼ Ber(pz1 ), where Iiz = 1 indicates gene i being potential target gene based on sequence data, we will call it a potential gene; and Iiz = 0 means gene i being a non-potential gene.
Fig. 19.1. The graphical overview of the hierarical structure of the joint model.
Figure 19.1 gives a graphical overview of this model. In summary, the model combines expression data and sequence data with binding data through the indicator variables. This model can automatically account for heterogeneity and different specificities of multiple sources of data. The posterior distribution of being a binding target can be used to explain how this model integrates different data together. For example, if we combine binding and expression data, the posterior distribution of being a binding target gene Iix is Iix | · ∼Ber(pix ), pix = A=
A A+B , 2 )− 12 exp px (τ1x
(μ −λ )2 − ix 2 x 2τ1x
2 )− 12 exp − B = (1 − px )(τ0x
μ2ix 2 2τ0x
I =0 I =1 I =2
iY iY iY py11 py21 py01 IiY =0 IiY =1 IiY =2 py00 py10 py20 .
where Iix |· represents the posterior distribution of Iix condition on all other parameters in the model and the data. We define p$y0 = (py00 , py10 , py20 ) and p$y1 = (py01 , py11 , py21 ). If the expression data do not contain information about binding, then p$y0 = p$y1 , which makes all the terms containing Y in the formula canceled out. In
522
Xie and Ahn
this case, only information contained in binding data X is used to do inference. On the other hand, the difference between p$y0 and p$y1 will be big when expression data contain information about binding. In this case, the information in expression data IiY will also be used to make inference. 3.3.3. Statistical Inference
Assuming that the binding, expression, and sequence data are conditionally independent (condition on the indicator Iix ), we can get the joint likelihood for the model. Based on the joint likelihood, we can obtain the closed form of full conditional posterior distribution for most of the parameters (except λx , λy , and dz ). Gibbs sampler was used to do Markov Chain Monte Carlo simulations for the parameters having the closed form. For λx , λy , and dz , Metropolis–Hastings algorithm was applied to draw the simulation samples. The iterations after burn-in samples were used as posterior simulation samples and were used for statistical inferences.
3.3.4. The Effects of Joint Modeling
Xie (39) illustrated that when using the binding data alone, the estimated posterior probabilities were positively associated with the mean binding intensities from binding data. In this model, the posterior probability does not depend on the expression data or sequence data. On the other hand, after doing the joint modeling, the posterior probabilities of the genes with high expression values have been increased compared to using binding data alone, but the sequence score did not have much influence on the inference for this data. In summary, this model can automatically account for heterogeneity and different specificity of multiple sources of data. Even if an addition data type does not contain any information about binding, the model can be approximated to that of using binding data alone.
4. Integrative Analysis for Classification Problem
Statistical classification methods, such as support vector machine (SVM) (41), random forest (42) and Prediction Analysis for Microarrays (PAM) (9), have been widely used for diagnosis and prognosis of breast cancer (10, 43), prostate cancer (44, 45), lung cancer (7, 46), and leukemia (47). Meanwhile, biological functions and relationships of genes have been explored intensively by the biological research community, and those information has been stored in databases, such as those with the Gene Ontology annotations (48) and the Kyoto Encyclopedia of Genes and Genomes (24). In addition, as mentioned before, prior exper-
Statistical Methods for Integrating Multiple Types of High-Throughput Data
523
iments with similar biological objectives may have generated data that are relevant to the current study. Hence, integrating information from prior data or biological knowledge has potential to increase the classification and prediction performance. The standard classification methods treat all the genes equally a priori in the process of model building, ignoring biological knowledge of gene functions, which may result in a loss of their effectiveness. For example, some genes have been identified or hypothesized to be related to cancer by previous studies; others may be known to have the same function of or be involved in a pathway with some known/putative cancer-related genes, hence we may want to treat these genes differently from other genes a priori when choosing genes to predict cancer-related outcomes. Some recent research has taken advantages of the prior information for classification problem. Lottaz and Spang (49) proposed a structured analysis of microarray data (StAM), which utilized the GO hierarchical structure. Biological functions of genes in GO hierarchal structure are organized as a directed acyclic graph: each node in the graph represents a biological function, and a child node has a more specific function while its parent node has a more general one. StAM first built classifiers for every leaf node based on an existing method, such as PAM, then propagated their classification results by a weighted sum to their parent nodes. The weights in StAM are related to the performance of the classifiers and a shrinkage scheme is used to shrink the weights toward zero so that a sparse representation is possible. This process is repeated until the results are propagated to the root node. Because the final classifier is built based on the GO tree, StAM greatly facilitates the interpretation of a final result in terms of identifying biological processes that are related to the outcome. StAM uses only genes that are annotated in the leaf nodes (i.e., with most detailed biological functions) as predictors, so it may miss some important predictive genes. Wei and Li (27) proposed a modified boosting method, nonparametric pathway-based regression (NPR), to incorporate gene pathway information into classification model. NPR assumed that the genes can be first partitioned into several groups or pathways, and only pathway-specific new classifiers (i.e., using only the genes in each of the pathways) were built for boosting procedure. More recently, Tai and Pan (50) proposed a flexible statistical method to incorporate prior knowledge of genes into prediction models. They adopted group-specific penalty terms in a penalized method allowing genes from different groups to have different prior distributions (e.g., different prior probabilities of being related to the cancer). Their model is similar to NPR with regard to grouping genes, but they apply to any penalized method through the use of group-specific penalty terms while NPR only applies to boosting. Garrett-Mayer et al. (51) proposed using meta-analytic
524
Xie and Ahn
approaches to combine several studies using pooled estimates of effect sizes.
5. Useful Databases and Program Code 5.1. Databases
In order to meet the urgent need of integrated analysis of highthroughput data, the National Institutes of Health (NIH) and the European Molecular Biology Laboratory (EMBL) have made concrete effort to build and maintain several large-scale and public-accessible databases. These databases are very valuable for both biomedical research and methodology development. We will introduce several of them briefly.
5.1.1. Array Express
Founded by EMBL, ArrayExpress is a public repository for transcriptomics data. It stores gene-indexed expression profiles from a curated subset of experiments in the repository. Public data in ArrayExpress are made available for browsing and querying on experiment properties, submitter, species, etc. Queries return summaries of experiments, and complete data or subsets can be retrieved. A subset of the public data is re-annotated to update the array design annotation and curated for consistency. These data are stored in the ArrayExpress Warehouse and can be queried on gene, sample, and experiment attributes. Results return graphed gene expression profiles, one graph per experiment. Complete information about ArrayExpress can be found from the web site http://www.ebi.ac.uk/Databases/.
5.1.2. Gene Expression Omnibus (GEO)
Founded by National Center for Biotechnology Information (NCBI), GEO is a gene expression/molecular abundance repository supporting Minimum Information About a Microarray Experiment (MIAME) compliant data submissions and a curated, online resource for gene expression data browsing, query, and retrieval. Currently it has 2,61,425 data including 4,931 platforms 2,47,016 samples and 9,478 series. All data are accessible through the web site http://www.ncbi.nlm.nih.gov/geo/.
5.1.3. Oncomine
Founded by Drs. Arul Chinnaiyan and Dan Rhodes at the University of Michigan and currently maintained by Compendia Bioscience Company, the Oncomine Research Platform is a suite of products for online cancer gene expression analysis dedicated to the academic and non-profit research community. Oncomine combines a rapidly growing compendium of 20,000+ cancer transcriptome profiles with an analysis engine and a web application for data mining and visualization. It cur-
Statistical Methods for Integrating Multiple Types of High-Throughput Data
525
rently includes over 687 million data points, 25,447 microarrays, 360 studies, and 40 cancer types. Oncomine access is available through the Oncomine Research Edition and the Oncomine Research Premium Edition. Information is available via the web site http://www.oncomine.org/main/index.jsp. 5.1.4. The Cancer Genome Atlas (TCGA)
Joint effort by the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI) of the NIH, TCGA is a comprehensive and coordinated effort to accelerate our understanding of the molecular basis of cancer through the application of genome analysis technologies, including large-scale genome sequencing. TCGA Data Portal provides a platform for researchers to search, download, and analyze data sets generated by TCGA. This portal contains all TCGA data pertaining to clinical information associated with cancer tumors and human subjects, genomic characterization, and highthroughput sequencing analysis of the tumor genomes. In addition, the Cancer Molecular Analysis Portal provides the ability for researchers to use analytical tools designed to integrate, visualize, and explore genome characterization from TCGA data. The following web site will lead to download TCGA data http://tcgadata.nci.nih.gov/tcga/.
5.2. WinBUGS Codes for Incorporating Gene Network Information into Statistical Inference
model { for( i in 1:N ){ Z[i] ∼dnorm(muR[i], tauR[i]) # z-scores muR[i]