Editor’s Letter Mike Larsen, Executive Editor
Dear Readers,
T
his issue of CHANCE begins with an article by Jana Ashe...
26 downloads
343 Views
3MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Editor’s Letter Mike Larsen, Executive Editor
Dear Readers,
T
his issue of CHANCE begins with an article by Jana Asher on collecting data in challenging settings. In particular, Asher describes her experiences conducting in-person survey interviews in East Timor. She gives us personal anecdotes, practical statistical advice, and an interesting story. Qi Zheng explains the origins of the Luria-Delbrück distribution and its role in studying evolutionary change in E. coli. The statistical reasoning underlying the phenomenon has a connection to the distribution of slot machine returns. Holmes Finch’s article, “Using Item Response Theory to Understand Gender Differences in Opinions on Women in Politics,” compares and contrasts item response models and how they describe a data set. The models are explained using formulas, pictures, and examples. In Volume 22, Number 4, Jürgen Symanzik proposed a puzzle based on 10 data points and a set of seven instructions. Contest winner Stephanie Kovalchik, a graduate student at UCLA, provided a solution in the form of an amusing letter and an illustrative graphic. The 10 data values were flight times in seconds recorded on the log 10 scale of the Space Shuttle Challenger. Brad Thiessen earned honorable mention for his graph that included temperature and historical facts. Bernard Dillard asks, “Who turned out the lights?” We are all concerned with energy demand and production. Bernard uses a discrete wavelet transformation to analyze electricity consumption data measured on a frequent time scale. The fit of the model is used in multiscale statistical process control. The ultimate goal is to be able accurately predict points of extreme energy demand and respond appropriately. Students in virtually all statistics courses learn something of least squares estimation when studying prediction of an outcome from an explanatory variable. Ivo Petras and Igor Podlubny ask whether there is a reasonable alternative to the default criterion. “Least circles” is presented for your consideration. To introduce students to concepts of design of experiments, instructors sometimes have students conduct taste tests of
Through your email you can get a table of contents notification for CHANCE. Go to www.springer.com/mathematics/ probability/journal/144 and add your email address in the box that says “Alerts For This Journal”. The web site also has a place you can recommend CHANCE to your library.
various food items, such as gummy bears (see Vol. 23, No. 1). John Bohannon, Robin Goldstein, and Alexis Herschkowitsch compared dog food and pâté. Really, they did. Read about their design and the results in this issue. Ronald Smeltzer shows us an early time-line bar graph by Philippe Buache depicting the water level of the Seine River in Paris from 1760 to 1766. The picture creatively and effectively depicts data in print before the advent of the modern printing techniques that we enjoy today. Howard Wainer, in his Visual Revelations column, writes about the graphics in the 2008 National Healthcare Quality Report and State Snapshots. Usefully and accurately displaying information graphically is important and challenging. Wainer makes suggestions for improving some of the displays. Continuing a series of articles on postage stamps, Peter Loly and George P. H. Styan discuss stamps issued in sheets with 5x5 Latin square designs. Color versions of the stamps, as well as previous articles on stamps, are available online at www.amstat.org/publications/chance. Jonathan Berkowitz’s puzzle celebrates the 2010 Winter Olympics, which was held in his home city of Vancouver, British Columbia. The puzzle, titled “Employs Magic,” is actually five smaller puzzles, each a cryptic five-square of 10 words. Mark Glickman’s Here’s to Your Health column will appear in the next issue. In other news, the Executive Committee of the ASA met recently and made decisions that impact CHANCE. First, the committee voted to continue CHANCE for another three years in both print and online versions. The next executive editor will serve 2011– 2013. I’ll enjoy reading CHANCE in the years to come. Second, the Executive Committee voted to make the online version of CHANCE free to the ASA’s certified student members. This is a great development, because students are potential long-term subscribers and future authors. They also can be inspired by the significant role that probability and statistics can play in major studies and activities. I hope that other professionals will be motivated to submit articles to CHANCE to entertain and influence this group. I look forward to your suggestions and submissions. Enjoy the issue! Mike Larsen
CHANCE
5
Collecting Data in Challenging Settings In the Global South—say, East Timor—data collection is not for the faint of heart or weak of stomach. Jana Asher
It is very early in the morning when water falling on my face jolts me awake. The raindrops are blowing into the small bedroom through a gap between the roof and side wall, and, unfortunately, the platform bed on which I am curled is occupied by four other women. Having nowhere to move, I coil up into a tighter ball under our shared blanket and attempt to sleep …despite the small pool of liquid that has formed a sheen on my cheek.
as ing the truck w one. Good th Le . ra le er ho Si w in it Truck wreck ve swallowed thole might ha big, or the po
Local villagers use machetes to remove a tree trunk from the road in East Tim or. There were no governmenta l services availab le to clear the tree; the villager s were responsib le for keeping their own pass off2,th e mountainside 6 VOL.ag23,esNO. 2010 clear.
The next morning, the occupants of the small village in which we are stranded will use machetes to cut out a portion of tree trunk that is blocking our path down the mountainside. We will make it about one-third of the way down the mountain in our truck before we are blocked by a second tree trunk, bigger than the first, and are therefore forced to abandon our vehicle. We will walk the rest of the way down the mountain in a state of hyper-alertness, listening to the creaking of the trees and hoping that one of them does not decide to fall on top of us. In the evening, we will finally make it to Dili, the capital of East Timor. That night, I will call my irate and somewhat despondent husband and assure him I am fine, despite having missed our planned phone call the night before. And then I will collapse in my warm hotel room bed, half a world away from my family and friends. To be fair, when we started out that morning, the weather was hot and dry. By an accident of fate we just happened to select a village on top of a mountain as a test site for our questionnaire on the very day that the rainy season decided to arrive. We completed our testing in a thunderstorm, and by the time we were done, the path we had taken up the mountain no longer existed, having been subsumed by a tree with a diameter greater than a man’s height. All for the sake of testing a questionnaire.
Figure 1.The data collection communication process
D
ata collection in the Global South is not for the faint of heart or weak of stomach. Nor is it for the overly rule-bound or uninventive. It requires an intimate understanding of your own fragility and mortality, a sense of adventure, a respect for the knowledge and wisdom that each person you encounter provides, and a limitless supply of patience. Finally, it requires a desire to bring the highest intellectual and scientific rigor to the most difficult of circumstances …coupled with the understanding that you will never truly succeed at doing so. But we are getting ahead of ourselves. Let us start by discussing why the questionnaire design and data-collection processes are so essential to the quality of the resulting data, and then review the ingredients that constitute good questionnaire design and data collection. Then we can return to East Timor—by way of the Middle East and Africa.
What Can Go Wrong? It seems like a deceptively simple process, one perhaps that you first tried in an elementary or junior high school class. You have a research question that can be answered by survey data, so you
write some questions, slap them onto a form, copy, distribute, collect, and presto: instant data! Well, yes and no. Any data-collection process represents a complex chain of communications, and as with any other chain, one weak link can break it. The more complicated the data-collection process, the more links in the chain, and the more opportunities for communication to break down. For a traditional interviewer-administered survey, multiple individuals must have a nearly identical conceptual understanding of a question, as shown in Figure 1. When everything goes well, the chain is a circle as presented on the left: That is, the interpretation of what information is desired is identical across the many individuals in the chain, leading to accurate reporting and recording of information. What can happen, however, is that the chain forms an outward spiral like the one on the right—and the information collected is not identical to that which was originally desired by the researcher. Two main issues that can arise in the process of communication are misinterpretations (errors in the communication between people) and inaccuracies in transmission (such as error caused by the interviewer writing down the wrong code on the survey form). The
misinterpretation type of error is best explained by examples. Let us start with a basic question that is fraught with interpretation issues and would therefore almost never appear on a pretested questionnaire.
Avoiding Ambiguous Questions To a critical observer, there are obvious ambiguities in this question. Within what time period do we mean? Do we mean household or individual income? Reported taxable income? Are alimony payments, gambling winnings, and other types of income included? What about bartered items? How is a respondent to know? Here is another real-life example: the testing of a web-based survey that was to be administered to expatriate Iraqi physicians so that they could describe their professional and other experiences in their new countries of residence. The research group that hired us to test it had been advised by a well-trained and highly reputable Iraqi physician that all Iraqi medical schools were conducted in English. Therefore, the survey could be administered in English and the doctors would have little difficulty understanding the questions. Unfortunately, he was wrong. As an illustration, note the following CHANCE
7
transcription from a pretest interview. The volunteer was instructed to read the question aloud and then tell the interviewer all thoughts he had as he determined his answer to the question. The goal of the question was to determine whether the physician was required to become recertified to practice medicine in his new country. Volunteer interviewee (reading): If you had to go through a credentialing process or are currently in a credentialing process in your new country, how many years did it take? Please do not answer if you did not need to go through a credentialing process in your new country.
Joahannes Kawa interviews in less than ideal circumstances. Note that privacy is not a common commodity in the rural areas of Sierra Leone, where the arrival of the interviewing team was considered a major event.
Volunteer interviewee (responding): Do you know if the credentialing process is the asylum process or the red card or green card? Interviewer: I'm not allowed to say anything, because it has to be how you interpret it. Volunteer interviewee: Yeah, I need to understand what means …This is asking me…like through the embassy or through the …I don't understand what mean [sic] “credentialing.” (Looks it up in English-Arabic dictionary.) Here credential is like ambassador or delegation …yes. What mean this [sic]? Interviewer: So how would you answer this question? Volunteer interviewee: This is paper of ambassador of delegation [sic] …I don't know.
News from the field in Sierra Leone. Team leader Bafara Jawara calls via satellite phone to report on progress in the Bonthe district.
8
VOL. 23, NO. 2, 2010
You do not need to be a seasoned questionnaire designer to see a big problem in the respondent's interpretation. We and our clients determined that their “cultural expert” on Iraqi physicians was familiar only with the “elite”—that is, those physicians who attended the best Iraqi medical schools and held the most prestigious positions in their home country. In fact, of the four Iraqi physicians we interviewed during the pretest of the questionnaire, only one came close to interpreting all the questions accurately. As a result of the pretest, the research team members developed an Arabic version of the questionnaire, thus saving themselves from collecting data that would be, for all intents and purposes, garbage. In general, a rigorous testing procedure will be required to ensure that a survey collects the data it intends
An interview during the field testing in East Timor. The interviewer, Jacinta Gonsalves, is sitting at the table; her supervisor, Silvia Verdial da Silva Lopes, is sitting to her right; and the respondent is the man sitting to her left. Note that privacy was virtually impossible during the interviewing process.
to collect—in other words, that the data collection process ends as a circle instead of a spiral.
Best Practices for Data Collection So what is the right way to design and test a questionnaire and interview respondents? Well, the short answer is that it depends, but there are some best practices that have been developed over the past few decades. Here is a look at the process of developing the questionnaire and training interviewers to administer it. Testing the questionnaire Testing a questionnaire requires multiple steps. A first round of testing should include review of the survey instrument by at least three different types of experts: an expert on the subject of the survey, an expert on the population to be surveyed, and an expert on questionnaire design. Although those roles might overlap—for example, your questionnaire design reviewer might be the same as your expert on the population to be surveyed—none of those experts should be the person who initially designed the questionnaire. Even the most experienced questionnaire designers make mistakes. Once that initial review is completed and the questionnaire has been modified based on the expert comments, the real testing begins. There are several possibilities for the next round of review.
One of the most successful techniques is called “cognitive interviewing.” Cognitive interviewing allows the survey designer some insight into the thinking processes of the respondent. During a cognitive interview, two trained professionals administer the survey to a test respondent. One of the professionals serves as the interviewer, and one records his/her impressions of the interaction between the interviewer and respondent, including tone of voice and body language that might indicate confusion or a strong emotional reaction. The respondent is asked to “think aloud”—that is, to verbalize his/her thoughts while responding to the question. In addition, the interviewer might ask probing questions—questions about the respondent's interpretation of or thinking about the survey. Those probing questions might be either ad-lib or carefully developed prior to the cognitive interviewing process. If possible, the cognitive interview will be tape- or video-recorded for study later. As an example, the transcription of the interview of the Iraqi physician given earlier in this article was taken from a cognitive interview. Cognitive interviewing can occur as a single process or in waves. When cognitive interviewing occurs in waves, between each wave the survey developer modifies the survey on the basis of what was learned during the previous wave. Following the example of the question
“What is your income?” we can imagine a first round of cognitive interviews might uncover the issue that there is no time frame given. The survey designer might respond by reformulating the question as “What was your income over the past 12 months?” A second round of cognitive interviews might reveal that some respondents are including interest income and some are not. The question might then be reformulated as “What was your income from gainful employment over the past 12 months?” and so on, until the cognitive interviews indicate that the respondents to the question are interpreting it as the survey developer intended. Limitations of cognitive interviewing One issue with the cognitive interviewing process is that it is time and labor intensive, so only a limited number of cognitive interviews can be completed during any wave. Therefore, it is a good idea to perform a final field test after the cognitive interviewing. A field test is a small run of the fieldwork for the survey, after which the results are tabulated to make sure that they seem appropriate. The field test allows both input from a large number of individuals from the population of interest and also an opportunity for the interviewers to practice with the survey before the real fieldwork begins. One hopes that any problems found at this stage will be minimal and CHANCE
9
A team in Sierra Leone reviewing questionnaires under the supervision of team leader Mohamed Daboh. From left to right: Andrew Simbo, Mohamed Daboh, Antoinette Licon, Nancy Joseph, and Joahannes Kawa. Note that the team is working by candlelight; electricity in the field was a luxury, as were flush toilets. Also note the spiffy Joint Statistical Meetings bags, donated by the American Statistical Association.
Transportation in the Bonthe district of Sierra Leone. The team members that covered the Bonthe district spent the majority of their mission being transported by boat. For this reason, they were issued life preservers and special containers in which to store the survey forms and their equipment and personal belongings. Here team members rest as they travel to the next village. The Bonthe team interviewed in the remotest village sampled for the survey. It was on an island that belonged to the district of Pujehun, and the Pujehun team had been unable to reach it. The Bonthe team members, led by Bafara Jawara, were required to ride in a boat for 16 hours and then hike for 10 miles to get to the village.
easily corrected, and the survey will be ready to commence in earnest. Another issue that frequently arises during questionnaire design is the need to translate the questionnaire into one 10
VOL. 23, NO. 2, 2010
or more languages. The base minimum for translation is a combination of forward and backward translation—that is, one individual performs a translation between the original and target
language, and another individual translates it back into the original language. The pretranslation is then compared to the version that has been forward- and back-translated to find inconsistencies, and the translation is then corrected. Although this has been an industry standard, recent research has suggested that a more rigorous technique must be used, especially in the case of a questionnaire that is being translated into multiple languages simultaneously. One option is to develop the questionnaire in the multiple languages at the same time, with individual teams for each language starting from a base set of concepts. In that case, cognitive interviewing and other testing methods will occur in all languages, not just the base language. Training the interviewers An essential aspect of the questionnaire design and testing process is the appropriate training of the interviewers. In most large survey projects in the United States, the interviewers are trained to “stick to the script”—in other words, to read the questions on the survey instrument exactly as written, to follow the questions in the order given, and not to offer additional explanation of the questions unless it has been preapproved by the survey manager. This protocol is designed to minimize interviewer bias in the answers—that is, changes in how respondents will answer questions that are due to some behavior of the interviewer. Interviewers may inadvertently change the meaning of a question if they do not read that question exactly, or they may cause respondents to favor one response over another due to the interviewer’s clear preference for that response. For that reason, interviewers are also taught to present the questions in a neutral way and to not implicitly or explicitly express their own opinions as to an appropriate answer. Interviewers must learn the appropriate way to administer the survey. However, there are other aspects of interviewer behavior that must be addressed, including appropriate voice control and body language. Interviewers are trained on techniques for building rapport, asking sensitive questions, and maintaining the confidentiality of the responses of the individuals interviewed. Finally, interviewers need to understand methods for keeping themselves safe in the field.
What Makes Survey Practice Different in a Developing or Transitional Country? Much of our understanding of what constitutes good survey practice has grown out of research that has taken place in the United States, Canada, and Europe— all vastly different environments from those of the Global South. There are several reasons why the data-collection methods discussed above might need significant alteration to be useful in the developing context. A particularly pernicious problem— one that just recently has become a research priority of government statistical organizations in the developed world—is the need to administer a survey to a diverse population comprised of multiple ethnic groups that speak a diverse set of languages. In the past, questionnaires often have been developed without consideration of how particular questions will be interpreted or understood across cultural groups, or whether particular concepts are directly translatable at all: an issue that the forward-/back-translation industry standard does not adequately address. Populations in the Global South might be more varied than those in the Global North for other reasons— including varying levels of literacy and different understanding and tracking of time (e.g., reliance on agricultural cycles rather than a Gregorian calendar). In the context of sensitive questions, ensuring privacy might be close to impossible in the Global South context, where entire families might share a one-room home or apartment and the arrival of interviewers is a villagewide event. And the cultural preferences of many Global South countries lead to higher rates of “acquiescence bias”: the higher likelihood of respondents answering yes to a question, not because the affirmative is true but because they want to please the interviewer, who is perceived as being in a position of authority.
Are There Best Practices for Data Collection in the Global South? The answer to this question is yes and no. There are several organizations— including the World Bank and United Nations—that have compiled the state of the art in random sample surveys in the Global South. However, there is still
Cognitive interviewing in the field in East Timor. From left to right: Jana Asher, Duarte da Silva, and the survey respondent with her daughter.
much research to be done as well as ongoing issues with the indiscriminate transfer of data-collection methodology. One important difference between fieldwork in the Global North and Global South is the availability of support infrastructure in the field. Common developed-world conveniences like cellular phone networks or landlines, hospitals and clinics, restaurants and hotels, and regular electricity and running water simply might not be available in some parts of the Global South. Teams of interviewers can be outfitted with satellite phones and carry their own medical supplies into the field. Interviewers can be vaccinated against local diseases and carry food and shelter to remote locations. And plenty of matches and candles will allow interviewers to complete their activities after dark. In addition, interviewers must be prepared in advance for difficulties such as flat tires, large potholes, or the need to forget the car or truck altogether and travel by motorcycle, boat, or foot. And you never know when your vehicle will not be able to proceed because there is a fallen tree in your way! What about the design of the questionnaire in the Global South context? In this researcher’s experience and opinion, in Global South countries where multiple tribes or cultures may be part of the sampled population, training the
interviewer to recite the questionnaire verbatim, with no ability to explain if the respondent is confused, does not lead to quality data. In fact, a reluctance to adapt to the interviewee can harm the interview process. Rather, the interviewer needs greater training and greater ability to improvise in the field. Additionally, although they are not often used, techniques like cognitive interviewing and more recently developed language translation techniques can and should be used in the Global South. Many cultures in the Global South are more attuned to agricultural cycles and timing than Gregorian divisions of time. In many cultures, even significant dates, like one's birth date, are not known or not important. That can significantly impact the quality of surveys that require recall of events, unless specific techniques are developed for those populations. A technique called the “calendar method” has shown great promise. The interviewer assists the respondent in developing a written calendar of important “landmark” events in his/her life to aid recall of events asked about during the interview, However, in its typical form, the calendar method requires the respondent to be literate and familiar with the Gregorian calendar. The following section describes an alternate method of assisting illiterate respondents in recalling date information: a CHANCE
11
method that the author developed in her fieldwork.
A Small Research Example
Team leader Mohamed Daboh watches as a local mechanic in Bo Town attempts to fix his team’s car during fieldwork. Car problems plague fieldwork in a country like Sierra Leone, where there is insufficient infrastructure to maintain the roads—or even pave the majority of them.
Sahr Gbondent holds up the remains of an interview team’s tire during fieldwork in Sierra Leone. During that project, each vehicle was equipped with two extra tires for each two–week mission. Very often, they weren’t enough.
12
VOL. 23, NO. 2, 2010
The survey was a national survey of human rights violations that occurred during the armed internal conflict of Sierra Leone. The U.S. State Department sponsored the survey as part of a general program of documenting war crimes. The participating organizations were the American Bar Association and Benetech. All three groups wanted the data to be used by the Sierra Leone Special Court during the prosecution of war criminals. For that to happen, each violation needed to be associated with a perpetrator group and a date. In addition, we were asked to collect age of the victim at the time of violation, the duration of the violation (including duration until death if appropriate), and the current age of all members of the respondent's household. However, a large percentage of the population of Sierra Leone is illiterate and does not regularly consult the Gregorian calendar of years, months, and weeks. Rather, the people are attuned to the seasons (rainy and dry) and other important events such as religious holidays and school terms. Our solution was twofold: We provided the interviewers with more latitude to probe the respondent for information (not sticking to the script of the questionnaire), and we crafted a series of prescripted probe questions that asked the respondents to determine whether a violation occurred before or after a national date of importance. Although the first decision deviated from standard survey practice regarding interviewers, we felt that the complexity of the information sought required that interviewers have the ability to be more creative in the field, and that the additional information elicited far outweighed the potential interviewer bias introduced. The second decision, however, rested firmly on current understanding of memory and cognition by psychologists. Current theory suggests that our recall of date information is based on a few “landmark” events for which we have memorized the date on which the event occurred, combined with the storage in memory of events in series form. In other words, an interviewee might remember that her birthday is October 17 and then
Table 1—Time Probe Results for Sierra Leone War Crimes Documentation Survey Probe
Victim Age During Violation
No probe needed + 1 probe
Violation Start Date
Violation Duration
Death Duration
Resident Age
766
171
85
20
701
44,168
11,040
57,488
10,018
32,013
3,229
2,107
4,899
370
1,825
16,528
32,538
2,217
380
4,310
2 probe codes recorded
26
18,577
28
0
1
3 probe codes recorded
1
283
0
0
0
4 probe codes recorded
0
1
0
0
0
More probe codes
0
0
0
0
0
19,784
51,399
2,245
380
4,311
No probe needed Missing value 1 probe code recorded
Total probe events
remember that about two weeks later she had her furnace serviced—but it is very unlikely she will remember that her furnace was serviced on October 29. In the case of the Sierra Leone survey, we created a probe question for each of eight events of national importance that were spread through time in a way that allowed interviewers to determine the year in which a particular respondent's reported human rights violation had occurred. For example, the probe “Did that happen before or after the invasion of Freetown?” allowed the interviewers to determine whether an event occurred (roughly) prior to 1999, during 1999, or later. In addition, the interviewers were provided with several scripted probes to help them narrow the time frame during which the violation could have occurred. Those probes referenced seasons (rainy versus dry); religious events (before or after Christmas or Ramadan); school terms (before or after a particular term started or ended); and, if needed, age of the respondent (both age of the respondent when the event happened and age now). The interviewers were free to use any combination of probes they felt was warranted, but they were required to record which probes they used (or note if they had not used any). The strong advantage of this method versus more traditional calendar-based methods was its ability to be used with illiterate individuals. Did it work? The evidence given in Table 1 would suggest so. Using the recorded information about which probes
were used to ask about which violations, we determined that 79.4% of violation dates relied on probing for the date to be determined. Events that were easier to remember were durations—as given by the results for violation duration and time from violation until death—only 3.5% of those cases required probes. Note that typically only one probe was required to determine dates. We believe that respondents, once introduced to the probing method, engaged in self-probing to recall ages and dates later in the interview. This theory is supported by observations of the cognitive interviewing for the survey, during which respondents began to self-probe. Please also note that the use of landmark-based probe events is highly supported by the response rate to time-related questions of the survey: Response rates were very high for resident age (99.9%), victim age (99.9%), violation start date (99.9%), and violation duration (96.9%).
Summary Many people required to analyze data do not see those data until they are nice and neat, either in a spreadsheet or another organized format. They might not realize how much error can be introduced through the data-collection process if the process is not carefully planned. You now know what can and will go wrong when appropriate questionnaire design and data collection protocols are not followed—or even when they are.
Scientists are hard at work determining new and better techniques for eliciting information from a variety of people across the world. So the next time you are given a data set to analyze, be sure to ask how the data were acquired. You might be surprised by how little or how much was done to produce data of the highest quality possible—whether or not the questionnaire designer ended up stuck on the top of a mountain.
Further Reading Bulmer, M., and Warwick, D. P., eds. 1993. Social research on developing countries: Surveys and censuses in the Third World. London: University College London Press. Casley, D. J., and Lury, D. A. 1981. Data collection in developing countries. London: Oxford University Press. Tourangeau, R., Rips, L. J., and Rasinski, K. 2000. The psychology of survey response. Cambridge, UK: Cambridge University Press. United Nations. 2005. Household surveys in developing and transitional countries. http://unstats.un.org/unsd/ Hhsurveys/ (October 30, 2009). Willis, G. B. 1999. Cognitive interviewing: A “how to” guide. http:// appliedresearch.cancer.gov/areas/cognitive/ interview.pdf.
CHANCE
13
The Luria-Delbrück Distribution Early statistical thinking about evolution Qi Zheng
I
n the 1940s, biologists were puzzling over the origin of certain mutations that confer survival advantages to bacteria living under harsh environmental conditions. A well-known example is a mutation that confers on Escherichia coli cells resistance to phage— viruses that infect and kill wild-type bacterial cells. Some biologists believed that such mutations occur spontaneously, or randomly, in the sense that they occur regardless of their usefulness to the organism. Others held that such mutations occur in response to the environment, for example, to the assault of phage. The former hypothesis, dubbed the “random mutation hypothesis,” supports Darwin’s theory of natural selection. The latter hypothesis, called the “directed mutation hypothesis” among other names, cannot be easily reconciled with Darwinism. The Luria-Delbrück probability distribution was developed to describe experimental results produced to address the contentious debate over these two hypotheses.
Photo courtesy of AP/Wide World Photos
King Gustaf Adolf, right, presents the Nobel Prize in Physiology or Medicine to German-born American biologist Max Delbrück in Stockholm, Sweden, December 10, 1969. Delbrück, of the California Institute of Technology, shares the prize with American biologist Alfred D. Hershey and Italian-American biologist Salvador E. Luria for their discoveries concerning the replication mechanism of viruses and their genetic structure.
Luria’s Experiment Salvador Luria, an Italian-born microbiologist, was responsible for several important advances in modern biology. An indirect contribution of Luria to biology was his decision to send James Watson, his first doctoral student, to Europe to pursue postdoctoral research, which culminated in the discovery of the molecular structure of DNA by James Watson and Francis Crick in 1953. In February 1943, after months of preoccupation with the controversy over the two hypotheses, Luria invented a new type of experiment that would shed light on the controversy. A diagram of the experiment is shown on the following page in Figure 1. Each test tube contains a liquid culture into which a few wild-type cells are seeded. During an ensuing incubation period cells grow and divide freely in each tube.
Under the random mutation hypothesis, when a wild-type cell divides, there is a small chance that one of the two daughter cells is a mutant. Since backward mutation is negligible, all offspring of that mutant would be mutants. To help understand how the cell population in a test tube evolves, consider synchronous cell growth. As depicted on the following page in Figure 2, let a cell population start from a common ancestor (top row). Each succeeding generation doubles in size. The offspring of a mutant type (black) are also of the mutant type. At the end of the incubation period, the numbers of mutants in the test tubes are determined by transferring the contents of each tube onto a solid culture in a dish containing a selective agent (e.g., phage). This transferring process is termed plating, which
eliminates wild-type cells but allows each mutant cell to grow and form a visible colony on a solid culture.
Overdispersion and the Slot Machine Luria’s experiment is a classic example of applying a simple statistical principle to an important biological problem. If mutations occur randomly, one would expect mutations to occur earlier in some tubes than in other tubes. Because an early-occurring mutation in general generates a larger number of mutant cells than a late-occurring mutation does, one is likely to observe a considerable amount of variation in the number of mutant cells across the tubes.
CHANCE
15
Figure 1. A simplified illustration of the fluctuation test. An actual experiment consists of around 30 test tubes. Wild-type cells are seeded into a test tube containing a liquid culture. After an incubation period, the cells in each tube are transferred to a dish where phage destroy the wild-type cells. The surviving mutant cells are fixed in place and grow to visible colonies, which are counted.
Figure 2. One of 36 pedigrees that have 16 mutant cells in the fifth generation. In the above diagram, a white circle stands for a wild-type cell, whereas a black disk stands for a mutant cell. Starting from a common ancestor (top row), each succeeding generation doubles in size. Under the random mutation hypothesis, mutations can occur in any generation. Thus, a large variation in the number of mutant cells across test tubes is expected.
On the other hand, if mutations occur only after the plating procedure brings wild-type cells into contact with phage, then, because cells lose mobility on a solid culture, the number of mutant colonies represents the number of mutations that occurred after plating (see Figure 3 on the following page). As each wild-type cell has an equal chance to mutate upon coming into contact with phage, the number of mutations would obey the binomial law, which should be well approximated by the Poisson law due to the large number of wild-type cells and the exceedingly small probability of mutation. 16
VOL. 23, NO. 2, 2010
The idea of measuring this kind of variability to test the random mutation hypothesis struck Luria when he was watching a colleague putting dimes into a slot machine at a faculty dance at Indiana University. As Luria later recounted in his autobiography, at that moment he vividly saw a striking similarity between the large variation in slot machine returns and the variation in the number of mutant cells across the tubes—if random mutations did occur. To put it another way, a jackpot is to the slot machine return what an
early-occurring mutation is to the number of mutants in a test tube. Luria eagerly communicated his novel idea and experimental results to Max Delbrück, who, equally excited, formulated a mathematical model to describe the variation in the number of mutants. Thus was born the LuriaDelbrück experiment, also known as the fluctuation test or the fluctuation experiment. The latter names are due to the fact that in their paper, published in the November 1943 issue of Genetics, Luria and Delbrück used the term “fluctuation” to refer to what a statistician would today
Figure 3. In contrast to Figure 2, mutations occur only after plating under the directed mutation hypothesis.
call “variance” or “variation.” A prominent feature of a Poisson random variable is that the mean and the variance are equal, and hence the variance-to-mean ratio is unity. This would be the case for the distribution of the number of mutants under the directed mutation hypothesis. However, as the slot machine analogy suggests, the ratio can far exceed unity under the random mutation hypothesis. From a modern perspective, the distribution of the number of mutants is overdispersed under the random mutation hypothesis, compared with the distribution under the directed mutation hypothesis. When Luria performed the world’s first fluctuation tests to investigate the mutation that confer on E. coli cells resistance to phage infection, the observed variance to mean ratios greatly exceeded unity. Luria and Delbrück thus argued for the occurrence of random mutations in their experiments. In their classic paper, Luria and Delbrück also demonstrated the usefulness of the fluctuation test in measuring microbial mutation rates in the laboratory. In the ensuing six decades, the fluctuation test would gradually be regarded more as a means of estimating mutation rates than as a tool for unraveling the controversy for which the experiment was invented.
Unpublished First Efforts A key step in estimating mutation rates using the Luria-Delbrück experiment is to express the distribution of the number of mutants (also called the “mutant distribution”) in terms of the mutation rate or related quantities. Several prominent statisticians of the time regarded the method adopted by Delbrück as
too crude for the purpose of estimating mutation rates. J. B. S. Haldane was among the first to seek algorithms to calculate the distribution. Haldane used the synchronous growth model as illustrated in Figure 2, and took a combinatorial approach to tackle the distribution. For example, to calculate the probability of 16 mutants in the fifth generation, Haldane first enumerates all five-generation pedigrees having 16 mutants in the fifth generation. Figure 2 shows one of 36 pedigrees that have 16 mutants in the fifth generation. If μ is the probability that a cell division produces a mutant daughter cell, then the probability of that pedigree is μ5(1 − μ)26, as in total 31 cell divisions have occurred and five divisions were accompanied by a mutation. The desired probability is the sum of the probabilities of the 36 pedigrees. This approach is simple in principle but can be unwieldy in practice. For instance, a wild-type cell can give rise to 374 pedigrees that have 46 mutant cells in the sixth generation. No efficient algorithms exist for identifying all these pedigrees. A modern approach is to treat the model as a Markov process, computing probabilities for the nth generation using probabilities for the (n−1)st generation. Haldane did not publish his results. The original manuscript, written in 1946, is now part of a large collection of Haldane’s papers archived by University College Library in London. Almost at the same time, around 1947, the distribution drew the attention of another giant figure in genetics and statistics. Finding Delbrück’s mathematical treatment of the mutant distribution
less than satisfactory, the young geneticist James Crow posed the question of how to find the distribution of the number of mutants to his newly acquainted friend R. A. Fisher. Fisher, upon hearing the question, leaned back in his chair to think for about a minute and then wrote on a scrap of paper a generating function that Crow could not immediately understand. Crow put aside the piece of paper for later study but could never find it again. As this valuable scrap of paper is unlikely to be ever recovered, the mathematical model that Fisher used to obtain his generating function will continue to remain a mystery. However, it is improbable that Fisher’s generating function was arrived at in the space of a minute or so, Fisher’s legendary intellectual prowess notwithstanding. The raging controversy and the refreshing statistical argument put forward by Luria and Delbrück grabbed the attention of many a contemporary geneticist. It would be hard to imagine, Fisher, then pioneering at the frontiers of both genetics and statistics, had not pondered the controversy and the mutant distribution before the episode that Crow recounted in 1990.
A Published Distribution E. A. Lea and C. A. Coulson were responsible for the mutant distribution that still is widely used today in laboratories around the world. Lea and Coulson completed their work in June 1947, and their paper appeared in the December 1949 issue of Journal of Genetics. The Lea-Coulson model is a modification of Delbrück’s model reported in the 1943 classic paper CHANCE
17
Photo courtesy of AP/Wide World Photos
Salvador E. Luria in his Massachusetts Institute of Technology Laboratory October 16, 1969, after word that he shared the 1969 Nobel Prize with two other bacteriologists for research on viruses.
of Luria and Delbrück. In Delbrück’s model, cell growth is asynchronous, and the number of wild-type cells in a tube at time t is approximated by an exponential growth function of the form N0 e t. Here is the cell-growth rate and N0 is the initial number of cells. It is assumed that mutations occur in accordance with a Poisson process with a time-dependent rate proportional to e t. Furthermore, Delbrück used the exponential function e(t−t') as a continuous approximation to the number of mutant cells at time t generated by a common mutation occurring at an earlier time t'. It is easy to see that Delbrück’s model induces a continuous mutant distribution. As the number of mutant cells generated by a mutation in a typical experiment is relatively small, the second approximation is a crude one. It was due to this second approximation that Delbrück’s model was generally regarded as inadequate. In the Lea-Coulson model a stochastic birth process with a birth rate replaces the exponential function e(t−t'). The resulting mutant distribution is a discrete one. Lea and Coulson derived a generating function for this distribution. In their mathematical development, however,
18
VOL. 23, NO. 2, 2010
Lea and Coulson inadvertently made an assumption that effectively treated the number of wild-type cells immediately before plating (NT) as an infinitely large quantity. This unintentional assumption allows the distribution to be indexed by a single parameter m, the expected number of mutations that occur in a test tube. As in a typical experiment NT is often in the order of 108, the effect of this simplifying assumption was largely negligible in practice. The exact form of the generating function allowing for finite NT is believed to have been derived independently by D. G. Kendall and M. S. Bartlett. Details about the origins of this so-called exact distribution were retold in the 1999 Mathematical Biosciences review article “Progress of a Half Century in the Study of the Luria-Delbrück Distribution,” by Q. Zheng. Note that the algorithm proposed by Lea and Coulson to calculate their distribution function was too laborious for routine application.
Recent Developments The year 1988 saw an explosive resurgence of interest in the directed mutation hypothesis, which intensified interest in the Lea-Coulson distribution. As a result, an improved algorithm for computing the Lea-Coulson distribution function was proposed by Ma and colleagues in 1990. The generating function of Lea and Coulson along with this improved algorithm entered the third edition of the authoritative monograph, “Univariate Discrete Distributions,” by Johnson, Kemp, and Kotz in 2005. This solidified the habit of calling Lea and Coulson’s distribution the Luria-Delbrück distribution. Among biologists, however, this distribution is often fondly called the “jackpot distribution,” for reasons now clear. Today the fluctuation test is introduced to biology students as an important biological experiment of the 20th century. The reader can gain a deeper understanding of the fluctuation test by consulting the popular genetics text Introduction to Genetic Analysis, by Griffiths et al. or several other biology textbooks. Continued interest in the Lea-Coulson and related distributions is due to the growing demand by biologists for better methods to estimate mutation rates using the fluctuation test. Recent work on mutant distributions focuses on the
improvement of point and interval estimation of the parameter m, from which the mutation rate is readily obtained. The reader can catch a glimpse of relevant computational issues from a tutorial written by microbiologists W. A. Rosche and P. L. Foster in 2000 or from a more specialized account written by Zheng in 2005. The Luria-Delbrück distribution remains an awe-inspiring subject among biologists because of the seminal contributions Luria and Delbrück made to modern biology. It is widely believed that Luria and Delbrück won the 1969 Nobel Prize in physiology or medicine (with A. D. Hershey) in part due to their fluctuation test from which the LuriaDelbrück distribution arose.
Further Reading Crow, J. F. 1990. R. A. Fisher, a centennial view, Genetics, 124:207–11. Griffiths, A. J. F., Wessler, S. R., Lewontin, R. C., and Carroll, S. B. 2007. Introduction to genetic analysis, 9th ed. New York: W. H. Freeman and Co. Johnson, N. L., Kemp, A. W., and Kotz, S. 2005. Univariate discrete distributions, 3rd ed. Hoboken, NJ: Wiley. Lea, E. A., and Coulson, C. A. 1949. The distribution of the numbers of mutants in bacterial populations. Journal of Genetics 49:264–85. Luria, S. E. 1984. A slot machine, a broken test tube: An autobiography. New York: Harper & Row. Luria, S. E., and Delbrück, M. 1943. Mutations of bacteria from virus sensitivity to virus resistance. Genetics 28:491–511. Ma, W. T., Sandri G. Vh. and Sarkar S. 1992. Analysis of the Luria-Delbrück distribution using discrete convolution powers. Journal of Applied Probability 19:255–267. Rosche, W. A., and Foster, P. L. 2000. Determining mutation rates in bacterial populations. Methods 20:4–17. Zheng, Q. 1999. Progress of a half century in the study of the LuriaDelbrück distribution. Mathematical Biosciences 162:1–32. Zheng, Q. 2005. New algorithms for Luria-Delbrück fluctuation analysis. Mathematical Biosciences 196:198–214.
Using Item Response Theory to Understand Gender Differences in Opinions on Women in Politics Holmes Finch
T
he analysis and assessment of items on standardized tests is the purview of a subgroup of statisticians known as psychometricians. These individuals study the statistical quality of test items and estimates of examinee ability in a content area that is measured by them. In large-scale testing programs, such as those used by state departments of education, decisions regarding which items to retain for use with the student population are often made with the help of statistics. These statistics estimate such things as the difficulty level of items and how well these items differentiate students with relatively high proficiencies in the content area from those students with lower proficiencies. In addition, psychometric analyses can be used to obtain estimates of examinee proficiency—estimates that are often used in these testing programs to make academic competency decisions. For such decisions to be valid statistically and defensible legally, the items used on these tests must be of the highest possible quality. A part of the quality question focuses on the level of item difficulty and discrimination. Another question is whether an item is fair for all populations that might take the test. Of specific interest is the question of whether individual items exhibit any particular “bias” in favor of one group over another.
and bi range from -∞ to ∞. When bi is smaller than , the probability of a correct answer is greater than one-half. We can visualize this model using an item characteristic curve (ICC). Take for example, an item with a difficulty value of 0, which would be considered average. Theoretically, difficulty ranges from -∞ to ∞. The ICC for this model appears in Figure 1. As an individual’s score on the latent trait being measured increases (x-axis) so does there probability of correctly answering the item (y-axis). In this case, the item difficulty of 0 corresponds to the value of the latent trait for which the probability of a correct response is 0.5. For the 1PL model, this interpretation of difficulty will always be the case, although for the more complex models discussed below this is not true. The 1PL model makes the assumption that all items are equally good (or bad) at discriminating among those examinees who have high proficiency from those who do not. If we do not want to make such an assumption, we can modify the 1PL model slightly by including item discrimination to get
1 0.9
Item Response Theory Item response theory (IRT) is the most common method used by psychometricians for analyzing item level testing data that can be scored dichotomously: as either correct or incorrect. Typically, correct responses are coded as 1 and incorrect as 0. IRT is characterized by a set of logistic models that differ in terms of the amount of information provided about individual items. For example, the 1 parameter logistic model (1PL) model links performance on an individual item only to the examinee’s proficiency, or ability, on the trait being measured and the difficulty of the item. One of the primary advantages of the IRT modeling framework is that item difficulty is on the same scale as an examinee’s ability on the latent trait being measured (e.g., proficiency in math), making a direct comparison between the two possible. The 1PL item response function (IRF) takes the following form
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -4
-3
-2
-1
0
1
2
3
4
Latent Trait
where Ui has a value of 1 for a correct answer and 0 for an incorrect answer, is the ability of the examinee on the trait measured by the test, and bi is the difficulty for item i. Both
Figure 1. Item characteristic curve (ICC) for the one parameter logistic (1PL) item response theory (IRT) model. The x-axis is the value of a latent trait. The y-axis is the probability of a correct response.
CHANCE
19
the 2 parameter logistic (2PL) model. The 2PL item response function is
Item 1 vs. Item 2 1 0.9 0.8 0.7 0.6
Item 1
0.5 Item 2
0.4 0.3 0.2 0.1 0 -4
-3
-2
-1
0
1
2
3
4
Latent Trait
Figure 2. Item characteristic curve (ICC) for the two parameter logistic (2PL) item response theory (IRT) model. The x-axis is the value of a latent trait. The y-axis is the probability of a correct response. Item 1 a=2.0, b=-0.5. Item 2 a=0.5, b=0.5.
Item 1 vs. Item 2 1 0.9 0.8 0.7 Item 1
0.6 0.5
Item 2
0.4 0.3 0.2 0.1 0 -4
-3
-2
-1
0
1
2
3
4
Latent Trait
Figure 3. Item characteristic curve (ICC) for the three parameter logistic (3PL) item response theory (IRT) model. The x-axis is the value of a latent trait. The y-axis is the probability of a correct response. Item 1 a=2.0, b=-0.5, c=0. Item 2 a=2.0, b=-0.5, c=0.25.
20
VOL. 23, NO. 2, 2010
where ai is the discrimination parameter for item i and the other terms are defined as before. Figure 2 contains examples of 2PL ICCs for two items. Note that the item discrimination parameter value (ai) impacts the steepness of the slope, whereas the difficulty parameter value (bi) indicates the location of the IRF. In this case, item 1 has a=2.0 and b=-0.5, whereas item 2 has a=0.5 and b=0.5. Thus, given these values we can conclude that it is easier to obtain a correct answer to item 1 than item 2, and that item 1 is better at differentiating among examinees in terms of their overall location on the latent trait being measured. The scaling constant of -1.7 in the 2PL model is included in order to align this model with a similar one based on the normal ogive. The normal ogive function, as described by R. P. McDonald in Test Theory: A Unified Treatment, is really just the cumulative normal distribution function. A given value of the normal ogive is simply the area to the left of a specific standard normal (Z) value. The inclusion of the constant -1.7 makes the logistic function and the normal ogive essentially the same. Finally, if it is possible for examinees to correctly answer an item by chance (i.e., guessing), the 3 parameter logistic (3PL) model can be used by including a chance parameter. The 3PL IRF is
where ci is the chance parameter for item i and is between zero and 1. Figure 3 contains ICC’s for two 3PL items. In this case, item 1 has a=2.0, b=-0.5, and c=0, whereas item 2 has a=2.0, b=-0.5, and c=0.25, The difficulty and discrimination parameter values have the same function in the 3PL as they did in the other IRT models. The ci parameter value serves as the minimum probability of a correct item response, and is generally known as the pseudo-guessing parameter. For this example, there is a 0 probability that an examinee will answer item 1 correctly due strictly to chance, whereas for item 2 the probability of a correct response by chance is 0.25. The 3PL model is only appropriate in cases where a chance correct response to an item is possible. In many survey situations in which our interest is in item endorsement and correct or incorrect responding is not an issue, one of the other IRT models would be more appropriate. An example might include assessing subjects’ political views by asking them to answer yes or no to statements regarding their attitudes on current events.
Women’s Rights Survey for Ninth-Grade Students In 1999, the National Center for Education Statistics (NCES) conducted a large-scale, nationwide study of ninth-grade students’ knowledge of democratic practices and governance and their attitudes regarding diversity, international relations, and national identity. (The interested reader can learn more about this study at the NCES web site http://nces.ed.gov/surveys/ CivEd/.) Among the scales used in this study was one devoted to attitudes regarding the rights and roles of women in society.
A total of eight Likert items (statements to which a respondent expresses agreement or disagreement) were included, with four possible responses ranging from Strongly Disagree to Strongly Agree, with Don’t Know as a fifth option. These items appear in the sidebar “Items on the Women's Rights Subscale.” For the purpose of the following analyses, student responses were recoded as either Agree (Strongly Agree and Agree) or Disagree (Strongly Disagree and Disagree). Furthermore, these dichotomous items were coded so that more traditional views of women were given the value 1 and less traditional were given a 0. Do Not Know responses were coded as missing for the purposes of the following analyses. The resulting scale can be seen, therefore, as measuring the degree to which respondents hold traditional views of women in society. A nationwide sample of 3,022 ninth-grade students was selected for inclusion in this study, using a complex survey design involving clustering and stratification. A total of 2,767 students responded to the eight items with one of the four Disagree to Agree responses, leading to a response rate of 91.6%. For the purposes of this analysis and to keep the length of the article reasonable, the item “Men are better qualified as political leaders” was selected as the target for IRT modeling. This item was chosen because of a belief by civics education professionals that it may prove to be interesting with regard to differences between female and male respondents. Specifically, it was felt that gender-specific response patterns might diverge—even for male respondents who had generally similar views to females in terms both of equal pay for equal work and the need for men and women to have equal opportunities in society.
Item Response Theory Results for Women’s Rights The analysis of the target item, “Men are better qualified as political leaders,” may first involve the estimation of item difficulty and discrimination using the 2PL IRT model. In a testing context, item difficulty reflects the location of the item on the latent trait scale. Easier items have lower difficulty parameter values, indicating that individuals with lower levels of proficiency have a higher probability of answering the item correctly. “Discrimination” refers to the ability of the item to differentiate among students based on their underlying proficiency on the trait being measured (e.g., knowledge of math). Higher discrimination values suggest that the item is better able to separate those students with higher proficiencies from those with lower proficiencies. In the current case, the eight items on the Women’s Rights scale can be thought of as measuring the latent opinion of respondents to the rights and roles of women in society. The total scale score is an estimate of the actual opinion score, which remains latent and unobserved. As described above, this opinion score can also be estimated using IRT. As an example of the type of information that this analysis can provide, let’s consider the target item in this study. The item parameter estimates for this item for the entire group of subjects, as well as results by gender, appear in Table 1. When the groups are considered together, as described by R. J. de Ayala in The Theory and Practice of Item Response Theory, item 8 would be considered to have a large bi value with a relatively high ai value. In this case, the larger bi would indicate that survey respondents are relatively unlikely to agree or strongly
Table 1—Item difficulty (level of agreement) and discrimination parameter estimates (standard errors) for the target item “Men are better qualified as political leaders” Group
Difficulty (Agreement)
Discrimination
All (n=2,767)
1.238 (0.055)
1.327 (0.112)
Male (n=1,375)
0.865 (0.061)
1.263 (0.133)
Female (n=1,392)
2.179 (0.216)
0.867 (0.128)
Items on the Women’s Rights Subscale A scale devoted to attitudes regarding the rights and roles of women in society was asked in the 1999 survey of ninth-grade students by the U.S. Department of Education. A total of eight Likert items were included, with four possible responses ranging from Strongly Disagree to Strongly Agree, with Don’t Know as a fifth option. Here are the items: 1. Girls have fewer chances in life than boys. 2. Women have fewer chances in life than men. 3. Women should run for political office. 4. Women have the same rights as men. 5. Women should stay out of politics. 6. Men should have more rights than women when jobs are scarce. 7. Men and women should receive equal pay for the same job. 8. Men are better qualified than women.
agree with the item, whereas the large ai means that this item is able to accurately discriminate between those with relatively high versus low scores on the Women’s Rights scale. Average bi is generally considered to be around 0, whereas “easy” items have difficulty values roughly below -2.0 renders as -2.0 , and “hard” items have difficulty values above 2.0. Although the notion of difficulty on a cognitive assessment is intuitively clear, “difficulty” in the context of an opinion item such as this one might not be. It is important to remember that the difficulty parameter is really a measure of the likelihood of a particular item being endorsed by respondents, with higher difficulty values corresponding to a respondent needing to have a higher level of proficiency to endorse the item. When the task at hand is a cognitive assessment, this translates into the level of proficiency needed to correctly answer the item. On the other hand, in the case of an opinion survey such as this one, endorsing an item means agreeing with it. Therefore, a high difficulty value would indicate that the respondent would need a high level of the latent trait (i.e., a more traditional opinion regarding the role of women in society) to endorse the item. In this context, discrimination refers to the extent to which the item is able to differentiate those with a more traditional view of women from those with CHANCE
21
An issue that might be of some interest to researchers is whether there are differences between genders in these item parameter values, after controlling for respondents’ levels on the Women’s Rights scale.
Item 1, Group 1 vs. Item 1, Group 2 1 0.9 0.8 0.7
Item 1, Group 1
0.6 0.5
Item 1, Group 2
0.4 0.3 0.2 0.1 0 -4
-3
-2
-1
0
1
2
3
4
Latent Trait
Figure 4. Example of uniform DIF for a 2pl model. The x-axis is the value of a latent trait. The y-axis is the probability of a correct response. For group 1, item 1 a=1.0, b=2.0. Group 2, item 1 a=1.0, b=-2.0.
a less traditional view, with higher values indicating more discriminatory power. An examination of these values in Table 1 suggests that this item has a larger bi than average for the sample as a whole, meaning that respondents must have a very traditional view of women’s roles in society in general to endorse the idea that men are better qualified to be political leaders. We can compare this value with the proportion of individuals who endorsed the item, which was 0.165, or 457 of the 2,767 respondents. The ai value for the full sample is also large, suggesting that this item is effective at separating those with relatively more traditional views of women from those with less traditional views of women. Because we hypothesized that female students may feel differently than males about the relative merits of women in positions of political leadership, it is worthwhile to compare the bi and ai values for the sample by gender. In order to obtain these parameter estimates by gender, it is necessary to fit the model for males and females separately. The item bi and ai values appear by gender in Table 1. In the sample, this item had a higher bi value for females than males, suggesting that in order for a female student to endorse this item, she would need to have a very traditional view of the role in women in society. Males would also need to have a more traditional view than average to endorse the item, but in the sample it would not need to be as high as that of the females. Of the 1,392 female respondents, 113 (0.081) endorsed the notion that males are better qualified as political leaders, while 347 of the 1,375 male respondents (0.252) did the same. The item has a higher discrimination value for male respondents than for females, suggesting that it might be more effective at differentiating more and less traditional boys from one another than it would be for girls. 22
VOL. 23, NO. 2, 2010
Differential Item Function Analysis: Comparing IRFs Across Subsamples Despite the fact that there are clearly differences in the difficulty parameter estimates between male and female respondents, this does not necessarily mean that such differences are present in the population. It is entirely possible that the apparently higher value exhibited by the males in the sample is due only to sampling variation. For this reason, some statistical analysis must be conducted to ascertain whether there may be a difference in the population as a whole. In the context of IRT, the comparison of item parameter values between two groups is referred to as “differential item functioning” (DIF) analysis. There are two types of DIF that can be present for a given item: uniform and nonuniform DIF. Uniform DIF is based focused on the difficulty parameter (bi), whereas nonuniform DIF is focused on the discrimination parameter (ai). In the context of educational measurement, “uniform DIF” refers to the situation where, after holding constant the latent trait being measured by the test as a whole, the probability of members of one group providing a correct response is lower than the probability of members of the other group doing so. In other words, an item displays DIF when members of the two groups who are matched on the ability being measured by the instrument as a whole have different probabilities of correctly responding. Another way to think about DIF is that it represents a difference in model parameters between two groups. In this context, uniform DIF occurs when the location of the item (bi) differs between the two groups, while the presence of nonuniform DIF means that the slope (ai) of the curve relating the latent trait to the probability of endorsing the item differs between the groups. Figure 4 provides a generic example of uniform DIF. We can see that although the shape of the ICCs is the same for the two groups, they have very different location parameter (difficulty) values. Nonuniform DIF occurs when the probability of a correct response to the target item differs between the two groups but that difference is not constant across the latent trait continuum. It can be thought of as an interaction between the latent trait and the group membership, and indeed it is often tested in that way, as will become evident shortly. An example of nonuniform DIF appears in Figure 5. Unlike the case in Figure 4, the location of the item for the two groups is the same; however, their curves cross. For individuals with proficiency levels below 0, those in group 2 have a higher probability of correctly responding to the item, while for those with proficiency above 0, members of group 1 are more likely to respond correctly than are those in group 2. A key element of DIF analysis is the matching of individuals on the latent trait. If two groups differ on the trait being measured, it would only be expected that their rate of correct responses to an item measuring that trait would differ as well. For example, imagine that a college physics instructor gives her class an introductory physics proficiency test at the beginning of the semester. Some of the students in the class will
have already had physics and would thus be expected to have a higher proficiency in the subject overall. Therefore, for any given item on the test, it would not be a surprise to find that the students who had physics previously would have a probability of answering the item correctly. Indeed, if the item is a good measure of physics knowledge, we would expect students with experience in physics to answer the item correctly at a higher rate than those who had never taken such a course. The issue of controlling for differences in the latent trait before comparing item parameter values is at the core of DIF analysis. For an item to demonstrate DIF, we must first match individuals in the two groups on the latent trait being measured by the instrument. Then, if there are differences in the probability of a correct response to the item for individuals who have the same level of the trait being measured, we can conclude that DIF is present. Thus, if we are able to match individuals from the two groups on their physics proficiency and still find a difference in the probability of a correct response to an item, we can conclude that there exists DIF, which may be indicative of a problem which favors one group over the other—perhaps in the way the item is written, for example.
Logistic Regression for DIF Analysis There are a number of approaches available for assessing DIF. In this example, we will use logistic regression (LR). This choice is based on a number of factors, including research such as that by Swaminathan and Rogers showing that LR is an effective tool for DIF assessment. Additionally, it has widespread use in areas outside of psychometrics and attending familiarity for most statisticians. LR also has the ability to simultaneously assess both uniform and nonuniform DIF effectively. The logistic regression (LR) model for DIF detection is
In the full model, p(ui) is the probability of person i responding correctly to the item (coded dichotomously so that 1=correct and 0=incorrect); represents proficiency on the trait being measured; g represents the group identifier; and g represents the interaction between group membership and proficiency. For LR, proficiency will be estimated as the total score on the instrument, excluding the item being tested for DIF, and a group is typically coded as 0 or 1. The s represent the intercept, and slopes for proficiency, group, and the interaction, respectively. The comparison of the log-likelihood of a correct response for the two groups is done controlling for the proficiency level of each member of the sample. In this way, we ensure that differences on this proficiency that might impact the response to the item are factored out when group response patterns are compared. A similar argument can be made for the testing of the interaction term. The examination of items for DIF with LR involves the estimation of three models and the comparison of the resulting chi-square fit statistics. The models differ with respect to the inclusion of the main effect for group and the interaction of group and the proficiency value. Significant differences in these models indicate the presence of uniform and/or nonuniform DIF. The reader interested in the details of this testing should refer to Bruno Zumbo's book called A Handbook on the Theory and Methods of Differential Item Functioning (DIF): Logistic
Item 2, Group 1 vs. Item 2, Group 2 1 0.9 0.8 0.7
Item 2, Group 1
0.6 0.5
Item 2, Group 2
0.4 0.3 0.2 0.1 0 -4
-3
-2
-1
0
1
2
3
4
Latent Trait
Figure 5. Example of nonuniform DIF for a 2PL model. The x-axis is the value of a latent trait. The y-axis is the probability of a correct response. For group 1, item 2 a=1.0, b=0. For group 2, Item 2 a=0.5, b=0.
Regression Modeling as a United Framework for Binary and Likert-type (Ordinal) Item Scores. In DIF research, an effect size is typically used in conjunction with the hypothesis test in order to determine the degree of DIF that is present. The preferred effect size measure for LR in the context of DIF, according to Zumbo, is the change in R2 values among the models (∆R2). According to Michael G. Jodoin and Mark Gierl in their Applied Measurement in Education article, “Evaluating Type 1 Error and Power Rates Using an Effect Size Measurement with the Logistic Regression Procedure for DIF Detection,” the most current guidelines for interpreting this value suggest that ∆R2 < .035 indicates negligible DIF; ∆R2 ≥ .035 and ≤ .070 indicates moderate DIF; and ∆R2 > .070 indicates large DIF. In general practice, the effect size values are only used when items exhibit DIF through at least one statistically significant ∆R2 value.
DIF Analysis of Women’s Rights Scale As noted above, the sample estimates of difficulty and discrimination differed between male and female respondents for the item “Men are better qualified as political leaders.” However, it is not known whether these differences are due to sampling variation or to some systematic divergence between the groups in the population. LR will be used to determine which is likely to be the case. Using the methodology outlined above, the chi-square values appearing in Table 2 on the following page were obtained. The difference between the values for the full model and the model without the interaction between group and the total score is not statistically significant (p=0.66); however, the difference between this second model and the model containing only the proficiency estimate is (p