Proc. Natl. Acad. Sci. USA Vol. 94, pp. 7691–7697, July 1997 Colloquium Paper
This paper serves as an introduction to the following papers which were presented at a colloquium entitled ‘‘Genetics and the Origin of Species,’’ organized by Francisco J. Ayala and Walter M. Fitch, held January 30–February 1, 1997, at the National Academy of Sciences Beckman Center in Irvine, CA.
Genetics and the origin of species: An introduction FRANCISCO J. AYALA*
AND
WALTER M. FITCH
Department of Ecology and Evolutionary Biology, University of California, Irvine, CA 92697-2525
Theodosius Dobzhansky (1900–1975) was a key author of the Synthetic Theory of Evolution, also known as the Modern Synthesis of Evolutionary Theory, which embodies a complex array of biological knowledge centered around Darwin’s theory of evolution by natural selection couched in genetic terms. The epithet ‘‘synthetic’’ primarily alludes to the artful combination of Darwin’s natural selection with Mendelian genetics, but also to the incorporation of relevant knowledge from biological disciplines. In the 1920s and 1930s several theorists had developed mathematical accounts of natural selection as a genetic process. Dobzhansky’s Genetics and the Origin of Species, published in 1937 (1), refashioned their formulations in language that biologists could understand, dressed the equations with natural history and experimental population genetics, and extended the synthesis to speciation and other cardinal problems omitted by the mathematicians. The current Synthetic Theory has grown around that original synthesis. It is not just one single hypothesis (or theory) with its corroborating evidence, but a multidisciplinary body of knowledge bearing on biological evolution, an amalgam of well-established theories and working hypotheses, together with the observations and experiments that support accepted hypotheses (and falsify rejected ones), which jointly seek to explain the evolutionary process and its outcomes. These hypotheses, observations, and experiments often originate in disciplines such as genetics, embryology, zoology, botany, paleontology, and molecular biology. Currently, the ‘‘synthetic’’ epithet is often omitted and the compilation of relevant knowledge is simply known as the Theory of Evolution. This is still expanding, just like one of those ‘‘holding’’ business corporations that have grown around an original enterprise, but continue incorporating new profitable enterprises and discarding unprofitable ones.
have the best chance of surviving and of procreating their kind? On the other hand, we may feel sure that any variation in the least degree injurious would be rigidly destroyed. This preservation of favorable variations and the rejection of injurious variations, I call Natural Selection.’’ Darwin’s argument is that natural selection emerges as a necessary conclusion from two premises: (i) the assumption that hereditary variations useful to organisms occur, and (ii) the observation that more individuals are produced than can possibly survive. The most serious difficulty facing Darwin’s evolutionary theory was the lack of an adequate theory of inheritance that would account for the preservation through the generations of the variations on which natural selection was supposed to act. Theories then current of ‘‘blending inheritance’’ proposed that offspring merely struck an average between the characteristics of their parents. As Darwin became aware, blending inheritance could not account for the conservation of variations, because differences among variant offspring would be halved each generation, rapidly reducing the original variation to the average of the preexisting characteristics. The missing link in Darwin’s argument was provided by Mendelian genetics. About the time the Origin of Species was published, the Augustinian monk Gregor Mendel was performing a long series of experiments with peas in the garden of his monastery in Bru ¨nn, Austria-Hungary (now Brno, Czech Republic). Mendel’s paper, published in 1866, formulated the fundamental principles of a theory of heredity that accounts for biological inheritance through particulate factors (now called ‘‘genes’’) inherited one from each parent, which do not mix or blend but segregate in the formation of the sex cells, or gametes (3). Mendel’s discoveries, however, remained unknown to Darwin and, indeed, did not become generally known until 1900, when they were simultaneously rediscovered by several scientists. In the meantime, Darwinism in the latter part of the 19th century faced an alternative evolutionary theory known as neo-Lamarckism. This hypothesis shared with Lamarck’s original theory the importance of use and disuse in the development and obliteration of organs, and it added the notion that the environment acts directly on organic structures, which explained their adaptation to the ways of life and environments of each organism. Adherents of this theory rejected natural selection as an explanation for adaptation to the environment. The rediscovery in 1900 of Mendel’s theory of heredity led to an emphasis on the role of heredity in evolution. In the Netherlands, Hugo de Vries (4) proposed a new theory of evolution known as mutationism, which essentially did away
Darwin to Dobzhansky Darwin summarized the theory of evolution by natural selection in the Origin of Species (2) as follows: ‘‘As many more individuals are produced than can possibly survive, there must in every case be a struggle for existence, either one individual with another of the same species, or with the individuals of distinct species, or with the physical conditions of life. . . . Can it, then, be thought improbable, seeing that variations useful to man have undoubtedly occurred, that other variations, useful in some way to each being in the great and complex battle of life, should sometimes occur in the course of thousands of generations? If such do occur, can we doubt (remembering that many more individuals are born than can possibly survive) that individuals having any advantage, however slight, over others, would
Abbreviation: MHC, major histocompatibility complex. *To whom reprint requests should be addressed at: Department of Ecology and Evolutionary Biology, University of California, 321 Steinhaus Hall, Irvine, CA 92697-2525. e-mail: FJAYALA@ UCI.EDU.
© 1997 by The National Academy of Sciences 0027-8424y97y947691-7$2.00y0 PNAS is available online at http:yywww.pnas.org.
7691
7692
Colloquium Paper: Ayala and Fitch
with natural selection as a major evolutionary process. According to de Vries (joined by other geneticists such as William Bateson in England), there are two kinds of variation in organisms. One is the ‘‘ordinary’’ variation observed among individuals of a species, which is of no lasting consequence in evolution because, according to de Vries, it could not ‘‘lead to a transgression of the species border even under conditions of the most stringent and continued selection.’’ The other consists of the changes brought about by mutations, spontaneous alterations of genes that yield large modifications of the organism and give rise to new species. According to de Vries, a new species originates suddenly, produced by the existing one without any visible preparation and without transition. Mutationism was opposed by many naturalists, and in particular by the so-called biometricians, led by Briton Karl Pearson, who defended Darwinian natural selection as the major cause of evolution through the cumulative effects of small, continuous, individual variations (which the biometricians assumed passed from one generation to the next without being subject to Mendel’s laws of inheritance). The controversy between mutationists (also referred to at the time as Mendelians) and biometricians approached a resolution in the 1920s and 1930s through the theoretical work of several geneticists (5). These geneticists used mathematical arguments to show, first, that continuous variation (in such characteristics as size, number of eggs laid, and the like) could be explained by Mendel’s laws; and second, that natural selection acting cumulatively on small variations could yield major evolutionary changes in form and function. Distinguished members of this group of theoretical geneticists were R. A. Fisher and J. B. S. Haldane in Britain and Sewall Wright in the United States (6–8). Their work contributed to the downfall of mutationism and, most importantly, provided a theoretical framework for the integration of genetics into Darwin’s theory of natural selection. Yet their work had a limited impact on contemporary biologists because it was formulated in a mathematical language that most biologists could not understand; because it was almost exclusively theoretical, with little empirical corroboration; and because it was limited in scope, largely omitting many issues, such as speciation, that were of great importance to evolutionists. Dobzhansky’s Genetics and the Origin of Species advanced a reasonably comprehensive account of the evolutionary process in genetic terms, laced with experimental evidence supporting the theoretical arguments. It had an enormous impact on naturalists and experimental biologists, who rapidly embraced the new understanding of the evolutionary process as one of genetic change in populations. Interest in evolutionary studies was greatly stimulated, and contributions to the theory soon began to follow, extending the synthesis of genetics and natural selection to a variety of biological fields. The main writers who, together with Dobzhansky, may be considered the architects of the synthetic theory were the zoologists Ernst Mayr (9) and Julian Huxley (10), the paleontologist George G. Simpson (11), and the botanist George Ledyard Stebbins (12). [The National Academy of Sciences held in January 1994 a colloquium (13) to commemorate the 50th anniversary of the publication of Simpson’s seminal book, Tempo and Mode in Evolution (11).] These researchers contributed to a burst of evolutionary studies in the traditional biological disciplines and in some emerging ones—notably population genetics and, later, evolutionary ecology. By 1950 acceptance of Darwin’s theory of evolution by natural selection was universal among biologists, and the synthetic theory had become widely adopted. The line of thought of Genetics and the Origin of Species is surprisingly modern—in part, no doubt, because it established the pattern that successive evolutionary investigations and treatises largely would follow. Dobzhansky writes in the preface: ‘‘The problem of evolution may be approached in two
Proc. Natl. Acad. Sci. USA 94 (1997) different ways. First, the sequence of the evolutionary events as they have actually taken place in the past history of various organisms may be traced. Second, the mechanisms that bring about evolutionary changes may be studied. . . . The present book is dedicated to a discussion of the mechanisms of species formation in terms of the known facts and theories of genetics.’’ The book starts with a consideration of organic diversity and discontinuity. Successively, it deals with mutation as the origin of hereditary variation, the role of chromosomal rearrangements, variation in natural populations, natural selection, the origin of species by polyploidy, the origin of species through gradual development of reproductive isolation, physiological and genetic differences between species, and the concept of species as natural units. The book’s organization was largely preserved in the second (1941) and third (1951) editions, and in Genetics of the Evolutionary Process (14), published in 1970, a book that Dobzhansky thought of as the fourth edition of the earlier one, but had changed too much for publication under the same title. Dobzhansky sought to extend the evolutionary synthesis to mankind in numerous articles and several books, most notably Mankind Evolving (15), published in 1962, a book that many judge to be as important as Genetics and the Origin of Species. Dobzhansky was a leading experimentalist and prolific writer, who published several books and nearly 600 papers dealing with leading questions in population and evolutionary genetics, as well as with philosophical problems and humanistic issues. The experimental organisms of most of his research were Drosophila fruitflies. A Man for All Seasons Theodosius Dobzhansky was born on January 25, 1900, in Nemirov, a small town 200 km southeast of Kiev in the Ukraine. He was the only child of Sophia Voinarsky and Grigory Dobrzhansky (precise transliteration of the Russian family name includes the letter ‘‘r’’), a teacher of high school mathematics. In 1910 the family moved to the outskirts of Kiev, where Dobzhansky lived through the tumultuous years of World War I and the Bolshevik revolution. In those times the family was often beset by various privations, including hunger. In his unpublished autobiographical Reminiscences for the ‘‘Oral History Project’’ of Columbia University, Dobzhansky states that his decision to become a biologist was made about 1912. Through his early high school years, Dobzhansky became an avid butterfly collector. A school teacher gave him access to a microscope that Dobzhansky used particularly during the long winter months. In the winter of 1915–1916 he met Victor Luchnik, a 25-year-old college drop-out, who was a dedicated entomologist specializing in Coccinellidae beetles. Luchnik convinced Dobzhansky that butterfly collecting would not lead anywhere and that he should become a specialist. Dobzhansky chose to work with ladybird beetles, which would be the subject of his first scientific publication in 1918. (Reference to Dobzhansky’s publications can be found in the extensive bibliography published by the National Academy of Sciences, ref. 16.) Dobzhansky graduated in biology from the University of Kiev in 1921. Before his graduation, he was hired as an instructor in zoology at the Polytechnic Institute in Kiev. He taught there until 1924, when he became an assistant to Yuri Filipchenko, head of the new department of genetics at the University of Leningrad. Filipchenko was familiar with Thomas Hunt Morgan’s work in the United States and had started a Drosophila laboratory, where Dobzhansky was encouraged to investigate the pleiotropic effects of genes. In 1927, Dobzhansky obtained a fellowship from the International Education Board (Rockefeller Foundation) and arrived in New York on December 27 to work with Thomas Hunt Morgan at Columbia University. In the summer of 1928 he
Colloquium Paper: Ayala and Fitch followed Morgan to the California Institute of Technology, where Dobzhansky was appointed assistant professor of genetics in 1929, and professor of genetics in 1936. In 1940 he returned to New York as professor of zoology at Columbia University, where he remained until 1962, when he became professor at the Rockefeller Institute (renamed Rockefeller University in 1965) also in New York City. On July 1, 1970, Dobzhansky became professor emeritus at Rockefeller University; in September 1971, he moved to the Department of Genetics at the University of California, Davis, where he was adjunct professor until his death in 1975. On August 8, 1924, Dobzhansky married Natalia (Natasha) Sivertzev, a geneticist in her own right, who was at the time working with the famous Russian biologist I. I. Schmalhausen in Kiev. Natasha was Dobzhansky’s faithful companion and occasional scientific collaborator until her death from coronary thrombosis on February 22, 1969. The Dobzhanskys had only one child, Sophie, married until her recent death to Michael D. Coe, professor of anthropology at Yale University. In a routine medical check-up on June 1, 1968, it was discovered that Dobzhansky suffered from chronic lymphatic leukemia, the least malignant form of leukemia. He was given a prognosis of ‘‘a few months to a few years’’ of life expectancy. Over the following 7 years, the progress of the leukemia was unexpectedly slow and, surprising to his physicians, it had little if any noticeable effect on his energy and work habits. However, the disease took a conspicuous turn for the worse in the summer of 1975. In mid-November Dobzhansky started to receive chemotherapy, but continued living at home and working at the laboratory. He was convinced that the end of his life was near and dreaded that he might become unable to work and to care for himself. This never came to pass. He died of heart failure on the morning of December 18, 1975. The previous day, he had been working in the laboratory. Dobzhansky was an excellent teacher and distinguished educator of scientists. Throughout his academic career he had more than 30 graduate students and an even greater number of postdoctoral and visiting associates, many of them from foreign countries. Some of the most distinguished geneticists and evolutionists in the United States and abroad are his former students. Dobzhansky spent long periods of time in foreign academic institutions, and was largely responsible for the establishment or development of genetics and evolutionary biology in various countries, notably Brazil, Chile, and Egypt. Dobzhansky gave generously of his time to other scientists, particularly to young ones and to students. But he resented time spent in committee activities, which he shunned as often as he reasonably could. Throughout his academic career, he avoided administrative posts, alleging, perhaps correctly, that he had neither temperament nor ability for management. Most certainly, he preferred to dedicate his working time to research and writing rather than to administration. Dobzhansky was a world traveler and an accomplished linguist able to speak fluently six languages and to read several more. He was a good naturalist and never lacked time for a hike in the California Sierras, the New England forests, or the Amazon jungles. He loved horseback riding but practiced no other sports. Dobzhansky’s interests included the visual arts, music, history, Russian literature, cultural anthropology, philosophy, religion, and, of course, science. His artistic preferences were unsystematic and definitely traditional. His favorite composer was Beethoven followed by Bach and other baroques; he loved Italian operas, but had little appreciation for most twentieth century music and a definite distaste for atonalism. (Of electronic and computer-composed music, he said that it is fit only for computers to listen to it.) In art, Dobzhansky admired the Italian Renaissance painters as well as the Dutch and Spanish masters of the seventeenth century; he appreciated the French Impressionists but detested cubism and all subsequent styles and schools of modern art.
Proc. Natl. Acad. Sci. USA 94 (1997)
7693
Dobzhansky’s obvious personality traits were magnanimity and expansiveness. He recognized and generously praised the achievements of other scientists; he admired the intellect of his colleagues, even when admiration was alloyed with disagreement. He made many long-lasting friendships, usually started by professional interaction. Many of Dobzhansky’s friends were scientists younger than himself, who either had worked in his laboratory as students, postdoctorals, or visitors, or had met him during his travels. He was conspicuously affectionate and loyal toward his friends; he expected affection and loyalty in return. Dobzhansky’s exuberant personality was manifest not only in his friendships but also in his antipathies, which he was seldom able, or willing, to hide. Dobzhansky was a religious man, although he apparently rejected fundamental beliefs of traditional religion, such as the existence of a personal God and of life beyond physical death. His religiosity was grounded on the conviction that there is meaning in the universe. He saw that meaning in the fact that evolution has produced the stupendous diversity of the living world and has progressed from primitive forms of life to mankind. Dobzhansky held that, in man, biological evolution has transcended itself into the realm of self-awareness and culture. He believed that somehow mankind would eventually evolve into higher levels of harmony and creativity. He was a metaphysical optimist. Dobzhansky’s prodigious scientific productivity was made possible by incredible energy and very disciplined work habits. His enormous success as the creator of new ideas and as a synthesizer was, at least in part, based on his broad knowledge, phenomenal memory, and an incisive mind able to see the relevance that a new discovery or a new theory might have with respect to other theories or problems. His success as an experimentalist depended on a wise blending of field and laboratory research; whenever possible he combined both in the study of a problem, often using laboratory studies to ascertain or to confirm the causal processes involved in the phenomena discovered in nature. He obtained the collaboration of mathematicians to design theoretical models for experimental testing and to analyze statistically his empirical observations. He was no inventor or gadgeteer, but he had an uncanny ability to exploit the possibilities of any suitable experimental apparatus or experimental method. Dobzhansky received many honors and awards. He was president of several professional organizations, including the Genetics Society of America (1941), the American Society of Naturalists (1950), the Society for the Study of Evolution (1951), the American Society of Zoologists (1963), the American Teilhard de Chardin Association (1969), and the Behavior Genetics Association (1973). He was a member of the National Academy of Sciences, the American Academy of Arts and Sciences, the American Philosophical Society, and of many foreign academies, such as the Royal Society of London. He received more than 20 honorary degrees from universities in the United States and abroad. He received the Daniel Giraud Elliot Medal (1946) and the Kimber Genetics Award (1958) from the National Academy of Sciences and numerous other medals, including the National Medal of Science, which he received in January 1964 from President Lyndon Baines Johnson (16, 17). The 16 papers that follow were presented at a colloquium sponsored by the National Academy of Sciences to celebrate the 60th anniversary of the publication of Dobzhansky’s Genetics and the Origin of Species. These papers are organized into four successive sections: Genetic Variation and Its Origins, Adaptation and Natural Selection, Population Differentiation and Speciation, and Patterns of Evolution. Genetic Variation and Its Origins In 1937, when Dobzhansky published Genetics and the Origin of Species (1), the DNA structure was not yet discovered, nor
7694
Colloquium Paper: Ayala and Fitch
were there any grounds to anticipate the tremendous impact that molecular biology would have on evolutionary research. We now know how genes are organized and function, and we can ask primeval questions such as what the original organisms were like or how ur-genes were organized. Walter Gilbert advanced in 1987 ‘‘the exon theory of genes’’ (18; see also 19) contending that introns have been around since the progenote, the earliest genetic organism, as spacers between the early, simple genes, and were thereafter used to assemble the complex genes that would later evolve as coalitions of the primitive ones. This hypothesis has been challenged with the alternative proposal that introns came about late in evolution and had nothing to do with the arrangement and rearrangement of gene pieces. Walter Gilbert, S. J. de Souza, and M. Long in ‘‘Origin of Genes’’ (20) review the two theories, as well as an intermediate position proposing that introns arose at the beginning of multicellularity and played a major role during the Cambrian explosion in creating new genes by exon shuffling. The authors argue that if exon shuffling originated with the progenote, exons should consist of an integer number of codons and should be correlated with compact regions of polypeptides. The evidence that they now present, they say, strongly supports the case. The ultimate source of genetic variation was thought to be, at the time of the publication of Genetics and the Origin of Species, gene mutation. Dobzhansky was soon to realize that chromosomal mutations could also play important roles in the evolution sweepstakes. The significance of the transposable elements, first discovered by Barbara McClintock in the 1940s, would become apparent only several decades later. Transposable elements, say Margaret G. Kidwell and Damon Lisch (21), are ubiquitous in many kinds of organisms and account for 10–15% of the Drosophila’s genome and more than 50% of maize’s. Transposable elements provide, indeed, genetic variation on a scale and variety that could hardly have been imagined even a few years ago. Kidwell and Lisch point out the manifold effects of transposable elements. In the genotype, they are involved in many gene mutations, are ubiquitous, and incessantly shift their numbers and locations. Transposable elements modify phenotypes as well, subtly in some cases, causing drastic alternations of development and organization in others. From an evolutionary perspective, transposable elements may be seen as parasites of genomes, but like with other parasites, organisms have often become coadapted with them and have even learned to subvert them for their own benefit. The word ‘‘virus’’ does not appear in the index of any of the three editions of Genetics and the Origin of Species. By 1970, when Genetics of the Evolutionary Process (14) was published, viruses had become a favored organism of molecular genetics, and the term ‘‘viruses’’ is represented by six entries in the index, mostly referring to bacteriophages, but there is also a discussion of the myxomatosis virus, introduced in 1950 in Australia to control a rabbit population explosion. Two decades later, the accumulation of virus gene sequences combined with the development of new phylogenetic methodologies has brought viruses into the mainstream of molecular evolution. Important insights that have been gained concern evolutionary processes but also epidemiology, public health, and geographic patterns of human migrations. Walter M. Fitch and colleagues (22) investigate the HA1 domain of the hemagglutinin gene from human influenza A viruses isolated throughout the world from 1984 to 1996. The gene is evolving at a rate of 5.7 3 1023 substitutions per site per year, about one million times faster than cellular genes. In several positions of hemagglutinin a majority of the nucleotide substitutions are nonsynonymous—i.e., result in amino acid replacements—which strongly supports positive Darwinian selection rather than neutral evolution. The authors aver that
Proc. Natl. Acad. Sci. USA 94 (1997) gene sequence phylogenies may manifest which isolates are most likely to cause future epidemics and might therefore be used for vaccine production. Dobzhansky’s interest in human genetic diversity was motivated by science but also by his enduring concern with the human predicament. He saw that the pervasiveness of genetic diversity was the foundation of human individuality but provided no grounds for any sort of discrimination. Equality—as in equality in law and equality of opportunity—‘‘pertains to the rights and the sacredness of life of every human being’’ (ref. 23, p. 4). In Mankind Evolving (15, p. 18) he wrote that ‘‘Human evolution has two components, the biological or organic, and the cultural or superorganic. These components are neither mutually exclusive nor independent, but interrelated and interdependent. Human evolution cannot be understood as a purely biological process, nor can it be adequately described as a history of culture. It is the interaction of biology and culture. There exists a feedback between biological and cultural processes.’’ For more than three decades, L. L. Cavalli-Sforza has sought to elucidate the geographic origins and dispersal patterns of human populations by investigating gene frequency distributions. Genetic information has accumulated exponentially, encompassing protein-encoding genes, nuclear and mitochondrial, as well as microsatellite and other DNA sequences. ‘‘Genes, Peoples, and Languages’’ (24) emphasizes the African origin of modern humans, whence the other continents were colonized starting '100,000 years ago, first West Asia, then East Asia and Oceania, both probably through the coastal route of South Asia, and later Europe and America, both from East Asia and the latter certainly from the north, via the Bering land passage created in the ice ages. Cavalli-Sforza sees that the genetic conclusions are confirmed by trees of linguistic families, although these are temporally shallow. Adaptation and Natural Selection Starting in the late 1960s gel electrophoresis of soluble enzymes uncovered stores of genetic variation, much greater than had been suspected, in all sorts of animal and plant populations, as well as bacteria and other microorganisms. Whether this variation is adaptively important or just neutral noise became a matter of debate. The 1980s ushered in populational DNA sequencing. Much additional variation was discovered in the form of nucleotide differences between haplotypes. We now know that any two haplotypes of any gene differ on the average by several nucleotide substitutions, although most do not yield amino acid differences in the encoded protein. The neutral-selection controversy rages on. Richard R. Hudson and collaborators (25) investigate the Sod gene (coding for the Cu,Zn superoxide dismutase) in Drosophila melanogaster, where an unusual polymorphism prevails. At the protein level two alleles, Fast and Slow, are discerned, with Slow absent in some populations but reaching frequencies '5–15% in many others. It turns out that all Slow alleles have identical DNA sequences (with trivial exceptions) even when they originate from different world continents. The Fast alleles fall into two categories: roughly half of them are identical, whether they come from Europe, Asia, or the Americas; the other half are heterogeneous, most of them distinguished from each other by several nucleotide differences. Adding to the puzzle is that the Fast alleles that are identical to each other are also identical to the Slow alleles except for the one nucleotide substitution that accounts for their different amino acid composition. Hudson and collaborators conclude that within the last few thousand years a previously rare allele has rapidly risen in frequency to the present levels. The process was driven by fairly strong natural selection.
Colloquium Paper: Ayala and Fitch Polymorphisms shared between species were investigated long and hard by Dobzhansky, mostly chromosomal rearrangements present in two closely related species, Drosophila pseudoobscura and Drosophila persimilis. DNA sequencing has uncovered numerous trans-specific polymorphisms, notably in the genes of the major histocompatibility complex (MHC) of mammals, where some shared alleles have persisted for millions of years. In plants of the family Solanaceae, alleles that are self-incompatible in fertilization have persisted across species barriers for 70 million years. Drosophila species also share DNA sequence polymorphisms that are several million years old. Andrew G. Clark (26) develops mathematical models seeking to elucidate the causes of trans-specific shared polymorphisms. The shared self-incompatibility polymorphisms of plants and MHC alleles of humans and other primates are maintained by strong natural selection, because the protein products accrue a fitness advantage to the bearer of those alleles if they are different. The Drosophila polymorphisms, however, are recent enough that they might have persisted by neutral drift. Three decades ago, Zuckerkandl and Pauling (27) conjectured that morphological evolution is largely caused by changes in the expression of genes, rather than in the amino acid sequence of the encoded polypeptides. Natalia A. Tamarina, Michael Z. Ludwig, and Rollin C. Richmond (28) explore the issue in two homologous genes in two species, D. melanogaster (Est-6) and D. pseudoobscura (Est-5B). The coding regions of these two genes share 80% of their nucleotide and amino acid sequences. The regions flanking the genes are, in contrast, so different that it is difficult to align their sequences to ascertain homology. Tamarina and colleagues (28) make recourse to the magician’s bag of tricks available to Drosophila geneticists. They pick up regulatory DNA segments from D. pseudoobscura and introduce them in the appropriate locations of D. melanogaster. The expression of the gene in the D. melanogaster transgenic flies becomes substantially altered. The expression of the two genes in normal flies follows similar patterns, yet the generegulating apparatus has become different in the two species. It was not until the 1970s that demography was integrated into the theory of the dynamics of natural selection. Population genetics theory had until then treated all individuals in a population as effectively equivalent, without corroboration of longevity, age-dependent fecundity, and other life history parameters. The beginnings of an integration of the theories of population ecology and population genetics appeared in the 1970s, although this integration never engaged much attention from theorists or experimentalists, perhaps because of the many complexities involved. Dobzhansky and some of his students and collaborators made important experimental contributions to the problems (see refs. 29–35). Wyatt W. Anderson and Takao K. Watanabe (36) analyze life history schedules of births and deaths to investigate the outcome of laboratory population experiments involving several chromosomal arrangements of D. pseudoobscura in various combinations. Coincidentally, it happens that all possible genetic outcomes occur: stable polymorphic equilibrium, unstable polymorphic equilibrium, and fixation for one of the alternatives. The authors conclude that, in these populations, both viability and fertility are important fitness components. Age-dependent female fecundity plays a particularly significant role in the outcome. Population Differentiation and Speciation The concept of species is fundamental in evolutionary theory. The modern understanding of this concept can be traced to 1935 when Dobzhansky introduced what is now known as the ‘‘biological species concept’’ (37). Dobzhansky defined species
Proc. Natl. Acad. Sci. USA 94 (1997)
7695
as ‘‘that stage in the evolutionary process at which the once actually or potentially interbreeding array of forms becomes segregated in two or more separate arrays which are physiologically incapable of interbreeding’’ (37; also ref. 1, p. 312). Dobzhansky saw that the species is not only a category of classification, but in sexual organisms also a natural unit defined by the ability to interbreed or its absence. He called attention to the determining role played by reproduction ‘‘isolating mechanisms,’’ a term that he created. The biological species concept has recently been challenged on the grounds that it unduly neglects phylogeny. John C. Avise and Kurt Wollenberg (38) examine this criticism by bringing to bear recent gene coalescence theory with an analysis of multiple, gender-defined pathways in genealogical pedigrees. They conclude that the supposed sharp distinction between the biological species concept and the phylogenetic constructs favored by the critics is illusory. ‘‘Historical descent and reproductive ties,’’ they write, ‘‘are related aspects of phylogeny, and jointly illuminate biotic discontinuity.’’ Among the reproductive isolating mechanisms identified by Dobzhansky (1) there was one, later called ‘‘gametic isolation,’’ occurring when the ‘‘spermatozoa fail to reach the eggs, or to penetrate into the eggs; in higher plants, the pollen tube growth may be arrested if foreign pollen is placed on the stigma of the flower.’’ Therese Markow (39) notes that the investigation of gametic isolation as an evolutionary mechanism has been unduly neglected. In the genus Drosophila alone, a huge diversification exists in the size and pattern of gametes and other internal reproductive traits affecting fertilization. For fertilization to occur, ‘‘sperm must successfully enter the female and be transported to the storage organs. . . [and] must stay alive with adequate motility until they are utilized by the female.’’ Markow examines how these steps fail in different cases and draws a richly patterned quilt that one can see will likely be much extended as other organisms are investigated. The evolutionary possibilities by which these variegations may come about are virtually infinite. Coniferous forests and oak woodlands along the North American Pacific Coast are inhabited by Ensatina terrestrial salamanders. Several species were thought to occur in California, but detailed morphological and coloration analysis led to the conclusion in 1949 that various forms were parts of only one polytypic species arranged in the form of a ring around the Central Valley of California (40). Dobzhansky (41) saw that virtually all stages in a speciation process could be identified along the ring, with complete reproductive isolation between the terminal populations meeting in the southern part of the valley. In Dobzhansky’s view speciation was thwarted by ongoing gene flow via the intermediate populations around the ring. Wake and colleagues demonstrated in the late 1980s (42–44) that gene flow could not hold the complex together: an analysis of protein variation in 19 populations along the ring disclosed great genetic differentiation among populations. David B. Wake (45) reviews mitochondrial DNA and other variation. The Ensatina population array is old, consisting of a number of geographically and genetically distinct components that have reached or approximate full species level. The evolutionary history elucidated is extremely complex, with repeated interludes of geographic separation and genetic interactions upon renewed contact. Peter R. Grant and Rosemary Grant (46) see that Dobzhansky’s Genetics and the Origin of Species is an appropriate starting point for investigating the speciation process and the underlying genetic changes. But in one respect, they note, Dobzhansky’s book is disappointing because it says nothing about the genetics of birds, which are their consuming research interest. Birds are made to serve a good purpose for illustrating geographical patterns of morphological variation within species, adaptation to newly colonized habitats, rapid radiation in
7696
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Ayala and Fitch
archipelagos, and interspecies competition. The evolution of reproductive isolation is considered, but ‘‘the genetics of speciation are the genetics of other organisms, mainly Drosophila.’’ Peter and Rosemary Grant note differences between speciation in birds and speciation in Drosophila. It is significant that in birds the behavioral barriers that prevent mating evolve first, whereas post-mating isolation typically evolves much later, perhaps after gene exchange has all but ceased. Premating isolation in birds may arise from nongenetic causes, often from factors such as song, which in many groups of birds is culturally inherited through an imprinting-like process. Of the factors involved in pre-mating isolation, such as plumage, morphology, and behavior, some are under single-gene control, but most are polygenetically determined. Patterns of Evolution The universal tree of life consists of three domains, or ‘‘empires,’’ bacteria, archaea, and eukarya. The three multicellular kingdoms, animals, plants, and fungi, are just 3 of the 10–12 extant major branches of the eukaryote domain. Molecular evolutionary investigations in the last decade have elucidated the large genetic diversity encompassed by the set of all eukaryotes and, hence, the reduced proportion represented by the multicellular kingdoms. The existence and great genetic heterogeneity of the archaea have been discovered by molecular evolutionists also in the last few years, and so have been most of the species and higher taxonomic groups. The reconstruction of the universal tree and the assessment of the genetic diversity of each branch are buttressed by the hypothesis of the molecular clock of evolution, which has multifarious other applications in other evolutionary studies. How good is the molecular clock? It has been known for some time that the time variance of molecular evolutionary events is larger than would be expected if the molecular clock were a stochastic clock, like the radioactive decay of isotopes. Francisco Ayala in ‘‘Vagaries of the Molecular Clock’’ (47) reviews two clocks, the genes Gpdh and Sod, investigated in his laboratory. Gpdh evolves in Drosophila very slowly, at a rate of 1.1 3 10210 amino acid replacements per site per year. But the rate is much faster, '4.5 3 10210 in mammals, between Dipteran families, between animal phyla, or between plants, animals, and fungi. On the other hand, Sod evolves very fast in Drosophila, '16 3 10210, which is also the rate in mammals and between Dipteran families; but the rate becomes much slower, 5.3 3 10210, between animal phyla, and still slower, 3.3 3 10210, between plants, animals, and fungi. If one were to assume that Gpdh and Sod are good clocks and project the Drosophila rate to estimate the time of divergence of the three multicellular kingdoms, Gpdh would yield an estimate of 3,990 million years, Sod an estimate of 224 million years, both very much off the commonly accepted divergence time of '1,100 million years. It is unlikely that many molecular clocks are as erratic as Gpdh or Sod, but molecular clocks should be applied with caution, particularly when remote extrapolations are made. The hypothesis of the molecular clock was originally predicated on the assumption that the evolutionary replacement of one amino acid for another, or one nucleotide for another is most often of no adaptive consequence. If such assumption would obtain, the process of molecular evolution would be governed by a time-dependent stochastic process. The assumption of adaptive inconsequence seems safest in the case of synonymous nucleotide substitutions, which do not change the amino acids encoded by a gene. Jeffrey Powell and Etsuko Moriyama (48) explore a vexing problem, namely that organisms do not use alternative synonymous codons with the frequencies expected if synonymous substitutions were inconsequential. The deviations from random expectations are large
in Drosophila genes and they often persist through long periods of evolution. Powell and Moriyama (48) exclude differential mutation rates as the cause of the codon bias. Rather, they conclude that natural selection is the cause. The determining factor is the relative abundance of the tRNAs that execute the translation of genes into proteins: genes that are expressed at high rates favor codons that match those tRNAs that are more abundant. The genes in the nucleus of plants often occur as ‘‘families’’—i.e., a gene encoding a particular polypeptide may exist in several copies of more or less remote evolutionary origin. Michael Clegg, Michael P. Cummings, and Mary L. Durbin investigate ‘‘The Evolution of Plant Nuclear Genes’’ (49) by focusing on three gene families, rbcS, Chs, and Adh. Additional copies are recruited at different rates in these families: new Chs and rbcS genes are recruited 20 times faster than Adh genes. The multiplication of gene copies and their divergence is particularly notable for Chs genes in the evolution of flowering plants. The evolution of Adh in monocot plants is not consistent with the molecular clock hypothesis even for synonymous nucleotide substitutions. Clegg and colleagues conclude that natural selection plays a significant role in driving the evolutionary divergence of duplicated genes. They add that new alleles often arise by intragene recombination (49). Multigene families occur in animals as well as in plants. Notable in humans and other mammals are genes associated with the immune system, such as the MHC genes and immunoglobulin (Ig) genes. Some multigene families, in animals as in plants, arise by concerted evolution, a process that generates new genes by interlocus recombination or gene conversion. Masatoshi Nei, Xun Gu and Tatyana Sitnikova (50) raise the question whether concerted evolution may account for the MHC and Ig families, as some authors have suggested. They note that member genes of these families are often more similar to homologous genes from different species than they are to other member genes within the same species. This would not be expected if concerted evolution were the main originating process of gene multiplication within a family. Phylogenetic analyses of several MHC and Ig multigene families display patterns inconsistent with the concerted evolution hypothesis. The evidence favors the conclusion that the creation of new genes by gene duplication has repeatedly occurred in the evolutionary history of organisms. Some duplicated genes persist in the diversified descendant species for a long time; others effectively disappear, either because they are deleted or have become nonfunctional by deleterious mutations. We are grateful to the National Academy of Sciences for the generous grant that financed the colloquium and to Kenneth Fulton and Edward Patte, and the staff of the Arnold and Mabel Beckman Center for their skill and generous assistance during the colloquium and its preparation. Special gratitude is owed to Denise Chilcote, who was responsible for the colloquium’s logistics at all stages, and for her gracious and dedicated performance. Most of all, we are grateful to the speakers and their co-authors for their wonderful contribution to the colloquium and in the papers that follow. We have borrowed extensively from ref. 16 in the preparation of Dobzhansky’s biographical statement. 1. 2. 3. 4. 5. 6.
Dobzhansky, Th. (1937) Genetics and the Origin of Species (Columbia Univ. Press, New York); 2nd Ed., 1941; 3rd Ed., 1951. Darwin, C. (1859) On the Origin of Species by Means of Natural Selection (Murray, London). Mendel, G. (1866) Verh. Naturforsch. Vereines Abhandlungen Bru ¨nn 4, 3–47. de Vries, H. (1900) Rev. Gen. Bot. 12, 257–271. Provine, W. G. (1971) The Origins of Theoretical Population Genetics (Univ. of Chicago Press, Chicago). Fisher, R. A. (1930) The Genetical Theory of Natural Selection (Clarendon, Oxford).
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Ayala and Fitch 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27.
Haldane, J. B. S. (1932) The Causes of Evolution (Harper, New York). Wright, S. (1931) Genetics 16, 97–159. Mayr, E. (1942) Systematics and the Origin of Species (Columbia Univ. Press, New York). Huxley, J. S. (1942) Evolution: The Modern Synthesis (Harper, New York). Simpson, G. G. (1944) Tempo and Mode in Evolution (Columbia Univ. Press, New York). Stebbins, G. L. (1950) Variation and Evolution in Plants (Columbia Univ. Press, New York). Fitch, W. M. & Ayala, F. J., eds. (1995) Tempo and Mode in Evolution (National Academy Press, Washington, DC). Dobzhansky, Th. (1970) Genetics of the Evolutionary Process (Columbia Univ. Press, New York). Dobzhansky, Th. (1962) Mankind Evolving (Yale Univ. Press, New Haven, CT). Ayala, F. J. (1985) Biogr. Mem. Natl. Acad. Sci. U.S.A 55, 163–213. Ayala, F. J. (1990) in Dictionary of Scientific Biography, ed. Gillespie, C. C. (Scribner’s, New York), Vol. 17, Suppl. II, pp. 233–242. Gilbert, W. (1987) Cold Spring Harbor Symp. Quant. Biol. 52, 901–905. Doolittle, W. F. (1978) Nature (London) 272, 581–582. Gilbert, W., de Souza, S. J., & Long, M. (1997) Proc. Natl. Acad. Sci. USA 94, 7698–7703. Kidwell, M. G. & Lisch, D. (1997) Proc. Natl. Acad. Sci. USA 94, 7704–7711. Fitch, W. M., Bush, R. M., Bender, C. A. & Cox, N. J. (1997) Proc. Natl. Acad. Sci. USA 94, 7712–7718. Dobzhansky, Th. (1973) Genetic Diversity and Human Equality (Basic Books, New York). Cavalli-Sforza, L. L. (1997) Proc. Natl. Acad. Sci. USA 94, 7719–7724. Hudson, R. R., Sa´ez, A. G. & Ayala, F. J. (1997) Proc. Natl. Acad. Sci. USA 94, 7725–7729. Clark, A. G. (1997) Proc. Natl. Acad. Sci. USA 94, 7730–7734. Zuckerkandl, E. & Pauling, L. (1965) in Evolving Genes and Proteins, eds. Bryson, V. & Vogel, H. J. (Academic, New York), pp. 97–166.
28. 29. 30. 31.
32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44.
45. 46. 47. 48. 49. 50.
7697
Tamarina, N. A., Ludwig, M. Z. & Richmond, R. C. (1997) Proc. Natl. Acad. Sci. USA 94, 7735–7741. Beardmore, J. A., Dobzhansky, Th. & Pavlovsky, O. A. (1960) Heredity 14, 19–33. Dobzhansky, Th., Lewontin, R. C. & Pavlovsky, O. (1964) Heredity 22, 169–186. Ayala, F. J. (1970) in Essays in Evolution and Genetics in Honor of Theodosius Dobzhansky, eds. Hecht, M. K. & Steere, W. C. (Appleton–Century–Drofts, New York), pp. 121–158. Ayala, F. J. (1969) Can. J. Genet. Cytol. 11, 439–456. Mueller, L. D. & Ayala, F. J. (1981) Genetics 97, 667–677. Mueller, L. D. & Ayala, F. J. (1981) Proc. Natl. Acad. Sci. USA 78, 1303–1305. Mueller, L. D., Guo, P. & Ayala, F. J. (1991) Science 253, 433–435. Anderson, W. W. & Watanabe, T. K. (1997) Proc. Natl. Acad. Sci. USA 94, 7742–7747. Dobzhansky, Th. (1935) Philos. Sci. 2, 344–355. Avise, J. C. & Wollenberg, K. (1997) Proc. Natl. Acad. Sci. USA 94, 7748–7755. Markow, T. A. (1997) Proc. Natl. Acad. Sci. USA 94, 7756–7760. Stebbins, R. C. (1949) Univ. Calif. Publ. Zool. 48, 377–526. Dobzhansky, Th. (1958) A Century of Darwin, ed. Barnett, S. A. (Harvard Univ. Press, Cambridge, MA), pp. 19–55. Wake, D. B. & Yanev, K. P. (1986) Evolution 40, 702–715. Wake, D. B., Yanev, K. P. & Brown, C. W. (1986) Evolution 40, 866–868. Wake, D. B., Yanev, K. P. & Frelow, M. M. (1989) in Speciation and its Consequences, eds. Otte, D. & Endler, J. A. (Sinauer, Sunderland, MA), pp. 134–157. Wake, D. B. (1997) Proc. Natl. Acad. Sci. USA 94, 7761–7767. Grant, P. R. & Grant, B. R. (1997) Proc. Natl. Acad. Sci. USA 94, 7768–7775. Ayala, F. J. (1997) Proc. Natl. Acad. Sci. USA 94, 7776–7783. Powell, J. R. & Moriyama, E. N. (1997) Proc. Natl. Acad. Sci. USA 94, 7784–7790. Clegg, M. T., Cummings, M. P. & Durbin, M. L. (1997) Proc. Natl. Acad. Sci. USA 94, 7791–7798. Nei, M., Gu, X. & Sitnikova, T. (1997) Proc. Natl. Acad. Sci. USA 94, 7799–7806.
Proc. Natl. Acad. Sci. USA Vol. 94, pp. 7698–7703, July 1997 Colloquium Paper
This paper was presented at a colloquium entitled ‘‘Genetics and the Origin of Species,’’ organized by Francisco J. Ayala (Co-chair) and Walter M. Fitch (Co-chair), held January 30–February 1, 1997, at the National Academy of Sciences Beckman Center in Irvine, CA.
Origin of Genes (intronyexonymoduleyevolution)
WALTER GILBERT*, SANDRO J. DE SOUZA,
AND
MANYUAN L ONG
Department of Molecular and Cellular Biology, Biological Laboratories, Harvard University, 16 Divinity Avenue, Cambridge, MA 02138
view is that introns were added very late in evolution, even in the last few million years, and thus have nothing to do with the rearrangement of pieces of genes. There is no exon shuffling on this picture. A third, intermediate view, popular in its own right, is that the introns arose at the initiation of multicellularity. In this picture, the Cambrian explosion used introns to create exon shuffling and a profusion of new genes. The idea of exon shuffling is that introns are as hot spots for genetic recombination, which is a property that introns would have solely because of their length. Introns affect the rate of homologous recombination between exons in a way that scales with length, but, more importantly, they affect nonhomologous recombination as the square of their length. Consider a new gene made by a new combination of regions of earlier genes by an unequal crossing-over event, a rare event at the DNA level, that matches small, similar sequences between two DNAs. To make a new protein that contains the first part of one protein with the second part of another requires such a rare, and in frame, event. However, if the regions that encode parts of the protein are separated by 1,000–10,000-base-long introns along the DNA, a process of unequal crossing-over occurring anywhere within that intron between the exons will create a new combination of exons. There is a combinatorial number of ways to find the matching of short sequences to initiate the unequal crossing over, and thus the recombination process will go a million to a hundred million times faster in the presence of an intron. This is a great enhancement of the rate of creation of new genes. The Exon Theory of Genes (2) is a specific statement of the idea that the first genes were made of small pieces. The crucial elements of that theory are that the very first genes and exons represented small polypeptide chains '15–20 amino acids long, that the basic method used by evolution to make new genes was to shuffle the exons, and that a major trend of evolution was then to lose introns and to fuse small exons together to make complicated exons. (The first enzymes probably were aggregates of such short gene products, but these ur-exons were soon tied together by an intronyexon system so that the proteins would have a covalently connected backbone.) The dominant evolutionary processes are thus to be recombination within introns, the sliding and drift of introns to change amino acid sequence around their borders, and the loss of introns, which can change the gene structure but does not affect protein structure. The strength of this concept is its argument that one searches sequence space not by amino acids and point mutations but by larger elements. We might compare a protein to a sentence. It is easier to understand the sentence as made up of words rather than simply as a string of letters.
ABSTRACT We discuss two tests of the hypothesis that the first genes were assembled from exons. The hypothesis of exon shuff ling in the progenote predicts that intron phases will be correlated so that exons will be an integer number of codons and predicts that the exons will be correlated with compact regions of polypeptide chain. These predictions have been tested on ancient conserved proteins (proteins without introns in prokaryotes but with introns in eukaryotes) and hold with high statistical significance. We conclude that introns are correlated with compact features of proteins 15-, 22-, or 30-amino acid residues long, as was predicted by ‘‘The Exon Theory of Genes.’’ The role of introns and exons in the history of genes has been the subject of debate between two extreme positions. One side holds that introns were used to assemble the first genes, an ‘‘introns-early’’ view (1, 2), and the other side maintains that introns were added during evolution to break up previously continuous genes, an ‘‘introns-late’’ view (3, 4). This discussion has a significant impact on our conceptions about the way genes were constructed in the first cells. Unfortunately, the two sides make opposing judgments about each piece of evidence, and no decisive evidence has yet been agreed upon. For example, in the context of phylogenies, that bacteria have no introns whereas vertebrates have many introns is interpreted differently by the two sides. One view is that introns were there originally and were simply lost; the alternative view is that they were gained. In homologous genes, one often finds introns in similar but not identical positions between genes separated by great evolutionary distances. The early-intronists say that these positions represent the same original intron, possibly moved slightly in position (intron drift or sliding). The late-intronists say that it is obvious that introns could not have existed in such closely neighboring positions in a single original gene and that, because introns could not have moved, these near coincidences must be evidence of insertion. There have been efforts to correlate introns with the threedimensional structure of proteins. The introns-late view denies that there are any such correlations and asserts that introns behave as though they were inserted randomly into the structure of genes (5). Alternatively, the early-intron position generally affirms such a connection but, up to now, has not been able to muster any strong statistical evidence. Recently, however, we have defined such a correlation in a way that yields strong statistical support (6). There are three possible scenarios for the evolutionary history of introns. One is that there were introns at the very beginning of evolution and that during evolution they were lost or, possibly, mostly lost and some added. This complex of ideas is ‘‘The Exon Theory of Genes’’ (2). The extreme alternative © 1997 by The National Academy of Sciences 0027-8424y97y947698-6$2.00y0 PNAS is available online at http:yywww.pnas.org.
*To whom reprint requests should be addressed.
7698
Colloquium Paper: Gilbert et al. How are introns lost? The most direct way is retroposition. A spliced RNA transcript of a gene with an intronyexon structure is copied back into cDNA by a reverse transcriptase, and that DNA is inserted into the chromosome within an intron of a previously existing gene. Splicing can now make that element serve as a complex exon in a new gene. (This process makes pseudogenes, if the reinsertion does not fall under a promoter.) A clear example of this process was worked out in the jingwei gene system in Drosophila (7). The argument that the first exons were 15–20 amino acids long does not have direct support in today’s exon distribution, which peaks at lengths of 35–40-amino acid residues (8). In terms of exon fusion, we expect that there has been, on average, two or three acts of fusion in going from the original 15–20-amino acid long exons to the pieces that are being shuffled today. That protein evolution begins with 15–20-residue polypeptides, essentially small ORFs, whose products are just long enough to have some shape in solution (or as an aggregate) provides an answer to the classic problem of how long proteins evolved. Although it is impossible to find one of (20)200 sequences by a random process (there is not enough carbon in the universe), all short fragments 15–20-residues long can be found in a few mols of material. Although we have described the Exon Theory of Genes as involving DNA-based introns and exons, the theory flows naturally out of an RNA world view (9) that (i) pictures RNA genetic material creating (by splicing) RNA enzymes to do all of the biochemistry, (ii) introduces then activated amino acids, one by one, to build up oligopeptides to support ribozyme function, and, finally, (iii) uses 20 amino acids, short exons, and mRNA splicing to create protein enzymes. This RNA world picture is supported by the ribosome’s RNA-based peptidebond catalysis, by the spliceosome’s RNA enzyme-based splicing mechanism, and by the essential RNA involvement in DNA synthesis and the biosynthesis of the DNA precursors (10). How can one devise any proofs or disproofs of these attitudes about the origin of genes? The polar views make different predictions, which can be tested (8, 11). The theory that introns are present today because there was exon shuffling in the original genes makes certain predictions about intron position and phase whereas theories that the introns were added to DNA sequence by a random process make different predictions. One example of such an introns-added theory is the hypothesis of a transposable element that bears splicing signals on its ends. If such an element were to insert into a gene, its RNA transcript would be spliced out, and the gene product would be unaffected. An element of this kind could spread through the genome as selfish DNA and put introns everywhere. Intron Phase Predictions The first set of predictions involves intron phase, the position of the intron within a codon. Even though there is no signature on the message after an intron has been spliced out, the intron position along the DNA can be referenced to the ultimate protein sequence. An intron can lie either between the codons, phase 0, after the first base, phase 1, or after the second base, phase 2. This is an evolutionarily conserved property if the intron remains present in the gene. If the introns had been inserted into the DNA, there would be no ‘‘phase’’ preference at the point of insertion. That insertion, as a DNA process, could take note of DNA sequence but not protein sequence. If, on the other hand, the exons had been shuffled and exchanged, the simplest model would have all introns in the same phase so that every combination between exons would work. Thus, introns-early predicts phase bias, and introns-late predicts (in its simplest form) equal numbers in each phase.
Proc. Natl. Acad. Sci. USA 94 (1997)
7699
A second property is phase correlation. Consider an exon bounded by introns. If the two introns had been inserted into a continuous gene, there could be no necessary relation between the phases of the intron that lies before and the intron that lies after the exon. The two events of insertion, and hence the phases, should be uncorrelated. On the other hand, if the exon had been inserted into a previously existing intron, then the phases of the intron on either side should be the same so that the reading frame will continue across it. That is, exon shuffling suggests that exons should be multiples of three bases. Intron addition makes no such commitment. How might one test these predictions? We constructed a database of exons by going to GenBank, identifying all genes with introns, purging that set to remove related genes, and getting a set of quasi-independent genes: 1636 genes with 9192 internal exons from GenBank 84 (8). We then looked at a special subset of those eukaryotic genes: those that have homologous sequences in the prokaryotes. In our database, there were 296 such genes with 1496 introns (8). These genes have the following essential property: They are prokaryotic genes that are colinear with a region of an eukaryotic gene. The prokaryotic gene has no introns; the region of eukaryotic gene has introns. Any introns-late model requires that all of these introns be inserted because there cannot have been exon shuffling for these sequences. Although one might argue in general, for eukaryotic sequences, that they could have been made by exon shuffling, these particular parts of eukaryotic sequence cannot be so made because they are orthologous and thus derive from the cenancestor. In an introns-late picture, all the introns in these homologous regions must be derived. They must have been inserted, and so they should show no phase bias and no phase correlations. According to the introns-early model, there should be such correlations because some or all of these introns originated in the progenote where these genes were assembled by exon shuffling. In fact, this subset of introns in ancient, conserved regions does show a phase bias: 55% are in phase 0, 24% are in phase 1, and 21% are in phase 2. (The alternative model predicts 33%, 33%, and 33%.) Still more interesting, from a biological viewpoint, there is an excess of symmetric exons, symmetric pairs of exons, triples, and quadruples of exons. Table 1 shows these data (8). All of these sets show an excess of multiples of three, significant at about the 1% level. This is the first strong argument for the existence of ancient introns. The excess of symmetric exons in these ancient conserved regions is predicted in a simple way by the idea that the introns were used to assemble the first gene, but it is not predicted by an insertional model without special biochemical pleadings that forces this result to happen. (One might, in principle, argue that introns inserted into special sequences on the DNA, like AGuGT, and that these sequences might show a bias relative to amino acid sequence to generate a phase bias. To this one might then add the ad hoc assumption that the splicing mechanism sees both ends of the exon and likes it to be a Table 1.
Intron correlations in ancient conserved regions Observedyexpected
Sets of exons
Symmetric
Asymmetric
x2
P
1 2 3 4
562y515 (9%) 439y400 (10%) 348y309 (13%) 267y238 (12%)
725y772 530y569 400y439 312y341
7.1 6.5 8.4 6.0
0.008 0.011 0.004 0.014
The symmetric exons or exon sets begin and end in the same phase. The asymmetric sets begin and end in different phases. The expectations for the single exons were calculated from the observed intron phases. The expectations for the sets of exons were calculated from the prior observed frequencies of the subset exons. The percentage difference with the symmetric exon or exon sets is calculated as (observed 2 expected)yexpected.
7700
Colloquium Paper: Gilbert et al.
Proc. Natl. Acad. Sci. USA 94 (1997)
multiple of three.) A simpler interpretation is that there was exon shuffling in the formation of the first genes. Introns Correlate with Protein Structure If genes had been put together from small shuffled pieces, proteins should be made up of repeated small elements, possibly elements of folding, although the evolutionary argument only requires elements that can be subject to natural selection. Evolution requires only function; it does not require biochemical structure. The prediction of the Exon Theory of Genes is that there should be such elements that evolution has selected, which we call modules, and that they should be coextensive to exons. By module, we mean a region of polypeptide chain that can be circumscribed in space by a maximum diameter. If one traces the Ca positions along the backbone of the polypeptide chain in space and requires that all of the pairwise distances be less than some maximum value, then this region of the chain must fold back and forth in space. Putting a maximum length over the Ca distances, roughly speaking, means that, as that region of chain travels through space, it can be circumscribed by a sphere of that diameter. How might one define the boundaries of such modules? The module notion was first introduced by Mitiko Go in the early 1980s and was used to predict the existence of novel introns. She used it (12) to suggest that there should be a novel intron in globin and that the intron was later discovered in the leghemoglobin of plants. We used that same idea to predict the existence of positions in which one might find introns in triosephosphate isomerase (13, 14). One difficulty with this notion, and a challenge that the introns-late supporters have made, is that this concept of compactness does not provide a sharp view of where the boundaries are to be. The ‘‘spheres’’ overlap, and one does not have a clean definition of where one module stops and the next begins. We have converted this difficulty into a virtue (6) by suggesting that one should take the overlap regions between the spheres that surround the folded portions and, rather than asking for a single intron to be added at a precise position, define these overlaps as “boundary regions” within which introns might lie. Such boundary regions are designed such that, if one put an intron into each of those regions along the gene, the gene product would be dissected into modules less than the specified diameter. This notion is well defined, and one can now write a computer program to define those regions. Fig. 1 shows this definition of the boundary regions for globin. By constructing a distance plot, a Go plot, of all pairwise distances between Ca positions along the protein and marking all distances .28 Å in black, one can easily see that the five large triangles along the diagonal identify the longest segments of polypeptide chain that lie in 28-Å modules and that the four small overlap triangles define the boundary regions. The gene thus is divided into two portions, one that corresponds to modules and another that corresponds to boundary regions. This yields a very simple statistical test. The boundary regions are approximately one-third of a gene: Do introns lie preferentially in these regions, or are their positions random? Again, we considered ancient conserved regions, choosing ones that correspond to three-dimensional structures homologous to bacterial genes without introns and to eukaryotic sequences with introns, to ask: Do those intron positions in the eukaryotic homologs tend to lie in the boundary regions or do they not? The two theories predict quite opposite effects. These are all derived introns on the introns-late model, they were added to the preexisting gene, and their positions should be random. The early-intron model predicts that these positions should fall in the boundary regions.
FIG. 1. Go plot for horse hemoglobin. The black spots represent pairs of amino acids whose a-carbons are separated by 28 Å or more. The five large triangles correspond to modules. Boundary regions (BR) are defined by the overlap of these triangles.
Using 28 Å to define the modules, a size that was used before to define modules for triosephosphate isomerase or globin, we examined a set of 32 ancient proteins and a corresponding set of 570 intron positions. The random expectation was to find 182.5 introns in the boundary regions, but we found 214. That is a 17% excess, not a big number, but there are so many positions that the x2 is 8 and the P value is less than 0.005. One might wonder if there could be some other reason, rather than ancient introns, that introns might lie within the boundary regions. One possibility might be the existence of some special sequences in the boundary regions, or some sequence biases, that could serve as targets for insertion. We have examined the sequences in the boundary regions and do not find any particular sequence or compositional bias at the amino acid level or the DNA level. Occasionally, people conjecture that introns might have targeted sequences like AGG or AGGT, ‘‘proto-splicing’’ sequences, but there is no excess of those sequences in the boundary regions. Craik and his coworkers once suggested that introns might lie on the surface of proteins (15), thus one might think that the boundary regions perhaps are on the surface of the proteins and that is why introns are in those regions. However, in this set of proteins, neither the boundary regions nor the introns are biased toward the surface (6). So far, we have not been able to identify any bias-dependent model that would put introns into the boundary regions. The hypothesis we are testing, the Exon Theory of Genes, says that intron positions should lie within these boundary regions. Even though some introns may have been added in the course of evolution, even though some introns may have been lost, even though some introns may have moved, and even though the protein structure may have altered since it was put together, one can still see an excess there. A further argument that the excess of intron positions in the boundary regions is due to intron antiquity is found in the examination of an ‘‘ancient’’ subset of the intron positions. We examined those introns that have the same, or similar, positions in three of the four groups: plants, vertebrates, invertebrates, and fungi. Of the 20 introns in this subset, 13 lie in the boundary regions whereas only 6.5 are expected. That is a 100% excess, as opposed to the 17% excess overall. Thus, in a group that is selected to be ancient, we found a higher bias. (That bias was significantly different from the 17%; the x2 for the difference between 100% and 17% was 6.5, a P value ' 0.01.) This finding is further support for the idea that the
Colloquium Paper: Gilbert et al. underlying signal is due to ancient introns. If the pattern is simply one of biased insertion, then any subset should simply have a value ranging around the 17% excess. The 28 Å size was purely arbitrary, a particular one that we had used historically. It worked, and we had chosen that size before we knew that this analysis would work, but there was no profound reason for that size. Because we have a computer program that can take any diameter and decompose the protein into modules corresponding to that diameter, we can ask: Is there some optimal decomposition? Fig. 2 shows the results of varying the module diameters from 6 Å, which is one amino acid apart along the chain, out to 50 Å and plotting the x2 values for the significance of the excess of introns within the boundary regions. Fig. 2 shows three peaks of significance: one peak corresponding to an '21-Å diameter, one of an '28-Å diameter, and one of an '33-Å diameter. The peaks rise to probabilities ' 0.001. This is a strong statistical argument that there are three differently sized structural elements in these proteins that are correlated with intron positions. (One might worry that the curve shows a statistical calculation repeated a thousand times; if the phenomenon had been purely random, at least one of the points should have yielded a P value of 0.001. If one examines the underlying distribution of the excess of intron positions, one sees that it is robust: Smooth peaks appear in the excess of the observed intron positions over the expectations.) Thus, we conclude that intron positions are correlated with modules of three different diameters: 21 Å, 28 Å, and 33 Å. Can we understand these modules in a more informative way? We can ask about the average length of the polypeptide chain contained within each of these modules, which is equivalent to asking for the average length of the hypothetical exons predicted by the computer program. Fig. 3 shows that the 21-Å modules have an average length of 15 amino acid residues; the 28-Å modules have an average length of 22 residues; and the
FIG. 2. x2 distribution for the matching of intron positions to the boundary regions of 32 ancient proteins as a function of module diameter. The 570 intron positions were drawn from version 90 of GenBank. There are three major peaks of significance around module diameters of 21, 28, and 33 Å.
Proc. Natl. Acad. Sci. USA 94 (1997)
7701
FIG. 3. Lengths of predicted modules for the peaks of significance around 21, 28, and 33 Å. The three peaks correspond to distributions centered around 15, 22, and 30 amino acid residues in length.
33-Å modules have an average length of 30 residues (with a considerable spread). We have given a very strong statistical argument, with P values ' 0.001, that introns define elements of protein structure with sizes of 15, 22, and 30 residues. This feature is exactly what the Exon Theory of Genes suggested back in 1987. Recently, we went back to the database. Since the calculation was first done, there are 90 more introns, 662 in total, so we can redo the calculation to see if it is better or worse. Most of the novel intron positions have come in through the Caenorhabditis elegans project, so they represent great evolutionary distances from many that were in the database before. With the new data, the peaks improve in statistical significance. Fig. 4 shows that the peak at 21 Å rises to a x2 ' 19, and both it and the peak at 28 Å rise to a P value less than 0.0001. Currently, we are analyzing the shapes that make up these peaks. The peak at 21 Å, for the set of 32 proteins, arises from a set of 822 modules. Because we know the structures of the proteins, we know the three-dimensional structures of each of these modules, and we can search for signs of exon shuffling. The hypothesis that we are testing not only says that there should be correlations of ancient introns with these modules but also that there should be a pattern of reuse of these elements. What we expect to find is that some 21-Å regions, some 28-Å regions, and some 33-Å regions will have been used over and over again. Once we have a classification of shapes that are reused, we will ask for further evidence that those shapes correspond to shuffled exons. Such evidence would be that those modules that have been reused are ones correlated with introns or that those modules that have been reused show sequence similarities that would suggest a divergent evolution. At this time, we know these patterns only very crudely. The most common module is an a-helix followed by a turn and a strand; '8–10% of all shapes at 21 Å are of that form. Then there are strand–turn–helix shapes, helix–turn–helix, and strand–turn–strand shapes repeating in the 21 Å peak. The other peaks contain more complicated shapes.
7702
Colloquium Paper: Gilbert et al.
Proc. Natl. Acad. Sci. USA 94 (1997) introns have sharp statistical significance. As the databases continue to increase in the future, these tests will become even more convincing. Issues of Selection and Adaptation
FIG. 4. The same analysis shown in Fig. 2 (dashed line) was repeated using a database of intron positions based on GenBank version 96 (662 intron positions, continuous line). The peaks around 21, 28, and 33 Å now reach x2 values around 19, 15, and 13, respectively.
DISCUSSION We have reviewed here two strong, statistical arguments that there were ancient introns used to shuffle exons in the first genes. Both arguments detect a signal of the presence of ancient introns in today’s intron spectrum, over a background that could be due to new introns, to moved introns, or to mutation and change of the protein structures. Both arguments, intron phase correlations and intron correlation to modules, were applied to ancient conserved regions of gene sequence. These are regions of sequence conserved between prokaryotes and eukaryotes; thus, these genes, on any theory, came into existence early in evolution, possibly in the progenote, certainly in the cenancestor, the last common ancestor. These regions are colinear between the prokaryotic forms and their eukaryotic homologs. It is for these ancient conserved regions especially that the two theories make the most divergent predictions. All forms of introns-late theories assert that these genes came into existence before there were spliceosomal introns. Hence, all of the introns in their eukaryotic counterparts had to be inserted during the course of evolution; they must be derived characters because the prokaryotic form, on those theories, was created as a continuous whole. No exon shuffling can have intervened for these eukaryotic counterparts because they are colinear to the prokaryotic forms. Conversely, all introns-early theories predict that these proteins actually were assembled from exons in the progenote or later by exon shuffling. During the evolution of the prokaryotes, these theories predict that all of the introns were lost. Only in the eukaryotic forms did (some of) these introns survive. Thus, for these introns, one theory says all were added, and thus should obey random statistics, and the other theory predicts that the current introns will show correlations due to their ancient origin. The databases of gene sequences have so increased in size that one can show that these traces of ancient
Are the introns under selection? In general, we argue that they are not. The hypothesis that the role of introns was to speed up evolution by increasing the recombination between exons is not based on the idea that they therefore were selected for that use. Such an idea would be a wrong teleological view, i.e., that they are present because they aid future selection. Rather, our view is that they are present because the easy path in the past that lead to the creation of a gene used them and that they have not yet been removed by selective pressure. Although the introns are not under any selective pressure in general, where there has been pressure on DNA size there would have been loss of introns, such as in prokaryotes, Arabidopsis, or other small genome organisms. Drosophila, for example, recently has been shown to have a high deletion frequency for unneeded DNA (16) associated with genome slimming, which suggests that many current introns in Drosophila may be adaptive and be maintained by such features as enhancers or gene expression timing. Many introns in Drosophila are very small, which may reflect the deletion pressure for loss of sequence that still does not go to completion because of the difficulty of removing the intron exactly. Gerald Fink (17) suggested that, in S. cerevisae, a special mechanism (in that case a runaway reverse transcriptase) led to the loss of introns as a result of bombarding the genome with cDNA copies of spliced messengers. Could natural selection on added introns create the observed correlation between introns and protein features? Such models fail because, for these ancient conserved genes, they involve selection for a future purpose. One such model, for example, hypothesizes that, as introns are being added to these ancient (continuous) genes, a well formed exon is shuffled off for use in some other gene and hence selected for. In reality, selection could fix that novel exon in the new gene in the population, but that selection would fail to fix the correct ancestral (donor) form of the gene in the population. (If the organisms had sex, then the donor form is unlinked and hence not fixed. If the organism had only one linkage group, so that the donor form would be fixed by piggybacking, so too would all of the wrongly inserted introns everywhere in the genome.)
CONCLUSION We have examined a large set of introns in ancient conserved regions. All of these introns should have been derived, late features if the first genes had been continuous. We found, however, that these introns show patterns of correlation to the gene sequence and to the protein structure of the gene products that are consistent with the predictions of The Exon Theory of Genes. We thank the National Institutes of Health for support (Grant GM 37997). S.J.d.S. was supported by Fundacao de Amparo a Pesquisa do Estado de Sao Paulo and the PEW–Latin American Fellows Program. 1. 2. 3. 4. 5. 6. 7.
Doolittle, W. F. (1978) Nature (London) 272, 581–582. Gilbert, W. (1987) Cold Spring Harbor Symp. Quant. Biol. 52, 901–905. Palmer, J. D. & Logsdon, J. M. J. (1991) Curr. Opin. Genet. Dev. 1, 470–477. Cavalier-Smith, C. C. F. (1978) J. Cell Sci. Stoltzfus, A., Spencer, D. F., Zuker, M., Logsdon, J. M. J. & Doolittle, W. F. (1994) Science 265, 202–207. de Souza, S. J., Long, M., Schoenbach, L., Roy, S. W. & Gilbert, W. (1996) Proc. Natl. Acad. Sci. USA 93, 14632–14636. Long, M. & Langley, C. (1993) Science 260, 91–95.
Colloquium Paper: Gilbert et al. 8. 9. 10. 11. 12.
Long, M., Rosenberg, C. & Gilbert, W. (1995) Proc. Natl. Acad. Sci. USA 92, 12495–12499. Gilbert, W. (1986) Nature (London) 319, 618. Gesteland, R. F. & Atkins, J. F., eds. (1993) The RNA World, (Cold Spring Harbor Lab. Press, Plainview, NY). Long, M., de Souza, S. J. & Gilbert, W. (1995) Curr. Opin. Genet. Dev. 5, 774–778. Go, M. (1981) Nature (London) 291, 90–93.
Proc. Natl. Acad. Sci. USA 94 (1997) 13. 14. 15. 16. 17.
7703
Straus, D. & Gilbert, W. (1985) Mol. Cell. Biol. 5, 3497–3506. Gilbert, W., Marchionni, M. & McKnight, G. (1986) Cell 46, 151–154. Craik, C. S., Sprang, S., Fletterick, R. & Rutter, W. J. (1982) Nature (London) 299, 180–182. Petrov, D. A., Lozovskaya, E. R. & Hartl, D. L. (1996) Nature (London) 384, 346–349. Fink, G. R. (1987) Cell 49, 5–6.
Proc. Natl. Acad. Sci. USA Vol. 94, pp. 7704–7711, July 1997 Colloquium Paper
This paper was presented at a colloquium entitled ‘‘Genetics and the Origin of Species’’ organized by Francisco J. Ayala (Co-chair) and Walter M. Fitch (Co-chair), held January 30 to February 1, 1997, at the National Academy of Sciences Beckman Center in Irvine, CA.
Transposable elements as sources of variation in animals and plants MARGARET G. K IDWELL*
AND
DAMON LISCH
Department of Ecology and Evolutionary Biology and The Center for Insect Science, University of Arizona, Tucson, AZ 85721
and that he would have been an active participant in the continuing debate about their role in evolution.
ABSTRACT A tremendous wealth of data is accumulating on the variety and distribution of transposable elements (TEs) in natural populations. There is little doubt that TEs provide new genetic variation on a scale, and with a degree of sophistication, previously unimagined. There are many examples of mutations and other types of genetic variation associated with the activity of mobile elements. Mutant phenotypes range from subtle changes in tissue specificity to dramatic alterations in the development and organization of tissues and organs. Such changes can occur because of insertions in coding regions, but the more sophisticated TE-mediated changes are more often the result of insertions into 5* flanking regions and introns. Here, TE-induced variation is viewed from three evolutionary perspectives that are not mutually exclusive. First, variation resulting from the intrinsic parasitic nature of TE activity is examined. Second, we describe possible coadaptations between elements and their hosts that appear to have evolved because of selection to reduce the deleterious effects of new insertions on host fitness. Finally, some possible cases are explored in which the capacity of TEs to generate variation has been exploited by their hosts. The number of well documented cases in which element sequences appear to confer useful traits on the host, although small, is growing rapidly.
Distribution and Classification TEs are discrete segments of DNA that are distinguished by their ability to move and replicate within genomes. Since their discovery by Barbara McClintock '50 years ago (1), TEs have been found to be ubiquitous in most living organisms. They comprise a major component of the middle repetitive DNA of genomes of animals and plants. They are present in copy numbers ranging from just a few elements to tens, or hundreds, of thousands per genome. In the latter case, they can represent a major fraction of the genome, especially in some plants. For example, TEs recently have been estimated to make up .50% of the maize genome (2). In Drosophila, '10–15% of the genome is estimated to be made up of TEs, most of which are found in distinct regions of centric heterochromatin (3). TEs are classified in families according to their sequence similarity. Two major classes are distinguished by their differing modes of transposition (4). Class I elements are retroelements that use reverse transcriptase to transpose by means of an RNA intermediate. They include long terminal repeat retrotransposons and long and short interspersed elements (LINES and SINES, respectively). Long terminal repeat retrotransposons are closely related to other retroelements of major interest, such as retroviruses (5). The gypsy element in Drosophila is an example of a rare type of retrotransposon that can sometimes also behave as a retrovirus (6). Class II elements transpose directly from DNA to DNA and include transposons such as the Activator-Dissociation (Ac-Ds) family in maize, the Tam element in Antirrhinum, the P element in Drosophila, and the Tc1 element in the worm, Caenhorabditis elegans. Recently. a category of TEs has been discovered (7) whose transposition mechanism is not yet known. These miniature inverted-repeat TE (MITEs) have some properties of both class I and II elements. They are short (100–400 bp in length), and none so far has been found to have any coding potential. They are present in high copy number (3,000–10,000) per genome and have target site preference for TAA or TA in plants. MITEs such as the Tourist element in maize and the Stowaway element in Sorghum (7) are found frequently in the 59 and 39 noncoding regions of genes and are frequently associated with the regulatory regions of genes of diverse flowering plants. TEs with similar properties also have been described in Xenopus (8), humans (9, 10), and the yellow fever mosquito, Aedes aegypti (11). Most, but not all, TE families are made up of both autonomous and nonautonomous elements. Whereas autonomous
The book whose publication we are celebrating in this colloquium indicates that Theodosius Dobzhansky had a very special interest in gene mutation and its causes. Dobzhansky recognized mutation as the ‘‘raw material’’ on which natural selection acts and as the first of three steps necessary for evolution to take place. However, the discovery of transposable elements (TEs) in the 1940s by Barbara McClintock occurred a decade later, and it was a further 30 years before the significance of her findings started to be fully appreciated. Sixty years ago, Dobzhansky was well aware of the mutagenic properties of ionizing radiation discovered in 1927 by H. J. Muller but acknowledged that much less than 1% of spontaneous mutations were attributable to this cause. He distinguished between spontaneous and induced mutations: ‘‘The former are those which arise in strains not consciously exposed to known or suspected mutation-producing agents.’’ He also pointed out that ‘‘since the name spontaneous constitutes only a thinlyveiled [sic] admission of the ignorance of the phenomenon to which it is applied, the quest for the causes of mutation has always occupied the attention of geneticists.’’ Although at that time no clues to its nature were yet available, Dobzhansky realized that a major piece of the mutation puzzle was still missing. We believe he would have been intrigued with the discoveries of TEs in natural populations that have taken place during the last 20 years
Abbreviations: TE, transposable elements; MITE, miniature invertedrepeat TE. *To whom reprint requests should be addressed. e-mail:
[email protected].
© 1997 by The National Academy of Sciences 0027-8424y97y947704-8$2.00y0 PNAS is available online at http:yywww.pnas.org.
7704
Colloquium Paper: Kidwell and Lisch elements code for their own transposition, nonautonomous elements lack this ability and usually depend on autonomous elements from the same, or a different, family to provide a reverse transcriptase or transposase in trans. This paper aims first to provide a brief, general description of the types of genetic variation caused by TEs in animals and plants and then to examine this variation within an evolutionary framework: (i) direct selection on TEs at the level of the DNA sequence (parasitic DNA); (ii) coevolution of TEs and their animal and plant hosts to avoid or mitigate the deleterious effects of insertion; and (iii) positive selection on elements that have evolved to provide some positive benefit to their hosts in addition to simply minimizing the harm they do. Types of TE-Induced Genetic Variation Like new mutations produced by any mutator mechanism, the majority of new TE-induced mutations are expected to be deleterious to their hosts. Those mutations that survive over long periods of evolutionary time are expected to be a small subsample of newly induced mutations. The property that distinguishes TE-induced mutations from those produced by other mutational mechanisms is their remarkable diversity and the degree to which their induction is regulated by both the host and the TE itself. The genetic variability resulting from TEs ranges from changes in the size and arrangement of whole genomes to changes in single nucleotides. It may produce major effects on phenotypic traits or small silent changes detectable only at the DNA sequence level. It is important to note that TEs produce their mutagenic effects not simply on initial insertion into host DNA. TEs may also produce mutations when they excise, leaving either no identifying sequence or only small ‘‘footprints’’ of their previous presence. In addition, some TE-induced mutations that may be of evolutionary significance to their hosts, such as mutations in regulatory sequences (12), may take long periods of time to evolve new functions or these new functions may have been acquired a long time ago. Consequently, they may have lost their original identification as TEs. For these reasons, the reliance solely on the distribution of TE sequences in the genomes of contemporary species of animals and plants to deduce the long term evolutionary importance of TEs may produce a biased result that may not adequately reflect TE-associated events that occurred long in the past. Deleterious effects of TEs can result not only from mutations caused by the insertion or excision of these elements at a single chromosomal site but also from genomic-level disruptive effects associated with TE transposition. For example, massive chromosome breakage in larval cells resulting from excision and transposition of genomic P elements has been implicated as the cause of temperature-dependent pupal lethality and sterility in hybrid dysgenesis in Drosophila melanogaster (13, 14). A brief description of the types of genetic variability caused by TE activity follows, based largely on the types of host DNA involved. Some of the mutations described were generated in the laboratory and have been subjected to artificial selection under unnatural and noncompetitive conditions. Although these are generally not the class of mutations that are of interest from an evolutionary perspective, we include them here to provide some indication of the potentially wide spectrum of phenotypic changes associated with TE activity. Insertions of TEs into Exons of Host Genes. On average, TEs that insert within the exons of genes are most likely to result in null mutations because of the sensitivity of these regions to frame shift mutations and the lack of tolerance of highly conserved regions to most mutations of any kind. However, those mutations that are not simply inviable can provide interesting and sometimes spectacular phenotypic variability. In Drosophila, a series of null alleles at the X-linked, white locus allowed the first identification of the P element in D. melanogaster as the causal agent of P–M hybrid dysgenesis
Proc. Natl. Acad. Sci. USA 94 (1997)
7705
(15). The insertion of both the P element and the copia element into exon sequences interrupted the coding sequences and the production of the red eye pigment by the wild-type gene. The result is a bleached white eye phenotype that reflects the lack of pigmentation. Such a null mutation can be maintained in the laboratory but is unlikely to survive in natural populations. A good classic example is the insertion of an element of the Ac-Ds family into wx-m9, an allele of the waxy locus in maize first discovered by McClintock (16). The mutation is caused by the insertion of Ds (Dissociator) into the 10th exon of the waxy locus. This was the first element to be cloned from maize, and it is of continuing interest because it is spliced, resulting in partial revertant activity (17, 18). In this case, the effect of the insertion is attenuated by the loss through splicing of the TE after transcription. Insertions into Regulatory Regions of Genes. An excellent example of this type is the insertion of gypsy into the 59 upstream region of the yellow gene in Drosophila, which causes a loss of expression of the yellow gene in specific tissues (19). The loss of expression in some tissues and not others in this case is the result of the interaction of the element, tissuespecific enhancers upstream of the element, and specific host factors. In Antirrhinum, a Tam3 element was observed to insert into a region 59 of the niv gene, which is involved in the synthesis of anthocyanin pigments. The initial insertion was observed to down-regulate expression of the gene. However, a series of rearrangements mediated by this element resulted in a change in the level and tissue specificity of expression of niv (Fig. 1). The net effect is a new and novel distribution of anthocyanin pigment in the flower tube (20). This series of mutations exemplifies the potential for TE-mediated ‘‘rewiring’’ of regulatory networks, in this case by bringing new regulatory sequences in proximity to exonic sequences via an inversion, followed by an imprecise excision event. TE activity can result in even more complex rearrangements that can have effects on gene regulation. In maize, the insertion of a Mu (Mutator) element into the TATA box of Adh1 changes the tissue specificity of RNA expression (21). Of interest, excision of this element caused a complex series of duplications and inversions whose net effect was to cause additional changes in tissue specificity (22). Kloeckener–Gruissem and Freeling (22) suggest that this kind of ‘‘promoter scrambling’’ may represent a more general process by which transposons produce variants of a type not produced by other mechanisms. Furthermore, in this case, like that of the Tam element in the niv gene of Antirrhinum, the TE footprint left behind after the element has generated the new mutation is small enough to be invisible to TE probes. Insertions in Introns. TEs that insert into introns generally have a greater chance to survive because these insertions are less visible to natural selection. Many of them are probably successfully spliced out during mRNA processing and have no obvious effect on the function of the gene. Even when spliced, however, introns are sometimes the site of regulatory sequences. In these cases, TE insertions into introns can affect gene regulation in surprising ways. For instance, the insertion of Mu elements into an intron of the Knotted locus in maize induces ectopic expression of the gene, suggesting that the intron carries sequences normally required to repress expression of the gene in certain tissues (23). Similarly, in Antirrhinum, complementary floral homeotic phenotypes result from opposite orientations of a Tam3 transposon in an intron of the ple gene (24) . Insertions in Heterochromatin. Middle repetitive DNA sequences, including TEs, are an important component of b heterochromatin in Drosophila, and retrotransposons constitute a considerable fraction of this DNA (25). Recent work by Pimpinelli and coworkers (3) has revealed that TE clustering into discrete regions of heterochromatin is a general property of elements in Drosophila. The cause of this distribution pattern is an open question. In some cases, a heterochromatic location probably reduces the probability of elimination of inserted se-
7706
Colloquium Paper: Kidwell and Lisch
Proc. Natl. Acad. Sci. USA 94 (1997) stimulation of site-specific gene conversion and recombination mediated by P element transposition. As a consequence of the selection against the negative effects of ectopic recombination, this is postulated to be the mechanism chiefly responsible for the removal of certain subsets of TEs from genomes and a means for controlling copy number (26, 31). It is worth noting, however, that not all rearrangements caused by ectopic recombination are necessarily selected against. From an evolutionary perspective, rare surviving chromosomal rearrangements could be of significance. For example, Lyttle and Haymer (32) demonstrated the presence of the hobo element at the breakpoints of several endemic, but not cosmopolitan, inversions in D. melanogaster natural populations from Hawaii. This result is consistent with the recent introduction of hobo elements to D. melanogaster by horizontal transfer (33) and the subsequent production of these inversions as a consequence of the activity of these elements. Effects on Quantitative Variability. The P element in Drosophila provides one of the most compelling demonstrations of TE-induced genetic variability in quantitative genetic traits. A series of experiments (34, 35) has shown that quantitative variability for bristle number is induced by P–M hybrid dysgenesis and is demonstrable by directional artificial selection. Furthermore, in well controlled experiments (36), a dramatic increase in new additive genetic variation in abdominal bristle number was observed that was 30 times greater than that expected from spontaneous mutation. In another Drosophila study (37), an excess of P element mutations having large effects on metabolic characters was observed relative to those expected. Significant among-line heterogeneity indicated that the mutational target site for enzyme activity is large and that most of the mutations must be regulatory. It was concluded that the large pleiotropic effects observed had important consequences for metabolic characters.
FIG. 1. Rearrangements associated with TE activity near a gene result in altered tissue specificity. (A) In the original isolate, a Tam3 element had inserted 64 bp upstream of the start of transcription of the niv gene of Antirrhinum. The result of this insertion was a reduced level of expression. (B) A derivative of the initial insertion allele carries an inversion flanked by two copies of the transposon. This allele confers an additional reduction in the level of expression. (C) Excision of the element closest to the niv gene left a short (26 bp) ‘‘footprint.’’ This rearrangement resulted in an increase in the level of niv gene expression as well as a novel pattern of expression, presumably due to the juxtaposition of a novel sequence with the niv gene TATA box and coding sequences [adapted from Lister et al. (20)].
quences from the genome by ectopic recombination, the mechanism believed to be largely responsible for controlling TE copy number in chromosome arms (26). However, some centric heterochromatic regions have been described as a graveyard for dead elements, rather than a safe haven for active elements, because the majority found there appear to be inactive and highly diverged sequences. For example, LINE-like I elements cloned from the Charolles-reactive strain of D. melanogaster contain no active euchromatic I factors, only defective copies that are embedded in clusters of defective copies of other retroelements (27). In contrast, elements such as mdg1 have been found to have a nested arrangement within other retrotransposons located in euchromatic chromosome regions (28, 29). These sites appear to exhibit properties of intercalary heterochromatin (25) and may be responsible for the properties of ectopic pairing, susceptibility to breakage, and late replication that are characteristic of this type of chromatin. Mediation of Recombination. TE-mediated increases in the rate of recombination have consequences not only for genetic variation at individual loci. This recombination activity can also result in more general changes in both fine and gross structural characteristics of chromosomes. For instance, analysis of mutations in the mei-41 and mus302 genes required for normal postreplication repair in D. melanogaster (30) revealed a striking
Evolutionary Considerations The idea that TEs are primarily parasitic is not at all inconsistent with a role for these elements in the evolution of their hosts. Indeed, as documented below and elsewhere (12, 38– 40), there is a growing body of evidence for coadaptation by both elements and their hosts to the long term presence in the genome of these parasitic sequences. In some cases, it appears that this coadaptation even may have lead to the use of TE sequences for essential and beneficial host functions. In this section, we explore all three perspectives. TEs as Genomic Parasites The intrinsically parasitic nature of active TEs (41–43) accounts for their undisputed ability to invade new species, increase in copy number, and survive over long periods of evolutionary time. The replicative advantage of TEs (44) is responsible for this ability, which is facilitated by their generally compact structure and inclusion of the coding capacity for transposition within their sequences. Natural selection acting on TEs at the level of the DNA sequence is responsible for maintaining their essentially parasitic properties. For example, P elements can rapidly invade a naive population of D. melanogaster, despite extremely strong negative selection at the host level in the form of high frequencies of temperature-dependent gonadal sterility (45). A proclivity for horizontal transfer is consistent with the role of TEs as genomic parasites. The life cycle of TEs in any single phylogenetic lineage can apparently last for many thousands or millions of years and can be considered as a succession of three phases: dynamic replication, inactivation, and degradation (46, 47). The transposition of both major classes of elements is error-prone and produces nonautonomous elements that often repress the transposition rate of active elements. Over long periods of evolutionary time, there is a tendency for a family of elements to degrade in coding capacity, but horizontal transfer to another host lineage provides the opportunity for active TEs to
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Kidwell and Lisch move to another lineage and begin the cycle over again (46, 48, 49). There is evidence that TEs do transfer horizontally more frequently than nonmobile genes (46). Class II elements such as mariner and P elements provide good examples of TE horizontal transfer (50–52), but a major puzzle remains regarding the mechanism by which horizontal transfer is achieved (49). Coadaptations to Mitigate Reduced Host Fitness Like viruses, TEs are dependent on their host organisms for survival, but unlike viruses, most TEs do not have a phase in their life cycle in which they can survive independent of their hosts. Therefore, coevolution and coadaptation of TEs with host genomes is expected to play a particularly important role in the long term survival of these element families. Given that the majority of new insertions tend to be deleterious to hosts, it is in the interests of both parties to mitigate or remove such deleterious effects. TEs can insert in many locations other than exons. In these noncoding sequences, they are likely to increase their probability of survival because of less visibility to natural selection. Examples of some of the ways that coadaptations by both mobile elements and their hosts appear to have evolved are described below and are summarized in Table 1. Insertion Bias for Noncoding Regions. A dramatic demonstration of the preference for new P element insertions into potentially regulatory regions of genes, rather than exons, is provided by P elements in Drosophila (53). Only a small minority of P elements was observed to have inserted in coding sequences, and these elements were in the 59 portion. Thus, there is a strong bias in favor of insertion in the 59 end of the gene and especially for the 59 untranslated region. Because these insertions all caused an obvious mutant phenotype (they failed to compliment a deficiency), this result is actually an understatement of the true number of insertions into or near regions of genes often involved in regulation. This preference is supported by the remarkable observation that, in another experiment, .65% of the 500 independent P element enhancer trap insertions were expressed in a spatially and temporally restricted fashion (54) . Another good example of preferential insertion of elements outside of coding regions is provided in yeast (55). Of over 100 new insertions of Ty1 observed in chromosome III, nearly all were inserted into or near either tRNA genes or preexisting long terminal repeats; only 3% were found in ORFs. Distribution patterns favoring the 59 noncoding sequences of genes were observed for MITE elements in plants (7) and animals (e.g., see ref. 11). However, it is not clear whether this pattern represents an insertion preference or whether it results from strong earlier selection against insertions into other regions. Similarly, in view of the large number of group II and group III introns present in the chloroplast genome of E. gracilis, the complete absence of introns in rRNA and tRNA genes is striking (56). One possibility is that secondary structure features of rRNA and intron RNA or tRNA and intron RNA (if they were present in the same Table 1.
pre-mRNA molecule) would interact in such a way as to prevent one or both from functioning. Preference for Insertion into Preexisting Elements. In some cases, it is clear that unrestricted transposition would be absolutely disastrous for the host. As mentioned above, a remarkable number of retroelements, probably representing well over 50% of the genome, are found between genes in maize (2). The five most abundant families make up more than 25% of the genome, and one family alone, Opie, makes up 10–15% of the maize genome. Despite this ubiquity, few homologies in the database were found among maize genes; these elements appeared to have a pronounced preference for regions outside of genes. Indeed, fully half of the elements examined were found nested within other elements. Splicing from Pre-mRNA Transcripts. It has been suggested that splicing of TEs from exonic sequences has evolved as a means by which TEs can minimize their deleterious host impact. We refer back to the waxy mutation described earlier (18). In that case, splicing of the Ds element inserted into the waxy gene results in partial reversion of the mutation caused by the original insertion. Presumably, even a partial amelioration of the mutant phenotype can provide some selective advantage to those TEs capable of providing it. Additional examples are found in Drosophila, C. elegans, and other plants (57–59). In addition, Marillonnet and Wessler have observed tissue-specific splicing of an element (S. Wessler, personal communication), suggesting that TEs could potentially play a role in the evolution of tissue-specific regulation of some genes. These examples may be illustrating an evolutionary spectrum, from purely parasitic behavior to functional significance for the host. The original capacity to be spliced may have arisen as a way to minimize the impact of insertions into coding sequences but on the road from poorly spliced variants that simply ameliorate mutant phenotypes to fully effective and selectively invisible splicing may have come the opportunity to develop new classes of regulation, such as tissue specificity. Tissue Specificity of TE Activity. A good example of a likely adaptation of a TE to its host is the restriction of transposition of P and I elements to the germ line (14). It is of mutual benefit for an element to transpose in those tissues that will ensure transmission to the next host generation, but to curtail activity in somatic tissues is likely to result in loss of host fitness without providing any benefit to the transposon. Repression of P element transposition in somatic cells occurs on the level of RNA processing (60). The 2–3 intron is spliced only in the germ cells, resulting in the absence of transposase in somatic cells. Splicing of this intron is prevented in the somatic cells by an 87-kDa protein that binds to a site in exon 2 located 12–31 bases from the 59 splice site (61). An existing host-splicing mechanism apparently has been coopted for this purpose (61, 62) that has been highly conserved during evolution. Host Regulation of TE Copy Number. Good examples of host regulation of copy number are found in maize and
Possible coevolved mechanisms to mitigate reduction in host fitness Mechanism
Insertion bias for noncoding regions
Pre-mRNA splicing Tissue specificity of transposition TE copy number regulation
Regulators of mutant phenotype expression
7707
Examples Preferential insertion in regulatory regions (1); nested retrotransposons in maize (2); clustered I elements (27); mdg1 in Drosophila (28, 29) Splicing from the maize wx gene (18); various genes in Drosophila (57), C. elegans (59), and plants (58) Repression of P element transposition in somatic cells at the level of RNA processing (60) Methylation of maize Ac, Spm, and Mu by host factors (63, 64); type I and type II P element-encoded repressors (13, 68, 70–72); Ac dosage effects (16); Spm repressor action (73–75) Transposase-dependent expression of mutant phenotype caused by insertion of Spm in maize (73); masking of mutant phenotypes by alleles of host suppressor genes in Drosophila (40)
7708
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Kidwell and Lisch
Drosophila. Unknown host-encoded factors specifically methylate Ac, Spm, and Mutator elements in maize (63, 64). With Mu (64), the example is particularly striking because it represents the global methylation of dozens of previously active elements simultaneously in a single generation. The methylation is not simply related to structural features of insertion sites of Mu elements because they become specifically methylated even when inside of genes; modification is rarely detected in the flanking sequences within the genes (65) . In D. melanogaster females, expression of the gypsy element envelope gene is strongly repressed by one copy of the nonpermissive allele of flamenco [reviewed by Bucheton (6)]. A less dramatic reduction in the accumulation of other transcripts and retrotranscripts also is observed. These effects correlate well with the inhibition of gypsy transposition in the progeny of these females and are therefore likely to be responsible for this phenomenon. The effects of flamenco on gypsy expression apparently are restricted to the somatic follicle cells that surround the maternal germline. Self-Regulation of TE Copy Number. It has been shown theoretically that self-regulation of TEs cannot evolve if it is assumed that deleterious effects on host fitness are caused by increased copy number alone or are not caused by dominant lethals (66). However, if the deleterious effects are immediate and occur as a direct consequence of transposition itself, then there may be a selective advantage to elements with reduced transposition rates that still allow them to spread in the genome but at a reduced cost to their host [reviewed by Brookfield (67)]. The activity of the P elements in D. melanogaster is regulated by element-encoded repressor products. These repressors fall into two discrete categories, type I and type II. Type I repressors are responsible for a cellular condition known as P cytotype, which depends on a 66-kDa, P element-encoded, repressor of transposition and excision (68). The genomic position of repressor elements determines the maternal vs. zygotic inheritance of P cytotype (69). Type II repressors usually have large internal deletions, are sensitive to genomic location, but show no maternal inheritance (13, 70–72) . In plants, the Ac element shows dosage effects; an increase in number of elements results in a decreased number of transpositions of the element (16). This could be interpreted as a response of the plant to increases in the level of Ac transposase or as an autoregulatory mechanism. Similarly, Spm:tnpA can protect Spm from methylation but may also act as a repressor of Spm (73, 74). Additionally, some deleted Spm elements can repress full length Spm elements in trans (75) . Regulators of Expression of Mutant Phenotypes. Very early in the investigation of mutable alleles in maize, it was discovered that the expression of some alleles depended on the presence or absence of a second factor (76). In these cases, that factor was the Table 2.
source of the transposase. However, as is clear from the example of some gypsy insertions whose mutant phenotype is only manifested in the presence of both Su(Hw) and a second host encoded factor, mod(mdg4) (77), TE-induced mutant alleles can also become ameliorated by other factors as well. The mutant phenotypes associated with many retrotransposon insertions are masked by alleles of host suppressor genes that act as trans-regulators of retrotransposon expression (40). The argument made is that such suppressor action may allow insertion mutations to partially, or completely, escape the action of purifying selection and allow them to persist or even increase in frequency in natural populations. There is evidence for the presence of host genes with suppressor function in natural populations of Drosophila (78). TE-Induced Characters Having Benefit to the Host There has been considerable debate whether, in addition to deleterious effects on fitness, TE-induced variability has any significance for host organisms over evolutionary time (40). The generally unpredictable nature of TE movements, coupled with the paucity of fixed insertion sites for TEs in species such as Drosophila (26), has lead some to reject the possibility of TEs having any significant evolutionary importance, other than as molecular parasites. However, there is a rapidly growing list of possible examples of TEs having evolved highly sophisticated functions, as shown by the examples briefly described below and summarized in Table 2. Insertions with Host Gene Regulatory Functions. It has been speculated for some time that changes in cis-regulatory regions of duplicated genes may be more important for the evolution and divergence of functional and morphological characters than mutations in coding sequences (see, e.g., refs. 79 and 80). However, only recently has evidence started to accumulate to support this hypothesis. In Drosophila, for instance, the three homeotic genes paired (prd), gooseberry (gsb), and gooseberry neuro (gsbn) have evolved from a single ancestral gene, following gene duplication. They now have distinct developmental functions during embryogenesis. The three corresponding proteins PRD, GSB and GSBN are transcription factors. Li and Noll (81) demonstrated that the three proteins are interchangeable with respect to their regulatory functions and that their distinct developmental functions are a consequence of changes in the regulatory sequences rather than in the proteins themselves. Because they lend themselves so well to changes in the architecture of promoter regions, it is likely that TE mutations have been involved in this kind of regulatory evolution (82). The potential importance of TEs as modifiers of the expression of normal plant genes has been highlighted by recent findings in plants. Long terminal repeat retrotransposons and MITEs have
Examples of TEs having functions that benefit their hosts New function
Insertions with regulatory functions “Molecular domestication” Source of new introns Replacement of normal host functions A role in host cell repair mechanisms Mediation of concerted evolution Possible functions of heterochromatic TE clusters
DSB, double-strand chromosome break.
Examples More than 20 examples of insertions into regulatory regions of genes (12, 38, 39) P element tandem repeats in D. obscura group may provide a new host gene function (89) Introns and twintrons in Euglena gracilis plastids (24) Repair of damaged chromosome ends by HET-A and TART in Drosophila (92, 93) Endogenous retroelements associated with repair of DSBs in yeast (94, 95) P element-mediated changes in subtelomeric repeat numbers in D. melanogaster (96) Developmentally programmed changes in DNA content; expression of heterochromatically embedded loci; genomic housekeeping functions (25)
Colloquium Paper: Kidwell and Lisch been found to be associated with the genes of many plants where some of these TEs contribute regulatory sequences (7). Furthermore, the MITE elements recently discovered in Aedes aegypti (11) also are associated closely with genes. In domesticated rice, Oryza sativa, a computer-based search revealed 32 common sequences belonging to nine putative mobile element families (83). Four of these families had been previously described, but five families were first discovered through this computer search, and four of these five had characteristics of MITES. New Patterns of Tissue-Specific Expression. TEs can contribute to the functional diversification of genes by supplying cisregulatory domains altering expression patterns. Earlier, we described the insertion of the gypsy element into a 59 upstream region of the yellow gene in Drosophila causing a loss of expression of this gene in specific tissues (84). The tissue-specific alterations in expression (a kind of mutation that is more subtle than simply knocking out a gene) is due to the presence of a specific sequence of DNA that is bound by Su(Hw), which is thought to be a transcription factor. More interesting, the Su(Hw) binding sequence seems to act as a general ‘‘buffer’’ that helps to define structural domains in the chromatin (77). Thus, gypsy may serve to introduce domains of regulation into given regions of the chromosome. This may have arisen initially as a means by which the TE could buffer itself from its chromosomal environment, but this kind of domain alteration could certainly also result in interesting variations in gene regulation as well. In addition to simply buffering chromosomal regions, many TEs are specifically expressed only in particular tissues at particular times. Based on recent findings, it appears that tissue specificity is a general feature of all retrotransposons in Drosophila. The expression patterns of 15 different families of long terminal repeat-containing retrotransposons were examined by Ding and Lipshitz (85) during normal development in different wild-type strains of D. melanogaster. Each family exhibited a pattern typical of spatial and temporal expression during embryogenesis, suggesting that each TE harbors cisregulatory factors that interact specifically with host transcription factors. These mobile cis-regulatory factors could potentially act to modify the expression of any number of host genes. Other Types of Insertions with Regulatory Functions. Some of the examples given thus far are anecdotal; they represent laboratory observations as to the kinds of changes that TEs can introduce into the host genome, rather than changes that have actually contributed to the evolution of the host. However, Britten (12, 38, 39) has used stringent criteria for the identification of strong cases of the involvement of TEs in the actual evolution of gene regulation. He maintains that a long term perspective is necessary in identifying and understanding mutations important for gene regulation. The number of cases he has identified is small, but growing. In addition to the plant MITE examples discussed above, he includes cases involving Alu-containing, T cell-specific enhancers in the human CD8a gene (86), the association of a retrovirus-related element with androgen regulation of the sex-limited protein (Slp) gene in mouse (87), and inverted repeats in the CyIIIa actin gene of sea urchin (88). Tandem Repeats of P-Related Sequences in Drosophila. A number of tandem P element repeats in three closely related species of the obscura group provides a very interesting example of several unrelated TE sequences evolving together that may provide a type of host gene function, which Miller et al. have termed ‘‘molecular domestication’’ (89). In this case, the P elements have lost all of their terminal repeats and thus can no longer transpose. Remarkably, each cluster unit consists of a cis-regulating section composed of insertion sequences derived from unrelated TEs, followed by the first three exons that, in mobile P elements, code for a 66-kDa protein that represses P element transposition. In contrast to this normal repressor function, these stationary P element repeats are hypothesized to have evolved the function of transcription factors (90).
Proc. Natl. Acad. Sci. USA 94 (1997)
7709
A Source of New Introns. Some retroelements are apparently fully adapted to their niche within exonic sequences. For example, the 143-kb Euglena gracilis plastid genome contains 155 group II and group III introns (56), nearly 10 times the number in any other known plastid DNA. The original introns were likely mobile, retrotransposable genetic elements that invaded the genome from another organism, relying in part on internally encoded enzyme activities for mobility. The group III introns appear to be streamlined versions of group II introns, sharing a common evolutionary ancestor with a group II intron. Among the E. gracilis introns are a number of introns-within-introns (twintrons), suggesting that these elements themselves have been targets of intron insertions. In one particularly interesting example (91), a group III intron is formed from domains of two individual group II introns. The authors suggest the possibility that ‘‘the introduction of one catalytic RNA into a functional domain of another catalytic RNA, through a process similar to twintron formation, can result in new combinations of sequences and structural domains that might lead to new RNA catalyzed reactions significant for RNA evolution.’’ Telomeres in Drosophila. An unusually finely tuned system between the host genome and mobile elements has evolved in Drosophila to take over a basic cellular function. Several retroelements, such as HET-A and TART, carry out the function of replacing damaged chromosome ends that is performed by telomerase in other insects (92, 93). The insertion frequency of the TEs involved has become adapted to match the average rate of telomere loss to maintain constant chromosome size. This is the best example to date of a TE providing a vital function to its host. As a Repair Mechanism of DNA Double-Strand Breaks. Although SINEs and LINEs and pseudogenes are abundant in eukaryotic genomes, indicating that reverse transcriptasemediated phenomena are important in genome evolution, the mechanisms responsible for their spread are largely unknown. The results of two recent experiments with the yeast Saccharomyces cerevisiae (94, 95) have linked reverse transcriptasemediated events with double-strand chromosome breaks in the absence of normal repair. This suggests a possible role for endogenous retroelements in the repair of double-strand chromosome breaks under certain circumstances. Note that, in this case, as in others described here, coadaptation may have grown out of apparently parasitic element behavior; doublestrand chromosome breaks may simply represent an especially good target for efficient TE insertion, and in turn, these insertions may sometimes be the most efficient repair pathway available to the host. The net effect, rapid insertional repair of breaks, is expected to benefit both host and TE. Mediation of the Concerted Evolution of Repetitive Gene Families. Evidence for the ability of TEs to directly influence the constitution of repetitive DNA was provided by experiments using genetically marked P elements located in a subtelomeric repeat of D. melanogaster (96). After P element mobilization, the number of repeats frequently was observed to be altered, with decreases being more common than increases, due to unequal gene conversion events. Therefore, TEs may play an important role in the evolution of heterochromatin. Changes in Genome Size. As described above, TEs may represent a variable and sometimes surprisingly high proportion of genomes, particularly in plants. By means of variation in sheer bulk, it is possible that TEs affect variability in life history traits and related characteristics because of the correlation between genome size, cell size, and various aspects of plant life form, such as growth rate and developmental time (97). Other Possible Functions. The idea that heterochromatic clusters of nomadic elements are merely graveyards of dead transposons appears to be giving way to the idea that these regions may also be involved in a number of important cellular processes (25). These include developmentally programmed changes in DNA content, expression of heterochromatically embedded gene
7710
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Kidwell and Lisch
loci, and housekeeping functions such as chromosome pairing, sister chromatid adhesion, and centromere function. The TE content of these regions may be important for these processes in ways that are not yet understood. Discussion One of the most compelling questions that arise when considering the new data on the preponderance of TE-derived sequences in some plant genomes is how enormous numbers of TE copies can accumulate in a single genome. For example, how is it that a single TE can make up 10% of the maize genome? Obviously, the recombinogenic properties of TEs that are hypothesized to maintain a relatively low, constant copy number in other organisms are not relevant in these cases. It may be that, in the case of low copy elements transposing at relatively low frequencies, recombination is able to purge some elements from euchromatin, but the maize example suggests that there may be a vast number of elements interspersed between genes in many locations. Recombination, then, may not be a particularly effective mechanism for purging the vast majority of repetitive sequences, many of which are clearly not located in heterochromatin. It can be argued that, wherever there is a concentration of TE sequences, heterochromatinlike structural features begin to evolve to down-regulate their expression. In turn, this would also tend to reduce the frequency of removal by ectopic recombination. These considerations motivate us to postulate that there are at least two types of elements that occupy two very different niches in the ecology of the genome: first, a type that preferentially inserts into regions distant from host gene sequences, such as heterochromatin or the regions between genes [e.g., the many retrotransposons found inserted between the genes on the third chromosome in maize (2)]; and second, a type that lives more dangerously by being more prone to insert into, or near, single copy sequences. We suggest that the first type escapes the ‘‘trap’’ of inactivation (via methylation or heterochromatinization) in regions outside of single copy host genes through the use of various buffer sequences; it has become specifically adapted to (or even makes up much of) these regions. As a strategy to minimize their potentially devastating effects on their hosts, these elements target regions in which recombination is minimal and where essential genes are scarce. The second type travels light and has evolved to take advantage of relatively accessible chromosomal architecture, a high concentration of transcription factors, host enhancer sequences, and horizontal transfer to maximize replication advantage. This type, represented by elements like Mu (which target single copy sequences) and P elements (at least 65% of insertions are located near enhancers) trades the disadvantage of an increased risk of negative selection for the advantages of occupying regions which are enriched for factors promoting efficient transcription and replication. This second type is postulated to be the one most likely to be discovered by geneticists (it is more likely to cause mutations) and also the one most likely to be lost through recombination (by targeting actively transcribed regions of the genome in which recombination is more frequent). We suggest that, when these elements insert in heterochromatin, they become inactive because they are not well adapted to that environment. We therefore need to consider the possibility that there may be more than one strategy to being a transposon and that each strategy, although successful from an evolutionary perspective, has a very different dynamic. Each type would be expected to affect host evolution in a different way. Type 1 would affect the overall architecture of the host chromosome, rather than the specific expression characteristics of individual genes. In contrast, type 2 would participate more directly in changes in gene regulation, such as is observed at the Adh1 locus in maize. A second area of considerable interest from an evolutionary perspective is the stress-induced mutability that is character-
istic of some TEs (98). A gradualist argument leveled against the idea that regulatory changes resulting from TE-induced mutations may be important in evolution is that such ‘‘macromutations,’’ like Goldschmit’s hopeful monsters, would be unlikely to arise at the precise time when a new ecological niche became available (40). However, there is increasing evidence that TE-induced mutation rates are far from constant. High frequencies of mutations are expected to appear in waves, such as those resulting from hybrid dysgenesis that accompany element invasions of new populations or species. TE-induced mutations have been recorded to occur in transpositional bursts (99) whose cause is not well understood but is likely related to inbreeding and other forms of genomic or environmental stress, possibly akin to the genomic stress referred to by McClintock (100). For example, it appears that plant retroelements are normally quiescent but can be activated by stress (98), such as cell culturing (101) or microbial infection (102). We suggest that the proximal, or adaptive, function in these cases is to increase element copy number during periods of stress to ensure a high probability of transmission by those host variants that happen to survive. With respect to the evolution of the host, however, the preadaptive, or exaptive function is to provide variation during periods of stress. In this case, as in the other cases outlined above, the transposon does not have to ‘‘know’’ that it is contributing to the evolution of its host nor has it evolved to do so, but out of its elemental parasitic behavior arises the potential for both dramatic and subtle changes in the genome of its host. Conclusions We are only just beginning to glimpse the complexity of possible interactions in the coevolution of TEs and their hosts. A full understanding of the population and evolutionary dynamics of these interactions, and the consequences to hosts, must await the results of further research. However some tentative conclusions can be made on the basis of current information. The primary parasitic nature of these sequences during their invasion of host populations is beyond dispute, but we believe that this does not by any means represent the whole story. A number of features of both TEs and their hosts can be interpreted as coadaptations to mitigate or abolish the reduction of fitness due to unbridled transposition. Furthermore, the number of well documented cases in which TE sequences have been coopted successfully by the host to provide a useful function is small but is growing rapidly. We suggest that the process by which elements and their hosts coevolve mutually beneficial strategies may lend itself to the production of genetic variation that would not otherwise have arisen. Although the role of TEs in evolution may not turn out to be precisely what McClintock had in mind when she first described controlling elements in maize, the importance of their role in the evolution of gene regulation and other host functions may yet surprise us. To paraphrase Dobzhansky’s famous phrase, there is good reason to believe that ‘‘Nothing about mobile elements makes sense except in the light of evolution.’’ We thank Dr. Zhijian Tu for comments on the manuscript. This work was supported by National Science Foundation Grant DEB9119349 to M.K. D.L. was supported by National Institutes of Health Training Program in Insect Science 1T32 AI07475. 1. 2. 3. 4. 5. 6.
McClintock, B. (1948) Carnegie Inst. Wash. Yearbook 47, 155–169. SanMiguel, P., Tikhonov, A., Jin, Y. K., Motchoulskaia, N., Zakharov, D., Melake-Berhan, A., Springer, P. S., Edwards, K. J., Lee, M., Avramova, Z. & Bennetzen, J. L. (1996) Science 274, 765–768. Pimpinelli, S., Berloco, M., Fanti, L., Dimitri, P., Bonaccorsi, S., Marchetti, E., Caizzi, R., Caggesse, C. & Gatti, M. (1995) Proc. Natl. Acad. Sci. USA 92, 3804–3808. Finnegan, D. J. (1992) Curr. Opin. Genet. Dev. 2, 861–867. McClure, M. A. (1993) in Reverse Transcriptase (Cold Spring Harbor Lab. Press, Plainview, NY), pp. 425–443. Bucheton, A. (1995) Trends Genet. 11, 349–353.
Proc. Natl. Acad. Sci. USA 94 (1997)
Colloquium Paper: Kidwell and Lisch 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29.
30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54.
Wessler, S., Bureau, T. E. & White, S. E. (1995) Curr. Opin. Genet. Dev. 5, 814–821. Unsal, K. & Morgan, G. T. (1995) J. Mol. Biol. 248, 812–823. Morgan, J. T. (1995) J. Mol. Biol. 254, 1–5. Smit, A. F. A. & Riggs, A. D. (1996) Proc. Natl. Acad. Sci. USA 93, 1443–1448. Tu, J. (1997) Proc. Natl. Acad. Sci. USA, in press. Britten, R. J. (1996) Proc. Natl. Acad. Sci. USA 93, 9374–9377. Engels, W. R. (1996) in Transposable Elements, eds. Saedler, H. & Gierl, A. (Springer, Berlin), pp. 103–123. Bregliano, J. C. & Kidwell, M. G. (1983) in Mobile Genetic Elements, ed. Shapiro, J. A. (Academic, New York), pp. 363–410. Rubin, G. M., Kidwell, M. G. & Bingham, P. M. (1982) Cell 29, 987–994. McClintock, B. (1951) Cold Spring Harbor Symp. Quant. Biol. 16, 13–47. Wessler, S., Baran, G., Varagona, M. & Dellaporta, S. (1986) EMBO J. 5, 2427–2432. Wessler, S., Baran, G. & Varagona, M. (1987) Science 237, 916–918. Corces, V. G. & Geyer, P. K. (1991) Trends Genet. 7, 86–90. Lister, C., Jackson, D. & Martin, C. (1993) Plant Cell 5, 1541–1553. Kloeckener-Gruissem, B., Vogel, J. M. & Freeling, M. (1992) EMBO J. 11, 157–166. Kloeckener-Gruissem, B. & Freeling, M. (1995) Proc. Natl. Acad. Sci. USA 92, 1836–1840. Greene, B., Walko, R. & Hake, S. (1994) Genetics 138, 1275–1285. Bradley, D., Carpenter, R., Sommer, H., Hartley, N. & Coen, E. (1993) Cell 72, 85–95. Arkhipova, I. R., Lyubomirskaya, N. V. & Ilyin, Y. V. (1995) Drosophila Retrotransposons (Landes, Austin, TX). Charlesworth, B. & Langley, C. H. (1989) Annu. Rev. Genet. 23, 251–287. Vaury, C., Bucheton, A. & Pelisson, A. (1989) Chromosoma 98, 215–224. Tchurikov, N. A., Zelentsova, E. S. & Georgiev, G. P. (1980) Nucleic Acids Res. 8, 1243–1258. Tchurikov, N. A., Ilyin, Y. V., Skryabin, K. G., Anan’ev, E. V., Bayev, A. A., Krayev, A. S., Zelentsova, E. S., Kulguskin, V. V., Lyubomirskaya, N. V. & Georgiev, G. P. (1981) Cold Spring Harbor Symp. Quant. Biol. 45, 655–665. Banga, S. S., Velazquez, A. & Boyd, J. B. (1991) Mutat. Res. 255, 79–88. Charlesworth, B., Sniegowski, P. & Stephan, W. (1994) Nature (London) 371, 215–220. Lyttle, T. W. & Haymer, D. S. (1993) in Transposable Elements and Evolution, ed. McDonald, J. F. (Kluwer, Dordrecht, The Netherlands). Simmons, G. M. (1992) Mol. Biol. Evol. 9, 1050–1060. Mackay, T. F. C. (1987) Genet. Res. 49, 225–233. Mackay, T. F., Lyman, R. F. & Jackson, M. S. (1992) Genetics 130, 315–332. Torkamanzehi, A., Moran, C. & Nicholas, F. W. (1992) Genetics 131, 73–78. Clark, A. G., Wang, L. & Hulleberg, T. (1995) Genetics 139, 337–348. Britten, R. J. (1996) Mol. Phylogenet. Evol. 5, 13–17. Britten, R. J. (1997) Gene, in press. McDonald, J. F. (1995) Trends Ecol. Evol. 10, 123–126. Doolittle, W. F. & Sapienza, C. (1980) Nature (London) 284, 601– 603. Orgel, L. E. & Crick, F. H. C. (1980) Nature (London) 284. Hickey, D. A. (1982) Genetics 101, 519–531. Plasterk, R. A. (1995) in Mobile Genetic Elements, ed. Sherratt, D. J. (IRL, Oxford), pp. 18–37. Kiyasu, P. K. & Kidwell, M. G. (1984) Genet. Res. 44, 251–259. Kidwell, M. G. (1993) Ann. Rev. Genet. 27, 235–256. Miller, W. J., Kruckenhauser, L. & Pinsker, W. (1996) in Transgenic Organisms: Biological and Social Implications, eds. Tomiuk, J., Woehrmann, K. & Sentker, A. (Birkhaeuser, Basel), pp. 21–35. Hurst, G. D. D., Hurst, L. D. & Majerus, M. E. N. (1992) Nature (London) 356, 659–660. Kidwell, M. G. (1994) J. Hered. 85, 339–346. Robertson, H. M. (1993) Nature (London) 362, 241–245. Lohe, A. R., Moriyama, E. N., Lidholm, D. A. & Hartl, D. L. (1995) Mol. Biol. Evol. 12, 62–72. Clark, J. B., Maddison, W. P. & Kidwell, M. G. (1994) Mol. Biol. Evol. 11, 40–50. Spradling, A. C., Stern, D. M., Kiss, I., Roote, J., Laverty, T. & Rubin, G. M. (1995) Proc. Natl. Acad. Sci. USA 92, 10824–10830. Bellen, H., O’Kane, C. J., Wilson, C., Grossniklaus, U., Pearson, R. K. & Gehring, W. J. (1989) Genes Dev. 3, 1288–1300.
55. 56.
57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 74. 75. 76. 77. 78. 79. 80. 81. 82. 83. 84. 85. 86. 87. 88. 89. 90. 91. 92. 93. 94. 95. 96. 97. 98. 99. 100. 101. 102.
7711
Ji, H., Moore, D. P., Blomberg, M. A., Braiterman, L. T., Voytas, D. F., Natsoulis, G. & Boeke, J. D. (1993) Cell 73, 1007–1018. Hallick, R. B., Hong, L., Drager, R. G., Favreau, M. R., Montfort, A., Orsat, B., Spielmann, A. & Stutz, E. (1993) Nucleic Acids Res. 21, 3537–3544. Fridell, R. A., Pret, A. M. & Searles, L. L. (1990) Genes Dev. 4, 559–566. Purugganan, M. & Wessler, S. (1992) Genetica 86, 295–303. Rushforth, A. M. & Anderson, P. (1996) Mol. Cell. Biol. 16, 422–429. Laski, F. A., Rio, D. C. & Rubin, G. M. (1986) Cell 44, 7–19. Tseng, J. C., Zollman, S., Chain, A. C. & Laski, F. A. (1991) Mech. Dev. 35, 65–72. Bingham, P. M., Chou, T. B., I., M. & Zachar, Z. (1988) Trends Genet. 4, 134–138. Chomet, P. S., Wessler, S. & Dellaporta, S. L. (1987) EMBO J. 6, 295–302. Chandler, V. & Walbot, V. (1986) Proc. Natl. Acad. Sci. USA 83, 1767–1771. Bennetzen, J. L. (1996) in Transposable Elements, eds. Saedler, H. & Gierl, A. (Springer, Berlin), pp. 195–229. Charlesworth, B. & Langley, C. H. (1986) Genetics 112, 359–383. Brookfield, J. F. Y. (1995) in Mobile Genetic Elements, ed. Sherratt, D. J. (IRL, Oxford), pp. 130–153. Robertson, H. M. & Engels, W. R. (1989) Genetics 123, 815–824. Misra, S. & Rio, D. C. (1990) Cell 62, 269–284. Black, D. M., Jackson, M. S., Kidwell, M. G. & Dover, G. A. (1987) EMBO J. 6, 4125–4135. Jackson, M. S., D. M. Black & G. A. Dover. (1988) Genetics 120, 1003–1013. Rasmusson, K. E., Raymond, J. D. & Simmons, M. J. (1993) Genetics 133, 605–622. Fedoroff, N., Schlappi, M. & Raina, R. (1995) Bioessays 17, 291–297. Schlappi, M., Raina, R. & Fedoroff, N. (1994) Cell 77, 427–437. Cuypers, H., Dash, S., Peterson, P. A., Saedler, H. & Gierl, A. (1988) EMBO J. 7, 2953–2960. McClintock, B. (1958) Carnegie Inst. Wash. Yearbook 57, 415–429. Gdula, D. A., Gerasimova, T. I. & Corces, V. G. (1996) Proc. Natl. Acad. Sci. USA 93, 9378–9383. Csink, A. & McDonald, J. F. (1990) Genetics 126, 375–385. King, M. C. & Wilson, A. C. (1975) Science 188, 107–116. Britten, R. J. & Davidson, E. H. (1971) Q. Rev. Biol. 46, 111–138. Li, X. & Noll, M. (1994) Nature (London) 367, 83–87. Fincham, J. R. S. & Sastry, G. R. K. (1974) Annu. Rev. Genet. 8, 15–50. Bureau, T. E., Ronald, P. C. & Wessler, S. R. (1996) Proc. Natl. Acad. Sci. USA 93, 8524–8529. Pelisson, A., Song, S. U., Prud’homme, N., Smith, P. A., Bucheton, A. & Corces, V. G. (1994) EMBO J. 13, 4401–4411. Ding, D. & Lipshitz, H. D. (1994) Genet. Res. 64, 167–181. Hambor, J. E., Mennone, J., Coon, M. E., Hanke, J. H. & Kavathas, P. (1993) Mol. Cell. Biol. 13, 7056–7070. Stavenhagen, J. B. & Robins, D. M. (1988) Cell 55, 247–254. Anderson, R., Britten, R. J. & Davidson, E. H. (1994) Dev. Biol. 163, 11–18. Miller, W. J., McDonald, J. F. & Pinsker, W. (1997) Genetica, in press. Miller, W. J., Paricio, N., Hagemann, S., Martinez-Sebastian, M. J., Pinsker, W. & de Frutos, R. (1995) Gene 156, 167–174. Hong, L. & Hallick, R. B. (1994) Genes Dev. 8, 1589–1599. Biessmann, H., Valgeirsdottir, K., Lofsky, A., Chin, C., Ginther, B., Levis, R. W. & Pardue, M. L. (1992) Mol. Cell. Biol. 12, 3910–3918. Pardue, M. L., Danilevskaya, O. N., Lowenhaupt, K., Slot, F. & Traverse, K. L. (1996) Trends Genet. 12, 48–52. Moore, J. K. & Haber, J. E. (1996) Nature (London) 383, 644–646. Teng, S. C., Kim, B. & Gabriel, A. (1996) Nature (London) 383, 641–644. Thompson-Stewart, D., Karpen, G. H. & Spradling, A. C. (1994) Proc. Natl. Acad. Sci. USA 91, 9042–9046. Smyth, D. R. (1993) in Control of Plant Gene Expression, ed. Verma, D. P. S. (CRC, Boca Raton, FL). Wessler, S. R. (1996) Curr. Biol. 6, 959–961. Gerasimova, T. I., Matjunima, L. V., Mizrokhi, L. J. & Georgiev, G. P. (1985) EMBO J. 4, 3773–3779. McClintock, B. (1984) Science 226, 792–801. Pouteau, S., Huttner, E., Grandbastien, M. A. & Caboche, M. (1991) EMBO J. 10, 1911–1918. Pouteau, S., Grandbastien, M. A. & Boccara, M. (1994) Plant J. 5, 535–542.
Proc. Natl. Acad. Sci. USA Vol. 94, pp. 7712–7718, July 1997 Colloquium Paper
This paper was presented at a colloquium entitled ‘‘Genetics and the Origin of Species,’’ organized by Francisco J. Ayala (Co-chair) and Walter M. Fitch (Co-chair), held January 30–February 1, 1997, at the National Academy of Sciences Beckman Center in Irvine, CA.
Long term trends in the evolution of H(3) HA1 human influenza type A WALTER M. FITCH*†, ROBIN M. BUSH*, CATHERINE A. BENDER‡,
AND
NANCY J. COX‡
*Department of Ecology and Evolutionary Biology, University of California, Irvine, CA 92692; and ‡Influenza Branch, Center for Disease Control and Prevention, Atlanta, GA 30333
porate new mutations, mutations that cause changes in the hemagglutinin (HA) molecule, the major molecule to which the immune system makes its humoral response. The effect of these mutations is to change the HA molecule so that, at least temporarily, it is no longer recognized by these antibodies, thereby permitting the virus to multiply. For this reason, old vaccines lose their efficacy and new ones must be made with viruses having the altered HAs. That in turn raises the question of which of today’s strains should be used to make that vaccine; that is, which of today’s strains is most likely to be the progenitor of the of next year’s epidemic strains. Vaccines are now selected on the basis of knowing which of the current strains are least reactive to current antibodies (on the assumption that these strains have the best opportunity to spread) and which strains seem currently to be spreading the most effectively. Nevertheless, it would be useful to develop better predictive methods of deciding that question. It is the purpose of this paper to begin exploring how a knowledge of the evolutionary history of the HA gene might contribute to improved prediction of epidemic strains and of the strain of choice for the next vaccine. We shall explore the issues of (i) the rate of evolution, (ii) mutations occurring during virus propagation in the laboratory, (iii) the intercontinental spread of the influenza virus, (iv) the mutabilities of the different coding sites along the gene, and (v) sites subjected to strong selection.
ABSTRACT We have studied the HA1 domain of 254 human inf luenza A(H3N2) virus genes for clues that might help identify characteristics of hemagglutinins (HAs) of circulating strains that are predictive of that strain’s epidemic potential. Our preliminary findings include the following. (i) The most parsimonious tree found requires 1,260 substitutions of which 712 are silent and 548 are replacement substitutions. (ii) The HA1 portion of the HA gene is evolving at a rate of 5.7 nucleotide substitutionsyyear or 5.7 3 1023 substitutionsysite per year. (iii) The replacement substitutions are distributed randomly across the three positions of the codon when allowance is made for the number of ways each codon can change the encoded amino acid. (iv) The replacement substitutions are not distributed randomly over the branches of the tree, there being 2.2 times more changes per tip branch than for non-tip branches. This result is independent of how the virus was amplified (egg grown or kidney cell grown) prior to sequencing or if sequencing was carried out directly on the original clinical specimen by PCR. (v) These excess changes on the tip branches are probably the result of a bias in the choice of strains to sequence and the detection of deleterious mutations that had not yet been removed by negative selection. (vi) There are six hypervariable codons accumulating replacement substitutions at an average rate that is 7.2 times that of the other varied codons. (vii) The number of variable codons in the trunk branches (the winners of the competitive race against the immune system) is 47 6 5, significantly fewer than in the twigs (90 6 7), which in turn is significantly fewer variable codons than in tip branches (175 6 8). (viii) A minimum of one of every 12 branches has nodes at opposite ends representing viruses that reside on different continents. This is, however, no more than would be expected if one were to randomly reassign the continent of origin of the isolates. (ix) Of 99 codons with at least four mutations, 31 have ratios of non-silent to silent changes with probabilities less than 0.05 of occurring by chance, and 14 of those have probabilities