EVOLUTION AFTER GENE DUPLICATION
EVOLUTION AFTER GENE DUPLICATION
Edited by
Katharina Dittmar SUNY at Buffalo Buffa...
124 downloads
946 Views
46MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
EVOLUTION AFTER GENE DUPLICATION
EVOLUTION AFTER GENE DUPLICATION
Edited by
Katharina Dittmar SUNY at Buffalo Buffalo, New York
David Liberles University of Wyoming Laramie, Wyoming
A JOHN WILEY & SONS, INC., PUBLICATION
Copyright © 2010 by Wiley-Blackwell. All rights reserved. Wiley-Blackwell is an imprint of John Wiley & Sons, formed by the merger of Wiley’s global Scientific, Technical and Medical business with Blackwell Publishing. Published by John Wiley & Sons, Inc., Hoboken, New Jersey. Published simultaneously in Canada. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission. Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002. Wiley also publishes its books in a variety of electronic formats. Some Content that appears in print may not be available in electronic formats. For information about Wiley products, visit our web site at www.wiley.com. Library of Congress Cataloging-in-Publication Data: Dittmar, Katharina. Evolution after gene duplication / Katharina Dittmar and David Liberles. p. cm. Includes bibliographical references and index. Summary: “Gene duplication has long been believed to have played a major role in the rise of biological novelty through evolution of new function and gene expression patterns. The first book to examine gene duplication across all levels of biological organization, Evolution after Gene Duplication presents a comprehensive picture of the mechanistic process by which gene duplication may have played a role in generating biodiversity. Key Features: Explores comparative genomics, genome evolution studies and analysis of multi-gene families such as Hox, globins, olfactory receptors and MHC (immune system). A complete post-genome treatment of the topic originally covered by Ohno’s 1970 classic, this volume extends coverage to include the fate of associated regulatory pathways. Taps the significant increase in multi-gene family data that has resulted from comparative genomics. Comprehensive coverage that includes opposing theoretical viewpoints, comparative genomics data, theoretical and empirical evidence and the role of bioinformatics in the study of gene duplication. This up-to-date overview of theory and mathematical models along with practical examples is suitable for scientists across various levels of biology as well as instructors and graduate students”— Provided by publisher. ISBN 978-0-470-59382-0 (hardback) 1. Evolutionary genetics. 2. Mutation (Biology) 3. Variation (Biology) I. Liberles, David II. Title. QH390.D58 2010 572.8 38–dc22 2010031097 Printed in the United States of America 10 9 8 7 6 5 4 3 2 1
wwwwwww
CONTENTS Contributors
vii
Preface
xi
1 Understanding Gene Duplication Through Biochemistry and Population Genetics
1
David A. Liberles, Grigory Kolesov, and Katharina Dittmar
2 Functional Divergence of Duplicated Genes
23
Takashi Makino, David G. Knowles, and Aoife McLysaght
3 Duplicate Retention After Small- and Large-Scale Duplications
31
Steven Maere and Yves Van de Peer
4 Gene Dosage and Duplication
57
Fyodor A. Kondrashov
5 Myths and Realities of Gene Duplication
77
Austin L. Hughes and Robert Friedman
6 Evolution After and Before Gene Duplication?
105
Tobias Sikosek and Erich Bornberg-Bauer
7 Protein Products of Tandem Gene Duplication: A Structural View
133
William R. Taylor and Michael I. Sadowski
8 Statistical Methods for Detecting Functional Divergence of Gene Families
163
Xun Gu
9 Mapping Gene Gains and Losses Among Metazoan Full Genomes Using an Integrated Phylogenetic Framework
173
Athanasia C. Tzika, Rapha¨el Helaers, and Michel C. Milinkovitch
10 Reconciling Phylogenetic Trees
185
Oliver Eulenstein, Snehalata Huzurbazar, and David A. Liberles
v
vi
CONTENTS
11 On the Energy and Material Cost of Gene Duplication
207
Andreas Wagner
12 Fate of a Duplicate in a Network Context
215
Orkun S. Soyer
13 Evolutionary and Functional Aspects of Genetic Redundancy
229
Ran Kafri and Tzachi Pilpel
14 Phylogenomic Approach to the Evolutionary Dynamics of Gene Duplication in Birds
253
Chris L. Organ, Matthew D. Rasmussen, Maude W. Baldwin, Manolis Kellis, and Scott V. Edwards
15 Gene and Genome Duplications in Plants
269
Pamela S. Soltis, J. Gordon Burleigh, Andre S. Chanderbali, Mi-Jeong Yoo, and Douglas E. Soltis
16 Whole Genome Duplications and the Radiation of Vertebrates
299
Shigehiro Kuraku and Axel Meyer
Index
313
CONTRIBUTORS
Maude W. Baldwin, Museum of Comparative Zoology, Harvard University, Cambridge, Massachusetts Erich Bornberg-Bauer, Evolutionary Bioinformatics Group, Institute for Evolution and Biodiversity, University of Muenster, Muenster, Germany J. Gordon Burleigh, Department of Biology, University of Florida, Gainesville, Florida Andre Chanderbali, Florida Museum of Natural History, University of Florida, Gainesville, Florida; Department of Biology, University of Florida, Gainesville, Florida Katharina Dittmar, Department of Biological Sciences, SUNY at Buffalo, Buffalo, New York Scott V. Edwards, Museum of Comparative Zoology, Harvard University, Cambridge, Massachusetts Oliver Eulenstein, Department of Computer Science, Iowa State University, Ames, Iowa Robert Friedman, Department of Biological Sciences, University of South Carolina, Columbia, South Carolina Xun Gu, Department of Genetics, Development and Cell Biology, Center for Bioinformatics and Biological Studies, Iowa State University, Ames, Iowa Rapha¨el Helaers, Department of Biology, Facult´es Universitaires Notre-Dame de la Paix, Namur, Belgium Austin L. Hughes, Department of Biological Sciences, University of South Carolina, Columbia, South Carolina Snehalata Huzurbazar, Department of Statistics, University of Wyoming, Laramie, Wyoming Ran Kafri, Department of Systems Biology, Harvard Medical School, Boston, Massachusetts Manolis Kellis, Computer Science and Artificial Intelligence, Massachusetts Institute of Technology, Cambridge, Massachusetts David G. Knowles, Smurfit Institute of Genetics, University of Dublin, Trinity College, Dublin, Ireland vii
viii
CONTRIBUTORS
Grigory Kolesov, Department of Molecular Biology, University of Wyoming, Laramie, Wyoming Fyodor A. Kondrashov, Center for Genomic Regulation, Barcelona, Spain Shigehiro Kuraku, Evolutionary Biology and Zoology, Department of Biology, University of Konstanz, Konstanz, Germany David A. Liberles, Department of Molecular Biology, University of Wyoming, Laramie, Wyoming Steven Maere, Department of Plant Systems Biology, VIB, Ghent, Belgium; Department of Molecular Genetics, Ghent University, Ghent, Belgium Takashi Makino, Smurfit Institute of Genetics, University of Dublin, Trinity College, Dublin, Ireland Aoife McLysaght, Smurfit Institute of Genetics, University of Dublin, Trinity College, Dublin, Ireland Axel Meyer, Evolutionary Biology and Zoology, Department of Biology, University of Konstanz, Konstanz, Germany Michel C. Milinkovitch, Department of Genetics and Evolution, Laboratory of Natural and Artificial Evolution, Sciences III, Geneva, Switzerland Chris L. Organ, Museum of Comparative Zoology, Harvard University, Cambridge, Massachusetts Tzachi Pilpel, Department of Molecular Genetics, Weizmann Institute of Science, Rehovot, Israel Matthew D. Rasmussen, Computer Science and Artificial Intelligence, Massachusetts Institute of Technology, Cambridge, Massachusetts Michael A. Sadowski, Division of Mathematical Biology, MRC National Institute for Medical Research, London, UK Tobias Sikosek, Evolutionary Bioinformatics Group, Institute for Evolution and Biodiversity, University of Muenster, Muenster, Germany Douglas E. Soltis, Department of Biology, University of Florida, Gainesville, Florida Pamela S. Soltis, Florida Museum of Natural History, University of Florida, Gainesville, Florida Orkun S. Soyer, Engineering, Mathematics and Physical Sciences, University of Exeter, Exeter, UK William R. Taylor, Division of Mathematical Biology, MRC National Institute for Medical Research, London, UK Athanasia C. Tzika, Laboratory of Natural and Artificial Evolution, Department of Genetics and Evolution, Sciences III, Geneva, Switzerland; Evolutionary Biology and Ecology, Universit´e de Bruxelles, Brussels, Belgium
CONTRIBUTORS
ix
Yves Van de Peer, Department of Plant Systems Biology, VIB, Ghent, Belgium; Department of Molecular Genetics, Ghent University, Ghent, Belgium Andreas Wagner, Department of Biochemistry, University of Zurich, Zurich, Switzerland Mi-Jeong Yoo, Florida Museum of Natural History, University of Florida, Gainesville, Florida; Department of Biology, University of Florida, Gainesville, Florida
wwwwwww
PREFACE The duplication of genes and genomes was postulated to be an important process for the evolution of functional and organismal diversity long before we entered the genome sequencing era. Many of the groundbreaking intellectual concepts for this hypothesis come from the work of Susumu Ohno. However, only with the recent availability of many genome sequences are we able to gather supporting data to develop hypotheses and test them on a large scale. Current research on the topic spans scientific disciplines from bioinformatics to organismal biology and touches on different aspects of gene and genome duplication, ranging from the molecular mechanics of the duplication process to the fate of duplicated genomes. Naturally, a variety of approaches goes hand in hand with differences in opinion, and presently, a prolific and at times overwhelming body of research has accumulated. Thus, it was our idea to provide a systematic examination of current thought on gene duplication and its importance to biological diversification across multiple levels. This edited volume began with a review article published in the Journal of Experimental Zoology. With the expansion of concepts into full book chapters, we hoped to cover approaches from a range of fields, starting with molecular and structural biology, leading to computer science and statistics to cellular and, ultimately, organismal-level biology. It is our intention to lay out a hierarchy of chapters extending from evolutionary principles and molecular details out to increasingly higher levels of biological organization. This setup is designed to make the reader appreciate the interconnectedness of these levels. It is important to us to present this work as a platform for diverse scientific approaches and reasoning. One clear point that will emerge in reading the book is that chapters come from authors in diverse disciplines, some of whom disagree with each other on underlying evolutionary forces, on experimental procedures, and on data interpretation. Occasionally, authors use terms in different ways. In particular, concepts of redundancy (and its maintenance) differ: Genes that one set of authors consider to have been selected for retention due to redundant function others consider to have diverged and to no longer be redundant. The first section of the book deals with models of gene duplication and retention. Five chapters provide an overview (with some overlap in description, but involving different interpretations) of the mechanisms for the retention of duplicate genes. This includes diverse perspectives on mutational opportunities, dosage compensation, evolutionary processes, and their interrelationship with molecular processes and resulting selection. These fine-scale aspects of interpreting genomic data lay the foundation for the cell/systems level of biology of duplication as well as effects on speciation and biodiversity. The second section, on gene/protein structure and duplication, includes two chapters. Chapter 6 links the evolutionary process associated with gene duplication through xi
xii
PREFACE
structure to function, paying particular attention to adaptive processes in proteins that have not yet undergone duplication. Chapter 7 describes the link between the process of gene duplication and protein structure, concentrating on a review of the genetic mechanisms creating fused tandem duplicates. Comparative genomic methodologies for characterizing gene duplicates are treated in the third section. Chapter 8 overviews methodology for characterizing the functional divergence of duplicate genes, Chapter 9 presents a procedure for linking changes in gene copy number through evolution to functional and gene expression evolution, and Chapter 10 presents an overview of model and parsimony-based approaches for gene tree/species tree reconciliation coupled to a detailed presentation of most parsimonious reconciliation. The fourth section, involving systems biology considerations of gene duplication, includes three chapters. Chapter 11 describes the energetic costs of gene duplication, arguing for a nonneutral process of fixation. The next two chapters describe the interplay between systems-level constraints and duplicate gene retention as well as the complementary interplay between duplication and the structure of biological networks. The last section builds up to species-level characterizations of gene duplication. The first two chapters in this section treat research on duplication in birds and plants, addressing their roles in the speciation process, as well as developmental and morphological novelty. Chapter 16 concludes this work with a portrayal of the role of duplication in vertebrate speciation. As editors, we would like to extend our thanks to all the researchers who contributed to this volume. First and foremost, we thank them for their timely delivered scientific insights, but also for their good humor and patience as we worked through the publishing process. We also thank all the colleagues, students, and friends who joined our discussions on this topic. Finally, we would like to thank Karen Chambers, at John Wiley, who was enthusiastic and supportive of our idea, and thankfully, a very patient editor in the long and at times exhausting process of assembling this volume. We hope the book will be a useful tool for researchers and students alike to learn about current research on duplication and inspire continued discussion about this important topic. David Liberles Katharina Dittmar
Fitness (phenotype)
Chapter 1, Figure 4 Change of binding interaction in the two-component system. Shown in red blobs are three amino acid substitutions described by Skerker et al. (2008) that completely switch from EnvZ histidine-kinase to OmpR HK signal transduction. The HK homodimer is on the bottom (green and blue domains), and the response regulator domain (brown, top) was computationally docked to the 2C2A HK structure by Marina et al. (2005). The phosphotransfer histidine is shown in magenta.
0 (aa)
1/2 (Aa)
1 (AA)
2 (AA,AA)
Protein concentration (genotype)
Chapter 4, Figure 2 Dependence of fitness on gene dosage (the number of gene copies or genotype). Four types of functions are displayed here: in red, a linear fitness function; in blue, a diminishing returns function; in green, a function with a well-defined optimum; and a concave function is shown in magenta. The area between the two dashed lines defines an area where fitness is similar to the fitness of an individual with one gene copy.
0
10
20
30
40
0
200
400
600
800
0
0
2
3
4
1
2
SSD
3
4
α0 = 1.55 ± 0.28 α1 = 0.60 ± 0.04 α2 = 0.45 ± 0.03 α3 = 0.75 ± 0.05
development
1
α0 = 1.25 ± 0.02 α1 = 0.90 ± 0.03 α2 = 0.65 ± 0.01 α3 = 0.85 ± 0.01
whole paranome
5
5
α
0
1
2
3
4
α0 = 1.25 ± 0.16 α1 = 0.60 ± 0.04 α2 = 0.30 ± 0.02 α3 = 0.55 ± 0.04
TF activity
5
0
5
10
15
20
25
0
1
β
2
(A)
KS
3
γ
4
α0 = 1.40 ± 0.12 α1 = 0.45 ± 0.15 α2 = 0.15 ± 0.07 α3 = 0.55 ± 0.11
5
secondary metabolism
0
20
40
60
80
0
0
2
3
4
1
2
3
4
α0 = 0.85 ± 0.15 α1 = 0.80 ± 0.11 α2 = 0.85 ± 0.11 α3 = 0.75 ± 0.08
DNA metabolism
1
α0 = 3.05 ± 1.07 α1 = 0.35 ± 0.03 α2 = 0.10 ± 0.02 α3 = 0.35 ± 0.03
signal transduction
observed Ks distr.
0
10
20
30
40
0
20
40
60
80
100
5
5
>1.0
γ
0.7 (B)
β
α
1), indicating directional selection, whereas the other copy should evolve under purifying selection (Ka /Ks < 1). This is, however, rarely observed (Van de Peer et al., 2001; Conant and Wagner, 2003; Zhang et al.,
34
DUPLICATE RETENTION AFTER SMALL- AND LARGE-SCALE DUPLICATIONS
2003). A much more prevalent scenario is that both copies still evolve under purifying selection, but the selection on one of the copies is relaxed (He and Zhang, 2005b; Byrne and Wolfe, 2007). One study of recent SSD duplicates in several prokaryotic and eukaryotic species (Ks < 0.5) found little evidence for asymmetric protein evolution (Kondrashov et al., 2002). In contrast, another study on recent SSD pairs in human (Ks < 0.3) found that about 60% of the pairs evolve asymmetrically (Zhang et al., 2003). However, a substantial portion of the fast-evolving genes identified in the latter study had accumulated amino acid substitutions evenly across their sequence and at nearly neutral rates, indicating that they may be on their way to pseudogenization rather than neofunctionalization. Another study on recent SSD duplicates in Saccharomyces cerevisiae, Schizosaccaromyces pombe, Drosophila melanogaster, and C. elegans found that 20 to 30% of the duplicate pairs exhibit asymmetric sequence divergence (Conant and Wagner, 2003). The lack of consistency between these results may be explained in part by the fact that all of these studies investigate relatively small samples, on the order of tens of genes. It is doubtful whether such small-scale studies can generate enough statistical power to detect rate asymmetry trends reliably, especially for recent duplicates (Lynch and Katju, 2004). Moreover, assessment of rate asymmetry in recent duplicates is probably not the best way to look for neofunctionalization, since the majority of recent fast-evolving duplicates is bound to be on its way to nonfunctionalization rather than neofunctionalization (Lynch and Conery, 2000). It probably makes more sense to look for traces of asymmetric protein evolution in older duplicates. One class of duplicates that have been studied extensively in this respect are duplicates retained from ancient WGDs. Byrne and Wolfe (2007) found evidence of asymmetric protein evolution rates in 56% of the duplicate pairs remaining from the WGD in yeast. The faster-evolving copy tends to be the same across the post-WGD species, indicating that the asymmetry was established soon after the WGD and before speciation occurred. A study of 26 zebrafish duplicate pairs remaining from the teleost-specific WGD uncovered 13 cases in which the evolutionary rate of one of the paralogs was increased (Van de Peer et al., 2001). In a larger-scale study on Tetraodon, another teleost fish, Brunet et al. (2006) estimated that at least 36% of the WGD pairs have diverged asymmetrically. Also, Blanc and Wolfe (2004a) found evidence for asymmetric protein evolution in more than 20% of the duplicate pairs retained from the most recent WGD in the Arabidopsis ancestor. In contrast to yeast, fish, and Arabidopsis, a recent study in the allotetraploid frog Xenopus laevis found evidence for asymmetric protein sequence divergence in a mere 6% of duplicate pairs that were retained from the WGD (Chain and Evans, 2006). The differences in the extent of asymmetric divergence between species may be correlated with differences in their effective population size or that of their ancestors. Organisms with a large effective population size, such as yeasts, are predicted to be more permissive to duplicate neofunctionalization than organisms with a small effective population size (Lynch et al., 2001). Other factors may be age differences between the WGDs in the different species and differences in selective pressure (i.e., the biological need for neofunctionalization). Except in Xenopus laevis, there seems to be more evidence for asymmetric protein evolution or relaxed selection after WGD than after SSD. One possible explanation may be that WGDs facilitate the establishment of phenotypically neutral duplicates by avoiding the fixation hurdle that plagues neutral SSD duplicates. But even for asymmetrically evolving WGD duplicates that have been retained for tens of millions
DUPLICATE RETENTION MECHANISMS
35
of years, there is no guarantee that they are on their way to neofunctionalization. Scannell and Wolfe (2008) showed that the protein evolution rate for WGD duplicates in yeast has not yet dropped back to the level for single-copy genes, indicating that the neofunctionalization process may still not have come to an end some 100 million years (My) after the WGD. There are indications that many of the copies undergoing relaxed selection may never acquire a novel function and are eventually lost. In duplicate pairs remaining from the yeast WGD, the faster-evolving copy is almost never essential and has frequently been lost in several of the post-WGD lineages (Byrne and Wolfe, 2007). Similarly, Blomme et al. (2006) showed that vertebrate gene duplicates that have been maintained for hundreds of millions of years can still be lost. Protein neofunctionalization is generally believed to be a slow process. However, neofunctionalization of a gene is not necessarily accomplished by gradual mutation of the coding sequence. It can also be caused by insertion of preexisting domains, which can occur much faster (Bj¨orklund et al., 2005; Drea et al., 2006; Song et al., 2008). Genes may also acquire a novel function through changes in their expression or activation patterns (regulatory neofunctionalization), leading, for example, to their expression in other tissues or at other time points. Regulatory neofunctionalization may be faster and much more common than protein neofunctionalization. Several studies of expression divergence after duplication reported a bias toward asymmetric expression divergence, but without suitable outgroup expression patterns, the involvement of neofunctionalization is hard to assess (Gu et al., 2005; Casneuf et al., 2006; Chung et al., 2006; Conant and Wolfe, 2006; Ganko et al., 2007). Tirosh and Barkai (2007) looked for regulatory neofunctionalization of WGD duplicate pairs in S. cerevisiae by comparing the duplicate expression profiles with those of their preduplication ortholog in Candida albicans. Starting from 96 WGD pairs showing conserved expression with C. albicans for at least one of the two genes, they found 43 WGD pairs (45%) where the other gene had diverged, consistent with regulatory neofunctionalization. Twenty-eight pairs showed significant expression conservation of both duplicates. In contrast, when investigating 46 SSD pairs, they found only one example of asymmetrical divergence, which might indicate that regulatory neofunctionalization is more common among WGD duplicates than among SSD duplicates. In contrast, Casneuf et al. (2006) found that WGD duplicates in Arabidopsis have more similar expression patterns than SSD duplicates of comparable age. Furthermore, as with evolution of the coding sequence, asymmetric expression divergence does not necessarily imply neofunctionalization. The copy experiencing relaxed selection on its expression pattern could also be on its way to being silenced. Accordingly, Casneuf et al. (2006) found that among diverging SSD duplicate pairs, one copy is frequently expressed in a much lower number of tissues than the other copy, but this could also reflect asymmetric subfunctionalization. Clear, well-studied examples of neofunctionalization are difficult to find. A fertile research area in this respect is the evolution of the primate and, in particular, the human brain. For instance, GLUD2 , a duplicate glutamate dehydrogenase gene in humans and apes important for glutamate detoxification after neuron firing, appears to have gained expression in the brain and testes after the human–Old World monkey split, and it also shows signs of directional selection on its protein sequence (Shashidharan et al., 1994; Burki and Kaessmann, 2004). Similarly, a cluster of tandem Ret finger proteinlike genes (hRFPL1,2,3 ) have gained expression in the neocortex and cerebellum in human and primates since the divergence from their murine ortholog (Bonnefont et al., 2008). Positive selection on the protein sequence was also observed. The importance
36
DUPLICATE RETENTION AFTER SMALL- AND LARGE-SCALE DUPLICATIONS
of the novel RFPL1,2,3 function in brains may be gleaned from the presence of a tandem cluster in the large-brained Catarrhini, which is conceivably due to selection for increased dosage (see Section 4). Also, the cortical expression of these proteins is significantly higher in humans than in other primates (Bonnefont et al., 2008). 3.2 Subfunctionalization The general consensus is that protein neofunctionalization happens on such a large time scale that most duplicates are lost or degenerated through nonfunctionalizing mutations before neofunctionalization can occur, except perhaps in species with very large effective population sizes, such as yeasts (Lynch et al., 2001). Under the assumption that neofunctionalization is a slow process, there must be other mechanisms at work that preserve duplicates long enough to enable them to acquire novel functions. In the 1990s, subfunctionalization was advanced as an alternative way to achieve duplicate preservation (Hughes, 1994; Force et al., 1999; Stoltzfus, 1999). The subfunctionalization hypothesis states that after duplication, the functionality of an ancestral protein can get partitioned over the duplicates, so that both copies are needed to perform the complete ancestral function. In contrast to neofunctionalization, subfunctionalization does not necessarily require the action of selective forces. According to the duplication–degeneration–complementation (DDC) model, degenerative mutations can occur neutrally in both copies as long as the duplicates as a pair still retain all ancestral functionality (Force et al., 1999). In theory, the partitioning of subfunctions over different copies is enough to preserve duplicate genes indefinitely, even if there is no associated phenotypic advantage. However, subfunctionalization can also facilitate the development of beneficial features. For example, subfunctionalization may free a multifunctional gene of its pleiotropic constraints: interactions between its subfunctions that prohibit the gene from the optimal exercise of any of them (Hughes, 1994, 2005). If these constraints are lifted, natural selection can fine-tune the subfunctions separately. This subfunctionalization model is sometimes referred to as the escape from adaptive conflict (EAC) or adaptive conflict resolution model (Hittinger and Carroll, 2007; Conant and Wolfe, 2008; Des Marais and Rausher, 2008). A similar model, here referred to as the IAD (innovation, amplification, divergence) model, was proposed by several authors, who considered it more of a neofunctionalizing mechanism that amplifies and optimizes a beneficial secondary function present in trace amounts in an ancestral gene (Hendrickson et al., 2002; Francino, 2005; Bergthorsson et al., 2007). The difference with EAC is an initial amplification step in which an increase in the number of gene copies is favored by selection for increased dosage (see further). Both EAC and IAD are examples of mechanisms by which duplicate genes can get coopted for preexisting secondary functions, as discussed by Conant and Wolfe (2008). Subfunctionalization might also buy a duplicate pair the time necessary for one or both of the copies to acquire a novel function. On the basis of an analysis of protein interaction divergence and expression divergence of yeast and human duplicates, respectively, He and Zhang (2005b) proposed a “subneofunctionalization” model in which a period of rapid subfunctionalization is followed by prolonged neofunctionalization. A similar model was proposed by Rastogi and Liberles (2005) based on evolutionary simulations on duplicated lattice model proteins. Following reports of widespread asymmetric evolution and protein neofunctionalization after the WGD in
DUPLICATE RETENTION MECHANISMS
37
yeasts (Kellis et al., 2004; Byrne and Wolfe, 2007), Scannell et al. (2008) observed that both the fast- and slow-evolving copies of asymmetrically diverging WGD pairs underwent a burst of protein evolution soon after the WGD, consistent with initial rapid subfunctionalization. For classical subfunctionalization to occur on the protein level, the duplicated protein must have multiple separable functions, so not all proteins can be preserved in duplicate through this mechanism. The aforementioned study of X. laevis WGD duplicates (Chain and Evans, 2006) recovered evidence for complementary patterns of amino acid substitutions, indicative of enhancement or degradation of different functional domains, in only 3% of the cases. A more prevalent mechanism may be quantitative subfunctionalization, in which a single function becomes quantitatively partitioned over the duplicate copies: for example, through activity-reducing mutations in both copies (Stoltzfus, 1999; Scannell and Wolfe, 2008). Subfunctionalization, both qualitative and quantitative, can also occur on the regulatory level, and it is expected to be more frequent on this level. For example, a pair of duplicate genes could subfunctionalize through complementary loss of regulatory elements and concomitant specialization of the duplicate expression patterns in time or in space (Force et al., 1999). A myriad of studies have investigated the expression patterns or transcription factor–binding profiles of duplicate genes in different organisms, and most of them agree that there is substantial expression divergence among duplicates (Gu et al., 2002, 2005; Makova and Li, 2003; Papp et al., 2003b; Blanc and Wolfe, 2004a; Evangelisti and Wagner, 2004; Haberer et al., 2004; Zhang et al., 2004; Kim et al., 2005, 2006; Li et al., 2005; Casneuf et al., 2006; Chung et al., 2006; Morin et al., 2006; Ganko et al., 2007; Ha et al., 2007; Hellsten et al., 2007; Hughes and Friedman, 2007; Chain et al., 2008). Both tissue specialization and differential expression in response to stress treatments have been observed, as well as quantitative expression divergence. However, because of the lack of ancestral or outgroup expression patterns in these studies, it is difficult to assess the extent to which the expression divergence observed is due to subfunctionalization. A notable exception is a recent study comparing the tissue-specific expression of WGD pairs in X. laevis with the expression of their orthologs in the non-WGD frog X. tropicalis, in which it was estimated that 1.2 to 11% of the retained WGD pairs underwent tissue-specific expression subfunctionalization (Semon and Wolfe, 2008). A similarly conceived study in yeast (Tirosh and Barkai, 2007) found, next to a class of asymmetrically diverging duplicate expression patterns mentioned before, a class of duplicates exhibiting symmetrical divergence in expression. Although it is likely that these duplicates are undergoing some form of subfunctionalization, it is difficult to pinpoint exactly what is going on. Taking a different approach, Duarte et al. (2006) attempted to reconstruct the ancestral expression profiles of duplicated MADS-box genes in Arabidopsis and found indications for both regulatory sub- and neofunctionalization. A form of regulatory subfunctionalization that does not necessarily involve mutation is based on epigenetic repatterning of the duplicate genes. Rodin and Riggs (2003) proposed the epigenetic complementation (EC) model as a mechanism to save newborn duplicates from pseudogenization. The EC model invokes a specific subfunctionalization mechanism to mediate the exposition of both duplicate partners to purifying selection: namely, complementary tissue- or developmental stage-specific epigenetic silencing of duplicated genes via methylation or other processes involving heritable
38
DUPLICATE RETENTION AFTER SMALL- AND LARGE-SCALE DUPLICATIONS
chromatin structure. According to Rodin et al. (2005a, 2005b), one of the main conditions for EC-mediated survival of duplicates is their repositioning to ectopic sites. Epigenetic silencing, genome rearrangements and translocations have been shown to come into play soon after polyploid formation. Polyploidization events in plants are followed by intensive genomic rearrangements and enhanced activity of transposable elements (Wendel, 2000; Adams and Wendel, 2005b). Moreover, it has been shown that polyploidization can influence epigenetic silencing patterns. Studies on synthetic allopolyploid cotton and Arabidopsis thaliana have shown that reciprocal, developmentally regulated silencing of duplicates can occur during or soon after polyploid formation (Adams et al., 2003; Wang et al., 2004; Adams and Wendel, 2005a). The subfunctionalization model makes testable predictions about the types of genes that should be preferentially preserved after gene duplication. For example, it predicts that duplicates of genes with higher numbers of separable subfunctions (i.e., more complex genes) should be retained at a higher frequency. He and Zhang (2005a) investigated the relationship between gene complexity and gene duplicability in more detail. They found that duplicated genes in yeast, from both WGD and SSD, have on average longer protein sequences, more functional domains, and more regulatory motifs than singleton genes. Especially regarding domain and motif content, the difference is striking: An average duplicated yeast gene has approximately four times as many domains and twice as many motifs as a singleton. The preferential retention of SSD duplicates of longer proteins is even more striking given that shorter proteins are presumed to be more likely to generate functional duplicates through small segmental duplications. Longer proteins may more often spawn truncated duplicates that are nonfunctional or even deleterious if they give rise to dominant negative phenotypes. On the other hand, the production of truncated forms of longer, more complex proteins through SSD may also facilitate subfunctionalization. The observation that more complex genes are preferentially retained after duplication appears to support the subfunctionalization model. However, one caveat is that the subfunctionalization process is expected to lower the complexity of the duplicated genes, thereby eroding any difference in complexity between duplicates and singletons. He and Zhang proposed that the complexity of subfunctionalized duplicates can be maintained through subsequent neofunctionalization, arguing that the protein domains and regulatory motifs involved in subfunctionalization need not deteriorate completely but may evolve to adopt new functions (He and Zhang, 2005a). Chapman et al. (2006) also observed that longer and more complex genes were preferentially retained in duplicate after successive WGDs in Arabidopsis. However, they noticed that there were fewer SNPs in these duplicate pairs than in genes that reverted to single-copy status, and that the impact of those SNPs on protein structure was generally less severe. Moreover, they found indications that the sequence conservation of WGD duplicates is actively maintained by homogenization processes. These observations do not support the increased subfunctionalization of more complex genes. Instead, Chapman et al. conjectured that complex genes are performing crucial functions and advocated the theory of genetic buffering of these functions through redundancy. 3.3 Buffering A possible advantage conferred by gene duplication is a buffering effect against null mutations. Several observations argue for this hypothesis. Molecular networks appear
DUPLICATE RETENTION MECHANISMS
39
to be extremely robust to single gene deletions: Fewer than 20% of yeast genes are essential, and deletion of one of other genes very often has little or no phenotypic effect, at least under rich media conditions (Giaever et al., 2002). Similar observations have been made in plants and C. elegans (Kamath et al., 2003). The robustness against deletion of many genes is commonly associated with the presence of backup copies (i.e. closely related paralogs) (Gu et al., 2003; Wagner, 2008). Correlation studies have shown that S. cerevisiae genes with retained paralogs are indeed less likely to show a growth defect upon deletion (Gu et al., 2003). However, completely redundant duplicates are predicted to be evolutionarily unstable, because either one of the copies can be deleted without phenotypic consequences and is therefore invisible to selection (Brookfield, 1992; Cooke et al., 1997; Nowak et al., 1997). Although perfect redundancy is predicted to be unstable, redundant duplicates may persist in an organism for millions of generations (Nowak et al., 1997). Moreover, duplicate redundancy may be maintained even longer by sequence homogenization mechanisms such as gene conversion. Gao and Innan (2004) found several examples of slowly evolving WGD duplicates in yeast which they attributed to gene conversion. After making similar observations on WGD duplicates in Arabidopsis, Chapman et al. (2006) suggested that the buffering of crucial functions might be one of the principal advantages of genome duplications. They went on to speculate that this effect might even cause the apparently cyclical reoccurrence of genome duplications in angiosperm plants. If the buffering of crucial genes would be a major factor governing retention of duplicates, one would expect that important genes get duplicated more often than dispensable genes. Several studies have investigated the relationship between gene duplicability and gene essentiality or dispensability. He and Zhang (2006) reported that less important genes in yeast, as measured by the fitness effect of their homozygous deletion, intrinsically have a higher duplicability than that of important genes. To avoid the confounding effect on deletion fitness of functional compensation among duplicates (Gu et al., 2003; Kamath et al., 2003; Conant and Wagner, 2004), He and Zhang looked only at singleton genes in S. cerevisiae and measured their duplicability by assessing whether or not any of the orthologs in four other yeast species have retained duplicates, but by doing so, they probably introduced bias as well. Moreover, their findings may be explained partially by an underrepresentation of complex-forming proteins among the genes with retained duplicates in other species. Complex-forming proteins tend to be less dispensable (Jeong et al., 2001) and frequently cause dosage effects when duplicated through SSD (see Section 3.6). Seemingly at odds with their previous finding, He and Zhang also found that singleton genes with retained duplicates in other species are more conserved in sequence than across-species singletons. Jordan et al. also reported that slowly evolving genes in yeast appear to be retained preferentially after duplication, and linked slow evolution to importance (Jordan et al., 2004). However, slowly evolving genes are not necessarily more important. As Davis and Petrov (2004) pointed out, slow evolution of genes may be caused by high codon bias, a hallmark of selection for expression efficiency, and slowly evolving genes may be preferentially duplicated because of selection for increased dosage (see Section 4). None of these studies made a distinction between SSD and WGD duplicates. In a recent very elegant study, Ihmels et al. (2007) compared the genetic interaction patterns of yeast duplicate genes and calculated that the presence of duplicates accounts for only about 25% of the robustness observed against single-gene deletions. Their study also revealed that even duplicate genes that can buffer for each other’s loss
40
DUPLICATE RETENTION AFTER SMALL- AND LARGE-SCALE DUPLICATIONS
typically exhibit rich genetic interaction patterns, indicative of limited backup capacity. Indeed, in the case of perfect backup, one would expect the gene to have at most one genetic interaction: with its backup copy. Moreover, most genetic interaction patterns of duplicate genes are divergent, indicating divergence in function. So although duplicate genes may often serve as backup copies under specific circumstances, this effect alone appears to be insufficient to secure their conservation. Lin et al. (2007) made a distinction between WGD and SSD duplicates and found that WGD duplicates in yeast are on average more dispensable than SSD duplicates. Moreover, whereas SSD duplicates become less dispensable as the protein sequence divergence (Ka ) with their closest paralog increases, the dispensability of WGD duplicates appears to be remarkably little influenced by their sequence divergence. A similar conclusion was reached by Guan et al. (2007). These results indicate that WGD duplicates remain intrinsically more able to backup for each other, even when their sequences have diverged substantially. A recent study showed that more than one in three WGD pairs in yeast exhibits phenotypic buffering under standard laboratory conditions, and that other WGD pairs show epistatic effects only under particular stress conditions (Musso et al., 2008). 3.4 Increased Dosage Buffering is not the only potential reason why functionally identical duplicates may be maintained indefinitely. Duplication may serve to increase the expression levels of certain gene products that are needed in large amounts, such as ribosomal proteins or histones. Seoighe and Wolfe (1999) noticed that highly expressed genes, such as ribosomal genes, were retained preferentially in duplicate after the WGD in yeasts. Following the work of Gao and Innan (2004) mentioned above, Sugino and Innan (2006) linked selection for increased dosage to the occurrence of gene conversion in yeast WGD pairs, arguing that selection may favor gene conversion of highly dosed duplicate genes. Lin et al. (2006) reanalyzed 56 low-Ks WGD gene pairs studied by Gao and Innan (2004) and found that all of these slowly evolving WGD pairs exhibit strong codon-usage bias, a hallmark of selection for translational efficiency. Only approximately half of the pairs showed signs of gene conversion. Therefore, Lin et al. suggested that gene conversion is not so much favored by increased dosage selection, but that prolonged gene conversion is instead facilitated by the reduced rate of sequence divergence caused by codon-usage bias. Aury et al. (2006) found a strong correlation between expression levels and WGD duplicate retention rate in Paramecium. A small subset of duplicates containing, for example, ribosomal constituents, histones, and cytoskeleton components exhibited not only high expression levels but also low protein sequence divergence, low levels of synonymous substitution, and optimized codon usage, in accordance with the increased dosage hypothesis. A particularly interesting example of selection for increased dosage has been uncovered in yeast. Recently, Conant and Wolfe (2007) hypothesized that retention of specific glycolytic genes after the WGD in yeasts has caused an increased glycolytic flux that gave post-WGD yeast species a growth advantage by increasing their glucose fermentation speed. 3.5 Dosage Balance All of the mechanisms described above, except for the increased dosage hypothesis, have in common that they assume that newborn duplicates are phenotypically neutral.
DUPLICATE RETENTION MECHANISMS
41
However, this is frequently not the case. Often, severe fitness defects are observed upon gene duplication. For example, a lot of human genetic diseases are associated with duplication of single genes, larger segments, or chromosomes (Chen et al., 2005; Kondrashov and Kondrashov, 2006; Conrad and Antonarakis, 2007). In most cases, the cause of the deleterious effect is a change in dosage balance (i.e., the stoechiometry of the cellular components gets upset). Early versions of the dosage balance hypothesis focused mainly on the effects of stoechiometric imbalances in regulatory protein complexes, especially those affecting transcription (Birchler et al., 2001; Veitia, 2002, 2003). More recently, dosage balance effects have also been linked to structural proteins, signal transduction cascades, and complex-forming proteins in general (Papp et al., 2003a; Kondrashov and Koonin, 2004; Liang et al., 2008; Veitia et al., 2008). In a landmark study, Papp et al. (2003a) showed that an imbalance in the concentration of the components of protein complexes in yeast generally leads to lower fitness, demonstrating the pervasiveness of dosage balance effects. A corollary of the balance hypothesis is that duplication of individual protein complex subunits would be harmful and thus selected against. Papp et al. found that members of large gene families in yeast are indeed less frequently involved in protein complexes than members of small gene families. Another study (Yang et al., 2003) suggested that in humans, dosage sensitivity increases and subunit duplicability decreases with an increasing number of subunits in a complex. Moreover, in yeast, subunits of heterogeneous protein complexes are significantly less duplicable than homocomplex subunits, consistent with the dosage balance hypothesis (Lin et al., 2007). A significant effect of increasing heterogeneity of heterocomplexes on the duplicability of the subunits could not be established, possibly due to the small sample size (Lin et al., 2007). Papp et al. restricted their analysis of dosage effects primarily to complex-forming genes but suggested that other classes of genes, specifically transcription factors and developmental genes, might be particularly prone to cause dosage effects. They did not find any evidence of that in yeast, which they ascribed to the fact that yeast transcription factors influence relatively few genes and that yeast lacks the long regulatory cascades that underlie multicellular development in higher eukaryotes. In contrast, Yang et al. (2003) pointed out that higher eukaryotes may be less sensitive than yeast to dosage changes, because of their higher intrinsic robustness against expression variations and more sophisticated systems to control gene and protein expression levels. Additionally, alternative splicing in higher eukaryotes might play an important role in fixing imbalance effects, and the smaller effective population size of higher eukaryotes, combined with the greater potential for duplicate subfunctionalization through tissue specialization, may facilitate the fixation and retention of duplicate of dosage-sensitive genes (Liang et al., 2008). Liang et al. (2008) examined protein underwrapping as a potential cause of dosage balance effects. The underwrapping parameter quantifies the extent to which the backbone of a protein is accessible to water. Highly underwrapped proteins are structurally unstable because the backbone hydrogen bonds that determine the structural integrity of the protein may be dissolved through solvent hydration of the polar groups. Therefore, underwrapped proteins are predicted to be part of protein complexes that shield the underwrapped backbone from the surrounding water. When the stoechiometry of such a complex is upset, excess underwrapped proteins frequently show a tendency to aggregate, often with detrimental consequences, as in Alzheimer’s disease and Parkinson’s disease (Fern´andez et al., 2003; Conrad and Antonarakis, 2007). Consistent with
42
DUPLICATE RETENTION AFTER SMALL- AND LARGE-SCALE DUPLICATIONS
the dosage balance hypothesis, Liang et al. (2008) found that highly underwrapped proteins are less likely to be retained after duplication, and that they are more retained after WGD than after SSD. They also found that the effect of protein underwrapping on gene duplicability decreases with increasing organismal complexity, consistent with the hypothesis that higher eukaryotes may be less sensitive to dosage changes (Yang et al., 2003). However, protein underwrapping is only one cause of dosage effects. Gene dosage appears to be one of the main factors influencing retention of gene duplicates, even in multicellular eukaryotes (see below). 3.6 Functional Bias Dosage balance is the only mechanism that can adequately explain one of the most salient features of gene duplicate retention: the striking difference in functional bias seen in duplicates originating from SSD and WGD. Papp et al. (2003a) conjectured that constituents of protein complexes should be retained preferentially after WGD. Indeed, because all members of a protein complex are duplicated simultaneously by WGD, imbalance effects could be circumvented more easily. Moreover, following WGD, the members of a balance-sensitive duplicated protein complex should be lost or retained together to avoid deleterious imbalance effects caused by the loss of single complex constituents. Several studies have assessed the impact of SSD and WGD on the gene complement of an organism. Given their high frequency of (paleo)polyploidization, plants are particularly attractive study objects for this purpose. Blanc and Wolfe (2004a) found that genes retained in duplicate after the most recent WGD in A. thaliana (α) were not distributed evenly over all gene ontology (GO) categories (Ashburner et al., 2000). Regulatory genes, such as transcription factors, signal transducers, protein kinases, protein phosphatases, and developmental genes were found to be enriched in the set of duplicates retained from the α event. Some complex-forming genes (e.g., ribosomal genes, proteasome subunits, and the photosystem II oxygen-evolving complex) were also found to be overretained. Genes involved in several highly conserved processes, such as DNA replication and repair, tRNA charging, and mitochondrial and chloroplast function, were underretained. Seoighe and Gehring (2004) also investigated the survivability of duplicates after the α duplication. They found that genes retained in duplicate after the γ or β polyploidization events are significantly more likely to have retained duplicates from the α event as well, suggesting that some genes are inherently more duplicable through WGD than are others. Similar to Blanc and Wolfe (2004a), Seoighe and Gehring (2004) found that the set of duplicates retained after α is biased toward transcriptional regulators and signal transducers. A limiting factor in these studies is that the resolution of WGD duplicates is confined to the pairs that can still be found in duplicated blocks with conserved gene content and order (Vision et al., 2000; Simillion et al., 2002; Blanc et al., 2003; Bowers et al., 2003), which compromises their use on the older events, γ and β, that have faded signatures. Also, the studies mentioned above do not compare the effects of WGD and SSD. Maere et al. (2005) developed a method to model the population dynamics of duplicate genes in Arabidopsis, taking into account both the WGD and SSD duplication modes. One of the advantages of the modeling approach is that individual gene duplications do not need to be attributed to a specific mode of duplication, circumventing the WGD identification problem.
DUPLICATE RETENTION MECHANISMS
43
For many GO categories, duplicate decay rates after WGD and SSD were found to be strikingly different (see Figure 1). Most notably, genes involved in transcriptional regulation, signal transduction, and development generally show high retention of duplicates in WGD modes, in accordance with previous studies, but low retention in the SSD mode. The same behavior is observed for protein-binding genes and transporter categories such as ion, channel/pore class, and electron transporters. Developmental genes show strong retention after γ and β, but not after α and SSD. Cell cycle and morphogenesis genes also show more retention after the WGDs than after SSD, but at a lower amplitude, indicating less conservation of those duplicates in general. In the same vein, genes involved in DNA metabolism and RNA binding are not well retained, regardless of the mode of duplication. The same goes for structural ribosomal genes, except after the α event, consistent with Blanc and Wolfe’s results (2004a). Overall, it seems that well-conserved basic cellular processes generally accumulate fewer duplicates. Metabolism and stress response categories generally show higher duplicate retention after SSD. Markedly, genes involved in biotic stress responses and related processes (e.g., secondary metabolism, lipid binding, oxygen binding, cell death) retain duplicates at a high rate after WGD and SSD alike. Compared with biotic stress-response genes, genes involved in the response to abiotic stimuli show lower duplicate retention after γ, α, and SSD and higher retention after the β event. It has recently been shown that the γ event was a hexaploidization rather than a tetraploidization event (Jaillon et al., 2007; Tang et al., 2008). Although this may affect the model parameters learned by Maere et al., the qualitative results, in particular the reciprocal relationship of duplicate retention after WGD and SSD for regulatory and developmental genes, are unlikely to change. Recently, Michael Freeling (2009) compared the retention of duplicate genes after SSD and the α event using different methods and found the same reciprocal relationship for many regulatory and complex-forming gene classes. All four studies on Arabidopsis find that regulatory genes and complex-forming genes are preferentially retained after WGDs (Blanc and Wolfe, 2004a; Seoighe and Gehring, 2004; Maere et al., 2005; Freeling, 2009). Studies on several other organisms have arrived at similar conclusions. Seoighe and Wolfe observed that signal transduction genes and ribosomal genes were overretained after the WGD that occurred in the S. cerevisiae lineage approximately 100 Mya (Seoighe and Wolfe, 1999). Later, Davis and Petrov (2005) confirmed these results and found that transcription factors were also retained in excess after the yeast WGD. Blomme et al. (2006) constructed phylogenetic trees for more than 8000 gene families in seven vertebrate species, from fish to human, and investigated the gain and loss of duplicate genes during 600 million years of vertebrate evolution. From the position of the duplication events in these trees relative to the speciation events, the duplications can be attributed to certain branches on the species tree. Not surprisingly, large amounts of duplicate genes were found to have been created on branches that coincide with proposed genome duplication events [one or two rounds (1R/2R) at the base of the vertebrate tree, and three rounds (3R) in the teleost fish lineage]. Furthermore, when the retention pattern of these duplicates was investigated, a strong bias was uncovered toward retention of regulatory genes (e.g., transcription factors, signal transducers, developmental genes), protein-binding genes, and ion transporters, both for 1R/2R and 3R and across several species. The enrichment of transcription factors and signaling genes among polyploidy-derived gene duplicates had already been noticed earlier in smaller-scale studies, both for the WGDs
44
number of retained duplicates
0
0
2
3
4
1
2
SSD
3
4
α0 = 1.55 ± 0.28 α1 = 0.60 ± 0.04 α2 = 0.45 ± 0.03 α3 = 0.75 ± 0.05
development
1
α0 = 1.25 ± 0.02 α1 = 0.90 ± 0.03 α2 = 0.65 ± 0.01 α3 = 0.85 ± 0.01
whole paranome
5
5
α
0
1
2
3
4
α0 = 1.25 ± 0.16 α1 = 0.60 ± 0.04 α2 = 0.30 ± 0.02 α3 = 0.55 ± 0.04
TF activity
5
0
5
10
15
20
25
0
1
β
2
(A)
KS
3
γ
4
α0 = 1.40 ± 0.12 α1 = 0.45 ± 0.15 α2 = 0.15 ± 0.07 α3 = 0.55 ± 0.11
5
secondary metabolism
0
20
40
60
80
0
0
2
3
4
1
2
3
4
α0 = 0.85 ± 0.15 α1 = 0.80 ± 0.11 α2 = 0.85 ± 0.11 α3 = 0.75 ± 0.08
DNA metabolism
1
α0 = 3.05 ± 1.07 α1 = 0.35 ± 0.03 α2 = 0.10 ± 0.02 α3 = 0.35 ± 0.03
signal transduction
observed Ks distr.
0
10
20
30
40
0
20
40
60
80
100
5
5
>1.0
γ
0.7 (B)
β
α
98%) identity level, and many fewer more diverged gene copies (Lynch and Conery, 2000; Kondrashov et al., 2002) generally supports models of gene duplications that assume neutral fixation. On the other hand, these data are also consistent with a large fraction of very similar gene copies undergoing frequent gene conversion, which appears to be the case in baker’s yeast (Gao and Innan, 2004) and Drosophila (Osada and Innan, 2008). If gene conversion would be found to be pervasive among gene copies in other species as well, it would undermine a major argument for neutrality of recent gene duplications. Evidence in support of the hypothesis that the majority of gene duplications are fixed by positive selection rather than genetic drift is also scarce on a genomewide level. Moore and Purugganan (2003) found a reduced level of polymorphisms around recently duplicated genes in Arabidopsis thaliana, implying that their fixations occurred under positive selection. It is unfortunate that no other study attempts to examine the selection pressure acting on gene duplications in the course of their fixation because such population genetic studies are probably the best way to resolve the issue at hand. Another study observed a higher number of recently duplicated genes in the mouse genome compared to the human genome, which can be explained by a stronger selection for these gene copies in the mouse because of the larger effective population size in that species (Shiu et al., 2006). Finally, the observation that recent gene duplications, including CNVs, are enriched for environmentally sensitive, stress-induced, and
68
GENE DOSAGE AND DUPLICATION
defensive functions in a wide variety of species (Kondrashov et al., 2002; Gu et al., 2002; Hooper and Berg, 2003; Francino, 2005; Nguyen et al., 2006; Shiu et al., 2006; Cooper et al., 2007; Emerson et al., 2008; Hanada et al., 2008; Korbel et al., 2008; Ponting, 2008; Powell et al., 2008) currently provides the broadest support for the hypothesis that most gene copies that have emerged independent of WGD events are fixed by positive selection in response to a changing environment. Two interesting observations have been made on a genomewide level that do not support either neutral or adaptive viewpoint of gene duplications. The first is the observation that gene duplications that remain after WGD events are functionally different from gene duplications that occur individually (Davis and Petrov, 2005; Hakes et al., 2007). It is likely that different evolutionary mechanism determine which genes are retained after a WGD or after a single-gene duplication, but this observation in itself does not reveal the selection regime present in the course of fixation of single-gene duplications. The second observation is that there is a substantial relaxation of selection in gene copies after a gene duplication (Lynch and Conery, 2000; Kondrashov et al., 2002). This observation is consistent with both neutral fixation and selective fixation: Genes that are fixed by selection for increased gene dosage will experience a period of relaxed selection as long as fitness after a gene duplication has not increased more than twofold (Kondrashov et al., 2002). In addition, the acquisition of a functional novelty may occur rapidly: that is, before enough substitutions accumulate in diverging lineage to make the Ka /Ks ratio informative.
7
DOSAGE AND GENETIC DOMINANCE
As we have seen from examples described in the literature, an increase in the number of gene copies can lead to an increase or decrease in fitness. Similarly, a decrease in the number of gene copies can have different consequences for fitness. In the 1930s, Sewall Wright related the concept of gene dosage to the fitness of heterozygous deleterious mutations, which in the case of a loss-of-function mutation corresponds to a loss of one copy of a gene (Wright, 1934). The aim of Wright’s model was to explain why some alleles are recessive whereas others are dominant, and he modeled different fitness functions that showed interdependence between dosage (genotype) and fitness (phenotype). Wright’s ideas on how the action of genes and their dosage is manifested in the relationship between genotype and fitness have been developed for certain functions of genes (Kacser and Burns, 1981; Veitia, 2002, 2004; Conrad and Antonarakis, 2007) and extended beyond the concept of dominance to include gene duplications (Kondrashov and Koonin, 2004; Veitia, 2004; Conrad and Antonarakis, 2007). The mathematical intricacies of these models are important for the understanding of the reasons why and how dosage may affect fitness. Wright (1934) and others (Kacser and Burns, 1981; Veitia, 2002, 2004) have shown that within one unifying theory, two types of functions lead to different consequences of decreasing gene dosage, which correspond to recessive and dominant loss-of-function alleles. Similarly, understanding the interplay between the increased dosage of a gene and its impact on fitness provides a basis for predicting the evolutionary fate of gene duplications (Papp et al., 2003; Kondrashov and Koonin, 2004; Birchler and Veitia, 2007; Conrad and Antonarakis, 2007; Lehner, 2008).
DOSAGE THEORY AND GENE DUPLICATIONS
69
8 DOSAGE THEORY AND GENE DUPLICATIONS
Fitness (phenotype)
Wright’s idea has been generalized for gene duplications with four different types of fitness functions. Perhaps the most straightforward function is a simple linear one, where a decrease of gene dosage is deleterious and an increase is beneficial. The diminishing returns function proposed by Wright describes instances where both a decrease and an increase of gene dosage have no effect. A function with a clear optimum describes the opposite case of dosage balance (Veitia, 2002), where both decrease and increase of gene dosage are deleterious. Finally, a concave-shaped function describes the case where a decrease of gene dosage may be benign whereas an increase is deleterious (Figure 2). It seems that most genes should be classifiable within these four categories, including those genes that may be showing complete redundancy. Conversely, modeling the fates of gene duplications within a context of gene dosage is possible for a wide variety of fitness outcomes of changes in gene dosage and copy number. The theoretical treatment of gene duplications through dosage, which includes the possibility of complete redundancy as an extreme case, appears to be more universal than the neutral theory, which assumes only complete redundancy. However, two major obstacles remain to our ability to apply this theory to real data. First, we have only a vague idea of how to understand which type of a fitness function is appropriate for a particular gene. In theory, enzymes should be described by a diminishing returns type of function (Kacser and Burns, 1981), and genes encoding isoforms of a large protein should be described by a function with an optimum (Veitia, 2002), whereas proteins
0 (aa)
1/2 (Aa)
1 (AA)
2 (AA,AA)
Protein concentration (genotype)
Figure 2 Dependence of fitness on gene dosage (the number of gene copies or genotype). Four types of functions are displayed here: in red, a linear fitness function; in blue, a diminishing returns function; in green, a function with a well-defined optimum; and a concave function is shown in magenta. The area between the two dashed lines defines an area where fitness is similar to the fitness of an individual with one gene copy. (Adapted from Wright, 1934; Kondrashov and Koonin, 2004; Conrad and Antonarakis, 2007.) (See insert for color representation of the figure.)
70
GENE DOSAGE AND DUPLICATION
with a propensity to aggregate should be described by the concave function (Conrad and Antonarakis, 2007). It is less clear what types of gene functions may exhibit a linear relationship between dosage and fitness; however, empirical observations point to genes with binding or regulatory function (Kondrashov and Koonin, 2004). Despite these theoretical considerations, empirical observations of many enzymes providing an adaptive response to stressful environments prove that not all enzymes follow the diminishing returns dosage rule. More so, it is likely that for many genes the real fitness function will be somewhere in between of the four characterizations displayed here. Another conceptual problem with determining the relationship between dosage and fitness for individual genes is that this relationship can depend drastically on the environment. Duplications of genes can be beneficial under stressful conditions but deleterious in a benign environment (Brown et al., 1998; Raymond et al., 1998; Guillemaid et al., 1999; Kondrashov et al., 2002; Foster et al., 2003, 2005; Bourguet et al., 2004; Francino, 2005; Lawrence, 2005); thus under stressful conditions the fitness function is closer to linear, or at least fast-growing diminishing returns, whereas in a normal setting the function is more of convex or optimum. It is possible that for a vast majority of genes, the relationship between dosage and fitness may resemble a very flat diminishing returns function, which is essentially the expectation of the complete redundancy model. On the other hand, there seem to be plenty of data that support a more vigorous dependence of fitness on dosage. However, many of the genome-level inferences of the fitness impacts of recent, smallscale duplications are obtained through indirect observations, and thus whether or not flat, diminishing returns functions describe most of the genes in the genome remains a matter of opinion. In addition, evidence of a difference in the functional repertoire of gene duplications created by recent small-scale gene duplications and those that have been retained from WGD events (Davis and Petrov, 2005; Hakes et al., 2007) suggests that different functions may be appropriate for different scales of gene duplications. How are small-scale gene duplications fixed in natural populations? Almost a century after the importance of gene duplications has been realized for the first time, we still do not have a solid answer to this question. In this chapter I presented a synopsis of evidence that supports the assertion that positive selection may play a decisive role in many gene duplications. Many other authors support the traditional view that most gene duplications are fixed by genetic drift. It seems that only genome-wise population genetics studies aimed at probing the selection pressures acting on gene duplications in the course of their fixation may provide a straightforward answer.
REFERENCES Anderson RP, Roth JR. 1977. Tandem genetic duplications in phage and bacteria. Annu Rev Microbiol 31:473–505. Anderson RP, Roth JR. 1979. Gene duplication in bacteria: alteration of gene dosage by sisterchromosome exchanges. Cold Spring Harb Symp Quant Biol 43(Pt 2):1083–1087. Aury JM, Jaillon O, Duret L, Noel B, Jubin C, Porcel BM, S´egurens B, Daubin V, Anthouard V, Aiach N, et al. 2006. Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraurelia. Nature 444:171–178. Birchler JA, Veitia RA. 2007. The gene balance hypothesis: from classical genetics to modern genomics. Plant Cell 19:395–402.
REFERENCES
71
Birchler JA, Riddle NC, Auger DL, Veitia RA. 2005. Dosage balance in gene regulation: biological implications. Trends Genet 21:219–226. Bourguet D, Guillemaud T, Chevillon C, Raymond M. 2004. Fitness costs of insecticide resistance in natural breeding sites of the mosquito Culex pipiens. Evolution 58: 128–135. Bridges CA. 1935. Salivary chromosome maps. J Hered 26:60–64. Brown CJ, Todd KM, Rosenzweig RF. 1998. Multiple duplications of yeast hexose transport genes in response to selection in a glucose-limited environment. Mol Biol Evol 15: 931–942. Buckland PR. 2003. Polymorphically duplicated genes: their relevance to phenotypic variation in humans. Ann Med 35:308–315. Chartier-Harlin MC, Kachergus J, Roumier C, Mouroux V, Douay X, Lincoln S, Levecque C, Larvor L, Andrieux J, Hulihan M, et al. 2004. Alpha-synuclein locus duplication as a cause of familial Parkinson’s disease. Lancet 364:1167–1169. Conrad B, Antonarakis SE. 2007. Gene duplication: a drive for phenotypic diversity and cause of human disease. Annu Rev Genom Hum Genet 8:17–35. Cooper GM, Nickerson DA, Eichler EE. 2007. Mutational and selective effects on copy-number variants in the human genome. Nat Genet 39:S22–S29. Craven SH, Neidle EL. 2007. Double trouble: medical implications of genetic duplication and amplification in bacteria. Future Microbiol 2:309–321. Croce CM. 2008. Oncogenes and cancer. N Engl J Med 358:502–511. Cuervo AM, Stefanis L, Fredenburg R, Lansbury PT, Sulzer D. 2004. Impaired degradation of mutant alpha-synuclein by chaperone-mediated autophagy. Science 305:1292–1295. Davis JC, Petrov DA. 2005. Do disparate mechanisms of duplication add similar genes to the genome? Trends Genet 21:548–551. Degenhardt YY, Wooster R, McCombie RW, Lucito R, Powers S. 2008. High-content analysis of cancer genome DNA alterations. Curr Opin Genet Dev 18:68–72. Derti A, Roth FP, Church GM, Wu CT. 2006. Mammalian ultraconserved elements are strongly depleted among segmental duplications and copy number variants. Nat Genet 38: 1216–1220. Devonshire AL, Field LM. 1991. Gene amplification and insecticide resistance. Annu Rev Entomol 36:1–23. Devonshire AL, Moores GD. 1982. A carboxylesterase with broad substrate specificity causes organophosphorus, carbamate and pyrethroid resistance in peach–potato aphids (Myzus persicae). Pestic Biochem Physiol 18:235–246. Djogb´enou L, Chandre F, Berthomieu A, Dabir´e R, Koffi A, Alout H, Weill M. 2008. Evidence of introgression of the ace-1(R) mutation and of the ace-1 duplication in West African Anopheles gambiae ss. PLoS ONE 3:e2172. Dopman EB, Hartl DL. 2007. A portrait of copy-number polymorphism in Drosophila melanogaster . Proc Natl Acad Sci USA 104:19920–19925. Emerson JJ, Cardoso-Moreira M, Borevitz JO, Long M. 2008. Natural selection shapes genome-wide patterns of copy-number polymorphism in Drosophila melanogaster. Science 320:1629–1631. Field LM, Devonshire AL, Forde BG. 1988. Molecular evidence that insecticide resistance in peach–potato aphids (Myzus persicae Sulz.) results from amplification of an esterase gene. Biochem J 251:309–312. Field LM, Blackman RL, Tyler-Smith C, Devonshire AL. 1999. Relationship between amount of esterase and gene copy number in insecticide-resistant Myzus persicae (Sulzer). Biochem J 339:737–742.
72
GENE DOSAGE AND DUPLICATION
Fisher RA. 1935. The sheltering of lethals. Am Nat 69:446–455. Force A, Lynch M, Pickett FB, Amores A, Yan YL, Postlethwait J. 1999. Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151:1531–1545. Foster SP, Young S, Williamson MS, Duce I, Denholm I, Devine GJ. 2003. Analogous pleiotropic effects of insecticide resistance genotypes in peach–potato aphids and houseflies. Heredity 91:98–106. Foster SP, Denholm I, Thompson R, Poppy GM, Powell W. 2005. Reduced response of insecticide-resistant aphids and attraction of parasitoids to aphid alarm pheromone; a potential fitness trade-off. Bull Entomol Res 95:37–46. Francino MP. 2005. An adaptive radiation model for the origin of new gene functions. Nat Genet 37:573–577. Freeling M. 2008. The evolutionary position of subfunctionalization, downgraded. Genome Dyn 4:25–40. Freeling M, Thomas BC. 2006. Gene-balanced duplications, like tetraploidy, provide predictable drive to increase morphological complexity. Genome Res 16:805–814. Gao LZ, Innan H. 2004. Very low gene duplication rate in the yeast genome. Science 306:1367–1370. Groot PC, Bleeker MJ, Pronk JC, Arwert F, Mager WH, Planta RJ, Eriksson AW, Frants RR. 1989. The human alpha-amylase multigene family consists of haplotypes with variable numbers of genes. Genomics 5:29–42. Gu Z, Cavalcanti A, Chen FC, Bouman P, Li WH. 2002. Extent of gene duplication in the genomes of Drosophila, nematode, and yeast. Mol Biol Evol 19:256–262. Guillemaud T, Raymond M, Tsagkarakou A, Bernard C, Rochard P, Pasteur N. 1999. Quantitative variation and selection of esterase gene amplification in Culex pipiens. Heredity 83:87–99. Hakes L, Pinney JW, Lovell SC, Oliver SG, Robertson DL. 2007. All duplicates are not equal: the difference between small-scale and genome duplication. Genome Biol 8:R209. Haldane JBS. 1933. The part played by recurrent mutation in evolution. Am Nat 67:5–9. Hanada K, Zou C, Lehti-Shiu MD, Shinozaki K, Shiu SH. 2008. Importance of lineage-specific expansion of plant tandem duplicates in the adaptive response to environmental stimuli. Plant Physiol 148:993–1003. Hastings PJ. 2007. Adaptive amplification. Crit Rev Biochem Mol Biol 42:271–283. Hemingway J. 2000. The molecular basis of two contrasting metabolic mechanisms of insecticide resistance. Insect Biochem Mol Biol 30:1009–1015. Hemingway J, Hawkes NJ, McCarroll L, Ranson H. 2004. The molecular basis of insecticide resistance in mosquitoes. Insect Biochem Mol Biol 34:653–665. Hendrickson H, Slechta ES, Bergthorsson U, Andersson DI, Roth JR. 2002. Amplificationmutagenesis: evidence that “directed” adaptive mutation and general hypermutability result from growth with a selected gene amplification. Proc Natl Acad Sci USA 99:2164–2169. Hoebler C, Karinthi A, Devaux MF, Guillon F, Gallant DJ, Bouchet B, Melegari C, Barry JL. 1998. Physical and chemical transformations of cereal food during oral digestion in human subjects. Br J Nutr 80:429–436. Hooper SD, Berg OG. 2003. On the nature of gene innovation: duplication patterns in microbial genomes. Mol Biol Evol 20:945–954. Hughes AL. 1999. Adaptive Evolution of Genes and Genomes. New York: Oxford University Press. Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW, Lee C. 2004. Detection of large-scale variation in the human genome. Nat Genet 36:949–951.
REFERENCES
73
Ib´an˜ ez P, Bonnet AM, D´ebarges B, Lohmann E, Tison F, Pollak P, Agid Y, D¨urr A, Brice A. 2004. Causal relation between alpha-synuclein gene duplication and familial Parkinson’s disease. Lancet 364:1169–1171. Jiang Z, Tang H, Ventura M, Cardone MF, Marques-Bonet T, She X, Pevzner PA, Eichler EE. 2007. Ancestral reconstruction of segmental duplications reveals punctuated cores of human genome evolution. Nat Genet 39:1361–1368. Kacser H, Burns JA. 1981. The molecular basis of dominance. Genetics 97:639–666. Kimura M, King JL. 1979. Fixation of a deleterious allele at one of two “duplicate” loci by mutation pressure and random drift. Proc Natl Acad Sci USA 76:2858–2861. King PH, Waldrop R, Lupski JR, Shaffer LG. 1998. Charcot–Marie–Tooth phenotype produced by a duplicated PMP22 gene as part of a 17p trisomy-translocation to the X chromosome. Clin Genet 54:413–416. Koch AL. 1981. Evolution of antibiotic resistance gene function. Microbiol Rev 45:355–378. Kondrashov FA, Kondrashov AS. 2006. Role of selection in fixation of gene duplications. J Theor Biol 239:141–151. Kondrashov FA, Koonin EV. 2004. A common framework for understanding the origin of genetic dominance and evolutionary fates of gene duplications. Trends Genet 20:287–290. Kondrashov FA, Rogozin IB, Wolf YI, Koonin EV. 2002. Selection in the evolution of gene duplications. Genome Biol 3:R8. Korbel JO, Kim PM, Chen X, Urban AE, Weissman S, Snyder M, Gerstein MB. 2008. The current excitement about copy-number variation: how it relates to gene duplications and protein families. Curr Opin Struct Biol 18:366–374. Kugelberg E, Kofoid E, Reams AB, Andersson DI, Roth JR. 2006. Multiple pathways of selected gene amplification during adaptive mutation. Proc Natl Acad Sci USA 103:17319–17324. Labb´e P, Berthomieu A, Berticat C, Alout H, Raymond M, Lenormand T, Weill M. 2007a. Independent duplications of the acetylcholinesterase gene conferring insecticide resistance in the mosquito Culex pipiens. Mol Biol Evol 24:1056–1067. Labb´e P, Berticat C, Berthomieu A, Unal S, Bernard C, Weill M, Lenormand T. 2007b. Forty years of erratic insecticide resistance evolution in the mosquito Culex pipiens. PLoS Genet 3:e205. Lawrence JG. 2005. Common themes in the genome strategies of pathogens. Curr Opin Genet Dev 15:584–588. Lehner B. 2008. Selection to minimise noise in living systems and its implications for the evolution of gene expression. Mol Syst Biol 4:170. Li WH. 1980. Rate of gene silencing at duplicate loci: a theoretical study and interpretation of data from tetraploid fishes. Genetics 95:237–258. Li WH. 1997. Molecular Evolution. Sunderland, MA: Sinauer Associates. Li X, Schuler MA, Berenbaum MR. 2007. Molecular mechanisms of metabolic resistance to synthetic and natural xenobiotics. Annu Rev Entomol 52:231–253. Liang H, Fern´andez A. 2008. Evolutionary constraints imposed by gene dosage balance. Front Biosci 13:4373–4378. Liang H, Plazonic KR, Chen J, Li WH, Fern´andez A. 2008. Protein under-wrapping causes dosage sensitivity and decreases gene duplicability. PLoS Genet 4:e11. Lupski JR, Stankiewicz P. 2005. Genomic disorders: molecular mechanisms for rearrangements and conveyed phenotypes. PLoS Genet 1:e49. Lupski JR, de Oca-Luna RM, Slaugenhaupt S, Pentao L, Guzzetta V, Trask BJ, SaucedoCardenas O, Barker DF, Killian JM, Garcia CA, et al. 1991. DNA duplication associated with Charcot–Marie–Tooth disease type 1A. Cell 66:219–232.
74
GENE DOSAGE AND DUPLICATION
Lupski JR, Wise CA, Kuwano A, Pentao L, Parke JT, Glaze DG, Ledbetter DH, Greenberg F, Patel PI. 1992. Gene dosage is a mechanism for Charcot–Marie–Tooth disease type 1A. Nat Genet 1:29–33. Lynch M, Conery JS. 2000. The evolutionary fate and consequences of duplicate genes. Science 290:1151–1155. Lynch M, Force A. 2000. The probability of duplicate gene preservation by subfunctionalization. Genetics 154:459–473. Maere S, De Bodt S, Raes J, Casneuf T, Van Montagu M, Kuiper M, Van de Peer Y. 2005. Modeling gene and genome duplications in eukaryotes. Proc Natl Acad Sci USA 102: 5454–5459. Maynard Smith J. 1978. The Evolution of Sex . Cambridge, UK: Cambridge University Press. Meisler MH, Ting CN. 1993. The remarkable evolutionary history of the human amylase genes. Crit Rev Oral Biol Med 4:503–509. Moore RC, Purugganan MD. 2003. The early stages of duplicate gene evolution. Proc Natl Acad Sci USA 100:15682–15687. Moore RC, Purugganan MD. 2005. The evolutionary dynamics of plant duplicate genes. Curr Opin Plant Biol 8:122–128. Mouch`es C, Pasteur N, Berg´e JB, Hyrien O, Raymond M, de Saint Vincent BR, de Silvestri M, Georghiou GP. 1986. Amplification of an esterase gene is responsible for insecticide resistance in a Californian Culex mosquito. Science 233:778–780. Newcomb RD, Gleeson DM, Yong CG, Russell RJ, Oakeshott JG. 2005. Multiple mutations and gene duplications conferring organophosphorus insecticide resistance have been selected at the Rop-1 locus of the sheep blowfly, Lucilia cuprina. J Mol Evol 60:207–220. Nguyen DQ, Webber C, Ponting CP. 2006. Bias of selection on human copy-number variants. PLoS Genet 2:e20. Ohno S. 1970. Evolution by Gene Duplication. New York: Springer-Verlag. Ohta T. 1990. How gene families evolve. Theor Popul Biol 37:213–219. Osada N, Innan H. 2008. Duplication and gene conversion in the Drosophila melanogaster genome. PLoS Genet 4:e1000305. Papp B, P´al C, Hurst LD. 2003. Dosage sensitivity and the evolution of gene families in yeast. Nature 424:194–197. Pasteur N, Raymond M. 1996. Insecticide resistance genes in mosquitoes: their mutations, migration, and selection in field populations. J Hered 87:444–449. Patel PI, Roa BB, Welcher AA, Schoener-Scott R, Trask BJ, Pentao L, Snipes GJ, Garcia CA, Francke U, Shooter EM, et al. 1992. The gene for the peripheral myelin protein PMP-22 is a candidate for Charcot–Marie–Tooth disease type 1A. Nat Genet 1:159–165. Paton MG, Karunaratne SH, Giakoumaki E, Roberts N, Hemingway J. 2000. Quantitative analysis of gene amplification in insecticide-resistant Culex mosquitoes. Biochem J 346: 17–24. Payne SR, Kemp CJ. 2005. Tumor suppressor genetics. Carcinogenesis 26:2031–2045. Perry GH, Dominy NJ, Claw KG, Lee AS, Fiegler H, Redon R, Werner J, Villanea FA, Mountain JL, Misra R, et al. 2007. Diet and the evolution of human amylase gene copy number variation. Nat Genet 39:1256–1260. Polymeropoulos MH, Lavedan C, Leroy E, Ide SE, Dehejia A, Dutra A, Pike B, Root H, Rubenstein J, Boyer R, et al. 1997. Mutation in the alpha-synuclein gene identified in families with Parkinson’s disease. Science 276:2045–2047. Ponting CP. 2008. The functional repertoires of metazoan genomes. Nat Rev Genet 9: 689–698.
REFERENCES
75
Powell AJ, Conant GC, Brown DE, Carbone I, Dean RA. 2008. Altered patterns of gene duplication and differential gene gain and loss in fungal pathogens. BMC Genom 9:147. Prince VE, Pickett FB. 2002. Splitting pairs: the diverging fates of duplicated genes. Nat Rev Genet 3:827–837. Pronk JC, Frants RR, Jansen W, Eriksson AW, Tonino GJ. 1982. Evidence of duplication of the human salivary amylase gene. Hum Genet 60:32–35. Qian W, Zhang J. 2008. Gene dosage and gene duplicability. Genetics 179:2319–2324. Rapoport IA. 1940. Mnogokratnye linejnye povtoreniya uchastkov khromosom i ikh evolyucionnoe znachenie. [Multiple linear repeats of chromosome segments and their evolutionary significance.] Zh Obshch Biol 1:235–270. Raymond M, Chevillon C, Guillemaud T, Lenormand T, Pasteur N. 1998. An overview of the evolution of overproduced esterases in the mosquito Culex pipiens. Philos Trans R Soc Lond B 353:1707–1711. Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W, et al. 2006. Global variation in copy number in the human genome. Nature 444:444–454. Roossinck MJ. 1997. Mechanisms of plant virus evolution. Annu Rev Phytopathol 35:191–209. Romero D, Palacios R. 1997. Gene amplification and genomic plasticity in prokaryotes. Annu Rev Genet 31:91–111. Schwab M. 1999. Oncogene amplification in solid tumors. Semin Cancer Biol 9:319–325. Scott JG. 1999. Cytochromes P450 and insecticide resistance. Insect Biochem Mol Biol 29:757–777. Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, MAAn´er S, Massa H, Walker M, Chi M, et al. 2004. Large-scale copy number polymorphism in the human genome. Science 305:525–528. Shackelton LA, Holmes EC. 2004. The evolution of large DNA viruses: combining genomic information of viruses and their hosts. Trends Microbiol 12:458–465. Shiu SH, Byrnes JK, Pan R, Zhang P, Li WH. 2006. Role of positive selection in the retention of duplicate genes in mammalian genomes. Proc Natl Acad Sci USA 103:2232–2236. Singleton AB, Farrer M, Johnson J, Singleton A, Hague S, Kachergus J, Hulihan M, Peuralinna T, Dutra A, Nussbaum R, et al. 2003. α-Synuclein locus triplication causes Parkinson’s disease. Science 302:841. Sonti RV, Roth JR. 1989. Role of gene duplications in the adaptation of Salmonella typhimurium to growth on limiting carbon sources. Genetics 123:19–28. Stark GR, Wahl GM. 1984. Gene amplification. Annu Rev Biochem 53:447–491. Stoltzfus A. 1999. On the possibility of constructive neutral evolution. J Mol Evol 49:169–181. Stranger BE, Forrest MS, Dunning M, Ingle CE, Beazley C, Thorne N, Redon R, Bird CP, de Grassi A, Lee C, et al. 2007. Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science 315:848–853. Tabashnik BE. 1990. Implications of gene amplification for evolution and management of insecticide resistance. J Econ Entomol 83:1170–1176. Taylor M, Feyereisen R. 1996. Molecular biology and evolution of resistance of toxicants. Mol Biol Evol 13:719–734. Taylor JS, Raes J. 2004. Duplication and divergence: the evolution of new genes and old ideas. Annu Rev Genet 38:615–643. Thomas BC, Pedersen B, Freeling M. 2006. Following tetraploidy in an Arabidopsis ancestor, genes were removed preferentially from one homeolog leaving clusters enriched in dosesensitive genes. Genome Res 16:934–946.
76
GENE DOSAGE AND DUPLICATION
Valdez IH, Fox PC. 1991. Interactions of the salivary and gastrointestinal systems: I. The role of saliva in digestion. Dig Dis 9:125–132. Veitia RA. 2002. Exploring the etiology of haploinsufficiency. Bioessays 24:175–184. Veitia RA. 2004. Gene dosage balance in cellular pathways: implications for dominance and gene duplicability. Genetics 168:569–574. Velkov VV. 1982. Gene amplification in prokaryotic and eukaryotic systems. Genetika 18:529–543. Vontas JG, Small GJ, Hemingway J. 2000. Comparison of esterase gene amplification, gene expression and esterase activity in insecticide susceptible and resistant strains of the brown planthopper, Nilaparvata lugens (StAAl). Insect Mol Biol 9:655–660. Wagner A. 1998. The fate of duplicated genes: loss or new function? Bioessays 20:785–788. Walsh JB. 1995. How often do duplicated genes evolve new functions? Genetics 139:421–428. Webb C. 2003. A complete classification of Darwinian extinction in ecological interactions. Am Nat 161:181–205. Weir B, Zhao X, Meyerson M. 2004. Somatic alterations in the human cancer genome. Cancer Cell 6:433–438. Widholm JM, Chinnala AR, Ryu JH, Song HS, Eggett T, Brotherton JE. 2001. Glyphosate selection of gene amplification in suspension cultures of 3 plant species. Physiol Plant 112:540–545. Wilson TG. 2001. Resistance of Drosophila to toxins. Annu Rev Entomol 46:545–571. Wolf JB, Brodie ED III, Wade MJ (eds.). 2000. Epistasis and the Evolutionary Process. Oxford, UK: Oxford University Press. Wright S. 1934. Physiological and evolutionary theories of dominance. Am Nat 68:24–53. Yasui K, Mihara S, Zhao C, Okamoto H, Saito-Ohara F, Tomida A, Funato T, Yokomizo A, Naito S, Imoto I, et al. 2004. Alteration in copy numbers of genes as a mechanism for acquired drug resistance. Cancer Res 64:1403–1410.
5
Myths and Realities of Gene Duplication AUSTIN L. HUGHES and ROBERT FRIEDMAN Department of Biological Sciences, University of South Carolina, Columbia, South Carolina
1 INTRODUCTION According to Li (1983, p. 14), “gene duplication is probably the most important mechanism for generating new genes and new biochemical processes that have facilitated the evolution of complex organisms from primitive ones.” Many evolutionary theorists have emphasized the importance of gene duplication in evolution (Nei, 1969; Ohno, 1973; Kimura and Ohta, 1974; Hughes, 1999a). Although the availability of complete genome sequences, particularly of eukaryotes, has provided us with an extraordinarily rich database on the gene duplication events that have occurred over evolutionary history, the mechanisms by which new gene functions evolve in connection with gene duplication remain elusive. In this chapter we briefly review some of the main theoretical ideas that have been proposed regarding the evolution of new gene and protein function after gene duplication. We then review some data regarding the evolution of new gene functions in the light of theory. We emphasize in particular the results of our own studies over the past decade, which illustrate the complexity of gene evolution and the difficulty of making general statements that are applicable to every case. Indeed, our view of the evolutionary process is that by its very nature, it defies easy generalization. Evolution depends on the haphazard and unpredictable raw material of mutation, filtered through such population processes as genetic drift and natural selection. What natural selection favors will be what works in terms of reproductive success—not necessarily what is “well designed” by any standard derived from human engineering. And what works may come about by many different pathways, suggesting that evolutionary biologists must always adopt a pluralistic mindset, ready to acknowledge that in biology just about every general rule has exceptions. Unfortunately, evolutionary biology at the present time remains shackled by an outmoded way of thinking inherited from the Neo-Darwinists of the early twentieth century. The Neo-Darwinists tended to view natural selection as a kind of magic that can accomplish whatever is needed to enable an organism to achieve optimal adaptation to a given environment. This naive view was based on the oft-stated belief that most natural populations contain sufficient heritable variation to respond to any Evolution After Gene Duplication, Edited by Katharina Dittmar and David Liberles Copyright © 2010 Wiley-Blackwell
77
78
MYTHS AND REALITIES OF GENE DUPLICATION
challenge that the environment might impose. The reality is far different. The ability to respond to environmental change can rarely rely on preexisting variation within a population but, rather, depends on new mutations. The numerous species that have been driven to extinction as a result of human disruption of their environments in the past few centuries provide a dramatic illustration of the impotence of natural selection to respond to environmental challenges in the absence of appropriate mutations. Moreover, because evolutionary biology is a historical science, it depends on the type and quality of evidence available. There may be certain questions about the past history of life on Earth that we will never be able to answer with any certainty, simply because the evidence that might enable us to answer them is no longer available to us. In such cases we can hypothesize and can deem certain hypotheses to be more plausible than others on the basis of what data are available. But the elaboration and testing of evolutionary hypotheses always require a certain degree of humility, the ability to distinguish what we know for certain from what is only a plausible story. In our view, hypothesis testing is the essence of the scientific process; and no hypothesis should be accepted without rigorous and thorough testing against alternatives. Because of the fact that evolutionary biology is a historical science, scientists working in this field need to be particularly rigorous and careful in testing hypotheses in order to distinguish what is solidly established from what remains mere speculation. Unfortunately, evolutionary biologists have frequently been remiss in this regard. In a famous paper, Gould and Lewontin (1979) decried the “adaptive storytelling” rampant among Neo-Darwinists. One would hope that the situation would have improved with the advent of molecular data, but in many ways it has gotten worse. The proliferation of numerous ill-founded statistical methods has given rise to a kind of “computer-assisted storytelling” that purports to test hypotheses but in fact does not adequately consider alternatives (Hughes et al., 2006). As a result of the uncritical acceptance of untested hypotheses, modern molecular evolutionary biology has become riddled with mythical thinking. In order to come to an accurate understanding of the role of gene duplication in evolution, it will be necessary to free ourselves from myths and accept only what is solidly established. As well as a review of some major hypotheses and relevant data in the field of gene duplication, our chapter represents a plea for critical thinking in the field of gene duplication.
2
HYPOTHESES OF NEW GENE FUNCTION
2.1 Mutation During Redundancy The cytogeneticist Susumu Ohno (1973) proposed that the origin of a new function after gene duplication must involve a stage in which the duplicate gene is completely redundant and therefore free to vary at random. According to Ohno’s description of this process, he seemed to envisage that a duplicate gene would be functionally completely redundant, allowing the accumulation of random mutations, which might by chance give rise to a new function. He described the proposed scenario as follows: The mechanism of gene duplication provides a temporary escape from the relentless pressure of natural selection to a duplicated copy of a functional gene locus. While being ignored by natural selection, a duplicated and thus redundant copy is free to accumulate all manner of randomly sustained mutations. As a result, it may become a degenerate, nonsense DNA
HYPOTHESES OF NEW GENE FUNCTION
79
base sequence. Occasionally, however, it may acquire a new active site sequence, therefore a new function and emerge triumphant as a new gene locus. (Ohno, 1973, p. 259)
Ohno’s model became the dominant one over the next two decades, although occasionally other authors expressed or implied different views. Hughes (1994a) noted a number of reasons why the scenario proposed by Ohno [the mutation during redundancy (MDR) model ] is unlikely to apply in most cases where a new function has evolved after gene duplication. First, the MDR model implies an absence of constraint on duplicate genes. Yet comparison of the pattern of synonymous and nonsynonymous nucleotide substitution in duplicate genes in a polyploid animal, the frog Xenopus laevis, supported the hypothesis that duplicate genes are subject to purifying selection and thus are not freed from all constraint (Hughes and Hughes, 1993). In addition, there were at least some cases in which credible evidence supported the hypothesis that functional divergence between members of multigene families occurred as a result of positive Darwinian selection favoring multiple amino acid changes in functionally important regions (Hughes, 1994a). Ohno’s (1973) model assumed that mutations giving rise to a new function occur at random during a period of nonfunctionality, and positive selection is not really compatible with this model. Hughes (1999a) reviewed examples of apparent positive selection leading to diversification of the members of multigene families. Since that time, there has been explosion in the literature of cases in which it has been claimed that positive Darwinian selection has been in diversification of gene family members. Unfortunately, most of these claims have been based on “codon-based” methods of testing for positive selection, which are based on a false assumption (Hughes, 2007). Therefore, the majority of these claims are probably not valid (see below). In addition, there was evidence that functional differentiation might precede rather than follow gene duplication. The most striking example involved the lens crystallins of animals. In the evolution of eyes, it appears that a wide variety of different proteins have been recruited to act as crystallins, the major structural protein of the lens. Moreover, Piatigorsky and colleagues provided evidence that this evolutionary process regularly involves a stage they called gene sharing, where a single gene encodes both a crystallin and a protein performing a completely different function (Piatigorsky and Wistow, 1991). For example, the δ-crystallins of certain birds do double duty as argininosuccinate lyases (Piatigorsky and Wistow, 1991). A pattern in which gene sharing precedes gene duplication suggests a very different model than that proposed by Ohno, which envisages the new function as a result of random exploration of functional space. It also contradicts the pronouncement of Kimura and Ohta (1974, p. 429) that “gene duplication must always precede the emergence of a gene having a new function.” Finally, molecular biology made available the sequences of many pseudogenes. It seems obvious that most pseudogenes are too severely damaged by mutation for Ohno’s concept of the resurrection of a functionless gene to be plausible in most cases. Loss of function in the case of genes seems to be in general a one-way street. However, since one should never say “never” when it comes to evolution, it is still possible that Ohno’s scenario or something like it has occurred in a small number of cases. One possibility might involve cases where a certain type of mutation that damages one function of a gene might thereby give rise to a new function. As a possible example, there are several known cases where a competitive inhibitor or
80
MYTHS AND REALITIES OF GENE DUPLICATION
antagonist of a given signaling molecule is encoded by a gene that is homologous to the gene encoding that signaling molecule. For example, the interleukin-1 receptor antagonist (IL-1RA) in mammals is clearly homologous to the functional interleukins IL-1α and IL-1β (Eisenberg et al., 1991). One possible scenario for the origin of an antagonist from a homologous functional cytokine would be gene duplication followed by mutation in one gene copy that abolished the encoded protein’s signaling function without damaging its ability to bind the IL-1 receptor (Hughes, 1994b). On the other hand, it is worth noting that damage to a duplicate gene may occur during the process of gene duplication itself. Thus, the prolonged period of redundancy during which mutations accumulate—as envisaged by Ohno—may not be necessary at all. Analysis of evolutionarily recent gene duplicates in the nematode worm Caenorhabditis elegans has revealed a surprising degree of structural heterogeneity between duplicates (Katju and Lynch, 2006). Gene duplications may be partial, and the loss of one or more exons may immediately create a gene with an altered function. Similarly, duplicate genes may pick up new exons from a variety of sources, including both coding and noncoding sequences (Katju and Lynch, 2006). 2.2 Gene-Sharing Model Hughes (1994a) pointed out that gene sharing in crystallins suggests a general model for the evolution of new gene (and protein) functions: namely, that in cases where new functions have evolved after gene duplication, a period of gene sharing typically precedes duplication (Figure 1). When a gene is bifunctional (or multifunctional), gene duplication permits partitioning of the ancestral functions between the daughter genes, allowing them to specialize for distinct subsets of the ancestral function. Because the functions divided among the daughter genes are already present in the ancestral gene, this model requires no period of redundancy during which a new function may emerge by chance. On this model, gene sharing is not just a peculiarity of crystallins but represents the standard pathway by which new gene functions evolve. The idea that a period of gene sharing might precede duplications that lead to functional differentiation was, in fact, not new. Several previous authors had developed or implied a similar scenario. For example, Goodman and colleagues (1975) hypothesized
A′
A
A′′
GENE DUPLICATION
FUNCTION 1
FUNCTION 2
FUNCTION 1
FUNCTION 2
Figure 1 Subdivision of ancestral functions between daughter genes (A and A ) after duplication of an ancestral bifunctional gene.
HYPOTHESES OF NEW GENE FUNCTION
81
that the vertebrate hemoglobin molecule was originally a homotetramer, prior to the gene duplication that gave rise to separate α-chain and β-chain genes. Along similar lines, Jensen (1976) and Jensen and Byng (1981) proposed that two enzymes that are specialized to catalyze two separate reactions might evolve, after gene duplication, from an enzyme capable of catalyzing both reactions (Jensen, 1976; Jensen and Byng, 1981). And the biochemist Leslie Orgel (1977), in a brief paper, specifically proposed that functional differentiation might precede rather than follow gene duplication. The GAL genes of brewer’s yeast, Saccharomyces cerevisiae, provide a potential example (Hughes, 1999a). These genes, which encode proteins involved in the metabolism of galactose, are inducible by galactose and repressible by glucose (Johnston, 1987). The protein product of GAL1 , designated Gal1p, is a galactokinase, which catalyzes the production of galactose-1-phosphate from ATP and galactose, thereby initiating the pathway of galactose metabolism. Gal3p, the product of the GAL3 gene, shows clear evidence of a close evolutionary relationship with Gal1p but has no galactokinase activity (Bajwa et al., 1988; Bhat et al., 1990). Gal4p, the major transcription factor for the GAL genes, is inhibited by Gal80p (Figure 2A). In the presence of galactose and ATP, Gal3p acts as a coinducer for GAL gene expression, removing the inhibition by binding Gal80p and thus enabling Galp4 to activate transcription (Yano and Fukasawa, 1997; Figure 2A). In a related fungus species, Kluyveromyces lactis, there is no GAL3 gene; and Gal1p functions as both a coinducer and a galactokinase (Meyer et al., 1991). Interestingly, in S. cerevisiae itself, there are mutants lacking GAL3 expression in which Gal1p appears to be able to take over the regulatory role of Gal3p. Phylogenetic analysis supports the hypothesis that the common ancestor of S. cerevisiae GAL1 and GAL3 encoded a bifunctional protein that, like that of K. lactis, functioned as both a coinducer and a galactokinase (Hughes, 1999a; Figure 2B). S. cerevisiae Gal3p is characterized by the deletion of two amino acid residues (SerAla) relative to known functional galactokinases (Hughes, 1999a); and it has been shown that experimental reinsertion of these two residues restores galactokinase activity to Gal3p (Platt et al., 2000). Thus, after gene duplication, the functions of the ancestral gene were partitioned between GAL1 and GAL3 , and the loss of enzymatic function accompanied specialization of Gal3p as a coinducer (Hughes, 1999a). A recent series of experiments by Hittinger and Carroll (2007) reveals the complexity of the functional changes after gene duplication in this apparently simple case. First, they showed that reinsertion of the deleted Ser-Ala dipeptide into Gal3p, while restoring galactokinase activity, caused the resulting molecule to be a subpar coinducer. Both K. lactis Gal1p and S. cerevisiae Gal1p were shown to be worse coinducers than the intact Gal3p; but the modified Gal3p was a still worse coinducer than either of those two functional galactokinases (Hittinger and Carroll, 2007). These results indicate that gene duplication has been followed by specialization on the part of Gal3p and that loss of its galactokinase activity has made possible increased effectiveness as a coinducer. Moreover, S. cerevisiae GAL1 and GAL3 have specialized in their promoters as well as in their coding sequences. The promoter for GAL3 has only one binding site for Gal4p, resulting in weakly inducible expression (Hittinger and Carroll, 2007). By contrast, the promoter for GAL1 has four binding sites for Gal4p, resulting in strongly inducible expression (Hittinger and Carroll, 2007). Moreover, the promoter of S. cerevisiae GAL1 has been structurally remodeled in comparison to that of K. lactis GAL1 , which presumably represents the ancestral state prior to gene duplication in the
82
MYTHS AND REALITIES OF GENE DUPLICATION
gal3p
gal80p gal4p
GAL GENES REPRESSED
ATP GALACTOSE gal80p gal3p gal4p
TRANSCRIPTION OF GAL GENES
GALACTOSE ATP gal80p gal1p gal4p
TRANSCRIPTION OF GAL GENES (A) Other eukaryotic galactosidases
K. lactis gal1p
S. carlbergensis gal1p
S. cerevisiae gal1p
S. cerevisiae gal3p (B)
Figure 2 (A) Function of Gal3p in brewer’s yeast, Saccharomyces cerevisiae. Gal3p de-represses expression of the GAL genes, a role that can be assumed by Gal1p in the absence of Gal3p (bottom). (B) Phylogeny of galactosidases.
Saccharomyces lineage. The result is a more strongly inducible Gal1p in S. cerevisiae than in K. lactis (Hittinger and Carroll, 2007). Thus, the GAL1 /GAL3 example suggests that specialization by daughter genes has made possible more precise adaptation to functions that were previously shared by a bifunctional ancestral protein. Some other examples suggest that a similar process has occurred (Hughes, 2005). Note that the gene-sharing model of the evolution of new functions does not imply that all or even most gene duplications involve bifunctional genes. Gene duplication occurs continually in the evolution of genomes, and most duplicate genes are probably eventually lost (Lynch and Conery, 2000). Rather, it predicts that in the limited number of cases where gene duplication has given rise to functionally distinct daughter genes, the ancestor performed both functions (Hughes, 1994a). Thus, the immediate ancestor of each pair of functionally distinct paralogs is predicted to have performed both of
HYPOTHESES OF NEW GENE FUNCTION TF BINDING SITE A
TF BINDING SITE B
83
CODING REGION
Figure 3 How the DDC model might operate in the case of a gene with two transcription factor–binding sites (TF binding sites A and B). After gene duplication, complementary mutations knock out each binding site in one of the daughter genes.
the daughter genes’ distinct functions, although not necessarily as well as the daughter genes do. Biologists may have been reluctant to see in gene sharing a general model for gene duplication because of a tendency to expect each gene (or protein) to have a single function. However, in the present era of systems biology, such a view seems outmoded (Hughes, 2005). Data on gene expression patterns and on networks of gene and protein interaction have provided striking evidence that the function of every gene is multidimensional. Thus, partitioning of the multidimensional functional space seems a possible occurrence after the duplication of any gene (Hughes, 2005; Piatigorsky, 2007). However, it remains uncertain how widespread the gene-sharing scenario has in fact been in the history of life. 2.3 DDC Model Lynch and Force (2000) proposed an additional model of how gene functions might be shared among daughter genes after gene duplication, a process they called subfunctionalization. Their model, called duplication–degeneration–complementation (DDC), envisages complementary loss of function in the two duplicates, thereby rendering the loss of either duplicate disadvantageous. The simplest example of this model might be to imagine a gene that is expressed in two different tissues—say, liver and kidney—because it has binding sites for two different tissue-specific transcription factors (Figure 3). After duplication, a mutation might occur in the promoter region of one gene copy that damages the liver-specific transcription factor–binding site and thus eliminates gene expression in liver. But such a mutation is not strongly deleterious because the other gene copy retains expression in liver. Thus, a loss-of-function mutation in the liver-specific promoter of one of the gene copies may drift to fixation. Meanwhile, a similar loss-of-function mutation in the kidney-specific promoter of the other gene may drift to fixation; again, such a mutation is not strongly deleterious because it complements the loss-of-function mutation in the other gene (Figure 3). However, the result is one liver-specific gene and one kidney-specific gene. Assuming that the organism requires expression of the protein in both tissues, purifying selection will oppose any further mutation eliminating the expression or protein function of either of the two genes. Note that although the DCC model is most easily illustrated in terms of tissue-specific expression, it might apply to any type of gene or protein function.
84
MYTHS AND REALITIES OF GENE DUPLICATION
The DDC model cannot be applied straightforwardly to several well-studied cases, including that of the yeast galactokinases discussed above. The loss of galactokinase activity in S. cerevisiae Gal3p represents an apparent example of the type of “degeneration” envisaged in the DDC scenario, as does the loss of Galp4-binding sites in the GAL3 promoter. However, there is no evidence of a “complementary” loss function on the part of S. cerevisiae GAL1 . The evidence suggests that S. cerevisiae Gal1p is no worse as a coinducer than is the bifunctional Gal1p of K. lactis (Hittinger and Carroll, 2007). Moreover, the change in the promoter of S. cerevisiae GAL1 has involved a change in the helical phasing of Gal4p binding sites, a change that can in no way be considered “degeneration” of the sort assumed by DDC. Thus, as with other models discussed here, the DDC model may be somewhat oversimplified. Nonetheless, the DDC model is important because it shows how shared ancestral functions can be parceled out between duplicates by processes involving only mutation and genetic drift, without requiring any role for positive Darwinian selection. If the “degenerative” mutations involved are either neutral or slightly deleterious, their chance of fixation will be greater when the effective population size is small. The effective population sizes of multicellular eukaryotes are smaller in general than those of unicellular eukaryotes or prokaryotes; and the difference in population size may be one factor contributing to the fact that gene families tend to be much larger in the former than in the latter (Lynch and Conery, 2003). However, because there are other factors at work, it is difficult to attribute the difference in gene family size between multicellular and unicellular organisms to population size difference alone. For one thing, it seems likely that multicellularity creates more opportunities for subfunctionalization than does unicellularity; for example, gene expression can be subdivided by tissue. Second, the circular chromosomes typical of prokaryotes may impose some upper bound on genome size, resulting in some degree of purifying selection against gene duplication that is probably absent in most multicellular eukaryotes. It is interesting that in prokaryotes, the mean gene family size and the total nucleotide content of the genome are strongly correlated (Hughes et al., 2005; Figure 4). Such a close relationship seems unlikely in the case of eukaryotes, where
Figure 4 Mean number of genes per gene family as a function of genome size in 99 prokaryotic genomes. (From Hughes et al., 2005.)
ROLE OF NATURAL SELECTION
85
most variation in genome size is probably due to variation in the content of repeating DNA in the genome (Hughes and Piontkivska, 2005). In prokaryotes themselves, it seems very unlikely, in fact, that species with large genomes, and thus large gene families, also have small long-term effective population sizes.
3 ROLE OF NATURAL SELECTION In present-day organisms, we can recognize certain gene duplications that occurred in the past and gave rise to important new organismal adaptations. However, the role of positive Darwinian selection—that is, natural selection favoring advantageous mutations—in this process is not always clear. Part of the reason for this is that many such gene duplications occurred in the quite distant past, and the statistical methods often used to test for positive Darwinian selection (which involve comparing synonymous and nonsynonymous substitutions) are not applicable because synonymous sites are saturated with changes. Further confusion arises from the fact that certain widely used methods of testing for positive selection depend on seriously flawed assumptions and thus do not provide trustworthy evidence. Moreover, many—probably most—of the mutations that give rise to new protein functions leave no identifiable “signature” of positive selection. In a few cases, natural selection may favor a series of changes in the amino acid sequence of a protein that fit it better for a specialized task. A powerful tool in testing hypotheses regarding this type of natural selection is a comparison of the patterns of synonymous and nonsynonymous (amino acid–altering) nucleotide substitution (Hughes and Nei, 1988). Unfortunately, this approach has been widely abused in recent years (Hughes, 2007), and published claims that natural selection has acted to diversify members of multigene families need to be treated with caution. One of the earliest published cases that compared synonymous and nonsynonymous substitutions among members of a multigene family involved a highly unusual kind of multigene family, the variable region genes (or, more strictly speaking, gene segments) of mammalian immunoglobulins (Tanaka and Nei, 1989). In the portion of these gene segments encoding the CDR region of the immunoglobulin, which binds antigens, the number of nonsynonymous nucleotide differences per nonsynonymous site (pN ) was found to exceed the number of synonymous differences per synonymous site (pS ). By contrast, in the remainder of the gene segment (encoding the framework region) the reverse pattern was seen, as in most genes (Tanaka and Nei, 1989). The fact that this highly unusual pattern of nucleotide substitution was seen in the CDR, where a diversity of amino acid sequences is likely to be advantageous because it enables the host to bind a diverse array of foreign antigens, supports the hypothesis that natural selection has acted to favor amino acid changes in the CDR region (Tanaka and Nei, 1989). Frequently, it is stated that a pattern whereby, in some set of codons, the number of nonsynonymous substitutions per nonsynonymous site (dN ) exceeds the number of synonymous substitutions per synonymous site (dS ) is a signature of positive selection, but such a statement is misleading. Note that in Tanaka and Nei’s (1989) study of immunoglobulins, as in Hughes and Nei’s (1988) study of major histocompatibility complex genes, the authors tested for a pattern of dN > dS in a set of codons where there was a biological reason for predicting that selection would favor amino acid
86
MYTHS AND REALITIES OF GENE DUPLICATION
diversity. This is not the same thing as simply searching in a set of coding sequences for one or more codons using codon-based methods of testing for positive selection. Such methods no doubt are able to identify codons with this property, but a certain proportion of such codons are likely to occur by chance in most coding sequences, even under strong purifying selection (Hughes and Friedman, 2005a). Codon-based methods depend on the false assumption that the existence of one or more codons, dN > dS , implies positive selection, and therefore these methods do not provide a valid test of the hypothesis of positive selection (Hughes, 2007). Unfortunately, codon-based methods have been used in the vast majority of published cases, where it has been claimed that positive selection has acted to favor amino acid changes within multigene families. Therefore, it remains unclear how widespread this phenomenon is. One problem with the comparison of synonymous and nonsynonymous substitutions is that it is only applicable over a fairly short time frame. Suppose that immediately after gene duplication, natural selection favors a series of nonsynonymous substitutions between two genes (Figure 5). If we can compare the two genes at that point, we will find that dN exceeds dS . However, once all the amino acid changes required to adapt the daughter genes to their specialized functions have occurred, no more amino acid changes will occur; and purifying selection at the amino acid sequence level will predominate, as in most genes (Figure 5). Assuming the neutrality or near-neutrality of most synonymous substitutions, dS between the two genes will thus eventually equal and finally overtake dN (Figure 5). Thus, even in cases where a pattern of dN > dS has occurred, it may not be detectable unless the events involved were relatively recent. The pregnancy-associated glycoproteins (PAGs) of ruminants provide an example of a recently duplicated mammalian gene family in which a pattern of dN > dS can be observed in a potentially functionally important region of the molecule (Hughes et al., 2000). PAGs are homologous to aspartic proteinases; they have apparently lost proteinase function, although they retain the ability to bind peptides. The PAG family has undergone massive gene duplication in the ruminant lineage, resulting in 100 or more genes, which are expressed in the placenta. In most comparisons between PAG
dS
dN
T0
T1
T2
T3
Figure 5 Effect on dS and dN of directional selection favoring a series of amino acid differences between two genes duplicated at T0 . At a certain time (T1 ), dN will exceed dS . But after the selectively favored amino acid changes have all been made, dS will catch up with and eventually exceed dN (T2 and T3 ).
ROLE OF NATURAL SELECTION
87
0.5
0.4
dN
0.3 0.2 0.1 0.0 0.0
0.1
0.2
0.3
0.4
0.5
0.3
0.4
0.5
dS (A) 0.5
0.4
dN
0.3 0.2 0.1 0.0 0.0
0.1
0.2 dS (B)
Figure 6 Plots of dN vs. dS : (A) putative peptide-binding region; (B) remainder of pregnancy-associated glycoprotein genes of ruminants. (From Hughes et al., 2000.)
genes, dN exceeds dS in the codons encoding the putative peptide-binding region of the protein (Figure 6A). By contrast, in the remainder of the gene, dS exceeds dN , as in most genes (Figure 6B). The duplication of PAG genes has all occurred within the ruminant lineage with the past 50 million years or so (Hughes et al., 2000, 2003a). Thus, diversification of these genes has occurred recently enough that the acceleration of dN relative to dS in the putative peptide-binding region is clearly detectable. The PAGs cannot really be considered a convincing example of positive selection until the function of these molecules is understood. Only if we understand the function of PAGs can we know why repeated amino acid changes in a portion of the PAG protein might be favored. Nonetheless, the PAGs provide a vivid illustration of the fact that comparison dS and dN is most easily detected in the case of recently diverged paralogs. Both the immunoglobulin and PAG examples illustrate another point that is worth emphasizing: Comparison of nonsynonymous and synonymous substitutions is only
88
MYTHS AND REALITIES OF GENE DUPLICATION
suitable for testing the hypothesis that natural selection has favored repeated amino acid changes at a limited set of positions (Hughes, 2007). But there is no reason to believe that this is a particularly common mode of selection on duplicated genes. Rather, in many cases, mutations that adapt duplicate genes for specialized functions may include such mutational events as the following: (1) replacement of a single amino acid; (2) deletion or insertion of one or more codons; (3) creation of a chimeric gene either by recombination with another protein-coding locus or by “capture” into coding exons or previously noncoding sequence; (4) loss of the appropriate splice signals and thus of expression of one or more exons; and (5) changes in regulatory regions leading to changes in gene expression. However, there are no statistical methods to test for positive selection in any of these cases, which almost certainly account for the overwhelming majority of cases of positive selection on duplicate genes. It is worth remarking that even degenerative changes, such as envisaged by the DDC model, may in some cases be positively selected. Known examples of adaptive evolution at the molecular level very often involve loss-of-function mutations (Hoekstra and Coyne, 2007). Moreover, in the case of bifunctional proteins, loss of one function may sometimes be selectively favored if the presence of that function imposes a constraint that limits the effectiveness of the other function (Hughes, 2005). It seems to us that the most reasonable course for evolutionary biologists to take is to concentrate on understanding the functional differences between duplicates and not worry about uncovering evidence of past positive selection. For example, in the case of the yeast GAL genes discussed previously, it is plausible that natural selection played a role in favoring changes in the S. cerevisiae GAL1 promoter that led to a strongly inducible expression, given the well-understood adaptive advantage this confers (Hittinger and Carroll, 2007). But if it occurred, this selection has left no signature that we are likely be able to detect. Understanding the functional differences between gene duplicates and identifying the mutations that caused those functional differences represents a much more important contribution to the advancement of biology as a science than does any bioinformatic search for supposed signatures of positive selection.
4
EXPRESSION DIFFERENTIATION
Theoretically, there are two distinct pathways by which duplicated protein-coding genes can become differentiated: (1) by differences in expression pattern, and (2) by differences in the amino acid sequence of the encoded protein, leading to functional change or specialization at the amino level. Intuitively, it seems likely that both types of differentiation can occur in the course of the evolution of a given duplicate gene pair, either simultaneously or at different stages of the process of functional differentiation. But it is unclear whether these two modes of diversification tend to occur in a mutually exclusive fashion or whether, on the contrary, they tend to go hand in hand. Intuitively, it seems likely that both types of differentiation can occur in the course of the evolution of a given duplicate gene pair, either simultaneously or at different stages of the process of functional differentiation. The availability of data from a number of sources regarding the patterns of gene expression and its regulation by transcription factors has made it possible to address questions of this sort.
EXPRESSION DIFFERENTIATION
89
4.1 Duplicated Genes in Arabidopsis Root Development Hughes and Friedman (2005b) approached the question of how coding sequence divergence and gene expression patterns relate in multigene families using data from a study of gene expression at three developmental stages in five different tissue types of the developing root in Arabidopsis thaliana (Birnbaum et al., 2003). Expression data were obtained using the ATH1 GeneChip (Affymetrix, Santa Clara, CA) covering three developmental stages (stages 1, 2, and 3) in the following cell zones: (1) stele, (2) endodermis, (3) endodermis + cortex, (4) epidermal atrichblasts, and (5) lateral root cap. Birnbaum et al. (2003) provided a data set giving raw expression scores (mean of three replicates) for 5717 transcripts that were shown to be regulated differentially across the 15 separate subzones (three stages × five cell types). Hughes and Friedman (2005b) estimated the 15 × 15 linear correlation matrix among the subzones, and principal components were extracted from this correlation matrix. The purpose of principal components analysis is to reduce a large number of variables (in the present case, 15 variables corresponding to the 15 subzones) to a smaller number of variables that explain most of the variance in the larger set. This amounts to rotating the original coordinate system in multivariate space to define new axes. The first two principal components (PC1 and PC2) extracted from the correlation matrix of expression scores in the 15 subzones accounted together for 81.3% of the trace of the matrix, 66.3% in the case of PC1 and 15.0% in the case of PC2. PC1 appeared to be a measure of overall level of expression, as shown by nearly equal loadings on each of the 15 subzones (Table 1). PC2 evidently provided a contrast between early and late expression, as shown by negative loadings on stage 1 and positive loadings on stage 3 (Table 1). PC2 also showed negative loadings on stage 2 in all but one cell type, stele (Table 1).
TABLE 1 Loadings for the First Two Principal Components (PC1 and PC2) on Variables Corresponding to Expression Levels in Arabidopsis Root Subzones Cell Type Stele
Endodermis Endodermis + cortex
Epidermal atrichoblasts
Lateral root cap
Stage
PC1
PC2
1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
0.292 0.294 0.178 0.284 0.294 0.209 0.282 0.296 0.193 0.280 0.285 0.243 0.229 0.274 0.187
−0.166 0.032 0.498 −0.200 −0.063 0.442 −0.190 −0.038 0.474 −0.244 −0.159 0.181 −0.186 −0.096 0.243
Source: Data from Hughes and Friedman (2005b).
90
MYTHS AND REALITIES OF GENE DUPLICATION
These interpretations of PC1 and PC2 based on variable loadings (Table 1) were supported by analyses comparing PC1 and PC2 with ad hoc composite variables designed to reflect similar aspects of the data. The correlation coefficient between PC1 score and the mean expression level for the 15 subzones was 0.999 (p < 0.001), supporting the hypothesis that PC1 reflects overall expression level. Similarly, PC2 was strongly positively correlated with two composite variables, reflecting a contrast between early and late developmental stages: (1) the mean score for stage 3 (across the five zones) minus the mean score for stage 1 (r = 0.902; p < 0.001), and (2) the mean score for stage 3 minus the mean score of stages 1 and 2 (r = 0.890; p < 0.001). To provide an index of divergence at the coding sequence level, Hughes and Friedman (2005b) estimated dS and dN between the two members of two-member families (428 families). The ratio dN /dS was then compared with the scores on PC1 and PC2 to examine how divergence at the amino acid sequence level relates to expression divergence. For each of the families, the range of PC1 (i.e., the absolute difference in PC1 score between the two family members) was plotted against the ratio dN /dS (Figure 7A). There was a modest negative correlation between the two variables (Spearman’s rank correlation coefficient, rS = −0.235; p < 0.001) (Figure 7A). The evident cause of this negative correlation was the fact that there were a number of families with relatively high ranges of PC1 but low dN /dS (Figure 7A). When the range of PC2 was plotted against the ratio dN /dS , a negative correlation was again observed (rS = −0.107; p = 0.027) (Figure 7B). As with PC1, this negative correlation can be explained by the occurrence of a number of families with relatively high ranges of PC2 but low dN /dS (Figure 7B). Examination of the individual families showed that ribosomal protein families provided the most striking cases of two-member families having low dN /dS values along with high expression divergence (Hughes and Friedman, 2005b). When the ribosomal proteins were removed from the data set, the correlation between the range of PC1 and dN /dS became more modest (rS = −0.105; p = 0.038); and the correlation between the range of PC2 and dN /dS was no longer significant (rS = −0.022; n.s.). Using the same gene expression data set, dN and dS were estimated between the members of 190 phylogenetically independent sister pairs of sequences from the 41 largest families (out of 820 families in the data set). In these comparisons, dN /dS was not significantly correlated with the absolute difference in PC1 scores between the pair members (rS = 0.059; n.s.) nor with the absolute difference in PC2 scores between the pair members (rS = 0.062; n.s.). Nonetheless, these 190 comparisons included some pairs with relatively low dN /dS but high absolute difference in PC1 or PC2 scores. Thus, in the majority of cases, Hughes and Friedman (2005b) found that duplicate genes in Arabidopsis have diverged in the coding sequence but not to a great extent in expression pattern, at least in the cell types analyzed. There were exceptions to this generalization—most notably the ribosomal proteins, for which the opposite was true. But overall, these data suggest caution regarding any model of the evolution of new functions after gene duplication that places a great deal of emphasis on differences of expression pattern. 4.2 Transcription Factor Binding Another way to examine the change of expression after gene duplication is to compare transcription factor binding by paralogous genes. Hughes and Friedman (2007) used
EXPRESSION DIFFERENTIATION
91
Absolute difference (PC1)
30
20
10
0 0.0
0.1
0.2
0.3
0.4 0.5 dN/dS
0.6
0.7
0.8
0.9
0.6
0.7
0.8
0.9
Absolute difference (PC2)
(A) 10 9 8 7 6 5 4 3 2 1 0 0.0
0.1
0.2
0.3
0.4 0.5 dN/dS (B)
Figure 7 (A) Range of PC1 (see Table 1) plotted against dN /dS for 428 Arabidopsis two-member families (rS = −0.235; p < 0.001); (B) range of PC2 (see Table 1) plotted against dN /dS for 428 Arabidopsis two-member families (rS = −0.107; p = 0.027). (From Hughes and Friedman, 2005b.)
data on transcription factors associated with brewer’s yeast (Saccharomyces cerevisiae) genes to compare the transcription factors targeting 190 pairs of duplicate genes. Overall, there was a negative correlation between sequence similarity, at either synonymous or nonsynonymous sites, and the degree to which the duplicated gene pair shared the same transcription factors. However, there was a great deal of scatter in the data. Some gene pairs with little divergence in the coding region were found to bind similar sets of transcription factors, whereas other gene pairs with highly similar coding regions were found to have essentially no overlap in the sets of transcription factors they bind (Hughes and Friedman, 2007). The genome of brewer’s yeast contains a number of apparently duplicated segments, hypothesized by some authors to be relics of an ancient polyploidization event (see below; Wolfe and Shields, 1997). It was of interest that gene pairs in duplicated segments tended to share transcription factors to a greater extent than gene pairs outside duplicated segments (Hughes and Friedman, 2007). This pattern may be explained in part by the fact that when a large genomic segment is duplicated, transcription
92
MYTHS AND REALITIES OF GENE DUPLICATION
factor–binding sites are likely to be duplicated along with the protein-coding genes. On the other hand, when a single gene is duplicated, there is a greater chance that not all of the transcription factor–binding sites will be duplicated. Of the 190 duplicate gene pairs analyzed by Hughes and Friedman (2007), the sets of transcription factors bound matched perfectly in 17 duplicated gene pairs. Ten of these pairs were ribosomal protein genes. Some information was available from the published literature regarding the function of three of the other gene pairs. DMA1 (THR115C) and DMA2 (YNL116W) encode proteins that function in the positioning of the mitotic spindle (Fraschini et al., 2004). Mutants lacking both genes showed aberrant spindle positioning, but there was no detectable phenotypic effect if either of the two genes was deleted but the other was not (Fraschini et al., 2004). Thus, in terms of their role in spindle positioning, the two duplicated genes seem to be functionally redundant or nearly so. Nonetheless, the two proteins are quite divergent at the amino acid level, and the correlation of their expression scores across a large number of microarray experiments was actually rather low (Hughes and Friedman, 2007). PDI1 (YCL043C) and YDR518W (YDR518W) both encode protein disulfide isomerases found in the lumen of the endoplasmic reticulum (Noiva and Lennarz, 1992; Tachibana and Stevens, 1992). Although the two proteins are functionally similar, their functions appear not to be identical. Deletion of PDI1 renders the cell nonviable, but deletion of EUG1 has no such effect (Tachibana and Stevens, 1992). Moreover, in cells in which the protein encoded by PDI1 was depleted, overexpression of EUG1 compensated partially only for the defect (Tachibana and Stevens, 1992). Finally, SAS5 (YOR213C) and TAF14 (YPL129W) are functionally similar in that both encode subunits of multisubunit complexes that regulate transcription, but the two proteins form part of structurally and functionally very different complexes. SAS5 encodes a protein that is part of a trimeric complex involved in transcriptional silencing of heterochromatin (Sutton et al., 2003; Shia et al., 2005). By contrast, the protein encoded by TAF14 (also known as ANC1 and TFG3 ) forms a part of the multisubunit RNA polymerase II holoenzyme, essential for transcription of protein-coding genes (Meyer and Young, 1998). The proteins are also very divergent at the amino acid level. These three examples of duplicate gene pairs with shared transcription factors show a range of patterns of functional divergence at the protein level, extending from apparent near-redundancy in the case of DMA1 and DMA2 to distinct and even opposing functions in the case of SAS5 and TAF14 . The different ways in which these duplicate gene pairs have differentiated provide a vivid illustration of the fact that gene expression and protein sequence constitute a multidimensional space in which duplicates can differentiate along one or more axes (Hughes, 2005). Thus, different gene pairs can differentiate functionally in very different ways by exploiting different dimensions in this space of possible patterns of expression and protein function.
5
SEGMENTAL DUPLICATION AND ITS AFTERMATH
5.1 The Polyploidization Obsession One of the most persistent—and, to our way of thinking, pernicious—myths in the field of gene duplication has been the late Susumo Ohno’s belief that the duplication
SEGMENTAL DUPLICATION AND ITS AFTERMATH
93
of entire genomes by polyploidization is a key to the origin of new adaptations (Ohno, 1970). Ohno probably developed this theory because at the time he was working, the only known model for gene expression in any organism was that of the lac operon in Escherichia coli (Hughes, 2000). Assuming that eukaryotic genes are similarly organized into operons, Ohno evidently reasoned that duplication of a single gene was unlikely to lead to anything productive because the regulatory region for the operon would not be duplicated as well. Ohno (1970) dismissed the lungfish as a supposed evolutionary failure, but it was problematic for his theory that the lungfish has a large genome. Thus, using circular reasoning, Ohno (1970) argued that the lungfish must have achieved its large genome size through tandem duplication rather than whole-genome duplication because, in the case of the poor lungfish, no “progress” has resulted. He did not seem to realize that a lungfish is at least as well adapted to its environment as is any tetrapod, despite its lack of recent phenotypic innovation. One idea that traces its descent to Ohno is the 2R hypothesis, the hypothesis that two rounds of genome duplication occurred early in the vertebrate lineage, before the origin of jawed vertebrates (Hughes, 2000). These supposed rounds of genome duplication have been implicated as playing a key role in the origin of a number of key vertebrate adaptations, including the vertebrate-specific (“adaptive”) immune system (Flajnik and Kasahara, 2001). Vertebrate-specific immunity involves a number of gene families unique to jawed vertebrates: namely, immunoglobulins, T-cell receptors, and the class I and II molecules of the major histocompatibility complex (MHC). It has been argued that there are four clusters of genes in the typical vertebrate genome that are homologous to a set of genes linked to the class I and II MHC genes in mammals (Flajnik and Kasahara, 2001). In fact, phylogenetic analyses show that many of the genes in these clusters were, in fact, duplicated early in the history of life, well before the origin of vertebrates (Hughes, 1998; Yeager and Hughes, 1999). But even if it is true that certain genes in these clusters were duplicated as a result of polyploidization events early in vertebrate history, those events explain nothing about the origin of the MHC, since the class I and II genes are linked to only one of the clusters, and no homolog to the MHC genes has been found in any species outside the jawed vertebrates. Advocates of the 2R hypothesis have focused on certain aspects of vertebrate genomes that are consistent with two rounds of polyploidization, notably the presence of four HOX gene clusters. But they have not subjected the hypothesis to rigorous testing. Every rigorous attempt to uncover an unambiguous signal of two rounds of genome duplication in vertebrates has failed to do so (Hughes, 1999b; Friedman and Hughes, 2001; Hughes et al., 2001; Hughes and Friedman, 2003a, 2004). In fact, all of the features of vertebrate genomes that have been attributed to 2R can be explained as easily or more easily by multiple separate events of duplication of individual genes or genomic segments. An unfortunate consequence of Ohno’s influence is the tendency to attribute to ancient polyploidization genomic features that might just as easily be explained by other mechanisms. Typically adduced as evidence of ancient polyploidization are the following: (1) regions of double synteny, that is, two or more genomic regions containing members of the same set of gene families (Wolfe and Shields, 1997); and (2) the existence of multiple pairs of duplicate genes with evidence (either from phylogenies of from evolutionary sequence distances) that they duplicated within roughly the same time frame (McLysaght et al., 2002).
94
MYTHS AND REALITIES OF GENE DUPLICATION
Among genomes of model organisms, both brewer’s yeast and Arabidopsis undoubtedly show double synteny, but whether this is due to polyploidization remains debatable. Certainly, if polyploidization occurred in either case, the organisms have since become re-diploidized; and re-diploidization is hypothesized to have involved a massive loss of duplicated genes. In brewer’s yeast, it has been estimated that 85% of genes duplicated by a hypothetical polyploidization must subsequently have been lost (Wolfe and Shields, 1997). There has been only one rigorous test of the polyploidization hypothesis in yeast: that of Martin et al. (2007), who used standard algorithms for counting minimal numbers of rearrangements to compare the hypothesis of polyploidization with that of a series of segmental duplications. The hypothesis of multiple segmental duplications was found to provide a much more parsimonious explanation than polyploidization (Martin et al., 2007). There are several problems with studies that present evidence of a “burst” or “peak” of gene duplication within a particular time frame as evidence for polyploidization. First, many such studies are based simply on frequency distributions of dS values in comparisons of paralogs (Cui et al., 2006). But because nucleotide substitution is a discrete process, the distribution of dS will tend to form a number of peaks that do not reflect any biological reality. Especially true when dS is greater than 1.0 (i.e., when synonymous sites are saturated with changes), peaks in the frequency distribution of dS values are very likely artifactual. Yet many studies have based claims of polyploidization on such high dS values. A further point is that any “peak” of gene duplication in the evolutionary past is not really a peak of gene duplication but rather a peak of retention of duplicate genes. It is probable that gene duplication—like other forms of mutation—occurs at a steady rate over evolutionary time (Lynch and Conery, 2003), but that at certain periods duplicate genes are more likely to be retained (Friedman and Hughes, 2003). The role of retention of gene duplicates in evolution is illustrated by comparing the genomes of brewer’s yeast Saccharomyces cerevisiae and fission yeast Schizosaccaromyces pombe (Hughes and Friedman, 2003b). These two species are only very distantly related, their last common ancestor being estimated at 420 million years ago (Sipiczki, 2000). Both have evolved independently a single-celled “yeast” lifestyle. As mentioned previously, the genome of S. cerevisiae includes a number of duplicated segments; but there is no evidence of such segmental duplication in S. pombe. Phylogenetic analyses of individual gene families showed that the same genes were duplicated independently in S. cerevisiae and S. pombe to a far greater extent than expected by chance (Figure 8; Hughes and Friedman, 2003b). Many of these are families likely to play roles in the unicellular life cycle (Hughes and Friedman, 2003b). In both S. cerevisiae and S. pombe, there were “bursts” of retention of duplicate genes, yet the bursts occurred at different times and involved different mechanisms, since segmental duplication was a factor only in the former species (Hughes and Friedman, 2003b). If duplicate genes represent an important raw material of evolution, it might be argued that their origin matters little: whether duplication of single genes, duplication of genomic segments, or duplication of entire genomes. From this perspective, debating whether or not whole-genome duplication occurred in the past history of this or that species may appear a rather pointless exercise. In a sense this is true, since we often cannot distinguish segmental duplication from whole-genome duplication followed by massive loss of duplicate genes. Nonetheless, we feel that the recent fad for publishing claims of ancient polyploidization has had a detrimental effect on the progress of
SEGMENTAL DUPLICATION AND ITS AFTERMATH
95
ALL PROTEINS 61
OBSERVED
EXPECTED
56
102
24
65
15
EXCLUDING RIBOSOMAL PROTEINS OBSERVED
59
EXPECTED
85
34
20
S. cerevisiae 8
46
S. pombe
Figure 8 Numbers of gene families with one or more duplications observed after the last common ancestor of S. pombe and S. cerevisiae, illustrating the numbers of families duplicated in each species separately and in both species. Separate analyses were observed for a data set of all proteins (650 families) and for a data set excluding ribosomal proteins (623 families). Numbers observed were compared with numbers expected, calculated by multiplying the proportions of families with one or more duplications in S. pombe by the proportion of families with one or more duplications in S. cerevisiae. For both data sets the numbers observed and expected were significantly different (χ2 = 117.2 and 82.3, respectively; p < 0.00001 in both cases). (From Hughes and Friedman, 2003b.)
evolutionary genomics, both because it has caused a general lack of interest in the evolutionary importance of mechanisms for segmental duplication and because it has implied the acceptance of some seriously inappropriate models for major phenotypic change. Molecular biology has revealed a number of exciting possible mechanisms of segmental duplication that have been almost entirely ignored in the evolutionary literature. One is the role of transposable elements. The fact that duplicated segments in the genome of Arabidopsis are associated with transposable elements to a far greater degree than expected by chance suggests that transposon-mediated segmental duplication, rather than polyploidization, may have structured the genome of that species (Hughes et al., 2003b). There are a number of ways that transposable elements can potentially mediate segmental duplication in genomes. First, transposable elements can provide sites of homology for unequal crossing over (Fedoroff, 2000). Recombination between transposable elements on different chromosomes can lead to translocation of a large genomic segment from one chromosome to another (Bennetzen, 2000). If a chromosome that has received such a translocation ends up, as a result of independent assortment of chromosomes, in the same genome with the wild-type version of the
96
MYTHS AND REALITIES OF GENE DUPLICATION
donor chromosome, the net effect will be a segmental duplication. The emphasis on polyploidization has led to an almost total disregard for the potential role of transposable elements as agents of segmental duplication both in Arabidopsis and in other species. Examination of the human genome has revealed that it differs from other sequenced mammalian genomes in possessing large blocks of interspersed duplications. These duplications have occurred fairly recently—within the primate lineage—and that they have occurred in a complicated fashion involving duplications within duplications, as elucidated in recent years by Eichler and colleagues (e.g., Jiang et al., 2007). Moreover, extensive gene copy number polymorphisms have been found in the human population, resulting from segmental duplications that are so recent that they have not yet been fixed in the population (Fredman et al., 2004). Thus, in our own species we can see both the recent evolution of paralogs and the raw material for evolution of future paralogs, and it seems reasonable to suppose that similar processes of segmental duplication have occurred in other lineages in the history of life. 5.2 Models of Phenotypic Innovation Stebbins (1971, p. 132) considered that “polyploidy has contributed little to progressive evolution.” In plants, polyploids have larger cells and slower development times, traits that are probably often deleterious but may under certain ecological circumstances confer advantages (Levin, 1983). In animals it has not been possible to associate known cases of polyploidy with any major phenotypic adaptation (Otto and Whitton, 2000). Despite this overall negative balance, numerous evolutionary biologists remain infected by an Ohno-inspired enthusiasm for the supposed innovative power of polyploidy. We believe that this view is seriously in error and harmful to progress in the understanding of genomic evolution. Here we illustrate the futility of studying polyploid organisms as models of evolutionary innovation by comparing two possible models: (1) the frogs of the genus Xenopus; (2) and the hominids. The frog genus Xenopus (family Pipidae) provides a well-studied example of an animal taxon in which polyploidization has been a frequent occurrence (Cannatella and de S´a, 1993). Xenopus laevis, a widely used laboratory animal, is known to be an ancient allo-tetraploid, which underwent a allo-polylpoidization event 30 to 40 million years ago (Bisbee et al., 1977). X. laevis has become re-diploidized, and many of the duplicate genes have been lost, perhaps as many as 50 to 75% (Hughes and Hughes, 1993; Hellsten et al., 2007). The genus Xenopus also includes octoploid and dodecaploid species (Bisbee et al., 1977; Cannatella and de S´a, 1993). Hellsten et al. (2007) compared duplicated genes in X. laevis with those of X. tropicalis, a related species that has not undergone polyploidization. They show that in many cases, one of the two X. laevis genes has diverged from the X. tropicalis gene at nonsynonymous sites somewhat more than has the other; but most such change appears to be due to the stochastic nature of the mutational process. In only 28 of 578 gene pairs (4.8%) was there significant asymmetry in amino acid evolution between the two X. laevis duplicates in comparison with a random (binomial) model (Hellsten et al., 2007). Chain and Evans (2006) detected a similar level of asymmetric divergence (18 of 290 gene pairs, or 6.2%). Thus, there is evidence for the possible occurrence of subfunctionalization among some of those duplicate gene pairs in X. laevis that have not been lost. In the case of
DUPLICATE GENES IN NETWORKS
97
the duplicated developmental gene hairy2 , there is evidence that the two copies are expressed differentially, as predicted by the DDC model (Murato et al., 2007). On the other hand, the two duplicates of another developmental regulatory gene, foxi1 in X. laevis, are expressed in identical ways (Matsuo-Takasaki et al., 2005). Moreover, there is no evidence that subfunctionalization in X. laevis has given rise to any significant evolutionary novelty. X. laevis and X. tropicalis are morphologically, physiologically, and behaviorally very similar. Similarly, there are no pronounced phenotypic differences between these species and the octoploid and dodecaploid members of the genus Xenopus. Contrast this situation with that of the hominids. In the 5 to 7 million years since their last common ancestor with the chimpanzee, the hominids have undergone numerous morphological changes, including (to name but a few) reduction of the canines, the evolution of upright posture, and a massive increase in brain size. Polyploidization was not a factor in any of this, of course. But it is at least a plausible hypothesis that segmental duplication did play a major role, given the high-frequency recent retained segmental duplications in the human genome (Jiang et al., 2007). It is at least suggestive that of all the mammalian species whose genomes have been sequenced, the species that has undergone the most extensive recent adaptive change is also the one with the most recent segmental duplication. Thus, it seems obvious to us that if one is seeking a model organism for the origin of new phenotypes, our own species represents a much more appropriate model than does any recent polyploidy (such as Xenopus). For example, there was a period of rapid morphological change in the early history of vertebrates (Carroll, 1988); to understand what happened during that period, what has gone on in the past five to seven years in hominids seems a much more reasonable starting place than is polyploidy in organisms that have changed little in over 30 million years. But evolutionary biologists will need to liberate themselves from Ohnoist mythology if they are to appreciate the striking model of phenotypic innovation that is literally right before their eyes—at least when they look in the mirror each morning.
6 DUPLICATE GENES IN NETWORKS In recent years there has been a great deal of interest in biological networks, including gene interaction networks, protein interaction networks, and metabolic networks (Kanehisa, 2000; Ravasz et al., 2002). It is generally acknowledged that gene duplication has played a major role in shaping biological networks as we see them today (Wagner, 2001), and is responsible for the distinctive properties of biological networks, such as their scale-free (Barab´asi and Albert, 1999) and modular (Ravasz et al., 2002) nature. However, these properties will arise only if gene duplication occurs in a differential fashion, whereby certain multiply connected “hub” genes in a network remain unduplicated while less connected “spoke” genes are duplicated (Hughes and Friedman, 2005c). The genome of brewer’s yeast provides some intriguing evidence that natural selection may act to eliminate duplication of multiply connected “hub” genes. In the analysis of a yeast genetic network (Tong et al., 2004), Hughes and Friedman (2005c) found that of 68 single-member yeast families with 25 or more network connections, 28 (44.4%) were located in duplicated segments of the genome (Seoighe and Wolfe, 1999). Each
98
MYTHS AND REALITIES OF GENE DUPLICATION
of these 28 loci was thus presumably duplicated along with the genomic segment to which it belongs. However, the fact that each of these 28 families now contains a single member implies that after segmental duplication, one duplicate member of each familiy was deleted from the genome. In addition, there is evidence that network connections are remarkably labile over evolutionary time (Hughes and Friedman, 2005c). Immediately after gene duplication, it seems reasonable to suppose that gene duplicates will often have the same network connections. There might be exceptions if one duplicate is partial or if an exon-shuffling event or other recombinational event has accompanied gene duplication. But data from both yeast and the nematode worm Caenorhabditis elegans suggested that those pairs of duplicated genes that have been retained in these genomes generally have quite distinct sets of connections (Hughes and Friedman, 2005c). Moreover, there is evidence that such changes can happen soon after duplication, as indicated by examples in C. elegans of closely related genes that shared no common network connections (Hughes and Friedman, 2005c). Phylogenetic analyses of multigene families were used to examine the relationship between phylogenetic relatedness and sharing of network connections in a yeast genetic network (Hughes and Friedman, 2005c). Figure 9A shows the phylogenetic tree of MAP kinases included in this network; this was the family showing the greatest within-family contrasts of all those included in the network. The two genes in this family with the highest numbers of connections, YPL031C (with 62 connections) and YHR030C (with 60 connections), were not sisters (Figure 9A). There was strong (100%) bootstrap support for clustering of YHR030C with YLR113W, which had only 24 connections (Figure 9A). When sharing of connections among these genes was examined, YHR030C was found to share only a single connection with the closely related YLR113W (Figure 9B). On the other hand, YHR030C shared 14 connections with YPL031C (Figure 9B). All other members of this family included in the yeast network shared at most a single connection (Figure 9B).
7
CONCLUSIONS
The current era is one of exciting potential for increased understanding of evolutionary processes at the genomic level. Genomic sequences have provided a vast new source of information that is only beginning to be tapped. However, the exploitation of this information has been seriously hampered by misuse of the scientific method (storytelling instead of hypothesis testing) and by invalid statistical method (particularly the codon-based tests for positive selection). Evolutionary biologists who seek a more accurate understanding of gene duplication and its aftermath will need to liberate themselves from outmoded ideas and approaches. The most important step toward an advanced understanding in this area, as in any field of science, is to apply methods that test critically among hypotheses. The mere fact that certain data are consistent with a given hypothesis provides no real support for that hypothesis unless and until one is able to rule out reasonable alternatives. Evolution should be understood as a fundamentally opportunistic process. Presentday genomes show the accumulated effects of many generations of mutation, drift, and natural selection. Because these processes are fundamentally unpredictable, few generalizations can be made regarding the evolutionary process. Rather than seeking
REFERENCES YLR113W (24)
99
YHR030C (60)
YKL126W (2) 100
80
35
YDR477W (1)
YPL031C (62) 54
YKL139W (2)
YPL042C (1) (A)
YPL042C (1) 1 YKL126W (2)
YDR477W (1) 1
YKL139W (2)
1 1
1 14
YHR030C (60)
YPL031C (62)
1 YLR113W (24)
1 (B)
Figure 9 (A) Maximum parsimony tree of yeast MAP kinases included in the genetic interaction network. Numbers in parentheses after each gene name are numbers of network connections. Numbers on the branches show the percentage of 1000 bootstrap samples supporting the branch. (B) Network indicating numbers of network connections shared by yeast MAP kinases. Numbers in parentheses after each gene name are numbers of network connections. (From Hughes and Friedman, 2005c.)
general laws of genomic evolution, evolutionary biologists can contribute most to advancing our knowledge of genome organization by patient reconstruction of the evolutionary events that have structured individual gene families, genomic regions, and genomes. In doing so, bioinformaticists will need to work closely with experimentalists, because much remains to be learned regarding the multidimensional functional space occupied by each gene.
REFERENCES Bajwa W, Torchia TE, Hopper JE. 1988. Yeast regulatory gene GAL3: carbon regulation; UASGal elements in common with GAL1, GAL2, GAL7, GAL10, GAL80 , and MEL1 ; encoded protein strikingly similar to yeast and Escherichia coli galactokinases. Mol Cell Biol 8:3439–3447. Barab´asi AL, Albert R. 1999. Emergence of scaling in random networks. Science 286:509–512.
100
MYTHS AND REALITIES OF GENE DUPLICATION
Bennetzen JL. 2000. Transposable element contributions to plant gene and genome evolution. Plant Mol Biol 42:251–269. Bhat PJ, Oh D, Hopper JE. 1990. Analysis of the GAL3 signal transduction pathway activating Gap4 protein-dependent transcription in Saccharomyces cerevisiae. Genetics 125:281–291. Birnbaum K, Shasha DE, Wang JY, Jung JW, Lambert GM, Galbraith DW, Benfrey PN. 2003. A gene expression map of the Arabidopsis root. Science 302:1956–1960. Bisbee CA, Baker MA, Wilson AC, Irandokht H-A, Fischberg M. 1977. Albumen phylogeny for clawed frogs (Xenopus). Science 195:785–787. Cannatella DC, de S´a RO. 1993. Xenopus laevis as a model organism. Syst Biol 42:476–507. Carroll RL. 1988. Vertebrate Paleontology and Evolution. New York: W.H. Freeman. Chain FJ, Evans BJ. 2006. Multiple mechanisms promote the retained expression of gene duplicates in the tetraploid frog Xenopus laevis. PloS Genet 2(4):e56. Cui L, Wall PK, Leebens-Mack JH, Lindsay BG, Soltis DE, Doyle JJ, Soltis PS, Carlson JE, Arumuganathan K, Barakat A, et al. 2006. Widespread genome duplications throughout the history of flowering plants. Genome Res 16:738–749. Eisenberg SP, Brewer MT, Verderber E, Heimdal P, Brandhuber BJ, Thompson RC. 1991. Interleukin 1 receptor antagonist is a member of the interleukin 1 gene family: evolution of a cytokine control mechanism. Proc Natl Acad Sci USA 88:5232–5236. Fedoroff N. 2000. Transposons and genome evolution in plants. Proc Natl Acad Sci USA 97:7002–7007. Flajnik MF, Kasahara M. 2001. Comparative genomics of the MHC: glimpses into the evolution of the adaptive immune system. Immunity 15:351–362. Fraschini R, Bilotta D, Lucchini G, Piatti S. 2004. Functional characterization of Dma1 and Dma2, the budding yeast homologues of Schizosaccharomyces pombe Dma1 and human Chfr. Mol Biol Cell 15:3796–3810. Fredman D, White SJ, Potter S, Eichler EE, Den Dunnen JT, Brookes AJ. 2004. Complex SNP-related sequence variation in segmental genome duplications. Nat Genet 36:861–866. Friedman R, Hughes AL. 2001. Pattern and timing of gene duplication in animal genomes. Genome Res 11:1842–1847. Friedman R, Hughes AL. 2003. The temporal distribution of gene duplication events in a set of highly conserved human gene families. Mol Biol Evol 20:154–161. Goodman M, Moore GW, Matsuda G. 1975. Darwinian evolution in the genealogy of haemoglobin. Nature 253:603–608. Gould SJ, Lewontin RC. 1979. The spandrels of San Marco and the Panglossian paradigm: a critique of the adaptationist programme. Proc R Soc Lond B 205:581–598. Hellsten U, Khokha MK, Grammer TC, Harland RM, Richardson P, Rokhsar DS. 2007. Accelerated gene evolution and subfunctionalization in the pseudotetraploid frog Xenopus laevis. BMC Biol 5:31. Hittinger CT, Carroll SB. 2007. Gene duplication and the adaptive evolution of a classic genetic switch. Nature 449:677–681. Hoekstra HE, Coyne JA. 2007. The locus of evolution: evo devo and the genetics of adaptation. Evolution 61:995–1016. Hughes AL. 1994a. The evolution of functionally novel proteins after gene duplication. Proc R Soc Lond B 256:119–124. Hughes AL. 1994b. Evolution of the interleukin-1 gene family in mammals. J Mol Evol 39:6–12. Hughes AL. 1998. Phylogenetic tests of the hypothesis of block duplication of homologous genes on human chromosomes 6, 9, and 1. Mol Biol Evol 15:854–870.
REFERENCES
101
Hughes AL. 1999a. Adaptive Evolution of Genes and Genomes. New York: Oxford University Press. Hughes AL. 1999b. Phylogenies of developmentally important proteins do not support the hypothesis of two rounds of genome duplication early in vertebrate history. J Mol Evol 48:565–576. Hughes AL. 2000. Polyploidization and vertebrate origins: a review of the evidence. In Sankoff D, Nadeau JH (eds.), Comparative Genomics. Dordrecht, The Netherlands: Kluwer, pp. 493–502. Hughes AL. 2005. Gene duplication and the origin of novel proteins. Proc Natl Acad Sci USA 102:8791–8792. Hughes AL. 2007. Looking for Darwin in all the wrong places: the misguided quest for positive selection at the nucleotide sequence level. Heredity 99:364–373. Hughes AL, Friedman R. 2003a. 2R or not 2R: testing hypotheses of genome duplication in early vertebrates. J Struct Funct Genom 3:85–93. Hughes AL, Friedman R. 2003b. Parallel evolution by gene duplication in the genomes of two unicellular fungi. Genome Res 13:794–799. Hughes AL, Friedman R. 2004. Pattern of divergence of amino acid sequences encoded by paralogous genes in human and pufferfish. Mol Phylogenet Evol 32:337–343. Hughes AL, Friedman R. 2005a. Variation in the pattern of synonymous and nonsynonymous difference between two fungal genomes. Mol Biol Evol 22:1320–1324. Hughes AL, Friedman, R. 2005b. Expression patterns of duplicate genes in the developing root in Arabidopsis thaliana. J Mol Evol 60:247–256. Hughes AL, Friedman R. 2005c. Gene duplication and the properties of biological networks. J Mol Evol 61:758–764. Hughes AL, Friedman R. 2007. Sharing of transcription factors after gene duplication in the yeast Saccharomyces cerevisiae. Genetica 129:301–308. Hughes AL, Nei M. 1988. Pattern of nucleotide substitution at MHC class I loci reveals overdominant selection. Nature 335:167–170. Hughes AL, Piontkivska H. 2005. DNA repeat arrays in chicken and human genomes and the adaptive evolution of avian genome size. BMC Evol Biol 5:12. Hughes AL, Green JA, Garbayo JA, Roberts RM. 2000. Adaptive diversification within a large family of recently duplicated, placentally expressed genes. Proc Natl Acad Sci USA 97:3319–3323. Hughes AL, da Silva J, Friedman R. 2001. Ancient genome duplications did not structure the human Hox -bearing chromosomes. Genome Res 11:771–780. Hughes AL, Green JA, Piontkivska H, Roberts RM. 2003a. Aspartic proteinase phylogeny and the origin of pregnancy-associated glycoproteins. Mol Biol Evol 20:1940–1945. Hughes AL, Friedman R, Ekollu V, Rose JR. 2003b. Non-random association of transposable elements with duplicated genomic blocks in Arabidopsis thaliana. Mol Phylogenet Evol 29:410–416. Hughes AL, Ekollu V, Friedman R, Rose JR. 2005. Gene family content-based phylogeny of prokaryotes: the effect of search criteria. Syst Biol 54:268–276. Hughes AL, Friedman R, Glenn NL. 2006. The future of data analysis in evolutionary genomics. Curr Genom 7:227–234. Hughes MK, Hughes AL. 1993. Evolution of duplicate genes in a tetraploid animal, Xenopus laevis. Mol Biol Evol 10:1360–1369. Jensen RA. 1976. Enzyme recruitment in the evolution of new function. Annu Rev Microbiol 30:409–425.
102
MYTHS AND REALITIES OF GENE DUPLICATION
Jensen RA, Byng GS. 1981. The partitioning of biochemical pathways with isozyme systems. Isozymes 5:143–174. Jiang Z, Tang H, Ventura M, Cardone MF, Marques-Bonet T, She X, Pevzner PA. Eichler EE. 2007. Ancestral reconstruction of segmental duplications reveals punctuated cores of human genome evolution. Nat Genet 39:1361–1368. Johnston M. 1987. A model fungal gene regulatory system: the GAL genes of Saccharomyces cerevisiae. Microbiol Rev 51:458–476. Kanehisa M. 2000. Post-genome Informatics. Oxford, UK:Oxford University Press. Katju V, Lynch M. 2006. On the formation of novel genes by duplication in the Caenorhabditis elegans genome. Mol Biol Evol 23:1056–1063. Kimura M, Ohta T. 1974. On some principles governing molecular evolution. Proc Natl Acad Sci USA 71:2848–2852. Levin DA. 1983. Polyploidy and novelty in flowering plants. Am Nat 122:1–25. Li W-H. 1983. Evolution of duplicate genes and pseudogenes. In Nei M, Koehn RK (eds.), Evolution of Genes and Proteins. Sunderland, MA: Sinauer Associates, pp. 14–37. Lynch M, Conery JS. 2000. The evolutionary fate and consequences of duplicate genes. Science 290:1151–1155. Lynch M, Conery JS. 2003. The origins of genome complexity. Science 302:1401–1404. Lynch M, Force A. 2000. The probability of duplicate gene preservation by subfunctionalization. Genetics 154:459–473. Martin N, Ruedi EA, LeDuc R, Sun F-J, Caetano-Anoll´es G. 2007. Gene-interleaving patterns of synteny in the Saccharomyces cerevisiae genome: Are they proof of an ancient genome duplication event? Biol Direct 2:3. Matsuo-Takasaki M, Matsumura M, Sasai Y. 2005. An essential role of Xenopus Foxi1a for ventral specification of the cephalic ectoderm during gastrulation. Development 132:3885–3894. McLysaght A, Hokamp K, Wolfe KH. 2002. Extensive genomic duplication during early chordate evolution. Nat Genet 31:200–204. Meyer J, Walker-Jonah A, Hollenberg CP. 1991. Galactokinase encoded by GAL1 is a bifunctional protein required for induction of the GAL genes in Kluyveromyces lactis and is able to suppress the gal3 phenotype in Saccharomyces cerevisiae. Mol Cell Biol 11:5454–5461. Meyer VE, Young RA. 1998. RNA polymerase II holoenzymes and subcomplexes. J Biol Chem 273:27757–27760. Murato Y, Nagatomo K, Yamaguti M, Hashimoto C. 2007. Two alloalleles of Xenopus laevis hairy2 gene: evolution of duplicated gene function from a developmental perspective. Dev Genes Evol 217:665–673. Nei M. 1969. Gene duplication and nucleotide substitution in evolution. Nature 211:40–42. Noiva R, Lennarz WJ. 1992. Protein disulfide isomerase: a multifunctional protein resident in the lumen of the endoplasmic reticulum. J Biol Chem 267:3553–3556. Ohno S. 1970. Evolution by Gene Duplication. New York: Springer-Verlag. Ohno S. 1973. Ancient linkage groups and frozen accidents. Nature 244:259–262. Orgel LE. 1977. Gene-duplication and the origin of proteins with novel functions. J Theor Biol 67:773. Otto SP, Whitton J. 2000. Polyploid incidence and evolution. Annu Rev Ecol Syst 34:401–407. Piatigorsky, J. 2007. Gene Sharing and Evolution: the Diversity of Protein Functions. Cambridge, MA: Harvard University Press. Piatigorsky J, Wistow G. 1991. The recruitment of crystallins: new functions precede gene duplication. Science 252:1078–1079.
REFERENCES
103
Platt A, Ross HC, Hankin S, Reece RJ. 2000. The insertion of two amino acids into a transcriptional inducer converts it into a galactokinase. Proc Natl Acad Sci USA 97:3154–3159. Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barab´asi AL. 2002. Hierarchical organization of modularity in metabolic networks. Science 297:1551–1555. Seoighe C, Wolfe KH. 1999. Updated map of duplicated regions in the yeast genome. Gene 238:253–261. Shia W-J, Osada S, Florens L, Swanson SK, Washburn MP, Workman JL. 2005. Characterization of the yeast trimeric–SAS acetyltransferase complex. J Biol Chem 280:11987–11994. Sipiczki M. 2000. Where does fission yeast sit on the tree of life? Genome Biol 1(2):1011. Stebbins GL. 1971. Chromosomal Evolution in Higher Plants. London: Edward Arnold. Sutton A, Shia W-J, Band D, Kaufman PD, Osada S, Workman JL, Sternglanz R. 2003. Sas4 and Sas5 are required for the histone acetyltransferase activity of Sas2 in the SAS complex. J Biol Chem 278:16887–16892. Tachibana C, Stevens TH. 1992. The yeast EUG1 gene encodes an endoplasmic reticulum protein that is functionally related to protein disulfide isomerase. Mol Cell Biol 12:4601–4611. Tanaka T, Nei M. 1989. Positive Darwinian selection observed at the variable-region genes of immunoglobulin. Mol Biol Evol 6:447–459. Tong AHY, Lesage G, Bader GD, Ding H, Xu H, Xin X, Young J, Berriz GF, Brost RL, Chang M, et al. 2004. Global mapping of the yeast genetic interaction network. Science 303:808–813. Wagner A. 2001. The yeast protein interaction network evolves rapidly and contains few redundant duplicated genes. Mol Biol Evol 18:1283–1292. Wolfe KH, Shields DC. 1997. Molecular evidence for an ancient duplication of the entire yeast genome. Nature 387:708–713. Yano K-I, Fukasawa T. 1997. Galactose-dependent reversible interaction of Gal3p with Gal80p in the induction pathway of Gal4p-activated genes of Saccharomyces cerevisiae. Proc Natl Acad Sci USA 94:1721–1726. Yeager M, Hughes AL. 1999. Evolution of the mammalian MHC: natural selection, recombination, and convergent evolution. Immunol Rev 167:45–58.
wwwwwww
6
Evolution After and Before Gene Duplication? TOBIAS SIKOSEK and ERICH BORNBERG-BAUER Evolutionary Bioinformatics Group, Institute for Evolution and Biodiversity, University of Muenster, Muenster, Germany
1 INTRODUCTION 1.1 Where Do Proteins Come From? Life, as we know it today, depends on protein function. Proteins in large part constitute the phenotype that evolution acts upon through genetic mutations and natural selection. To understand how evolution works, it is fundamental to know how proteins evolve. In a constantly changing environment and under constant competition between individuals, proteins with new functions often determine how successfully an organism can reproduce. It is now generally accepted that new proteins evolve from existing ones, either through small-scale mutations (i.e., single-nucleotide substitutions that change the encoded amino acid) or, more fundamentally, larger-scale mutations (such as domain rearrangements and gene duplications). A protein domain can be considered a structurally as well as functionally independent unit or building block of a protein (unless domains catalyze codependent steps in a multistep reaction and have therefore become fused to one protein). The same domain can be found in various arrangements with other domains in different proteins. Therefore, the term protein is used below interchangeably with the term single domain (unless stated otherwise). The emergence of new proteins and, in particular, new functional protein domains is still an unsolved problem [for a review, see Vogel et al. (2004)]. Thanks to the wide availability of genomic data it has become possible to reconstruct in quite some detail how proteins have rearranged their domains. At the bottom line, it appears that ancestral proteins were probably mostly single-domain proteins which, at a later stage, became fused in different combinations (Bj¨orklund et al., 2005). This fusion has been iterated to give rise to more complex arrangements (architectures). The second major driving force is the loss of domains, in particular at the C-termini, due to nonsense mutations (Weiner et al., 2006). If one tries to trace back evolution even further, one has to ask how protein domains themselves came into existence. Again, the mechanism of fusion and fission of smaller Evolution After Gene Duplication, Edited by Katharina Dittmar and David Liberles Copyright © 2010 Wiley-Blackwell
105
106
EVOLUTION AFTER AND BEFORE GENE DUPLICATION?
fragments (polypeptides) is a possibility that has been investigated (Riechmann and Winter, 2000, 2006; Lupas et al., 2001; S¨oding and Lupas, 2003; Alva et al., 2007). Those polypeptides would have been too short to fold natively on their own, but possibly were useful as cofactors to ribozymes (S¨oding and Lupas, 2003). Upon fusion, however, those peptides may have led to stable folds, perhaps due primarily to their repetitiveness (S¨oding and Lupas, 2003). It is still not clear whether or not some domains still evolve from existing ones by fusion and fission of small peptide fragments, but it seems possible that this mechanism would facilitate the evolution of new protein functions. This is because single amino acid substitutions usually change a protein rather gradually and slowly, whereas larger-scale changes can make “jumps” in sequence space to reach more “distant” phenotypes. Since such larger-scale mutations (also called insertions/deletions, or simply indels) occur primarily during recombination events, this mechanism could provide an answer to the unsolved question of why sexual reproduction (which frequently leads to recombination) has been so “successful” during evolution, despite its costs (Barton and Charlesworth, 1998; Kouyos et al., 2007). Another source for insertions and deletions is the slippage of the DNA strand or of DNA polymerase during replication (Garcia-Diaz and Kunkel, 2006). There is evidence that the existing protein families did not evolve from one common ancestor, but that the multiple-birth model holds instead. This model states that proteins evolved by independent recombination of subdomain fragments or supersecondary structural elements (Choi and Kim, 2006). All known protein structures can be assigned to one of only four major classes [all-α, all-β, α/β, and α + β, as defined by SCOP (Murzin et al., 1995)]. Each of these classes is defined by the supersecondary structural elements (ββ-hairpin, αα-hairpin, and βαβ-element) it contains (S¨oding and Lupas, 2003). These relatively short elements occur in varying numbers of repeats, possibly due to multiple recombination events and might therefore have been the building blocks of ancient proteins.
1.2 Effects of Point Mutations on the Emergence of New Protein Function The problem with any kind of mutation is that in most cases genes are under negative or purifying selection, because their activity is needed by the cell. This puts a considerable constraint on evolution, because it requires proteins to remain unchanged (i.e., conserved). At the same time, organisms need to adapt to new environmental conditions by modifying proteins or generating new ones via mutations. They can do so with astounding speed: for example, when pathogens develop antibiotic resistance. Wagner (2008) discusses this apparent paradox. The extent to which point mutations are either neutral, advantageous, or detrimental is still a matter of debate. A mutation in the active site of an enzyme, for example, will almost certainly affect its function. It is more difficult to infer the fitness effects of mutations that are not directly involved in function. Most mutations alter the stability with which a protein sequence folds into its three-dimensional conformation, especially if a polar residue in the protein core is substituted for a hydrophobic residue. But there are at least two other properties of protein evolution that balance the detrimental effects of a mutation: (1) compensatory mutations (DePristo et al., 2005) and (2) the general robustness of protein structures to mutations (Wagner, 2008). Compensatory mutations can correct the destabilizing effects that a mutation has on protein structure.
INTRODUCTION
107
This means that the fitness effect that a mutation has is always relative to its genetic background (i.e., to amino acids at other sites in the protein sequence). Mutational robustness (i.e., the ability to maintain a functioning phenotype under genetic mutations) is a property observed in proteins (Chan and BornbergBauer, 2002; Xia and Levitt, 2004; Tokuriki et al., 2007)[as well as in RNAs (Wagner and Stadler, 1999)] that in itself has been proposed to be subjected to adaptive evolution because it reduces the overall proportion of detrimental mutations in favor of neutral mutations. These and other aspects [such as selection for increased translational efficiency (Drummond et al., 2005)] have to be considered when estimating mutational effects on protein fitness. However, the more conserved a protein (i.e., the more important its correct, uncompromised functioning is to the organism), the higher the chance of a mutation being detrimental. On the other hand, some proteins are more dispensable than others, and therefore their malfunctioning might be more tolerable (P´al et al., 2006). Therefore, fitness effects of mutations vary not only along the nucleotide sequence of a protein-coding gene [expressed as its fitness density (Drummond et al., 2005; P´al et al., 2006)] but also between different genes [referred to as the dispensability of the encoded proteins (P´al et al., 2006)]. Still, it seems reasonable to assume that many, if not most, point mutations in a gene will have a negative effect on the fitness of the protein. 1.3 Emergence of New Proteins via Gene Duplication The predominant view is that a gene duplication is required for biological innovation because it provides the opportunity for evolution to “try out” alternative protein designs without sacrificing an existing design. There are different mechanisms that may produce gene duplicates. Small-scale duplications (SSDs) of one or a few genes can happen, for example, by unequal crossing-over or retrotransposition. Unequal crossing-over occurs when two homologous sequences on different chromosomes misalign during recombination. Retrotransposition occurs by mobile genetic elements that copy themselves into other regions of the genome, using RNA intermediates and reverse transcriptase. Occasionally, adjacent genes are copied with those mobile elements. Such SSDs have been estimated to occur quite frequently: about 0.01 per gene per million years (Lynch and Conery, 2000). Whole-genome duplications (WGDs) are another possible mechanism, although much rarer. WGDs are the result of mitotic cell divisions that only duplicate the genome but fail to separate the cytoplasm as well, or alternatively, WGDs can be caused by two rounds of replication without a cell division in between. In diploid organisms this results in a tetraploid cell. When this cell enters meiosis it produces gametes that are diploid, and a zygote produced by the fusion of two such gametes will be tetraploid. Polyploidy is very common in plants but also occurs in animals, mostly in fish and amphibians. In mammals, polyploid zygotes are not viable, probably because it interferes with the chromosome-based sex determination mechanisms. After a WGD, the organism usually returns to its original ploidy level. For example, a diploid organism that has become tetraploid will return to its diploid state (diploidization) by, for example, losing excess chromosomes or by fusing them. This has been shown for the plant Arabidopsis thaliana, for which evidence of several WGDs could be found (Bowers et al., 2003). The result of the WGD event is a number of paralogs
108
EVOLUTION AFTER AND BEFORE GENE DUPLICATION?
that have been retained for various reasons, as explained below. WGDs are thought to have played key roles in large evolutionary transitions which increased organismal complexity and led to major innovations in the larger branches of the tree of life. Above all, a WGD provides a high degree of redundancy in the genome, since every gene then exists twice. This also has consequences for the interactions between proteins: for example, in gene regulatory networks, where the duplicated version of a network can assume entirely different regulatory interactions. Potentially, this enabled the evolution of more complex developmental processes. Duplications of genomic regions can also occur in a way that cuts off parts of a gene (Figure 1A). The duplicate might then miss parts or all of its regulatory and promotor regions, which means that it is regulated differently or cannot be expressed at all. Alternatively, the duplicate might miss some part of its C-terminus: for example, when a retrotransposon lies within an intron. Gene duplications, at whatever scale, occur in single individuals. A population-wide fixation has to follow for evolution to take advantage of the innovative potential of the duplicates. (The fixation of a WGD might actually result in a new species, as reproductive incompatibilities are often found between individuals of different ploidy levels.) This can occur either randomly by drift or because the duplicates provide an immediate positive fitness effect. It is also possible, and may even be likely, that a duplication event has a negative effect on fitness. This could occur, for example, by bringing an imbalance into a metabolic equilibrium by doubling the amount of the gene’s product or, in case of a regulatory protein, by overregulating a certain cellular process (Papp et al., 2003). These are called dosage effects (Figure 1B). It is arguable whether or not a WGD escapes the negativity of dosage effects by doubling the amount of all gene products at the same time, thus reducing the probability of imbalances. However, since traces of gene duplications can be found abundantly in all organisms, it can be assumed that a sufficient proportion of gene duplicates reaches fixation (Force et al., 1999; Blomme et al., 2006; Brunet et al., 2006). 1.4 What Happens After Gene Duplication? For the period after fixation, several possible fates of gene duplicates have been proposed, and they are summarized in Figure 1C. Most duplicates will be lost again (Lynch and Conery, 2000). Since they are redundant and therefore can accumulate mutations without any effects on fitness, mutations in the promotor region will eventually prevent transcription of the gene. It then becomes a pseudogene, a gene that is still homologous to its former paralogs and orthologs but that is no longer transcribed. Pseudogenes eventually lose all resemblance to their homologs because they accumulate mutations at a relatively high rate. Finally, the pseudogenization process is completed when the duplicated gene becomes fixed in the population. For retention, a gene duplicate will have to acquire properties that render it beneficial to the organism. There are several alternative, nonexclusive models for how this could happen [for an excellent review on this, see a recent article by Conant and Wolfe (2008)]. The classical view, as presented by Ohno (1970), states that one of the gene copies remains as it was before the duplication, while the other copy develops a new function by randomly “exploring” protein sequence space (which contains all possible protein sequences) through neutral drift until it “finds” a new function that provides a fitness advantage. This hypothesis, referred to as neofunctionalization, is
INTRODUCTION
109
(A)
(B)
(C)
Figure 1 Possible outcomes of gene duplication. (A) A gene duplication can copy genes either in their entirety (complete duplication) or as a fragment of the original gene (incomplete duplication). An incomplete duplication can result either in a nonfunctional copy or in a copy with modified function and/or expression. (B) Dosage effects can play a role for some duplicated genes if the level of expressed protein is under selection. The second copy of the gene might therefore get lost eventually. Alternatively, mutations in the protein-coding region or in the regulatory region of a gene can compensate for increased protein levels caused by the duplication. The increase in dosage can also be advantageous or simply irrelevant for fitness. (C) In case of a complete duplication, both copies initially have redundant functions and expression patterns. According to the currently predominant models, pseudogenization, neofunctionalization, and subfunctionalization are the most common fates of duplicate genes. (Based on Hughes, 2007.)
110
EVOLUTION AFTER AND BEFORE GENE DUPLICATION?
illustrated in Figure 2. The new function of the duplicate can be an actual change in the amino acid sequence of the protein, but it might also be a mutation in the regulatory region of the gene (thus altering expression level, time, and/or location) or potentially even a mutation that alters the splicing pattern of the gene. Since the formulation of the neofunctionalisation model, examples have been found where the percentage of retention after a WGD is quite high (Ahn and Tanksley, 1993; Hughes and Hughes, 1993; Van de Peer et al., 2003; Blomme et al., 2006): for example, 15% in teleost fish (Brunet et al., 2006). This suggests that neofunctionalization cannot be the only explanation for retention, because it would take too long to acquire so many new functions via beneficial mutations, which are generally assumed to be very rare. Another model is subfunctionalization [also known as the duplication–degeneration–complementation (DDC) model (Force et al., 1999)], where a protein with two functions hands down one function to each of its duplicates by degrading complementary regions of the genes via deleterious mutations (Force et al., 1999; Lynch and Force, 2000). This model does not require the assumption of beneficial mutations. Instead, retention occurs only because the duplicates have become nonredundant in terms of function. Subfunctionalization can occur either in the regulatory region of a gene (thus altering expression patterns) or in the protein-coding region. Subfunctionalization of the coding region in a multidomain protein happens when the two duplicates accumulate changes in different domains. In one duplicate one of
Figure 2 Gene duplication event followed by neofunctionalization within a gene tree. The ancestral gene XYZ diverges via two speciation events into the orthologs X1 , Y1 , and Z1 (X, Y, and Z representing different species). As the gene is under purifying selection, it accumulates only neutral mutations (horizontal axis). Additionally, a duplication of the gene occurs in one of the branches after the first speciation. One copy evolves toward a new beneficial function via adaptive mutations (vertical axis), which then becomes conserved (also during the second speciation). The result is two pairs of paralogs: (X1 , X2 ) and (Y1 , Y2 ).
INTRODUCTION
111
the domains keeps its original function while another domain changes, and vice versa in the other duplicate [see, e.g., Chain and Evans (2006) and Figure 1C]. The question of how the conversion from one domain into another might occur is an important one and is addressed later. Domains are likely to play an important role in the diversification of protein functions by subfunctionalization and neofunctionalization. Since domains are functionally distinct from each other, sub- and neofunctionalization can occur only for a subset of domains of the same protein while leaving the other domains unchanged. An example of this can be found in the bHLH (basic helix–loop–helix) transcription factor family in humans, where two domains are the same for all family members (the basic DNA-binding domain, as well as the helix–loop–helix dimerization domain), whereas a third domain (which is also involved in dimerization) varies among groups within the family (Amoutzias et al., 2004). These groups coincide with subnetworks of the bHLH regulatory network, because dimerization can occur only between monomers with the same dimerization domains. A very similar example can be found in regulatory networks relying on the widely spread MADS domain (Veron et al., 2007). The evolution of regulatory networks might therefore, at least in some cases, rely on subfunctionalization events involving the rearrangement and modification of domains (Amoutzias et al., 2004; Veron et al., 2007). Another possible mechanism for duplicate retention, especially after WGDs, is dosage balance. As described above, dosage effects might be less severe if all interaction partners are duplicated, as in WGDs. Those genes that are especially prone to dosage effects might therefore be dependent on the retention of their duplicates in order to maintain their newly found balance after duplication. Dosage effects might in some cases even be of adaptive value in that a gene that is already transcribed at its maximum rate can further increase its transcription rate by having additional copies of itself in the genome. However, dosage-related effects are probably only relevant for some genes, so that subfunctionalization remains the most likely candidate among the models explaining gene retention after duplication. A related mechanism of potential importance is the retention of gene duplicates due to “activity-reducing mutations” (Scannell and Wolfe, 2008). Dosage effects might be reduced relatively quickly by accumulation of slightly deleterious mutations (Figure 1B). Since most mutations are thought to reduce stability (possibly accompanied by a decrease in catalytic efficiency), the activity of both copies might be reduced to the original (preduplication) level, which will improve the chances of duplicate retention. It is obvious that subfunctionalization by itself (i.e., the redistribution of existing functions) is not a satisfying explanation for innovation of protein function, but only for duplicate retention. More promising, therefore, is a combination of neoand subfunctionalization [subneofunctionalization (He and Zhang, 2005; Rastogi and Liberles, 2005)], according to which subfunctionalization occurs rapidly after gene duplication and leads to retention of the duplicate, followed by neofunctionalization. At least for SSDs in mammals, neo- or subneofunctionalization seem to be the dominant fate of gene duplicates (Hughes and Liberles, 2007). Neither neofunctionalization nor subfunctionalization can yet account for the high probability of pseudogenization (i.e., the loss of function of a gene relieved from selection and therefore exposed to the accumulation of deleterious mutations). If adaptive evolution can only occur after a gene has already been duplicated, there might not be
112
EVOLUTION AFTER AND BEFORE GENE DUPLICATION?
enough time for drift to produce a mutation that eventually provides a fitness advantage before pseudogenization occurs. Directly after gene duplication the rate of pseudogenization is high but also decreases rapidly with time (Hughes and Liberles, 2007), so that a quick mechanism of subfunctionalization is required to retain a sufficient number of gene duplicates that can then undergo further neofunctionalization. This problem—the retention of the duplicate at frequencies and over time periods sufficient for the accumulation of adaptive mutations—has recently been termed Ohno’s dilemma, and the proposed solution is similar to the one outlined here (Bergthorsson et al., 2007). In this chapter we present experimental and theoretical evidence that might explain how evolution toward different protein functions can start before the corresponding gene undergoes duplication and how this “preduplication” evolution might therefore facilitate the following sub- or neofunctionalization process. A key mechanism might be the exploitation of the promiscuous (or latent) protein functions that many proteins seem to have and that are free to change and adapt. For several decades the concept of single-domain proteins performing more than one function has been known as gene sharing, and the possibility of such proteins preceding and facilitating evolution after gene duplication has been proposed (Piatigorsky and Wistow, 1989; Hughes, 1994) and is now beginning to gain some experimental support (McLoughlin and Copley, 2008). This model has, however, received little attention compared with the models of evolution after gene duplication, probably because only a few very specific examples were known. Only recently has this concept reappeared under the name escape from adaptive conflict (Hittinger and Carroll, 2007; Conant and Wolfe, 2008; DesMarais and Rausher, 2008), proposing that it might be a more common evolutionary mechanism than previously thought. This adaptive conflict arises when a certain gene is under selection pressure to both conserve an old function and acquire a new one at the same time. This conflict might lead to the phenomenon of gene sharing via promiscuous protein functions associated with a reduced efficiency for each function. The escape from this conflict can only occur after gene duplication, when each copy specializes in only one of the ancestral functions. The shift between protein functions is likely to be associated with changes in proteinfolding stabilities and we therefore begin with a discussion of protein stability and proceed with the role that stability plays in the evolution to new functional phenotypes. We then discuss population genetic aspects as well as the possible role of phenotypic mutations in adaptation. Finally, we compare the evolution of ribozymes with the evolution of proteins to find potential similarities.
2
STABILITY OF PROTEIN STRUCTURES
Each protein folds into the conformation with the lowest free energy of all possible conformations and therefore the conformation that is thermodynamically the most stable. This is called the native conformation, and it is traditionally associated with protein function. However, there is now multiple and compelling evidence that the structural dynamics of a protein are essential for its function (Boehr et al., 2006; Vendruscolo and Dobson, 2006; Henzler-Wildman et al., 2007). Although proteins are usually thought of as having only one native conformation, it is possible for protein sequences to have more than one native conformation (i.e., that result in the same amount of free energy).
STABILITY OF PROTEIN STRUCTURES
113
A remarkable property of protein structures is their robustness toward single amino acid changes (Ferrada and Wagner, 2008; Wagner, 2008). Many amino acids in a protein sequence can be substituted by other amino acids without compromising the overall structure or function of the protein. This means that in protein sequence space (which contains all possible protein sequences) a sequence has many neutral neighbors, one-error mutants that all fold into the same conformation. The theoretical construct representing all the “mutationally connected” protein sequences folding into the same conformation is called a neutral network , a concept first used for RNA structures (Schuster et al., 1994) and then for proteins (Bornberg-Bauer, 1997). Neutral networks are often drawn as graphs in two dimensions (Figure 3); this is a strong simplification, however, as sequence space is multidimensional (one dimension for each amino acid in the peptide chain). Therefore, neutral networks, and the distances between them, usually cannot be drawn to the correct scale. According to Bornberg-Bauer and Chan (1999), every neutral network is associated with one conformation and organized around a prototype sequence. The prototype sequence of a neutral network is, in theory, the sequence with the highest thermodynamic stability for the associated conformation and might also coincide with the consensus sequence of a protein family (Figure 3). The prototype sequence usually lies at the center of its network, and thermodynamic stability is supposed to decrease smoothly when moving toward the edge of the network (Figure 4). This funnel-like distribution of thermodynamic stability in neutral networks of proteins, predicted by
Protein family Protein 3 Protein 1
Free energy
Se
qu
en
ce
sp
ac
e
Protein 2
Unstable Sequence space Stable
Figure 3 Members of a protein family within the same neutral network. The nodes in the neutral network are protein sequences connected by single amino acid substitutions (edges). The plane represents sequence space, the vertical axis is the free energy of protein folding. The members of this hypothetical protein family recently evolved from the same ancestral protein via gene duplication and subsequent neutral point mutations and all fold into the same native structure. Therefore, they all inhabit the same neutral network. The prototype sequence (node surrounded by a circle) of this network has the highest number of neutral neighbors and might be equivalent to a consensus sequence derived from all sequences belonging to this protein family.
114
EVOLUTION AFTER AND BEFORE GENE DUPLICATION?
(A)
(B)
Figure 4 Energy landscape of two adjacent neutral networks. (A) The large plane is a two-dimensional representation of sequence space. The vertical axis represents the free energy (G) of the protein structures x1 and x2 associated with the neutral networks [here, molecular structures of the Arc-repressor are taken as an example (Cordes et al., 2000)]. The lower the energy, the higher the thermodynamic stability of the structure. Each symbol represents a protein sequence; the lines between them represent amino acid substitutions. Neutral networks are distinguished by symbol shape. Sequences that fold uniquely into one conformation are shown in black, those that are equally stable in more than one conformation are shown in white. The sequence in the middle of the two nets folds equally well into both structures, x1 and x2 . A path connecting the prototype sequences (framed symbols) of the two networks is drawn. (B) Frontal view of the energy landscape, showing that the two structures x1 and x2 coexist in an equilibrium for the protein sequences lying in between the two neutral networks. The locations of the prototype sequences are indicated in parts A and B by dashed lines. (See insert for color representation of the figure.)
STABILITY OF PROTEIN STRUCTURES
115
theoretical studies with lattice models∗ (Bornberg-Bauer and Chan, 1999), has recently been confirmed experimentally (Bloom et al., 2006). In the simple world of lattice proteins it is possible to assign the highest fitness to the prototype sequence if structural stability is to be taken as a proxy for functional efficiency (Cui et al., 2002; Wroe et al., 2007). Real proteins, however, function only within a certain window of stability. On the one hand, proteins need some conformational flexibility to bind to their ligands or interaction partners. On the other hand, too much instability leads to unfolding, aggregation, and degradation (DePristo et al., 2005). Many enzymes undergo structural changes while carrying out their normal functions. Adenylate kinase, for example, undergoes the same steps of conformational changes that it would need for processing its substrate, even in its unbound state (Henzler-Wildman et al., 2007). Most mutations alter the thermodynamic stability of proteins (Alber, 1989; Pakula and Sauer, 1989; Matthews, 1995; Wilson et al., 1992). The effect that a mutation has on protein stability, however, depends largely on the genetic context (i.e., on other stability-changing mutations). The same mutation can therefore be either neutral, advantageous, or deleterious in terms of stability. It is neutral if it does not alter protein stability in a way that compromises its correct folding or functioning. Also, an adaptive mutation might be temporarily disadvantageous regarding stability but can be compensated for by another mutation that restores the original stability of the protein (DePristo et al., 2005). For each deleterious mutation there is a number of potentially compensatory mutations, as demonstrated for the bacterium Salmonella typhimurium (Maisnier-Patin et al., 2002). This bacterium was equipped with a mutation in the ribosomal protein S12, which confers antibiotic resistance but is detrimental otherwise (it slows the rate of protein synthesis, due to an increased proofreading rate). Of 80 lineages carrying this mutant, 77 independently evolved additional compensatory mutations so that the antibiotic resistance was still given, but the deleterious side effects were reduced. In total, the compensatory mutations found comprised 35 different amino acid substitutions (Maisnier-Patin et al., 2002). Mutations can also occur simultaneously during crossing-over (recombination) of homologous DNA sequences. Instead of only one amino acid substitution, an entire part of a protein sequence can be exchanged, carrying all the substitutions within it. How many amino acids are changed depends, of course, on how different the homologous DNA sequences are. Crossing-over can also maintain epistatic effects between amino acids if they are copied together (Barton and Charlesworth, 1998; Kouyos et al., 2007). Recombination has been shown to speed up evolutionary transitions in computer simulations of lattice proteins by “tunneling” through sequence space, thus reaching more distant structures (Cui et al., 2002).
∗ Lattice
models use very simplified representations of proteins to test general properties of protein folding, which are then extrapolated to the folding of real proteins. The use of such simple models is necessary because an algorithm for sequence-to-structure mapping is still not available for proteins. In lattice models the polymer chain of a protein can assume discrete positions only on a two- or three-dimensional lattice. The monomer alphabet is usually reduced from 20 to only two monomers, with the properties “hydrophobic” and “polar” (or H and P). Those are the two properties thought to be most relevant in defining a protein’s structure. Alternative models use compact squares or cubes and contact interactions which are drawn randomly from a continuous energy distribution.
116
3
EVOLUTION AFTER AND BEFORE GENE DUPLICATION?
STRUCTURAL PROMISCUITY OF PROTEINS
To understand how one domain can be converted into another, as expected during neofunctionalization (see Section 1.4), it is important to consider aspects of protein structure and stability. If a domain eventually assumes a new fold by amino acid substitutions, there must be a point where enough mutations have accumulated to switch from one fold to another. Determining this point (i.e., the number of amino acid changes necessary) has been a challenge for some time now, resulting, for example, in the Paracelsus challenge, which called for a conversion between two protein folds while retaining 50% of the original amino acid composition (Rose and Creamer, 1994; Jones et al., 1996). This goal was achieved [with the Janus protein (Dalal et al., 1997)] and even exceeded recently by a pair of proteins sharing 88% of their amino acids but assuming different folds (Alexander et al., 2007). These are, however, extreme, artificially produced examples and are not necessarily the rule for structural transitions in nature. Instead of changing entire folds, smaller conformational changes might be more common during neofunctionalization. As already discussed, structural transitions via single amino acid substitutions can be drawn as neutral paths through sequence space. The more mutational steps a protein sequence takes from the center of its neutral network toward the edge (Figure 4), the less stably it folds into its native conformation∗ . It has been shown that some proteins occur in an equilibrium state of two different folds, and that a very small number of mutations is sufficient to change this equilibrium in either direction. This means that the two folds are thermodynamically equally (or at least similarly) stable and that they either stay in their respective fold once they have begun one particular folding pathway [such as Arc-repressor (Cordes et al., 2000), Rop, or the human prion protein; see (Meier and Ozbek, 2007) for an extended list], or they switch back and forth (fluctuate) constantly between the two structures. For minor conformational changes, fluctuations would be more likely and, as mentioned above, are to a certain degree necessary for the function of many proteins. In those cases, the general fold (i.e., the relative distribution of secondary structural elements) remains the same. Examples of ambiguity in the secondary structure can be found in proteins with chameleon sequences, which are equally likely to form an α-helix or a β-sheet, such as the Janus protein (Dalal et al., 1997). This illustrates how close two very different structures can be in terms of the amino acid sequences that form them. Interestingly, some evidence has already been gathered of in vivo protein structures that appear to lie in between neutral networks. One example is the Arc-repressor, which carries a region folding into a β-sheet in the wild-type form. One amino acid substitution in that region, however, leads to a mutant form that can carry an αhelix instead. In this mutant, both conformations occur in equilibrium, switching back and forth dynamically (Cordes et al., 2000). Another example is the prion protein, which is supposed to be an evolutionary intermediate of a transmembrane protein “on its way” to becoming a soluble globular protein. This would be the reason for its occasional folding into an insoluble, aggregating form, which causes the pathogenic ∗ The concept of neutrality has to be used very carefully here, because even within neutral networks, mutations are never entirely neutral. Each mutation alters stability, but not all mutations are so disruptive that they lead to a new native conformation. The fitness effects of those minor stability changes are relatively small, so they can be called neutral.
EVOLUTIONARY TRANSITIONS BETWEEN PROTEIN PHENOTYPES
117
symptoms of Creutzfeldt–Jakob disease and related diseases (Tompa et al., 2001). Similar properties have been found in proteins with cystein-rich domains (CRDs), which provide physical strength through disulfide bridges and occur in the nematocysts of Hydra. The structure of one such domain could be transformed into the structure of another naturally occurring CRD (with a completely different pattern of disulfide bridges) by introducing only two point mutations in vitro (Meier et al., 2007). The first mutation already led to an equilibrium state of both conformations; the second then completed the transition. Also, Rop (repressor of primer) protein folds into two different four-helix-bundle structures that form two different dimers (Levy et al., 2005). These and other examples (reviewed by Meier and Ozbek (2007)) provide a continuously growing body of evidence for the plasticity of protein structures. As mentioned earlier, another type of mutation is that of indels (insertions/deletions). Estimations of structural transitions have recently been attempted with bioinformatics approaches, which focused primarily on insertions of small peptides into existing structures (Jiang and Blouin, 2007; Viksna and Gilbert, 2007). These studies found that structural transitions via insertions seem to be an important mechanism of protein evolution.
4 EVOLUTIONARY TRANSITIONS BETWEEN PROTEIN PHENOTYPES The emergence of new protein phenotypes (in terms of structure, function, or both) has long been an unsolved problem and to some extent still is. The main difficulty in finding a solution to this problem might be the fact that most mutations are thought to be deleterious and that, consequently, adaptive mutations are much too rare to explain how a protein can evolve from one phenotype to another (let alone how to create a de novo protein from a random sequence). As already mentioned, the crucial step is to maintain a viable protein simultaneously with inventing a new one. The evolution from one protein structure x1 to another protein structure x2 by single amino acid substitutions can be drawn as a path through sequence space. This path starts within the neutral network of x1 , somewhere near its prototype sequence, and ends near the prototype sequence of the neutral network of x2 . The two neutral networks have to be adjacent (i.e., one mutation can change the dominant fold of the protein) or overlapping [i.e., there is one (or more) protein sequence(s) that can fold into the conformations associated with both nets] for the transition to be smooth enough in terms of fitness. This is possible when both phenotypes provide some fitness advantage. As the sequence “moves” (by means of mutations) along the path between the two neutral networks, x1 becomes less stable and therefore ceases to be the dominant fold. At the same time the stability of x2 increases and becomes the dominant fold of the protein. This means that the protein occurs in one or the other conformation during the transition, so that the protein’s fitness cannot drop too low. Such neutral paths would be suitable for guiding the evolution of new phenotypes through regions of sequence space that are “safe enough” (i.e., the intermediate states between two phenotypes are not too detrimental and therefore remain in the population). Such phenotypic transitions could be observed in both computational simulations (Wroe et al., 2007) and laboratory evolution experiments (Amitai et al., 2007). Conceivably, there are two different “routes” that the transition from one phenotype to another can take: either via a promiscuous generalist intermediate or directly
118
EVOLUTION AFTER AND BEFORE GENE DUPLICATION?
from one specialist phenotype to another (Khersonsky et al., 2006). The route of the promiscuous intermediate is more likely when there is not a high fitness cost for having both promiscuous phenotypes at the same time. The direct route is favored when there is strong dual selection pressure that acts to promote one phenotype and demote the other one at the same time [e.g., in the context of signaling and quorum sensing (Collins et al., 2006)], so that the transition between the phenotypes is very abrupt. Both cases, generalist intermediates (Rothman and Kirsch, 2003) and dual selection (Varadarajan et al., 2005; Collins et al., 2006), have been observed, although the first type seems to be the predominant one (Khersonsky et al., 2006). After having focused on the structural implications of phenotypic transitions, we look next into aspects of protein function and the probable correlation between the two.
5
FUNCTIONAL PROMISCUITY OF ENZYMES
For some time now, enzymes have been known that catalyze substrates other than their native ones, albeit at very low rates (O’Brien and Herschlag, 1999; Khersonsky et al., 2006). This phenomenon is referred to as catalytic or functional promiscuity. One of the best-studied examples in terms of promiscuous protein function is serum paraoxonase (PON1), for which four promiscuous activities have been described (Aharoni et al., 2005) (Figure 5A). Furthermore, these activities could be increased by factors of 101 to 102 [or even more in other examples (Khersonsky et al., 2006)] through directed laboratory evolution, without substantially reducing the native activity (Aharoni et al., 2005; Amitai et al., 2007). The native function of PON1 is that of a lipo-lactonase, which hydrolyzes esters in lactone rings. All promiscuous functions are esterase activities as well, but for slightly different substrates. As is typical for enzyme promiscuity, just a few amino acid changes in or around the active site of PON1 are needed to increase promiscuous activity by several orders of magnitude (Aharoni et al., 2005). Promiscuous activities can also be called latent when their effects are neglectable but have the potential to be increased dramatically by just a few mutations in the right places. Often, promiscuous enzymes have more than one latent activity besides their native activity. This can be interpreted as being close to more than one neighboring neutral network of a functional phenotype (as demonstrated for PON1 in Figure 1A). Experimentally, it was observed for some protein families that their members share one promiscuous function or that the native function of one member is the promiscuous function of another, and so on (Khersonsky et al., 2006). This indicates that all the functions found within a family are located within the same evolutionary neighborhood in sequence space. In the course of evolution, different family members might have explored this neighborhood in different directions. It has been proposed that at the origin of every protein family there was a generalist that could perform some or all of the functions of its descendants, which eventually became specialists as they diverged (Jensen, 1976; Khersonsky et al., 2006): for example, through subfunctionalization. The fitness landscape of the hypothetical neutral network of PON1 (and the adjacent nets of the promiscuous functions) (Amitai et al., 2007) is shown in Figure 5B and C. In each neutral network there is a region with maximum fitness. Between
FUNCTIONAL PROMISCUITY OF ENZYMES
119
(A)
(B)
Figure 5 Hypothetical sequence neighborhood of the enzyme PON1. (A) Putative neutral network of PON1 with adjacent networks associated with promiscuous functions. The symbols represent different protein sequences connected by amino acid substitutions (lines). Symbols of different shapes belong to different neutral networks. Symbols with circles around them represent sequences on maxima in the fitness landscape. Large dashed circles delimit the neutral networks of the promiscuous functions. The native function of PON1 is that of a lipo-lactonase (circles). Promiscuous functions are thiolactonase (hexagons), aryl-esterase (squares), phosphotriesterase (triangles), and drug resistance (stars). (B) Fitness landscape of the same neutral network. The plane represents protein sequence space and the vertical axis is the fitness of an individual expressing the corresponding protein. Over the centre of each neutral network lies a fitness maximum. Paths between the neutral networks correspond to ridges in the fitness landscape, which connect the maxima of neighboring networks. (C) Frontal view of the fitness landscape, showing the maxima in profile. A hypothetical diagram under a highlighted part of the fitness landscape shows how catalytic activities of a promiscuous function (thio-lactonase) and the native function decrease and increase along the connecting path in sequence space [see (Aharoni et al. 2005) for experimental data]. Along this path, overall fitness does not decrease much if both native and promiscuous functions can be maintained by the same enzyme. For a complete transition from one network to another, however, a gene duplication event might be necessary (see Figure 6). [(A) Adapted from Amitai et al., 2007.] (See insert for color representation of the figure.)
120
EVOLUTION AFTER AND BEFORE GENE DUPLICATION? (C)
Figure 5 (Continued )
those local fitness maxima, the fitness of the corresponding protein sequences (i.e., points in sequence space) does not drop completely, but forms a fitness “ridge” that allows a path of nearly neutral mutations between two neutral networks. An important question remaining is whether or not structural and functional promiscuity generally evolve simultaneously (James and Tawfik, 2003; Tokuriki and Tawfik, 2009). Whereas evolution of promiscuous functions can be explained without major conformational changes, the opposite case seems very unlikely, because a new structure is useless without function. The transition between neutral networks would then be correlated not only with different activities (as described for PON1 above) but also with different structures. A promiscuous protein could then be considered to be located at the edge of the neutral network of its native structure, bordering onto the neutral network of another structure (Figure 4B). Each structure would then correspond to a different activity: for example, the binding of different substrates. An intriguing example of such a link between promiscuous structures and functions has been found in antibodies. As shown in vitro, the same antibody, named SPE7, could bind a small peptide as well as a large protein surface. Each substrate was bound via a different backbone conformation of the antibody (James et al., 2003). Other, very recent examples of functionally promiscuous proteins include ubiquitin (Lange et al., 2008) and cytochrome P450 (Muralidhara et al., 2008), for both of which distinct crystal structures were resolved corresponding to the conformations necessary for binding different ligands.
6 GENE DUPLICATIONS AND PHENOTYPIC TRANSITIONS AT THE POPULATION LEVEL It is important to emphasize that gene duplications occur within single organisms. Whether or not such a duplication will become a permanent part of the genetic repertoire of one species (or even lead to the formation of a new species) also depends on the dynamics occurring at the population level. It is likely that a gene duplication becomes lost immediately due to its lethal or detrimental effects, or simply by chance.
GENE DUPLICATIONS AND PHENOTYPIC TRANSITIONS AT THE POPULATION LEVEL
121
Population size has a considerable effect on the fate of gene duplicates as estimated by computer simulations (Lynch and Force, 2000). In large populations, pseudogenization (also called nonfunctionalization) as well as the determination of gene duplicate fate (by subfunctionalization or neofunctionalization) take longer. This should provide an advantage for organisms that occur in smaller populations (e.g., multicellular eukaryotes). Since positive mutations are fixed more rapidly in the population, those organisms can profit from beneficial gene duplications more quickly than can organisms that occur in large populations (such as protists or bacteria) (Kimura, 1983). Also, the molecular events leading to gene duplications are usually more frequent in organisms with larger genomes, which also tend to occur in populations of smaller size. In eukaryotes there are more repetitive genomic regions suitable for recombination errors, and transposable elements are also more abundant than in prokaryotes. Furthermore, horizontal gene transfer is common in bacteria but very rare in eukaryotes. Therefore, bacteria have very different options when it comes to the acquisition of new proteins and do not rely on the mechanisms of gene duplication and divergence in the same way that eukaryotes do (Lerat et al., 2005). There are also population genetic consequences for the phenotypic transitions via latent protein functions, as described above. If the latent function of a protein provides a fitness advantage and its gene is duplicated, a very small number of amino acid changes will be sufficient to increase the efficacy of the latent function in one of the duplicates. The mechanism by which this happens might be either sub- or neofunctionalization. The result is two paralogs, one with the original native function and one with the former latent function as its native function. After a gene duplication, the duplicates have been found to accumulate (possibly adaptive) mutations at an elevated rate (Zhang et al., 2003; Johnston et al., 2007; Scannell and Wolfe, 2008). If adaptive mutations, which can be measured as the ratio of synonymous over nonsynonymous mutations, occur before and simultaneously with gene duplication (e.g., via the optimization of a latent protein function), there should be traces of adaptive evolution. Indeed, this could be measured for a number of proteins in chordates, for which higher rates of adaptive mutations could be detected in those branches of the gene tree that lead toward (i.e., precede) a duplication event (Johnston et al., 2007). Another possibility for preduplication adaptation is that the divergence between native and latent functions occurs on different alleles first. In diploid organisms, alleles provide slightly different versions of the same gene (at the same locus on homologous chromosomes), much like the two copies that arise through gene duplication. The major difference is that gene duplicates always occur within the same individual in a population, whereas alleles may also occur in varying combinations. But if an allele carries a mutation that promotes the latent activity and if this activity is advantageous to the organism, the frequency of the allele will increase in the population. If a gene duplication then occurs in an individual carrying both alleles (i.e., one with low latent activity and one with elevated latent activity), so that the allele from one homologous chromosome gets copied onto the other homologous chromosome, both copies would be inherited together. Allelic divergence before gene duplication should be expected to occur under high selection pressures for functional or expressional divergence of a protein. Figure 6 shows a fitness landscape of a protein under selection for divergence. Such a selection pressure applies if the function of a protein needs to be optimized for performance in
122
EVOLUTION AFTER AND BEFORE GENE DUPLICATION?
Fitness
ce uen Seq spa
ce
ce
ce uen
spa
q
Se
Figure 6 Mechanism for functional divergence of a protein. Two neutral networks (wide circles) associated with the protein functions and structures A and B are shown. Sequence a1 is the fittest within its network, because it is most efficient at performing its native function. However, there would be a selective advantage in performing the function of the adjacent neutral network B as well, which is not yet available. Sequence a1 therefore adapts toward sequence a2 (solid arrow), which still folds into structure A, but also into structure B, albeit with low stability. Sequence a2 therefore has a tolerably high fitness, even though it is less efficient at function A. A gene duplication would be advantageous at this point, because not only would it instantly increase the rates of functions A and B (as there is now twice the amount of the gene product), but it would also allow for further functional divergence (dashed arrows). One gene copy could complete the transition toward sequence b, which has a high efficiency for function B, and the other copy could return to the also highly efficient sequence a1 . This process has been described as EAC (escape from adaptive conflict) subfunctionalization (Conant and Wolfe, 2008). The end result, however, is that of neofunctionalization.
different tissues or under different conditions (as, for example, caused by a changing environment). If those conditions do not differ much (i.e., if the pressure for divergence is low), the same protein will be optimized for performance under both conditions, probably with compromised efficiency. The more the two conditions differ, the more beneficial a gene duplication at the protein’s gene locus will be, enabling diversification of the two copies. Finally, if the selection pressure for divergence is high, a gene duplication might take too long, so that adaptation cannot take place when it would be advantageous. In this case, different alleles might adapt to the different conditions instead, so that some individuals (those that carry different alleles, i.e., that are heterozygous) will obtain a higher level of fitness. These population genetic dynamics for diploid organisms have so far been observed in computer simulations (Proulx and Phillips, 2006).
GENE DUPLICATIONS AND PHENOTYPIC TRANSITIONS AT THE POPULATION LEVEL
123
Another population genetic effect on phenotypic transitions has been proposed for fast-replicating organisms such as bacteria and viruses, which occur in vast numbers. In such enormous populations one would expect to find a high tolerance toward mutations, since the sheer number of individuals will compensate for any deleterious effects of mutations within single individuals. So if the genotypic diversity of such populations could be plotted as the distance from the wild-type genotype, something resembling a “cloud” (rather than a “dot”) of genotypes would emerge. This has been referred to as a quasispecies, a term used originally for viruses (Eigen, 1993, 1996; Domingo, 1998; Tolou et al., 2002). It has been proposed that such quasispecies could be spread out over a rather flat region in a fitness landscape and still outcompete populations located on a high but narrow fitness peak. This hypothesis was at first proven only for digital organisms (Wilke et al., 2001), but then further evidence was found in plant viroids (Codo˜ner et al., 2006) and viruses (van Nimwegen, 2006). The explanation for this could be that “flat” populations are better suited for adaptation to changing conditions than populations focusing on only one genotype with a very high level of fitness. Longterm evolvability thus outcompetes short-term adaptedness, leading to the “survival of the flattest” (Wilke et al., 2001). Whether or not this hypothesis also applies to organisms such as bacteria still needs to be shown, but it integrates well with the notion of protein evolution proceeding along neutral paths of single amino acid changes. Each variant protein provides a potential starting point for the transition toward a different neighboring protein phenotype. Viruses such as HIV seem to benefit greatly from the quasispecies effect, because it allows them to evade the host’s immune response very rapidly (Eigen, 1993; Ribeiro et al., 1998). The replication machinery of viruses is inherently very error-prone (Preston, 1996), thus leading to the high mutation rates required to form a quasispecies. In higher eukaryotes there are also ways of increasing the phenotypic diversity under selection. The molecular chaperone Hsp90, for example, is capable of suppressing morphological diversity in Drosophila melanogaster under normal circumstances. Under environmental stress or mutations, however, Hsp90’s function is compromised and the genetic diversity accumulated (which was suppressed before) becomes apparent in the phenotype. Proteins with such properties can therefore be used as “switches” between conservation and adaptation, depending on the prevalent selection pressures (Rutherford and Lindquist, 1998; Sangster et al., 2004). In addition to the mechanism of genetic mutations presented so far, the concept of phenotypic mutations might explain, at least for some cases, how alternative protein structures can be “tested” for fitness advantages without changing the predominant phenotype of that protein (Whitehead et al., 2008) (Figure 7). To be effective, some phenotypic traits require two mutations: for example, disulfide bridges. Having only one of these mutations will not increase an organism’s fitness. Phenotypic mutations happen every time the transcription and translation machineries make a mistake that goes “unnoticed” and results in an amino acid change. In an individual carrying only one mutation in the genome, a small percentage of the proteins might carry the second mutation by mistake. If the fitness advantage of those proteins is high enough, the genotype will spread in the population and the second mutation is more likely to occur on the genotypic level. Whitehead et al. (2008) named this hypothetical mechanism the look-ahead effect of phenotypic mutations.
124
EVOLUTION AFTER AND BEFORE GENE DUPLICATION?
Figure 7 Adaptation toward a high-fitness double mutant via phenotypic mutations. This scheme illustrates how phenotypic mutations might aid evolution toward a mutant protein carrying two codependent point mutations (A and B). Both mutations are required for an increase in fitness, so it would take a long time for the double mutant to arise by neutral drift. One of the two mutations is, however, likely to arise by chance, and individuals carrying it might produce the second mutation at a low rate via phenotypic mutations (see the text). The fitness benefit resulting from this might be sufficient to drive the single mutant A to fixation. Finally, if A has become fixed, the population will eventually acquire B by drift, resulting in a high-fitness phenotype that will quickly spread through the population.
7
EVOLUTION OF RIBOZYME STRUCTURES
Proteins are not the only catalytic molecules in the cell. Ribozymes, RNA molecules that have enzymatic activity and are not translated into protein, have been found to carry out important functions, mostly regulatory in nature. Large portions of eukaryotic genomes, previously termed junk DNA, have recently been found to contain noncoding RNAs, ribozymes, and riboswitches, which are important in posttranscriptional regulation (Serganov and Patel, 2007). Those noncoding RNAs can exert their function by recognizing and binding other nucleic acid sequences (DNA or RNA) and by forming enzyme-like three-dimensional conformations with catalytically active sites. Ribozymes thus combine functions of DNA and proteins, a property that has led to the hypothesis of an ancient RNA world (Gesteland et al., 2006) where RNA molecules had both information-storing and catalytic functions. Later then, according to this hypothesis, DNA took over the role of information storage because its double helix is much more stable than that of RNA, and proteins took over the role of catalytic agents because their alphabet of 20 amino acids (compared with the four nucleotides found in RNA)
CONCLUSIONS
125
allowed for more diverse chemical activity. However, RNA remained at key positions in such basic cellular activities as transcription and translation. Because RNA molecules have the property of folding into distinct catalytically active conformations and also exhibit mutational robustness (Wagner and Stadler, 1999; Borenstein and Ruppin, 2006; Wagner, 2008), it is reasonable to assume that they, too, evolve via latent or promiscuous activities, from one phenotype to another. Computer simulations suggest that RNA structures form sparse but extended neutral networks that are likely to be in close spatial proximity. This means that several entirely different structures can be just one mutation apart. The neutral networks of proteins seem to be different in that they are more compact and more isolated from each other. [Protein neutral networks have been likened to plums in plum pudding, as opposed to RNA neutral networks, which are more like a bowl of spaghetti (Goldstein, 2008).] Also, according to current knowledge, a completely different fold is required for a new ribozyme to evolve (Curtis and Bartel, 2005), whereas in proteins the same overall conformation can be used to catalyze very different reactions (since mostly amino acids in the active site change) (Holmquist, 2000). In vivo examples of ribozyme promiscuity are still rare [e.g., the Tetrahymena group I ribozyme (Forconi and Herschlag, 2005)]. Most insights on ribozyme functional promiscuity come from in vitro experiments. It is possible to produce artificial RNA sequences that lie exactly between two neighboring neutral networks and that fold into the structures of both nets at equal frequencies (Schultes and Bartel, 2000), as illustrated for proteins in Figure 4. Neutral paths (i.e., via mutations that do not change fitness) that connect sequences uniquely folding into one or the other conformation can be reconstructed, showing that a gradual phenotypic transition is at least possible similar to the way proposed for proteins. It has also been shown that under directed laboratory evolution, ribozymes can evolve towards new functions with few mutations (Curtis and Bartel, 2005). Therefore, it is possible that just like proteins, ribozymes can recruit promiscuous activities but need more mutations to do so. Unfortunately, it seems more difficult to elucidate functional promiscuity in ribozymes, because compared with proteins, more mutations are needed to change from one phenotype (i.e., conformation) to another. Therefore, even though two ribozymes might share a common evolutionary origin, such relationships might not be recognized.
8 CONCLUSIONS In this contribution we have collected evidence pointing toward a mechanism of gene duplicate retention that extends the current view on sub- and neofunctionalization. Most approaches so far have focused on the processes following a gene duplication. Adaptive evolution, however, does not seem to be limited to postduplication processes, but can act on protein sequences at any time. In fast-replicating microbes, the strategy is to generate a high level of genetic diversity through random mutations, regardless of the fitness effects for the individual. For organisms with lower replication rates, however, this strategy does not work, as the survival of the individual is more important. Therefore, a mechanism that conserves a functional phenotype at all times, while still allowing for innovation, can be beneficial.
126
EVOLUTION AFTER AND BEFORE GENE DUPLICATION?
We suggest that the conformational variability of protein structures plays an important role in protein evolution. Proteins seem to have evolved to a state where different structural phenotypes are mutationally closely related. At the same time, some enzymes exhibit functional promiscuity, which can be altered with just a few mutations. Taken together, these two observations can be used to formulate a mechanism of phenotypic transitions of proteins by point mutations. Those transitions can occur gradually via functionally promiscuous intermediates. The most important aspect of the phenotypic transition is the maintenance of the native protein function while increasing another promiscuous function. In the terminology of neutral networks, this allows a protein under selection to be positioned near the neutral network of a different beneficial phenotype. This is the potential starting point for subfunctionalization, because two subfunctions are available for selection to work on. At this point, a gene duplication could be very advantageous because it would allow one copy to complete the phenotypic transition while letting the other copy return to its original native state. Therefore, the end result of this process is more consistent with the original model of neofunctionalization, because one gene copy retains its original function while the other copy adopts a new one. This applies to single-protein domains. Since phenotypic transitions are likely to occur domain independently in multidomain proteins, subfunctionalization might also be the result of neofunctionalization occurring in different “directions” for individual domains. This means that on the same gene copy (or paralog) one domain might keep its original function while another loses it, and vice versa in the other copy. The loss of a function in one domain can be accompanied by the simultaneous gain of a new function, but not necessarily. If complementary domains are conserved between the two copies, “pure” subfunctionalization might be the immediate fate, potentially followed by neofunctionalization later. In the absence of gene duplication and under a high selection pressure, allelic divergence might be another preduplication step toward a new advantageous phenotype. A high selection pressure is necessary in order to increase the frequency of subfunctionalized alleles in the population, so that a gene duplication is more likely to unite the two subfunctions on one chromosome. With the current advances in genomics it should soon be possible to sample entire populations under selection pressure to detect allelic divergence of protein subfunctions systematically. Until then, computer simulations yield the most promising insights into these processes (Proulx and Phillips, 2006). Also on the level of structural protein evolution, computer simulations with simple lattice models (Bornberg-Bauer and Chan, 1999) have proven to be well suited in predicting phenomena that could be shown in the lab afterward (Bloom et al., 2006). Fortunately, lattice models have helped greatly to elucidate the properties of protein neutral networks (Chan and Bornberg-Bauer, 2002; Xia and Levitt, 2004; Goldstein, 2008), so that these are no longer “terra incognita” (Meier and Ozbek, 2007). In general, simulations have the potential to provide insights into evolutionary processes and can be used to formulate research hypotheses in the lab. However, more evidence is still needed for the link between structural and functional promiscuity, and the possibility of phenotypic transitions via promiscuous intermediates. As of now, only one or the other of those properties has been demonstrated in real proteins. Although the possibility of such transitions seems likely, as demonstrated by the joint computational and experimental efforts by (Wroe et al., 2007) and (Amitai et al., 2007), there is still a need for a “smoking gun example” (DePristo, 2007).
REFERENCES
127
REFERENCES Aharoni A, Gaidukov L, Khersonsky O, Gould SM, Roodveldt C, et al. 2005. The “evolvability” of promiscuous protein functions. Nat Genet 37(1):73–76. Ahn S, Tanksley SD. 1993. Comparative linkage maps of the rice and maize genomes. Proc Natl Acad Sci USA 90(17):7980–7984. Alber T. 1989. Mutational effects on protein stability. Annu Rev Biochem 58:765–798. Alexander PA, He Y, Chen Y, Orban J, Bryan PN. 2007. The design and characterization of two proteins with 88% sequence identity but different structure and function. Proc Natl Acad Sci USA 104(29):11963–11968. Alva V, Ammelburg M, S¨oding J, Lupas AN. 2007. On the origin of the histone fold. BMC Struct Biol 7: 17. Amitai G, Gupta RD, Tawfik DS. 2007. Latent evolutionary potentials under the neutral mutational drift of an enzyme. HFSP J 1(1):67–78. Amoutzias GD, Robertson DL, Oliver SG, Bornberg-Bauer E. 2004. Convergent evolution of gene networks by single-gene duplications in higher eukaryotes. EMBO Rep 5(3):274–279. Barton NH, Charlesworth B. 1998. Why sex and recombination? Science 281(5385):1986–1990. Bergthorsson U, Andersson DI, Roth JR. 2007. Ohno’s dilemma: evolution of new genes under continuous selection. Proc Natl Acad Sci USA 104(43):17004–17009. Bj¨orklund AK, Ekman D, Light S, Frey-Sk¨ott J, Elofsson A. 2005. Domain rearrangements in protein evolution. J Mol Biol 353(4):911–923. Blomme T, Vandepoele K, Bodt SD, Simillion C, Maere S, et al. 2006. The gain and loss of genes during 600 millionyears of vertebrate evolution. Genome Biol 7(5): R43. Bloom JD, Labthavikul ST, Otey CR, Arnold FH. 2006. Protein stability promotes evolvability. Proc Natl Acad Sci USA 103(15):5869–5874. Boehr DD, McElheny D, Dyson HJ, Wright PE. 2006. The dynamic energy landscape of dihydrofolate reductase catalysis. Science 313(5793):1638–1642. Borenstein E, Ruppin E. 2006. Direct evolution of genetic robustness in microRNA. Proc Natl Acad Sci USA 103(17):6593–6598. Bornberg-Bauer E. 1997. How are model protein structures distributed in sequence space? Biophys J 73(5):2393–2403. Bornberg-Bauer E, Chan HS. 1999. Modeling evolutionary landscapes: mutational stability, topology, and superfunnels in sequence space. Proc Natl Acad Sci USA 96(19):10689–10694. Bowers JE, Chapman BA, Rong J, Paterson AH. 2003. Unravelling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events. Nature 422(6930):433–438. Brunet FG, Crollius HR, Paris M, Aury JM, Gibert P, et al. 2006. Gene loss and evolutionary rates following whole-genome duplication in teleost fishes. Mol Biol Evol 23(9):1808–1816. Chain FJJ, Evans BJ. 2006. Multiple mechanisms promote the retained expression of gene duplicates in the tetraploid frog Xenopus laevis. PLoS Genet 2(4): e56. Chan HS, Bornberg-Bauer E. 2002. Perspectives on protein evolution from simple exact models. Appl Bioinf 1(3):121–144. Choi IG, Kim SH. 2006. Evolution of protein structural classes and protein sequence families. Proc Natl Acad Sci USA 103(38):14056–14061. Codo˜ner FM, Dar´os JA, Sol´e RV, Elena SF. 2006. The fittest versus the flattest: experimental confirmation of the quasispecies effect with subviral pathogens. PLoS Pathog 2(12): e136.
128
EVOLUTION AFTER AND BEFORE GENE DUPLICATION?
Collins CH, Leadbetter JR, Arnold FH. 2006. Dual selection enhances the signaling specificity of a variant of the quorum-sensing transcriptional activator LuxR. Nat Biotechnol 24(6):708–712. Conant GC, Wolfe KH. 2008. Turning a hobby into a job: how duplicated genes find new functions. Nat Rev Genet 9(12):938–950. Cordes MH, Burton RE, Walsh NP, McKnight CJ, Sauer RT. 2000. An evolutionary bridge to a new protein fold. Nat Struct Biol 7(12):1129–1132. Cui Y, Wong WH, Bornberg-Bauer E, Chan HS. 2002. Recombinatoric exploration of novel folded structures: a heteropolymer-based model of protein evolutionary landscapes. Proc Natl Acad Sci USA 99(2):809–814. Curtis EA, Bartel DP. 2005. New catalytic structures from an existing ribozyme. Nat Struct Mol Biol 12(11):994–1000. Dalal S, Balasubramanian S, Regan L. 1997. Protein alchemy: changing beta-sheet into alphahelix. Nat Struct Biol 4(7):548–552. DePristo MA. 2007. The subtle benefits of being promiscuous: adaptive evolution potentiated by enzyme promiscuity. HFSP J Comput Graph Stat 1(2):94–98. DePristo MA, Weinreich DM, Hartl DL. 2005. Missense meanderings in sequence space: a biophysical view of protein evolution. Nat Rev Genet 6(9):678–687. DesMarais DL, Rausher MD. 2008. Escape from adaptive conflict after duplication in an anthocyanin pathway gene. Nature 454(7205):762–765. Domingo E. 1998. Quasispecies and the implications for virus persistence and escape. Clin Diagn Virol 10(2–3):97–101. Drummond DA, Bloom JD, Adami C, Wilke CO, Arnold FH. 2005. Why highly expressed proteins evolve slowly. Proc Natl Acad Sci USA 102(40):14338–14343. Eigen M. 1993. Viral quasispecies. Sci Am 269(1):42–49. Eigen M. 1996. On the nature of virus quasispecies. Trends Microbiol 4(6):216–218. Ferrada E, Wagner A. 2008. Protein robustness promotes evolutionary innovations on large evolutionary time-scales. Proc Biol Sci 275(1643):1595–1602. Force A, Lynch M, Pickett FB, Amores A, Yan YL, et al. 1999. Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151(4):1531–1545. Forconi M, Herschlag D. 2005. Promiscuous catalysis by the tetrahymena group I ribozyme. J Am Chem Soc 127(17):6160–6161. Garcia-Diaz M, Kunkel TA. 2006. Mechanism of a genetic glissando: structural biology of indel mutations. Trends Biochem Sci 31(4):206–214. Gesteland RF, Cech TR, Atkins JF. 2006. The RNA World , 3rd ed. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory. Goldstein RA. 2008. The structure of protein evolution and the evolution of protein structure. Curr Opin Struct Biol 18(2):170–177. He X, Zhang J. 2005. Rapid subfunctionalization accompanied by prolonged and substantial neofunctionalization in duplicate gene evolution. Genetics 169(2):1157–1164. Henzler-Wildman KA, Thai V, Lei M, Ott M, Wolf-Watz M, et al. 2007. Intrinsic motions along an enzymatic reaction trajectory. Nature 450(7171):838–844. Hittinger CT, Carroll SB. 2007. Gene duplication and the adaptive evolution of a classic genetic switch. Nature 449(7163):677–681. Holmquist M. 2000. Alpha/beta-hydrolase fold enzymes: structures, functions and mechanisms. Curr Protein Pept Sci 1(2):209–235. Hughes AL. 1994. The evolution of functionally novel proteins after gene duplication. Proc Biol Sci 256(1346):119–124.
REFERENCES
129
Hughes MK, Hughes AL. 1993. Evolution of duplicate genes in a tetraploid animal, Xenopus laevis. Mol Biol Evol 10(6):1360–1369. Hughes T. 2007. Computational analysis of the evolutionary dynamics of proteins on a genomic scale. Ph.D. dissertation, University of Bergen. Hughes T, Liberles DA. 2007. The pattern of evolution of smaller-scale gene duplicates in mammalian genomes is more consistent with neo- than subfunctionalisation. J Mol Evol 65(5):574–588. James LC, Tawfik DS. 2003. Conformational diversity and protein evolution–a 60-year-old hypothesis revisited. Trends Biochem Sci 28(7):361–368. James LC, Roversi P, Tawfik DS. 2003. Antibody multispecificity mediated by conformational diversity. Science 299(5611):1362–1367. Jensen RA. 1976. Enzyme recruitment in evolution of new function. Annu Rev Microbiol 30:409–425. Jiang H, Blouin C. 2007. Insertions and the emergence of novel protein structure: a structurebased phylogenetic study of insertions. BMC Bioinf 8: 444. Johnston CR, O’Dushlaine C, Fitzpatrick DA, Edwards RJ, Shields DC. 2007. Evaluation of whether accelerated protein evolution in chordates has occurred before, after, or simultaneously with gene duplication. Mol Biol Evol 24(1):315–323. Jones DT, Moody CM, Uppenbrink J, Viles JH, Doyle PM, et al. 1996. Towards meeting the Paracelsus challenge: the design, synthesis, and characterization of paracelsin-43, an alpha-helical protein with over 50% sequence identity to an all-beta protein. Proteins 24(4):502–513. Khersonsky O, Roodveldt C, Tawfik DS. 2006. Enzyme promiscuity: evolutionary and mechanistic aspects. Curr Opin Chem Biol 10(5):498–508. Kimura M. 1983. The Neutral Theory of Molecular Evolution. Cambridge, UK: Cambridge University Press. Kouyos RD, Silander OK, Bonhoeffer S. 2007. Epistasis between deleterious mutations and the evolution of recombination. Trends Ecol Evol 22(6):308–315. Lange OF, Lakomek NA, Far`es C, Schr¨oder GF, Walter KFA, et al. 2008. Recognition dynamics up to microseconds revealed from an RDC-derived ubiquitin ensemble in solution. Science 320(5882):1471–1475. Lerat E, Daubin V, Ochman H, Moran NA. 2005. Evolutionary origins of genomic repertoires in bacteria. PLoS Biol 3(5): e130. Levy Y, Cho SS, Shen T, Onuchic JN, Wolynes PG. 2005. Symmetry and frustration in protein energy landscapes: a near degeneracy resolves the Rop dimer-folding mystery. Proc Natl Acad Sci USA 102(7):2373–2378. Lupas AN, Ponting CP, Russell RB. 2001. On the evolution of protein folds: Are similar motifs in different protein folds the result of convergence, insertion, or relics of an ancient peptide world? J Struct Biol 134(2–3):191–203. Lynch M, Conery JS. 2000. The evolutionary fate and consequences of duplicate genes. Science 290(5494):1151–1155. Lynch M, Force A. 2000. The probability of duplicate gene preservation by subfunctionalization. Genetics 154(1):459–473. Maisnier-Patin S, Berg OG, Liljas L, Andersson DI. 2002. Compensatory adaptation to the deleterious effect of antibiotic resistance in Salmonella typhimurium. Mol Microbiol 46(2):355–366. Matthews BW. 1995. Studies on protein stability with T4 lysozyme. Adv Protein Chem 46:249–278.
130
EVOLUTION AFTER AND BEFORE GENE DUPLICATION?
McLoughlin SY, Copley SD. 2008. A compromise required by gene sharing enables survival: implications for evolution of new enzyme activities. Proc Natl Acad Sci USA 105(36):13497–13502. Meier S, Ozbek S. 2007. A biological cosmos of parallel universes: Does protein structural plasticity facilitate evolution? Bioessays 29(11):1095–1104. Meier S, Jensen PR, David CN, Chapman J, Holstein TW, et al. 2007. Continuous molecular evolution of protein-domain structures by single amino acid changes. Curr Biol 17(2):173–178. Muralidhara BK, Sun L, Negi S, Halpert JR. 2008. Thermodynamic fidelity of the mammalian cytochrome P450 2B4 active site in binding substrates and inhibitors. J Mol Biol 377(1):232–245. Murzin AG, Brenner SE, Hubbard T, Chothia C. 1995. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247(4):536–540. O’Brien PJ, Herschlag D. 1999. Catalytic promiscuity and the evolution of new enzymatic activities. Chem Biol 6(4): R91–R105. Ohno S. 1970. Evolution by Gene Duplication. New York: Springer-Verlag. Pakula AA, Sauer RT. 1989. Genetic analysis of protein stability and function. Annu Rev Genet 23:289–310. P´al C, Papp B, Lercher MJ. 2006. An integrated view of protein evolution. Nat Rev Genet 7(5):337–348. Papp B, P´al C, Hurst LD. 2003. Dosage sensitivity and the evolution of gene families in yeast. Nature 424(6945):194–197. Piatigorsky J, Wistow GJ. 1989. Enzyme/crystallins: gene sharing as an evolutionary strategy. Cell 57(2):197–199. Preston BD. 1996. Error-prone retrotransposition: rime of the ancient mutators. Proc Natl Acad Sci USA 93(15):7427–7431. Proulx SR, Phillips PC. 2006. Allelic divergence precedes and promotes gene duplication. Int J Org Evol 60(5):881–892. Rastogi S, Liberles DA. 2005. Subfunctionalization of duplicated genes as a transition state to neofunctionalization. BMC Evol Biol 5(1): 28. Ribeiro RM, Bonhoeffer S, Nowak MA. 1998. The frequency of resistant mutant virus before antiviral therapy. AIDS 12(5):461–465. Riechmann L, Winter G. 2000. Novel folded protein domains generated by combinatorial shuffling of polypeptide segments. Proc Natl Acad Sci USA 97(18):10068–10073. Riechmann L, Winter G. 2006. Early protein evolution: building domains from ligand-binding polypeptide segments. J Mol Biol 363(2):460–468. Rose GD, Creamer TP. 1994. Protein folding: predicting predicting. Proteins 19(1):1–3. Rothman SC, Kirsch JF. 2003. How does an enzyme evolved in vitro compare to naturally occurring homologs possessing the targeted function? Tyrosine aminotransferase from aspartate aminotransferase. J Mol Biol 327(3):593–608. Rutherford SL, Lindquist S. 1998. Hsp90 as a capacitor for morphological evolution. Nature 396(6709):336–342. Sangster TA, Lindquist S, Queitsch C. 2004. Under cover: causes, effects and implications of Hsp90-mediated genetic capacitance. Bioessays 26(4):348–362. Scannell DR, Wolfe KH. 2008. A burst of protein sequence evolution and a prolonged period of asymmetric evolution follow gene duplication in yeast. Genome Res 18(1):137–147. Schultes EA, Bartel DP. 2000. One sequence, two ribozymes: implications for the emergence of new ribozyme folds. Science 289(5478):448–452.
REFERENCES
131
Schuster P, Fontana W, Stadler PF, Hofacker IL. 1994. From sequences to shapes and back: a case study in RNA secondary structures. Proc Biol Sci 255(1344):279–284. Serganov A, Patel DJ. 2007. Ribozymes, riboswitches and beyond: regulation of gene expression without proteins. Nat Rev Genet 8(10):776–790. S¨oding J, Lupas AN. 2003. More than the sum of their parts: on the evolution of proteins from peptides. Bioessays 25(9):837–846. Tokuriki N, Tawfik DS. 2009. Protein dynamism and evolvability. Science 324(5924):203–207. Tokuriki N, Stricher F, Schymkowitz J, Serrano L, Tawfik DS. 2007. The stability effects of protein mutations appear to be universally distributed. J Mol Biol 369(5):1318–1332. Tolou H, Nicoli J, Chastel C. 2002. Viral evolution and emerging viral infections: what future for the viruses? A theoretical evaluation based on informational spaces and quasispecies. Virus Genes 24(3):267–274. Tompa P, Tusn´ady GE, Cserzo M, Simon I. 2001. Prion protein: evolution caught en route. Proc Natl Acad Sci USA 98(8):4431–4436. Van de Peer Y, Taylor JS, Meyer A. 2003. Are all fishes ancient polyploids? J Struct Funct Genomi 3(1–4):65–73. van Nimwegen E. 2006. Epidemiology: influenza escapes immunity along neutral networks. Science 314(5807):1884–1886. Varadarajan N, Gam J, Olsen MJ, Georgiou G, Iverson BL. 2005. Engineering of protease variants exhibiting high catalytic activity and exquisite substrate selectivity. Proc Natl Acad Sci USA 102(19):6855–6860. Vendruscolo M, Dobson CM. 2006. Structural biology: dynamic visions of enzymatic reactions. Science 313(5793):1586–1587. Veron AS, Kaufmann K, Bornberg-Bauer E. 2007. Evidence of interaction network evolution by whole-genome duplications: a case study in MADS-box proteins. Mol Biol Evol 24(3):670–678. Viksna J, Gilbert D. 2007. Assessment of the probabilities for evolutionary structural changes in protein folds. Bioinformatics 23(7):832–841. Vogel C, Bashton M, Kerrison ND, Chothia C, Teichmann SA. 2004. Structure, function and evolution of multidomain proteins. Curr Opin Struct Biol 14(2):208–216. Wagner A. 2008. Robustness and evolvability: a paradox resolved. Proc R Soc Lond B 275(1630):91–100. Wagner A, Stadler PF. 1999. Viral RNA and evolved mutational robustness. J Exp Zool 285(2):119–127. Weiner J, Beaussart F, Bornberg-Bauer E. 2006. Domain deletions and substitutions in the modular protein evolution. FEBS J 273(9):2037–2047. Whitehead DJ, Wilke CO, Vernazobres D, Bornberg-Bauer E. 2008. The look-ahead effect of phenotypic mutations. Biol Direct 3(18). Wilke CO, Wang JL, Ofria C, Lenski RE, Adami C. 2001. Evolution of digital organisms at high mutation rates leads to survival of the flattest. Nature 412(6844):331–333. Wilson KP, Malcolm BA, Matthews BW. 1992. Structural and thermodynamic analysis of compensating mutations within the core of chicken egg white lysozyme. J Biol Chem 267(15):10842–10849. Wroe R, Chan HS, Bornberg-Bauer E. 2007. A structural model of latent evolutionary potentials underlying neutral networks in proteins. HFSP J 1(1):79–87. Xia Y, Levitt M. 2004. Simulating protein evolution in sequence and structure space. Curr Opin Struct Biol 14(2):202–207. Zhang P, Gu Z, Li WH. 2003. Different evolutionary patterns between young duplicate genes in the human genome. Genome Biol 4(9): R56.
wwwwwww
7
Protein Products of Tandem Gene Duplication: A Structural View WILLIAM R. TAYLOR and MICHAEL I. SADOWSKI Division of Mathematical Biology, MRC National Institute for Medical Research, London, UK
1 INTRODUCTION Since the early proposal of Ohno (1970), it is now clear from many studies in the fields of protein structure analysis and comparative genomics that one of the major mechanisms by which proteins evolve is through gene duplication and modification of the resulting duplicated gene products. Many proteins contain domains that have clearly arisen through duplication and subsequent fusions [for a review, see an article by Bajaj and Blundell (1984)]. Duplication events often result in spatially and sequentially distinct domains which can then be regulated separately, such as when a protein expressed on the X-chromosome in humans is required in spermatozoa and must be copied to an autosomal chromosome to ensure its presence [discussed by Patthy (2008)]. In the less common circumstance that the duplication event overwrites a stop codon, a fusion between two duplicates of the same gene can be created. Depending on the subsequent fate of the two copies, this can have little or no effect on the structure or result in such large changes that the relationship between the novel protein and its ancestor can be obscured completely. The chapter begins with a brief review of the genetic mechanisms that create tandem duplicates of this type and control their genomic fate. We then describe the effects that these events can have on protein structures and review the examples of each type of event that have so far been observed. The underlying issue of detecting symmetries in protein sequences and structures is also considered. Finally, we synthesize a general evolutionary model from these experimental findings and discuss how it can be tested.
2 GENETIC MECHANISMS Tandem duplications are most commonly the result of homologous recombination [see Shapiro et al. (1977) and following papers]. Other mechanisms that can result in gene duplication, such as lateral transfer by a retrotransposon, have no bias for the duplicated copy to remain adjacent and are not considered here. Short repeats generated Evolution After Gene Duplication, Edited by Katharina Dittmar and David Liberles Copyright © 2010 Wiley-Blackwell
133
134
PROTEIN PRODUCTS OF TANDEM GENE DUPLICATION: A STRUCTURAL VIEW
by slippage and loop-out events are also not considered. Nor do we consider somatic duplications, since it is only duplication events in the germline that are passed down the generations. During meiosis, the original father and mother copies of the diploid genome are aligned side by side with their centroids at the equator of the spindle apparatus before being pulled apart toward the poles. During separation strands can become reconnected, either by chance breakage or by active exchange. The result is the four-way Holliday junction, which under pressure from the separating chromosomes (and aided by proteins) can migrate away from the centromeres, either reaching the end or becoming resolved back into separate strands again. The result is an exchange of father/mother DNA between the two separating genome copies called recombination (Figure 1). Recombination cannot rely simply on the rough physical positioning of the chromosomes to provide a match between strand exchange sites, and more active mechanisms have been proposed to ensure better registration between the strands. This might involve nicks cut at matching sites by a specific restriction enzyme, allowing the two free ends to cross-hybridize (with subsequent ligation). Alternatively, a nick in one strand would leave a single free end that could invade an adjacent strand without the need to have synchronized nick sites. This mechanism could equally include two free ends from
Figure 1 Homologous recombination: normal crossover events.
GENETIC MECHANISMS
135
the same strand caused by a double-stranded DNA break (as might be prone to occur under the stress of chromosome separation).∗ All these crossover mechanisms rely on strands locating their complementary match on the opposite chromosome. Given that there will always be some uncertainty in this process, there is the chance that the strands cannot find their correct locations. Errors in this processes are more likely to occur if there are multiple homologous regions in the vicinity, increasing the chance of incorrect strand hybridization. Down the generations, this can lead to repeated duplications, as earlier-duplicated segments can engender further duplication. Otherwise, with respect to gene structure, strand breakage and recombination are random occurrences,† and it is a matter of chance whether an intact gene is copied, or multiple genes, along with their control elements and intervening sequences, are copied. The result of incorrect matching depends on how the Holliday junction resolves itself (Figure 2). In one direction there is simply a reciprocal exchange of genetic material, whereas in the other, one chromosome becomes longer while the other becomes correspondingly shorter. In the longer duplex, a duplication has occurred at two positions (mother3father3 and MOTHER5FATHER5), with a symmetric loss of FATHER5 and mother3 in the shorter product. Since duplication events are initiated by incorrect strand hybridization, the regions exchanged will not be a perfect match to each other, and DNA mismatch repair mechanisms will attempt to correct any resulting base mismatches in a process known as gene conversion. In the example above, this would entail the conversion of one of the strands in the FATHER4/mother5 pairing to match the other. It is possible that the changes resulting from gene conversion might involve the loss or gain of a stop codon (or an intron splice site), leading to a dramatic change in gene structure (Figure 3). In this example, if the father3father4 stretch corresponds to a gene with a STOP codon (at 4), conversion of this to a non-STOP codon from the MOTHER5 strand would result in a read-through into the father5 segment or beyond until another STOP codon was hit by chance, either in some intervening “junk” DNA or after a read-through of a following gene. In this situation, the two original adjacent genes would become fused as a single gene product. A number of these possibilities involve the incorporation of intergenic regions of DNA into a new coding region, where they appear as linkers between previous gene regions. The lengths of these segments might well be quite large and in the new coding region would constitute a new “random” protein sequence. As suggested previously, these novel proteins might be protected sufficiently from selection pressure by their flanking functional domains to allow some structure and function to be acquired. ∗ Double-stranded breaks in DNA where the broken ends are close in sequence can be repaired directly by enzymes that do not make use of hybridization (called nonhomologous end joining), but if there is a length mismatch between the broken ends, as might result from strand separation between two distant single-stranded breaks, the repair mechanisms rely on the hybridization of the two complementary singlestranded tails. Typically, the 3 broken ends are trimmed back until two matching single-stranded segments are exposed, resulting in loss of genetic material. However, during replication, when there is an intact DNA copy closeby, the loose ends from the strand with a staggered break can locate their complementary sequences in the newly synthesized copy (strand invasion), resulting in an arrangement that is similar to the crossover events that occur in meiosis. Indeed, the final stage in this process is a pair of Holliday junctions that can resolve either with or without crossover. † Recombination hot-spots are observed, but except for the specialized system in the immunoglobulin region, these do not have any relation to protein structure.
136
PROTEIN PRODUCTS OF TANDEM GENE DUPLICATION: A STRUCTURAL VIEW
Figure 2
Misaligned crossover events, potentially causing duplications.
They might then eventually be cut loose as a distinct protein by some future gene rearrangement. The incorporation of intergenic regions into a new open reading frame will also result in a loss of reading frame between the two genes, assuming that matching intron splice sites are not copied. Given that there is no mechanism to synchronize these frames, loss of coherence would be expected in two out of three events. As suggested above, translation of the downstream gene would produce a “random” amino acid sequence until a STOP codon was encountered by chance. Such a novel gene product might survive by chance on the back of its neighbor, but it is not impossible to imagine that the random reading frame could acquire some sructure and perhaps function. Given the nature of the genetic code, a shift in reading frame does not generate a completely random sequence. This simplified survey of genetic mechanisms has shown that there are ways to create fully or partially duplicated genes (or adjacent gene fusions) that lie in tandem. Despite extensive research into the fate of duplicated genes at the population and evolutionary levels, the importance of the various contributing factors that are involved in the initial generation of tandem duplicated copies remains poorly quantified. The relative contributions of errors in double-stranded break repair by homologous strand invasion relative to simple reannealing, both of which can occur in the diploid germ cells, will affect the balance between gene expansion and contraction, and both may provide different contributions in different species. In eukaryotic species, the balance
DUPLICATED PROTEINS
137
Figure 3 Overwritten stop codons: leading to fused transcripts.
of these relative to homologous recombination is largely unquantified, and it may be difficult to separate the events, as double-stranded breakage can act to promote recombination. In prokaryotes, the balance of processes will be different again as the promotion of recombination through haploid generation will be absent, with other specialized mechanisms coming into play. Other complications include the influence of introns (Whamond and Thornton, 2006) either with a passive role in facilitating duplication events (Street et al., 2006) or with a possibly more active ancient past (deRoos, 2005). It must also be remembered that the genes we observe are those that have become fixed in a population, implicating an entire variety of new factors, such as population size (Lynch et al., 2001; Lynch, 2002). This may explain why duplication is more common in eukaryotes than in prokaryotes, which typically have much larger populations and much shorter generation times.
3 DUPLICATED PROTEINS Following a tandem duplication there are several structural possibilities, depicted in Figure 4. If a domain that does not form a homodimer is fully duplicated, the two copies may remain as independent folding units with no structural association (“beads
138
PROTEIN PRODUCTS OF TANDEM GENE DUPLICATION: A STRUCTURAL VIEW
(A)
(B)
(C)
Figure 4 Structural options for tandem duplicates: (A) beads on a string, illustrated with the structure of titin, (2r15); (B) pseudodimer, illustrated with the archaeal histone structure, (1f 1e); (C) domain swap, illustrated with the structure of cyanovirin, (1l5b); (D) inseparable domains, illustrated with the structure of aspartate protease (1e81); (E) entangled domain, illustrated with the structure of myoglobin (101 m). In each case the first domain is shown in blue, the second in red. (Figures created in RasMol.) (See insert for color representation of the figure.)
DUPLICATED PROTEINS
(D)
(E)
Figure 4 (Continued )
139
140
PROTEIN PRODUCTS OF TANDEM GENE DUPLICATION: A STRUCTURAL VIEW
on a string”). Duplication of a single domain that does homodimerize can form a fused homodimer (pseudodimeric domains) as a single unit or in pairs. If the functional residues in the protein become distributed between the two subunits, neither copy alone would be functional and the situation becomes one of inseparable domains. The evolutionary origin of this is particularly obscure if the functional contributions of the two domain copies become asymmetric and the original single-chain ancestor becomes extinct. Additional embellishments to the resulting fold by deletions or insertions and frozen domain swaps or deletions to particular parts lead to the generation of new domain structures, which now contain only a relic of the original duplicate as part of the core, and the evolutionary path to the modern fold is all but undetectable (entangled domains). Circular permutations can also develop, obscuring the traces of evolution still further, and this can in rare cases lead to the formation of topological knots. We deal next with each of these possibilities. 3.1 Beads on a String The simplest but least interesting structural option for duplicated domains is to remain as isolated folding units. These might derive selective advantage by all the methods available to isolated gene copies, such as increased binding or enzymatic turnover, but have the additional option of producing a bivalent molecule. This confers the novel function of being able to cross-link, and although such a function could be contained in a dimer (from a single gene copy), the incorporation of a covalent bond into the link provides additional strength without the complexity of introducing explicit cross-links (such as disulfide bonds or shared metal-ion coordination). Extensive tandem repeats are commonly found in proteins that perform a structural function, such as providing extensions for receptors to allow intercellular contact (e.g., icams), or most dramatically, in the 1000 (mixed) repeats in titin that are required to bridge the sarcomere length. Such multiple repetition merges with smaller repeat lengths into more classic fibrous class of protein, such as spectrin or ankyryns associated with the cytoskeleton to the tripeptide repeat of collagen. 3.2 Pseudodimeric Domains If the original protein undergoing duplication forms a functional homodimer, then given a linker of sufficient length to achieve the same orientation of the two domains, the two domains in the fusion protein will interact in the same way, giving rise to proteins known as pseudodimers. We can estimate the frequency of these events by looking at a nonredundant set of proteins containing homologous duplicate domains of known structure. From such data we find that roughly 70% of proteins containing exactly two domains are between pairs very unlikely to be homologous (not sharing a SCOP class), with about 30% of fusions containing repeated domains (with the same fold, superfamily, or family designation). Although there are several ways to try to ensure that the sample is less biased than the PDB as a whole or the SCOP classification, three different methods produce similar results: using SCOP as a whole, using a 25% nonredundant set of chains, and using a 40% nonredundant domain set. However, in this as in most other cases, we
DUPLICATED PROTEINS
141
cannot distinguish between fixed tandem duplicates and more complicated evolutionary events, but tandem duplication is the most parsimonious explanation. Why do such fusions become the dominant form in some cases? The principal reason must relate to the molecular function of the protein. Binding sites and the active sites of enzymes are typically associated with a depression (cleft or hole) on the surface of a protein. This allows functional side chains to be brought into contact with the substrate from a variety of angles, enabling the development of increased binding specificity and catalytic options. A natural source of a binding cleft is in the region between protein molecules, either between two subunits or between two domains of a larger single protein. A homodimer has the disadvantage that a site formed between two monomers will have constraints imposed by symmetry. As most substrates will be asymmetric, this can create problems, as a residue change to improve binding to one part of the substrate will create a symmetric change in the other copy that may be disruptive. This effect can be avoided by moving the site away from the molecular twofold, but symmetry then implies that this will create a second site, which may not always be desirable. Subtle effects may come into play, as, for example, in the triose phosphate isomerase dimer, in which only one site is active. The symmetry can be broken in this situation by the binding of the substrate itself with communication through to the other site. An alternative is to make use of a heterodimer to avoid the constraints of symmetry, and it is not obvious that two covalently linked domains should be any different from an equally diverged pair of proteins that constitute a hetrodimer. The constraint to produce stoichiometric equivalents might provide some advantage to the linked domains, as they are literally tied to be in the correct 1 : 1 ratio, but two copies under one operon would also give good control over the hetrodimer composition. Perhaps the only difference is that if the domains or dimers are not greatly diverged, the hetrodimer will be prone to create two less productive homodimer variants. The complexity of adding constraints to the sequence to avoid homodimer formation could hand the advantage to the covalent domain linkage. The relative energetics of the two systems also plays a part, as it is often required that the two sides of an active site cleft should be relatively free to move in a hingelike manner to facilitate entry of substrate and release of product. To make a heterodimer interface specific (over homodimer alternatives) requires additional interactions that would mitigate against flexibility. By contrast, the strong flexible covalent domain linkage is well suited to allow relative motion. In addition, an entropic component must be considered. The formation of a dimer (homo or hetero) corresponds to a great decrease in the potential number of states (entropy) of the system. The two domains in the duplicated gene are already confined and do not have such a large degree of freedom to lose on adopting their functional form. Protein structures that function as homodimers would be the most likely candidates for gene duplication into a fused protein, as they have already evolved complementary interacting surfaces. The dimers most susceptible to duplication and fusion would be those in which the two ends to be joined (the N-terminus of one subunit with the C-terminus of its symmetric half) lie close together. Without this, some unwinding of the chain at each terminus would be necessary, or an additional linking segment would be needed. Both would give rise to new interactions, with the probability of these being unfavorable. A direct implication of this is that the remaining free ends (now the termini of the fused-gene product) must, because of the twofold symmetry, lie close
142
PROTEIN PRODUCTS OF TANDEM GENE DUPLICATION: A STRUCTURAL VIEW
together. Interestingly, this would explain the frequent proximity of termini in protein domains (Thornton and Sibanda, 1983), a phenomenon that is largely unexplained by other effects. 3.3 Inseparable Domains Some dimerlike domain pairs are so closely linked that it is difficult to say whether they should be treated as a single domain. This becomes a particularly difficult case to call when the domains have never been observed in isolation, and under some definitions this is part of the definition of being considered a domain. Well-known examples of this type include the β-barrel folds of the serine and aspartyl proteases. Higher-order repeats have also been observed in repeat proteins such as the β-propellors. Proteases The aspartyl protease family has a fold consisting of two remotely related domains with considerable differences in loop lengths and subdomain packing (Figure 5). Each domain, however, contributes an aspartic acid to the active site, which in high-resolution structures can be seen to have an almost exact twofold (180) relationship (Blundell et al., 1990). In addition, the same twofold axis corresponds closely to the symmetric relationship of the two domains, suggesting a precursor molecule that functioned as a dimer (Tang et al., 1978). For many years the double-domain form was the only known structure, but with the sequencing of the HIV genome, a possible aspartic protease active site was identified (Toh et al., 1985) and shown to be consistent with an intact half-domain (Pearl and Taylor, 1987). It was predicted that this would form a dimeric enzyme in the virus, as was later shown to be so by x-ray crystallography (Wlodawer et al., 1989). Given the uncertainties associated with viral origins and evolution, it is difficult to argue that the monomer
Figure 5 Aspartyl protease 1e81. The two halves are colored red and blue to distinguish them, the two active-site aspartic acids are shown in a lighter color. (See insert for color representation of the figure.)
DUPLICATED PROTEINS
143
was the ancestral form of the protein that duplicated in the distant past. It might equally well be argued that pressure of viral genome size induced a shift in the reverse direction to create a monomeric protein. A similar situation is found in the other large protease family, the serine proteases. In this family, two six-stranded barrels pack together to create an active site for the protein, but dispite being very widespread, no dimeric equivalent has yet been seen. This does not mean that one will never be found, but as the number of genomes increases it becomes increasingly less likely. A reduction in a monomeric form seems possible in the aspartyl proteases because of the high symmetry in the active site between the two aspartates, whereas the serine proteases have an asymmetric catalytic triad, consisting of histidine, aspartate, and serine residues, which would not easily allow an equivalent reduction. β-Propellers β-Propeller structures are stacked arrays of β-sheets in which the edges of the sheets form a hub from which the sheets radiate (Figure 6). Because of the twist of the β-sheet, this gives the appearance of a ship’s propeller (Murzin, 1992). Over the years, proteins with different numbers of sheets have been found and there are now structures displaying all sheet numbers from four to eight. Each sheet in the propeller is self-contained and there is a clear sequence motif, implying that the structure arose through tandem duplication of a single sheet. Depending on the family, the repeats are variously referred to as the WD40 motif (after conserved tryptophan and aspartate residues in the motif of 40 residues) or the Kelch repeat. A recent, thorough analysis of the known members of the family has concluded that these proteins have a very active recent evolutionary history in which an entire protein has evolved from duplication of a single repeat (Chaudhuri et al., 2008). The β-sheet blade of the propellors has never been observed in isolation; attempts to produce single propellors artificially have found that some regions on either side of the blade are also necessary (Yadid and Tawfik, 2007). This may help to explain the mechanism by which the propellors are closed, although it is also possible that the sequence signal for stability of an independent blade has disappeared, since in the context of a multibladed protein there would be no selection pressure to maintain it. RNA Polymerase Unlike all viral RNA polymerases, the structure of the RNAi polymerase (Salgardo et al, 2006) is distantly homologous to the DNA polymerases, suggesting that their common structure may have been the ancestor polymerase before the shift from RNA to DNA as the prime genetic material [see also Jones (2006)]. The catalytic core in this polymerase is a pair of double ψ β-barrel domains, which by their duplication must also have had an even earlier single-domain dimeric form. This would suggest a common ancestor perhaps over 3.5 billion years ago at the earliest boundary of cellular life. The core domains of the RNA-dependent RNA polymerase provides another example of two core-conserved β-barrel domains that have never been observed apart. 3.4 Beta/Alpha Class The β/α class of proteins contains more folds than the all-β class and all-α class together, and not suprisingly, tandem duplications are very common.
144
PROTEIN PRODUCTS OF TANDEM GENE DUPLICATION: A STRUCTURAL VIEW
(A)
(B)
Figure 6 Beta propellor structures with (A) four, (B) five, (C) six, and (D) seven blades: structures 1hxn, 1tl2, 1f8d, and 2bbk, respectively. (See insert for color representation of the figure.)
DUPLICATED PROTEINS
(C)
(D)
Figure 6 (Continued )
145
146
PROTEIN PRODUCTS OF TANDEM GENE DUPLICATION: A STRUCTURAL VIEW
Rossmann Fold The Rossmann fold (Rao and Rossmann, 1973), which is found extensively throughout dinucleotide-binding proteins, consists of two subdomains arranged about a pseudo-twofold axis, often corresponding to the center of the binding cleft. It has been argued that this fold is an example of an ancient duplication event bringing about the fusion of two mononucleotide-binding domains. Each individual domain consists of three β/α units in tandem, and although this is a common enough motif in other proteins, it is not seen as an isolated domain.∗ TIM Barrel One of the most ubiquitous β/α folds is the eightfold β/α barrel, typified by the enzyme triosephosphate isomerase (TIM) (Banner et al., 1975). The high symmetry of the eightfold barrel has been linked to an ancient duplication event, and there is evidence that there is a preferred relic twofold associated with the two barrel halves. Unlike the all-β barrel folds discussed above, which would retain their individual barrel topologies if split in two, half a TIM barrel would appear to be less structurally stable. However, there are some candidates for a half barrel, but it is difficult to assess whether these really have any evolutinary connection to the true barrels or are just alternative simple structural solutions that have been arrived at independently. This problem will be returned to below.
4
ENTANGLED DOMAINS
4.1 Domain-Swapped Dimers In true dimeric interactions involving proteins with multiple domains, the packing between the domains within each chain forms identical contacts (as they must, being dimeric). If the linker between domains is sufficiently long, this allows domains to be swapped between monomers. Because of the equivalence of the two sets of interactions, the only determining factor in whether this rearrangement occurs is the length of the connecting loop and any “spurious” interactions it might make in facilitating the exchange (Bennet et al., 1995). The balance is so subtle that even different crystal forms of the same protein can be domain-swapped. In some systems, such as the cyanovirin, the balance can be controlled by external conditions, producing either swapped or unswapped dimers (Barrientos et al., 2002). After tandem duplication, if a dimeric protein had previously been able to form a domain swap, this can become “frozen” into the new protein structure as diversification of the copies shifts the previously fine balance away from equilibrium. This has been postulated to account for the fold of a number of proteins that appear to contain domain-swapped relics. Histone Fold The histone fold consists of three α-helixes: two short and one long. Functional histones bind DNA as octamers, which ultimately consist of heterodimers with the basic histone fold. Since the hydrophobic patches required for assembly of both the heterodimeric and tetrameric forms are exposed, there would be no means to prevent a much larger superassembly from forming. Thus, some copies of these genes ∗ The original Rossmann fold was half a dinucleotide-binding domain. It is used here to refer to the double fold, which constitutes an intact domain.
ENTANGLED DOMAINS
147
do not have these parts and are used for chain termination. In most organisms these are found as separate copies, but in some archeal organisms, such as the methanomicrobia, only a single histone gene is found that is a pseudodimeric fusion of a dimerizationcompetent and a dimerization-incompetent gene (Sandaman and Reeve, 2006). The identification of a helix–strand–helix motif that is common to the histones and other ancient proteins with related functions is suggestive that the original core protein consisted only of this short motif, which then underwent a series of duplications combined with domain swapping. The problem with this and similar analyses is that it is impossible to verify these relationships using sequence data, as the original events must have occurred long ago. Even structural similarity, which has the capacity to probe more deeply back in time, cannot be considered significant either when the core motif is simply a pair of common secondary structures. Nevertheless, it is useful to see that, in principle, these basic folds can be related by a simple mechanism, irrespective of whether this reflects their true history. Globin Fold The globin fold might also be explained as a domain-swapped relic. The α-helixes that constitute the globin fold can be labeled A to G, but one pair of these (CD) is small and poorly conserved and can be neglected, leaving six major helixes, of which the EF pair forms the main binding cleft for the heme group, and each contributes one of the coordinating histidines. The separation between the E and F helixes required to fit the heme makes it unlikely that this pair was ever an independent protein with any structure in the absence of a heme. The existence of a protoglobin based around the EF pair is supported circumstantially by the observation that these helixes constitute a separate exon and that the EF pair may correspond to a common heme-binding ancestor of both the globins and cytochromes (Craik et al., 1981). Addition of sequence to the amino and carboxy ends of the EF core could convert the fold to a Greek-key four-helical bundle and, through further additions, into the modern fold. If such accretion occurred, it would predict that the Greek-key fold should remain as a core in the modern fold, but there is no nucleation center that would give rise to this motif. An alternative explanation of how the fold might have evolved can be based on the rough twofold symmetry of the globin fold. The globin fold can be viewed as a short segment of a double helix of α-helixes, which is more easily seen in greatly simplified representations. This equates the BE hairpin with the GH hairpin and places the A and F helixes in equivalent positions across the end of each hairpin (Figure 7). A superposition of the myoglobin halves based on this correspondence superposes ˚ RMSD. With a limited number of helixes, this 60 residues in each half within 4.5-A correspondance might well be due to chance; however, additional support can be found in folding studies, which have predicted the early formation of the BE and GH hairpins (Ptitsyn and Rashin, 1975; Bashford et al., 1988). This model implies that the protoglobin was equivalent to half the modern fold but with the difference that a single heme would be held between a dimer, with coordination just to the B helix. The location of the A and F helixes relative to this core would suggest that their positions were swapped across the dimer interface, possibly being acquired as later stabilizing additions to the core. Tandem duplication of this dimer would produce the modern fold except that the heme coordination would need to shift from the H to the F helix. As predicted by Go (1978), a third intron was found in the gene of a leghemoglobin (Jensen et al., 1981) that splits the E and F helices. It
148
PROTEIN PRODUCTS OF TANDEM GENE DUPLICATION: A STRUCTURAL VIEW (a)
(e) E
E
B
F
b e
(b)
(f) E
E
B
B
A a
F
b
G
e
(c)
(d) E
E
B
B A
A F
Figure 7
G H
F
G H
Proposed duplicated structural relics in myoglobin.
could be argued that this exon junction is the oldest and has been lost in the other globins, but more recent observations indicate that intron gain is more prevalent than was originally believed, making this argument considerably weaker without the support of a quantitative evolutionary model. Proteosome Fold The αββα-layer protein 1ryp1 from the proteasome contains an internal structural duplication. This symmetry runs through three of the four layers of secondary structure, and although it is not clear to the eye, it was identified as a twofold repeat using the SAPit program. The sequence identity over the repeats is less than 10%, which would not be seen by any sequence-based method. A nontrivial set of relationships has been analyzed for a small family of proteins with the αββα-architecture called the DOM-fold (Cheng and Grishin, 2005). The structure of the molybdenum cofactor binding protein (1jroB) in this family can be explained by a series of two duplication events with domain swapping from a core αβ-domain which has the same architecture as the αβ-layers in the other members of the family. At the sequence level, these αβ-subdomains can be aligned with two other members of the family, giving a good correspondence of secondary structure elements and hydrophobic positions but no highly conserved residues. Similar highly divergent relationships can also be seen in smaller all-β folds (Theobald and Wuttke, 2006).
ENTANGLED DOMAINS
149
4.2 Membrane Proteins Extensive duplication and subsequent modifications have resulted in many membrane proteins being functional heteromers of subunits which are only minor modifications of one another (the most obvious examples being the very large and ubiquitous families of voltage- and ligand-gated ion channels). In some cases the subunits are found in separate genes and in others proteins containing several subunits within a single transcript are observed (Yu and Catterall, 2003). These may be the consequence either of tandem duplication events or fusions of paralogous proteins after some divergence has occurred. The recent accumulation of high-resolution three-dimensional structures across several membrane protein families has led to several interesting observations of symmetry, which build on the observations described above. Structural Observations of Symmetry in Membrane Proteins Since high-resolution structural data have become available for several families of helical membrane proteins, there have been many observations of twofold pseudosymmetry in the orientations of regions of the tertiary structure. The first such observation was made in the aquaporin family of water-transporting pores. These are part of the large Major Intrinsic Protein (MIP) family, which generally have six transmembrane (TM) segments. It was recognized early that a conserved motif (GAXNPAX[ST][AG]) occurs twice in the sequences of several family members, suggesting a duplication in the sequence (Wistow et al., 1991). Subsequently, with larger numbers of sequences available, it was found that a second motif, known as the AEF-box , could also be suggested to have been duplicated, although in the second repeat the motif has degenerated to a conserved glutamate (Zardoya and Villalba, 2001). A more recent analysis has broadened the scope of this motif to the closely related glycerol transporter family (Zardoya, 2005). Interpretation of the meaning of these repeated motifs is not entirely straightforward, however, since they are directly involved in the function of the channel and may possibly be the result of convergence (Zardoya and Villalba, 2001). Regardless of whether the sequence evidence was indeed interpreted correctly, the publication of high-resolution structures for aquaporin (Fu et al., 2000; Murata et al., 2000) showed strong evidence that the receptor could be separated into two 3-TM halves, which were rotated copies of one another and validated the symmetrical location of the functional residues observed from sequence analyses. This strongly implies that an ancient duplication event occurred in the evolution of this “superfamily” of membrane proteins. The structures of ClC chloride channels from Salmonella enterica typhimurium and Escherichia coli also show patterns consistent with duplication and structural embellishment: these channels use 16-TM regions to span the membrane with a twofold pseudosymmetry apparent between two subcomponents of 7-TM regions, the other two presumably having arisen subsequently (Dutzler et al., 2002). In a publication describing the structure the authors observe that some weak sequence similarities exist between the putative duplicated regions, but in the absence of structural evidence it would be impossible to distinguish these from chance similarity. Similar reports have emerged at almost the same rate as new membrane protein structures. The list presently includes the antiporter from E. coli (Hunte et al., 2005), the BtuCD vitamin B12 transporter from E. coli (Locher et al., 2002), the bacterial homolog of the human neurotransmitter uptake proteins (Yamashita et al., 2005), the
150
PROTEIN PRODUCTS OF TANDEM GENE DUPLICATION: A STRUCTURAL VIEW
Sec61/SecY-facilitated diffusion channel for protein translocation (Van den Berg et al., 2004), and members of the hydroxycarboxylate family of secondary transporters. In the latter case there is also substantial sequence-based evidence for the existence of a duplication event in the ancestor of the 12-TM family members (Lolkema et al., 2005; Sobczak and Lolkema, 2005) and structural information by direct assessment of membrane topology (Saaf et al., 2001). Dual-Topology Membrane Proteins Clearly, the observations above suggest that duplications or fusions have occurred commonly in the evolution of membrane proteins. However, a major difficulty with this is that in many cases the repeated segment contains an odd number of transmembrane helices arranged in an antiparallel fashion for a twofold pseudosymmetrical relationship. If a transcript containing this repeat has evolved by duplication, this implies an ancestor that was indifferent to its membrane orientation. The observation by von Heijne (1992) that positively charged amino acids are found in loops on the inside of the cell would predict that this could occur only if the ancestral protein had an equal distribution of such amino acids on either side. A recent survey of the membrane topologies of the E. coli membrane proteome (Daley et al., 2005) using a dual-reporter assay to determine the location of the termini found a candidate for this, a protein known as EmrE. This is a four-helix protein that is responsible for the efflux of a variety of toxins in E. coli and many other bacteria. As such, it is extremely interesting as a potential drug target and has been the subject of a considerable amount of biochemical experimentation. The results of the global study indicated that EmrE could adopt either of two topologies, with the N-terminus inside or outside the cell. On the basis of this result and an earlier study (Saaf et al., 1999), the authors proposed that EmrE forms a functional homodimer in an antiparallel organization, with the two subunits in opposite membrane orientations (Figure 8). Although this was a controversial suggestion, later structural studies were consistent with this model (Tate et al., 2001; Ma and Chang, 2004; Pornillos et al., 2005) and the authors followed this up with a second assay on EmrE aimed specifically at testing the proposal that it could form a functional heterodimer, which was found to be positive (Rapp et al., 2007a). However, the controversy has continued since it has been argued that their assay may disrupt formation of the native transmembrane topology (Schuldiner, 2007a) and that earlier biochemical results contradict these findings. The retraction of the x-ray structures that same year (Chang et al., 2006) added some weight to these contentions (Schuldiner, 2007b), although the authors argued that the finding of duplicate copies with opposite topologies and concomitant charge distributions in the same family presented a strong case for this mechanism (Rapp et al., 2007b). Subsequent biochemical studies have been reported that both support (Nara et al., 2007; Lehner et al., 2008) and refute (Steiner-Mordoch et al., 2008; McHaourab et al., 2008) the proposition that EmrE functions as an antiparallel dimer, leaving the question without definitive resolution for the present. Publication of the corrected x-ray structures has also been found to favor the antiparallel organization (Chen et al., 2007), but it remains difficult to reconcile the conflicting nature of the evidence in this specific case. Another, indirect line of argument advanced by the authors is that Bacillus subtilis encodes two homologs of EmrE, ebrA and ebrB, on the same operon. These are 4-TM proteins with charge distributions consistent with opposite topologies. They
ENTANGLED DOMAINS
I
II
III
N
I
IV
II
III
IV
C
N
C
151
(A)
N
I
N
II
III
IV
C
I
II
III
IV
C (B)
Figure 8
(A) Parallel and (B) antiparallel dimer configurations of EmrE.
therefore propose that this is a case where a duplication has occurred, with the copies being coexpressed to form the functional transporter. Since EmrE has four helixes, it would require an extra helix to adopt an antiparallel membrane orientation, which suggests that separate copies in the same operon would be a more likely route. Another E. coli protein identified by Rapp et al. belonging to the PFam DUF606 family has been found to have several homologs which are fusions of two copies of the 5-TM domain in either order, each copy having adopted a charge bias opposite to that of the other (Lolkema et al., 2008). Whatever the status of EmrE in particular, there is now a large body of evidence that may support the general evolutionary mechanism proposed by Rapp et al. (2007a) (Figure 9). As described above, it has frequently been observed that the two parts of the transmembrane structure of a helical membrane protein are related by twofold pseudosymmetry. Although in general the evolutionary mechanisms that have actually occurred are at present invisible, having taken place very early, this strongly suggests that the proteins are either fusions of separate genes or tandem duplicates. Generality of Dual-Topology Fusions So far the majority of evidence for ancestral fusions by “flip-flopping” membrane proteins has been found in proteins with
152
PROTEIN PRODUCTS OF TANDEM GENE DUPLICATION: A STRUCTURAL VIEW +
+
I
C
II
III
+ N Topologically indifferent ancestor
(A)
I
+
II
III
N + + C Topologically indifferent duplicate + + N C
I
II
+
I
IV V VI
III
IV V VI
(B)
+
II
III
IV V VI
N + + C Duplicates with defined topology N
C
I
II
+
III
IV V VI
(C)
+
Figure 9 Evolution of topological preferences. A possible chain of events leading to pseudosymmetric membrane proteins is shown. An ancestor with unbiased positive charge distribution (A) duplicates and can adopt either of two topologies (B). Selective loss of positive charges from one set of loops can then lead to a choice of one or the other topology (C).
transporter functions. Whether or not these events have happened in other large membrane protein families, such as those with receptor or enzymatic functions, is not apparent. An argument for ancient duplication in the formation of the very common seven-transmembrane architecture, which would account for similarities between bacteriorhodopsin and eukaryotic G-protein-coupled receptors, was advanced many years ago (Taylor and Agarwal, 1993); the publication of the rhodopsin structure has been suggested to offer some support for this (Palczewski et al., 2000). However, whether this represents duplication or exon shuffling remains unclear. Although the fusion of a repeated polytopic motif with no strong topology preference may yet prove to underlie the evolution of many or most families of helical membrane proteins, it is possible that it has only been used where there is a clear functional advantage for doing so. A recent article raises this possibility: The serotonin transporter SERT is proposed to have a mechanism in which it alternately exposes its 5-HT binding site to either side of the membrane, thereby enabling controlled cross-membrane transport. On the basis of the crystal structure for the bacterial homolog, LeuT, the authors argued that orientation of the symmetric parts of the structure can be used to deduce the residues that would be exposed to the cytoplasm following the conformational change, which is not observed in the structure. Cysteine mutagenesis followed by labeling to determine accessible residues on the cytoplasmic face of the transporter supports this prediction (Forrest et al., 2008). If this observation stands and similar observations are made for other transporters, it may prove that the large number of such symmetries observed could be selected for functional reasons. Alternatively, it is possible to argue that this mechanism is quite general, but symmetry has been maintained only in cases where it is functionally useful, although how such a hypothesis would be tested is difficult to imagine.
ENTANGLED DOMAINS
153
The absence of observations of potential fusions for some families may also be of interest. Given the growing evidence that the extremely large and diverse family of G-protein-coupled receptors are functional dimers, it is curious to note that no GPCR fusion proteins have been observed. It is not possible to preclude their existence, but it is possible to explain this observation and predict that it will remain the case for two reasons. First, the argument applied by Rapp et al. also applies to these proteins, albeit in reverse: GPCRs dimerize in a parallel arrangement, but since they have an odd number of transmembrane segments, to maintain their topological relationships they would require an extra membrane-spanning region. This does not rule out their existence (see above); however, it makes finding them more difficult. A second possible objection is that it may be necessary (or convenient) to maintain separate copies to allow for a large number of possible dimerization events to occur, rather than constraining interaction partners by fusion. Nonetheless, so far, this has not been studied explicitly. The existence of duplicate repeats in membrane proteins has reached the status of general acceptance given the number of published structures, which cover many apparently unrelated lineages. This accords well with the observations already discussed for globular proteins, since it would be difficult to justify the existence of separate mechanisms in the two cases; although embedding a protein structure into a membrane introduces additional constraints, it remains a protein nonetheless and is subject to the same general rules of evolutionary change. To what extent the dual-topology evolutionary trajectory will be supported by future observations is, of course, uncertain, but at present it holds a great deal of promise as an explanation for how modern membrane protein folds came into existence. 4.3 Knotted Folds KARI Family The class II ketol-acid reductioisomerase (KARI) has a 250-residue αβ nucleotide–binding domain to its amino terminus with a large all-α domain following. It was found that in the plant protein acetohydroxy acid isomeroreductase (1yve and homologs 1yrl and 1qmg), this C-terminal domain contains a figure-of-eight knot (Taylor, 2000). The knot is the most deeply embedded known, with an amino-terminal domain on one side and 70 residues trailing on its carboxy terminus. It was proposed that the knotted domain can best be explained by a duplication followed by a helix swap (Taylor, 2000) (with deletion of the second αβ domain). Unlike the dimers discussed above, where it was suggested that short N-C connections would be preferred, the duplicated halves are connected by a long loop through which the C-terminus must pass before folding is complete. A dimeric precursor of the knotted domain can be found in the class I KARI structure (Ahn et al., 2003) (PDB code: 1np3), in which the terminal segments of the monomers ˚ suggesting that a simple in the dimer make a closest approach of just over 10 A, duplication (which requires these ends to join) would be quite viable. The subsequent deletion of one of the nucleotide-binding domains would then leave a larger gap of ˚ but this can be closed easily by remodeling a few of the residues on each over 20 A terminus at either end of the gap. The resulting (knotted) dimeric fusion has an RMSD ˚ over 260 residues. with a true knotted domain of 5 A Other duplication, swapping, and deletion (DSD) events have led to a number of related folds, including glycerol-3-phosphate dehydrogenase, 6-phosphogluconate dehydrogenase (PGDH), and similar oxidoreductases (Andreeva and Murzin, 2006),
154
PROTEIN PRODUCTS OF TANDEM GENE DUPLICATION: A STRUCTURAL VIEW
none of which are knotted. These enzymes, collectively referred to as the PGDH-like oxidoreductases, all contain a conserved nucleotide-binding domain, without which the relationships among the all-α catalytic domains would be very difficult to deconvolute. The all-α domain of PGDH and the corresponding knotted domain in the KARI class II structure both contain a clear internal duplication, but in the knotted domain, the core helixes are swapped across the pseudo-twofold. The relationship between these domains is not just a simple exchange of two helix positions, and there is no single rearrangement that would transform one fold to the other. However, an indirect link can be traced through the all-α dimererization domains of two further dehydrogenases, UDP-glucose dehydrogenase (1dliA, UDPGDh) and GDP-manose dehydrogenase (1 mv8A, GPDMDh), which are related by a swapped pair of helixes (Andreeva and Murzin, 2006). When a dimeric fusion is constructed from the GPDMDh structure, there is a plausible superposition over the core. This comprises the long central helixes at the start of each duplication and two following helixes, but the link between these symmetric halves has no correspondence. 4.4 Cyclic Permutation The combinations of duplications, deletions, and domain swaps discussed above can make it very difficult to discern the sequence of evolutionary events and obscure relationships between protein families. Another duplication-based mechanism whereby cryptic relationships of this type can evolve is through the cyclic permutation (CP) of a sequence over the structure. Such a shift is difficult to rationalize by any single evolutionary mechanism, but if the fold is duplicated and partially deleted from each terminus, the remaining core appears as a CP. This process appears to be more probable if it is accompanied by domain swapping. Consider a compact domain formed by two subdomains, A and B. After duplication, a subdomain swap could create the compact domains, and deletion of the termini (the domain) would leave the domain, which would appear as a CP in the sequence. The extent of rotation through the sequence would depend on the relative sizes of A and B. Globular Proteins Using a sequence alignment algorithm designed to detect cyclic permutation, Weiner and colleagues (Weiner et al., 2005) describe some novel examples and make the distinction between proper CPs as described above and incomplete CPs, where the deletion has been made at only one of the termini. The permutations examined in this way all occur at the level of intact domains in multidomain proteins, and it is always possible that other mechanisms of gene rearrangement (such as exon shuffling) might produce the same change. When the cyclic permutations occur within a domain, sequence-based methods are often not sensitive enough to detect them. This level of permutation has been investigated using a novel structure comparison algorithm (FASE), with several new examples being reported (Vesterstrom and Taylor, 2006). These include not only new examples of the familiar permutation of strands in a β-barrel architecture but also examples in βα proteins. One has a shift through the topology by one strand position in the βsheet (32145 to 21345), while the other has the more dramatic shift of strand positions 12534 to 45312. Studies have shown that many possible intermediate steps created by cyclic permutations can be functional, allowing for the possiblity that this is a general mechanism by which proteins can evolve different structures (Peisajovich et al., 2006).
INTRINSIC PROTEIN SYMMETRIES
155
Membrane Proteins The question of cyclic permutations in the evolution of membrane proteins has received surprisingly little attention. Since these are well known to exist in globular proteins and it is possible for antiparallel copies to fuse, it seems that there is nothing to prevent these from forming should a partial deletion occur following a duplication event. One piece of supporting evidence is also the demonstration that at least for one superfamily (the GPCRs) it is possible to cut the chain at a certain point and coexpress the two fragments to create a functional receptor (Schoneberg et al., 1995; Ridge et al., 1996), which is also the case for globular domains shown to permute circularly (Carey et al., 2007). Since the topology of the permuted variant with respect to the membrane would be liable to change in many cases as a consequence of changing the charge distribution in the loop regions, it is possible that the range of duplication–deletion pairs which are potentially functional as circularly permuted membrane proteins is narrower than for globular proteins. On this basis, units of two membrane-spanning domains with partial loops would be more likely as a basic unit than single helixes; additionally, where the N-terminal region has accepted other domains that function at a particular location (cytoplasmic or extracellular), it is likely to be less probable. However, apparently, some proteins (such as the M1 muscarinic receptor from Drosophila melanogaster) are able to accept extremely large substitutions between TM regions, so even this apparently unlikely possibility cannot be ruled out entirely. One limitation may well be that the termini of the protein need to be close, which would be impossible for proteins with odd numbers of TM regions. Circular permutants of the β-barrel transmembrane protein OmpX (in which the termini are close) have been used in bacterial display experiments (Rice et al., 2006), demonstrating the possibility in the case of β-barrel membrane proteins. Whether helical membrane proteins will prove to be different seems an interesting unresolved question.
5 INTRINSIC PROTEIN SYMMETRIES Most of the cases discussed earlier are considered to be the consequence of gene duplication events on the basis that their structures and/or sequences are symmetrical. Although there is frequently other evidence that supports this, it is not clear how strongly the observation of symmetry in protein structures indicates an earlier duplication event. An internal symmetry that appears to have arisen by duplication may be due to intrinsic physical constraints on protein folds as a consequence of preferences for chirality and compactness. It is therefore worth considering the results of analyses of symmetry in protein structures from a theoretical standpoint. 5.1 β/α Proteins The clear chiral preference in connections between secondary structure units (the connection βαβ is almost never left-handed) can provide a strong bias toward symmetric structures. These structural constraints imply that the different βα folds we see cannot be random, as there will always be the possibility of finding some symmetric arrangements of secondary structures by chance. This is particularly likely when both the
156
PROTEIN PRODUCTS OF TANDEM GENE DUPLICATION: A STRUCTURAL VIEW
degree of symmetry (two-, three-, fourfold) is not specified beforehand and there is scope to neglect arbitrary “disordered” parts of the structure. 5.2 All-β Proteins Symmetries can be found in the all-β structure class. Typically, these are seen in structures consisting of a β-sheet (or sheets) with a closed connection forming a barrel structure. If the barrel were opened up (as in a Mercator projection of the world), the whole can be depicted in two dimensions. In this representation, some of the chiral symmetries resemble the decorative motif commonly used in classical Greece and was accordingly named the Greek key (Richardson, 1977). The extension of this spiral has been called a jelly roll and consists of eight strands in a closed barrel, with two connections across the top and two below. It has been suggested that the Greek-key motif (and the jelly roll) might have arisen from the symmetric folding of an elongated hairpin β-structure in the form of a double helix. 5.3 All-α Proteins Folding symmetries are also found in the α/α class, but their relationship to the local chiral preferences of the substructures is less clear. Much of the apparent symmetry within this class probably results simply from the more limited packing arrangements available with fewer secondary structures. A bundle of four or five helixes will have some regularity almost no matter how they are packed. Symmetry becomes more apparent in the all-α superhelixes and barrels, which have a simple solenoid fold. In contrast to the superhelixes, which have clear sequence repeats and function primarily as structural proteins, the barrels are all enzymes and do not have a repeating motif. 5.4 Fourier Analysis of Structural Symmetry The cases above have generally been established on the basis of multiple sources of evidence for duplication, including structural symmetry, sequence repeats, conserved positioning of functional residues, and the maintenance of unusual structural features. Since structural symmetry has in many cases led to the observation of previously undetected sequence repeats and therefore identified plausible new evolutionary relationships involving duplications and fusions, it is interesting to search for structural repeats on a large scale. The SAP program for structural comparison (Taylor and Orengo, 1989; Taylor, 1999) provides the option to align a structure to itself and find self-similarities indicative of symmetry. Using the technique of Fourier analysis, the periodicity of “ridges” that such self-similarities create in the comparison matrix can be identified and used to define repeat boundaries (Taylor et al., 2002). Calibrating this method using real and artificial repetitive proteins and searching a subset of the PDB found a significant fraction (17%) that were highly repetitive, dominated by the β-propellor and TIMbarrel folds. Once obvious sequence repeats are removed from this list, the remainder are almost exclusively globular β/α class proteins. This is a surprising result, as there is no obvious structural reason why these should be more likely to generate such repetitive folds.
DUPLICATE AND DESTROY
157
One possible explanation of the predominant symmetries of the globular β/α proteins might be based on the relative sizes and degrees of structural freedom that are available to the various supersecondary structure types. All-β proteins have a geometric regularity imposed by the plane of the β-sheet but are otherwise relatively topologically unconstrained, thus giving rise to few symmetries by chance. The all-α protein structures lack the spatial register imposed by a hydrogen-bonded sheet and so will naturally be less symmetric in their packing. However, as the α-helix is a relatively large structure, smaller proteins (with fewer than six helixes) will stand a good chance of having a symmetric arrangement. The β/α unit combines symmetry-inducing attributes of the previous types, having the spatial register of the β-sheet, while being relatively large, so there will not be too many unsymmetric arrangements in a protein of typical size. Alternatively, the reason for this bias might be a result of the evolutionary history of these folds (Phillips et al., 1978; Lupas et al., 2001). The most obviously repeating structures are of relatively recent origin (within the last 500 million years) and so retain their sequence signal, whereas those in the βα class tend to be ancient metabolic enzymes often common to all known life. This suggests that their structural symmetry may be a relic of duplications in the far distant past, far enough back in time that no trace of detectable sequence similarity remains. Such ideas are difficult, if not impossible, to prove (Phillips et al., 1978). However speculative, they nonetheless provide one of the few glimpses into the distant origins of protein structures.
6 DUPLICATE AND DESTROY The survey above indicates that tandem duplication, optionally followed by partial deletion, is a key mechanism for the generation of structural novelty. Mechanisms such as circular permutation in particular are key to generating new topologies in a nearly neutral fashion. Another mechanism which is not relevant here but is likely to be of high importance to the question of fold change in structures generally, is the existence of “chameleon sequences,” which equilibriate between two very different conformations and may therefore form evolutionary bridges between apparently unrelated folds (reviewed in Taylor, 2007). Mutations occur at different rates in different proteins (Luz and Vingron, 2006) and at different locations within a protein sequence; the same is true for insertions and deletions. These events can lead to the gradual accretion (Pan and Bardwell, 2006) or embellishment of substructure around a conserved core (Reeves et al., 2006). One major unresolved question in the study of protein evolution is the extent to which these mechanisms have operated in generating the structural diversity that we observe at present. Since “jumps” involving chameleon sequences are clearly possible, it is not simple to determine this (not least because we cannot at present predict when and where they can happen). However, we can suggest two reasons why gene duplications and deletions may be a better explanation. First, the mechanisms that involve gradual mutations, accretions around structural cores, and chameleon sequences are likely to operate very slowly. Mutational studies of proteins have only rarely observed radical changes to occur, which suggests that in a given protein at most only a handful of proteins can exist. Additionally, such large-scale structural changes as can be introduced by chameleon sequences are more
158
PROTEIN PRODUCTS OF TANDEM GENE DUPLICATION: A STRUCTURAL VIEW
likely to have negative than positive functional consequences. On this basis, duplication and deletion, which can generate structural transitions much more quickly, are more probable. Second, it is not yet clear whether protein structure space is sufficiently well connected for enough accessible paths to exist between folds to permit such mechanisms to make the journey between them (see Whitehead et al., 2008 and Chapter 6). The concept that proteins evolve primarily by the duplicate-and-destroy method (in which agglomerations of proteins are built up and then unnecessary components gradually removed) has the advantage that it is less risky: If two units that fold independently are joined together, we expect the initial result to be two joined, independently folding units. If an independently folding unit experiences an insertion, deletion, or point mutation, a disruption is far more likely. The only risk in the former mechanism is that posed by an increased “dosage” of the duplicated domain. The “periodic table” representation of protein fold space (Taylor, 2002) provides one way to represent and visualize structural transitions. In this representation a duplication event would be represented as a leap forward, with deletion events corresponding to steps back to smaller structures, not unlike radioactive decay in the more familiar table of elements. It would be possible to tune such a model to recapitulate known evolutionary events; this would provide a tool to evaluate the probability of an evolutionary relationship between two proteins, not unlike the work of Shaknovich and colleagues at the residue level using simple lattice models (Dokholyan et al., 2003; Zeldovich et al., 2006). We have seen how some relationships between proteins can be traced back through a series of duplications and domain rearrangements to very basic elements of protein structure, often incorporating only a pair of a few secondary structure elements. Although difficult to prove, it is tempting to speculate that these basic cores once corresponded to the earliest functional units and that all known protein folds can be derived through the mechanisms of duplication and deletion that have been described above. Acknowledgment The work was supported by the Medical Research Council (UK). REFERENCES Ahn HJ, Eom SJ, Yoon HJ, Lee BI, Cho H, Suh SW. 2003. Crystal structure of class I acetohydroxy acid isomerase from Pseudomonas aeruginosa. J Mol Biol 328:505–515. Andreeva A, Murzin AG. 2006. Evolution of protein fold in the presence of functional constraints. Curr Opin Struct Biol 16:399–408. Bajaj M, Blundell T. 1984. Evolution and the tertiary structure of proteins. Annu Rev Biophys Bioeng 13:453–492. Banner DW, Bloomer AC, Petsko GA, Phillips DC, Pogson CI, Wilson IA. 1975. Structure of ˚ resoluchicken muscle triose phosphate isomerase determined crystallographically at 2.5 A tion. Nature 255:609–614. Barrientos LG, Louis JM, Botos I, Mori T, Han Z, O’Keefe BR, Boyd MR, Wlodawer A, Gronenborn AM. 2002. The domain-swapped dimer of cyanovirin-n is in a metastable folded state. Structure 10:673–686.
REFERENCES
159
Bashford D, Cohen FE, Karplus M, Kuntz ID, Weaver DL. 1988. Diffusion–collision model for the folding kinetics of myoglobin. Protein Struct Funct Genet 4:211–227. Bennet MJ, Schlunegger MP, and Eisenberg D. 1995. 3D domain swapping: a mechanism for oligomer assembly. Protein Sci 4:2455–2468. Blundell TL, Jenkins JA, Sewell BT, Pearl LH, Cooper JB, Tickle IJ, Veerapandian B, Wood SP. ˚ resolution 1990. X-ray analyses of aspartic proteinases: the 3-dimensional structure at 2.1 A of endothiapepsin. J Mol Biol 211:919–941. Bofkin L, Goldman N. 2007. Variation in evolutionary processes at different codon positions. Mol Biol Evol 24:513–521. Carey J, Lindman S, Bauer M, Linse S. 2007. Protein reconstitution and three-dimensional domain swapping: benefits and constraints of covalency. Protein Sci 16:2317–2333. Chang G, Roth CB, Reyes CL, Pornillos O, Chen Y, Chen AP. 2006. Retraction of Pornillos et al., Science 310(5756) 1950-1953; retraction of Reyes and Chang, Science 308(5724) 1028-1031; retraction of Chang and Roth, Science 293(5536) 1793-1800. Science 314:1875. Chaudhuri I, Soding J, Lupas AN. 2008. Evolution of the beta-propeller fold. Protein Struct Funct Genet 71:795–803. Chen YJ, Pornillos O, Lieu S, Ma C, Chen AP, Chang G. 2007. X-ray structure of emre supports dual topology model. Proc Natl Acad Sci USA 104:18999–19004. Cheng H, Grishin NV. 2005. DOM-fold: a structure with crossing loops found in DmpA ornithine acetyltransferase and molybdenum cofactor-binding protein domain. Protein Sci 14:1902–1910. Craik CS, Buchman SR, Beychok S. 1981. O binding properties of the product of the central exon of beta globin gene. Nature 291:87–90. Daley DO, Rapp M, Granseth E, Melen K, Drew D, von Heijne G. 2005. Global topology analysis of the Escherichia coli inner membrane proteome. Science 308:1321–1323. deRoos ADG. 2005. Origins of introns based on the definition of exon modules and their conserved interfaces. Bioinformatics 21:2–9. Dokholyan NV, Deeds EJ, Shakhnovich EI. 2003. Protein evolution within a structural space. Biophys J 85:2962–2972. Dutzler R, Campbell EB, Cadene M, Chait BT, MacKinnon R. 2002. X-ray structure of ˚ reveals the molecular basis of anion selectivity. Nature a clc chloride channel at 3.0 A 415:287–294. Forrest LR, Zhang Y-W, Jacobs MT, Gesmonde J, Xie L, Honig BH, Rudnick G. 2008. Mechanism for alternating access in neurotransmitter transporters. Proc Natl Acad Sci USA 105:10338–10343. Fu D, Libson A, Miercke LJ, Weitzman C, Nollert P, Krucinski J, Stroud RM. 2000. Structure of a glycerol-conducting channel and the basis for its selectivity. Science 290:481–486. Go M. 1978. Correlation of DNA exonic regions with protein structural units in haemoglobin. Nature 291:90–92. Hunte C, Screpanti E, Venturi M, Rimon A, Padan E, Michel H. 2005. Stricutre of a na + /h+ antiporter and insights into mechanism of action and regulation by pH. Nature 435:1197–1202. Jensen EO, Paludan K, Hyldig-Nielsen JJ, Jorgensen P, Marcker KA. 1981. The structure of a chromosomal leghaemoglobin gene from soybean. Nature 291:677–679. Jones R. 2006. RNA silencing sheds light on the RNA world. PloS Biol 4:1. Lecomte JTJ, Vuletich DA, Lesk AM. 2005. Structural divergence and distant relationships in proteins: evolution of the globins. Curr Opin Struct Biol 15:290–301.
160
PROTEIN PRODUCTS OF TANDEM GENE DUPLICATION: A STRUCTURAL VIEW
Lehner I, Basting D, Meyer B, Haase W, Manolikas T, Kaiser C, Karas M, Glaubitz C. 2008. The key residue for substrate transport (glu(14)) in the emre dimer is asymmetric. J Biol Chem 283:3281–3288. Locher KP, Lee AT, Rees DC. 2002. The E. coli btucd structure: a framework for abc transporter architecture and mechanism. Science 496:1091–1098. Lolkema JS, Sobczak I, Slotboom D-J. 2005. Secondary transporters of the 2hct family contain two homologus domains with inverted membrane topology and trans re-entrant loops. FEBS Lett 272:2334–2344. Lolkema JS, Dobrowolski A, Slotboom D. 2008. Evolution of antiparallel two-domain membrane proteins: tracing multiple gene duplications events in the duf606 family. J Mol Biol 378:596–606. Lupas AN, Ponting CP, Russell RB. 2001. On the evolution of protein folds: Are similar motifs in different protein folds the result of convergence insertion or relics of an ancient peptide world? J Struct Biol 134:191–203. Luz H, Vingron M. 2006. Family specific rates of protein evolution. Bioinformatics 22:1106–1171. Lynch M. 2002. Gene duplication and evolution. Science 297:945–947. Lynch M, O’Hely M, Walsh B, Force A. 2001. The probability of preservation of a newly arisen gene duplication. Genetics 159:1789–1804. Ma C, Chang G. 2004. Structure of the multidrug resistance efflux transporter emre from Escherichia coli . Proc Natl Acad Sci USA 101:2852–2857. McHaourab BS, Mishra S, Koteiche HA, Amadi SH. 2008. Role of sequence bias in the topology of the multidrug transporter EMRE. Biochemistry 47:7980–7982. Murata K, Mitsuoka K, Hirai T, Walz T, Agre P, Heymann JB, Engel A, Fujiyoshi Y. 2000. Structural determinants water permeation through aquaporin-1. Nature 407:599–605. Murzin AG. 1992. Structural principles for the propeller assembly of β-sheets: the preference for seven-fold symmetry. Protein Struct Funct Genet 14:191–201. Nara T, Kouyama T, Kurata Y, Kikukawa T, Miyauchi S, Kamo N. 2007. Anti-parallel membrane topology of a homo-dimeric multidrug transporter, emre. J Biol Chem 142:621–625. Ohno S. 1970. Evolution by Gene Duplication. New York: Springer-Verlag. Palczewski K, Kumasaka T, Hori T, Behnke CA, Motoshima H, Fox BA, LeTrong I, Teller DC, Okada T, Stenkamp RE, et al. 2000. Crystal structure of rhodopsin: a G-protein coupled receptor. Science 289:739–745. Pan JL, Bardwell JCA. 2006. The origami of thioredoxin-like folds. Protein Sci 15:2217–2227. Patthy L. 2008. Protein Evolution, 2nd ed. Oxford: Blackwell. Pearl LH, Taylor WR. 1987. A structural model for the retroviral proteases. Nature 329:351–354. Peisajovich SG, Rockah L, Tawfik DS. 2006. Evolution of new protein topologies through multistep gene rearrangements. Nat Genet 38:168–174. Phillips DC, Sternberg MJE, Thornton JM, Wilson IA. 1978. An analysis of the structure of triose phosphate isomerase and its comparison with lactate dehydrogenase. J Mol Biol 119:329–351. Pornillos O, Chen Y, Chen AP, Chang G. 2005. X-ray structure of the emre multidrug transporter in complex with a substrate. Science 310:1950–1953. Ptitsyn OB, Rashin AA. 1975. A model of myoglobin self-organisation. Biophys Chem 3:1–20. Rao ST, Rossmann MG. 1973. Comparison of super-secondary structures in proteins. J Mol Biol 76:241–256.
REFERENCES
161
Rapp M, Seppala S, Granseth E, von Heijne G. 2007a. Emulating membrane protein evolution by rational design. Science 315:1282–1284. Rapp M, Seppala S, Granseth E, von Heijne G. 2007b. Reply to Schuldiner 2007. Science 317:748–751. Reeves GA, Dallman TJ, Redfern OC, Akpor A, Orengo CA. 2006. Structural diversity of domain superfamilies in the CATH database. J Mol Biol 360:725–741. Rice JJ, Schohn A, Bessette PH, Boulware KT, Daugherty PS. 2006. Bacterial display using circularly permuted outer membrane protein ompx yields high affinity peptide ligands. Protein Sci 15:825–836. Richardson JS. 1977. β-Sheet topology and the relatedness of proteins. Nature 268:495–500. Ridge KD, Lee SSJ, Abdulaev NG. 1996. Examining rhodopsin folding and assembly through expression of polypeptide fragments. J Biol Chem 271:7860–7867. Saaf A, Baars L, von Heijne G. 2001. The internal repeats in the Na+ /Ca2+ exchangerrelated Escherichia coli protein yrbg have opposite membrane topologies. J Biol Chem 276:18905–18907. Saaf A, Johansson M, Wallin E, von Heijne G. 1999. Divergent evolution of membrane protein topology: the Escherichia coli RnfA and RnfE homologues. Proc Natl Acad Sci USA 96:8540–8544. Salgado PS, Koivunen MRL, Makeyev EV, Bamford DH, Stuart DI, Grimes JM. 2006. The structure of an RNAi polymerase links RNA silencing and transcription. PLoS Biol 4:2274–2281. Sandaman K, Reeve JN. 2006. Archaeal histones and the origin of the histone fold. Curr Opin Microbiol 9:520–525. Schoneberg T, Liu J, Wess J. 1995. Plasma-membrane localization and functional rescue of truncated forms of a G-protein coupled receptor J Biol Chem 270:18000–18006. Schuldiner S. 2007a. Controversy over emre structure. Science 317:748–751. Schuldiner S. 2007b. When biochemistry meets structural biology: the cautionary tale of emre. TIBS 32:252–258. Shapiro JA, Adhya SL, Bukhari AI. 1977. Introduction: New pathways in the evolution of chromosome structure. In Bukhari AI, Shapiro JA, Adhya SL (eds.), DNA Insertion Elements, Plasmids and Episomes. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press, pp. 3–11. Sobczak I, Lolkema JS. 2005. The 2-hydroxycarboxylate transporter (2hct) family: physiology structure and mechanism. Microbiol Mol Biol Rev 69:665–695. Steiner-Mordoch S, Soskiine M, Solomon D, Rotem D, Gold A, Yechieli M, Adam Y, Schuldiner S. 2008. Parallel topology of genetically fused emre homodimers. EMBO J 27:17–26. Street TO, Rose GD, Barrick D. 2006. The role of introns in repeat protein gene formation. J Mol Biol 360:258–266. Tang J, James MNG, Hsu IN, Jenkins JA, Blundell TL. 1978. Structural evidence for gene duplication in the evolution of the acid proteases. Nature 271:619–621. Tate CG, Kunji ERS, Lebendiker M, Schuldiner S. 2001. The projection structure of emre, ˚ resolution. EMBO J a proton-linked multidrug transporter from Escherichia coli , at 7 A 20:77–81. Taylor WR. 1999. Protein structure alignment using iterated double dynamic programming. Protein Sci 8:654–665. Taylor WR. 2000. A deeply knotted protein and how it might fold. Nature 406:916–919. Taylor WR. 2002. A periodic table for protein structure. Nature 416:657–660. Taylor WR. 2007. Evolutionary transitions in protein fold space. Curr Opin Struct Biol 17:354–361.
162
PROTEIN PRODUCTS OF TANDEM GENE DUPLICATION: A STRUCTURAL VIEW
Taylor EW, Agarwal A. 1993. Sequence homology between bacteriorhodopsin and G-protein coupled receptors: exon shuffling or evolution by duplication? FEBS Lett 325:161–166. Taylor WR, Orengo CA. 1989. Protein structure alignment. J Mol Biol 208:1–22. Taylor WR, Heringa J, Baud F, Flores TP. 2002. A Fourier analysis of symmetry in protein structure. Protein Eng 15:79–89. Theobald DL, Wuttke DS. 2006. Divergent evolution within protein superfolds inferred from profile-based phylogenetics. J Mol Biol 354:722–737. Thornton JM, Sibanda BL. 1983. Amino and carboxy-terminal regions in globular proteins. J Mol Biol 167:443–460. Toh H, Ono M, Saigo K, Miyata T. 1985. Retroviral protease-like sequence in the yeast transposon ty1. Nature 315:691. Van den Berg B, Clemons WM, Collinson I, Modis Y, Hartmann E, Harrison SC, Rapoport TA. 2004. X-ray structure of a protein-conducting channel. Nature 427:36–44. Vesterstrom J, Taylor WR. 2006. Flexible secondary structure based protein structure comparison applied to the detection of circular permutation. J Comput Biol 13:43–62. von Heijne G. 1992. Membrane-protein structure prediction: hydrophobicity analysis and the positive-inside rule. J Mol Biol 225:487–494. Weiner J, Thomas G, Bornberg-Bauer E. 2005. Rapid motif-based prediction of circular permutations in multi-domain proteins. Bioinformatics 21:932–937. Whamond GS, Thornton JM. 2006. An analysis of intron positions in relation to nucleotides amino acids and protein secondary structure. J Mol Biol 359:238–247. Whitehead DJ, Wilke CO, Vernazobres D, Bornberg-Bauer E. 2008. The look-ahead effect of phenotypic mutations. Biol Direct 3:18. Wistow GJ, Pisano MM, Chepelinsky AB. 1991. Tandem sequence repeats in transmembrane channel proteins. TIBS 16:170–171. Wlodawer A, Miller M, Jaskolski M, Sathyanarayana BK, Baldwin E, Weber IT, Selk LM, Clawson L, Schneider J, Kent SBH. 1989. Conserved folding in retroviral proteases: crystal structure of a synthetic HIV-1 protease. Science 245:616–621. Yadid I, Tawfik DS. 2007. Reconstruction of functional β-propeller lectins via homo-oligomeric assembly of shorter fragments. J Mol Biol 365:10–17. Yamashita A, Singh SK, Kawate T, Jin Y, Gouaux E. 2005. Crystal structure of a bacterial homologue of Na+ /Ca–dependent neurotransmitter transporters. Nature 437:215–223. Yu FH, Catterall WA. 2003. Overview of the voltage-gated sodium channel family. Genome Biol 4:207. Zardoya R. 2005. Phylogeny and evolution of the major intrinsic protein family. Biol Cell 97:397–414. Zardoya R, Villalba S. 2001. A phylogenetic framework for the aquaporin family in eukaryotes. J Mol Evol 52:391–404. Zeldovich KB, Berezovsky IN, Shakhnovich EI. 2006. Physical origins of protein superfamilies. J Mol Biol 357:1335–1343.
8
Statistical Methods for Detecting Functional Divergence of Gene Families XUN GU Department of Genetics, Development and Cell Biology, Center for Bioinformatics and Biological Statistics, Iowa State University, Ames, Iowa
1 INTRODUCTION Many organisms, from yeast to human, have undergone genomewide or local chromosome duplication events during their evolution (Ohno, 1970; Lundin, 1993; Holland et al., 1994; Spring, 1997; Wolfe and Shields, 1997). After gene duplication, one gene copy maintains the original function, whereas the other copy is free to accumulate amino acid changes toward functional divergence (Li, 1983). As a result, many genes are represented as several paralogs in the genome with related but distinct functions. Since gene family proliferation is thought to have provided the raw materials for functional innovations, it is desirable, from sequence analysis, to identify amino acid sites that are responsible for the functional diversity. This approach has great potential for functional genomics because it is cost-effective, and these predictions can be tested further by experimentation. Since most amino acid changes are not related to functional divergence but represent neutral evolution, it is crucial to develop appropriate statistical methods to distinguish between these two possibilities. Indeed, when sequences of a gene family are available, the identification of functionally important residues can be approached computationally (e.g., Casari et al., 1995; Lichtarge et al., 1996; Livingstone and Barton, 1996; Gu, 1999, 2001; Landgraf et al., 1999). In particular, Gu (1999, 2001) has developed a novel probabilistic model, based on the underlying principle that functional divergence after gene duplication is correlated strongly with the change of evolutionary rate. This correlation is a complement to a fundamental rule in molecular evolution: Functional importance is correlated strongly with evolutionary conservation (Kimura, 1983). A site-specific profile based on the posterior probability was then developed to predict critical residues for functional divergences between two gene clusters. Many authors (e.g., Wang and Gu, 2001; Gu et al., 2002; Mathews, 2005) have applied this newly developed method successfully to the study of functional diversity in gene families. For example, Wang and Gu (2001) studied Evolution After Gene Duplication, Edited by Katharina Dittmar and David Liberles Copyright © 2010 Wiley-Blackwell
163
164
STATISTICAL METHODS FOR DETECTING FUNCTIONAL DIVERGENCE
the caspase gene family and found that our predictions are supported by experimental data. In this chapter we review the statistical basis for testing functional divergence after gene duplication and predicting the amino acid residues that are responsible for these divergences.
2
TWO-STATE MODEL FOR FUNCTIONAL DIVERGENCE
Consider a multiple alignment of a gene family with two sets of homologous genes, 1 and 2 (Figure 1). Although various terminologies were used previously (e.g., Casari et al., 1995; Lichtarge et al., 1996; Livingstone and Barton, 1996; Gu, 1999; Landgraf et al., 1999), amino acid patterns can be classified tentatively as follows (Gu, 2001). Type 0 represents amino acid patterns that are universally conserved through the entire gene family, implying that these residues are important for the common function shared by all member genes. Type I represents amino acid patterns that are highly conserved in gene 1 but highly variable in gene 2, or vice versa, implying that these residues have experienced altered functional constraints. Type II represents amino acid patterns that are highly conserved in both genes but whose biochemical properties are very divergent (e.g., charge positive vs. negative), implying that these residues may be responsible for functional specification. Finally, amino acid patterns at many residues are not so clear-cut that they have to be regarded as unclassified (type U). After gene duplication, functional divergence between duplicates is likely to occur in the early stage. There are two basic types of functional divergence. The first type results in site-specific altered functional constraints (i.e., different evolutionary rate) between duplicate genes. We named it type I functional divergence, as it typically generates type I amino acid patterns. The second type results in no altered functional constraints but radical change in amino acid property between duplicates (e.g., charge,
x1 x5 x2
x0
Cluster 1
x3 x6 x4
y1 y5 y2
y0
Cluster 2
y3 y6 y4
Figure 1 Two gene clusters after gene duplication.
TESTING TYPE I FUNCTIONAL DIVERGENCE AFTER GENE DUPLICATION
165
hydrophobicity). We named it type II functional divergence, as it typically generates type II amino acid patterns. For two gene clusters generated by the gene duplication, the two-state model assumes that in each cluster, one site has two possible states, S0 (functional divergence-unrelated, or functional constraint) and S1 (functional divergence). When a site is under S0 , the evolutionary rate at this site is virtually the same between two clusters (i.e., λ1 = λ2 ). In contrast, under state S1 we have statistical independence between λ1 and λ2 (Gu, 1999). The assumption of rate independence for functional divergence means that knowing the evolutionary rate at such sites in one cluster contains no information for predicting the intensity of functional constraint in the other cluster. 3 TESTING TYPE I FUNCTIONAL DIVERGENCE AFTER GENE DUPLICATION Gu (1999, 2001) developed statistical approaches to estimating type I functional divergence, which have been implemented in the software DIVERGE (Gu and Vander Velden, 2002). The principal difference between these two models is that the method of Gu (2001) is based on the Markov chain model, whereas that of Gu (1999) is based on the Poisson model. Figure 2 outlines the pipeline of statistical analysis. 3.1 Markov Chain Model Under the Markov chain model, the likelihood for sequence evolution can be derived as follows (Felseinstein, 1981; Kishino et al., 1990). First, the transition probability matrix for a given time period t can be computed as P = exp(λRt), where the rate matrix R represents the pattern of amino acid substitutions, which can be determined empirically by, for example, the Dayhoff model (Dayhoff et al., 1978). The evolutionary rate (λ)
Input: aligned amino acid sequences of two clusters (A, B) and the phylogeny
Probabilistic model of a site in each cluster: f(XA|λA), f(XB|λB), where XA and XB are amino acid configurations in A and B. In the fast algorithm of Gu (1999), XA and XB are simplified to the expected number of substitutions (Gu and Zhang 1997) so that f(XA|λA) and f(XB|λB) are Poisson processes. See Gu (2001) for a formal likelihood treatment.
Site-specific profile Posterior analysis: P(S1|X) = θf(X|S1)/ f(X)
The joint probability of X = (XA, XB) f(X) = (1 – θ) f(X|S0) + θf(X|S1) and the likelihood over all sites (k) is L = Πk f(Xk)
Figure 2
Conditional joint probability: assume rates λA and λB varies among sites according to a gamma distribution. Under S0, λΑ = λΒ = λ, so f(X|S0) = E[f(XA|λ) f(XB|λ)] Under S1, λA, λB independent, so f(X|S1) = E[f(XA|λA)] E[f(XB|λB)] where X = (XA, XA) and E for expectation.
Chart for statistical analysis of functional divergence.
166
STATISTICAL METHODS FOR DETECTING FUNCTIONAL DIVERGENCE
may vary among sites because of different functional constraints. Usually, λ is treated as a random variable, which follows a gamma distribution; that is, φ(λ) =
βα α−1 −βλ λ e (α)
(1)
(Uzzel and Corbin, 1971). The shape parameter, α, describes the strength of rate variation among sites (i.e., a small value of α means a strong rate heterogeneity among sites, and α = 8 means no rate variation among sites), whereas β is a scale constant (Gu et al., 1995). Consider the phylogenetic tree in Figure 1. Let X = (x1 , x2 , x3 , x4 ) and Y = (y1 , y2 , y3 , y4 ) be the amino acid patterns observed for a site with clusters 1 and 2, respectively. For the (unrooted) subtree for cluster 1 or 2, the conditional probability of observing X or Y at a site can be written as follows: f (X|λ) =
20 20
bx5 Px5 x1 Px5 x2 Px5 x6 Px6 x3 Px6 x4
x3 =1 x6 =1
f (Y |λ) =
20 20
(2) by5 Py5 y1 Py5 y2 Py5 y6 Py6 y3 Py6 y4
y5 =1 y6 =1
where Pij = Pij (vij ) is the transition probability from node i to node j , vij is the branch length between them, and bi is the frequency of amino acid i. By integrating out the random variable λ, the probability of observing X or Y at a site is given by ∞ p(X) = f (X|λ)φ(λ) dλ 0 (3) ∞ p(Y ) = f (Y |λ)φ(λ) dλ 0
respectively. Let P (S1 ) = θI be the probability of a site being in state S1 (functional divergence) and P (S0 ) = 1 − θI be the probability of a site being in state S0 (functional constraint). We call θI the coeficient of type I functional divergence between clusters 1 and 2 (Gu, 1999). Let X and Y be the amino acid patterns of a site in clusters 1 and 2, respectively. Since evolutionary rates (λ1 and λ2 ) at an S1 site (i.e., a site under S1 ) are statistically independent between two clusters, whereas they are completely correlated (λ1 = λ2 , without loss of generality) at an S0 site, the joint probability of subtrees conditional on S0 or S1 is given by ∞ f (X|λ)f (Y |λ)φ(λ) dλ = E[f (X|λ)f (X|λ)] f (X, Y |S0 ) = (4) 0 f (X, Y |S1 ) = p(X)p(Y ) = E[f (X|λ1 )] × E[f (Y |λ2 )]
TESTING TYPE I FUNCTIONAL DIVERGENCE AFTER GENE DUPLICATION
167
where f (X|λ1 ) or f (Y |λ2 ) is the likelihood of each unrooted subtree, respectively [e.g., it is given by Eq. (8.2) for the phylogeny in Figure 1], and E means taking expectation. From the two-state model, one can easily show that the joint probability of two subtrees can be written as p(X, Y ) = (1 − θI )f (X, Y |S0 ) + θI f (X, Y |S1 )
(5)
Then, under the assumption of site independence, the likelihood function over all sites (gaps excluded) is given by L(x|data) = p(X(k) , Y (k) ) (6) k
where k is the number of sites and x is the set of unknown parameters. 3.2 Poisson Model Gu (1999) developed a Poisson-based model to estimate the coefficient of functional divergence, which is computationally efficient. At a given site, the number of amino acid changes (Xi , i = 1, 2 for gene clusters 1 and 2, respectively) follows a Poisson distribution; that is, the probability that Xi = k is given by pi (k) =
(λi Ti )k −λi Ti e k!
i = 1, 2
(7)
where T1 and T2 are the total evolutionary times of clusters 1 and 2, respectively. The joint distribution of the number of changes, P (X1 , X2 ), can be derived as follows. For any S1 site, the evolutionary rate is statistically independent between two clusters, whereas it is completely correlated at an S0 site. Thus, the probability of X1 = i in cluster 1 and X2 = j in cluster 2 under state S0 or S1 is given by P (X1 = i, X2 = j |F1 ) = Q1 (i)Q2 (j ) P (X1 = i, X2 = j |F0 ) = K12 (i, j )
(8)
The analytical forms of Q1 , Q2 , and K12 were derived by Gu (1999). Then the joint distribution can be expressed as P (X1 , X2 ) = (1 − θI )K12 + θI Q1 Q2
(9)
To estimate θI we need to know the number of changes at each site for each gene cluster (i.e., X1 and X2 ). Since X1 and X2 cannot be observed directly from the sequence data, a conventional solution is to use the number of minimum-required changes (m) as an approximation, which can be inferred by the parsimony under a known phylogenetic tree (Fitch, 1971). However, m is a biased estimate for the true number of changes because it does not consider the possibility of multiple hits. This
168
STATISTICAL METHODS FOR DETECTING FUNCTIONAL DIVERGENCE
problem has been solved by using a combination of ancestral sequence inference and maximum likelihood estimation (Gu and Zhang, 1997). Extensive computer simulation has shown that the estimate of mean of expected number of changes, as well as that of variance, is asymptotically unbiased and robust against the accuracy of ancestral amino acid inference.
4 PREDICTING CRITICAL RESIDUES FOR TYPE I FUNCTIONAL DIVERGENCE It is of great interest to predict (statistically) which sites are likely to be responsible for functional differences. Indeed, these sites can be tested further by experimentation using molecular, biochemical, or transgenic approaches. We have developed site-specific profiles for this purpose, which can be obtained using posterior analysis. 4.1 Markov Chain Model For the simple two-cluster case, there are only two states: S0 and S1 . We wish to know the probability of S1 for a given site when the amino acid configuration (X, Y ) is observed [i.e., P (S1j X, Y ]. The prior probability of S1 is P (S1 ) = θI . According to the Bayesian law, we have P (S1 |X, Y ) =
θI f (X, Y |S1 ) p(X, Y )
(10)
where f (X, Yj S1 ) and p(X, Y ) are given by Eqs. 4 and 5, respectively. 4.2 Poisson Model In the case of strong statistical evidence supporting the functional divergence after gene duplication (i.e., θI > 0), it is of great interest to predict which sites are likely to be responsible for these (type I) functional differences. Indeed, these sites can be tested further using molecular, biochemical, or transgenic approaches. Remember that in the two-state model, each site has two possible states, S0 (functional constraint) and S1 (functional divergence), with the (prior) probabilities P (S1 ) = θI and P (S0 ) = 1 − θI , respectively. To provide a statistical basis for predicting which state is more likely at a given site, we need to compute the (posterior) probability of state F1 at this site with X1 (and X2 ) changes in cluster 1 (and 2), P (S1 |X1 , X2 ). Obviously, P (S0 |X1 , X2 ) = 1 − P (S1 |X1 , X2 ). According to the Bayesian law, one can show that P (S1 |X1 , X2 ) =
θI Q1 Q2 (1 − θI )K12 + θI Q1 Q2
(11)
We may use this formula to identify those amino acid sites that may be responsible for the functional divergence given a cutoff value. In practice, the
IMPLEMENTATION AND CASE STUDY
169
choice of a cutoff value is somewhat arbitrary, from P (S1 |X1 , X2 ) > 0.5 (Rij > 1) to P (S1 |X1 , X2 ) > 0.95 (or Rij > 20). As will be seen below, it depends on how much information we can obtain.
5 IMPLEMENTATION AND CASE STUDY These methods have beeen implemented in the software Diverge, which is available at www.xgu.gdcb.iastate.edu. Diverge is a GUI-based, user-friendly software package to provide an integrated analytical tool for functional prediction of protein sequence data, which can be run under both the Windows and LiNUX operating systems (Figure 3). Using Diverge, Wang and Gu (2001) analyzed the caspase gene family to explore the structural–functional basis for site-specific rate shifts (type I functional divergence) of protein sequences between major caspase subfamilies. The key component in the apoptotic machinery (or programmed cell death) is a cascade of cysteine aspartyl proteases (caspases). To date, 14 members of the caspase gene family have been identified in mammals, which can be classified into two major subfamilies, CED-3 (including
Figure 3
Interface of the software Diverge.
170
STATISTICAL METHODS FOR DETECTING FUNCTIONAL DIVERGENCE HUMAN 3-alpha HUMAN 3-beta 99 99 RAT 3-alpha 92 RAT 3-beta 99 MOUSE HAMSTER 97 CHICKEN 99 FROG HUMAN MOUSE 99 77 RAT 68 94 HAMSTER HUMAN 99 RAT B 99 MOUSE 96 99 CHICKEN 81 DROSOPHILA ARMY WORM 99 C HUMAN 99 MOUSE HUMAN 10a 99 HUMAN 10b 99 99 HUMAN 10d C. ELEGANS CED-3 HUMAN 9 HUMAN 99 RAT 97 MOUSE 99 CHICKEN HUMAN 99 MOUSE HUMAN 97 HORSE 99 RAT MOUSE 99 99 HUMAN 76 99 HUMAN 93 HUMAN MOUSE 99 67 MOUSE FROG ICE-A FROG ICE-B 99 99
A
CASP-3
E-Casp CASP-7
CASP-6 CED-3
CASP-8 CASP-10 CASP-9
I-Casp
CASP-2
CASP-14
CASP-1 CASP-4 CASP-5 CASP-13 CASP-11 CASP-12
ICE
0.05
Figure 4
Phylogenetic tree of the caspase gene family.
caspase-2, -3, -6, -7, -8, -9, -10, and -14) and ICE (including caspase-1, -4, -5, -11, -12, and -13). CED-3-type caspases are essential for most apoptotic pathways, while the major function of the ICEtype caspases is to mediate immune response. Based on the inferred tree of caspases (Figure 4), Wang and Gu (2001) found that type I functional divergence is statistically significant between two major subfamilies, CED-3 and ICE (θI = 0.29). The posterior profile (Figure 5) predicts crucial amino acid residues that are responsible for functional divergence between them. It has been shown that 4 of 21 amino acid residues predicted (for type I functional divergence between CED-3 and ICE) have been verified by experimental or structural evidence.
REFERENCES
171
1 P(S1|X)
0.8 0.6 0.4 0.2 191
181
171
161
151
141
131
121
111
91
101
81
71
61
51
41
31
21
0
11
0 Alignment position (A) Site
CED-3
Sequence conservation An invariant Trp (W) 161
86/88
Highly variable
Structural features
Form a narrow pocket with an No extra loop; a shallow depression found extra loop; form a H-bond
Substrate specificity
Network with a group o amino Hydrophobic side chains acids; Hydrophilic side chains
Structural features
No surface loop
Sequence conservation Highly variable 131
ICE
Structural features
Not a cleavage site
Lie in an exta surface loop Highly conserved Cleavage site for proenzyme processing
(B)
Figure 5 (A) Site-specific profile for predicting critical amino acid residues responsible for functional divergence between CED-3 and the ICE subfamilies, measured by the posterior probability of being functionally divergence-related at each site [P (S1 |X)]. The arrows point to four amino acid residues at which functional divergence between two subfamilies has been verified by experimentation. (B) Four predicted sites that have been verified by experimentation.
REFERENCES Casari G, Sander C, Valencia A. 1995. A method to predict functional residues in proteins. Struct Biol 2:171–178. Dayhoff MO, Schwartz RM, Orcutt BC. 1978. A model of evolutionary change in proteins. In Dayhoff MO (ed.), Atlas of Protein Sequence Structure, Vol. 5, Suppl. 3. Washington, DC: National Biomedical Research Foundation, pp. 342–352. Felsenstein J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17:368–376. Fitch WM. 1971. Toward defining the course of evolution: minimum change for a specific tree topology. Syst Zool 20:406–416. Gu X. 1999. Statistical methods for testing functional divergence after gene duplication. Mol Biol Evol 16:1664–1674. Gu X. 2001. Maximum likelihood approach for gene family evolution under functional divergence. Mol Biol Evol 18:453–464. Gu X, Vander Velden K. 2002. DIVERGE: Phylogeny-based analysis for functional–structural divergence of a protein family. Bioinformatics 18:500–501.
172
STATISTICAL METHODS FOR DETECTING FUNCTIONAL DIVERGENCE
Gu X, Zhang J. 1997. A simple method for estimating the parameter of substitution rate variation among sites. Mol Biol Evol 14:1106–1113. Gu X, Fu YX, Li WH. 1995. Maximum likelihood estimation of the heterogeneity of substitution rate among nucleotide sites. Mol Biol Evol 12:546–557. Gu J, Wang Y, Gu X. 2002. Pattern of functional divergence in JAK tyrosine protein kinase family. J Mol Evol 54:725–733. Holland PWH, Garcia-Fern´andez J, Williams NA, Sidow A. 1994. Gene duplication and the origins of vertebrate development. Development 1994 Suppl. pp. 125–133. Kimura M. 1983. The Neutral Theory of Molecular Evolution. Cambridge, UK: Cambridge University Press. Kishino H, Miyata T, Hasegawa. 1990. Maximum likelihood inference of protein phylogeny and the origin of chloroplasts. J Mol Evol 31:151–160. Landgraf R, Fischer D, Eisenberg D. 1999. Analysis of heregulin symmetry by weighted evolutionary tracing. Protein Eng 12:943–951. Li WH. 1983. Evolution of duplicated genes. In Nei M, Koehn RK (eds.), Evolution of Genes and Proteins. Sunderland, MA: Sinauer Associates. Lichtarge O, Bourne HR, Cohen FE. 1996. An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol 257:342–358. Livingstone CD, Barton G. 1996. Identification of functional residues and secondary structure from protein sequence alignment. Methods Enzymol 266:497–512. Lundin LG. 1993. Evolution of the vertebrate genome as reflected in paralogous chromosomal regions in man and the house mouse. Genomics 16:1–19. Mathews S. 2005. Analytical methods for studying the evolution of paralogs using duplicate gene datasets. Methods Enzymol 395:724–745. Ohno S. 1970. Evolution by Gene Duplication. New York: Springer-Verlag. Spring J. 1997. Vertebrate evolution by interspecific hybridisation: Are we polyploid? FEBS Lett 400:2–8. Uzzel T, Corbin KW. 1971. Fitting discrete probability distribution to evolutionary events. Science 172:1089–1096. Wang Y, Gu X. 2001. Functional divergence in the caspase gene family and altered functional constraints: statistical analysis and prediction. Genetics 158:1311–1320. Wolfe KH, Shields DC. 1997. Molecular evidence for an ancient duplication of the entire yeast genome. Nature 387:708–713.
9
Mapping Gene Gains and Losses Among Metazoan Full Genomes Using an Integrated Phylogenetic Framework ATHANASIA C. TZIKA Laboratory of Natural and Artificial Evolution, Department of Genetics and Evolution, Sciences III, Geneva, Switzerland; Evolutionary Biology and Ecology, Universit´e Libre de Bruxelles, Brussels, Belgium
¨ HELAERS RAPHAEL Department of Biology, Facult´es Universitaires Notre-Dame de la Paix, Namur, Belgium
MICHEL C. MILINKOVITCH Laboratory of Natural and Artificial Evolution, Department of Genetics and Evolution, Sciences III, Geneva, Switzerland
1 INTRODUCTION Although a rough increase in maximum phenotypic complexity across the entire range of evolutionary time is indisputable, this general trend is not distributed homogeneously throughout the tree of life. Multiple lineages, such as myzostomes, flatworms, and tunicates, even exhibit simplified body plans probably derived rather than ancestral. Conversely, many branches in the tree of life at diverse phylogenetic scales are characterized by an accelerated acquisition of new and complex physiological and morphological characters [e.g., Aburomia et al. (2003), but see Donoghue and Purnell (2005)], of which some had a major impact on the ability of these lineages to diversify and thrive. The temptation to correlate phenotypic complexity with genomic complexity is both obvious and unsubstantiated. Although the absolute amount of DNA in a haploid cell is poorly correlated with organismal complexity (Gregory, 2002), notable and gradual increases in gene number (resulting from the retention of duplicated genes) and more abrupt increases in the abundance of spliceosomal introns and mobile genetic elements through evolutionary time have been suggested (Lynch and Conery, 2003). More generally, it is possible that the emergence of new genes [through one or a combination of processes involving exon shuffling, gene duplication, mobile elements, lateral Evolution After Gene Duplication, Edited by Katharina Dittmar and David Liberles Copyright © 2010 Wiley-Blackwell
173
174
PHYLOGENY-BASED MAPPING OF GENE GAINS AND LOSSES
gene transfer, gene fusion/fission, and de novo origination; see Long et al. (2003) for a review] is involved in the development of phenotypic novelties in organismal evolution. Furthermore, the single or multiple round(s) of full-genome duplication [see Van de Peer (2004) for a review] in several major lineages, such as yeast, vertebrates, and plants, might explain major evolutionary leaps and adaptive radiations [Ohno, 1970; Aburomia et al., 2003; Van de Peer, 2004; but see Donoghue and Purnell (2005)]. A recent major competing hypothesis is that phenotypic transitions are explained by shifts in the precise spatial and temporal expression patterns of genes [see, e.g., Carroll (2001, 2005) and Carroll et al. (2004)] rather than by changes in their protein-coding regions: Gains or losses of cis-regulatory noncoding modules (CRMs) would cause shifts in the regulation of discrete tissue-specific and developmental-stage-specific expression of genes while avoiding deleterious pleiotropic effects of protein sequence modification. The structural mutation (mutation within the coding region) and regulatory mutation (mutations outside the coding region) models (Hoekstra and Coyne, 2007) are, however, by no means incompatible. For example, compartmentation of specialized gene functions can be brought about by duplication of the protein-coding sequence with its regulatory module(s) followed by subfunctionalization (Lynch and Conery, 2000; Lynch and Force, 2000); that is, the two gene copies specialize to perform complementary functions, for example, through protein sequence changes and/or evolution of the respective sets of CRMs (Force et al., 1999; Greer et al., 2000). Note that the increased probability of survival of subfunctionalized duplicates (Lynch and Conery, 2000) provides an extended time period during which neofunctionalization [i.e., one copy acquiring a new function whereas the other retains the ancestral function (Ohno, 1970)] can occur through coding sequence modifications (He and Zhang, 2005; Rastogi and Liberles, 2005). Furthermore, several studies corroborate the importance of lineage-specific positive selection (hence, potential neofunctionalization) for the retention of duplicates (Hurles, 2004; Kondrashov and Kondrashov, 2006; Shiu et al., 2006). Hoekstra and Coyne (2007) recently provided an extensive and articulated discussion suggesting that embracing the recent cis-regulatory paradigm of adaptive evolution as the single dominating mechanism for explaining the emergence of adaptations is theoretically unsubstantiated and is not supported experimentally. Rather, they suggest that “changes in both the structure and regulation of genes have been important in adaptation, that their relative importance will not be known for a considerable time, and that the role of structural mutations in morphological evolution—and other adaptive change—is unlikely to be trivial.” The increasing number of fully sequenced genomes and large-scale expression studies, accompanied by a constantly growing number of software and databases for better integration and exploitation of this wealth of data, should help investigate correlations between genome and phenotype evolution. However, whole-genome comparisons among eukaryotic species have proven more problematic than among prokaryotes, not only due to extensive gene duplication events and the multidomain structure of most proteins, but also because of the low-coverage sequencing of several genomes (Milinkovitch et al., 2010a). Furthermore, the broad field of comparative genomics currently suffers from two major biases. First, a striking taxonomic bias in the choice of model species and genome sequencing projects is noteworthy (Milinkovitch and Tzika, 2007); for example, only 3% of full-genome sequencing projects use the localization of the corresponding species in the tree of life as a primary
INTRODUCTION
175
motivation (Liolios et al., 2006). As a result, a database such as Ensembl (Hubbard et al., 2007), which generates and maintains automatic annotation of selected eukaryotic genomes (www.ensembl.org), includes 21 mammalian and five teleost fish genomes, but only one bird and no reptile (v45). Current proposals for full-genome sequencing (www.genome.gov/10002154) correct the problem only very partially. Second, many of the methods and databases available for identifying duplication events and assessing orthology relationships of genetic elements among genomes avoid the heavy computational cost of phylogenetic trees inference and the difficulties associated with their interpretation, even though phylogeny-based orthology/paralogy identification is widely accepted as the most valid approach (Li et al., 2003; Alexeyenko et al., 2006). Recently, the problem has, however, been largely recognized and partially addressed by the comparative genomics community. For example, Ensembl (Hubbard et al., 2007) and the Human Phylome (Huerta-Cepas et al., 2007) are automated pipelines in which orthologs and paralogs are identified through the estimation of gene family phylogenetic trees. Furthermore, the recently developed MANTiS relational database (www.mantisdb.org) (Tzika et al., 2008) integrates phylogeny-based orthology/paralogy assignments with functional and expression data, allowing users to explore phylogeny-driven (focusing on any set of branches), gene-driven (focusing on any set of genes), function/process-driven, and expression-driven questions (Milinkovitch et al., 2010b). Application systems that integrate into an explicit evolutionary framework the mapping of gene gains and losses with functional and expression data should help in investigating whether the gene duplication phenomenon is generally relevant to adaptive evolution (i.e., beyond the well-known examples of diversification in globins, olfactory receptors, opsins, and transcription factors) and might even provide a means of investigating the causal relationship between genome evolution and an increase in phenotypic complexity. Furthermore, even if adaptations involve gene duplication, structural mutations, and regulatory mutations at drastically different relative frequencies (in the phylogenetic tree as a whole), many evolutionists will still be interested in identifying the genetic basis of adaptive traits at specific lineages of interest. Here we compare the efficiency of MANTiS against those of InParanoid (O’Brien et al., 2005), MultiParanoid (Alexeyenko et al., 2006), OrthoMCL (Li et al., 2003), and RoundUp (DeLuca et al., 2006) for the localization of gene gains and losses and duplication events within the metazoan phylogeny. First, InParanoid is a program that identifies putative ortholog clusters seeded by a reciprocally best-matching ortholog pair, around which in-paralogs are gathered and out-paralogs are excluded on the basis of their similar pairwise scores resulting from NCBI BLAST (Remm et al., 2001; O’Brien et al., 2005). InParanoid clusters generated from different pairs of genomes can then be merged using MultiParanoid. InParanoid was one of the first programs to refine best reciprocal hits for ortholog clustering. Second, OrthoMCL is an algorithm that groups putative ortholog protein sequences by (1) distinguishing between putative in-paralog and ortholog pairs through comparisons of reciprocal best hits within and between genomes, (2) correcting for differences in evolutionary distances between pairs of sequences, and (3) using the Markov clustering algorithm to split megaclusters. OrthoMCL was the first database to allow detection of genes present in a set of genomes and absent from another set. Third, the RoundUp database detects putative orthologs using the reciprocal smallest distance algorithm (RSD) based on global sequence alignment and maximum likelihood estimation of evolutionary distances. RoundUp incorporates the greatest number of sequenced genomes. Note that contrary
176
PHYLOGENY-BASED MAPPING OF GENE GAINS AND LOSSES
to the three resources mentioned above, MANTiS (1) incorporates the mapping of gains and losses of genes as well as of duplication events into an explicit phylogenetic framework, and (2) allows the user to perform elaborate queries combining parameters pertaining to gene identity, phylogeny, function, and expression.
2
DATA MINING
Data mining and the construction of the relational database were accomplished using the MANTiS (v1.0.15) pipeline (Tzika et al., 2008), available at www.mantisdb.org. MANTiS performs automated downloads from Ensembl (www.ensembl.org), extracts information relevant to protein families trees from the Compara database (Vilella et al. 2009), and defines characters for the generation of a full data set that includes orthologous gene presence/absence information for all species selected. Note that orthology is not assigned on the basis of simple best-reciprocal BLAST hits (BRH). Indeed, the existence of a BRH does not guarantee that orthology is inferred correctly in all cases (Theissen, 2002), because it ignores gene loss and differential rates of evolution. Here, orthology/paralogy are assigned in Ensembl (Hubbard et al. 2007, 2008) through the use of a pipeline (www.ensembl.org/info/data/compara/homology_method.html) that includes (1) the identification of gene families (using gene-relation graphs based on BRH), (2) tree inference after multiple protein sequence alignment within each gene family, and (3) identification of duplication and speciation events through gene tree vs. species tree reconciliation. MANTiS builts two data sets: the with duplications data set, combining all characters (de novo gains and duplication events), and the families only data set, which excludes characters corresponding to duplication events (i.e., we merge the characters within each protein tree). After each duplication event, the “ancestral” (vs. “derived”) character is identified as the child subtree with the smallest mean distance between the duplication node and all leaf nodes (Tzika et al., 2008). As functional and expression data are associated with a single specific Ensembl gene but a MANTiS character can correspond to a set of several Ensembl orthologous genes, all the relevant orthologs are assigned to a single MANTiS character corresponding to an Ensembl gene (and associated functional data) from the species with the largest amount of functional and expression data available (called priority species). All nonpriority species genes associated with a given character are considered as synonyms of the corresponding MANTiS character except when functional information is available via the Panther database (e.g., for Mus musculus, Rattus norvegicus, and Drosophila melanogaster genes). See an article by Tzika et al. (2008) for details on the character assignment method. Orthology assignment problems are expected to decrease as genome assembly and annotation improve. Note, however, that the annotation quality of a given genome does not depend solely on genome sequence coverage but also on its phylogenetic proximity with model species for which experimental data assisting in genome annotation (e.g., EST and SAGE data) are available. For example, the high-quality annotation of the human genome is more easily exploited for annotation of the macaque genome than for annotation of the opossum genome.
COMPARISONS WITH OTHER DATABASES
177
3 CHARACTER MAPPING Gains and losses of orthologs are mapped by MANTiS v1.0.15 (www.mantisdb.org) on the “true” species tree [i.e., the topology best supported (Halanych, 2004; Springer et al., 2004; Bashir et al., 2005)]. MANTiS maps characters as follows: (1) the character presence/absence matrix for all species (built in the character-mining phase; see above) is used for computing a distance matrix following a modified Jukes–Cantor model; (2) the distance matrix is used to compute the branch lengths of the true species topology using the least-squares approach under minimum evolution; (3) the gain of a character is assigned to the corresponding internal or tip branch of the true species tree; and (4) a recursive maximum likelihood approach is used to identify, for each character, the exact most likely combination of branch(es) on which gene loss(es) is (are) assigned. Once gains and losses have been mapped, MANTiS builds the genome content of each internal node. See the article by Tzika et al. (2008) for details on the character mapping method and genome content view of MANTiS.
4 COMPARISONS WITH OTHER DATABASES FOR THE LOCALIZATION OF GAINS AND LOSSES We compared the character mapping generated by MANTiS against similar information extracted from InParanoid (O’Brien et al., 2005), MultiParanoid (Alexeyenko et al., 2006), OrthoMCL (Li et al., 2003), and RoundUp (Deluca et al., 2006). We focused on genes present in human, mouse, and rat because these are the only three mammalian species present in all the databases mentioned above, and their genomes are well annotated. Mapping of specific genes was retrieved using the queries system available within MANTiS. Indeed, MANTiS allows building elaborate queries [performed on one or several “statement(s)” executed following priorities and logical operators in a user-friendly interface] concerning gene identity, mapping, and function parameters (biological processes, molecular functions, and gene expression) [see the articles by Tzika et al. (2008) and Milinkovitch et al. (2010b) for details]. The data sets originating from other databases were extracted as follows. First, MultiParanoid was used to merge the SQL tables of orthologs generated by InParanoid (v5.0) for the pairwise species comparisons Homo–Mus, Homo–Rattus, and Mus–Rattus. We retained only the clusters with a confidence value of 1 within, and no discrepancy among, the three comparisons of species pairs. All proteins of the three species were converted to MANTiS characters both for the “with duplications” and “families only” data sets. Second, using the “phyletic pattern form” view of OrthoMCL (v1), we extracted ortholog groups present in H. sapiens, M. musculus, and R. norvegicus and absent from all other species in the database. All cluster representatives were converted to MANTiS characters for the “with duplications” and “families only” data sets. Third, transitively closed phylogenetic profiles were retrieved from the RoundUp (July 2007) orthology database for H. sapiens, M. musculus, and R. norvegicus, using their most stringent conditions (BLAST e-value 20) gene families in
258
EVOLUTIONARY DYNAMICS OF GENE DUPLICATION IN BIRDS
Figure 3
Count of gene families vs. the number of chicken genes in amniote gene trees.
the chicken. Gene family size has been found to follow a power-law distribution in animals, a pattern thought to be produced by differential rates of pseudogenization among families (Huynen and van Nimwegen, 1998; Hughes and Liberles, 2008). Moreover, birth–death models and purifying selection appear to account for much of the conservation seen within lineage-specific paralogs (Ota and Nei, 1994; Piontkivska et al., 2002; Piontkivska and Nei, 2003; Eirin-Lopez et al., 2004). Substitution rates within amniote lineages are known to be quite variable, and we expect similar rate variation among paralogous members of chicken gene families. We estimated dates of divergence for each node in our chicken multigene family trees by using a penalized likelihood model, as implemented in the r8s rate analysis program (Sanderson, 2003), and by using 310 million years as an estimate for the Gallus/Homo common ancestor (Benton and Donoghue, 2007). The penalized likelihood model finds the optimal trade-off between maximizing the likelihood of a Poisson process for nucleotide substitution and a penalty term for rate variation between neighboring branches. The weighting of these two terms is determined by a coefficient λ, such that higher values of λ greatly penalize rate variation in favor of a clocklike model, and lower values allow a large amount of rate variation. Using a cross-validation procedure (Sanderson, 2003), we determined the optimal choice of λ from the possible values: 10−2 , 10−1 , 1, 10, 102 , and 103 . Within our amniote gene trees we found that over 54% of them were best explained by a λ value of 10−2 , indicating that the vast majority of gene families exhibit substantial rate variation among lineages (Figure 4). This rate variation probably stems from two sources: natural deviations in the clock as commonly found, for example, in phylogenetic analyses of different species; and bursts of adaptive evolution among newly evolved gene family members. Under the first hypothesis, it might be expected that larger gene families would exhibit more
RESULTS: DYNAMICS OF CHICKEN-SPECIFIC GENE DUPLICATION
(A)
259
(B)
Figure 4 Branch length rate variation (molecular clock; λ) for amniote gene trees (25 mammals and chicken). (A) Distribution of branch length rate variation binned in different λ values. Low values of λ represent highly variable rates, and high values of λ represent clocklike rates. (B) Comparison of each gene tree’s best fit λ to its size show little correlation (r 2 = 0.0003, p = 0.08, n = 6,901). Values of λ were visually dithered to illustrate density.
rate variation than would small ones, and that the incidence of rate variation would increase with family size. However, we did not find this trend when we regressed λ on gene family size (Figure 4). Many of the gene families best explained by a log λ value of −2 showed levels of divergence that suggests duplication since the Cenozoic era, especially since the Neogene period. For this reason we suspect that much of the rate variation among gene family members may in fact be due to adaptive bursts, because generation time effects among different lineages of birds are expected to influence rate variation only for those gene families that duplicated prior to the chicken’s divergence from other lineages. Of the other categories of substitution rate variation among gene family members, the class best explained by log λ = 3 was the next most common. Substitution rates among gene family members in this category are fairly clocklike. The ages of gene duplications in chicken are distributed exponentially, with most duplications occurring recently (Figure 5). This pattern is consistent with previous analyses, regardless of whether synonymous substitutions or phylogenetic analyses are used (Lynch and Conery, 2000), and suggests that assuming a relatively constant rate of gene duplication, most genes are pseudogenized or eliminated from the genome soon after duplication. However, these results are also consistent with widespread gene conversion between paralogs (Gao and Innan, 2004; Osada and Innan, 2008), in which the duplication event between highly similar sequences would be older than direct sequence comparisons would suggest. With the full genome sequence of two birds it is difficult to untangle the relative contributions of gene loss or gene conversion in producing the skewed age distribution of gene duplications. Moreover, this pattern could also suggest that concerted evolution might be more common among chicken gene families than in other groups. For example, concerted evolution among major histocompatibility complex (MHC) paralogs in birds is thought to occur more frequently,
260
EVOLUTIONARY DYNAMICS OF GENE DUPLICATION IN BIRDS
Figure 5 Age of paralogs on the lineage leading to Gallus gallus that evolved after the divergence between Gallus and Homo. Pivotal events during the evolution of this lineage are noted on the figure. Pg, Paleogene; Ng, Neogene.
and over a shorter time scale, than in mammals (Hess and Edwards, 2002), and the phylogenetic scale over which MHC orthologs can be identified may be smaller in birds than in mammals (but see Burri et al., 2008).
4
EXAMPLES OF FAMILIES WITH CHICKEN-SPECIFIC DUPLICATIONS
Gene family composition is shaped both by gene gain and loss, yet as other researchers have noted (Furlong, 2005), gene family expansion is easier to detect, especially when annotation is not complete and gaps remain in recent genome builds. We examined several gene families containing lineage-specific expansions in the chicken, using the amniotic gene tree rooted at the chicken–human divergence. In Table 1 we describe the dynamics of five representative families: Toll-like receptors, hemoglobin, ovalbuminrelated serpins, four subfamilies of olfactory receptors, and keratin. These families were selected for their variety in size, age, and function and because the annotation and family membership could be at least partially cross-validated with recent studies. 4.1 Toll-like Receptors Temperley and colleagues (2008) describe the evolutionary history and chromosomal location of chicken Toll-like receptors (TLRs), a family that is part of the innate immune system and is characterized by an ancient, highly conserved pathogen-recognition
EXAMPLES OF FAMILIES WITH CHICKEN-SPECIFIC DUPLICATIONS
TABLE 1
Summary of Properties for Specific Gene Families in the Amniotesa
Family Name Toll-like-receptors (TLR2A and TLR2B ) Ovalbumin B serpins (gene X, gene Y , and ovalbumin) Ovalbumin B serpins (MENT, serpinb10 ) Hemoglobin β—globin (βH , ρ, ε) Olfactory receptors (orthologous to OR5U1 and OR5BF1 in Homo) Olfactory receptors (related to COR7 in Gallus) Olfactory receptors (related to COR 1-6 in Gallus) Olfactory receptors (small cluster on chromosome 1 in Gallus) β-keratin a
261
Sequence Mean Branch Divergence Number Number Length from Estimated Before Chicken of of Tips to Duplication Duplications Amniote Chicken Duplication Time After Amniote Paralogs Paralogs in Chicken (Mya) Divergence Log λ 18
2
0.075
67
0.260
−2
73
3
0.141
107
0.227
−2
24
2
0.183
203
0.094
−2
69
3
0.075
122
0.086
−2
511
196
0.205
281
0.320
−2b
12
3
0.014
15
0.394
−2
60
4
0.275
238
0.084
2
207
3
0.049
30
0.421
−2
117
117
0.635
NAc
NA
NA
Number of amniote paralogs is defined as the number of genes across all species in the amniote tree (25 mammals and chicken; see Ensembl for exact species). Mean branch length is the average path length (in substitutions/site) between chicken paralogs (tips) and their common ancestor (the first chicken duplication). Across all families this length is on average 0.235 substitution/site. Estimated duplication time is the time of first chicken duplication in millions of years. Sequence divergence before the duplications in chicken is given in substitutions/site between the root of the amniote tree (Gallus/Homo common ancestor) and the first chicken duplication. λ is a molecular-rate variation parameter (low values are highly variable rates; high values are clocklike). b This λ is calculated from Homo and Gallus for computational limitations due to large family size. c NA, cannot be computed because of chicken-only expansions.
262
EVOLUTIONARY DYNAMICS OF GENE DUPLICATION IN BIRDS
domain that triggers an inflammatory response. These authors discovered that while chickens and humans both have 10 receptors, only four genes in chicken maintain oneto-one orthology with mammalian genes; much gain and loss has occurred in every lineage. In the chicken, their study suggested that a duplication event estimated at 66 million years ago (Mya) gave rise to TLR2A and TLR2B , orthologs to the single TLR2 in mammals. Three other genes, two that duplicated in tandem (TLR1LA and TLR1LB, estimated duplication time 147 Mya), as well as TLR15 , have no mammalian counterpart. Other mammalian members have been pseudogenized or fully lost in chicken. Using our automated phylogenetic approach, we are able to analyze one of these chicken-specific expansions: the duplication event that gave rise to TLR2A and TLR2B in chickens. We obtained the same estimate of duplication time (67 Mya) as did Temperley et al. (2008). As with all of the families that we examined in detail, there was considerable rate variation among gene lineages (log λ = −2). The sequence divergence in the Toll-like receptor family is greater than the sequence divergence found for 50.8% of other chicken gene families. We could not analyze the other duplication in chicken, as our data set from Ensembl was missing one of the chicken-specific genes (TLR1LB ). TLR15 , another gene unique to birds, had no mammalian ortholog, so it also was not included in our amniote gene tree. 4.2 Ovalbumin-Related Serpins Another gene family with documented chicken-specific expansions is the ov-serpin family, also called ovalbumin-related serpins or clade B serpins. Benarafa and RemoldO’Donnell (2005) examine the phylogenetic relationship between the chicken members (some of which function as egg-white storage proteins) and their mammalian counterparts (involved in diverse roles such as embryogenesis, inflammation regulation, and angiogenesis). The initial duplication is thought to have occurred very early in the vertebrate lineage; and, like TLRs, the family is also marked by recent lineage-specific expansions and losses. Chickens have 10 members and humans have 13 members. Three genes in chicken—ovalbumin and ovalbumin-like genes X and Y —are paralogs and lack a human ortholog. Another gene, with a single human ortholog, seems to have duplicated to produce the chicken genes Serpinb10 and MENT (mature erythrocyte nuclear termination state-specific protein). The remaining family members from chicken each have single human orthologs. Among the two subfamilies with chickenspecific expansions, rate variation is substantial (log λ = −2). Moreover, the sequence divergence in the subfamily containing ovalbumin and ovalbumin-like genes X and Y is greater than the sequence divergence found for 69.6% of other chicken gene families that have duplicated since the chicken–mammalian split. The sequence divergence in the other ovalbumin subfamily (serpinb10 and MENT ) is greater than the sequence divergence found for 77.6% of other chicken gene families that have duplicated since the chicken–mammalian split. 4.3 Hemoglobin Metabolic rate is an important trait that governs many organismal characters, from growth strategies to sustained physical activity. In amniotes, an elevated metabolism (endothermy) has only evolved within two extant groups, birds and mammals, although paleontologists suspect that many extinct dinosaurian lineages possessed endothermy
EXAMPLES OF FAMILIES WITH CHICKEN-SPECIFIC DUPLICATIONS
263
(de Ricql`es et al., 2001; Horner et al., 2001). Whereas the typical mammalian and avian condition is homeothermy (roughly constant body temperature), some birds, such as swifts, hummingbirds, and nightjars, are facultatively poikilothermic, a condition in which their usually elevated body temperature can vary over a wider range than that seen in mammals (e.g., Lane et al., 2004). The hemoglobin multigene family is closely associated with metabolism and the respiratory system. Hemoglobin, a multidomain protein, has rapidly diversified within vertebrate lineages (Gribaldo et al., 2003; Cooper et al., 2006; Opazo et al., 2008; Alev et al., 2009). For example, α-globin underwent rapid duplication and deletion in mammals (Hoffmann et al., 2008). Based on an analysis of the platypus genome, which incorporated information from flanking loci, a recent model (Patel et al., 2008) proposes that the β-globin paralogs arose from a single transposition in the amniote ancestor followed by independent duplication in birds and mammals. From our data set, the β-globin paralogs βH , ρ, and ε appear as a chicken-specific expansion, consistent with this model. Rate diversity is high in this family as well, and sequence evolution (point mutations) in β-globin paralogs is similar to that seen in the Toll-like receptors family (the sequence divergence in the β-globin family is greater than the sequence divergence found for 50.8% of other chicken gene families). 4.4 Olfactory Receptors Olfaction has recently gained much recognition as an important sensory modality for birds (Nevitt et al., 2008; O’Dwyer et al., 2008; Steiger et al., 2008, 2009, 2010; Warren et al., 2010). Historically, birds were assumed to communicate primarily via the visual or auditory systems, but behavioral and genomic data suggest that chemosensory perception plays a larger role. The chicken genome paper remarked on the surprisingly large group of avian-specific olfactory receptors (218 genes were identified), whereas Steiger et al. (2009) found 479 genes, including 111 pseudogenes. Our bioinformatic approach detected 196 genes belonging to this subfamily; the discrepancy is perhaps in part due to different genome builds, but also no doubt to our fully automated approach involving no manual inspection. As in other recent work (Lagerstr¨om et al., 2006), we also identified other families of olfactory receptors with small expansions in chicken (three to four genes in our analysis). These include a cluster associated with the previously identified COR1-6 genes (chicken olfactory receptor genes) on chromosome 5, a second cluster on chromosome 10 related to COR7 (Figure 2), and a third cluster on chromosome 1. Interestingly, the subfamilies have very different extents of sequence divergence ranging from 26.7% (family containing COR7) to 88.4% (family containing COR1–6). The latter gene family also had a more clocklike rate than other families, which together with the large sequence divergence suggests that it is among the oldest gene duplications in chicken. 4.5 Keratin The evolution of feathers in theropod dinosaurs was a major innovation that probably provided insulation for metabolically active animals, ornamentation for display, and in one lineage transformed arms into wings (Ji et al., 1998; Zhang and Zhou, 2000; Currie and Chen, 2001; Norell et al., 2002; Sawyer and Knapp, 2003). β-Keratins differ from the keratins found in nonavian reptiles and are the basic structural elements of
264
EVOLUTIONARY DYNAMICS OF GENE DUPLICATION IN BIRDS
feathers and therefore a gene family vital for the success of birds. In the publication of the first genome draft, the International Chicken Genome Sequencing Consortium (2004) noted the large expansion of the avian-specific keratin gene family, estimated at around 150 members. This avian keratin family, which encodes proteins forming feathers and scales (Sawyer et al., 2000), is functionally and evolutionarily distinct from the mammalian hair-specific α-keratin, but recent work suggests that chickens possess α-keratin genes and that these genes are expressed in avian digits (Eckhart et al., 2008). Other components of hair, the keratin-associated proteins, have no members in the chicken genome (Wu et al., 2008). Within nonavian reptiles, β-keratins probably duplicated by retrotransposition, resulting in the loss of introns in some paralogs; in birds, all paralogs have lost introns. Unequal crossover is also thought to expand and contract keratin gene arrays in birds, resulting in a tandem organization of multiple paralogs (Toni et al., 2007). We find that the amount of sequence evolution in keratin is among the highest for any chicken gene family (greater than the sequence divergence found for 98% of other chicken gene families). This high divergence is likely a consequence of the absence of comparisons with other reptiles, but it could also be due to the adaptive significance of these proteins within birds.
5
PROSPECTS AND CONCLUSIONS
Reptilia, including birds and nonavian reptiles, is the sister group of mammals and as such holds an important phylogenetic position for shedding light on patterns of gene duplication in amniotes (Wang et al., 2006). Reptiles are arguably more diverse than mammals in many traits; with about 17,000 species (about 10,000 in birds and 7000 nonavian reptiles) they are substantially more species-rich than mammals (about 5000 species) and possess a greater diversity of sex chromosome and sex determination systems (Organ and Janes, 2008). The chicken is currently the sole member of Reptilia with a draft genome and as such provides the only point of comparison of genome dynamics between mammals and their sister group. A greater understanding of genome and multigene family dynamics in mammals will undoubtedly require greater genome sampling and characterization in Reptilia. Gene duplication and the families they produce are vital for generating the thread with which evolution weaves new adaptations and species. We have developed a pipeline for phylogenomic analysis of gene duplication in the chicken lineage, but our approach can be applied easily to any particular clade of interest. Our approach rests on the assumption that gene orthology and paralogy are best identified through phylogenetic analysis, and we delimit chicken-specific gene duplications by an approach (Figure 1) that combines initial identification and collection of gene copies across many vertebrates that show significant sequence similarity in Ensembl, followed by phylogenetic analysis of these gene sets; identification of particular nodes in these gene trees that correspond to gene duplications, in our case the mammal–bird divergence; identification of those gene clusters that diversify from these particular nodes; and statistical analysis of the gene trees collected. Many of the duplications we have identified here as “chicken-specific” in fact will be found to have duplicated in ancestors of the chicken, since orthologs of many chicken genes will no doubt be discovered in other reptile genomes as they emerge. Nonetheless, using an approximate time scale (Figure 5) we can estimate which chicken gene paralogs might be found in upcoming
REFERENCES
265
reptilian genome projects based on their estimated timing of duplication relative to the divergence times of species whose genomes are being compared. Our approach has the advantage of providing an objective means of identifying chicken-specific gene duplications, but of course when conducted on a genome-wide scale, it will miss some gene family members that manual curation will identify; we have illustrated this with some specific examples (Table 1). The loss of detail for some gene families is offset by the ability to study genome-wide distributions of multigene family dynamics; both approaches are required to provide an informed view of the dynamics of multigene family evolution in birds and relatives. Phylogenomic approaches such as those presented here have only just begun to provide a window into the dynamics and importance of gene duplication within organisms. For example, nonprotein coding RNA paralogs are dispersed throughout the chicken genome; this, along with an unusual paucity of nonprotein coding RNA pseudogenes, suggests that they may not undergo the same processes of duplication (unequal crossover and retrotransposition) that characterize protein coding genes (Hillier et al., 2004). Currently available data are insufficient to address this and other hypotheses, because as of the time of this writing the genome of only one reptile species has been sequenced. But progress is quickly being made with the publication of the zebra finch (Taeniopygia guttata) genome (Warren et al., 2010) and the release of the anole lizard (Anolis carolinensis) genomes. An increase in the number of genomes will permit more detailed quantitative comparison of the evolutionary dynamics of gene duplication in amniotes and other lineages and will help clarify the role of these gene duplications in organismal diversification. REFERENCES Alev C, et al. 2009. Genomic organization of zebra finch alpha and beta globin genes and their expression in primitive and definitive blood in comparison with globins in chicken. Dev Genes Evol 219:353–360. Benarafa C, Remold-O’Donnell E. 2005. The ovalbumin serpins revisited: perspective from the chicken genome of clade B serpin evolution in vertebrates. Proc Nat Acad Sci USA 102(32):11367–11372. Benton MJ, Donoghue PCJ. 2007. Paleontological evidence to date the tree of life. Mol Biol Evol 24(1):26–53. Burri R, et al. 2008. Evolutionary patterns of MHC class II B in owls and their implications for the understanding of avian MHC evolution. Mol Biol Evol 25:1180–1191. Cooper SJB, et al. 2006. The mammalian alpha(D)-globin gene lineage and a new model for the molecular evolution of alpha-globin gene clusters at the stem of the mammalian radiation. Mol Phyl Evol 38:439–448. Currie PJ, Chen P-J. 2001. Anatomy of Sinosauropteryx prima from Liaoning, northeastern China. Can J Earth Sci 38(12):1705–1727. de Ricql`es AJ, et al. 2001. The bone histology of basal birds in phylogenetic and ontogenetic perspectives. In Gauthier J, Gall LF (eds.), New Perspectives on the Origin and Early Evolution of Birds: Proceedings of the International Symposium in Honor of John H. Ostrom. New Haven, CT: Peabody Museum of Natural History, pp. 411–426. Eckhart L, et al. 2008. Identification of reptilian genes encoding hair keratin-like proteins suggests a new scenario for the evolutionary origin of hair. Proc Natl Acad Sci 105:18419–18423.
266
EVOLUTIONARY DYNAMICS OF GENE DUPLICATION IN BIRDS
Edgar RC. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792–1797. Eirin-Lopez JM, et al. 2004. Birth-and-death evolution with strong purifying selection in the histone H1 multigene family and the origin of orphon H1 genes. Mol Biol Evol 21(10):1992–2003. Eisen JA. 1998. Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res 8:163–167. Furlong RF. 2005. Insights into vertebrate evolution from the chicken genome sequence. Genome Biol 6(2). Gao L-Z, Innan H. 2004. Very low gene duplication rate in the yeast genome. Science 306:1367–1370. Gregory TR. 2002. A bird’s-eye view of the C-value enigma: genome size, cell size, and metabolic rate in the class Aves. Evolution 56(1):121–130. Gribaldo S, et al. 2003. Functional divergence prediction from evolutionary analysis: a case study of vertebrate hemoglobin. Mol Biol Evol 20(11):1754–1759. Gu Z, et al. 2002. Rapid divergence in expression between duplicate genes inferred from microarray data. Trends Genet 18(12):609–613. Guindon S, Gascuel O. 2003. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 52(5):696–704. Haasa NB, et al. 2001. Subfamilies of CR1 non-LTR retrotransposons have different 50 UTR sequences but are otherwise conserved. Gene 265:175–183. Heger A, Ponting CP. 2007. Evolutionary rate analyses of orthologs and paralogs from 12 Drosophila genomes. Genome Res 17(12):1837–1849. Hess CM, Edwards SV. 2002. The evolution of the major histocompatibility complex in birds. Bioscience 52(5):423–431. Hillier LW, et al. 2004. Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 432(7018):695–716. Hittinger CT, Carroll SB. 2007. Gene duplication and the adaptive evolution of a classic genetic switch. Nature 449:677–681. Hoffmann FG, et al. 2008. Rapid rates of lineage-specific gene duplication and deletion in the alpha-globin gene family. Mol Biol Evol 25(3):591–602. Horner JR, et al. 2001. Comparative osteology of some embryonic and perinatal archosaurs: developmental and behavioral implications for dinosaurs. Paleobiology 27(1):39–58. Hubbard TJP, et al. 2007. Ensembl 2007. Nucleic Acids Res 35(database issue):D610–D617. Huerta-Cepas J, et al. 2007. The human phylome. Genome Biol 8:R109. Hughes T, Liberles DA. 2008. The power-law distribution of gene family size is driven by the pseudogenisation rate’s heterogeneity between gene families. Gene 414(1–2):85–94. Huynen MA, van Nimwegen E. 1998. The frequency distribution of gene family sizes in complete genomes. Mol Biol Evol 15(5):583–589. Ji Q, et al. 1998. Two feathered dinosaurs from northeastern China. Nature 393:753–761. Kondrashov FA, et al. 2002. Selection in the evolution of gene duplications. Genome Biol 3(2):1–9. Lagerstr¨om MC, et al. 2006. The G protein–coupled receptor subset of the chicken genome. PLoS Comput Biol 2(6):493–507. Lane JE, et al. 2004. Daily torpor in free-ranging whip-poor-wills (Caprimulgus vociferus). Physiol Biochem Zool 77:297–304. Li W-H. 2006. Molecular Evolution. Sunderland, MA: Sinauer Associates.
REFERENCES
267
Li H, et al. 2006. TreeFam: a curated database of phylogenetic trees of animal gene families. Nucleic Acids Res 34:D572–D580. Lynch M, Conery JS. 2000. The evolutionary fate and consequences of duplicate genes. Science 290:1151–1155. Lynch M, Force A. 2000. The probability of duplicate-gene preservation by subfunctionalization. Genetics 154:459–473. Maniatis T, Tasic B. 2002. Alternative pre-mRNA splicing and proteome expansion in metazoans. Nature 418(6894):236–243. Nei M, Gu X, Sitnikova T. 1997. Evolution by the birth-and-death process in multigene families of the vertebrate immune system. Proc Natl Acad Sci 1997 94(15):7799–7806. Nevitt GA, et al. 2008. Evidence for olfactory search in wandering albatross, Diomedea exulans. Proc Nat Acad Sci 105(12):4576–4581. Norell MA, et al. 2002. “Modern” feathers on a non-avian dinosaur. Nature 416:36–37. Nowak MA, et al. 1997. Evolution of genetic redundancy. Nature 388(6638):167–171. O’Dwyer TW, et al. 2008. Examining the development of individual recognition in a burrownesting procellariiform, the Leach’s storm-petrel. J Exp Biol 211(3):337–340. Ohno S. 1970. Evolution by Gene Duplication. New York: Springer-Verlag. Opazo J, Hoffmann CFG, Storz JF. 2008. Differential loss of embryonic globin genes during the radiation of placental mammals. Proc Natl Acad Sci 105:12950–12955. Organ CL, Janes DE. 2008. Evolution of sex chromosomes in Sauropsida. Integrative Comparative Biol 48(4):512–519. Osada N, Innan H. 2008. Duplication and gene conversion in the Drosophila melanogaster genome. PLoS Genets 4(12):e1000305. Ota T, Nei M. 1994. Divergent evolution and evolution by the birth-and-death process in the immunoglobulin V-H gene family. Mol Biol Evol 11(3):469–482. Patel VS, et al. 2008. Platypus globin genes and flanking loci suggest a new insertional model for beta-globin evolution in birds and mammals. BMC Biol 6(34). Piontkivska H, Nei M. 2003. Birth-and-death evolution in primate MHC class I genes: divergence time estimates. Mol Biol Evol 20(4):601–609. Piontkivska H, et al. 2002. Purifying selection and birth-and-death evolution in the histone H4 gene family. Mol Biol Evol 19(5):689–697. Rasmussen MD, Kellis M. 2007. Accurate gene-tree reconstruction by learning geneand species-specific substitution rates across multiple complete genomes. Genome Res 17(12):1932–1942. Sanderson MJ. 2003. r8s: inferring absolute rates of molecular evolution and divergence times in the absence of a molecular clock. Bioinformatics 19(2):301–302. Sarrias MR, et al. 2004. The scavenger receptor cysteine-rich (SRCR) domain: an ancient and highly conserved protein module of the innate immune system. Crit Rev Immunol 24(1):1–37. Sawyer RH, Knapp LW. 2003. Avian skin development and the evolutionary origin of feathers. J Exp Zool B 298B(1):57–72. Sawyer RH, et al. 2000. The expression of beta (β) keratins in the epidermal appendages of reptiles and birds. Am Zool 40(4):530–539. Shedlock AM. 2006. Phylogenomic investigation of CR1 LINE diversity in reptiles. Syst Biol 55(6):902–911. Shedlock AM, et al. 2007. Phylogenomics of non-avian reptiles and the structure of the ancestral amniote genome. Proc Natl Acad Sci 104:2767–2772.
268
EVOLUTIONARY DYNAMICS OF GENE DUPLICATION IN BIRDS
Steiger SS, et al. 2008. Avian olfactory receptor gene repertoires: evidence for a well-developed sense of smell in birds?. Proc R Soc B 275(1649):2309–2317. Steiger SS, et al. 2009. A comparison of reptilian and avian olfactory receptor gene repertoires: species-specific expansion of group gamma genes in birds. BMC Genomics 10:446. Steiger SS, et al. 2010. Evidence for adaptive evolution of olfactory receptor genes in 9 bird species. J Heredity 101(3):325–333. Storm CEV, Sonnhammer ELL. 2003. Comprehensive analysis of orthologous protein domains using the HOPS database. Genome Res 13:2353–2362. Temperley ND, et al. 2008. Evolution of the chicken Toll-like receptor gene family: a story of gene gain and gene loss. BMC Genom 9(62). Toni M, et al. 2007. Hard (beta-) keratins in the epidermis of reptiles: composition, sequence, and molecular organization. J Proteome Res 6(9):3377–3392. Wang Z, et al. 2006. Tuatara (Sphenodon) genomics: BAC library construction, sequence survey, and application to the DMRT gene family. J Hered 97(6):541–548. Wapinski I, et al. 2007. Natural history and evolutionary principles of gene duplication in fungi. Nature 449:54–61. Warren C, et al. 2010. The genome of a songbird. Nature 464:757–762. Wu DD, et al. 2008. Molecular evolution of the keratin associated protein gene family in mammals, role in the evolution of mammalian hair. BMC Evol Biol 8(241). Yuri T, et al. 2008. Duplication of accelerated evolution and growth hormone gene in passerine birds. Mol Biol Evol 25(2):352–361. Zelano B, Edwards SV. 2002. An Mhc component to kin recognition and mate choice in birds: predictions, progress, and prospects. Am Nat 160:S225–S237. Zhang J. 2003. Evolution by gene duplication: an update. Trends Genet 18(6):292–298. Zhang F, Zhou Z. 2000. A primitive enantiornithine bird and the origin of feathers. Science 290(5498):1955–1959. Zhang J, et al. 1998. Positive Darwinian selection after gene duplication in primate ribonuclease genes. Proc Nat Acad Sci USA 98:3708–3713. Zmasek CM, Eddy SR. 2002. RIO: analyzing proteomes by automated phylogenomics using resampled inference of orthologs. BMC Bioinf 3:14.
15
Gene and Genome Duplications in Plants PAMELA S. SOLTIS Florida Museum of Natural History, University of Florida, Gainesville, Florida
J. GORDON BURLEIGH Department of Biology, University of Florida, Gainesville, Florida
ANDRE S. CHANDERBALI and MI-JEONG YOO Florida Museum of Natural History, University of Florida, Gainesville, Florida; Department of Biology, University of Florida, Gainesville, Florida
DOUGLAS E. SOLTIS Department of Biology, University of Florida, Gainesville, Florida
1 INTRODUCTION Many plants have large and complex genomes, with genome size varying 1000-fold (Bennett and Leitch, 2004). Although large genome size in conifers has been attributed to expansion of retrotransposons, the large size and complexity of most plant genomes appears to be due to gene duplication—through expansion of gene families [e.g., SKP1 , which shows extensive gene birth and death (Kong et al., 2004); MADS-box genes, which are more numerous in plants than in other eukaryotes (Becker and Theissen, 2003)] and whole-genome duplication (polyploidy; e.g., Vision et al., 2000). Current research is exploring the role of gene and genome duplication in genetic interactions, floral development, morphological diversification, speciation, and adaptation. These once disparate areas have recently been unified by the availability of genomic data and by conceptual and informatic developments that allow genomic methods to be applied to nonmodel plants. The result is a paradigm shift in our understanding of the importance and pervasiveness of gene and genome duplication. Gene duplication has long been recognized as the ultimate source for evolutionary change (Ohno, 1970), and recent theoretical developments have clarified the possible fates of duplicate genes (Lynch and Conery, 2000). Both members of a duplicate gene pair may be maintained, retaining their original function, or they may diverge. Divergence may follow any of the following paths: retention of one copy and loss/silencing of the other, leading to a pseudogene; retention of the ancestral function by one copy Evolution After Gene Duplication, Edited by Katharina Dittmar and David Liberles Copyright © 2010 Wiley-Blackwell
269
270
GENE AND GENOME DUPLICATIONS IN PLANTS
and acquisition of a new function by the other (neofunctionalization; Lynch and Conery, 2000); partitioning of the original function or domain of expression between the two copies (subfunctionalization; Lynch and Conery, 2000; Lynch and Force, 2000; e.g., if the ancestral copy had functioned in both flowers and leaves, one copy of the duplicate pair might function in the flower and the other might function in the leaves). Clarification of hypotheses of gene fate has helped to stimulate new areas of research in gene expression and gene function (e.g., Adams et al., 2003, 2004; Adams and Wendel, 2005a). But how do gene duplications arise? What is the ultimate source of novel genetic material? Some plants, such as Clarkia (Onagraceae, evening primrose family), are prone to extensive chromosomal rearrangements (Lewis and Lewis, 1955), which have generated dozens of gene duplications (e.g., Gottlieb, 1974, 1977; Soltis et al., 1987). Such duplicate gene pairs are typically unlinked because they arose via translocation of one chromosomal segment onto another chromosome. Linked duplicates may arise via tandem duplication; the evolutionary dynamics of linked duplicate genes would be expected to differ from those involving unlinked genes. Perhaps the greatest source of duplicate genes in plants is whole-genome duplication (WGD; i.e., polyploidy). Evolutionary dynamics of genes duplicated via polyploidy are also likely to differ from those accompanying either tandem duplications or small-scale chromosomal duplications, most important because the stoichiometry of duplicate genes is maintained following WGD, allowing for divergence and modification not only of single genes but conceivably, of gene networks. Polyploidy has long been considered an important mechanism of speciation and of genomic change (e.g., Darlington, 1937; Clausen et al., 1945; Stebbins, 1950; Grant, 1981), but its fundamental role in genome evolution (not only in plants but also other eukaryotes, yeast (Kellis et al., 2004), and vertebrates [reviewed by Furlong and Holland (2004)] is only beginning to be appreciated. In this chapter we provide an overview of WGD in plants: How prevalent is it? How have our views changed with the acquisition of genomic data? How can it be detected? We follow these questions with an exploration of the consequences of gene duplication (regardless of its origin), with regard to (1) retention or loss of duplicate genes and (2) changes in expression or function. We address particularly the role of duplicate genes in development, with an eye to possible shifts in developmental programs and to morphological novelty associated with gene duplication. We conclude with a brief perspective on the relative contributions of duplicate gene retention, gene loss/silencing, neofunctionalization, and subfunctionalization in plant genomes. We focus on angiosperms because there are more data for this large clade than for any other, but we encourage the investigation of gene duplication and genome structure and evolution in nonangiosperms as well.
2
WHOLE-GENOME DUPLICATION (POLYPLOIDY) IN PLANTS
2.1 Traditional Views It has long been recognized that polyploidy is an important evolutionary force in plants, particularly ferns and angiosperms. Polyploidy has been studied in plants for 100 years (Lutz, 1907; Gates, 1909; Kuwada, 1911), with early investigators of the topic now comprising a “who’s who” of prominent plant evolutionists and geneticists
WHOLE-GENOME DUPLICATION (POLYPLOIDY) IN PLANTS
271
(e.g., Winge, 1917; M¨untzing, 1936; Darlington, 1937; Clausen et al., 1945; Stebbins, 1947, 1950; L¨ove and L¨ove, 1949; Lewis, 1980a; Grant, 1981). Following Stebbins’s seminal (1940, 1947, 1950) overviews of polyploidy, considerable research effort was focused on polyploid complexes and groups of hybridizing species. Allopolyploids are those formed via hybridization between two parental species, coupled with genome doubling; autopolyploids form from within a single species. Hence, polyploidy represented a major portion of biosystematic research during this “classic period,” up until the late 1980s. At that time, the ease of application of DNA methods resulted in fewer and fewer studies of polyploidy, as plant systematists focused much more attention on phylogeny reconstruction [reviewed by Soltis et al. (2003)]. The considerable research conducted on polyploid systems during the classic period resulted in the establishment of what may now best be termed the traditional tenets of polyploid evolution. One of the prominent early views was that although polyploidy was considered relatively common, the genetic fate of polyploid species was not considered promising, leading to the view of polyploids as evolutionary dead-ends. For example, the parental genomes were considered largely static following polyploidy. In addition, autopolyploids were considered rare and maladaptive, and each polyploid species—whether allo- or autopolyloid—was considered to have a single origin. All of these views have been turned on their heads: Today, polyploidy is not viewed as maladaptive, polyploidy genomes are dynamic, autopolyploids are common and evolutionarily diverse, and nearly all polyploid species examined to date show evidence of recurrent formation. There was also considerable effort during the classic period to provide estimates of the frequency of polyploidy in plants, particularly within angiosperms. A number of prominent authors attempted this exercise by using published chromosome numbers and establishing hypotheses for the presumed cutoff between “diploid” and “polyploid” chromosome numbers. Thus, estimates varied depending on the base chromosome number cutoff used as well as on the taxa considered. For example, both M¨untzing (1936) and Darlington (1937) suggested that about 50% of all angiosperm species were polyploid, while Stebbins (1950) later estimated the frequency of polyploidy in angiosperms at 30 to 35%. Using a cutoff point of n = 14, Grant (1963, 1981) inferred that 47% of all flowering plants were of polyploid origin and proposed that 58% of monocots and 43% of “dicots” (his usage) were polyploid. Using additional chromosome counts and the same methods and cutoff as Grant, Goldblatt (1980) subsequently recalculated the frequency of polyploidy in the monocots to be 55%. Goldblatt also suggested that Grant’s (1963) estimate was too conservative; he thought that taxa with chromosome numbers above n = 9 or 10 probably have polyploidy in their evolutionary history. Using these lower numbers, he calculated that at least 70%, and perhaps 80%, of monocots are of polyploid origin. Lewis (1980b) applied an approach similar to Goldblatt’s to dicots and estimated that 70 to 80% were polyploid. More recently, Masterson (1994) used the novel approach of comparing leaf guard cell size in fossil and extant taxa from a few angiosperm families (Platanaceae, Lauraceae, Magnoliaceae) to estimate polyploid occurrence through time. Because guard cell size is often much larger in polyploids than in diploids, this provided a method for estimating whether the fossil taxa were diploid (smaller guard cells than extant taxa) or polyploid (the same or larger guard cell sizes vs. extant species). From these
272
GENE AND GENOME DUPLICATIONS IN PLANTS
comparisons, Masterson (1994) estimated that 70% of all angiosperms had experienced one or more episodes of polyploidy in their ancestry. 2.2 Genetic and Genomic Approaches to Understanding Polyploidy Genomic data have provided unprecedented new insights into the genetic and genomic consequences of genome doubling, dramatically changing our views of polyploid evolution and resulting in the formulation of a new polyploid paradigm. For example, genomic data have provided novel insights into the frequency and timing of ancient polyploid events in angiosperms. Genomic investigations reveal that flowering plants possess genomes with considerable gene redundancy; much of this redundancy may be the result of ancient episodes of polyploidy. Complete sequencing of the very small genome of Arabidopsis thaliana, long considered the archetype of diploidy, revealed numerous duplicate genes and suggested two or three rounds of WGD (Vision et al., 2000; Bowers et al., 2003). Because Arabidopsis has only five chromosomes, it was not previously classified as a polyploid using the common cutoff criteria noted above. But genomic data clearly indicate a recent round of duplication, perhaps during the early evolution of the Brassicaceae, with an earlier round of duplication that occurred deeper in Brassicales, and a third event that may coincide with the early diversification of eudicot angiosperms [reviewed by D.E. Soltis et al. (2009)]. Significantly, all other angiosperms whose entire nuclear genomes have been sequenced completely all show evidence of WGD events: Oryza (Paterson et al., 2004), Populus (Tuskan et al., 2006), Vitis (Jaillon et al., 2007; Velasco et al., 2007), and Carica (Ming et al., 2008). These genomic investigations provide evidence for a number of phylogenetically important ancient genome doubling events, including a proposed paleohexaploid event that may have occurred close to the origin of the eudicots [reviewed by D.E. Soltis et al. (2009)]. ESTs (expressed sequence tags), now available for many angiosperm species, are another important major source of genomic data that can be used to infer ancient polyploidy. The thousands of ESTs available provide a useful genomic “snapshot,” permitting determination of ancient genome duplication events as well as a rough approximation of the timing of those events. Lynch and Conery (2000) developed a method that evaluates the frequency distribution of per-site synonymous divergence levels (Ks ) for pairs of duplicate genes (see below). A genomewide duplication event results in thousands of paralogous pairs that are all duplicated simultaneously. Evidence of past genome duplications can be seen as peaks in the distribution of Ks values for sampled paralogous pairs (Lynch and Conery, 2000). Importantly, this method does not require information on the position of genes within the genome, and can therefore be applied to any species for which there are moderate-to-large EST sets. When the Ks approach was applied to ESTs from diverse angiosperms, most species show evidence of ancient polyploidy, and sometimes there is evidence for multiple events. For example, Blanc and Wolfe (2004) used Ks values and found evidence of ancient polyploidy in Zea (maize), Glycine (soybean), Gossypium (cotton), and Solanum (tomato and potato). Similarly, Schlueter et al. (2004) found evidence in eight major crop species, including Glycine, Medicago (alfalfa), Solanum, Zea, Sorghum, Oryza, and Hordeum (barley), and inferred multiple independent genome duplications in Fabaceae (legumes), Solanaceae (potatotes and tomatoes), and Poaceae (grasses). A recent survey of ESTs from Asteraceae suggests that members of this family are also ancient polyploids (Barker et al., 2008).
WHOLE-GENOME DUPLICATION (POLYPLOIDY) IN PLANTS
273
Cui et al. (2006) applied the Ks approach to ESTs from several basal angiosperms and found evidence for episodes of ancient polyploidy in Nuphar advena (Nymphaeaceae; water lilies), the sister to all other living angiosperms following Amborella (Soltis et al., 2005). They also found the signature of ancient polyploidy in two members of the magnoliid clade [Persea americana (avocado: Lauraceae) and Liriodendron tulipifera (Magnoliaceae)] and Saruma henryi (Aristolochiaceae). In addition, Cui et al. (2006) detected WGDs in the basal eudicot Eschscholzia californica (California poppy; Papaveraceae) and the basal monocot Acorus americanus (Acoraceae). In fact, several genomewide duplication events appear to have occurred in Nuphar (Cui et al., 2006). One of these events appears to be restricted to Nymphaeaceae, but an older event is evident, and this may date to the common ancestor of all angiosperms except Amborella. Significantly, however, Amborella lacks evidence of ancient polyploidy, despite the use of very large EST data sets (D.E. Soltis et al., 2009). The analysis of Ks values by Cui et al. (2006) also provided weak, albeit inconclusive evidence of a still older polyploidization in Persea that may correspond to the old event suggested for the common ancestor of all angiosperms except Amborella. Alternatively, it could perhaps even predate the angiosperms, but testing these hypotheses will require comprehensive transcriptome sequencing for additional basal angiosperms and a complete Amborella genome sequence (Soltis et al., 2008). Genetic investigation of other taxa using other genetic methods suggests still additional ancient polyploidy events. For example, “diploid” members of Brassica are, at the least, ancient tetraploids (Kowalski, 1994; Lan et al., 2000; Quiros, 2001) and perhaps ancient hexaploids based on analyses of linkage maps—a number of genes (and blocks of genes) are clearly represented multiple times (e.g., Lagercrantz and Lydiate, 1996; Lukens et al., 2004). Genomic data provide evidence for other lineage-specific duplications: one within the legumes (Schlueter et al., 2004; Pfeil et al., 2005; Cannon et al., 2006) and another occurring in Capparaceae (Schranz and Mitchell-Olds, 2006). Genomic studies now raise the question: Are there really any true diploids (D.E. Soltis et al., 2009a)? Furthermore, the classic question of what percentage of angiosperms is of polyploid origin now appears moot. The evidence strongly suggests that all angiosperms may be ancient polyploids. The major question is no longer “How many angiosperms are polyploid?” but rather, “How many episodes of genome duplication have various angiosperm lineages experienced?” Finally, the contrast between the role that polyploidy is now envisioned to play in angiosperm (and plant) evolution has changed dramatically over the past 15 to 20 years. For example, in a recent study, Fawcett et al. (2009) estimated that ancient polyploid events occurred at the same time (about 65 million years ago) in several diverse angiosperm lineages, suggesting the possibility of a shared common causal factor. Interestingly, this estimate corresponds with the K-T boundary. Hence, the authors propose that genome doubling was a catalyst for the survival and/or diversification of extinction event that occurred following the Cretaceous–Tertiary (K-T) boundary. Similarly, the correspondence of ancient polyploid events to the origin of many species-rich plant clades, including Fabaceae, Asteraceae, eudicots, monocots, and even angiosperms as a whole, has also prompted speculation about the role of polyploidy in stimulating major bursts of plant diversification (D.E. Soltis et al., 2009a). In the light of such speculation, it is worthwhile recalling that only 25 years ago, polyploids were commonly viewed as “evolutionary dead-ends” [reviewed by Soltis and Soltis (1993–2000)].
274
GENE AND GENOME DUPLICATIONS IN PLANTS
2.3 Detecting Ancient Whole-Genome Duplications Analyses of large-scale genomic data have provided evidence of many previously undetected, ancient whole-genome duplication (WGD) events (e.g., Blanc and Wolfe, 2004; Schlueter et al., 2004; Cui et al., 2006); however, there is little consensus on the number and timing of ancient WGDs in plants. The task of identifying and placing ancient WGDs is greatly complicated by the diploidization process following a WGD, during which rapid gene loss and chromosomal rearrangements obscure, if not erase, evidence of the WGD. Still, numerous promising approaches have been developed to identify the remaining signals of ancient WGDs from different types of genomic data, including gene maps, pairs of duplicated genes, and gene trees. Although these methods are not designed specifically for plants, the frequency of polyploidy in plants and the wealth of genomic data make plants among the most useful systems for testing methods to detect ancient WGDs. The presence of large, syntenic (duplicated) blocks within a genome provided the first evidence of ancient WGDs in Arabidopsis and Brassicaceae (e.g., Vision et al., 2000; Simillion et al., 2002; Blanc et al., 2003; Bowers et al., 2003; Schranz and Mitchell-Olds, 2006), rice (e.g., Guyot and Keller, 2004; Paterson et al., 2004; Wang et al., 2005; Yu et al., 2005), legumes (Shoemaker et al., 1996; Cannon et al., 2006), poplar (Tuskan et al., 2006), and Vitis (Jaillon et al., 2007; Velasco et al., 2007). Although these duplicated chromosomal segments may provide direct and unambiguous evidence of a past WGD, in practice, rapid gene losses and rearrangements after polyploidy can make it extremely difficult to detect such duplications (Lynch and Conery, 2000; Eckhardt, 2001; Simillion et al., 2002). Different methods of detecting duplicated blocks and use of different criteria for defining a syntenous block can greatly affect interpretations of the history of large-scale duplications (see Durand and Hoberman, 2006). For example, the gene-order data in rice have been interpreted as ancient aneuploidy (Vandepoele et al., 2003) and as ancient polyploidy (e.g., Yu et al., 2005). Once duplicated chromosomal blocks have been identified, the timing of the WGD event(s) that created the duplication can be estimated using the sequence divergence of paralogous genes on each block, usually based on silent (synonymous) substitution rates. Variation in rates of molecular evolution among both genes and taxa can also make it extremely difficult to date accurately or precisely the corresponding WGD events. However, perhaps the greatest current limitation to identifying WGDs from genetic map data in plants is the lack of such data across a phylogenetically diverse sampling of taxa. Still, without evidence of duplicated blocks of genes, or any genetic map data at all, it is possible to detect ancient polyploidy based on the age distributions of pairs of paralogous genes throughout a genome (e.g., Lynch and Conery, 2000; Vision et al., 2000). If gene duplication and loss occur at a constant rate, the frequency of duplicated genes in a genome will decrease exponentially with time. In contrast, a large-scale duplication event such as a WGD should result in an overrepresentation of duplicated gene pairs at the time corresponding to the large-scale duplication event. Thus, if one plots the age distribution, as represented by sequence divergence, of duplicated genes from a genome, peaks in the age distribution curves may indicate WGDs (Figure 1). As with the map-based methods for detecting WGDs, the date of the large-scale duplication (peak in the graph) is usually estimated from the molecular divergence of the overrepresented gene pairs. Since the age is most often represented by
Frequency
WHOLE-GENOME DUPLICATION (POLYPLOIDY) IN PLANTS
275
Genome duplication
Age (Ks divergence)
Figure 1 Example of Ks curve. Under a constant rate of gene duplication and loss, the frequency of duplicate (paralogous) genes should decrease exponentially with age, as shown in the bottom line. In the thinner, top line, a WGD event will result in an overabundance of duplicate genes at the age of the WGD.
the synomymous (or silent) substitutions between duplicate genes, the age-distribution plots are often called Ks plots. Ks plots can be built from any large, gene-sequence data sets. In fact, as noted above, the Ks plot approach has identified ancient WGDs from EST data sets in many angiosperm lineages (Blanc and Wolfe, 2004; Schlueter et al., 2004; Sterck et al., 2005; Cui et al., 2006; Barker et al., 2008), and even the gymnosperm Welwitschia (Cui et al., 2006) and the moss Physcomitrella (Rensing et al., 2007). Still, there are several disadvantages to inferring WGDs from Ks plots. First, the age distribution plots can be difficult to interpret, and in some cases, analyses of Ks plots have failed to detect known WGDs (e.g., Blanc and Wolfe, 2004; Paterson et al., 2004). Furthermore, the age distribution of duplicate genes is indirect evidence for WGDs, and in theory, a period of reduced gene loss (Lynch, 2007), or a large-scale, but not whole-genome duplication event can be mistaken for a WGD on a Ks plot. Finally, as with the map-based analyses, it is difficult to place a duplication event precisely just from divergence time estimates from duplicated genes. Fawcett et al. (2009) addressed this issue by using a rate-smoothing technique that does not assume a molecular clock to date the divergences (Sanderson, 2002). Examining gene sequences from multiple taxa in a phylogentic context provides an approach to date WGD events in relation to speciation events rather than the potentially problematic divergence time estimates (e.g., Bowers et al., 2003; Langkjaer et al., 2003; Vandepoele et al., 2003; Chapman et al., 2004). For example, in the simple three-taxon case, a gene tree is constructed with a pair of paralogous genes from the test taxon, and the best homologs from a second taxon and from an outgroup taxon (Bowers et al., 2003; Langkjaer et al., 2003; Vandepoele et al., 2003; Chapman et al., 2004). If the paralogs from the test taxon form a clade, they diverged after the common ancestor of the test taxon and the second taxon; if they do not, they diverged before the last common ancestor. This three-taxon phylogenetic approach has been used to determine the timing of WGDs in Arabidopsis relative to its divergence with pines, rice, and other eudicots (Bowers et al., 2003) and rice relative to its divergence with
276
GENE AND GENOME DUPLICATIONS IN PLANTS
Gene tree
a
c
b
Species tree
d A
B
C
D
Figure 2 Example of LCA mapping to identify gene duplications. In lowest common ancestor (LCA) mapping, each gene (node) of the gene tree on the left is mapped to the lowest node in the species tree on the right that could have included the gene. A duplication exists when parent and child nodes on the gene tree map to the same node on the species tree, as shown with the arrows. The node on the species tree indicated by an asterisk marks the lowest possible location of the gene duplication event.
pines, Arabidopsis, and other monocots (Vandepoele et al., 2003; Chapman et al., 2004). There also has been much interest in identifying WGDs by mapping gene duplications from a collection of gene trees, containing genes from many taxa, onto a species tree (e.g., Guig´o et al., 1996; Fellows et al., 1998; Page and Cotton, 2002; Burleigh et al., 2008). A gene in the gene tree can be interpreted as a duplication if it has a child with the same lowest common ancestor mapping (LCA mapping) on a species tree (Figure 2; Eulenstein, 1998; see Bansal et al., 2007). The LCA mapping associates every gene in the gene tree to the most recent species in the species tree that could have contained the preduplication ancestral gene. However, this does not mean that the LCA mapping on the species tree indicates the location of the duplication event; in many cases, the duplication event could, in theory, predate the location of the LCA mapping. There are several proposed ways to de?ne the range of possible location(s) of a gene duplication on a species tree (e.g., Guig´o et al., 1996; Fellows et al., 1998; Page and Cotton, 2002). Because there is often a range of possible locations for each duplication, the number of possible locations for the set of all duplications can be exponentially large in the size of the input trees. The challenge is to identify a mapping, or set of locations for all duplications, that will highlight WGDs. One such approach is to seek a mapping that implies the minimum number of locations in the species tree where all duplications in the gene trees can be placed. Burleigh et al. (2008) demonstrated that this approach can be used to identify WGDs across angiosperms with relatively few gene trees. Alternatively, Bansal and Eulenstein (2008) described an efficient algorithm to find the mapping that minimizes the number of gene duplication events, or episodes, that can include all gene duplications, but this has not been tested in plants. Furthermore, it may be useful to identify WGDs based on the size (number of duplications) of an episode, but such approaches are not yet developed. Although there appears to be much potential for gene mapping approaches to identifying WGDs, they all rely on the accuracy of the gene tree topologies. Specifically, error in the gene trees often appears like duplications toward the root of the species tree. Therefore, it may be difficult to distinguish WGDs near the root of the species tree from gene tree error (Hahn, 2007; Burleigh et al., 2008).
FATES AND CONSEQUENCES OF GENE DUPLICATION IN PLANTS
277
Because the process of diploidization rapidly erases evidence of WGDs, identifying and locating ancient WGDs is an inherently difficult task. Our current understanding of WGDs throughout the evolution of plants will doubtlessly be improved with largescale genomic data from a wider range of taxa and refinement of the current methods, perhaps through the development of hypothesis-testing frameworks, which are unfortunately uncommon in WGD analyses. Yet the fact that we still find evidence of WGDs from even hundreds of millions of years ago in current, relatively simple analyses of gene maps, distribution of duplicated genes, and topology of gene trees from plants demonstrates the profound influence of WGDs on plant genome structure. The future challenge for studying WGDs in plants is not only to seek more accurate estimates of the number and location of WGDs, but to better characterize the effects of WGDs on plant genomes through time. 3 FATES AND CONSEQUENCES OF GENE DUPLICATION IN PLANTS 3.1 Homoeolog Loss vs. Retention in Polyploids It is generally agreed that the majority of homoeologs (genes duplicated via polyploidy) will be under relaxed selective constraints due to redundancy, allowing mutations to accumulate in one duplicate copy (Ohno, 1970; Nowak et al., 1997). Most mutations will be deleterious, causing a loss of function, so nonfunctionalization and gene silencing are expected to be the ultimate fate of one homoeolog for the majority of duplicate gene pairs. For populations with effective sizes less than the reciprocal of the loss-of-function mutation rate, nonfunctionalization is expected to occur in less than approximately 1 million generations (Lynch and Force, 2000). However, the observed frequency of homoeologs retained in polyploid organisms is higher than would be expected under this model (Wendel, 2000), suggesting that natural selection may be acting to maintain gene duplicates (Shiu et al., 2006). Selection for retention of homoeologs could occur via several mechanisms. First, if gene copy number correlates with transcript production (“dosage dependence”), selection for a certain level of expression can act upon gene copy number. Such selection might be especially important in networks requiring precise stoichiometry of interacting gene products (Birchler et al., 2005; Maere et al., 2005; Aury et al., 2006; Thomas et al., 2006). Second, retention of homoeologs can provide fixed heterozygosity in allopolyploids, providing interlocus heterodimers and increased protein diversity in an individual (Roose and Gottlieb, 1976; Levin, 1983; Soltis and Rieseberg, 1986; Hedrick, 1987; Soltis and Soltis, 1989, 1993; Udall and Wendel, 2006) and perhaps rendering it more resistant to the deleterious effects of repeated self-fertilization (Barrett and Shore, 1987; Soltis and Soltis, 1990; Pannell et al., 2004). Third, duplicates of a gene with more than one function or expressed in more than one tissue may both degenerate to carry out complementary subsets of the ancestral gene’s role, so that both copies are maintained by selection (subfunctionalization; Force et al., 1999a, 1999b). Fourth, gene duplication may provide an “escape from adaptive conflict,” where two functions of an ancestral singleton gene are both freed to improve in different gene duplicates (Des Marais and Rausher, 2008). Fifth, rare beneficial mutations could cause one homoeolog to carry out a new function that is favored by natural selection (neofunctionalization; Ohno, 1970; Ohta, 1988; Walsh, 1995; Lynch and Conery, 2000).
278
GENE AND GENOME DUPLICATIONS IN PLANTS
Recently, it has been suggested that for certain genes, selection could favor rapid loss of one homoeolog (Paterson et al., 2006). This scenario could occur due to dosagedependent effects. First, some gene products may be effective in low concentrations, but tend to form nonfunctional aggregates at high concentrations (Conrad and Antonarakis, 2007). Second, disrupted stoichiometry of gene products can occur if one gene is duplicated but another is not; this might occur within the nucleus if a segment of the genome is duplicated, or between nuclear and cytoplasmic genes in the case of a whole (nuclear)-genome duplication. In Homo sapiens, gene duplications cause a significant number of genetic diseases that tend to involve dosage-sensitive genes and genes encoding proteins with a propensity to aggregate (Conrad and Antonarakis, 2007). Selection for singleton status could also act when whole-genome duplication accompanies hybridization (allopolyploidy), in which case an organism’s genome comprises two divergent parental genomes. For example, products of genes A and B in one parental genome form a protein heterodimer and have coevolved, and genes A and B have also coevolved in the other genome so that their products can form a dimer with one another. If A –B and A–B interlocus heterodimers fail, selection could favor the loss or silencing of genes A and B or A and B (Comai et al., 2003). Comparative studies in putative paleopolyploids provide some evidence that certain genes, if duplicated, are predisposed for a return to singleton status. In a study comparing genome duplication in Oryza (rice), Arabidopsis, Tetraodon (puffer fish), and Saccharomyces (yeast), Paterson et al. (2006), found 16 protein family (Pfam) domains in Arabidopsis and 12 Pfam domains in Oryza that contain a higher-than-expected percentage of singleton genes. In five cases, these Pfam domains were the same in the two species, and in one case a singleton-enriched Pfam domain in Oryza was the same as a singleton-enriched Pfam domain in Tetraodon. Paterson et al. (2006) suggested that this convergent reversion to singleton status points to certain domain-containing proteins being maladaptive in duplicate copy. Similarly, Leebens-Mack et al. (2006) estimated that 727 strict ortholog sets exist as singletons in the Arabidopsis, Oryza, and Populus genomes, but that if reversion to singleton status were independent in the three lineages, the number of strict ortholog sets should be 99. Again, this high rate of convergence could suggest the action of natural selection driving duplicate gene loss and return of the same genes to singleton status. A few studies have documented homoeolog loss in recent polyploids. In the allopolyploid species Tragopogon miscellus, loss of homoeologs is found in about fortiethgeneration natural populations (Tate et al., 2006; Buggs et al., 2009), but is not found in F1 hybrids (Tate et al., 2006) or first-generation synthetics (Buggs et al., 2009). Loss of coding (Kashkush et al., 2002) and noncoding (Liu et al., 1998b; Shaked et al., 2001; Ozkan et al., 2002) DNA sequence has been shown in synthetic Triticum allopolyploids. Genomic changes in synthetic allopolyploid lines in Brassica (Song et al., 1995; Gaeta et al., 2007) and genome downsizing in several polyploid species (Leitch and Bennett, 2004; Eilam et al., 2009) also suggest that rapid homoeolog loss may occur. One mechanism by which such rapid losses may occur is homoeologous nonreciprocal transpositions (Udall et al., 2005; Gaeta et al., 2007) or activation of transposons due to genomic shock (McClintock, 1984; Comai, 2000). Loss of some nonprotein coding sequences seems to follow a more concerted mechanism. In Tragopogon and Nicotiana, homogenization of rDNA repeats through gene conversion occurs gradually after allopolyploidization (Maty´asek et al., 2003, 2007; Kovar´ık et al., 2004, 2005).
FATES AND CONSEQUENCES OF GENE DUPLICATION IN PLANTS
279
In Triticum allopolyploids, low-copy noncoding DNA, some of which is chromosome specific, is eliminated in a well-orchestrated and reproducible fashion in F1 hybrids and synthetic allopolyploid lines (Liu et al., 1998b; Shaked et al., 2001; Ozkan et al., 2002). 3.2 Transposon Activation in Polyploids Hybridization and polyploidization may cause “genomic shock,” activating mobile elements in the genome (McClintock, 1984; Comai, 2000; Comai et al., 2003). These may cause genome restructuring (Lonnig and Saedler, 2002) and/or changes in gene expression (McClintock, 1984; Weil and Martienssen, 2008). Increased transposon activity has been found in allopolyploids of Nicotiana (Petit et al., 2007), Triticum (Kashkush et al., 2003; Dong et al., 2005), and Arabidopsis (Madlung et al., 2002, 2005). It is not known to what extent this is due to hybridization as opposed to genome doubling, as homoploid hybridization can activate mobile elements (Shan et al., 2005; Ungerer et al., 2006). In early-generation autopolyploid Arabidopsis, the transposon Sunfish was activated, but its transcription was repressed in advanced autopolyploid generations, whereas it remained active in allotetraploids (Madlung et al., 2002). 3.3 Gene Expression Changes in Polyploids Changes in gene expression due to polyploidization have been the subject of several recent reviews (Osborn et al., 2003; Adams and Wendel, 2005b; Chen and Ni, 2006; Adams, 2007; Chen, 2007; Hegarty et al., 2008; Hegarty and Hiscock, 2008). Most studies compare expression in different species, such as polyploids versus parental diploids, homoploid hybrids versus allopolyploids, or autopolyploids versus allopolyploids. Comparisons may be between (1) expression of genes without distinguishing between homoeologs at each locus (Hegarty et al., 2005, 2006; Wang et al., 2006a); or (2) expression of each homoeolog at individual loci (Adams et al., 2003, 2004; Adams and Wendel, 2005a; Tate et al., 2006; Liu and Adams, 2007; Flagel et al., 2008; Buggs et al., 2009). Comparison 1 may help us understand the extent and timing of the effects of polyploidy on the transcriptome, whereas comparison 2 allows us to understand the contributions of the individual genomes to those effects, and the influence that this may have on the evolution of duplicated genes. In synthetic Arabidopsis thaliana and A. suecica polyploids, microarray studies showed changes in gene expression in over 5% of 26,090 genes in the allopolyploids relative to their parents (Wang et al., 2006a). Of these genes, 68% were expressed at different levels in the parental species (Wang et al., 2006a), and only 41% of changes were common to two independent allopolyploid lineages. An autopolyploid of A. thaliana showed altered expression in only 0.3% of genes relative to its parent (Wang et al., 2006a). In synthetic Senecio cambrensis allopolyploids, anonymous microarrays with about 6000 cDNA clones showed that gene expression differs extensively in diploid hybrids compared to the parental species, but are ameliorated by polyploidization (Hegarty et al., 2005, 2006). Patterns of expression in the synthetic allopolyploid were maintained over five generations (Hegarty et al., 2006). Both of these studies suggest that hybridization rather than polyploidization has a greater instantaneous effect on global patterns of gene expression.
280
GENE AND GENOME DUPLICATIONS IN PLANTS
Differences in expression between homoeologs are likely to have important consequences, and studies that distinguish between homoeologs show this to be occurring. In natural cotton polyploids, a microarray study of 1383 genes found biased expression in 70% of homoeolog pairs, only 24% of which seem to have biased expression immediately on diploid hybridization of the parental species (Flagel et al., 2008). Detailed study of expression of homoeologs in cotton shows examples of environment-specific expression (Liu and Adams, 2007) and tissue-specific expression (Adams et al., 2003, 2004), one example of which occurs in diploid hybrids (Adams and Wendel, 2005a). Silencing of homoeologs, including tissue-specific homocologs, has also been shown in hexaploid wheat (Bottley et al., 2006). In Tragopogon miscellus, silencing of homoeologs in leaf tissue has been found in natural allopolyploids about 80 years old, but not in diploid hybrids or synthetic allopolyploids (Tate et al., 2006; Buggs et al., 2009). Homoeolog silencing has also been found in the natural allopolyploid Arabidopsis suecica (Lee and Chen, 2001). Genes thought to have been duplicated by polyploidy before the split of A. thaliana and A. arenosa show higher levels of expression divergence between the two species than singleton genes, and a greater proportion of them were nonadditively expressed in resynthesized and natural allotetraploids formed from the two species (Ha et al., 2009). Together, these studies suggest that modulation of gene expression occurs at temporally distinct stages. Some effects are instantaneous effects of hybridization (Adams and Wendel, 2005a; Hegarty et al., 2006; Flagel et al., 2008), whereas others evolve in generations subsequent to polyploidization (Tate et al., 2006; Flagel et al., 2008; Buggs et al., 2009). Thus, it appears that polyploidy may provide an initial saltation in gene expression, followed by gradual changes in expression permitted by the presence of genes in duplicate form. Changes in gene expression between homoeologs may lead to divergence of gene sequence. Silenced genes will not be selected for, and so may accumulate deleterious mutations and so be nonfunctionalized (see above). Homoeologs that are expressed in different tissues may be showing incipient subfunctionalization (see above). 3.4 Epigenetics in Polyploids Polyploidy appears to have epigenetic effects that are responsible for changes in gene expression that are stable across generations (Comai, 2000; Liu and Wendel, 2002, 2003; Osborn et al., 2003; Rapp and Wendel, 2005; Chen and Ni, 2006). Ni and colleagues (2009) have recently provided intriguing evidence that growth vigor and increased biomass in hybrid and allopolyploid plants of Arabidopsis are caused by epigenetic modulation of parental alleles and homologous loci of the internal circadian clock regulators: this alters the amplitude of downstream gene expression and metabolic flux in clock-mediated photosynthesis and carbohydrate metabolism. A theoretical study (Rodin and Riggs, 2003) suggests that epigenetic tissue-specific silencing may enhance the evolution of genes to divergent functions, especially in small populations, promoting subfunctionalization. Below we review the various types of epigenetic changes that have been found in polyploids. Methylation Methylation-sensitive AFLP analysis in Brassica (Song et al., 1995), Arabidopsis (Madlung et al., 2002), and Triticum (Liu et al., 1998a,b; Shaked et al., 2001) showed widespread changes in genomic methylation patterns
DUPLICATIONS IN THE MADS-BOX GENE FAMILY
281
upon polyploidization. In 20 accessions of allotetraploid Gossypium hirsutum, methylation–polymorphism diversity was greater than genetic diversity (Keyte et al., 2006). A study of 49 synthetic first-generation Brassica napus allopolyploids and their parental diploids found methylation changes to be much more common (35 of 73 markers) in the allopolyploids than insertions and deletions in the DNA (3 of 76 markers) (Lukens et al., 2006). It seems likely that the causes of these epigenetic changes are themselves epigenetic. Fulnecek and colleagues (2009) examined three major DNA methyltransferase families (MET1, CMT3, and DRM) in Nicotiana tabacum and found that both homoeologs of each gene were retained and expressed in the allopolyploid. Small Interfering RNA (siRNA) Recent evidence suggests that siRNA may play a role in controlling methylation of DNA in polyploids. Chen and colleagues (2008) found that accumulation of centromeric siRNA in A. suecica correlated with centromere methylation. When Preuss and colleagues (2008) knocked out two genes required for the biogenesis of siRNAs (RDR2 and DCL3 ) in A. suecica, nucleolar dominance was disrupted, suggesting that the methylation of 45S rRNA is directed by siRNA. Chromatin Remodeling Chromatin modification appears to play a significant role in changes in gene expression in polyploids (reviewed in Chen and Tian, 2007). Histone acetylation, together with DNA methylation, has been shown to play a role in repressing rRNA genes in Brassica (Chen and Pikaard, 1997) and Arabidopsis (Lawrence et al., 2004; Earley et al., 2006) allopolyploids, causing nucleolar dominance (Pikaard, 1999). Late flowering in Arabidopsis synthetic allotetraploids is correlated with activation of the flowering repressor gene FLC by histone acetylation and methylation (Wang et al., 2006b). Ni and colleagues (2009) suggest that in allotetraploid Arabidopsis the expression of clock regulators is altered by chromatin modifications, including rhythmic changes in histone acetylation. It has been suggested that histone modification could be influenced in polyploids by dosage effects of the gene products involved in histone modifier complexes (Birchler et al., 2005). Alternative Splicing Recent evidence suggests that allopolyploidy in wheat can affect gene regulation via changes in alternative splicing efficiency. Terashima and Takumi (2009) compared levels of the alternatively spliced forms of WDREB2, a transcription factor involved in abiotic stress response, among wheats of different ploidal levels. In diploids, the level of the nonfunctional transcript gradually decreased due to splicing in response to drought stress, but in hexaploid wheat lines, including both cultivars and synthetic lines, the nonfunctional form failed to decrease, suggesting that allopolyploidization inhibited efficient alternative splicing of the transcripts.
4 DUPLICATIONS IN THE MADS-BOX GENE FAMILY AND THEIR ROLES IN FLORAL DEVELOPMENT Gene duplication and diversification are among the most important genetic raw materials for evolutionary change. Many genes involved in floral development have undergone duplication over the course of angiosperm evolution. For example, gene families such as the MADS-box genes and the TCP genes were duplicated multiple times in the
282
GENE AND GENOME DUPLICATIONS IN PLANTS Gymnosperms
Angiosperms Eudicots
Monocots
Basal Angiosperms
Core Eudicots Arabidopsis Brassica Asterids Aquilegia Maize
Tulip
Persea Nuphar Amborella Cycas
Pinus
AP1+CAL SHP1+SHP2 SEP1+2 AP1+FUL AG+PLE AP3+TM6 SEP1,2+4
B A
C E
Sep
Pet
Stm
Car
AP3+PI AG+STK SEP3+SEP1,2,4
Figure 3 Phylogenetic distribution of MADS-box gene duplications across the angiosperms, as shown by the shaded bars on the branches of the tree. The shades of the bars correspond to the gene class of the ABCE model, which is shown at the lower left of the figure. Sep, sepals; Pet, petals; Stm, stamens; Car, carpels.
history of flowering plants (Howarth and Donoghue, 2006; reviewed by P.S. Soltis et al., 2009). MADS-box genes encode transcription factors containing a DNA-binding domain (the MADS domain) that regulates a wide variety of developmental processes. They are found in three eukaryotic kingdoms— plants, animals, and fungi—but have undergone a significant amount of gene duplication in plants, which coupled with the recruitment of duplicate genes to new roles is likely to have played a fundamental role in plant evolution (Theissen et al., 2000; Parenicova et al., 2003; Irish and Litt, 2005; Martinez-Castilla and Alvarez-Buylla, 2005). In flowering plants, especially, MADSbox genes have a wide range of functions, including the transition from vegetative growth to flowering and the development of flowers themselves. Over 70 MADS-box genes are present in the genomes of the angiosperm genetic models Arabidopsis and rice, while far fewer MADS-box genes have been found in the gymnosperms (Nam et al., 2003). Phylogenetic reconstructions suggest that the MADS-box gene family has diversified in angiosperms, with duplications of many gene lineages in angiosperm ancestors or within specific angiosperm clades (Becker and Theissen, 2003; Figure 3). 4.1 Gene Duplications and Diversification in MADS-Box Floral Organ Identity Genes According to the ABCE model, the overlapping influences of four functions (A, B, C, and E) regulate floral organ identity. In A. thaliana, A function is involved in the specification of sepals and petals, B function in petal and stamen specification, C function in stamen and carpel specification, and E function participates in the specification of
DUPLICATIONS IN THE MADS-BOX GENE FAMILY
283
all floral organs (Coen and Meyerowitz, 1991; Colombo et al., 1995; Pelaz et al., 2000; Ditta et al., 2004). These functions are all encoded by members of separate MADS-box gene lineages, each of which shows evidence of multiple duplication events (Theissen, 2001). On the basis of functional studies, most notably in the model plant A. thaliana, it has been demonstrated that ancestral role retention, role swapping, and the acquisition of novel roles in floral development are among the functional consequences of these duplication events.
A-Function and APETALA1 The A function lineage is represented by APETALA1 (AP1 ), CAULIFLOWER (CAL), and FRUITFUL (FUL) in Arabidopsis, of which AP1 is the de facto A-function gene and has undergone at least two duplications in angiosperm history (Litt and Irish, 2003). The most recent is probably confined to the Brassicaceae (Lowman and Purugganan, 1999) and produced the Arabidopsis paralogs AP1 and CAL, which are nearly identical in sequence and redundant for specifying floral meristem identity (Bowman et al., 1993; Kempin et al., 1995). A more ancient duplication event at the base of the core eudicots produced two distinct core eudicot gene clades, the euAP1 clade, which includes AP1 (and CAL), and the euFUL clade, which includes FUL (Litt and Irish, 2003; Litt, 2007). FUL has no known role in organ identity, but shares a function in floral meristem specification with AP1 and CAL (Mandel and Yanofsky, 1995; Ferrandiz et al., 2000) and also has a unique function in fruit development (Ferrandiz, 2002). Thus, there are three paralogous members of the A-function lineage in the Arabidopsis genome, each making varying contributions to floral meristem identity. Efforts to evaluate the functions of orthologs and paralogs AP1 /CAL and FUL in other species suggest that members of this lineage may play a conserved role in floral meristem identity. However, despite similar expression patterns of euAP1 orthologs, the AP1 function of A. thaliana is not conserved. For example, orthologs of AP1 (members of the euAP1 clade) in Antirrhinum, pea (Pisum sativum), and tomato (Solanum lycopersicum) do not appear to have a role in sepal and petal specification. Instead, mutants display decreased flowering and increased inflorescence branching, indicating a role in determining floral meristem identity (Huijser et al., 1992; Berbel et al., 2001; Taylor et al., 2002; Vrebalov et al., 2002). Within the euFUL gene clade there have been numerous duplications within various core eudicot groups, but few FUL orthologs have been functionally characterized. Reported expression patterns suggest little evidence for a conserved role in meristem identity and/or fruit development, but data are still limited (Litt, 2007). Angiosperm taxa that diverged before the core eudicot duplication “below” the euAPI and euFUL duplication have genes with greater sequence similarity to euFUL genes than euAP1 genes. These FUL-like genes have undergone numerous duplications in different angiosperm clades, and as with euFUL genes, functional data are generally lacking. Expression patterns tend to be broad and varied, but might support a conserved role in floral meristem identity. Expression levels of the FUL-like genes in the basal angiosperms Persea, Nuphar, and Magnolia were greater in leaves than mature floral organs (Kim et al., 2005; Chanderbali et al., 2006; Yoo et al., unpublished data), but the highest levels were measured in the emerging inflorescences of Persea and pre-meiotic buds of Nuphar, developmental stages enriched for floral meristems (Chanderbali et al., 2009; Yoo et al., unpublished data).
284
GENE AND GENOME DUPLICATIONS IN PLANTS
B-Function and APETALA3 and PISTILLATA The two B-function genes APETALA3 (AP3 ) and PISTILLATA (PI ) belong to two paralogous gene lineages that resulted from a duplication event prior to the origin of the angiosperms (Kim et al., 2004). In addition, the AP3 lineage underwent another duplication event at the base of the core eudicots, giving rise to two AP3 sublineages: the euAP3 and the TOMATO MADS BOX GENE6 (TM6 ) gene lineages (Kramer et al., 1998). As with the FUL-like genes in noncore eudicot angiosperms, the angiosperms that diverged prior to this duplication event have “paleoAP3 ” genes, which share greater sequence similarity with TM6 than with euAP3. TM6 genes have been independently lost in Arabidopsis and Antirrhinum, but most core eudicots have both paralogs. Functional studies suggest functional diversification in these paralogous lineages. For example, TM6 is involved in the development of stamens but not petals, while the ortholog of AP3 regulates both stamen and petal development in tomato (de Martino et al., 2006). In contrast, in petunia, petals are transformed into sepals with little effect on stamens after loss of the AP3 ortholog, and stamens are only affected when mutations afflict orthologs of both AP3 and TM6 (Rijpkema et al., 2006). Outside the core eudicots, duplicate paleoAP3 genes in poppy have apparently been specialized into promoting the development of either stamens or petals, but not both. Similarly, three paleoAP3 paralogs in Aquilegia appear to have temporal as well as spatial partitioning in their roles in stamen and/or petal development (Kramer et al., 2007). The only other functionally characterized paleoAP3 genes are from the monocots rice and maize, which do not seem to have a history of duplications, and function in both stamen and petal ( = lodicule in grasses) development (Whipple et al., 2004, 2007). The expression patterns of paleoAP3 genes in basal angiosperms suggest an ancestral role in stamen and perianth specification (Kim et al., 2005; Chanderbali et al., 2006; Soltis et al., 2006, 2007; Yoo et al., in preparation), although duplications may have resulted in neo- or subfunctionalization. For example, one of two paleoAP3 paralogs in Nuphar advena is restricted to stamens during early development while the other is expressed in both stamens and inner tepals (petals), although both paralogs are detected throughout the flower during mature floral stages (Kim et al., 2005). Functional diversification following duplication is also suggested by expression shifts of three paleoAP3 paralogs in Illicium floridanum. One paralog is expressed in the outer tepals, inner tepals, and stamens (the typical paleoAP3 expression), the second is restricted to the inner tepals and stamens, and the third is limited to the inner tepals (Kim et al., 2005). Unlike the AP3 lineage, the PI lineage has not undergone a duplication event at the base of the core eudicots, but there are ample examples of dynamic patterns of evolution in other angiosperms. For instance, there is clear evidence for ancient duplications in the magnoliid clade of basal angiosperms (Stellari et al., 2003), but expression data do not suggest functional diversification (Chanderbali et al., 2006, 2009), although functional data are unavailable. Numerous recent duplications have occurred in the Ranunculaceae (Kramer et al., 2003) and monocots (Winter et al., 2002), but the limited expression data do not suggest functional diversification. C-Function and AGAMOUS The evolutionary history of the C-function gene lineage demonstrates recent duplications in various angiosperm taxa, an older duplication event placed early in the history of the core eudicots, and an even more ancient duplication
DUPLICATIONS IN THE MADS-BOX GENE FAMILY
285
event early in angiosperm history after the divergence of the angiosperms and gymnosperms (Kramer et al., 2004; Zahn et al., 2006). Three sequential gene duplications in the C-function lineage have resulted in four paralogs in the Arabidopsis genome. The first occurred before the radiation of flowering plants and gave rise to the SEEDSTICK (STK) and AGAMOUS (AG) lineages. The second duplication event occurred at the base of the core eudicots, producing the PLENA lineage as sister to the euAG lineage. Duplicate members of the PLENA lineage in Arabidopsis, SHATTERPROOF1 (SHP1 ) and SHATTERPROOF2 (SHP 2 ), appear to have resulted from a recent duplication in the Brassicaceae. AG functions in floral meristem determinacy as well as stamen and carpel development, and is expressed in the floral meristem from developmental stage 3 (as defined by Smyth et al., 1990), in stamens and carpels from primordial to mature stages, and later in the developing seed coat (Bowman et al., 1991; Drews et al., 1991). SHP1 and SHP2 are expressed in the ovules and in the developing pistil and fruit, where they share largely redundant functions in specifying the fruit dehiscence zone required for seed-pod shattering, by controlling the formation of specialized valve margin cells that are found only in fruits of the Brassicaceae (Liljegren et al., 2000). STK is expressed in the developing ovule primordia and seeds and functions in specifying ovule identity (D function) along with the AG and SHP1/SHP2 genes (Rounsley et al., 1995; Favaro et al., 2003; Pinyopich et al., 2003). Remarkably, in a demonstration of how gene function can be unpredictably partitioned between products of a gene-duplication event, AG and PLENA (PLE ), its functional counterpart in Antirrhinum, respectively, belong to the paralogous euAG and PLENA lineages that descended from the core eudicot duplication event. Thus, PLE is orthologous to SHP1/SHP2 while functionally more similar to AG (Bradley et al., 1993; Davies et al., 1999). Other functionally characterized members of the euAG lineage from petunia and morning glory function similarly to AG, even though these species are more closely related to Antirrhinum than to Arabidopsis. The Antirrhinum AG ortholog, FARINELLI (FAR), has a lesser role in organ identity and meristem determinacy than AG or PLE , and instead, plays a greater role in late stamen development than PLE (Davies et al., 1999). Members of paralogous euAG and PLE lineages in the core eudicots therefore display redundancy, subfunctionalization, and/or neofunctionalization in different taxa, and demonstrate that the functional roles of duplicate genes can vary considerably. Outside the core eudicots, a duplication event in the monocots has also produced paralogous AG lineages that may have become subfunctionalized. For example, the maize gene ZAG1 is more strongly expressed in carpels, while the paralogous ZMM2 is restricted to stamens (Mena et al., 1996). Similarly, duplication events have produced three AG homologs in the magnoliid Persea, one of which is restricted to late-stage stamens and carpels while the other two are expressed in all floral organs at induction and maturity (Chanderbali et al., 2006, 2009). E-Function and SEPALLATA Sequential duplications similar to those in the AG lineage gave rise to four SEPALLATA (SEP1 to SEP4 ) genes, which may be largely functionally redundant in Arabidopsis. All are flower-specific and expressed in all floral organs, and only the quadruple mutant lacking all four genes exhibits a complete loss of floral organ identity (Ditta et al., 2004). The first duplication event predates the radiation of extant angiosperms and produced the SEP3 and SEP1/2/4 lineages. A second duplication in the latter lineage occurred at the base of the core eudicots and
286
GENE AND GENOME DUPLICATIONS IN PLANTS
separated the SEP4 and SEP1/2 lineages, while a subsequent duplication resulted in duplicate SEP1 and SEP2 in the Brassicaceae. The expression of SEP homologs in angiosperms is generally conserved, and it appears that the entire SEP subfamily has a potentially conserved function in controlling the identity of all floral organs. Additionally, members of the SEP lineage may have a conserved role in floral meristems and ovules (Zahn et al., 2005). However, lineage-specific duplications followed by functional diversification are evident. Several additional gene duplication events were detected within the SEP1/2/4 and SEP3 lineages in monocots and eudicots Zahn et al. (2005) detected at least five distinct grass clades in the monocots: three in the SEP1/2/4 lineage and two in the SEP3 lineage. After the origin of the eudicots but before the radiation of the core eudicots, an early duplication event in the SEP1/2 lineage (prior to the Brassicaceae duplication event) produced the FLORAL BINDING PROTEIN9 (FBP9 ) lineage that has apparently been lost in Arabidopsis (Zahn et al.. 2005). The expression patterns in the basal angiosperms are similar to those of Arabidopsis, although duplication events have also occurred. For example, a recent duplication in the SEP3 lineage has produced two paralogs in Persea (Chanderbali et al., 2006, 2009). Despite this dynamic evolutionary history of gene duplications and diversification, SEP genes may generally be conserved in specifying meristem and floral organ identity in angiosperms. However, it is noteworthy that they range from being developmentally redundant, as in Arabidopsis, to having unique roles in Gerbera of the sunflower family (Asteraceae). The Gerbera GRCD1 and GRCD2 genes, of the SEP3 and SEP1/2/4 lineages, respectively, are expressed in all floral organs but have become subfunctionalized. Down-regulation of GRCD1 results in transformation of the staminodes of female flowers into petals, while GRCD2 is needed for carpel development (Ulmari et al., 2004). 4.2 Gene Duplications and Morphological Novelty: Is There a Connection? It is tempting to hypothesize that gene and genome duplications, while providing new raw material for evolution, also provide the catalyst for morphological innovation. Although such hypotheses are probably oversimplifications of the underlying genetic requirements for morphological evolution and data are limited, recent evidence from Aquilegia (Kramer et al., 2007; Rasmussen et al., 2009) suggests that varying patterns of expression of three AP3 paralogs and PI control petaloidy in these flowers. That is, different expression patterns of the duplicated genes contribute to the novel features of columbine flowers. Although not extensive evidence for the hypothesis that morphological novelty can arise through the action of duplicate genes, the Aquilegia example is certainly intriguing and suggests that other cases should be investigated.
5
CONCLUSIONS
Plant genes and genomes are replete with duplications, due to both local and wholegenome duplications. These duplications span nearly the age of angiosperms themselves, with some WGDs dating back to the early nodes of angiosperm phylogeny, with more recent WGDs superimposed on these ancient events. The result is a complex genomic structure in all species investigated to date. The gene pairs resulting from various processes of duplication provide an immense data set for exploring the
REFERENCES
287
consequences of gene and genome duplication and the fate of duplicate genes. As predicted by theory, some duplicate genes are retained in duplicate, each continuing to perform the ancestral function. Other genes are silenced, or nonfunctionalized. Still other pairs undergo neofunctionalization, in which one member of the pair acquires a new function. Finally, data are beginning to accumulate in support of the concept of subfunctionalization—the parsing of ancestral function between members of a duplicate pair. One of the most surprising observations is the propensity for homoeolog loss, as observed in the hexaploid wheat and Tragopogon allotetraploids, in particular. Such homoeolog loss is more rapid than those processes that rely on the accumulation of point mutations for gene silencing or changes in gene function; loss may occur very soon after gene or genome duplication (e.g., Tate et al., 2006; Buggs et al., 2009). The fate of duplicate genes may be, to some extent, lineage-specific, as some allotetraploids undergo homoeolog loss and others do not. The factors that contribute to homoeolog loss are unknown. Following stabilization after homoeolog loss, longer-term processes involving point mutations take over, and duplicate gene pairs within the same genome may experience alternative fates, from silencing of one copy to neofunctionalization to subfunctionalization. The genomic attributes and selective pressures that result in one fate versus another have not been addressed. As patterns of duplicate gene fate begin to emerge, we should turn our attention next to those genomic features that may lead to one path versus another. For example, are genes duplicated by polyploidy more likely to undergo loss than those duplicated via tandem duplication? Do duplicate genes involved in the same pathway or network respond in the same way? And ultimately, what are the links between duplicate genes, morphological novelty, and organismal diversification? These questions are just beginning to be addressed (see, e.g., DeBodt et al., 2005; Maere et al., 2005; Freeling and Thomas, 2006; Semon and Wolfe, 2007; Fawcett et al., 2009; Freeling, 2009; Van de Peer et al., 2009). The next few years offer extremely exciting opportunities for further study of duplicate genes in plants. Acknowledgments This work was supported in part by National Science Foundation grants MCB-0346437, DEB-0608268, EF-0431266, PGR-0115684, and DBI-0638595 and by the University of Florida. We appreciate the contributions and helpful discussion of R. J. A. Buggs. We thank two anonymous reviewers for their comments on an earlier draft of the chapter. REFERENCES Adams KL. 2007. Evolution of duplicate gene expression in polyploid and hybrid plants. J Hered 98:136–141. Adams KL, Wendel JF. 2005a. Allele-specific, bidirectional silencing of an alcohol dehydrogenase gene in different organs of interspecific diploid cotton hybrids. Genetics 171:2139–2142. Adams KL, Wendel JF. 2005b. Novel patterns of gene expression in polyploid plants. Trends Genet 21:539–543. Adams KL, Cronn R, Percifield R, Wendel JF. 2003. Genes duplicated by polyploidy show unequal contributions to the transcriptome and organ-specific reciprocal silencing. Proc Natl Acad Sci USA 100:4649–4654.
288
GENE AND GENOME DUPLICATIONS IN PLANTS
Adams KL, Percifield R, Wendel JF. 2004. Organ-specific silencing of duplicated genes in a newly synthesized cotton allotetraploid. Genetics 168:2217–2226. Aury JM, Jaillon O, Duret L, Noel B, Jubin C, Porcel BM, et al. 2006. Global trends of wholegenome duplications revealed by the ciliate Paramecium tetraurelia. Nature 444:171–178. Bansal MS, Eulenstein O. 2008. The multiple gene duplication problem revisited. Bioinformatics 24:i132–i138. Bansal MS, Burleigh JG, Eulenstein O, Wehe A. 2007. Heuristics for the gene-duplication problem: a (http://bioinformatics.oxfordjournals.org/math/theta.gif) isn’t in document θ(n) speed-up for the local search. In Speed TP, Huang H (eds.), Proceedings of the 11th Annual International Conference on Research in Computational Molecular Biology (RECOMB’07 ), Vol. 4453 of Lecture Notes in Computer Science. New York: Springer-Verlag, pp. 238–252. Barker MS, Kane NC, Matvienko M, Kozik A, Michelmore RW, Knapp SJ, Rieseberg LH. 2008. Multiple paleopolyploidizations during the evolution of the Compositae reveal parallel patterns of duplicate gene retention after millions of years. Mol Biol Evol 25:2445–2455. Barrett SCH, Shore JS. 1987. Variation and evolution of breeding systems in the Turnera ulmifolia complex (Turneraceae). Evolution 41:340–354. Becker A, Theissen G. 2003. The major clades of MADS-box genes and their role in the development and evolution of flowering plants. Mol Phylogenet Evol 29:464–489. Bennett MD, Leitch IJ. 2004. Plant DNA C-values database (release 3.0, Dec. 2004). www.rbgkew.org.uk/cval/homepage.html. Berbel A, Navarro C, Ferrandiz C, Canas LA, Madueno F, Beltran JP. 2001. Analysis of PEAM4 , the pea AP1 functional homologue, supports a model for AP1 -like genes controlling both floral meristem and floral organ identity in different plant species. Plant J 25:441–451. Birchler JA, Riddle NC, Auger DL, Veitia RA. 2005. Dosage balance in gene regulation: biological implications. Trends Genet 21:219–226. Blanc G, Wolfe KH. 2004. Widespread paleopolyploidy in model plant species inferred from age distributions of duplicate genes. Plant Cell 16:1667–1678. Blanc G, Hokamp K, Wolfe KH. 2003. A recent polyploidy superimposed on older large-scale duplications in the Arabidopsis genome. Genome Res 13:137–144. Bottley A, Xia GM, Koebner RMD. 2006. Homoeologous gene silencing in hexaploid wheat. Plant J 47:897–906. Bowers JE, Chapman BA, Rong J, Paterson AH. 2003. Unravelling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events. Nature 422:433–438. Bowman JL, Drews GN, Meyerowitz EM. 1991. Expression of the Arabidopsis floral homeotic gene AGAMOUS is restricted to specific cell types late in flower development. Plant Cell 3:749–758. Bowman JL, Alvarez J, Weigel D, Meyerowitz EM, Smyth DR. 1993. Control of flower development in Arabidopsis thaliana by APETALA1 and interacting genes. Development 119:721–743. Bradley D, Carpenter R, Sommer H, Hartley N, Coen E. 1993. Complementary floral homeotic phenotypes result from opposite orientations of a transposon at the plena locus of Antirrhinum. Cell 72:85–95. Buggs RJA, Doust AN, Tate JA, Koh J, Soltis K, Feltus FA, et al. 2009. Gene loss and silencing in Tragopogon miscellus (Asteraceae): comparison of natural and synthetic allotetraploids. Heredity 103:73–81. Burleigh JG, Bansal MS, Wehe A, Eulenstein O. 2008. Locating multiple gene duplications through reconciled trees. RECOMB, LNBI 4955:273–284. Cannon SB, Sterck L, Rombauts S, Sato S, Cheung F, Gouzy J, et al. 2006. Legume genome evolution viewed through the Medicago truncatula and Lotus japonicus genomes. Proc Natl Acad Sci USA 103:14959–14964.
REFERENCES
289
Chanderbali AS, Kim S, Buzgo M, Zheng Z, Oppenheimer DG, Soltis DE, Soltis PS. 2006. Genetic footprints of stamen ancestors guide perianth evolution in Persea (Lauraceae). Int J Plant Sci 167:1075–1089. Chanderbali AS, Albert VA, Leebens-Mack J, Altman NS, Soltis DE, Soltis PS. 2009. Transcriptional signatures of ancient floral developmental genetics in avocado (Persea americana; Lauraceae). Proc Natl Acad Sci USA 106:8929–8934. Chapman BA, Bowers JE, Schulze SR, Paterson AH. 2004. A comparative phylogenetic approach for dating whole genome duplication events. Bioinformatics 20:180–185. Chen ZJ. 2007. Genetic and epigenetic mechanisms for gene expression and phenotypic variation in plant polyploids. Annu Rev Plant Biol 58:377–406. Chen ZJ, Ni Z. 2006. Mechanisms of genomic rearrangements and gene expression changes in plant polyploids. Bioessays 28:240–252. Chen ZJ, Pikaard CS. 1997. Epigenetic silencing of RNA polymerase I transcription: a role for DNA methylation and histone modification in nucleolar dominance. Genes Dev 11:2124–2136. Chen ZJ, Tian L. 2007. Roles of dynamic and reversible histone acetylation in plant development and polyploidy. Biochim Biophys Acta 1769:295–307. Chen M, Ha M, Lackey E, Wang JL, Chen ZJ. 2008. RNAi of met1 reduces DNA methylation and induces genome-specific changes in gene expression and centromeric small RNA accumulation in Arabidopsis allopolyploids. Genetics 178:1845–1858. Clausen J, Keck DD, Hiesey WM. 1945. Experimental studies on the nature of species: II. Plant evolution through amphiploidy and autopolyploidy, with examples from the Madiinae. Washington, DC: Carnegie Institute of Washington. Coen ES, Meyerowitz EM. 1991. The war of the whorls: genetic interactions controlling flower development. Nature 353:31–37. Colombo L, Franken J, Koetje E, van Went J, Dons HJ, Angenent GC, van Tunen AJ. 1995. The petunia MADS box gene FBP11 determines ovule identity. Plant Cell 7:1859–1868. Comai L. 2000. Genetic and epigenetic interactions in allopolyploid plants. Plant Mol Biol 43:387–399. Comai L, Madlung A, Josefsson C, Tyagi A. 2003. Do the different parental “heteromes” cause genomic shock in newly formed allopolyploids?. Philos Trans R Soc Lond B 358:1149–1155. Conrad B, Antonarakis SE. 2007. Gene duplication: a drive for phenotypic diversity and cause of human disease. Annu Rev Genom Hum Genet 8:17–35. Cui L, Wall PK, Leebens-Mack J, Lindsay BG, Soltis D, Doyle JJ, et al. 2006. Widespread genome duplications throughout the history of flowering plants. Genome Res16:738–749. Darlington CD. 1937. Recent Advances in Cytology, 2nd Philadelphia: P. Blakiston’s Son and Co. Davies B, Motte P, Keck E, Saedler H, Sommer H, Schwarz-Sommer Z. 1999. PLENA and FARINELLI: redundancy and regulatory interactions between two Antirrhinum MADS-box factors controlling flower development. EMBO J 18:4023–4034. DeBodt S, Maere S, van de Peer Y. 2005. Genome duplication and the origin of angiosperms. Trends Ecol Evol 20:591–597. de Martino G, Pan I, Emmanuel E, Levy A, Irish VF. 2006. Functional analyses of two tomato APETALA3 genes demonstrate diversification in their roles in regulating floral development. Plant Cell 18:1833–1845. Des Marais DL, Rausher MD. 2008. Escape from adaptive conflict after duplication in an anthocyanin pathway gene. Nature 454:762–765. Ditta G, Pinyopich A, Robles P, Pelaz S, Yanofsky MF. 2004. The SEP4 gene of Arabidopsis thaliana functions in floral organ and meristem identity. Curr Biol 14:935–940.
290
GENE AND GENOME DUPLICATIONS IN PLANTS
Dong Y, Liu Z, Shan X, Qiu T, He M, Liu B. 2005. Allopolyploidy in wheat induces rapid and heritable alterations in DNA methylation patterns of cellular genes and mobile elements. Russ J Genet 41:890–896. Drews GN, Bowman JL, Meyerowitz EM. 1991. Negative regulation of the Arabidopsis homeotic gene AGAMOUS by the APETALA2 product. Cell 65:991–1002. Durand D, Hoberman R. 2006. Diagnosing duplications: Can it be done?. Trends Genet 22:156–164. Earley K, Lawrence RJ, Pontes O, Reuther R, Enciso AJ, Silva M, et al. 2006. Erasure of histone acetylation by Arabidopsis HDA6 mediates large-scale gene silencing in nucleolar dominance. Genes Dev 20:1283–1293. Eckhardt N. 2001. A sense of self: the role of DNA sequence elimination in allopolyploidization. Plant Cell 13:1699–1704. Eilam T, Anikster Y, Millet E, Manisterski J, Feldman M. 2009. Genome size in natural and synthetic autopolyploids and in a natural segmental allopolyploid of several Triticeae species. Genome 52:275–285. Eulenstein O. 1998. Vorhersage von Genduplikationen und deren Entwicklung in der Evolution. GMD Research Series, Vol. 20. Sankt Augustin, Germany. Favaro R, Pinyopich A, Battaglia R, Kooiker M, Borghi L, Ditta G, et al. 2003. MADSbox protein complexes control carpel and ovule development in Arabidopsis. Plant Cell 15:2603–2611. Fawcett, JA, Maere S, van de Peer Y. 2009. Plants with double genomes might have had a better chance to survive the Cretaceous–Tertiary extinction event. Proc Natl Acad Sci USA 106:5737–5742. Fellows M, Hallett M, Stege U. 1998. On the multiple gene duplication problem. ISAAC’98, LNCS 1533:347–357. Ferrandiz C. 2002. Regulation of fruit dehiscence in Arabidopsis. J Exp Bot 53:2031–2038. Ferrandiz C, Gu Q, Martienssen R, Yanofsky MF. 2000. Redundant regulation of meristem identity and plant architecture by FRUITFULL, APETALA1 and CAULIFLOWER. Development 127:725–734. Flagel LE, Udall J, Nettleton D, Wendel J. 2008. Duplicate gene expression in allopolyploid Gossypium reveals two temporally distinct phases of expression evolution. BMC Biol 6:16. Force A, Lynch M, Pickett FB, Amores A, Yan YL, Postlethwait J. 1999a. Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151:1531–1545. Force A, Lynch M, Postlethwait J. 1999b. Preservation of duplicate genes by subfunctionalization. Am Zool 39:460. Freeling M. 2009. Bias in plant gene content following different sorts of duplication: tandem, whole-genome segmental, or by transposition. Annu Rev Plant Biol 60:433–533. Freeling M, Thomas BC. 2006. Gene-balanced duplications, like tetraploidy, provide predictable drive to increase morphological complexity. Genome Res 16:805–814. Fulneˇcek J, Maty´asˇ ek R, Kovao´ık A. 2009. Faithful inheritance of cytosine methylation patterns in repeated sequences of the allotetraploid tobacco correlates with the expression of DNA methyltransferase gene families from both parental genomes. Mol Genet Genom 281:407–420. Furlong RF, Holland PW. 2002. Were vertebrates octoploid?. Philos Trans R Soc Lond B 357:531–544. Gaeta RT, Pires JC, Iniguez-Luy F, Leon E, Osborn TC. 2007. Genomic changes in resynthesized Brassica napus and their effect on gene expression and phenotype. Plant Cell 19:3403–3417. Gates RR. 1909. The stature and chromosomes of Oenothera gigas De Vries. Arch Zellforsch 3:525–552.
REFERENCES
291
Goldblatt P. 1980. Polyploidy in angiosperms: monocotyledons. In Lewis WH (ed.), Polyploidy: Biological Relevance. New York: Plenum Press, pp. 219–239. Gottlieb LD. 1974. Gene duplication and fixed heterozygosity for alcohol dehydrogenase in the diploid plant Clarkia franciscana. Proc Natl Acad Sci USA 71:1816–1818. Gottlieb LD. 1977. Evidence for duplication and divergence of the structural gene for phosphoglucoisomerase in diploid species of Clarkia. Genetics 86:289–307. Grant V. 1963. The Origin of Adaptations. New York: Columbia University Press. Grant V. 1981. Plant Speciation, 2nd ed. New York: Columbia University Press. Guig´o R, Muchnik I, Smith TF. 1996. Reconstruction of ancient molecular phylogeny. Mol Phylogenet Evol 6:189–213. Guyot R, Keller B. 2004. Ancestral genome duplication in rice. Genome 47:610–614. Ha M, Kim E-D, Chen ZJ. 2009. Duplicate genes increase expression diversity in closely related species and allopolyploids. Proc Natl Acad Sci USA 106:2295–2300. Hahn M. 2007. Bias in phylogenetic tree reconciliation methods: implications for vertebrate genome evolution. Genome Biol 8:R141. Hedrick PW. 1987. Genetic load and the mating system in homosporous ferns. Evolution 41:1282–1289. Hegarty M, Jones J, Wilson I, Barker G, Sanchez-Baracaldo P, et al. 2005. Development of anonymous cDNA microarrays to study changes to the Senecio floral transcriptome during hybrid speciation. Mol Ecol 14:2493–2510. Hegarty M, Barker G, Wilson I, Abbott RJ, Edwards KJ, Hiscock SJ. 2006. Transcriptome shock after interspecific hybridization in Senecio is ameliorated by genome duplication. Curr Biol 16:1652–1659. Hegarty MJ, Hiscock SJ. 2008. Genomic clues to the evolutionary success of polyploid plants. Curr Biol 18:R435–R444. Hegarty MJ, Barker GL, Brennan AC, Edwards KJ, Abbott RJ, Hiscock SJ. 2008. Changes to gene expression associated with hybrid speciation in plants: further insights from transcriptomic studies in Senecio. Philos Trans R Soc Lond B 363:3055–3069. Howarth DG, Donoghue MJ. 2006. Phylogenetic analysis of the “ECE” (CYC/TB1) clade reveals duplications predating the core eudicots. Proc Natl Acad Sci USA 103:9101–9106. Huijser P, Klein J, L¨onnig WE, Meijer H, Saedler H, Sommer H. 1992. Bracteomania, an inflorescence anomaly, is caused by the loss of function of the MADS-box gene squamosa in Antirrhinum majus. EMBO J 11:1239–1249. Irish VF, Litt A. 2005. Flower development and evolution: gene duplication, diversification and redeployment. Curr Opin Genet Dev 15:454–460. Jaillon O, Aury JM, Noel B, Policriti A, Clepet C, Casagrande A, et al 2007. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 449:463–467. Kashkush K, Feldman M, Levy AA. 2002. Gene loss, silencing and activation in a newly synthesized wheat allotetraploid. Genetics 160:1651–1659. Kashkush K, Feldman M, Levy AA. 2003. Transcriptional activation of retrotransposons alters the expression of adjacent genes in wheat. Nat Genet 33:102–106. Kellis M, Birren BW, Lander ES. 2004. Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature 428:617–624. Kempin S, Savidge B, Yanofsky MF. 1995. Molecular basis of the cauliflower phenotype in Arabidopsis. Science 267:522–525. Keyte AL, Percifield R, Liu B, Wendel JF. 2006. Intraspecific DNA methylation polymorphism in cotton (Gossypium hirsutum L.). J Hered 9:444–450.
292
GENE AND GENOME DUPLICATIONS IN PLANTS
Kim S, Soltis DE, Albert V, Yoo MJ, Farris JS, Soltis PS, Soltis DE. 2004. Phylogeny and diversification of B-function MADS-box genes in angiosperms: evolutionary and functional implications of a 260-million-year-old duplication. Am J Bot 91:2102–2118. Kim S, Koh J, Yoo MJ, Kong H, Hu Y, Ma H, et al. 2005. Expression of floral MADS-box genes in basal angiosperms: implications for the evolution of floral regulators. Plant J 43:724–744. Kong H, Leebens-Mack J, Ni W, dePamphilis CW, Ma H. 2004. Highly heterogeneous rates of evolution in the SKP1 gene family in plants and animals: functional and evolutionary implications. Mol Biol Evol 21:117–128. Kovaˇr´ık A, Maty´asˇ ek R, Lim KY, Skalicka K, Koukalova B, Knapp S, et al. 2004. Concerted evolution of 18-5.8-26S rDNA repeats in Nicotiana allotetraploids. Biol J Linn Soc 82:615–625. Kova´ısˇk A, Pires JC, Leitch AR, Lim KY, Sherwood AM, Maty´asˇ ek R, et al. 2005. Rapid concerted evolution of nuclear ribosomal DNA in two Tragopogon allopolyploids of recent and recurrent origin. Genetics 169:931–944. Kowalski S, Lan T-H, Feldmann K, Paterson A. 1994. Comparative mapping of Arabidopsis thaliana and Brassica oleracea chromosomes reveals islands of conserved gene order. Genetics 138:499–510. Kramer EM, Dorit RL, Irish VF. 1998. Molecular evolution of petal and stamen development: gene duplication and divergence within the APETALA3 and PISTILLATA MADS-box gene lineages. Genetics 149:765–783. Kramer EM, Di Stilio VS, Schl¨uter PM. 2003. Complex patterns of gene duplication in the APETALA3 and PISTILLATA lineages of the Ranunculaceae. Int J Plant Sci 164:1–11. Kramer EM, Jaramillo MA, Di Stilio VS. 2004. Patterns of gene duplication and functional evolution during the diversification of the AGAMOUS subfamily of MADS box genes in angiosperms. Genetics 166:1011–1023. Kramer EM, Holappa L, Gould B, Jaramillo MA, Setnikov D, Santiago PM. 2007. Elaboration of B gene function to include the identity of novel floral organs in the lower eudicot Aquilegia. Plant Cell 19:750–766. Kuwada Y. 1911. Meiosis in the pollen mother cells of Zea mays L. Bot Mag 25:163–181. Lagercrantz U, Lydiate DJ. 1996. Comparative genome mapping in Brassica. Genetics 144:1903–1910. Lan TH, DelMonte TA, Reischmann KP, Hyman J, Kowalski SP, McFerson J, Kresovich S, Paterson AH. 2000. An EST-enriched comparative map of Brassica oleracea and Arabidopsis thaliana. Genome Res 10:776–788. Langkjaer RB, Cliften PF, Johnston M, Piskur J. 2003. Yeast genome duplication was followed by asynchronous differentiation of duplicated genes. Nature 421:848–852. Lawrence RJ, Earley K, Pontes O, Silva M, Chen ZJ, Neves N, et al. 2004. A concerted DNA methylation/histone methylation switch regulates rRNA gene dosage control and nucleolar dominance. Mol Cell 13:599–609. Lee HS, Chen ZJ. 2001. Protein-coding genes are epigenetically regulated in Arabidopsis polyploids. Proc Natl Acad Sci USA 98:6753–6758. Leebens-Mack JH, Wall K, Duarte J, Zheng Z, Oppenheimer D, dePamphilis C. 2006. A genomics approach to the study of ancient polyploidy and floral developmental genetics Adv Bot Res 44:528–549. Leitch IJ, Bennett MD. 2004. Genome downsizing in polyploid plants. Biol J Linn Soc 82:651–663. Levin DA. 1983. Polyploidy and novelty in flowering plants. Am Nat 122:1–25. Lewis WH. 1980a. Polyploidy in species populations. In Lewis WH (ed.), Polyploidy: Biological Relevance. New York: Plenum Press, pp. 103–144.
REFERENCES
293
Lewis WH. 1980b. Polyploidy in angiosperms: dicotyledons. In Lewis WH (ed.), Polyploidy: Biological Relevance. New York: Plenum Press, pp. 241–268. Lewis H, Lewis ME. 1955. The genus Clarkia. Univ Calif Publ Bot 20:241–392. Liljegren SJ, Ditta GS, Eshed Y, Savidge B, Bowman JL, Yanofsky MF. 2000. SHATTERPROOF MADS-box genes control seed dispersal in Arabidopsis. Nature 404:766–770. Litt A. 2007. An evaluation of A-function: evidence from the APETALA1 and APETALA2 gene lineages. Int J Plant Sci 168:73–91. Litt A, Irish VF. 2003. Duplication and diversification in the APETALA1/FRUITFULL floral homeotic gene lineage: implications for the evolution of floral development. Genetics 165:821–833. Liu Z, Adams KL. 2007. Expression partitioning between genes duplicated by polyploidy under abiotic stress and during organ development. Curr Biol 17:1669–1674. Liu B, Wendel JF. 2002. Non-mendelian phenomena in allopolyploid genome evolution. Curr Genom 3:489–505. Liu B, Wendel JF. 2003. Epigenetic phenomena and the evolution of plant allopolyploids. Mol Phylogenet Evol 29:365–379. Liu B, Vega JM, Feldman M. 1998a. Rapid genomic changes in newly synthesized amphiploids of Triticum and Aegilops: II. Changes in low-copy coding DNA sequences. Genome 41:535–542. Liu B, Vega JM, Segal G, Abbo S, Rodova M, Feldman M. 1998b. Rapid genomic changes in newly synthesized amphiploids of Triticum and Aegilops: I. Changes in low-copy noncoding DNA sequences. Genome 41:272–277. Lonnig WE, Saedler H. 2002. Chromosome rearrangements and transposable elements. Annu Rev Genet 36:389–410. L¨ove A, L¨ove D. 1949. The geobotanical significance of polyploidy: I. Polyploidy and latitude. Portugaliae Acta Biol Ser A: 273–352. Lowman AC, Purugganan MD. 1999. Duplication of the Brassica oleracea APETALA1 floral homeotic gene and the evolution of domesticated cauliflower. Genetics 90:514–520. Lukens LN, Quijada PA, Udall J, Pires JC, Schranz ME, Osborn TC. 2004. Genome redundancy and plasticity within ancient and recent Brassica crop species. Biol J Linn Soc 82:665–674. Lukens LN, Pires JC, Leon E, Vogelzang R, Oslach L, Osborn T. 2006. Patterns of sequence loss and cytosine methylation within a population of newly resynthesized Brassica napus allopolyploids. Plant Physiol 140:336–348. Lutz AM. 1907. A preliminary note on the chromosomes of Oenothera lamarckiana and one of its mutants, O. gigas. Science 26:151–152. Lynch M. 2007. The Origins of Genome Architecture. Sunderland, MA: Sinauer Associates. Lynch M, Conery JS. 2000. The evolutionary fate and consequences of duplicate genes. Science 290:1151–1155. Lynch M, Force A. 2000. The probability of duplicate gene preservation by subfunctionalization. Genetics 154:459–473. Madlung A, Masuelli RW, Watson B, Reynolds SH, Davison J, Comai L. 2002. Remodeling of DNA methylation and phenotypic and transcriptional changes in synthetic Arabidopsis allotetraploids. Plant Physiol 129:733–746. Madlung A, Tyagi AP, Watson B, Jiang HM, Kagochi T, Doerge RW, et al. 2005. Genomic changes in synthetic Arabidopsis polyploids. Plant J 41:221–230. Maere S, De Bodt S, Raes J, Casneuf T, Van Montagu M, Kuiper M, et al. 2005. Modeling gene and genome duplications in eukaryotes. Proc Natl Acad Sci USA 102:5454–5459. Mandel MA, Yanofsky MF. 1995. The Arabidopsis AGL8 MADS box gene is expressed in inflorescence meristems and is negatively regulated by APETALA1 . Plant Cell 7:1763–1771.
294
GENE AND GENOME DUPLICATIONS IN PLANTS
Martinez-Castilla LP, Alvarez-Buylla ER. 2003. Adaptive evolution in the Arabidopsis MADSbox gene family inferred from its complete resolved phylogeny. Proc Natl Acad Sci USA 100:13407–13412. Masterson J. 1994. Stomatal size in fossil plants: evidence for polyploidy in majority of angiosperms. Science 264:421–423. Maty´asˇ ek R, Lim KY, Kovaˇs´ık A, Leitch AR. 2003. Ribosomal DNA evolution and gene conversion in Nicotiana rustica. Heredity 91:268–275. Maty´asˇ ek R, Tate JA, Lim YK, Srubarova H, Koh J, Leitch AR, et al. 2007. Concerted evolution of rDNA in recently formed Tragopogon allotetraploids is typically associated with an inverse correlation between gene copy number and expression. Genetics 176:2509–2519. McClintock B. 1984. The significance of responses of the genome to challenge. Science 226:792–801. Mena M, Ambrose BA, Meeley RB, Briggs SP, Yanofsky MF, et al. 1996. Diversification of C-function activity in maize flower development. Science 274:1537–1540. Ming R, Hou S, Feng Y, Yu Q, Dionne-Laporte A, Saw JH, et al. 2008. The draft genome of the transgenic tropical fruit tree papaya (Carica papaya L.). Nature 452:991–996. M¨untzing A. 1936. The evolutionary significance of autopolyploidy. Hereditas 21:263–378. Nam J, dePamphilis CW, Ma H, Nei M. 2003. Antiquity and evolution of the MADS-box gene family controlling flower development in plants. Mol Biol Evol 20:1435–1447. Ni Z, Kim E-D, Ha M, Lackey E, Liu J, Zhang Y, et al. 2009. Altered circadian rhythms regulate growth vigour in hybrids and allopolyploids. Nature 457:327–331. Nowak MA, Boerlijst MC, Cooke J, Smith JM. 1997. Evolution of genetic redundancy. Nature 388:167–171. Ohno S. 1970. Evolution by Gene Duplication. New York: Springer-Verlag. Ohta T. 1988. Time for acquiring a new gene by duplication. Proc Natl Acad Sci USA 85:3509–3512. Osborn TC, Pires JC, Birchler JA, Auger DL, Chen JZ, Lee HS, et al. 2003. Understanding mechanisms of novel gene expression in polyploids. Trends Genet 19:141–147. Ozkan H, Levy AA, Feldman M. 2002. Rapid differentiation of homeologous chromosomes in newly-formed allopolyploid wheat. Isr J Plant Sci 50:S65–S76. Page RDM, Cotton JA. 2002. Vertebrate phylogenomics: reconciled trees and gene duplications. Pacific Symposium on Biocomputing, pp. 536–547. Pannell JR, Obbard DJ, Buggs RJA. 2004. Polyploidy and the sexual system: What can we learn from Mercurialis annua?. Biol J Linn Soc 82:547–560. Parenicova L, de Folter S, Kieffer M, Horner DS, Favalli C, Busscher J, et al. 2003. Molecular and phylogenetic analyses of the complete MADS-box transcription factor family in Arabidopsis: new openings to the MADS world. Plant Cell 15:1538–1551. Paterson AH, Bowers JE, Chapman BA. 2004. Ancient polyploidization predating divergence of the cereals, and its consequences for comparative genomics. Proc Natl Acad Sci USA 101:9903–9908. Paterson AH, Chapman BA, Kissinger JC, Bowers JE, Feltus FA, Estill JC. 2006. Many gene and domain families have convergent fates following independent whole-genome duplication events in Arabidopsis, Oryza, Saccharomyces and Tetraodon. Trends Genet 22:597–602. Pelaz S, Ditta GS, Baumann E, Wisman E, Yanofsky MF. 2000. B and C floral organ identity functions require SEPALLATA MADS-box genes. Nature 405:200–203. Petit M, Lim KY, Julio E, Poncet C, Dorlhac de Borne F, Kovarik A, et al. 2007. Differential impact of retrotransposon populations on the genome of allotetraploid tobacco (Nicotiana tabacum). Mol Genet Genomi 278:1–15.
REFERENCES
295
Pfeil BE, Schlueter JA, Shoemaker RC, Doyle JJ. 2005. Placing paleopolyploidy in relation to taxon divergence: a phylogenetic analysis in legumes using 39 gene families. Syst Biol 54:441–454. Pikaard CS. 1999. Nucleolar dominance and silencing of transcription. Trends Plant Sci 4:478–483. Pinyopich A, Ditta GS, Savidge B, Liljegren SJ, Baumann E, Wisman E, Yanofsky MF. 2003. Assessing the redundancy of MADS-box genes during carpel and ovule development. Nature 424:85–88. Preuss SB, Costa-Nunes P, Tucker S, Pontes O, Lawrence RJ, Mosher R, et al. 2008. Multimegabase silencing in nucleolar dominance involves siRNA-directed DNA methylation and specific methylcytosine-binding proteins. Mol Cell 32:673–684. Quiros CF, Grellet F, Sadowski J, Suzuki T, Li G, Wroblewski T. 2001. Arabidopsis and Brassica comparative genomics: sequence, structure and gene content in the ABI1-Rps2-Ck1 chromosomal segment and related regions. Genetics 157:1321–1330. Rapp RA, Wendel JF. 2005. Epigenetics and plant evolution. New Phytol 168:81–91. Rasmussen DA, Kramer EM, Zimmer EA. 2009. One size fits all?. Molecular evidence for a commonly inherited petal identity program in Ranunculales. Am J Bot 96:96–109. Rensing SA, Ick J, Fawcett JA, Lang D, Zimmer A, Van de Peer Y, Reski R. 2007. An ancient genome duplication contributed to the abundance of metabolic genes in the moss Physcomitrella patens. BMC Evol Biol 7:130. Rijpkema AS, Royaert S, Zethof J, van der Weerden G, Gerats T, Vandenbussche M. 2006. Analysis of the Petunia TM6 MADS box gene reveals functional divergence within the DEF/AP3 lineage. Plant Cell 18:1819–1832. Rodin SN, Riggs AD. 2003. Epigenetic silencing may aid evolution by gene duplication. J Mol Evol 56:718–729. Roose ML, Gottlieb LD. 1976. Genetic and biochemical consequences of polyploidy in Tragopogon. Evolution 30:818–830. Rounsley SD, Ditta GS, Yanofsky MF. 1995. Diverse roles for MADS box genes in Arabidopsis development. Plant Cell 7:1259–1269. Sanderson MJ. 2002. Estimating absolute rates of molecular evolution and divergence times: a penalized likelihood approach. Mol Biol Evol 19:101–109. Schlueter JA, Dixon P, Granger C, Grant D, Clark L, Doyle JJ, Shoemaker RC. 2004. Mining EST databases to resolve evolutionary events in major crop species. Genome 47:868–876. Schranz EM, Mitchell-Olds T. 2006. Independent ancient polyploidy events in the sister families Brassicaceae and Cleomaceae. Plant Cell 18:1152–1165. Semon M, Wolfe KH. 2007. Consequences of genome duplication. Curr Opin Genet Dev 17:505–512. Shaked H, Kashkush K, Ozkan H, Feldman M, Levy AA. 2001. Sequence elimination and cytosine methylation are rapid and reproducible responses of the genome to wide hybridization and allopolyploidy in wheat. Plant Cell 13:1749–1759. Shan X, Liu Z, Dong Z, Wang Y, Chen Y, Lin X, Long L, Han F, Dong Y, Liu B. 2005. Mobilization of the active MITE transposons mPing and Pong in rice by introgression from wild rice (Zizania latifolia Griseb.). Mol Biol Evol 22:976–990. Shiu S-H, Byrnes JK, Pan R, Zhang P, Li W-H. 2006. Role of positive selection in the retention of duplicate genes in mammalian genomes. Proc Natl Acad Sci USA 103:2232–2236. Shoemaker RC, Polzin K, Labate J, Specht J, Brummer EC, Olson T, et al. 1996. Genome duplication in soybean (Glycine subgenus soja). Genetics 144:329–338. Simillion C, Vandepoele K, Van Montagu MC, Zabeau M, Van de Peer Y. 2002. The hidden duplication past of Arabidopsis thaliana. Proc Natl Acad Sci USA 99:13627–13632.
296
GENE AND GENOME DUPLICATIONS IN PLANTS
Smyth DR, Bowman JL, Meyerowitz EM. 1990. Early flower development in Arabidopsis. Plant Cell 2:755–767. Soltis DE, Rieseberg LH. 1986. Autopolyploidy in Tolmiea menziesii (Saxifragaceae): genetic insights from enzyme electrophoresis. Am J Bot 73:310–318. Soltis DE, Soltis PS. 1989. Genetic consequences of autopolyploidy in Tolmiea (Saxifragaceae). Evolution 43:586–594. Soltis PS, Soltis DE. 1990. Evolution of inbreeding and outcrossing in ferns and fern-allies. Plant Species Biol 5:1–12. Soltis DE, Soltis PS. 1993. Molecular data and the dynamic nature of polyploidy. Crit Rev Plant Sci 12:243–273. Soltis DE, Soltis PS. 1999. Polyploidy: recurrent formation and genome evolution. Trends Ecol Evol 14:348–352. Soltis PS, Soltis DE. 2000. The role of genetic and genomic changes in the success of polyploids. Proc Natl Acad Sci USA 97:7051–7057. Soltis PS, Soltis DE, Gottlieb LD. 1987. Phosphoglucomutase gene duplications and their phylogenetic implications in Clarkia (Onagraceae). Evolution 41:667–671. Soltis DE, Soltis PS, Tate J. 2003. Advances in the study of polyploidy since Plant Speciation. New Phytol 161:173–191. Soltis DE, Soltis PS, Endress P, Chase MW. 2005. Phylogeny and Evolution of Angiosperms. Sunderland, MA: Sinauer Associates. Soltis PS, Soltis DE, Kim S, Chanderbali A, Buzgo M. 2006. Expression of floral regulators in basal angiosperms and the origin and evolution of the ABC model. Adv Bot Res 44:483–506. Soltis DE, Chanderbali AS, Kim S, Buzgo M, Soltis PS. 2007. The ABC model and its applicability to basal angiosperms. Ann Bot 100:155–163. Soltis DE, Albert VA, Leebens-Mack J, Palmer J, Wing R, dePamphilis C, et al. 2008. The Amborella genome initiative: a genome for understanding the evolution of angiosperms. Genome Biol 9:402. Soltis DE, Albert VA, Leebens-Mack J, Bell CD, Paterson A, Zheng C, et al. 2009a. Polyploidy and angiosperm diversification. Am J Bot 96:336–348. Soltis PS, Brockington SF, Yoo M-J, Piedrahita A, Latvis M, Moore MJ, et al. 2009b. Floral variation and floral genetics in basal angiosperms. Am J Bot 96:110–128. Song K, Lu P, Tang K, Osborn TC. 1995. Rapid genome change in synthetic polyploids of Brassica and its implications for polyploid evolution. Proc Natl Acad Sci USA 92:7719–7723. Stebbins GL. 1940. The significance of polyploidy in plant evolution. Am Nat 74:54–66. Stebbins GL. 1947. Types of polyploids: their classification and significance. Adv Genet 1:403–429. Stebbins GL. 1950. Variation and Evolution in Plants. New York: Columbia University Press. Stellari G, Jaramillo MA, Kramer E. 2004. Evolution of the APETALA3 and PISTILLATA lineages of MADS-box-containing genes in the basal angiosperms. Mol Biol Evol 21:506–519. Sterck L, Rombauts S, Jansson S, Sterky F, Rouz´e P, Van de Peer Y. 2005. EST data suggest that poplar is an ancient polyploid. New Phytol 167:165–170. Tate JA, Ni ZF, Scheen AC, Koh J, Gilbert CA, Lefkowitz D, et al. 2006. Evolution and expression of homeologous loci in Tragopogon miscellus (Asteraceae), a recent and reciprocally formed allopolyploid. Genetics 173:1599–1611. Taylor SA, Hofer JMI, Murfet IC, Sollinger JD, Singer SR, Knox MR, Ellis THN. 2002. PROLIFERATING INFLORESCENCE MERISTEM , a MADS-box gene that regulates floral meristem identity in pea. Plant Physiol 129:1150–1159.
REFERENCES
297
Terashima A, Takumi S. 2009. Allopolyploidization reduces alternative splicing efficiency for transcripts of the wheat DREB2 homolog, WDREB2 . Genome 52:100–105. Theissen G. 2001. Development of floral organ identity: stories from the MADS house. Curr Opin Plant Biol 4:75–85. Theissen G, Becker A, Di Rosa A, Kanno A, Kim JT, Munster T, Winter KU, Saedler H. 2000. A short history of MADS-box genes in plants. Plant Mol Biol 42:115–149. Thomas BC, Pedersen B, Freeling M. 2006. Following tetraploidy in an Arabidopsis ancestor, genes were removed preferentially from one homeolog leaving clusters enriched in dosesensitive genes. Genome Res 16:934–946. Tuskan GA, DiFazio S, Jansson S, Bohlmann J, Grigoriev I, Hellsten U, et al. 2006. The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science 313:1596–1604. Udall JA, Wendel JF. 2006. Polyploidy and crop improvement. Crop Sci 46:S3–S14. Udall JA, Quijada PA, Osborn TC. 2005. Detection of chromosomal rearrangements derived from homeologous recombination in four mapping populations of Brassica napus L. Genetics 169:967–979. Ulmari A, Kotilainen M, Elomaa P, Yu D, Albert VA, et al. 2004. Integration of reproductive meristem fates by a SEPALLATA-like MADS-box gene. Proc Natl Acad Sci USA 101:15817–15822. Ungerer MC, Strakosh SC, Zhen Y. 2006. Genome expansion in three hybrid sunflower species is associated with retrotransposon proliferation. Curr Biol 16:R872–R873. Van de Peer Y, Maere S, Meyer A. 2009. The evolutionary significance of ancient genome duplications. Nat Rev Genet 10:725–732. Vandepoele K, Simillion C, van de Peer Y. 2003. Evidence that rice and other cereals are ancient aneuploids. Plant Cell 15:2192–2202. Velasco R, Zharkikh A, Troggio M, Cartwright DA, Cestaro A, Preuss D, et al. 2007. A high quality draft consensus sequence of the genome of a heterozygous grapevine variety. PLoS ONE 2:e1326. Vision TJ, Brown DG, Tanksley SD. 2000. The origins of genomic duplications in Arabidopsis. Science 290:2114–2117. Vrebalov J, Ruezinsky D, Padmanabhan V, White R, Medrano D, Drake R, Schuch W, Giovannoni J. 2002. MADS-box gene necessary for fruit ripening at the tomato ripening-inhibitor (rin) locus. Science 296:343–346. Walsh JB. 1995. How often do duplicated genes evolve new functions?. Genetics 139:421–428. Wang X, Shi X, Hao B, Ge S, Luo J. 2005. Duplication and DNA segmental loss in the rice genome: implications for diploidization. New Phytol 165:937–946. Wang J, Tian L, Lee H-S, Wei NE, Jiang H, Watson B, et al. 2006a. Genomewide nonadditive gene regulation in Arabidopsis allotetraploids. Genetics 172:507–517. Wang JL, Tian L, Lee HS, Chen ZJ. 2006b. Nonadditive regulation of FRI and FLC loci mediates flowering-time variation in Arabidopsis allopolyploids. Genetics 173:965–974. Weil C, Martienssen R. 2008. Epigenetic interactions between transposons and genes: lessons from plants. Curr Opin Genet Dev 18:188–192. Wendel JF. 2000. Genome evolution in polyploids. Plant Mol Biol 42:225–249. Whipple CJ, Ciceri P, Padilla CM, Ambrose BA, Bandong SL, Schmidt RJ. 2004. Conservation of B-class floral homeotic gene function between maize and Arabidopsis. Development 131:6083–6091. Whipple CJ, Zanis MJ, Kellogg EA, Schmidt RJ. 2007. Conservation of B class gene expression in the second whorl of a basal grass and outgroups links the origin of lodicules and petals. Proc Natl Acad Sci USA 104:1081–1086.
298
GENE AND GENOME DUPLICATIONS IN PLANTS
Winge O. 1917. The chromosomes: their number and general importance. C R Trav Lab Carlsberg 13:131–275. Winter KU, Weiser C, Kaufmann K, Bohne A, Kirchner C, Kanno A, Saedler H, Theissen G. 2002. Evolution of class B floral homeotic proteins: obligate heterodimerization originated from homodimerization. Mol Biol Evol 19:587–596. Yu J, Wang J, Lin W, Li S, Li H, et al. 2005. The genomes of Oryza sativa: a history of duplication. PLoS Biol 3:266–281. Zahn LM, Kong H, Leebens-Mack JH, Kim S, Soltis PS, Landherr LL, et al. 2005. The evolution of the SEPALLATA subfamily of MADS-box genes: a pre-angiosperm origin with multiple duplications throughout angiosperm history. Genetics 169:2209–2223. Zahn LM, Leebens-Mack JH, Arrington JM, Hu Y, Landherr LL, dePamphilis CW, et al. 2006. Conservation and divergence in the AGAMOUS subfamily of MADS-box genes: evidence of independent sub- and neofunctionalization events. Evol Dev 8:30–45.
16
Whole Genome Duplications and the Radiation of Vertebrates SHIGEHIRO KURAKU Evolutionary Biology and Zoology, Department of Biology, University of Konstanz, Konstanz, Germany
AXEL MEYER Evolutionary Biology and Zoology, Department of Biology, University of Konstanz, Konstanz, Germany; Center for Advanced Study, Berlin, Germany
1 INTRODUCTION Almost 40 years ago, Susumo Ohno (1970) now famously said: “Duplication created where selection merely modified.” Ohno made this statement, which is still considered by most researchers to be rather heretical, virtually in the absence of empirical data, at least by the standards and knowledge of the age of genomics that we are in (see Meyer and Van de Peer, 2003). However, during the last four decades, particularly the last 10 years, both the profusion and importance of all kinds of duplications in the genome have become more widely recognized. Duplications as construction principles of evolution have increased in acceptance even outside the field of genomics. It is seen increasingly by researchers in the field of “evo-devo” as an important mechanism by which organisms are permitted to experiment with the evolution of novel gene function, without having to do this slowly or having to lose the original function of a gene (copy) altogether. Gene and genome duplications might increase the potential of evolutionary lines to produced diverse phenotypes of organisms (Ohno, 1970). One could classify genomic duplications based on their size or the mechanism that produced them. The first category would then be duplications of individual nucleotides followed by small numbers of base pairs such as dinucleotide motifs of microsatellites (e.g., CA repeats). Other potential categories could be the duplications of small sets of functional contiguous nucleotides such as enhancers and promoters. This category might be followed by exon duplications and entire gene duplication that might occur through tandem duplications or retropositions. Duplication of chromosomal regions that are larger than a single gene might include chromosome arms or even entire chromosomes. The largest, and presumably rarest form of duplication would be duplication of the whole genome. Whole genome duplications (WGDs) can apparently be recognized in genomes of Evolution After Gene Duplication, Edited by Katharina Dittmar and David Liberles Copyright © 2010 Wiley-Blackwell
299
300
WHOLE GENOME DUPLICATIONS AND THE RADIATION OF VERTEBRATES
“higher” organisms that have been sequenced completely. Usually, up to three genome (but interestingly, never more than that) WGDs have been inferred so far in the genomes of higher animals, fungi, and plants. WGD might be a major force that not only changes the genome content dramatically, but potentially creates a surplus of newly duplicated genes, which might also result in new genetic networks and, possibly, evolutionary phenotypic novelties [reviewed by Semon and Wolfe (2007a)]. Before whole genome sequences became available, whole genome duplications had been analyzed primarily for early vertebrate evolution, based on evidence obtained through molecular phylogenetic analyses of biologically important gene families (e.g., Hox genes, genes involved in the adaptive immune system). In such an analysis, one can commonly recognize an increase in numbers of genes among various gene families (Holland et al., 1994; Kasahara et al., 1996; Wittbrodt et al., 1998). More recently, by analyzing a larger set of gene sequences, whether they were complete or incomplete genome sequences, it was shown that similarly arranged sets of genes (synteny) are located on different chromosomes within a single genome (e.g., Pebusque et al., 1998) and that many chromosomal segments are similar within a genome (reviewed in Kasahara, 2007). Intragenome redundancy was later revealed for teleost fishes as well (reviewed in Meyer and Van de Peer, 2005). In this chapter we briefly describe current knowledge regarding large-scale gene duplications that occurred at the basal lineages of vertebrates. We focus on the teleostspecific genome duplication (TSGD). This teleost-specific genome duplication is also called the third round (3R) genome duplication because the basal lineage of vertebrates experienced two previous rounds (1R and 2R) of whole genome duplications. Here we give specific attention to the 1R, 2R, and 3R WGD in the basal vertebrate lineages.
2
TELEOST-SPECIFIC GENOME DUPLICATION
2.1 Background The idea of the teleost-specific whole genome duplication event in the actinopterygian lineage was proposed much later than that for 1R/2R WGDs. This is probably because studies at the molecular level for animals leading to human (mouse, chicken, and Xenopus) go back further than those on actinopterygians. Identification of higher numbers of genes for some gene families in teleost fishes than in tetrapods was the first DNAbased evidence on this issue (Wittbrodt et al., 1998). Some fish (e.g., salmonids and cyprinids) and amphibian lineages, including Xenopus laevis, have experienced additional independent whole genome duplication(s) more recently (see Gregory, 2005; Semon and Wolfe, 2008). The clumping of duplicated genes in specific genomic segments of modern fishes suggested a genome doubling that is not shared by tetrapods (e.g., Amores et al., 1998; reviewed by Meyer and Van de Peer, 2005). More recently, this fish-specific genome duplication has been shown with certainty through analyses involving large-scale sequence data (Taylor et al., 2001a, 2003), including those of draft genome sequences of teleost fish models, pufferfish, and medaka (Aparicio et al., 2002; Jaillon et al., 2004; Kasahara et al., 2007). The question of how far back in the evolution of fishes the 3R duplication occurred remained open because the initial comparative genomic analyses were based on the genomes of rather modern “model” teleost fishes. Studies of more basal lineages of
TELEOST-SPECIFIC GENOME DUPLICATION
301
Figure 1 Phylogenetic and genomic properties of key lineages representing pre- and post-WGD conditions for the teleost-specific whole genome duplication. Phylogenetic relationships are based on Kikugawa et al. (2004) and Inoue et al. (2003). Divergence times are based on Azuma et al. (2008). Animal groups with a pre-WGD state are shown in white boxes, while those with a post-WGD state are shown in black boxes. Information regarding C -values and chromosome numbers was retrieved from the Animal Genome Size Database (www.genomesize.com; Gregory et al., 2007). Note that multiple entries for the same species in this genome size database are included in the graphs.
Actinopterygii revealed that their divergences from the stem lineage of fishes preceded this genome duplication event (Crow et al., 2006; Hoegg et al., 2004). Now it is thought that TSGD occurred in the lineage leading to all extant teleost fishes after the separation of more basal actinopterygian lineages: the Polypteriformes (bichir), Acipenseriformes (sturgeons and paddlefish), Amiiformes (bowfin), and Semionotiformes (gars) (Hoegg et al., 2004) (Figure 1). The TSGD was formerly called the fish-specific genome duplication (FSGD), but is now based on the more precise knowledge of the phylogenetic timing of the event. Currently, it is now more correctly called the teleost-specific genome duplication (TSGD; Kuraku and Meyer, 2009). The latter term is also more accurate because the word fish is applied to a paraphyletic assemblage that includes cyclostomes, chondrichthyes, lungfishes, and coelacanths, which did not experience this genomic event. The phylogenetic timing of the TSGD has been revealed using two different approaches. First, based on a gene family tree approach, where timings of gene duplications are estimated directly, an analysis using duplicated sets of paralogs of the Fugu genome derived from the TSGD suggested that it occurred 320 ± 67 million years ago (Mya) (Vandepoele et al., 2004). Second, the relative timing of the TSGD was estimated based on the absolute timings of the split of nonteleost actinopterygians, and was inferred to have occurred approximately 380 to 300 Mya. The latter approach was taken, based on whole mitochondria DNA (mtDNA) genome sequences (Azuma et al., 2008), as well as a combination of both mtDNA and nuclear protein-coding genes. This study came up with a more recent estimate (316 to 226 Mya; Hurley et al., 2007).
302
WHOLE GENOME DUPLICATIONS AND THE RADIATION OF VERTEBRATES
2.2 Evolution Before the TSGD As already mentioned, the TSGD occurred after the separation of several ancient lineages from the actinopterygian stem lineage that led to the teleosts. Phylogenetic relationships among these pre-TSGD lineages have been explored using both mitochondrial genes and nuclear genes as well as using morphological characters (Inoue et al., 2003; Kikugawa et al., 2004; reviewed by Meyer and Zardoya, 2003). Both of these two types of genes produced consistent results and identified the Polypteriformes as the most basal actinopterygian lineage and found that bowfin and gars are more closely related to each other than to any other groups. Previously, Holostei was considered a monophyletic group based on morphological observations (Nelson, 1969). However, the position of the amia-gar group in relation to that of Acipenseriformes is still not resolved consistently (Inoue et al., 2003; Kikugawa et al., 2004) (see Figure 1). According to information from the Animal Genome Size Database (www.genomesize.com; Gregory et al., 2007), some members of Polypteriformes have much larger genome sizes than those of most other actinopterygian fishes, except for some species of the Acipenseriformes. Many sturgeons have hugely increased their genome sizes and their number of chromosomes as a result of repeated lineage-specific polyploidization events (Gregory, 2005; Peng et al., 2007). It is interesting to note that all of these pre-TSGD actinopterygian lineages show a low level of species diversity, and only about 40 extant species belong to the four most basal actinopterygian lineages (Figure 1). The finding that the ancestors of some basal lineages of actinopterygian fishes did not experience the TSGD is interesting, as it allows us to study their genomes in an effort to examine a pre-TSGD (i.e., a 2R) genomic condition. Although the amount of data is still limited, the pre-TSGD condition of gene repertoires or selected genomic segments haboring them has been confirmed for some more cases (e.g., Chiu et al., 2004; Hoegg and Meyer, 2007). For example, for the ParaHox gene family, Amia calva, an extant member of one of the lineages that diverged immediately before the TSGD, seems to possess a similar gene organization (Gsx, Xlox , and Cdx ) in the same transcription orientation to that of human and amphioxus (Mulley et al., 2006). The ParaHox gene repertoire of teleost fishes whose ancestor experienced the third vertebrate genome duplication is different. Their ancestral set of ParaHox gene clusters dispersed across different genomic regions as a result of successive subsequent gene losses after the TSGD (Siegel et al., 2007). 2.3 Evolution After the TSGD It has been suggested that the TSGD somehow permitted the remarkable diversification of teleost species we see in extant teleost fishes. It will be interesting to explore the relevance of this genome doubling for morphological diversification by identifying and studying fish lineages that diverged from the stem lineage immediately after this event. The most interesting lineage in this regard will be the Osteoglossomorpha (e.g., arowana, arapaima), the earliest post-TSGD lineage (Hoegg et al., 2004; Azuma et al., 2008). The time that elapsed between the TSGD and the split of these fishes from the stem lineage is thought to be less than 10 million years, a short amount of time compared to the following history of the post-TSGD fish lineages (−300 Mya). This estimate was based on evolutionary rates of Hox genes (Crow et al., 2006).
TELEOST-SPECIFIC GENOME DUPLICATION
303
The Osteoglossomorpha (