The Origins of Evolutionary Innovations
This page intentionally left blank
The Origins of Evolutionary Innovations ...
47 downloads
936 Views
3MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
The Origins of Evolutionary Innovations
This page intentionally left blank
The Origins of Evolutionary Innovations A Theory of Transformative Change in Living Systems
Andreas Wagner Institute of Evolutionary Biology and Environmental Studies University of Zurich Switzerland
1
1
Great Clarendon Street, Oxford ox2 6dp Oxford University Press is a department of the University of Oxford. It furthers the University’s objective of excellence in research, scholarship, and education by publishing worldwide in Oxford New York Auckland Cape Town Dar es Salaam Hong Kong Karachi Kuala Lumpur Madrid Melbourne Mexico City Nairobi New Delhi Shanghai Taipei Toronto With offices in Argentina Austria Brazil Chile Czech Republic France Greece Guatemala Hungary Italy Japan Poland Portugal Singapore South Korea Switzerland Thailand Turkey Ukraine Vietnam Oxford is a registered trade mark of Oxford University Press in the UK and in certain other countries Published in the United States by Oxford University Press Inc., New York © Andreas Wagner 2011 The moral rights of the author have been asserted Database right Oxford University Press (maker) First published 2011 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press, or as expressly permitted by law, or under terms agreed with the appropriate reprographics rights organization. Enquiries concerning reproduction outside the scope of the above should be sent to the Rights Department, Oxford University Press, at the address above You must not circulate this book in any other binding or cover and you must impose the same condition on any acquirer British Library Cataloguing in Publication Data Data available Library of Congress Cataloging in Publication Data Data available Typeset by SPI Publisher Services, Pondicherry, India Printed in Great Britain on acid-free paper by CPI Antony Rowe, Chippenham, Wiltshire ISBN 978-0-19-969259-0 (Hbk.) 978-0-19-969260-6 (Pbk.) 1 3 5 7 9 10 8 6 4 2
If you want to have a good invention, have a lot of them.
Attributed to T.A. Edison
This page intentionally left blank
Acknowledgments
Research is a social endeavor. The research leading to this book is no exception. The book’s bibliography comprises almost 900 items. This large number reflects the size of my debt to a community that has accumulated the knowledge on which I build. And still, the bibliography is not complete. Any attempt at being exhaustive would have led to a tome many times the current size. Please accept my apologies if your work is not cited here. I did not omit it for willful negligence, but to keep the exposition focused, for the benefit of the non-expert reader. A significant portion of the book relies on research by PhD students and postdocs in my own laboratory over more than ten years. Their work is cited throughout. No one person could assemble a body of work this size in such a limited time. I am in great debt to my co-workers, not least because of the trust they placed in an unorthodox research program
based on innovation. Special thanks also go to my collaborator Olivier Martin. His expertise has been instrumental in analyzing the structure of large genotype spaces. Allan Drummond, Angela Hay, Miltos Tsiantis, Danny Tawfik, and Nobuhiko Tokuriki have provided illustrations. Several trusted colleagues reviewed individual chapters of this book. They include Homayoun Bagheri, Peter and Rosemary Grant, Lukas Keller, Marcelo Sánchez, and Daniel Segrè. Thanks to all of them, as well as to Ian Sherman and Helen Eaton for their editorial work. Finally, Johannes Jaeger and Alessandro Minelli, as well as an anonymous reviewer who went far beyond the call of duty, and read and critiqued the entire volume. I followed most of their advice, which helped improve the book considerably. Where I decided otherwise, it may have been for the worse and only I am to blame.
This page intentionally left blank
Contents
Acknowledgments 1 Introduction
vii 1
2 Metabolic innovation
18
3 Innovation through regulation
33
4 Novel molecules
47
5 The origins of evolutionary innovation
68
6 Genotype networks, self-organization, and natural selection
83
7 A synthesis of neutralism and selectionism
93
8 The role of robustness for innovation
107
9 Gene duplications and innovation
124
10 The role of recombination
132
11 Environmental change in adaptation and innovation
143
12 Evolutionary constraints and genotype spaces
158
13 Phenotypic plasticity and innovation
172
14 Towards continuous genotype spaces
186
15 Evolvable technology and innovation
198
16 Summary and outlook
214
References Index
219 249
ix
This page intentionally left blank
CH A PT ER 1
Introduction
The history of life is a history of innovations. We are all familiar with countless examples, but are there principles behind them? Is there a property that facilitates innovations, regardless of their physical manifestation? I here argue that the answer is yes, and I characterize this property—I will call it innovability.
Innovations
everywhere Every macroscopic organism has visible traits that were dramatic, transformative innovations when they first became fully formed. They changed not only organismal lifestyles, but also the future evolutionary path of life. Examples include plants with flowers, animals with a hard skeleton, birds and insects with wings, organisms living in groups, and, most fundamentally, multicellularity itself. Others include teeth to digest hard foodstuffs, vascular systems of plants and animals, syringes to deliver venoms, the endosperm storage tissues of seeds, and the silk production of arthropods [807]. Underneath this surface of macroscopically visible innovations is a universe of microscopic and submicroscopic innovations. Ultimately, they are the basis of all macroscopic innovations. An example is oxygen-producing photosynthesis. It originated with light-harvesting molecules that can split water to produce oxygen, and with mechanisms to incorporate carbon dioxide into biomass. By allowing oxygen to accumulate in the atmosphere, it changed not only the entire geochemistry of the planet, but also the future trajectory of life [410]. It permitted the macroscopic innovations of higher plant life, and ultimately supports most of the 1000 billion tons of biomass that exist today on earth [229]. Other similarly profound innovations involve the ability of organisms to thrive on unusual (for
us) food sources, such as minerals, natural gas, or crude oil; the ability to synthesize keratins, a critical component of the outer covering of many animals, such as the scales of reptiles, the feathers of birds, and the hairs of mammals; the ability to incorporate gaseous nitrogen—an otherwise growth-limiting element for many plants—directly into biomass; the origin of myelin, an electrical insulator that allows mammalian neurons to conduct electrical signals efficiently, and that may have promoted the evolution of complex brains [264, 620, 667]. It may be difficult to define rigorously what an evolutionary innovation is [538, 616]. However, these and countless other examples show that it is usually easy to recognize: a new feature that endows its bearer with qualitatively new, often game-changing abilities. These may not only mean the difference between life and death in a given environment (just think of biosynthetic abilities), they may also create broad platforms for future innovations, as did the innovations of photosynthesis and of complex nervous systems.
Towards a theory of innovation During Charles Darwin’s era, molecular innovations were inaccessible to science. In his theory of evolution by natural selection, Darwin thus focused on complex macroscopic innovations, such as our eyes, “organs of extreme perfection and complication,” which he acknowledged as potential difficulties for his theory [Ch. 6 of ref. 162]. At the same time Darwin emphasized his conviction that such complex innovations could evolve from simpler antecedents through gradual variation that is preserved by natural selection. Since then, eyes have become a textbook example of evolutionary innovation. We now know that they have evolved multiple times independently [213].
1
2
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
While Darwin’s theory rightly emphasized the role of natural selection in preserving useful variation, it left untouched the question how new and useful variation originated. As the geneticist Hugo de Vries put it in 1904 [170], “Natural selection may explain the survival of the fittest, but it cannot explain the arrival of the fittest.” This question about the origins of new things is still fundamentally unanswered. What is it about life that allows innovation through random changes in its parts? This ability becomes especially striking when we contrast it with the properties of most man-made, engineered systems. Would random changes in a typical complex technological system, say, a computer or an airplane, be a sensible recipe to improve the system? Hardly. There is something special about the architecture of life that makes it amenable to improvement through random change. This something is the subject of this book. I here provide evidence that it is more than a combination of natural selection and random change. Both are necessary but not sufficient for innovation. A deep understanding of innovation would have been inaccessible in Darwin’s time. First, he and his contemporaries knew little about the nature of inheritance—they were ignorant of the nature of the inherited material, how it is transmitted between generations, and how it changes over time. Second, they also knew nothing about the molecular events that are key to innovation in general (and to eyes in particular, such as the evolution of photoreceptors and of lens proteins). These events are changes in the interactions among biological molecules, as well as in the molecules themselves. 150 years later, we are in a completely different position. We understand the nature of genotypes, the genetic material (DNA or RNA) of organisms. In addition, we have amassed much information concerning the structure and function of biological molecules, and how they change over time. We are also beginning to understand the interactions between these molecules, and the large molecular networks that they form. These molecules and networks together ultimately determine all observable characteristics of organisms, their phenotype. Many evolutionary innovations have been studied individually in great detail. They provide fascinating case studied of natural history. However, no
number of case studies can add up to the deeper and general insights that would answer how organisms can innovate. Case studies cannot provide the general perspective needed to answer this question. By themselves, they are a heap of observations without a principle that unifies them. This principle could only come from an overarching explanatory framework for evolutionary innovations. One might call such a framework a theory of evolutionary innovation, an “innovability theory.” I argue here that if such a theory exists, the concepts I discuss will be its necessary (and perhaps sufficient) building blocks.
What must a theory of innovation accomplish? At first sight, the very search for such a theory may seem utterly quixotic, yes absurd. Does not the very nature of innovation defy prediction, and is not one main purpose of a theory to predict? An analogy with Darwin’s theory is instructive. Darwin explained many apparently unrelated phenomena with the key organizing principle of his theory, natural selection. Although population genetics and quantitative genetics have seen limited success in predicting evolutionary trajectories, such prediction is elusive for many real-life evolutionary processes. Even small, biochemically well-understood molecules with completely known genotypes, evolved in the laboratory under minutely controlled chemical conditions, can take surprising evolutionary turns [317, 379, 380]. Yet such unpredictability does not cast Darwin’s theory into doubt. The theory may fail to predict any one phenomenon, but it succeeds in organizing a myriad disparate phenomena. It has value as a unifying framework. Similarly, a theory of innovation may have little to say about any one specific innovation. Instead, it may provide an explanatory framework for innovations in general. It would fit the definition of a theory as a “small body of general principles that work together to explain a large number of empirical observations, often by describing an underlying mechanism common to all of them” [Ch. 5 of ref. 657]. It can be powerful in its generality, without trivializing the individual innovation and its marvelous uniqueness. Here is a minimal list of what a theory of innovation should accomplish.
INTRODUCTION
1. (The paramount problem.) It should explain how biological systems can preserve existing, well adapted phenotypes while exploring myriad new phenotypes. This is perhaps the most fundamental challenge of biological innovation, because destroying the old before finding something new and better may spell death. In addition, finding an innovation may require exploration of many different inferior phenotypes, before a new and superior phenotype is uncovered. The inventor Thomas Alva Edison’s adage “if you want to have a good invention, have a lot of them” holds for organisms, perhaps even more so than for human inventors. 2. It should unify innovations that involve different levels of biological organization. Some innovations are caused by new molecules with new structures and functions; others are caused by regulatory changes, for example, in the expression of molecules; yet others occur through combining existing molecules into new pathways. A theory of innovation should be general enough to accommodate different kinds of innovation. It needs to be dissociated from any particular substrate of innovation, but it must apply to each such substrate. 3. It should be able to capture the combinatorial nature of innovation. Biological systems have parts that are elementary units of system function. They include the amino acids that compose proteins, the enzymes that compose metabolic pathways and networks, and many others. Innovation usually involves new combinations of these “modules” and of higher order units of organization. Because combinatorial change is at the heart of many innovations, it must also be central to a theory of innovation. 4. It should be able to capture that the same problem can be solved by different innovations. Innovations can be viewed as solutions to a problem an organism faces. In the history of life, these problems have often been solved multiple times, and in very different ways. Examples include the evolution of image-forming eyes, tetrapod wings, aerobic respiration, and carbon fixation [807]. For instance, the last problem is that of incorporating inert atmospheric CO2 into biomass. It has been solved through the Calvin– Benson cycle, the reductive citric acid cycle, and the
3
hydroxypropionate cycle, in quite different ways [661]. 5. It should enable us to study how environmental change influences innovability. The environment determines whether any one novel phenotype is an innovation. Some aspects of an organism’s environment may be constant; others may change rapidly or slowly, predictably or unpredictably. We do not know whether these differences can affect the rate of innovation. Because environments can change in many different ways, universal answers may not exist. However, a theory of innovation must at least provide a framework to study this environmental influence. 6. It should be applicable to technological systems. A theory of innovation dissociated from any concrete material substrate should also apply to non-biological systems. In doing so, it might help develop technologies that can use evolutionary principles to accelerate innovation.
What kind of information does a theory of innovation need? I stated earlier that a deep understanding of innovation was inaccessible at Darwin’s time, because essential information was lacking. What is this essential information? In my view, it has at least four elements. The first element is a systematic and comprehensive understanding of genotypes. Ultimately, evolutionary innovations are caused by genotypic change, change in DNA or RNA molecules. (An apparently contradictory view holds that innovations begin with phenotypic change. I discuss this view in Chapter 11, where I argue that the contradiction is more apparent than real.) Our ability to understand genotypes is becoming nearly limitless with technologies to sequence entire genomes in single experiments. Genotypes can be organized into vast genotype spaces. Albeit astronomically large, these spaces have countably many member genotypes, which allows their systematic analysis. The second element is a systematic and comprehensive understanding of realistically complex phenotypes. The phenotypes of biological systems range from molecular phenotypes, such as protein structures,
4
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
all the way up to macroscopic phenotypes, such as the body plans of organisms. Each level of organization, from molecules to whole organisms, can have astronomically many different phenotypes. I write “realistically complex” with this observation in mind. Innovations create new complex phenotypes from existing complex phenotypes. To define a comprehensive “phenotype space” that is also amenable to systematic analysis is an enormous challenge. It is obvious how to meet this challenge for some molecular phenotypes, such as protein structures, but unclear for others. The more comprehensive our understanding of phenotype is on any level of organization, the better our chances for a framework that can apply to all innovations on this level. The third element is the ability to link phenotype to genotype. Innovations originate with a genotypic change whose effects translate into a phenotypic change. Thus, if we do not understand how exactly genotypic change maps into phenotypic change, we cannot hope to develop a comprehensive explanatory framework for innovation. The link between them can be provided through experiments, through comparative data, or through computational and mathematical modeling. The fourth element is an understanding of population-level processes. Any evolutionary process involves populations of reproducing objects with heritable differences that affect their reproductive success [452]. Population-level processes can limit or enhance the efficacy of natural selection [402, 476] and they affect the exploration of novel phenotypes, depending on factors such as population sizes and mutation rates. They thus also affect the emergence of innovations. Any theory of innovation needs to take them into consideration. Fortunately, these processes are well-understood, largely to the credit of the “modern synthesis” of evolutionary biology that brought forth population genetics in the early twentieth century [503]. Which areas of biological knowledge already fulfill these four requirements? The answer to this question might help identify the nuclei of a theory of innovation. I will next discuss what many would consider the best candidate areas. Unfortunately, neither of them currently has all of the above elements.
Population genetics and evolutionary developmental biology Population genetics and quantitative genetics constitute a body of quantitative evolutionary theory that emerged decades after Darwin with the modern synthesis [503]. Through its ability to handle a potentially infinite number of genotypes [310, 402] this body of theory is wellequipped to encapsulate the richness of possible genotypes of DNA and RNA sequences (element 1 above). Population processes (element 4) lie of course at its very heart, and pose no problem for it. However, models in population genetics and quantitative genetics usually make simple assumptions about the relationship between genotype and phenotype. By design, they thus make phenotype easy to predict from genotype (element 3). Unfortunately, this raises a serious problem with the phenotypic complexity (element 2) represented in these models. Many population genetic models, for example, represent phenotype only through a (scalar) fitness. Even quantitative genetic models that consider multivariate quantitative traits represent complex phenotypes only through correlations among each dimension of such a trait. They can thus only capture statistical dependencies among the constituents of complex phenotypes. These representations are not well-suited for realistically complex phenotypes. There is a world of difference, for example, between this statistical representation of a phenotype, and the myriad numbers of possible protein structures. The latter are best represented as atomic coordinates for amino acids, and not as correlations among quantitative characters. For this reason, population genetics and quantitative genetics are missing a critical element. While clearly necessary for a theory of innovation, they are not sufficient, at least in their current state. In contrast to population genetics, evolutionary developmental biology tackles complex phenotypes head-on. Its phenotypes are the most complex phenotypes of all, the macroscopic phenotypes— tissues, organs, body plans—of multicellular organisms. Evolutionary developmental biology has elucidated many beautiful and fascinating examples of phenotypic change and its roots in genotypic change. We will encounter some of them below. Nonetheless, the very complexity of organismal phenotypes presents two problems. The first of
INTRODUCTION
them, perhaps less serious, regards the systematic account of phenotypes a theory of innovation would require (element 2 above). It may be possible to overcome this problem, for example, through concepts such as the morphospace of paleontologists [507, 634]. The second problem, however, poses a more formidable obstacle. Given how complex the phenotypes are that developmental biology is studying, and how many genes contribute to them, we currently are unable to determine phenotype from genotype for them. Thus, element 3 above is missing. Hundreds of genes may influence even the simplest organismal phenotypes, such as the shape of a bacterial cell wall or the structure of a human hair. The fact that organismal phenotypes are not static but unfold over time, and that they are intricately organized in space, would further complicate the task of linking genotype to phenotype. In addition, the understanding of population-level processes (element 4) is less advanced in developmental biology. In sum, these considerations show that the phenotypes of population genetics are still too simple, and those of developmental biology still too complex to become part of a theory of innovation, given our current knowledge. They also show that key in trying to develop a theory is to find a middle ground: Phenotypes that are sufficiently rich to capture the astronomical diversity of actual phenotypes, yet manageable enough to understand how genotypes translate into phenotypes, while at the same time being crucial for many different kinds of innovation. This book revolves around such phenotypes.
5
and macromolecules—proteins and RNA. They correspond to three broad classes of innovations that have played a key role in the history of life: innovations involving new metabolic pathways, involving new patterns of gene activity in regulatory circuits, and involving new molecules. Most innovations in macroscopic traits can ultimately be traced to molecular innovations in these three system classes, or to combinations thereof. Any fundamental shared principles they reveal may thus apply to the complex, macroscopic phenotypes of developmental biologists, once we will be able to study these phenotypes with the same level of rigor. These three classes of systems are also special for a different reason: We have either massive amounts of empirical data linking genotypes and phenotypes, or we can predict phenotype from genotype. In other words, they fulfill the above requirement 3 for a theory of innovation. Admittedly, predictions of these phenotypes are far from perfect. They involve mathematical modeling based on limited empirical data, and a heavy dose of computation. This holds especially true for the kinds of analyses I will discuss here, analyses needed to study new phenotypes systematically. They require us to map thousands to millions of genotypes to their phenotypes. Experimental genotyping on this scale is routine, but analysis of this many phenotypes is still difficult. Until such phenotyping becomes possible, computational approaches remain essential. They may not be sufficiently good to predict any one phenotype with very high accuracy, but they are sufficient to tackle the broad questions a theory of innovation needs to answer. They give us a place to start.
Three classes of tractable phenotypes are involved in most innovations I will next discuss
Innovation through metabolism Innovation
three broad and very different classes of biological systems in which innovations occur. Here and elsewhere, I view a system as a set of elements or parts that cooperate to perform a task. An example is a protein whose parts—amino acids—cooperate to catalyze a chemical reaction. The phenotype of a protein is the three-dimensional folded structure it assumes, and the biological function it performs. Together, all proteins constitute a system class. The three classes of systems central to innovation are large metabolic networks, regulatory circuits,
often arises through combining enzymes—or, more specifically, enzyme-coding genes—into new metabolic pathways. Such pathways can make new energy sources available to the organism, or they can synthesize new compounds useful for self-defense, protection, and communication. “A” new metabolic pathway can mean the difference between life and death, either by allowing its carrier to subsist on new food sources, defend itself against an enemy, or survive in a hostile environment.
6
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
Pentachlorophenol HO Cl
Cl
Cl
Cl Cl
Pentachlorophenol hydroxylase HO Cl
Cl
Cl
Cl HO
Maleylacetoacetate isomerase HO Cl
Cl
Cl
H H
Maleylacetoacetate isomerase HO Cl
Cl
H
H HO
2,6-dichlorohydroquinone dioxygenase Ring cleavage product Figure 1.1 Degradation of pentachlorophenol as a metabolic innovation. Shown are four enzymatic steps in the degradation of pentachlorophenol. The enzymes written in light gray type have probably been recruited to pentachlorophenol degradation from pathways that are involved in the degradation of naturally occurring chlorophenols, such as 2,6-dichlorophenol, which are
Microbial metabolism provides dramatic examples of how novel combinations of enzymes, and the reactions they catalyze, can lead to innovations. Microbes can use a bewildering variety of substances as food, including many man-made compounds not known to occur in nature [620]. For instance, microbial isolates from pristine soils that have been minimally exposed to humans can use several antibiotics as sole carbon sources, including fully synthetic compounds such as ciprofloxacin [160]. Microbes also thrive on many xenobiotic substances of industrial importance, such as polychlorinated biphenyls, a highly toxic, now banned class of industrial compounds [638]; chlorobenzenes, organic solvents [796, 797]; or pentachlorophenol, a synthetic pesticide first produced in 1936 [126, 141]. Just take the last chemical. Not known to occur in nature, pentachlorophenol can nonetheless be digested by the bacterium Sphingomonas chlorophenolica. The necessary metabolic pathway involves four reactions that this organism assembled, using enzymes that process naturally occurring chlorinated chemicals, as well as an enzyme involved in tyrosine metabolism [141] (Figure 1.1). In microbes, horizontal gene transfer is an extremely effective and abundant way of creating such new combinations of reactions. A second example regards halophilic bacteria and algae, some of which can survive in saturating salt concentrations of 30 percent, or even in fluid inclusions of growing salt crystals. In contrast, drinking seawater with its paltry 3 percent of sodium chloride kills many other organisms [620]. Several complementary strategies allow halophilic bacteria to survive in such high salt concentrations [75, 179, 180, 620]. One of them involves the production of “compatible solutes,” such as ectoine or glycine betaine. These substances stabilize proteins to keep them functioning, and they neutralize the high external produced naturally by some fungi and insects. The last reaction leads to cleavage of the aromatic ring shown. The reactions marked with dark gray arrows are catalyzed by maleylacetoacetate isomerase, an enzyme involved in the degradation of phenylalanine and tyrosine in some organisms, including some bacteria, fungi, and humans [141].
7
INTRODUCTION
so readily. However, such innovations also occur in higher organisms. A case in point regards the detoxification of ammonia, a waste product of animal metabolism. Water-living animals can excrete it directly into the water, but land-living organisms cannot do so. To avoid poisoning themselves, they convert it into a less toxic compound for excretion. Many do so through the production of urea, made possible by another metabolic innovation, the urea cycle (Figure 1.2). The urea cycle illustrates a key theme of metabolic innovations: The individual reactions are not necessarily new, but their combination is. The urea cycle arose when a set of four reactions involved in arginine biosynthesis combined
osmotic pressure caused by high salt concentrations. Ectoine and glycine betaine are produced by a short chain of reactions that starts from ubiquitous molecules, such as the amino acid aspartate. Yet another metabolic innovation occurred in the origin of oxygen-producing photosynthesis. Although the evolution of photosynthesis was not a one-step process [661, 869], one associated key innovation was the evolution of the light-harvesting pigment chlorophyll. Chlorophylls are tetrapyrrole compounds like heme and vitamin B12, whose biosyntheses share many features. Microbes provide the most dramatic examples of metabolic innovations, because they exchange genes
O N
Aspartate
O
Argininosuccinate synthetase
N
N
O
O
O
N N
O
Citrulline
Argininosuccinate
O
N O N
Ornithine transcarbamylase
NH3 CO2 2 ATP
O
Argininosuccinate lyase
Carbamyl phosphate
Fumarate
Carbamylphosphate synthetase Ornithine
N
Arginine
N
N
N
O
N
O
O
Arginase N
O
N O N UREA Figure 1.2 The urea cycle as a metabolic innovation. The figure shows the urea cycle, whose enzymes are expressed in the mammalian liver. They serve to convert ammonia into urea, which can be excreted in liquid form. The four reactions marked with light gray arrows constitute an arginine biosynthesis pathway, and are expressed in various tissues other than the liver for arginine biosynthesis. The reaction marked in dark gray is the first reaction involved in arginine degradation. The five reactions occur in many different organisms from prokaryotes to human, and are thus not themselves mammalian innovations [753].
8
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
with arginase, a reaction involved in arginine degradation [753]. All the reactions involved are widespread in both prokaryotes and eukaryotes [753]. In sum, metabolism provides a treasure trove of innovations, new metabolic abilities that have enabled new lifestyles. They often arise through new combinations of enzymes that already exist in other organisms. They have played important roles in the earliest history of life, and continue to play such roles to this day. As we shall see, metabolic phenotypes are one kind of complex phenotype that can help understand innovation in a principled way.
(a)
(b)
(c)
Innovation through regulation For my purpose, I define regulation as a process that changes the abundance or activity of a gene product at a particular time and place (but does not change its encoding DNA, RNA, or amino acid sequence). Examples include changes in the rate of transcription of a gene into its RNA product, the rate of translation of a messenger RNA into protein, or the modification of a protein that changes the protein’s activity, for example, through phosphorylation. Suggestions that regulatory changes play an important role in evolution date back many years. In 1975 King and Wilson, for example, noted the small amount of sequence divergence between humans and chimpanzees (≈1 percent). Homologous proteins in these two species, so their argument went, are too similar to explain what makes us human. Thus, they argued, changes in regulatory DNA that affect gene expression are responsible for these differences [404]. Many traits that distinguish us from chimpanzees are evolutionary innovations. They include bipedalism and qualitatively new cognitive abilities, such as symbolic communication. And while many researchers continue to search for genetic changes responsible for these species differences, others have focused on innovations outside primates [125, 190, 376, 397, 558, 867]. As a result, we now have plenty of candidates for innovations that involve regulatory change. Several examples follow. Some butterflies use an ingenious innovation to scare off would-be predators many times their size [736–738]. Their wings harbor spots that resemble
Figure 1.3 Butterfly eyespots as regulatory innovations. (a) Butterfly eyespots on the wings of the moth Automeris io (image from http://commons.wikimedia. org). (b) Eyespots on the ventral surface of the forewing (upper) and hindwing (lower) of the butterfly Bicyclus anynana. From figure 3 of [54]. (c) Distal-less expression in a B. anynana hindwing imaginal disc (seven small white spots in upper-left panel), the larval structure from which wings form. Distal-less expression is visible in seven spots that correspond to the future position of seven eyespots on the adult hindwing (upper-right). Distal-less expression during development of the Cyclops mutant (lower-left) occurs in a single stripe corresponding to the sole eyespot that will form in this mutant (lower right). From figure 3 of [86], used with permission from Nature Publishing Group.
the eyes of animals much larger than their predators (Figure 1.3a). A display of these eyespots is a bluff that may save the butterfly’s life when it is attacked. (Eyespots belong to a much larger class of
INTRODUCTION
color-patterning innovations. Such patterns often serve to inform friends or deceive foes.) In developing butterfly larvae and pupae, eyespots form in a prospective wing region called the eyespot focus. One feature that distinguishes eyespot foci from their surrounding tissue is the expression of a key regulatory molecule, the transcription factor Distalless [86]. In many animals Distal-less plays a role in the development of several body structures, including legs and wings [104]. Its expression in the eyespot focus is an early key event that demarcates the eyespot (Figure 1.3). Even though butterflies vary in the numbers and positions of eyespots, Distal-less is expressed in all eyespot foci. Conversely, grafts of Distal-less-expressing eyespots to developing wing tissue suffice to cause eyespot formation in the recipient tissue [86]. Other regulatory molecules are also expressed in eyespot foci [396], and some of them in turn drive surrounding cells to produce the pigments that give eyespots their striking appearance. We may never know whether one of these regulators or Distal-less first changed their expression in the origin of eyespots. However, the key point is that a change in the expression of one or more already existing molecules is critical to form these defensive innovations. The lenses of vertebrate eyes are marvelous innovations [431]. They are able to form images with minimal aberration, the distortion of an image as light passes through a lens. The materials responsible for this ability and for a lens’ glassy transparency are crystallins. They comprise a class of proteins with multiple functions elsewhere in the body [612], many of them enzymes. What unites them is that they can be highly expressed while remaining soluble and transparent. These properties make them ideal materials for eye lenses. Regardless of their function elsewhere in the body, regulatory mutations have caused them to be highly expressed in the lens. Many crystallins have undergone gene duplication, but non-duplicated crystallins also exist. They include e-crystallin, which is the same molecule as lactate dehydrogenase, and t-crystallin, which is the same molecule as a-enolase [611, 612, 781]. In such nonduplicated crystallins changes in regulatory DNA regions have allowed enhanced gene expression in the lens.
9
The lenses of water-living animals face a particularly stiff challenge [431]. To bend light’s trajectory, lenses take advantage of the difference in refractive index as light passes from one medium to another. In land-living animals, light passes from air into the water-rich biological tissue of the lens. But in waterliving animals light already travels through water, so their lenses cannot take advantage of the air– water difference in refractive index. Lenses of waterliving animals thus need to bend light much more strongly compared to land-living animals, and they suffer greater aberration. To minimize this aberration, fish and squid have lenses with a graded refractive index. Their lenses are built of many onion-like layers. Central layers have a higher refractive index (higher crystallin concentration). Peripheral layers have a lower index. This lens architecture allows high power with little aberration. Regulatory mutations are key to achieve it [431, 611, 612, 747, 781]. In sum, regulatory changes in the expression of existing proteins are responsible both for the existence of transparent lenses, as well as for their sophisticated fine structure. Some plant leaves are simple in shape, others are highly complex or dissected, consisting of multiple small leaflets (Figure 1.4a). The first flowering plants most likely had simple leaves [60]. Leaf dissection is an innovation that can serve many purposes, among them to prevent leaf overheating in hot environments, and to increase CO2 uptake in water [275, 299]. The developing leaflets of most flowering plants with complex leaves show a marked increase in the expression of KNOX (KNOTTED1-like homeobox) transcription factors [60] (Figure 1.4b). This association is causal, as shown in the lamb’s cress Cardamine hirsuta, which has dissected leaves. Reducing the activity of KNOX genes severely impairs leaflet formation, whereas an increase in its expression is sufficient to produce additional leaflets [316]. Thus here again, a change in the expression of regulatory molecules is closely associated with an evolutionary innovation. Many animals use highly specialized body parts as tools to access food. The availability of a tool can have dramatic consequences on the animal’s survival probability in times of food scarcity. One such tool is a bird’s beak. Beaks come in many shapes and sizes. They range from the long and narrow
10
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
(a) Simple leaf
Dissected leaf
(b) Cardamine hirsuta (hairy bittercress)
* 4
2
3
Arabidopsis thaliana (thale cress)
*
Figure 1.4 Leaf dissection as a regulatory innovation. (a) A simple and a dissected leaf; (b) left panels: dissected leaves of Cardamine hirsuta (top) and simple leaves of Arabidopsis thaliana (bottom); right panels: accumulation of class 1 KNOX proteins in the primordia of dissected leaves of C. hirsuta (top), but not of A. thaliana (bottom), as revealed by antibody staining (dark spots in right panels) of KNOX proteins [316]. The central region, marked with an asterisk, is the shoot apical meristem, from which the shoot forms, and which shows
hummingbird beak, specialized to access deep and narrow flowers, to the wide and squat beak of seedcrushing birds. The beaks of Darwin’s finches on the Galapagos and Cocos islands provide wellstudied and diverse examples [288]. These finches include cactus finches, such as Geospiza candens with long, pointed beaks, specialized in feeding on cactus flowers or the insects therein, and ground finches, such as G. magnirostris, capable of crushing hard and large seeds. Beak-shape differences have great adaptive significance. For example, only the largest ground finch G. magnirostris can feed on the largest occurring seeds in its habitats, because only its beak can exert the necessary force to crush them [288; Ch.6]. Other ground finches are restricted to smaller seeds in their diet. During some periods of droughts, where small seeds can get depleted quickly, only large and hard seeds remain on the ground [287]. Recent studies linked a change in the expression of two regulatory molecules to beak shape and size. One of them is bone morphogenetic protein 4 (bmp4), a signaling protein with a role in skeleton and jaw development [44, 383, 572]. Its expression is highest in the developing deep and wide beaks of seed-crushing species [4]. The other protein is calmodulin, a signaling protein that mediates signals of changing calcium concentration to a variety of proteins. Calmodulin is most highly expressed in the developing elongate beaks of cactus finches. When bmp4 expression or calmodulin-mediated calcium signaling are artificially increased in chicken embryos, the embryos’ beaks change shape in the same way as they do among different finch species [3, 4]. Bmp4 and calmodulin expression thus probably play a causal role in changing beak shape.
KNOX expression in both species [316]. The enclosed areas, two of which are indicated by arrows, indicate initiating leaf cells of leaf primordia, which do not show KNOX expression in either species. The exception are the dark-staining small regions within C. hirsuta indicated by arrowheads, which correspond to initiating leaflets where KNOX proteins are expressed. After figure 1 of [316], used with permission from Nature Publishing Group.
INTRODUCTION
The last example may seem different from the preceding ones. On the one hand, it was not about a qualitatively new feature (presence or absence of an eyespot), but a quantitative modification of an existing feature (beak shape). From this perspective, it may not seem like the qualitative change an innovation requires. On the other hand, if only one kind of beak—exemplified by that of G. magnirostris—can crush the hardest seeds, having this beak will make a qualitative, life-preserving difference whenever only hard seeds are available. From this perspective, the beak can be viewed as an innovation. I included this example on purpose, to remind us of an oft-overlooked fact: Many innovations, when examined closely, are of this kind [528], although it is usually not obvious from the final product. They fall into a large gray area between unambiguously qualitative and merely quantitative phenotypic change. However, one key feature unites all examples in this section: they revolve around regulatory change.
Innovation through new molecules It is difficult to know even where to begin. Every single one of thousands of highly specific enzymes in our body was a molecular innovation when it first arose, many million years ago. The same holds true for other proteins and RNA molecules that are involved in metabolism, development, mechanical support, and communication. Innovations in them arise through mutations of individual nucleotides and recombination. They are facilitated by the functional promiscuity of some proteins (Chapter 11), and by gene duplications (Chapter 9) that can liberate the molecules from functional constraints [135, 368]. Here I will highlight only a few wellstudied cases. The first example shows how even the smallest possible change in a protein can lead to qualitatively new functions. It concerns the bacterial enzyme L-ribulose-5-phosphate 4-epimerase (L-Ru5P). This enzyme from Escherichia coli catalyzes the interconversion of L-ribulose-5phosphate and D-xylulose-5-phosphate. The enzyme links arabinose metabolism and the pentose phosphate pathway. It allows bacterial cells to survive on arabinose as a carbon and energy
11
source. The enzyme is a homotetramer with four identical subunits, one of which is shown in Figure 1.5a [475]. The active site of this enzyme includes a histidine residue at position 97, which is also shown in Figure 1.5a. A single mutation at this position from histidine to asparagine gives rise to a new catalytic activity, that of an aldolase (Figure 1.5b), while preserving the structure shown in Figure 1.5a [371]. Specifically, the mutant enzyme is able to join one molecule of dihydroxyacetone phosphate and glycoaldehyde phosphate in a condensation reaction. There are other known (and probably many unknown) enzymes where single amino acid changes give rise to new catalytic activities [566, 849]. Isocitrate dehydrogenases (IDHs) are enzymes in the energy-producing citric acid cycle; b-isopropylmalate dehydrogenases (IMDHs) are their distant relatives that catalyze a reaction in leucine biosynthesis. Despite their common ancestry, these enzymes have very different biological roles. A key distinction between them is their use of cofactors. IDHs can use either nicotine amide dinucleotide (NAD) or NADP, whereas IMDHs can use only NAD. Because NAD and NADP play very different roles in metabolism—providing electrons for ATP production and biosynthesis, respectively—the question of what causes this functional shift is intriguing. It turns out that fewer than ten amino acid differences are sufficient to dramatically shift the cofactor preferences of these enzymes [116, 171, 277, 474]. The next example revolves around the threats posed by freezing temperatures. When ice crystals grow, they kill cells. They incorporate the liquid water molecules that proteins need to function, and they slice through cell membranes [620]. Organisms that can survive this threat include arctic and antarctic fish, as well as overwintering terrestrial insects and plants. They have independently evolved a class of proteins called antifreeze proteins (Figure 1.6). These proteins bind the surface of small ice crystals and prevent them from growing [118, 166, 246]. For example, many fish adapted to cold waters can survive ice-laden seawater at almost –2ºC, about 1ºC lower than the freezing temperature of their body fluids [246].
12
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
(a)
H97
(b)
L-Ribulose-5-phosphate 4-epimerase R2
R2
R1
H
O
OH OH
R1
HO
O
H
His97
OH
Asn97
Aldolase R2
R2
R1
R1
O O
H
OH
OH H
+
H OH
OH
Figure 1.5 A single amino acid change can create a novel enzymatic function. (a) The structure of one subunit of the homotetrameric L-ribulose-5-phosphate 4-epimerase from Escherichia coli. A histidine residue (His97) in the catalytic site is highlighted. The structure is rendered from information in Protein Data Bank file 1K0W [475]. (b) Schematic drawing of the chemical reaction catalyzed by the epimerase shown in (a), as well as for a mutant with a single histidine to asparagine amino acid change at position 97 (after [566]). The mutant can catalyze a new aldolase reaction.
One important observation about this molecular innovation is that it occurred repeatedly: Antifreeze proteins fall into five classes [118] that show very little similarity in sequence or structure. Their ancestors are very different proteins, for example, serine proteases and chitinases. Arctic and antarctic fish have evolved antifreeze proteins with very sim-
ilar sequences independently [115]. In addition, antifreeze proteins can evolve very rapidly. For example, the arctic glaciation, which drove antifreeze protein evolution in arctic fish, occurred less than 3 million years ago [691]. Sister species in the same genus Myoxocephalus (sculpins) have even independently evolved two different classes of anti-
INTRODUCTION
(a)
13
(b)
Figure 1.6 Antifreeze proteins. Antifreeze protein of (a) the longsnout poacher Brachyopsis rostratus, a benthic fish living off the northeast coast of Japan, and (b) the mealworm beetle Tenebrio molitor. Note the very different structures of the two proteins, which are rendered from Protein Data Bank files 2ZIB [560] and 1EZG [463].
freeze proteins [118]. Antifreeze proteins stand for a much larger class of innovations that occurred repeatedly, rapidly, from different ancestors, and sometimes with very different solutions to the same problem [661, 807]. They underscore how readily innovations can arise in living systems. The last three examples were ordered by the amount of change—from minimal to drastic—required for a new protein function. The next and last example illustrates again the sliding scale between a quantitative change in an existing phenotype, and the qualitative change characteristic of innovation. It requires minimal change in a protein, modifies an existing protein’s function, but can open completely new habitat to an organism, and can thus set the stage to the conquest of new environments. At Mount Everest’s peak, the air contains only one-third of the amount of oxygen compared to sea level. Because oxygen is so limited, exercise becomes very strenuous at high altitudes. This is why many human high-altitude climbers need supplementary oxygen. The bar-headed goose (Anser indicus) does not have this luxury. This bird lives in central Asia and migrates over the Himalayas, at altitudes exceeding 10 kilometers. It is one of the highest flying birds known. The ability to migrate over a mountain range this high is an amazing adaptation
that can greatly expands an organism’s habitat range. How does this bird do it? The answer is multi-faceted, but an important aspect regards oxygen transport [466, 531]. The bar-headed goose has a hemoglobin molecule with higher oxygen affinity than its lowland relatives. A proline to alanine substitution in one of the hemoglobin subunits is important for this change [277, 459]. It eliminates a key contact between the hemoglobin subunits, which shifts the equilibrium of hemoglobin towards a conformation that has higher affinity to oxygen.
Genotype networks and their history The preceding three sections illustrate three classes of innovation in different kinds of phenotypes. Together, they form the basis of most evolutionary innovations. Some innovations may arise in a single large step, but many arise more gradually, through a series of changes with individually modest effects. Such innovations will typically involve hopelessly entangled changes in all three phenotypic classes. For example, a new metabolic ability may arise in an organism through the “import” of new enzymecoding genes via horizontal gene transfer, together with changes in the regulation of metabolic enzymes already encoded in an organism’s genome, and the
14
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
evolution of new enzymatic activities through mutations in existing genes. Similarly, complex macroscopic innovations, such as the evolution of new body parts, may involve changes in the regulation of multiple molecules, and the evolution of new molecules. Known macroscopic innovations are so complex that we do not yet understand all required changes for any one of them. Despite these complexities, it is useful to keep these three classes of innovations conceptually separate, and not only because some innovations fall within a single class. Such separation allows us to ask whether different classes of biological systems have similarities relevant to innovation, even though they may differ in most other respects. I argue in this book that such deep similarities exist, and that they are key to understand innovation. One important similarity is that their phenotypes can be organized into genotype networks. A genotype network is a set of genotypes that have the same phenotype. Genotypes in such a network are connected in the following sense: you can reach each genotype by a series of small mutational changes, each of which leaves the phenotype unchanged. Each such small change affects only a single part of a genotype, such as one amino acid in a protein. (I will call two genotypes that differ in only a single part neighbors.) All human understanding requires abstraction from the unfathomable complexity of the world around us. If one tries to understand a particular phenomenon, one needs to ask about the level of abstraction on which this phenomenon can be understood. In my view, the concepts of genotype spaces and genotype networks are the right level of abstraction to understand evolutionary innovation comprehensively and systematically. The reasons are spelled out throughout this book. I will explain the concept of a genotype network and its importance for innovation in much greater detail in later chapters. For now I will just say a few words about its history. To my knowledge, the concept was first foreshadowed in a 1970 paper on protein spaces—now usually called genotype spaces or sequence spaces—by the late John Maynard Smith. The paper stated “. . . if evolution by natural selection is to occur, functional proteins must form a continuous network which can be traversed by unit
mutational steps without passing through nonfunctional intermediates” [498]. Maynard Smith’s interest in this paper did not regard the origins of evolutionary innovation, but whether natural selection could plausibly lead to any functional proteins. He argued that this was the case for real proteins. The network Maynard-Smith had in mind differed from genotype networks in this book in another important respect. His is not a network of proteins with the same phenotype, but of proteins with any phenotype (function). To understand innovation, however, it is important to distinguish between different phenotypes and the genotype networks each forms. We shall see that the organization of these networks in the space of all possible genotypes is important for innovation. After Maynard Smith’s paper, it took another twenty years and considerable advances in computational technology before computational studies first showed that genotype networks may exist, at least for simple models of “coarse grained” structural phenotypes that can be computationally estimated from genotypes. The genotypes of these studies [434, 464] were coarse models of amino acid strings that consist of only two types of amino acids, hydrophobic and hydrophilic amino acids. The phenotypes were geometric models of protein structure, where each amino acid occupies a different position on a regular geometric lattice. Lipman and Wilbur [464] showed that such a structure is typically adopted by a large number of genotypes. Many of these genotypes can be reached from one another through series of single amino acid changes that do not change the phenotype. A few years later, unrelated work on RNA genotypes by Peter Schuster and his associates provided further support for the existence of genotype networks [688]. The phenotypes in this work were RNA secondary structures, the planar shapes that such sequences can adopt through internal basepairing (more about them in Chapter 4). These authors showed that RNA molecules with the same secondary structure typically can have very different sequence. In addition, sequences with the same phenotype typically form large sets whose sequences can be reached from one another through a series of single nucleotide changes [687, 688]. Their work remains among the most detailed characterization of a genotype space.
INTRODUCTION
Each of the last two lines of work was limited. It either considered only model proteins, or only partial phenotypes—RNA secondary structure phenotypes are necessary but not sufficient for RNA function. However, the concepts that emerge from this work remain important. This book shows that these concepts apply not just to model or partial phenotypes. In addition, they are important far beyond molecules like protein and RNA. They apply to different levels of biological organization, and can tie innovations on different levels together. The importance of this unifying power is hard to overstate. Such unification is essential for any comprehensive theory of innovation.
Neutral versus genotype networks Schuster and collaborators coined the term “neutral network” for the genotype networks they studied [688]. “Neutrality” in their sense means invariance of a well-defined phenotype among all genotypes on a neutral network. The term “neutral network” is widely used; it is evocative, and has alliterative appeal. In evolutionary biology, however, neutrality has a different meaning: a change in a genotype that is invisible to natural selection, because it does not affect fitness (more about that neutrality in chapter 7). Neutrality in the first sense does not imply neutrality in the second. To avoid confusion, I will thus use the word “neutral network” sparingly, and only where its meaning—in the first sense above—should be unambiguous from the context. Elsewhere, I will refer to “genotype networks.” Most phenomena I will discuss do not require that the genotypes on the same genotype network have exactly the same fitness. For example, many mutations in proteins of well-studied organisms are deleterious, but weakly so [227, 676]. Such weakly deleterious mutations can rise to high frequency in a population by chance events (Chapter 7), or they can persist until other mutations arise that compensate for their deleterious effects and thus preserve them [393, 428]. They are no strong impediment to evolutionary change on one genotype network. Conversely, many mutations that increase fitness do so only very slightly, and their fate can be determined by the same forces that determine the fate of neutral mutations [310, 676].
15
In sum, because the term “neutral network” insinuates that its genotypes have the same fitness, it is too narrow for the purpose of studying innovation, and I will prefer the term “genotype network.”
Innovability versus evolvability or phenotypic variability A few words are now necessary to motivate my use of the neologism “innovability.” Perhaps a more popular word, such as “evolvability” might be a better choice? The most widely used meaning of evolvability is the ability to produce heritable phenotypic variation. Why, then, not just use this notion here, or simply “phenotypic variability”? The reason is that phenotypic variability can merely refer to quantitative variation in existing phenotypes (body height, thermotolerance, etc.). When studying innovation, however, qualitative variation becomes important. The approaches I use below to analyze different phenotypes all aim to distinguish such qualitative differences. We currently do not have a good word to refer to such qualitative differences. This is the main motivation for using a new word, innovability. In addition, many authors use evolvability to describe some aspect of their study system. Unfortunately, the word’s meaning has thus become rather muddled by overuse. Moreover, evolvability has many aspects that I do not discuss here [264]. This is another reason to sidestep this word in the book.
Chapter overview Each of the next three chapters will focus on one of the three main system classes important for innovation. Specifically, Chapter 2 will focus on metabolic systems, Chapter 3 will focus on regulatory circuits, and Chapter 4 will focus on protein and RNA molecules. Each chapter will provide evidence for the existence of genotype networks; it will also characterize these networks. The emphasis is on common features, not an exhaustive review. These features include that genotype networks typically have vast size, that they extend far through genotype space, and that the neighbors of different genotypes on any one such network form very different novel phenotypes. If you are not interested in the technical details of these chapters, you may wish to skip to Chapter 5, which summa-
16
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
rizes and synthesizes information from the earlier chapters. It explains why these and other features are crucial for innovation, and probably were crucial since the origin of life. The chapter is as selfcontained as I knew how to write it. Chapter 6 concerns the perhaps most puzzling observation left unexplained by Chapter 5. It is that the three very different system classes from Chapters 2 to 4 have key commonalities that are important for innovation. The chapter shows that a simple fact is both necessary and sufficient for these commonalities: in all three system classes many neighbors of any one genotype G typically have the same phenotype as G itself. In other words, genotypes are to some extent robust to mutations. This chapter is the only mathematical chapter of the book, although the mathematics are elementary and used to make largely qualitative statements. Taken together, Chapters 2–6 show that the framework I propose here accomplishes three of the five basic goals for a theory of innovation, including the most important goal: to explain how life can preserve the old while exploring the new. The remaining two goals are the subject of later chapters. Subsequent chapters deal with several apparently disparate phenomena and problems in evolutionary biology, and show how the concepts of the earlier chapters allow us to unify them and resolve tensions between them. Some of the chapters summarize large bodies of work; others mainly outline directions for future research. Chapter 7 regards the tension between selectionism and neutralism. Selectionism emphasizes the role of natural selection in evolution, whereas neutralism ascribes an important role to neutral change that is invisible to natural selection. The tension between them has permeated the field of molecular evolutionary biology at least since Motoo Kimura proposed the neutral theory of molecular evolution [402]. The chapter shows how we can resolve this tension. It argues that neutral or nearly neutral changes may be frequent, but that most such changes will later become subject to selection. Such neutrality, albeit ephemeral, is indispensable for innovation, because it allows the exploration of novel phenotypes. Chapter 8 is about robustness, a biological system’s ability to preserve—in any one environment—
its phenotype under perturbations such as mutations [825]. The chapter first makes the elementary qualitative observation that robustness causes the existence of genotype networks (as shown in Chapter 6), and is thus essential for innovation. Quantitatively however, the relationship between robustness and innovability is more complex. On the one hand, and almost by definition, the more robust a genetic system is, the less phenotypic variation it produces in response to perturbation. From this perspective, robustness hinders innovability. On the other hand, both experimental and computational studies show that robustness can promote innovability in some system classes. The chapter resolves this tension and shows how robust phenotypes can promote innovation. Whether they do depends on details of genotype space organization for a system class. The closely related Chapter 9 focuses on gene duplication, a kind of mutation linked to dramatic innovations. The chapter shows that we can understand this link by considering that gene duplications are mutations which increase robustness in a particular way: Without destroying old phenotypes, they greatly facilitate the exploration of new phenotypes around a genotype network. Chapter 10 will discuss recombination, an important class of mutation that causes large-scale genotypic change. Recombination can be highly effective in producing novel phenotypes. However, it can also destroy existing, well-adapted phenotypes. I will show here that this destructive potential of recombination may be smaller than its creative potential. For example, molecules or regulatory circuits can be highly resilient to recombination, especially if they are exposed to it frequently. Recombination can help exploring the new without destroying the old. Chapter 11 returns to a key remaining question about robustness. Robustness brings forth genotype networks, but why does it exist in the first place, and why in such very different systems? The answer leads to the role of environmental change. I argue that coping with changing environments may require systems to increase their size, and that this increase in system complexity causes robustness in any one environment. In other words, the need to cope with environmental change has been a driving force behind the enormous complexity of present
INTRODUCTION
day biological systems. This complexity entails robustness, which is behind the existence of genotype networks. The chapter also shows how genotype networks can help study the quantitative influence of environmental change on innovation: systems able to cope with multiple environments exist in intersections of multiple genotype networks, which affects their ability to innovate. Chapter 12 discusses constraints on phenotypic evolution, which are biases or limitations in the phenotypic variation a system produces. It shows that genotype networks are useful to understand and unify several apparently unrelated causes of such constraints. These causes emerge from an underlying “developmental” cause, the processes that produce phenotypes from genotypes. Chapter 13 focuses on phenotypic plasticity, a genotype’s ability to produce multiple phenotypes. Genotype networks can facilitate the origin of genotypes that have a novel phenotype in their plastic repertoire of phenotypes. The chapter also discusses genetic assimilation and related phenomena that may lead to the fixation of plastic phenotypes after their origin. When characterizing genotype networks, one usually represents both genotypes and phenotypes as discrete objects, which facilitates enumeration and comparison. Chapter 14 discusses systems that are best represented by continuously valued phenotypes and genotypes. Such systems
17
are a research frontier, because they have not been rigorously studied. What little we know, however, suggests that the main principles I described apply to them as well. Chapter 15 discusses technological systems. A general innovability theory should apply to both biological and technological substrates of innovation. This chapter shows one technological application. It focuses on reconfigurable hardware, a commercially important class of electronic circuitry whose internal wiring (“genotype”) can be altered to compute different functions (“phenotypes”). The chapter shows that such circuitry can display key features of biological systems. It suggests that the biological principles I explored earlier are transferable to technology, and may promote technological adaptation and innovation. Chapter 16 is a short summary of key points and an outlook to future challenges. Taken together, the material in chapters 2–6 show that the framework I propose here meets the above minimal requirements 1–3 for a theory of innovation. Chapter 11 shows that it meets requirement 4, and Chapter 15 shows that it meets the last minimal requirement number 5. Although some data in this book come from previously unpublished research, most of what originated in my own research group is scattered throughout some 30 articles that are cited where appropriate.
CH A PT ER 2
Metabolic innovation
Metabolic Genotype: An organism’s enzyme-coding genes and the enzymes they produce. Metabolic phenotype: An organism’s ability to synthesize biomass, produce essential molecules, and extract energy from a chemical environment. In this chapter, I examine innovation in metabolic phenotypes. I focus on the most fundamental such phenotypes: those that allow an organism to survive in a given chemical environment. I will discuss how vast genotype networks, sets of metabolic genotypes with the same phenotype, facilitate the evolution of novel metabolic phenotypes. My discussion applies to all organisms, but especially to microbes, where metabolic network evolution is rapid and mediated through horizontal gene transfer.
Metabolic networks The genome of a free-living organism encodes metabolic enzymes that catalyze most of the hundreds to thousands of chemical reactions needed to sustain life. These reactions convert food into biochemical building blocks or energy, build biomass out of light, air, and minerals, and produce chemicals that serve in storage, self-defense, communication, and other processes. Typically, each metabolic process involves the joint action of multiple enzymes. Traditional biochemistry has taught us to think of such processes as linear chains of reactions encoded by enzymes, or as simple cycles. Now that we have complete, or nearly complete, information about the metabolisms of well-studied organisms, we realize that it is better to think of them as highly reticulate metabolic networks [207, 208, 324, 562]. Many metabolic abilities of organisms surely were game-changing when they first arose. These include abilities I already mentioned, such as the
18
ability to synthesize a protective cell wall, to produce communication molecules, or to synthesize diverse storage compounds for times of nutrient scarcity [842]. The most fundamental of these is the ability to use new foods—new chemicals in the environment—to synthesize building blocks for cell or biomass growth. The chemicals in question may serve both as sources of energy and of essential building materials, especially of the elements nitrogen, phosphorous, sulfur, and carbon. Because carbon is the most abundant of these four elements, it is of central importance in this regard. Over its four-billion-year history, life—and in particular prokaryotic life—has evolved the ability to thrive on a myriad different carbon sources, however toxic they may be to humans. Chapter 1 already mentioned some astounding innovations in carbon metabolism, where organisms learned to feed on xenobiotic carbon sources that include industrial chemicals and antibiotics [126, 141, 160, 638, 796, 797]. Other elements can also be provided by a broad range of sources [195, 257, 375, 515, 588, 783]. Whether an organism has the ability to survive in an environment where only one nutrient or energy source occurs is a life-or-death matter. The ability to produce biomass from a given set of nutrients is thus perhaps the most fundamental requirement metabolism must fulfill. I will focus on it here, but much of what I say may apply to a broader spectrum of metabolic phenotypes. More specifically, I will focus on organisms that feed on organic nutrients, and on carbon metabolism, because of carbon’s centrality to life. Although I will refer to these nutrients as carbon sources, it should be understood that they are at the same time energy sources. What I say here about carbon metabolism also holds for other chemical elements [653].
M E TA B O L I C I N N OVAT I O N
Metabolic genotypes The biosphere contains enzyme-coding genes whose products catalyze of the order of 104 or more chemical reactions [571]. The genome of any one organism encodes enzymes that catalyze some of the reactions in this reaction “universe.” We can view this collection of enzymecoding genes as an organism’s metabolic genotype. This genotype is ultimately a string of DNA, but representing it as such is not effective to study metabolism. More compact representations are necessary, representations that focus on the set of chemical reactions that the enzymes encoded by this genotype can catalyze. One simple such representation is a binary string whose length is the number of reactions in the known reaction universe (Figure 2.1). This string contains a “1” at position i if
19
the organism encodes a gene for reaction i, and a “0” if it does not. This representation is gene-centered, which reflects the fact that the gene and not the enzyme is the unit of metabolic evolution: only genes are subject to mutations that are inherited from parents to offspring. Some enzymes catalyze multiple reactions, which can simply be represented through the gene encoding them. Conversely, some reactions are catalyzed by multiple enzymes; these can be represented through one of their enzyme-coding genes. This genotype representation is discrete, which is a great advantage if one wants to enumerate metabolic genotypes. Such enumeration is useful to develop the genotype network concept. It focuses on whether one or more reaction can be catalyzed at
(a)
(b)
Genotype
Phenotype (survival on food source)
(determines metabolic reaction network)
Glucose + ATP Æ Glucose 6-phosphate + ADP
1
1
Alanine
Fructose 1,6-bisphosphate Æ Fructose 6-phosphate + Pi
1
0
Citrate
Isocitrate Æ Glyoxylate + Succinate
0
Acetoacetyl-Co + Gyoxylate Æ CoA + Malate
1
0 Flux Balance Analysis
1
Ethanol Glucose
Oxaloacetate + ATP Æ Phosphoenolpyruvate + CO2 + ADP
1
1
Melibiose
Pyruvate + Glutamate ´ 2-Oxoglutarate + Alanine
0
0
Xanthosine
sole carbon sources Figure 2.1 The concept of metabolic genotypes and phenotypes. (a) The metabolic genotype of a genome-scale metabolic network can be represented in discrete form as a binary string, each of whose entries corresponds to one biochemical reaction in a “universe” of known reactions. Individual entries indicate the presence (“1,” black type in stoichiometric equation) and absence (“0,” gray type) of an enzyme-coding gene whose product catalyzes the respective reaction. The binary string is as long as the number of known enzyme-catalyzed reactions, and for any one organism, only a small fraction of its entries may be equal to one. (b) Qualitatively, metabolic phenotypes can be represented by a binary string. The entries of this string correspond to individual carbon sources. The string contains a “1” for every carbon source (black type), for which a metabolic network can synthesize all major biomass molecules, if this source is the only available carbon source. Flux balance analysis (arrow) can be used to determine the metabolic phenotype from the genotype.
20
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
all, rather than on merely quantitative differences in the rates at which they can be catalyzed. In doing so, it allows us to focus on qualitatively new metabolic abilities—the ability to survive on new food sources—rather than on quantitative changes in existing abilities. We can think of all metabolic genotypes as forming a vast genotypes space. Even with our current, limited knowledge of some 104 metabolic reactions, the size of this space is astronomical. Given the genotype representation I use here, it contains of 4 the order of 2(10 ) metabolic network genotypes. Each metabolic network is a point in this space. It will be useful to have a measure of the distance between two metabolic genotypes (networks) in this space. The measure I will use is the number or fraction D of reactions that are not catalyzed by both metabolic networks. I will refer to D as the genotype distance. Two metabolic genotypes would have the maximal distance if they have no reactions in common. Conversely, two metabolic genotypes are neighbors if they differ in exactly one reaction, i.e., if one metabolic network catalyzes all the reactions that the other also catalyzes, except for one. The neighborhood of a metabolic genotype comprises all the metabolic networks that differ from it by one reaction [652]. More generally, one can define a k(-mutant)-neighborhood as the set of metabolic networks that differ from it by k reactions. By the end of the twentieth century, the first genome-scale metabolic networks of well-studied organisms had been thoroughly characterized [207, 208]. To characterize such a network is to assemble a list of enzyme-catalyzed chemical reactions known to occur in a given organism. These reactions are usually represented by their stoichiometric equations (Figure 2.1a). To characterize an organism’s metabolic network is easier if its genome has been completely sequenced and if its genes have been identified. In this case, one can compare these genes to all known genes that encode specific enzymes in other organisms, and thus infer the likely functions of the enzymes they encode. The resulting information about metabolic genotypes is usually still incomplete and needs to be complemented by biochemical information. Such information is easiest to come by for well-studied model organisms such as E. coli and the yeast S. cerevisiae, whose metabolism
has been studied for decades, and where mountains of primary literature on metabolic enzymes exist [208, 254, 324, 637]. However, metabolic networks are also being characterized for many other organisms, including humans [232]. The genome-scale metabolic networks that result from this effort may not include every single reaction catalyzed in an organism, but are usually comprehensive enough to cover the synthesis and recycling of all major biomass molecules.
Metabolic phenotypes I will now turn from metabolic genotypes to metabolic phenotypes related to the synthesis of essential biomass molecules. For free-living (non-parasitic) organisms, these biomass molecules include all amino acid and nucleotide precursors, lipids, many carbohydrates, and multiple cofactors for metabolic enzymes. For example, for E. coli, they comprise some 50 different molecules [231]. One could define a metabolic network’s phenotype as the number or fraction of these biomass molecules it can synthesize. This definition, however, has a serious limitation: unless a network synthesizes every single essential biomass molecule, it cannot sustain life. In other words, if we want to understand phenotypes of living organisms and their metabolic innovations, this definition is too limited. Here is an alternative definition of metabolic phenotypes. It is motivated by the observation that the nutrients available to a cell determine whether the cell can synthesize all biomass molecules. Many organisms, such as E. coli, can survive in minimal environments that contain a terminal electron acceptor (such as O2), a source of nitrogen (e.g., NH3), sulfur (SO4), phosphorus (PO4), carbon, and energy. A simple way of characterizing a metabolic phenotype is to ask whether a metabolic network can synthesize all biomass molecules in a given chemical medium, such as a minimal medium with glucose as the only or sole carbon source. In other words, can the network sustain cell growth in this environment? The above definition is useful but still has drawbacks, because astronomically many combinations of carbon sources might allow a network to sustain growth. A more systematic categorization of metabolic phenotypes is necessary, where these combi-
M E TA B O L I C I N N OVAT I O N
nations are represented in a simple way. To arrive at such a categorization, consider a minimal environment like that above, where all molecules except the carbon source do not vary in their availability. That is, some environments may harbor one carbon source, others may harbor another, yet others may harbor multiple carbon sources. With this categorization in mind, I propose the following representation of a metabolic phenotype. Let us write this phenotype as a binary string. The length of this string corresponds to the number of molecules that could potentially serve as sole carbon source for some metabolic network. For any one metabolic network, this string contains a “1” at position i if the network can synthesize all biomass molecules whenever carbon source i is provided as the only carbon source, in an otherwise minimal environment (Figure 2.1b). A string with multiple ones corresponds to a network that can synthesize biomass in multiple minimal environments that differ in the sole carbon source they contain. The length of this string will be much smaller than the total number of carbon-containing molecules [202]. The reasons are that, first, organisms can import a limited number of such molecules; second, some carbon-containing molecules may be highly unstable; third, some molecules may be toxic reaction intermediates of metabolism. For brevity, I will call a network that can synthesize all biomass molecules in a given environment a viable network. An advantage of this representation is that it accounts for environments that contain many carbon molecules: a network that can synthesize all biomass molecules on each of several sole carbon sources, is also likely to do so if all these carbon sources occur together. Obviously, this kind of reasoning can also be applied to categorizing metabolic phenotypes with respect to sources of sulfur, nitrogen, and phosphorus, when these sources vary in their availability. The same holds for different energy sources, even though sources of these elements often provide energy as well. In sum, for my purpose an organism’s metabolic genotype is the totality of biochemical reactions catalyzed by enzymes of the organism. It is but a point in a vast space of metabolic networks. A network’s metabolic phenotype is the spectrum of alternative carbon (or other food) sources that the
21
network can use to synthesize all of an organism’s biomass molecules.
From metabolic genotype to phenotype With the above definitions in hand, how can we determine phenotype from genotype? The time-honored experimental approach is to expose a specific organism to a minimal environment with a sole carbon source, and determine whether it can grow and divide. By exploring many different sole carbon sources, the metabolic phenotype can then be elucidated experimentally. However, to understand metabolic innovation, we will need to explore the metabolic phenotypes of many thousand well-defined genotypes. To create and characterize such genotypes is beyond current experimental techniques, and thus requires computational approaches. Fortunately, the method of flux balance analysis allows us to compute metabolic phenotypes from genotypes for very large metabolic networks comprising thousands of reactions [678]. This method is widely used to optimize metabolic properties of industrially important microbes [232, 233, 624]. I will not discuss its mathematical details but merely highlight some central features. Flux balance analysis requires a list of all stoichiometric equations (Figure 2.1a) in a network. The method assumes that a network operates in steady-state conditions, as, for example, in a growing population with a constant nutrient supply. Flux balance analysis is commonly used for two purposes. First, it can identifiy, for all reactions in a network, the set of allowed metabolic fluxes through these reactions—the rate at which the substrates of a reaction are converted into products—in a given chemical environment. Not all possible fluxes are allowed. The reason is that metabolism needs to conserve mass. It cannot produce more than it consumes. In calculating allowed fluxes, flux balance analysis must take into consideration that environmental nutrients can flow into a network at a limited rate. The task of calculating allowed fluxes amounts to solving a large set of linear equations, one for each kind of molecule in the network. The set of allowed fluxes typically forms a large connected region of a high-dimensional space with as many dimensions as there are reactions in a metabolic network.
22
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
The second purpose of flux balance analysis is to identify those fluxes among all allowed fluxes that have a property of interest. For example, they might allow efficient synthesis of a biotechnologically interesting molecule, or they might allow production of all of a cell’s essential biomass molecules. To achieve the second purpose, flux balance analysis maximizes or minimizes linear functions of fluxes using a numerical optimization technique called linear programming [144]. In practice, the predictions of flux balance analysis are often in good agreement with experimental data for well-studied organisms [64, 248, 348, 593, 690]. The usual exceptions involve situations where the organism may not have adapted to a nutrient environment during its recent evolutionary past [146, 243, 348]. However, even in this case, laboratory evolution experiments can rapidly create microbial genotypes whose metabolic properties match the predictions of flux balance analysis [146, 247, 248, 348]. The most common reason for the original mismatch is suboptimal regulation of necessary enzymes in a novel laboratory environment. Such regulatory constraints are easily and rapidly overcome, more easily than the complete absence of an enzyme for an essential reaction. (I will examine regulation extensively in Chapter 3 and later chapters.) For my purpose, the key question about any one metabolic genotype is whether it can synthesize all essential biomass molecules in a given environment. To understand metabolic innovation, it is important to understand how this qualitative ability—regardless of the rate of synthesis—can change as a genotype changes. It is prudent to ignore the rate of synthesis, because the fast cell growth that a high synthesis rate supports is not necessarily the best indicator of an organism’s fitness, especially for microbes. For example, some highly successful microbial species, such as Mycobacterium tuberculosis, grow slowly in the wild [150, 405, 843]. Slow-growing microbes may even outcompete faster growing microbes under conditions often found in nature [172]. Nonetheless, most of what I will say below also holds for metabolic networks with high biomass synthesis rates [670]. In order to determine a metabolic network’s phenotype, as defined above, one can apply flux balance n times, that is, to n different minimal chemical media that differ only in the sole carbon sources
they contain. Each medium in which an organism can synthesize all biomass molecules from the respective carbon source is assigned a “1” in the metabolic phenotype (Figure 2.1b). In sum, we can think of an organism’s metabolic genotype as a set of biochemical reactions represented by enzyme-coding genes. These reactions form a metabolic network whose most basic task is to synthesize all biomass molecules, small molecules essential for cell growth. As defined here, a metabolic phenotype is the ability to synthesize all of these molecules from chemicals found in the environment. Because of carbon’s centrality, I here focus on chemicals that can serve as carbon (and energy) sources in heterotrophic organisms. I categorize phenotypes according to those chemicals that can serve as sole carbon sources in an otherwise minimal chemical environment. Flux balance analysis allows us to predict metabolic phenotypes from genotypes. In closing this section, I note that the concepts and tools I just introduced fulfill several of the requirements I posited in Chapter 1. First, they can capture the combinatorial nature of metabolic innovations, which arise through novel combinations of enzymatic reactions; second, they explicitly represent qualitatively different phenotypes that regard the ability or inability to sustain life, and thus allow us to focus on the qualitative phenotypic change that characterizes innovation; third, they allow us to predict phenotype from genotype. I will next turn to what I earlier called the paramount problem of innovation: how organisms can explore myriad qualitatively new phenotypes while preserving their existing phenotypes.
Evolution of metabolic networks The unit of evolutionary change in metabolic networks is the enzyme-coding gene. I will postpone discussing changes in individual genes to Chapter 4, and consider here instead two larger scale changes; both of them are more appropriate for the level of resolution at which I represent metabolic networks. First, I consider changes that arise through the addition of enzyme-coding genes (and thus reactions) to a metabolic network. An important driver of such change is horizontal or lateral gene transfer. It occurs in both pro- and eukaryotes. It is so frequent in prokaryotes that it can change genome organization on short evolutionary time scales [122, 163, 419,
23
M E TA B O L I C I N N OVAT I O N
(a) 1200
anaerobic, aquatic aerobic, terrestrial
number of network pairs
1000
thermophilic halophilic
800
marine
600 400 200 0
0
0.1
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 metabolic network genotype distance D
1
(b) 1.0 metabolic network distance D
445, 550, 568, 569, 589]. Second, I consider elimination of individual reactions from a network, as might occur through gene deletions or through lossof-function mutations in enzyme-coding genes. Even though much of what I will say below applies to all metabolic networks, it is useful to be aware of how fast especially microbial metabolic networks may change in evolution. Here are some relevant observations from the well-studied genome of the bacterium Escherichia coli and its relatives. DNA is transferred into the E. coli genome at a rate of 64 kilo base pairs per million years [437]. With an average gene length of approximately 1 kilo base pairs, this amounts to the transfer of 64 genes per million years [65]. Even different E. coli strains can differ by more than 1 Mbp of DNA, or more than 20% of their genome, and may have experienced of the order of 100 gene additions through horizontal transfer relative to other strains [567, 590]. Because some 30 percent of E. coli genes have metabolic functions, the effects of such horizontal gene transfer on metabolism is profound [65, 231]. The addition of new DNA is compensated by the deletion of other DNA, and many newly added genes reside in the genome only for short amounts of time [437, 590]. Gene turnover in microbial genomes can thus be very high. Environmental demands on the organism in general, and on metabolism in particular, play an important role in such turnover [437, 590]. Over long time-scales, the accumulated effects of this gene turnover on metabolic network organization are staggering. Figure 2.2a shows the distribution of the genotype distance D between metabolic genotypes that are representatives of 222 prokaryotic genera whose complete genome sequences are known [831]. These genomes include both bacteria and archaea, and they span a broad range of prokaryotic diversity. Each network reaction is represented through its encoding gene, as represented by orthologs of metabolic enzyme-coding genes in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database, a curated database of metabolic networks [571]. Such databases have limitations, but they are currently the only viable means to centralize and manage the enormous amounts of data required for such an analysis [831]. As defined earlier, the genotype distance D is the fraction of reactions that occur in only one, but not both, of two networks in a pair. (Networks that
0.8
0.6
0.4
0.2
Spearman's r = 0.39; P 0), patterned rectangles or repressing (wij < 0), solid rectangles. Any given gene’s expression may be unaffected by most regulators in the circuit (wij = 0, open rectangles). Rectangles drawn in different shades of gray and different patterns correspond to different magnitudes of regulatory interactions wij. The highly regular correspondence of matrix entries to transcription factor binding sites serves the purpose of illustration and is not normally found, because such binding sites often function regardless of their position in a regulatory region. (b) Neighbors in genotype space. The middle panel shows a hypothetical circuit of five genes (top) and its genotype w of regulatory interactions (bottom), if genes are numbered clockwise from the uppermost gene. Light gray arrows indicate activating interactions and dark gray lines terminating in a circle indicate repressive interactions. The left-most circuit and the middle circuit differ in one repressive interaction from gene four to gene three (dashed thick gray line, black cross, large open rectangle). The right-most circuit and the middle circuit differ in one activating interaction from gene one to gene five (dashed thick line, black cross, large open rectangle). Each of the three circuit topologies corresponds to one point—indicated by the large circles around them—in a vast regulatory genotype space. These genotypes (circles) are connected because they are neighbors, that is, they differ by one regulatory interaction. After [123].
scriptional regulation circuitry systematically. Computational approaches are needed, approaches that allow us to characterize many regulatory genotypes and their expression phenotypes. I will now turn to a computational model of transcriptional regulation
circuits that captures general features of the crossregulatory interactions in such circuits [820]. The model is concerned with a circuit of S transcriptional regulators. Here and elsewhere S stands for the size of a system, that is, the number of its
I N N OVAT I O N T H R O U G H R E G U L AT I O N
parts. These regulators are represented by their expression patterns Et = E(t) = (E1(t), E2(t), . . ., ES(t)) at some time t during a developmental or cell-biological process, and in one cell or region of an embryo. The model’s transcriptional regulators can influence each other’s expression through crossregulatory and autoregulatory interactions, which are encapsulated in a matrix w = (wij). The entries wij of this matrix indicate the strength of the regulatory influence that gene j has on gene i (Figure 3.1a). This influence can be activating (wij > 0), repressing (wij < 0), or absent. The entries of w represent cisregulatory elements or transcription factor binding sites on DNA. Each row wi. of this matrix represents the regulatory region of an entire gene i.The entire matrix w represents the (regulatory) genotype of this system. This model does not represent the genotype of the transcriptional regulators themselves. It focuses instead on the regulatory interactions encapsulated in the genotype w, and on evolutionary changes in this genotype. In doing so, it can disentangle the evolution of regulatory interactions from the evolution of the interacting molecules, the transcriptional regulators. For conceptual clarity, the evolution of these and other molecules is best examined separately, as I will do in Chapter 4. Also, cis-regulatory elements evolve much more rapidly, and may thus be more important for circuit evolution than evolutionary change in transcriptional regulators, which are often highly conserved [683, 775, 856, 857, 889]. The regulatory interactions of circuit genes change the expression state E(t) of the circuit over time t, a change that is modeled by the difference equation: ⎡ S ⎤ Ei (t + t ) = s ⎢ ∑ w ij E j (t )⎥ , ⎣ j =1 ⎦
(3.1)
where t is a time constant that is determined by the time scale characteristic for transcriptional regulation, which is of the order of minutes. The function s(.) is a sigmoidal function whose values lie in the interval (–1, +1). Equation (3.1) reflects the extent to which circuit genes contribute to the regulation of any circuit gene i. The sigmoidal function in the right-hand side of Equation 3.1 reflects the common
37
observation that transcriptional regulators regulate the expression of their target genes cooperatively [101, 438, 577, 862]. The influence of a regulator j on the expression of gene i is reflected in the relative magnitude of the wij in Equation 3.1. The equation is analogous to equations used in neural computation [19, 325]. This model is different from earlier models of generic regulatory circuits in that it specifically models transcriptional regulation, and not just any type of regulatory process [387]. More details on its biological motivation can be found in reference [820]. I here consider the limit where s(.) has a very steep slope at the origin, and becomes a step function, giving (–1) for negative arguments and (+1) for positive ones. (s(0)=0.) This limit is used for computational convenience, but much of what I discuss below would also for steep sigmoidal functions as well. This limit means that the model represents two gene expression states, one where a gene is not expressed (–1), and one where it is expressed (+1). This choice of variables exists for computational convenience [123, 820]. Other modeling choices that have been examined, such as Boolean (0–1) expression states, lead to phenomena similar to those I discuss below [341, 481]. The model is concerned with circuits whose expression dynamics start from a pre-specified initial state E0 at some time t = 0 during development, and arrive at a “target” equilibrium expression state E∞. The initial state can be viewed as being determined by regulatory factors upstream of the circuit, which may represent signals from the cell’s environment, or from other regions of a developing embryo. Transcriptional regulators that are expressed in the equilibrium state E∞ can affect the expression of genes downstream of the circuit. I will here focus on stable equilibrium states E∞, but extensions to states that vary over time are straightforward [704]. The expression state E∞ is the expression phenotype of a regulatory circuit. Because any such state is only attained from some initial expression state E0, the pair (E0, E∞) needs to be considered jointly when one studies expression phenotypes. Given that there is an astronomical number of 2S × 2S = 22S such pairs for circuits with S genes, it may seem hopeless to make general statements about them and the genotypes that express them. However, one can show
38
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
that most properties of the model encapsulated in Equation 3.1 that are relevant for my purpose, depend only on the number of genes whose expression state differs in the initial and the equilibrium state [123]. This means that instead of characterizing 22S possible expression states, it is sufficient to consider those (S + 1) classes of pairs (E0, E∞) for which anywhere between 0 and S genes differ in their expression, regardless of the identity of these genes, and regardless of the specific initial and equilibrium expression state. This model is abstract, which is an advantage to characterize generic properties of transcriptional regulation circuits. It may seem like a disadvantage for modeling specific circuits in any one organism. However, I note that variants of this model are highly successful in such modeling. For example, they can predict the regulatory dynamics of early developmental genes in the fruit fly Drosophila, including their mutant phenotypes [362, 527, 645, 646, 696]. These model variants have also helped address a broad range of conceptual questions in evolutionary biology. They include why mutants often show a release of genetic variation that is cryptic in the wild-type, how adaptive evolution of robustness occurs in genetic networks of a given topology, and how sexual reproduction can enhance robustness to recombination [34, 55, 83, 704, 820]. For brevity, I will refer to my observations below as properties of transcriptional regulation circuits.
Genotype space, neighborhoods, and numbers of genotypes and phenotypes I will refer to the structure of a regulatory genotype matrix w as the topology of a circuit, the pattern of existing (wij ≠ 0) and absent (wij = 0) regulatory interactions (Figure 3.1). Changes in topology correspond to the loss of regulatory interactions (wij ® 0), or to the appearance of new regulatory interactions that were previously absent. A variety of evidence, some of it discussed above, shows that such topological changes can occur on very short evolutionary timescales [490, 740, 758, 786]. My analysis in this chapter will focus on such topological changes. Part of the motivation is biological [123]: the biochemical parameters determining the behavior of cellular circuitry vary continually,
because a cell’s internal and external environment varies constantly. This variation makes less variable circuit topologies (instead of the more variable interactions strengths within a topology) an obvious and important focus of study [349, 512, 812]. A second motivation for this focus on topology is conceptual. It will allow us to see deep similarities between innovation in regulatory circuits and in other systems. I will refer to the set of all possible circuit topologies as the genotype space of regulatory circuits, and will call two circuits (topologies, genotypes) neighbors in this space, if they differ in the presence or absence of exactly one regulatory interaction (Figure 3.1b). The entire neighborhood of a circuit consists of all circuits that differ in exactly one regulatory interaction from it. That is, this neighborhood consists of circuits that either have one additional interaction, or they lack one interaction. More generally, one can define a circuit’s k-neighborhood as comprising all circuits that differ from it in k regulatory interactions. By extension, one can define the genotype distance D of two circuits as the fraction of non-zero regulatory interactions that they differ in [349, 512, 812]. This distance ranges from zero to a maximal value of D=1 for two circuits that do not share a single interaction. A (k)-neighborhood around any one circuit can also be viewed as a shell or ball of some small radius D around the circuit. For my purpose, it is useful to consider a simplified and discrete regulatory genotype, where individual regulatory interactions can only assume values of wij = + 1 (activation), wij = –1 (repression), or no interaction (wij = 0). However, much of what I will say holds for continuous interactions [123, 124, 491]. The main purpose of this simplification is that it allows enumeration of the possible genotypes in this model. This number of genotypes is the size 2 of the circuits’ genotype space. It is equal to 3( S ) for circuits with S genes. This number remains large even if one considers only circuits with some given number I of (non-zero) regulatory interactions. The number of such circuits is given by: ⎛ S2⎞ S2! 2I ⎜ ⎟ = 2I 2 , ( N − I )!I ! ⎝⎜ I ⎠⎟
(3.2)
where X! = 1 × 2 × . . . × X denotes the factorial function for positive integers X. It is easy to see—for
I N N OVAT I O N T H R O U G H R E G U L AT I O N
example, by applying Stirling’s formula [2] to approximate S2!—that this number of circuits grows exponentially in S2. Thus, even a modest number of regulators can interact in astronomically different ways, and form a vast genotype space of possible regulatory circuits. Below I will restrict myself to circuits where not all genes regulate each other’s expression; that is, circuits where the number of interactions I is much smaller than S2. This restriction reflects biological reality. For example, the circuit involved in the patterning of the syncytial Drosophila embryo comprises 15 genes with merely 32 interactions among them [104]. Among 76 transcriptional regulators in the yeast Saccharomyces cerevisiae, 106 putative regulatory interactions exist in the form of transcription factors bound to the promoter regions of other transcription factor coding genes [225, 440]. Although the number of regulatory genotypes grows exponentially in S2, the number of expression phenotypes, as I mentioned earlier, grows only exponentially in 2S, that is, as 22S. This means that
39
there are exponentially more regulatory genotypes than phenotypes. It follows that there must be many circuits with the same expression phenotype. This excess of regulatory genotypes over phenotypes does not mean, however, that every phenotype has the same large number of associated genotypes. Some expression phenotypes may be abundant; that is, they may have many more circuits adopting them than other “rare” phenotypes. For the generic circuits I study here, the number of genotypes per phenotype depends only on the number of genes whose activity differs between the initial expression state E0 and the final equilibrium expression state E∞ [123]. If we view a regulatory circuit as a device that “computes” the state E∞ from E0, then this computation becomes more difficult as more genes need to change their activity during the computation. In consequence, fewer circuit genotypes can execute the computation. Figure 3.2 shows this dependency of circuit number on phenotype [222]. The figure’s vertical axis shows the fraction of genotype space occupied by circuits of a given genotype. This
6x10–6
Fraction of genotype space
5x10–6 4x10–6 3x10–6 2x10–6 1x10–6
0
0
0.2
0.4
0.6
0.8
1.0
Fraction d of expression state differences Figure 3.2 Some phenotypes have many more associated genotypes than others. The horizontal axis shows the fraction d of genes which differ in their expression between a circuit’s initial expression state E0 and its equilibrium expression state E∞ [123]. The vertical axis shows, for each value of d, the fraction of a random sample of 1.7 × 106 circuits with this value of d, divided by the total number of expression phenotypes (E0, E∞) with this value of d. (A value of d = 0 corresponds to circuits where E0 = E∞, and where thus the only requirement on the circuit is that E∞ is a stable equilibrium state.) It can be shown that the number of circuit genotypes per phenotype (E0, E∞) depends only on the fraction d. The data is shown for circuits of size S = 20 genes and I = 55 regulatory interactions, but similar relationships would hold for other circuits [123, 222].
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
fraction decreases rapidly as more genes need to change their activity during a circuit’s regulatory dynamics.
Circuit genotypes with the same phenotype form very large connected sets An excess of regulatory genotypes over phenotypes has another consequence for the number of circuits that have a given expression phenotype—the genotype set of this phenotype: This genotype set may be very large, yet it may constitute a very small fraction of genotype space. As an example, Figure 3.3 shows how the fraction (left vertical axis) and number (right vertical axis) of circuits with a given expression phenotype depend on circuit size. For example, for networks with merely S=10 genes, there are approximately 1039 circuits with any one expression phenotype, but these circuits occupy only a small
fraction 10-5 of genotype space. We will see similar relationships when we discuss phenotypes of molecules in Chapter 4. I next focus on the genotype set of a given phenotype. This genotype set can be viewed as a graph. Graphs are mathematical objects that consist of nodes and edges, and that are used to represents networks in many areas of science, from physics to sociology. The nodes in our genotype set are genotypes (regulatory circuits). Two nodes are connected by an edge if they are neighbors; that is, if they differ in exactly one regulatory interaction. A fundamental question about this graph is whether it is connected; that is, whether there is a path of edges connecting any two circuits that does not leave this graph. Alternatively, this graph might consist of only isolated nodes—nodes not neighboring any other circuit on the graph—or of some
180
log10 (fraction of genotypes)
–2
160
–3
140 –4
120
–5
100
–6
80 60
–7
log10 (number of genotypes)
40
40 –8 –9
20 2
4
6
8 10 12 14 16 Number S of circuit genes
18
20
0 22
Figure 3.3 Circuits with any one expression phenotype represent a small fraction of genotype space. The horizontal axis shows the circuit size S, that is, the number of genes in a circuit. The left vertical axes shows the fraction of circuits in genotype space that have a given expression phenotype (see below), on a logarithmic scale. The right vertical axis shows the number of these circuits. Note the logarithmic scales, the exponential decrease in the fraction of genotypes with increasing S, and the greater than exponential increase in the number of genotypes, which is caused by the exponential scaling of genotype space size with the square of circuit size S [123]. The relationship shown here can be shown to hold quantitatively for all phenotypes where the fraction d of genes which differ in their expression between E0 and E∞ is equal to d=0.5, regardless of the actual expression states E0 and E∞ [123]. The number of genotypes per phenotype would decrease as d increases (Figure 3.2), but otherwise qualitatively identical relationships would hold for other values of d. The data is based on circuits where each gene is regulated by half of all circuit genes, but similar scaling relationships would hold for different numbers of regulatory interactions. The data was obtained through random sampling of circuits from circuit genotype space. Sampling errors are at least one order of magnitude smaller than the estimated values, and thus invisible on the plot.
I N N OVAT I O N T H R O U G H R E G U L AT I O N
number of connected components, subsets of nodes that are connected to one another but isolated from all other nodes. I will refer to any such component as a genotype network. A closely related and equally important question is whether genotype networks typically occur in small, highly localized “islands” of genotype space, or whether they extend far through this space. To determine connectedness of a genotype network, one can pursue two different strategies. First, for very small circuits (S < 7 genes), one can exhaustively enumerate all circuits with a given phenotype. For larger circuits, such exhaustive enumeration is no longer possible [123]. In this case, one needs to sample circuits from a genotype set and ask whether there are paths through genotype space that connect these circuits without leaving this set. In practice, such sampling allows one to estimate connectivity of a genotype set to arbitrary accuracy, given sufficiently large samples. A combination of enumeration and sampling approaches demonstrates the following property [123]: regardless of the specific phenotype studied, the vast
41
majority of all circuits belong in the same genotype network, and only a tiny fraction form islands of largely isolated circuits. For example, for circuits with 12 genes and an average of three regulatory interactions per gene, 99.8 percent of all circuits with the same phenotype are connected in a single genotype network. This percentage increases further for larger circuits, and is similarly high for networks with different phenotypes [123]. Such high connectedness of a genotype set is only possible if circuits in the set have on average more than one neighbor with the same phenotype. In fact, most circuits have vastly more such “neutral” neighbors. As an example, Figure 3.4 shows the distribution of the fraction of a circuit’s neutral neighbors for circuits of S = 20 genes. Most circuits of this size have dozens to hundreds of neutral neighbors, regardless of their specific expression phenotype. The next question is how far individual genotype networks extend through genotype space. To obtain a first answer, one can sample circuits chosen at random from the same genotype network, and ask how different they are from each other. Put differently,
800 700
Number of circuits
600 500 400 300 200 100 0 0.0
0.2 0.4 0.6 0.8 Fraction n of neighbors with the same phenotype
1.0
Figure 3.4 Most circuits have many neighbors with the same phenotype. The horizontal axis shows the fraction n of a circuit’s neighbors with the same phenotype, out of a total of 2S(S – 1) possible neighbors. The data are based on a sample of 104 networks (S = 20 genes) with approximately S/2 regulatory interactions per gene, sampled at random from a set of genotypes with the same phenotype [123]. The distribution shown applies to any phenotype where half of all genes differ in their expression between E0 and E∞ [123], regardless of the specific phenotype. While the shape of the distribution depends on some of the parameters just mentioned, the fact that most circuits have many neighbors with the same phenotype does not change as these parameters are varied [123].
42
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
4
0.85 Mean genotype distance D
Number of circuit pairs (×104)
0.84
3
0.83 0.82 0.81 0.80 0.79 0.78 0.77 0.76
2
0.75
12
16
20
24
28
36
Number S of network genes
1
0
0.1
0.2
0.3
0.4 0.5 0.6 Genotype distance D
0.7
0.8
0.9
1
Figure 3.5 Randomly chosen circuit pairs with identical gene expression phenotype have very different topology. The horizontal axis shows genotype distance D, the fraction of regulatory interactions that differ between two circuits with topologies w and w’. The histogram is based on a sample of 5 × 105 circuit pairs with S = 24 genes, and an average of 6 regulatory interactions per gene. The same distribution exists for all phenotypes where 50 percent of genes differ in their expression between E0 and E∞, regardless of the specific phenotype. Similarly large distances are observed when these circuit properties are varied [123, 124]. The inset shows, as an example, the mean (and standard deviation) of the distribution shown, but for circuits with different numbers S of genes. The genotype distance D is defined as: D '(w , w ') = (1/ 2 M+ ) i , j (1 − d [sign(wij ), sign(wij )]), where M+ denotes the maximally allowed number of regulatory interactions, and where the function d = 1 if and only if x = y, and d = 0 otherwise [124].
∑
this approach asks how different circuits with the same phenotype typically are [123, 124]. Figure 3.5 shows an example. The example illustrates that randomly chosen pairs of circuits with the same phenotype typically differ in approximately 80% of their regulatory interactions. This observation suggests that circuits with the same phenotype can be found in very distant parts of genotype space. A similar approach can be used to estimate the maximal genotype distance of circuits within one genotype network. Specifically, a lower bound for this maximal distance is given by the maximal genotype distance among pairs of circuits in a large random sample of a genotype network. When estimating this maximum, one finds that it is very large and often equal to the diameter of genotype space, i.e., D = 1 [124]. Two circuits at this extreme distance have the same expression phenotype, yet they do not share a single
regulatory interaction. These observations hold not only for some circuits and distance measures, but over a broad range of circuit sizes, circuit phenotypes, and for different measures of genotype distance [124]. They are a generic property of the circuits I analyze here. I note that these observations are broadly consistent with the qualitative empirical observations that I discussed earlier, and which showed that very different regulatory interactions (regulatory genotypes) can produce the same gene expression phenotype. In sum, transcriptional regulation circuits with the same phenotype typically form vast sets of genotypes. Most or all of these genotypes are connected in one giant genotype network that can be traversed through small individual steps, changing one regulatory interaction at a time, and affecting as little as
I N N OVAT I O N T H R O U G H R E G U L AT I O N
a single transcription factor binding site. The genotype networks of different phenotypes extend far through genotype space, and many such networks even traverse this space completely [123, 124].
Accessing novel gene expression phenotypes Thus far, I have focused on circuits with the same expression phenotype. I will next turn to different (new) expression phenotypes, and how they arise through regulatory mutations. Many new expression phenotypes may have no consequence or deleterious consequences for the organism, but some fraction of them may give rise to regulatory innovations. Whether a new phenotype is an innovation will depend on many circumstances, but this much is certain: circuits that can produce many new variants over time while preserving their existing phenotype will increase their odds of bringing forth regulatory innovations. I will next show that the sprawling genotype networks characteristic of regulatory circuits facilitate such innovations. Consider two different circuits that have very different genotypes (large genotype distance D), but that lie on the same genotype network; that is, they share the same gene activity phenotype P. The two immediate neighborhoods around each of these circuits contain many circuits that also have the same phenotype P. However, both neighborhoods may also contain many circuits whose phenotypes are different from P. The same holds for two larger neighborhoods; that is, k-neighborhoods that contain circuits differing in some small number k of regulatory interactions [124]. If the neighborhoods of two very distant circuits in genotype space typically contain exactly the same novel phenotypes, then the extendedness of a genotype network is irrelevant for the accessibility of such phenotypes. No matter where a circuit is located on a genotype network, small changes in it will produce the same novel phenotypes. However, if such neighborhoods contain different novel phenotypes, the very existence of a genotype network becomes important to evolutionary innovation. In this case, a series of small genotypic changes that preserve a circuit’s phenotype may give the circuit access to many novel phenotypes. The following observations show that this is the case.
43
The horizontal axis of Figure 3.6a shows the genotype distance between circuit pairs drawn a random from the same genotype network. The data in this figure is based on more than 103 such circuit pairs [124]. The vertical axis shows the fraction of novel phenotypes unique to the k-neighborhood (k ≤ 3) of one circuit. Here and elsewhere, I use the word unique phenotype in the sense that a phenotype occurs in the neighborhood of one but not the other circuit in a pair. The fraction of unique, novel phenotypes is generally large, with a mean of greater than 0.7. Figure 3.6b shows a similar relationship but for circuit pairs at smaller genotype distances. The figure displays only the mean fraction of novel and unique phenotypes in a neighborhood as a function of genotype distance among circuits in a pair. This mean fraction increases very rapidly with increasing genotype distance. For instance, for circuits of S = 8 genes, at a genotype distance of merely D = 0.06, 34 percent of new phenotypes occurring in the 2-neighborhood of two circuits are different. This number increases further until it exceeds 50 percent [124]. Taken together, these data show that small neighborhoods around different circuits with the same phenotype contain many novel phenotypes that are unique to each neighborhood.
Genotype networks are highly interwoven An important final question regards how far one must typically travel in genotype space to reach a circuit with an arbitrary novel phenotype. This is a question about the shortest distance between two genotype networks in this space. It is analogous to a question I asked in Chapter 2 for metabolic systems. To address it, one can choose at first two random circuits that have the same initial expression state E0 but arrive at different and arbitrary (random) equilibrium states E∞ and E’∞. Starting from the first circuit, one can then perform a random walk that aims to reach the second circuit, while leaving the first circuit’s phenotype E∞ unchanged. After a large number of mutations in this random walk, a distance Dmin is reached that cannot be reduced further without changing the phenotype E∞. Repeating this procedure with more than 103 pairs of random networks yields the distribution of Dmin shown in Figure 3.7. To be precise, this distribution is only an upper bound for the minimum distance
44
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
(a)
Fraction U of phenotypes unique to neighborhood
1.0
0.8
0.6
0.4
0.2
Circuit pairs: (0,5] (5,10] (10,15] (15,20] (20,25] (25,30] (30,35] (35,40] (40,45] > 45
0.0 0.2
0.3
phenotype
genotype
0.4
0.5
0.6
0.7
P
G1
0.8
P
U
G2
0.9
1.0
Genotype distance D (b)
Fraction U of phenotypes unique to neighborhood
0.7 0.6
S = 12 genes S = 8 genes
0.5 0.4 0.3 0.2 0.1 0.0 0.00
0.04
0.08
0.12
0.16
0.20
0.24
Genotype distance D Figure 3.6 Different neighborhoods of a genotype network contain different expression phenotypes. (a) The mean fraction U of novel and unique phenotypes that occur in the neighborhood (circles in the inset on the lower right) of one but not the other regulatory circuit (gray color in the inset, symbol U for unique), for pairs of regulatory circuits whose genotype distance D is shown on the horizontal axis. Sizes of small circles represent numbers of circuit pairs with a given distance and fraction of unique phenotypes. Data are based on the k-neighborhoods (k ≤ 3) of a sample of 2210 circuit pairs chosen at random from the same genotype network. Notice the large fraction of unique new phenotypes for almost all circuit pairs shown (mean U = 0.73). (b) As in (a) but for the mean U of genotypes at smaller distances from one another [124], and for k = 2. Standard deviations around each data point are no greater than 8×10–3 and error bars are thus too small to be visible. Data are for circuits of S = 8 genes (in addition, S = 12 genes for b), an average of 2 regulatory interactions per gene, for any phenotype where 50% of genes differ in their expression between E0 and E∞ (regardless of the specific phenotype) and for circuits that have between 45 and 60% of neutral neighbors. Qualitatively similar patterns hold over a broad range of these circuit properties [124].
I N N OVAT I O N T H R O U G H R E G U L AT I O N
45
Minimum genotype distance (upper bound)
300
Number of circuit pairs
250 200 150
0.50 0.45 0.40 0 35 0 30 0.25 0.20 0.15 0.10
8
12
16
20
24
28
Number N of circuit genes
100 50 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Minimal genotype network distance (Dmin) Figure 3.7 Genotype networks of arbitrary phenotypes are close together in genotype space. The figure shows a histogram for an upper bound in the minimum genotype distance Dmin of two circuits with different phenotypes. It is based on circuits with S = 20 genes, 5 regulatory interactions per gene on average, and on 1600 random pairs of phenotypes, where 50 percent of genes differ in their expression between E0 and E∞, and where the expression states of individual genes in E∞ and E’∞ are uncorrelated. The inset shows the mean (error bars: one standard deviation) of Dmin as a function of the number of genes. The mean Dmin decreases with increasing number of genes S (inset) and with an increasing number of regulatory interactions per gene (data not shown). See also [124].
between two typical genotype networks. The actual Dmin may still be lower. Nonetheless, the figure shows that Dmin is small and that it decreases as circuit size increases. For example, for circuits of merely twenty genes, the average minimum distance to circuits with arbitrary different phenotypes (expression patterns) is only D=0.14. In genotype space, a region with this radius contains only a tiny fraction of all networks. For example, there are an estimated 10128 networks of S=20 genes with an average of 5 interactions per gene. Only a fraction 10-102 of them is contained in a neighborhood of D=0.14 around any one circuit. In sum, the genotype networks of an arbitrary phenotype can typically be found in a tiny region around any given genotype network. This observation shows that typical genotype networks not only span large distances, but they also are highly interwoven [124].
Summary The analyses I have summarized here suggest that transcriptional regulation circuits have several generic properties. First, there are many more circuits (genotypes) than phenotypes. Second, typi-
cally almost all genotypes with any one expression phenotype form one vast, connected genotype network. Third, typical circuits have a large number of neighbors with the same expression phenotype. Their phenotype is robust to regulatory change. Some of the experimental data I discussed above speaks to these properties. For example, the fact that one can readily alter regulatory interactions in the E. coli transcriptional regulation circuit without detrimental effects insinuates that this kind of robustness also exists for the E. coli regulatory circuit [355]. Fourth, the longest distance of two genotypes in such a network is close to the diameter of genotype space. This means that two circuits with the same expression phenotype may have few regulatory interactions in common. As I mentioned earlier, these observations are broadly consistent with empirical evidence that very different cis-regulatory regions and transcription factors can indeed convey the same expression pattern on many different genes in different species [490, 758, 786, 847]. Also, transitional forms of regulation in species intermediate to those
46
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
studied show how such highly diverged circuitry may arise mechanistically: new regulatory interactions arise before old interactions are eliminated, and a sequence of such additions and eliminations of reactions may eventually cause regulatory circuitry to diverge beyond recognition, yet leave an expression phenotype largely unchanged [490, 758, 786]. Fifth, the neighborhoods of different circuits on the same genotype network contain many unique new phenotypes. Sixth, the shortest distance between genotype networks in genotype space is small. No experimental evidence is yet available that speaks to these last two observations. The models on which these observations are based contain assumptions that facilitate enume-
ration of genotypes and phenotypes. However, relaxing these assumptions is not likely to affect these observations. For example, wide-ranging and vast genotype networks still exist for circuits with continuous regulatory interactions, with different representations of gene expression states, and for signaling circuits, circuits that do not just involve transcriptional regulation [123, 124, 481, 491, 561, 823]. Neighborhood diversity, a large fraction of unique novel phenotypes in different neighborhoods of a genotype network, can also exist for such circuitry (Chapter 14). Taken together, all this evidence suggests that the properties I discuss here are general properties of regulatory circuitry.
CH A PT ER 4
Novel molecules
Genotype: A sequence of ribonucleotides or amino acids. Phenotype: Secondary and tertiary structure, biochemical activity. Every distinct and specific enzymatic activity we see in organisms today was an evolutionary innovation when it first arose. These innovations permit the use of new energy sources, the biosynthesis of essential molecules from unusual food, or protection against a hostile world. The same holds for many non-enzymatic proteins that serve in structural support, communication, or defense. The relationship between macromolecular genotypes and their phenotypes is thus very important to understand evolutionary innovation. I will examine this relationship here for two important macromolecules: proteins and RNA. A previous book of mine also discussed parts of this material [825].
RNA and protein genotypes In RNA molecules, genotypes can be represented as nucleotide strings. As in previous chapters, I will let the letter S stand for system size. It corresponds to the length of a nucleotide string, its number of nucleotide monomers. RNA molecules exist in a vast genotype space. Specifically, because there are four different RNA nucleotide monomers, this genotype space comprises 4S possible genotypes for a molecule of length S. Because all proteins are encoded by RNA or DNA molecules, one could also represent the genotype of a protein by the encoding nucleotide sequence. However, it is often more convenient to represent it directly as an amino acid sequence. This is unproblematic, because a nucleotide string usually encodes
only one amino acid string. Like RNA genotype space, protein genotype space is vast. For example, it contains 20S proteins with S amino acids. In both protein and RNA genotype spaces, a simple measure of the distance D between two genotypes exists. It is the number or fraction of nucleotides or amino acids in which they differ. Two genotypes are (1-mutant) neighbors in this space, if they differ in only one nucleotide or amino acid. The (1-mutant) neighborhood of a genotype comprises all its neighbors. More generally, the k-mutant neighborhood of a genotype comprises all molecules differing in no more than k monomers from itself.
RNA and protein phenotypes Because proteins have many different roles in a cell, they have multifaceted phenotypes. An enormous amount of work has focused on characterizing secondary structure and tertiary structure phenotypes of proteins, the arrangement of their amino acid sequence in threedimensional space. This aspect of a protein’s phenotype is also known as its fold (Figures 1.5 and 1.6 showed examples). Compared to the astronomical number of protein sequences, the number of protein folds is small. It varies according to the estimation method but is less than 20,000 according to available predictions [145, 286, 418, 579, 838, 860, 885, 886]. When studying protein function, the fold is not always the best representation of phenotype. Enzymes provide a case in point. An enzyme’s catalytic site is formed by few surface amino acids that are responsible for the enzyme’s substrate specificity, and for the kind of chemical reaction it catalyzes. Enzymes with very similar folds can have a great variety of catalytic functions [660, 774, 776, 777, 849]. Thus, at least for some proteins, the 47
48
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
tertiary structure provides only a scaffold; different biochemical functions are built on top of this scaffold by subtle modifications. Despite this limitation of studying protein folds as phenotypes, disrupting a fold will also generally disrupt protein function. Thus, a properly formed tertiary structure is a necessary ingredient for the function of many proteins. It therefore deserves study in its own right. RNA phenotypes are also attractive study objects, because RNA is involved in most of life’s key processes. Examples include: small nuclear RNAs, which are key parts of the splicing machinery; guide RNAs important in RNA editing; telomerase RNA necessary for maintaining chromosome ends; and small RNA molecules regulating gene expression through RNA interference [13]. Like proteins, RNA forms secondary and tertiary structures. RNA secondary structure is an elaborate planar shape that is formed when an RNA molecule folds onto itself, thus forming hydrogen bonds of complementary base-pairs within the molecule (A–U, G–C, and, to a lesser extent, G–U nucleotides; Figure 4.1). The three-dimensional RNA tertiary structure brings distant secondary structure elements into proximity through non-standard base-pairing, pseudoknots, and bivalent ions such as Mg2+. Compared to solved protein structures, which number in the thousands, the number of well-characterized RNA tertiary structures is puny, because RNA tertiary structure is more difficult to determine [191]. Thus, more evolutionary work has focused on RNA secondary structure phenotypes. Another motivation for this focus is that secondary structure is critical to the function of many RNA molecules: destroy this structure and you destroy RNA function. Examples include many viruses whose genome consists of RNA. Parts of their genome form secondary structure motifs, such as the so-called transactivation responsive and Rev-1 responsive elements of human HIV, the internal ribosomal entry site of picorna viruses, and the 3’ untranslated region of flavivirus genomes [48, 169, 361, 486, 622]. The RNA structures they form interact with parts of the protein machinery necessary to complete the viral life-cycle. Another class of examples regards the secondary structures formed by messenger RNA (mRNA). They are important for chemical modifications of messenger RNA that affect its half-
life, and how efficiently it is translated into protein [489, 575, 599, 802]. Evolutionary conservation underscores the functional importance of many RNA secondary structures. Distantly related species harbor many RNAs with diverged nucleotide sequences but conserved secondary structures. Natural selection maintains these secondary structure phenotypes. Examples come from ribosomal RNAs, transfer RNAs, catalytic RNAs such as ribonuclease P, and viral RNA genomes [191, 256, 465, 687, 832].
Genotype networks in proteins and RNA Proteins with the same phenotype can be extremely diverse in genotype. A first and prominent example involves the globin fold, whose elucidation won Max Perutz and John Kendrew the Nobel prize. The globin fold is a structural phenotype characteristic of oxygen-binding proteins, such as myoglobin and hemoglobin. These proteins have numerous distant relatives in many vertebrates, invertebrates—mollusks, arthropods, and annelids—and even plants, where they bind oxygen to facilitate nitrogen fixation. All these proteins bind oxygen, albeit with different affinities and kinetics [264, pp. 38–40]. The tertiary structures of even distant globin representatives are very similar. For instance, the threedimensional structure of whale myoglobin and the hemoglobin of the clam Lucina pectinata can be superimposed almost exactly [651]. This great phenotypic similarity stands in stark contrast to the high genotypic diversity of globins. For example, clam hemoglobin has only 18 percent amino acid identity to vertebrate hemoglobin. A study of 6 hemoglobins from plants and animals found that as few as 12.4 percent of amino acid residues were identical between any pair of these proteins [31]. In addition, only 4 out of 97 amino acids in the protein were unchanged in all of these proteins. If the sequence of two proteins has diverged this dramatically, it is difficult to determine whether they share a common ancestor from their sequence alone. However, a variety of other criteria, including details of a protein’s structure, can help. Taken together, such criteria argue for a common origin of globins [305]. The phylogenetic tree in Figure 4.2 shows that amino acid similarities of globins among different species reflect the evolutionary relatedness of the
NOVEL MOLECULES
(a)
49
G G hairpin loop U A C G U A A internal loop G G C A U G C stack A U U A G C A U G A G G internal loop U G G A A G U AG G C multiloop C G A G G G U C G A G C C C A G C A A G GA U U A
(b)
(((((.(((.....((((((.((....)).))))))........))).))...))..))))) Figure 4.1 Two equivalent representations of RNA secondary structure. (a) Two-dimensional graphical representation. Stacks, regions of paired bases, and various kinds of loops, unpaired regions, are indicated. Base pairs in an RNA secondary structure have to meet two conditions. First, each nucleotide must be paired with at most one other nucleotide. Second, two pairings can not cross in the planar projections of the structure, otherwise planarity would be violated. (b) Dot-parenthesis representation. A dot stands for an unpaired base, and a pair of matching parentheses corresponds to a base pair. The two representations are equivalent. That is, as one reads the RNA string in (a) beginning from GGGU . . ., one can read the representation in (b) from left to right to find which bases are paired and unpaired. From figure 4.1 in [825], used with permission from Princeton University Press.
species. More distantly related species harbor globins with more dissimilar amino acid sequences [279]. Globins are not unusual in being highly diverse yet connected evolutionarily, as the next examples will illustrate. A protein domain is a distinct, compact and stable unit of protein structure that folds independently of other such units. It often also has a unique biochemical function. A protein can consist of single domains, like hemoglobin, or of multiple domains. As structural information on proteins is accumulating, comparative analyses of protein evolution are
focusing increasingly on domains themselves. Such comparisons reveal patterns very similar to those of whole proteins I have just discussed. A case in point is the fibronectin type III domain, which forms a tertiary structure similar to one also found in immunoglobulins. It is widespread in animals and has also been found in some bacteria [79]. The proteins in which it occurs include extracellular matrix proteins such as fibronectin—involved in processes as diverse as tissue repair, blood clotting, and cell migration—intracellular proteins, and many kinds of membrane receptor proteins, such as the human
50
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
700 21 22
66
ANNELID
600
ARTHROPOD
VERTEBRATE 33
35
54
61
MOLLUSC
31
16
89
59
CRU
83
125
60 GASTROPOD
B INTR IVALVE ACE LLU LAR
82
CEA
59
260
99
171
276
158
129
174
28
22
144
140
DIMERIC
TETRAMERIC
98 87
87
64
60
22
21
14
HOMO b
HOMO a
LAMPETRA
MYXINE
HOMO mb
APLYSIA l.
APLYSIA k.
BUSYCON
CERITHIDEA
ANADARA t.
CTT VIII
CTT VI
CTT VIIA
CTT IX
CTT X
CTT IIb CTT VIIb
DIMERIC
ANADARA b.
MONOMERIC
CTT IV
CTT III
CTT IIIA
1
0
CTT IA CTT I
ARTEMIA E1
GLYCERA
TYL I
LUMB I
TYL IIC
TYL IIB
PARASPONIA
VICIA
LUPINUS
8
PHASEOLUS
28
5
10
28
0
54
115 140
7
14 61
GLYCINE 16 22 31 55 GLYCINE C2
59
100
93
122 144
204
30
129
134
TYL IIA
158
125
LUMB II
95
89
200
193
12
31
122
220
18
42
300
STA
INTRACELLULAR
54
MYR BP
PLANT
400
129
EXTRA
CT INSE
CELLU LA
R
500
Figure 4.2 Evolutionary relationships among globins. The phylogenetic tree shown is a maximum parsimony tree [237] based on aligned amino acid sequences of globins from 5 plants, 9 invertebrates, and 3 vertebrates [279]. The numbers along each tree branch represent the numbers of substitutions that took place along the branch. The left vertical axis represents time in million years before present (MYR BP). Acronyms and names of genera denote the following. Anadara broughtonii and Anadara trapezia (bivalves); Aplysia spp. (gastropods); Artemia salina (brine shrimp); Busycon canaliculatum and Cerithidea rhizopharorum (gastropods); CTT: Chironomus tumi tumi (a midge) and its various globins; Donabella auricularia (a sea hare); Glycera dibranchiata (a polychaete); Lampetra sp. (lamprey); LUM: Lumbricus terrestris (earthworm); Myxine sp. (hagfish); Glycine sp., Lupinus sp., Parasponia sp., Phaseolus sp., Vicia sp. (legumes); Scapharca inaequivalvis (bivalves); TYL: Tylorrhynchus heterochaetus (a polychaete). From ref. [279], used with permission from Springer Science+Business Media.
growth hormone receptor. Despite their highly similar tertiary structures, amino acid sequence similarities of fibronectin domains in different species are as low as 9 percent, although they may have a common origin [79].
Among many other examples, the triosephosphate isomerase (TIM)-barrel domain stands out, because it is one of the most abundant protein structures in nature. Named after a glycolytic enzyme that harbors this domain, it has a barrel-like
NOVEL MOLECULES
structure whose “planks” are made up of secondary structure elements. It may occur in as many as 10 percent of all enzymes. The enzymes with this domain have a broad diversity of functions [849]. The TIM-barrel domain may derive from a single ancestor, but many proteins that harbor this domain have no recognizable sequence similarity to one another [140].
Genotype sets and genotype networks For RNA molecules and proteins, a genotype set comprises all genotypes (sequences) with the same phenotype. In the case of globins, for example, it would comprise all proteins with an oxygen-binding globin fold. In a genotype network, additionally, any pair of sequences can be connected through single amino acid changes that do not change the phenotype. As I mentioned in earlier chapters, not necessarily all sequences in a genotype set form a single genotype network; this also holds for proteins. We may never know whether all proteins with oxygenbinding globin folds belong to exactly one genotype network, because their number may be so astronomically large. However, phylogenetic analyses like that of Figure 4.2 show that the genotype network on which known globins lie is vast [279]. Because different globin phenotypes have close to maximally dissimilar genotypes, their genotype network spans genotype space or nearly so. To see this, it is useful to remember that the species in Figure 4.2 are but a few leaves on a huge tree of globin-possessing species connected through common ancestry. If we could identify all their (extinct) ancestors, we would see the full continuum of genetically diverse molecules that constitute this network. Examples such as those of globins and the TIMbarrel domain show that evolutionary explorations of sequence space in the billion-year long history of eukaryotes can range very far without compromising a protein’s structure. However, any one such example is essentially an anecdote. It does not show how representative this phenomenon is of all proteins. Counterexamples exist of proteins with conserved structure and very similar sequences in widely divergent organisms; they include actins, tubulins, and histones, with up to 98 percent
51
sequence identity in organisms as dissimilar as humans and plants [188]. To find out whether such counterexamples are the rule or the exception, statistical surveys of many protein structures are necessary. With increasing amounts of available protein structure information, such surveys have become possible. However, a difficulty is that many proteins with similar structures are close together in genotype space, because they share a recent common ancestor—a single sequence—from which they have diversified. This means that proteins with the same structure are not unbiased representatives of all sequences that fold into the structure. In a statistical survey that alleviates this problem, Rost first identified 272 proteins with different folds in the protein databank, a database of thousands of protein structures at atomic resolution [201, 659]. He then used each such protein with a unique structure as a reference protein, and identified all other proteins that were so dissimilar to the reference protein in amino acid sequence (< 25 percent identity) that their common ancestry with the reference protein is doubtful [189]. He found that many of these proteins have a fold that is similar to that of the reference protein. Because of low sequence similarity to the reference protein, this set of proteins is not biased towards highly similar proteins that fold into the same structure. It is striking that proteins with similar structure but different sequence thus identified shared on average only 8.5 percent of their amino acids, much fewer than the 25 percent threshold pre-imposed on the analysis. This number is only slightly higher than the 5.6 percent amino acid identity between any two proteins chosen at random from the database (and possibly different structure). Other surveys make similar observations: many protein structures can be realized by very different amino acid sequences [47, 774, 776]. When we turn to RNA and its phenotypes, the situation is similar, although our knowledge is more limited. Very different sequences can have similar form and function. For example, only seven nucleotides are conserved among group I selfsplicing introns, a prominent class of RNAs with catalytic activity. Nevertheless, secondary (and probably tertiary) structures within the core of these catalytic RNAs are conserved [465]. Similarly great diversity is observed for the catalytic RNA
52
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
molecules of ribonuclease P, an enzyme necessary for the biosynthesis of transfer RNA molecules. This RNA contains a 200 nucleotide long core that is shared by eukaryotes and prokaryotes [256]. Only 10 percent of these 200 nucleotides do not change in known RNA molecules. More generally, RNA sequence conservation among molecules with the same phenotype is typically restricted to short stretches of fewer than 10 nucleotides [191]. This limited similarity can raise serious problems when trying to identify new molecules based on sequence similarity alone [191]. It is often the conserved structure phenotype itself that needs to be used as a guide to identify such sequences. In sum, any one RNA and protein phenotype can typically be formed by many highly dissimilar sequence genotypes. For well-studied molecules with many known genotypes, these genotypes form a mutationally connected genotype network.
Systematic explorations of genotype networks Examples from naturally occurring molecules powerfully demonstrate how molecules with similar phenotype can have very different genotypes. However, they have difficulty answering more general questions, some of which are important to understand innovability. Do genotype networks have similar sizes for molecules with different phenotypes? Which kinds of novel phenotypes— candidate evolutionary innovations—can typically be found near a given genotype network? To answer these questions, one would ideally want exhaustive information about all possible genotypes and their phenotypes, but all we have is a modest and biased set of naturally occurring genotypes. They are a few of the leaves on a vast phylogenetic tree. Many other leaves, and the internal branches connecting them, are usually unknown, because they have not been discovered yet, or because they correspond to sequences from extinct species. To avoid this limitation of available data, one can either explore a small genotype space exhaustively, or one can sample genotype space in a random, unbiased way. (For sufficiently large sample sizes, results from sampling will be arbitrarily close to those of exhaustive exploration.) Unfortunately, experimental determination of phenotypes is too laborious to create large samples. In addition, com-
putational prediction of all aspects of a molecule’s phenotype is either too slow or impossible with today’s methods. We thus need to focus on aspects of phenotypes that are complex enough to reflect important properties of actual molecules, but simple enough to predict for vast numbers of genotypes. Doing so involves computational approaches, such as computational models of protein structure. I will next introduce some of the relevant models for proteins, before returning to RNA.
Protein folding models Computational models of protein folding rest on the notion that proteins will fold into a native conformation or tertiary structure that is compact in space and that minimizes the protein’s free energy. An important contribution to this free energy is non-covalent interactions—hydrogen bonds, hydrophobic interactions, and ionic bonds—between amino acids that are not adjacent in the amino acid chain [87]. Some such interactions are favorable and reduce the protein’s free energy, whereas others are unfavorable and increase it. The native conformation has the largest number of favorable and strong interactions. It is usually also a compact conformation, in the sense that amino acids are densely packed in it. The principal obstacle to brute force calculation of this native conformation is the astronomical number of non-native conformations any one protein can form. Computational approaches take various shortcuts to alleviate this problem. One prominent approach uses simplified and tractable models of protein folding called lattice proteins [35, 80–82, 90, 97, 108–110, 156, 185, 186, 220, 283–286, 434, 444, 464, 668, 669, 718, 759, 760, 805, 853]. In lattice proteins, folding is constrained in one important way: in the native conformation, individual amino acids can only assume positions on a discrete grid—a lattice. This grid can be either two-dimensional or three-dimensional (Figure 4.3). The advantage of this discrete representation of tertiary structure is that all possible tertiary structures of an amino acid chain can be enumerated. There are, for example, fewer than 105 possible tertiary structures that completely fill the three-dimensional cubic lattice of Figure 4.3b [455]. To identify the most thermodynamically stable among these structures, lattice protein models make several assump-
NOVEL MOLECULES
53
(a)
(b)
Figure 4.3 Lattice proteins. A protein is represented by a chain of black and white beads, corresponding to hydrophobic and hydrophilic amino acids. (a) A protein of 36 amino acids folded onto a 6 × 6 two-dimensional cubic lattice. (b) A protein of 27 amino acids folded onto a 3 × 3 × 3 threedimensional cubic lattice. From figure 1 in [455], used with permission from AAAS.
tions. For instance, they may represent proteins as chains of only two types of amino acids: polar (P) and hydrophobic (H). Any protein’s amino acid sequence is then completely determined by choosing one of these types for every position along the chain. When folded compactly on a lattice, a protein’s free energy E can then be calculated by adding the individual energy contributions Ea a made i
j
by all individual amino acids ai and aj that are adjacent to one another on the lattice, but not adjacent on the amino acid chain. (The latter amino acids do not contribute to the free energy, because they are always adjacent, regardless of the protein’s fold.) If there are only two types (H and P) of amino acids, this interaction energy can assume only three values, EHH, EHP, and EPP. These values can be chosen to
54
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
mimic features of actual proteins, e.g., that amino acids of the same type interact preferentially (e.g., EHP > EPP, EHH) [455]. Even this simplest of models captures important aspects of protein folding. Specifically, it is the tendency of hydrophobic amino acids to avoid water that drives different proteins to fold into compact shapes with a core of hydrophobic amino acids [87]. Some real proteins can be designed or redesigned merely by choosing suitable hydrophobic and polar amino acids along the chain [143, 382]. These observations justify a restriction to hydrophobic and hydrophilic amino acids as a first order approximation to characterize protein structures. Extensions of this model can capture increasingly subtle aspects of protein folding. Such extensions include larger alphabets of up to 20 amino acids, empirically determined interaction energies for these amino acids, and modifications to incorporate the effects of the solvent in which the protein folds [90]. Lattice proteins are not the only models of protein folding. Another class of models folds proteins without confining them to a lattice [549, 565]. Yet another class does not represent protein conformations in space, but only their free energies [35, 90, 97]. Because models of protein folding minimize free energy, they implicitly rest on the assumption that a protein’s native phenotype is its minimum free-energy fold [28]. But how does a protein find this fold? The total number of possible folds is astronomical, and a protein cannot possibly explore all of them [894]. The likely answer is that the protein’s minimum free-energy structure is surrounded by a “folding funnel” of similar folds with higher free energies [444, 565, 894]. This folding funnel guides the folding protein through states of increasingly lower free energy to the minimum free-energy structure. Thus, aside from exceptions, most proteins are able to form the minimum freeenergy fold as their native fold [57].
Genotype networks in protein folding models The simplest lattice protein models with only hydrophobic and hydrophilic amino acids already demonstrate the existence of genotype networks [80, 455]. For instance, Li and collaborators examined all possible sequences of hydrophobic and polar amino acids in three-dimensional (3 × 3 × 3) and two-dimensional lattice proteins (for various
lattice sizes) that have a unique minimum freeenergy structure. Here is what they found. First, there are many fewer protein phenotypes than genotypes. This observation is consistent with the modest numbers of folds (< 2 × 104) in real proteins [449]. For instance, for three-dimensional lattice proteins, the average phenotype (structure) is realized by 62 of the 2(3 × 3 × 3) = 1.3 × 108 possible sequences. Second, some phenotypes are formed by many sequences (up to 3794), whereas others are formed by few [455]. These qualitative observations are largely independent of the protein model used and are thus likely to hold for real proteins [91, 218, 456, 457, 519, 692]. Third, where a phenotype is formed by many amino acid sequences, these sequences can be very dissimilar [455]. These observations are also consistent with evidence from real proteins. The majority of protein structures are “unifolds,” realized by only one family of usually diverse proteins with the same, unique evolutionary origin [418]. In addition, multiple especially “frequent” tertiary structures exist [418]. These include the TIM-barrel I mentioned above, or the Rossman fold, a tertiary structure found in nucleotide-binding proteins [87, 546]. These structures are frequent in the sense that they occur in multiple families of proteins. Members of a family have significant amino acid sequence similarity and thus a common ancestor. However, little such similarity exists among families. Although even highly dissimilar sequences can share a common ancestor, some frequent folds may have originated multiple times independently [79, 140, 279, 594]. Fourth, most model protein phenotypes have a single network of connected genotypes, whereas only a minority has multiple disconnected networks [156]. In general, the larger the number of genotypes that form a phenotype, the more likely it is that they form one large connected genotype network. Such networks can only exist if any one genotype typically has one or more neighboring genotypes with the same phenotypes. This is the case, but within any one genotype network, the distribution of the number of neighbors is very heterogeneous. Some genotypes have many neighbors with the same phenotype, others have few [107]. All these observation have to be taken with a grain of salt, especially because of the small size of
NOVEL MOLECULES
the amino acid “alphabet” (two amino acids) in these models, and because other aspects of protein folding are sensitive to the number of different amino acids and their interaction energies [91]. However, explorations of more extensive amino acid alphabets and real protein structures confirm these observations [35].
Different neighborhoods of the same protein genotype network contain different novel phenotypes Just like comparative data from real proteins, protein folding models thus support the notion that amino acid sequences are organized into extensive genotype networks. But an important
55
question about innovability remains. Are the phenotypes in the neighborhood of different genotypes on the same genotype network similar? This question is important, because its answer determines how broad the spectrum of phenotypes is that one or few mutations can reach from a genotype network. Available knowledge about the biochemical function of thousands of proteins can help answer this question. Figure 4.4 shows the answer for enzymes, the most prominent and especially well-studied class of proteins. The figure shows an analysis of small neighborhoods around two proteins with the same structure, where the proteins have varying
Fraction U of unique phenotypes in neighborhood
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.25
0.5 Genotype distance D
0.75
1.0
Figure 4.4 Small neighborhoods around different proteins with a given structure contain proteins with many unique functions. The horizontal axis shows the genotype distance D of two single-domain protein genotypes with the same structure. The vertical axis shows the fraction U of proteins with enzymatic functions that occur in a small sequence neighborhood of one but not the other of the two genotypes. The neighborhood in question covers a radius of k = 5 point mutations around each sequence. The data for this figure are based on the 30 most abundant protein folds—the folds with the highest number of sequences—in a dataset with 16,574 single-domain proteins of known structure and enzymatic function. This dataset includes 705 different types of enzymes from all six enzymes classes of the enzyme commission (EC) nomenclature [239]. The moderately large value of k, and the large uncertainty (long error bars) at small distances D result from the low number of known enzyme pairs at low genotype distance D. Error bars represent standard errors of the mean [239].
56
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
genotype distance D. It shows that even for proteins with moderate genotype distance D one finds more than 50 percent of enzymatic phenotypes unique to one neighborhood, a percentage that increases for larger distances D [239]. Similar observations would hold for small neighborhoods around protein genotypes with conserved function instead of conserved structure [239]. The data in this figure come from more than 16,000 proteins with known structure and enzymatic functions. Although these data cover 750 different enzyme functions, they are still a very sparse sample of sequence space. For example, sequence pairs with small D are highly underrepresented in them [239]. This explains the lack of data at the smallest values of D, the large uncertainty at moderate D (long error bars), and the necessity to analyze k-neighborhoods of radius k = 5 and not k = 1 for this figure. As sequence space becomes better characterized, these gaps will undoubtedly be filled. Despite these limitations, the data clearly shows that different neighborhoods in sequence space contain different novel phenotypes.
RNA structure prediction I will next return to RNA molecules and their organization in genotype space. Systematic explorations of RNA phenotypes that avoid the biases inherent in comparative studies of natural RNA molecules demonstrate patterns of organization similar to those of proteins. Such explorations sample RNA phenotypes and genotypes at random from sequence space, explore their properties, and compare them to those of naturally occurring RNA molecules. The organization of RNA phenotypes in genotype space is even better studied than that of proteins. The focus in such studies is on RNA secondary structure phenotypes, because tertiary structure data is very limited, and because, as discussed above, secondary structure is often essential to RNA function. Experimental techniques to analyze RNA secondary structure are too laborious to analyze thousands of RNA genotypes and phenotypes, as is necessary to understand their organization in genotype space [408, 530]. In consequence, computational approaches are important to determine RNA secondary structures. Two categories of approaches exist. The first predicts secondary
structures by comparing RNA sequences with conserved function from multiple different organisms [255, 364, 365, 543, 587, 595, 859]. Its reliance on available sequences is thus subject to the biases discussed above and makes it less suited for a systematic analysis of sequence space. The second category predicts RNA secondary structure from thermodynamic principles. Such prediction can be made on several levels of resolution. First, one can determine an RNA molecule’s secondary structure with the smallest free energy, that is, the most stable secondary structure. The task of predicting this minimum free-energy structure is simplified by the fact that each secondary structure consists of only two kinds of elements: loops, that is, regions of unpaired bases; and stacks, regions of paired bases (Figure 4.1). Loops destabilize a secondary structure, whereas stacks stabilize it. I note that the most stable secondary structure is generally not the structure with the largest number of paired bases. Part of the reason is that each stack, although it stabilizes secondary structure, by necessity creates a loop, which destabilizes secondary structure. For instance, a transfer RNA responsible for attaching histidine to a nascent polypeptide has more than 105 secondary structures with 26 base pairs, the maximum number of base pairs possible for this RNA. However, it has only one minimum free-energy structure, which has 22 base pairs, fewer than the maximum number of 26 base pairs [864]. Most of a structure’s stabilizing energy comes from interactions between the aromatic rings of adjacent pyrimidine and purine base-pairs, so-called stacking interactions. Secondary structure prediction algorithms [563, 841, 892, 893] take advantage of experimentally determined energy contributions of stacks and loops [318, 363, 497, 789, 835]. Albeit not perfect, their predictions often agree well with experimentally determined secondary structures; most importantly for my purpose, their characterizations of genotype network structure are insensitive to algorithmic details [750]. Some computational approaches to predict RNA structure take into account that each possible structure—including an RNA’s minimum free-energy structure—is only metastable. That is, thermal fluctuations cause an RNA molecule to unfold and
NOVEL MOLECULES
refold constantly, and thus to assume a whole spectrum of different secondary structures. The lower a secondary structure’s free energy compared to other structures, the more time the RNA molecule will spend in this structure. The molecule thus spends the relative majority of its time in its minimum freeenergy structure. Algorithms to calculate the free energies of secondary structures within an energy range above the minimum free energy exist [864]. I will discuss some observations made with these computationally more demanding algorithms later on (Chapter 11). For the moment, I will focus on minimum free-energy structures, because they expose the key features of genotype network organization most clearly. Yet another class of algorithms to predict RNA structure also takes into account an RNA’s folding kinetics—the temporal order in which base pairs form as an RNA molecule is synthesized. Folding kinetics is an important determinant of structure, especially for long RNA sequences [326, 532]. Algorithms that take folding kinetics into account are computationally demanding [245, 298, 356, 523]. Thus, they are not yet extensively used to study genotype networks. The currently most comprehensive analyses of RNA secondary structures have been carried out by Peter Schuster and his associates [687].
RNA genotype networks have highly nonuniform sizes The most basic prerequisite for the existence of genotype networks is that multiple genotypes can form the same phenotype. Like proteins, RNA molecules fulfill this prerequisite. The average number of genotypes per phenotype can be determined exhaustively for short RNA sequences. It can be estimated through combinatorial analysis for longer sequences [331, 686]. This number is astronomical, even for sequences of moderate length. For instance, there are 420 = 1.10 × 1012 RNA sequences with 20 nucleotides, but no more than 2741 distinct minimum free-energy structures of 20 nucleotides. This implies that, on average, there are more than 400 million RNA sequences per structure, even for sequences as short as 20 nucleotides. The discrepancy between number of sequences and number of structures becomes much greater for longer sequences: the number of RNA sequences of
57
a given length S is 4S, whereas the number of RNA structures scales with S as approximately 1.8S [688]. There are therefore over 2S-fold more sequences than structures. Analogously to proteins, a genotype set is the number of RNA sequences that fold into the same structure. A genotype network is a connected set of RNA sequences with the same phenotype. That is, every pair of RNA sequences in this network can be connected through a series of single nucleotide changes that do not change the structure. The size of a genotype set depends strongly on the associated structure: most structures have few sequences folding into them, but the vast majority of sequences folds into a small number of structures with large genotype set sizes. Figure 4.5 shows a plot of the distribution of genotype set sizes (expressed both as a fraction of genotype space size, and in absolute numbers of genotypes) for structures found in a random sample of one million RNA sequences that are each 30 nucleotides long [830]. The plot clearly illustrates that this distribution is highly skewed and far from uniform. Some structures have genotype sets that are 1000 times larger than those of other structures, even in this modest sample of sequences that comprises only a tiny fraction 106/430 = 9 × 10–13 of sequence space. For even shorter sequences or sequences with a restricted “alphabet” of only two nucleotides, the size distribution of genotype sets can be determined exactly through exhaustive folding of all sequences [294, 685, 686]. For instance, there are more than 109 sequences of S = 30 nucleotides that consist only of G and C nucleotides. These sequences fold into one of approximately 2 × 105 structures, yielding on average 5000 sequences per structure. If one defines a frequent structure as one that is realized by more sequences than this average, then only 10.4 percent of structures are frequent. However, 93 percent of sequences fold into these frequent structures. At the other end of the structure spectrum, one finds 12,362 structures formed by only one sequence, and more than half of all possible structures are formed by fewer than 100 sequences [294, 685, 686]. As a sequence’s length increases, the frequent structures occupy an increasingly large fraction of sequence space. For very long sequences, almost all
5×10–03
5×1015
5×10–04
5×1014
5×10–05
5×1013
5×10–06
5×1012
0
40000
Genotype set size
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
Genotype set size / 430
58
80000
Size rank of genotype set Figure 4.5 The number of sequences folding into one structure has a highly skewed distribution. For this figure, I randomly sampled 106 RNA sequences of length 30, determined their minimum free-energy secondary structure, and ranked each structure according to how frequently it occurred in this sample [332]. For a structure that occurs sufficiently frequently, one can estimate the structure’s genotype set size from its number of occurrences in the sample. The plot shows structure rank (horizontal axis) plotted against estimated genotype set size (vertical axes), expressed as a fraction of the size of sequence space, 430 (left vertical axis), and in terms of absolute numbers (right vertical axis). For rarely found structures that occur only once or few times, this procedure overestimates genotype network size [830]. Note the logarithmic scale on the vertical axis. Structure frequencies vary by more than a factor 103 even in this modest sample of short sequences.
sequences fold into a vanishingly small fraction of structures. These properties are not peculiarities of small molecules or molecules with restricted nucleotide composition. They are typical of RNA molecules. With these observations in mind, it becomes clear that rare phenotypes, phenotypes with small genotype networks, may not play an important role in evolution. They are difficult to find in a genotype space that is almost completely filled with the large genotype networks of frequent phenotypes. Are there any obvious features that render a structure frequent? It has been suggested that structures with large genotype networks show a balance between stacked regions that provide thermodynamic stability and looped regions that can be realized by many dif-
ferent RNA sequences [249]. However, heuristic algorithms to predict genotype network size from these and other simple structural characteristics have so far had limited success [147, 149, 378]. A next question regards the organization of all genotypes with the same phenotype in genotype space. The only phenotypes of any practical importance here are frequent phenotypes with large genotype sets. It has been shown that three main possibilities for their organization exist [640, 687]. First, these genotypes can be connected in a single large genotype network; second, the vast majority may fall into a single genotype network, with a small minority being organized into much smaller networks or isolated genotypes; and third, they
NOVEL MOLECULES
may form a small number of genotype networks that are each very large in size. A mathematical observation that I will revisit in Chapter 6 shows that a surprisingly simple property can predict whether the first possibility holds, i.e., whether a single genotype network exists. This property is the fraction n of a sequence’s neighbors that have the same phenotype as the sequence itself, averaged over all sequences with the same phenotype [640, 641]. Specifically, if this average fraction of neutral neighbors exceeds a value of 0.37, then the sequences with the phenotype in question will generally be connected in a single genotype network. The more frequent a phenotype is, the greater the average number of neutral neighbors of its genotypes, and the more likely it is that all of these genotypes form a single connected network [830]. In genotypes with frequent phenotypes, the average fraction of neutral neighbors will often exceed 0.37 [378, 640]. In sum, regardless of whether all genotypes with the same phenotype are connected in a single genotype network, a large fraction of them can typically be reached from one another through single-point mutations.
59
In addition to being organized into one or few genotype networks, the genotypes that form a frequent phenotype are also extremely diverse. Computationally, this can be shown by randomly changing some genotype G in single mutational steps, while requiring that each mutation preserves the minimum free-energy structure, until a genotype maximally distant from G has been reached. By repeating this procedure for different starting genotypes with the same phenotype, one can obtain a distribution of the maximal distance D of genotypes with the same phenotype. This distance D is conveniently expressed as the fraction of nucleotides that differ among these genotypes. Even for short random sequences (thus random phenotypes) of 100 nucleotides, this maximal distance is on average greater than 0.95 [687]. This means that for typical phenotypes, genotypes can differ in more than 95 percent of their nucleotides while preserving their phenotype. Figure 4.6 illustrates that very different genotypes (with the same phenotype) can be reached even with a limited number of mutations away from a starting genotype. To create this figure, I used the
80
Number of sequences
70 60 50 40 30 20 10 0
0
0.1
0.2
0.3
0.4 0.5 0.6 0.7 Genotype distance
0.8
0.9
1
Figure 4.6 RNA molecules with the same structural phenotype can have very different genotypes. The distribution of genotype distances based on 466 RNA molecules with “typical” secondary structures, i.e., structures formed by 466 sequences drawn at random from genotype space. For each such sequence, a random walk in genotype space was carried out. This random walk comprised 5000 mutations. Each step in each walk had to preserve the sequence’s structure, and it was not allowed to decrease the distance to the starting sequence. The figure shows the distribution of genotype distances to the starting sequence at the endpoints of these random walks, expressed as the fraction of nucleotides in which the two sequences differed. The mean genotype distance is 0.87. For most genotypes, the maximally achievable distance could be even higher than shown here, because only a limited number of mutations was used in this analysis.
60
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
approach I have just discussed and applied it to 400 randomly chosen sequences of length 100. All these observations complement empirical evidence I discussed earlier from extremely diverse natural RNA molecules. The ability to diversify genotypically while preserving a phenotype is thus not a peculiarity of some natural RNA molecules. It is a generic feature of RNA.
Individual genotype networks are highly heterogeneous Similar to my earlier discussion of proteins, I will now turn to features of individual RNA genotype networks that are important for innovability. The first of these regards the neighborhoods of individual sequences. I have already discussed that individual genotypes on a genotype network usually have multiple neighbors with the same phenotype. The actual number, however, varies greatly among members of the same genotype network; Figure 4.7 shows an example. It is based on sequences sampled at random from the genotype network of a transfer RNA that incorporates the amino acid phenylalanine into nascent proteins [745]. The figure shows the fraction n of neighbors with the same phenotype as this transfer RNA. For a molecule of this length (S = 76 nucleotides), a frac-
tion n = 0.1 would correspond to approximately 23 neighbors with the same phenotype. The distribution in Figure 4.7 is clearly quite broad. Just as in proteins, some sequences have many neutral neighbors whereas others have few. What about the remainder of a genotype’s neighborhood, the sequences that form different phenotypes? Are these phenotypes very similar to one another? Just as it is possible to measure distance among sequence genotypes, various measures of phenotype (structure) distance exist [250]. Perhaps the simplest one determines the fraction of different symbols in the dot-parenthesis representation (Figure 4.1) of two secondary structures. It indicates the number of base pairs in which two structures differ. Figure 4.8a shows as an example an arbitrary RNA sequence of length S = 30 and its minimum free-energy secondary structure. 17 percent of the sequence’s 3S = 90 neighbors have the same structure as itself. The remainder covers a broad spectrum of structures that differ in between 1 and 19 base pairs from the reference structures shown on the left. Examples of these structures are shown near the individual sections of the pie chart. Figure 4.8b [745] shows a similar pattern for a natural RNA molecule. It is the Hammerhead ribozyme of peach
Number of sequences
1000 800 600 400 200 0
0.1
0.2
0.3
0.4
0.5
Fraction v of neighbors with the same phenotype Figure 4.7 RNA sequences have many neighbors with the same phenotype. The figure is based on more than 104 sequences sampled at random from the genotype network of the cloverleaf RNA structure characteristic of tRNAPhe (S = 76), a transfer RNA responsible for transporting phenylalanine to the ribosome during translation [745]. The figure shows the distribution of the fraction of neighbors with the same minimum free-energy structure as the sampled sequences. The generally large number of such neighbors is not a peculiarity of tRNAPhe, but typical of RNA structures [745, 830].
NOVEL MOLECULES
61
(a) G
U
A
G
C
G
A G C
G
A
U
A
U
5'
neighborhood 3S=90 neighbors 32 different structures 17 different distances
G C U
A
U
A
C
G
G
U
C
G
C
G
neutral, d=0, 17%
d=19, 8%
U
d=9, 8% d=1, 26% d=6, 7%
d=2, 7%
C
(b) 1400 Hammerhead (54-mer) Random 54-mer
Number of sequences
1200 1000 800 600 400 200 0
2
6
10 14 18 22 26 30 34 38 42 46 50 54 Structure distance
Figure 4.8 The neighborhood of a genotype contains a broad spectrum of phenotypes. (a) The pie chart shows the distribution of the distance d between the RNA secondary structure phenotype on the left, and the RNA phenotypes in the neighborhood of the sequence shown. Distance d between two structures is measured as the number of differences in their dot-parenthesis representation (Figure 4.1), and indicated by gray shading in the pie chart. Encircling the pie chart are graphical examples of some structures, together with their distances and the percentage of the neighborhood that their genotypes occupy. (b) For a 54nt “hammerhead” RNA structure involved in the self-cleavage of peach latent mosaic viroid [18], the histogram shows the distribution of structure distances, for phenotypes found in the neighborhoods of many sequences sampled at random from the genotype network of this structure (black bars). The gray bars show an analogous histogram, but for randomly chosen RNA molecules of length 54, and thus for random phenotypes [745]. The figure demonstrates that the broad distribution of structures with a variety of distances d in a sequence neighborhood is not a peculiarity of individual sequences or structures, but a typical feature of RNA molecules. Similar observations have been made for RNA molecules of different length, and using different distance measures [745].
62
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
latent mosaic viroid, a simple plant parasite whose RNA genome can cleave itself [323]. The figure shows the distribution (black bars) of the structure distances for the neighborhoods of sequences sampled from the genotype network of this ribozyme [745]. This distribution is not a peculiarity of the particular phenotype shown, it also holds for random genotypes of the same length (gray bars), and independently of the particular structure distance measure used [745]. To summarize, the genotypes of RNA phenotypes are typically organized into vast genotype networks that reach far through sequence space. Different genotypes on a genotype network vary in their number of neutral neighbors. The remaining (nonneutral) neighbors form a broad spectrum of phenotypes. These properties are a generic feature of RNA phenotypes. They hold both for biological RNA molecules, and for RNA molecules sampled at random from genotype space.
Different neighborhoods of the same RNA genotype network contain different novel phenotypes I next turn to the question of how diverse the phenotypes in the neighborhood of different RNA genotypes are, a question that Figure 4.4 has addressed already for proteins. An early pertinent study carried out long random walks on a genotype network starting from a given sequence [347]. It then determined the cumulative number of unique phenotypes that the random walker encountered in its neighborhood, and found that this number increases nearly linearly, without showing any signs of leveling off for long random walks. This observation suggests that different sequence neighborhoods on a genotype network may contain very different phenotypes. The results of an analysis shown in Figure 4.9 underscore this observation. The starting point of the analysis is a reference genotype G that has a given phenotype. One then generates genotypes GD that differ in D nucleotides from G and that lie on the same genotype network as G. That is, they have the same phenotype as G. One then examines the 1-neighborhoods of both G and GD, and counts the fraction of phenotypes in the neighborhood of GD that do not also occur in the neighborhood of G. In other words, one asks how many
phenotypes are unique to the neighborhood of GD. Figures 4.9b and 4.9c show the answer for two different RNA molecules with known biological function, as a function of the number of nucleotide differences D. They demonstrate that the fraction of unique phenotypes is greater than 50 percent even for distances as small as D = 2, and eventually approaches a value of greater than 0.8 for large D [745]. The neighborhoods of different sequences on a genotype network thus contain very different phenotypes. This does not only hold for the two biological molecules shown here. It is a generic feature of RNA sequences [745]. Thus, RNA molecules are similar to proteins in the diversity of their neighborhoods.
The close proximity of different genotype networks In previous chapters on metabolism and regulation, I showed that genotype networks of different phenotypes are close together and tightly interwoven in genotype space. We do not know whether this holds for protein genotype space, but RNA genotype space shows this feature. Consider a randomly chosen RNA sequence that folds into a frequent structure. Now choose a completely different frequent structure, and ask how far one has to step away from the original sequence to find a sequence that folds into this second structure. This question can be asked in more general terms: How large is the radius k of the neighborhood (sphere) around one sequence that is sufficient to find a representative of any common structure? Recall that a neighborhood of radius k is a collection of sequences that differ in no more than k nucleotides the neighborhood’s center sequence. If this radius k was large, such as half that of the entire sequence space (k≤S/2), one would have to traverse at least half of the sequence space to find a representative of every structure. However, k, which can be estimated, is much smaller than (S/2) [685, 686, 688]. For instance, for RNAs of lengths S = 100 nucleotides, a sphere of k = 15 mutational steps contains with probability one a sequence for any frequent structure. This implies that one has to search only a vanishingly small fraction of sequence space (one 4.52 × 1037th for sequences of length 100) to find all common structures. This phenomenon has been called shapespace covering [688].
NOVEL MOLECULES
63
(a)
phenotype (structure)
P
P
U
GD
G
genotype (sequence) (b) 1.0 Fraction U of unique structures
Peach Mosaic Latent Viroid (54nt) 0.8 0.6 0.4 0.2 Mean±SE
0.0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Genotype distance D (c) 1.0 Fraction U of unique structures
tRNA (76nt) 0.8 0.6 0.4 0.2 0.0
Mean±SE
0
2
4
6
8
10
12
14
16
Genotype distance D
18
20
22
Figure 4.9 The neighborhoods of different RNA sequences on a genotype network contain very different phenotypes. (a) The bottom symbols G and GD stand for two RNA genotypes (sequences) that differ in D nucleotides, but that form the same phenotype P. The left and right circles symbolize the phenotypes different from P that occur in the neighborhoods of G and GD. Of special interest is the shaded circle segment. It contains the fraction (among all phenotypes in the neighborhood of GD) that occur only in the neighborhood of GD, but not in the neighborhood of G. For brevity, I refer to it as the fraction U of phenotypes unique to the neighborhood of GD. The lower two panels show the distance D (horizontal axes) of two genotypes on the same genotype network plotted against U (vertical axes) for genotypes that form (b) the 54nt hammerhead structure involved in the selfcleavage of peach latent mosaic viroid [323], and (c) the tRNAPhe cloverleaf structure. Note that D here corresponds to a number, not a fraction of changed nucleotides, as in most other figures of the paper. The figure shows that U approaches a value of one rapidly as D increases. The same pattern holds for random RNA structure phenotypes [745].
64
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
Vast network size and tiny occupancy of sequence space The vast sizes of genotype spaces for both proteins and RNA give rise to counterintuitive properties. One of them is that there is no contradiction between the observation that a genotype network may occupy a tiny fraction of genotype space, yet be astronomical in size. For example, the frequency at which ATP-binding proteins are found in random protein libraries suggests that a tiny fraction 10–11 of random amino acid sequences can bind ATP [389]. Because the proteins in question are 80 amino acids long, the sequence space comprises 2080 = 1.2 × 10104 sequences. Thus, the tiny fraction of 10–11 translates into a huge number of 1.2 × 1093 genotypes with ATP-binding phenotypes. Another example comes from experiments with randomized chorismate mutase, a metabolic enzyme necessary for amino acid biosynthesis [761]. These experiments suggest that a fraction 10–24 of random proteins that are 93 amino acids long encode a protein with the same structure and activity. This fraction is 13 orders of magnitude smaller than the 10–11 above, probably because it is more difficult to design a protein with specific catalytic activity than a protein that just binds a given molecule. However, even this tiny fraction translates into a genotype network size of 10–24 × (2093) = 9.9 × 1096 genotypes [761]. For part of the l repressor, a transcriptional regulator that plays an important role in the life-cycle of the bacteriophage l, a fraction 10–63 of all amino acid sequences may yield a functional protein [639]. The genotype network(s) of this protein, which comprises 92 amino acids, thus have (9220) × 10–63 = 5 × 1056 sequences. If 10–63 is unimaginably small, then 1056 surely is unimaginably large. For instance, a random protein library that contains only one copy of each of 1056 protein sequences of length 92 would have a mass of 1037 g, more than a billion times the mass of the earth (5.97 × 1027 g), or more than 10,000 times the mass of the sun (2 × 1033 g). Analogous observations hold for RNA molecules with specific phenotypes [673]. Common lore has it that molecules with some specific phenotype are very rarefied in sequence space. The above observations support this notion. However, they also show that the genotype network(s) associated with such molecules are astronomically large.
Molecules with a given structure and function may thus be both rare and frequent, in the sense that they have large genotype networks. These observations raise the additional question of whether natural RNA or protein molecules with some biological function are unusually rare compared to other molecules, for example, molecules with typical phenotypes chosen at random from genotype space. That is, are the genotype networks of biological RNA molecules unusually small? To my knowledge, this question is still unanswered for proteins. However, for RNA structures, we have developed an approach that allows estimation of genotype network size for RNA structures of length up to circa S = 100 [378]. Applying this approach to some 80 biological RNA molecules with diverse functions showed not only that their genotype networks are very large (while occupying a tiny fraction of genotype space). It also showed that the genotype networks of these structures are larger than those of randomly chosen RNA structures [378]. In other words, the structure phenotypes of biological RNA molecules are not rarer, but even more frequent, than that of generic RNA structures. I will revisit these observations in Chapter 8, together with a candidate explanation.
Differences between RNA and protein phenotypes The evidence discussed above highlights multiple similarities between protein and RNA phenotypes. Most importantly, both kinds of phenotypes occur in vast genotype networks that span genotype space. However, there are also some differences between protein and RNA phenotypes, one of which I will now highlight. The majority of random RNA molecules form a well-defined secondary structure, which is often essential to their function [687]. This does not necessarily hold for proteins. Estimates of the fraction of proteins that fold into some structure are based on experimental work on random protein libraries and on theoretical calculations. Such estimates vary broadly between 0.01 and 10 percent, but they generally suggest that a minority of proteins fold [164, 241, 389]. The reason may be that there are fewer amino acid sequences that can fold into a compact hydrophic core, as is required for protein folding, than there are nucleotide sequences that can form some stable internal base pairing pattern [80, 87,
NOVEL MOLECULES
107, 729]. The consequence of this difference is that protein space is more sparsely populated with folded proteins. Some authors suggest that proteins with different structures are more isolated from one another in protein space, and that genotype networks of different protein structures do not lie close to one another as they do for RNA [80, 107]. From this perspective, folding proteins thus occupy distinct, localized, and isolated islands in sequence space. This perspective, however, cannot be the whole truth. The reason is that proteins with the same function, structure, and common ancestry often have irrecognizably diverse sequences, as I mentioned earlier. Where common ancestry of diverse proteins is unclear, sequence intermediates that demonstrate such common ancestry often exist [594]. This suggests that genotype networks of proteins still span sequence space or nearly so, just as they do for RNA [687]. None of these differences between protein and RNA argue against the importance of genotype networks for evolutionary innovations in proteins. Many innovations, for example, occur without transforming the scaffold provided by a given structure [239, 371, 849]. However, we still have much too learn about the organization of protein phenotypes in sequence space.
Experimental evidence on genotype networks and their role in innovation I have opened this chapter discussing empirical evidence for the extreme diversity of proteins with the same function or structure. I will close it with a focus on relevant experimental data. Such data are currently available only for the evolution of novel RNA phenotypes. I will first discuss an experiment that demonstrates the importance of genotype networks in the creation of new molecular functions [684]. The subjects of this experiment are one natural and one synthetic ribozyme—an RNA molecule that can catalyze chemical reactions. The natural ribozyme (Figure 4.10a, right) is encoded by the hepatitis delta virus, a human pathogen with a single-stranded RNA genome. This ribozyme catalyzes its own cleavage, a reaction that is necessary to complete the viral lifecycle. The synthetic RNA is the class III self-ligating ribozyme (Figure 4.10a, left), which joins an oligo-
65
nucleotide substrate to its own 5’ end, and was isolated in the laboratory from a pool of random RNAs. The sequences of these two ribozymes have no more than the 25 percent sequence identity expected by chance alone, and no structural similarities that might favor the nearness of their respective genotype networks [684]. Nevertheless, Schultes and Bartel [684] were able to design an RNA molecule that simultaneously has both catalytic activities, that of the self-cleaving ribozyme and that of the ligase. This sequence is more than 40 mutational steps away from both the prototype ligase and from the prototype self-cleaving ribozyme. Its activity is substantially lower than that of the prototype ribozymes (Figure 4.10b), but still 70 times higher than that of uncatalyzed RNA cleavage, and 460 times higher than that of the uncatalyzed ligase reaction. Importantly, this hybrid sequence can be linked via a series of point mutations to both prototype ribozymes, without reducing its activity. Two point mutations into the direction of the ligase restore near wild-type levels of ligase activity, and two point mutations in the other direction restore near wild-type levels of the self-cleavage activity. The remaining, approximately 40 point mutations in either direction keep the catalytic activity close to the level of the prototype ligase and the self-cleaving ribozyme (Figure 4.10b). By constructing a hybrid ribozyme and constructing a path through sequence space back to its ancestors, this work makes two key points. First, many changes in a genotype are possible that do not affect an RNA’s (catalytic) phenotype. Second, these changes can be very important intermediate steps in creating a new catalytic function. Similar principles have been suggested for other ribozymes [50]. A third point of this experiment is that RNAs with different functions can be near each other in sequence space. This notion is independently supported by a different study that started from an RNA molecule whose phenotype was the ability to bind ATP with high specificity [338]. The study’s authors changed this molecule through random mutation followed by selection for a new phenotype, namely the ability to bind a different molecule related to GTP. The experiment revealed several RNA molecules with this ability. These RNA molecules had different structures, but they were close to
66
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
(a) The class III self-ligating ribozyme.
The hepatitis delta virus ribozyme
L5 UU U U A U G C P5 A U C G C G A A G J5/4 A C G U A C G U U U A G G AC G A U C P2 U P4 G C J2/5 A G C A J1/2 C J2/1 G C G pppG C C 3' U G 2' HO C G U A J3/4 A G C G A U P1 C G G AAAC G G G C U A G C G C P3 G C C G U U UU L3
J1/2 U A C G G C P1 U G G G C U U A A G G G
C C G A C C U G G G
3' G GCA G U A C G P2 U A C G C G U A G C P3 G G C A C U G A J4/2 C C C L3 U U P1.1
C G
C G A U U A C G C G G C P4 A U A U G C G C U U UU L4
(b) 10
10 1 10–1
LIG2
10–2
10–1
HDV1
10–2
LIG1 Intersection
10–3
10–3
10–4
10–4
10–5
10–5
10–6
10–6
10–7
10–7 40
30
20
10
0
10
20
30
Relative Cleavage Rate
Relative Ligation Rate
1
HDV2
40
Figure 4.10 Mutational paths on genotype networks lead to new ribozyme functions. (a) The two starting ribozymes discussed in the text. (b) A series of mutations that do not destroy catalytic activity connect a hybrid ribozyme with the two starting ribozymes. The horizontal axis shows the distance (in single nucleotide changes) of the hybrid ribozyme (positioned at distance 0) from its ancestors, the ligase (towards the left) and the hepatitis delta virus ribozyme (towards the right). The vertical axis shows the reaction rate of each ribozyme (gray = ligation, black = self-cleavage) as a fraction of the rate realized by the respective ancestor. The relative rate for the uncatalyzed ligation reaction is indicated by the short-dashed line (ligation with formation of a 2’-5’ linkage) and the dotted line (ligation with formation of a 3’-5’ linkage). The rate of the uncatalyzed cleavage reaction is indicated by the long-dashed line. From figure 3A in [684], used with permission from AAAS.
NOVEL MOLECULES
the original ATP-binding molecule in sequence space. A final experimental example provides a hint that the neighborhoods of different but similar genotypes may contain very different novel phenotypes. The phenotype at issue is a ribozyme with a new catalytic activity. The experiment started from a ribozyme capable of modifying its own 3’ end by adding an adenylated phenylalanine [157]. This reaction is an aminoacylation, which is important to load amino acids onto transfer RNA molecules for protein synthesis. The goal of the experiment was to completely change the enzymatic activity of the starting ribozyme into that of a kinase that adds a thiophosphate group to its own 5’ end. Aminoacylation and kinase reactions were chosen for the experiment, because they are both biochemically important yet very different reactions [157]. By mutagenizing the starting ribozyme and selecting for the new biochemical activity, the study’s authors readily found 23 different kinases whose structures were different from the parent’s structure. The probability of finding such ribozymes increased with the distance from the starting ribozyme. That is, ribozymes with the novel activity were less likely to be found very close to the starting ribozyme, and more likely at moderate distances (10–15 mutations out of 90 nucleotides) from the parent.
Summary A growing body of evidence points to the existence of genotype networks in both RNA and protein phenotypes, and to their importance
67
for evolutionary innovation. This evidence comes from comparative analysis of protein and RNA genotypes and phenotypes, from laboratory evolution experiments, and from computational analysis of RNA and model protein structures. This evidence shows that many genotypes can form the same phenotype, even for molecules of modest size. Some phenotypes have a much larger set of associated genotypes—the genotype set—than others. Only phenotypes with a large genotype set are of practical importance, because only they could be found in a vast genotype space through an evolutionary search driven by random mutations. The vast majority of genotypes in a genotype set falls into one or few genotype networks of astronomical size. These networks often nearly span genotype space. This means that two molecules with the same phenotype may share little or no sequence similarity. A series of mutational changes on a genotype network can preserve a phenotype while exploring an everchanging spectrum of new phenotypes. Laboratory evolution experiments show that these properties facilitate the evolution of new function by allowing exploration of new phenotypes while leaving an existing phenotype unchanged. All these principles are most easily explained if one focuses on native structures or phenotypes, to the neglect of the continuous unfolding and refolding of individual conformations caused by thermal noise and other environmental influences. I will discuss such phenotypic plasticity and its role for innovation in Chapter 13.
CH A PT ER 5
The origins of evolutionary innovation
In the preceding chapters, I examined three very different classes of biological systems. These are large, metabolic networks, biological circuits that regulate gene activity, as well as protein and RNA molecules. Most evolutionary innovations arise through changes in these systems. Any theory of innovation thus needs to apply to systems as different as these. At first sight, this may seem impossible, precisely because these systems are so different; but on a deeper level, they also share important similarities. These similarities can help us understand the ability of living things to innovate. Here I will summarize key material from the previous chapters, highlight these similarities, and point out how they affect innovability. You can find most relevant literature references in the previous chapters.
Genotypes and phenotypes When discussing biological macromolecules, I focused on protein and RNA molecules, because they perform most catalytic, transport, support, regulation, and communication functions in a cell. An RNA molecule’s genotype is a sequence of ribonucleotides (Figure 5.1) or, equivalently, the DNA sequence encoding this RNA. Proteins are also encoded by RNA or DNA sequences; their genotype is the encoding DNA sequence. However, for my purpose, an amino acid based representation of protein genotype is more economical. With this representation, we are spared having to conceptually translate the encoding nucleotide sequence into an amino acid sequence for every single protein genotype. Such translation becomes especially tedious when studying change in protein genotypes, because the genetic code’s redundancy causes many nucleotide changes to have no effect on the amino acid sequence and thus on the protein [100]. Moreover, most of the complexity in forming protein phenotypes does not lie 68
in the steps leading from DNA to amino acid sequences, but from the spatial folding of amino acid sequences into secondary and tertiary structures. Both RNA and protein phenotypes have two key aspects. One is the arrangement of a sequence in two- and three-dimensional space. The second aspect is a molecule’s biological function, be it the chemical reaction it catalyzes, the structural support it provides, or any other process it is a part of. Because structure is usually a prerequisite for function, it is a worthy subject of study in and by itself. The second class of system I explored was regulatory circuits. The DNA sequences that encode all parts of a circuit and the interactions between these parts comprise a circuit’s genotype. These interactions may be encoded in the parts themselves, such as for interactions between proteins; or they may involve DNA that does not encode proteins, such as the short DNA motifs that bind transcriptional regulators. It would be cumbersome to understand a circuit directly from its encoding DNA sequence, even more so than for proteins. The problem is analogous to that faced by a computer novice learning how a computer program works. She can wade through the binary numbers that are the program’s instructions to the hardware; or she could study the program in its higher-level programming language. The latter would be much more effective. Both representations are correct, but one of them is better for the purpose at hand. Similarly, if we want to understand what regulatory circuits do, it is best to represent their genotypes on a higher level than that of a DNA string. The representation I chose earlier encodes the regulatory interactions of circuit parts through numerical parameters that indicate the strengths of these interactions. After all, it is these interactions that
THE ORIGINS OF EVOL UTIONARY INNOVATION
Metabolism
Genotype
Phenotype
DNA encoding enzyme-catalyzed metabolic reactions
ability to synthesize biomass molecules from a given set of nutrients
DNA encoding regulatory interactions among molecules
gene expression pattern, concentration or activity of regulatory molecules
amino acid sequence
protein fold or or biochemical activity
nucleotide sequence
RNA fold or biochemical activity
69
Regulation
Molecules: Protein
Molecules: RNA
Figure 5.1 An overview over analogous notions of genotype and phenotype in metabolic networks, regulatory circuits, proteins, and RNA.
determine what a circuit does. For a regulatory circuit with S regulatory molecules, there are S2 possible pairwise interactions, and the strengths of these interactions (many of which may be absent and thus have zero strength) represent a circuit’s genotype. I focused on a particularly important class of circuits, namely transcriptional regulation circuits. Here, the regulatory molecules are transcriptional regulators encoded by circuit genes. Their pairwise regulatory interactions involve a regulator’s binding of DNA near a circuit gene, and the activation or repression of that gene. The phenotype of such a circuit is a pattern of gene expression that the circuit’s regulatory interactions produce. (In Chapter 14, I will discuss other kinds of circuits.) On the next level of organization, I explored genome-scale metabolic networks. The main task of a metabolic network is to synthesize all of a
cell’s molecular building blocks, including amino acids, nucleotides, sugars, and lipids. I refer to these building blocks as a cell’s biomass components or biomass precursors. An organism’s metabolic genotype encodes the metabolic enzymes that catalyze all chemical reactions in its metabolic network. The most effective genotype representation for my purpose reflects which reactions are present in a given metabolic network (Figure 2.1, Figure 5.1). These reactions form part of a much larger “universe” of possible enzyme-catalyzed reactions. The phenotype of a metabolic network is the ability to synthesize all biomass components in one or more chemical environments. Because of the central role carbon plays in life, I here focused on chemical environments differing in their carbon sources; these carbon sources can also serve as energy
70
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
sources. Most of what I say would also hold for sources of other elements [653]. I classified metabolic phenotypes according to the carbon/energy sources that a network can use as the only source to synthesize all biomass compounds in an otherwise minimal chemical environment. In this context, a metabolic phenotype can be most simply represented as a binary string, each of whose entries corresponds to one of the carbon/energy sources that an organism can import from the environment. If an organism can synthesize all biomass compounds from carbon source i, then this string will contain a one at position i, and otherwise a zero. I emphasize that these and other discrete representations of genotypes and phenotypes are abstractions that serve to develop important concepts more clearly. Further, below I will revisit their merits and limitations. I will now summarize important commonalities of all but the smallest metabolic networks, regulatory circuits, and molecules.
Many more genotypes than phenotypes The above three systems share a simple feature with far-reaching consequences: they have many more genotypes than phenotypes. In metabolic networks, the number 2S of possible genotypes is determined by the size S of the “universe” of enzyme-catalyzed reactions. The known universe currently comprises more than S = 5 × 103 such reactions, and may well be much larger [571]. The number of possible metabolic network genotypes is thus also very large. To see that there are fewer metabolic phenotypes than genotypes, recall (Chapter 2) that there are many possible pathways of synthesizing all biomass components from a given sole carbon source. Each of these pathways corresponds to one genotype. A large body of evidence shows that even central functions of metabolism cannot be executed in just one optimal way, but by different pathways of metabolic reactions. Furthermore, in any given environment, many reactions in a metabolic network can be eliminated, thus changing a genotype without necessarily changing a metabolic phenotype [64, 208, 248, 309, 690, 730, 839]. Taken together, these observations demonstrate an excess of genotypes over phenotypes. Because they hold for any one carbon source
(and for other elemental sources), they also apply to any combination of carbon sources. In addition, empirical data on metabolic network composition of a broad spectrum of organisms suggests that they can utilize only a small fraction C of all possible carbon sources, rendering the total number of phenotypes (2C) vastly smaller than the number of genotypes [202]. In a transcriptional regulation circuit of size S, that is, with S genes, there are S2 possible pairwise regulatory interactions (and many more higher order interactions). Even if we restrict ourselves to the simple abstraction of admitting only three kinds of interactions (activating, repressing, or absent) 2 there are 3 N genotypes such a circuit can have. In contrast, the total number of phenotypes (gene expression states) such a circuit could have in a single cell would be of the order of 2S, if genes are counted as being either on or off. These numbers of genotypes and phenotypes would rise dramatically, if we admitted finer gradations of interaction strengths and gene expression levels. However, because the number of possible interaction is proportional to the square of the number of genes, whereas the number of expression states is proportional only to the number of genes itself, the number of possible regulatory genotypes would generally be exponentially larger than the number of possible phenotypes. This would hold for any kind of regulatory system: the number of phenotypes scales with the number of molecules, whereas the number of regulatory genotypes scales with the much larger number of possible interactions among molecules. Let us now turn to protein molecules. Proteins with S amino acids have 20S possible genotypes. Even for short genotypes of 100 amino acids, this number (≈ 10130) may be many orders of magnitude larger than the number of hydrogen atoms in the universe. To see that there are fewer phenotypes than genotypes, let us first focus on the protein fold (tertiary structure) aspect of phenotype. Here, available structural evidence shows that there are of the order of 104 protein folds (Section 5.2). Protein folding models, where genotypes and phenotypes can be exhaustively enumerated, also show many fewer phenotypes than genotypes. The number of protein function phenotypes is different from that of structure phenotypes. The
THE ORIGINS OF EVOL UTIONARY INNOVATION
reason is that structure and function do not show a one-to-one relationship. For example, the catalytic sites of enzymes are formed by a precise local juxtaposition of few amino acids relative to the total size of a protein. They can be thought of as local “decorations” of a global fold. Any one fold may harbor different such decorations. Conversely, different folds may have the same enzymatic activity. Our understanding of the universe of protein function phenotypes is still limited, but some orderof-magnitude statements can be made about its size. A comprehensive effort to classify functions of proteins and other molecules is the widely used “gene ontology” classification system [32]. It currently recognizes 8 × 103 different molecular functions. Another is a classification of enzymatic functions that is long-established and supported by more than a century of biochemical research, and thus perhaps better founded than that of gene ontology [133]. The currently most comprehensive metabolic reaction database lists fewer than 104 enzymatic functions [571]. These estimates should be taken with a grain of salt because classifying protein functions, let alone counting them, is difficult. Also, these numbers may not include many as yet undiscovered functions. However, even if the universe of protein or enzyme function were a million times as large as our current knowledge indicates, the total number of functions (1010) would still be paltry compared to the total number of protein genotypes. The second class of functionally important macromolecules, RNA, also has an astronomical number of 4S genotypes for molecules comprising S nucleotides. Although little is known about the number of RNA tertiary structures, the number of possible secondary structure phenotypes scales approximately as 1.8S (Chapter 4) This means that as the length of an RNA molecule increases, there are exponentially more RNA genotypes than phenotypes (approximately (4/1.8)S) Many fewer catalytic functions are known for RNA than for proteins, perhaps because RNA molecules have fewer building blocks (four instead of twenty), and are thus more restricted in the spectrum of molecular shapes required for catalysis. In sum, in disparate classes of biological systems, there are more genotypes than phenotypes. Where sufficient information exists to enumerate these
71
phenotypes, there are exponentially more genotypes than phenotypes, as a function of the number S of system parts. This means that any one phenotype typically has many genotypes that form it. Many of the more complex phenomena below rest on this deceptively simple fact.
Genotype networks An important mode of evolutionary change in metabolic networks is the elimination and addition of chemical reactions, for example, through horizontal gene transfer. One can ask how many such additions and deletions a network can sustain without changing its metabolic phenotype. The answer is best expressed in terms of the maximal genotypic distance D (Figure 5.2) of two metabolic networks with the same phenotype. For networks whose size is typical of that of freeliving organisms (≈103 reactions), this distance is greater than D = 0.75. This means that two networks can share fewer than 25 percent of their reactions, while having the same metabolic phenotype. This great divergence is not sensitive to the specific metabolic phenotype considered (Chapter 2). By generating many such diverse networks, one finds that metabolic networks of the same phenotype form vast genotype networks that extend far through genotype space. Regulatory circuits, in turn, evolve through mutations that create and destroy regulatory interactions between circuit molecules. For example, small genetic changes can alter a transcriptional regulator’s binding sites on DNA. An analysis similar to that of metabolic genotype space shows that the genotype distance of two circuits with the same gene expression phenotype can be as high as D = 1. At this maximal genotype distance, two circuits have no regulatory interactions in common (Figure 5.2). Two otherwise random circuits with the same phenotype typically share only of the order of 20 percent of regulatory interactions (D = 0.8). Moreover the vast majority or all circuits with the same phenotype form one gigantic genotype network that traverses genotype space completely or nearly so (Chapter 3). Protein and RNA molecules undergo evolutionary change in their individual nucleotide or amino acid building blocks. A combination of empirical evidence and computational studies shows that
72
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
Metabolism
Neighbors
Genotype Distance D
networks differing in one reaction
fraction of reactions not shared by two networks
circuits differing in one regulatory interactions
fraction of regulatory interactions not shared by two networks
proteins differing in one amino acid
fraction of amino acids not identical in two proteins
RNA molecules differing in one nucleotide
fraction of nucleotides not identical in two RNA molecules
Regulation
Molecules: Protein
Molecules: RNA
Figure 5.2 An overview over the analogous concepts of neighbors and genotype distance used in describing metabolic networks, regulatory circuits, proteins, and RNA.
protein and RNA molecules with the same phenotype—whether defined structurally or functionally—are often not recognizably similar. Their maximal genotype distance D (Figure 5.2) is typically close to one. RNA molecules with the same structure phenotypes may even have no nucleotides in common (D = 1). As in the other study systems, the genotypes forming a typical phenotype are connected in one or few vast genotype networks. There may be many differences between metabolic, regulatory, and molecular phenotypes, as well as within each class of system. For example, their genotype networks may differ in how far they extend through genotype space, or in whether the genotypes of typical phenotypes form one or more genotype networks. The depth of our knowledge also varies for these system classes, which affects our ability to compare them. For example, discover-
ing new reactions in the biochemical reaction universe can only increase the already large flexibility in metabolic network organization we observe. Such new knowledge can affect how far metabolic genotype networks appear to reach through genotype space, and whether they differ from regulatory genotype networks in this regard. Similarly, we have more information about the connectivity of genotype networks for molecules than for metabolisms, simply because we have studied molecules much longer than large metabolic networks. For my purpose, these and many other differences—whether real or caused by the gaping holes in our knowledge—are differences in details. They do not affect the commonality most important for evolutionary innovation: that genotypes with the same phenotype form genotype networks that spread far and wide through genotype space.
THE ORIGINS OF EVOL UTIONARY INNOVATION
Common features, different mechanisms An approach that asks how genotypes with the same phenotype are organized in genotype space creates a global, statistical perspective on this space. A complementary, mechanistic perspective would ask why the same phenotype can be built in so many different ways; conversely, why must some parts of a genotype not change? The question arises because two genotypes highly or maximally different from each other (D = 1) may not be able to form the same phenotype. In contrast to the first, statistical question, the answer to the second, mechanistic question depends strongly on the study system. For example, the flexibility of metabolic organization comes from the multiple ways in which biomass components can be synthesized. This flexibility is limited, because some reactions or pathways do not admit alternatives, as dictated by principles of organic chemistry. For proteins, the formation of a specific structure phenotype typically requires a conserved core of hydrophobic amino acids in the center of this structure. These interactions provide the glue for a protein’s spatial structure [87, 186]. The need to have such a hydrophobic core can limit the extent to which protein genotypes can vary. Least well understood are the reasons why some regulatory circuits with a given phenotype may need to preserve a small fraction of regulatory interactions, a phenomenon that occurs in the biological evolution of regulatory circuitry [165]. One candidate explanation is that the preserved interactions provide resistance of an expression phenotype to gene expression noise, but our knowledge in this area is very limited [561]. Despite such limitations, it is clear that limited genotypic conservation has different mechanistic causes in different systems. These differences make the commonality of connected genotype networks even more remarkable
Neutral neighbors, robustness, and continuity properties A system’s ability to preserve its phenotype while exploring genotype space requires that not all small changes in genotype cause large changes in phenotype. This is the case for genotypic changes in proteins, regulatory circuits, and
73
metabolism. Even more, many such changes have no effect on phenotype. This means that genotypes typically have a substantial fraction of neighbors in genotype space with the same phenotype. For brevity, I will for now refer to such neighbors as neutral neighbors. A biological system’s robustness to change is its ability to preserve phenotype in the face of change. Thus, a genotype with many neutral neighbors is to some extent robust to genetic change in individual system parts. As I will explore in more detail in Chapter 6, such robustness is a prerequisite for the existence of genotype networks. This requirement for robustness also hints at the positive role robustness can play in evolutionary innovation. I have explored this role in an earlier book [825], and will discuss it in more detail in Chapter 8. It is tempting to cast this property of genotypes in the mathematical language of continuous functions [435]. In a continuous function, a small change in the function’s argument causes a small or no change in the function’s value. The function F at issue is one that takes a genotype G as an argument and produces a phenotype from it (P = F(G)). Such a function F is often called a genotype-phenotype map [11]. Thus, on the surface, the genotype-phenotype maps I studied have properties akin to continuity. However, the analogy is limited: when applying strict mathematical definitions of continuity, all functions in a discrete space are continuous [689, p.131]. I note in passing that many man-made, technological systems differ in this respect from biological system: if you change a part, the function of the whole often changes dramatically, and usually for the worse. This difference may be why most technological systems cannot readily innovate through random change [636]. However, this limitation can be overcome with the right kind of technology (Chapter 15).
Some consequences of genotype space and genotype network size Genotype spaces are vast. Even the genotype networks of individual phenotypes we discussed are astronomical in size. The large numbers one encounters in studying genotype spaces are best illustrated with examples from molecules, because approaches to estimate
74
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
these numbers are most advanced here [294, 378, 389, 639, 761]. Consider the guide RNA of the protozoan parasite Leishmania tarentolae shown in Figure 5.3. Guide RNAs are important in RNA editing, a process that changes the nucleotide sequence of already transcribed RNA molecules [710]. There are approximately 5 × 1022 sequences forming the minimum free-energy structure of the RNA molecule in Figure 5.3 [378]. Turning to proteins, consider again the bacteriophage l repressor from Chapter 4, where a protein’s structure and function is achieved by an estimated 5 × 1056 sequences [639]. Numbers as large as these are not exceptions to a rule. Many other proteins and RNAs of the same size have comparable or even larger genotype networks. Their enormity can perhaps be better appreciated if we consider that the number of stars in our galaxy (1011–1012) is puny compared to them. We currently can estimate genotype network sizes less well for regulatory circuits and metabolic networks, but they can be just as large (Figure 3.3 and ref. [670]). This results simply from the excess of genotypes over phenotypes.
guide RNA C
A
A
Our intuition easily fails us when thinking about very large numbers. Here are a few consequences of the vast numbers of genotypes with the same phenotype and of the genotype networks they form. First, because genotype space is vast, phenotypes with small genotype networks have limited biological relevance, because a (blind) evolutionary search is very unlikely to find them. Recall the example of RNA (Chapter 4), where the vast majority of genotype space is filled with phenotypes that have large genotype networks. These phenotypes are “typical” in the sense that genotypes chosen at random from genotype space are highly likely to form one of them. With increasing length of a molecule, the likelihood to find atypical phenotypes decreases rapidly. In this regard, it is also relevant that the genotype networks of RNA phenotypes with known biological functions are somewhat larger than those of RNA molecules chosen at random from genotype space [378]. The perspective I propose here provides a ready explanation. If a given biological function can be carried out by two phenotypes, one with a small genotype network, and the other with a larger genotype network, then a blind search through genotype space is more likely to discover the larger genotype network.
l repressor
A A
U U
U
A A
U G C U A U U A C U G G A C U G A A GU A U A U U AUA A G G G G G C A A A U U
Number of genotypes: 5×1022
Number of genotypes: 5×1056
Fraction of genotype space: 4×10–8
Fraction of genotype space: 4×10–63
Figure 5.3 The number of genotypes in a genotype network may be astronomical. The left panel shows the secondary structure of a guide RNA from the protozoan parasite Leishmania tarentolae. We estimated the number of genotypes with this secondary structure using a replica exchange Monte Carlo algorithm [378]. The protein structure in the right panel is the 92 amino acid long N-terminal part of the l repressor, a transcriptional repressor of bacteriophage l. It is displayed using atomic coordinates in Protein Databank File 3BDN [727]. The number of genotypes displayed was estimated from a large-scale mutagenesis experiment with this protein [639].
THE ORIGINS OF EVOL UTIONARY INNOVATION
Second, a genotype network can be astronomically large, yet it may occupy a tiny fraction of an even larger genotype space. For example, the astronomical number of RNA sequences forming the structure of Figure 5.3 constitutes only a tiny fraction 4 × 10–8 (≈ 5 × 1022/450) of this space. The even greater number of protein sequences folding into the l repressor occupies a fraction 10–63 of their genotype space. A biologically important phenotype may have a vast number of associated genotypes, yet still be rare in genotype space. Third, a biological system that serves multiple functions does not necessarily have a small genotype network. Many enzymes can have more than one enzymatic activity, or even some non-enzymatic functions [8, 367]. Regulatory circuits are exposed to different regulatory signals in different cells, and they respond by producing different cell-specific gene expression patterns that influence physiology and development. Metabolic networks may synthesize different compounds in different tissues or at different times. We can think of each such function as having an associated genotype network. Genotypes with multiple functions occur in the intersection of these networks. Because each genotype network is vast, the intersection may still be very large. Consider the model regulatory circuits I discussed in Chapter 3, where circuits with 6 genes and 3 regulatory interactions per gene have 8.6 × 1013 possible genotypes. The average number of networks producing only one specific gene expression pattern is equal to 5.92 × 1010. (The average is taken over different expression patterns.) Bifunctional networks, that is, networks that can produce two specific expression patterns, have fewer but still very many genotypes (1.96 × 107) that can produce both expression patterns [491]. Fourth, because genotype networks are so vast, there is much room for heterogeneity inside them. This internal structure, however, is still poorly understood. One exception to our general ignorance is that in some regions of a genotype network, genotypes have many more neutral neighbors than in other regions. This holds for all three classes of systems I discussed so far. It has evolutionary implications that I will discuss in more detail later (Chapter 8).
75
In closing this section, I restate the perhaps most important consequence of the vast size of genotype space. Because this veritable universe can hold an enormous number of different phenotypes, it provides the ideal starting point to develop an account of innovation that is systematic instead of anecdotal; because it has room for myriad phenotypes, it is sufficiently rich to encapsulate the enormous diversity of molecular innovations in the history of life. It does not just reduce their diversity and complexity to a simple caricature, as other contemporary models do. It qualifies as a solid foundation for a theory of innovation.
Neighborhoods and their phenotypic diversity The existence of vast genotype networks ensures that a genotype can change substantially without changing its phenotype. This feature, however, is not sufficient to explain innovability. To produce evolutionary innovations, biological systems must explore many different phenotypic variants before finding one that may become an innovation. Such phenotypic variants are produced by mutations. Among all possible variants, those accessible from any one genotype via a single mutation are the most important, because they are most easily reached. For the systems I focus on, such variants differ in a single amino acid, an RNA nucleotide, a regulatory interaction, or an enzymatic reaction. Together, all of a genotype’s single mutants comprise a genotype’s 1(-mutant) neighborhood. Thus far, the language I used insinuated that individual systems change and explore different phenotypes. That is only part of the truth, because all evolution takes place in populations. Although most of the principles I discussed apply to both populations and their members, large populations have the following advantage in exploring new phenotypes. If a population is sufficiently large or if mutation rates are sufficiently high, then the mutations occurring every generation produce not only single mutant variants [845]. A case in point is viruses, such as HIV, with small genomes, high mutation rates, and enormous population sizes. In the human body, a single round of replication of a viral population would suffice to produce all single mutants of a viral genome, as well as many double and triple
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
(a) Fraction U of phenotypes unique to a neighborhood
mutants [488, 608, 629]. In such populations, not only members of a genotype’s 1-neighborhood, but many members of its 2- and 3-neighborhoods are accessible. The genotypic neighborhoods of metabolic networks, regulatory circuits, and molecules share two important features, as shown schematically in Figure 5.4. First, different neighborhoods on the same genotype network contain very different novel phenotypes. Specifically, consider two such genotypes G1 and G2, and the fraction U of “unique” phenotypes accessible to only one of them. These are phenotypes that occur only in the neighborhood of one but not the other genotype. The fraction U increases with increasing genotype distance between G1 and G2, and it reaches a plateau at genotype distances much smaller than the diameter (the maximally possible genotype distance) of a genotype network (Figure 5.4a). At this plateau, a 1-neighborhood contains between 40 percent to greater than 90 percent of unique phenotypes, depending on the system and the neighborhood considered. This percentage increases further for larger neighborhoods. The second observation (Figure 5.4b) regards individual genotypes or entire populations that evolve on a given genotype network through cycles of mutation and natural selection. Mutations allow individuals to explore the network and its surroundings in a random walk; selection preserves the population’s well-adapted phenotype and thus confines it to the genotype network. In consequence, a population spreads through genotype space like a cloud of genotype “particles” diffusing through a porous medium, the population’s genotype network. As mutations accumulate generation after generation (while leaving the phenotype unchanged), individuals and populations gain access to an ever-increasing number of novel phenotypes. These are phenotypes in their neighborhoods that were not contained in any neighborhood encountered in previous generations. Because genotype networks are typically vast in size, and because the total number of possible phenotypes is very large, this number of novel phenotypes does not level off even when most system parts—nucleotides, amino acids, regulatory interactions, metabolic reactions—have
1 U
Genotype distance D
Dmax
(b) Cumulative number of novel phenotypes in neighborhood
76
S
Number of mutations
Figure 5.4 Different genotypic neighborhoods on the same genotype network contain very different phenotypes. Both panels are schematic illustrations of features common to metabolic networks, regulatory circuits, and molecular genotypes discussed in earlier chapters. (a) The fraction U of phenotypes that occur in the neighborhood of one but not the other genotype on a neutral network (as illustrated by the intersecting circuits in the inset), as a function of the genotype distance D between two genotypes. Dmax indicates the maximal genotype distance of two genotypes on the same genotype network, which is often close or equal to the diameter of genotype space. (b) The cumulative number of different phenotypes (vertical axis) in the neighborhood of a genotype that changes its composition gradually through mutations (horizontal axis), while preserving its phenotype. This cumulative number increases linearly or nearly so for a number of mutations that is much greater than the system size S (number of nucleotides, amino acids, regulatory genes, or metabolic enzymes), as indicated by the label S on the horizontal axis.
changed multiple times. Eventually, the number of novel phenotypes would have to level off, because genotype networks have a finite size.
THE ORIGINS OF EVOL UTIONARY INNOVATION
However, a population of realistic size could hardly explore a large genotype network on realistic evolutionary time scales, and thus experience this exhaustion of phenotypic variation. Taken together, these two properties ensure that molecules, regulatory circuits, and metabolic networks meet an indispensible requirement to produce evolutionary innovations: the ability to access vast amounts of novel phenotypes while leaving their own phenotype unchanged.
Two necessary prerequisites for innovability Figure 5.5 summarizes and illustrates how these two features conspire to allow metabolic innovation. The rectangle in this figure stand for genotype space. The gray open circles correspond to genotypes with some common phenotype. Genotypes are connected by straight lines if they are neighbors. Symbols of different shapes and shading correspond to different phenotypes that occur as neighbors of some genotype on the graygray genotype network. Each symbol stands for a different phenotype. Figure 5.5 illustrates a genotype network that spans genotype space and that is connected. By virtue of its connectedness, genotypes evolving on it can access many different novel phenotypes, while fulfilling the key requirement of not changing their own phenotype. Molecules, regulatory circuits, and metabolic networks may differ in many details, but they all share these organizational features. First, their phenotypes are typically organized into vast genotype networks that traverse a large fraction of genotype space. Second, different neighborhoods on these networks contain very different novel phenotypes. Figures 5.6a through 5.6c illustrate that these features are essential by exploring several counterfactual scenarios; that is, these scenarios are not typical for the system classes I examined. In the first scenario (Figure 5.6a), the number of genotypes that form the same phenotype is just as large as in Figure 5.5. These genotypes are also as widely distributed through sequence space. However, these genotypes are either isolated from one another or they form only small groups of connected genotypes. Their disconnectedness hinders access of new phenotypic variants, because evolving gen-
77
otypes remain confined to small regions of this space. They can no longer explore large regions of this space through mutations that leave the phenotype unchanged. The second scenario (Figure 5.6b) shows a genotype network that is connected, but that does not span a large fraction of genotype space. Instead, it is localized in a smaller region of this space. Therefore, many novel phenotypes occurring elsewhere in genotype space remain inaccessible from it. The fundamental underlying reason is again the requirement to retain old phenotypes—and thus remain close to a genotype network—while exploring new phenotypes.
Figure 5.5 Connected genotype networks facilitate accessibility of diverse phenotypes. The figure schematically represents a set of genotypes (gray circles) in genotype space (rectangle) that share the same phenotype and form a genotype network; neighboring genotypes are connected by gray lines. Symbols of different shapes and shading indicate genotypes with different phenotypes. The figure illustrates that many different novel phenotypes can be accessed from a connected genotype network that spreads far through genotype space.
78
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
(a)
(b)
(c)
Figure 5.6 Three counterfactual scenarios for genotype network organization. Each panel indicates a set of genotypes (gray circles) in genotype space (rectangle) that share the same phenotype; neighboring genotypes are connected by gray lines. Symbols of different shapes and shading indicate genotypes with different phenotypes: (a) a disconnected genotype network, (b) a highly localized genotype network, and (c) a genotype network where different neighborhoods contain the same novel phenotypes. See text for details.
A final counterfactual scenario is shown in Figure 5.6c, where a sprawling and connected genotype network exists, but where the phenotypes in its neighborhoods are all the same. In this case, the network is irrelevant for evolutionary innovation, because regardless of where a genotype occurs on this network, and regardless of how far a population spreads through this network, it has access to the same novel phenotypes. Taken together, these images highlight that both the extension of neutral networks in genotype space, and the phenotypic diversity of their neighborhoods are essential for the exploration of many different phenotypes, which allows evolutionary innovation. Whatever else a theory of evolutionary innovation might include, these elements will be essential. Whether they are also sufficient for innovation is an intriguing open question. Before going on, I need to caution that Figure 5.5 and other low-dimensional representations are mere caricatures of genotype networks. Genotype spaces are very high-dimensional. They are closely related to hypercubes, n-dimensional analogues of threedimensional cubes [641]. Such high-dimensional
spaces have many counterintuitive features. I will discuss them in more detail in Chapter 6. For now, I will merely highlight two features that make low-dimensional representations misleading. First, in our familiar three-dimensional space, we can move into three orthogonal directions, but in a genotype space we can move in as many “directions” as there are dimensions. In consequence, the immediate neighborhood of a genotype contains many different genotypes. For example, any protein genotype with S = 100 amino acids has 19 × 100 = 1900 immediate neighbors. A two-dimensional projection cannot capture such large neighborhoods well. Second, despite the enormous size of the corresponding genotype space, one can walk through this space in few steps; that is, in as many mutations as there are dimensions. Because each step can take multiple possible directions, the number of paths through this space is astronomical. Many paths lead to genotypes that are maximally different from the starting genotype, yet they are also maximally different from each other. Again, two-dimensional images represent these features poorly. They only serve as visual crutches to aid our understanding.
THE ORIGINS OF EVOL UTIONARY INNOVATION
Genotype networks are highly interwoven One further general feature of molecules, regulatory circuits, and metabolic networks is worth highlighting: The genotype networks of any two typical phenotypes, P1 and P2, are close together in genotype space. More specifically, the minimal number of mutations necessary to go from a genotype with phenotype P1 to a genotype with phenotype P2 comprises only a small fraction of the diameter of genotype space. (The diameter is the maximal distance between two genotypes.) In other words, there is at least one point in genotype space where two genotype networks are close together. This holds for any two typical phenotypes. For an example, consider first the regulatory circuits of merely twenty genes I discussed earlier (Chapter 3). Here, the average minimum genotype distance between circuits with arbitrary different expression phenotypes is only D = 0.14. There are approximately 10128 circuits of S = 20 genes with an average of 5 regulatory interactions per gene. Only a tiny fraction 10–102 of them is contained in a neighborhood of D = 0.14 around any one circuit. Yet this tiny region of genotype space around a circuit—around any circuit—contains most expression phenotypes. The radius D and size of a neighborhood with this property would further decrease as circuit size increases. Genotype networks of larger circuits thus become increasingly interwoven. As a second example, consider RNA molecules of size S = 100 nucleotides. Here, a region of D = 0.15 around any one genotype contains with near certainty one sequence for any common structure. Such a region comprises only one 4.52 × 1037th of genotype space (Chapter 4). Lastly, let us turn to metabolic networks. To reach any one metabolic phenotype from any other metabolic phenotype, one typically does not need to go further than a genotypic distance of D = 0.1, i.e., change 10 percent of a metabolic network’s reactions (Chapter 2). For the genotype space of more than 5000 reactions I discussed earlier, a genotype neighborhood with this radius would contain much less than one 10–500th of genotype space. As strikingly tiny as the fractions I just cited may seem, one should bear in mind that walking hundred biochemical reactions, fourteen regulatory interactions, or fifteen nucleotides away from a gen-
79
otype network is not a small feat. If old phenotypes must not be destroyed for evolutionary innovations to occur, then genotypes this far away are certainly not readily accessible. One might thus argue that this phenomenon, however striking, may be of limited importance for evolutionary innovation. I nonetheless mention it here, because it speaks to the diversity of phenotypes that we find in different neighborhoods of a genotype network (Figure 5.5). By studying it more closely, we may learn more about the causes for this diversity.
Minimal requirements for a theory of innovation Chapter 1 listed several minimal requirements for a theory of innovation. I will briefly revisit them to show that the elements I discussed so far meet these requirements. The first requirement is to explain how the old can be preserved while the new is being explored. Extended genotype networks with diverse phenotypic neighborhoods allow precisely this kind of exploration. The second requirement is to unify different kinds of innovations. Because most innovations involve changes in three classes of systems that share organizational features of genotype spaces, this requirement is also met. The third requirement is to capture the combinatorial nature of innovation. The framework I use is also ideal in this regard, because it explicitly captures innovation as new combinations of system parts (elementary modules) that give rise to novel genotypes. Depending on the system class, these elementary units of organization are enzymes, regulatory interactions, amino acids, or nucleotides. In each system, small modules may form higher order modules, such as regulatory circuit motifs, or enzyme complexes. The role of such higher order modularity is much studied but still poorly understood [224, 384, 424, 681, 774, 813, 834]. The genotype space framework can help us study it systematically. The fourth requirement is to capture that the same problem can be solved through different innovations. This is important, because many innovations in the history of life occurred multiple times and in different ways [807]. The existence of extended genotype networks captures this feature
80
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
well. For example, it shows that the problem of synthesizing biomass from a single energy source can be solved by many and very different sets of chemical reactions, i.e., metabolic network genotypes. Any two such networks can be viewed as different solutions to the same problem. The same holds for two different regulatory circuits that produce the same molecular activity phenotype, or two different proteins with the same enzymatic function. However, genotype networks help us see much more than that: if a problem has a solution at all, it has usually astronomically many solutions. In addition, these solutions are connected in genotype space. Which of these solutions an organism discovers depends on its evolutionary history, that is, its past trajectory and location in genotype space. The fifth requirement regarded environmental change. I will discuss it in Chapter 11. The sixth and last requirement regarded applicability to technology; it is the focus of Chapter 15.
The merits and price of abstraction One of the most influential concepts in evolutionary biology is that of the adaptive landscape [259]. It is commonly visualized as a landscape of rolling hills or steep ravines. Its peaks represent trait combinations or genotypes with high fitness. This concept is an abstraction derived from an immensely complex reality. Such abstraction is necessary for any human understanding. Yet like any other abstraction, it also has limitations. One of them derives from the fact that genotype space is high-dimensional. For example, a single mountain peak in three dimensions can become a much stranger object in higher dimensions. Any phenotype—metabolic, regulatory, or molecular—that confers high fitness on its carrier could serve as an example. It is a peak in an adaptive landscape. But because many connected genotypes typically form any such phenotype, this peak is spread out through genotype space. In other words, a single peak of a three-dimensional fitness landscape becomes a connected, vast, and sprawling genotype network in a higher dimensional space. Just as low-dimensional representations of fitness landscapes have limitations, so has their refinement to many dimensions, and the concept of genotype networks. The most elementary abstraction I made here is to consider discrete genotypes and pheno-
types. It has obvious merits. First, it is well-suited to study the qualitatively different phenotypes that are important for innovation. Second, it also helps us understand the combinatorial nature of innovation. Third, it gives rise to clear concepts about proximity, neighborhoods, robustness, the spreading of genotype networks, and unique phenotypes in a neighborhood. But this abstraction also has limitations. For example, one might argue that many systems can have an infinite continuum of phenotypes. Examples include the ever-changing conformations of proteins and RNA molecules inside a cell. Although, continuous systems and their organization in genotype space are poorly studied, Chapter 14 will hint that important observations from discrete systems also apply to continuous systems. On a more general note, a continuum of phenotypes may well fall into discrete classes defined by distinct biological features. We may live in a continuous world, but most efforts at understanding this world involve classification of its objects, as much in biology as in any other area of human life, from the classification of biological species, to the classification of objects by our retina and visual cortex. Classification is a form of discretization. Discretization, with all its limitations, is thus central for our orientation in the world. The framework I have developed thus far also contains other, more hidden simplifications, on which subsequent chapters will focus. I have thus far neglected the role of changing environments and phenotypic plasticity for innovation (Chapters 11 and 13), as well as the role of recombination (Chapter 10), population dynamics (Chapters 7, 8), and gene duplications (Chapter 9). As these chapters will show, the framework can easily accommodate these phenomena. They can enhance the power of genotype networks to explore new phenotypes.
Validation To help validate any principle that organizes natural phenomena, it is essential to think about the limits of its applicability, including the kinds of evidence that could prove this principle wrong. Although some of these limitations were implicit in my previous discussions, I will now briefly revisit them to make them more explicit. Below will refer to a “system class” as a particular
THE ORIGINS OF EVOL UTIONARY INNOVATION
kind of genotype and phenotype, like the molecules, regulatory circuits, and metabolic networks I discuss throughout. First, the framework I suggest would face a problem if we found that phenotypes with tiny genotype networks generally are more innovative than phenotypes with large such networks. What I have in mind are phenotypes unusually rare in genotype space, yet nonetheless highly abundant in organisms, and at the same time highly innovative. Second, highly innovative systems or phenotypes, where many genotypes form the same phenotype, but where most of these genotypes are isolated in genotype space would present a problem to the theory (Figure 5.6a). Third, the same would hold if we found systems or phenotypes whose genotype networks are highly localized in a small region of genotype space, and that are nonetheless highly innovative (Figure 5.6b), more so than genotype networks that reach far through genotype space. Fourth, it would be problematic if we found systems or highly innovative phenotypes where distant neighborhoods of a genotype network contain mostly or exclusively identical phenotypes (Figure 5.6c). Lastly, the framework does not apply to systems with as many or more phenotypes than genotypes. (This is not the same issue as raised by phenotypic plasticity, where environmental variation produces several phenotypes from one genotype, as Chapter 13 discusses.) There is at least one prominent system class that may fall into this category. It comprises systems involved in self-recognition and immunity. An organism’s antibody repertoire, for example, is most effective in recognizing many antigens if its antibodies have highly diverse surface properties. Organisms achieve such high diversity through several mechanisms, including hypermutation of small hypervariable regions in the genes encoding these antibodies [13]. To maximize the antibody diversity that results from a given number of mutations, it is best if each mutation generates an antibody with new surface properties. In other words, it is best if there is one phenotype for every genotype. The same line of reasoning applies to pathogens capable of producing many gene variants encoding different surface proteins. This great diversity of surface proteins can help them avoid
81
detection by the immune system [537]. Thus, when every phenotypic variant is useful, the genotype network framework becomes useless. In this case, “innovation” also becomes trivially synonymous with variation, and is no longer challenging to explain. Our abilities to study genotypes and phenotypes are growing rapidly with the advance of whole-genome sequencing and functional genomic technology. Thus, in the years and decades to come, we will undoubtedly learn more about the incidence of anomalies like those I just described, whether they are the exception to a rule, or whether they lead to new principles we do not yet appreciate.
Innovation at the origin of life The importance of innovation undoubtedly began with the origin of life. We do not know whether the earliest life involved the information carrier RNA, or some simple metabolic network [535]. From an innovability perspective, it may not matter: both system classes have the prerequisites for innovability I discussed here. They can support the countless innovations that must have occurred until life even vaguely resembled its present cellular form. One could even take the perspective that genotype networks predate natural selection and thus life itself. From this point of view, they transcend biology. They exist regardless of whether a life form explores them. And life was able to take advantage of them as soon as natural selection started to operate.
Summary Figure 5.7 contains a summary of six important commonalities that we find between
Innovation is combinatorial in nature. Genotypes have many neighbors with the same phenotype. Many or all genotypes with the same phenotype are connected in genotype networks. Genotype networks of different phenotypes have different sizes. Typical genotype networks traverse a large part of genotype space. Different neighborhoods of a genotype network contain different phenotypes. Figure 5.7 Six common features of metabolic networks, regulatory circuits, and molecules.
82
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
metabolic networks, regulatory circuits, and molecules, key system classes underlying all evolutionary innovation. It highlights the last two commonalities, because they are essential for the ability to innovate: genotype networks extend far
through genotype space and their different neighborhoods contain a universe of different phenotypes. In the next chapter, we will see that these features emerge from a remarkable simple property of innovable systems.
CH A PT ER 6
Genotype networks, self-organization, and natural selection
The previous chapter showed that extended genotype networks with phenotypically diverse neighborhoods are necessary for innovability. But it left a fundamental question unanswered: Why do they exist in the first place, and in system classes as different as metabolic networks, regulatory circuits, and molecules? This chapter suggests an answer. They emerge from one core common property: individual genotypes typically have many neighbors with the same phenotype. I will show that this property is sufficient for the existence of connected genotype networks that occupy a small fraction of genotype space, but that extend far through this space. I will also show that this property is necessary. Subsequently, I will show that the great phenotypic diversity of different neighborhoods is not surprising, but expected for systems with many phenotypes. Next, I will discuss the interdependency of self-organization and natural selection in evolution. Finally, I will point out that genotype networks render phenotypic change non-random in ways that facilitate innovation.
Genotype networks as graphs The chemistry and physics of how molecules fold, of how genes regulate their expression, and of how large metabolic networks synthesize biomass differ in many details. Commonalities among them will thus probably emerge neither from physics nor chemistry, but from more fundamental, mathematical principles. Principles from graph theory are most important in this regard [70, 304]. A graph is a mathematical object that consists of nodes, and of edges that link these nodes. A graph is connected if one can reach any node from any other
node by traversing a path of edges, and disconnected otherwise. Two kinds of graphs are important for my purpose. The first is a genotype space. It can be viewed as a graph whose nodes are genotypes. Edges connect nearest (1-mutant) neighbors in this space. The second kind of graph is a genotype network. The nodes in a genotype network are also genotypes, but only genotypes with a common phenotype; an edge connects two genotypes again if they are 1-mutant neighbors. A genotype network graph typically does not contain all genotypes in a genotype space. It can thus be viewed as a subgraph of genotype space. My treatment of genotype networks below emphasizes intuition over mathematical rigor, for three reasons. First, doing so will render the text as accessible as possible to the non-expert reader. Second, mathematically rigorous graph theory has produced deep insights, but mostly about highly idealized graphs with a simple structure, not the highly heterogeneous and “messy” real-world graphs such as genotype networks [70–73, 77, 262, 263, 642]. Third, although rigorous and highly technical mathematical proofs exist [70, 641–644], they are not essential to gain intuition about qualitative genotype network properties. Only some elementary probability theory and some terminology is necessary, to which I will now turn. I am using some terms that are non-standard in graph theory, but that will facilitate comprehension by non-experts [70, 304].
Hypercubes The genotype spaces of molecules, regulatory circuits, and metabolic networks are closely related to hypercube graphs. I will now explain this concept and its relationship to genotype spaces.
83
84
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
In doing so, I will use metabolic networks as an example. Recall that a metabolic genotype can be encapsulated in a binary string whose entries represent presence (“1”) or absence (“0”) of a chemical reaction in a metabolic network. Figure 6.1 shows cubes in various dimensions. The line and square of Figure 6.1a and 6.1b can be viewed as one- and two-dimensional cubes. Figure 6.1c shows the conventional three-dimensional cube. Figure 6.1d shows a three-dimensional representation of a four-dimensional cube. Such a representation, at least one that provides geometric intuition, becomes impossible for higher-dimensional cubes. Such higher dimensional cubes are called hypercubes. The vertices of the cubes in Figure 6.1 are labeled by binary strings whose length corresponds to the dimension of each cube. We can interpret these strings as representations of a genotype; for example, a metabolic network genotype. In this context, the vertices of the hypercube become the set of possible genotypes in a genotype space. For example,
(a) (b)
0
1
01
11
00
10
(d)
(c)
110 111
100
101 010 011
000 001
Figure 6.1 Cubes and hypercubes. (a) and (b) “Cubes” in one and two dimensions. (c) A cube in three dimensions. The vertices of each cube are labeled with the binary strings that they correspond to. (d) A three-dimensional representation of a four-dimensional cube, where the labeling of vertices with binary strings (of length four) is omitted for clarity. The vertices of cubes in higher dimensions (hypercubes) form the nodes of hypercube graphs.
the ends of the line in Figure 6.1a correspond to a trivially small genotype space that contains only one reaction, which can be absent or present in any one genotype. Figure 6.1d shows a four-dimensional genotype space whose metabolic network genotypes can contain up to four reactions. The concept of a hypercube graph extends these ideas to higher dimensions. A hypercube graph is a graph whose nodes (here: genotypes) correspond to vertices of a hypercube. Two nodes are neighbors (connected by an edge), if they correspond to adjacent vertices in the hypercube. In the context of metabolic networks, there is a one-to-one correspondence between each vertex of a hypercube, and each genotype in metabolic genotype space. Two adjacent vertices correspond to two metabolic genotypes that are neighbors, and that differ in a single entry of their binary representation, that is, in a single enzymatic reaction. I now review some very basic facts about hypercube graphs that we will need later on. (Figure 6.2 shows some of the notation I will use here.) For the purpose of this chapter, the distance between two genotypes corresponds to the number of mutations (additions or eliminations of reactions) necessary to transform one genotype into the other. The diameter of a graph is the maximum distance (number of edges) between any two genotypes. For the hypercube graph, this distance is equal to S, the number of mutations that are necessary to create from a genotype G its complement G̅ that differs in every single reaction. This diameter S is vastly smaller than the total number of possible genotypes 2S. This simple fact will go a long way towards explaining important properties of genotype networks. Each genotype G in genotype space has exactly S immediate neighbors. One can show that its number of k-neighbors, genotypes that differ from it in exactly k system parts, is given by the binomial coefficients: S! ⎛ S⎞ . ⎜⎝ k ⎟⎠ = ( S − k )!S !
(6.1)
This can be most easily seen by considering a hypothetical genotype G in which all chemical reactions are present, corresponding to the binary string of only ones. The k-neighbors of G are the strings that have exactly k zeroes. The number of ways to choose
G EN OTY P E N ETW ORKS, SEL F - OR GANIZATION, AND NATURAL SEL ECTION
S
85
System size Size of the “universe”of possible biochemical reactions Number of possible regulatory interactions in a regulatory circuit Length (number of monomers) of an RNA or protein molecule
B
Number of different system “building blocks” B=2 for metabolic networks (reaction presence/absence) B≥3 for regulatory circuits B=4 for RNA B=20 for proteins
G
genotype
G
genotype differing from G in every one of S building blocks
u
fraction of a genotype’s neighbors with the same phenotype
D
distance between two nodes in a graph or between two genotypes in a genotype network, expressed as a fraction of system size S. (0≤D≤1)
KP
Total number of phenotypes.
MP
The number of genotypes with a phenotype P
Figure 6.2 Some notation used in this chapter. I note a minor deviation from notation in earlier chapters. For regulatory circuits, I earlier used the variable S to indicate the number of circuit genes (Chapter 3), which renders the total number of possible pairwise regulatory interactions equal to S2. In this chapter, I will use the variable S for this number of possible pairwise regulatory interactions. I do so to keep the mathematical notation for molecules, regulatory circuits, and metabolic networks commensurate.
k out of S elements, regardless of their order is given by the binomial coefficients (Equation 6.1). The same argument applies to any genotype G. The fundamental reason is that hypercubes are highly symmetric: The nodes of a hypercube graph have identical roles in it, much like the vertices in a cube. The number of k-neighbors given by Equation 6.1 increases with increasing k until k≈S/2, and it declines thereafter, as k approaches S. In the metabolic networks I discussed, genotypes can be characterized by the presence or absence of metabolic reactions. In regulatory circuits, one needs to distinguish at least three kinds of regulatory interactions (activating, absent, or repressing). Finer gradations of interaction strengths may be useful or necessary for many purposes. In RNA molecules, there are four kinds of nucleotides, whereas in proteins, there are 20 kinds of amino acids. Thus, the elementary parts of genotypes may have very different numbers of building blocks, be they metabolic reactions, strengths of regulatory interactions, nucleotides, or amino acids. I will refer to the number of such building blocks with the
variable B, and will consider here predominantly the values B=2 (metabolism), B=3 (transcriptional regulation), B=4 (RNA), and B=20 (protein). Wherever the number of building blocks exceeds B=2, an extension of the hypercube concept is necessary. For B>2, there are not just 2S but BS genotypes, because each of S system parts can be made of B different building blocks. In even just one dimension (Figure 6.1a), each part of a genotype may then assume not just two values but B different values. The resulting hypercube graphs are even more difficult to understand geometrically than for B=2, although some features remain the same. Most importantly, despite the larger number of genotypes when B>2, the graph diameter remains at the small value of S. Genotypes are again connected by an edge if they differ in exactly one building block. However, each genotypes now has (B–1)S 1-neighbors, because each of its S parts can change into one of (B–1) different building blocks. The number of k-mutant neighbors becomes: ⎛ S⎞ ( B − 1)k ⎜ ⎟ . ⎝ k⎠
(6.2)
86
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
The additional factor (B-1)r arises because each of the k changed parts of a genotype can change into B–1 other parts. In another difference to hypercube graphs for B=2, any one genotype G does not have just one maximally different complement (or S-neighbor) G̅, but a large number of (B-1)S such maximally distant neighbors. The reason is, again, that each constituent part may adopt one of (B–1) values different from those in G.
The existence of neutral neighbors suffices for the existence of genotype networks In molecules, regulatory circuits, and metabolic networks, a genotype G typically has many neighbors with the same phenotype as G itself. As in Chapter 5, I will refer to such neighbors as neutral neighbors for brevity. I will denote the fraction of a genotype’s neighbors that are neutral as ν. It can vary widely among different phenotypes and genotypes, but typically ranges between 0.1 and greater than 0.5, as we saw in Chapters 2–4. I will now show that this feature is a sufficient condition for the existence of genotype networks that comprise a tiny fraction of genotype space, but that extend far through this space. To keep the mathematics simple, I will treat the problem as if all genotypes had the same number of neutral neighbors, and neglect the fact that different genotypes on a genotype network may have different numbers of neutral neighbors. This simplification is appropriate for my qualitative analysis. The analysis rests on the following idea. I construct different kinds of random graphs in which an approximate fraction ν 1−
1 , n ( B − 1)S
(6.4)
as the graph diameter beyond which addition of new nodes to one node Gk of our random graph is expected to add fewer than one edge that increases the graph diameter further. This estimate focuses on a single genotype Gk from which the graph is extended further. It neglects that each iteration may produce multiple genotypes Gk that have the same maximal distance k from G. Each of them (and not just one of them) has a probability given by Equation 6.3 to increase the graph diameter further in the next iteration. The estimate also neglects that for large ν, small B, and Gk with large k, the graph construction process I outlined may produce more than one edge connecting (k–1)-neighbors of G to Gk, thus reducing the number of (k+1)-neighbors that can be added to Gk while keeping its number of neighbors approximately equal to ν. For these reasons, the estimate is rather crude, but it provides some qualitative insight. It tells us that as S and (B–1) increase, the diameter of such a random graph would move ever closer to the maximally possible value of one. The same holds for increasing values of the fraction ν of neutral neighbors, because as ν increases, the likelihood increases that at least one of the newly added neighbors increases the graph diameter. Even for modest S, small B, and values of ν close to the lower range of
87
what we observe in typical metabolic, regulatory, and molecular genotype networks, D can be large. For example, even for a modest genotype size of S=100, merely B=2 building blocks, and ν=0.1, the above estimate yields D>0.9. The random graph thus constructed may thus span some 90 percent of genotype space. In sum, random graphs, where each genotype has some modest fraction ν of neighbors, will typically have large diameters. This means that genotype networks with this property will typically reach far through genotype space. The fundamental reason is the huge discrepancy between the total number of BS genotypes in a genotype space, which grows exponentially with the size of a genotype space, and the diameter of the hypercube, which grows merely linearly. Next, I will consider the typical size of a random graph whose nodes have an appreciable fraction ν of neutral neighbors. Naively, one might think that such a graph would also occupy a large fraction of genotype space. The purpose of this section is to show that this is not the case. To estimate the number of genotypes in such a graph, I will revisit the first two steps in the random graph construction process. A fraction ν of 1-neighbors of G are connected to G, and of those again a fraction ν connect to their neighbors, such that the fraction of 2-neighbors of G that become connected to G is of the order of ν2. More precisely, consider an arbitrary 2-neighbor of G, regardless of whether it is part of our random graph. It is easy to see that any such 2-neighbor has exactly two edges that connect to it from 1-neighbors of G. The probability that this 2-neighbor of G is part of our random graph is then equal to the probability that at least one of these two edges link it to 1-neighbors of G. From the argument in the previous paragraph, the probability that none of these edges link it to 1-neighbors of G is of the order of (1-v2). The probability that at least one of these edges link it to 1-neighbors of G then simply becomes ν2. This argument holds for all 2-neighbors of G, who become connected to G via 1-neighbors of G (independently from one another and with the same probability, as prescribed in the construction procedure for our random graph). Thus, the number of 2-neighbors of G in our random graph becomes:
88
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
⎛ S⎞ n 2 ( B − 1)2 ⎜ ⎟ . ⎝ 2⎠ The argument extends analogously to neighbors of increasing distance k of G yielding: ⎛ S⎞ n k ( B − 1)k ⎜ ⎟ . ⎝ k⎠
(6.5)
for the number of k-neighbors of G that form part of this random graph. Overall, the number of nodes in the random graph is approximated by the sum of these individual contributions: S
∑n k =0
k
⎛ S⎞ ( B − 1)k ⎜ ⎟ . ⎝ k⎠
(6.6)
I emphasize that Equation 6.6 is again a crude approximation. It neglects, for example, that only a fraction of neighbors of randomly chosen k-neighbors of G connect to (k+1)-neighbors of G. The simplification, however, may not affect the order-of-magnitude estimate (Equation 6.6) dramatically, if one considers that this fraction of neighbors decreases only linearly in k (Equation 6.3), whereas the fraction of (k+1)-neighbors added to the graph in each construction step changes exponentially in k. Figure 6.3 shows how the estimate (Equation 6.6) depends on the fraction ν of neighbors (horizontal axis), and on the number of building blocks B (vertical axis) for a moderately sized system of S=100
20
Number B of building blocks
10–90
10–70
10–60
10–50 10–40
10–30
4
10–20
3 2 0.1
0.2
0.3
0.4
0.5
Number n of neighbors Figure 6.3 Genotype space spanning random graphs can occupy a small fraction of sequence space. The horizontal axis shows the average fraction n of neighbors in a random graph constructed as described in the text, the vertical axis shows the number B of building blocks that each genotype can be constructed of. In the systems I discussed here, B included values of B=2 (metabolic networks), B=3 (regulatory circuits), B=4 (RNA), and B=20 (proteins). The contour lines indicate random graph sizes, expressed as fractions of the total size BS of the hypercube for S=100.
G EN OTY P E N ETW ORKS, SEL F - OR GANIZATION, AND NATURAL SEL ECTION
metabolic reactions, regulatory interactions, nucleotides, or amino acids. This genotype network size is expressed as a fraction of the total size BS of genotype space. The figure shows that this fractional size is very small compared to genotype space. It would decrease further with increasing S. Equation 6.6 may be an imprecise estimate of genotype network size, but it serves to make the main point. Even if a genotype typically has many neutral neighbors, its genotype network may occupy a tiny fraction of genotype space. This also means that genotype space could host a myriad different genotype networks—one for each phenotype—that span this space or nearly so (Equation 6.4.). A remaining aspect of the organization of genotype networks regards their connectedness. The observation that genotypes typically have many neighbors with the same phenotype makes it seem highly likely that genotypes form large connected networks. However, whether all genotypes would belong to one such network, or whether there might be more than one network is less clear. Some work in graph theory aims at answering this question. This work typically focuses on genotype networks in the limit where the size S of a system approaches infinity. It demonstrates graph properties that exist with probability one in this limit. One such property is the existence of a giant component. A component is a part of a graph where every two nodes can be connected through a path of edges. A giant component contains a finite fraction of a graph’s nodes [70]. This means that if a graph’s size approaches infinity, then so does the size of a giant component. In this limit, components that are not giant contain only an infinitesimal fraction of a graph’s nodes. A graph can have more than one giant component. Relevant for my purpose is a graph theoretical result applying to random graphs on the hypercube for which an average fraction ν>0 of each node’s neighbors lie also on the graph. In this case, the probability that a graph has a giant component becomes equal to one as the graph’s size approaches infinity [640]. If ν exceeds a threshold of n > 1 − B −1 1/ B , then there exists exactly one such component in the limit where S approaches infinity. This component spans the entire genotype space, that is, its diameter is equal to D=1 [640]. For metabolic networks, where B=2, this threshold is ν>0.5,
89
for regulatory circuits (B³3), it is ν>0.42, for RNA (B=4), it would be ν>0.37, and for proteins (B=20), it would be ν>0.15. Sets of genotypes whose members have more than this fraction of neutral neighbors would be connected and span genotype space. Smaller sets would not, but they would still form a giant component that contains most genotypes. The more building blocks B a system has, the smaller the average fraction ν of neighbors that suffices to generate a single genotype space-spanning genotype network. I emphasize again that all these considerations do not imply that actual genotype networks behave exactly like random graphs. They may differ in many respects, for example, in how the fraction ν of neutral neighbors varies from genotype to genotype. However, they show that even simple random graphs with some typical fraction ν>0 of neutral neighbors per node typically form vast connected sets that extend far (or all the way) through genotype space, despite comprising only a tiny fraction of this space. The root cause for these properties is the huge discrepancy between the diameter of this space (S) and the number of genotypes (BS) in it.
Neutral neighbors are necessary for the existence of genotype networks Thus far, I have shown that neutrality (ν>0) is sufficient for the existence of genotype networks. I will now show that it is also necessary. To see this, consider a typical phenotype P. It will be formed by some very large number MP of genotypes that typically constitute a very small fraction of genotype space. Let us assume that this set of MP genotypes consists of genotypes chosen at random from genotype space, without requiring that each genotype has many neighbors with the same phenotype. This assumption is another variant of a null hypothesis comparing genotype networks to random graphs. The question is whether many or most of these genotypes would be connected in a genotype network. To address this question, let us examine one such genotype G and its S(B–1) neighbors. What is the probability that this genotype is isolated, that is, that none of its neighbors are members of P’s genotype set? To answer this question, consider first the probability that a randomly chosen genotype from the BS–1 genotypes different from G is not one of the
90
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
S(B–1) neighbors of G. This probability is equal to one minus the number of neighbors of G divided by the BS–1 genotypes different from G, i.e., 1–[S(B–1)/ (BS–1)]. Similarly, the probability that a second genotype chosen at random from the now remaining BS–2 genotypes is not a neighbor of G is 1–[S(B–1)/ (BS–2)]. The same argument applies for a third, fourth, and further genotypes, until one reaches genotype number (MP–1), for which the probability that it is not a neighbor of G is given by 1–[S(B–1)/ (BS–(MP–1))]. From these expressions, we can calculate the probability that none of the (MP–1) genotypes different from G are neighbors of G as their product: M p −1
⎛
∏ ⎜⎝1− i =1
S ( B − 1) ⎞ ⎟. Bs − i ⎠
I note that each factor in this product is greater than the last factor (i=Mp–1), such that the entire product is greater than the expression: ⎛ S ( B − 1) ⎞ ⎜⎝1 − B s − M + 1⎟⎠ P
M p −1
≈ 1−
S ( B − 1)(M p − 1) B s − MP + 1
.
(6.7)
The approximation in Equation 6.7 takes advantage of the relationship that (1–x)y≈1–yx for small x. The ratio S(B–1)/[BS–(MP–1)] (corresponding to x) will indeed be very small, even for moderately large system size S. The reason is that the numerator of this ratio is linear in S, whereas the denominator is dominated by the term BS, which is exponential in S. For the same reason, and because the number MP of genotypes with phenotype P is typically very small compared to the size BS of genotype space, the right-hand side of Equation 6.7 will be extremely close to 1. Thus, the probability that any one genotype in a set of genotypes with the same phenotype has no neighbors that are also in this set is extremely close to one. Because this holds for any genotype in this set, such a set would consist mostly of isolated nodes, and would not form a connected genotype network. More generally, one can show that a set of random genotypes must contain at least of the order of a fraction 1/S of all genotypes in genotype space before a giant connected component arises [74, 643]. In a genotype space of BS genotypes, this is a gigantic number of genotypes, larger than the
genotype sets for most phenotypes in the systems I study. In sum, even a large set of random genotypes with the same phenotype would mostly consist of isolated nodes. Thus, the condition of neutrality, that many neighbors of a genotype have the same phenotype is essential for the existence of connected genotype networks.
High phenotypic diversity in genotype network neighborhoods is expected I will now turn to the second major feature of genotype networks that is crucial for evolutionary innovation: their diverse genotypic neighborhoods. Specifically, I will show here that we can expect these neighborhoods to be diverse, even if different phenotypes were organized completely randomly in genotype space. By organized randomly I mean that if there are KP phenotypes in total, then each genotype is equally likely to adopt any one of these phenotypes. Actual phenotypes are certainly not distributed in this way, but this scenario serves again as a useful null hypothesis. Consider two genotypes, G1 and G2, on the same genotype network, where a fraction ν of each genotype’s S(B–1) neighbors have the same phenotype as Gi itself. Let us focus on all the (1–ν) S(B–1) genotypes in the neighborhood of G1 that have a phenotype different from G1. I will assume for the moment that all these phenotypes are also different from each other. The question is what fraction of these phenotypes we would expect to find also in the neighborhood of G2. The number of genotypes in the neighborhood of G2 whose phenotype is different from that of G2 itself is (1–ν) S(B–1). Under the null hypothesis, the probability p that any one randomly chosen phenotype in the neighborhood of G2 is also found in the neighborhood of G1 is equal to the number (1–ν) S(B–1) of phenotypes in the neighborhood of G1, divided by the total number of phenotypes KP. That is, p=(1–ν) S(B–1)/KP. Under this null hypothesis, the number of G2’s neighbors whose phenotypes are identical to phenotypes in the 1-neighborhood of G1 is then binomially distributed with parameter (1–ν) S(B–1) and probability p. The properties of the binomial distribution [236] imply that the expected number of phenotypes that is identical in the two
G EN OTY P E N ETW ORKS, SEL F - OR GANIZATION, AND NATURAL SEL ECTION
neighborhoods is given by (1–ν) S(B–1)p, which is equal to: (1 − n )2 S 2 ( B − 1)2 . Kp
(6.8)
The numerator of Equation 6.8 is dominated by the term S2. The denominator, the total number of phenotypes KP, is generally much larger than S2. It scales exponentially with the number of nutrients for metabolic phenotypes (Chapter 2), for gene expression phenotypes (Chapter 3), for the RNA phenotypes I have discussed (Chapter 4), and, albeit not scaling exponentially, it may also be large for protein functions (Chapter 5). This means that, for most systems, the ratio in Equation 6.8 will be much smaller than 1. Thus, under the null hypothesis, the two neighborhoods are expected to share fewer than one common genotype. If some of the phenotypes in a genotype’s neighborhood that are different from that of G1 and G2 are identical to one another (as is usually observed) then the numerator of Equation 6.8 would be even smaller. This means that the two neighborhoods would contain even fewer common phenotypes. In reality, phenotypic neighborhoods are diverse, but not quite as diverse as Equation 6.8 suggests. In metabolic networks, regulatory circuits, and molecules, the neighborhoods even of very distant genotypes on the same genotype network share some fraction of phenotypes, while other phenotypes are unique to one neighborhood. The reason is that neighbors of a genotype G tend to adopt phenotypes similar to that of G itself. Specifically, many neighbors of a regulatory circuit genotype G typically have gene expression patterns similar to that of G; the metabolic phenotype of a metabolic genotype G’s neighbor must be similar to that of G, because the neighbor differs from G in only one reaction; and the structure of an RNA genotype G’s neighbors is often similar to that of G itself (Figure 4. 8). Actual genotype networks thus violate the null hypothesis. But again, this hypothesis merely serves to show that in systems with many phenotypes, even a random organization of phenotypes in genotype space will lead to genotypic neighborhoods with highly diverse phenotypes. Such diversity is thus expected, and not unusual for systems with many phenotypes.
91
Self-organization and natural selection The word self-organization has many meanings [99, 210, 303, 557, 739, 746, 854]. Here I use it in the sense that collections of objects and their interactions bring forth structures on a higher level of organization. Such structures form from the bottom-up, merely through properties of the objects and their interactions, and without any order imposed from the outside. Genotype networks are examples of self-organized structures. Here, the lower level objects are genotypes of molecules, regulatory circuits, and metabolic networks. The higher order structures are individual genotype networks, and their organization in genotype space. They emerge from the principles that guide how phenotypes form from genotypes, together with some basic features of genotype space, such that it has a small diameter and many phenotypes, but fewer than genotypes (Chapter 5). For more than a century, evolutionary biology has focused on natural selection as the key process explaining life’s enormous diversity. The occasional suggestion that self-organization may be equally, or more, important than natural selection has been decidedly heterodox [388]. In light of what I have said thus far, it is useful to re-examine the relationship between natural selection and self-organization. This relationship, it turns out, is not difficult to understand. The self-organization of genotype networks is essential for evolutionary innovation. If genotype networks did not exist, or if they were organized differently (Figure 5.5b–d), a vast world of molecules, regulatory circuits, and metabolic network phenotypes would be inaccessible to evolution. Conversely, imagine a world with selforganized genotype networks, but without natural selection. There would be no force preserving phenotypes. Driven by mutations, a population of molecules, regulatory circuits, or metabolic networks with a given phenotype would drift aimlessly through genotype space, and thus lose this phenotype. The integrated organization of even the simplest organism would be unsustainable. These simple considerations show that both natural selection and self-organization are equally necessary in evolution. The success of one in bringing forth innovation depends entirely on the other. Self-
92
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
organization of genotype networks ensures that mutations can produce innovations, and natural selection ensures that innovations can be preserved. Self-organization is as essential for innovation as natural selection is for its preservation. Our insights into the self-organization of genotype networks are recent. In contrast, the discovery of natural selection goes back to Charles Darwin and Alfred Russell Wallace in the nineteenth century [162, 503]. It is thus little surprise that much more attention has been paid to selection in the history of evolutionary biology. Compared to what we know about selection, our ignorance about genotype networks is nearly complete, save for the few qualitative features I discuss here. Perhaps it is time to refocus our efforts to better understand this second key ingredient to life’s great success.
Innovation and “random” change We generally think of evolution as driven by “random” mutations. I will next examine briefly how genotype networks may affect our views on randomness in evolution. I will only make some qualitative observation, because a nuanced discussion could itself fill a book. Randomness is made precise through the notion of a random variable, a mathematical object that can adopt a range of possible values, each of them with some probability drawn from a probability distribution [236]. Randomness can only be properly defined with such a set of values and a probability distribution in mind. One can distinguish two connotations of randomness pertinent to my subject. The first connotation regards the effect of mutations on an organism’s fitness. Here, the random variable is the change in fitness a mutation causes. It can assume three categories of values: beneficial, detrimental, and neutral. When biologists maintain that mutations are random, they often mean that mutations do not preferentially increase their carrier’s fitness. Mutations do not serve the interests of the organism in which they occur. There is widespread consensus among evolutionary biologists that mutations are random in this sense [168, 204, 259, 501, 706, 715, 716].
A second connotation, more important here, regards phenotypes. How do they change through a random mutation, for example, one that affects each system part (nucleotide, amino acid, regulatory interaction, or biochemical reaction) with the same probability? Here, the random variable is a phenotype. And its possible “values” are all possible phenotypes. If we call a mutation random that produces all phenotypes with equal probability, then mutations are decidedly non-random. First, a single mutation can only bring forth a minute fraction of phenotypes, whose identity depends on the mutated genotype and its neighborhood. Second, a substantial fraction of mutations do not affect phenotype at all. Both aspects of non-randomness are important, but let me highlight the second. As we saw here, it brings forth genotype networks with an organization that facilitates innovation. From this perspective, mutations are non-random in a way that enables evolutionary innovation.
Summary Genotype networks that are large, occupy a small fraction of genotype space, traverse a large fraction of this space, and show highly diverse phenotypic neighborhoods are self-organized, emergent features of genotype spaces. The existence of many neutral neighbors per genotype gives rise to the first three features. It is both necessary and sufficient for them. The last feature, different genotypic neighborhoods that contain different novel phenotypes, merely requires that a system has many different phenotypes. Thus, the properties of the metabolic, regulatory, and molecular systems that facilitate innovation are self-organized features of genotype space. Mutations are non-random in a way that brings forth these features, and thus promotes evolutionary innovation. Natural selection and self-organization are both essential for evolutionary innovation. In Chapter 1, I quoted the geneticist de Vries who stated that natural selection cannot explain the origin of novel phenotypes [170]. More than 100 years later we can say this: Genotype networks help explain the arrival of the fittest, and natural selection permits their survival.
CH A PT ER 7
A synthesis of neutralism and selectionism
Neutralism and selectionism are two opposing perspectives on evolutionary change. In the broadest sense, they apply to all evolutionary change, including evolutionary innovation. Any theory of innovation thus needs to have a position towards them. In this chapter, I first explain these two perspectives (see also ref. [829]). I then provide some background material on the population dynamics of neutral change. After that, I propose a synthetic view on neutralism and selectionism that can resolve the tension between them. In this view, neutralism and selectionism capture complementary aspects of biological reality. Genotype networks play a central role in it. This view also clarifies the role of molecular exaptations in innovation, an important concept that I will also discuss here [281, 282]. Most pertinent data for this chapter comes from molecules, but the major principles hold for all three major system classes of this book.
Selectionism and neutralism in a broad and narrow sense With respect to innovations, a strict selectionist would maintain that all innovations arise through beneficial mutations. These mutations change a trait for the better when they first arise, and they constitute the innovation. For my purpose, the relevant traits are the structure and function of molecules, the expression phenotypes of genes, and a metabolic network’s biosynthetic abilities. In contrast, a neutralist would argue that mutations without any effect when they first arise might facilitate such innovation. I refer to these perspectives as selectionism and neutralism in the broad sense. Selectionism and neutralism are also used in a narrower sense. I need to explain this usage here, because it played an important role in the history of molecular evolutionary biology. In this usage, selectionism
and neutralism offer competing explanations on what causes observed genetic variation in populations. To understand this usage, one needs to be aware that the DNA of a genome is subject to three possible kinds of mutations. The first kind comprises deleterious mutations, which are harmful and subject to purifying selection. The second kind comprises neutral mutations, which do not affect fitness. The third kind comprises beneficial mutations, which increase fitness and are subject to positive selection. Neutralism and selectionism in the narrow sense agree that deleterious mutations are frequent in the evolution of genes and proteins. However, they profoundly disagree on the relative importance of neutral and beneficial mutations. In the words of Motoo Kimura, one of neutralism’s principal proponents, “. . . random fixation of selectively neutral or slightly deleterious mutations occur far more frequently in evolution than positive Darwinian selection of definitely advantageous mutants” [403, 536]. In contrast, selectionism posits that most mutations that attain high frequency or become fixed in a population would be beneficial, or be linked to abundantly occurring beneficial mutations. (An allele’s frequency is the number of copies of the allele in the population, divided by the number of individuals in haploid populations, or divided by twice the number of individuals in diploid populations. Fixation means that an allele attains a frequency of one—it replaces all other alleles.) Strict selectionists, such as Ernst Mayr, dismiss the importance of neutral evolutionary change altogether [502, pp.204–214]. Neutralism and selectionism in this narrower sense originated early in the twentieth century [Ch.1 of ref. 402], but a debate about them only gained momentum in the 1960s. At that time, the 93
94
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
first systematic observations on enzyme polymorphisms indicated that many wild populations contain great amounts of genetic variation [308, 453]. Neutralists proposed that most of this variation was caused by neutral mutations, whereas selectionists attributed it to beneficial mutations. The narrow and the broad usages of neutralism and selectionism are linked. For example, a strict selectionist view on molecular variation would also tend to favor selectionism with respect to evolutionary innovation. This is because mutations that dominate populations would tend to be the mutations that produce most evolutionary innovation. In recent years, the narrow-sense neutralist– selectionist debate has abated, for reasons I will discuss below, but the broader tension remains. After a brief introduction to concepts central to understand the evolution of neutral mutations, I will first discuss experimental data that clearly supports the selectionist perspective on molecular evolution. I will then juxtapose this data with evidence that neutral mutations are critical to evolutionary innovation. Finally, I will suggest how to reconcile these lines of evidence into a synthetic perspective. Three predictions emerge from this synthesis, and I will discuss supporting evidence for them.
Evolutionary dynamics of neutral mutations As mentioned above, a neutral allele is a genetic variant that does not affect an organism’s fitness. Such an allele’s frequency p in a population is influenced by genetic drift, a force of random evolutionary change that is strongest in small populations [402]. In a population of constant, finite size, genetic drift causes this frequency to fluctuate from generation to generation, because alleles get sampled from the previous, parental generation to form the next, offspring generation. In haploid organisms, the variance in the amount of change in allele frequency from generation to generation is given by V=p(1–p)/N [310]. The quantity N here is the effective population size, which reflects how many individuals actually contribute alleles to the next generation [310]. This, and all mathematical expressions below, hold for haploid populations, but they apply to diploid populations if one replaces every occurrence of N with 2N. The above expression for V shows that in small populations, allele frequencies fluctuate more
than in large populations. Over time, these fluctuations will cause an allele to become either extinct (frequency p=0) or fixed (p=1). An allele newly arisen through mutation has a probability of 1/N to go to fixation. If it goes to fixation, it will take on average 2N generations to do so [310]. Over time, genetic drift will thus reduce genetic variation in a population. Mutations, on the other hand, continually introduce new genetic variation. Genetic drift and mutations are thus opposing evolutionary forces. They will reach a balance over time. This mutation–drift balance is influenced by the rate μ at which neutral mutations occur per generation in a gene or in any stretch of DNA. A population will most of the time be monomorphic—it will contain only one allele—if the product of population size N and mutation rate μ is much smaller than one (Nμ1. These and many other predictions are made by the neutral theory of molecular evolution, a widely accepted body of work about the effects of genetic drift on populations [402]. Alleles that are not neutral have a selection coefficient s that indicates by how much their carrier’s fitness differs from a reference, non-mutant genotype. The fate of mutations whose selection coefficient s is much smaller than 1/(2N) is determined by drift rather than by selection, because generation-togeneration random allele frequency fluctuations are stronger than the influence of selection. Such mutations are also called effectively neutral [401]. However, even mutations whose selection coefficient is greater than 1/2N are influenced by drift. Specifically, weakly deleterious mutations can go to fixation, whereas weakly beneficial mutations can be lost, all through the influence of drift [227, 310, 676]. These considerations show that population size N is of central importance for the “visibility” of alleles to selection, and thus for their fate. Effective population sizes vary among species by more than five orders of magnitude, from typical values of 104 for vertebrates to values of up to 109 for prokaryotes. Effective population sizes generally decrease in larger and multicellular organisms [476]. Many alleles whose fate would be dominated by selection in prokaryotes would thus be evolving neutrally in vertebrates [226, 476]. The consequences may be far-reaching. For example, neutral evolution may
A SY NTHESIS OF NEUTRAL ISM AND SEL ECTIONISM
95
influence the size and complexity of genomes [476]. Genome complexity is much greater in higher eukaryotes, where drift is stronger, than in prokaryotes, where more mutations altering genome structure would be deleterious and get eliminated. Population size also affects the fates of the kinds of genotypes I study here, those of molecules, regulatory circuits, or metabolic networks. Whether the fitness of two such genotypes is indistinguishable depends on population size. Similarly, the number of members in a set of genotypes with indistinguishable fitness would expand or shrink in magnitude with changing population size. Again, the reason is that small fitness differences visible to selection in large populations become neutral in small populations. (Because the size of such neutral genotype sets is often astronomical, however, they may still be very large even when shrunk.) The organization of genes into chromosomes adds a further layer of complication. On the one hand, if a neutral mutation occurs physically close to a beneficial mutation, then the neutral mutation may be rapidly swept to a high frequency or to fixation, if recombination does not break up its association with the beneficial mutation. This phenomenon is also called “genetic draft” or “hitchhiking” [270, 271, 500]. Genomic regions where hitchhiking is frequent show reduced amounts of neutral genetic variation, similar to those caused by a reduced population size [270]. On the other hand, if a neutral mutation occurs close to a region where deleterious mutations segregate, the neutral mutation may be dragged to extinction along with the deleterious mutations. This phenomenon of “background selection” can affect polymorphisms and the time neutral alleles need to go to fixation [113]. Because recombination rates vary substantially among organisms and chromosomal regions, the impact of these phenomena on allelic variation may also vary [476, 478].
strongly depend on the environment. In any one organism, some fitness components may be unknown, whereas the influence of others on fitness may be missed, because they are manifest only in a environments different from standard laboratory environments. A second problem is that experimental measurements of fitness components, such as microbial cell division rates, can resolve selection coefficients to a resolution of at most s=10–3, but much smaller selection coefficients of s0) is both necessary and sufficient for the existence of genotype networks. Such networks, in turn, are essential for evolutionary innovation, because they allow access to vast amounts of phenotypic variation while preserving existing phenotypes. In sum, mutational robustness (ν>0) ensures that an evolving system can explore many different phenotypes, which facilitates innovation. Where does the naive argument that robustness should impede innovation go wrong? To see this, consider the extreme case of a minimally robust genotype G (ν=0) for any one of the systems we examined earlier. The total number of its neighbors is (B–1)S. Here, S is again the size of the system (number of amino acids, regulatory interactions, metabolic reactions), and B is the number of elementary building blocks (B=20 for proteins, B=4 for RNA, etc.). If 107
108
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
all of these neighbors’ phenotypes are different from each other, then the neighborhood of our hypothetical genotype contains the maximally possible number (B–1)S of novel phenotypes. It certainly contains more novel phenotypes than the neighborhood of a genotype with greater robustness (ν>0). Put differently, a greater number of novel phenotypes are just one mutation away from the minimally robust genotype, and thus immediately accessible. This observation encapsulates the gist of the earlier argument: the lower a genotype’s robustness, the greater its phenotypic variability in response to mutations. The argument has two flaws. First, typically the vast majority of mutations with a new phenotype— and thus perhaps all (B–1)S neighbors—are deleterious [458, 477]. To see the second flaw, consider a genotype G like the one we just considered, but where exactly one neighbor G’ has the same phenotype (ν=1/(B–1)S). Without changing the phenotype, novel phenotypes that lie in G’s neighborhood, and in the neighborhood of its neutral neighbor G’ are accessible to G. There are of the order of 2(B–1)S–1 such variants, approximately twice as many as if ν=0. In addition, the neutral neighbor G’ may itself have neutral neighbors, through which even more novel phenotypes become accessible, and so on. Earlier, we saw that ν is typically much greater than the minimal value of 1/(B–1)S I just assumed (Chapters 2–4). This larger fraction of neutral neighbors renders accessible a number of phenotypic variants that is astronomically larger than (B–1)S; it thus increases the chances to find rare beneficial variants, and it facilitates preservation of the initial phenotype. Robustness may well reduce the number of new phenotypes in the immediate neighborhood of a genotype, but it also allows access to its neutral neighbors, their neutral neighbors, and so forth, thus vastly expanding accessible variation. An essential condition for this positive role of robustness is that the number of phenotypes accessible by a point mutation must be much smaller than the total number of phenotypes [193]. This condition, however, is trivially fulfilled for any system of realistic complexity. In sum, the naive argument that robustness impedes phenotypic variability neglects this: robustness brings forth vast connected genotype networks, and renders an astronomical number of phenotypes accessible through them.
Mutual reinforcement on different levels of organization I will next briefly point out that the hierarchical organization of biological systems can reinforce the positive effect of robustness on innovation. In previous chapters, I have focused on different levels of biological organization and discussed them separately. However, molecules, regulatory circuits, and metabolic networks interact. Specifically, robustness on one level of organization can reinforce robustness on another level. Consider, for example, a regulatory change that reduces an enzyme’s expression, and thus its concentration and activity. This enzyme may be embedded in a metabolic network where its function is dispensable, because metabolites that this enzyme cannot process can be rerouted through other reactions. If so, then this regulatory change affects the protein’s activity, but not the organism’s metabolic phenotype. The same may hold for an amino acid change in the enzyme that may create or increases an activity, but reduces another activity. Such changes are not uncommon [68, 135]. If this change can be readily buffered by a metabolic network, it will be neutral on the level of the metabolic phenotype. It is not difficult to see that interactions like these, which mutually reinforce robustness on different levels of organization, can further enhance the accessibility of novel phenotypes, while leaving an existing phenotype unchanged.
Lessons from artificial systems In studying biological systems, we are trying to understand an existing relationship between genotype and phenotype. This relationship determines system properties such as robustness. One can, therefore, not easily manipulate robustness alone, to find out how it affects evolutionary innovation. To do so, one would have to change the very relationship between genotype and phenotype, the “map” between them. Such manipulation, however, is easy in artificial mappings from genotype to phenotype. Their details may have little bearing on biology, but they have one key merit: they can be designed with tunable robustness. They allow us to ask, everything else being equal, whether robustness facilitates evolutionary adaptation and innovation. I will next briefly discuss a few relevant examples from artificial systems.
THE R OL E OF R OBUSTNESS F OR INNOVATION
One pertinent study used an artificial genotypeto-fitness map, where phenotypes are only represented through fitness values [554]. By design, the chosen map had tunable robustness, as represented by the number of genotypes with the same fitness. In this study, evolutionary searches encountered genotypes with high fitness more readily, if robustness is high [554]. Another study focused on a different kind of map between genotype and phenotype. It examined two scenarios. In the first scenario, an evolving population can explore a genotype network neutrally to find a superior new phenotype. In the second scenario, the robustness implied by such neutrality is absent, and a population needs to cross a fitness “valley” of inferior genotypes to reach a new phenotype [799]. The neutrally evolving population found the new phenotypes faster by orders of magnitude. A third study focused on kinds of maps important in computer science, including cellular automata, a class of dynamical system with discrete state “phenotypes”; their “genotype” corresponds to simple logic rules that prescribe how the system’s state change [205]. This study showed that evolutionary searches, involving maps with greater robustness, find novel phenotypes more easily [205]. A final example involves self-replicating computer programs whose “phenotype” consisted in their ability to compute a set of logic functions. In long evolutionary searches, programs more robust to random changes in their computing instructions can be more effective in discovering new phenotypes [215]. These observations from artificial systems, where robustness can be tuned, support what we learn from biological systems. With the insight that robustness creates genotype networks, it is easy to understand observations like these from one common unifying framework.
Robust phenotypes may facilitate innovation The argument I have presented thus far is qualitative and based on the very existence of genotype networks. Once we know about this existence, we can go one step further: We can ask about finer-grained aspects of genotype space organization, and how they affect a system’s ability to explore novel phenotypes. One such aspect regards the size of different genotype networks, and how this size
109
affects the ability to explore novel phenotypes quantitatively. As I discussed earlier (Chapters 2-4), the sizes of different genotype networks vary by many orders of magnitude. Moreover, genotypes in larger genotype networks have, on average, more neighbors with the same phenotype [124, 640, 830]. In addition, within any one genotype network, some genotypes have few neighbors, others have many neighbors with the same phenotype. I now discuss how these properties affect a system’s ability to explore novel phenotypes. The first part of my analysis focuses on molecules, because they have been studied in the most detail. In the second part, I show how observations on molecules translate to other system classes. In this part of my analysis I will compare properties of different genotype networks. This amounts to comparing properties of phenotypes, because each genotype network is associated with a different phenotype. It is thus useful to introduce a new concept: the robustness of a phenotype. Recall that the robustness of a genotype is the fraction ν of its neutral neighbors. By extension, the robustness of a phenotype is the average fraction ν of neutral neighbors for each of the genotypes with this phenotype. This robustness generally increases with the number of genotypes that adopt a phenotype, as Figure 8.1a shows. The data in the figure are based on RNA molecules sampled at random from genotype space, but the same holds for RNA molecules with known biological functions. For example, in a sample of 82 RNA molecules with known and diverse biological functions, the number of sequences folding into a structure, and their average robustness show a high Spearman’s rank correlation coefficient of 0.92 [378]. In other words, phenotypes with large genotype networks are also more robust. My qualitative observations from the previous sections showed that the mere existence of neutral networks makes many different phenotypes accessible without the need to change an existing phenotype. In a more quantitative approach, one could ask how many different phenotypes occur in the neighborhood of an entire genotype network. This neighborhood comprises all genotypes one mutant away from a genotype network, and thus genotypes immediately accessible from it. Such an
110
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
(a)
(b) Spearman's s=0.64; P38
Dicots Monocots Basal Angiosperms Gymnosperms Bryophytes Green algae
Fungi, Animals
1
1-4
Figure 9.1 Duplication of MADS box genes are associated with innovations in flowering plants. The plant phylogeny shown is highly simplified. It contains major plant groups, as well as fungi and animals as outgroups [667]. Superimposed on this phylogeny are the numbers of known MADS box genes from organisms that include yeast, nematodes, and fruit flies; the green algae Coleochaete scutata, and the eudicotyledon Arabidopsis thaliana (thale cress) [649, 755, 766]. The lower bound of MADS box genes in eudicotyledons is based on the MADS MIKC subfamily; this has been suggested by [808]. Numbers of MADS box genes are minimal numbers and could fluctuate within taxonomic groups. The images depict C. scutata [370], and a flower of A. thaliana [352] (Courtesy of Vivian Irish, Yale University.) Figure and legend adapted from [828].
Example 2: The rise of vertebrates Much like the radiation of flowering plants, the radiation of vertebrates created a spectacularly successful and diverse group of organisms. Hox genes played a key role in this diversification. These genes encode transcriptional regulators and are named after the homeobox, a DNA sequence they contain and that encodes their DNA-binding domain. Hox genes pattern many structures along the head–tail body axis, including the hindbrain, the vertebral column, and the limbs [508]. Most animals contain multiple Hox genes that are adjacent to one another, forming one or more clusters of genes close together on a chromosome. Their spatiotemporal expression pattern along the head–tail axis corresponds to their chromosomal order in a Hox gene cluster. Many invertebrates have a single cluster of Hox genes. This cluster underwent at least two duplications during vertebrate evolution, duplications that led to four Hox gene clusters in many vertebrates [442].
Vertebrates have numerous innovations relative to their chordate ancestors [699]. Examples include not only the sophisticated brain of higher vertebrates, but also cartilage, teeth, and bone. These tissues serve many roles, ranging from support to feeding. The evolution of bone in turn gave rise to the most obvious and striking vertebrate innovations, which include hinged jaws, the vertebral column, and paired appendages. The latter allowed new forms of swimming, walking, and flying that made many ecological niches accessible. Various duplicate Hox genes are critical for the development of vertebrate-specific traits, suggesting that Hox genes also play important roles in the evolution of these traits. Some of the Hox genes that were duplicated during vertebrate evolution have evolved new functions. Their functional divergence often involved acquiring novel gene expression rather than novel biochemical activities [89, 131, 292,
G E N E D U P L I C AT I O N S A N D I N N OVAT I O N
790]. A case in point is the duplicate Hox genes Hoxa3 and Hoxd3 [292]. The effect of a mutation in one of these two genes depends on which gene is mutated. For example, Hoxa3 mutants show defects in pharyngeal tissues, whereas Hoxd3 mutants have malformed cervical vertebrae [121, 136]. However, expressing one gene where the other is normally expressed, and vice versa, indicates that the two genes can substitute for each other’s function, as long as they are expressed in the same way. Their differences may therefore be caused by quantitative expression changes [292]. Similarly, the protein products of the Hoxa1 and Hoxb1 genes have different functions in hindbrain development. Nonetheless, one of them can substitute for the other when it is expressed in the right time and place [790].
Example 3: Heart evolution The preceding examples are about spectacular evolutionary radiations from which many diverse species emerged, but gene duplications may also facilitate innovations in individual traits. One example regards the heart, a pump that drives fluid through the body. Such a pump becomes necessary in organisms too large to distribute nutrients and oxygen through diffusion. Hearts originated with a simple architecture, a contractile tube with bidirectional blood flow. A prototypical example is the heart of amphioxus (lancelets), which is thought to resemble that of a basal vertebrate. More advanced hearts are more complex. For instance, the heart of amniotes, including mammals and birds, is a complex fourchambered pump with two atria and two ventricles that separate oxygen-poor from oxygen-rich blood. The heart acquired its sophisticated structure during vertebrate evolution. Fish hearts have a single atrium and a single ventricle; amphibian hearts have two atria and one ventricle; vertebrate hearts, additionally, acquired septae to separate the heart’s chambers, valves to enforce unidirectional flow, and a conduction system for synchronized and forceful heart contraction [667]. A core circuit of transcription factors controls heart development in vertebrates and invertebrates. These factors include proteins named NK2, MEF2, GATA, Tbx, and Hand [154, 578]. Their coding genes have duplicated during vertebrate evolution
127
[578] (Figure 9.2). One of them, MEF2 (myocyte enhancer factor 2), regulates the expression of contractile muscle proteins. The fruit fly Drosophila has only one MEF2 gene. If this gene loses its function, the expression of contractile proteins in muscle cells ceases [391, 631]. Vertebrates, in contrast, have four MEF2 duplicates with partly divergent functions [62]. A case in point is MEF2c. If it loses its function, a subset of contractile proteins in the heart ceases to be expressed. In addition, the right ventricle no longer forms [461]. Because the population of cells from which the right ventricle forms occurs only in amniotes, the function of MEF2c in its development is arguably novel. Another, particularly striking example involves the Hand gene. Zebrafish (Danio rerio) and amphibians express a single copy of this gene. Both kinds of animals have only one ventricle. The zebrafish Hand gene is necessary for the formation of this single ventricle [877]. In contrast to these organisms, mice express two duplicates of Hand. Among other defects, loss-of-function mutants in Hand1 cannot form the left ventricle, whereas Hand2 mutants fail to form the right ventricle [92, 242, 650, 723]. Thus, the two duplicates have acquired functions that are not only specialized to one of two ventricles. Their functions are also novel, because the structures they help form did not yet exist in fish. All three examples I have just discussed—flowering plant radiation, vertebrate radiation, and heart evolution—share a conspicuous association between gene duplication and complex evolutionary innovations. Based on such associations, many researchers argue that gene duplication is key to innovation, and reasonably so [135, 354, 573, 833]. Unfortunately, such associations are no proof that gene duplications were necessary for innovation. And because the processes I have discussed unfolded over tens to hundreds of million years, far beyond the time scales of laboratory evolution experiments, we may never have such proof. Although this observation is an important note of caution, gene duplications are too abundant, and their association with innovation too striking to dismiss their importance for innovation. I will thus show next how they fit into the framework I propose in this book.
128
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
Duplicate Cardiac Transcription Factors
Heart Chambers
Hand
1
1
1
2
MEF2
1
4
4
4
GATA
2
3
3
3
Tbx
?
≥4
≥5
≥7
NK2
1
≥3
≥3
≥2
1
2
3
4
Amphibians
Amniotes (Reptiles,birds, mammals)
CephaloFish chordates (Amphioxus)
Left Atrium
Complexity
Right Atrium
Right Ventricle
Left Ventricle
Figure 9.2 Duplications of genes involved in heart development. The upper panel (gray) shows the number of duplicates in different groups of chordates (below the panel, together with their characteristic number of heart chambers) for five genes (Hand, MEF2, GATA, Tbx, NK2) encoding transcriptional regulators with central functions in heart development. The lower panel shows a highly simplified vertebrate phylogeny, together with schematic illustrations of a primitive heart (left), and the four-chambered vertebrate heart (right). From [578, 828]. Used with permission from AAAS.
Gene duplications cause robustness After a gene duplication that creates identical duplicates, both duplicates have redundant functions. Mutations in one of them are thus less likely to be deleterious than before the duplication. Among the many kinds of mutations that can affect a genome— from single nucleotide changes to chromosome rearrangements—gene duplications are unique in this way: only they systematically increase robustness to mutations. This increase in robustness is evident from two complementary lines of experimental evidence. The first comes from efforts to study the function of individual genes. An important approach to study gene function is to eliminate (“knock out”) a gene or its expression, and to examine the phenotype of
the resulting mutant. In many genes whose knockouts have little or no phenotypic effect, gene duplications are behind the absence of such effects [16, 134, 266, 295, 768, 824, 851]. A second line of evidence comes from molecular evolution studies. Duplicate genes can tolerate more nucleotide changes than their single copy counterparts, and are thus under relaxed selection. The phenomenon is most evident if one examines duplicate genes on a genome-wide scale. Here, recent gene duplicates in various eukaryotes can tolerate 10-fold more amino acid changes than older duplicates [417, 479]. I note that remnants of such robustness still exist for the duplicate genes discussed in the preceding sections on organismal innovation. For example, in the thale cress Arabidopsis thaliana, individual dupli-
G E N E D U P L I C AT I O N S A N D I N N OVAT I O N
cates of the flower development genes SEPALLATA show only weak phenotypic effects if their function is lost due to mutations [187, 605]. Similarly, some Hox gene duplicates have retained partially redundant functions, remnants of the robustness that gene duplications cause. Examples include zebrafish Hoxa2 and Hoxb2, which function redundantly in embryonic patterning of the second pharyngeal arch [345]; and the mouse Hox8 genes, which have redundant roles in positioning the hindlimbs [795].
Many new phenotypes become accessible through duplication The observation that gene duplications (temporarily) increase robustness of duplicated genes is important, because, as we saw in Chapter 8, robustness can facilitate evolutionary innovation. First, it permits the existence of genotype networks. Second, the cryptic genetic variation that a population of molecules with a highly robust phenotype can accumulate allows it to access many novel phenotypes. Gene duplications not only systematically increase robustness, they also do so in a peculiar way that increases the size of the genotype space in which an evolutionary search can take place. In doing so, they increase the number of different phenotypes that can be explored without destroying an existing phenotype. It is easiest to appreciate this phenomenon with a simple example. In Chapter 4, I discussed the enzyme chorismate mutase with its 93 amino acids. For this enzyme, an estimated fraction 10−24 of genotype space encodes proteins with the same structure and activity [761]. This fraction would translate into a total number of 10−24×2093≈1097 chorismate mutase genotypes. Now consider a hypothetical duplication of a gene encoding this protein. After the duplication, one of the duplicates must maintain its function, whereas the other is free to vary. The genotype space of both proteins taken together has the squared size of the original genotype space. It thus contains (2093)2 genotypes. Mathematically speaking, it is the Cartesian product of the two original genotype spaces. Figure 9.3 illustrates how we can think of the two combined spaces geometrically, although, as usual, my two-dimensional caricature does not do justice to the high-dimensional nature of genotype spaces. The two open circles in the middle represent two
129
(identical) gene duplicates immediately after duplication, and the jagged lines terminated by an arrow indicate the independent evolutionary trajectories that the duplicates can take as they begin to change independently in an evolving population. Even if we restricted both evolving duplicates to their respective genotype networks, as shown in the figure, an evolving population containing them could access twice the phenotype variation than before the duplication. The reason is that both duplicates undergo mutation independently. They thus explore their respective genotype network independently, and gain access to novel phenotypes in its neighborhood independently. The effect is equivalent to doubling the size of a population of molecules that explores a given genotype network (in the absence of duplication). In reality, the effect of gene duplication would be much more dramatic than indicated by this argument. For as long as one of the proteins is confined to its genotype network, the other is free to explore the genotype space. If a complete exploration of this space were possible, then the total number of genotypes that would become accessible through the duplication, while preserving the chorismate mutase phenotype, is given by1097 (the number of chorismate mutase genotypes) times 2093 (the size of genotype space, for the freely evolving molecule) or 10218, more than hundred orders of magnitude greater than the original genotype network. This calculation neglects that the genotype space is much too large to be explored by any one molecule. However, even if the second molecule could merely explore the k-neighborhoods of its genotype network for some small value of k, a vast number of phenotypes would become accessible that is beyond reach if only the immediate 1-neighborhood can be explored. My exposition so far has focused on duplications of individual genes, but much the same holds for larger scale duplications. For example, whole genome duplications that occurred in vertebrates and many other organisms duplicated entire regulatory circuits [793]. As far as the genes of these circuits are concerned, the same reasoning as above applies: Such duplications free individual circuit genes to explore new structures and activities, as long as one of the duplicates preserves the old activity. But with the duplication of a regulatory circuit,
130
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
Figure 9.3 Gene duplications increase the size of the search space for novel phenotypes. The left and right parallelograms stand for the genotype space associated with each of two duplicate genes. Inscribed into the parallelograms are the genotype networks of each duplicate’s phenotype, with gray circles corresponding to genotypes with the same phenotype, and gray lines indicating neighboring genotypes. Symbols of different shapes and shading indicate genotypes with novel phenotypes that are only one mutation away from the genotype network shown. The two open circles in the middle stand for two hypothetical identical duplicate genes immediately after duplication. The identical protein structures (from chorismate mutase, Protein Data Bank identifier: 2gtv, [201] are shown merely to indicate that after a duplication that creates identical duplicates, the duplicates’ phenotypes will be identical. The jagged black lines illustrate that each duplicate mutates independently, and thus explores genotype space independently from the other duplicate.
G E N E D U P L I C AT I O N S A N D I N N OVAT I O N
not only the circuit genes, but also their regulatory interactions undergo duplication. This permits, in addition, the exploration of novel regulatory interactions, and thus of new gene activity or expression phenotypes, while the original phenotype can be preserved. This regulatory divergence is more difficult to analyze experimentally than the divergence of individual molecules. For example, while duplicate genes mutate and diversify independently from one another, regulatory interactions in a duplicate regulatory circuit are intertwined: They involve both original and duplicate genes. Taken together, these observations show that the kind of robustness caused by gene duplications can increase access to novel phenotypes. But as if this was not enough, gene duplications also solve another problem that is common in evolutionary innovation: an old function of a system may need to be preserved not only during exploration of genotype space, but also afterwards, after a genotype with a novel function has been found. And often, a single genotype may not be able to execute both functions, or execute them equally well. A candidate example would be an enzyme that needs to catalyze reactions with two different substrates at high rate, or a regulatory protein that needs to bind two very different DNA sequences with high specificity. Here, gene duplications can preserve the
131
old function of one duplicate, while facilitating not only the origin, but also the optimization of a novel function in the other duplicate [56, 135, 328].
Summary Gene duplications increase mutational robustness in a way that greatly increases the accessibility of novel phenotypes. They can preserve an existing phenotype not only during a search for novel phenotypes, but also afterwards, once such phenotypes have been encountered. It is thus not surprising that duplications have been associated not only with dramatic innovations in individual organs, such as the heart, but also with the diversification of vast groups of organisms, such as flowering plants and vertebrates. Although we may never have absolute certainty about their involvement in innovations whose origin and refinement may require many million years, it would be surprising if their association with such innovations was coincidental. In the most general terms, my observations here can be summarized in a syllogism, a logical argument dating back to ancient Greece, where two premises are used to infer a conclusion. The first premise is that gene duplication causes robustness; the second premise is that robustness can facilitate innovation. The conclusion is that gene duplications can facilitate innovation.
CH A PT ER 10
The role of recombination
Thus far, I have focused on the smallest unit of genetic change, mutations that affect one system part, be it a nucleotide, an amino acid, a regulatory interaction, or a metabolic gene. As we have seen, a sequence of such small changes can transform a system gradually while leaving its phenotype intact, and yet allow it to explore many novel phenotypes. Mutations are undoubtedly important for evolutionary innovation [68, 330, 607, 684]. However, recombination, a larger scale kind of change (Figure 10.1), may be at least as important [414, 576, 694, 888, 890]. From the perspective I discuss here, recombination causes long jumps in genotype space. By reaching into far-flung regions of this space, such jumps facilitate the exploration of different phenotypes. To see this, recall that neighborhoods in far-apart regions of this space contain very different novel phenotypes. In other words, recombination can be more effective than mutation for exploring new phenotypes. (I will discuss some experimental evidence below.) But while long jumps may facilitate phenotypic exploration, they also cause a major problem. The genotype network of any one phenotype typically comprises a tiny fraction of genotype space. One would thus think that a long jump through this space should almost always end outside this network. If so, then a key benefit of genotype networks, the preservation of the old during a search or the new, would be lost. The central theme of this chapter is that recombination, perhaps surprisingly, does not suffer from this problem. For example, recombination perturbs phenotypes much less than mutation. It has weak effects on existing phenotypes, and yet it can help explore novel phenotypes that are very different from a starting phenotype. These two properties make recombination a powerful facilitator of evolutionary innovation. 132
Different kinds of recombination Perhaps the simplest kind of recombination is homologous recombination, as it occurs during meiosis. When taking place between two protein-coding regions, such recombination exchanges parts of these regions and leaves their length unchanged (Figure 10.1a). Other kinds of recombination are more complex and involve unequal exchange among the recombining partners. Such exchange is facilitated if two molecules share at least short stretches of identical DNA sequence. Unequal exchange can occur, for example, between protein-coding genes that share short, repeated motifs of DNA sequences. It can also occur between genes that encode multiple, similar protein domains. Such recombination can create novel proteins with new and unique domain combinations. The genomes of higher organisms encode thousands of proteins with multiple domains, and the same domain is often found in many different proteins of different functions. These observations speak to the power of recombination to create novel proteins [93, 269, 418, 449, 580]. In addition to this role in generating novel proteins, recombination can also cause change on a much larger scale. It can rearrange genomes, causing duplications, inversions, and translocations, where entire chromosome segments are swapped. Repetitive DNA, and in particular transposable elements, play important roles in such recombination [458, 576]. Wherever repetitive DNA occurs in a genome, it facilitates an increased incidence of unequal recombination. In humans, some 50 percent of the genome consist of repetitive DNA, much of it derived from transposable elements [432, 806]. In addition to their passive role in facilitating recombination, transposable elements can also generate recombinant DNA through their active movement.
T H E R O L E O F R E C O M B I N AT I O N
133
(a) MPTYIHELLYTLLLTYLSSPSPRSGPLRSGPLRFRRIQHINSPSPSSTRAVLASFSEENLIPD
MPTYIHELLYTLLLTYLSSPSPRSGPLRSGPLRFRRIQHINSPSPSSRATVLASFSEENPSSD
MYPTIHEPPYTLLLTYLSSHLPRSGPLRSGPLRFRRIQHINSPSPSSRATVLASFSEENPSSD
(b)
Figure 10.1 Recombination in proteins and regulatory circuits. (a) The left side of the figure shows two hypothetical “parental” protein sequences. The right side shows a chimaeric protein resulting from a reciprocal recombination event between these parents. (b) The left side of the figure shows two hypothetical regulatory circuits that differ in regulatory interactions (black and grey arrows) among circuit genes (black rectangles). The right side shows a chimaeric circuit created through recombination between the parents. The recombinant circuit contains a mix of the parents’ regulatory interactions.
When inserting into a gene, they can lead to the creation of novel coding regions; and when inserting near a gene they can affect its regulation. Some 50 human genes consist largely of sequences derived from transposable elements, and many more contain exons derived from such elements [576]. A final form of recombination is lateral gene transfer [96, 445, 568, 569]. It is typically not a reciprocal exchange of DNA, but a unidirectional transfer from a donor to a host genome. Aside from the addition of new genes, it thus does not change the host’s gene content as radically as other kinds of genome rearrangements. For this reason, lateral gene transfer may be less disruptive. Therefore, it poses less of a problem for preserving well-adapted
phenotypes than other kinds of recombination. I will not discuss it in detail here. The non-reciprocal kinds of recombination I just discussed can change the size of a system, be it the length of a molecule, or the number of genes in a cellular circuit. This property makes it difficult to compare the substrates and the outcome of non-reciprocal recombination. Simply put, the reason is that recombination substrates and products exist in genotype spaces of different dimensions. This may represent a fundamental obstacle to analyzing the effects of non-reciprocal recombination systematically [724]. Because my objective is such a systematic analysis, I will thus here focus on reciprocal recombination, whose substrates and products occur in the same
134
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
genotype space. The principles I will describe, however, may well apply to all kinds of recombination.
The power of recombination I will highlight the power of recombination with DNA shuffling [731], a widely used technique to engineer novel proteins and higher order systems in the laboratory [111, 152, 153, 443, 552, 719, 731, 888]. Briefly, DNA shuffling starts from a mix of different “parental” variants of equally long DNA sequences, such as different alleles of a gene. These sequences are cut into small fragments at random location within them, denatured (made single-stranded), and reannealed to render them double-stranded again. The result is a complex mixture of partially singlestranded, partially double-stranded chimaeric DNA sequences. The double-stranded regions of these DNA sequences then are extended by DNA polymerase in a polymerase chain reaction. After multiple cycles of denaturation, reannealing, and synthesis of new DNA using the polymerase chain reaction, recombinant DNA molecules of the same length as the parental DNA emerge. Each such molecule consists of multiple recombined fragments of the parental DNA molecules [731]. Two experiments that use this technique illustrate the power of recombination. Crameri and collaborators applied DNA shuffling to recombine genes encoding cephalosporinases. These are enzymes that confer resistance to cephalosporins, a class of antibiotics. The aim of the experiment was to create cephalosporinases that confer resistance against high concentrations of antibiotics. A single DNA shuffling experiment recombined four cephalosporinases that showed between 18 and 42 percent divergence on the DNA level [153]. The experiment yielded a chimaeric cephalosporinase with a 270fold increase of resistance to the cephalosporin moxalactam, as compared to the parental sequences. By comparison, the highest improvement achievable in the same amount of time through point mutations was an 8-fold increase over the parent [153]. The same approach can also shuffle DNA sequences on a much larger scale, recombining DNA containing multiple genes or entire genomes [888]. For example, recombination of entire genomes has been used to produce strains of the bacterium
Streptomyces fradiae that produce high amounts of the antibioticum tylosin. In this approach, recombination was 20 times more effective than random mutagenesis in improving tylosin production [888]. These and other experiments show that experimental recombination of DNA sequences can rapidly generate new genes, pathways, and genomes with new and desirable features [111, 152, 153, 443, 552, 719, 731, 888].
Recombination preserves existing gene expression phenotypes in regulatory circuits Engineering experiments like these illustrate the power of recombination to identify phenotypes with novel properties. However, they are not designed to examine the central problem recombination poses: it might disrupt already existing, welladapted phenotypes, and thus often have large deleterious effects. Specifically, the superior genotypes that these experiments find might be few among an astronomical number of potentially inactive chimaeras [888]. I will now address this problem of deleterious recombination effects for regulatory circuits and molecules (Figure 10.1). I will not discuss recombination in genome-scale metabolic networks here, for two reasons. First, in prokaryotes, the kind of frequent, obligate recombination that is characteristic of meiosis is absent, and horizontal gene transfer, as I discussed, will usually be less disruptive to phenotypes. Second, in eukaryotes with their frequent and often obligate meiotic sex, individuals in the same interbreeding population are typically similar in their genotypes. They would differ in the presence of few (if any) of their hundreds to thousands of metabolic reactions. Recombination would thus usually not alter their metabolic genotype dramatically. I will begin by examining the effects of recombination in transcriptional regulatory circuits of the kind studied in Chapter 3. Consider two individuals (“parents”), each of which harbors a regulatory circuit genotype that produces a gene expression phenotype. Both individuals have identical phenotypes. Thus, they belong to the same genotype network or genotype set (Figure 10.2). Their genotypes may differ in one or more regulatory interactions.
T H E R O L E O F R E C O M B I N AT I O N
Parent 1
Offspring
Parent 2
Figure 10.2 Recombination causes long jumps through genotype space. The figure illustrates schematically that the offspring of a recombination event may lie far from either parent in genotype space. Its neighborhood will thus contain novel phenotypes that are different from those accessible near either parent (Chapter 5). The large rectangle stands for genotype space. Small grey circles connected by lines indicate neighboring genotypes on one hypothetical genotype network. Symbols of different shapes and shading indicate genotypes with a novel phenotype that are just one mutation away from genotypes on this genotype network. The large black and white circles indicated two hypothetical parental genotypes. The large grey circle stands for a recombinant offspring of the two parents. In the image, the offspring genotype is equally distant from either parent, but in reality it may be closer to one or the other parent, depending on details of the recombination event that produced it. The offspring may lie on the same genotype network, and thus have the same phenotype as the parents, as indicated in this hypothetical example; or it may lie outside this genotype network and thus have a different phenotype.
These two individuals produce offspring through a reciprocal exchange of their regulatory genotypes. If all circuit genes occurred in a closely linked gene cluster on a single chromosome, then the likelihood
135
of a meiotic recombination event between them would be small. For this reason, I will here consider the opposite scenario, where the individual genes of a circuit are not closely linked, and thus recombine freely. This scenario is important, because here the potentially deleterious effects of recombination will be most evident. Specifically, this scenario requires that every gene in each “offspring” network receives with probability one half the regulatory region of one of the parents, and with probability one half the regulatory region of the other parent. (Recall that in these circuits we focus on the evolution of regulation through changes in cis-regulatory regions.) A quantity of interest is the probability that the offspring of recombination between two parents would no longer have the parental phenotype. This probability indicates the disruptive effects of recombination. A complication is that its value will depend on how different the parents are from one another. Recombination between genotypically similar parents will produce offspring whose genotypes are also similar to either parent. Thus, we would expect that their phenotypes are also often unchanged. Conversely, genotypically very different parents would usually produce genotypically and phenotypically diverse offspring. One way to take parental similarity into account is to compare the offspring’s genotype to one of the parents, and determine the number m of regulatory interactions in which it differs from this parent. To this end, I will examine the probability RR(m) that a recombination event changing m regulatory interactions of a viable circuit preserves its phenotype. It is useful to compare this quantity to the probability Rμ(m) that m independent random changes (mutations) of individual regulatory interactions preserve the phenotype. By comparing the two quantities RR(m) and Rμ(m), we can assess how strongly recombination affects a genotype when compared to an equivalent amount of mutational change. Figure 10.3 shows RR(m) and Rμ(m) for circuits sampled at random from the same genotype set [492]. One can see that for recombination events that change only m=1 regulatory interactions, recombination is already less likely to change a
136
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
1.0 Recombinational robustness RR(m) Mutational robustness Rμ(m)
Robustness
0.8
0.6
0.4
0.2
0.0
1
2
3 4 5 6 7 8 9 10 11 Number m of regulatory interactions changed
12
Figure 10.3 Recombination has weaker phenotypic effects than mutation. The figure shows the probabilitites Rμ(m) and RR(m) that m changes of individual regulatory interactions caused by mutation and recombination, respectively, leave a circuit’s gene expression phenotype intact. The data is based on 106 circuits of S=12 genes randomly sampled from the same genotype network [492]. Lengths of error bars show one standard deviation. A mutation may cause (i) an existing interaction to disappear, in which case the respective interaction strength wij (Chapter 3) is set to zero; (ii) a new regulatory interaction to appear, in which case the new value is chosen as a Gaussian random variable with mean zero and variance one (N(0,1)); and (iii) an existing interaction to change in magnitude. In the latter case, the sign of the interaction is forced to remain unchanged by choosing a Gaussian (N(0,1)) random variable and multiplying it by (–1) if it is of the wrong sign. For the data shown, the number of regulatory interactions per circuit lie in the interval (S, 3S), and gene expression states E(0) and (E∞) differ in the expression of half of their genes. All relevant circuit properties depend only on how different these two gene expression states are [123]. Similar observations exist for circuits of different size. Figure and caption adapted from [492], used with permission from Genetics Society of America.
circuit phenotype. Specifically, more than 90 percent of recombinant offspring that differ from their most closely related parent by only one regulatory interaction preserve the parental phenotype. In contrast, only 75 percent of circuits where mutations changed one regulatory interaction preserve this phenotype. With increasing numbers of changes m, these differences increase. For example, when a recombination event changes m=12 regulatory interactions, 50 percent of all offspring circuits preserve the parental phenotype, whereas fewer than 8 percent of circuits with 12 random mutations preserve this phenotype (Figure 10.3). These observations show that exchanging regulatory interactions that are already part of a viable circuit greatly increases the likelihood to preserve the circuit’s phenotype.
The following is a complementary way of examining the effects of recombination [492]. If the parent circuits differ in I regulatory interactions, then one of the recombinant offspring will differ from one parent by m regulatory interactions, whereas the other offspring will differ by (I–m) regulatory interactions. We can then express the distance of the offspring from either parent as a fraction of I, i.e., as a recombination distance DR=m/I. This recombination distance varies between 0 and 1. A value of DR close to zero means that the offspring is close to the reference parent, whereas a value of DR close to one means that the offspring is very distant to the reference parent, but very close to the other parent. Intermediate values of DR mean that the offspring is approximately equally distant to either parent.
T H E R O L E O F R E C O M B I N AT I O N
137
Fraction of recombinant networks with preserved phenotype
1.1 1.0 Sample Mutation–selection Mutation–selection–recombination
0.9 0.8 0.7 0.6 0.5 0.4
0
0.15
0.35
0.55
0.75
0.95
Recombination distance DR Figure 10.4 Recombination’s disruptive effect on a phenotype depends on the distance of an offspring circuit from either parent. The vertical axis shows the fraction of recombinant offspring circuits with the same phenotype as the parent, as a function of the recombination distance DR between parent and offspring (horizontal axis, see text). The recombination distance is normalized to values ranging between zero and one. Data are shown for parental circuits sampled uniformly from the same genotype network (“sample”), for circuits from a population in mutation–selection balance, and for circuits from a population in mutation–selection–recombination balance. Note the very high fraction of viable recombinants for the population in mutation–selection–recombination balance. Data is based on circuits with S=12 genes, number of regulatory interactions per circuit in the interval (S,3S), as well as initial (E(0)) and equilibrium ((E∞)) gene expression states where 50 percent of genes differ in their activity. All relevant circuit properties depend only on how different these two gene expression states are [123]. The middle and upper curves are based on populations of 1000 circuits, and μ=1 mutations of regulatory interactions per circuit and generation. Lengths of error bars correspond to one standard deviation, and are too small to be visible for any of the data points shown. From [492], used with permission from Genetics Society of America.
Figure 10.4 examines the relationship between the recombination distance DR to the probability that recombination preserves the parental phenotype. For now, I will focus on the lower-most set of points (closed circles). These data are based on parental regulatory circuits that are sampled at random from a set of genotypes with the same phenotype [492]. The figure shows that offspring very similar to the parent, where DR is close to zero or one, is very likely to preserve the parental phenotype. The likelihood that a recombination event is deleterious has a parabolic, U-shaped distribution, with a minimum at intermediate recombination distances DR. This means that recombination is most likely to change a phenotype, if the recombinant circuit is maximally different from either parent.
I will return to the significance of this figure for my main argument shortly.
Recombination preserves protein structure and function The weak effects of recombination in Figure 10.3 may be peculiarities of transcriptional regulation circuits. Alternatively, they may be generic properties that hold for broader classes of systems, and that reflect fundamental organizational properties of genotype space. A mix of computational and experimental evidence from proteins argues for the latter possibility [156, 196]. One such study focused on lattice proteins, the computational models of protein folding I discussed in Chapter 3 [156]. Its authors studied sequences that fold into the same structure and subjected pairs of such
138
T H E O R I G I N S O F E V O L U T I O N A RY I N N O V A T I O N S
sequences to recombination. They found that 78.9 percent of recombination products fold stably into a structure, and the vast majority of them (99.3 percent) adopt a structure identical to that of the parents. Another relevant study compared the effects of recombination in real proteins and lattice proteins [196]. The study’s authors estimated the probabilities RR(m) that a recombination event changing m amino acids preserves protein structure, and compared it to the probability Rμ(m) that the same number of mutational changes preserves secondary structure. For lattice proteins with Rμ(1)=0.1, that is, where 10 percent of a protein’s 1-mutant neighbors have the same structure, they found that the fraction of recombination events that change a single amino acid and preserve protein structure is RR(1)≈0.7. In other words, recombination is seven times more likely than point mutation to preserve a lattice protein’s structure. For mutationally more robust proteins where Rμ(1)=0.5, RR(1) exceeds 0.85. For larger numbers m of amino acid changes, mutations typically have dramatically more disruptive effects than recombination. For example, for a structure that remains intact with a probability of less than 1 percent after five independent mutations (Rμ(5)1) or monomorphic (Nμ0.9999 and RR(2)=0.998 [492]. In contrast, the probability that one or two mutations leave this phenotype intact is Rμ(1)=0.94 and Rμ(2)=0.88. This means that the genetic load is dominated by the effects of mutation and not recombination. A second observation needed to understand the phenomenon of Figure 10.6 regards the product of population size and mutation rate Nμ. In a population that is monomorphic most of the time (Nμ