CBMS-NSF REGIONAL CONFERENCE SERIES IN APPLIED MATHEMATICS A series of lectures on topics of current research interest ...
14 downloads
705 Views
3MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
CBMS-NSF REGIONAL CONFERENCE SERIES IN APPLIED MATHEMATICS A series of lectures on topics of current research interest in applied mathematics under the direction of the Conference Board of the Mathematical Sciences, supported by the National Science Foundation and published by SIAM. GARRETT BRKHOFF, The Numerical Solution of Elliptic Equations D. V. LINDLEY, Bayesian Statistics, A Review R. S. VARGA, Functional Analysis and Approximation Theory in Numerical Analysis R. R. BAHADUR, Some Limit Theorems in Statistics PATRICK BILLINGSLEY, Weak Convergence of Measures: Applications in Probability J. L. LIONS, Some Aspects of the Optimal Control of Distributed Parameter Systems ROGER PENROSE, Techniques of Differential Topology in Relativity HERMAN CHERNOFF, Sequential Analysis and Optimal Design J. DURBIN, Distribution Theory for Tests Based on the Sample Distribution Function SOL I. RUBINOW, Mathematical Problems in the Biological Sciences P. D. LAX, Hyperbolic Systems of Conservation Laws and the Mathematical Theory of Shock Waves I. J. SCHOENBERG, Cardinal Spline Interpolation IVAN SINGER, The Theory of Best Approximation and Functional Analysis WERNER C. RHEINBOLDT, Methods of Solving Systems of Nonlinear Equations HANS F. WEINBERGER, Variational Methods for Eigenvalue Approximation R. TYRRELL ROCKAFELLAR, Conjugate Duality and Optimization Sot JAMES LIGHTHILL, Mathematical Biofluiddynamics GERARD S ALTON, Theory of Indexing CATHLEEN S. MORAWETZ, Notes on Time Decay and Scattering for Some Hyperbolic Problems F. HOPPENSTEADT, Mathematical Theories of Populations: Demographics, Genetics and Epidemics RICHARD ASKEY, Orthogonal Polynomials and Special Functions L. E. PAYNE, Improperly Posed Problems in Partial Differential Equations S. ROSEN, Lectures on the Measurement and Evaluation of the Performance of Computing Systems HERBERT B. KELLER, Numerical Solution of Two Point Boundary Value Problems J. P. LASALLE, The Stability of Dynamical Systems - Z. ARTSTEIN, Appendix A: Limiting Equations and Stability ofNonautonomous Ordinary Differential Equations D. GOTTLIEB AND S. A. ORSZAG, Numerical Analysis of Spectral Methods: Theory and Applications PETER J. HUBER, Robust Statistical Procedures HERBERT SOLOMON, Geometric Probability FRED S. ROBERTS, Graph Theory and Its Applications to Problems of Society JURIS HARTMANIS, Feasible Computations and Provable Complexity Properties ZOHAR MANNA, Lectures on the Logic of Computer Programming ELLIS L. JOHNSON, Integer Programming: Facets, Subadditivity, and Duality for Group and SemiGroup Problems SHMUEL WINOGRAD, Arithmetic Complexity of Computations J. F. C. KINGMAN, Mathematics of Genetic Diversity MORTON E. GURTIN, Topics in Finite Elasticity THOMAS G. KURTZ, Approximation of Population Processes (continued on inside back cover)
Mathematics of Genetic Diversity
This page intentionally left blank
J.F.C. Kingman University of Oxford Oxford, England
Mathematics of Genetic Diversity
SOCIETY FOR INDUSTRIAL AND APPLIED MATHEMATICS PHILADELPHIA
Copyright ©1980 by the Society for Industrial and Applied Mathematics. 1098765432 All rights reserved. Printed in the United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688. Library of Congress Catalog Card Number: 80-51290 ISBN 0-89871-166-5 is a registered trademark.
Contents Foreword Terminology and Notation
vii ix
Chapter 1 THE PROBLEM 1.1 Why mathematics? 1.2 Genes and their inheritance 1.3 Selection 1.4 Mutation
1 2 3 5
Chapter 2 SURVIVAL OF THE FITTEST 2.1 Balanced polymorphisms 2.2 Multi-locus selection 2.3 Balance between selection and mutation 2.4 The house of cards 2.5 The diploid house of cards 2.6 The resistance of polymorphisms to mutation
7 10 12 15 18 20
Chapter 3 THE NEUTRAL ALTERNATIVE 3.1 Evolution in the absence of selection 3.2 A general model for mutation in finite populations 3.3 The random walk case 3.4 The frequency spectrum 3.5 The Ewens sampling formula 3.6 The Poisson-Dirichlet distribution 3.7 Partition structures 3.8 Testing neutrality
24 28 32 35 39 42 46
Chapter 4 SELECTION IN FINITE POPULATIONS 4.1 Deleterious mutants 4.2 The Wright-Fisher model 4.3 Wright's formula 4.4 The infinite alleles limit
49 50 54 58
Appendix I AN INEQUALITY
61
Appendix II THE GENEALOGY OF THE WRIGHT-FISHER MODEL Bibliography
23
. . . .
63 67
V
This page intentionally left blank
Foreword A Regional Conference held at Iowa State University, in June 1979, under the auspices of the Conference Board of the Mathematical Sciences and of the National Science Foundation, has given me this opportunity of drawing together some mathematical ideas which have in recent years been useful in population genetics. Of course, population genetics covers a wide area of science, and even those aspects which admit some mathematical analysis would be impossible to survey exhaustively in a slim volume. The present account, like the lectures on which it is based, concentrates on a few aspects which seem to be both biologically relevant and mathematically interesting. My choice is entirely personal, and the book should be read as the adventures of one mathematician in the fascinating, but formidably complicated, world of population genetics. Were it not for the efforts of Oscar Kempthorne and his colleagues at ISU, the conference would not have taken place, nor would it have attracted a distinguished audience whose comments have improved both form and content of these chapters. To them I express my thanks, but I acknowledge too my debt to Warren Ewens and Geoff Watterson, whose profoundly original work has inspired, as their friendship has encouraged, my own contributions to the subject. Finally, I dedicate this book to John and Charlotte, excellent examples of its theme of genetic diversity. J. F. C. K. Oxford, 1979
vii
This page intentionally left blank
Terminology and Notation The now standard notationR. R n . N . Z is used for, respectively, the real line, n-dimensional Euclidean space, the set of natural numbers 1. 2. 3, • • • , and the set of (positive and negative) integers. Probability and expectation are denoted by P and E. Words like "positive" or "increasing" are to be understood in the weak sense unless qualified by the word "strictly."
ix
This page intentionally left blank
CHAPTER 1
The Problem 1.1. Why mathematics? The essence of the application of mathematics to any branch of science is the recognition and exploitation of pattern or regularity. The regularity may be rigid and striking, like the procession of the planets around the sun, or it may be a dimly observed tendency hardly distinguishable amidst a general confusion. The biological sciences more often yield examples of the latter than of the former kind, but genetics is unusual because, despite the many complex and unpredictable factors which govern the everyday life of a biological organism, it inherits its genetic make-up from its parents according to a mechanism which does possess a definite structure. Indeed, the inductive process by which Mendel inferred the existence of genes and their role in transmitting heritable characteristics was itself a fine example of mathematical reasoning. As it became accepted that the evolution of a natural population was to be understood in terms of Mendelian genetics and Darwinian natural selection, so also it became clear that this understanding could not be sought only at a qualitative level. The biologist needed to know how fast evolution would be expected to take place, how much genetic diversity would persist in a population, how intrusive would be the effect of various chance factors, and so on. Such questions led to the work of such scientists as R. A. Fisher, J. B. S. Haldane and Sewall Wright, who analysed models of populations in which Mendel's laws were combined with assumptions about the effect of different features of the natural environment. One of the goals of "mathematical population genetics" has been to explore the mechanisms which maintain diversity in a population. In any group of humans, it is rare to find two who look alike, and any superficial resemblance (save between identical twins) is unlikely to reflect a genetical identity. A similar observation would be made by an educated fruit fly about his fellow Drosophila; it is a general characteristic of natural populations to exhibit great genetic diversity. Yet if Darwin was right to stress "survival of the fittest", should not the less fit characteristics have been driven out by the fitter? How is the observed diversity to be reconciled with natural selection? This is not a question with a single answer, since there are many different mechanisms that could explain the persistence of variety. The mathematician can construct models to display such mechanisms at work, so that the biologist can compare them with experiment and observation, and so (perhaps) arrive at a judgement as to which mechanism is at work in any particular population. The purpose of the mathematical analysis is to start with a set of biological
1
2
CHAPTER 1
assumptions and to explore the way in which a population would evolve if those assumptions were satisfied. Typically this process also clarifies and refines the assumptions themselves, and helps to sharpen the questions which the scientist asks of the world. It also serves as the starting point for the statistician, who can set up sensitive methods for using available data to discriminate between different models, and to estimate the parameters which appear in them. In all this, the basic framework is supplied by the existing biological concepts and experimental techniques. As science advances, so new mathematical problems are posed and, probably, new mathematical tools must be forged. One such advance is a major theme of these notes: the realisation that, at a particular chromosome locus, the number of possible alleles is almost unlimited. Since the classical theory concentrated on loci with two, or perhaps three or four, alleles, this is clearly a fundamental change of emphasis. But of course mathematics thrives on large numbers, and the new emphasis has led to interesting mathematics, some of which is described herein. 1.2. Genes and their inheritance. In order to establish a framework for the mathematics, it may be helpful to begin with a greatly over, simplified account of the basic genetical theory. Most interesting organisms are diploid, which means that the genetical information which governs their development is carried by several pairs of chromosomes. Thanks to modern advances in molecular biology, we now have a picture of what a chromosome is; it is essentially a very long string of symbols drawn from a four-letter alphabet (the four different nucleotides of DNA). Most of the symbols appear to have no meaning, but at certain particular places on the chromosome there is a meaningful string of (several hundred) symbols. This is called the gene at that particular locus. Thus we can think of a gene as a message written in this four-letter alphabet, on a tiny label which is attached to the chromosome at a particular place. The possible messages that could be written on that label are called the alleles of that locus. A diploid organism has a chromosome from each of its parents, so at a particular locus it has two genes (which may or may not be the same allele). It often makes sense to concentrate attention on a particular locus, so that the genetic character of an individual is described by specifying its two genes at that locus, but it should always be remembered that the genetics really depends on all loci at once (some aspects of this fact will be mentioned in §2.2). An organism reproduces by mating with another, and usually contributes one of its chromosomes, as does its mate, to each child. Here there are many possible complications. Many species are dioecious, which means that they have two sexes which must be considered separately. The choice of mate may be (and in species like our own invariably is) correlated with genetic characteristics. Moreover, the chromosomes may break up and recombine in the reproductive process, so that what is passed on to the next generation is not a perfect copy of the parental chromosome. These and other effects ought to enter any realistic model of the reproductive
THE PROBLEM
3
process, but they will not do so in the models discussed in this monograph. We shall assume that our populations are monoecious (have only one "sex") and that each individual chooses its mate randomly from the population. Sometimes the same methods work, and sometimes similar conclusions are valid, for dioecious populations (as, for example, a recent account by Ethier and Nagylaki [1] suggests). Our main interest will be with the gene frequencies in a large population at a particular locus. If there are N individuals in the population, there are 2N chromosomes containing this locus. For each allele, we consider the proportion of the 2N chromosomes at which the gene is of this allele. This gives a probability distribution over the set of possible alleles which clearly describes the genetic make-up of the population as far as the chosen locus is concerned. The problem is to model the dynamical process by which this distribution changes in time, or from generation to generation. In some populations there are well-defined breeding seasons, and it therefore makes sense to talk of successive generations. This is the point of view adopted here, although some species (like our own) breed all the time and the generations soon get mixed up. We then have to work in real time, with allowance for the ages at which mating takes place; the analysis is more complicated but not essentially different. It is convenient to have a definite model for the reproductive process in a monoecious, randomly mating population. Suppose we have a population whose size N is held constant (for example, by constraints on living space or food supply). Direct attention to a particular locus. One can then imagine that each individual produces a very large number of cells called gametes, each of which contains only one gene at the locus. Half the gametes inherit copies of one of the individual's genes, the other half copies of the other. All the gametes produced by all the individuals are thrown into a pool, and an individual of the next generation is produced by drawing two gametes at random from the pool and combining them. The N individuals of the next generation are obtained by 2N independent drawings from the pool. This basic model is usually referred to as the Wright-Fisher model; it is of course a greatly simplified one and any conclusion from it needs to be checked for robustness. The gene frequencies in a generation are the same as those in the subsequent gamete pool, so instead of successive generations we can consider the successive intercalated pools. This simplifies the problem by replacing the diploid organism by haploid gametes (having only one set of chromosomes). For this reason it is often sufficient to develop a theory of haploid populations, though care needs to be exercised because many biological mechanisms (notably selection and recombination) act in an essentially diploid way. 1.3. Selection. The genetic composition of an organism determines, by interaction with its environment, the way in which the organism will develop, and, therefore, the way it will adapt to the problems of its existence: how successful it will be in garnering food, in warding off predators, and in mating and
4
CHAPTER 1
producing offspring. Some alleles, or combinations of alleles at several loci, will tend to be more successful than others, and so tend to increase their frequency in the population at the expense of those less successful. This selective effect, as far as a single locus is concerned, can easily be incorporated into the Wright-Fisher model. One imagines that initially many more than N individuals are produced by drawings of pairs of gametes, and that the probability of survival to maturity depends on the two genes (the genotype) of the individual, so that the N who survive to maturity will typically form a population with gene frequencies weighted according to these survival probabilities. The relative values of these probabilities, normalised in some convenient way, are the relative fitnesses of the possible allele-pairs. With this model, we can begin to consider the problem posed in §1.1, the reconciliation of observed genetic diversity with postulated selective pressure. Several general answers might be envisaged, of which the most important are the following. (i) It may very well happen that our population is not in equilibrium, that the less fit alleles are on the way out but have not yet disappeared. Certainly it is important to understand the nonequilibrium behaviour of models like the Wright-Fisher, 1 but it hardly suffices for a general explication of genetic diversity to argue that it does exist, but that by some genetical analogue of the second law of thermodynamics it is steadily decreasing. So a convincing model ought to maintain diversity even when it has reached equilibrium. (ii) Diversity can be maintained by differences in environment in space (or perhaps time). If one allele is adapted to one environment, and another to another, and there is migration between the environments, then one equilibrium can be established in which both alleles thrive (see, for example, Fleming [1], Gillespie [2]). (iii) Selection typically acts on allele pairs rather than on individual alleles, and an allele can be favourable in some combinations but not in others. That this can result in a stable configuration with several alleles will be shown in §2.1. (iv) The last two mechanisms may explain why diversity is maintained, but they give no clue as to how it originated. The usual explanation is in terms of mutation, spontaneous change which means that an offspring inherits not the gene from the parental gamete but a different one. The label has not been accurately transcribed. Such mutations occur with very small probability, but their cumulative effect over many births may provide enough variety to balance the effect of selection. Models for mutation selection balance will be discussed in §§2.3-2.5, and from a different point of view in Chapter 4. (v) Some geneticists take the view that the effect of selection has been exaggerated, and that much genetic diversity is to be understood mainly in terms of mutation and the random fluctuations which are inherent in the reproductive 1
Methods exist for that; see. for example, Kimura [1], [2], Nei and Li [1], Kingman [12].
THE PROBLEM
5
process. Such an attitude leads to "neutral" models without selection; these are the subject matter of Chapter 3. It should be stressed that there is no question of showing that one of these five answers (or perhaps yet a different one) is universally valid. The behaviour of any real population will probably have ingredients from each, together with other complications. But the biologist's explanations will give prominence to one or more, and fall to be judged in the light of an understanding of models based on them. 1.4. Mutation. Suppose that the alleles at a locus are listed A l 5 A 2 , • • • , Ak . A parental gene At appears also as A, in the genotype of an offspring, unless mutation has occurred, but mutation may cause the offspring to inherit a different allele Aj. If uu denotes the probability that mutation changes A; to Aj , and
then the uu (j =£ i) are small and ul{ is near 1. The usual type of mutation postulated is the change of a single letter in the "message" which constitutes the gene. For such a mutation, it is natural to suppose that different mutations are statistically independent, so that the mutation model is completely specified by the matrix (w ( j ; i,j — 1, 2, • • • , k}. This is the viewpoint which will be adopted here. There are, however, other types of events which give the same mutation effect but are much more difficult to model. A chromosome break can occur in the middle of a single gene, so that the daughter gene contains part of the messages from each of its parents. The mutation rate then depends on whether the parents had the same genes at that locus, and a complicated correlation between different mutations enters. Even with the simple picture of "single letter" mutation, it is far from clear what form to assume for the matrix ( H J J ) . The alleles A, are (some or all of) the messages that can be written with the appropriate number of symbols from the four-letter alphabet. Then uu is zero unless A, and Aj differ in only one letter. A model reflecting this structure will be described in §3.3. Fortunately, however, the results are insensitive to the detailed properties of (Uij) so long as the possibility of recurrent mutation can be neglected. That is to say, once an allele A; has been corrupted by a mutation, it is supposed unlikely that further corruption will by coincidence restore At. In §3.5 we explore the consequences of this approximation and the very useful simplification which it brings to the analysis, when selective effects are ignored. In selective models, one has also to bring in the fitness of the genotype A / A j , and genetical theory is silent as to how this is likely to depend on the messages of the alleles A, and Aj. To make any progress, a much more stringent hypothesis is made about mutation, namely, that the mutant allele is independent of its parent, so that uu does not depend on / at all, so long as / + j. The ration-
6
CHAPTER I
ale for this assumption is that it is extremely unlikely that the corrupt message means anything at all. If it does, it probably bears no biological relation to the original (just as there is no direct association between the meanings of the words "chance" and "change"). An understanding of the real nature of mutation will no doubt come as biologists learn the language in which the allele labels are written, but it seems reasonable in the meantime to exploit as a first approximation the simplification which the independence assumption brings. These arguments assume, however, that the alleles are all distinguishable. Experimental techniques have hardly reached this level. For example, it is common to detect variation at a locus by electrophoretic techniques which give each allele a (positive or negative) integer, representing in a sense the total charge on that part of the chromosome. If this is a sum of contributions from the different letters, then a mutation will probably cause an increase or decrease of one charge unit. Thus it is supposed that the alleles are placed at integer points of a line, that we can only distinguish between alleles at distinct points, and that the mutation probabilities uu are nonzero only if A, and A-} are adjacent. This gives a model in which ( n u ) is the transition matrix of a random walk. This can be generalised to cover the situation in which several independent electrophoretic determinations are made (as for instance in Singh, Lewontin and Felton [1]), when the alleles sit at points of a c/-dimensional integer lattice Z'7, and (//,;) is again a random walk transition matrix. Models of this type are discussed in §3.3, where it is shown that (in the absence of selection) the approximation of nonrecurrent mutation becomes quite quickly accurate as the dimension d increases. To this extent, the random walk model (also called the "charge-state" or "ladder-rung" model, and first introduced by Ohta and Kimura [1]) is appropriate only to the situation in which alleles are very imperfectly distinguished.
CHAPTER 2
Survival of the Fittest 2.1. Balanced polymorphisms. Throughout this chapter it is supposed that the population considered is so large that random fluctuations can be ignored. Thus the evolution of gene frequencies can be modelled deterministically. In this section we confine our attention to a single locus, and to the.effect of selection alone. Suppose that the possible alleles at this locus are listed as A l 7 A 2 , • • • , Ak. Let the corresponding gene frequencies be p1, p2, • • • , pk, so that in the corresponding gamete pool a gamete chosen at random has probability /?, of carrying a gene of allele A,-; clearly
The process of drawing from the gamete pool yields an individual of genotype AjAj with probability/?,PJ, this reflecting the assumption of random mating. (In fact, nature does not distinguish between A,-Aj and A ; A ( , but we can do so for present purposes.) Now suppose that the probability of an individual with genotype A/Aj surviving to maturity is vi' ( j ; thus
and only the ratios of the fitness \ru are relevant. The number of such survivors is proportional to u'ijPiPj- This enables us to compute the gene frequencies/?• in the next generation as
where
Thus we can plot a point p = ( p 1 , p 2 , • • • , /?*•) in R A to represent the genetic composition of the population. The conditions (2.1.1) confine p to a subset A of R*, a (A - l)-dimensional simplex. Equations (2.1.3-4) can be written in the 7
8
CHAPTER 2
form
where O is a function from A into A determined by the fitness matrix (\\'ij'. i,j = 1, 2, • • • , k). If p(r) denotes the point of A describing the rth generation, then p(r) is obtained from p(0) by /applications of (2.1.5), so that the evolution of the population is determined by the iterates of the function <J>. It is instructive to consider first the simple case k = 2. Here ( 2 . 1 . 5 ) can be written
where
The function $ maps the unit interval onto itself and satisfies (!) = 1. It also satisfies (p) = p in this interval. From this it is easy to see what happens in four different cases (omitting borderline situations in which two of n' n , u' 2 2 , ^'12 are equal): (I) H' n < H' 12 < u'22- In this case p' < p unless p = 0, and p(t) decreases to zero as t —> ^. Thus allele Al dies out and A2 is "fixed." (II) H' 22 < H'i 2 < H' U . The opposite to (I); p' > p and allele Al is fixed. (III) H'j2 < H' n , H^- The ultimate outcome now depends on /?(0), because 0 < p < 1 and p' < p if 0 < p < p while p' > p if p < p < 1. Thus A} is fixed if /XO) > /?, but A 2 is fixed if/?(0) < p . (IV) vt'i2 > H ' U , H' 2 2 - Here p' always lies between p and/?, and it follows that p(t) -^ p as t -^ 3c, whatever the value of /?(0). Thus the two alleles will coexist indefinitely if the heterozygote A-^A., is fitter than either of the homo-ygotes A1Al and A2A2. If this condition is satisfied. then selection alone can account for the variety at the locus. It is natural to try to generalise this argument to more than two alleles, and the key to doing this is to observe that the "mean fitness" W, regarded as a function of p, increases from one generation to the next. To prove this, the Mandel-Scheuer inequality [1], note that the value of W in the daughter genera-
S U R V I V A L OF THE FITTKST
9
tion is
Interchanging the roles of / and / and adding, we obtain
In this chain, the first inequality comes from the fact that the arithmetic mean of two numbers exceeds the geometric mean, the second from the convexity inequality
when r = 2, and the third from the same inequality when r = 1. The argument shows, moreover, that W > W unless p' = p. Hence W(p(t)) increases with /. and it follows without difficulty that, if W attains its maximum in A at a point p with p, > 0 for all /, then p(t) —» p as t —> ^, whatever the starting point p(0). The condition for W t o have a stationary value at p is that, for some W,
10
CHAPTER 2
and this will be a strict maximum if the fitness matrix is such that
for all jc ( -, not all zero, satisfying X( — 0. (See Kingman [1] for further details, and for the conditions under which random fluctuations may safely be ignored.) If W attains its maximum on the boundary of A, this argument fails. Various possibilities exist, but what always happens is that at least one allele dies out. After this has happened, the evolution is governed by equations of the form (2.1.3-4), but with a smaller value of A. and a reduced fitness matrix. Thus selection according to a fitness matrix (wu) can maintain A alleles in equilibrium, but only if the equations (2.1.8) admit a strictly positive solution (pi), and the negative-definiteness condition (2.1.9) is satisfied. When k is large, these are very demanding conditions (cf. Gillespie [1]), very unlikely to be fulfilled by a fitness matrix written down at random. It might therefore be thought that the simple theory of this section cannot explain a polymorphism with a large number of alleles without some mechanism for evolving a rather special array of fitnesses. This however is not so. If the (k x k) fitness matrix does not lead to an internal equilibrium, some alleles will die out, and eventually equilibrium will be established with some smaller number k' of alleles. The number k' depends on the wu and on the initial gene frequencies, but there seems no reason why it should not be large if A: is large. How large then would one expect k' to be for "typical 11 fitness matrices and initial conditions? One could formulate this question by allowing the wu (1 ^ / ^ jf 5i A') to be. say, independent random variables uniformly distributed on (0,1), and p(0) uniformly distributed on A. The distribution of A:' in terms of k could then in principle be computed, though the problem would be one of great difficulty. An upper bound for k' given in Kingman [1] (see also Karlin [2]) shows that on the average k' is at most H' for k' large, but how sharp this is remains obscure. Of course this whole approach is unrealistic in that it ignores the very mechanism by which the k alleles are presumably established: mutation. A general, perhaps too general, approach to the synthesis of mutation and selection in a one locus diploid context is outlined in §2.5. In §2.6 a much cruder approach is adopted, which suggests that large polymorphisms may be less susceptible to attack by new mutations. 2.2. Multi-locus selection. The elegant theory sketched in the last section is, alas, too good to be true, and much more difficult problems arise when the fitness of an individual depends on the genes at more than one locus. Take the simplest case of two loci. Let the alleles at the first be Aa(a = 1, 2, • • • , a) and at the second B0((3 = 1, 2, • • • , & ) . Then the possible gametes will be described by pairs / = (a, /3) and the possible individuals by genotypes (/, j )
SURVIVAL OF THE FITTEST
11
which are pairs of pairs. The fitness matrix is now of order ab, the fitnesses wu being indexed by the pairs / = («, (3),j = (a , /3'). This complication would not be serious if equation (2.1.3) continued to hold, but it does not, because of the possibility of recombination. The gametes contributed to the pool by an individual of type ( i , j ) are not all of types / = (a, /3) and./ = («', /3'); some are of the mixed types (a, (3') and (a', /3). Moreover, the proportions of these four varieties depend on whether the two loci are on the same chromosome, and if so how far they are apart. The effect is to replace (2.1.3) by a recurrence of the form
where aul ^ 0 depends on the fitnesses and on the recombination probabilities, and «j, = The passage from (2.1.3) to (2.2.1) is fatal to the methods of §2.1. Moran [2J showed that the mean fitness does not always increase, and efforts to find a function with this (Lyapunov) property have not prospered. Moreover, Karlin and others (Karlin [1], Karlin and Carmelli [1], Karlin and Liberman [1]) have shown that there can be many stable equilibria with all alleles present, whereas in the one-locus case there can be at most one. This throws doubt on the application of the results of §2.1 to any natural polymorphism. Observed diversity at a particular locus could be the result of a multi-locus system in stable equilibrium. In principle we could estimate the fitness matrix at that locus, but (2.1.3) could only be applied bearing in mind that the fitnesses will themselves be changing, since they are functions of the gene frequencies at the linked loci. Thus there is an urgent need for a general theory of multi-locus systems and the consequent marginal behaviour at single loci. Some progress in this direction has been made by Ewens and Thomson [1], using the fact that the evolution of such a system is always governed by an equation of the form (2.2.1) for suitable coefficients au!. Moreover, these coefficients satisfy certain conditions which can be exploited. Perhaps the most important conclusions of Ewens and Thomson concern the situation at one A-allele locus of an arbitrary multi-locus system in stable equilibrium. At that locus it makes sense to define vv ;j (/, j = 1, 2, • • • , A.) as the average fitness of those genotypes which have alleles A, and A j . Then it is shown that the marginal gene frequencies pt do satisfy (2.1.8). What is not known, but is conjectured to be true (cf. Karlin and Carmelli [1]) is that (2.1.9) also holds for wu so defined. For example, when k = 2, the equilibrium frequency of A: is given by (2.1.7). Since this expression lies outside (0, 1) if >v 12 falls between H'U and M* 2 2 , a necessary condition for stability is that either iv12 < w n , w22 or that vv12 > vv n , w22.
12
CHAPTER 2
What is conjectured is that the former possibility cannot arise from a stable system; it is a measure of the difficulty of the problem that this has not been resolved even for a general two-locus system with two alleles at each locus. Even the problems of existence of, and convergence to, an equilibrium in (2.2.1) are very difficult to resolve in general, though some deep theorems of Kesten [1] (and with random fluctuations, [2]) are sometimes applicable. 2.3. Balance between selection and mutation. Little progress has been made with the analysis of models involving both mutation and selection at the level of generality of the arguments in §2.1. To get useful results it seems that some restriction must be placed on the fitnesses \ru. Following Moran [4], [5], [6], we shall assume multiplicative fitness, so that for some if, (/' = 1, 2, • • • , A),
This would hold if selection operated on the gametes rather than on the diploid organism, and more generally it would be a reasonable approximation for loci without dominance, in which the heterozygote A j A } has a fitness intermediate between those of the corresponding homozygotes /4,-/4 ( - and A j A j . When (2.3.1) holds, the equations (2.1.3-4), governing gene frequencies under selection alone, simplify to the point of triviality:
and the composition of the /th generation is given by
As ( increases, all the alleles die out except that (or those) for which n' ( is greatest. Mutation can be introduced into this simple model by supposing, as in §1.4, that of the gametes of allele A, which are contributed to the pool, a proportion uu mutate to A} before the selection process which forms the next generation. As before, we define uu by (1.4.1), so that ( / / ( J ; L j = 1, 2, • • • , k) is a stochastic matrix which describes the effect of mutation. Then (2.3.2) must be replaced by the equation
S U R V I V A L OF THE FITTEST
13
It is to obtain this simple bilinear equation that we adopt the stringent assumption (2.3.1); without it we would be faced with an intractable biquadratic recursion. It reduces our diploid problem to a haploid one; however, we shall return to the truly diploid situation in §2.5. The way to solve (2.3.4) is to follow Moran [5] in defining a matrix V = (Wjiijj). Denote by rjf the (/, j)th element of the rcth power Vn. Then by induction on f,
Thus the behaviour of the gene frequencies in successive generations depends on that of the powers of the positive matrix V. For simplicity, suppose that some power of V has no zero element. Then the Perron-Frobenius theorem asserts that V has a maximal eigenvalue X > 0 and left and right eigenvectors (.v,) and ( y / ) which can be normalised so that
Moreover (see, for example. Seneta [1, Thm. 1.2]),
which, when substituted into (2.3.5), shows that
regardless of the initial frequencies p,(0). In other words, the equations
have a unique solution (/;,) satisfying (2.1.1) for exactly one value of X. This is the limiting set of gene frequencies, from whatever starting point, and the value of X is the limiting (haploid) mean fitness. When k is small, these equations may be approached directly, giving a polynomial equation of degree k for X, but when k is large it may be difficult to extract useful information from (2.3.9). To see how this works out in detail, consider a very simple situation in which
14
CHAPTER 2
A! is favourable over equally fit alleles A 2 , • • • , Ak:
Then (2.3.9) becomes
Summing over j, we obtain
so that (2.3.11) has the iterative solution
in terms of the powers (/^f) of the stochastic matrix (///,). Putting j = 1. we have the equation
which determines A.. With this value of X, PJ is then given by
Of course, the fact that (2.3.13) has a unique real solution X > 1 is obvious, because the left-hand side decreases continuously with X, with limits V //j"' and 0 as X —> 1 and X —» x, and the positivity condition on V7 implies that Without the positivity condition it might happen that this series converges: indeed
where/i is the probability, in the Markov chain with transition probabilities uu, of ultimate return to the state 1 (Feller [1]). Thus equation (2.3.13) has a solution only if s'1 < fj(\ - fj\ that is, if
a nontrivial condition if/ t < 1.
S U R V I V A L OF THE FITTEST
15
For finite Markov chains,/] < 1 only occurs if return to 1 is forbidden from some other states, which is hardly likely in the context of mutation. The analysis of the special case (2.3.10) does give a danger signal when, for convenience of modelling, we admit the possibility of infinitely many alleles. For example, it was noted in §1.4 that, in the electrophoretic picture, it is natural to regard the "alleles" as the integer points on the line, with the only nonzero nu as
For this random walk,/, = ! — / / + - / / _ j , so that (2.3.13) has a solution for all s only if // + - / / _ . The situation is even worse when (2.3.17) is replaced by the corresponding values for a t/-dimensional random walk, since then even the symmetric case has j\ < 1 when d ^ 3. In one sense, this is an artificial difficulty, since A. is always finite in real systems, and a model like (2.3.17) can always be truncated by introducing suitable boundaries at larger distances. But the lack of a solution to the infinite model will show itself even in the finite model because the distribution (p^ will be very thinly spread among the k alleles. It is therefore an interesting question to ask. given an infinite stochastic matrix (un: /, / = 1, 2, • • • ) and a set of positive numbers hr,-: / = 1, 2, • • • ), whether the equations (2.3.9) with k — ?~ admit a solution (/?,) satisfying (2.1.1) with k — sc. No necessary and sufficient condition is known, but useful sufficient conditions have been established by Moran [5], [6] and Kingman [5]. Thus it is sufficient that, for some c. it should be true that (i) if, ^ c for all but a finite number of /, and ( i i ) there is a finite subset B such that the maximal eigenvalue of the matrix (n'jUjj'. i,j E B) is strictly greater than c. In genetical terms, conditions like this insist that a few alleles are sufficiently advantageous to prevent the population spreading itself too thinly over many different alleles. 2.4. The house of cards. The two special models for mutation which have been most deeply studied are the electrophoretic model (2.3.17) and, at the other extreme, that in which nn does not depend on / (for j ^ i). The latter is more appropriate when the alleles are supposed to be completely distinguished, and when, as in §1.4, the effect of mutation is to bring down the biochemical "house of cards" painfully built up by past evolution. Whether or not this is regarded as a realistic picture of mutation, it does make a useful contrast to the random walk model, and has the great advantage of admitting fairly explicit analysis. Thus in this section we assume that the mutation matrix is given by
16
CHAPTER 1
where the // ; are strictly positive numbers with
// is a measure of the overall mutation rate for each gamete, and njit is the probability that a mutant is of allele A-,. Substituting (2.4.1) into (2.3.4) we obtain
so that, writing
the gene frequencies p}(t) in the /th generation are given recursively by
If for a moment we regard Wf as known, this linear recursion can be solved separately for each /', to give
where
Summing (2.4.5) over j gives the recursion
which determines the 11, and hence the Wt. In particular, W, ^ w = max vr,-, and hence so long as w z\ < 1, (2.4.7) can be expressed in the generating function form
S U R V I V A L OF THE FITTEST
17
which simplifies to
It is easy to check that the right-hand side of this equation is a rational function of z whose nearest singularity to the origin is a simple pole at the unique real point z in 0 < z < [(1 - //Kr]" 1 satisfying
This solution will be written z = W 1. Expressing (2.4.8) in terms of partial fractions and expanding as a power series in z, we see that II, ~ CW as t —> ^, for some C > 0, and therefore
Since W > (1 - //)n- ^ (1 - / / ) M ' J , it follows from (2.4.5) that, as / —» ^c.
In other words, whatever the initial gene frequencies, the frequencies in the tth generation converge to those of a stable limiting distribution
the value of W being exactly that which makes the p} satisfy (2.1.1). It should be noted that (2.4.11) expresses the equilibrium distribution of fitness (which assigns probability p-3 to the point \v-} in [0, u 1 ] for each j ) in terms of the mutant fitness distribution (which assigns probability //_,-/// to each w}) and the mutation parameter //. In this interpretation the model is a special case of one studied in Kingman [11], where a new mutant is supposed to have a fitness drawn at random from a given (not necessarily discrete) distribution. For example, if the mutant fitness distribution has a density/on [0, 1], the equilibrium fitness distribution has a density
18
CHAPTER 2
where W is chosen in (1 - «, 1), if possible, so that
However, it can happen that no value of W satisfies this condition, in which case the equilibrium fitness distribution has an atom of probability at the upper limit of mutant fitness. The only difficulty in using these formulae is in evaluating the limiting mean fitness Wfrom (2.4.9). Once this is done (perhaps approximately, using the fact that // is small) the explicit form of the equilibrium gene frequencies shows very clearly the way in which selective differences affect the genetic structure of the population. 2.5. The diploid house of cards. The engagingly simple conclusions of the last section encourage the hope that it might be possible to adapt the treatment to a truly diploid model, in which the fitnesses are not supposed to take the multiplicative form (2.3.1). We therefore seek a general model without (2.3.1), in which a new mutant always exhibits a completely novel allele, never before encountered, so that in particular there is no recurrent mutation. In this situation the natural way to label the alleles is in the order of their arrival on the scene, taking those which appear between one generation and the next in random order. Thus we envisage an infinite sequence of possible alleles A I . A2, • • • , and correspondingly an infinite array of fitnesses ir ( j satisfying (2.1.2). Because mutation is a random process, the \\'u are random variables, and we have to model the random array (M-,,). Rather than proposing a specific model, it seems more convincing to postulate a simple symmetry property, which asserts that the joint distribution of the H'(J would be the same if the alleles happened to arise in some other order. This is reasonable if any dependence between a mutant and its parent allele is ignored. More precisely, it will be assumed that, for any n and any permutation 77 of (1, 2, • • • , n}, the joint distribution of the random variables HVi.TTj 0\y — 1, 2, • • • , n] does not depend on 77. This has been called \veak exchangeability, and is a very much weaker property than exchangeability of the n2 variables wu. (It involves invariance under a group of order n\ rather than « 2 !) However, an analogue of de Finetti's theorem has recently been established by Aldous [1], who shows that the most general infinite array, satisfying wu = u'jj and the weak exchangeability property, is of the form
for some function F, symmetric in the second and third arguments, where the random variables £, £ , - ( / = 1, 2, • • • ) and e / ; (/, j = 1, 2, • • • ) are independent (except for the constraint eu = e//) and uniformly distributed on the interval (0, 1).
SURVIVAL OF THE FITTEST
19
This remarkable and deep result can for our purposes be somewhat simplified, since the variable £ runs through all the alleles, and takes only one value throughout the evolution of the population. The fact that it might take other values in other realisations can never affect the population, so that £ can be taken as constant and absorbed into the function F. Hence it will be assumed that fitnesses are of the form
for some function F: (0, I) 3 —» [0, oc) which is symmetric in its first two arguments. One can now argue as follows, assuming as throughout this chapter that the population is large enough for random fluctuations to be ignored. Suppose that in a particular generation the £-values of the genes in the gamete pool have an empirical distribution approximated (the population being large) by a probability density p(x) (0 < jc < 1). The individuals in the next generation inherit a £-value from each gamete, and so may be described by a pair of £-values (£', £") whose joint distribution, assuming random mating, is p(x)p(y}. Individuals with this description have on the average a fitness K ( g ' , £"), which by (2.5.2) is given by
so that the joint density in the mature population is
where
Thus, were it not for mutation, the ^-values in the next gamete pool would have density
but mutation at rate // (remembering that the mutant ^-values are uniformly distributed) replaces this by (1 - i^p^x) + it. Hence, assuming equilibrium, we have the equation
20
CHAPTER 2
which together with (2.5.4) is a nonlinear integral equation determining p in terms of K and //. Of course, p is not a function of direct interest because the ^-values have no biological significance, but once p is known it determines via (2.5.2) the distribution of quantities defined in terms of the fitnesses. When K is of the form K(x, v) = k(.\)k(y) the analysis reduces in effect to that of §2.4. Other special forms of K would be relevant to other biological contexts. For example, recessive selection in which )\-u = maxdr,-,-, U j j ) implies that AXv.v) = max{A(.v), A(v)} for some function A:, with corresponding simplification to (2.5.5). In the presence of (2.5.5), equation (2.5.4) reduces to
It is convenient to write (2.5.5) as
so the problem is to solve (2.5.7) for p and cr = (1 - u)W l subject to the condition (2.5.6). Note that, when // = 0, (2.5.7) is the continuous analogue of (2.1.8), and it might be conjectured to have similar properties. Thus if W, regarded as a quadratic form in p, attains its supremum over all probability densities at a smooth strictly positive density p 0 , then p 0 satisfies (2.5.7) with // = 0, and a perturbation technique would presumably yield a solution pu as a power series in i< for sufficiently small u (and // is always small in biological reality). On the other hand, when Wdoes not have such a smooth maximising Po, as in the multiplicative case, such a simple argument would fail, and the dependence of p on K would be singular at u = 0. This seems a possibly fruitful area for future research. It should be stressed that, in the generality contemplated here, the functions Fand K may be very irregular in behaviour. Nevertheless, one could consider more restrictive models in which the fitnesses take the form (2.5.2) and the £,do have some biological meaning. An example of this (complicated by random environmental fluctuations) is the "SAS-CFF" model of Gillespie [2]. 2.6. The resistance of polymorphisms to mutation. Suppose a polymorphism at a single locus is in equilibrium under selective forces alone, as in §2.1. Thus alleles A1, • • • , Ak are present in frequencies p1, • • • , /?/,., and the fitnesses u'ij satisfy
for all /, where W is the mean fitness. Suppose for definiteness that the \ru lie
S U R V I V A L OF THE FITTEST
21
between 0 and 1, and that W (which is the maximum of the quadratic form over all x in A) is only slightly below the upper limit 1. This is likely when k is large, because for instance W cannot be less than the maximum homozygotic fitness max/if,-,-. Now suppose that mutation introduces a new allele A 0 . Either this will die out again, or it will establish itself and in doing so create a new polymorphism with (k + 1) or fewer alleles. Let us estimate (rather crudely) the probability P that the latter event takes place, so as to get an idea of how vulnerable the original polymorphism is to mutation. If the genotype A0Aj has fitness u' 0 j , the mean fitness of all the individuals containing the mutant (assuming that the initial population frequency of AQ is negligible) is
Hence the selective advantage of A 0 is initially
A celebrated result of Fisher (see, for instance, Ewens [1], §7.1) indicates that the probability that A0 survives, and takes its place in a new equilibrium, is approximately P = 2e + , where e+ = max(e, 0). Hence we arrive at the approximate formula
To see how this probability depends on the parameters of the polymorphism, think of the H' 0j as being independent random variables, drawn from some distribution on (0, 1), and replace P by its expectation/*. For any random variable y, and any 6 > 0.
so that
If therefore ir = ir 0i has
22
CHAPTER 2
we arrive at the inequality
Moreover, by analogy with the large deviation theorem of Chernoff [1], one would expect this upper bound to be reasonably sharp^ It is a general consequence of the form of (2.6.2) thatP is least when all the/?, are equal (see Appendix 1), so that the best-defended polymorphisms with given W and k are those with p} = k~l for ally. For these the right-hand side of (2.6.3) is
Hence P decays to zero as e~yk, where
This is of course a very crude argument indeed, but it does perhaps suggest that large polymorphisms, once established in stable equilibrium, are exponentially more difficult to overturn by mutation. The detailed justification or refutation of this suggestion is left to future research.
CHAPTER 3
The Neutral Alternative 3.1. Evolution in the absence of selection. Models like those described in the last chapter show that there is no necessary conflict between selection and diversity. Selective pressure alone can maintain a balanced polymorphism when heterozygotes are sufficiently favourable, and even when the selective effects are such as to reduce diversity, these can come into balance with the opposite tendency of mutation. On the other hand, such selective models are not immune from criticism. A stable polymorphism at a single locus requires the fitnesses to satisfy the strong condition (2.1.9). for which there is no clear biological reason. If mutation is an important factor, formulae like those of §2.4 show that selective differences must be small so as not to swamp the comparatively small mutation rates observed in reality. More generally, selective models fall under suspicion because they can too easily explain any observed situation; for example, any set of frequencies PJ can be realised in a balanced polymorphism by a suitable choice of the fitnesses wu. (It suffices to take u' 0 = 1 for / ^ j and wit = 1 - apY1 for small a.) Partly for these reasons, there has been a tendency for some geneticists (see, for example, Kimura [3]) to seek to explain diversity without appeal to selection. More precisely (since no one claims that differences in fitness have no genetical component) it is maintained that at many loci the observed genetic differences contribute little to the overall fitness of the organism. Hence in this chapter we explore the models which have been constructed for the evolution of populations in which selection plays a negligible role. These models all involve mutation and assume that a mutant allele is equally favourable with the existing alleles. This is not true in practice, since most mutants are deleterious. In effect what we are assuming is that every mutant is either "good," in the sense of being as fit as the existing alleles, or "bad", in that it is so much less fit that its contribution to future generations can be ignored. Thus a bad mutation is here being treated as a death. A more realistic treatment of deleterious mutations will be sketched in Chapter 4 (see Li [3] for a more extended discussion). All the analysis of Chapter 2 can be specialised to the neutral situation simply by setting the letter \v (in lower case or capitals, with or without affixes) equal to 1 whenever it appears, and everywhere this reduces the argument to complete triviality. In §2.1, for example, the evolution equation becomes p- = /?,-, so that gene frequencies apparently remain constant from generation to generation. But this cannot be realistic even in the absence of mutation; if the population evolves for long enough there will be chance fluctuations with no stabi23
24
CHAPTER 3
lising mechanism to counteract them. Eventually this may cause some alleles to die out, so that diversity will be reduced not by selection but by randomness (or "genetic drift" as the unfortunate biological nomenclature has it). This reduction takes place only slowly in large populations, and it therefore makes sense to imagine it being counteracted by mutation. Hence there is a need to understand a balance between mutation and the random effects which come from the finiteness of the populations considered. Models for this balance are of necessity stochastic, and it turns out that comparatively sophisticated tools of probability theory are needed to formulate and manipulate them with efficiency and economy. 3.2. A general model for mutation in finite populations. Let us explore the effect of introducing mutation into the Wright-Fisher model of § 1.2. A population of fixed size N was there postulated, so that a particular generation G, consists of N diploid individuals. Consider a single locus at which the possible alleles are At (i G S), labelled by the elements of a countable set 5 (which for later convenience will not be assumed finite). As far as this locus is concerned, the genetical structure of the population is then specified by 2N elements, not necessarily distinct, of the set 51, the corresponding A; being the genes carried by the N individuals. As far as the production of the next generation is concerned, the order in which these elements of 5 appear is irrelevant, and it helps the mathematics to list them in random order as X^t), X2(t), • • • , X2N(t). Thus, if the 2N genes are sampled without replacement, the rth to be drawn is A-, where / = Xr(t). Each gene in the daughter generation G,+1 comes from a copy of one of the form A,-, where / is equally likely to be any one of the X,.(t). Moreover, this random choice is independent, as between one gene and another. (These assertions are equivalent to those of the Wright-Fisher model, with no selection.) However, mutation ensures that this daughter gene is of the form A} with probability uu (as in §1.4). Hence, conditional on G f , the random elements Xr(t + 1) of S are independent, with a common conditional distribution
(To ease the notation we sometimes write u ( i , j ) for uri.) The distribution of a typical member of the rth generation is given by the probabilities
and because of (3.2.1),
THE NEUTRAL ALTERNATIVE
25
Thus Trt(j) is determined by recursion on / by the equation
and this is just the equation for the successive distributions in a Markov chain with transition matrix (ui}). Suppose that this transition matrix is irreducible, aperiodic and positive-recurrent (the last being trivially true if S is finite). Then the limits
exist and form a probability distribution on S, and do not depend on the initial conditions. The random variables Xr(t) for different r are not independent. Consider the joint distribution
of two members of G,; by the symmetry of the definition of the Xr this does not depend on r and s so long as r ^ s. By (3.2.1) and conditional independence,
on splitting the double summation into the two cases a ^ ft and a = (3. Hence
The presence of the factor 1 - 1/2N < 1, together with (3.2.4), make it easy to
26
CHAPTER 3
see that the limits
exist, and form a probability distribution on the set S' 2 = S x S of pairs ( /,. /.,). They are determined uniquely (supposing the n(j) are k n o w n ) by the equation (3.2.6) with the suffix / removed. To take the simplest possible example, in which there are only two alleles A} and A2, and in which // 1 2 = n.2l = it, equation (3.2.6) has the solution
In more complicated cases, the solution in explicit terms of (3.2.6) may be very difficult. The argument leading to (3.2.6) generalises to the higher joint distributions
The limits
exist, and are determined by analogues of (3.2.6) whose algebraic complexity increases rapidly with n. For each n, these equations are to be solved for the TrU'n • • • < . / « ) m terms of 77(7'!, • • • , ,/,„) for ni < n. The recursion on n ends when n — 2/V, when it gives the joint distribution of all the X,. in the limit as / —» DC.
This algebraic programme is quite impracticable, but some advantage can be taken from the fact that, in most cases of biological interest, the mutation rates u.. a ^ j ) are small while the population size N is large. In the simple case (3.2.8) a good approximation is 7r(l, 2) = 2 N u / ( \ + 8/V//K which depends only on the product Nu. One may ask whether there is more generally an approximation to (3.2.6) and its higher order analogues when N is large and the n-ti are of order N~l. This question can be answered by supposing that we can write
for some stochastic matrix U/, ; ) and some parameter n of order N'1. Substituting this into (3.2.6) and ignoring terms of order N~2. we arrive at the equa-
THE NEUTRAL ALTERNATIVE
27
tion
Here 8 (j = 8(/, y) is the Kronecker delta, and we adopt the conventional notation
Although perhaps (3.2.12) does not look much simpler than (3.2.6), the approximation yields dividends with the higher order equations; the limit (3.2.10) satisfies
where vn is the number of b ^ a with jh = jn .l Methods for extracting useful information from these elaborate equations will be considered in the next three sections. However, there is one general consequence which is worth noting here (Kingman [10, §5]). Equation (3.2.14) solves to give, by induction on «, a probability distribution Trn on the set Sn of //-tuples, and this for every // ^ 1 since the restriction n ^ 2N evaporates as N —» 3c. Because of the symmetric definition of the Xr this distribution is exchangeable in the sense that TT(J\, j2, • * • , ./„) is a symmetric function of its n arguments. Moreover, the irn (n = 1, 2, • • • ) form a consistent family, in the sense that
Hence de Finettfs theorem applies, to show that there is a family of random variables p ( j ) , j E S, satisfying
1
More precisely, what can be shown is that the limits in (3.2.10) themselves converge as N —> •* to quantities satisfying (3.2.14).
28
CHAPTER 3
and such that, for any /?,
The methods of Kallenberg [1] or of Kingman [4] may be used to show that the joint distribution of the/?(j) is the limiting distribution (as N —> ^ with 6 = 4Nu fixed) of the empirical distribution of the 2N variables Xr. We return to this interpretation in §3.7. 3.3. The random walk case. The analysis of the last section allows a quite general transition matrix (w ( j ) describing mutation, and it is not surprising that progress is impossible at this level of generality. To be more specific we must make some detailed assumption about the mutation rates. In §1.4 it was noted that an "electrophoretic" picture suggested that the alleles be labelled by the set Z of (positive and negative) integers, with
and all other ui} zero. Unfortunately this transition matrix is not positive-recurrent, but we can recover this property by truncating Z to the set Zm = {0, 1, 2, • • • , m - 1} made into an additive group by addition modulo m. (Or equivalently, winding the integers round a circle of circumference m.) When m is large, the distinction between Zand Z m will be insignificant. The refinement by which d independent electrophoretic measurements are made leads in an obvious way to the choices S = Zrf or 5 = /?„, the respective direct products of d copies of Zor Zm • It is then natural to generalise (3.3.1) at least to the extent of supposing that ui} depends only on the (componentwise) difference j - i. Thus it might be supposed that various biological interpretations could be modelled by taking S to be a finite abelian group (written additively) with the assumption that uu depends only on j - i. If this is so, then u — \ — uit does not depend on /, and there is a probability distribution g, (j G S) such that (3.2.11) holds in the special form
We describe this as the ''random walk" case, and will show how (3.2.14) can to some extent be analyzed by the Fourier transform in the group S. Before doing this, we note that the random walk case covers a simple model of the fine structure of the gene. If indeed a gene is a long word, we can label the word at some initial moment as a string of d zeroes (d being the word length), and indicate a mutation of one letter as a change from 0 to 1. If the effect of multiple mutations of a single letter is ignored, the alleles are then words in the letters 0 and 1, so that S =Z$. Moreover, a natural assignment of mutation rates is (3.3.2), where q, = d'1 if j has the symbol 1 exactly once, and