{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

Hudson_OxSurv_1990 - OXFORD SURVEYS IN EVOLUTIONARY IHOLOGY...

Info iconThis preview shows pages 1–23. Sign up to view the full content.

View Full Document Right Arrow Icon
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 2
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 4
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 6
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 8
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 10
Background image of page 11

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 12
Background image of page 13

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 14
Background image of page 15

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 16
Background image of page 17

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 18
Background image of page 19

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 20
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: OXFORD SURVEYS IN EVOLUTIONARY IHOLOGY EDITED BY DOUGLAS FUTUYMA AND JANIS ANTONOVICS \Hflunua7 1990 OXFORD UNIVERSITY PRESS 1990 Gene genealogies and the coalescent process RICHARD R. HUDSON 1. INTRODUCTION When a collection of homologous DNA sequences are compared, the pattern of similarities between the different sequences typically contains information about the evolutionary history of those sequences. Under a wide variety of circumstances, sequence data provide information about which sequences are most closely related to each other, and about how far back in time the most recent common ancestors Of different sequences occurred. If the sequences were Obtained from distinct species, then the information is frequently extracted and displayed in the form of an inferred phylogenetic tree, which may represent the evolutionary relationships of the species from which the sequences were sampled. If, instead Of being from different species, the sequences are from different individuals of the same population, the information is genealogical, and in this case gene trees can sometimes be inferred. A gene tree shows which sampled sequences are most closely related to each other and perhaps the times when the most recent common ancestors of different sequences occurred. A hypothetical gene tree, or genealogy, of five sampled sequences is shown in Fig. 1. In the absence of recombination, each sequence has a single ancestor in the previous generation. (It is important to distinguish a gene tree of sampled sequences from the pedigree of a sample of diploid individuals, in which the number of ancestors grows as one proceeds back in time, because each diploid individual has two parents.) The possibility of Obtaining detailed information about the genealogy of sampled genes dramatically changes the situation for molecular population geneticists. Before the DNA era, molecular polymorphism data were primarily in the form of frequencies of electromorphs, alleles distinguished by their mobility on electrophoretic gels. With protein electrophoresis, two homo— logous copies of a gene could be classified as being the same or different. If they were different, one could not measure how different; if the two 3: copies were the same, one could not with confidence distinguish whether 2 Richard R. Hudson Past Present Sampled alleles Fig. I. An example of a genealogy of a sample of five alleles, showing the time intervals between coalescent events. In this figure, the intervals, T(i), are shown with lengths proportional to their expected values as given by eqn (5). they were really the same or simply convergent in certain physical proper- ties leading to similar electrophoretic mobility. Thus detailed information about the genealogies of genes could not be extracted from data on electromorph frequencies. With modern DNA techniques, sequences of homologous regions of many individuals are obtainable and detailed infor- mation about the genealogy of sampled genes will be obtained, Examples of genealogies inferred from sampled alleles are given in Stephens and Nei (1985), Aquadro et al. (1986), Bermingham and Av15e (1986), Avrse et al. (1987) and Cann et al. (1987), The obvious challenge for molecular population geneticists is: How can we utilize this information to increase our understanding of the forces acting on molecular variation in natural populations? From the theory side, we can begin by examining the properties of genealogies that arise under a variety of population genetic models. It is important to ask: Are genealogies expected to be very different under different competing models? Can we devise statistical tests that take advantage of the different genealogies expected? To proceed with this task, one needs to examine Gene genealogies and the coalescent process 3 the statistical properties of genealogies of sampled genes under different models. In the following, I will describe a variety of circumstances in which properties of genealogies can be derived analytically or by computer simulation. This will not constitute a comprehensive review of gene gen- ealogy theory, but rather a very personal View that concentrates on the infinite-site model. Some properties of genealogies will be described under selectively neutral models, with and without recombination, and with and without geographic structure. The effects of some forms of selection will also be described. I will indicate some applications of this genealogical approach for carrying out statistical tests or estimating parameters or simply allowing an ‘eye-ball’ test of the fit of observations to data. I will also indicate how simulations based on the coalescent process can be constructed and used to investigate a variety of models. This will not be a rigorous mathematical treatment. Those interested in a more precise analysis should consult the seminal work of Kingman (1980, 1982a,b) and the review by Tavaré (1984). Much of the very elegant and useful work of Griffiths (1980), Watterson (1984) and Padmadisastra (1987, 1988) on coalescents and lines of descent that focus on the infinite allele model will not be covered. This includes a large body of work on the ages of alleles (Donnelly 1986; Donelly and Tavaré 1986; Tavaré et al. (1989) that is reviewed by Ewens (1989). The infinite—allele models and the infinite-site models are very closely related, as will be described later, and results from one can often be used immediately to answer questions about the other. However, the questions asked and the par- ameter values considered are often quite distinct for the two models. In this chapter, I will concentrate on results that directly concern infinite- site models, which I feel are most useful in the interpretation of nucleotide variation in populations. I will focus on properties of relatively small samples of alleles. The work on properties of genealogies of entire populations, including fixation times, will not be considered (Donnelly and Tavaré 1987; Watterson 1982a, 1982b). Also, the important work on the relationship between gene trees and species trees will not be discussed (Hudson 1983b; Neigel and Avise 1986; Pamilo and Nei 1988; Takahata 1989). Statistical properties of genealogies depend very strongly on the kind of sampling that occurs to produce one generation from the last. In this chapter, only the Wright-Fisher (W’F) model will be considered. The sampling that produces one generation from the last under this model is described briefly in the next section. A range of alternative neutral models have been found that have essentially the same genealogical properties as the W-F model, with only a change of time-scale (Kingman 1982a,b; Watterson 1975; see also the reviews by Tavaré, 1984, and Ewens, 1989). 4 Richard R. Hudson Gene genealogies and the coalescent process 5 2. SEPARATING THE GENEALOGICAL PROCESS FROM have described in the Previous paragraph, the number of mutations that THE NEUTRAL MUTATION PROCESS willhaye occurred to distinguish our randomly sampled individual from , the indiViduals in the population at time 0, is just the number of mutations I that have occurred along a particular lineage of length t. This number of , mutations is Porsson distributed with mean at. It does not matter what . the population size has been, whether selection has been occurring at linked loci, or whether there is population subdivision, This is the basis I for the results of Birky and Walsh (1988) concerning the rate of accumu— ation of neutral mutations when selection is occurring at linked loci. In , the example above, the number of mutations that have fixed in the entire : population between time 0 and time I will depend on these demographic aspects .of the population. Similarly, the amount of polymorphism in the population at time twill depend on population size and other demographic factors, but the number of mutations that will have occurred along individ- ual lineages in the past 1 generations, that distinguish a sampled sequence [from their ancestors t generations back, is Poisson distributed with mean pt, regardless of these other factors. This property of the constant-rate neutral mutation process will be 3 exploded in the following way. Let Ttot denote the sum of the lengths of the branches of the genealogy of a sample. As discussed in the previous aragraph, S, the number of mutations on the genealogy, given Tm, is Poxsson distributed with mean uTmt. Once the distribution of Tm, is determined under a particular model, the distribution of S can easily be obtained. For example, if the first two moments of Tm are determined, “then the first two moments of S can be calculated using properties of compound distributions as: As will be discussed in great detail in the following pages, the statistical properties of genealogies depend on such factors as population size, geographic structure and the presence of selectively maintained alleles. That properties of genealogies should depend on these demographic properties is obvious, because actual genealogies depend on who had offspring and who did not, who migrated and to where, and whose offspring bore selectively important mutations. It should also be clear that strictly neutral mutations — mutations that have not and will not affect fitness — should have no affect on the genealogies of random samples. This is because, by definition, neutral mutations do not affect the number of offspring or tendency to migrate of individuals bearing those mutations. That being the case, we can study the properties of genealogies without regard to a specific mutation model for neutral variants. So, for example, the statistical properties of genealogies do not depend on whether neutral mutations are more frequently transitions than tranversions or whether an infinite-site, finite-site or infinite-allele model is most appropriate. Of course, the statistical properties of our inferences about the genealogical process are likely to depend strongly on the mutation process. For exam- ple, if the neutral mutation rate is very low, all the sequences in a sample may be identical and we could get no information about the genealogy of the sample. With the neutral mutation process that we will consider, each offspring differs from its parent at the locus under consideration by a Poisson distributed number of mutations. The mean number of mutations, u, will be assumed constant, independent of genotype, population size and time. E( S) = HE( Tm) (1) The mutations are assumed to occur independently in different individuals and different generations. This mutation model will be referred to as ‘- and the constant-rate neutral mutation process. This is the standard neutral mutation model (Kimura 1983; Watterson 1975). Under these assump- Var(S) = “Eamd + pL2 V3f(Tmi) (2) tions, mutations accumulate along lineages in an inexorable fashion inde- _ pendent of, for example, population size or selection events at linked loci. Given t, the number of generations since the most recent common ancestor of two sampled homologous sequences, S, the number of mutations that have occurred in the descent to the two descendent sequences, is Poisson distributed with mean Zut. When t is a random quantity, the mean and the variance — in fact all the moments of S ~ are determined by the moments of t assuming the constant-rate neutral mutation process. To emphasize this point, consider a population that at time 0 is com- pletely homozygous at a locus at which only neutral mutations occur. After I generations of evolution, one examines the sequence at the locus -: in a single randomly selected individual. Under the mutation scheme we _ Reiterating, under the models that we will consider, the properties of genealogies do not depend on the neutral mutation process, and therefore can be studied without precise specification of the neutral mutation pro- cess, For example, we can study the statistical properties of Ttot without specrfyrng the rate or pattern of neutral mutation. Furthermore statistical properties of neutral variation in samples are completely determined by the statistical properties of the genealogies and the neutral mutation? process. In other words, if two different models make the same assump» tions about the neutral mutation process and if the two different models lead to thesame distribution of genealogies, then the pattern of neutral variation Will be the same for the two models. For example, if the neutral 6 Richard R. Hudson Gene genealogies and the coalescent process 7 mutation process is as we have described above, the mean value of S is completely determined by the mean value of Tm. Two different models that lead to the same mean value of Ttot will have the same mean value of S. Throughout this chapter, we will consider an ideal W—F model, with either N haploids or N diploids. Briefly, this is a discrete generation model in which, for the haploid version, the N haploids of an offspring generation are obtained by sampling (and replicating possibly with mutation) N times with replacement from the parent generation. In the selectively neutral version, all parents are equally likely as parents of each of the N haploid offspring. A detailed description of this model is contained in Ewens (1979). We will assume that N is large and constant, in which case individuals have approximately Poisson distributed numbers of offspring. Most of the results concerning this model will be approximate, ignoring terms of order (1/N2) relative to (1/N). This corresponds to the usual assumptions made for using diffusion approximations and will be referred preceding generation. Consider first a sample of two individuals. The probability that the second individual sampled has the same parent as the first is I./N, as under the W-F neutral model each individual of the previous generation is equally likely to be the parent of any individual of the current generation. Thus P(2) is 1~1/N. If three individuals are sampled he probability that all three have distinct ancestors in the previous gener: ation, is the probability that the first two have distinct parents X the probability that the parent of the third individual drawn is distinct from -, hird individual has a distinct parent from the first two, given that the ' first two have distinct parents, is (N—2)/N = 1—2/N. In general, the probability that n sampled individuals have n distinct parents in the preVious generation is: to as the diffusion approximation. In contrast to the W—F model, exact 11—1 (’21) results can often be obtained for the Moran model (see, for example, Pol) 2 H (1-171(1) z 1 _ N (3) i=1 Watterson 1975). The Moran model will not be considered here. _ We can ask the same question about these n distinct ancestors: What is he probability that they have n distinct ancestors one generation earlier? ,, Clearly, this is also P(n). This means that the probability that the n sampled individuals have n distinct ancestors in each of the preceding t generations, and that in the t + 1 generation back in time, two or more fthe sampled individuals have common ancestors is: 3. THE SIMPLEST CASE: NO SELECTION AND NO RECOMBINATION Although genealogical processes are implicit in much of the work on identity coefficients that has been carried on for many years, it was the ; knowledge of the nature of the genetic material and the possibility of obtaining sequence data (or restriction map data) that stimulated some , of the earliest work that considers the genealogical process directly. Wat terson’s (1975) remarkable paper describes the basic properties of gen ealogies under neutral models and marks the beginning of modern coalesc— , ent theory. The following description of the no—recombination genealogy I under the WoF neutral model draws heavily from the work of Watterson (1975), Kingman (1980, 1982a,b) Griffiths (1980) and Tajima (1983). To begin, we consider an ideal haploid species without recombination, without geographic subdivision and without selection ~ a typical garden- variety haploid species. We wish to examine properties of the genealogy , of a random sample of n individuals from this population. Let us label " the population from which the sample was drawn, generation 0. The ancestral population t generations back in time will be referred to as generation I. The basic property of a sample drawn from such a population, upon which much of the following is based, concerns the probability, P(n), that all the n sampled individuals have separate distinct ancestors in the <2) <2 1"(n)'[1-P(n)l*7;e ” (4) ’n words, the time back until the first occurrence of a common ancestor s geometrically distributed and will be approximated by an exponential 72 distribution with mean N/<2>. For large N and small It, as we will assume hroughout, the probability that more than two individuals of our sample ave common ancestors in a single generation is very small and will be __ gnored. Thus with high probability, the recent history of our sample onsrsts. of t generations in which n distinct lineages exist, and then at ,_ eneration t + 1, a single pair of lineages ‘coalesce’ at the most recent common ancestor of two of the sampled individuals. Each of the (n) a v v 2 OSSible pairs of lineages are equally likely to form the coalescing pair. To continue tracing the history of our sample back in time, we note that 8 Richard R. Hudson Gene genealogies and the coalescent process 9 "lies that can be considered completely linked. For the model being onsrdered, sufficiently low means that Nr < 1, where r is the recombi- gation rate per generation between the ends of the region being con- idered. If time is measured in units of N generations for haploid models ml in units of 2N generations for diploid models, the results are exactly gigt'same for haploids and diploids, i.e. the mean of T(j) is given by eqn Unlinked loci in large populations are essentially independent and ill have their own independent genealogies. Linked loci, which have correlated genealogies, will be considered later. in the generations preceding the first coalescence, there are n — 1 ances- tors or lineages to follow. The probability ~ each generation —- that all of these ancestors have distinct ancestors in the preceding generation is P(n—1). So the time to the next coalescence is approximately exponen- 11-1 2 > At this coalescence, each of the tially distributed with mean N/( 2 Note that one of these (n——1) lineages has two descendants in our original sample, the other lineages having a single descendant in the sample. We can continue in this way until all the lineages have coalesced into a single lineage, the common ancestor of the entire sample of n individuals. A genealogy of five sampled alleles is shown in Fig 1. The stochastic process that generates a genealogy, referred to as the coalescent process, can be summarized very briefly. The time, T0), during which there are j distinct lineages is approximately exponentially distributed, and if time is measured in units of N generations, the mean of T0) is: (n— 1) possible pairs of lineages are equally likely to coalesce at this node. -, ADDING NEUTRAL MUTATIONS TO THE GENEALOGY Given the properties of the genealogies just described, we can predict roperties of samples under various mutation schemes. As discussed in he previous section, we will assume a constant~rate neutral mutation rocess, in which each offspring gamete differs from its parent by an verage of u mutations. In addition, we will assume an infinite-site model Kimura 1969). Under this model, the locus is composed of many sites 0 that no more than one mutation occurs at any site in the genealogy oi ur sample: The oft-employed infinite-allele model (Kimura and Crow 964) 15 Similar, assuming that each mutation produces a new allele not resent anywhere else in the genealogy of the sample. For our purposes he Infinite—Site model and the infinite—allele model are essentially the ame bu...
View Full Document

{[ snackBarMessage ]}