This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: OXFORD SURVEYS
IN EVOLUTIONARY IHOLOGY EDITED BY
DOUGLAS FUTUYMA AND JANIS ANTONOVICS \Hﬂunua7
1990 OXFORD UNIVERSITY PRESS
1990 Gene genealogies and the coalescent
process RICHARD R. HUDSON 1. INTRODUCTION When a collection of homologous DNA sequences are compared, the
pattern of similarities between the different sequences typically contains
information about the evolutionary history of those sequences. Under a
wide variety of circumstances, sequence data provide information about
which sequences are most closely related to each other, and about how far back in time the most recent common ancestors Of different sequences
occurred. If the sequences were Obtained from distinct species, then the
information is frequently extracted and displayed in the form of an inferred
phylogenetic tree, which may represent the evolutionary relationships of
the species from which the sequences were sampled. If, instead Of being
from different species, the sequences are from different individuals of the
same population, the information is genealogical, and in this case gene
trees can sometimes be inferred. A gene tree shows which sampled
sequences are most closely related to each other and perhaps the times
when the most recent common ancestors of different sequences occurred.
A hypothetical gene tree, or genealogy, of ﬁve sampled sequences is
shown in Fig. 1. In the absence of recombination, each sequence has a
single ancestor in the previous generation. (It is important to distinguish
a gene tree of sampled sequences from the pedigree of a sample of diploid
individuals, in which the number of ancestors grows as one proceeds back
in time, because each diploid individual has two parents.) The possibility
of Obtaining detailed information about the genealogy of sampled genes
dramatically changes the situation for molecular population geneticists.
Before the DNA era, molecular polymorphism data were primarily in
the form of frequencies of electromorphs, alleles distinguished by their mobility on electrophoretic gels. With protein electrophoresis, two homo—
logous copies of a gene could be classiﬁed as being the same or different.
If they were different, one could not measure how different; if the two 3: copies were the same, one could not with conﬁdence distinguish whether 2 Richard R. Hudson Past Present Sampled alleles Fig. I. An example of a genealogy of a sample of ﬁve alleles, showing the time
intervals between coalescent events. In this ﬁgure, the intervals, T(i), are shown
with lengths proportional to their expected values as given by eqn (5). they were really the same or simply convergent in certain physical proper
ties leading to similar electrophoretic mobility. Thus detailed information
about the genealogies of genes could not be extracted from data on
electromorph frequencies. With modern DNA techniques, sequences of
homologous regions of many individuals are obtainable and detailed infor
mation about the genealogy of sampled genes will be obtained, Examples
of genealogies inferred from sampled alleles are given in Stephens and
Nei (1985), Aquadro et al. (1986), Bermingham and Av15e (1986), Avrse
et al. (1987) and Cann et al. (1987), The obvious challenge for molecular population geneticists is: How can
we utilize this information to increase our understanding of the forces
acting on molecular variation in natural populations? From the theory
side, we can begin by examining the properties of genealogies that arise
under a variety of population genetic models. It is important to ask:
Are genealogies expected to be very different under different competing
models? Can we devise statistical tests that take advantage of the different
genealogies expected? To proceed with this task, one needs to examine Gene genealogies and the coalescent process 3 the statistical properties of genealogies of sampled genes under different
models. In the following, I will describe a variety of circumstances in which
properties of genealogies can be derived analytically or by computer
simulation. This will not constitute a comprehensive review of gene gen
ealogy theory, but rather a very personal View that concentrates on the
inﬁnitesite model. Some properties of genealogies will be described under selectively neutral models, with and without recombination, and with and
without geographic structure. The effects of some forms of selection will also be described. I will indicate some applications of this genealogical
approach for carrying out statistical tests or estimating parameters or simply allowing an ‘eyeball’ test of the ﬁt of observations to data. I will
also indicate how simulations based on the coalescent process can be constructed and used to investigate a variety of models. This will not be a rigorous mathematical treatment. Those interested
in a more precise analysis should consult the seminal work of Kingman
(1980, 1982a,b) and the review by Tavaré (1984). Much of the very elegant
and useful work of Grifﬁths (1980), Watterson (1984) and Padmadisastra
(1987, 1988) on coalescents and lines of descent that focus on the inﬁnite
allele model will not be covered. This includes a large body of work on
the ages of alleles (Donnelly 1986; Donelly and Tavaré 1986; Tavaré et
al. (1989) that is reviewed by Ewens (1989). The inﬁnite—allele models
and the inﬁnitesite models are very closely related, as will be described
later, and results from one can often be used immediately to answer
questions about the other. However, the questions asked and the par
ameter values considered are often quite distinct for the two models. In
this chapter, I will concentrate on results that directly concern inﬁnite
site models, which I feel are most useful in the interpretation of nucleotide
variation in populations. I will focus on properties of relatively small samples of alleles. The
work on properties of genealogies of entire populations, including ﬁxation
times, will not be considered (Donnelly and Tavaré 1987; Watterson
1982a, 1982b). Also, the important work on the relationship between
gene trees and species trees will not be discussed (Hudson 1983b; Neigel
and Avise 1986; Pamilo and Nei 1988; Takahata 1989). Statistical properties of genealogies depend very strongly on the kind
of sampling that occurs to produce one generation from the last. In this
chapter, only the WrightFisher (W’F) model will be considered. The
sampling that produces one generation from the last under this model is
described brieﬂy in the next section. A range of alternative neutral models
have been found that have essentially the same genealogical properties as
the WF model, with only a change of timescale (Kingman 1982a,b;
Watterson 1975; see also the reviews by Tavaré, 1984, and Ewens, 1989). 4 Richard R. Hudson Gene genealogies and the coalescent process 5 2. SEPARATING THE GENEALOGICAL PROCESS FROM have described in the Previous paragraph, the number of mutations that
THE NEUTRAL MUTATION PROCESS willhaye occurred to distinguish our randomly sampled individual from
, the indiViduals in the population at time 0, is just the number of mutations I that have occurred along a particular lineage of length t. This number of
, mutations is Porsson distributed with mean at. It does not matter what . the population size has been, whether selection has been occurring at
linked loci, or whether there is population subdivision, This is the basis
I for the results of Birky and Walsh (1988) concerning the rate of accumu—
ation of neutral mutations when selection is occurring at linked loci. In
, the example above, the number of mutations that have ﬁxed in the entire
: population between time 0 and time I will depend on these demographic
aspects .of the population. Similarly, the amount of polymorphism in the
population at time twill depend on population size and other demographic
factors, but the number of mutations that will have occurred along individ
ual lineages in the past 1 generations, that distinguish a sampled sequence [from their ancestors t generations back, is Poisson distributed with mean
pt, regardless of these other factors. This property of the constantrate neutral mutation process will be
3 exploded in the following way. Let Ttot denote the sum of the lengths of
the branches of the genealogy of a sample. As discussed in the previous aragraph, S, the number of mutations on the genealogy, given Tm, is
Poxsson distributed with mean uTmt. Once the distribution of Tm, is
determined under a particular model, the distribution of S can easily be
obtained. For example, if the ﬁrst two moments of Tm are determined, “then the first two moments of S can be calculated using properties of
compound distributions as: As will be discussed in great detail in the following pages, the statistical
properties of genealogies depend on such factors as population size,
geographic structure and the presence of selectively maintained alleles.
That properties of genealogies should depend on these demographic
properties is obvious, because actual genealogies depend on who had
offspring and who did not, who migrated and to where, and whose
offspring bore selectively important mutations. It should also be clear that
strictly neutral mutations — mutations that have not and will not affect
ﬁtness — should have no affect on the genealogies of random samples.
This is because, by deﬁnition, neutral mutations do not affect the number
of offspring or tendency to migrate of individuals bearing those mutations.
That being the case, we can study the properties of genealogies without
regard to a speciﬁc mutation model for neutral variants. So, for example,
the statistical properties of genealogies do not depend on whether neutral
mutations are more frequently transitions than tranversions or whether
an inﬁnitesite, ﬁnitesite or infiniteallele model is most appropriate. Of
course, the statistical properties of our inferences about the genealogical
process are likely to depend strongly on the mutation process. For exam
ple, if the neutral mutation rate is very low, all the sequences in a sample
may be identical and we could get no information about the genealogy of
the sample. With the neutral mutation process that we will consider, each offspring
differs from its parent at the locus under consideration by a Poisson
distributed number of mutations. The mean number of mutations, u, will be assumed constant, independent of genotype, population size and time. E( S) = HE( Tm) (1)
The mutations are assumed to occur independently in different individuals and different generations. This mutation model will be referred to as ‘ and the constantrate neutral mutation process. This is the standard neutral mutation model (Kimura 1983; Watterson 1975). Under these assump Var(S) = “Eamd + pL2 V3f(Tmi) (2) tions, mutations accumulate along lineages in an inexorable fashion inde _
pendent of, for example, population size or selection events at linked loci.
Given t, the number of generations since the most recent common ancestor
of two sampled homologous sequences, S, the number of mutations that
have occurred in the descent to the two descendent sequences, is Poisson
distributed with mean Zut. When t is a random quantity, the mean and
the variance — in fact all the moments of S ~ are determined by the
moments of t assuming the constantrate neutral mutation process. To emphasize this point, consider a population that at time 0 is com
pletely homozygous at a locus at which only neutral mutations occur.
After I generations of evolution, one examines the sequence at the locus :
in a single randomly selected individual. Under the mutation scheme we _ Reiterating, under the models that we will consider, the properties of
genealogies do not depend on the neutral mutation process, and therefore
can be studied without precise speciﬁcation of the neutral mutation pro
cess, For example, we can study the statistical properties of Ttot without
specrfyrng the rate or pattern of neutral mutation. Furthermore statistical
properties of neutral variation in samples are completely determined by
the statistical properties of the genealogies and the neutral mutation?
process. In other words, if two different models make the same assump» tions about the neutral mutation process and if the two different models lead to thesame distribution of genealogies, then the pattern of neutral
variation Will be the same for the two models. For example, if the neutral 6 Richard R. Hudson Gene genealogies and the coalescent process 7 mutation process is as we have described above, the mean value of S is
completely determined by the mean value of Tm. Two different models
that lead to the same mean value of Ttot will have the same mean value
of S. Throughout this chapter, we will consider an ideal W—F model, with
either N haploids or N diploids. Brieﬂy, this is a discrete generation model
in which, for the haploid version, the N haploids of an offspring generation
are obtained by sampling (and replicating possibly with mutation) N times
with replacement from the parent generation. In the selectively neutral
version, all parents are equally likely as parents of each of the N haploid
offspring. A detailed description of this model is contained in Ewens
(1979). We will assume that N is large and constant, in which case
individuals have approximately Poisson distributed numbers of offspring.
Most of the results concerning this model will be approximate, ignoring
terms of order (1/N2) relative to (1/N). This corresponds to the usual
assumptions made for using diffusion approximations and will be referred preceding generation. Consider ﬁrst a sample of two individuals. The
probability that the second individual sampled has the same parent as the
ﬁrst is I./N, as under the WF neutral model each individual of the previous
generation is equally likely to be the parent of any individual of the
current generation. Thus P(2) is 1~1/N. If three individuals are sampled he probability that all three have distinct ancestors in the previous gener:
ation, is the probability that the ﬁrst two have distinct parents X the
probability that the parent of the third individual drawn is distinct from , hird individual has a distinct parent from the ﬁrst two, given that the
' ﬁrst two have distinct parents, is (N—2)/N = 1—2/N. In general, the probability that n sampled individuals have n distinct parents in the
preVious generation is: to as the diffusion approximation. In contrast to the W—F model, exact 11—1 (’21)
results can often be obtained for the Moran model (see, for example, Pol) 2 H (1171(1) z 1 _ N (3)
i=1 Watterson 1975). The Moran model will not be considered here. _ We can ask the same question about these n distinct ancestors: What is
he probability that they have n distinct ancestors one generation earlier?
,, Clearly, this is also P(n). This means that the probability that the n sampled individuals have n distinct ancestors in each of the preceding t generations, and that in the t + 1 generation back in time, two or more
fthe sampled individuals have common ancestors is: 3. THE SIMPLEST CASE: NO SELECTION AND NO
RECOMBINATION Although genealogical processes are implicit in much of the work on
identity coefﬁcients that has been carried on for many years, it was the ;
knowledge of the nature of the genetic material and the possibility of
obtaining sequence data (or restriction map data) that stimulated some ,
of the earliest work that considers the genealogical process directly. Wat
terson’s (1975) remarkable paper describes the basic properties of gen
ealogies under neutral models and marks the beginning of modern coalesc— ,
ent theory. The following description of the no—recombination genealogy I
under the WoF neutral model draws heavily from the work of Watterson
(1975), Kingman (1980, 1982a,b) Grifﬁths (1980) and Tajima (1983). To begin, we consider an ideal haploid species without recombination,
without geographic subdivision and without selection ~ a typical garden
variety haploid species. We wish to examine properties of the genealogy ,
of a random sample of n individuals from this population. Let us label "
the population from which the sample was drawn, generation 0. The
ancestral population t generations back in time will be referred to as
generation I. The basic property of a sample drawn from such a population, upon
which much of the following is based, concerns the probability, P(n), that
all the n sampled individuals have separate distinct ancestors in the <2) <2
1"(n)'[1P(n)l*7;e ” (4) ’n words, the time back until the ﬁrst occurrence of a common ancestor
s geometrically distributed and will be approximated by an exponential 72 distribution with mean N/<2>. For large N and small It, as we will assume hroughout, the probability that more than two individuals of our sample
ave common ancestors in a single generation is very small and will be
__ gnored. Thus with high probability, the recent history of our sample onsrsts. of t generations in which n distinct lineages exist, and then at
,_ eneration t + 1, a single pair of lineages ‘coalesce’ at the most recent common ancestor of two of the sampled individuals. Each of the (n)
a v v 2
OSSible pairs of lineages are equally likely to form the coalescing pair. To continue tracing the history of our sample back in time, we note that 8 Richard R. Hudson Gene genealogies and the coalescent process 9 "lies that can be considered completely linked. For the model being
onsrdered, sufﬁciently low means that Nr < 1, where r is the recombi
gation rate per generation between the ends of the region being con
idered. If time is measured in units of N generations for haploid models
ml in units of 2N generations for diploid models, the results are exactly
gigt'same for haploids and diploids, i.e. the mean of T(j) is given by eqn
Unlinked loci in large populations are essentially independent and ill have their own independent genealogies. Linked loci, which have
correlated genealogies, will be considered later. in the generations preceding the ﬁrst coalescence, there are n — 1 ances
tors or lineages to follow. The probability ~ each generation — that all of
these ancestors have distinct ancestors in the preceding generation is
P(n—1). So the time to the next coalescence is approximately exponen
111 2 > At this coalescence, each of the tially distributed with mean N/( 2
Note that one of these (n——1) lineages has two descendants in our original sample, the other lineages having a single descendant in the sample. We
can continue in this way until all the lineages have coalesced into a single
lineage, the common ancestor of the entire sample of n individuals. A genealogy of ﬁve sampled alleles is shown in Fig 1. The stochastic
process that generates a genealogy, referred to as the coalescent process,
can be summarized very brieﬂy. The time, T0), during which there are
j distinct lineages is approximately exponentially distributed, and if time
is measured in units of N generations, the mean of T0) is: (n— 1) possible pairs of lineages are equally likely to coalesce at this node. , ADDING NEUTRAL MUTATIONS TO THE GENEALOGY Given the properties of the genealogies just described, we can predict
roperties of samples under various mutation schemes. As discussed in
he previous section, we will assume a constant~rate neutral mutation
rocess, in which each offspring gamete differs from its parent by an
verage of u mutations. In addition, we will assume an inﬁnitesite model
Kimura 1969). Under this model, the locus is composed of many sites
0 that no more than one mutation occurs at any site in the genealogy oi
ur sample: The oftemployed inﬁniteallele model (Kimura and Crow
964) 15 Similar, assuming that each mutation produces a new allele not
resent anywhere else in the genealogy of the sample. For our purposes
he Inﬁnite—Site model and the inﬁnite—allele model are essentially the
ame bu...
View
Full Document
 Fall '10
 Hey

Click to edit the document details