Genetics, Vol. 162, 1811-1823, December 2002, Copyright © 2002

Likelihood and Bayes Estimation of Ancestral Population Sizes in Hominoids Using Data From Multiple Loci

Ziheng Yanga
a Galton Laboratory, Department of Biology, University College London, London WC1E 6BT, England

Corresponding author: Ziheng Yang, University College London, Darwin Bldg., Gower St., London WC1E 6BT, England., z.yang{at}ucl.ac.uk (E-mail)

Communicating editor: Y.-X. FU


*  ABSTRACT
*TOP
*ABSTRACT
*MAXIMUM-LIKELIHOOD ESTIMATION
*THE BAYES APPROACH USING...
*DISCUSSION
*APPENDIX
*LITERATURE CITED

Polymorphisms in an ancestral population can cause conflicts between gene trees and the species tree. Such conflicts can be used to estimate ancestral population sizes when data from multiple loci are available. In this article I extend previous work for estimating ancestral population sizes to analyze sequence data from three species under a finite-site nucleotide substitution model. Both maximum-likelihood (ML) and Bayes methods are implemented for joint estimation of the two speciation dates and the two population size parameters. Both methods account for uncertainties in the gene tree due to few informative sites at each locus and make an efficient use of information in the data. The Bayes algorithm using Markov chain Monte Carlo (MCMC) enjoys a computational advantage over ML and also provides a framework for incorporating prior information about the parameters. The methods are applied to a data set of 53 nuclear noncoding contigs from human, chimpanzee, and gorilla published by Chen and Li. Estimates of the effective population size for the common ancestor of humans and chimpanzees by both ML and Bayes methods are ~12,000–21,000, comparable to estimates for modern humans, and do not support the notion of a dramatic size reduction in early human populations. Estimates published previously from the same data are several times larger and appear to be biased due to methodological deficiency. The divergence between humans and chimpanzees is dated at ~5.2 million years ago and the gorilla divergence 1.1–1.7 million years earlier. The analysis suggests that typical data sets contain useful information about the ancestral population sizes and that it is advantageous to analyze data of several species simultaneously.


THE amount of sequence polymorphism in a population is mainly determined by the parameter {theta} = 4Nµ, where N is the (effective) population size and µ is the mutation rate per nucleotide site per generation. In addition to its effect on the amount of neutral variation maintained in a population, the population size is also important in affecting the efficiency of natural selection and is a critical parameter in population genetics models of mutation and selection. It is important to competing theories of origin and evolution of modern humans. When an estimate of the mutation rate is available, as is assumed here, the population size can be obtained from estimates of {theta}. The population size of a present-day species can be estimated by using observed DNA sequence variation in the population (e.g., KREITMAN 1983 Down; FU 1994 Down; TAKAHATA et al. 1995 Down; YANG 1997A Down). The population size of modern humans has been estimated to be ~10,000 (e.g., TAKAHATA et al. 1995 Down; ZHAO et al. 2000 Down; YU et al. 2001 Down).

The population size of an extinct ancestral species, such as the common ancestor of humans and chimpanzees, is harder to estimate, but a few methods have been developed (see EDWARDS and BEERLI 2000 Down and SATTA et al. 2000 Down for extensive reviews). They require data from multiple loci and make use of the information in the conflict of gene trees among loci. For example, TAKAHATA 1986 Down suggested a method for estimating the population size of the common ancestor of two closely related species when a pair of homologous DNA sequences are available from the two species at each locus. The average divergence at many loci provides information on the species divergence time, while the variation of sequence divergence among loci reflects the ancestral population size. The method of TAKAHATA 1986 Down uses summary statistics from the data and was later extended to a full-likelihood analysis and to the case of three species, where the population sizes of the two extinct ancestors as well as the two speciation dates were estimated (TAKAHATA et al. 1995 Down).

In the case of three species, another simpler method has also been used (NEI 1987 Down; WU 1991 Down; HUDSON 1992 Down). This approach uses the proportion of loci at which the gene tree does not match the species tree to estimate the ancestral population size and exploits the fact that ancestral polymorphism creates conflicts between the gene tree and the species tree. I refer to it as the tree-mismatch method. Its application to human and great ape sequence data has led to estimates of the population size for the ancestor of humans and chimpanzees that are much larger than estimated population sizes for modern humans (RUVOLO 1997 Down; CHEN and LI 2001 Down). For example, CHEN and LI 2001 Down sequenced 53 noncoding contigs from human, common chimpanzee, gorilla, and orangutan, and estimated the population size of the common ancestor of humans and chimpanzees to range from 52,000 to 150,000, depending on the generation time used (15 or 20 years) and on whether parsimony or neighbor joining was used to infer the gene trees.

The tree-mismatch approach has room for improvement. First, the conflicts in the estimated gene trees among loci can be due to both ancestral polymorphism and sampling errors in tree reconstruction. As the sequences are highly similar, with few variable or informative sites at each locus, the reconstructed gene tree, no matter what method was used to infer it, is unreliable. Ignoring the sampling error in the estimated gene tree leads to overestimation of the conflict among loci and overestimation of the ancestral population size (see below). Second, the branch lengths in the gene tree provide, in a probabilistic sense, information about the ancestral population parameters, but are ignored by the method. The method clearly makes poor use of the information in the data. Third, the method assumes that one knows the species divergence times, while they might not be known. Indeed, some authors argue for the importance of accounting for ancestral polymorphism when estimating speciation dates (TAKAHATA et al. 1995 Down).

It seems advantageous to use information in both the conflicting gene genealogies as well as the branch lengths while accounting for their uncertainties. This can be achieved by using the likelihood method under a coalescent model developed by TAKAHATA et al. 1995 Down. However, the infinite-sites model assumed in the method is sometimes violated by the data. In this article, I extend the method of Takahata et al. for estimating ancestral population sizes, using data from three species. I use the nucleotide substitution model of JUKES and CANTOR 1969 Down to correct for multiple substitutions at the same site. The model is also implemented in a Bayes framework, with Markov chain Monte Carlo (MCMC) used to achieve efficient calculation. The methods are applied to the data of CHEN and LI 2001 Down.


*  MAXIMUM-LIKELIHOOD ESTIMATION
*TOP
*ABSTRACT
*MAXIMUM-LIKELIHOOD ESTIMATION
*THE BAYES APPROACH USING...
*DISCUSSION
*APPENDIX
*LITERATURE CITED

The species phylogeny is represented ((12)3) and is assumed known. Species 1 and 2 diverged {tau}1 generations ago while species 3 diverged {tau}0 generations earlier (Fig 1A). In this article, species 1, 2, and 3 represent human (H), chimpanzee (C), and gorilla (G), respectively. The effective population size of the ancestor of species 1 and 2 is N1, and that of the common ancestor of all three species is N0. The mutation rate µ is measured as the number of nucleotide substitutions per site per generation. As rate and time are confounded in such analysis, the parameters of the model are {theta}0 = 4N0µ, {theta}1 = 4N1µ, {gamma}0 = {tau}0µ, and {gamma}1 = {tau}1µ (Fig 1A). Collectively they are denoted {Theta} = {{theta}0, {theta}1, {gamma}0, {gamma}1}. The data contain DNA sequences from multiple loci, with one individual sequenced from each of the three species at each locus. It is assumed that there is no recombination within a locus and free recombination between loci. The population is assumed to be random mating, with no gene flow after species divergence.



View larger version (17K):
In this window
In a new window
Download PPT slide
 
Figure 1. (a) The species tree ((12)3) for three species: 1 (human), 2 (chimpanzee), and 3 (gorilla). Species 1 and 2 diverged {tau}1 generations ago and species 3 diverged {tau}0 generations earlier. The population sizes are N0 in the common ancestor of all three species and N1 in the common ancestor of species 1 and 2. The four parameters in the model are {theta}0 = 4N0µ, {theta}1 = 4N1µ, {gamma}0 = {tau}0µ, and {gamma}1 = {tau}1µ, where µ is the mutation rate per site per generation. There are four possible gene trees, represented by b–e. In case I (b), sequences 1 and 2 coalescence between the two speciation events and the gene tree T0 = ((12)3) is consistent with the species tree. This is referred to as case I in the text. In c–e, both coalescence events occur in the common ancestor of all three species. In this case (case II), the gene tree can be T1 = ((12)3), T2 = ((23)1), or T3 = ((31)2), with equal probability.

The likelihood function:
Consider the probability distribution of the gene tree and its branch lengths at any locus i. Two cases are possible and are dealt with separately. Case I is represented by Fig 1B, where the gene tree is T0 = ((12)3), identical to the species tree, and sequences 1 and 2 coalesce between the two speciation events C and D. The coalescent times t0 and t1 are defined in Fig 1B. Case I occurs when t1 < {tau}0, with probability

(1)

(e.g., TAKAHATA et al. 1995 Down). In case II, both coalescent events occur in the population ancestral to all three species (Fig 1, c–e). The three gene trees T1 = ((12)3), T2 = ((23)1), and T3 = ((31)2) occur with equal probability (1 - {psi})/3.

In case I (tree T0 in Fig 1B), the coalescent time t1 is an exponential random variable truncated at t1 < {tau}0, and t0 is an independent random variable with an exponential distribution

(2)

Let branch lengths b0 and b1 in the gene tree of Fig 1B be defined as follows:

(3)

Note that b0 is the length of the internal branch AB and b1 is the distance from sequence 1 to ancestor B (Fig 1B). Given that the gene tree at locus i is Ti = T0 and its branch lengths are bi0 and bi1, the probability of observing sequence data Di at locus i, P(Di|T0, bi0, bi1), is the likelihood in traditional molecular phylogenetics (FELSENSTEIN 1981 Down). Calculation of this probability is discussed in the next section. By averaging over the distribution of the random variables t0 and t1 (or b0 and b1), the probability of observing the data at locus i for case I is

(4)

after a change of variables.

In case II (Fig 1, c–e), both coalescence events occur in the population ancestral to all three species. The three gene trees T1, T2, and T3 have equal probabilities. The coalescent times t0 and t1, defined in Fig 1C&NDASH;E, are independent random variables with exponential distributions

(5)

for k = 1, 2, or 3. Similarly, the branch lengths in the gene tree are defined as

(6)

Calculation of the probability, P(Di|Tk, bi0, bi1), of observing data Di at locus i, given the gene tree Ti = Tk (k = 1, 2, 3) and its branch lengths bi0 and bi1, will be described in the next section. The probability of observing the data at locus i for case II is

(7)

for k = 1, 2, 3.

Averaging over cases I and II or over the four gene trees T0, T1, T2, T3 in Fig 1, we obtain the marginal probability of observing data Di at locus i as

(8)

The log likelihood is a sum over all the L loci,

(9)

where D = {Di} represents data at all L loci. Parameters {theta}0, {theta}1, {gamma}0, and {gamma}1 are estimated by numerical maximization of the log-likelihood function. A C-optimization routine from the PAML package (YANG 1997B Down) was used. Each likelihood calculation requires evaluation of 2L two-dimensional integrals (Equation 8), which are calculated numerically using Mathematica. The Mathlink library was used to establish communication between the C routine and Mathematica. For the data of CHEN and LI 2001 Down with L = 53 loci, the ML iteration takes ~1 hr on a Pentium III PC at 1.2 GHz.

Probability of data at a locus given the gene tree and branch lengths:
Given the gene tree Ti, which is one of T0, T1, T2, or T3 in Fig 1B, and its branch lengths (bi0 and bi1), the probability of observing the data, P(Di|Ti, bi0, bi1), in Equation 8 can be calculated under any model of nucleotide substitution (LIO and GOLDMAN 1998 Down). The model of JUKES and CANTOR 1969 Down is used in this article, which seems sufficient for such highly similar sequences. Under this model, the sequence data can be summarized as counts of five possible site patterns (configurations): xxx, xxy, yxx, xyx, and xyz, where x, y, and z are any three different nucleotides. The data at locus i can thus be represented by the number of sites with those five site patterns: Di = {ni0, ni1, ni2, ni3, ni4}. The observed number of counts from the data of CHEN and LI 2001 Down is listed in Table 1.


 
View this table:
In this window
In a new window

 
Table 1. Number of site patterns from the data of CHEN and LI 2001 Down

For gene trees T0 or T1, the probabilities of observing the five site patterns are

(10)

(YANG 1994 Down). For gene trees T2 and T3 (Fig 1D and Fig E), the probabilities can be obtained by considering the symmetry of the problem. Thus with functions p0p4 defined above, the probability of data at locus i, conditional on the gene tree Tk, k = 1 (or 0), 2, 3, and its branch lengths b0 and b1, is given by the multinomial distribution

(11)

where C = ni!/{Pi}4j=0nij!, and ni = ni0 + ni1 + ni2 + ni3 + ni4.

Mutation rate variation among loci:
An important factor that may influence the estimation of ancestral population sizes is the variation of mutation rates among loci (YANG 1997A Down; CHEN and LI 2001 Down). For example, estimation of ancestral population size from comparison between two species was found to be sensitive to even slight rate variation (YANG 1997A Down). If relative rates for the loci are available, it will be straightforward to incorporate them in the likelihood calculation (YANG 1997A Down). In this article, I use the average distance from the orangutan to the three African apes to calculate the relative rate for the locus (Table 1). This ad hoc approach appears sensible since the orangutan diverged from the African apes very early and ancestral polymorphism in their common ancestor does not seem important. The likelihood calculation proceeds as before except that the branch lengths for the gene tree at each locus are multiplied by the relative rate for that locus.

Application to the data of Chen and Li:
CHEN and LI 2001 Down sequenced one individual from each of the four species, human, chimpanzee, gorilla, and orangutan, at 53 independent noncoding loci (contigs). An advantage of the data set is that the sequences are expected to be outside and far away from coding regions and not affected by selection at linked sites or loci. The model of this article assumes the molecular clock and uses only three species. The counts of site patterns are listed in Table 1. At some loci, the total number of sites used in this article is larger than that in CHEN and LI 2001 Down(Table 1), because some sites had alignment gaps in the orangutan and were removed by Chen and Li.

The three-species data are analyzed by the maximum-likelihood (ML) method of this article. The estimates of parameters are given in Table 2. If we assume a generation time of 20 years and a mutation rate of 10-9 substitutions per site per year (e.g., NACHMAN and CROWELL 2000 Down), the estimates suggest a population size for the ancestor of humans and chimpanzees of ~12,000. This is several times smaller than the estimates of CHEN and LI 2001 Down from the same data, at a minimum of 52,000. The estimate is also similar to estimates of the population size of modern humans, for example, 12,000 by YU et al. 2001 Down. The population size for the common ancestor of all three species is estimated to be ~38,000. The same analysis estimated the human-chimpanzee divergence time at 5.1 million years ago (MYA) and the gorilla divergence at ~1.1 million years (MY) earlier. Those estimates are largely consistent with those of previous studies (e.g., HASEGAWA et al. 1985 Down; RUVOLO 1997 Down; KUMAR and HEDGES 1998 Down; YODER and YANG 2000 Down). Fig 2A shows the likelihood surface as a function of {theta}0 and {theta}1 when {gamma}0 and {gamma}1 are fixed at their maximum-likelihood estimates (MLEs). The ~95% confidence region is given by the likelihood contour at 3.32 units below the optimum, that is, at -3102.73 (Fig 2A). The sampling errors are quite large. Analysis of the human and chimpanzee sequences at the 53 loci using the ML method of TAKAHATA et al. 1995 Down under the infinite-sites model leads to 1 = 0.0017 and 1 = 0.0055 (N. TAKAHATA, personal communication). Those estimates are similar to the MLEs of Table 2.



View larger version (50K):
In this window
In a new window
Download PPT slide
 
Figure 2. Log-likelihood surface (contour) as a function of {theta}0 and {theta}1 when {gamma}0 and {gamma}1 are fixed at their MLEs. (a) The same substitution (mutation) rate is assumed for all loci. (b) Fixed relative rates obtained from comparison between the orangutan and the African apes (human, chimpanzee, and gorilla) are used to account for possible evolutionary rate variation among loci. Maximum-likelihood estimates of parameters are listed in Table 2.


 
View this table:
In this window
In a new window

 
Table 2. Maximum-likelihood estimates of parameters

To examine the effect of mutation rate variation among loci, I calculated the average sequence distance under the model of JUKES and CANTOR 1969 Down from the human, chimpanzee, and gorilla to the orangutan, that is, dHCG-O = (dH-O + dC-O + dG-O)/3. This is divided by the average across all loci to give the relative rate for that locus (Table 1). The average distance from the orangutan to the African apes is found to be 0.0296; this is consistent with a mutation rate of 10-9 substitutions per site per year and an orangutan divergence date of ~13 MYA (e.g., HASEGAWA et al. 1987 Down) since 0.0296/(2 x 13 x 106) = 1.1387 x 10-9. The relative rates for loci calculated this way are used as fixed constants in the likelihood calculation for the data of three species. The MLEs of parameters are shown in Table 2. Using the same generation time and mutation rate as above, we get the estimate of the population size for the ancestor of humans and chimpanzees to be 21,000, larger than the estimate under the assumption of a constant rate for all loci. The population size for the ancestor of all three species is estimated to be 29,000, smaller than under the constant-rate model. The differences between the two analyses are somewhat surprising, as one might expect the population sizes to be smaller when rate variation among loci is accounted for. However, it is noted that the distance from orangutan to the African apes has only a weak correlation (0.44) with the average distance within the H-C-G group, which appears to suggest that the mutation rates are rather homogeneous among loci and that the conflict among loci in sequence divergence is mainly caused by ancestral polymorphism.

If the average distances within the H-C-G group, (dHC + dCG + dGH)/3, are used as relative rates for the loci, parameter estimates become 0 = 0.0000014, 1 = 0.002902, 0 = 0.003229, and 1 = 0.004555, with {ell} = -3069.73. Those correspond to a population size of 36,000 for the ancestor of humans and chimpanzees, a population size of only 18 for the ancestor of all three species, 4.5 MY for the H-C divergence, and 7.7 MY for the gorilla divergence. This calculation effectively attributes all variation in sequence divergence among loci to mutation rate variation and causes underestimation of {theta}0 and {gamma}1 and overestimation of {theta}1 and {gamma}0 (see also Table 2).

Comparison with the tree-mismatch method:
When the gene tree is T2 or T3 (Fig 1), there is a mismatch between the species tree and the gene tree. This occurs with probability PSG = f(T2) + f(T3) = 2(1 - {psi})/3 = 2/3e-2{gamma}0/{theta}1 (e.g., NEI 1987 Down, pp. 288–289). The tree-mismatch method estimates {theta}1 by equating this probability to the proportion () of loci at which the estimated gene tree differs from the species tree, with {gamma}0 being assumed known; that is, 1 = -2{gamma}0/log{3/2}. CHEN and LI 2001 Down used the orangutan to root the H-C-G tree and were able to resolve the gene tree at 33 loci, out of which 9 mismatches were found, at a proportion of 27.3%. Several coding loci were included as well, so that 16 mismatches were found at a total of 52 resolved loci, at the proportion 30.8%. The authors assumed an H-C divergence at {gamma}1 = 1.6 MYA and arrived at 1 = 0.00414, which, if the generation time is 20 years, corresponds to a minimum population size of 1 = 52,000 for the ancestor of humans and chimpanzees. In this article, the molecular clock has been assumed, which can also be used to root the H-C-G tree. Under the clock, T1, T2, or T3 is the ML tree if ni1, ni2, or ni3 is the greatest among the three, respectively (SAITOU 1988 Down; YANG 1994 Down). The clock-rooting approach uses more "informative" sites than out-group rooting and resolves the gene tree at 49 of the 53 loci, out of which 18 are mismatches, at the proportion 36.7%. This proportion is even higher than those of Chen and Li and produces even larger estimates of {theta}1 and N1.

To understand the difference between the tree-mismatch method and ML, note that three aspects of the data are ignored by the tree-mismatch method and accounted for by ML: (i) uncertainty in the estimated gene tree due to the finite number of nucleotide sites at the locus, (ii) unresolved loci (ties), and (iii) branch lengths in the gene tree reflected in the sequence divergences. While all three probably contribute to the large differences discussed above, uncertainty in the estimated gene tree seems to be the predominant reason. A more "proper" tree-mismatch method should equate the observed proportion of mismatches () not to PSG but to PSE, the probability of a mismatch between the species tree and the estimated gene tree. This probability is given by

(12)

where f is the probability of data Di = {ni0, ni1, ni2, ni3, ni4} in Equation 8, and the summation is over all data configurations in which the ML tree for the locus is either T2 or T3 (Fig 1). Unlike PSG, PSE is dependent on all four parameters, {theta}0, {theta}1, {gamma}0, and {gamma}1, as well as the sequence length ni and appears no easier to calculate than the full likelihood (Equation 9). Instead I use Monte Carlo simulation to calculate those probabilities to assess the impact of errors in gene tree reconstruction on the difference between PSG and PSE. The MLEs of parameters in Table 2 ("constant rate") are used to generate gene trees, which are used to "evolve" sequences. The sites are counted to obtain the data Di = {ni0, ni1, ni2, ni3, ni4}, which are used to estimate the gene tree by ML.

Fig 3 shows the probabilities that the species tree (S), the gene tree (G), and the estimated gene tree (E) differ from each other. The probability of a mismatch between the species tree and the gene tree is PSG = 2(1 - {psi})/3 = 0.0739, much lower than the observed mismatch proportion = 0.367. The probability of a mismatch between the species tree and the estimated gene tree is higher. With 466 sites (the average across the 53 loci, Table 1), PSE = 0.2028, which is 2.7 times as high as PSG (Fig 3). Now consider the four gene trees T0, T1, T2, and T3 (Fig 1), which occur with probabilities f(T0) = {psi} = 0.8892 and f(T1) = f(T2) = f(T3) = (1 - {psi})/3 = 0.0369 (Equation 1). According to the simulation, the probability that the topology of the gene tree is incorrectly reconstructed when the true gene tree is T0, T1, T2, or T3 is P0 = 0.1453 and P1 = P2 = P3 = 0.1981. Note that for the estimated gene tree, we consider only the topology and disregard its divergence times relative to the speciation times. The average error probability of gene tree reconstruction is thus PGE = {sum}3i=0f(Ti)Pi = 0.8892 x 0.1453 + 3 x 0.0369 x 0.1981 = 0.1512 (see Fig 3). When gene trees T0 or T1 are incorrectly reconstructed, the estimated gene tree will always be a mismatch with the species tree; such errors will cause an overcount of f(T0)P0 + f(T1)P1 = 0.1366. Conversely, when gene trees T2 or T3 are incorrectly reconstructed, the estimated tree will not be counted as a mismatch one-half of the time, so the undercount is f(T2)P2/2 + f(T3)P3/2 = f(T2)P2 = 0.0074. The difference between those two error rates gives rise to the net error due to gene tree reconstruction: PSE - PSG = f(T0)P0 = {psi}P0 = 0.1290. The above argument suggests that ignoring errors in gene tree reconstruction always causes overestimation of the mismatch between the species tree and the gene tree and leads to overestimation of the ancestral population size N1. It is interesting that the bias in the tree-mismatch method is caused by reconstruction errors for gene tree T0 alone and thus can be reduced if {psi} is reduced, for example, if the two speciation events are very close or if the ancestral population size N1 is large. Obviously factors that reduce the reconstruction error P0, such as longer sequences (Fig 3) and higher mutation rates, will reduce the bias as well.



View larger version (15K):
In this window
In a new window
Download PPT slide
 
Figure 3. Tree-mismatch probabilities calculated using Monte Carlo simulation plotted as functions of the sequence length. MLEs of parameters in Table 2 (one rate for all loci) are used in the simulation. Note that S, G, and E in the subscripts stand for the species tree, the gene tree, and the estimated gene tree (the ML tree), respectively. Thus PSG is the probability that the species tree and the gene tree differ; this is 0.0739 for the parameter values used. PSE is the probability of a mismatch between the species tree and the estimated gene tree, and PGE is the probability of a mismatch between the gene tree and the estimated gene tree. Ptie is the proportion of replicates in which a tie occurs, that is, the two best trees are equally good. Ties are excluded in calculation of PSE and PGE. Ten million replicates were simulated for each sequence length.


*  THE BAYES APPROACH USING MCMC
*TOP
*ABSTRACT
*MAXIMUM-LIKELIHOOD ESTIMATION
*THE BAYES APPROACH USING...
*DISCUSSION
*APPENDIX
*LITERATURE CITED

A Bayes approach is implemented under the same model, using MCMC. As parameters {Theta} = {{theta}0, {theta}1, {gamma}0, {gamma}1} are all positive, I use independent gamma distributions as the prior. The gamma density is

(13)

with mean {alpha}/ß and variance {alpha}2. The hyperparameters {alpha} and ß are chosen to reflect the range and likely values of the parameters.

Instead of the coalescent times t0 and t1, which have different definitions in different gene trees (Fig 1), branch lengths b0 and b1 in the gene tree are used in the MCMC. The joint prior distributions of the gene tree T0 and its branch lengths b0 and b1 given {Theta} can be easily derived from the distributions of the coalescent times t0 and t1 (Equation 2),

(14)

for {gamma}1 < b1 < {gamma}1 + {gamma}0 and {gamma}1 + {gamma}0 - b1 < b0 < {infty}.

Similarly, the joint prior distribution of gene tree Tk, k = 1, 2, 3, and its branch lengths b0 and b1 given {Theta} is

(15)

with 0 < b0 < {infty} and {gamma}1 + {gamma}0 < b1 < {infty}.

The variables to be updated in the Markov chain include the parameters {Theta} = {{theta}0, {theta}1, {gamma}0, {gamma}1} and the gene trees and branch lengths at all L loci, G = {Ti, bi0, bi1}, i = 1, 2, ... , L. The Markov chain is constructed so that its steady-state distribution is the posterior distribution of those variables. Bayes inference is then based on the joint posterior distribution

(16)

The denominator f(D) is the marginal probability of the data

(17)

where the integral over G represents summation over the four gene trees (T0, T1, T2, T3 in Fig 1) and integration over branch lengths in each tree. The posterior distribution of any parameter is then given by integrating over the joint posterior distribution. For example,

(18)

A Metropolis-Hastings algorithm (METROPOLIS et al. 1953 Down; HASTING 1970 Down) is used to update variables in the MCMC. Given the current state of the chain ({Theta}, G), a new state ({Theta}*, G*) is proposed through a proposal distribution q({Theta}*, G*|{Theta}, G). The new state is then accepted with probability

(19)

If the new state is accepted, the chain moves to the new state ({Theta}*, G*). Otherwise the chain stays in the old state ({Theta}, G). Note that f(D) in Equation 16 cancels in calculation of the acceptance ratio R. Calculation of f({Theta}*, G*|D)/f({Theta}, G|D) is straightforward due to the conditional independence in the model as described above. So the focus here is the proposal mechanism and the proposal ratio q({Theta}, G|{Theta}*, G*)/q({Theta}*, G*|{Theta}, G).

The proposal density q can be rather flexible as long as it specifies an aperiodic and irreducible Markov chain. The algorithm I implemented cycles through several steps, with each step updating some but not all variables. In step 1, the gene tree and branch lengths at each locus i (Ti, bi0, bi1) are updated, while parameters {Theta} are fixed. Each locus is updated once in this step. Step 2 updates parameters {Theta} while the branch lengths {bi0, bi1} are fixed. This step can cause changes to the gene trees at some loci. Step 3 is a mixing step, in which parameters {theta}0, {theta}1, {gamma}0, {gamma}1 and branch lengths at all loci are multiplied by a constant while the gene trees remain unchanged. The MCMC algorithm is tedious and the details are given in the Appendix.

The Markov chain is started from a random initial state. Sampling starts after a certain number of generations, which are discarded as burn-in, and samples are taken every certain number of steps, thus "thinning" the chain. Convergence of the chain is checked by examining the output and also by running multiple chains. The algorithm is also run without sequence data, and the posterior distribution generated is found to be close to the prior.

Application to the data of Chen and Li:
The Bayes MCMC algorithm is applied to the data of CHEN and LI 2001 Down(see Table 1). I used two sets of priors (Table 3). Parameters {alpha} and ß in the gamma prior distributions are chosen by considering likely values and ranges of ancestral population sizes and species divergence times and converting them into parameters {theta}0, {theta}1, {gamma}0, and {gamma}1 using a generation time of 20 years and a mutation rate of 10-9 substitutions per site per year. The first set is considered more realistic and referred to as the "good priors" in Table 3. Ancestral population sizes N0 and N1 are centered ~12,500, close to estimates for modern humans, with the 2.5 and 97.5% percentiles at 1500 and 34,800, respectively. The divergence time for humans and chimpanzees is centered ~5 MY, while the divergence time for the gorilla is centered ~1.6 MY. Note that parameters {theta}0, {theta}1, {gamma}0, and {gamma}1 are all <<1 but are definitely >0; thus values of {alpha} >1 are used so that the gamma distribution has a mode >0.


 
View this table:
In this window
In a new window

 
Table 3. Prior and posterior distributions of parameters in the Bayes analysis

The posterior distributions of parameters {theta}0, {theta}1, {gamma}0, and {gamma}1 are shown in Fig 4 together with their priors. They are also summarized in Table 3 (good priors). The means of the posterior distributions for {theta}0, {theta}1, {gamma}0, and {gamma}1 are listed, and then the means and the 95% credibility sets for the two population sizes (N0 and N1) and for the two speciation times are listed. The posterior means and medians are close. The population size for the ancestor of humans and chimpanzees is estimated to be 12,400, with the 95% credibility interval (CI) to be (1700, 32,100). The H-C divergence is dated at 5.3 MY, with the 95% CI to be (4.4, 6.1). The estimates of {theta}1 and {gamma}1 are very similar to the MLEs. The posterior mean of {theta}0 is smaller and that of {gamma}0 is larger than the MLEs (Table 2 and Table 3). The correlation coefficients calculated from the posterior distributions of parameters {theta}0, {theta}1, {gamma}0, and {gamma}1 are shown in Table 4. There is strong negative correlation between {theta}0 and {gamma}0 and between {theta}1 and {gamma}1. Comparison of the prior and posterior distributions (Fig 4) suggests that the data contain much more information about the divergence times, especially the H-C divergence time ({gamma}1), than about the population sizes.



View larger version (26K):
In this window
In a new window
Download PPT slide
 
Figure 4. Prior and posterior distributions for parameters {theta}0, {theta}1, {gamma}0, and {gamma}1. Parameter estimates are shown in Table 3 (good priors).


 
View this table:
In this window
In a new window

 
Table 4. Correlation coefficients in the posterior distribution

To see the effect of prior assumptions on the posterior distributions, I used a second set of priors, which are more spread out and also assume large population sizes (mean 62,500). The posterior distributions are summarized in Table 3 under the heading "Poor priors." The point estimates of both N0 and N1 are ~33,000, smaller than the prior means. The H-C divergence is dated at 4.6 MY, and the gorilla divergence is dated 1.9 MY earlier. Those estimates appear reasonable, although the H-C divergence date is too recent. Similar strong correlations between the parameters are discovered as in the analysis using the good priors. The negative correlation between {gamma}1 and {theta}1 (calculated to be -0.76), combined with the assumed and estimated large population sizes, appears to have led to a H-C divergence date that is too recent.


*  DISCUSSION
*TOP
*ABSTRACT
*MAXIMUM-LIKELIHOOD ESTIMATION
*THE BAYES APPROACH USING...
*DISCUSSION
*APPENDIX
*LITERATURE CITED

ML and Bayes methods of this article estimated the population size for the common ancestor of humans and chimpanzees to be ~12,000, similar to estimates for modern humans. The estimates are several times smaller than those obtained by CHEN and LI 2001 Down from the same data using the tree-mismatch method, which range from 52,000 to 150,000. Even the worst-case estimates—e.g., 36,000 by ML under the assumption that all sequence divergence variation among loci is due to mutation rate variation and 33,000 from the Bayes analysis using the poor priors—are smaller than the minimum estimate of Chen and Li. The tree-mismatch method used by Chen and Li appears to have serious biases due to errors in gene tree reconstruction, and the likelihood and Bayes estimates reported here are probably more reliable. Thus it may be concluded that the sequence data of CHEN and LI 2001 Down do not support much larger ancestral populations than the modern humans or the notion that early human populations experienced dramatic size reductions (HACIA 2001 Down; KAESSMANN et al. 2001 Down).

While the ML and Bayes methods are expected to have better statistical properties than the simple tree-mismatch method, it is worthwhile to examine some of the assumptions made in this article. First, the evolutionary rate is assumed to be constant over lineages. This assumption seems reasonable as the species compared are very closely related; CHEN and LI's (2001) relative-rate tests supported the molecular clock. The large differences between the tree-mismatch method and the likelihood and Bayes methods are clearly not due to the use of the clock assumption in this article; use of clock rooting in the tree-mismatch method produced even larger estimates of the population size for the ancestor of humans and chimpanzees. Second, the analysis assumes no recombination within a locus. The effect of recombination on estimation of parameters {theta}0, {theta}1, {gamma}0, and {gamma}1 is not well understood, although SATTA et al. 2000 Down emphasized its possible significance. As the human, chimpanzee, and gorilla sequences are extremely similar, most of the recombination events will not be visible in the sequence data, and the few sites at which more than two nucleotides are observed in the data (see counts ni4 for site pattern xyz in Table 1) are probably due to multiple substitutions at the same site. Third, the substitution model of JUKES and CANTOR 1969 Down is simplistic. More complex models, such as those that account for variable substitution rates among sites within the locus, can be easily implemented, but are expected to have little effect. The most serious issue seems to be mutation rate variation among loci. In the case of two species, the ancestral population size is overestimated when mutation rate variation is ignored and accounting for the bias leads to dramatic reduction in the estimated population size (YANG 1997A Down). In this article, the population size of the ancestor of humans and chimpanzees is not very large under the constant-rate model and becomes larger when variable rates for loci are assumed. The effect is much less important and also in the opposite direction compared with the two-species case. Lack of strong correlation among sequence distances with the orangutan seems to suggest that the rates are relatively homogeneous among those loci. It seems that simultaneous analysis of data from three species allows the parameters to constrain each other, leading to a better use of information in the data. It is quite likely that the estimation can be further improved by sampling multiple individuals from the same species.

The ML and Bayes methods produced similar results for the data analyzed in this article. However, the ML calculation is slower than the MCMC algorithm. The Bayes approach also provides a framework for incorporating prior information about the parameters. For example, a wealth of information is available about the divergence time between humans and chimpanzees. By forcing a very narrow prior distribution for {gamma}1, such information can be incorporated in the Bayes analysis. Using an informative prior will reduce the adverse effect of strong correlation among parameters when other parameters are estimated. Furthermore, the Bayes algorithm seems easier than ML to extend to data that contain more than three species and more than one individual from each species.

Program availability:
C programs implementing the MCMC algorithm and calculating the mismatch probabilities (PSG, PSE, and PGE, etc.) are available from the author upon request. The C and Mathematica programs for the likelihood method are available as well, but they make use of the Mathlink library and are awkward to use.


*  ACKNOWLEDGMENTS

I am very grateful to Drs. W.-H. Li and F.-C. Chen for providing the data analyzed in this article. I thank M. Hasegawa and B. Larget for discussions and N. Takahata for comments. This study is supported by grant 31/G13580 from the Biotechnology and Biological Sciences Research Council (United Kingdom).

Manuscript received April 18, 2002; Accepted for publication September 6, 2002.


*  APPENDIX
*TOP
*ABSTRACT
*MAXIMUM-LIKELIHOOD ESTIMATION
*THE BAYES APPROACH USING...
*DISCUSSION
*APPENDIX
*LITERATURE CITED

IMPLEMENTATION OF THE MCMC ALGORITHM
Updating the gene tree and branch lengths at each locus:
This step of the algorithm goes through the L loci to update the branch lengths and possibly the gene tree as well. The algorithm visits each locus once. The description in this section concerns one locus i. The subscript i in bi0 and bi1 may be suppressed. The proposal algorithm for gene trees T2 and T3 is simpler and is described first, before the algorithm for gene trees T0 and T1 (Fig 1).

If the current gene tree is either T2 or T3, only the branch lengths are updated and the gene tree is not changed. Note that b0 and b1 have to satisfy the requirements 0 < b0 < {infty} and {gamma}0 + {gamma}1 < b1 < {infty}. A sliding window is used to propose new branch lengths,

(A1)

where U(a, b) is the uniform distribution in the interval (a, b). The window size is set at w = H1, with H1 a small fine-tuning parameter. If the proposed value is outside the range of the variables, the excess is reflected back into the range; that is, if b*0 < 0, then b*0 is reset to -b*0, and if b*1 < {gamma}0 + {gamma}1, then b*1 is reset to 2({gamma}0 + {gamma}1) - b*1. The proposal ratio is 1. The acceptance ratio is

(A2)

If the current gene tree is either T0 or T1 (Fig 1B and Fig C), a change between those two trees is allowed when the branch lengths are updated. Both the current and the new branch lengths should satisfy the constraints {gamma}1 < b1 < {infty} and {gamma}0 + {gamma}1 < b0 + b1 < {infty}. To propose new branch lengths under those constraints, I apply the transformation

(A3)

with y0 > 0 and 0 < y1 < 1. Note that y0 is the distance between the root of the gene tree (node A in Fig 1) and the first speciation event (C in Fig 1), and changing y0 will shrink or expand the height of the gene tree, while y1 is the ratio of distance BD to AD, and changing y1 will slide node B between A and D (Fig 1B and Fig C).

New states are proposed for y0 and y1 as

(A4)

where r is a random variable from U(0, 1). If y*1 is out of the range (0, 1), it is reflected back into the range. The new branch lengths b*0 and b*1 are calculated from the relationships

(A5)

If b*1 > {gamma}0 + {gamma}1, the new gene tree T*i is set to T1; otherwise it is set to T0. This change of gene tree does not change the proposal ratio. The proposal ratio for variables y0 and y1 is y*0/y0. The proposal ratio for the original variables b0 and b1 can be derived by noting

Thus the acceptance ratio is

(A6)

If the gene tree is T1, T2, or T3, the algorithm may (with a small probability of, say, 0.2 or 0.3) attempt to swap the gene trees. The current gene tree is replaced by one of the other two trees chosen at random. The proposal ratio is 1, and the acceptance ratio is

(A7)

Updating population size and speciation date parameters:
This step makes two proposals: the first to change {theta}0 and {theta}1 and the second to change {gamma}0 and {gamma}1. Parameters {theta}0 and {theta}1 are positive but are not constrained otherwise. They are updated simultaneously, with all other variables fixed. New values are proposed around the current values by

(A8)

where r0 and r1 are uniform random numbers in the interval (0, 1) and H2 is a small fine-tuning parameter. The proposal ratio is {theta}*0{theta}*1/({theta}0{theta}1) and the acceptance ratio is

(A9)

Next, speciation date parameters {gamma}0 and {gamma}1 are updated while {theta}0 and {theta}1 as well as branch lengths bi0 and bi1 for all loci are fixed. At each locus the following constraints have to be satisfied: 0 < {gamma}1 < b1 < {gamma}0 + {gamma}1 < b0 + b1 < {infty} in gene tree T0 and 0 < b0 < {infty} and {gamma}0 + {gamma}1 < b1 < {infty} in gene tree T1, T2, or T3 (Fig 1). To allow the chain to move more freely, the gene tree is allowed to change, if necessary, from T0 to T1, T2, or T3 or vice versa. Thus only the following constraints have to be satisfied when new values are proposed for {gamma}0 and {gamma}1: 0 < {gamma}1 < b1 and {gamma}0 + {gamma}1 < b0 + b1 at every locus; that is, {gamma}1 < min{bi1} = c and {gamma}1 < {gamma}0 + {gamma}1 < min{bi0 + bi1} = d. The following transformation is used to propose new states,

(A10)

with constraints 0 < y0 < 1 and 0 < y1 < c. Note that y0 is the ratio of the distance CD to AD in Fig 1, and changing y0 will slide the speciation date C (Fig 1) between A and D. Note that {gamma}0 = y0(d - {gamma}1), {gamma}1 = y1, and

Sliding windows are used to propose new values,

(A11)

where H3 is a small fine-tuning parameter. If the new values y*0 and y*1 are out of the range, they are reflected back into the range. The proposal ratio for the transformed variables y0 and y1 is 1. The proposal ratio for variables {gamma}0 and {gamma}1 is (d - {gamma}*1)/(d - {gamma}1). Next, all loci are scanned to see whether the gene tree needs updating. If the current gene tree is T0 but {gamma}*0 + {gamma}*1 < bi1, the gene tree T*i is set to be one of T1, T2, or T3, chosen at random. The proposal ratio will be multiplied by 3 since there are three trees to move to and only one tree to move back. If the current tree is T1, T2, or T3 but {gamma}*0 + {gamma}*1 > bi1, the gene tree is set to T*i = T0, and the proposal ratio is divided by 3. Otherwise the gene tree for the locus remains unchanged: T*i = Ti. In sum the acceptance ratio is

(A12)

where f({gamma}*0, {gamma}*1)/f({gamma}0, {gamma}1) = ({gamma}*0{gamma}*1/{gamma}0{gamma}1){alpha}-1e-ß({gamma}*0-{gamma}0+{gamma}*1-{gamma}1) from the gamma priors and cT is the proposal ratio due to changes to gene trees at some loci (a product of threes and one-thirds).

Mixing step:
A mixing step is found to be effective in speeding up convergence when a poor starting point is chosen for the chain. In this step, the gene trees remain unchanged, but parameters {theta}0, {theta}1, {gamma}0, and {gamma}1 and branch lengths bi0 and bi1 for all loci are multiplied by a constant

(A13)

where r is a random number from U(0, 1) and H4 is a small fine-tuning parameter. The proposal ratio is c4+2L. The acceptance ratio is

where f({Theta}*)/f({Theta}) = c4({alpha}-1)e({theta}*0-{theta}0+{theta}*1-{theta}1+{gamma}*0-{gamma}0+{gamma}*1-{gamma}1),

given by the gamma prior distributions.

Performance of the algorithm:
The performance of the MCMC algorithm is noted to depend on the choice of priors (values of {alpha} and ß in the gamma distributions). For some priors, the algorithm is noted not to mix well, and in particular, parameters {theta}0 and {gamma}0 appear to change slowly. The high correlation among parameters and the constraints seem to cause difficulties for the algorithm. In such cases, the chain has to be run much longer than usual to achieve stable estimates. In proposing new values for {gamma}0 and {gamma}1, only small steps are taken (with H3 in the range 0.01–0.05) to achieve an acceptance rate of ~0.1–0.3. For other variables, even large steps (with H1, H2, H4 in the range 0.1–0.5) are accepted at high frequencies (>50%). The mixing step seems rather effective so that ~1000 generations seem enough for the burn-in. For some priors, <500,000 generations appear sufficient, which takes a few minutes on a Pentium III PC at 1.2 GHz for the data of CHEN and LI 2001 Down. For other priors, the algorithm has to be run much longer.


*  LITERATURE CITED
*TOP
*ABSTRACT
*MAXIMUM-LIKELIHOOD ESTIMATION
*THE BAYES APPROACH USING...
*DISCUSSION
*APPENDIX
*LITERATURE CITED

CHEN, F.-C. and W.-H. LI, 2001  Genomic divergences between humans and other hominoids and the effective population size of the common ancestor of humans and chimpanzees. Am. J. Hum. Genet. 68:444-456.[Medline]

EDWARDS, S. V. and P. BEERLI, 2000  Perspective: gene divergence, population divergence, and the variance in coalescence time in phylogeographic studies. Evolution 54:1839-1854.[Medline]

FELSENSTEIN, J., 1981  Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368-376.[Medline]

FU, Y.-X., 1994  A phylogenetic estimator of effective population size or mutation rate. Genetics 136:685-692.[Abstract]

HACIA, J. G., 2001  Genome of the apes. Trends Genet. 17:637-645.[Medline]

HASEGAWA, M., H. KISHINO, and T. YANO, 1985  Dating the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160-174.[Medline]

HASEGAWA, M., H. KISHINO, and T. YANO, 1987  Man's place in Hominoidea as inferred from molecular clocks of DNA. J. Mol. Evol. 26:132-147.[Medline]

HASTING, W. K., 1970  Monte Carlo sampling methods using Markov chains and their application. Biometrika 57:97-109.[Abstract/Free Full Text]

HUDSON, R. R., 1992  Gene trees, species trees and the segregation of ancestral alleles. Genetics 131:509-513.[Medline]