- THIS ARTICLE
-
Abstract
- Full Text (PDF)
-
All Versions of this Article:
genetics.108.087049v1
179/2/951 most recent - Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Vasco, D. A.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Vasco, D. A.
Originally published as Genetics Published Articles Ahead of Print on May 27, 2008.
Genetics, Vol. 179, 951-963, June 2008, Copyright © 2008
doi:10.1534/genetics.108.087049
A Fast and Reliable Computational Method for Estimating Population Genetic Parameters
Daniel A. Vasco1
Odum School of Ecology, University of Georgia, Athens, Georgia 30602
1 Address for correspondence: Animal Genomics Group, S162 ASRC, Division of Animal Sciences, University of Missouri, 920 E. Campus Dr., Columbia, MO 65211-5300.
E-mail: vasco.daniel.a{at}gmail.com
>ABSTRACT
GENOMIC DATA, COALESCENTS, AND...
STATISTICAL MODEL
STATISTICS AND COMPUTATION-BASED...
RESULTS
DISCUSSION
ACKNOWLEDGEMENTS
LITERATURE CITED
The estimation of ancestral and current effective population sizes in expanding populations is a fundamental problem in population genetics. Recently it has become possible to scan entire genomes of several individuals within a population. These genomic data sets can be used to estimate basic population parameters such as the effective population size and population growth rate. Full-data-likelihood methods potentially offer a powerful statistical framework for inferring population genetic parameters. However, for large data sets, computationally intensive methods based upon full-likelihood estimates may encounter difficulties. First, the computational method may be prohibitively slow or difficult to implement for large data. Second, estimation bias may markedly affect the accuracy and reliability of parameter estimates, as suggested from past work on coalescent methods. To address these problems, a fast and computationally efficient least-squares method for estimating population parameters from genomic data is presented here. Instead of modeling genomic data using a full likelihood, this new approach uses an analogous function, in which the full data are replaced with a vector of summary statistics. Furthermore, these least-squares estimators may show significantly less estimation bias for growth rate and genetic diversity than a corresponding maximum-likelihood estimator for the same coalescent process. The least-squares statistics also scale up to genome-sized data sets with many nucleotides and loci. These results demonstrate that least-squares statistics will likely prove useful for nonlinear parameter estimation when the underlying population genomic processes have complex evolutionary dynamics involving interactions between mutation, selection, demography, and recombination.
THE estimation of ancestral and current effective population sizes in expanding populations is central to understanding the genetics of natural populations (CRANDALL et al. 1999). It is now possible to scan entire genomes of several individuals within a population (NIELSEN et al. 2005; SCHAFFNER et al. 2005; MCVEAN and SPENCER 2006). In this article I present a fast and reliable statistical method for estimating population parameters such as effective population size and growth rate using genomic data. Although the method is applicable to more complex population and selection models, I focus in this article on illustrating the method in a model of exponential population growth.
The problem of determining the parameters of a demographic expansion is a fundamental problem in population genetics theory and coalescents (AVISE 2000). Several article have appeared addressing this problem, ranging from methods based upon summary statistics (ROGERS and HARPENDING 1993) to those using the full data in a sample such as maximum-likelihood (ML) estimators (GRIFFITHS and TAVARÉ 1994; KUHNER et al. 1998). However, for large data sets and complex models, computationally intensive methods may exhibit difficulties.
First, the methods may be prohibitively slow or difficult to implement on large data, especially when integrating the Felsenstein equation (STEPHENS 1999; HEY and NIELSEN 2007). This often involves analysis of the "mixing properties" of a complex Markov chain Monte Carlo (MCMC) algorithm—a technically difficult task (SISSON 2007). In previous work, VASCO et al. (2001) demonstrated the close relationship of summary-statistics-based phylogenetic estimation methods to earlier coalescent methods (WATTERSON 1975; TAJIMA 1983, 1989; FU 1995) as well as those based on full-likelihood approaches. They argued that instead of utilizing the full likelihood, an analogous function, determined by least-squares (LS) fitting of the data, may be computed in which the full data are replaced by a vector of summary statistics. These are still standard methods that are widely used for coalescent data analysis across the subdisciplines of evolutionary genetics (AVISE 2000; EMERSON et al. 2001; KNOWLES 2004; HICKERSON et al. 2006). However, important questions remain: When does the minimum of the LS function coincide with the maximum of the ML approach? How do the performance and reliability of the two methods compare over various samples sizes of nucleotide sequences? How does performance compare for fast vs. slow population expansions? As data sets become larger and more complex, these kinds of questions are likely to loom large in the near future.
A second and related problem involves diagnosing systematic patterns of estimation bias for likelihood modeling of nucleotide sequence data. Recent work utilizing MCMC methods has indicated that likelihoods for infinite-sites coalescent models may be quite complicated for changing population size (STEPHENS 1999; STEPHENS and DONNELLY 2000; WAKELEY et al. 2001; HEY and NIELSEN 2007). Similar observations have been made in likelihood modeling of microsatellite data (BEAUMONT 1999). These studies have suggested that there may be rugged topographies underlying the likelihood surface, rendering accurate estimation (or identifiability) of population parameters difficult or impossible in some regions. Recently, SISSON (2007) discussed the general problem of Bayesian computation with regard to intractable likelihoods for complex genetical data.
To address both of these problems I take the following approach. First, I present the coalescent model with a focus on its close relationship to summary genetic data in a sample (VASCO et al. 2001; HEIN et al. 2005; MARJORAM and TAVARÉ 2006). Next, I review two standard coalescent point estimation methods and from these discuss more computationally intensive, simulation-based estimation methods (HJORTH 1994; GOURIEROUX and MONFORT 1996; GIVENS and HOETING 2005). Simulation-based estimation is a natural extension of the computationally efficient LS coalescent point estimators developed in past work (FU 1994a,b; LI and FU 1994; DENG and FU 1996; VASCO et al. 2001). I then compare the computationally intensive statistics with estimates obtained using EVE (VASCO et al. 2001), which gives a LS estimate, and FLUCTUATE (KUHNER et al. 1998), which gives a maximum-likelihood estimate. Both of these programs give point estimates of the genetic diversity parameter
= 2Nµ (where N is the effective population size, and µ is the mutation rate) and the population growth rate given a sample of nucleotide sequences. Finally, I demonstrate that it is possible to obtain a nearly unbiased estimate of the population parameters by recursive use of the LS estimator. The estimator is applied to the African human mtDNA sample collected by VIGILANT et al. (1991).
ABSTRACT
>GENOMIC DATA, COALESCENTS, AND...
STATISTICAL MODEL
STATISTICS AND COMPUTATION-BASED...
RESULTS
DISCUSSION
ACKNOWLEDGEMENTS
LITERATURE CITED
Figure 1A shows how to go from the complete data, a sampled set of five DNA sequences, to a summary of the data: each large dot represents a mutation (neutral gene substitution) on one of the sequences shown in Figure 1C (shown as smaller dots). The pattern of polymorphism for the sample can be observed as the number of neutral nucleotide substitutions (KIMURA 1983) along a lineage of the tree, di, where each i corresponds to a branch in the coalescent tree. This quantity is useful as summary measure of the complete data (the set of sequences).
|
Next, consider the coalescent tree of the sample itself, without recombination and migration, so that the sampled sequences are connected by a single genealogy as shown in Figure 1B. In the Wright–Fisher model each generation is discrete and formed randomly by sampling N parents with replacement from the current generation. In general, for any coalescent tree, it is possible to label the time intervals, tn, for 2, ..., n sequences and these are referred to here as the nth coalescent times. These intervals are random variables representing the time duration (the number of generations) required for coalescence from n sequences to n – 1 sequences. Thus, in Figure 1B each of the five sequences can be traced back in time first to n – 1 ancestral sequences, next to n – 2 sequences, and so on until a single ancestral sequence remains (the most recent common ancestor, MRCA).
We have now covered the first two criteria for developing a statistical theory of genomic data: it has been demonstrated how the complete data (a sampled set of sequences) are connected to a coalescent tree. Before the role of stochastic simulation and statistical inference can be introduced, I first develop some bookkeeping rules that are useful for large coalescent-based phylogenies.
Each branch length of the tree can be represented as a simple sum of the coalescent times between ancestors. It is not difficult to show that, since the jump process of a coalescent tree is fixed (FU 1994a,b), the branch lengths of any coalescent tree can be computed as
![]() | (1) |
)T, where the superscript T signifies transposition, then any vector of branch lengths
, satisfying Equation 1, can be written in matrix form
![]() |
Consider next the effect of exponential population expansion on the coalescent times and branch lengths of the tree. Thus, N(t) = N0egt, where g is the growth rate (g > 0), t is the time since the initial generation, and N0 is the initial effective population size. In using a coalescent approach, it is useful to look backward in time at N(–t) and from this vantage it appears that the population exponentially declines in size as the population experiences coalescence toward its most recent common ancestor. In a coalescing population, currently at size N0, one unit of time corresponds to 2N0 generations. I now describe how to simulate coalescent trees under this model. The kth coalescent time, tk, in a genealogy corresponds to a time interval in which exactly k ancestors exist in the sample. Each tk is conditioned upon a time
k when there are more than k ancestors in the sample. The intervals between coalescent events can be efficiently estimated using the distribution
![]() |
n = 0 (GRIFFITHS and TAVARÉ 1994). The dependence of the time tk upon
k has an intuitive basis: it occurs because each period tk in a genealogy starts only when n sequences have coalesced to k sequences. For the cases of constant population size and exponential growth, the distribution for the tk can be determined exactly (see HUDSON 1990; SLATKIN and HUDSON 1991). The value tk (k = 2, ..., n) is an outcome of the coalescent with exponential growth g,
![]() | (2) |
and U is randomly drawn from the uniform distribution on the open interval (0, 1). This last result determines the final property required to describe genomic data (stated earlier in this section): by generating random numbers stochastic simulation may be used to study the complex stochastic processes by which coalescent trees are formed. The result can be combined with a model of DNA sequence evolution to simulate the evolution of sampled sequences on the tree itself as explained below. Before this is done, however, I show how to develop a statistical coalescent model that may be used to estimate population parameters.
Assume now that the coalescent for a sample of five sequences in an exponentially growing population has produced a coalescent tree with parameters (
, g). Although this genealogy is the product of a stochastic process, the branch lengths, li, and number of mutations on a branch, di, for a single genealogical history can be observed exactly for a simulation or accurately reconstructed for a set of sequences. Throughout this article the di's are the components of a data vector
, where the convention upon numbering and ordering is shown in Figure 1. I now show how to approximate the average branch lengths and number of mutations on each lineage of such a tree given its topology. For the moment I assume that the tree topology is known exactly. To compute the expected branch lengths, one simulates the coalescent time distribution using Equation 2 many times and averages the results. Each average coalescent time,
, is computed using
![]() | (3) |
![]() | (4) |
ABSTRACT
GENOMIC DATA, COALESCENTS, AND...
>STATISTICAL MODEL
STATISTICS AND COMPUTATION-BASED...
RESULTS
DISCUSSION
ACKNOWLEDGEMENTS
LITERATURE CITED
li conditional on li, where
= 2N0µ. All moments of di (including the mean and variance) are determined by the moments of ti. The expected number of mutations can be found using Equation 4 and is equal to
![]() | (5) |
is conditioned on the vector of branch lengths
by the Poisson distribution with expectation
![]() | (6) |
![]() | (7) |
is the diagonal matrix for which nonzero entries correspond to li with respect to a given di.
Given a fixed genealogy with J and E(T), Equations 4–7 can be used to define the following nonlinear mutation model,
![]() | (8) |
is the vector of expected branch lengths, and the model error term is
, with the ith component
.
Expectations for the coalescent model can be efficiently computed over the entire genealogy:
![]() | (9) |
for coalescent genealogies satisfying the mutation model can likewise be computed, but are not utilized in the present article. For each branch of a coalescent genealogy the data di can be computed using distance, parsimony, or any other method that gives an accurate estimate of the number of mutations on a branch (see below). These data are stored in the vector
. All other quantities such as
and E(JT) represent theoretical expectations that can be efficiently computed using Monte Carlo integration of sampled genealogies based upon Equations 3 and 4. ABSTRACT
GENOMIC DATA, COALESCENTS, AND...
STATISTICAL MODEL
>STATISTICS AND COMPUTATION-BASED...
RESULTS
DISCUSSION
ACKNOWLEDGEMENTS
LITERATURE CITED
using a LS and a ML method, how to use computational statistics and simulation to quantify the bias associated with a targeted estimate, and how to correct the bias of the targeted estimate.
Least-squares estimation:
Assume the exponential growth coalescent model with parameters
, g, N0, where the effective population size N0 is fixed at the time of sampling. Using (8) define the scalar function,
![]() | (10) |
, g) upon solution of the optimization problem
![]() |
and g. Application to multiple loci is straightforward: compute a function Vi(
, g) using the coalescent history for each ith locus, sum over all i, and then minimize the sum with respect to (
, g). This can be rapidly computed for large numbers of loci, for example. A computer program written in the C language, called Estimators for Variable Environments (EVE), was developed to perform these computations (VASCO et al. 2001). At the point g = 0, the LS minimum point determining
for the EVE algorithm is expected to be the same as that computed by Fu's UPBLUE recursive estimator for
when the effective population size is constant (FU 1994a). Thus, this statistic can be considered an extension of Fu's method to the case of variable population size.
Maximum-likelihood estimation:
For the problem of estimation of
and g from sequence data KUHNER et al. (1998) proposed the maximum-likelihood computation program FLUCTUATE. This program computes
using a transformation in which the coalescent structure of a genealogy becomes identical to the constant population expectation of the program COALESCE (KUHNER et al. 1995). The conversion between EVE's
and FLUCTUATE's
is made by multiplication of the FLUCTUATE
by number of sites in a sequence. Since EVE assumes that coalescent times are scaled by 2N0 generations, I rescaled FLUCTUATE growth rate estimates appropriately. In addition, growth rate is scaled in per mutation units of 1/µ. Note that to make this conversion and comparison with the unscaled EVE estimate of growth rate, essentially a priori information is required with regard to the constant population size expectation of
0 = 2N0µ. When computing genealogies, FLUCTUATE takes a step that is the construction of a single genealogy; a "chain" for a set of genealogies is formed to compute parameter estimates. I used those suggested in KUHNER et al. (1998) for a single locus: 10 short chains of 1000 steps followed by 2 long chains of 15,000 steps. For an initial guess of the
, WATTERSON's (1975) estimate was used. For an initial guess of g I used 10–5.
Computation-based estimation:
Statistical problems such as severe bias create difficulties for obtaining accurate estimates of parameters for a point estimate. Simulation-based estimation is a powerful way to diagnose some of these difficulties if one has access to a fast estimator. In this section I describe three computationally intensive estimators based upon jackknifing, bootstrapping, and simulation principles (EFRON and TIBSHIRANI 1993; SHAO and TU 1995; GOURIEROUX and MONFORT 1996) that are used investigate statistical properties of the point estimators EVE and FLUCTUATE.
Jackknife estimation of parameters:
QUENOUILLE (1956) proposed utilizing the jackknife to estimate the bias of an estimator by deleting one datum each time from the original data set and then recalculating the estimator on the basis of the rest of the data. The jackknife estimator of a parameter p (where p = g or
in this article), utilizes the subsample estimates of an estimator
to quantify the bias-reduction process,
![]() | (11) |
is the estimate computed applying a method like ML and LS to the whole sample and
is the jth subsample, respectively, with the ith sequence removed, this being denoted as the subscript –i. For subsampling of the set of sequences one can take the complete set
deleting a single sequence
at a time. Each
is then a statistic computed using 2n – 1 observations
with datum dr removed (which is uniquely determined when
is removed from the sample, so that r is equal to the subscript of the external branch corresponding to the removed sequence drawn from
). To compute the estimator stated by Equation 11 one iteratively solves the LS problem
![]() | (12) |
and Vj,–i. Since the removal of a sequence corresponds to the removal of an external branch, to find
it suffices to compute the sik matrix with 0's corresponding to coalescent time entries in the full sik matrix for the removed branch. Having determined
and
, one then performs the matrix multiplication given by Equation 12 to obtain Vj,–i. Minimizing Vj,–i with respect to (
, g) gives a jackknifed LS estimate for the sample. The subscript j bookkeeps the computation of the jackknifed LS statistic for each
. Summing the estimates with respect to j over all n subsamples then gives
in Equation 11.
Parametric bootstrap-known tree:
In a parametric bootstrap, samples (replicate sets of nucleotide sequences) are simulated from a specified value of g and
. The estimator then is fed the known tree topology (J), the known branch lengths
, and the known number of mutations on each branch
for a given MC replicate. From these data, which correspond to perfect knowledge of the phylogenetic tree of the coalescent, a LS estimate is computed for each replicate. The bootstrap expectation
, where
is the MC replicate. I computed the MC estimated standard deviation
as
.
Parametric bootstrap-unknown tree:
This case corresponds to that stated above except that for each MC replicate the tree topology is estimated using UPGMA (see, for example, SWOFFORD and OLSEN 1990). For each replicate, UPGMA is used to compute an approximation to the branch lengths of the tree, giving an estimate of the observed number of mutations between pairs of coalescing sequences. Statistics for the MC replicates are computed as stated in the last section, with the notational change
and
for the mean and the MC replicate, respectively.
Simulation of sampled sequences:
When implementing simulation studies I assumed a Jukes–Cantor model of DNA nucleotide sequence evolution (LI 1997). This assumes that all four nucleotides are equally frequent and all substitution types are equally likely in a sample. I assumed a sequence length of 1000 sites and a per locus mutation rate µ in simulations when the target parameters were known. When simulating the VIGILANT et al. (1991) data I assumed a sequence length of 610 sites. Under the infinite-sites model, the number of mutations separating a pair of sequences is simply the number of nucleotide differences between the pair of sequences (FU 1994a). In this case the number of substitutions between sequences i and j corresponds to the number of mutations separating two coalescing sequences dij (TAJIMA 1983). Once the distance matrix for a sample of sequences is computed the UPGMA method can be applied to compute the tree. When applied to an empirical set of nucleotide sequences one computes the genetic distance and applies UPGMA to the sample and then inputs this into EVE (see VASCO et al. 2001 for an example using HIV sequences).ABSTRACT
GENOMIC DATA, COALESCENTS, AND...
STATISTICAL MODEL
STATISTICS AND COMPUTATION-BASED...
>RESULTS
DISCUSSION
ACKNOWLEDGEMENTS
LITERATURE CITED
is known for pT. For this case all that one can hope for is that
is sufficiently close to pT since pT itself is unknown. Nature gives researchers a single stochastic realization when fitting their stochastic simulation models to genomic data. So an important statistical question is how to robustly characterize the statistical variation associated with that realization.
To investigate the second case I used an African human mtDNA sample drawn from VIGILANT et al. (1991) to compute an LS point estimate (the mtDNA control region with a sequence length of 610 bp and a sample size of n = 97). I then use simulation-based estimation to compute Monte Carlo replicates. This allows using the parametric bootstrap means to quantify bias. The bootstrap replicates also allow estimation of the variance associated with the biased mean. Finally I use the LS statistic to assess bias and correct for it. As noted by EFRON and TIBSHIRANI (1993), bias correction is a worthwhile pursuit but a dangerous one. Here I show it is possible to compute a nearly unbiased estimate of
. For the purpose of computational approaches to population genetics data analysis this result clearly demonstrates that the LS statistic has a peak at or near an estimated quantity (CABRERA and FERNHOLZ 1999). Elimination of troublesome sources of bias allows computing reliable distributions of statistical errors for the sample that can be utilized in subsequent statistical studies such as the development of power tests and model prediction and selection (HJORTH 1994).
Performance of the LS statistic:
Tables 1 and 2 show the performance of the LS method in estimating g and
for the case
= 40. In these tables the "size" of the table, in terms of increasing g, corresponds to the range of consistency for estimation of g. Thus, higher sample sizes generally gave better estimates of g. Table 2 shows that simultaneous estimation of
is nearly unbiased over a wide range of g—the reason being that
appears in Equation 8 linearly. It is also shown that a significant reduction in MC standard deviation of the estimators can be achieved by increasing sample size. In all cases, the MC standard deviation of the estimators decreases with increased numbers of sequences. There is a slight bias in estimation near g = 0.001 because of the singularity at g = 0 in Equation 2. As noted previously, however, one may use the UPBLUE method to estimate
by FU (1994a) at g = 0 (or at least its assumption of constant population size to compute the expected branch lengths in Equation 10).
|
|
Figure 2 graphically shows the effect of sample size on the performance of LS estimation. For higher substitution rate (
> 40), a significant range of growth rates can be consistently estimated. Better knowledge of the gene tree always allows expansion of the range of consistent estimation. Figure 2A shows that for a range of g from 0.001 to 6, consistent estimation of g was possible for the unknown tree. Lack of knowledge of the actual gene tree significantly affects the ability to consistently estimate g > 5 for
= 40. Figure 2B shows that the range of consistent estimation for g in the case
= 40 is in the range of g from 0.001 to 80 for the known gene tree. Since many rapid population expansions, such as for humans and viruses, may lie in this range it is important to correct the bias for any point estimates obtained that lie in this range. Below, I show how to address this problem using the jackknife bias-correction statistic. I then compare this statistic to those obtained using the more standard parametric bootstrap method.
|
Comparison of LS and ML statistics:
In this section, I compare the LS and ML statistics, utilizing the LS surface as a tool to diagnose performance. A ML surface should exhibit a maximum with respect to the minimum on the LS surface. Therefore, it is possible to investigate how the statistics behave by visually scanning the distribution of MC replicates around the LS minimum in (
, g) parameter space. I show that there are two scales in parameter space that determine the properties of the estimators. The first is determined by a "local" region surrounding a LS minimum point that determines EVE estimates. This is determined by the LS criterion, i.e., the minimization of Equation 10. In this region, one would expect the distribution of MC replicates to cluster around the minimum, with density highest over the target parameters. In fact, a second "global" region over (
, g) parameter space plays a role with respect to the LS surface. This happens because of the appearance of a large plateau in parameter space. In this section I show how this plateau affects performance of the ML vs. the LS estimator.
Figure 3C shows the distribution of Monte Carlo replicates for the target g = 5 and
= 60 at a sample size n = 10. Figure 3, D–F, shows the distribution of Monte Carlo replicates for the target g = 15 and
= 60 at three different sample sizes n = 10, 30, and 100. For both g = 5 and g = 15, the sample size of n = 10 shows a cloud of LS and ML replicates close to the targets (see Figure 3, C and D). Also several outliers appear for both the LS and the ML estimates. However, the LS outliers appear to track the valley in the LS surface, while the ML estimators tend to be dispersed on a steeply rising ridge in the LS surface as shown in Figure 3A. The height of these ridges requires blowing up the region around the targets to see contours of the valley (see Figure 3B). FLUCTUATE outperforms EVE for the case shown in Figure 3C. It has a MC mean closer to the target than EVE, with smaller standard deviation. However, for the cases shown in Figure 3, D–F, the presence of outliers shifts the mean MC value for
away from its target. This gets progressively worse as the sample size increases—see Figure 3, E (n = 30) and F (n = 100). In contrast to the severe estimation bias observed for
, FLUCTUATE estimation of g appears to be nearly unbiased. However, this conclusion must be accepted with the proviso that perfect knowledge of
0 = 2N0µ is available. Table 3 gives the MC means and standard deviation for the comparison between LS and ML statistics.
|
|
For larger sample sizes the distributions of the ML and LS estimates move further apart. Overall, these results imply that the ML estimator FLUCTUATE does not exhibit consistency at high growth rates and sample sizes when estimating
. However, at small growth rate and small sample size FLUCTUATE outperforms EVE.
Previous MCMC studies for infinite-sites models (WAKELEY et al. 2001) have shown that the likelihood surface for changing population size has a complex topography: the parameters become nonidentifiable in some regions and plateaus appear. BEAUMONT (1999) made similar observations with likelihood modeling of microsatellite data. The findings of this section lend support to this earlier work. Under rapid population expansion the majority of the mutations are in the terminal branches of the tree. In a population sample, this appears as a sequence with a common ancestral haplotype and with the remaining haplotypes appearing as singletons. Such a star-shaped coalescent tree gives rise to a difficult data set for an MCMC coalescent likelihood method. For these data the MCMC computation may generate the same star-shaped pattern for many choices of
and growth rate. Indeed, STEPHENS (1999) and STEPHENS and DONNELLY (2000) have previously predicted such computational difficulties for certain classes of MCMC methods. They noted that some MCMC algorithms, such as FLUCTUATE, use a fixed or "driving" value of
. This gives rise to the possibility that such algorithms find modes in an essentially flat surface or ridge where none are present. Figure 3A shows a steeply rising LS contour up to a long flat plateau. The FLUCTUATE MC replicates project onto the contours of such a plateau (see Figure 3F). The large pink dot centered at the cloud of replicates suggests that it may be at the center of a mode "found" by FLUCTUATE where clearly none is present.
Effect of recombination on the LS statistic:
SCHIERUP and HEIN (2000a) demonstrated that even small amounts of recombination can produce biases in tree reconstruction. These biases can affect phylogenetic inferences using reconstructed trees for nucleotide sequence data. For example, bias from recombination can cause trees to superficially resemble phylogenies for sequences from an exponentially growing population. They also found that recombination created a large overestimation of rate heterogeneity and that standard ML methods [the DNAml and DNAmlk programs of PHYLIP (FELSENSTEIN 1995)] could not infer the presence of a molecular clock from sequence data (SCHIERUP and HEIN 2000b). Given that the LS method uses the shape of a single phylogenetic tree to estimate parameters, I tested the LS estimator on several thousand simulated samples of nucleotide sequences exhibiting varying amounts of recombination. I used the coalescent algorithm with recombination described in HUDSON (1983), modified so as to include exponential population growth. Since the generation of the coalescence and recombination events are independent, I was able to use the approach described in this article to efficiently simulate the coalescent time distribution under recombination (FEARNHEAD and DONNELLY 2001). The mutation model used was the Jukes–Cantor model (LI 1997) and was applied to simulate nucleotide data sets under recombination as described in the article by SCHIERUP and HEIN (2000a).I examined the performance of estimation under two scenarios. First, if the growth rate is zero, how well will the LS estimator perform given varying rates of recombination? Second, in an exponentially growing population with high growth rate, how well does the LS estimator perform under varying rates of recombination?
The results for the first case are shown in Figure 4. Setting the growth rate to zero in the simulations allows generating a null distribution on the estimators to determine the range of bias created by recombination. For low to moderate rates of recombination (
= 0–15) the LS estimator performs reasonably well. One observes growth rates close to zero in each case. The range of estimation bias lies well within the range of that observed to be caused by intrinsic bias in the coalescent time distribution. For
> 15, however, the bias due to recombination gradually increases with the amount of recombination. Each point in Figure 4 was computed from the mean over 1000 replicated data sets.
|
For the second case I simulated samples of nucleotide sequences under moderate growth rate (g = 15) and varying rates of recombination
= 0–50. The results are shown in Figure 5. As in the previous case, in the region (
= 0–15) the LS estimator appears to perform reasonably well. In this region the estimates of g and
exhibit low bias, on the order of that observed in the absence of recombination. For
> 15, the bias caused by recombination starts to increase the estimation bias of the observed estimate of g, causing it to become more and more inflated. This bias appears to rise more rapidly in the presence of high growth than in the absence of growth (compare Figures 4 and 5). Each point in Figure 5 was computed from the mean over 1000 replicated data sets.
|
Inference on sample with parameters unknown:
LS statistics were computed for the African subset of the sampled human mtDNA sequenced by VIGILANT et al. (1991). This gave the LS point estimates
, which imply a significant recent population expansion (see Figure 6A). The distributions of the MC replicates for all three simulation-based estimators track the LS surface. This makes sense in light of the results shown earlier (see Tables 1 and 2, Figure 2) for the case when the target is known.
|
Consider next the quantification of estimation bias. Parametric bootstrap estimates of the point estimate were computed with the tree assumed unknown. The results for this are shown in Figure 6B. The mean is shown as the large dot at (
, g) = (197.82 ± 39.35, 39.70 ± 13.13), and the small blue dots correspond to MC replicates. Although extreme bias is present in this estimator the spread of the replicates closely tracks the LS surface determined by the point estimate. Next parametric bootstraps were computed of the point estimate with the tree assumed known. The results are shown in Figure 6C. The MC mean is shown as the large dot at (
, g) = (259.52 ± 66.78, 69.88 ± 30.25). This estimator shows substantially less bias. Finally, the nonparametric jackknife estimator gives a nearly unbiased estimate of the target. The small dots in Figure 6D are the MC jackknife replicates and these cluster tightly around the target. Although the estimated population growth rate for the African sample was very high it was still possible to obtain a nearly unbiased estimate of the point estimate using the jackknife.
The jackknifing statistic allows taking advantage of the intuitive idea that most of the mutations with information with regard to computing accurate estimates of
and g lie on the tips of the coalescent tree. Thus, jackknifing may be regarded as a computationally cheap way to take advantage of this property to obtain nearly unbiased estimates of the parameters. It is essentially a recalibration method that allows reweighting the estimator using the residuals of the nonlinear mutation model. The jackknifing is so efficient it effectively weights each external branch as if it were adding genetic information from many independent loci (see Figure 2 for the cases n = 30, l = 10). However, in this case the only information in the sample is from a single locus—the human mtDNA.
ABSTRACT
GENOMIC DATA, COALESCENTS, AND...
STATISTICAL MODEL
STATISTICS AND COMPUTATION-BASED...
RESULTS
>DISCUSSION
ACKNOWLEDGEMENTS
LITERATURE CITED
Least-squares vs. maximum-likelihood estimation:
In many situations a ML estimator is preferable over a LS estimator because the former is more flexible and can incorporate statistical tests more readily than the latter. However, estimation of demographic and selective parameters using the coalescent is an inherently nonlinear problem. That is, the surface to be minimized or maximized is likely to have multiple optima and a complex topography. Estimating likelihoods when the tree is not known or approximated is extremely computationally intensive in determining this surface. Despite this, using a parametric bootstrap it is possible to compare the ML and LS estimators. One generates Monte Carlo replicate data sets and superimposes the resulting estimates on top of each other. The LS estimates should cluster around the minimum and the ML should cluster around a maximum. The minima and the maxima should be approximately centered over each other. Thus the two clusters should lie roughly on top of each other with centers over the target. This is true for the LS estimator (EVE) but not always for the ML estimator analyzed in this article (FLUCTUATE). As sample size increases the correlation between optima located on the LS surface and optima located on the ML surface becomes lost. As sample size increases one would expect that an efficient estimator should become more accurate in predicting the target. My results show that the LS estimator does have a minimum over the target and becomes more accurate in predicting the target as sample size increases. I also show that the LS point estimator can be efficiently bias corrected if a nonparametric resampling estimation method is used.
Measurement error, summary statistics, and MCMC:
Recently SISSON (2007) pointed out some of the technical difficulties associated with developing MCMC algorithms for complex genetic data. He reviews computational problems associated with application of standard Metropolis–Hastings samplers to population genetics problems involving intractable likelihoods. As noted above, STEPHENS (1999) earlier pointed to similar problems, specifically centered around integration of the Felsenstein equation (HEY and NIELSEN 2007). An important part of Sisson's critique lies in evaluating how to construct proposals for MCMC algorithms on the basis of summary statistics that allow efficient "mixing" of the algorithm. He points to the analysis of TANAKA et al. (2006), who constructed a stochastic model at the genetic level that allows direct application of the Metropolis–Hastings approach of MARJORAM et al. (2003) to a summary statistic. Such an approach is related to the methods developed in the present article. However, the method would have been substantially slower and I would never have been quite sure when to stop the algorithm. Instead I implemented a cheaper LS model that allowed obtaining a fast answer. The frequentist framework developed here can be expressed within the framework of Bayesian sensitivity analysis (OAKLEY and O'HAGAN 2004). Whatever framework one chooses, however, the present article offers up a cheap and fast method of obtaining parameter estimates for complex genetic data sets that can be compared to computationally expensive MCMC methods. Future work should lie in bringing together these two approaches into a more complete and useful statistical theory of genomic data.
ABSTRACT
GENOMIC DATA, COALESCENTS, AND...
STATISTICAL MODEL
STATISTICS AND COMPUTATION-BASED...
RESULTS
DISCUSSION
>ACKNOWLEDGEMENTS
LITERATURE CITED
ABSTRACT
GENOMIC DATA, COALESCENTS, AND...
STATISTICAL MODEL
STATISTICS AND COMPUTATION-BASED...
RESULTS
DISCUSSION
ACKNOWLEDGEMENTS
>LITERATURE CITED
AVISE, J. C., 2000 Phylogeography: The History and Formation of Species. Harvard University Press, Cambridge, MA.
BEAUMONT, M. A., 1999 Detecting population expansion and decline using microsatellites. Genetics 153: 2013–2029.
CABRERA, J., and L. T. FERNHOLZ, 1999 Target estimation for bias and mean square error reduction. Ann. Stat. 27(3): 1080–1104.[CrossRef]
CRANDALL, K., D. POSADA and D. A. VASCO, 1999 Effective population sizes: missing measures and missing concepts. Anim. Conserv. 2: 317–319.[CrossRef]
CHRISTIANINI, N., and M. W. HAHN, 2007 Introduction to Computational Genomics. Cambridge University Press, Cambridge, UK.
DENG, W.-H., and Y.-X. FU, 1996 The effect of variable mutation rates across sites on phylogenetic estimation of effective population size. Genetics 144: 1271–1281.[Abstract]
EFRON, B., and R. J. TIBSHIRANI, 1993 An Introduction to the Bootstrap (Monographs on Statistics and Applied Probability 57). Chapman & Hall, London.
EMERSON, B. C., E. PARADIS and C. THEBAUD, 2001 Revealing the demographic histories of species using DNA sequences. Trends Ecol. Evol. 16(12): 707.[CrossRef]
FEARNHEAD, P., and D. P. DONNELLY, 2001 Estimating recombination rates from population genetic data. Genetics 159: 1231–1241.
FELSENSTEIN, J., 1995 PHYLIP (Phylogeny Inference Package), Version 3.572. University of Washington, Seattle.
FU, Y.-X., 1994a A phylogenetic estimator of effective population size or mutation rate. Genetics 136: 685–692.[Abstract]
FU, Y.-X., 1994b Estimating effective population size or mutation rate using the frequencies of mutations of various classes in a sample of DNA sequences. Genetics 138: 1375–1386.[Abstract]
FU, Y.-X., 1995 Statistical properties of segregating sites. Theor. Popul. Biol. 48: 172–197.[CrossRef][Medline]
FU, Y.-X., and W.-H. LI, 1999 Coalescing into the 21st century: an overview and prospects of coalescent theory. Theor. Popul. Biol. 56: 1–10.[CrossRef][Medline]
GIVENS, G. H., and J. A. HOETING, 2005 Computational Statistics. Wiley-Interscience, Hoboken, NJ.
GOURIEROUX, C., and A. MONFORT, 1996 Simulation-Based Econometric Methods. Oxford University Press, Oxford.
GRIFFITHS, R. C., and S. TAVARÉ, 1994 Sampling theory for neutral alleles in a varying environment. Philos. Trans. R. Soc. Lond. B 344: 403–410.[Medline]
HEIN, J., M. H. SCHIERUP and C. WIUF, 2005 Gene Genealogies, Variation and Evolution. A Primer in Coalescent Theory. Oxford University Press. Oxford.
HEY, J., and R. NIELSEN, 2007 Integration with the Felsenstein equation for improved Markov chain Monte Carlo methods in population genetics. Proc. Natl. Acad. Sci. USA 104: 2785–2790.
HICKERSON, M. J., G. DOLMAN and C. MORITZ, 2006 Comparative phylogeographic summary statistics for testing simultaneous vicariance. Mol. Ecol. 15: 209.[CrossRef][Medline]
HJORTH, J. S. U., 1994 Computer Intensive Statistical Methods: Validation Model Selection and Bootstrap. Chapman & Hall, London.
HUDSON, R. R., 1983 Properties of a neutral allele model with intragenic recombination. Theor. Popul. Biol. 23: 183–201.[CrossRef][Medline]
HUDSON, R. R., 1990 Gene genealogies and the coalescent process, pp. 1–44 in Oxford Surveys in Evolutionary Biology, edited by P. H. HARVEY and L. PARTRIDGE. Oxford University Press, London/New York/Oxford.
KIMURA, M., 1983 The Neutral Theory of Molecular Evolution. Cambridge University Press, Cambridge, UK.
KNOWLES, L. L., 2004 The burgeoning field of statistical phylogeography. J. Evol. Biol. 17: 1–10.[CrossRef][Medline]
KUHNER, M. K., J. YAMOTO and J. FELSENSTEIN, 1995 Estimating effective population size and mutation rate from sequence data using Metropolis-Hastings sampling. Genetics 140: 1421–1430.[Abstract]
KUHNER, M. K., J. YAMOTO and J. FELSENSTEIN, 1998 Maximum likelihood estimation of population growth rates based on the coalescent. Genetics 149: 429.
LI, W.-H., 1997 Molecular Evolution. Sinauer Associates, Sunderland, MA.
LI, W.-H., and Y.-X. FU, 1994 Estimation of population parameters and detection of natural selection from DNA sequences, pp. 112–125 in Non-Neutral Evolution: Theories and Molecular Data, edited by B. GOLDING. Chapman & Hall, London.
MARJORAM, P., and S. TAVARÉ, 2006 Modern computational approaches for analyzing molecular genetic variation data. Nat. Rev. Genet. 7: 759–770.[CrossRef][Medline]
MCVEAN, G., and C. C. A. SPENCER, 2006 Scanning the human genome for signals of selection. Curr. Opin. Genet. Dev. 16: 624–629.[CrossRef][Medline]
NIELSEN, R., S. WILLIAMSON, Y. KIM, M. J. HUBISZ, A. G. CLARK et al., 2005 Genomic scans for selective sweeps using SNP data. Genome Res. 15: 1566–1575.
OAKLEY, J. E., and A. O'HAGAN, 2004 Probabilistic sensitivity analysis of complex models: a Bayesian approach. J. R. Stat. Soc. B 66(3): 751–769.[CrossRef]
QUENOUILLE, M. H., 1956 Notes on bias in estimation. Biometrika 43: 353–360.
ROGERS, A. R., and H. HARPENDING, 1993 Population growth makes waves in the distribution of pairwise genetic differences. Mol. Biol. Evol. 9: 552–569.
SCHAFFNER, S., C. FOO, S. GABRIEL, D. REICH, M. J. DALY et al., 2005 Calibrating a coalescent simulation of human genome sequence variation. Genome Res. 15: 1576–1583.
SCHIERUP, M., and J. HEIN, 2000a Consequences of recombination on traditional phylogenetic analysis. Genetics 156: 879–891.
SCHIERUP, M., and J. HEIN, 2000b Recombination and the molecular clock. Mol. Biol. Evol. 17: 1578–1579.
SHAO, J., and D. TU, 1995 The Jackknife and the Bootstrap. Springer-Verlag, New York.
SISSON, S. A., 2007 Genetics and stochastic simulation do mix! Am. Stat. 61(2): 112–119.[CrossRef]
SLATKIN, M., and R. R. HUDSON, 1991 Pairwise comparisons of mitochondrial DNA sequences in stable and exponentially growing populations. Genetics 129: 555–562.[Abstract]
STEPHENS, M., 1999 Problems with computational methods in population genetics. Bulletin of the 52nd Session of the International Statistical Institute, Helsinki.
STEPHENS, M., and P. DONNELLY, 2000 Inference in molecular population genetics. J. R. Stat. Soc. B 62: 605–635.[CrossRef]
SWOFFORD, D. L., and G. J. OLSEN, 1990 Phylogenetic reconstruction, pp. 411–501 in Molecular Systematics, edited by D. M. HILLIS and C. MORITZ. Sinauer Associates, Sunderland, MA.
TAJIMA, F., 1983 Evolutionary relationship of DNA sequences in finite populations. Genetics 105: 437–460.
TAJIMA, F., 1989 The effect of change in population size on DNA polymorphism. Genetics 123: 597–601.
TANAKA, M. M., A. R. FRANCIS, F. LUCIANI and S. A. SISSON, 2006 Estimating tuberculosis transmission parameters from genotype data using approximate Bayesian computation. Genetics 173: 1511–1520.
WAKELEY, J., R. NIELSEN, S. N. LIU-CORDERO and K. ARDLIE, 2001 The discovery of single-nucleotide polymorphisms—inferences about human demographic history. Am. J. Hum. Genet. 60: 1332–1347.
WATTERSON, G. A., 1975 On the number of segregating sites in genetic models without recombination. Theor. Popul. Biol. 7: 256–276.[CrossRef][Medline]
VASCO, D. A., K. A. CRANDALL and Y.-X. FU, 2001 Molecular population genetics: coalescent methods based on summary statistics, pp. 173–216 in Computational and Evolutionary Analysis of HIV Molecular Sequences, edited by A. G. RODRIGO and G. H. LEARN, JR. Kluwer Academic Publishers, Dordrecht, The Netherlands.
VIGILANT, L. M., M. STONEKING, H. HARPENDING, K. HAWKES and A. C. WILSON, 1991 African populations and the evolution of human mtDNA. Science 253: 1503–1507.
Communicating editor: R. NIELSEN
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
-
All Versions of this Article:
genetics.108.087049v1
179/2/951 most recent - Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Vasco, D. A.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Vasco, D. A.




, with associated labeled mutations 1, 2, 3, 4, 5, 6. This schematic shows the structure of the summary statistics if the sequences evolve according to the coalescent genealogical structures shown in B–D. Comparing summary data shown in A with coalescent processes shown in B–D demonstrates that it is possible to separate demographic processes of the coalescent that are determined by the ti with those determined by genetic data (mutations di, see also 




















