- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Laval, G.
- Articles by Chevalet, C.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Laval, G.
- Articles by Chevalet, C.
Maximum-Likelihood and Markov Chain Monte Carlo Approaches to Estimate Inbreeding and Effective Size From Allele Frequency Changes
Guillaume Lavala, Magali SanCristobalb, and Claude Chevaletba Computational and Molecular Population Genetics Laboratory, Zoology Institute, University of Bern, 3012 Bern, Switzerland
b Laboratoire de Génétique Cellulaire, Institut National de la Recherche Agronomique, 31326 Castanet-Tolosan, France
Corresponding author: Magali SanCristobal, INRA, Chemin de Borde-Rouge--Auzeville, BP 27, 31326 Castanet-Tolosan Cedex, France., msc{at}toulouse.inra.fr (E-mail)
Communicating editor: S. W. SCHAEFFER
| ABSTRACT |
|---|
Maximum-likelihood and Bayesian (MCMC algorithm) estimates of the increase of the Wright-Malécot inbreeding coefficient, Ft, between two temporally spaced samples, were developed from the Dirichlet approximation of allelic frequency distribution (model MD) and from the admixture of the Dirichlet approximation and the probabilities of fixation and loss of alleles (model MDL). Their accuracy was tested using computer simulations in which Ft = 10% or less. The maximum-likelihood method based on the model MDL was found to be the best estimate of Ft provided that initial frequencies are known exactly. When founder frequencies are estimated from a limited set of founder animals, only the estimates based on the model MD can be used for the moment. In this case no method was found to be the best in all situations investigated. The likelihood and Bayesian approaches give better results than the classical F-statistics when markers exhibiting a low polymorphism (such as the SNP markers) are used. Concerning the estimations of the effective population size all the new estimates presented here were found to be better than the F-statistics classically used.
MONITORING genetic diversity in animal populations has become an important concern for institutions involved in the preservation of wild life and of endangered species, as well as in animal breeding. In the latter case, a few breeds are selected with high selection pressure, while other breeds are no longer extensively used and are faced with risks of extinction. In both cases, severe reductions in the genetic effective size are observed, leading to the loss of genetic diversity. Measuring this loss can be performed by means of historical surveys of allelic frequencies at a number of polymorphic loci. Such time-spaced sampling protocols have been applied in a wide range of populations, from natural and domestic taxa (![]()
![]()
Consider a neutral marker exhibiting k alleles and an isolated population of effective size N, which is described by its allele frequencies p0 = (p0,1, ... , p0,i, ... , p0,k) and pt = (pt,1, ... , pt,i, ... , pt,k) (for i = 1, ... , k), in a first generation, named the founder generation, and t generations later, respectively. In the tth generation the expectation and variance of the allelic frequencies distribution are given by
![]() |
(1) |
![]() |
(2) |
when genetic drift is assumed during t discrete and nonoverlapping generations. The quantity 1 - (1 - 1/2N)t, which can be viewed as the growth from the founder generation of the Wright-Malécot inbreeding coefficient (![]()
![]()
A theoretical framework based on F-statistics proposed by ![]()
![]()
![]()
= t/2
t in which t, the number of generations between two samplings, is known. The estimations of Ft and of the variance effective size performed in a natural population give the increase in the inbreeding coefficient and the size of a Wright-Fisher population that would experience a comparable increase in variance of gene frequency over time.
In fact, estimates of allele frequencies are obtained by sampling a number of individuals in the population, which suggests using the theory of coalescence. Following it (Fig 1, right, broken arrows), the probability to get a sample mt = {mt,1, ... , mt,i, ... mt,J0,·} at time t, given the true initial frequencies p0 = {pt,1, ... , pt,i, ... pt,J0,·}, is obtained by conditioning on the true number J0,· =
i J0,i of original genes, or lines of descent, from which the final sample originates. The probability distribution of J0,·, given the size mt,· =
imt,i of the final sample, can be derived exactly (![]()
![]()
![]()
|
In the following, we consider the alternative approach (Fig 1, left, solid arrows), introducing the direct transition between initial allele frequencies p0 and allele frequencies pt at time t, followed by the final (multinomial) sampling giving the observed sample mt. Asymptotic distribution of allele frequencies is known under special cases involving mutations between a finite number of alleles and leading to Dirichlet distributions (![]()
![]()
![]()
The exact transition due to drift has no simple analytical expression. We therefore investigated approximations of joint allele probability using the Dirichlet distribution (![]()
![]()
![]()
0 for every allele existing in the founder sample, it is clear that this first model cannot take into account the loss of alleles due to the drift process. We introduced a second model, which is an admixture of allele fixation and loss probabilities (![]()
Distributions of the sample mt are obtained conditionally on initial exact frequencies p0. In this article we therefore consider the following two cases. First, we propose maximum-likelihood estimates based on two models, Dirichlet only and Dirichlet and allele loss probabilities admixture, when founder frequencies are known. This theoretical situation allows us to discuss the relevance of the Dirichlet approximation and the benefit given by the introduction of the allele loss probabilities.
Second, the founder frequencies are estimated from a small sample of founder animals. In such a case the second model taking allele loss into account is hardly tractable. Developing unbiased estimates requires further statistical developments that are not presented in this article. Here, we consider only the first model (Dirichlet only) in which the sampling process within the founder generation is dealt with either from a heuristic point of view, checking for possible corrections of maximum likelihood estimates, or using a Monte Carlo Markov chain (MCMC) algorithm in a Bayesian context.
Comparisons with two F-statistic methods, one introduced by ![]()
![]()
![]()
Since the number of generations between the founder and the tth generation is known, we test the accuracy of the estimations of N from the maximum-likelihood estimations of Ft using the classical formula
= t/2
t. We also introduce a corrected estimate of N.
Furthermore, to test these new methods (maximum-likelihood estimates and MCMC algorithm) to a real data set, a French snail population (a species of importance for French consumers) was analyzed.
| STATISTICAL BACKGROUND |
|---|
Multinomial-Dirichlet model:
In the following sections, this model is called multinomial-Dirichlet (MD). We assumed that the allele-frequency distributions in the current generation can be approximated by a Dirichlet distribution (![]()
![]()
![]()
t = (
t,1, ... ,
t,i, ... ,
t,k),

where
i
t,i = At. Parameters
t can be related to Ft by adjusting the first two moments under the genetic drift model (Equation 1 andEquation 2) and under the statistical Dirichlet model. This leads to
![]() |
(3) |
It may be shown that the drift and the Dirichlet distributions are only approximately equal, since third moments are different in an amount proportional to F2t, indicating that the approximation is valid only for small Ft values.
Taking account of gene sampling in the tth generation is performed as follows: for one locus, the gene sampling gives partitions mt = (mt,1, ... , mt,i, ... , mt,k). Let us denote the total number of sampled alleles, i.e., twice the number of individuals, as
i mt,i = mt,·. Assuming that the sample stage is described by a multinomial drawing, the whole model is a compound multinomial-Dirichlet model:

Accordingly, an approximate likelihood
MD of mt given p0 and At (after integrating pt out) can be written as
![]() |
(4) |
Allele loss taken into account:
In the following sections, this model is called multinomial-Dirichlet and allele loss (MDL). The Dirichlet distribution is no longer valid when fixation or loss of alleles occurs and is expected to bias the likelihood since any null allele frequency is considered as the result of final sampling only. Taking into account a possible loss of alleles during the drift process implies that f(pt|p0, Ft) is written as a mixture of discrete and continuous terms, introducing the probabilities that some alleles can be lost. Let S be a state in which some of the alleles are lost; then
![]() |
(5) |
Pr(S|p0, Ft) is the probability of getting state S at reduced time Ft from the initial conditions p0 and is derived from the probabilities that alleles are fixed,
![]() |
(6) |
which are given by solutions of the classical partial differential equation of ![]()
In practice, the following approximations were used. Fixation probabilities were approximated following ![]()
S,i parameters of the Dirichlet characterizing a transient state S were set to

keeping a single unconditional At parameter, and taking
![]() |
(7) |
where p0,(S-lost) is the total of initial frequencies of lost alleles in state S, both approximations being justified by the short-term scope of the present scenario.
Writing f(pt|p0, Ft) as a mixture (Equation 5), the likelihood of a sample becomes a mixture involving various S terms:

Since the first term Pr(mt|pt, S) is zero for any S state in which one allele is observed in the sample while it is of null frequency in state S, there are only one or two terms in the previous likelihood. If all the alleles are observed in the sample (state S1), then the likelihood
MDL is equivalent to
MD inEquation 4, weighted by Pr(S1|p0, Ft). Otherwise, at least one allele is not observed in the sample. In that case, two S states must be considered, the one in which all frequencies pt,i are positive but sampling did not allow some alleles to be observed (state S1) and the state that indicates that unobserved alleles have been lost (state S2). For example, for three alleles (see Appendix A), if mt,1 > 0, mt,2 > 0, and mt,3 = 0, the whole likelihood
MDL is given by
![]() |
(8) |
with

where 111 stands for state S1 in which the three alleles have been kept, and 110 for state S2 in which the third allele has been lost during the drift process.
| ESTIMATION PROCEDURE |
|---|
The likelihoods
MD and
MDL, inEquation 4 andEquation 8, depend on the true founder frequencies. Thus we consider the following two cases:
Known founder frequencies:
This situation allows us to check by simulation the validity of the Dirichlet approximation made in this study. However, this situation can be found in some selected breeds in which all founder animals are known and can be genotyped.
Assuming statistical independence between loci (
), the maximization of the multilocus log-likelihood

in which 
=
MD in model MD (Equation 4) and 
=
MDL in model MDL (Equation 8), was performed using a quasi-Newton algorithm. In model MD and in model MDL, the maximum-likelihood estimators are named
MD(ML) and
MDL(ML), respectively.
Estimated founder frequencies:
For one locus, gene sampling within the founder generation gives partitions m0 = (m0,1, ... , m0,i, ... , m0,k). Let the total number of alleles sampled be
i m0,i = m0,·. The estimations of Ft can be based on likelihoods (Equation 4 andEquation 8), where p0 are replaced by consistent estimators x0 (x0,i = m0,i/m0,·). Since maximum-likelihood estimations do not explicitly account for the sampling process in the founder generation, this was performed as follows.
Model MD (Dirichlet):
MD(ML) shows a positive bias of 1/m0,· (data not shown), corresponding to ![]()
![]()
MD(corrML) =
MD(ML) - 1/m0,· was proposed.
Alternatively, a full Bayesian approach can be implemented, taking advantage of prior knowledge on parameters At and p0 through Gamma and Dirichlet distributions, respectively, as
![]() |
(9) |
![]() |
(10) |
with
0 = (
0,1, ... ,
0,k). Computer simulations showed that maximum a posteriori estimates of Ft were biased, with biases depending on sample size (data not shown), and could not be simply corrected as was the case for
MD(ML). Therefore, Bayes estimates, noted
MD(MC) in the following, were derived from the mean of the marginal a posteriori distribution of parameters of interest, f(Ft|m0, mt), obtained by means of a MCMC approach using a Metropolis-Hasting algorithm (![]()
![]()
Model MDL (Dirichlet and allele loss): Combining the model accounting for gene loss and the sampling process in the founder generation (p0 are replaced by x0) did not give satisfactory results. No simple rule was found to obtain unbiased estimates through maximization of likelihood or of a posteriori probability. Extending the MCMC algorithm was postponed for future studies.
Bias correction for F-statistics:
As in ![]()
NT (the index t is omitted) of ![]()
![]() |
(11) |
We also used Reynolds' F-statistic
R (![]()
![]() |
(12) |
in which, for generations 0 and t,
i x2i is replaced by (
i x2i - (1/m·))/(1 - (1/m·)) (![]()
Multilocus estimation and variance prediction for F-statistics:
Denoting 
a single-locus estimate (Equation 11 orEquation 12), multilocus estimates were written as
![]() |
(13) |
where n0,
are the observed numbers of founder alleles at locus
(![]()
- 1 allows heterogeneity of information between markers to be taken into account in minimizing the variance of the multilocus estimation assuming statistical independence between loci.
The predictions of the standard error (SE) of NEI and TAJIMA's (1981) distance given in ![]()
![]()
![]()
![]()
![]()
![]() |
(14) |
| SIMULATION PROCEDURE |
|---|
Twenty genetically independent loci were always considered in the simulations. A simulated population of size N = 500, with allele frequencies p00,i (for i = 1, ... , k), was initially considered, and a pure genetic drift process was simulated 1000 times through five generations. This process generates 1000 quasi-independent populations used as starting points for simulation runs.
To provide inbreeding coefficient values in the interval [0.002, 0.008], each one of these 1000 populations, described by its founder frequencies p0, was submitted to a further pure genetic drift process, with constant diploid effective sizes set to 500 individuals during 8 nonoverlapping generations. Samples of mt,· = 50 alleles were drawn (sampling of 25 diploids) in a multinomial way every two generations. To provide inbreeding coefficient values in the interval [0.01, 0.1], the same process was applied to a population of 100 individuals evolving during 22 generations.
Two kinds of sampling at the stage of the founder population were considered: (i) exhaustive sampling, where the founder sample size is made up of the complete founder population, so that the true founder allele frequencies p0,i are known, and (ii) finite sampling, where the founder sample size m0,· was set to 50 alleles (sampling of 25 diploids).
In Uniform founder frequencies, three sets of 1000 simulations, in which allele frequencies of the initial population were set to p00,i = 1/k, were performed with k = 2, k = 4, and k = 8 alleles, respectively.
In Biochemical and microsatellite founder frequencies two sets of 1000 simulations were used, in which allele frequencies p00,i in the initial populations were set to biochemical and microsatellite marker frequencies published in ![]()
![]()
| RESULTS |
|---|
Performances of the estimates were compared using the bias, Bt (or the relative bias, Bt/Ft with Ft the true value of inbreeding coefficient) and the standard error, SEt, computed over S simulations (unless it was explicitly indicated, S was always set to 1000). The global accuracy of estimations was established on the basis of the square root of MSE
a criterion combining bias and variance, which is equal to the standard error when the estimation is unbiased.
Approximate confidence intervals of relative biases and standard errors were computed using formulas that are valid for normal distributions of estimates. They are indicated in the figures when relative biases are significantly different from zero (zero outside the 95% confidence interval, P < 0.05) and when differences between standard errors were significant (no overlapping of the 95% confidence intervals, P < 0.05).
Uniform founder frequencies
Known founder frequencies ( Fig 2A, Fig 3A, Fig 4A, and Fig 5A):
Square root of the mean squared error Fig 2A:
With two founder alleles per locus (top curves),
NT gives the least accurate estimations (
are the highest) whatever the level of drift. With four founder alleles (middle curves) and eight founder alleles (bottom curves) per locus, the two likelihood methods
MD(ML) (Dirichlet without gene loss) and
MDL(ML) (Dirichlet with gene loss) diverge from each other for
F > 0.07 in the first case and F > 0.04 in the second case.
MD(ML) shows a large loss of accuracy with eight alleles whereas
MDL(ML) always gives the best estimations for any Ft value (with results significantly better for F > 0.08) and a small to medium number of alleles.
|
|
|
|
Relative bias Fig 3A:
As expected from its definition,
R shows no significant bias as is illustrated for eight alleles (Fig 3A) and for fewer alleles (data not shown). In contrast to this result, other measures show biases that are dependent on polymorphism and on the amount of drift.
NT shows significant positive bias with two alleles (data not shown); with eight alleles
NT, and also
MDL(ML), exhibit small but significant negative biases for F > 0.07 (confidence intervals in Fig 3A). However, biases remain small. Therefore, except for
MD(ML) computed with eight alleles per locus, the global accuracy of every method is determined mainly by the standard errors of the estimations (
SE) as long as the number of loci is not too large (<50).
Standard error Fig 4A and Fig 5A:
The prediction given byEquation 14 provides a conservative basis for choosing the number and the polymorphism of markers needed to achieve a given level of precision as defined by the variance. Indeed it is apparent that the different methods considered here are characterized by a variance equal to or less than that predicted by the formula. In every situation encountered (four alleles, data not shown, and eight alleles, Fig 5A)
MDL(ML) proves to be the best with respect to the criterion of minimal variance, with the difference with the theoretical reference (Equation 14) always significant for F > 0.07. For eight alleles (Fig 5A)
NT performs nearly as well but is slightly less accurate [differences between
NT and
MDL(ML) are significant].
Estimated founder frequencies ( Fig 2B, Fig 3B, Fig 4B, and Fig 5B):
As mentioned above, taking account of the sampling process within the founder generation needs additional statistical treatments when the Dirichlet approximation and loss of gene probabilities are combined. Replacing p0,i by x0,i provides biased estimates, with biases inversely proportional to the founder sample size. The results given by
MDL(ML) are not shown.
Square root of the mean square error Fig 2B:
For two alleles (top curves) a gain in accuracy was found with the corrected likelihood estimate
MD(corrML). Beyond 7% drift,
R is slightly better than
NT. For four alleles, results are nearly identical for all the compared methods. With eight alleles
NT turns out to be better than the others.
Relative bias Fig 3B:
R is always unbiased (biases were never statistically different from zero under the simulation conditions used in this article). Both
NT and
MD(corrML) measures show a bias that depends on the allelic distribution, changing in sign with the number of alleles and the true value of Ft. In the range of Ft values considered, and discarding very low values, the relative bias remains smaller than
10%. However, the global accuracy of methods is still due mainly to the standard error of estimation [
SE, except for
MD(corrML) computed with eight alleles per locus and Ft > 0.07].
Standard error Fig 4B and Fig 5B:
The best method is
MD(corrML) when the number of alleles is reduced (two alleles, Fig 4B). With eight alleles per locus (Fig 5B)
NT is the best method with a reduction in standard error of up to 25% for an Ft value
0.10.
The results presented in this section highlight the main drawback of the Dirichlet approximation. Biases depending on the founder polymorphism of markers and on the Ft values appear even for a small amount of drift (Ft
0.1). This problem can be avoided when allele loss probabilities are taken into account. These results suggest that this new model should be used to develop further estimates, which will be more accurate when highly polymorphic markers are used. However
MD(corrML) performs well with markers exhibiting two alleles. This method should, then, be recommended when markers such as single-nucleotide polymorphisms (SNPs) are used.
This method, as well as the MCMC algorithm, was tested with two data sets generated with allele frequencies observed in pig and cattle breeds. Furthermore, the accuracy of the estimations of N obtained with the corrected maximum-likelihood method (estimations of N were derived from the estimations of Ft) was determined with these data sets.
Biochemical and microsatellite founder frequencies
In this section the allele frequency distributions include rare founder alleles and different numbers of alleles between markers. The
in Fig 6 and Fig 7 were computed with 20 biochemical (![]()
![]()
|
|
Estimation of Ft:
To show the relevance of the model combining Dirichlet and allele loss probabilities, we give the results obtained when the founder frequencies are known (Fig 6A and Fig 7A).
MDL(ML) is still the best inbreeding estimator in every case (being unbiased and with the smallest standard error).
When the founder frequencies of biochemical markers are estimated (Fig 6B), the likelihood approach
MD(corrML) and NEI and TAJIMA's (1981) statistics provide the best inbreeding estimations:
MD(corrML) is nearly the best for biochemical markers. As for the replications performed in the previous section, the effect of bias on the global accuracy (
SE) is negligible. For microsatellite markers (Fig 7B), the method giving the smallest standard error still provides the most accurate F estimations, in this case
NT.
The MCMC algorithm requires larger computation times than the likelihood methods. The results are shown for two different values of Ft, Ft = 0.01 and Ft = 0.1 (Table 1), and the number of simulations was limited to S = 100. In practice with a real data set the parameters of candidate generating densities were empirically chosen so that the Metropolis-Hasting algorithm accepts 25% of drawn values (![]()
|
With biochemical and microsatellite markers,
MD(MC) was found to be significantly more accurate than
NT and
MD(corrML) for Ft values of 0.01 (the first two rows in Table 1). The
is almost halved in every case. For Ft = 0.1 the MCMC algorithm does not give the same large gain in accuracy. There are no significant differences among the three methods. The MCMC algorithm allows us to compute the posterior marginal distribution of the parameter of interest, f(Ft|m0, mt) (Fig 8 and Fig 9; Ft = 0.01 and Ft = 0.1, respectively).
|
|
Estimation of N:
These data sets were also used to obtain estimations of the effective population size since t is known (and small) and biases of F estimations are small. For Ft between 0.01 and 0.1 (a simulated population of N = 100) estimations of N can be simply obtained from
![]() |
(15) |
However, even if
t is unbiased,
is biased as
![]() |
(16) |
Equation 14 suggests the following estimate
' of N:
![]() |
(17) |
The performances of these estimates are given in Table 2 and Table 3. Their names are given with the same subscript as used for the
notation.
|
|
The results obtained with
R and
MD(MC) are not shown since they are less accurate than those obtained with
NT and
MD(corrML), respectively.
In every case, the
' estimations are less biased with a smaller standard error than that of the noncorrected estimators
. The corrected-likelihood method is more accurate than the F-statistics, for every t and for both biases and standard errors. It is more suitable to combine
' with the corrected maximum-likelihood approach: in the last row of Table 3 (20 microsatellites),
'MD(corrML) gives an unbiased estimation and leads to a decrease of 30% of the standard error of
NT.
It should be mentioned here that this corrected estimate must be used when the experimental conditions lead to a small coefficient of variation of the estimation of Ft, sinceEquation 16 is accurate only when Var(
t)/E(
t)2 is negligible. The highest relative standard error of Ft in Table 2 and Table 3 is <0.5. In practiceEquation 14 can be used to estimate the relative standard error to decide whether
' is relevant.
| A REAL DATA SET |
|---|
We used a data set provided by J. F. ARNAULD (unpublished results) to illustrate the behavior of the corrected maximum-likelihood method and the MCMC algorithm with a real data set. The population of Helix aspersa (Gastropoda: Helicidae) belongs to an intensive agricultural zone located in Brittany (northwestern France), in the polders of the Bay of Mont-Saint-Michel. The snail population was sampled in 1998 and 2000 with 15 and 30 individuals, respectively. The estimations of Ft and N were computed with four microsatellite markers, exhibiting 5, 6, 8, and 12 alleles in the founder generation, respectively.
The estimations of Ft obtained with
NT,
R, and
MD(corrML) are equal to 0.011, 0.016, and 0.032, respectively. Their coefficients of variation, computed fromEquation 14, are equal to 1.715, 1.264, and 1 respectively. With such coefficients of variation
' cannot be used. We estimated the effective size fromEquation 15 assuming one generation per year (![]()
![]()
The MCMC estimations of Ft and of the coefficient of variation (computed with the standard error obtained from the marginal posterior distribution of Ft, Fig 10) are equal to 0.007 and 0.857, respectively.
|
| DISCUSSION |
|---|
Since many different methods and situations were investigated in this study, a summary table (Table 4) is given to highlight the main results obtained when the founder frequencies are estimated with a limited founder sample size (here 25 individuals), the situation most widely found in an experimental scheme.
|
Scope of the present work:
The scope of this article was limited to a short-term investigation. The comparisons of methods are based on an expected increase in inbreeding of 10% or less, which corresponds to 10 generations for a population of 50 individuals. This seemed a realistic timescale since the validity of the assumption of no mutation may be questionable for longer time intervals. The number of markers used was 20, which is a common value found in practical (![]()
![]()
![]()
![]()
Increasing the number of markers to get a better estimate of drift, which reduces the standard error around the expectation but not around the true value, requires the use of unbiased measures. Bias becomes a significant consideration when the number of alleles is large, and analytical corrections for
NT could be derived to avoid this bias. In practice, however, this may be not crucial since with the number of markers commonly used and the small values of Ft observed biases remain small in comparison with the standard errors. To illustrate this point when Ft = 0.08, markers exhibit eight alleles, and founder allele frequencies are exactly known, >200 markers are needed for the unbiased
R statistic to outperform NEI and TAJIMA's (1981) method. Hence working with measures with a low bias may be an advantage when the level of inbreeding is low.
It must be stressed that this conclusion is quite different when we consider the distance between two different breeds. Indeed, the values of distances are often >0.1. Methods such as
NT and similar statistics as well as likelihood estimates show nonnegligible biases when allele numbers and Ft values are large and therefore cannot be recommended on the basis of this work. Unbiased methods such as ![]()
![]()
For dominant markers such as randomly amplified polymorphic DNA or amplified fragment length polymorphism, allelic frequencies can be estimated with the square root rule from the frequencies of absence of bands and can be used to estimate Ft, keeping in mind that a deviation from the Hardy-Weinberg proportions leads to biased maximum-likelihood estimations of allelic frequencies. Some software (Arlequin, ![]()
![]()
Performances of various estimates of Ft were sensitive to the total number of alleles over loci (Equation 14). Considering biallelic markers, SNPs seem to be full of potential (![]()
The theoretical prediction (Equation 14), although computed under the assumption of the statistical independence of loci (linkage disequilibrium remains null), seems to be a conservative estimate of the variance of the optimized F-statistics (Equation 13) and of the other measures considered. With a real data set in which some markers are in linkage disequilibrium, the expected standard error cannot be computed easily. Some preliminary simulations have been undertaken with 100 individuals, 22 generations, and 20 highly polymorphic loci (20 founder alleles per locus), with a recombination rate r between them varying from 0.5 to 0.001. Results show that the variance of the FR estimate is not affected by linkage between loci as long as r > 0.1. The standard deviation is increased by 17% if r = 0.05 and by 155% if r = 0.001 (9 and 104%, respectively, with 4 founder alleles per locus). More work is needed to quantify the influence of nonindependence of loci on the variances of the estimates.
Analytical approximation:
We derived an approximate likelihood by using the Dirichlet approximation of conditional allele frequency distribution. Comparisons between likelihood methods computed when true founder frequencies are known and the F-statistics (unbiased like
R and with a small variance like
NT) indicate that the MD approximation model is relevant when the polymorphism of markers is low.
When factors enhance the probability that alleles may be lost (high polymorphism leading to the occurrence of rare founder alleles and intermediate and high levels of drift), this simple approximation is no longer relevant. Several levels of approximations were used, to circumvent the combinatorial problems raised by the exact coalescent approach when the number of alleles increases. Likelihoods and posterior probabilities were derived by means of a Dirichlet model and probabilities of fixation and loss of alleles made use of a simple approximation. The latter was previously checked (![]()
MD(ML)) observed in Fig 2A shows the deleterious effects of increasing drift or number of alleles, with both effects increasing the probabilities that some alleles may be lost by drift. Combining probabilities of losses and a mixture of distributions in model MDL allowed the best estimates to be obtained when using initial known frequencies (Fig 2A). The advantage over other methods seems to be uniform over all considered cases, and the most pronounced improvement concerns the variance of F estimates, although the gain seems to be less for a large number of alleles.
The latter observation suggests that the combined approximation becomes less accurate for more than eight alleles per loci. It may be suggested that the choice of the Dirichlet distributions used for transient allele frequencies (Appendix A) is the most sensitive step sinceconsidering the first two moments of distributionsit can be proved that it is not possible to equate the true mixture (Equation 5) to the mixture of Dirichlet distributions that yieldsEquation 8. Other parameter adjustments, such as using exact conditional expectations (Appendix A) rather than the simplified expression ofEquation 7, and adjusting the dispersion parameters (A) for each transient state S, might give a better account of drift for mid- or long-term processes. At the same time this should avoid the combinatorial problem driven by a large number of alleles in the exact coalescence.
Initial sampling:
Precision in the estimation of founder allele frequencies is a key to accurately estimate the amount of drift. For example, the best estimate obtained with exact biochemical founder frequencies [ Fig 6A,
MDL(ML) estimate] shows the same mean square error as the best estimate possible with microsatellite markers sampled in the founder generation (Fig 7B,
NT estimate).
As mentioned above in experimental schemes applied to domestic breeds all founder animals are known and can be sampled. In this special case, we have shown that it is possible to derive a valuable approximation of the drift process, which allows improvements of Ft estimation to be obtained with both biochemical and microsatellites markers. The gain is significant for intermediate values of Ft (in the range of 510%) and if the mean number of alleles is not too large. From a practical point of view concerning natural populations the methods based on this MDL model might be used when the founder sample is large (up to 100 individuals). The bias of the maximum-likelihood estimates, which is inversely proportional to the founder sample size (data not shown), tends to be small in front of the standard error.
For small founder sample sizes no method was found to be consistently better than the others over all the situations tested. Introducing the allele loss probabilities in the Dirichlet model is hardly tractable in this case and we kept this for future work. However, the methods based on the Dirichlet model without allele loss greatly improve the estimation of the amount of drift in several interesting situations.
The corrected maximum-likelihood methods (
MD(corrML)) and the MCMC algorithm should be preferred when markers of low polymorphism are used, a situation that may be of importance with the advent of the SNP markers. Using estimations of founder frequencies can lead to biased Ft estimations with the maximum-likelihood method based on the MD model but we have shown that this problem can be solved simply with a heuristic correction. This bias correction gives more accurate estimations than the F-statistics give without changing the standard error of the maximum-likelihood estimate. In contrast, the MCMC algorithm greatly reduces the standard error of the maximum-likelihood estimate. Using this algorithm, which performs a numerical integration of the nuisance parameters (here p0), is the most relevant when a large part of this standard error is due to the sampling in the founder generation, as was the case for small Ft. We can deduce fromEquation 14 that the sampling process is largely responsible for the decrease of the accuracy of estimations when Ft is small: the relative standard error SEt/Ft becomes large when Ft tends to 0 [the part depending on 1/(m0,·) is inversely proportional to Ft].
The analysis of the French snail population shows that the MCMC algorithm can be applied to a real data set. The Markov chain well converges and the estimation of Ft is of the order of magnitude of the estimations given by the other approaches, suggesting that the results
















0.01) were computed over 20 loci and 1000 replications performed with populations of N = 500, t varying from 2 to 8 (respectively N = 100, t varying from 2 to 22). Every sample size was equal to 25 individuals. In A, the true founder frequencies are known. In B, the founder sample size was set to 25 individuals. In both A and B, dotted lines show the results obtained with uniform distributions involving four founder alleles. Solid lines show the results obtained with uniform distributions involving two (top curves) and eight (bottom curves) founder alleles.









