Originally published as Genetics Published Articles Ahead of Print on July 29, 2007.

Genetics, Vol. 177, 861-873, October 2007, Copyright © 2007
doi:10.1534/genetics.107.077263

Empirical Bayes Inference of Pairwise FST and Its Distribution in the Genome

* Faculty of Marine Science, Tokyo University of Marine Science and Technology, Minato, Tokyo 108-8477, Japan and {dagger} Graduate School of Agriculture and Life Sciences, University of Tokyo, Bunkyo, Tokyo 113-8657, Japan

1 Corresponding author: Tokyo University of Marine Science and Technology, 4-5-7 Konan, Minato, Tokyo, 108-8477, Japan.
E-mail: kitada{at}kaiyodai.ac.jp

Manuscript received June 11, 2007. Accepted for publication July 17, 2007.

ABSTRACT

Populations often have very complex hierarchical structure. Therefore, it is crucial in genetic monitoring and conservation biology to have a reliable estimate of the pattern of population subdivision. FST's for pairs of sampled localities or subpopulations are crucial statistics for the exploratory analysis of population structures, such as cluster analysis and multidimensional scaling. However, the estimation of FST is not precise enough to reliably estimate the population structure and the extent of heterogeneity. This article proposes an empirical Bayes procedure to estimate locus-specific pairwise FST's. The posterior mean of the pairwise FST can be interpreted as a shrinkage estimator, which reduces the variance of conventional estimators largely at the expense of a small bias. The global FST of a population generally varies among loci in the genome. Our maximum-likelihood estimates of global FST's can be used as sufficient statistics to estimate the distribution of FST in the genome. We demonstrate the efficacy and robustness of our model by simulation and by an analysis of the microsatellite allele frequencies of the Pacific herring. The heterogeneity of the global FST in the genome is discussed on the basis of the estimated distribution of the global FST for the herring and examples of human single nucleotide polymorphisms (SNPs).


INFERRING genetic population structure has been a major theme in population biology, ecology, and human genetics. The fixation index FST, introduced by WRIGHT (1951), is a key parameter for such studies and is most commonly used to measure genetic divergence among subpopulations (PALSBØLL et al. 2007). It is defined as the correlation between random gametes drawn from the same subpopulation relative to the total population. Another measure used frequently is COCKERHAM's (1969, 1973) coancestry coefficient, which is the probability that two random genes from different individuals are identical by descent, and the average overall pairs of individuals within the same subpopulation equal Wright's FST (EXCOFFIER 2003). We use the notation {theta}WC for the average coancestry coefficient and {theta}WC = FST as shown by WEIR and COCKERHAM (1984). NEI's (1973) GST is analogous to FST and identical to FST for diploid random-mating populations (EXCOFFIER 2003).

NEI and CHESSER (1983) proposed an estimator for FST and GST. The estimation of these parameters accounts only for the sampling error within subpopulations and therefore assumes that all subpopulations have been sampled (COCKERHAM and WEIR 1986; EXCOFFIER 2003). WEIR and COCKERHAM (1984) developed the moment estimator Formula for the coancestry coefficient {theta}WC, which takes the sampling error for the subpopulations into account. Several moment estimators with different weighting schemes have also been derived (ROBERTSON and HILL 1984; WEIR and COCKERHAM 1984). An alternative estimation has been discussed using the method of ordinary least squares (REYNOLDS et al. 1983). WEIR and HILL (2002) extended {theta}WC to a population-specific parameter to allow different levels of coancestry for different populations. They also derived an estimator for {theta}WC with confidence intervals using a normal theory approach.

Despite the development of methods for assigning individuals to populations (PAETKAU et al. 1995; PRITCHARD et al. 2000; HUELSENBECK and ANDOLFATTO 2007), the differentiation estimators remain the most commonly used tools for describing population structure (BALLOUX and LUGON-MOULIN 2002). WEIR and COCKERHAM (1984) showed that their estimator Formula provides the smallest bias among the moment estimators. GOUDET et al. (1996) confirmed this using simulations and showed that Formula generates the least-biased estimate of FST but has the largest variance when FST is small. RAUFASTE and BONHOMME (2000) showed that Formula is nearly unbiased, with minimal variance for large FST, and that the estimator of ROBERTSON and HILL (1984) Formula is negatively biased, with minimal variance for small FST. They proposed a correction for the bias of Formula but this cannot be corrected properly in the range of [0.05, 0.1]. Therefore, a precise estimate of FST is crucial, especially for small and moderate levels of genetic differentiation.

In addition to the estimation of FST over all subpopulations in a metapopulation (hereafter, we call this global FST), FST's for pairs of sampled localities or subpopulations (pairwise FST) are usually estimated in conservation biology and ecology. In fact, the computer programs Arlequin (EXCOFFIER et al. 2005), FSTAT (GOUDET 1995), and Genepop (RAYMOND and ROUSSET 1995) estimate these parameters and are used widely in ecological studies. These three software packages produce the same or similar values for pairwise FST estimates and provide the basic statistics for exploratory analyses of population structure, such as cluster analysis and multidimensional scaling. They are also used as a criterion for population differentiation (WAPLES and GAGGIOTTI 2006; PALSBØLL et al. 2007). However, the estimation of FST's is not precise enough to reliably estimate the population structure and the extent of heterogeneity, especially for large gene flow species.

Small numbers of individuals taken from each locality should also affect the precision of Formula Populations often have very complex hierarchical structures, and geographical samples are usually taken from many localities to include a wide area. Therefore, the numbers of individuals from each locality are frequently limited by the large number of sampling points. Small sample sizes can result in biased estimates of the allele frequencies of each subpopulation. This bias may be larger for cases with larger numbers of alleles, such as microsatellite DNA. Uncertainty in the estimates of allele frequencies should affect the estimation of FST's. The Bayesian approach provides better estimates of allele frequencies by taking uncertainty into account (LANGE 1995; LOCKWOOD et al. 2001). Posterior distributions of global FST were simulated from posterior distributions of allele frequencies, assuming common hyperparameters across all loci (HOLSINGER 1999; HOLSINGER et al. 2002; CORANDER et al. 2003). However, accurate estimation of pairwise FST, the essential parameter in ecological studies, has not been fully investigated.

In this article, we propose an empirical Bayes procedure to estimate locus-specific pairwise FST's, taking into account the uncertainty of the allele frequencies of subpopulations. The estimation procedure has two stages. First, the hyperparameters of Dirichlet prior distributions for allele frequencies at each locus are estimated from observed allele counts by a maximum-likelihood method. The global FST is then estimated at each locus. Second, on the basis of the estimates of the hyperparameters, and given the allele counts, posterior distributions of the allele frequencies are generated for each locus, from which the posterior distributions of locus-specific pairwise FST's are simulated. The posterior mean of our empirical Bayes pairwise FST estimates can be interpreted as a shrinkage estimator (STEIN 1956; MARITZ and LEWIN 1989) anchored to the average of the true values among pairs. It performs better than conventional differentiation estimators and robustly estimates the population structure, even for non-Dirichlet cases, as stepping-stone models. The posterior distribution of pairwise FST's can be used to calculate a criterion of population differentiation. Our maximum-likelihood estimates of the global FST's can also be used as sufficient statistics to estimate the distribution of FST among loci in the genome. Our model assumes random mating or random sampling of alleles at each locality and that linkage equilibrium holds between loci. It also assumes that allele counts at each locus, given the true allele frequencies, are independent among populations. Our method can be applied to frequency data for common genetic markers, including isozymes, mitochondrial DNA, microsatellites, and single nucleotide polymorphisms (SNPs). We show the efficacy of our model by simulation and by an analysis of microsatellite allele frequencies of the Pacific herring. The heterogeneity of FST in the genome is discussed on the basis of the estimated distribution of global FST for the herring and examples of human SNPs.


MODELS AND METHODS

The model:

Consider a simple random sampling from multiple localities in a metapopulation. Suppose that K random-mating demes or subpopulations are drawn from the metapopulation. Let Formula be a vector of the true allele frequencies at locus l in subpopulation k, where Jl is the number of different alleles at the locus, and Formula We assume a Dirichlet distribution as the prior distribution of pkl. The probability density function is

Formula
where Formula are the hyperparameters and Formula is a scale parameter that is specific for the locus. This model describes well a metapopulation that has a continuous structure and consists of an infinite number of subpopulations or demes (PANNELL and CHARLESWORTH 2000; ROUSSET 2003; HANSKI and GAGGIOTTI 2004). Let Formula be the mean allele frequency for the metapopulation at the locus satisfying Formula Hence, we have the relation ßlj = {alpha}lj/{theta}l.

Under this model, the global FST (hereafter denoted as Formula) at each locus is expressed simply by the scale parameter, as

Formula 1(1)
as given by WRIGHT (1969), RANNALA and HARTIGAN (1996), BALDING and NICHOLS (1997), LOCKWOOD et al. (2001), BALDING (2003), and KITADA and KISHINO (2004). In this model, the variance of the jth allele frequency for the locus, pklj, is expressed by

Formula 1
as given by WEIR (1996), HOLSINGER et al. (2002), BALDING (2003), and KITAKADO et al. (2006). The Dirichlet distribution assumes an evolutionary equilibrium and an equal mutation rate for all alleles (WEIR and HILL 2002; EWENS 2004). Under this assumption, the scale parameter {theta}l refers to the rate of gene flow, as given by RANNALA and HARTIGAN (1996). We use the symbol {theta} for the scale parameter, following RANNALA and HARTIGAN (1996) and BALDING and NICHOLS (1997). WEIR and COCKERHAM (1984) also used the same symbol {theta} for the coancestry coefficient (= FST), so we use {theta}WC for their {theta}. Our Formula 1 is equivalent to {theta}WC (WEIR 1996, pp. 47–48) and Holsinger's {theta}B (HOLSINGER 1999; HOLSINGER et al. 2002).

Maximum-likelihood estimation of hyperparameters and global FST:

The maximum-likelihood estimation of the hyperparameters has been discussed by LANGE (1995), KITADA et al. (2000), and BALDING (2003). A pseudo-likelihood approach was also taken by RANNALA and HARTIGAN (1996). In the maximum-likelihood framework, a method for the simultaneous estimation of Formula 1 and the linkage disequilibrium coefficient between two SNPs has been proposed (KITADA and KISHINO 2004). KITAKADO et al. (2006) proposed an integrated-likelihood approach to reduce the negative bias of Formula 1 particularly for cases with few sampling points.

Suppose that Formula 1 alleles of diploid organisms (Nk/2 individuals) are counted at locus l and Formula 1 denotes a vector of observed allele counts at the locus in subpopulation k. We assume that all individuals are successfully genotyped at all loci, so Formula 1 The marginal likelihood of the observed allele counts at a locus nkl has a Dirichlet-multinomial distribution (LANGE 1995; RANNALA and HARTIGAN 1996; WEIR 1996; BALDING and NICHOLS 1997; KITADA et al. 2000; BALDING 2003; ROUSSET 2003). The parameters to be estimated are Formula 1 Because we assume the independence of subpopulations, the overall likelihood for these parameters is given by the product of the likelihood functions for K samples, as

Formula 2(2)
The hyperparameters {alpha}l are estimated by maximizing this marginal likelihood (LANGE 1995; KITADA et al. 2000). Our method can be used for both allele and haplotype counts without modification, but some notations differ slightly. For haploid organisms, Nk refers to the individuals genotyped; and nkl should be nk and {alpha}lj should be {alpha}j. Henceforth, for simplicity, we focus on diploid organisms.

The locus-specific Formula 2 can be estimated by substituting Formula 2 for {theta}l in Equation 1. The variance estimator for Formula 2 is calculated from the Fisher information matrix for {alpha}l, as Formula 2 The asymptotic variance for Formula 2 is estimated using the Delta method (SEBER 1982) as

Formula 3(3)
In our metapopulation model or infinite-island model, the sampled localities are regarded as a sample from all possible demes or subpopulations, including those not sampled. Hence, Equation 2 estimates the locus-specific genetic differentiation under the random-effect model of population sampling (WEIR 1996). The average estimate of Formula 3 for all loci is calculated as an arithmetic mean across the loci.

Empirical Bayes estimation of pairwise FST:

The posterior distribution of allele frequencies pkl at locus l in subpopulation k is again a Dirichlet distribution, with parameters modified by the sample allele counts

Formula 4(4)
(LANGE 1995; WEIR 1996). Given the estimates of the hyperparameters and the sampled allele counts, random numbers of pkl can be generated through this posterior distribution. The posterior distributions for any parametric functions of pkl can then be simulated by the empirical Bayes procedure (KITADA et al. 2000).

When population differentiation between or among specific subpopulations is of interest, the selected populations can be regarded as the entire set of populations. Hence, applying the fixed-effect model of population sampling (WEIR 1996) is appropriate. Therefore, we use Nei's GST formula (NEI and CHESSER 1983), which defines quantities with respect to fixed extant populations (COCKERHAM and WEIR 1986), to estimate the posterior distributions of pairwise FST's (hereafter denoted as Formula 4), as did HOLSINGER (1999) and CORANDER et al. (2003) in estimating global FST. Nei's gene diversity analysis compares expected heterozygosities under Hardy–Weinberg equilibrium (HWE), and the GST estimator is expressed as a function of allele frequencies. Therefore the posterior distribution of Formula 4 at each locus can easily be generated on the basis of the GST estimator, without using genotype frequencies. We set the number of each simulation to 10,000, so 10,000 Formula 4's are calculated at each locus from the 10,000 sets of allele frequencies pkl between a set of two populations. From the posterior distribution of Formula 4 the posterior mean and 95% credible interval are calculated. We use the posterior mean as the empirical Bayes estimator of locus-specific Formula 4 We can also calculate the probability that Formula 4 is smaller than an arbitrary value [e.g., P(Formula 4)], which can be used as the criterion for population differentiation (WAPLES and GAGGIOTTI 2006; PALSBØLL et al. 2007). The average estimate of Formula 4 for overall loci is calculated as an arithmetic mean across the loci.

ROSENBERG et al. (2003) proposed a general measure for determining the amount of information on individual ancestry on the basis of the Kullback–Leibler information. The informativeness for assignment In is defined as

Formula 5(5)
where Formula 5 The authors showed that In and FST are very closely correlated but that In is more informative than the standard SNP-specific pairwise FST. In an additional analysis, we examine how our empirical Bayes method works to measure this under the same simulation protocol.

Inferring heterogeneity of global FST among loci:

We estimate locus-specific Formula 5's on the basis of Formula 5 estimated at each locus. Evolutionary forces may differ among sites in the genome. Therefore, it is important to investigate the heterogeneity of FST among loci. One practical analysis is to test the null hypothesis H0, the homogeneity of Formula 5 among L loci, Formula 5) against the alternative hypothesis H1, the heterogeneity of Formula 5 among loci, on the basis of estimates of Formula 5 When a large number of subpopulations are sampled, the maximum-likelihood estimate Formula 5 follows a normal distribution of Formula 5 The maximum likelihood under H0 is then given as Formula 5 where Formula 5 Formula 5 and Formula 5 The maximum likelihood under H1 is Formula 5 maximizing Formula 5 by Formula 5 The negative twice-log-likelihood ratio is then Formula 5 which follows the {chi}2-distribution with (L – 1) d.f. under the null hypothesis. We can test the heterogeneity of Formula 5 on the basis of the test statistics.

The other approach to investigate the heterogeneity of Formula 5 is to estimate the distribution of Formula 5 in the genome. In recent years, the number of loci analyzed has been increasing in ecological studies, but is still smaller than those used for human SNPs. For such cases, it would be difficult to directly estimate the specific distribution of Formula 5 from the data. Here, we estimate the distribution of Formula 5 in the genome from estimates of randomly selected loci, Formula 5 When the distribution is expressed by the parametric model Formula 5 the unknown parameter Formula 5 which defines the distribution of Formula 5 is estimated by maximizing the log marginal likelihood

Formula 5
where Formula 5 is the distribution of Formula 5 Here, we assume the independence of loci Formula 5 Because the maximum-likelihood estimator Formula 5 is a sufficient statistic, it is possible to estimate the distribution of Formula 5 in the genome on the basis of the estimates Formula 5 for randomly sampled loci, instead of using a direct estimation from the data.

For the preliminary discussion here, we assume that Formula 5 is normally distributed in the genome. When the distribution of Formula 5 is expected to be different from 0, a simple approximation may be a normal distribution. We then assume Formula 5 follows N(µ, {sigma}2) as a first step in estimating the distribution of Formula 5 in the genome under the limited number of loci analyzed. In this case, the parameter vector {rho} refers to µ and {sigma}2. The general form of the log marginal likelihood given above becomes

Formula 6(6)
Here, Formula 6 is the variance of the estimates, Formula 6 We estimate µ and {sigma}2 numerically, regarding Formula 6 as Formula 6 The distribution of Formula 6's can also be calculated with slight modifications to the above procedure.


RESULTS

Improved precision of our empirical Bayes estimator for pairwise FST:

We investigated the performance of our method of estimating pairwise Formula 6 using numerical simulations. Random vectors of allele frequencies at locus l in subpopulation k, pkl's, were generated independently from the Dirichlet distribution with the parameter {alpha}l(= {theta}l ßl). Here, the number of sampling localities (K) was set at 5, 10, and 50, and the mean allele frequencies at a locus were assumed Formula 6 with Jl = 50. As the true values of the global FST,l's (= 1/(1 + {theta}l)), we chose four different levels: Formula 6 = 0.01, 0.05, 0.1, and 0.2. The sample size (Nk/2) was deemed to be common to all the localities and was set at 20, 30, and 50 individuals. Then, allele counts nkl's were drawn independently from the multinomial distribution Multi(Nk, pkl) for K localities. The pairwise Formula 6 values between the first and second localities were evaluated by the conventional Nei's GST estimator and the empirical Bayes method. In the latter procedure, 500 Formula 6's were simulated on the basis of Nei's GST formula to save computation time, and the posterior mean was calculated as the estimator of the pairwise FST. The point estimate for the conventional GST estimator and the posterior mean of the empirical Bayes estimates were compared with the true Formula 6 These procedures were iterated R = 1000 times.

Figure 1 and supplemental Figure S1 at http://www.genetics.org/supplemental/ show the general features of the empirical Bayes estimator compared with those of the conventional GST estimator. The former examines the case of a relatively large number of sampling points and a limited sample size from each sampling point (K = 50, Nk/2 = 20). The latter investigates the case of moderately large samples from a small number of sampling points (K = 10, Nk/2 = 50), which is common in ecological studies. It is clearly apparent that the conventional GST estimator greatly overestimates the true values and has a large variance, especially for small Formula 6 In contrast, the empirical Bayes estimates of Formula 6's shrink toward the average of the true Formula 6 and reduce the positive bias of the conventional estimator. This is reasonable because our empirical Bayes estimator can be interpreted as a shrinkage estimator (STEIN 1956; MARITZ and LEWIN 1989).


Figure 1
View larger version (27K):
In this window
In a new window
Download PPT slide
 
FIGURE 1.—

The conventional GST (A) and empirical Bayes (B) estimates of pairwise Formula 6 from 1000 simulations under the infinite-island model at various levels of Formula 6 (0.01, 0.05, 0.1, 0.2) over subpopulations. The mean allele frequencies assumed were Formula 6 with J = 50. The number of sampling localities (K) was set at 50. The sample size Nk/2 (individuals) was common to all localities and was set at 20 individuals.

 
The effects of the number of sampling points K and sampled individuals Nk/2 are also shown in the supplemental material at http://www.genetics.org/supplemental/. The numbers of sampling points did not affect the positive bias of the conventional GST estimator because K was fixed at 2. The empirical Bayes procedure provided larger variation for smaller numbers of subpopulations, especially for small Formula 6 values, and vice versa (supplemental Figure S2 at http://www.genetics.org/supplemental/). Conversely, larger numbers of sampled individuals reduced the positive bias of the conventional GST estimator and the variation of the empirical Bayes estimator (supplemental Figure S3 at http://www.genetics.org/supplemental/).

For a quantitative comparison of the two estimators, we used the following two measures: Formula 6 (relative mean bias) and Formula 6 (root relative mean squared error), where Formula 6 and Formula 6 are an estimate and the true value, respectively, for pairwise Formula 6 in the r th iteration. As shown in Table 1, smaller K and sample sizes (Nk/2) resulted in larger positive biases for the conventional estimator for various levels of true Formula 6 values. The bias and variation became smaller as Formula 6 became larger. In contrast, the empirical Bayes estimator provided smaller biases and variations for all cases of Formula 6 although smaller K and sample sizes resulted in a slight negative bias.


View this table:
In this window
In a new window

 
TABLE 1

Relative mean bias and root relative mean squared error (in parentheses) of the conventional GST and empirical Bayes estimators of pairwise FST from 1000 simulations at various levels of global FST (see text)

 
The negative bias of the empirical Bayes estimator of pairwise FST is large, especially when gene flow is large and the estimation is based on a sample from a few sampling points. Underestimation of population differences should lead to optimistic management strategies. Therefore, it is recommended in the conservation genetics of birds and fish that samples be collected from as many sampling points as possible. It is noted that the bias of the empirical Bayes estimator is reduced by increasing the number of sampling points, whereas the bias of the conventional GST estimator is not.

We estimated Rosenberg et al.'s informativeness of assignment In between the first and second localities using Equation 5 (K = 2) and the empirical Bayes method based on the In estimator for the case of Formula 6 The mean allele frequencies were assumed to be Formula 6 with J = 50. The number of sampling localities (K) was set at 50 and the sample size (Nk/2 individuals) was set at 20 individuals for each sampling point. Figure 2 shows that the conventional estimator for In was positively biased, whereas the empirical Bayes estimator of In performed much better, consistent with the fact that In produces upwardly biased estimates in small samples (ROSENBERG et al. 2003).


Figure 2
View larger version (14K):
In this window
In a new window
Download PPT slide
 
FIGURE 2.—

The conventional informativeness of assignment In (ROSENBERG et al. 2003) (A) and empirical Bayes (B) estimates of In from 1000 simulations for the case of Formula 6 The mean allele frequencies assumed were Formula 6 with J = 50. The number of sampling localities (K) was set at 50 and the sample size Nk/2 (individuals) was common to all localities and was set at 20 individuals.

 

Robustness of our shrinkage estimator for pairwise FST:

The robustness of our Formula 6 estimator was explored using numerical simulations for non-Dirichlet distributions of the allele frequencies. We considered cases in which genetic differentiation becomes larger with geographic distance. Such a stepping-stone model is biologically realistic and may be common (PALSBØLL et al. 2007). We set the number of sampled subpopulations (K) to 15 and the Formula 6 between two adjacent populations to 0.001 (case 1) or 0.0005 (case 2). We considered biallelic cases and set the allele frequencies to (0.5, 0.5) at a locus for the middle population. We then calculated the allele frequencies pkl's at the locus for another 14 subpopulations by numerical optimization. The sample size (Nk/2) was deemed to be common to all localities and was set at 20 individuals. Then, allele counts nk's were drawn independently from the multinomial distribution Multi(Nk, pkl) for 15 localities. A total of 105 pairwise FST values between all sets of the two localities were evaluated by the conventional GST and the empirical Bayes estimator following the simulation protocol described above.

As shown in Figure 3 (case 1) and supplemental Figure S4 at http://www.genetics.org/supplemental/ (case 2, top left), our empirical Bayes procedure provided better estimates than those of the conventional GST estimator for smaller Formula 6's (~ <0.06 for case 1 and 0.04 for case 2). In conservation, management units should be defined among subpopulations with little genetic differentiation. PALSBØLL et al. (2007) concluded that Formula 6 could be used as the criterion for deciding the separate management units of sockeye salmon spawning sites. For such cases with very small genetic differentiation, our method performs more efficiently, even for stepping-stone models. On the contrary, the conventional GST estimator displayed a positive bias for the whole range of Formula 6's in both cases. Our method reduced the positive bias for small Formula 6's and the bias became negative for larger Formula 6's. Reflecting the characteristics of the shrinkage estimator, the relative mean bias of the empirical Bayes estimator, which was the average of the 105 pairwise FST estimates, was slightly (1.06 times) larger than that of the conventional estimator for case 1 (Table 2). The precision of our estimates was much better for the whole range of Formula 6 [Figure 3, top right (case 1) and supplemental Figure S4 (case 2)].


Figure 3
View larger version (22K):
In this window
In a new window
Download PPT slide
 
FIGURE 3.—

Mean (top left) and root relative mean squared error (top right) of the conventional GST (red circle) and empirical Bayes estimators (blue circle) of pairwise Formula 6 from 1000 simulations under the stepping-stone models. Means and root MSEs were plotted on the true Formula 6's which fluctuated very slightly when small uniform random variables were added to prevent the points overlapping heavily. The number of subpopulations (K) was set at 15 and the pairwise Formula 6 between two adjacent populations was set at 0.001 (case 1) and 0.0005 (case 2). The sample size Nk/2 (individuals) was common to all localities and set at 20 individuals. Only the results for case 1 are shown. The results of the MDS analysis of two data sets are given in the bottom section; black circles show the true population structure, and the estimated population structure based on the conventional (red "*") and empirical Bayes estimates (blue "+") of Formula 6 is shown.

 

View this table:
In this window
In a new window

 
TABLE 2

Relative mean bias and root relative mean squared error of the conventional GST and empirical Bayes estimators of pairwise FST from 1000 simulations under the stepping-stone models (see text)

 
We also investigated the robustness of estimating population structure on the basis of estimated Formula 6's. The results of the multidimensional scaling (MDS) (TORGERSON 1952; YOUNG and HAMER 1987) analysis of two data sets in the simulation showed that our method describes the true population structure well. The conventional GST estimator also worked, despite the larger positive bias and variance (Figure 3, bottom).

The case of Pacific herring:

We analyzed six geographical samples of the Pacific herring Clupea pallasii from spawning areas in Lake Akkeshi (AK), Yudonuma Lake (YD), and Funka Bay (FK), which are located off the east coast of Hokkaido in Japan, and Obuchinuma Lake (OB), Miyako Bay (MY), and Matsushima Bay (MT), located off the northern Pacific coast of Honshu. Hatchery fish, released and recaptured in Lake Akkeshi (AKH) and Miyako Bay (MYH), were also distinguished from wild fish on the basis of otoliths stained with alizarin complexon. A total of 2055 mature individuals were genotyped at five microsatellite loci. Allele frequencies are given in supplemental Tables S1–S5 at http://www.genetics.org/supplemental/. HWE was satisfied in each sample except at four localities for three loci, and the assumption of our metapopulation model was considered to be satisfied (supplemental Figure S5). On the basis of the estimates of the hyperparameters (supplemental Figure S6), the scale parameter {theta}l and Formula 6 were estimated over all subpopulations (Table 3).


View this table:
In this window
In a new window

 
TABLE 3

Estimated locus-specific scale parameters {theta} and Formula 6 over subpopulations of the Pacific herring, with standard errors in parentheses

 
We estimated the posterior distributions of Formula 6 for all sets of subpopulations at all loci. As an example, the posterior distributions between FK and MY at the five loci are shown in Figure 4. The posterior means of Formula 6 varied from 0.0064 to 0.0245, with the 95% credible intervals in parentheses (Table 4). The P-values in Figures 4 and 5 are for the homogeneity contingency test performed with Genepop (RAYMOND and ROUSSET 1995). At all loci, the allelic differences between FK and MY were highly significant (P < 0.0000). We used Formula 6 of 0.01 as a population criterion and defined the probability of Formula 6 ≤ 0.01 as P*. Here, Formula 6 = 0.01 refers to Formula 6 which means that the effective number of migrants is 25 individuals per generation (WAPLES and GAGGIOTTI 2006), where Ne is the effective population size and m is the migration rate. The P*-value was near 1.0 for Cha 63, indicating that the genetic differentiation at the locus was small. In contrast, P*-values were 0 or near 0 for Cha 17, Cha 113, and Cha 123 and 0.5318 for Cha 20 (Figure 4). The posterior distribution of the average Formula 6 over all the loci was calculated as the mean of the posterior distributions at five loci (Figure 4, bottom right). Both the P- and P*-values coincided at 0, showing significant genetic differentiation between Funka Bay and Miyako Bay.


Figure 4
View larger version (21K):
In this window
In a new window
Download PPT slide
 
FIGURE 4.—

Posterior distributions of Formula 6 for the Pacific herring between Funka Bay and Miyako Bay at each locus, and over all loci, which were averaged over Formula 6 at five loci.

 

View this table:
In this window
In a new window

 
TABLE 4

Posterior means and 95% credible and confidence intervals for pairwise FST of the Pacific herring between Funka Bay and Miyako Bay at each locus

 

Figure 5
View larger version (27K):
In this window
In a new window
Download PPT slide
 
FIGURE 5.—

Posterior distributions of Formula 6 for the Pacific herring over all loci, which were averaged over Formula 6 for five loci: AK, Lake Akkeshi; YD, Yudonuma Lake; FK, Funka Bay; OB, Obuchinuma Lake; MY, Miyako Bay; and MT, Matsushima Bay.

 
The posterior distributions of average Formula 6 over all loci for all pairs of wild subpopulations were also simulated as simple means of the posterior distributions at five loci (Figure 5). The allelic differences were highly significant with P = 0.0000, except between MY and MT. The criterion P* resulted in a very different evaluation of population differentiation, even for the same P-value of 0.0000. This result clearly shows the difficulty in the hypothesis-testing framework in evaluating the genetic differentiation between subpopulations (e.g., DIZON et al. 1995; RYMAN and JORDE 2001; RYMAN et al. 2006). The P*-based criterion works well in this case and is recommended for use in hypothesis testing.

The variation in Formula 6 over five loci was not trivial, with a coefficient of variation (CV) of 23% (Table 3). However, the negative twice log-likelihood ratio {lambda} was calculated to be 6.9264 and the hypothesis of constant Formula 6 for all loci was not rejected (P = 0.8602, d.f. = 4). We estimated the distribution of Formula 6 in the genome, assuming normality based on Equation 6. The maximum-likelihood estimates (MLEs) for µ and {sigma} were obtained with 90, 95, and 99% confidence intervals, which define the distribution of Formula 6 in the genome via the likelihood profile (Figure 6A). The MLE for µ, with the 95% confidence interval, was 0.0160 (0.0128, 0.0206) [Figure 6A (a, e)] and for {sigma} was 0 (0, 0.00716) [Figure 6A (0, c)]. The weighted mean Formula 6 coincided with the MLE of µ (Table 3) and the distribution of the weighted mean Formula 6 shown as the blue line in Figure 6B, described well the 95% confidence interval of µ [Figure 6A (a, e)]. Here, Formula 6 and Formula 6 were estimated with Equation 3. The log-likelihood profile for {sigma} was monotonic and decreasing, indicating that the point estimate of {sigma} was 0 (supplemental Figure S7 at http://www.genetics.org/supplemental/). Hence, the point estimate of the distribution of Formula 6 for the Pacific herring is constant, supporting the hypothesis of constant Formula 6 throughout the genome. However, the distribution of Formula 6 has uncertainty, which accounts for the confidence regions of µ and {sigma} (Figure 6B).


Figure 6
Figure 6
View larger version (32K):
In this window
In a new window
Download PPT slide
 
FIGURE 6.—

Confidence regions of the distribution of Formula 6 in the genome of the Pacific herring, assuming a normal distribution (see the text). (A) The confidence regions of µ and {sigma}2, which specify the distribution. (B) The MLE distribution (red line, the delta distribution) and the representative distributions on the boundary of the confidence regions (a–e), which correspond to the points in A. The distribution of the weighted mean of Formula 6 is superimposed (blue line).

 


DISCUSSION
Our empirical Bayes estimator of Formula 6 performed better than the conventional GST estimator for various levels of Formula 6 and various sampling conditions under the infinite-island model. Even for non-Dirichlet distributions of allele frequencies, such as stepping-stone models, our method provided better estimates of Formula 6 than did the conventional GST, especially for cases with large gene flow.

Integrated likelihood method:

The empirical Bayes estimator of Formula 6's is negatively biased, especially when the population has large gene flow and the estimation is based on a sample from only a few sampling points (Table 1, supplemental Figure S2 at http://www.genetics.org/supplemental/). With a small number of sampling points, the MLE of {theta} is not precise. Therefore, it is recommended in the conservation genetic analysis of such a population that the samples be collected from as many sampling points as possible. When sampling from many localities is not feasible, the integrated-likelihood method (KITAKADO et al. 2006) can reduce the negative bias of Formula 6 We estimated pairwise Formula 6's by the empirical Bayes method based on integrated-likelihood estimates (ILEs) of {theta} for cases with Formula 6 = 0.01, K = 5, and Nk/2 = 20, 30, or 50 with the same simulation protocol used for Table 1 (supplemental Figure S8 at http://www.genetics.org/supplemental/). The relative mean biases of the Formula 6 estimates, with root relative mean squared errors in parentheses, were –0.238 (0.506), –0.132 (0.365), and –0.060 (0.272), respectively. These values were much smaller than those estimated on the basis of the MLE of {theta}, given in Table 1 (top three rows). The negative bias of Formula 6's based on the MLE was reduced to 32.8, 24.4, and 18.2% for Nk/2 = 20, 30, and 50, respectively. The integrated-likelihood method uses a uniform prior for the mean allele frequency ßl and eliminates ßl by integration regarding it as a nuisance parameter. By using the ILE of {theta} instead of the MLE, the empirical Bayes method proposed here provides nearly unbiased estimates of Formula 6's when the sample sizes (Nk/2 individuals) are large and works more efficiently, even for cases with a small number of sampling points (supplemental Figure S8).

Weir and Cockerham's Formula 6WC:

When we estimate genetic differentiation between two specific subpopulations, selected subpopulations can be regarded as the entire set of populations. Nei's GST formula defines quantities with respect to fixed extant populations (COCKERHAM and WEIR 1986). In addition, GST is a function of allele frequencies under HWE, and the posterior distribution can easily be simulated from only allele frequencies. Therefore, we used the GST estimator to estimate the posterior distribution of pairwise FST.

The citation record suggests that the most widely used estimator for Wright's FST is Weir and Cockerham's Formula 6 (WEIR and HILL 2002). This moment estimator takes the sampling error for subpopulations into account and essentially estimates the global FST, Formula 6 The estimator is also widely used to estimate pairwise FST's among fixed pairs of populations. With the assumption of no local inbreeding, {theta}WC is estimated only from sample allele frequencies, but these need to be inferred from sample genotype frequencies (WEIR and HILL 2002). With J alleles at a locus, the number of possible genotypes is J(J + 1)/2. In microsatellite DNA analyses, J is generally large. If J = 50, the number of genotypes is 1225. Such a situation makes our simulation more complicated and the uncertainty of the genotype counts becomes large under small or moderate sample sizes.

Here, we investigated the properties of Formula 6 on estimating pairwise FST on the basis of the relationship between GST and {theta}WC. Weir and Cockerham's estimates Formula 6 can be approximated as a function of GST by Equation 2 in WEICKER et al. (2001): Formula 6 Using this equation, we calculated the conventional estimates of pairwise Weir and Cockerham's {theta}WC from GST estimates with the simulation protocol described in RESULTS. We examined cases with a global FST of Formula 6 = 0.01, 0.05, 0.1, and 0.2. The mean allele frequencies were assumed Formula 6 with J = 50. The number of sampling localities K was set at 10 and the sample size Nk/2 (individuals) was deemed to be common to all the localities and was set at 50 individuals. As shown in supplemental Figure S9 at http://www.genetics.org/supplemental/, the two estimators have linear relationships. Hence, our simulation results on the conventional GST estimator can be extended straightforwardly to pairwise {theta}WC. In fact, the pairwise Formula 6-values calculated for the Pacific herring with Genepop (RAYMOND and ROUSSET 1995) were 1.93 ± 0.47 times larger than the posterior means of Formula 6 (supplemental Figure S10 at http://www.genetics.org/supplemental/), which coincides with our simulations for small Formula 6's (Figure 1, supplemental Figures S1 and S11).

Weir and Cockerham's estimator Formula 6 is nearly unbiased (RAUFASTE and BONHOMME 2000), although it has a negative bias for the two-allele case (WEIR and HILL 2002). Nevertheless, when estimating the pairwise FST, Formula 6 is considered to have a large positive bias, especially for species with large gene flows. Nei's GST and Rosenberg et al.'s informativeness of assignment In also showed the same phenomenon, suggesting the positive bias is irrelevant to the estimators. This positive bias of the conventional estimators was larger for smaller genetic differentiation. This might be caused by the large variation in the sample allele frequencies, which is larger for smaller sample sizes (individuals) and largely exceeded the real variation between subpopulations. The shrinkage estimator stabilizes such variation and provides better estimates.

Weir and Hill's normal theory:

WEIR and HILL's (2002) normal theory approach has the same variance as a Dirichlet distribution when i != i' in their notation. Hence, their estimator Formula 6 is equivalent to our Formula 6 Formula 6 is the variance of the allele frequencies among subpopulations relative to the total population. Hence, Formula 6 refers to a sample variance of allele frequencies, and therefore Formula 6 follows a {chi}2-distribution when the number of sampling points K is sufficiently large, as shown by WEIR and HILL (2002). The shape of the posterior distribution of Formula 6 at each locus was unimodal and slightly right tailed, which reminded us of {chi}2-distributions (Figure 4). We estimated Weir and Hill's confidence intervals by substituting the posterior mean of Formula 6 with Formula 6 as Formula 6 where d.f. = (K – 1)(J – 1) is the degrees of freedom. The confidence intervals of Formula 6 obtained with the {chi}2-approximation coincided well with the credible intervals calculated from the posterior distributions, although our credible intervals were narrower than the confidence intervals of the {chi}2-approximation, which were slightly right tailed (Table 4). The slight difference in the intervals might have been the effect of the small K(= 8) on the {chi}2-approximation, although it was not substantial. On the contrary, the confidence interval for the sample mean over all loci was narrower than our credible interval (Table 4). The distribution of a sample mean is normal when the sample size is large with the variance reduced by the central limit theorem. A {chi}2-distribution approaches a normal distribution as the degrees of freedom become larger. For our case of 1309 d.f. [Formula 6 as given in WEIR and HILL 2002], the two distributions are equal. The property of the sample mean should cause the narrower confidence interval of the normal theory approach. The result shows that the posterior distribution of Formula 6 describes the distribution of Formula 6 well, both for each locus and for the average over all loci.

LD among loci:

The case study of the Pacific herring, based on a few microsatellite markers, did not detect significant variation in Formula 6 among loci in the genome (Figure 6). The assumption of normality for the MLE of Formula 6 is valid when more than a few sampling points are surveyed. This assumption might be violated when the data are collected from only a few sampling points or Formula 6 is close to 0 or 1.0. We also assumed the independence of loci Formula 6 However, it is necessary to take into account the linkage disequilibrium (LD) among loci, when the molecular markers are tightly linked.

Recent progress in whole-genome analysis of human populations provides a new perspective on the inference of FST and its distribution in genomes (GARTE 2003; HINDS et al. 2005; WEIR et al. 2005; WALSH et al. 2006). In their Figure 1, WEIR et al. (2005) showed that values for the single-locus marker Formula 6 over the whole human genome for three (Perlegen) or four (HapMap) populations had a distribution very much like the {chi}2-distributions with 2 or 3 d.f. and suggested that values of Formula 6 are genome-region specific. However, Formula 6 follows a {chi}2-distribution under constant Formula 6 as reported by WEIR and HILL (2002) and as demonstrated in our analysis of the Pacific herring. Therefore, the distributions of the values for the single-locus marker Formula 6 in WEIR et al. (2005) do not necessarily support the genome-region-specific FST hypothesis in the human genome.

Weir et al.'s 5-Mb window average values for Formula 6 were close to normal because of the property of the sample mean. The standard deviations (SDs) of the 5-Mb window average values of Formula 6 decreased substantially from single-locus estimates of 0.12 to 0.02 for HapMap and from 0.11 to 0.02 for Perlegen (Tables 2 and 3 in WEIR et al. 2005). The average number of markers in a 5-Mb window was 1114 for HapMap and 1834 for Perlegen (calculated from Table 1 in WEIR et al. 2005). Hence, these SDs are expected to be Formula 6 and Formula 6 if all SNPs are independent and distributed identically in the genome. The effective sample size (KISH 1965, p. 259) could be much smaller than the real number of SNPs in the 5-Mb windows if the Formula 6's are correlated, because of LD between the SNPs. A coalescent simulation of human population history implies that linkage equilibrium holds for SNPs separated by >10–100 kb (Figure 1 in KRUGLYAK 1999). Therefore, we can estimate the effective number of human SNPs per 100-kb window to be ~1–20. If we assumed it to be ~10, the effective number of SNPs per 5-Mb window becomes 500. Therefore, the SDs of the 5-Mb window Formula 6 are expected to be Formula 6 for HapMap and Formula 6 for Perlegen. The actual value (0.02) is much larger than these, even when the LD among the SNPs is taken into account. This discrepancy can be explained by the large-scale heterogeneity of FST between the 5-Mb windows. New data show that LD is highly structured into discrete blocks of sequences separated by hot spots of recombination (GOLDSTEIN 2001; MCVEAN et al. 2004) and differs among species (HERNANDEZ et al. 2007). The simultaneous estimation of FST's of SNPs and the LD between the SNPs should give us an accurate picture of the distribution of FST in genomes.


ACKNOWLEDGEMENTS
We are grateful to Bruce Weir for valuable comments on the manuscript and encouragement with this work. We also thank the anonymous reviewers for their constructive comments, which improved the earlier version of the manuscript. This work was supported by grants from the Japan Society for the Promotion of Science.


LITERATURE CITED

BALDING, D. J., 2003 Likelihood-based inference for genetic correlation coefficients. Theor. Popul. Biol. 63: 221–230.[CrossRef][Medline]

BALDING, D. J., and R. A. NICHOLS, 1997 Significant genetic correlations among Caucasians at forensic DNA loci. Heredity 78: 583–589.[CrossRef][Medline]

BALLOUX, F., and N. LUGON-MOULIN, 2002 The estimation of population differentiation with microsatellite markers. Mol. Ecol. 11: 155–165.[CrossRef][Medline]

COCKERHAM, C. C., 1969 Variance of gene frequencies. Evolution 23: 72–83.[CrossRef]

COCKERHAM, C. C., 1973 Analysis of gene frequencies. Genetics 74: 679–700.[Abstract/Free Full Text]

COCKERHAM, C. C., and B. S. WEIR, 1986 Estimation of inbreeding parameters in stratified populations. Ann. Hum. Genet. 50: 271–281.[Medline]

CORANDER, J., P. WALDMANN and M. J. SILLANPÄÄ, 2003 Bayesian analysis of genetic differentiation between populations. Genetics 163: 367–374.[Abstract/Free Full Text]

DIZON, A. E., B. L. TAYLOR and G. M. O'CORRY-CROWE, 1995 Why statistical power is necessary to link analyses of molecular variation to decisions about population structure, pp. 288–294 in Evolution and the Aquatic Ecosystem, edited by J. L. NIELSEN and D. A. POWERS. American Fisheries Society Symposium 17, Bethesda, MD.

EWENS, W. J., 2004 Mathematical Population Genetics, I. Theoretical Introduction, Ed. 2. Springer, Berlin/Heidelberg, Germany/New York.

EXCOFFIER, L., 2003 Analysis of population subdivision, pp. 713–750 in Handbook of Statistical Genetics, Ed. 2, edited by D. J. BALDING, M. BISHOP and C. CANNINGS. Wiley, Chichester, UK.

EXCOFFIER, L., G. LAVAL and S. SCHNEIDER, 2005 Arlequin ver. 3.0: an integrated software package for population genetics data analysis. Evol. Bioinform. Online 1: 47–50.

GARTE, S., 2003 Locus-specific genetic diversity between human populations: an analysis of the literature. Am. J. Hum. Biol. 15: 814–823.[CrossRef][Medline]

GOLDSTEIN, D. B., 2001 Islands of linkage disequilibrium. Nat. Genet. 29: 109–111.[CrossRef][Medline]

GOUDET, J., 1995 FSTAT (version 1.2): a computer program to calculate F-statistics. J. Hered. 86: 485–486.[Free Full Text]

GOUDET, J., M. RAYMOND, T. DE MEEÜS and F. ROUSSET, 1996 Testing differentiation in diploid populations. Genetics 144: 1933–1940.[Abstract]

HANSKI, I., and O. E. GAGGIOTTI, 2004 Ecology, Genetics, and Evolution of Metapopulations. Academic Press, San Diego.

HERNANDEZ, R. D., M. J. HUBISZ, D. A. WHEELER, D. G. SMITH, B. FERGUSON et al., 2007 Demographic histories and patterns of linkage disequilibrium in Chinese and Indian rhesus macaques. Science 316: 240–243.[Abstract/Free Full Text]

HINDS, D. A., L. L. STUVE, G. B. NELSEN, E. HALPERIN, E. ESKIN et al., 2005 Whole-genome patterns of common DNA variation in three human populations. Science 307: 1072–1079.[Abstract/Free Full Text]

HOLSINGER, K. E., 1999 Analysis of genetic diversity in geographically structured populations: a Bayesian perspective. Hereditas 130: 245–255.[CrossRef]

HOLSINGER, K. E., P. O. LEWIS and D. K. DEY, 2002 A Bayesian approach to inferring population structure from dominant markers. Mol. Ecol. 11: 1157–1164.[CrossRef][Medline]

HUELSENBECK, J. P., and P. ANDOLFATTO, 2007 Inference of population structure under a Dirichlet process model. Genetics 175: 1787–1802.[Abstract/Free Full Text]

KISH, L., 1965 Survey Sampling. Wiley, New York.

KITADA, S., and H. KISHINO, 2004 Simultaneous detection of linkage disequilibrium and genetic differentiation of subdivided populations. Genetics 167: 2003–2013.[Abstract/Free Full Text]

KITADA, S., T. HAYASHI and H. KISHINO, 2000 Empirical Bayes procedure for estimating genetic distance between populations and effective population size. Genetics 156: 2063–2079.[Abstract/Free Full Text]

KITAKADO, T., S. KITADA, H. KISHINO and H. J. SKAUG, 2006 An integrated-likelihood method for estimating genetic differentiation between populations. Genetics 173: 2073–2082.[Abstract/Free Full Text]

KRUGLYAK, L., 1999 Prospects for whole-genome linkage disequilibrium mapping of common disease genes. Nat. Genet. 22: 139–144.[CrossRef][Medline]

LANGE, K., 1995 Application of the Dirichlet distribution to forensic match probabilities. Genetica 96: 107–117.[CrossRef][Medline]

LOCKWOOD, J. R., K. ROEDER and B. DEVLIN, 2001 A Bayesian hierarchical model for allele frequencies. Genet. Epidemiol. 20: 17–33.[CrossRef][Medline]

MARITZ, J. S., and T. L. LEWIN, 1989 Empirical Bayes Methods, Ed. 2. Chapman & Hall, London.

MCVEAN, G. A. T., S. R. MYERS, S. HUNT, P. DELOUKAS, D. R. BENTLEY et al., 2004 The fine-scale structure of recombination rate variation in the human genome. Science 304: 581–584.[Abstract/Free Full Text]

NEI, M., 1973 Analysis of gene diversity in subdivided populations. Proc. Natl. Acad. Sci. USA 70: 3321–3323.[Abstract/Free Full Text]

NEI, M., and R. K. CHESSER, 1983 Estimation of fixation indices and gene diversities. Ann. Hum. Genet. 47: 253–259.[Medline]

PAETKAU, D., W. CALVERT, I. STIRLING and C. STROBECK, 1995 Microsatellite analysis of population structure in Canadian polar bears. Mol. Ecol. 4: 347–354.[Medline]

PALSBØLL, P. J., M. BÉRUBÉ and F. W. ALLENDORF, 2007 Identification of management units using population genetic data. Trends Ecol. Evol. 22: 11–16.[CrossRef][Medline]

PANNELL, J. R., and B. CHARLESWORTH, 2000 Effects of metapopulation processes on measures of genetic diversity. Philos. Trans. R. Soc. Lond. B 355: 1851–1864.[CrossRef][Medline]

PRITCHARD, J. K., M. STEPHENS and P. DONNELLY, 2000 Inference of population structure using multilocus genotype data. Genetics 156: 945–959.

RANNALA, B., and J. A. HARTIGAN, 1996 Estimating gene flow in island populations. Genet. Res. 67: 147–158.[Medline]

RAUFASTE, N., and F. BONHOMME, 2000 Properties of bias and variance of two multiallelic estimators of FST. Theor. Popul. Biol. 57: 285–296.[CrossRef][Medline]

RAYMOND, M., and F. ROUSSET, 1995 GENEPOP (version 3.4): population genetics software for exact tests and ecumenicism. J. Hered. 86: 248–249.[Free Full Text]

REYNOLDS, J., B. S. WEIRS and C. C. COCKERHAM, 1983 Estimation of the coancestry coefficient: basis for a short-term genetic distance. Genetics 105: 767–779.[Abstract/Free Full Text]

ROBERTSON, A., and W. G. HILL, 1984 Deviation from Hardy–Weinberg proportions: sample variances and use in estimation of inbreeding coefficients. Genetics 107: 703–718.[Abstract/Free Full Text]

ROSENBERG, N. A., L. M. LI, R. WARD and J. K. PRITCHARD, 2003 Informativeness of genetic markers for inference of ancestry. Am. J. Hum. Genet. 73: 1402–1422.[CrossRef][Medline]

ROUSSET, F., 2003 Inferences from spatial population genetics, pp. 681–712 in Handbook of Statistical Genetics, Ed. 2, edited by D. J. BALDING, M. BISHOP and C. CANNINGS. John Wiley & Sons, Chichester, UK.

RYMAN, N., and P. E. JORDE, 2001 Statistical power when testing for genetic differentiation. Mol. Ecol. 10: 2361–2373.[CrossRef][Medline]

RYMAN, N., S. PALM, C. ANDRÉ, G. R. CARVALHO, T. G. DAHLGREN et al., 2006 Power for detecting genetic divergence: differences between statistical methods and marker loci. Mol. Ecol. 15: 2031–2045.[CrossRef][Medline]

SEBER, G. A. F., 1982 The Estimation of Animal Abundance and Related Parameters, Ed. 2. Griffin, London.

STEIN, C., 1956 Inadmissibility of the usual estimator for the mean of a multivariate distribution. Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, California, Vol. 1, pp. 197–206. University of California Press, Berkeley, CA.

TORGERSON, W. S., 1952 Multidimensional scaling: 1. theory and methods. Psychometrika 17: 401–419.