- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Innan, H.
- Articles by Tajima, F.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Innan, H.
- Articles by Tajima, F.
A Method for Estimating Nucleotide Diversity From AFLP Data
Hideki Innana, Ryohei Terauchib, Günter Kahlb, and Fumio Tajimaaa Department of Biological Sciences, Graduate School of Science, The University of Tokyo, Hongo 7-3-1, Tokyo 113-0033, Japan
b Plant Molecular Biology, Biocenter, University of Frankfurt, D-60439 Frankfurt am Main, Germany
Corresponding author: Fumio Tajima, Department of Biological Sciences, Graduate School of Science, The University of Tokyo, Hongo 7-3-1, Bunkyo-ku, Tokyo 113-0033, Japan., ftajima{at}biol.s.u-tokyo.ac.jp (E-mail)
Communicating editor: A. G. CLARK
| ABSTRACT |
|---|
A method for estimating the nucleotide diversity from AFLP data is developed by using the relationship between the number of nucleotide changes and the proportion of shared bands. The estimation equation is based on the assumption that GC-content is 0.5. Computer simulations, however, show that this method gives a reasonably accurate estimate even when GC-content deviates from 0.5, as long as the number of nucleotide changes per site (nucleotide diversity) is small. As an example, the nucleotide diversity of the wild yam, Dioscorea tokoro, was estimated. The estimated nucleotide diversity is 0.0055, which is larger than estimations from nucleotide sequence data for Adh and Pgi.
THE amplified fragment length polymorphism (AFLP) technique, developed by ![]()
Although AFLP has been increasingly applied to linkage mapping of genomes in various organisms (![]()
![]()
![]()
![]()
![]()
Here, we report the application of the AFLP technique for estimating the nucleotide diversity (
), defined as the average number of pairwise nucleotide changes per site (![]()
![]()
![]()
![]()
![]()
![]()
| ESTIMATION METHOD |
|---|
For estimation of the nucleotide diversity from AFLP data, we consider a random nucleotide sequence under the Jukes and Cantor model (![]()
![]()
![]()
The nucleotide diversity (
) in a sample of n haploid individuals can be estimated by averaging the estimated numbers of nucleotide changes (d) over all the pairs in the sample. Namely,
can be estimated by
![]() |
(1) |
where
ij is the estimated number of nucleotide changes between the ith and jth haploid individuals. Note that the estimated value is presented with a circumflex.
First, we consider the probability that a fragment is conserved by time t. If we follow the original protocol, in the AFLP technique, we have three classes of PCR products: those flanked by EcoRI-adapters in both sides, those flanked by EcoRI- and MseI-adapters, and those flanked by MseI-adapters in both sides. As only EcoRI-primers are labeled, the first and second classes of fragments are visible on the autoradiograph. We call these two classes of fragments type 1 and type 2 fragments, respectively. Let Q1(L) and Q2(L) be the probabilities that type 1 and type 2 fragments with L nucleotides are conserved by time t. Note that L is not the real length of the amplified fragment, but L represents the nucleotide length of the fragment excluding the length of the adapter sequences. In other words, L is the length of sequence that originated from the genomic DNA. If no nucleotide change occurs at both primer sites and no new restriction site appears between them, the fragment can be conserved. Let c1 and c2 be the numbers of the selected bases of EcoRI- and MseI-primers, respectively. Under the Jukes and Cantor model, the probability (p) that the nucleotide at a particular site is the same as that t generations ago is given by p =
(![]()
![]()
![]()
![]() |
(2) |
In the same way, the probability that a new MseI site appears in a given 4-bp sequence, b2, is also obtained. Because the probability that one or more nucleotide substitutions occur in this 4-bp sequence by time t is 1 - p2 = 1 - p4, b2 becomes
![]() |
(3) |
where a2 is the probability that a new MseI site forms in the 4-bp sequence (a2 = 0.254). In a fragment with L nucleotides, there are L - 6 + 1 possible 6-bp sequences and L - 4 + 1 possible 4-bp sequences. Then, using p'1, p'2, b1, and b2, we have Q1(L) and Q2(L), which are approximately given by
![]() |
(4) |
and
![]() |
(5) |
Equation 4 and Equation 5 are approximates because the events during which a new restriction site appears are considered to be independent for all the 6- or 4-bp sequences. Apparently, these events are not independent. For example, if a new EcoRI site forms in a 6-bp sequence, say the sequence between nucleotide positions x and x + 5 (x is the nucleotide position number from the 5' end of the fragment), a new EcoRI site never forms in the 6-bp sequences that start with the position x - 5, x - 4, ... , x - 1, x + 1, ... , x + 5. However, (4) and (5) can be good approximations (![]()
Next, we consider the distribution of L. Assume that L is restricted within a range between Lmin and Lmax. Lmin and Lmax mean the minimum and maximum nucleotide lengths of the fragments, respectively, which can be scored on the autoradiograph. Let G1(L) be the distribution of L of type 1 fragment and a'1 be the probability that a 6 + c1-bp sequence matches EcoRI-primer (a'1 = 0.256+c1). Then G1(L) is given by
![]() |
(6) |
where g1(L) is approximately given by
![]() |
(7) |
If we denote (1 - a1)(1 - a2) by A, (7) can be rewritten as
![]() |
(8) |
and (6) becomes
![]() |
(9) |
In the same way, we can obtain G2(L), the distribution of L of type 2 fragment. Let a'2 be the probability that a 4 + c2-bp sequence matches MseI-primer (a'2 =0.254+c2). Then, we have
![]() |
(10) |
where g2(L) is approximately given by
![]() |
(11) |
After some calculations, (10) becomes
![]() |
(12) |
indicating that the distributions of L of types 1 and 2 fragments follow the same geometric distribution in the interval between Lmin and Lmax.
Finally, we consider the relationship between the number of nucleotide changes (d) and the expected proportion of shared bands (F) for a pair of haploid individuals. Denote by R1 the average probability that a type 1 fragment is conserved by time t in both lineages of a pair of haploid individuals. When they diverged t generations ago, the expectation of d is 2µt. Therefore, R1 is written as the average of Q1(L)2 weighted by G1(L) in the interval between Lmin and Lmax. Namely,
![]() |
(13) |
In the same way, the average probability that a type 2 fragment is conserved in both haploid individuals, R2, is given by
![]() |
(14) |
Because the expected ratio of the number of type 1 fragments to that of type 2 fragments is
, the probability that a fragment is conserved by both of haploid individuals is given by
![]() |
(15) |
Here, let us consider the relationship between F and R. In RFLP analysis, ![]()
![]() |
(16) |
where C is the expected proportion of bands shared by chance. Let m be the expected number of bands scored. Because the expected number of bands that is conserved in both lineages of the pair of haploid individuals is Rm, the remaining (1 - R)m bands have a possibility to be shared by chance. The probability that a band with length L is shared by chance is G1(L) {= G2(L)}, and the distribution of L also follows G1(L). Hence, C is given by
![]() |
(17) |
where
![]() |
(18) |
From (16) and (17), we have
![]() |
(19) |
From the relationship between F and d (= 2µt), we can estimate d from F. Let n be the number of haploid individuals investigated and
ij be the estimated proportion of shared bands when the ith and jth haploid individuals are compared. Following ![]()
ij is given by
![]() |
(20) |
where mi and mj are the observed numbers of bands scored in the ith and jth haploid individuals and mij is the observed number of bands shared by both haploid individuals. Because we can estimate dij from (19), the nucleotide diversity (
) is obtained by averaging
ij as shown in (1).
There is another method for estimating
, in which the average of
ij (
) is used. Namely, we have
![]() |
(21a) |
If
is substituted for F in (19), we can estimate
directly (![]()
![]()
estimated by this method is virtually the same value as that estimated by (1), when
is relatively small (![]()
is available on request.
F can be also estimated by
![]() |
(21b) |
This method uses the averages of mij and mi to estimate F. We can also estimate
from
. In the AFLP analysis,
appears to be almost the same as
, because the numbers of bands for all haploid individuals are relatively large and not so different from each other.
| COMPUTER SIMULATION |
|---|
In the above equations we have made several assumptions and approximations. To know the accuracy of the present method, a computer simulation was conducted. The procedure of the simulation is as follows. A random ancestral sequence with the length of M million bp is constructed. The sequence consists of four nucleotides, A, T, G, and C with a given GC-content (g). On this sequence, random mutations are generated. The number of mutations is determined by following the Poisson distribution with mean µt. As models of mutation, we used the equal-input and equal-output models in ![]()
and µAG = µAC = µTG = µTC = µGC = µCG =
. In the equal-input model with g = 0.67, µAT = µTA = µGA = µGT = µCA = µCT =
and µAG = µAC = µTG = µTC = µGC = µCG =
. In the equal-output model with g = 0.33, µAT = µAG = µAC = µTA = µTG = µTC =
and µGA = µGT = µGC = µCA = µCT = µCG =
. In the equal-output model with g = 0.67, µAT = µAG = µAC = µTA = µTG = µTC =
and µGA = µGT = µGC = µCA = µCT = µCG =
. Apparently, all the mutation rates are
when g = 0.5 in both models. This mutational process is carried out twice so that two descendant sequences are obtained. For these two sequences, the AFLP fragments are detected and the lengths of the fragments (L) are scored if Lmin
L
Lmax, and the proportion of the shared bands (fragments) is calculated by (20).
The results of the simulation for M = 1.6 and g = 0.5 are shown in Figure 1. The selective base of EcoRI-primer was A and that of MseI-primer was G, so that c1 = 1 and c2 = 1. The number of replications for a given d was 1000. Note that the equal-input and equal-output models result in the same model when g = 0.5. The average number of bands (m) that can be scored was ~38. Figure 1A shows the average of
with the theoretical expectation obtained by (19). It is shown that the average of
is very close to the expected value. From
, d is estimated by (19), and the average of
is plotted in Figure 1B.
is very close to the true d. The variance of
increases as d increases, although the variance of
is nearly constant.
|
It is known that GC-content is not 0.5 in many organisms. By computer simulation, we investigated whether the relationship between d and F presented by Equation 19 holds when GC-content deviates from 0.5. Note that this formula assumes that GC-content is 0.5. Two values of GC-content were investigated (g = 0.33 and 0.67). Since GC-content affects the number of bands (m), the genome size (M) was adjusted so that m
38 (M = 1.3 and 5.8 for g = 0.33 and 0.67, respectively). From
, d was estimated by (19). In Figure 2, the average of
is plotted with true d. When g = 0.33,
is smaller than the true value (Figure 2A). On the other hand,
is larger than the true value when g = 0.67 (Figure 2B). The deviation of
from true d is larger in the equal-output model than in the equal-input model, indicating that the degree of the deviation of
from true d depended on the mutation model. However, if d < 0.025,
is very close to the true value in our simulation even when g = 0.33 and 0.67, suggesting that Equation 19 is quite useful in a range of GC-content between 0.33 and 0.67 when d is small.
|
| APPLICATIONS |
|---|
Using the relationship between F and d, we estimated the nucleotide diversity in Dioscorea tokoro. D. tokoro is a dioecious, diploid, wild yam species distributed in East Asia. The AFLP data are unpublished results of R. TERAUCHI and G. KAHL. Two individuals [DT5 (female) and DT7 (male)], collected from Wakayama Prefecture in Japan, were investigated. For linkage analysis, they have segregation data of AFLP patterns in their F1 progenies. In the present article, we estimate the nucleotide diversity in these two individuals, DT5 and DT7 (corresponding to four haploid individuals) from the AFLP data.
Table 1 summarizes the results of AFLP detected between DT5 and DT7 for 14 primer combinations. PCR primers complementary to EcoRI- and MseI-adapters have two and three selective bases at their 3' ends, respectively. As there are segregation data among progeny (R. TERAUCHI and G. KAHL, unpublished results), it was possible to distinguish the homozygous (indicated by ++) and heterozygous (+-) states of the fragments. Thus the combinations of the AFLP genotypes for DT5 and DT7 could be classified into eight classes. The number of AFLP fragments (bands) detected for each primer combination ranged from 48 to 102, with a total of 897 fragments for 14 primer combinations. About 76% of bands were homozygous (++) for both individuals.
|
From Table 1,
was calculated as follows. Note that (21b) is not applicable because D. tokoro is a diploid. Because we have data of diploid individuals, it is necessary to consider the diploid individual as a unit of two haploid genomes. Fortunately, in this example, we know from F1 data whether the scored band is homozygous or heterozygous (Table 1). Here, consider the banding patterns of n diploid individuals, which consist of a total of K types of bands. If we focus on a particular band (for example, the xth band), we know the number of haploid genomes that have this band on the autoradiograph. Denote this number by Sx, where Sx ranges from 1 to 2n. Let us consider the probability, Hx, that the band is shared by two haploid genomes randomly chosen from the sample. There are (2n 2) ways to choose a pair of haploid genomes among the sample, of which (Sx 2) pairs share the band. Then, we have
![]() |
(22) |
Considering all the K types of bands, therefore, we can obtain the average proportion of the shared bands (
) for a pair of haploid genomes in the sample. Namely,
![]() |
(23) |
where the denominator of the right side is the average number of bands per haploid genome. From (19), then, we can estimate
using
.
In this case,
was calculated to be 0.914. Then we have
= 0.0055 from (19). The sampling variance of
was computed by the jackknife method (![]()
![]()
The nucleotide diversities of six Lens species were calculated. The data are taken from Table 2 of ![]()
by averaging Fij. The obtained
is summarized in Table 2. From
, the nucleotide diversity was calculated by (19), and the results are also shown in Table 2. The estimated nucleotide diversity ranges from 0.0048 to 0.0220. The sampling variance was also estimated by the jackknife method. ![]()
|
In the case of D. tokoro, we know whether the scored band is homozygous or heterozygous, because we have data of F1 progeny. If such data are not available, we cannot use (23) for estimating
. In this case, we have to use the frequency of the band in the population. The following procedure is essentially the same as in ![]()
x
K), where K is the number of types of scored bands. Consider that n diploid individuals are sampled from a population, and assume that the population is in Hardy-Weinberg equilibrium. Let Sx be the number of (diploid) individuals that have the xth band (1
Sx
n). Then, we have
![]() |
(24) |
Using this relationship with Haldane's correction (![]()
![]() |
(25) |
Let hx be the probability that the xth band is shared by two haploid genomes randomly chosen from the population, so that hx corresponds to the homozygosity of the xth band (hx = fx2). From (24), hx can also be estimated by
![]() |
(26) |
where
x is given by (25). Therefore,
is given by
![]() |
(27) |
where the denominator of the right side is the expected number of bands per haploid genome. Using the above
, we can calculate the nucleotide diversity.
| DISCUSSION |
|---|
In this study, we developed a method for estimating nucleotide diversity (
) from AFLP data. Although Equation 19 is very complex to calculate, the computer simulation indicates that this equation gives a good estimate of d as shown in Figure 1. The variance of the estimate increases with d, indicating that the estimate is not as reliable when d is large.
Our method was directly applied to the AFLP data set from D. tokoro. The estimated value of
was 0.0055 ± 0.0001 (SD). This value was compared with those in two gene regions of D. tokoro, which were estimated from DNA sequences by ![]()
from DNA sequences. The sampling variance of the estimated
from DNA sequences is also calculated by Equation 32 in ![]()
estimated from AFLP is larger than
from DNA sequences, except for Adh introns. Apparently,
from AFLP represents the nucleotide diversity of the total genome of D. tokoro. It is known that in eukaryote genomes many regions have little or no functions, and that in such regions the selective constraint may be very weak in comparison with functional regions (![]()
![]()
for the total genome may be larger than that for a specific coding region.
|
Another explanation for the large value of
based on AFLP data is the effect of insertions and deletions, which are assumed to be very rare events and are neglected in this study. If insertion and deletion events are not rare,
estimated by our method might be an overestimate. This problem also appears in estimation of
from RFLP data without a restriction map (![]()
![]()
To investigate the amount of intraspecific variation, the AFLP pattern of D. tokoro was analyzed. As expected from the results with other plant species (![]()
| ACKNOWLEDGMENTS |
|---|
The authors thank Naohiko Miyashita and Akira Kawabe for their comments and suggestions. This work was supported in part by a grant-in-aid from the Ministry of Education, Science, Sports, and Culture of Japan.
Manuscript received July 16, 1998; Accepted for publication November 12, 1998.
| LITERATURE CITED |
|---|
CLARK, A. G. and C. M. S. LANIGAN, 1993 Prospects for estimating nucleotide divergence with RAPDs. Mol. Biol. Evol. 10:1096-1111[Abstract].
EFRON, B., 1982 The Jackknife, the Bootstrap, and Other Resampling Plans. Society of Industrial and Applied Mathematics, Philadelphia.
HALDANE, J. B. S., 1956 The estimation of viabilities. J. Genet. 54:294-296.
HILL, M., H. WITSENBOER, M. ZABEAU, P. VOS, and R. KESSELI et al., 1996 PCR-based fingerprinting using AFLPs as a tool for studying genetic relationships in Lactuca ssp. Theor. Appl. Genet. 93:1202-1210.
JUKES, T. H., and D. R. CANTOR, 1969 Evolution of protein molecules, pp. 21132 in Mammalian Protein Metabolism, edited by H. N. MUNRO. Academic Press, New York.
KIMURA, M., 1983 The Neutral Theory of Molecular Evolution. Cambridge University Press, Cambridge, UK.
MAHESWARAN, M., P. K. SUBUDHI, S. NANDI, J. C. XU, and A. PARCOET et al., 1997 Polymorphism, distribution, and segregation of AFLP markers in a doubled haploid rice population. Theor. Appl. Genet. 94:39-45.
MAUGHAM, P. J., M. A. SAGHAI MAROOF, G. R. BUSS, and G. M. HUESTIS, 1996 Amplified fragment length polymorphism (AFLP) in soybean: species diversity, inheritance, and near-isogenic line analysis. Theor. Appl. Genet. 93:392-401.
NEI, M., 1987 Molecular Evolutionary Genetics. Columbia University Press, New York.
NEI, M. and W.-H. LI, 1979 Mathematical model for studying genetic variation in terms of restriction endonucleases. Proc. Natl. Acad. Sci. USA 76:5296-5273.
NEI, M. and J. C. MILLER, 1990 A simple method for estimating average number of nucleotide substitutions within and between populations from restriction data. Genetics 125:873-879[Abstract].
NEI, M. and F. TAJIMA, 1981 DNA polymorphism detectable by restriction endonucleases. Genetics 97:145-163
NEI, M. and F. TAJIMA, 1983 Maximum likelihood estimation of the number of nucleotide substitutions from restriction sites data. Genetics 105:207-217
SHARMA, S. K., M. R. KNOX, and T. H. N. ELLIS, 1996 AFLP analysis of the diversity and phylogeny of Lens and its comparison with RAPD analysis. Theor. Appl. Genet. 93:751-758.
STEPHENS, J. C., D. A. GILBERT, N. YUHKI, and S. J. O'BRIEN, 1992 Estimation of heterozygosity for single-probe multilocus DNA Fingerprints. Mol. Biol. Evol. 9:729-743[Abstract].
TAJIMA, F., 1983 Evolutionary relationship of DNA sequences in finite populations. Genetics 105:437-460
TAJIMA, F. and M. NEI, 1982 Biases of the estimates of DNA divergence obtained by the restriction enzyme technique. J. Mol. Biol. 18:115-120.
TAJIMA, F. and M. NEI, 1984 Estimation of evolutionary distance between nucleotide sequences. Mol. Biol. Evol. 1:269-285[Abstract].
TERAUCHI, R., T. TERACHI, and N. T. MIYASHITA, 1997 DNA polymorphism at the Pgi locus of a wild yam, Dioscorea tokoro.. Genetics 147:1899-1914[Abstract].
THOMAS, C. M., P. VOS, M. ZABEAU, D. A. JONES, and K. A. NORCOTTET et al., 1995 Identification of amplified restriction fragment polymorphism (AFLP) markers tightly linked to the tomato Cf-9 gene for resistance to Cladosporium fluvum.. Plant. J. 8:785-794[Medline].
VOS, P., R. HOGERS, M. BLEEKER, M. REIJANS, and T. VAN DE LEE et al., 1995 AFLP: a new technique for DNA fingerprinting. Nucleic Acids Res. 23:4407-4414
This article has been cited by other articles:
![]() |
A. Caballero, H. Quesada, and E. Rolan-Alvarez Impact of Amplified Fragment Length Polymorphism Size Homoplasy on the Estimation of Population Genetic Diversity and the Detection of Selective Loci Genetics, May 1, 2008; 179(1): 539 - 554. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Blaich, J. Konradi, E. Ruhl, and A. Forneck Assessing Genetic Variation among Pinot noir (Vitis vinifera L.) Clones with AFLP Markers Am. J. Enol. Vitic., December 1, 2007; 58(4): 526 - 529. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. J. M. Koopman and G. Gort Significance Tests and Weighted Values for AFLP Similarities, Based on Arabidopsis in Silico AFLP Fragment Length Distributions Genetics, August 1, 2004; 167(4): 1915 - 1928. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. L. Semerikov and M. Lascoux Nuclear and cytoplasmic variation within and between Eurasian Larix (Pinaceae) species Am. J. Botany, August 1, 2003; 90(8): 1113 - 1123. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Iizuka, H. Tachida, and H. Matsuda A Neutral Model With Fluctuating Population Size and Its Effective Size Genetics, May 1, 2002; 161(1): 381 - 388. [Abstract] [Full Text] [PDF] |
||||
![]() |
S.R. Larson, T.A. Jones, Z-M. Hu, C.L. McCracken, and A. Palazzo Genetic Diversity of Bluebunch Wheatgrass Cultivars and a Multiple-Origin Polycross Crop Sci., July 1, 2000; 40(4): 1142 - 1147. [Abstract] [Full Text] |
||||
![]() |
N. T. Miyashita, A. Kawabe, and H. Innan DNA Variation in the Wild Plant Arabidopsis thaliana Revealed by Amplified Fragment Length Polymorphism Analysis Genetics, August 1, 1999; 152(4): 1723 - 1731. [Abstract] [Full Text] |
||||
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Innan, H.
- Articles by Tajima, F.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Innan, H.
- Articles by Tajima, F.


























) The result of the equal-input model; (
) the result of the equal-output model.








