- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Li, H.
- Articles by Fu, Y.-X.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Li, H.
- Articles by Fu, Y.-X.
Neutrality Tests Using DNA Polymorphism From Multiple Samples
Haipeng Lib,c, Yunwu Zhangc, Ya-Ping Zhangc, and Yun-Xin Fub,aa Laboratory of Bioinformatics, Yunnan University, Kunming 650991, People's Republic of China,
b Human Genetics Center, University of Texas, Houston, Texas 77030
c Laboratory of Molecular Evolution and Genome Diversity, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, People's Republic of China
Corresponding author: Yun-Xin Fu, UT School of Public Health, P.O. Box 20186, 1200 Herman Pressler, Houston, TX 77030., yunxin.fu{at}uth.tmc.edu (E-mail)
Communicating editor: G. B. GOLDING
| ABSTRACT |
|---|
The polymorphism of a gene or a locus is studied with increasing frequency by multiple laboratories or the same group at different times. Such practice results in polymorphism being revealed by different samples at different regions of the locus. Tests of neutrality have been widely conducted for polymorphism data but commonly used statistical tests cannot be applied directly to such data. This article provides a procedure to conduct a neutrality test and details are given for two commonly used tests. Applying the two new tests to the chemokine-receptor gene (CCR5) in humans, we found that the hypothesis that all mutations are selectively neutral cannot explain the observed pattern of DNA polymorphism.
THE amount and pattern of polymorphism in DNA sequence samples from a population reflects not only mutations in the ancestors of the sequences but also random genetic drift as well as other evolutionary forces, such as natural selection. How to detect the presence of natural selection in molecular population genetics and evolution is an important issue. It is possible to detect the presence of natural selection because natural selection often causes the pattern of polymorphism to differ from that under the neutral mutation hypothesis, which postulates that the majority of mutations that have contributed significantly to the genetic variation in natural populations is neutral or nearly neutral (![]()
![]()
A number of statistical tests have been proposed, and almost all of them are designed for a single sample. In reality, polymorphic data can be accumulated over time in the same or different laboratories, which means different sites may be examined using different samples. How to conduct neutrality tests in such situations is the focus of this article. To date, millions of single nucleotide polymorphisms (SNPs) have been identified. It is very likely that SNPs from a single gene or tightly linked regions will be typed for different samples by different research groups over time. The newly developed method will be valuable for analyzing such data. We present an example of such an analysis using data from the CCR5 gene.
| STATISTICAL TESTS |
|---|
A number of statistical tests can be extended for multiple samples. However, for the sake of discussion, we focus on two particular tests partly because they are used widely. First, ![]()
(=4Nµ), where N is the effective population size, and µ is the mutation rate per sequence per generation, to detect the presence of selection. His test statistic is
![]() |
(1) |
where
is the mean number of nucleotide differences between two sequences, K is the number of segregating sites, which is equal to the total number of mutations under the infinite-sites model, n is sample size, and
![]() |
(2) |
Second, ![]()
![]() |
(3) |
where
1 is the number of mutations in the external branches, that is, mutations that are inherited by only one sequence in the sample.
The above two tests can be extended to multiple samples in the following way. Assume that a locus without recombination is divided into m regions that have been surveyed using different or partially overlapping samples (Fig 1). It should be emphasized that the assumption of no recombination is made here to make the null model as simple as possible, similar to the original Tajima and Fu and Li tests. Just as the presence of recombination does not invalidate the Tajima and Fu and Li tests, but makes them more conservative, the new tests will likely behave similarly and this will also be applicable to data with recombination.
|
Define
= 4Nµ, where µ is the mutation rate per sequence per entire locus, and N is the effective population size. Also define
i = 4Nµi, where µi is the mutation rate for the ith region (i = 1, 2, ... , m), and ri = µi/µ, which is the proportion of the mutation rate of the ith region. If the mutation rate per site is constant, the ri is equal to the proportion of the length of the ith region,

Furthermore, we define the mean number of nucleotide differences between two sequences in the ith region as
i, the number of segregating sites in the ith region as Ki, the number of mutations in the external branches of the genealogy in the ith region as
1i, and the sample size in the ith region as ni (Fig 1). Define
![]() |
(4) |
![]() |
(5) |
and
![]() |
(6) |
It is obvious that when m = 1, these equations reduce to their conventional definitions for a single sample. Note that in these equations the same weight is given to every region. An alternative approach is to give weight to a region according to certain criterion. However, so far we have not found any other weighting scheme to perform better than the equal-weighting scheme.
Furthermore, we define
![]() |
(7) |
where ani is given by (2). That is,
is a weighted average of ani. Then the tests (1) and (3) become
![]() |
(8) |
and
![]() |
(9) |
| PERFORMING THE TESTS |
|---|
Since
and
, we can compute Var(
- K/
) and Var(K - 
1) if we are able to compute Var(K), Var(
), Var(
1), Cov(
, K), and Cov(K,
1). Some of these terms can be computed analytically; others have to be estimated.
Analytical result:
Assume the total sample consists of n sequences, and let those sequences be numbered from 1 to n. S = {1, 2, ... , n} will represent the whole sample. The sample, Si, for the ith region will be a subset of S. That is Si
S. We do not make any assumption here on the relationship among Si. We note that in one extreme, we can have Si
Sj =
for every pair of i and j, and on the other hand, we may have S1 = S2 = ... = Sm = S. In many situations, it is likely that Si
Sj
(e.g., Fig 1). We use |Si| to represent the number of elements in the set Si, which is the sample size. Then we have ni = |Si|, n = |S|.
The computation of Var(
), Var(K), and other components that we mentioned above requires understanding of the sample genealogy. Let Li be the total branch length of subtree for the ith region scaled so that 1 unit corresponds to 4N generations, and the length of a branch here means a time duration.
ki is the number of k-size mutations in the ith region, and lki is the length of k-size branches in the subtree for the ith region scaled similarly as Li (Fig 1). The size of a branch is the number of sequences in the sample that are descendants of that branch, and a mutation is said to be size k if it occurs in a branch of size k (![]()
,
i, and ri, we have
![]() |
(10) |
Moreover, since 1 unit in time corresponds to 4N generations, we have from ![]()
![]() |
(11) |
Under the infinite-sites model, we have
![]() |
(12) |
Conditioning on the coalescent times, the number of mutations in each branch follows a Poisson distribution with parameter
l, where l is the branch length. We thus have

where tk (2
k
n) is the time duration required for k sequences to coalesce to k - 1 sequence, i.e., the so-called k-coalescent time (Fig 1), and i
j. Then we have
![]() |
(13) |
which leads to
![]() |
(14) |
![]()
![]() |
(15) |
where

Also we can partition E(LiLj) further as
![]() |
(16) |
Therefore, from (5), (10), (14), (15), and (16), we have
![]() |
(17) |
Similar to (13), we have for i
j that
![]() |
(18) |
![]()
i) =
i, so we have
![]() |
(19) |
Since
i is the mean number of nucleotide differences between two sequences in the ith region, it can be calculated from
ki as
![]() |
(20) |
(![]()
![]()
i as
![]() |
(21) |
So we have
![]() |
(22) |
and thus
![]() |
(23) |
where Var(
i) and E(
i
j) are given by (21) and (22), respectively. Moreover, from (12) and (19), we have
![]() |
(24) |
where E(K
) is given later (Equation 25). ![]()
ki
pi). After putting these terms together, we have
![]() |
(25) |
![]() |
(26) |
which leads to
![]() |
(27) |
and
![]() |
(28) |
![]()
![]() |
(29) |
where cn = 1 when n = 2, and when n > 2

From (18) and (26), we have
![]() |
(30) |
Substituting (29) and (30) into (28), we have
![]() |
(31) |
and
![]() |
(32) |
where E(
ki
1i) is given by ![]()
Numerical estimation:
The above derivation shows that E(lkilpj) is critical for computing Var(K), Var(
), Cov(
, K), Var(
1), and Cov(K,
1). The value of E(lkilpj) is dependent on the relationship of branches among subtrees of the regions. A simple example of the relationship is given in Fig 1. Although ![]()
j appears to be analytically intractable. However, an estimate can be obtained relatively easily from the following procedure:
- Simulate a genealogy g (topology without mutation) of a sample of n sequences numbered from 1 to n.
- Compute the value of Eg(lkilpj) for the simulated genealogy.
- Repeat steps 1 and 2 M times. Then E(lkilpj) is finally estimated as

The computation of Eg(lkilpj) in step 2 is done in a similar manner as
FU 1994 . As it is obvious, the accuracy of the estimation can be improved by using large values of M.
- The components mentioned above can be obtained after E(lkilpj) is estimated, and then DT and DF can be calculated. Similar to
TAJIMA'S 1989 and
FU and LI'S 1993 tests, DT and DF do not follow well-known standard distributions. Since E(lkilpj) have to be estimated from simulated samples, it is natural to use computer simulations to determine the critical points of the tests. Overall, this approach gives more accurate critical values than using approximation by a standard distribution.
| AN EXAMPLE |
|---|
We consider data from the CCR5 gene from Caucasians (![]()
![]()
![]()
Sj =
, where i
j. In the second case, we assume that smaller samples are a subset of larger samples. That is, we assume S8 (64)
S11 (90)
S6 (98)
S3 (170)
S12 (174)
S10 (242)
S1 (382)
S5 (462)
S9 (490)
S2 (698)
S4 (708)
S7 (5210), where the numbers in parentheses are the sample sizes. We refer to those two cases as case 1 and case 2, respectively. By assuming samples are independent in the first case, we basically obtain the maximum possible information from such data. On the other hand, by assuming smaller samples are a subset of larger samples, we have the minimal amount of information.
|
From (12), we can get the estimate of
as
since
so we have
= 1.913. Also we have
=
i = 0.392. Table 2 shows that both tests yield significant negative values in both cases 1 and 2. Thus we conclude that the CCR5 region has not evolved according to the neutral model. One possibility is that it has evolved under natural selection, which remains to be seen by further study.
|
| ACKNOWLEDGMENTS |
|---|
We are grateful to Ms. Sara Barton for her help. This work is supported by National Institutes of Health grants R01 GM50428 and R01 GM55759 (Yun-Xin Fu), the Chinese Academy of Sciences (KSCX2-1-05), the National Nature Science Foundation of China, and the Nature Science Foundation of Yunnan Province in China.
Manuscript received April 23, 2002; Accepted for publication November 21, 2002.
| LITERATURE CITED |
|---|
CARRINGTON, M., T. KISSNER, B. GERRARD, S. IVANOV, and S. J. O'BRIEN et al., 1997 Novel alleles of the chemokine-receptor gene CCR5.. Am. J. Hum. Genet. 61:1261-1267.[Medline]
FU, Y. X., 1994 Estimating effective population-size or mutation-rate using the frequencies of mutations of various classes in a sample of DNA sequences. Genetics 138:1375-1386.[Abstract]
FU, Y. X., 1995 Statistical properties of segregating sites. Theor. Popul. Biol. 48:172-197.[Medline]
FU, Y. X. and W. H. LI, 1993 Statistical tests of neutrality of mutations. Genetics 133:693-709.[Abstract]
KIMURA, M., 1983 The Neutral Theory of Molecular Evolution. Cambridge University Press, Cambridge, UK.
LI, W. H., and Y. X. FU, 1998 Coalescent theory and its applications in population genetics, pp. 4579 in Statistics in Genetics, edited by E. HALLORAN. Springer-Verlag, New York.
TAJIMA, F., 1983 Evolutionary relationship of DNA sequences in finite populations. Genetics 105:437-460.
TAJIMA, F., 1989 Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123:585-595.
WATTERSON, G. A., 1975 On the number of segregating sites in genetic models without recombination. Theor. Popul. Biol. 7:256-276.[Medline]
This article has been cited by other articles:
![]() |
Y.-w. Zhang, O. A. Ryder, and Y.-p. Zhang Intra- and Interspecific Variation of the CCR5 Gene in Higher Primates Mol. Biol. Evol., October 1, 2003; 20(10): 1722 - 1729. [Abstract] [Full Text] |
||||
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Li, H.
- Articles by Fu, Y.-X.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Li, H.
- Articles by Fu, Y.-X.

































