- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Haubold, B.
- Articles by Hudson, R. R.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Haubold, B.
- Articles by Hudson, R. R.
Detecting Linkage Disequilibrium in Bacterial Populations
Bernhard Haubolda, Michael Travisanoa, Paul B. Raineya, and Richard R. Hudsonba Department of Plant Sciences, University of Oxford, Oxford OX1 3RB, United Kingdom
b Department of Ecology and Evolutionary Biology, University of California, Irvine, California 92717
Corresponding author: Bernhard Haubold, Max-Planck-Institut für Chemische Ökologie, Tatzendpromenade 1a, D-07745 Jena, Germany., haubold{at}ice.mpg.de (E-mail).
Communicating editor: P. L. FOSTER
| ABSTRACT |
|---|
The distribution of the number of pairwise differences calculated from comparisons between n haploid genomes has frequently been used as a starting point for testing the hypothesis of linkage equilibrium. For this purpose the variance of the pairwise differences, VD, is used as a test statistic to evaluate the null hypothesis that all loci are in linkage equilibrium. The problem is to determine the critical value of the distribution of VD. This critical value can be estimated either by Monte Carlo simulation or by assuming that VD is distributed normally and calculating a one-tailed 95% critical value for VD, L, L = E(VD) + 1.645
, where E(VD) is the expectation of VD, and Var(VD) is the variance of VD. If VD (observed) > L, the null hypothesis of linkage equilibrium is rejected. Using Monte Carlo simulation we show that the formula currently available for Var(VD) is incorrect, especially for genetically highly diverse data. This has implications for hypothesis testing in bacterial populations, which are often genetically highly diverse. For this reason we derive a new, exact formula for Var(VD). The distribution of VD is examined and shown to approach normality as the sample size increases. This makes the new formula a useful tool in the investigation of large data sets, where testing for linkage using Monte Carlo simulation can be very time consuming. Application of the new formula, in conjunction with Monte Carlo simulation, to populations of Bradyrhizobium japonicum, Rhizobium leguminosarum, and Bacillus subtilis reveals linkage disequilibrium where linkage equilibrium has previously been reported.
BACTERIA might be called "facultative sexuals" because they can exchange genetic material through conjugation, transformation, and transduction, but genetic exchange is not a part of their reproductive mode. Just how frequently recombination takes place in bacteria has been a topic of debate since the first major study of bacterial population genetics, in which Escherichia coli genomes were assumed to recombine frequently leading to linkage equilibrium (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The conclusion of linkage equilibrium reached in these studies is based on the variance of the distribution of the number of pairwise differences (VD) among bacterial isolates that have been subjected to genetic analysis at multiple loci. VD can be compared to a critical value obtained under the null hypothesis that all loci are in linkage equilibrium. This approach was first developed by ![]()
![]()
![]()
There are two methods of calculating a critical value for VD. (1) The null distribution of VD can be simulated on a computer, and (2) assuming the null distribution of VD is normal, a critical value can be calculated by the well-known method of adding x standard deviations to E(VD). But, as it is not known whether the null distribution of VD is normal, Monte Carlo simulation has recently emerged as the preferred way for testing linkage equilibrium in bacterial populations (![]()
![]()
![]()
| THE TRADITIONAL METHOD OF COMPUTING THE VARIANCE OF VD |
|---|
Suppose we have n sampled haploid individuals, arbitrarily numbered from 1 to n, that have been genetically assayed at q loci. Let dij denote the number of loci at which individuals i and j differ. Then the variance of pairwise differences is by definition equal to
![]() |
(1) |
![]() |
(2) |
The distribution of VD depends on how replicate samples would be generated. In this article, we assume that replicate samples are generated by randomly shuffling the original alleles among the sampled haplotypes. In this way, the numbers of alleles and the frequencies of the alleles at individual loci are exactly the same in each replicate as in the original sample, but there is no statistical association of alleles on haplotypes except that which arises by chance. This shuffling method is the method suggested by ![]()
![]() |
(3) |
![]() |
(4) |
![]()
![]() |
(5) |
![]() |
(6) |
![]() |
(7) |
In the next section we derive a formula for the variance of VD under the randomization scheme of ![]()
| COMPUTING THE VARIANCE OF VD |
|---|
In this section we obtain an exact expression for the variance of VD under the shuffling of alleles across individuals (the sampling without replacement method; see also ![]()
![]() |
(8) |
![]() |
(9) |
Because under the randomization scheme that we are considering
is a constant, it follows that
![]() |
(10) |
We now proceed to derive expressions for each of the terms on the right-hand side of the last line of Equation 10. Let xk be an indicator variable, equal to one if individual 1 and individual 2 are identical at locus k, and zero otherwise. Then
![]() |
(11) |
![]() |
(12) |
k is the probability that two randomly chosen individuals are identical at locus k. For our case, ![]() |
(13) |
![]() |
(14) |
To calculate E(s4ij), we write
![]() |
(15) |
To arrive at the last line, we have used the fact that an indicator variable to any power is equal to the indicator variable itself. (For example, x4k = xk.) We have also made use of the fact that xk is independent of xj, for j
k. We show later that the double, triple, and quadruple sums on the last line of (15) can be written as single sums and products of single sums of terms involving powers of the
i's.
Similarly, to calculate the other terms in (10) we define zk to be one if individuals 3 and 4 are identical at locus k and zero otherwise, and we define yk to be one if individuals 1 and 3 are identical at locus k. It follows that
![]() |
(16) |
k is the probability that individuals 1 and 2 are identical at locus k and individuals 3 and 4 are also identical at this locus. Recall that alleles are assigned to individuals randomly without replacement, so

Similarly,
![]() |
(17) |
k is the probability that individuals 1, 2, and 3 are identical to each other at locus k,

One can now calculate Var(VD) using (10) together with (15), (16), and (17).
We can write the results in a way that does not require double, triple, or quadruple sums. For example, note that

In a similar fashion, the other multiple sums can be reduced to terms involving the following single sums:

After some manipulation, the result is
![]() |
(18) |
Finally, we define an ~95% critical value as
![]() |
(19) |
| RESULTS AND DISCUSSION |
|---|
To convince ourselves of the correctness of the above algebra and to demonstrate the inadequacy of Var(VD)old we used Monte Carlo simulations. Eleven artificial samples were constructed in the following way: The first data set containing 100 strains and 10 loci with five alleles at each locus was constructed from 96 strains of genotype

and one each of genotype




The second data set was made up of 88 strains of the major genotype and 3 strains of each of the minor genotypes and so on until a data set of maximum genetic diversity was reached consisting of 20 strains of each genotype. In this way we obtained artificial data sets with genetic diversities ranging from 0.078 to 0.8, which represent the range of genetic diversities to which the test developed by ![]()
For each sample, Var(VD)old and Var(VD) were computed (using Equation 6 and Equation 10, respectively). In addition, the randomization method suggested by ![]()
When Var(VD)old was compared with Var(VD)MC, it was found that the two values diverged dramatically for input matrices of high genetic diversity (Figure 1). This causes similar divergence between true and estimated critical values (data not shown) and has implications for testing linkage equilibrium in bacterial populations that will be discussed later. Clearly, Equation 6 should not be used. No discrepancies were found between Var(VD)MC and the variance calculated with Equation 10 (see Figure 1).
|
The usefulness of (19) for hypothesis testing depends on whether the distribution of VD is approximately normal under our null hypothesis of linkage equilibrium with replicates being produced by shuffling of alleles on haplotypes. For multilocus data sets there are three variables that may influence the shape of the distribution of VD, the number of loci, the degree of diversity at each locus, and the number of strains. We investigated the effect of these three variables on the skewness of the distribution of VD through Monte Carlo simulation by calculating g1 as a measure of skewness from sets of resampled VD values,
![]() |
(20) |
; Figure 3). Sample size also had a strong effect on skewness. In general, the larger the sample, the closer the sampling distribution of VD approached normality (Table 1).
|
|
|
Given that the distribution of VD has positive skewness even for large samples, we investigated the effect of this deviation from normality on hypothesis testing. Data sets consisting of between 15 and 480 strains and 10 loci, each with genetic diversity of 0.444, were resampled to calculate the frequency with which VD exceeded the critical values that would be obtained if the distribution of VD was normal. Even for small data sets the discrepancy was slight. For instance, with 15 strains 6.69% of the resampled VD values exceeded the 5% normal critical value (Table 1). For a sample of 480 strains the discrepancy between 5.13% and 5.0% was negligible. Note that the probabilities of exceeding the normal critical values were always slightly too large, as would be expected from the positive skewness of the distribution of VD. For real data this means that whenever a sample has been diagnosed as being in linkage equilibrium, the same conclusion would be reached by Monte Carlo simulation. Further, the more time consuming it becomes to test the hypothesis of linkage equilibrium due to large sample size, the more useful our formula becomes. This is because the sampling distribution of VD approaches normality for large samples.
Several recent reports of panmixis in bacteria have used the observed variance of pairwise differences (VD) as a test statistic. Panmixis was concluded if the critical value of VD was greater than the observed value of VD (![]()
![]()
![]()
![]()
![]()
(H. spontaneum) = 0.145 (![]()
Bacterial populations:
To test the usefulness of this derivation in the study of bacterial population genetics, we investigated published allozyme data for the ECOR collection of E. coli (![]()
![]()
![]()
![]()
Generally we observed that bacterial populations are highly diverse (
= 0.311 to 0.691; Table 2) and that the genetic diversity varies strongly between loci (standard deviation = 0.178 to 0.304; Table 2). Further, the distribution of VD displayed positive skewness in all cases, as observed in the simulations (Table 2).
|
E. coli:
As expected from previous work (![]()
B. japonicum:
![]()

rather than on the unbiased estimator (Equation 4) employed in this study. Using hbj, Lold = 3.996, which is slightly greater than VD = 3.985. This result is due to the large difference between biased and unbiased estimators of the genetic diversity per locus in a sample consisting of only 17 ETs.
R. leguminosarum:
![]()
![]()
B. subtilis:
![]()
N. gonorrhoeae:
This group of bacteria is the best established example of a bacterial population in linkage equilibrium. An extensive allozyme data set comprising 228 isolates has been published and reported to be in linkage equilibrium (![]()
![]()
For all the bacterial populations tested, LMC and Lnew agreed well. This contrasted with the strong divergence of Lold from LMC, which led to conflicting conclusions about the genetic structure of E. coli, B. japonicum, R. leguminosarum, and B. subtilis. Using computer simulations, ![]()
![]()
We conclude that past attempts to detect linkage disequilibrium in haploid multilocus data sets through the computation of a critical value for VD were based on an erroneous formula for the variance of VD. The correct formula for Var(VD) communicated in this article forms the basis of a simple test of linkage. Furthermore, we find that VD is approximately normally distributed (especially for large samples). Hence the algebraic test proposed here is a useful alternative to Monte Carlo simulation in cases where simulation is deemed too expensive or time consuming. A computer program written in FORTRAN77, which implements both the algebraic as well as the Monte Carlo test, can be obtained from B.H. upon request.
| ACKNOWLEDGMENTS |
|---|
We thank J. Maynard Smith for first drawing our attention to the problem of testing linkage equilibrium from mismatch data and for helpful discussion. Thanks are also due to P. J. Bottomley for providing the Rhizobium allozyme data, and to T. S. Whittam and two anonymous reviewers for comments on the manuscript. This work was supported by grants from the Royal Society, Oxford University and the Biotechnology and Biological Sciences Research Council (United Kingdom).
Manuscript received April 2, 1998; Accepted for publication August 21, 1998.
| LITERATURE CITED |
|---|
BOTTOMLEY, P. J., H.-H. CHENG, and S. R. STRAIN, 1994 Genetic structure and symbiotic characteristics of a Bradyrhizobium population recovered from a pasture soil. Appl. Environ. Microbiol. 60:1754-1761
BROWN, A. H. D., M. W. FELDMAN, and E. NEVO, 1980 Multilocus structure of natural populations of Hordeum spontaneum.. Genetics 96:523-536
DUNCAN, K. E., N. FERGUSON, K. KIMURA, X. ZHOU, and C. ISTOCK, 1994 Fine-scale genetic and phenotypic structure in natural populations of Bacillus subtilis and Bacillus licheniformis: implications for bacterial evolution and speciation. Evolution 48:2002-2025.
GO, M. F., V. KAPURA, D. Y. GRAHAM, and J. M. MUSSER, 1996 Population genetic analysis of Helicobacter pylori by multilocus enzyme electrophoresis: extensive allelic diversity and recombinational population structure. J. Bacteriol. 178:3934-3938
GUTTMAN, D. S. and D. E. DYKHUIZEN, 1994 Clonal divergence in Escherichia coli as a result of recombination, not mutation. Science 266:1380-1383
HAUBOLD, B. and P. B. RAINEY, 1996 Genetic and ecotypic structure of a fluorescent Pseudomonas population. Mol. Ecol. 5:747-761.
HUDSON, R. R., 1994 Analytical results concerning linkage disequilibrium in models with genetic transformation and conjugation. J. Evol. Biol. 7:535-548.
ISTOCK, C. A., K. E. DUNCAN, N. FERGUSON, and X. ZHOU, 1992 Sexuality in a natural population of bacteria: Bacillus subtilis challenges the clonal paradigm. Mol. Ecol. 1:95-103[Medline].
MARUYAMA, T. and M. KIMURA, 1980 Genetic variability and effective population size when local extinction and recolonization of subpopulations are frequent. Proc. Natl. Acad. Sci. USA 77:6710-6714
MAYNARD SMITH, J., 1994 Estimating the minimum rate of genetic transformation in bacteria. J. Evol. Biol. 7:525-534.
MAYNARD SMITH, J., N. H. SMITH, C. G. DOWSON, and B. G. SPRATT, 1993 How clonal are bacteria? Proc. Natl. Acad. Sci. USA 90:4384-4388
MILKMAN, R., 1973 Electrophoretic variation in Escherichia coli from natural sources. Science 182:1024-1026
MILLER, R. D. and D. L. HARTL, 1986 Biotyping confirms a nearly clonal population structure in Escherichia coli.. Evolution 40:1-12.
OCHMAN, H. and R. K. SELANDER, 1984 Standard reference strains of Escherichia coli from natural populations. J. Bacteriol. 157:690-693
O'ROURKE, M. and E. STEVENS, 1993 Genetic structure of Neisseria gonorrhoeae populations: a non-clonal pathogen. J. Gen. Microbiol. 139:2603-2611[Medline].
ROBERTS, M. S. and F. M. COHAN, 1995 Recombination and migration rates in natural populations of Bacillus subtilis and Bacillus mojavensis.. Evolution 49:1081-1094.
SELANDER, R. K. and B. R. LEVIN, 1980 Genetic diversity and structure in Escherichia coli populations. Science 210:545-547
SOKAL, R. R., AND F. J. ROHLF, 1981 Biometry, Ed. 2. W. H. Freeman, New York.
SOUZA, V., T. T. NGUYEN, R. R. HUDSON, D. PIÑERO, and R. E. LENSKI, 1992 Hierarchical analysis of linkage disequilibrium in Rhizobium populations: evidence for sex? Proc. Natl. Acad. Sci. USA 89:8389-8393
STRAIN, S. R., T. S. WHITTAM, and P. J. BOTTOMLEY, 1995 Analysis of genetic structure in soil populations of Rhizobium leguminosarum recovered from the USA and the UK. Mol. Ecol. 4:105-114.
WHITTAM, T. S., H. OCHMAN, and R. K. SELANDER, 1983 Multilocus genetic structure in natural populations of Escherichia coli.. Proc. Natl. Acad. Sci. USA 80:1751-1755
WISE, M. G., L. J. SHIMKETS, and J. V. MCARTHUR, 1995 Genetic structure of a lotic population of Burkholderia (Pseudomonas) cepacia.. Appl. Environ. Microbiol. 61:1791-1798[Abstract].
This article has been cited by other articles:
![]() |
T. Tomita, B. Meehan, N. Wongkattiya, J. Malmo, G. Pullinger, J. Leigh, and M. Deighton Identification of Streptococcus uberis Multilocus Sequence Types Highly Associated with Mastitis Appl. Envir. Microbiol., January 1, 2008; 74(1): 114 - 124. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Wang, Y. Su, and G. Chen Population Genetic Variation and Structure of the Invasive Weed Mikania micrantha in Southern China: Consequences of Rapid Range Expansion J. Hered., January 1, 2008; 99(1): 22 - 33. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. A. Castillo and J. T. Greenberg Evolutionary Dynamics of Ralstonia solanacearum Appl. Envir. Microbiol., February 15, 2007; 73(4): 1225 - 1238. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Tanriverdi, A. Markovics, M. O. Arslan, A. Itik, V. Shkap, and G. Widmer Emergence of Distinct Genotypes of Cryptosporidium parvum in Structured Host Populations Appl. Envir. Microbiol., April 1, 2006; 72(4): 2507 - 2513. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. M. Goss, M. Kreitman, and J. Bergelson Genetic Diversity, Recombination and Cryptic Clades in Pseudomonas viridiflava Infecting Natural Populations of Arabidopsis thaliana Genetics, January 1, 2005; 169(1): 21 - 35. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Das, S. Mohanty, and W. Stephan Inferring the Population Structure and Demography of Drosophila ananassae From Multilocus Data Genetics, December 1, 2004; 168(4): 1975 - 1985. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Johansson, J. Farlow, P. Larsson, M. Dukerich, E. Chambers, M. Bystrom, J. Fox, M. Chu, M. Forsman, A. Sjostedt, et al. Worldwide Genetic Relationships among Francisella tularensis Isolates Determined by Multiple-Locus Variable-Number Tandem Repeat Analysis J. Bacteriol., September 1, 2004; 186(17): 5808 - 5818. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Koehler, H. Karch, T. Beikler, Thomas. F. Flemmig, S. Suerbaum, and H. Schmidt Multilocus sequence analysis of Porphyromonas gingivalis indicates frequent recombination Microbiology, September 1, 2003; 149(9): 2407 - 2415. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Vogel, P. Normand, J. Thioulouse, X. Nesme, and G. L. Grundmann Relationship between Spatial and Genetic Distance in Agrobacterium spp. in 1 Cubic Centimeter of Soil Appl. Envir. Microbiol., March 1, 2003; 69(3): 1482 - 1487. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. DURAND, Y. MICHALAKIS, S. CESTIER, B. OURY, M.C. LECLERC, M. TIBAYRENC, and F. RENAUD SIGNIFICANT LINKAGE DISEQUILIBRIUM AND HIGH GENETIC DIVERSITY IN A POPULATION OF PLASMODIUM FALCIPARUM FROM AN AREA (REPUBLIC OF THE CONGO) HIGHLY ENDEMIC FOR MALARIA Am J Trop Med Hyg, March 1, 2003; 68(3): 345 - 349. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Coenye and J. J. LiPuma Population structure analysis of Burkholderia cepacia genomovar III: varying degrees of genetic recombination characterize major clonal complexes Microbiology, January 1, 2003; 149(1): 77 - 88. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Haubold, J. Kroymann, A. Ratzka, T. Mitchell-Olds, and T. Wiehe Recombination and Gene Conversion in a 170-kb Genomic Region of Arabidopsis thaliana Genetics, July 1, 2002; 161(3): 1269 - 1278. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. de Meeus, F. Renaud, E. Mouveroux, J. Reynes, G. Galeazzi, M. Mallie, and J. M. Bastide Genetic Structure of Candida glabrata Populations in AIDS and Non-AIDS Patients J. Clin. Microbiol., June 1, 2002; 40(6): 2199 - 2206. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Suerbaum, M. Lohrengel, A. Sonnevend, F. Ruberg, and M. Kist Allelic Diversity and Recombination in Campylobacter jejuni J. Bacteriol., April 15, 2001; 183(8): 2553 - 2559. [Abstract] [Full Text] |
||||
![]() |
M. G. Lorenz and J. Sikorski The potential for intraspecific horizontal gene exchange by natural genetic transformation: sexual isolation among genomovars of Pseudomonas stutzeri Microbiology, December 1, 2000; 146(12): 3081 - 3090. [Abstract] [Full Text] [PDF] |
||||
![]() |
R.-C. Yang Zygotic Associations and Multilocus Statistics in a Nonequilibrium Diploid Population Genetics, July 1, 2000; 155(3): 1449 - 1458. [Abstract] [Full Text] |
||||
![]() |
J. M. Smith The Detection and Measurement of Recombination From Sequence Data Genetics, October 1, 1999; 153(2): 1021 - 1027. [Abstract] [Full Text] [PDF] |
||||
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Haubold, B.
- Articles by Hudson, R. R.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Haubold, B.
- Articles by Hudson, R. R.




















), by resampling 10,000 times without replacement (
), or by using
).

) diverge at both extremes of the distribution, although for testing the hypothesis of linkage equilibrium only the positive skew apparent in the high cumulative probability values is of interest. The resampled artificial input data set consisted of 100 strains and 10 loci, each with a genetic diversity of 0.558.






