Genetics, Vol. 161, 1641-1650, August 2002, Copyright © 2002

The Matrix Coalescent and an Application to Human Single-Nucleotide Polymorphisms

Stephen Woodinga and Alan Rogersb
a Eccles Instititute of Human Genetics, University of Utah, Salt Lake City, Utah 84112-5330
b Department of Anthropology, University of Utah, Salt Lake City, Utah 84112-0060

Corresponding author: Stephen Wooding, University of Utah, 15 N. 2030 E., Salt Lake City, UT 84112-5330., swooding{at}genetics.utah.edu (E-mail)

Communicating editor: Y.-X. FU


*  ABSTRACT
*TOP
*ABSTRACT
*MODEL
*DISCUSSION
*APPENDIX A
*APPENDIX B
*LITERATURE CITED

The "matrix coalescent" is a reformulation of the familiar coalescent process of population genetics. It ignores the topology of the gene tree and treats the coalescent as a Markov process describing the decay in the number of ancestors of a sample of genes as one proceeds backward in time. The matrix formulation of this process is convenient when the population changes in size, because such changes affect only the eigenvalues of the transition matrix, not the eigenvectors. The model is used here to calculate the expectation of the site frequency spectrum under various assumptions about population history. To illustrate how this method can be used with data, we then use it in conjunction with a set of SNPs to test hypotheses about the history of human population size.


THE history of population size is a point of general interest in studies of biological variation. Among other things, population size changes can affect levels of heterozygosity, allele frequency, and the extent of linkage disequilibrium (HARPENDING et al. 1998 Down; TERWILLIGER et al. 1998 Down). In humans, these effects are an important consideration in problems ranging from evolutionary biology to gene mapping. Thus, information about long-term population size contributes to the understanding of both ancient human history and modern human biology.

Information about the history of human population size comes from a variety of sources. Archaeology, paleoanthropology, linguistics, and historical documentation are all important. Over the last 25 years, however, genetic evidence has risen to the forefront. By providing information inaccessible through traditional means, genetic data play a key role in inferences about the ancient human past. Central to this role are the theoretical tools of population genetics, which attempt to describe the relationship between demography and genetic diversity. Among these tools, models of the coalescent process have distinguished themselves as a way to extract information about past patterns of population size change from present patterns of genetic variation (FU and LI 2001 Down).

The coalescent process (KINGMAN 1982A Down; HUDSON 1990 Down) describes the ancestry of a sample of genes. As we trace the ancestry of each modern gene backward from ancestor to ancestor, we occasionally encounter common ancestors—genes whose descendants include more than one gene in the modern sample. Each time this happens, the number of ancestors decreases by one. Eventually, we reach the gene that is ancestral to the entire modern sample, and the process ends. This process provides a natural description of genetic variation, which can be described both in terms of the topological (or genealogical) relationships among genetic lineages and the genetic distances (or coalescence times) between them.

The coalescent process is an example of a Markov process—a stochastic process in which the probability of moving from one state to another depends only on the state you are in, not on the states you have previously visited. In previous literature, attention has focused on the Markov chain that governs not only the lengths of the intervals between coalescent events but also the topology of the resulting gene genealogy (e.g., TAKAHATA 1988 Down). In this article, we introduce a reduced version of the Markov chain that ignores topology and deals only with the lengths of intervals. Our procedure has a number of advantages, especially in dealing with variation in population size. After introducing the model, we use it to study a set of human single-nucleotide polymorphisms (SNPs).


*  MODEL
*TOP
*ABSTRACT
*MODEL
*DISCUSSION
*APPENDIX A
*APPENDIX B
*LITERATURE CITED

The matrix coalescent:
If time is measured backward into the past, and a sample of k lineages is selected t generations before present from a haploid population with size N(t), then the probability that the k sampled lineages have k - 1 distinct ancestors t + 1 generations before present is approximately

(1)

(HUDSON 1990 Down). For diploid populations, 2N(t) can be replaced by 4N(t).

A sample of n lineages gathered at the present (t = 0 generations ago) will have a genealogy proceeding from the state of having n distinct lineages to the state of having n - 1 lineages, and so on down to one lineage, at a rate determined by the transition probabilities {alpha}n(t), {alpha}n-1(t), ... , {alpha}2(t). In general, the probability, pk(t), of observing k lineages t generations before present where n >= k >= 1 is described by a system of recurrence equations

(2)

with initial condition pn(0) = 1, pn-1(0) = 0, ... , p1(0) = 0.

In calculations, we exclude terms for the absorbing state, in which there is just a single lineage. This is not restrictive, since we can always calculate

In matrix notation, Equation 2 becomes

(3)

where p(t) is a column vector with entries p2(t), p3(t), ... , pn(t), where I is the identity matrix, and where

is the transition rate matrix. Equation 3 can be approximated in continuous time by an ordinary differential equation

(4)

which is solved by

(5)

The entries, pi(0), of the initial vector p(0) are defined above.

Eigenvalues and eigenvectors:
Since A(t) is a triangular matrix, its eigenvalues are equal to its diagonal entries: -{alpha}2(t), ... , -{alpha}n(t). The column eigenvectors of A(t) are defined by the equation , where {lambda} is a scalar—one of the eigenvalues of A—and c a column eigenvector. This equation can be reexpressed (suppressing t) as

(6)

where ci is the ith entry in vector c. The jth eigenvector is calculated by setting , setting c1 to an arbitrary constant, and then applying (6) repeatedly. When i = j, this equation becomes cj+1 = cj x 0. Consequently ci = 0 for all i > j, and the matrix C of column eigenvectors is upper triangular.

Equation 6 also implies that the column eigenvectors are time invariant: Substitute (1) into (6) for the jth column eigenvector to obtain

Since this expression does not depend on t, the matrix C of column eigenvectors is time invariant.

The row eigenvectors of A are defined by rA = {lambda}r, where {lambda} is an eigenvalue of A and r is the corresponding row eigenvector. This equation can be reexpressed as

(7)

and row eigenvectors can be calculated iteratively in the same way as column eigenvectors. Like C, the matrix R of row eigenvectors will be upper triangular and time invariant.

Before these eigenvectors can be used, they must be normalized so that RC = I, where I is the identity matrix. Since both matrices are upper triangular, this requires only that, for eigenvector j, we ensure that rjcj = 1. Our computer program normalizes the eigenvectors by setting rj = cj = 1.

By expanding the matrix exponential in Equation 5 in diagonal form, we obtain

(8)

where P(t) is a diagonal matrix whose xth diagonal element is

(9)

and p(0) = [0, 0, ... , 1] as described for (2). The kth element of p(t) contains the probability of observing k distinct lineages t generations before present when population size is described by the function N(t).

The second row of plots in Fig 1A shows how pk(t) varies with t for several different values of k and under three population histories: a sudden population increase, a gradual increase, and a gradual increase with periodic cycling.



View larger version (25K):
In this window
In a new window
Download PPT slide
 
Figure 1. Theoretical expectations. Shown are the relationships between population history, probability of coalescence, expected interval length, and theoretical frequency spectrum under three population histories for samples of n = 5 lineages. (a) Top, population size over time; middle, the probability that there are still k distinct lineages in the genealogy t generations ago, given the population history at top; bottom, expected interval lengths given the population history at top. Dashed vertical lines indicate that no particular branching order is implied for the genealogies. (b) Normalized frequency spectra for the genealogies represented in a. These results were generated using the Maple 5.1 software package using default numerical precision.

Expected lengths of coalescent intervals:
Let m denote the vector whose kth entry, mk, is the expected duration in generations of the interval during which the process contains k lineages. There is a close relationship between m and p: The kth entry in p(t) is the probability that generation t makes a contribution to the interval during which the process has k lineages. To calculate m, we integrate across p:

(10)

After substituting Equation 8, this becomes

(11)

where E is a diagonal matrix whose xth diagonal element is

(12)

(ROSS 1997 Down, Chap. 5). The third row of plots in Fig 1A shows the expected length of each coalescent interval under several hypothetical population histories.

The theoretical frequency spectrum of mutations:
The frequency spectrum is the distribution describing the relative abundance of alleles occurring i = 1, 2, ... , n - 1 times in a sample of n homologous genes. Spectra from populations that have increased in size show an overabundance of rare variants relative to populations of constant size, but populations that have decreased show an underabundance (HARPENDING et al. 1998 Down; WOODING 1999 Down). The sensitivity of the frequency spectrum to population size change is exploited in several statistical tests of stationarity or neutrality (TAJIMA 1989 Down; FU 1997 Down).

A polymorphic nucleotide site is ordinarily present in only two states within a sample, one of which is ancestral and the other derived. The expected fraction, {sigma}k, of sites at which the derived allele occurs k times is given by

(13)

where n is the number of DNA sequences in the sample, mj is the expected length of the coalescent interval containing j distinct lineages, and y(j, k, n) is the probability that a single lineage within coalescent interval j has k descendants in a sample of size n. This equation is derived in APPENDIX A.

The probability y(j, k, n) is given by Polya's distribution:

(FELSENSTEIN 1992 Down; SHERRY et al. 1997 Down; see also Equation 21 in FU 1995 Down). Fig 1B shows the theoretical frequency spectrum under several assumptions about population history.

Numerical methods:
Equation 5 contains a matrix exponential, and these are notoriously difficult to evaluate numerically (MOLER and LOAN 1978 Down). It does not help to expand the exponential in terms of eigenvalues and eigenvectors. When the sample size is much over 50, the two eigenvector matrices in Equation 8 will contain very large numbers as well as small ones. Even worse, the entries of each row of C alternate in sign, leading to severe cancellation errors in the matrix product on the right-hand side of Equation 8. With samples of even moderate size, straightforward evaluation of these equations can produce results without any significant digits.

We deal with these problems in two different ways. For population histories of arbitrary complexity, we resort to brute force and use the CLN-1.0.1 programming library (HAIBLE 2000 Down) to perform computations either with high-precision floating-point numbers or with rational numbers. By varying the precision, it is possible to determine how many digits of precision are needed. Some of our calculations were performed with floating-point numbers using 500 decimal digits of precision.

Better alternatives are available when the population's history is piecewise constant. By this we mean that the history is divided into a series of epochs within each of which N(t) is constant. If the number of epochs is large, the piecewise constant model can approximate any history of population size. Even with only a few epochs, it is probably realistic for populations whose sizes are ordinarily held constant by density-dependent population regulation.

For such histories, we use the "uniformization" algorithm of STEWART 1994 Down(Chap. 8), to evaluate Equation 5 across a single epoch of the population's history. This makes it possible to project the vector p backward in time epoch by epoch. With this method, double-precision floating-point calculations are able to deal with problems involving samples of at least 1000.

This method for projecting p backward in time also makes it easy to calculate m. Details are given in Appendix B.

Statistical methods:
Under the assumption that the genealogies of unlinked sites are statistically independent, the log-likelihood of an observed data set (D) given a hypothetical population history (H) is

where Sk is the number of sites occurring k times in the sample and {sigma}k is the probability of a variant site occurring k times in the sample. If different sample sizes are used for different loci, {sigma}k changes from site to site. The ratios of likelihoods under different population histories can be compared using standard likelihood-ratio tests (BULMER 1979 Down; EDWARDS 1992 Down).

Application to human SNPs:
SNPs are a potentially valuable source of information about population history: They are abundant, they are spread widely across the genome, and they are relatively inexpensive to assay. Most studies of SNPs are focused on their potential epidemiological applications (e.g., CARGILL et al. 1999 Down; HALUSHKA et al. 1999 Down). SNPs have also been exploited as a source of information about the process of natural selection (SUNYAEV et al. 2000 Down; FAY et al. 2001 Down). We focus here on human population history, although we include some discussion of selective processes.

CARGILL et al. 1999 Down surveyed SNPs in 196.2 kb of nuclear DNA sequence in 20 Europeans, 14 Asians, 10 African Americans, and 7 African Pygmies. Most of the sequence was from the coding portion of genes implicated in cardiovascular, endocrine, and neuropsychiatric diseases, but some noncoding sequence was sequenced in flanking and intervening regions. Each amplified segment was screened by both DNA sequencing and denaturing high-performance liquid chromatography, and every putative SNP was verified by resequencing (CARGILL et al. 1999 Down). In total, 612 SNPs were identified in 106 genes.

The laboratory methodology used by CARGILL et al. 1999 Down avoided some problems such as false positives, but two features of the SNP data made analysis difficult. First, the SNPs were a combination of linked and unlinked loci. Second, different SNP loci were assayed in different numbers of chromosomes. Some SNPs were sampled in 28 chromosomes, for example, while others were sampled in 114. To cope with these problems, CARGILL et al. 1999 Down were forced in some analyses to rely on doubtful assumptions. The matrix coalescent provides an alternative approach. It cannot accomodate sites with varying levels of linkage, but likelihood-ratio tests can take varying sample sizes into account.

To take advantage of the informativeness of unlinked sites and to avoid the confounds associated with partial linkage, we resampled the original data set randomly in three steps:

  1. All of the SNPs reported by CARGILL et al. 1999 Down were divided into the three categories reported originally: coding nonsynonymous (cns) and coding synonymous (cs) and noncoding (nc) sites near genes.

  2. To minimize linkage between sampled sites, only one randomly chosen SNP from each category was scored for each reported gene. If no SNPs in a category were found in a given gene, then no SNP in that category was chosen from the gene.

  3. The number of sites in each of the frequency categories reported in CARGILL et al. 1999 Down(0–5%, 5–15%, and 15–50%) was tabulated for cns, cs, and nc SNPs using the dbSNP database (SHERRY et al. 2000 Down, SHERRY et al. 2001 Down).

Totals of 60 cns loci, 68 cs loci, and 30 nc loci were included in the randomized data set, which was composed of sites from at least 19 different chromosomes (Table 1). The sites within each category, which were always from different genes and often from different chromosomes, were assumed to be unlinked.


 
View this table:
In this window
In a new window

 
Table 1. Sampled SNPs

SNPs occurring k times could not be distinguished from SNPs occurring n - k times for roughly one-half of the SNPs in the original data set, so theoretical spectra were "folded" at frequency 0.5 in tests here, as described by HARPENDING et al. 1998 Down.

Likelihoods of hypotheses given the observed frequency spectra were generated for each data set over a series of hypothetical population histories. Although the matrix coalescent can cope with very complicated models of history, it is doubtful that we could estimate more than a few parameters with the data at hand. We have therefore limited our analysis to piecewise-constant population histories containing two history epochs. We define these histories using three parameters: N0 is the population size during the most recent epoch (epoch 0), N1 is that in the earlier epoch (epoch 1), and T is the duration of epoch 0 in generations. Epoch 1 is assumed to have infinite duration. Our analysis loses a degree of freedom because the data are a collection of polymorphic sites and do not inform us about the fraction of sites that are polymorphic within the region of the genome under study. Thus, instead of working directly with the three parameters just defined, we work instead with two: and . Here, {rho} is a parameter representing the magnitude of population growth and {tau} is a parameter representing the time of population size change. Each parameter introduced 1 d.f. in likelihood-ratio tests. Maximum-likelihood estimates of {rho} and {tau} were obtained for each SNP category by iterating over a series of values of {rho} and {tau}.

Five hypotheses were tested for each SNP category. First, the maximum likelihood of each category was compared with the category's likelihood under the maximum-likelihood parameters of the other two categories (Fig 2). Then the maximum likelihood of each SNP category was tested against the category's likelihood of three alternatives: (a) stationarity, (b) the most recent population expansion not excluded by ROGERS 1995 Down(), and (c) the most ancient population expansion not excluded by ROGERS 1995 Down(); (see Fig 2).



View larger version (14K):
In this window
In a new window
Download PPT slide
 
Figure 2. Pitch plot of deviations from expectation. The deviation of each frequency category (0–5%, 5–15%, and 15–50%) from expectation in each SNP category (cns, cs, and nc is shown). Deviations given are the mean difference from expectation across all sample sizes within a SNP category, since not all loci were sampled in the same number of chromosomes. The hypotheses are as follows: (a) stationarity, constant population size, and selective neutrality; (b) mtDNA (recent), the most recent population expansion not rejected by ROGERS 1995 Down on the basis of mtDNA polymorphism (see MODEL); (c) mtDNA (ancient), the most ancient population expansion not rejected by ROGERS 1995 Down; (d) the maximum-likelihood parameters for coding nonsynonymous SNPs (cns); (e) the maximum-likelihood parameters for coding synonymous SNPs (cs); (f) the maximum-likelihood parameters for noncoding SNPs near genes (nc). The probability of the observed data given the history is indicated for hypotheses that were rejected.

CARGILL et al. 1999 Down found that the frequency distribution of cs and nc SNPs differed significantly from that of cns SNPs, and that cns SNPs showed an excess of low-frequency variants. FAY et al. 2001 Down found differences between the frequency spectra of synonymous and nonsynonymous changes in the CARGILL et al. 1999 Down data set as well. Our parameter estimates confirm these results (Fig 3). The cns category had maximum-likelihood parameters implying recent population growth under the assumption of selective neutrality , and the cs and nc categories yielded estimates implying little or no change in population size ({rho} = 0.4 for cs and 0.6 for nc).



View larger version (12K):
In this window
In a new window
Download PPT slide
 
Figure 3. Maximum-likelihood parameter estimates. Open circles show the parameters of two alternative hypotheses estimated from mitochondrial DNA (see MODEL).

Maximum-likelihood estimates for the nc data set were not rejected as an explanation for the cs data set at the 0.05 level, but the maximum-likelihood estimates for nc and cs data sets were rejected as an explanation for the cns data set. The maximum-likelihood parameters for the cns data set were almost (but not quite) rejected as an explanation for the nc (p < 0.08). The nc and cs data were indistinguishable, but both could be distinguished from cns (Fig 2). In addition, the cns data showed an excess of low frequency variation relative to expectations under stationarity, as CARGILL et al. 1999 Down also observed.

The failure of likelihood-ratio tests to distinguish between cs and nc categories is a result of their similar {rho} estimates. When {rho} is near 1 the time of population size change has little effect on the frequency spectrum, and confidence intervals around {tau} are broad. When {rho} is exactly 1 they extend to infinity regardless of sample size. Given the nearness of the nc SNPs to coding regions, the similarity of nc and cs frequency spectra is consistent.

If evolutionary processes in SNPs are neutral, then the three categories should be indistinguishable, yet clearly they are not. The frequency spectrum in cns SNPs differs from that of nc and cs SNPs, and none of the observed spectra is consistent with hypotheses about human population growth inferred from mtDNA.


*  DISCUSSION
*TOP
*ABSTRACT
*MODEL
*DISCUSSION
*APPENDIX A
*APPENDIX B
*LITERATURE CITED

The model introduced here differs from the coalescent theory introduced by KINGMAN 1982A Down in that it ignores the topology of the gene genealogy. This simplified theory has a smaller state space than the classical theory, and it is easy to apply the elementary methods of the theory of Markov chains. This opens up opportunities for the study of populations that vary in size.

Conventional coalescent theory can deal with varying population sizes, as well: One simply uses 1/N(t) as the unit of time in generation t. [This procedure was suggested by KINGMAN 1982B Down(p. 31) and has been used by many later authors; we use it in Appendix B of this article.] However, this procedure is awkward when mutations are introduced, because mutations occur at a constant rate on the normal (not the rescaled) timescale. This has complicated efforts to calculate quantities like the expected site frequency spectrum under models of varying population size. Such problems are easier under the formulation introduced here.

The results of this study clearly reject the hypothesis that the cns, cs, and nc SNP data were produced by drift and mutation alone under a model of recent population expansion. The simplest explanation for the present results, taken in isolation, is that human population size has been constant, but some form of selection has affected the cns data. The preponderance of low-frequency polymorphisms in those data is consistent either with purifying selection acting on linked sites or with a selective sweep (BRAVERMAN et al. 1995 Down; FU 1997 Down). FAY et al. 2001 Down, for example, found evidence for purifying selection in an analysis of the ratios of synonymous and nonsynonymous variants in different frequency categories in the CARGILL et al. 1999 Down data set. Similar patterns of variation have been attributed to weak purifying selection elsewhere (PRZEWORSKI et al. 1999 Down).

Yet the present results should probably not be taken in isolation. Genetic data from substantial human samples involving a variety of genetic systems are now published. These can be divided into two categories: noncoding regions that on a priori grounds ought to be selectively neutral and coding regions (or closely linked introns) that on a priori grounds are more likely to be selected. The presumably neutral systems all show evidence either of population growth or of a selective sweep. (We cannot tell the difference.) The presumably selected systems are all consistent either with neutral evolution under constant population size or with weak balancing selection. To account for this strange pattern, HARPENDING and ROGERS 2000 Down suggested that population growth did in fact occur during the Late Pleistocene, but that its signature has been obscured in the coding portions of the human genome by pervasive balancing selection. Additional data sets that have appeared since then have been consistent with this hypothesis (ROGERS 2001 Down). Thus, it is natural to wonder whether the present data set is also consistent with this hypothesis. Let us consider, then, the possibility that the present data reflect the simultaneous effects of population growth and selection.

In the absence of selection, population growth produces a genealogy without deep branches. Balancing selection has the opposite effect; it may maintain two or more allelic classes for a very long time. Since balancing selection and population growth affect genealogies in opposite ways, each tends to obscure the effect of the other. These countervailing effects, however, would not be reflected equally in our three categories of data. Many mutations would occur on the long branches that separate allelic classes, but only the neutral mutations would survive long. Consequently, these long branches would contribute mainly to the SNPs in our cs and nc categories. This is of interest because mutations that occur on the deepest branches of the genealogy can have intermediate frequencies (i.e., far from 0 or 1). Thus, balancing selection inflates the count of loci with intermediate frequencies, but this effect is visible mainly in the the cs and nc categories. Since mutations on deep branches contribute less to the cns category, balancing selection is less likely to obscure the effect of population growth there. Thus, cns SNPs are more likely to show the elevated count of alleles with extreme frequencies (near 0 or 1) that one associates with a population expansion. The count of extreme-frequency cns SNPs should be additionally elevated by recent deleterious mutations that have not yet been removed by purifying selection. For both reasons, the count of extreme-frequency loci should be larger among cns SNPs than among cs or nc SNPs. This is exactly the pattern that we observe.

There are undoubtedly other ways to explain these data, and there is no good reason for confidence in the hypothesis we just proposed. Our point is merely that the present data are consistent with the view that the human population underwent an expansion whose effects are visible in data from neutral loci but are hidden by balancing selection at protein-coding loci.


*  ACKNOWLEDGMENTS

Henry Harpending, Jon Seger, Stewart Ethier, John Hawks, Pat Corneli, David Witherspoon, Josh Cherry, Pui-Yan Kwok, Brad Demarest, and Lara Carroll provided helpful comments and discussion. Nelson Beebe provided helpful advice on numerical methods. Yun-Xin Fu and two anonymous reviewers provided helpful comments. S.W. was supported by a National Institutes of Health (NIH) Genome Sciences Training Grant (Genome Informatics) to the University of Utah. A.R. was supported by NIH grant GM-59290 to the University of Utah. Software developed for this project is available at http://www.anthro.utah.edu/~rogers/src.

Manuscript received July 5, 2001; Accepted for publication May 10, 2002.


*  APPENDIX A
*TOP
*ABSTRACT
*MODEL
*DISCUSSION
*APPENDIX A
*APPENDIX B
*LITERATURE CITED

THE EXPECTED SITE FREQUENCY SPECTRUM
We assume that mutations are rare enough that the possibility of multiple mutations in a single gene genealogy can be ignored. Let Aj denote the event that exactly one mutation occurs within the portion of the genealogy containing j lineages, B the event that exactly one mutation occurs within the genealogy as a whole, and Pr[Aj|B] the conditional probability of Aj given B. The conditional probability that the mutant site will appear k times within a sample, given B, is

(14)

Using Bayes' rule,

(15)

Here, because event B occurs whenever Aj does.

To calculate the unconditional probability of Aj, let Lj denote the length of the jth coalescent interval in a random gene tree. Then jLj is the total branch length associated with that coalescent interval. The conditional probability, given Lj, that a single mutation occurs within this interval is

where µ is the mutation rate, and we assume a Poisson distribution of mutations. The approximation here assumes that µ2 is negligible in comparison to µ. The unconditional probability of Aj is

where mj is the expected value of Lj.

A similar argument gives

where the sum on the right is the expected length of the gene tree as a whole. Substituting these results back into Equation 15 and Equation 14 gives Equation 13.


*  APPENDIX B
*TOP
*ABSTRACT
*MODEL
*DISCUSSION
*APPENDIX A
*APPENDIX B
*LITERATURE CITED

CALCULATING EXPECTED INTERVAL LENGTHS UNDER PIECEWISE CONSTANT POPULATION HISTORIES
Our goal in this section is to calculate the vector m, which contains the expected lengths of the intervals between coalescent events. To simplify the problem, we first separate N(t) from A(t) by defining

and

With these definitions, substitution of (5) into (10) gives

(16)

where and

Suppose now that the population's history is divided into K + 1 epochs within each of which N(t) is constant. We can reexpress F(0, {infty}) as a sum of contributions from these epochs:

Here, (0, v1) is the interval of variable v that is encompassed by history epoch 0, (v1, v2) is that encompassed by epoch 1, and (vK, {infty}) is that encompassed by epoch K. Within the interval between vi and vi+1, the population size is a constant, Ni. Consequently, these integrals can be evaluated directly. For epochs of finite length,

For the final epoch, which has infinite length, this becomes

To recover m from Equation 16, we must right multiply each of the F's by p(0), a process that yields

where

is the result of projecting the initial vector, p(0), backward by v units of time under the assumption that , a constant.

Our computer program uses the projection methods discussed previously to calculate the probability vectors (v), then subtracts pairs of vectors, and finally applies NiB-1. This last step is easy. For example, if n = 4,

where -ß2, -ß3, and -ß4 are the diagonal entries of B.


*  LITERATURE CITED
*TOP
*ABSTRACT
*MODEL
*DISCUSSION
*APPENDIX A
*APPENDIX B
*LITERATURE CITED

BRAVERMAN, J. M., R. R. HUDSON, N. L. KAPLAN, C. H. LANGLEY, and W. STEPHAN, 1995  The hitchhiking effect on the site frequency spectrum of DNA polymorphisms. Genetics 140:783-796.[Abstract]

BULMER, M. G., 1979 Principles of Statistics. Dover Publications, New York.

CARGILL, M., D. ALTSHULER, J. IRELAND, P. SKLAR, and K. ARDLIE et al., 1999  Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat. Genet. 22:231-238.[Medline]

EDWARDS, A. W. F., 1992 Likelihood. The Johns Hopkins University Press, Baltimore.

FAY, J. C., G. J. WYCKOFF, and C.-I WU, 2001  Positive and negative selection on the human genome. Genetics 158:1227-1234.[Abstract/Free Full Text]

FELSENSTEIN, J., 1992  Estimating effective population size from samples of sequences: inefficiency of pairwise and segregating sites as compared to phylogenetic estimates. Genet. Res. 59:139-147.[Medline]

FU, Y.-X., 1995  Statistical properties of segregating sites. Theor. Popul. Biol. 48:172-197.[Medline]

FU, Y.-X., 1997  Statistical tests of neutrality of mutations against population growth, hitchhiking and background selection. Genetics 147:915-925.[Abstract]

FU, Y.-X. and W.-H. LI, 2001  Coalescing into the 21st century: an overview and prospects of coalescent theory. Theor. Popul. Biol. 56:1-10.

HAIBLE, B., 2000 CLN: Class Library for Numbers Version 1.0.1. Computer program distributed by the author, http://clisp.cons.org/~haible/packages-cln.html.

HALUSHKA, M. K., J.-B. FAN, K. BENTLEY, L. HSIE, and N. SHEN et al., 1999  Patterns of single-nucleotide polymorphisms in candidate genes for blood-pressure homeostasis. Nat. Genet. 22:239-247.[Medline]

HARPENDING, H. C. and A. R. ROGERS, 2000  Genetic perspectives on human origins and differentiation. Annu. Rev. Genomics Hum. Gen. 1:361-385.

HARPENDING, H. C., M. A. BATZER, M. GURVEN, L. B. JORDE, and A. R. ROGERS, 1998  Genetic traces of ancient demography. Proc. Natl. Acad. Sci. USA 95:1961-1967.[Abstract/Free Full Text]

HUDSON, R. R., 1990 Gene genealogies and the coalescent process, pp. 1–44 in Oxford Series in Evolutionary Biology, Vol. 7, edited by D. FUTUYMA and J. ANTONOVICS. Oxford University Press, Oxford.

KINGMAN, J. F. C., 1982a  The Coalescent.. Stoc. Proc. Appl. 13:235-248.

KINGMAN, J. F. C., 1982b  On the genealogy of large populations. J. Appl. Prob. 19a:27-43.

MOLER, C. B. and C. F. V. LOAN, 1978  Nineteen dubious ways to compute the exponential of a matrix. SIAM Rev. 20:801-836.

PRZEWORSKI, M., B. CHARLESWORTH, and J. D. WALL, 1999  Genealogies and weak purifying selection. Mol. Biol. Evol. 16:246-252.[Abstract]

ROGERS, A. R., 1995  Genetic evidence for a Pleistocene population explosion. Evolution 49:608-615.

ROGERS, A. R., 2001  Order emerging from chaos in human evolutionary genetics. Proc. Natl. Acad. Sci. USA 98:779-780.[Free Full Text]

ROSS, S., 1997 A First Course in Probability, Ed. 5. Prentice Hall, Upper Saddle River, NJ.

SHERRY, S. T., H. C. HARPENDING, M. A. BATZER, and M. STONEKING, 1997  Alu evolution in human populations: Using the coalescent to estimate effective population size. Genetics 147:1977-1982.[Abstract]

SHERRY, S. T., M. WARD, and K. SIROTKIN, 2000  Use of molecular variation in the NCBI dbSNP database. Hum. Mutat. 15:68-75.[Medline]

SHERRY, S. T., M. H. WARD, M. KHOLODOV, J. BAKER, and L. PHAN et al., 2001  dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29:308-311.[Abstract/Free Full Text]

STEWART, W. J., 1994 Introduction to the Numerical Solution of Markov Chains. Princeton University Press, Princeton, NJ.

SUNYAEV, S. R., W. LATHE, V. E. RAMENSKY, and P. BORK, 2000  SNP frequencies in human genes: an excess of rare alleles and differing modes of selection. Trends Genet. 16:335-337.[Medline]

TAJIMA, F., 1989  Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123:585-595.[Abstract/Free Full Text]

TAKAHATA, N., 1988  The coalescent in two partially isolated diffusion populations. Genet. Res. 52:213-222.[Medline]

TERWILLIGER, J. D., S. ZÖLLNER, M. LAAN, and S. PÄÄBO, 1998  Mapping genes through the use of linkage disequilibrium generated by genetic drift: "drift mapping" in small populations with no demographic expansion. Hum. Hered. 48:138-154.[Medline]

WOODING, S., 1999 TreeToy Coalescent Simulation Version 1.0b. Computer program distributed by the author, http://www.anthro.utah.edu/popgen/programs/TreeToy




This article has been cited by other articles:


Home page
Genome ResHome page
I. Hellmann, Y. Mang, Z. Gu, P. Li, F. M. de la Vega, A. G. Clark, and R. Nielsen
Population genetic analysis of shotgun assemblies of genomic sequences from multiple individuals
Genome Res., July 1, 2008; 18(7): 1020 - 1029.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
A. M. Adams and R. R. Hudson
Maximum-Likelihood Estimation of Demographic Parameters Using the Frequency Spectrum of Unlinked Single-Nucleotide Polymorphisms
Genetics, November 1, 2004; 168(3): 1699 - 1712.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
G. T. Marth, E. Czabarka, J. Murvai, and S. T. Sherry
The Allele Frequency Spectrum in Genome-Wide Human Variation Data Reveals Signals of Differential Demographic History in Three Large World Populations
Genetics, January 1, 2004; 166(1): 351 - 372.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
A. Polanski and M. Kimmel
New Explicit Expressions for Relative Frequencies of Single-Nucleotide Polymorphisms With Application to Statistical Inference on Population Growth
Genetics, September 1, 2003; 165(1): 427 - 436.
[Abstract] [Full Text] [PDF]