- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Grassly, N. C.
- Articles by Holmes, E. C.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Grassly, N. C.
- Articles by Holmes, E. C.
Population Dynamics of HIV-1 Inferred From Gene Sequences
Nicholas C. Grasslya, Paul H. Harveya, and Edward C. Holmesaa Wellcome Trust Centre for the Epidemiology of Infectious Disease, Department of Zoology, University of Oxford, Oxford OX1 3PS, United Kingdom
Corresponding author: Nicholas C. Grassly, Max Planck Institute for Evolutionary Anthropology, c/o Zoology Institute, University of Munich, Luisenstr. 14, Munich D-80333, Germany., grassly{at}zi.biologie.uni-muenchen.de (E-mail)
Communicating editor: M. SLATKIN
| ABSTRACT |
|---|
A method for the estimation of population dynamic history from sequence data is described and used to investigate the past population dynamics of HIV-1 subtypes A and B. Using both gag and env gene alignments the effective population size of each subtype is estimated and found to be surprisingly small. This may be a result of the selective sweep of mutations through the population, or may indicate an important role of genetic drift in the fixation of mutations. The implications of these results for the spread of drug-resistant mutations and transmission dynamics, and also the roles of selection and recombination in shaping HIV-1 genetic diversity, are discussed. A larger estimated effective population size for subtype A may be the result of differences in time of origin, transmission dynamics, and/or population structure. To investigate the importance of population structure a model of population subdivision was fitted to each subtype, although the improvement in likelihood was found to be nonsignificant.
HUMAN immunodeficiency virus (HIV) type-1 currently infects >30 million people, and it is estimated that more than 16,000 new infections are generated every day (UNAIDS and WHO 1998). Global viral isolates from this pandemic have revealed extensive genetic diversity, but restriction of this diversity into distinct viral subtypes (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Consider the two subtypes, A and B, for which large amounts of sequence data are available due to their dominance of the HIV-1 pandemic. Subtype A is common throughout Africa and is spread predominantly by heterosexual sex. Subtype B, in contrast, is found mainly in the United States and parts of Europe (although it has also been found in China, Japan, South America, Australasia, and Southeast Asia), and is usually associated with transmission by needle sharing among intravenous drug users (IDUs) or male homosexual sex. These subtypes may therefore be expected to show different patterns of genetic diversity, as a result of differences in their epidemiological history and mode of transmission.
The inference of past population dynamics from randomly sampled genetic sequence data can be based on either the genealogical structure relating sampled sequences (e.g., ![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
In this article a statistical framework for the inference of past population dynamics based on summary statistics is described and used to investigate the population dynamics of HIV-1 subtypes A and B. This framework is designed to be flexible and allows not only population dynamic parameter estimation, but also the fit of different population dynamic and mutational models to be tested. Population dynamic models of the epidemic spread of the virus are implemented, with or without further subdivision of the subtypes into partially isolated subpopulations or "demes." A mutational model that allows heterogeneity in the rate of mutation both along sequences and between nucleotides at a single site is used (following their demonstrated importance in HIV-1 evolution; ![]()
![]()
Estimates of the rate of spread of the two subtypes and their effective population sizes are obtained from both the gag and the env genes. The latter parameter is important because it determines the relative roles of genetic drift and selection in determining the evolution of HIV-1. In particular, the spread of drug-resistant mutations, which may be selectively disadvantageous in the absence of the drug (e.g., ![]()
![]()
![]()
![]()
| THEORY |
|---|
Coalescent process:
If we take K sequences, randomly sampled from a large population of size N, it is possible to model their evolutionary or genealogical relationships with a simple process known as the coalescent (![]()
![]()
![]()
![]()
In this article a model of the coalescent process for an expanding, subdivided haploid population is used, of which a panmictic (unsubdivided) model is a special case. The island model of subdivision due to ![]()
- Coalescence with rate

where ki is the number of lineages under consideration in the ith deme, Nx the population size at generation x, and
2 the variance in the number of offspring had by each member of the population [thus for larger
2 the chance of having the same parent in the preceding generation increases; under Wright-Fisher reproduction
2 = 1.0 (![]()
![]()
![]()
2 = 2N-1x]. Migration with rate mi,jk, where mi,j is the migration rate per generation out of deme i into deme j. In the implementation here the migration rate is constant for all i and j and assumed to be independent of deme size Nx (thus mi,j = m).
If time is measured in units other than generations (e.g., years), and g is the generation time in these units, then assuming that the probability of an event occurring in a unit time is small, we can calculate the probability of nothing occurring in a unit time as
![]() |
(1) |
t, where N0 is the deme size at time t = 0, and
is the exponential growth rate per unit time. Given that an event has occurred at time a, then the probability that nothing occurs within the next x time units, in other words that the time between events, T, is greater than x, is given by ![]() |
(2) |
If we switch to continuous time, this then becomes
![]() |
(3) |
The cumulative density function (cdf) for an event occurring within the next x time units is simply 1 - P(T > x). Because the area under the cdf must sum to one, the distribution of T can be obtained by setting (3) equal to a uniformly distributed random number between 0 and 1 and solving for x. If we ignore population subdivision and migration, then this cdf simplifies to that used by ![]()
To generate times between events for the entire subdivided population, consisting of L demes, we calculate min[xi] for all i {i = 1, 2, ... L}, and assume that an event has occurred at this time in the corresponding deme.
As for any population dynamic model where the population or deme size declines into the past, the coalescent approximations, based on the assumption that not more than one event can occur per unit time, will begin to break down as N becomes smaller. However, for an exponentially growing population most coalescent events tend to occur at time t =
(assuming g = 1), at which time the population size is approximately 1/
and is independent of N0 (![]()
Given that an event (migration, M, or coalescence, C) has occurred in the ith deme, the probability that it is a migration event is
![]() |
(4) |
Using Equation 3 and Equation 4, it is possible to simulate gene genealogies under a range of population dynamic scenarios. Sequences can subsequently be evolved down these gene genealogies, given a suitable substitution model, to generate sets of aligned sequences. A program to simulate sequence alignments under certain population dynamic scenarios and mutational models, called Treevolve, is available on the World Wide Web at http://evolve.zoo.ox.ac.uk/.
Parameter estimation:
Sequence alignments, simulated under certain population dynamic models, can be compared to observed data to not only test the fit of these models, but also to estimate their associated parameters. In the method for estimating population dynamic parameters described here (see also ![]()
![]()
![]()
![]()
![]()
![]()
![]()
o and S2k,o, respectively), and for sets of data simulated under a particular coalescent and mutational model (which we will denote
, giving 
,i and S2k,
,i; i = 1, 2, 3, ...). To calculate an approximate likelihood of the model given
o and S2k,o, define the index, Ii (
), as
![]() |
(5) |
is an arbitrary small number (following
, to the observed data (
o and S2k,o). It therefore seems reasonable to assume that the probability of the data given the model can be given by ![]() |
(6) |
. Because the likelihood L (
|
o, S2k,o is by definition proportional to P(
o, S2k,o|
)
(![]()
![]()
gives the maximum-likelihood estimate (MLE),
. For small n, variance in the simulated
and S2k can translate into bias in
. For example, the variance in
and S2k depends on the total genealogical length, which in turn is linearly proportional to the effective population size, N0. Thus, for the above method with a finite number of simulations, as the estimator
0 (N0
) increases, so too does its variance, leading to an upward bias. However, as n
and
0, the variance is expected to tend to zero, and the estimator becomes unbiased.
It is possible to calculate the region of the parameter space about MLEs where simple hypotheses would not be rejected at the 95% significance level using the likelihood ratio. This space determines the set of admissible hypotheses (sensu., ![]()
). At a confidence limit we are effectively restricting the parameter(s) of interest to a particular value(s),
R. We can therefore calculate the likelihood ratio as
![]() |
(7) |
By definition,
R is nested in
and hence for large n, -2
is approximately
2-distributed, with the number of degrees of freedom equal to the number of parameters restricted (![]()
R, which satisfy
= -1.92.
As the number of estimated parameters increases, the confidence interval becomes wider. Both the population parameters and the two summary statistics used to estimate them are nonindependent to differing extents. Thus it is unclear exactly how many degrees of freedom there are, and whether two unique population genetic parameter estimates can be obtained from the two summary statistics. Fortunately, likelihood surfaces can be calculated about MLEs to reveal the range of plausible parameter estimates (those with a likelihood where
is less than the critical, 1/2
2-distributed, value).
Because of the assumption of neutrality made by the coalescent process, the process of mutation can be decoupled from that generating the gene genealogy (![]()
![]()
![]()
![]()
![]()
| MATERIALS AND METHODS |
|---|
Sequence data:
Complete sequences of the gag and env genes from subtypes A and B of HIV-1 were extracted from the Los Alamos HIV sequence database (![]()
![]()
![]()
In total, 13 env and 23 gag sequences were obtained for subtype A, and 54 env and 26 gag sequences for subtype B. The accession numbers, dates, and places of isolation of these sequences are available upon request from the authors. All sequences were isolated between 1983 and 1996 about a mean isolation date of 1991 (SD = 3.65 yr). The coalescent models described here assume sequences are isolated at the same time. The effect of noncontemporaneous-isolation-date estimates of population parameters is analogous to the effects of a substitutional process that is more stochastic than the assumed Poisson process (see DISCUSSION).
Phylogenetic relationships and inference of substitution process:
Phylogenetic trees for the four sets of aligned sequences were inferred at the same time as the substitution process, using the maximum likelihood method implemented in a test version of PAUP* made available by the author David Swofford. The general reversible substitution model was used (![]()
![]()
![]()
, were estimated from the data. For each alignment two trees were constructed, one where sequences (tips) were constrained to be contemporaneous with a single rate of substitution, and one where the sequences were not constrained (and thus substitution rates down each branch could vary). Using the likelihood-ratio statistic the goodness-of-fit of the coalescent assumption of contemporary tips and a molecular clock were assessed. The constrained tree is a special case of the unconstrained tree, where the unconstrained tree has k - 2 extra parameters, which are free to vary (![]()
2-distribution with k - 2 d.f. (![]()
The substitution rate of the gag and env genes of the virus, µ, was set to 5 x 10-3 per site per year for all estimates of population dynamic parameters, consistent with previous estimates for these genes (![]()
Inference of population dynamic history:
Initially a panmictic coalescent model was fitted to each of the two HIV-1 subtypes. Such a model is simply derived from Equation 3 by setting m = 0 and considering only a single deme. Under this population dynamic model there are two free parameters to be estimated: the exponential growth rate
, and the compound parameter g
-2N0, which determines the rate of coalescence. This latter parameter can be defined as the current real-time effective population size and is denoted throughout this article as N
. The real-time effective population size determines the probability that two randomly sampled gene sequences have a common ancestor within the last year (
N-1
, for large N
) and is analogous to WRIGHT's (1931) effective population number, Ne, which determines the probability that two randomly chosen individuals shared a common ancestor in the preceding generation (
N-1e). Because each sequence in the sample of HIV-1 sequences can be considered to represent the viral population from an infected patient, the "individual" of the population genetic model implemented here is an infected patient. Infected individuals transmit their virus to other individuals to generate more "offspring" whose viral sequence resembles that of the "parent." This coalescent model of the transmission process allows variation in offspring number (i.e.,
2 > 1), but assumes that there is no correlation across generations in the number of offspring had by a genealogical lineage (no Hill-Robertson effect; ![]()
any error in µ can be accounted for by allowing N
to vary (although of course any error in µ will affect the actual estimates of N
).
MLEs of N
and
for the four alignments were obtained and likelihood surfaces about the estimates produced. Using the
2-approximation to the likelihood-ratio statistic ~95% confidence limits about these estimates were calculated. Subsequently the exponential growth rate (
) was fixed at 0.693 yr-1, and estimates of the real-time effective population size were obtained. This growth rate is equivalent to a doubling time of 1 yr, a value consistent with that observed for the incidence of HIV-1 infection (where a range from 4.9 to 15.6 mo is seen; ![]()
The subdivided population coalescent model with two demes was also fitted to the observed sequence data. This model has three free parameters:
, N
, and the migration rate, M =
(where
and M have units per year, and N
is the real-time effective deme size). Although nonunique due to the use of only two summary statistics (
o and S2k,o), and therefore only 2 d.f., MLEs of these parameters were obtained and the likelihood at these MLEs recorded. The likelihood ratio of the subdivided to the unsubdivided model was calculated for each sequence alignment and its significance assessed using the
2-approximation with 1 d.f. (because there is one additional free parameter, M).
Performance of 
and the
2-approximation to the likelihood ratio:
The performance of the estimator of the real-time effective population size and the validity of the
2-approximation to the likelihood ratio were assessed using simulations. For each value of N
, which was estimated from HIV-1 with the doubling time set at 1 yr (i.e., for A and B subtypes, gag and env genes), 100 sequence alignments were simulated with the relevant number of sequences and the mutational parameters in Table 1. For each sequence alignment N
was estimated and the ML value recorded. The likelihood of the actual N
under which the data were simulated was also recorded, and the ratio to the ML value calculated. In this way the performance of both the estimator and the
2-approximation to the likelihood ratio could be assessed.
|
| RESULTS |
|---|
Phylogenetic relationships and inference of substitution process:
For each of the four alignments a molecular clock was rejected using the likelihood-ratio statistic [log-likelihood ratio = 14.32 (P = 0.01) and 57.45 (P < 0.005) for subtype A env and gag genes, respectively, and 322.01 (P < 0.005) and 52.89 (P < 0.005) for subtype B env and gag genes]. This is unsurprising given the noncontemporaneous nature of the viral isolates and the possibility of greater-than-Poisson stochasticity in the substitution process. Furthermore, the likelihood-ratio test of a molecular clock based on a complete phylogeny is very stringent (powerful). A single anomalous branch length can cause rejection of the clock. In contrast, the method for inferring population parameters described here is based on summary statistics, which are less sensitive to such rate heterogeneity. Thus, despite the rejection of a clock by the likelihood ratio, a coalescent analysis is considered appropriate (see DISCUSSION for a consideration of the implications of rate heterogeneity for estimates of population parameters).
The substitution processes inferred from both the constrained and unconstrained phylogenies are shown in Table 1. It can be seen that constraint of the tips of the phylogeny has little effect on the estimates of the substitution parameters, despite the lack of fit of a molecular clock. Adenosine (A) is more common than the other nucleotides, a result of the preference for G to A substitution in the HIV-1 genome (![]()
, governing rate heterogeneity along the sequences is <0.4 in all cases, indicating a fair degree of rate heterogeneity (in agreement with estimates from other gag and env alignments; ![]()
The phylogenetic trees constrained to have contemporaneous tips for the two genes from each subtype are shown in Figure 1. They have a fairly pronounced bush- or star-like topology, which is indicative of exponential growth of the population from which the sequences have been sampled (![]()
|
Inferred population dynamic history:
Under a panmictic model, the population growth rate,
, and the current real-time effective population size, N
, were coestimated to give the MLEs and ~95% confidence limits shown in Figure 2. It can be seen that the likelihood surface is ridged, and indeed the confidence limits are open ended as
and N
increase. This is expected for any coestimate of these two parameters based on the properties of gene genealogies, and is a result of the coupling of the effect of
and N
for large
. A star-like phylogeny is produced when
is large, and thus if N
further increases,
can also increase to produce a genealogy of the same shape and length. However, as
becomes smaller, a more structured phylogeny is produced, and the coupling of
and N
breaks down. This therefore allows a minimum possible
and N
to be estimated. In the case of subtypes A and B of HIV-1 it is meaningful to ask what their minimum rate of spread is. For subtype B this was found to be 0.5 and 0.4 per year for the env and gag genes, respectively (corresponding to a maximum possible doubling time, t1/2, of 17 and 21 mo; Figure 2). This fits with the average t1/2 of the HIV-1 pandemic estimated from AIDS case reports to the World Health Organization at ~10 mo (![]()
for subtype A will be wider due to fewer available sequences. Hence, it is difficult to draw conclusions about differences in the minimum rate of spread for the two subtypes.
|
Although conclusions about the exact rate of spread of the two subtypes are difficult, it is clear from Figure 2 that the ratio of
to 
for subtype B is larger than that for subtype A. This implies that either subtype B is spreading at a faster rate than A (1.2 or 3.2 times faster based on the env and gag genes, respectively), or that subtype A has a larger current real-time effective population size than B, or a combination of the two. Given our knowledge of the epidemiology of HIV in Africa and in the United States and Europe, it seems unlikely that subtype B is growing at a much faster rate than subtype A. Indeed, the doubling time of HIV incidence in Africa (where subtype A is found) is similar, if somewhat shorter, than in the United States and Europe (where subtype B is found). If t1/2 is fixed at 1 yr, consistent with this epidemiological data, it is possible to estimate the corresponding N
that maximizes the likelihood for each subtype, as shown in Figure 3. This figure clearly demonstrates that subtype A must have a larger N
than subtype B if they have been growing at the same rate, although for the env alignments the 95% confidence limits overlap.
|
Figure 3 also reveals that both subtypes have a very low current real-time effective population size, N
, despite the large numbers of individuals infected with these subtypes. If HIV-1 has been spreading with a doubling time of about 1 yr, then the 95% confidence limits about 
for the env and gag alignments from subtype A range from 2.4 x 104 to 5.8 x 106, while for subtype B they range from 2.6 to 8.8 x 104. At the lower confidence limits of 
the coalescent approximations begin to break down. However, at the mean 
for subtypes A and B of ~105 and ~102, respectively, the approximations are valid, and at the upper confidence limits for each subtype (5.8 x 106 and 8.8 x 104) the approximations will be accurate. These low values of 
are robust to error in the specified mutation rate µ (= 5 x 10-3), which is unlikely to vary by more than an order of magnitude, and therefore may indicate an important role for the stochastic process of genetic drift in the fixation of mutations. However, it is also possible that selection is acting to reduce N
, as will be discussed.
The results of fitting a model of population subdivision to the data are shown in Table 2. This table gives the likelihood ratio,
, or support, of the subdivided model over the panmictic model, and the MLEs of its associated parameters. The improvement in the fit of the model when subdivision is added can be seen to be small, and absent in one case (subtype B, env alignment). Although
is on average somewhat higher for subtype A than subtype B, in no case is the improved fit significant at the 5% level when
is assessed using the
2-approximation (
2-value for 1 d.f. = 1.92). Estimates of growth rates are similar to those obtained for the panmictic model.
|
Performance of 
and the
2-approximation to the likelihood ratio:
The mean estimated N
for each simulated dataset and the likelihood of the true value, together with the number of times the true value was rejected at the 95% significance level (
= 0.05) using the
2-approximation to the likelihood ratio, are shown in Table 3. The percentage of cases where the true 
was rejected at the 95% significance level is consistent with this significance level (the 95% confidence interval about the expected
of 0.05 is 0.0160.113 under the binomial distribution with n = 100). Furthermore, the distribution of likelihood ratios was well fitted by a
2-distribution (results not shown), as expected given the nesting of the different hypotheses for N
(see THEORY: Parameter estimation).
|
It can be seen that the estimates of N
are biased upward, although the true N
falls within the 95% confidence limits at the correct frequency. This upward bias is a result of the finite number of simulations carried out when calculating each likelihood value (in this case, n = 20 in Equation 6). Simulated values of the summary statistics
and S2k under a large N
have a high variance and so can result in a reasonable likelihood for a large estimated 
even if the sequence data are from a population with a small N
. Conversely, simulations under a small N
have a low variance, and so the likelihood for a small 
is unlikely to be large for data sampled from a big population. As n
and
0, this variance will tend to zero and the bias will disappear.
It can be concluded that the small N
reported here for HIV-1 subtypes A and B are not a result of bias in the method, which would tend to result in slight over- rather than underestimation of N
.
| DISCUSSION |
|---|
The analysis of gag and env sequences presented here reveals a small current real-time effective population size, N
, for subtypes A and B of HIV-1, of ~105 and ~102, respectively. This real-time effective population size can be converted to WRIGHT's (1931) effective population size, Ne, by dividing by the generation time, g. However, it is unclear what the generation time is for HIV-1 infections. Rates of partner change for homosexuals and heterosexuals tend to be on the order of 1 yr (![]()
![]()
![]()
= Ne, while for shorter generation times N
< Ne. However, even for a very short generation time of 1 mo, the estimated N
imply that the Ne of subtypes A and B is ~106 and ~103, respectively. Both these estimates fall below the census size of HIV-1 infections, particularly for subtype B that predominates in the United States and Europe, where HIV-1 is estimated to infect more than 1.4 million people (UNAIDS and WHO 1998).
Why is Ne small?
Two features of HIV-1 may be important in causing a small effective population size: the transmission dynamics and/or natural selection. The transmission dynamics of HIV-1 are such that there is a large variance in the rate at which new infections are generated (i.e., a large
2, and hence small Ne). The different modes of transmission of HIV-1, via homosexual or heterosexual sex, needle-sharing among IDUs, or contamination of blood products, are all associated with different rates of transmission of the virus. For example, transmission of the virus through a susceptible network of needle-sharing IDUs is likely to be initially more rapid than through a susceptible heterosexual population (![]()
![]()
![]()
![]()
![]()
![]()
![]()
In an analogous way to heterogeneity in susceptibility, subdivision of the human population according to geographic, social, and behavioral barriers may also play a role in reducing the effective population size below the census size, N. If the migration rate is high, then a subdivided population of total size Ne, consisting of L subpopulations or demes, will begin to approximate a panmictic population of size Ne/L as the migration rate M
(for lower migration rates, subdivision can inflate the effective population size, Ne, above N). Subdivision of the HIV-1 subtypes A and B is suggested by a certain degree of clustering of the places of isolation on the phylogenetic trees relating the sequences (see Figure 1), but a simple two-deme model of population subdivision was not found to give a significantly better fit to either subtype (see Table 2). This may reflect the minimal effect subdivision has on patterns of sequence diversity when the population has been growing at a rapid rate. In such cases, a star-like phylogeny where all coalescences occur at approximately the same time will be produced no matter what the level of subdivision, unless single lineages survive into the past in demes of very small size (Ne < 1). It is also possible that the particular model of subdivision and the use of summary statistics to make inferences result in a poor improvement in the likelihood when the subdivided model is assessed. The use of a genealogy-based method (sensu., ![]()
![]()
The analysis of the gag and env genes also reveals a larger 
for subtype A (found in Africa) than subtype B (found in the United States and Europe), although the confidence limits overlap for the env gene (see Figure 3). This may simply be the result of a greater age, number of infections, and hence diversity of HIV-1 in Africa than in Europe and the United States. Alternatively, it may be a result of differences in transmission dynamics. Subtype B is associated mainly with the AIDS epidemic in homosexuals and IDUs in developed countries, where transmission rates may be more variable than for the heterosexual transmission associated with subtype A (![]()
![]()
![]()
2 for subtype B could therefore explain the smaller estimated N
.
Although transmission dynamics shape the observed sequence diversity of HIV-1, natural selection is also likely to play a role. The high levels of rate heterogeneity observed along both the gag and env genes (Table 1) suggest an important role of functional constraint. This is further evidenced by the restricted number of nonsynonymous as opposed to synonymous changes for these two genes (the dN:dS ratio is 0.44 and 0.51 for the env subtype A and B alignments, and 0.25 and 0.24 for the gag alignments). However, although functional constraint can reduce linked genetic diversity (![]()
e (which is confounded with µ) unless such constraint is accompanied by occasional selective sweeps of fit mutants through the subtype. Such selective sweeps would reduce linked variation while maintaining a high substitution rate and hence could therefore decrease
e. Selective sweeps may be a feature of HIV-1 evolution, but the high frequency at which HIV-1 recombines (![]()
![]()
![]()
![]()

substantially, but only if they occur fairly often. So far, little evidence concerning whether such sweeps occur at the subtype level or what their frequency is has accumulated.
Role of recombination:
Recombination reduces the variance of the distribution of pairwise differences, while little affecting the mean of this distribution. Thus if recombination is ignored, when
and Ne are coestimated,
will be overestimated (the phylogeny simply becomes more star-like). However, estimates of Ne for a fixed
should be little affected because for a star phylogeny
e is mainly determined by the mean of the distribution of pairwise differences. The estimates of Ne for HIV-1 reported here are therefore valid, despite the lack of recombination in the coalescent model. In the future it may be possible to use a coalescent model (e.g., see ![]()
![]()
Nonrandom sampling:
HIV-1 sequences are unlikely to be sampled randomly with respect to geographic origin for political and economic reasons. Because the estimates of Ne given here are based mainly on mean sequence diversity, such nonrandom sampling will not be too problematic unless subtype diversity is systematically underrepresented. This will occur when countries with prevalent HIV-1 are not included in the sample. In such cases estimates of Ne will be for the regions that are adequately sampled only. For example, because partial V3 sequences from China that group with subtype B (![]()
Importance of intrahost dynamics:
Both the substitution and genealogical processes modeled focus on interhost evolution. This is because each pair of sequences sampled from subtypes A and B is likely to be separated by a large number of transmission events. A lineage in the coalescent model therefore reflects a population with an effective size of ~103 (the estimated intrahost population size; ![]()
![]()
![]()
If many mutations within the host are nonneutral (e.g., due to immune surveillance), then viral population size within a coalescent lineage becomes important and the rate of interhost substitution may be more variable than the assumed Poisson process (![]()
e based on summary statistics, given that the mean sequence diversity is likely to remain the same. However, for likelihood calculations based on explicit representation of the HIV-1 genealogy, a more accurate model of the substitution process will be required. There is no evidence for systematic differences in the substitution process or rate between subtypes A and B that may be causing the apparent population parameter differences.
It is possible that selective pressures within the host may increase the effect of intrahost polymorphism on observed differences between sampled sequences. For this reason the large number of immunogenic epitopes along the env protein (reflected by a larger dN:dS ratio compared with the gag polyprotein; ![]()
![]()
![]()
Implications of a small Ne:
The small value of
e, if not a consequence of past selective sweeps, has important implications for the spread of drug-resistant mutations, which are typically at a selective disadvantage in the absence of the drug (e.g., ![]()
![]()
e may indicate an important role for genetic drift in the fixation of drug-resistant mutations. Although the values of selective coefficients for drug-resistant mutations seem to be variable (s = -0.004 to -0.25 within drug-naïve patients; ![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
e therefore suggests that drug resistance requiring multiple mutations may arise through the stochastic effects of drift. This has important implications for the efficacy of drugs and combination drug therapy, because preexistence of a few drug-resistant viruses within an individual implies that even if the drug therapy is effective in dramatically reducing viral load, resistance will quickly emerge (![]()
![]()
| ACKNOWLEDGMENTS |
|---|
Many thanks to Oliver Pybus for providing early sequence alignments. This work was supported by grants from the Biotechnology and Biological Sciences Research Council (BBSRC), The Royal Society, and The Wellcome Trust.
Manuscript received July 17, 1998; Accepted for publication October 26, 1998.
| LITERATURE CITED |
|---|
ANDERSON, R. M. and R. M. MAY, 1988 Epidemiological parameters of HIV transmission. Nature 333:514-519[Medline].
ANDERSON, R. M., R. M. MAY, T. W. NG, and J. T. ROWLEY, 1992 Age-dependent choice of sexual partners and the transmission dynamics of HIV in sub-Saharan Africa. Philos. Trans. R. Soc. Lond. B Biol. Sci. 336:135-155[Medline].
ARAKI, H. and H. TACHIDA, 1997 Bottleneck effect on evolutionary rate in the nearly neutral mutation model. Genetics 147:907-914[Abstract].
BIRKY, C. W. and J. B. WALSH, 1988 Effects of linkage on rates of molecular evolution. Proc. Natl. Acad. Sci. USA 85:6414-6418
BLOWER, S. M., D. HARTEL, H. DOWLATABADI, R. M. ANDERSON, and R. M. MAY, 1991 Drugs, sex and HIV: a mathematical model for New York City. Philos. Trans. R. Soc. Lond. B Biol. Sci. 331:171-187[Medline].
BONHOEFFER, S. and M. A. NOWAK, 1997 Pre-existence and emergence of drug resistance in HIV-1 infection. Proc. R. Soc. Lond. Ser. B Biol. Sci. 264:631-637[Medline].
BONHOEFFER, S., R. M. MAY, G. M. SHAW, and M. A. NOWAK, 1997 Virus dynamics and drug therapy. Proc. Natl. Acad. Sci. USA 94:6971-6976
CHARLESWORTH, B., M. T. MORGAN, and D. CHARLESWORTH, 1993 The effect of deleterious mutations on neutral molecular variation. Genetics 134:1289-1303[Abstract].
CONLON, C. P., P. KLENERMAN, A. EDWARDS, B. A. LARDER, and R. E. PHILLIPS, 1994 Heterosexual transmission of human immunodeficiency virus type-1 variants associated with zidovudine resistance. J. Infect. Dis. 169:411-415[Medline].
CORNELISSEN, M., R. VAN DEN BURG, F. ZORGDRAGER, V. LUKASHOV, and J. GOUDSMIT, 1997 pol gene diversity of five human immunodeficiency virus type 1 subtypes: evidence for naturally occurring mutations that contribute to drug resistance, limited recombination patterns, and common ancestry for subtypes B and D. J. Virol. 71:6348-6358[Abstract].
DEAN, M., M. CARRINGTON, C. WINKLER, G. A. HUTTLEY, and M. W. SMITH et al., 1996 Genetic restriction of HIV-1 infection and progression to AIDS by a deletion allele of the CKR5 structural gene. Science 273:1856-1862
DIAZ, R. S., E. C. SABINO, A. MAYER, J. W. MOSLEY, and M. P. BUSCH, 1995 Dual Human Immunodeficiency Virus type-1 infection and recombination in a dually exposed transfusion recipient. J. Virol. 69:3273-3281[Abstract].
EDWARDS, A. W. F., 1992 Likelihood. Extended Edition. The Johns Hopkins University Press, London.
EIGEN, M. and K. NIESELT-STRUWE, 1990 How old is the immunodeficiency virus? AIDS 4:S85-S93.
FELSENSTEIN, J., 1974 The evolutionary advantage of recombination. Genetics 78:737-756
FELSENSTEIN, J., 1981 Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368-376[Medline].
FELSENSTEIN, J., 1992 Estimating effective population-size from samples of sequencesinefficiency of pairwise and segregating sites as compared to phylogenetic estimates. Genet. Res. 59:139-147[Medline].
FISHER, R. A., 1921 On the `probable error' of a coefficient of correlation deduced from a small sample. Metron 1: part 4, 332.
FISHER, R. A., 1930 The Genetical Theory of Natural Selection. Clarendon Press, Oxford.
FU, Y. X., 1994 A phylogenetic estimator of effective population size or mutation rate. Genetics 136:685-692[Abstract].
FU, Y. X. and W. H. LI, 1993 Maximum likelihood estimation of population parameters. Genetics 134:1261-1270[Abstract].
GAO, F., D. L. ROBINSON, S. G. MORRISON, H. X. HUI, and S. CRAIG et al., 1996 The heterosexual human immunodeficiency virus type 1 epidemic in Thailand is caused by an intersubtype (A/E) recombinant of African origin. J. Virol. 70:7013-7029
GOUDSMIT, J., A. DE RONDE, D. D. HO, and A. S. PERELSON, 1996 Human immunodeficiency virus fitness in vivo: calculations based on a single zidovudine resistance mutation at codon 215 of reverse transcriptase. J. Virol. 70:5662-5664
GOUDSMIT, J., A. DE RONDE, E. DE ROOIJ, and R. DE BOER, 1997 Broad spectrum of in vivo fitness of human immunodeficiency virus type 1 subpopulations differing at reverse transcriptase codons 41 and 215. J. Virol. 71:4479-4484[Abstract].
GRASSLY, N. C., and E. C. HOLMES, 1998 The use of Monte Carlo simulation to infer population dynamic history from DNA sequence data, pp. 91112 in Proceedings of the Trinational Workshop on Molecular Evolution, edited by M. K. UYENOYAMA, A. VON HAESELER and N. TAKAHATA. Duke University Publications Group, Durham, NC.
GREENHALGH, D., 1996 Effects of heterogeneity on the spread of HIV AIDS among intravenous drug users in `shooting galleries'. Math. Biosci. 136:141-186[Medline].
GRIFFITHS, R. C. and S. TAVARE, 1994a Ancestral inference in population genetics. Stat. Sci. 9:307-319.









