Genetics, Vol. 151, 427-438, February 1999, Copyright © 1999

Population Dynamics of HIV-1 Inferred From Gene Sequences

Nicholas C. Grasslya, Paul H. Harveya, and Edward C. Holmesa
a Wellcome Trust Centre for the Epidemiology of Infectious Disease, Department of Zoology, University of Oxford, Oxford OX1 3PS, United Kingdom

Corresponding author: Nicholas C. Grassly, Max Planck Institute for Evolutionary Anthropology, c/o Zoology Institute, University of Munich, Luisenstr. 14, Munich D-80333, Germany., grassly{at}zi.biologie.uni-muenchen.de (E-mail)

Communicating editor: M. SLATKIN


*  ABSTRACT
*TOP
*ABSTRACT
*THEORY
*MATERIALS AND METHODS
*RESULTS
*DISCUSSION
*LITERATURE CITED

A method for the estimation of population dynamic history from sequence data is described and used to investigate the past population dynamics of HIV-1 subtypes A and B. Using both gag and env gene alignments the effective population size of each subtype is estimated and found to be surprisingly small. This may be a result of the selective sweep of mutations through the population, or may indicate an important role of genetic drift in the fixation of mutations. The implications of these results for the spread of drug-resistant mutations and transmission dynamics, and also the roles of selection and recombination in shaping HIV-1 genetic diversity, are discussed. A larger estimated effective population size for subtype A may be the result of differences in time of origin, transmission dynamics, and/or population structure. To investigate the importance of population structure a model of population subdivision was fitted to each subtype, although the improvement in likelihood was found to be nonsignificant.


HUMAN immunodeficiency virus (HIV) type-1 currently infects >30 million people, and it is estimated that more than 16,000 new infections are generated every day (UNAIDS and WHO 1998). Global viral isolates from this pandemic have revealed extensive genetic diversity, but restriction of this diversity into distinct viral subtypes (MYERS et al. 1991 Down; LOUWAGIE et al. 1993 Down). HIV-1 represents one of a number of lineages of primate lentiviruses (SHARP et al. 1994 Down) and can itself be divided into an M and O group (MCCUTCHAN et al. 1996 Down), and most recently an N group (SIMON et al. 1998 Down). Most infections are due to the M group, which is currently classified into 10 further subtypes, A through J (MCCUTCHAN et al. 1996 Down). The reason for the existence of subtypes is unclear. They do not correlate with classical neutralization serotype (WEBER et al. 1996 Down; JANSSENS et al. 1997 Down), making their maintenance by epistatic selection of epitopes due to the immune system an unlikely explanation (GUPTA et al. 1996 Down). It seems more probable that they simply reflect the past population dynamic history of the virus.

Consider the two subtypes, A and B, for which large amounts of sequence data are available due to their dominance of the HIV-1 pandemic. Subtype A is common throughout Africa and is spread predominantly by heterosexual sex. Subtype B, in contrast, is found mainly in the United States and parts of Europe (although it has also been found in China, Japan, South America, Australasia, and Southeast Asia), and is usually associated with transmission by needle sharing among intravenous drug users (IDUs) or male homosexual sex. These subtypes may therefore be expected to show different patterns of genetic diversity, as a result of differences in their epidemiological history and mode of transmission.

The inference of past population dynamics from randomly sampled genetic sequence data can be based on either the genealogical structure relating sampled sequences (e.g., FELSENSTEIN 1992 Down; LUNDSTROM et al. 1992 Down; FU and LI 1993 Down; FU 1994 Down; GRIFFITHS and TAVARE 1994A Down, GRIFFITHS and TAVARE 1994B Down; KUHNER et al. 1995 Down, KUHNER et al. 1998 Down) or summary statistics, such as the number of segregating sites or the distribution of pairwise differences (WATTERSON 1975 Down; TAJIMA 1983 Down; ROGERS and HARPENDING 1992 Down; GRIFFITHS and TAVARE 1996 Down). The former approach has the advantage that it makes full use of the available data, and consequently estimators of population dynamic parameters tend to have a smaller variance when the number of sequences sampled is finite (FELSENSTEIN 1992 Down; FU and LI 1993 Down). However, the latter approach is computationally much faster, and can allow easier implementation of complex population dynamic and mutational models.

In this article a statistical framework for the inference of past population dynamics based on summary statistics is described and used to investigate the population dynamics of HIV-1 subtypes A and B. This framework is designed to be flexible and allows not only population dynamic parameter estimation, but also the fit of different population dynamic and mutational models to be tested. Population dynamic models of the epidemic spread of the virus are implemented, with or without further subdivision of the subtypes into partially isolated subpopulations or "demes." A mutational model that allows heterogeneity in the rate of mutation both along sequences and between nucleotides at a single site is used (following their demonstrated importance in HIV-1 evolution; EIGEN and NIESELT-STRUWE 1990 Down; LEITNER et al. 1997 Down). The performance of the estimator of effective population size is investigated using Monte Carlo simulation.

Estimates of the rate of spread of the two subtypes and their effective population sizes are obtained from both the gag and the env genes. The latter parameter is important because it determines the relative roles of genetic drift and selection in determining the evolution of HIV-1. In particular, the spread of drug-resistant mutations, which may be selectively disadvantageous in the absence of the drug (e.g., GOUDSMIT et al. 1996 Down, GOUDSMIT et al. 1997 Down), is dependent on the relative importance of genetic drift (LEIGH BROWN 1997 Down; LEIGH BROWN and RICHMAN 1997 Down). The parameter estimates for subtypes A and B are contrasted, and reasons for their differences discussed.


*  THEORY
*TOP
*ABSTRACT
*THEORY
*MATERIALS AND METHODS
*RESULTS
*DISCUSSION
*LITERATURE CITED

Coalescent process:
If we take K sequences, randomly sampled from a large population of size N, it is possible to model their evolutionary or genealogical relationships with a simple process known as the coalescent (KINGMAN 1982A Down, KINGMAN 1982B Down, KINGMAN 1982C Down). This process is concerned with the rate at which sampled lineages coalesce when following a genealogy backward in time. Under the assumption of neutrality (or more specifically, exchangeability in offspring number; KINGMAN 1982C Down), this rate of coalescence of the sampled lineages is dependent solely on the dynamics of the population, and is independent of the mutational process. Conversely, the genealogy of randomly sampled sequences can therefore be used to make inferences about the dynamics of the population from which they were sampled, given a suitable coalescent model, and with a mutational process specified a priori.

In this article a model of the coalescent process for an expanding, subdivided haploid population is used, of which a panmictic (unsubdivided) model is a special case. The island model of subdivision due to WRIGHT 1931 Down is implemented, and it is assumed that selection and recombination are absent (see DISCUSSION). If we follow a single subpopulation or deme backward in time there are therefore two types of event:

  • Coalescence with rate

where ki is the number of lineages under consideration in the ith deme, Nx the population size at generation x, and {sigma}2 the variance in the number of offspring had by each member of the population [thus for larger {sigma}2 the chance of having the same parent in the preceding generation increases; under Wright-Fisher reproduction {sigma}2 = 1.0 (FISHER 1930 Down; WRIGHT 1931 Down), while for MORAN 1958 Down overlapping generations model {sigma}2 = 2N-1x]. Migration with rate mi,jk, where mi,j is the migration rate per generation out of deme i into deme j. In the implementation here the migration rate is constant for all i and j and assumed to be independent of deme size Nx (thus mi,j = m).

If time is measured in units other than generations (e.g., years), and g is the generation time in these units, then assuming that the probability of an event occurring in a unit time is small, we can calculate the probability of nothing occurring in a unit time as

(1)
where Nt is the deme size at time t. Because the deme has been growing exponentially, the deme size at time t into the past is given by Nt = N0e-{lambda}t, where N0 is the deme size at time t = 0, and {lambda} is the exponential growth rate per unit time. Given that an event has occurred at time a, then the probability that nothing occurs within the next x time units, in other words that the time between events, T, is greater than x, is given by

(2)

If we switch to continuous time, this then becomes

(3)

The cumulative density function (cdf) for an event occurring within the next x time units is simply 1 - P(T > x). Because the area under the cdf must sum to one, the distribution of T can be obtained by setting (3) equal to a uniformly distributed random number between 0 and 1 and solving for x. If we ignore population subdivision and migration, then this cdf simplifies to that used by SLATKIN and HUDSON 1991 Down(p. 559) to simulate coalescent times.

To generate times between events for the entire subdivided population, consisting of L demes, we calculate min[xi] for all i {i = 1, 2, ... L}, and assume that an event has occurred at this time in the corresponding deme.

As for any population dynamic model where the population or deme size declines into the past, the coalescent approximations, based on the assumption that not more than one event can occur per unit time, will begin to break down as N becomes smaller. However, for an exponentially growing population most coalescent events tend to occur at time t = (assuming g = 1), at which time the population size is approximately 1/{lambda} and is independent of N0 (SLATKIN and HUDSON 1991 Down). Given that these events occur in a short space of time, the effect of the breakdown of the coalescent approximation on the total genealogical structure should therefore be minimal.

Given that an event (migration, M, or coalescence, C) has occurred in the ith deme, the probability that it is a migration event is

(4)

Using Equation 3 and Equation 4, it is possible to simulate gene genealogies under a range of population dynamic scenarios. Sequences can subsequently be evolved down these gene genealogies, given a suitable substitution model, to generate sets of aligned sequences. A program to simulate sequence alignments under certain population dynamic scenarios and mutational models, called Treevolve, is available on the World Wide Web at http://evolve.zoo.ox.ac.uk/.

Parameter estimation:
Sequence alignments, simulated under certain population dynamic models, can be compared to observed data to not only test the fit of these models, but also to estimate their associated parameters. In the method for estimating population dynamic parameters described here (see also GRASSLY and HOLMES 1998 Down), summary statistics are used to make this comparison. These summary statistics are the mean and variance of the distribution of pairwise differences, and were used because of their well-documented responsiveness to changes in population dynamics (WATTERSON 1975 Down; HUDSON 1987 Down; SLATKIN and HUDSON 1991 Down; ROGERS and HARPENDING 1992 Down; SIMONSEN et al. 1995 Down; HEY 1997 Down). These can be calculated for the observed data (o and S2k,o, respectively), and for sets of data simulated under a particular coalescent and mutational model (which we will denote {Theta}, giving {Theta},i and S2k,{Theta},i; i = 1, 2, 3, ...). To calculate an approximate likelihood of the model given o and S2k,o, define the index, Ii ({Theta}), as

(5)
where {partial} is an arbitrary small number (following WEISS and VON HAESELER 1998 Down). This gives a measure of the goodness-of-fit of the model {Theta}, to the observed data (o and S2k,o). It therefore seems reasonable to assume that the probability of the data given the model can be given by

(6)
where n is a large number corresponding to the number of simulations under {Theta}. Because the likelihood L ({Theta}|o, S2k,o is by definition proportional to P(o, S2k,o|{Theta})

(FISHER 1921 Down; EDWARDS 1992 Down), maximization of the right-hand side of Equation 6 over {Theta} gives the maximum-likelihood estimate (MLE), . For small n, variance in the simulated and S2k can translate into bias in . For example, the variance in and S2k depends on the total genealogical length, which in turn is linearly proportional to the effective population size, N0. Thus, for the above method with a finite number of simulations, as the estimator 0 (N0 {isin} {Theta}) increases, so too does its variance, leading to an upward bias. However, as n -> {infty} and {partial} -> 0, the variance is expected to tend to zero, and the estimator becomes unbiased.

It is possible to calculate the region of the parameter space about MLEs where simple hypotheses would not be rejected at the 95% significance level using the likelihood ratio. This space determines the set of admissible hypotheses (sensu., WILKS 1938 Down), which are termed here approximate confidence limits or intervals (although they do not represent confidence limits in the strictest sense where the type I error rate should equal the significance level, {alpha}). At a confidence limit we are effectively restricting the parameter(s) of interest to a particular value(s), {theta}R. We can therefore calculate the likelihood ratio as

(7)

By definition, {theta}R is nested in {Theta} and hence for large n, -2{Lambda} is approximately {chi}2-distributed, with the number of degrees of freedom equal to the number of parameters restricted (WILKS 1938 Down). Thus, for example, the 95% confidence limits about a single MLE are defined by the {theta}R, which satisfy {Lambda} = -1.92.

As the number of estimated parameters increases, the confidence interval becomes wider. Both the population parameters and the two summary statistics used to estimate them are nonindependent to differing extents. Thus it is unclear exactly how many degrees of freedom there are, and whether two unique population genetic parameter estimates can be obtained from the two summary statistics. Fortunately, likelihood surfaces can be calculated about MLEs to reveal the range of plausible parameter estimates (those with a likelihood where {Lambda} is less than the critical, 1/2{chi}2-distributed, value).

Because of the assumption of neutrality made by the coalescent process, the process of mutation can be decoupled from that generating the gene genealogy (KINGMAN 1982A Down, KINGMAN 1982B Down, KINGMAN 1982C Down). The parameters governing the mutational process can therefore be calculated independently and directly from the sequence data, using explicit substitution models (equivalent to mutational models under neutrality). The procedure used here estimates the parameters of a substitution model chosen a priori, at the same time as the branch lengths and topology of the phylogenetic tree linking the sampled sequences, by maximizing their likelihood given the observed sequence data (for a description see SWOFFORD et al. 1996 Down). The validity of this procedure depends on the accuracy of the chosen substitution model, and of the maximum-likelihood (ML) tree topology. The former is improved by using a wide range of testable substitution models. In the case of the latter, the accuracy of the tree topology tends to have little effect on the inferred substitution model unless the deepest bifurcation of the tree (at the root) partitions the taxa incorrectly (e.g., see SULLIVAN et al. 1996 Down), which is unlikely for any method of phylogenetic reconstruction. Furthermore, even if recombination occurs, invalidating a strictly bifurcating phylogeny, it is unlikely to cause an incorrect phylogenetic partition at such a deep level.


*  MATERIALS AND METHODS
*TOP
*ABSTRACT
*THEORY
*MATERIALS AND METHODS
*RESULTS
*DISCUSSION
*LITERATURE CITED

Sequence data:
Complete sequences of the gag and env genes from subtypes A and B of HIV-1 were extracted from the Los Alamos HIV sequence database (MYERS et al. 1996 Down), and the published alignments used. Sequences previously demonstrated to be intersubtype recombinants were removed (including gag sequences isolated in Thailand, which are subtype A in origin, but constitute the recombinant subtype E; GAO et al. 1996 Down; MCCUTCHAN et al. 1996 Down). In addition, those sequences cloned from a common source (patient or isolate) were removed to promote the random sampling of the sequences. Despite this removal it is likely that sequences were not sampled entirely randomly with respect to geographic origin. The effect of this on estimates of population parameters is considered in the DISCUSSION.

In total, 13 env and 23 gag sequences were obtained for subtype A, and 54 env and 26 gag sequences for subtype B. The accession numbers, dates, and places of isolation of these sequences are available upon request from the authors. All sequences were isolated between 1983 and 1996 about a mean isolation date of 1991 (SD = 3.65 yr). The coalescent models described here assume sequences are isolated at the same time. The effect of noncontemporaneous-isolation-date estimates of population parameters is analogous to the effects of a substitutional process that is more stochastic than the assumed Poisson process (see DISCUSSION).

Phylogenetic relationships and inference of substitution process:
Phylogenetic trees for the four sets of aligned sequences were inferred at the same time as the substitution process, using the maximum likelihood method implemented in a test version of PAUP* made available by the author David Swofford. The general reversible substitution model was used (LANAVE et al. 1984 Down; RODRIGUEZ et al. 1990 Down), together with a discrete gamma model of rate heterogeneity with four rate categories (YANG 1994 Down). In each case the parameters governing the relative rate of substitution between the different nucleotides and the gamma shape parameter, {alpha}, were estimated from the data. For each alignment two trees were constructed, one where sequences (tips) were constrained to be contemporaneous with a single rate of substitution, and one where the sequences were not constrained (and thus substitution rates down each branch could vary). Using the likelihood-ratio statistic the goodness-of-fit of the coalescent assumption of contemporary tips and a molecular clock were assessed. The constrained tree is a special case of the unconstrained tree, where the unconstrained tree has k - 2 extra parameters, which are free to vary (FELSENSTEIN 1981 Down). Thus the significance of the likelihood ratio was assessed using the {chi}2-distribution with k - 2 d.f. (WILKS 1938 Down).

The substitution rate of the gag and env genes of the virus, µ, was set to 5 x 10-3 per site per year for all estimates of population dynamic parameters, consistent with previous estimates for these genes (LI et al. 1988 Down). This parameter is completely confounded with the effective population size N0, and hence estimates of N0 depend on the accuracy of µ. Although the codon-based structure of the genes has been ignored, the inference of population dynamic history relies on the shape of the genealogy relating the sampled sequences, which is likely to remain the same for all three codon positions due to linkage. Selection does, however, play a role in shaping some of the diversity of HIV-1, and the implications of this for the analysis presented here are considered in the DISCUSSION.

Inference of population dynamic history:
Initially a panmictic coalescent model was fitted to each of the two HIV-1 subtypes. Such a model is simply derived from Equation 3 by setting m = 0 and considering only a single deme. Under this population dynamic model there are two free parameters to be estimated: the exponential growth rate {lambda}, and the compound parameter g{sigma}-2N0, which determines the rate of coalescence. This latter parameter can be defined as the current real-time effective population size and is denoted throughout this article as N{tau}. The real-time effective population size determines the probability that two randomly sampled gene sequences have a common ancestor within the last year ({approx}N-1{tau}, for large N{tau}) and is analogous to WRIGHT's (1931) effective population number, Ne, which determines the probability that two randomly chosen individuals shared a common ancestor in the preceding generation ({approx}N-1e). Because each sequence in the sample of HIV-1 sequences can be considered to represent the viral population from an infected patient, the "individual" of the population genetic model implemented here is an infected patient. Infected individuals transmit their virus to other individuals to generate more "offspring" whose viral sequence resembles that of the "parent." This coalescent model of the transmission process allows variation in offspring number (i.e., {sigma}2 > 1), but assumes that there is no correlation across generations in the number of offspring had by a genealogical lineage (no Hill-Robertson effect; FELSENSTEIN 1974 Down). This latter assumption seems valid given that there is so far little evidence for the existence of HIV-1 genetic factors determining transmission rates, or viral virulence. The census population size, N, for this study is therefore the number of individuals infected by HIV-1 subtypes A and B. The effective population size may be smaller or bigger than N, determining the role of genetic drift in the fixation of mutations. It is completely confounded with the substitution rate, and hence when estimating {lambda} any error in µ can be accounted for by allowing N{tau} to vary (although of course any error in µ will affect the actual estimates of N{tau}).

MLEs of N{tau} and {lambda} for the four alignments were obtained and likelihood surfaces about the estimates produced. Using the {chi}2-approximation to the likelihood-ratio statistic ~95% confidence limits about these estimates were calculated. Subsequently the exponential growth rate ({lambda}) was fixed at 0.693 yr-1, and estimates of the real-time effective population size were obtained. This growth rate is equivalent to a doubling time of 1 yr, a value consistent with that observed for the incidence of HIV-1 infection (where a range from 4.9 to 15.6 mo is seen; MAY and ANDERSON 1989 Down).

The subdivided population coalescent model with two demes was also fitted to the observed sequence data. This model has three free parameters: {lambda}, N{tau}, and the migration rate, M = (where {lambda} and M have units per year, and N{tau} is the real-time effective deme size). Although nonunique due to the use of only two summary statistics (o and S2k,o), and therefore only 2 d.f., MLEs of these parameters were obtained and the likelihood at these MLEs recorded. The likelihood ratio of the subdivided to the unsubdivided model was calculated for each sequence alignment and its significance assessed using the {chi}2-approximation with 1 d.f. (because there is one additional free parameter, M).

Performance of {tau} and the {chi}2-approximation to the likelihood ratio:
The performance of the estimator of the real-time effective population size and the validity of the {chi}2-approximation to the likelihood ratio were assessed using simulations. For each value of N{tau}, which was estimated from HIV-1 with the doubling time set at 1 yr (i.e., for A and B subtypes, gag and env genes), 100 sequence alignments were simulated with the relevant number of sequences and the mutational parameters in Table 1. For each sequence alignment N{tau} was estimated and the ML value recorded. The likelihood of the actual N{tau} under which the data were simulated was also recorded, and the ratio to the ML value calculated. In this way the performance of both the estimator and the {chi}2-approximation to the likelihood ratio could be assessed.


 
View this table:
In this window
In a new window

 
Table 1. Maximum-likelihood estimated substitution parameters for constrained (clock) and unconstrained phylogenies inferred from the env and gag genes isolated from HIV-1 subtypes A and B


*  RESULTS
*TOP
*ABSTRACT
*THEORY
*MATERIALS AND METHODS
*RESULTS
*DISCUSSION
*LITERATURE CITED

Phylogenetic relationships and inference of substitution process:
For each of the four alignments a molecular clock was rejected using the likelihood-ratio statistic [log-likelihood ratio = 14.32 (P = 0.01) and 57.45 (P < 0.005) for subtype A env and gag genes, respectively, and 322.01 (P < 0.005) and 52.89 (P < 0.005) for subtype B env and gag genes]. This is unsurprising given the noncontemporaneous nature of the viral isolates and the possibility of greater-than-Poisson stochasticity in the substitution process. Furthermore, the likelihood-ratio test of a molecular clock based on a complete phylogeny is very stringent (powerful). A single anomalous branch length can cause rejection of the clock. In contrast, the method for inferring population parameters described here is based on summary statistics, which are less sensitive to such rate heterogeneity. Thus, despite the rejection of a clock by the likelihood ratio, a coalescent analysis is considered appropriate (see DISCUSSION for a consideration of the implications of rate heterogeneity for estimates of population parameters).

The substitution processes inferred from both the constrained and unconstrained phylogenies are shown in Table 1. It can be seen that constraint of the tips of the phylogeny has little effect on the estimates of the substitution parameters, despite the lack of fit of a molecular clock. Adenosine (A) is more common than the other nucleotides, a result of the preference for G to A substitution in the HIV-1 genome (VARTANIAN et al. 1994 Down). The gamma parameter, {alpha}, governing rate heterogeneity along the sequences is <0.4 in all cases, indicating a fair degree of rate heterogeneity (in agreement with estimates from other gag and env alignments; LEITNER et al. 1997 Down). This heterogeneity is most likely a result of functional constraint at certain positions, although there may also be a role of positive selection (see DISCUSSION).

The phylogenetic trees constrained to have contemporaneous tips for the two genes from each subtype are shown in Figure 1. They have a fairly pronounced bush- or star-like topology, which is indicative of exponential growth of the population from which the sequences have been sampled (SLATKIN and HUDSON 1991 Down). The places of isolation of the sequences are shown on the trees. These can be seen to cluster to a certain extent, indicating a possible role of population subdivision in generating the observed sequence diversity.



View larger version (34K):
In this window
In a new window
Download PPT slide
 
Figure 1. Maximum-likelihood phylogenetic trees inferred for the gag and env alignments from HIV-1 subtypes A and B, showing the place of isolation of the sequences. A question mark indicates that the place of isolation was not recorded, and in such cases the location of the authors' academic institute has been substituted.

Inferred population dynamic history:
Under a panmictic model, the population growth rate, {lambda}, and the current real-time effective population size, N{tau}, were coestimated to give the MLEs and ~95% confidence limits shown in Figure 2. It can be seen that the likelihood surface is ridged, and indeed the confidence limits are open ended as {lambda} and N{tau} increase. This is expected for any coestimate of these two parameters based on the properties of gene genealogies, and is a result of the coupling of the effect of {lambda} and N{tau} for large {lambda}. A star-like phylogeny is produced when {lambda} is large, and thus if N{tau} further increases, {lambda} can also increase to produce a genealogy of the same shape and length. However, as {lambda} becomes smaller, a more structured phylogeny is produced, and the coupling of {lambda} and N{tau} breaks down. This therefore allows a minimum possible {lambda} and N{tau} to be estimated. In the case of subtypes A and B of HIV-1 it is meaningful to ask what their minimum rate of spread is. For subtype B this was found to be 0.5 and 0.4 per year for the env and gag genes, respectively (corresponding to a maximum possible doubling time, t1/2, of 17 and 21 mo; Figure 2). This fits with the average t1/2 of the HIV-1 pandemic estimated from AIDS case reports to the World Health Organization at ~10 mo (MAY and ANDERSON 1989 Down). The maximum possible t1/2 of subtype A was 42 and 83 mo for the env and gag genes, respectively. Although this suggests a less rapid spread of subtype A than subtype B, the confidence limits on the estimate of {lambda} for subtype A will be wider due to fewer available sequences. Hence, it is difficult to draw conclusions about differences in the minimum rate of spread for the two subtypes.



View larger version (29K):
In this window
In a new window
Download PPT slide
 
Figure 2. Approximate 95% confidence limits about the MLEs of {lambda} and N{tau} (marked by crosses) for HIV-1 subtypes A and B, based on gag and env sequence alignments. Because we are interested in the minimum possible value for {lambda}, not N{tau}, the confidence limits have been constructed using the {chi}2-approximation to the likelihood-ratio statistic with 1 d.f.

Although conclusions about the exact rate of spread of the two subtypes are difficult, it is clear from Figure 2 that the ratio of to {tau} for subtype B is larger than that for subtype A. This implies that either subtype B is spreading at a faster rate than A (1.2 or 3.2 times faster based on the env and gag genes, respectively), or that subtype A has a larger current real-time effective population size than B, or a combination of the two. Given our knowledge of the epidemiology of HIV in Africa and in the United States and Europe, it seems unlikely that subtype B is growing at a much faster rate than subtype A. Indeed, the doubling time of HIV incidence in Africa (where subtype A is found) is similar, if somewhat shorter, than in the United States and Europe (where subtype B is found). If t1/2 is fixed at 1 yr, consistent with this epidemiological data, it is possible to estimate the corresponding N{tau} that maximizes the likelihood for each subtype, as shown in Figure 3. This figure clearly demonstrates that subtype A must have a larger N{tau} than subtype B if they have been growing at the same rate, although for the env alignments the 95% confidence limits overlap.



View larger version (25K):
In this window
In a new window
Download PPT slide
 
Figure 3. Likelihood surfaces about the MLE of N{tau} on a log scale, for HIV-1 subtypes A and B based on the env and gag alignments, with {lambda} fixed at 0.693 (corresponding to a doubling time of 1 yr). {tau} for subtype B was 17.8 and 2.1 x 103 for the gag and env genes, respectively, while for subtype A it was 8.3 x 105 and 1.2 x 105, respectively. Approximate 95% confidence limits about {tau}, calculated using the {chi}2-approximation (1 d.f.), are shown. These are 2.6 to 400 and 23 to 8.8 x 104 for subtype B gag and env alignments, respectively, and 2.4 x 104 to 5.8 x 106 and 2.6 x 104 to 2.8 x 105 for subtype A.

Figure 3 also reveals that both subtypes have a very low current real-time effective population size, N{tau}, despite the large numbers of individuals infected with these subtypes. If HIV-1 has been spreading with a doubling time of about 1 yr, then the 95% confidence limits about {tau} for the env and gag alignments from subtype A range from 2.4 x 104 to 5.8 x 106, while for subtype B they range from 2.6 to 8.8 x 104. At the lower confidence limits of {tau} the coalescent approximations begin to break down. However, at the mean {tau} for subtypes A and B of ~105 and ~102, respectively, the approximations are valid, and at the upper confidence limits for each subtype (5.8 x 106 and 8.8 x 104) the approximations will be accurate. These low values of {tau} are robust to error in the specified mutation rate µ (= 5 x 10-3), which is unlikely to vary by more than an order of magnitude, and therefore may indicate an important role for the stochastic process of genetic drift in the fixation of mutations. However, it is also possible that selection is acting to reduce N{tau}, as will be discussed.

The results of fitting a model of population subdivision to the data are shown in Table 2. This table gives the likelihood ratio, {Lambda}, or support, of the subdivided model over the panmictic model, and the MLEs of its associated parameters. The improvement in the fit of the model when subdivision is added can be seen to be small, and absent in one case (subtype B, env alignment). Although {Lambda} is on average somewhat higher for subtype A than subtype B, in no case is the improved fit significant at the 5% level when {Lambda} is assessed using the {chi}2-approximation ({chi}2-value for 1 d.f. = 1.92). Estimates of growth rates are similar to those obtained for the panmictic model.


 
View this table:
In this window
In a new window

 
Table 2. Fit of model of subdivision to HIV-1 subtypes A and B

Performance of {tau} and the {chi}2-approximation to the likelihood ratio:
The mean estimated N{tau} for each simulated dataset and the likelihood of the true value, together with the number of times the true value was rejected at the 95% significance level ({alpha} = 0.05) using the {chi}2-approximation to the likelihood ratio, are shown in Table 3. The percentage of cases where the true {tau} was rejected at the 95% significance level is consistent with this significance level (the 95% confidence interval about the expected {alpha} of 0.05 is 0.016–0.113 under the binomial distribution with n = 100). Furthermore, the distribution of likelihood ratios was well fitted by a {chi}2-distribution (results not shown), as expected given the nesting of the different hypotheses for N{tau} (see THEORY: Parameter estimation).


 
View this table:
In this window
In a new window

 
Table 3. Performance of {tau} and the {chi}2-approximation to the likelihood ratio

It can be seen that the estimates of N{tau} are biased upward, although the true N{tau} falls within the 95% confidence limits at the correct frequency. This upward bias is a result of the finite number of simulations carried out when calculating each likelihood value (in this case, n = 20 in Equation 6). Simulated values of the summary statistics and S2k under a large N{tau} have a high variance and so can result in a reasonable likelihood for a large estimated {tau} even if the sequence data are from a population with a small N{tau}. Conversely, simulations under a small N{tau} have a low variance, and so the likelihood for a small {tau} is unlikely to be large for data sampled from a big population. As n -> {infty} and {partial} -> 0, this variance will tend to zero and the bias will disappear.

It can be concluded that the small N{tau} reported here for HIV-1 subtypes A and B are not a result of bias in the method, which would tend to result in slight over- rather than underestimation of N{tau}.


*  DISCUSSION
*TOP
*ABSTRACT
*THEORY
*MATERIALS AND METHODS
*RESULTS
*DISCUSSION
*LITERATURE CITED

The analysis of gag and env sequences presented here reveals a small current real-time effective population size, N{tau}, for subtypes A and B of HIV-1, of ~105 and ~102, respectively. This real-time effective population size can be converted to WRIGHT's (1931) effective population size, Ne, by dividing by the generation time, g. However, it is unclear what the generation time is for HIV-1 infections. Rates of partner change for homosexuals and heterosexuals tend to be on the order of 1 yr (ANDERSON and MAY 1988 Down), while it is unclear what the rate of needle sharing is among IDUs (KAPLAN 1989 Down; BLOWER et al. 1991 Down). There is a similar ambiguity regarding the probability of transmission in each case. If the generation time is on the order of 1 yr, then N{tau} = Ne, while for shorter generation times N{tau} < Ne. However, even for a very short generation time of 1 mo, the estimated N{tau} imply that the Ne of subtypes A and B is ~106 and ~103, respectively. Both these estimates fall below the census size of HIV-1 infections, particularly for subtype B that predominates in the United States and Europe, where HIV-1 is estimated to infect more than 1.4 million people (UNAIDS and WHO 1998).

Why is Ne small?
Two features of HIV-1 may be important in causing a small effective population size: the transmission dynamics and/or natural selection. The transmission dynamics of HIV-1 are such that there is a large variance in the rate at which new infections are generated (i.e., a large {sigma}2, and hence small Ne). The different modes of transmission of HIV-1, via homosexual or heterosexual sex, needle-sharing among IDUs, or contamination of blood products, are all associated with different rates of transmission of the virus. For example, transmission of the virus through a susceptible network of needle-sharing IDUs is likely to be initially more rapid than through a susceptible heterosexual population (KAPLAN 1989 Down; BLOWER et al. 1991 Down). However, even within a sexual mode of transmission, there may be substantial heterogeneity in the rate of spread due to variation in partner exchange rates and transmission probabilities (ANDERSON et al. 1992 Down; SERVICE and BLOWER 1995 Down). For instance, with regard to the former, for the average rate of homosexual partner exchange (8.7 per year) in England and Wales, the variance is ~600 per year (ANDERSON and MAY 1988 Down). With regard to the latter, there is substantial evidence for heterogeneity in the susceptibility of different people to infection. For example, ~10% of the Caucasian population in the United States possess a 32-bp deletion in the chemokine receptor CCR5, and are less susceptible to HIV-1 infection when homozygous (DEAN et al. 1996 Down; MICHAEL et al. 1997 Down).

In an analogous way to heterogeneity in susceptibility, subdivision of the human population according to geographic, social, and behavioral barriers may also play a role in reducing the effective population size below the census size, N. If the migration rate is high, then a subdivided population of total size Ne, consisting of L subpopulations or demes, will begin to approximate a panmictic population of size Ne/L as the migration rate M -> {infty} (for lower migration rates, subdivision can inflate the effective population size, Ne, above N). Subdivision of the HIV-1 subtypes A and B is suggested by a certain degree of clustering of the places of isolation on the phylogenetic trees relating the sequences (see Figure 1), but a simple two-deme model of population subdivision was not found to give a significantly better fit to either subtype (see Table 2). This may reflect the minimal effect subdivision has on patterns of sequence diversity when the population has been growing at a rapid rate. In such cases, a star-like phylogeny where all coalescences occur at approximately the same time will be produced no matter what the level of subdivision, unless single lineages survive into the past in demes of very small size (Ne < 1). It is also possible that the particular model of subdivision and the use of summary statistics to make inferences result in a poor improvement in the likelihood when the subdivided model is assessed. The use of a genealogy-based method (sensu., GRIFFITHS and TAVARE 1994B Down; KUHNER et al. 1998 Down), where place of isolation is an explicit part of the model, would be likely to result in improved power in testing the fit of a model of subdivision.

The analysis of the gag and env genes also reveals a larger {tau} for subtype A (found in Africa) than subtype B (found in the United States and Europe), although the confidence limits overlap for the env gene (see Figure 3). This may simply be the result of a greater age, number of infections, and hence diversity of HIV-1 in Africa than in Europe and the United States. Alternatively, it may be a result of differences in transmission dynamics. Subtype B is associated mainly with the AIDS epidemic in homosexuals and IDUs in developed countries, where transmission rates may be more variable than for the heterosexual transmission associated with subtype A (ANDERSON and MAY 1988 Down; BLOWER et al. 1991 Down; GREENHALGH 1996 Down). A larger value of {sigma}2 for subtype B could therefore explain the smaller estimated N{tau}.

Although transmission dynamics shape the observed sequence diversity of HIV-1, natural selection is also likely to play a role. The high levels of rate heterogeneity observed along both the gag and env genes (Table 1) suggest an important role of functional constraint. This is further evidenced by the restricted number of nonsynonymous as opposed to synonymous changes for these two genes (the dN:dS ratio is 0.44 and 0.51 for the env subtype A and B alignments, and 0.25 and 0.24 for the gag alignments). However, although functional constraint can reduce linked genetic diversity (CHARLESWORTH et al. 1993 Down), such a restriction will be reflected by a lower estimated substitution rate, µ, and therefore is unlikely to result in a reduction of e (which is confounded with µ) unless such constraint is accompanied by occasional selective sweeps of fit mutants through the subtype. Such selective sweeps would reduce linked variation while maintaining a high substitution rate and hence could therefore decrease e. Selective sweeps may be a feature of HIV-1 evolution, but the high frequency at which HIV-1 recombines (DIAZ et al. 1995 Down; ROBERTSON et al. 1995 Down; ZHU et al. 1995 Down; MOUTOUH et al. 1996 Down) will restrict this reduction in diversity to the vicinity of the selected mutation. Thus selective sweeps could be reducing {tau} substantially, but only if they occur fairly often. So far, little evidence concerning whether such sweeps occur at the subtype level or what their frequency is has accumulated.

Role of recombination:
Recombination reduces the variance of the distribution of pairwise differences, while little affecting the mean of this distribution. Thus if recombination is ignored, when {lambda} and Ne are coestimated, {lambda} will be overestimated (the phylogeny simply becomes more star-like). However, estimates of Ne for a fixed {lambda} should be little affected because for a star phylogeny e is mainly determined by the mean of the distribution of pairwise differences. The estimates of Ne for HIV-1 reported here are therefore valid, despite the lack of recombination in the coalescent model. In the future it may be possible to use a coalescent model (e.g., see HUDSON 1987 Down; GRASSLY and HOLMES 1998 Down) to estimate the rates of recombination within and between HIV-1 subtypes.

Nonrandom sampling:
HIV-1 sequences are unlikely to be sampled randomly with respect to geographic origin for political and economic reasons. Because the estimates of Ne given here are based mainly on mean sequence diversity, such nonrandom sampling will not be too problematic unless subtype diversity is systematically underrepresented. This will occur when countries with prevalent HIV-1 are not included in the sample. In such cases estimates of Ne will be for the regions that are adequately sampled only. For example, because partial V3 sequences from China that group with subtype B (SHAO and WOLF 1995 Down) were not included in the alignment, the estimates of Ne presented here reflect subtype B diversity in the United States, Europe, and Japan, but not mainland Asia.

Importance of intrahost dynamics:
Both the substitution and genealogical processes modeled focus on interhost evolution. This is because each pair of sequences sampled from subtypes A and B is likely to be separated by a large number of transmission events. A lineage in the coalescent model therefore reflects a population with an effective size of ~103 (the estimated intrahost population size; LEIGH BROWN 1997 Down; LEIGH BROWN and RICHMAN 1997 Down) punctuated by repeated bottlenecks occurring at transmission (HOLMES et al. 1992 Down). For neutral mutations, where substitution rates are independent of population size, it is therefore reasonable to model the interhost substitution process with a standard Poisson model.

If many mutations within the host are nonneutral (e.g., due to immune surveillance), then viral population size within a coalescent lineage becomes important and the rate of interhost substitution may be more variable than the assumed Poisson process (ARAKI and TACHIDA 1997 Down). Such rate heterogeneity is supported by the rejection of constrained molecular clock phylogenetic trees by the likelihood-ratio test (see RESULTS). The effect of this rate heterogeneity may be further enhanced by the noncontemporaneous nature of the HIV-1 sequence isolates, which can cause sister taxa in the viral genealogy to have unequal branch lengths. The effect of departure from the assumption of a molecular clock and a genealogy with contemporary tips on estimates of Ne is unclear. It is not obvious how it could cause a reduction in estimates of e based on summary statistics, given that the mean sequence diversity is likely to remain the same. However, for likelihood calculations based on explicit representation of the HIV-1 genealogy, a more accurate model of the substitution process will be required. There is no evidence for systematic differences in the substitution process or rate between subtypes A and B that may be causing the apparent population parameter differences.

It is possible that selective pressures within the host may increase the effect of intrahost polymorphism on observed differences between sampled sequences. For this reason the large number of immunogenic epitopes along the env protein (reflected by a larger dN:dS ratio compared with the gag polyprotein; KORBER et al. 1996 Down) may explain why the env gene gives a less clear distinction between subtypes A and B. Selection during primary infection for particular configurations of the env V3 loop (e.g., see ZHANG et al. 1993 Down) will not affect the rate of substitution of linked (or unlinked) neutral variants (BIRKY and WALSH 1988 Down), and so is unlikely to have any effect on rates of interhost evolution, and hence estimates of Ne.

Implications of a small Ne:
The small value of e, if not a consequence of past selective sweeps, has important implications for the spread of drug-resistant mutations, which are typically at a selective disadvantage in the absence of the drug (e.g., GOUDSMIT et al. 1996 Down, GOUDSMIT et al. 1997 Down). In general, for populations of constant size Ne, selective effects that are less than the reciprocal of Ne are negligible compared to the stochasticity of drift. Thus for HIV-1, depending on the future rate of spread of the virus, the small e may indicate an important role for genetic drift in the fixation of drug-resistant mutations. Although the values of selective coefficients for drug-resistant mutations seem to be variable (s = -0.004 to -0.25 within drug-naïve patients; GOUDSMIT et al. 1996 Down), the cases of drug resistance in drug-naïve patients that have begun to be reported (CONLON et al. 1994 Down; NAJERA et al. 1995 Down; GOUDSMIT et al. 1996 Down; KOZAL et al. 1996 Down; CORNELISSEN et al. 1997 Down) suggest that, at least in some cases, s is small enough for genetic drift to occur. Of course, the high mutation rate and rapid turnover of HIV-1 within infected individuals (HO et al. 1995 Down; WEI et al. 1995 Down) means that single mutations conferring resistance to drugs are likely to arise by chance even in drug-naïve patients (BONHOEFFER and NOWAK 1997 Down). However, for drug resistance requiring multiple mutations preexistence is unlikely (BONHOEFFER and NOWAK 1997 Down) unless the mutations provide some drug resistance on their own, in which case they may spread and then be brought together in a single viral genome by recombination. The small e therefore suggests that drug resistance requiring multiple mutations may arise through the stochastic effects of drift. This has important implications for the efficacy of drugs and combination drug therapy, because preexistence of a few drug-resistant viruses within an individual implies that even if the drug therapy is effective in dramatically reducing viral load, resistance will quickly emerge (BONHOEFFER and NOWAK 1997 Down; BONHOEFFER et al. 1997 Down).


*  ACKNOWLEDGMENTS

Many thanks to Oliver Pybus for providing early sequence alignments. This work was supported by grants from the Biotechnology and Biological Sciences Research Council (BBSRC), The Royal Society, and The Wellcome Trust.

Manuscript received July 17, 1998; Accepted for publication October 26, 1998.


*  LITERATURE CITED
*TOP
*ABSTRACT
*THEORY
*MATERIALS AND METHODS
*RESULTS
*DISCUSSION
*LITERATURE CITED

ANDERSON, R. M. and R. M. MAY, 1988  Epidemiological parameters of HIV transmission. Nature 333:514-519[Medline].

ANDERSON, R. M., R. M. MAY, T. W. NG, and J. T. ROWLEY, 1992  Age-dependent choice of sexual partners and the transmission dynamics of HIV in sub-Saharan Africa. Philos. Trans. R. Soc. Lond. B Biol. Sci. 336:135-155[Medline].

ARAKI, H. and H. TACHIDA, 1997  Bottleneck effect on evolutionary rate in the nearly neutral mutation model. Genetics 147:907-914[Abstract].

BIRKY, C. W. and J. B. WALSH, 1988  Effects of linkage on rates of molecular evolution. Proc. Natl. Acad. Sci. USA 85:6414-6418[Abstract/Free Full Text].

BLOWER, S. M., D. HARTEL, H. DOWLATABADI, R. M. ANDERSON, and R. M. MAY, 1991  Drugs, sex and HIV: a mathematical model for New York City. Philos. Trans. R. Soc. Lond. B Biol. Sci. 331:171-187[Medline].

BONHOEFFER, S. and M. A. NOWAK, 1997  Pre-existence and emergence of drug resistance in HIV-1 infection. Proc. R. Soc. Lond. Ser. B Biol. Sci. 264:631-637[Medline].

BONHOEFFER, S., R. M. MAY, G. M. SHAW, and M. A. NOWAK, 1997  Virus dynamics and drug therapy. Proc. Natl. Acad. Sci. USA 94:6971-6976[Abstract/Free Full Text].

CHARLESWORTH, B., M. T. MORGAN, and D. CHARLESWORTH, 1993  The effect of deleterious mutations on neutral molecular variation. Genetics 134:1289-1303[Abstract].

CONLON, C. P., P. KLENERMAN, A. EDWARDS, B. A. LARDER, and R. E. PHILLIPS, 1994  Heterosexual transmission of human immunodeficiency virus type-1 variants associated with zidovudine resistance. J. Infect. Dis. 169:411-415[Medline].

CORNELISSEN, M., R. VAN DEN BURG, F. ZORGDRAGER, V. LUKASHOV, and J. GOUDSMIT, 1997  pol gene diversity of five human immunodeficiency virus type 1 subtypes: evidence for naturally occurring mutations that contribute to drug resistance, limited recombination patterns, and common ancestry for subtypes B and D. J. Virol. 71:6348-6358[Abstract].

DEAN, M., M. CARRINGTON, C. WINKLER, G. A. HUTTLEY, and M. W. SMITH et al., 1996  Genetic restriction of HIV-1 infection and progression to AIDS by a deletion allele of the CKR5 structural gene. Science 273:1856-1862[Abstract/Free Full Text].

DIAZ, R. S., E. C. SABINO, A. MAYER, J. W. MOSLEY, and M. P. BUSCH, 1995  Dual Human Immunodeficiency Virus type-1 infection and recombination in a dually exposed transfusion recipient. J. Virol. 69:3273-3281[Abstract].

EDWARDS, A. W. F., 1992 Likelihood. Extended Edition. The Johns Hopkins University Press, London.

EIGEN, M. and K. NIESELT-STRUWE, 1990  How old is the immunodeficiency virus? AIDS 4:S85-S93.

FELSENSTEIN, J., 1974  The evolutionary advantage of recombination. Genetics 78:737-756[Abstract/Free Full Text].

FELSENSTEIN, J., 1981  Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368-376[Medline].

FELSENSTEIN, J., 1992  Estimating effective population-size from samples of sequences—inefficiency of pairwise and segregating sites as compared to phylogenetic estimates. Genet. Res. 59:139-147[Medline].

FISHER, R. A., 1921 On the `probable error' of a coefficient of correlation deduced from a small sample. Metron 1: part 4, 3–32.

FISHER, R. A., 1930 The Genetical Theory of Natural Selection. Clarendon Press, Oxford.

FU, Y. X., 1994  A phylogenetic estimator of effective population size or mutation rate. Genetics 136:685-692[Abstract].

FU, Y. X. and W. H. LI, 1993  Maximum likelihood estimation of population parameters. Genetics 134:1261-1270[Abstract].

GAO, F., D. L. ROBINSON, S. G. MORRISON, H. X. HUI, and S. CRAIG et al., 1996  The heterosexual human immunodeficiency virus type 1 epidemic in Thailand is caused by an intersubtype (A/E) recombinant of African origin. J. Virol. 70:7013-7029[Abstract/Free Full Text].

GOUDSMIT, J., A. DE RONDE, D. D. HO, and A. S. PERELSON, 1996  Human immunodeficiency virus fitness in vivo: calculations based on a single zidovudine resistance mutation at codon 215 of reverse transcriptase. J. Virol. 70:5662-5664[Abstract/Free Full Text].

GOUDSMIT, J., A. DE RONDE, E. DE ROOIJ, and R. DE BOER, 1997  Broad spectrum of in vivo fitness of human immunodeficiency virus type 1 subpopulations differing at reverse transcriptase codons 41 and 215. J. Virol. 71:4479-4484[Abstract].

GRASSLY, N. C., and E. C. HOLMES, 1998 The use of Monte Carlo simulation to infer population dynamic history from DNA sequence data, pp. 91–112 in Proceedings of the Trinational Workshop on Molecular Evolution, edited by M. K. UYENOYAMA, A. VON HAESELER and N. TAKAHATA. Duke University Publications Group, Durham, NC.

GREENHALGH, D., 1996  Effects of heterogeneity on the spread of HIV AIDS among intravenous drug users in `shooting galleries'. Math. Biosci. 136:141-186[Medline].

GRIFFITHS, R. C. and S. TAVARE, 1994a  Ancestral inference in population genetics. Stat. Sci. 9:307-319.

G