- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by McVean, G. A. T.
- Articles by Vieira, J.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by McVean, G. A. T.
- Articles by Vieira, J.
Inferring Parameters of Mutation, Selection and Demography From Patterns of Synonymous Site Evolution in Drosophila
Gilean A. T. McVeana and Jorge Vieiraaa Institute of Cell, Animal and Population Biology, University of Edinburgh, Edinburgh EH9 3JT, United Kingdom
Corresponding author: Gilean A. T. McVean, Department of Statistics, 1 S. Parks Rd., Oxford OX1 3TG, United Kingdom., g.mcvean{at}ed.ac.uk (E-mail)
| ABSTRACT |
|---|
Selection acting on codon usage can cause patterns of synonymous evolution to deviate considerably from those expected under neutrality. To investigate the quantitative relationship between parameters of mutation, selection, and demography, and patterns of synonymous site divergence, we have developed a novel combination of population genetic models and likelihood methods of phylogenetic sequence analysis. Comparing 50 orthologous gene pairs from Drosophila melanogaster and D. virilis and 27 from D. melanogaster and D. simulans, we show considerable variation between amino acids and genes in the strength of selection acting on codon usage and find evidence for both long-term and short-term changes in the strength of selection between species. Remarkably, D. melanogaster shows no evidence of current selection on codon usage, while its sister species D. simulans experiences only half the selection pressure for codon usage of their common ancestor. We also find evidence for considerable base asymmetries in the rate of mutation, such that the average synonymous mutation rate is 2030% higher than in noncoding regions. A Bayesian approach is adopted to investigate how accounting for selection on codon usage influences estimates of the parameters of mutation.
IN phylogenetic and population genetic analyses, mutations at synonymous codon positions are typically considered to behave according to the neutral model of molecular evolution (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Selection acting on synonymous codon positions affects not only base composition but also patterns of polymorphism and divergence (![]()
![]()
![]()
![]()
![]()
![]()
A second consequence of selection on codon usage is that changes in the strength or efficacy of selection acting on codon usage may lead to unusual patterns of divergence. For example, an excess of substitutions to unpreferred codons in the lineage leading to Drosophila melanogaster from its most recent common ancestor (MRCA) with D. simulans has been interpreted in terms of a recent reduction in the effective population size (Ne) of D. melanogaster (![]()
![]()
![]()
Such deviations from the neutral model of molecular evolution will clearly introduce biases into methods of phylogenetic analysis that assume that synonymous mutations are of no consequence to organismal fitness. The quantitative nature of the relationship between parameters influencing codon usage and patterns of divergence can be investigated through the use of explicit population genetics models (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
In this article, we present a new approach to analyzing patterns of molecular evolution and specifically patterns of synonymous site divergence. The idea is to use conventional likelihood methods of phylogenetic sequence analysis, but to parameterize the underlying models of sequence evolution in terms of the population genetics of mutation, selection, and drift. The benefit of using models constructed simply from biologically meaningful parameters is that we can use patterns of divergence to provide estimates of the parameters of interest. The benefit of a likelihood approach is that we can both test whether increasingly complex models provide significantly better fits to the data and get an idea of the range of parameter values compatible with the data.
Population genetic models incorporating selection have previously been applied to patterns of synonymous-site evolution (![]()
![]()
![]()
The second major advantage of the method is that we can use information from both patterns of codon usage and patterns of synonymous divergence. Population genetic models predict not only substitution rates but also equilibrium codon frequencies. Data on codon frequencies have been previously used to analyze patterns of codon usage (![]()
![]()
In the following section, we describe a simple model for sequence evolution that includes parameters of mutation, selection, and demography. We then describe how this can be applied to data on synonymous site divergence and use it to analyze patterns of evolution in 50 homologous genes from D. melanogaster and D. virilis and 27 homologous genes from D. melanogaster and D. simulans. Simultaneous analysis of multiple genes lends much power to the approach and provides a way of testing for heterogeneity between genes in patterns of divergence. To this end, we develop a series of goodness-of-fit tests to consider the extent to which the model provides an adequate description of the data.
| A POPULATION GENETIC MODEL FOR CODON USAGE EVOLUTION |
|---|
The mutation-selection-drift model for the evolution of codon usage provides a simple population genetic description of the forces influencing synonymous site evolution (![]()
![]()
![]()
![]() |
(1) |
where s is the selective advantage of the mutant allele over the ancestral allele and is negative if the mutation is deleterious.
|
If the rate of mutation per site per generation from codon type i to j is µij, the number of new mutations entering the population each generation at a site fixed for codon i is a Poisson distribution with mean 2Nµij. For 2Nµij << 1 and u(s) << 1, the fate of a new mutation is essentially independent of others occurring at the same locus either in the same or subsequent generations. If selection acts independently at each site in the genome, the rate of substitution from codon i to codon j is
![]() |
(2) |
and Sij is 4Nesij (4Ne times the difference in fitness between alleles, scaled to an average fitness of one). When S = 0, this reduces to the mutation rate.
To use this population genetic model as a description of sequence evolution we must make the simplifying assumption that substitution events are instantaneous on an evolutionary timescale. That is, all differences observed between genes sampled from different species are the result of fixations governed by Equation 2 and are not the result of mutations segregating in either population. This assumption is implicit to almost all phylogenetic models of sequence evolution (see, e.g., ![]()
For an amino acid with n synonymous codons, and using a continuous time approximation, the substitution probabilities determine an n x n rate matrix Q that characterizes the rates of change in state for each codon. For example, an amino acid encoded for by two C- and T-ending codons, such as asparagine, will have the matrix
![]() |
(3) |
which is time independent if the parameters of mutation, selection, and demography (Ne) are invariant. The population size of a species is likely to vary over time, but if the timescale of fluctuations is short relative to the substitution process, Ne is well approximated by the harmonic mean population size (![]()
To describe patterns of sequence evolution, consider the transition probability matrix P(t). For an amino acid encoded for by C- and T-ending codons this will have the form
![]() |
(4) |
which describes the probabilities pij that a codon is of type j given that it was of type i, t generations ago. For a time-independent transition matrix with a continuous timescale approximation, this is given by
![]() |
(5) |
(see, e.g., ![]()
![]()
= µt, where µ is the transversion mutation rate, as µ and t cannot be estimated separately.
For a given amino acid, the expected divergence matrix for homologous codons sampled at time
after speciation, D(
), where dij is the probability of observing a site with nucleotide i in species 1 and j in species 2, is given by
![]() |
(6) |
where P1(
) and P2(
) are the transition probability matrices for each species and F is a diagonal matrix of the frequencies of each codon in their common ancestor, given by the behavior of (5) as
.
In the full model, Ne is estimated for each daughter species relative to the MRCA, such that the ratio Ne(species)/Ne(MRCA) is given by the parameter fspecies. Changes in Ne generate nonstationary patterns of substitution and asymmetrical divergence matrices. This approach also differs from current methods for estimating sequence divergence (e.g., ![]()
| APPLICATION TO EMPIRICAL DATA |
|---|
The model outlined describes the behavior of codon usage for a given set of selection coefficients. But differences in codon usage between amino acids and genes should be allowed for when estimating parameters from data. As there is insufficient power to estimate the parameters of selection for each occurrence of each amino acid in each gene, our solution is to consider an explicit model for how the selection coefficient acting on codon usage varies between amino acids and genes (![]()
![]()
The power gained by these assumptions is that we can combine information from different genes, without assuming complete uniformity of the pressures affecting codon usage across the genome. This flexibility is considerably more realistic than previous approaches, but at the same time provides considerable power both to estimate the parameters determining codon usage and test the models against empirical data. In addition we can compare the fully parameterized model against simpler models through means of likelihood-ratio tests. We also use a simplified mutation model in which we estimate 4 of the 12 different mutation rates: transversions, rate µ; A
G/T
C transitions, rate
µ; G
A transitions, rate
G
µ; and C
T transitions, rate
C
µ (see Fig 1). Models with more parameters do not provide a better fit to the data. Variation in the mutation rates between genes or sites is not considered.
We have applied the models to two sets of data concerning codon usage evolution in Drosophila. The first consists of 50 orthologous genes from D. melanogaster and D. virilis (![]()
40 MY (![]()
![]()
3 MY (![]()
![]()
![]()
![]()
Sequences:
Only complete gene sequences were used in the comparisons. Gene sequences for the D. melanogaster-D. virilis comparison were as previously used (![]()
-amylase distal (D17733/L22720/413896),
-amylase proximal, (D17734/L22735/413898), amylase-like protein (U96159/U69607/AF039561), cecropin A1 (AB010790/3192094), cecropin B (AB010790/AF018999), decapentaplegic (U63854/M30116), esterase-6 (L10670/M15961), fat body protein 2 (AF045786/AF045796), glucose dehydrogenase (U63325/M29298), lethal-of-scute (AB005802/298089/3618332), myosin light chain (L08051/K01567), metallothionein N (M69016/M69015), cytochrome P450 (AF017005/AF017002), glucosephosphate isomerase (L27552/U20567/L27684), ras 1 (AF186649/K01960), ras3 (AF186655/8407), ref2p (U23930/8420/7407), spalt (M21227/8536/M21579), scute (AB005801/AH000975/3618330), Cu-Zn superoxide dismutase (X15685/8644/AF127159), triosephosphate isomerase (U60861/10945/U60870), vermilion (U27204/M34147) and white (U64875/10873). Protein sequence alignments were constructed for each gene pair, and only conserved amino acid positions were used in the analysis. Conserved amino acid positions are under stronger selection for codon usage in D. melanogaster (![]()
![]()
Likelihood estimation:
In the implemented version of the model, there are 45 parameters common to all genes: the time of species divergence, the ratio of current to ancestral population size for each species, 3 mutational parameters, and 39 parameters for the hierarchy of selection among codons. We treat serine as two separate amino acids because the two relevant sets of codons cannot be bridged in a single point mutation. For each gene the relative strength of selection on codon usage is also estimated.
The likelihood of the data for a given set of parameter values is calculated in two parts. For a given set of parameter values we generate a surface of expected codon divergence matrices (Equation 6) for each amino acid for a fixed number of classes of the strength of selection in the MRCA (21 classes evenly arranged over the interval 4Nes = 04, where the value of 4Nes is that distinguishing between the codons AAC and AAT for asparagine). Codon divergence tables for each gene are compared to the expected frequencies and the likelihood is calculated as
![]() |
(7) |
where Ci and fij are the observed count and expected frequency of the ith pair of codons. For each gene we find the class with the highest likelihood and the marginal maximum likelihoods are summed across genes. While we cannot guarantee to find the true maximum-likelihood value for the strength of selection, the error should be small for a sufficiently fine grid. We have also tried fitting a discrete gamma distribution to the selection coefficients; results obtained by this method are essentially identical to those presented here.
To explore the likelihood surface for the common parameters, we have adopted a Monte Carlo Markov chain (MCMC) approach, using the Metropolis-Hastings rejection algorithm. A parameter is chosen at random and incremented by a pseudorandom number drawn from a uniform distribution over the range (-
,
), where
has been previously fixed. The overall likelihood is then compared to the previous likelihood. If it is greater (i.e., more likely), the change is accepted. If it is less likely, then the change is accepted with probability

through comparison to a pseudorandom number drawn from a uniform distribution (0, 1). For reasonably smooth likelihood surfaces, this approach will explore parameter space in proportion to the likelihood, so samples from the MCMC can be used to obtain the posterior distribution of each parameter (with the assumption of a uniform prior for each parameter).
Although the Metropolis-Hastings algorithm will explore parameter values in proportion to their likelihood, two practical issues must be addressed. First, this behavior only applies once the MCMC reaches equilibrium, which takes an unknown period of time to achieve. Second, we also wish to find the maximum-likelihood (ML) parameter values (to compare models), and with such high dimensionality the MCMC takes a long time to find the ML values. For these reasons, we have split the problem in two. First, we use a form of simulated tempering to find the ML values. The MCMC is run as before, but for proposed changes with lower likelihood, the difference in log-likelihood is multiplied by a factor c (range 150). From the starting parameter values, the chain rapidly increases in likelihood. After a number of iterations (typically 5000 proposed changes), we then reduce c to <1 (range 00.5) for a number of iterations (typically 1000) and then revert to the original value. This is repeated a number of times and with different starting points, until the same ML values are found repeatedly. The parameter values arrived at by this method are taken to be the ML estimates. The MCMC is then run for 5 x 106 proposed changes, with samples taken every 500 proposed changes, after a burn-in of 5 x 105 proposed changes, to estimate the posterior distribution of each parameter. The procedure was repeated several times to check for convergence.
| RESULTS |
|---|
We first consider whether the addition of extra parameters of selection and demography significantly improves the fit between model and data. Table 1 shows the increase in log-likelihood for a series of improvements to a basic model in which all synonymous mutations are neutral. The major codon usage (MCU) selection model assumes that the difference in fitness between preferred and unpreferred codons is the same for all amino acids in each gene, but that there is variation between genes in the strength of selection acting on codon usage. Preferred codons are those defined by ![]()
2 distributed with the degrees of freedom equal to the difference in the number of parameters (![]()
|
Relative mutation rates:
We use a simplified mutation model that considers four different mutation rates (see Fig 1). The reason for separating different types of transition is that there is considerable evidence for an AT-biased mutation process in Drosophila (![]()
![]()
T and G
A mutations are 1.53 times higher than T
C and A
G mutations (Table 2). In both comparisons the maximum a posteriori (MAP) parameter estimates are considerably lower than the ML estimates, but included within the 2-unit support intervals.
|
We also find evidence for an elevated rate of other transitions: A
G and T
C transitions are estimated to be one to two times as common as transversions in the two comparisons. Both sets of values predict that in regions under no selection on base composition, the GC content should be 3940%. This compares with empirical estimates from intron sequences in which the average GC content is
37% (![]()
![]()
The hierarchy of selection on codon usage:
To describe the influence of selection on codon usage, we estimate the fitness of codons relative to the others within the same synonymous group and the strength of selection relative to other amino acids: what we have previously called the hierarchy of selection on codon usage (![]()
= 0.73, P = 0.004 by randomization; mel-sim,
= 0.57, P = 0.01); see also ![]()
|
There are two possible explanations. With more codons and more tRNAs, there is greater opportunity for unfavorable tRNA-codon pairings and consequently greater selection for codons that correspond to the more abundant tRNAs. Alternatively, given that there is a correlation between the relative frequency of amino acids and the number of codons (
= 0.53, P = 0.018;
= 0.58, P = 0.01 for the two data sets), a correlation between codon number and strength of selection is expected if the frequency of an amino acid determines the strength of selection on codon usage. To distinguish these hypotheses, we can consider the relationship between amino acid frequency and strength of selection within each group of amino acids. We find that observed patterns vary between comparisons, but, if anything, there is a negative correlation between amino acid abundance and strength of selection on codon usage within groups: amino acids with two codons,
= -0.09, P = 0.80;
= -0.57, P = 0.09; four codons,
= -0.85, P = 0.026;
= 0.68, P = 0.15. Abundance per se is therefore not an important determinant of the strength of selection on codon usage. We suggest that amino acids with more codons have stronger selection on codon usage because they have a greater opportunity for unfavorable codon-tRNA pairing.
Demographic differences between species:
For each species comparison we consider a simple demographic model in which an ancestral population at mutation-selection-drift equilibrium gives rise to two daughter species of variable population size. We estimate the strength of selection acting on codon usage in the ancestral population, the relative strength in each daughter species (fsp), and the time since the speciation event in terms of the expected number of mutations in neutral noncoding regions (Table 2).
As previously (![]()
50%, based on amino acids with two codons, the relative strength in D. virilis is estimated to be
90% that of the D. melanogaster lineage, and both experience less than half the strength of selection on codon usage than their MRCA.
In the mel-sim comparison, we find that both species have a lower strength of selection on codon usage than their MRCA. But whereas D. simulans has about one-half the selection of the ancestral species, selection in D. melanogaster has been reduced to one-tenth of its ancestral level and is not significantly different from zero. A decrease in the strength of selection (Nes) in D. melanogaster since divergence from its MRCA with D. simulans has been previously suggested from patterns of molecular evolution (![]()
![]()
2
=1 = 4.07, P < 0.05) in the D. simulans lineage (preferred codons classified according to ![]()
|
Testing the adequacy of the model:
The methods presented here provide a powerful approach to estimating parameters of mutation, selection, and demography from patterns of synonymous site evolution. However, that the parameter values we estimate are only as good as the model is an accurate description of the data. There are two ways in which our model may be inaccurate. First, the population genetic model makes assumptions that are not valid (particularly independence between sites and the neglect of polymorphism). These are discussed in the Appendix Second, the constraints we impose to apply the model to sequence data may be inaccurate. For example, we do not consider variation in the selection coefficient between occurrences of an amino acid within a gene or variation in the mutation rate between genes. In short, the number of parameters in the model is bound to be fewer than in reality.
To test the adequacy of the model as providing an explanation of the data, we consider a series of goodness-of-fit tests. The simplest test is to compare the observed number of each codon pair with that expected from the ML parameter values. We calculate a G-test statistic as the sum over all possible homologous codon pairs i, amino acids j, and genes k,

where Cijk is the observed number of such pairs and E[Cijk] is the expected number (![]()
What is the cause of the discrepancy? There are two important ways in which the model may fail. First, the level of divergence may vary more between genes and amino acids than the model predicts. Second, deviation between observed and expected codon frequencies may result from the constraints imposed by the fixed hierarchy of selection. To investigate discrepancies between the model and data in the level of divergence, we consider a G-test statistic obtained by comparing the observed (Djk) and expected numbers of homologous codons at which there are differences, across amino acids j and genes k:

The significance level of each test is calculated by comparing the observed value to those from 20,000 simulated data sets.
In both comparisons, a number of genes show strong deviation from the expected divergence (nine and six, respectively), although this is not consistently in one direction. We also consider Gd for each amino acid by summing the observed and expected levels of divergence for each amino acid in each gene. For 7 (mel-vir) and 3 (mel-sim) of the 19 amino acids, the model predicts levels of divergence significantly different from those observed.
To investigate the extent to which the model provides an adequate fit to observed codon frequencies in each species, we consider a G-test statistic obtained by summing the discrepancy between observed and expected codon counts, over codons i, amino acids j, and genes k:

Note that this test does not consider divergence, and codon frequencies in each species are considered separately. We also partition this value by genes and amino acids. Significance is assessed by comparison to 10,000 data sets simulated from the ML parameter estimates.
We find significant deviations from the model for both genes and amino acids. In the mel-vir comparison, 18 (mel) and 25 (vir) genes have Gcod values >95% of simulated values, while for the mel-sim comparison, the numbers are 3 (mel) and 5 (sim) genes. Partitioning by amino acids we find that most amino acids show significant deviation in the mel-vir comparison, although only 3 (mel) and 5 (sim) do in the mel-sim comparison. In short, the simple model is adequate for the majority of genes and amino acids in terms of describing levels of sequence divergence, but there is considerable discrepancy between the model and data in terms of the fitted codon frequencies for the more divergent species pair. A possible cause of the discrepancy is changes in the relative strength of selection acting on codon usage between amino acids over the timescale separating D. melanogaster and D. virilis (![]()
| DISCUSSION |
|---|
The method presented here is the first attempt to combine conventional phylogenetic methods of sequence analysis with populations genetic models of the underlying substitution process, incorporating mutation, selection, and drift. By applying these ideas to patterns of synonymous site divergence in Drosophila, we have estimated the strength of selection acting on codon usage and detected demographic events in the history of the species that have had a profound effect on the efficacy of selection. We suggest that this approach provides a powerful method for understanding patterns of sequence evolution because the underlying models are constructed exclusively from biologically meaningful parameters.
The effect of base composition on the mutation rate:
Our analysis provides strong evidence for considerable mutational bias in Drosophila, such that the rates of C
T and G
A mutation are about two times higher than the reverse. This has important consequences for interpreting patterns of molecular evolution and variation, because it implies a link between the base composition of a region and its mutation rate. This point has been made before (![]()
S, the average synonymous mutation rate (in terms of synonymous mutations per synonymous site as estimated by the method of ![]()
![]()
4, at synonymous positions. Almost identical values are obtained if we use a set of 2070 genes (![]()
|
A consequence of a higher synonymous mutation rate is that we expect synonymous site polymorphism to be greater than that in noncoding regions, as long as the strength of selection acting on codon usage is relatively weak (![]()
![]()
![]()
20% higher (![]()
An alternative explanation for higher variability at synonymous sites is that introns and other noncoding regions of the genome are under stronger constraint than synonymous sites. To test this, we can compare the level of divergence observed in introns with that predicted by our best-fitting model under the assumption of neutrality. For the mel-sim comparison, the ML values predict that the average proportion of differences between homologous sequences in neutral noncoding regions should be 8.9%. From an analysis of 30 homologous internal introns (within the open reading frame) from 21 genes in D. melanogaster and D. simulans we find an average difference of 8.7% (274 changes in 3154 alignable sites, excluding GT . . AG motifs). While this analysis is not conclusive evidence for a lack of constraint in introns, it is compatible with the majority of changes in intron sequences being neutral. Higher synonymous than noncoding diversity in D. melanogaster and D. simulans may therefore simply be due to a higher mutability of these sequences. For D. virilis it may be that selection on codon usage is sufficiently strong to reduce synonymous polymorphism below that in introns. However, ![]()
Estimating parameters of mutation, selection, and drift from patterns of molecular evolution:
Using sequence data to estimate parameters such as the mutation rate and the strength of selection influencing new mutations has many advantages, most importantly the ease of data collection. But the accuracy and efficacy of such methods is debatable. The factors affecting molecular evolution are clearly far more complex than any model that can be fitted, and naive model comparisons have the potential to lead to greater faith in parameter estimates than the data truly allow.
A partial solution is to consider the ability of a model to provide an adequate explanation of the data before using it to obtain parameter estimates. The inability to reject a model does not guarantee that the model is correct, but parameters estimated from models that can be rejected should be treated with caution. The problem with this approach is that the factors affecting molecular evolution are so diverse that as the complexity of the fitted model increases, so may the power to reject it. For example, the model we have considered here is by a long way the most complex yet analyzed, but we have also shown that the model is inadequate in several important respects. While we do not wish to claim that all our parameter estimates are insensitive to such inadequacy, it may well be that some, or even the majority, are. No general statement can be made about which parameters we expect to be strongly biased by model inaccuracy; this can only be addressed through simulation (see the Appendix).
The second limitation in using molecular evolutionary analysis is that often the models do not estimate the parameters of interest, but only some compound parameter such as tµ or Nes. To estimate the per-generation mutation rate, or the selection coefficient differentiating between alternative codons, we therefore have to obtain estimates of other parameters. Often no particularly reliable estimate of such nuisance parameters is available.
One solution is to use conservative, or extreme, estimates to define upper and lower bounds. This approach not only throws away information but also leads to parameter estimates that cannot be combined. For example, to estimate the per-generation neutral mutation rate in Drosophila, we could use the point estimates of 45 MY for the separation of the D. melanogaster and D. virilis lineages, 3 MY for the separation of D. melanogaster and D. simulans, and an average of 10 generations per year in the wild (![]()
An alternative solution is to adopt a Bayesian approach and estimate the posterior distribution of the parameters of interest. The posterior distributions of the model parameters are obtained from the Monte Carlo Markov chain analysis (though note the previous caveat about model adequacy). The choice of priors for the other parameters is subjective, and we should err on the side of caution. The split of the Drosophila and Sophophora subgenera has been put at no more recent than 30 MYA and perhaps as early as 60 MYA, while from biogeography the split between D. melanogaster and D. simulans is thought to be 24 MYA (![]()
These priors used are represented in Fig 3, as are the resulting posterior distributions for each comparison and the combined posterior. The combined MAP estimate is 1.5 x 10-9 mutations per site per generation in noncoding DNA, and the 95% credibility interval is 10-92.5 x 10-9. The distributions obtained from the two comparisons are very similar, which perhaps implies that sequence data are not the limiting factor in the accuracy of our estimate. The single greatest element of uncertainty in the calculation is the number of generations per year. Remarkably, the MAP estimate is almost exactly that obtained by ![]()
![]()
![]()
|
Why should accounting for selection on codon usage have such a small effect on estimates of the per-generation mutation rate? Fig 4 shows, for each comparison, the relationship between the relative strength of selection acting on codon usage, the observed and expected proportions of differences at synonymous sites, and the differences expected at completely neutral sites. Two features are of note. First, the observed average proportions of synonymous differences are very close to the expected proportions of differences at noncoding sites. Second, the expected relationship between the strength of selection acting on codon usage and rate of synonymous divergence is fairly weak, particularly for the mel-sim comparison. In short, weak selection, of the order required to explain codon usage patterns in Drosophila, does not greatly influence the overall rate of substitution (though it does influence the type of mutations that become fixed between species).
|
The genome-wide rate of mutation:
For several aspects of evolutionary theory, genome-wide parameters of mutation are more relevant than rates at individual sites. Specifically, the per-genome rate of deleterious mutation is a critical parameter in determining the relevance of one class of theory for the maintenance of sexual reproduction (![]()
![]()
![]()
This approach is clearly not valid in organisms for which there is selection acting on codon usage (![]()
Such complications are directly addressed by the methods presented here. To calculate the genome-wide rate of synonymous mutation in the D. melanogaster genome, US, we use the previous priors for the times of divergence and the number of generations per year, and a prior for the number of genes in the D. melanogaster genome as shown in Fig 5 (Ggenome: range 1115 (![]()
![]()
5 x 10-3. However, from the mel-sim comparison we estimate that no synonymous mutations experience a selection intensity of |Nes| > 1 in either D. melanogaster or D. simulans, and only 6% did in their MRCA. Using a data set of 2070 genes (![]()
|
The per-genome rate of nonsynonymous mutation, UN, in D. melanogaster can be estimated by counting the number of mutations in genes that would lead to a change in the amino acid. We obtain a combined MAP estimate of 0.024 nonsynonymous mutations per haploid genome per generation from the data sets used here and an estimate of 0.028 from the EDGP data (Table 4 and Fig 5). Again, this value is almost identical to an estimate that did not take into account selection on codon usage (![]()
![]()
The genetic load caused by synonymous mutations:
While the selection coefficients associated with mutations at any one site are likely to be very small, the cumulative effects of a large number of unpreferred codons across the genomes may be considerable. ![]()

where si is the selective disadvantage of a codon relative to the best possible and Ci is the number of occurrences of that codon. From the results presented, we can estimate the genetic load caused by unpreferred codons in the D. melanogaster genome.
Using the approximation (1 - s)C
e-sC, the estimated genetic load in the genome as a whole is given by L = 1 - exp[(
)
i(wi - wopt)
], where the relative fitnesses of codons have been estimated from the model and Gsamp and Ggenome are the numbers of genes in the sample and the genome as a whole. Estimates of the recent Ne of D. melanogaster are
106 (![]()
| ACKNOWLEDGMENTS |
|---|
We thank Brian Charlesworth, Adam Eyre-Walker, Peter Keightley, Arcadi Navarro, Rasmus Nielsen, Ziheng Yang, and an anonymous reviewer for discussion and comments on the manuscript. G.M. is funded by the Natural Environment Research Council, and J.V. is supported by the Fundação para a Ciencia e Tecnologia (PRAXIS XXI/BPD/14120/97).
Manuscript received June 19, 2000; Accepted for publication September 27, 2000.
| APPENDIX |
|---|
The assumptions of the model:
The underlying model of sequence evolution we have employed makes a number of critical assumptions. The two most important are first that the substitution process is instantaneous and, therefore, that segregating mutations do not contribute to differences between sequences from different species. The second is that evolution acts independently at all sites. Neither assumption is realistic, so it is important to understand the effects of relaxing these assumptions on parameter estimates.
To this end we have conducted a series of simulations in which we consider a more biologically plausible speciation model and a finite sample size. We consider a two-allele model with selection and symmetric, reversible mutation, and a population of 500 diploid individuals, each with a genome of 1000 linked sites. At speciation, we assume that the population is simply duplicated (each daughter species has the same population size). We sample a single sequence from each daughter population at different time points and find the ML estimates of the selection coefficient and time since divergence (we assume that the relative mutation rates are known). For each combination of selection coefficient and time we take 50 independent samples: the scaled parameter values used are 4Neµ = 0.04, 4Ner = 0.1, and 4Nes in the range 04.
|
Table A1 shows the results of the simulations. The main effect is that the estimated strength of selection tends to be lower than the true strength of selection, even when the recombination rate between adjacent sites is relatively high. For weakly selected sites the bias is small (1020%), while for 4Nes
4, the average downward bias can be at least 40%, although for lower values of Neµ, the effects are weaker (data not shown). Both ancestral polymorphism and that generated subsequent to species divergence tend to lead to overestimates of the time since divergence, and for sites under strong selection the effect can be considerable. For 4Nes = 10 and
= 0.5, the ML estimate was always infinity.
These simulations are a worst-case scenario, both in terms of the sample size and value of Neµ. With large data sets, in which the majority of sites are under weak selection, 4Nes < 4, as seems to be the case for selection on codon usage in Drosophila, estimates of the time since divergence should be reasonably unbiased. The strength of selection acting on codon usage is likely to be an underestimate. In the mel-sim comparison, the upper limit for the estimated strength of selection on codon usage in their MRCA is
4Nes = 7.5 (for the difference between the codons TTG and TTA for leucine in the gene amyp). Given our simulation results, it is possible that this may be half the true value, but it is unlikely to be an order of magnitude higher. We cannot rule out the possibility that a fraction of synonymous sites have much higher selection coefficients.
| LITERATURE CITED |
|---|
ADAMS, M., S. CELNIKER, R. HOLT, C. EVANS, and J. GOCAYNE et al., 2000 The genome sequence of Drosophila melanogaster.. Science 287:2185-2195
AKASHI, H., 1994 Synonymous codon usage in Drosophila melanogaster: natural selection and translational accuracy. Genetics 136:927-935[Abstract].
AKASHI, H., 1995 Inferring weak selection from patterns of polymorphism and divergence at "silent" sites in Drosophila DNA. Genetics 139:1067-1076[Abstract].
AKASHI, H., 1996 Molecular evolution between Drosophila melanogaster and D. simulans: reduced codon bias, faster rates of amino acid substitution and larger proteins in D. melanogaster.. Genetics 144:1297-1307[Abstract].
AKASHI, H., 1999 Inferring the fitness effects of DNA mutations from polymorphism and divergence data: statistical power to detect directional selection under stationarity and free recombination. Genetics 151:221-238
AKASHI, H. and S. W. SCHAEFFER, 1997 Natural selection and the frequency distributions of "silent" DNA polymorphism in Drosophila. Genetics 146:295-307[Abstract].
ASHBURNER, M., 1989 Drosophila: A Laboratory Handbook. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY.
BULMER, M. G., 1991 The selection-mutation-drift theory of synonymous codon usage. Genetics 129:897-907[Abstract].
CHIAPELLO, H., F. LISACEK, M. CABOCHE, and A. HENAUT, 1998 Codon usage and gene function are related in sequences of Arabidopsis thaliana. Gene 209:GC1-GC38[Medline].
CROW, J. F., and M. KIMURA, 1970 An Introduction to Population Genetics Theory. Harper and Rowe, New York.
DRAKE, J. W., B. CHARLESWORTH, D. CHARLESWORTH, and J. F. CROW, 1998 Rates of spontaneous mutation. Genetics 148:1667-1686
DURET, L. and D. MOUCHIROUD, 1999 Expression pattern and, surprisingly, gene length shape codon usage in Caenorhabditis, Drosophila and Arabidopsis.. Proc. Natl. Acad. Sci. USA 96:4482-44











