- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Data Supplement
-
All Versions of this Article:
genetics.107.074377v1
176/4/2035 most recent - Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Dinka, S. J.
- Articles by Raizada, M. N.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Dinka, S. J.
- Articles by Raizada, M. N.
Originally published as Genetics Published Articles Ahead of Print on June 11, 2007.
Genetics, Vol. 176, 2035-2054, August 2007, Copyright © 2007
doi:10.1534/genetics.107.074377
Predicting the Size of the Progeny Mapping Population Required to Positionally Clone a Gene
Stephen J. Dinka*,
Matthew A. Campbell
,
Tyler Demers* and
Manish N. Raizada*,1
* Department of Plant Agriculture, University of Guelph, Guelph, Ontario, Canada N1G 2W1 and
The Institute for Genomic Research, Rockville, MD, 20850
1 Corresponding author: Department of Plant Agriculture, University of Guelph, 50 Stone Rd., Guelph, Ontario, Canada N1G 2W1.
E-mail: Raizada{at}uoguelph.ca
A key frustration during positional gene cloning (map-based cloning) is that the size of the progeny mapping population is difficult to predict, because the meiotic recombination frequency varies along chromosomes. We describe a detailed methodology to improve this prediction using rice (Oryza sativa L.) as a model system. We derived and/or validated, then fine-tuned, equations that estimate the mapping population size by comparing these theoretical estimates to 41 successful positional cloning attempts. We then used each validated equation to test whether neighborhood meiotic recombination frequencies extracted from a reference RFLP map can help researchers predict the mapping population size. We developed a meiotic recombination frequency map (MRFM) for
1400 marker intervals in rice and anchored each published allele onto an interval on this map. We show that neighborhood recombination frequencies (R-map, >280-kb segments) extracted from the MRFM, in conjunction with the validated formulas, better predicted the mapping population size than the genome-wide average recombination frequency (R-avg), with improved results whether the recombination frequency was calculated as genes/cM or kb/cM. Our results offer a detailed road map for better predicting mapping population size in diverse eukaryotes, but useful predictions will require robust recombination frequency maps based on sampling more progeny.
A limited number of forward genetics techniques exist to isolate an allele that underlies a mutant or polymorphic phenotype and that require no prior knowledge of the gene product. These include protocols to isolate host DNA flanking insertional mutagens (e.g., transposons) (BALLINGER and BENZER 1989; RAIZADA 2003) and positional gene cloning techniques (BOTSTEIN et al. 1980; PATERSON et al. 1988; TANKSLEY et al. 1995) that permit the discovery of alleles created by chemical mutagens, radiation, or natural genetic variation. Positional gene cloning is feasible when the following conditions are met: (1) two parents exist that differ in a trait of interest; (2) the parents can be distinguished at the chromosome level by polymorphic DNA markers (e.g., RFLP); and (3) in a population of progeny, the underlying gene can be mapped relative to nearby DNA segments that have previously been cloned (BOTSTEIN et al. 1980; TANKSLEY et al. 1995). Unfortunately, positional gene cloning suffers from unpredictability in terms of the number of post-meiotic progeny that a researcher can expect to genotype to narrow a candidate chromosomal region to a small number of candidate genes (DINKA and RAIZADA 2006). For example, in rice (Oryza sativa L.), only 1160 gametes were genotyped to narrow the Pi36(t) allele to a resolution of 17 kb (LIU et al. 2005), whereas 18,944 gametes were genotyped to map the Bph15 allele to a lower resolution of 47 kb (YANG et al. 2004). During fine mapping, the physical distance between a known physical location on a chromosome (i.e., the molecular marker) and the target allele is inferred by the frequency of meiotic recombinants that can break cosegregation of the phenotype encoded by the target allele with physically anchored molecular markers (BOTSTEIN et al. 1980; PATERSON et al. 1988). Ideally, a gene hunt ends once a molecular marker is found that always cosegregates with the target phenotype in a large population of genotyped and phenotyped F2 (or post-F2) progeny. Therefore, the frequency of meiotic recombination in the vicinity of the target locus (defined as R = kilobase/cM), along with the local density of molecular markers, determines the size of the mapping population. We are interested in helping researchers predict mapping population size. As initial analysis assigns a target allele to a 1–5-cM map interval, the goal of this study is to determine whether the recombination frequency at this interval size, obtained from a high-density molecular marker map, can be used to predict the number of progeny required for subsequent sub-centimorgan mapping in combination with user-friendly mathematical formulas.
DURRETT et al. (2002) used the kb/cM ratio (R) as the basis of an equation (which we will refer to as the Durrett–Tanksley equation) to predict genotyping requirements during positional cloning, the only such equation we could find in the literature. DURRETT et al. compared the results of their equation to empirical evidence from 12 published positional cloning successes in Arabidopsis thaliana; the model often appeared to overestimate the number of progeny required to be genotyped. However, the accuracy of the model was difficult to assess, because only the genome-wide recombination frequency was employed, rather than local rates of recombination. Perhaps as a result, it was simply concluded that some researchers were lucky or unlucky (DURRETT et al. 2002).
Building upon the work of DURRETT et al., we have tried to understand and predict when a researcher will be lucky or unlucky during positional gene cloning by accounting for: (1) over-genotyping (resulting in redundant crossovers between the target locus and the closest molecular markers); (2) a low density of available molecular markers in the target interval (causing some crossovers to be missed); and most important, (3) high or low local rates of local recombination (R) compared to the genome-wide average (NACHMAN 2002). We have compared the predictions of the Durrett–Tanksley equation to empirical data obtained from 41 positional cloning studies in rice (O. sativa L.), which is a model system for the world's most important crops, the cereals (PATERSON et al. 2005). Specifically, we have measured the predictability of the Durrett–Tanksley equation and then focused on whether "neighborhood" (<2 cM) recombination values obtained from a reference genetic map (HARUSHIMA et al. 1998) further improve the accuracy of the model compared to using the genome-wide average recombination rate (R-avg). In addition, we have derived and tested a simpler equation that predicts progeny mapping size. Finally, we have measured the utility of employing R-values calculated as genes/cM rather than kb/cM to predict mapping population size, as the former allows the candidate gene number to be estimated, which is of greater interest to researchers targeting sequenced, annotated genomes.
Use and modification of the Durrett–Tanksley equation:
First, we used the Durrett–Tanksley equation (DURRETT et al. 2002) which estimates the number of F2/post-F2 meiotic gametes required to positionally clone an allele as derived from an F1 heterozygote, based on the following probability:
![]() |
As the equation is dependent only on the value NT/100R, then if the probability is set at 0.95, NT/100R = 4.744, which may be rewritten as N = (4.744 x 100R)/T.
To adjust for the target number of gametes containing an informative crossover (
T), which we assume may decrease T (better map resolution), we introduced the empirically-derived T modifier, 4.744/
T (see RESULTS); the resulting modified Durrett–Tanksley equation is as follows:
![]() |
![]() |
T is number of crossovers between the closest two molecular markers (
2). The Durrett–Tanksley equation assumes that the recombination frequency (R) is constant in the vicinity T of the target allele. This equation also requires that the genotype of the target allele (a) in F2/post-F2 progeny can be assigned. Thus, in the case of a recessive target allele, N equals the number of F2 testcross progeny. Alternatively, where F2 progeny are the product of selfing F1 heterozygotes (such as in plants), then since each F2 progeny is derived from two meioses, N equals two times the number of F2 progeny genotyped; this is only true, however, when the F2 progeny genotype AA can be distinguished from the genotype Aa since this is required to determine whether a crossover occurred on the proximal or distal side of the target allele. Such a determination requires testing progeny for segregation of phenotypes in the F3 generation (progeny testing).
Derivation of a simplified equation based on single-crossover probability:
We developed the following user-friendly equation to estimate the fine-mapping population size, an estimate of the number of F2 testcross progeny required to be genotyped to detect sufficient crossovers to achieve a desired kilobase or gene block resolution:
![]() |
where N is the number of meiotic gametes (chromosomes) that must be genotyped in which it can be determined whether a crossover is located proximal or distal to the target allele, P is threshold probability of success (e.g., 0.95), T-marker is expected distance between flanking molecular markers (kilobases or candidate genes), and R is local or genome-wide average recombination frequency (kb/cM or genes/cM).
This equation was based on the assumption that if a crossover occurs in a segment (with length T) on the proximal side of a target allele in a large population of F2 progeny (N), then there is an equal chance that a recombination event will be carried by a sibling F2 gamete on the distal side within a distance of <T from the target allele as shown in Figure 1B. Hence, because the probability of only a single recombination event occurring within the mapping population must be calculated, the equation is simplified. However, it is recognized that the distance between the two crossovers will range from zero to 2T; on average, however, the distance will be T, and likely <T when there are more than two informative crossovers and/or when the molecular marker resolution is limiting. However, since the majority of positional cloning studies report more than two informative crossovers (
) (see Table 2), and since the minimum distance between flanking molecular markers (T-marker) is often limiting, then the probability is high that the distance between the closest two crossovers will be <T-marker.
|
|
The detailed derivation of this equation is as follows:
- P(failure) of a crossover in the target interval (T) per gamete = (total genome crossovers – target interval crossovers)/total genome crossovers.
- Alternatively, P(failure) per gamete = 1 – (fraction of genome x number of crossovers in whole genome).
- Thus, P(failure) per gamete = 1 – [(kb resolution/kb genome size x (genome map in cM/100)] or P(failure) per gamete = 1 – [(gene block resolution/genome-wide gene number x (genome map in cM/100)].
- Since P(failure) = (Pfailure per gamete)N, where N is number of informative gametes, then
and

- Therefore, N = Log (1 – Psuccess)/Log [1 – (gene block/genome gene number x genome map cM/100)] or N = Log (1 – Psuccess)/Log [1 – (kb target/genome kb x genome map cM/100)].
- Simplified, the above equation can be rewritten as:
or
whereR is local or genome-wide recombination frequency.
Additional assumptions of this model are as follows:
- The equation assumes that the phenotype of the trait of interest can be readily scored to determine if a crossover occurred proximal or distal to the target allele; hence N is equivalent to the number of testcross progeny, 0.5 x the number of F2 (selfed) progeny (if no progeny testing performed), or 2 x the number of F2 (selfed) progeny (if F3 progeny testing is performed).
- The equation assumes that the frequency of double-recombinants in a small interval is negligible due to crossover interference.
- The equation assumes that the crossover may occur anywhere in the defined interval T such that the distance between each informative crossover and the target locus is <T.
- The recombination frequency is assumed to be constant in the region <2T.
Modified single crossover equation:
Based on empirical data, we then modified this equation by adjusting the genetic map resolution T by the number of crossovers (see RESULTS), resulting in the equation:
![]() |
T is number of crossovers between the closest two molecular markers (
2).
Analysis of published positional cloning studies:
We analyzed 41 published positional cloning/fine-mapping studies in rice to extract or calculate the three variables, N, T, and R (Table 1). The candidate gene resolution (T) [in kb or gene number, T(kb) or T(gene)] was either reported in each study or obtained by personal communication with the authors. In the latter case, these were confirmed by corroborating the kilobase resolution with the gene resolution using the TIGR Pseudomolecules Release 4.0 database (YUAN et al. 2005); retroelements, transposons, and transposases were excluded for gene resolution. The calculation of N gametes genotyped was more complex; it required us to distinguish the actual number of progeny genotyped (g) from the number of informative chromosomes (N), defined as chromosomes that had the potential of having a crossover between the target allele and a flanking molecular marker, and where the location of that crossover (proximal or distal to the target) was distinguished (e.g., using progeny testing). To convert g to N, we multiplied g by a meiosis factor (f) as shown in Table 1 (also see footnotes to Table 1). This required us to classify the mapping strategy used and note whether the target trait was dominant, recessive, or was expressed in the haploid generation (gamete or gametophyte). For example, for the cloning of the recessive bc1 allele (Y. LI et al. 2003), since only F2 recessive progeny were genotyped (7068 recessives genotyped out of 30,000 F2 progeny) and hence the genotype of the target allele was non-ambiguous, the total number of informative chromosomes genotyped was 2 x 7068 (i.e., f = 2, hence N = 2 x g). In contrast, for the fine mapping of the dominant Psr1 allele (NISHIMURA et al. 2005), since 3800 (Backcross 3, BC3) F1 progeny were genotyped, and thus only 50% of the target chromosomes underwent informative meioses, then f = 0.5, and N = 1900 informative chromosomes. For rice, it was assumed that males and females had equal rates of recombination, but in many species, such as zebrafish, this is not true (SINGER et al. 2002; LENORMAND and DUTHEIL 2005) and must be accounted for in the meiosis factor. Finally, to calculate the local recombination frequency (R-local) (Table 2), we used the following equation:
![]() |
1 +
2)/N, where
1 is number of closest, proximal crossovers (Table 2),
2 is number of closest, distal crossovers (Table 2), and N is total number of informative gametes (chromosomes) genotyped (Table 1). In a testcross, m = 100 x
/progeny, whereas in a selfed cross with progeny testing, m = 100 x (
/2 x progeny) since genotyping permits both chromosomes to contribute to the mapping population.
|
The only crossovers (
T) in the calculation were those that were in between the two molecular markers used to define T. For each of the 41 studies, we applied the values for R(local), T(kb) and set P at 0.95, to the Durrett–Tanksley equation and compared the number of informative gametes (N) required by this equation to the empirical numbers shown in Table 1. We performed both nonparametric correlation analysis (Spearman coefficient) and linear regression analysis using the software program Instat 3 (GraphPad Software).
Generation of a reference meiotic recombination frequency map (MRFM) for rice:
To determine whether recombination frequencies derived from a reference genetic map could be used to predict progeny sampling requirements using the Durrett–Tanksley equation, we first assembled such a map, inspired by a previous report (WU et al. 2003), to generate two types of recombination values: R(gene), in genes/cM; and R(kb), in kilobases/cM (see supplemental Table 1 at http://www.genetics.org/supplemental/). The names and GenBank accession numbers of RFLP markers genetically mapped in an F2 population between Nipponbare and Kasalath were obtained from the Rice Genome Project (RGP: http://rgp.dna.affrc.go.jp/) (HARUSHIMA et al. 1998). FASTA sequence files for the markers were obtained from NCBI. The RFLP marker sequences from the RGP map were physically mapped onto the version 4 TIGR rice pseudomolecules map (http://www.rice.tigr.org) using the Genomic Mapping and Alignment Program (GMAP) (WU and WATANABE 2005). The physical map position of each marker was derived from the top hit that exceeded a threshold of 95% identity over 90% of the length. After physically positioning the RFLP markers onto the pseudomolecules, Perl scripts and manual inspection were used to remove all markers showing map incongruency (where the physical and genetic position of the markers were at odds). We obtained 1391 congruent markers for the RGP map. This established both physical and genetic locations and hence interval distances for each RFLP marker; from these values, the kb/cM recombination frequency was calculated for each marker pair. To generate the corresponding genes/cM frequencies, we queried the Osa1 database at TIGR: the coordinates of all 42,535 non-transposable element-related transcription units were obtained (YUAN et al. 2005). Custom Perl scripts were written to bin these transcription units between each RFLP marker pair. This established the number of non-transposable element candidate genes for each interval along with the genetic locations of these markers, and hence the following parameters were calculated for each RFLP marker pair: the genetic distance between each marker and the corresponding genes/cM recombination rate.
Testing the predictive value of the Modified Durrett–Tanksley equation using R-map recombination frequencies:
Next we assigned each target allele to a physical location on the RGP physical map, which contains 1400 marker intervals. To accomplish this, each target allele was assigned a TIGR locus number (if cloned) onto a BAC/PAC clone (if not cloned; TIGR Pseudomolecules Release 4.0); sometimes this information was published. In remaining examples, the GenBank gene sequence or molecular marker information was used to screen the TIGR rice sequence database; the genetic map position, marker data, and BAC/PAC assignment helped to verify the physical assignment. The locus or BAC/PAC name and sequence was then used to assign each allele to an interval between two mapped markers on the RGP MRFM of rice (Table 2; supplemental Table 1 at http://www.genetics.org/supplemental/). The recombination frequency of the corresponding marker interval (R-map) was then employed; because we feared that chance crossovers might distort the recombination frequency in small intervals (<277 kb, 1-cM average) on this map, adjacent segments were sometimes added together (to achieve a >280-kb interval) before calculating an average R-map value with the goal of situating the target allele at the physical center of the larger interval. In rare situations, an R-map value for an interval of <280 kb was accepted because adjacent intervals were unusually large. The choice to add or not add marker intervals was done blindly from the R-local values in order to not bias R-map values. The R-map values were then applied to each equation.
Calculation of R-avg values:
The genome-wide average recombination frequency in kilobases/cM was calculated by dividing the total genome size (
430 Mb) (IRGSP 2005) by the total genetic map length (
1521 cM) (HARUSHIMA et al. 1998); the average recombination frequency in genes/cM was calculated by dividing the total number of non-transposable element-encoded transcription units (
42,535) (YUAN et al. 2005) by the map length. The resulting genome-wide recombination frequency (R-avg) in rice is 277 kb/cM and 28 genes/cM. Initial equations to predict mapping population size:
Initially, we employed two equations to predict the size of the fine-mapping population, one of which is developed here. First, we used the Durrett–Tanksley equation (DURRETT et al. 2002), which estimates the number of F2/post-F2 meiotic gametes required to positionally clone an allele as generated from an F1 heterozygote; it calculates the probability (P) that if a (proximal) crossover occurs in the vicinity of a target allele that a second (distal) crossover will be carried by a sibling gamete, such that the distance between the two crossovers will be the kilobase distance T (Figure 1A), for a prescribed number of genotyped gametes (N) (informative chromosomes) and for a given recombination frequency (R), according to the following equation:
![]() |
The primary assumption of the equation is that the progeny number will vary with the recombination frequency: the higher the frequency of recombination, the fewer progeny will be required to detect a crossover between the target allele and flanking molecular markers. See MATERIALS AND METHODS for additional details.
We then derived a second equation with the goal of making it more user-friendly for researchers. This equation was based on the following premise: if a crossover occurs in a segment (with length T) on the proximal side of a target allele in a large population of F2 progeny (N), then there is an equal probability that a sibling gamete will carry a crossover on the distal side within a distance of <T from the target allele as shown in Figure 1B. This simplifies the equation by only having to calculate the probability of a single crossover within the population, noting, however, that although on average any two crossovers will be distance T apart, they may range from zero to 2T (see MATERIALS AND METHODS for further details). The number of F2 testcross progeny required to be genotyped to detect sufficient crossovers to achieve a desired kilobase or gene block resolution is thus as follows:
![]() |
where N is the number of meiotic gametes (chromosomes) that must be genotyped in which it can be determined whether a crossover is located proximal or distal to the target allele, P is threshold probability of success (e.g., 0.95), T-marker is expected distance between flanking molecular markers (kilobases or candidate genes), and R is local or genome-wide average recombination frequency (kb/cM or genes/cM).
Similar to the Durrett–Tanksley equation, this model assumes that the phenotype of the trait of interest can be readily scored to determine if a crossover occurred proximal or distal to the target allele; hence N is equivalent to the number of testcross progeny, 0.5 times the number of F2 (selfed) progeny (if no progeny testing performed), or two times the number of F2 (selfed) progeny (if F3 progeny testing is performed). The derivation of this equation is in the MATERIALS AND METHODS section.
Empirical gamete number, mapping resolution, and lessons from published studies in rice:
To validate the equations noted above, we first analyzed 41 published positional cloning/fine-mapping studies in rice, to extract or calculate N and T (Table 1) (see MATERIALS AND METHODS). We made several observations that might be useful to future research groups who wish to undertake positional cloning in rice. First, as in other species, in rice there was a wide range in the number of informative gametes (N) (potential recombinant chromosomes) that were genotyped to positionally clone target alleles: this ranged from only 416 gametes for the Pi-kh allele (SHARMA et al. 2005) to
20,000 gametes for the alleles Gn1a (ASHIKARI et al. 2005), qSH1 (KONISHI et al. 2006), and Bph15 (YANG et al. 2004), an
25-fold range. The average number of informative gametes genotyped was 5686; the median was 4200. The median target resolution (T) achieved was 44.5 kb or five genes. There were seven examples of single-gene resolution mapping (Table 1), and to achieve this resolution, the number of informative gametes employed ranged from 2800 to 26,000 (
10-fold range); the average was 11,593 gametes. Single gene resolution mapping in a smaller genome, A. thaliana, has been much rarer (DINKA and RAIZADA 2006). Several fine-mapping strategies were used successfully:
- Of 41 studies, 11 groups reported isolation of a quantitative trait locus (QTL); to reduce the effects of minor QTL and/or to be able to employ a background with well-characterized molecular markers, the target QTL was isolated by limited backcrossing (BC) or full introgression (near isogenic line, NIL) into a new genetic background. In other examples (e.g., qSH1) (KONISHI et al. 2006), the original QTL genome was used for mapping such that all but the target QTL was fixed (not segregating); to create heterozygosity in the region containing the target allele for mapping, a corresponding chromosome segment from a polymorphic genotype was crossed in [segment substitution line (SSL)] (Table 1).
- Because outcrosses/testcrosses are challenging in rice, most studies involved selfing progeny, which has the potential of carrying informative crossover events on both diploid chromosomes, thus potentially doubling the effective number of informative gametes (N). One of the challenges created by selfing, however, for recessive alleles, is that it is not possible to determine whether a crossover occurred proximal or distal to the target without checking for the segregation pattern (progeny testing, PT) in the subsequent generation (e.g., F3) to distinguish all genotype combinations (aa, Aa, AA) at the target locus. Six groups progeny-tested to check the recessive genotype (e.g., chl1) (H. T. ZHANG et al. 2006). Alternatively, to avoid F3 generation phenotyping, 15 groups (e.g., bc1) (Y. LI et al. 2003) preselected recessive (mutant) progeny by phenotyping and then only genotyped this subset, thus discarding 75% of all progeny.
- There were 12 fully dominant alleles targeted; in these cases, as in recessive alleles, because the proximal vs. distal location of flanking crossovers could not be distinguished without distinguishing AA from Aa genotypes, researchers either progeny-tested in the subsequent generation (e.g., Pi-kh) (SHARMA et al. 2005) or, cleverly, preselected only the recessive progeny class for genotyping (e.g., Xa1) (YOSHIMURA et al. 1998).
- Finally, there were four examples [f5-DU, Rf-1, S32(t), S5n] where the target alleles were expressed in the haploid generation (e.g., pollen grain, embryo sac) and where the nature of the gene products often required generating outcross/testcross progeny for mapping. In the case of f5-DU (WANG et al. 2006), an allele that boosts pollen viability in specific hybrid genotypes, testcross progeny were used for mapping, since phenotyping required a hybrid background to check for segregation of viable pollen grains (either high or low). Similarly, to fine map the S5n locus (QIU et al. 2005), which confers embryo sac viability to wide-cross hybrids, 8000 hybrids were generated by outcrossing a heterozygous NIL S5n/– parent (NIL F1) to a wide-cross tester; phenotyping was performed by measuring segregation of fertility of F2 embryo sacs on hybrid rice spikelets. In the case of S32(t) (LI et al. 2007), which also confers (post-meiotic, haploid) embryo sac viability, the segregation of embryo sac viability was measured in the spikelets of selfed F2 plants. Finally, in the case of Rf-1, a nuclear locus that restores male gamete (pollen) fertility by overcoming the effects of a mitochondrial [cytoplasmic male sterility (CMS)] gene, 5145 testcross F2 progeny (three-way cross: heterozygote restorer x non-restorer tester) were generated for mapping and the segregation of pollen viability scored (KOMORI et al. 2003, 2004).
Lessons from calculating empirical local recombination frequencies (R-local) and their use in validating predictive equations:
To both validate the equations noted in this study and later understand any discrepancies between the experimental data and predictions based on the molecular marker map, we then calculated the experimental (local) recombination frequency (R-local) for each of the 41 successful fine-mapping studies in rice (see MATERIALS AND METHODS) (Table 2). From each study, we counted the number of crossovers located between the closest two markers used to define the final map resolution (T); these are the first recombinants used to define the edges of the candidate target region. Although we expected to find only 1 crossover on each distal or proximal flank (2 total), in 32 of 41 examples we found between 3 and 16 total crossovers, due to hotspots of recombination and/or poor marker density; such redundant crossover targets suggested that an excess number of progeny were genotyped given the available marker density in the majority of rice positional cloning attempts, an important observation.
Since a high density of molecular markers and large progeny numbers are used in positional cloning, the R-local values provide an interesting snapshot into the variation in recombination frequency in the rice genome: we found that though the genome-wide average R was 277 kb/cM or 28.0 genes/cM in rice, locally, R-values ranged from 3.3 to 1344.2 genes/cM or 28.2 to 14,718 kb/M, an
400-fold and
500-fold range, respectively. Strongly influenced by chance, such a wide range in recombination frequencies would largely explain the wide range in the number of progeny that were genotyped in rice (Table 1). The most hyper-recombinogenic region (3.3 genes/cM, 28.2 kb/cM) flanked the Pi36(t) allele (LIU et al. 2005), which required only 1160 informative gametes to achieve a map resolution of 17 kb or two candidate genes. The region with the least amount of recombination (1344.2 genes/cM or 14,718 kb/cM) encompassed the chl9 allele; in this study, although 4906 informative chromosomes were genotyped, the map resolution was 1500 kb or 137 genes (H. T. ZHANG et al. 2006). These two groups define the extremes of good and bad "luck," respectively, in rice, and as such may set upper and lower map-population-size boundaries for future positional cloning attempts in this important species.
We then compared the empirical number of gametes that were genotyped (N) in each study to the number predicted by both equations (see above) given only the variables T and R-local; this allowed us to first test the validity of the equations in rice and to modify the equations if necessary. The size of the mapping population (informative chromosomes) (N) predicted by the Durrett–Tanksley equation compared to the empirical data, for given T and R-local values (in kb/cM), is shown in Figure 2A; we found a strong positive correlation between the mapping size predicted by the Durrett–Tanksley equation and the experimental results (Spearman r = 0.85, P < 0.0001, n = 41). In at least 10 examples (10/41), however, in spite of using the actual recombination frequencies, we found that the Durrett–Tanksley equation overestimated the mapping population by at least twofold, which would have caused researchers to unnecessarily genotype thousands of extra progeny. The simpler, Single Crossover model appeared to be a slightly better predictor of the progeny mapping population size as shown in Figure 2B. Although this second equation predicted the mapping population N with a near-equivalent correlation as the Durrett–Tanksley equation (Spearman r = 0.86; P < 0.0001; n = 41), linear regression analysis of the two models (Figure 3, A and B) demonstrated that the single crossover equation came closer to a linear slope of m = 1 on an x–y scatter plot of predicted vs. experimental N values; in the case of the Durrett–Tanksley model, the best-fit line followed the equation y = 1.70x – 1323 (goodness of fit r2 = 0.76, Sy.x = 5456), whereas for the single crossover equation, the best-fit line was y = 1.07x – 833 (r2 = 0.76, Sy.x = 3426). Although one equation was slightly better than the other, these results demonstrate for the first time that (both) simple formulas, if based on accurate local recombination frequency values, can provide significant guidance in predicting the mapping population size in the majority of alleles targeted for positional cloning.
|
|
Fine-tuning of the equations based on empirical studies:
We then wondered if we could fine-tune both predictive models. We noticed that the Durrett–Tanksley equation overestimated the number of progeny needed when the experimental number of crossovers found in distance T was low (<5 total); when the number of crossovers found was high (>5), this equation underestimated the number of progeny required (Figure 2A; Table 2). In the latter cases, it appeared as if T was limited by the local density of molecular markers; given this low density, the published studies appear to have "over-genotyped" the progeny population. Restated, when many crossovers were found within the interval T (final map resolution), then the actual candidate distance (in kilobases) might have been smaller (higher map resolution) had more molecular markers been available in the vicinity. By plotting the ratio Nmodel/Nempirical relative to the number of crossovers (
T) (where
=
1 +
2) (Table 2) on a scatter plot, we found that there was an inverse Power relationship between the two variables such that Nmodel/Nempirical = 4.744/
T. Therefore, we adjusted T by multiplying it by 4.744/
T, where
T is the total number of crossovers in this region. Accordingly, we also redefined T as T-marker to note that marker density often rate-limits the physical resolution. The resulting modified Durrett–Tanksley equation is
![]() |
![]() |
T is number of crossovers between the closest two molecular markers (
2). This is a rewritten version of the standard map distance calculation: m = 100 x recombinants/progeny for a testcross, assuming no double crossovers (HALDANE 1919).
We then compared the predictions of the modified Durrett–Tanksley equation, using R-local values (Table 2), to the published mapping size population values (N); as shown in Figure 3C, the modified equation was 100% predictive (y = 1.0x, r2 = 1.0, F = 0). Using a similar approach, we also modified the Single Crossover equation. By plotting the ratio Nmodel/Nempirical relative to the number of crossovers (
T) (where
T =
1 +
2) (Table 2) on a scatter plot, we found that there was an inverse Power relationship between the two variables such that Nmodel/Nempirical
3/
T. Therefore, we modified the genetic map resolution T by the number of crossovers, resulting in the following modified Single Crossover equation:
![]() |
As shown in Figure 3D, again the modified equation was close to 100% predictive of the empirical results (y = 1.0x – 1.5, r2 = 1.0).
These modified equations offer some advantages for researchers: these equations define probability explicitly as the number of crossovers (informative gametes) that a researchers can expect to achieve for a given progeny population. A researcher is taking more of a risk if the goal is to achieve only two informative gametes, each carrying a crossover on either side of the target allele (
T = 2), compared to if the target is five informative gametes. These equations also make it explicit that the density of available molecular markers in the target region is critical: if there are few available molecular markers, a researcher does not achieve better resolution by increasing the number of progeny genotyped (N) beyond a certain threshold. We suggest that users of this equation who wish to predict N should select T based on a realistic density of achievable molecular markers in the vicinity of the target allele, and adjust
T according to their own risk assessment. For example, if obtaining only two informative recombinant gametes is too risky, N should be increased.
Predictive value of the equations using recombination frequencies derived from a MRFM:
In the analysis above, we validated both Durrett–Tanksley equations and the Single Crossover equations using published high-resolution, local recombination frequencies (R-local) derived from already fine-mapped alleles. Our goal was to predict the progeny mapping population (N informative gametes) in advance, however, whereas R-local data is not available until the conclusion of a positional cloning attempt. Previous a priori mapping population estimates only used the genome-wide average recombination frequency (R-avg) (DURRETT et al. 2002), but as we have confirmed (Table 2) and as many others have noted (WU et al. 2003; CRAWFORD et al. 2004; MCVEAN et al. 2004), recombination frequencies vary tremendously along any chromosome. Therefore, we wondered if we could more accurately predict N in advance by employing regional meiotic recombination frequencies from a high-density molecular marker map (R-map). To accomplish this, we first developed a MRFM for 1400 marker intervals in rice, based on the Rice Genome Project (RGP) F2 [Nipponbare (Japonica) x Kasalath (Indica)] RFLP map (HARUSHIMA et al. 1998). Mean R-map values were 33.5 genes/cM and 294 kb/cM, similar to calculations of the whole-genome average recombination frequency (R-avg) for rice (28 genes/cM and 277 kb/cM). The entire R-map data set is located in supplemental Table 1 (http://www.genetics.org/supplemental/) and it should serve as a useful reference for future positional cloning studies in rice.
Next, in silico, we mapped each cloned allele onto a physical and genetic interval on this map as shown in Table 2 (see MATERIALS AND METHODS). We then used the corresponding "neighborhood" recombination frequencies (R-map) to calculate mapping population sizes (N). As shown in Figure 4, we found that there was a modest but significant improvement in predicting the number of informative gametes (N) required to be genotyped when recombination frequencies (calculated as kilobases/cM) were based on rice RGP R-map values; as we suspected, we found that there was not a significant correlation between the empirical mapping size (N) vs. mapping sizes predicted by either of the two (unmodified) equations when the R-avg value was used (Spearman r = 0.30, P = 0.0547, n = 41) (Figure 4, A and D). In contrast, the correlation was significant when R-map values were used (Spearman r = 0.46, P = 0.0022, n = 41) (Figure 4, B and E) and this correlation increased even further when several outliers were removed (Spearman r = 0.61, P < 0.0001, n = 36) (Figure 4, C and F). Surprisingly, however, the correlation did not improve even further when the modified equations were used that took into account the number of immediate crossovers (
T) (for R-map, Spearman r = 0.35, P = 0.0232, considered significant); however, the correlation was still a significant improvement over when the R-avg value was used in conjunction with the modified equations (Spearman r = 0.21, P = 0.19, n = 41, not significant; data not shown). We conclude that mapping size predictions based on neighborhood (>280-kb segments) recombination frequencies (in kilobases/cM) better predict the number of progeny required to be genotyped to positionally clone a gene than predictions based on using the genome-wide average recombination frequency.
|
The effect of using R-map recombination frequencies calculated as kb/cM vs. genes/cM:
Although use of R-map values better predicted the size of the progeny mapping population compared to the genome-wide average recombination frequency, we were disappointed that the improvement was not more significant. In order to understand the reason, we asked to what extent R-map values calculated as kilobases/cM (from the rice RGP 1400-marker map) in fact correlated with the R-local values that we extracted from the 41 published studies. As shown in Figure 5A, the correlation was in fact poor (Spearman r = 0.23, P = 0.1428, considered not significant); of course, there was no correlation when R-local was compared to R-avg, so the R-map (kb/cM) values were still useful.
|
However, we then asked whether the correlation improved when R-map was calculated as genes/cM instead of kb/cM. Limited evidence (FU et al. 2001) suggested that the crossovers contributing to R-map values might primarily be occurring in and around genes. In fact, as shown in Figure 5B, we found a significantly improved correlation between R-map values calculated as genes/cM to R-local values also calculated as genes/cM (Spearman r = 0.48, P = 0.0016).
Therefore, we retested whether we could better predict progeny mapping population sizes (N) when using rice RGP R-map values calculated as genes/cM rather than kilobases/cM. Using R-map (genes/cM) calculations shown in Table 2, Figure 6 demonstrates that indeed the map population (N) predicted by both the (unmodified) Durrett–Tanksley equation and the (unmodified) Single-Crossover equation based on R-map (genes/cM) values better predicted the published results over the genome-wide R-avg (28 genes/cM) or R-map values based on kb/cM (Figure 6 vs. Figure 4). In fact, with three outliers removed, the correlation between the progeny size predictions based on R-map vs. the published data was extremely significant (Spearman r = 0.67, P < 0.0001, n = 38) (Figure 6, C and F). Although the predictions did not improve further when the modified equations were used (for R-map, Spearman r = 0.38, P = 0.0151, considered significant), the predictions were significantly better than when the R-avg value was used in conjunction with the modified equations (Spearman r = 0.05, P = 0.7662, n = 41, not significant; data not shown). We conclude that mapping size predictions based on neighborhood (>280-kb segments) recombination frequencies (R-map) better predict the number of progeny required to be genotyped for positional gene cloning in rice when R-values are calculated as genes/cM rather than kilobases/cM, and both are significant improvements over calculations based on the genome-wide R-avg.
|
The limiting factor is that R-map values often do not reflect R-local frequencies, but when they do the progeny mapping size can be accurately predicted:
As calculated in Table 2 and shown in Figure 7A, the limiting factor is that the neighborhood recombination frequency often does not reflect the local recombination frequency, even though it is more reflective of local rates of recombination than the genome-wide average. The situation may or may not be better for other maps in other species, particularly as more robust, higher-resolution maps are constructed. Indeed, the rice map gave us hope for the future; in spite of the problems with our use of this map (see DISCUSSION) as shown in Figure 7A, we found 11 examples where the R-map values (calculated as genes/cM) were only <30% different than the corresponding R-local value. These corresponded to the following loci: f5-DU, spl11, gl-3, pla1, hd1, moc1, S32(t), bel, dl1, fon4, and Pi-d2. When the mapping population size (N) was calculated for only these 11 alleles, shown in Figure 7, B–E, linear regression analysis showed that both the modified Durrett–Tanksley equation as well as the modified Single Crossover equation very accurately predicted the mapping population size (N) using recombination frequency (R-map) values from the RGP map: the best fit lines were linear (m = 1.2) and the predictions matched the best-fit lines with very high r2 values (0.95–0.98). Similar results were obtained for 10 examples where R-map values, calculated as kb/cM, were used; in that case, the predictions matched the best-fit line also with r2 value of 0.98 (slope y = 0.8x – 590; data not shown).
|
The utility of our approach was best demonstrated by comparing the data for bel (PAN et al. 2006) vs. Pi-d2 (CHEN et al. 2006); empirically, only 462 informative gametes (N) were genotyped to fine map bel to a map resolution (T) of 18 genes; in contrast, 8000 informative gametes were required to fine map Pi-d2 to a map resolution of 33 genes. The RGP map correctly predicted that the recombination frequency (R-local) flanking Pi-d2 was
20-fold lower than that flanking bel. As a result, both modified equations would have predicted in advance that mapping bel to this resolution would require
360 gametes, and that Pi-d2 would require
10,000 gametes. If such accurate predictions could be made across the majority of target loci in the future, then researchers will be able generate appropriately sized map populations and properly allocate human, growth room, and financial resources.
) was again a useful equation modifier (Figures 2 and 3). With validated equations, and researchers not having the luxury of having access to robust recombination frequencies in the vicinity of their target allele, we measured whether recombination frequencies derived from a 1400-marker reference genetic map (supplemental Table 1 at http://www.genetics.org/supplemental/) could be useful, and indeed the map population size was more accurately predicted when these values were used instead of the genome-wide average recombination frequency (Figures 4 and 6). Since researchers targeting a fully sequenced genome care more about how many candidate genes they must distinguish, not the number of kilobases per se, we also determined that the models could predict gene resolution as well as or better than the kilobase resolution (Figures 5 and 6). Although the rice map, in conjunction with our formulas, could have accurately predicted several unusually large or small mapping population-requiring target alleles, including alleles located near centromeres suffering from suppressed meiotic recombination (e.g., chl9, Pi-d2, and Bph15), we found that the limiting factor was the correlation between R-map vs. R-local recombination frequencies (Table 2, Figure 7).
Understanding R-map vs. R-local discrepancies:
There are likely several reasons for why recombination frequencies from a reference genetic map (R-map) in rice often did not match the frequency in the vicinity of target alleles (R-local), and these are important lessons for future attempts to predict mapping population size. First and most obvious, even within a >280-kb interval (
1 cM average), the rice RGP map demonstrated that the meiotic recombination frequency could vary significantly (WU et al. 2003) (supplemental Table 1 at http://www.genetics.org/supplemental/). Second, as is the case with many whole-genome genetic maps, only small numbers of progeny (typically 100–200) were genotyped to generate the RGP map (HARUSHIMA et al. 1998); as a result, the location of rare crossovers was more subject to chance. In other words, had the RGP map been generated multiple times using independent populations, the recombination frequencies would likely have varied significantly within 1–2-cM intervals. Third, whereas the RGP map was based on two parental genotypes, the rice Indica variety (Kasalath) and the Japonica variety (Nipponbare) (HARUSHIMA et al. 1998), only 8 of 41 of the studies that we compared our models to also used these genotypes to generate their mapping populations. Differences between genotypes, such as the density of repetitive DNA or local cytogenetic rearrangements as seen in maize (BENNETZEN and RAMAKRISHNA 2002; WANG and DOONER 2006), might have caused R-map values from the RGP map to differ from the published studies. Indeed, it has been shown that domesticated rice cultivars have an unusually high rate of ongoing gene duplications, vary considerably in the location and density of repetitive DNA (e.g., retroelements), and have very high rates of intergenic nucleotide polymorphisms (SNPs, indels), perhaps in part due to human selection in geographically isolated locations (GARRIS et al. 2005; YU et al. 2005; TANG et al. 2006). Finally, the RGP map was generated using F2 selfed progeny, whereas the mapping populations used in the 41 published studies were generated by diverse methods, including the use of NILs, chromosome SSLs, and recombinant inbred lines (RILs), and in at least at one locus with low recombination rates, fon4-1, an
200-kb chromosome deletion was involved (H. W. CHU et al. 2006). It has been shown that when two chromatids differ in their relatedness to one another, as in RILs vs. NILs, the local recombination frequency may be affected (BURR and BURR 1991; LUKACSOVICH and WALDMAN 1999; LI et al. 2006); in the most extreme case, unequal deletions between chromatids, suppression of meiotic recombination has long been observed (RIESEBERG 2001). All of these factors might have contributed to our observation that R-map values from the rice RGP map often did not match recombination frequencies in the vicinity of target alleles.
Applying these results:
As for our recommendations to researchers undertaking positional cloning, we recommend that the R-map strategy should only be relied upon when they have access to a reference genetic map that has been demonstrated to have a strong correlation between R-map values and R-local values. To make this possible, higher resolution maps, with more markers, must be generated and/or employed to account for sub-centimorgan R variation. In potato, a genetic map with 10,000 markers was recently constructed (VAN OS et al. 2006), demonstrating progress in this area. Such high-resolution maps will provide researchers with a range of recombination frequencies across a 1–2-cM interval, and thus, at best, researchers could expect to predict an upper and lower range of N, not the precise number. To improve the robustness (reproducibility) of R-map frequencies, genetic maps must be generated based on sampling hundreds to thousands of progeny rather than only 100–200 individuals (FERREIRA et al. 2006). To make reference map frequencies relevant to the genotypic targets of positional cloning, maps must be constructed from more parental genotype pairs. In addition, for some species, the number of informative gametes (N) might need to be adjusted to account for male vs. female differences in recombination frequency (LENORMAND and DUTHEIL 2005) by adjusting the meiosis factor (f) (see MATERIALS AND METHODS). As to whether R-map values based on genes/cM or kilobases/cM should be used, we had assumed, given that meiotic recombination in plant genomes has been shown to be highly biased to gene regions, rather than flanking heterochromatin (FU et al. 2001), that if we ascribed most recombination as occurring within or flanking genes, then the genes/cM ratio would be less variable than the kb/cM ratio; in other words, as the number of genes increased in an interval, the frequency of crossovers would also increase in proportion, keeping the genes/cM ratio constant. However, in retrospect, two pieces of data now suggest that this assumption was incorrect. First, in the meiotic recombination frequency calculations we made on the RGP rice map, we found that the genes/cM ratio varied within the genome nearly as much as the kb/cM ratio; the coefficient of variation for R (genes/cM) was 98% across the rice genome (n = 971) compared to 113% for R (kb/cM) (n = 952). Second, if recombination was biased to within or near genes, then the recombination frequencies from positional cloning studies (R-local) would be predicted to be higher than the genome-wide average for rice (R-avg = 277 kb/cM); in fact, out of the 41 published studies, 20 studies had a R-local value below R-avg with 20 above the R-avg, suggesting no bias in recombination near genes (Table 2). It is therefore possible that the stronger correlation we found for the RGP map between R-map vs. R-local, when calculated as genes/cM, was random, but this should be tested for more maps and for more species. Indeed, it will be interesting to test the predictions of this paper in both larger and more compact genomes.As more robust, higher-resolution maps across more parental genotypes become available, our hope is that the methodology we have described here will generate accurate mapping population size graphs that predict a range of N-values for a given target allele. We conclude by showing an example of such a map in Figure 8, representing our predictions for rice chromosome 3. In spite of the challenges noted, this map did accurately predict the very different mapping population sizes required for the five alleles shown.
|
ASHIKARI, M., H. SAKAKIBARA, S. Y. LIN, T. YAMAMOTO, T. TAKASHI et al., 2005 Cytokinin oxidase regulates rice grain production. Science 309: 741–745.
BALLINGER, D. G., and S. BENZER, 1989 Targeted gene mutations in Drosophila. Proc. Natl. Acad. Sci. USA 86: 9402–9406.
BENNETZEN, J. L., and W. RAMAKRISHNA, 2002 Exceptional haplotype variation in maize. Proc. Natl. Acad. Sci. USA 99: 9093–9095.
BOTSTEIN, D., R. L. WHITE, M. SKOLNICK and R. W. DAVIS, 1980 Construction of a genetic linkage map in man using restriction fragment length polymorphisms. Am. J. Hum. Genet. 32: 314–331.[Medline]
BURR, B., and F. A. BURR, 1991 Recombinant inbreds for molecular mapping in maize—theoretical and practical considerations. Trends Genet. 7: 55–60.[Medline]
CHEN, X. W., J. J. SHANG, D. X. CHEN, C. L. LEI, Y. ZOU et al., 2006 A B-lectin receptor kinase gene conferring rice blast resistance. Plant J. 46: 794–804.[CrossRef][Medline]
CHU, H. W., Q. QIAN, W. Q. LIANG, C. S. YIN, H. X. TAN et al., 2006 The floral organ number4 gene encoding a putative ortholog of Arabidopsis CLAVATA3 regulates apical meristem size in rice. Plant Physiol. 142: 1039–1052.
CHU, Z. H., B. Y. FU, H. YANG, C. G. XU, Z. K. LI et al., 2006 Targeting xa13, a recessive gene for bacterial blight resistance in rice. Theoret. Appl. Genet. 112: 455–461.[CrossRef]
CRAWFORD, D. C., T. BHANGALE, N. LI, G. HELLENTHAL, M. J. RIEDER et al., 2004 Evidence for substantial fine-scale variation in recombination rates across the human genome. Nat. Genet. 36: 700–706.[CrossRef][Medline]
DINKA, S. J., and M. N. RAIZADA, 2006 Inexpensive fine mapping and positional cloning in plants using visible, mapped transgenes. Can. J. Bot. 84: 179–188.[CrossRef]
DURRETT, R. T., K. Y. CHEN and S. D. TANKSLEY, 2002 A simple formula useful for positional cloning. Genetics 160: 353–355.
FERREIRA, A., M. F. DA SILVA, L. SILVA and C. D. CRUZ, 2006 Estimating the effects of population size and type on the accuracy of genetic maps. Genet. Molec. Biol. 29: 187–192.
FU, H. H., W. K. PARK, X. H. YAN, Z. W. ZHENG, B. Z. SHEN et al., 2001 The highly recombinogenic bz locus lies in an unusually gene-rich region of the maize genome. Proc. Natl. Acad. Sci. USA 98: 8903–8908.
GARRIS, A. J., T. H. TAI, J. COBORN, S. KRESOVICH and S. R. MCCOUCH, 2005 Genetic structure and diversity in Oryza sativa L. Genetics 169: 1631–1638.
HAGA, K., M. TAKANO, R. NEUMANN and M. IINO, 2005 The rice coleoptile phototropism gene encoding an ortholog of Arabidopsis NPH3 is required for phototropism of coleoptiles and lateral translocation of aux
















