- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Templeton, A. R.
- Articles by Sing, C. F.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Templeton, A. R.
- Articles by Sing, C. F.
Cladistic Structure Within the Human Lipoprotein Lipase Gene and Its Implications for Phenotypic Association Studies
Alan R. Templetona, Kenneth M. Weissb,c, Deborah A. Nickersond, Eric Boerwinklee, and Charles F. Singfa Department of Biology, Washington University, St. Louis, Missouri 63130-4899,
b Institute of Molecular Evolutionary Genetics, Department of Biology, Pennsylvania State University, University Park, Pennsylvania 16802,
c Department of Anthropology, Pennsylvania State University, University Park, Pennsylvania 16802,
d Department of Molecular Biotechnology, University of Washington, Seattle, Washington 98195-7730,
e Human Genetics Center, University of Texas Health Science Center, Houston, Texas 77225-0334
f Department of Human Genetics, University of Michigan Medical School, Ann Arbor, Michigan 48109-0618
Corresponding author: Alan R. Templeton, Department of Biology, Campus Box 1137, Washington University, St. Louis, MO 63130-4899., temple_a{at}biology.wustl.edu (E-mail)
| ABSTRACT |
|---|
Haplotype variation in 9.7 kb of genomic DNA sequence from the human lipoprotein lipase (LPL) gene was scored in three populations: African-Americans from Jackson, Mississippi (24 individuals), Finns from North Karelia, Finland (24), and non-Hispanic whites from Rochester, Minnesota (23). Earlier analyses had indicated that recombination was common but concentrated into a hotspot and that recurrent mutations at multiple sites may have occurred. We show that much evolutionary structure exists in the haplotype variation on either side of the recombinational hotspot. By peeling off significant recombination events from a tree estimated under the null hypothesis of no recombination, we also reveal some cladistic structure not disrupted by recombination during the time to coalescence of this variation. Additional cladistic structure is estimated to have emerged after recombination. Many apparent multiple mutational events at sites still remain after removing the effects of the detected recombination/gene conversion events. These apparent multiple events are found primarily at sites identified as highly mutable by previous studies, strengthening the conclusion that they are true multiple events. This analysis portrays the complexity of the interplay among many recombinational and mutational events that would be needed to explain the patterns of haplotype diversity in this gene. The cladistic structure in this region is used to identify four to six single-nucleotide polymorphisms (SNPs) that would provide disequilibrium coverage over much of this region. These sites may be useful in identifying phenotypic associations with variable sites in this gene. Evolutionary considerations also imply that the SNPs in the 3' region should have general utility in most human populations, but the 5' SNPs may be more population specific. Choosing SNPs at random would generally not provide adequate disequilibrium coverage of the sequenced region.
CORONARY artery disease (CAD) is the major cause of death in many countries, and several genetic and nongenetic risk factors have been identified (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The first goal of this article is to show that there is indeed considerable cladistic structure in the 5' and 3' portions of this 9.7-kb segment that flank the region of high recombination identified by ![]()
![]()
A third goal of this article is to use the subregional, nonrecombinant, and postrecombinant trees to examine the possible role of multiple mutational events at highly mutable sites as a source of homoplasy. The fourth goal of this article is to show how the results of such an evolutionary/recombinational analysis can be applied to the problem of genetic/phenotypic associations. A final goal is to show how analyses of the evolutionary history, amount of recombination, and patterns of mutation can be used to identify a small number of single nucleotide polymorphisms (SNPs) that could be used in phenotypic association studies through linkage disequilibrium.
| MATERIALS AND METHODS |
|---|
Population samples:
We use the data of ![]()
![]()
![]()
DNA sequencing:
DNA sequencing was performed on diploid genotypes as described in ![]()
![]()
![]()
Haplotype determination:
Haplotypes were determined by a mixture of allele-specific PCR (AS-PCR) and the haplotype subtraction algorithm of ![]()
![]()
![]()
![]()
![]()
![]()
![]()
|
Inference of cladistic structure when recombination is present:
Fig 1 presents the overall flowchart for inferring cladistic structure used in this article. Fig 1 also illustrates our inference procedure with a fictional data set that makes the steps easier to follow than with the much more complicated LPL data set. Step 1 in Fig 1 starts with known haplotypes, just as we start with the LPL haplotypes given in ![]()
|
The second step is to estimate the statistical parsimony (SP) tree for the entire DNA region under the null hypothesis of no recombination or gene conversion. Statistical parsimony was specifically designed for estimating intraspecific haplotype trees (![]()
![]()
|
|
Step 3 in Fig 1 is the estimation and testing of recombination and gene conversion events with the haplotypes given in step 1 through the use of the algorithm given in ![]()
![]()
![]()
![]()
In step 4 we remove the impact of recombination upon the cladistic structure. Two different approaches are used, indicated by 4a and 4b in Fig 1. We proceed to step 4a in our inference procedure only if the recombination encountered in step 3 is either rare and/or concentrated into hotspots, as is true for LPL. Step 4a is to subdivide the DNA region into smaller segments that show little or no internal recombination. This is then followed by step 5a: estimating separate SP trees for each subregion. Since step 3 indicated that the 5' subregion of LPL defined by variable sites 118 and the 3' subregion defined by variable sites 3669 have experienced little internal recombination, we estimate separate haplotype trees in the 5' subregion (variable sites 118) and in the 3' subregion (3669), thereby excluding the recombinational hotspot (1935) from the analysis. When using only a subset of the sites, many of the original haplotypes collapse into a common state. Table 2 and Table 3 present these collapsed haplotype categories for the 5' and 3' regions, respectively. Although the 5' and 3' regions have very little internal recombination, there is some (![]()
![]()
|
|
Steps 4b and 5b (Fig 1) show an alternative and novel method for removing the effects of recombination and estimating cladistic structure in a DNA region that has experienced recombination. Step 4b removes all the recombination events inferred in step 3 from the SP tree estimated for the entire region obtained in step 2. By removal, we mean that the recombinant haplotype itself is removed from the SP tree, the homoplasies that were used to identify the recombinant under the ![]()
The peeled tree should ideally reflect the component of haplotype diversity that has not been affected by recombination during the coalescence of this DNA region. However, additional cladistic structure could have arisen in haplotype lineages derived from the recombinant haplotype. This postrecombinational cladistic structure is estimated in step 4b as those subsets of the SP tree estimated in step 2 that consist of branches and haplotypes derived from each of the original recombinants by subsequent mutational events.
To see if these two methods of estimating cladistic structure in a recombining DNA region yield compatible results, we next collapse the cladistic networks estimated in step 4b by first considering only the 5' sites (118) and then the 3' sites (3669). This results in separate 5' and 3' tree topographies (step 5b in Fig 1) that are contained within the cladistic structure estimated in step 4b but that correspond to the subregions identified in step 4a. This allows a direct comparison of the two methods (steps 4a and 4b) for removing the effects of recombination. Step 6 in Fig 1 is therefore to test the concordance of the tree topologies estimated in step 5a vs. those estimated in step 5b. The null hypothesis that a given data set fits equally well into two alternative evolutionary trees or networks is tested with a Wilcoxon matched pair, signed rank tests according to the procedures given in ![]()
![]()
![]()
Testing for an association between mutagenic sites and homoplasy:
![]()
![]()
2).
| RESULTS |
|---|
The first three steps in the inference chain (Fig 1) have already been completed and published in previous articles for LPL (![]()
![]()
![]()
Tree estimation of the subregions flanking the recombinational hotspot (step 5a):
Fig 4 and Fig 5 show, respectively, the SP trees estimated for the 5' and 3' regions that flank the recombinational hotspot. The 5' region has two statistically parsimonious solutions (Fig 4) for the haplotype categories given in Table 2. Fig 5 shows the resulting statistical parsimony networks for the remaining haplotypes as grouped into the 3' categories shown in Table 3. The long branches leading to the four major termini that frequently serve as candidates for parental types in recombination (![]()
|
|
Estimation of cladistic structure for the entire 9.7-kb region with no detected recombination (step 4b):
Fig 3 shows the SP tree estimated by ![]()
|
Estimation of cladistic structure for the entire 9.7-kb region that evolved from recombinant haplotypes (step 4b):
Fig 7 shows the 29 recombination events and 1 gene conversion event (given at the MDECODE website), along with the cladistic structure estimated to have arisen as recombinant haplotypes and their descendants accumulated subsequent mutations. Sometimes one of these descendant haplotypes served as an inferred parental type in a subsequent recombination event. This results in an interlocking of the cladistic structure that evolved from one recombinant with that of another recombinant, as is also shown in Fig 7. In other cases, neither the recombinant nor any of its descendant haplotypes engaged in any subsequent recombination events. Such cases are indicated in Fig 7 by the absence of any connection to any other recombination event or its postrecombinational cladistic structure.
|
Collapsing the nonrecombinant and postrecombinant cladistic structure into separate 5' and 3' subregional trees (step 5b):
We collapsed the peeled tree of nonrecombinant cladistic structure obtained in step 4b into its 5' and 3' subsets as defined by the subregions identified in step 4a by simply removing all variable characters not in the subregion of interest (step 5b, Fig 1). Fig 8A shows the resulting haplotype network when the peeled nonrecombinant tree for the entire 9.7-kb region includes only variable characters 118 (the 5' region flanking the recombinational hotspot), and Fig 8B shows the corresponding 3' collapsed tree (characters 3669). Similarly, we collapsed the postrecombinational cladistic structure shown in Fig 7 by considering only the 5' characters (118), with the result shown in Fig 9A, and by considering only the 3' characters (3669), with the result shown in Fig 9B.
|
|
Cross-validation of the cladistic structures emerging from steps 5a and 5b (step 6):
We first compare the cladistic structures estimated for the 5' and 3' regions flanking the recombinational hotspot (with any recombination events involving those flanking regions excluded) as shown in Fig 4 and Fig 5 with the nonrecombinant cladistic structure given in Fig 8. There are fewer haplotype categories in Fig 8 than in Fig 4 and Fig 5 because all crossover events anywhere in the 9.7-kb region were peeled off in obtaining Fig 8, whereas the trees given in Fig 4 and Fig 5 excluded only crossover events that were internal to either the 5' or 3' flanking regions, respectively. Consequently, the contrast of alternative topologies is limited only to that portion of the topology defined by the shared haplotype categories. The Templeton test for the 5' region is 1, with only 1 observation out of 18 not tied. A minimum of 5 untied observations is required for significance at the 5% level, so this result is not even close to significance at the 5% level. The Templeton test for the 3' region is 3 with only 2 untied observations out of 34. This result is also not significant at the 5% level.
We next cross-validate the postrecombinational cladistic structure given in Fig 9 by comparing it with the 5' and 3' trees shown in Fig 4 and Fig 5, respectively. There are no topological inconsistencies for the 5' region, so there was no need to perform a Templeton test. For the 3' region, the Templeton test is 2 with only 1 untied observation out of 34, a result not significant at the 5% level.
Association between homoplasies in the cladistic structure with mutagenic categories after the removal of the effects of detected recombination:
The cladistic structure is probably most accurately reconstructed for the 5' and 3' flanking regions, which show considerable cross-validation through two different estimation techniques (step 6 above). Hence, we count the number of homoplasies at sites 118 and 3669 as displayed in Fig 4 and Fig 5, but using the resolved loops from Fig 8 and Fig 9 that are topologically consistent with Fig 4 and Fig 5. Making use of this added resolution is justified because the trees in Fig 8 and Fig 9 are based on more character state information than those in Fig 4 and Fig 5. In addition, six recombination/gene conversion events were excluded in estimating the cladistic structures shown in Fig 4 and Fig 5, as mentioned earlier. Fig 9 shows the mutations inferred to have occurred after these recombination/gene conversion events, and the sites found in the relevant flanking regions given in Fig 9 are also included in this analysis. In this fashion, all 18 sites in the 5' region and all 34 sites in the 3' region are included. Each of these 52 sites was then characterized as being highly mutable [CG dinucleotides, mononucleotide runs of length five or greater, and DNA polymerase
arrest sites having the motif TG(A/G)(A/G)GA] or not, as detailed in ![]()
The results are shown in Table 4, and a Fisher's exact test reveals a significant pattern in which homoplasies are disproportionately found at highly mutable sites, when the recombinational hotspot sites (1935) are excluded (a two-tailed probability of 0.0089 under the null hypothesis of homogeneity) or included (a two-tailed probability of 0.0013 under the null hypothesis of homogeneity).
|
| DISCUSSION |
|---|
Cladistic structure:
Although recombination is common in the LPL gene, the inference that most of this recombination is concentrated into a hotspot (![]()
![]()
These conclusions are reinforced by the peeled tree shown in Fig 6, which provides an estimate of the cladistic structure contained within the entire 9.7-kb region sequenced that has not been altered by any detected recombination or gene conversion events. This peeled tree also contains the long branches among the four major termini that are defined primarily by sites 3' to the recombinational hotspot. Indeed, the peeled tree in Fig 6 shows that some nonrecombinant states have persisted all the way back to the root of the LPL gene tree. The peeled tree also reinforces the conclusion of a recombinational hotspot because the peeled tree has only one current haplotype associated with termini 2, 4, and 5, thereby implying that virtually all the haplotypes defined by the evolutionary old 3' structure on the long branch between T-1 and the remaining termini have undergone recombination. Fig 7 also reveals that substantial cladistic structure has also arisen after recombination events have occurred. Thus, there is both nonrecombinant and postrecombinant cladistic structure in this region of the LPL locus. We anticipate that cladistic structure will exist for other nuclear DNA regions showing recombination when the recombination is concentrated into hotspots (![]()
Partitioning cladistic structure into nonrecombinational and postrecombinational components:
Peeling off inferred recombinants from a tree estimated through statistical parsimony under the null hypothesis of no recombination is a novel method of estimating cladistic structure in a DNA region subject to recombination. Such peeling partitions cladistic structure into a component that has never been influenced by detectable recombinational events and a component that evolved after recombination. To check the accuracy of this novel approach, we cross-validated it with the haplotype networks estimated in a more traditional fashion within subregions with little evidence for recombination (Fig 4 and Fig 5) and found no significant differences. The few topological differences that did exist were minor. Starting with the 5' region, the only discrepancy between Fig 8A and Fig 4 is in the position of 5'-2. There is an additional homoplasy at site 8 in the peeled tree that is not present in the more traditionally estimated tree. The single discrepancy between the topologies could be due to a recombination event that placed site 8 upon a new 3' background but had too few markers to be statistically significant under the recombination test given by ![]()
Similarly, the 3' region of the peeled nonrecombinant tree shown in Fig 8B differs from the topology given in Fig 5 by only two additional homoplasies involving sites 53 and 65 and by a slightly different but equally parsimonious (when inference is restricted to characters 3669) placement of haplotype 59N. Once again, these topological differences are not significant by the Templeton test and can be explained by two additional recombination events that were not statistically significant because of too few markers. Therefore, the peeled tree shown in Fig 6 accurately reflects both the 5' and 3' evolutionary structure that flanks the recombinational hotspot found in this 9.7-kb region. The few topological discrepancies that are detected are explicable by three additional recombination events, but these three events did not involve enough markers to achieve statistical significance. Given the overall number of recombination events that were statistically significant, it is reasonable to expect that some additional recombination events occurred but were not detected (![]()
The results given in Fig 7 imply that much of the cladistic structure observed in the 9.7-kb region arose due to mutational accumulation in haplotype lineages that were initially created by a recombination or gene conversion event. In addition, Fig 7 implies that a single haplotype lineage could have been affected by multiple recombination events during its evolutionary history. For example, haplotypes 63N and 71R have 10 inferred recombination events in their evolutionary history, as well as postrecombination mutational accumulation. Overall, Fig 7 reveals a complex history of interlocking recombination events as a major force in shaping the haplotype diversity found in this region of the LPL gene. At this point, we have not yet determined how accurately this complex recombinational history has been reconstructed, although we do know that two events are not robust to alternative tree topologies (recombination events 23 and 24 in Fig 7) and that events 26 and 28 are collapsed into a single recombination event under some alternative tree topologies, as are events 16 and 20 (![]()
![]()
As with the peeled nonrecombinant tree, we found that the postrecombinational cladistic structure suggested by Fig 7 was topologically consistent with the 5' and 3' trees shown in Fig 4 and Fig 5, respectively, as ascertained by the Templeton test. There were no topological inconsistencies for the 5' region between Fig 3 and Fig 9A, and there is only a single inconsistency involving two additional homoplasies at sites 53 and 69 in the 3' region.
All this topological consistency indicates that this peeling method has promise in partitioning the cladistic structure in gene regions with recombination into nonrecombinant and postrecombinant components. We used the existence of a recombinational hotspot in the LPL region as a tool to provide a cross-validation test, but the inference scheme given in steps 1, 2, 3, 4b, and 5b in Fig 1 could also be applied to regions with uniform recombination. In contrast, step 4a would be difficult to implement under uniform recombination unless recombination were rare because it would be impossible to find large subregions that show little or no internal recombination. Hence, the peeling method of inferring cladistic structure has a broader range of applicability than the method of subdividing a gene into smaller regions that show little or no internal recombination.
Evidence for multiple mutational events:
![]()
![]()
![]()
Because undetected recombination or gene conversion would be expected to weaken this association, these results indicate that the highly mutable sites have indeed been subjected to multiple mutational events. Hence, the infinite sites mutation model, which assumes that multiple events can never occur, is not strictly applicable to the LPL region. This is an important conclusion because many of the statistics commonly used to analyze DNA sequence data are based upon the infinite sites model, and some of these statistics, but perhaps not all (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Implications for genotype/phenotype association studies:
As shown in Fig 3 Fig 4 Fig 5 Fig 6 Fig 7, there is much cladistic structure in the LPL locus despite common recombination, primarily because the recombination is concentrated into a small hotspot. This cladistic structure is important for future studies on associations between genetic variation and phenotypic variation. Indeed, the LPL gene is an ideal candidate for a cladistic analysis because the recombinational hotspot preserves much cladistic structure while simultaneously allowing positional inferences to be made on any detected phenotypic associations (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Choosing a small number of SNPs for disequilibrium mapping in the LPL region:
One of the major new goals of the Human Genome Project is to develop a map of more than 100,000 SNPs distributed over the entire human genome (![]()
![]()
![]()
It is virtually impossible for any single SNP to show disequilibrium across the entire 9.7-kb region sequenced because of the recombinational hotspot. First, SNPs in the hotspot show little disequilibrium either with one another or with any markers in the 5' and 3' flanking regions (see Fig 8 in ![]()
20% of the sequenced portion of the LPL gene represents a disequilibrium blind spot. If a "randomly" chosen SNP fell into this hotspot region, it would most probably be useless for any analysis requiring disequilibrium. The only SNPs that would be expected to provide some disequilibrium coverage are those falling into the 5' and 3' regions that flank the recombinational hotspot. Obviously, significant disequilibrium is expected only within each of the flanking regions and not between them. Therefore, a single SNP could not provide adequate disequilibrium coverage for even this third of the LPL gene. Rather, separate SNPs would be needed for each flanking region.
Given that choosing SNPs nonrandomly by population frequency criteria implies a nonrandom choice by evolutionary history as well, it is far better to use the evolutionary history explicitly. We now address how the cladistic and homoplasy structure estimated in this article can be used for choosing SNPs in the two flanking regions.
Starting with the 3' end, there are four major clades (the four termini in Fig 5 or Fig 6). Given that the 3' cladistic structure implies that these are old and long-surviving clades, these four clades capture the majority of mutational divergence found in the 3' end of the gene. Moreover, because the branches interconnecting these four major clades are long, we have great confidence that this portion of the estimated LPL haplotype tree is accurate and fully resolved. Cladistic analyses of phenotypic associations do not require that the haplotype tree be estimated in a completely accurate, resolved fashion (![]()
Three of these nine SNPs (at variable sites 39, 43, and 47) unambiguously mark the oldest and most extensive divergence found within this gene region, the distinction between terminus 1 vs. termini 2, 3, and 4. Another important consideration in choosing among these three candidate SNPs is whether or not they are at highly mutable sites. Highly mutable sites should tend to show high levels of polymorphism, which makes them highly informative of their direct phenotypic effects and useful in defining additional haplotype variation when used in conjunction with other nearby variable sites. However, highly mutable sites are not ideal as single-site markers of association (the primary purpose of choosing SNPs in this context) because they could show complicated patterns of disequilibrium due to the fact that identity by state may not reflect identity by descent if multiple mutational events occurred. Our demonstration that mutable sites are strongly associated with increased homoplasy indicates that this possibility cannot be ignored. Given that almost half of the SNPs in the sequenced portion of the LPL gene come from the highly mutable classes, this consideration imposes another constraint upon choosing SNPs for disequilibrium mapping. For the particular task of choosing among the three SNPs to discriminate T-1 from the remaining 3' termini, the SNP at site 39 is at a highly mutable site and should be excluded, leaving two candidates (the SNPs at sites 44 and 57). To discriminate among the other termini, the SNP at site 51 unambiguously discriminates between terminus 4 and the node leading to termini 2 and 3 and is not at a highly mutable site. Finally, any one of the SNPs at sites 32, 37, 48, or 52 unambiguously discriminate between T-2 and T-3, but the SNP at site 37 would be the best because it is not in one of the highly mutable categories. Hence, three SNPs (at sites 43 or 47, site 51, and site 37) are needed to identify the four major 3' clades that represent the oldest and most extensive genetic diversity within the region sequenced.
The 5' end consists of evolutionarily closer haplotypes, and there are no sites totally lacking homoplasy within this end with respect to the total data set (Fig 3). However, much of this homoplasy is due to recombination with the 3' end, and there are many sites that have no homoplasy within the 5' end (Fig 4 and Fig 9A). Hence, the 5' sites with no homoplasy in the 5' tree can be used to mark the major sources of haplotype divergence in the 5' end, but should always be used in conjunction with the minimal set of three 3' markers identified above because all sites show apparent homoplasy in the total data set due to recombination. The single branch in Fig 4 that captures most of the 5' variation is the branch between the 5'-3 and 5'-6 haplotype sets. To the left of this branch are a set of closely related 5' haplotype categories that differ from their nearest neighbors by only a single site, and to the right of this branch is a set of more distantly related 5' haplotype categories (Fig 4). This branch is marked by two sites, and of those, site 5 shows no 5' homoplasy and is not highly mutable. Hence, the SNP at site 5 captures most of the 5' variation. With additional sites, even greater 5' resolution is possible. The right half of Fig 4 consists of three clusters of 5' haplotypes (5'-6, 32J, and 36J; 5'-8; and 5'-7, 8J, and 14J), all of which are at least two mutational steps from a common node. Variable sites 9 and 10 show no 5' homoplasy and are not in highly mutable categories, and either would discriminate the [5'-6, 32J, 36J] cluster from the remaining right-side clades. Similarly, sites 2 and 18 show no 5' homoplasy, but only site 2 is not in a highly mutable category. Hence, the SNP at site 2 should be used for discriminating between the remaining two right-side clades. In summary, scoring three SNPs at sites 5, 9 or 10, and 2 would mark the major sources of haplotype divergence in the 5' region of the sequenced portion of this gene.
These considerations indicate that a single SNP cannot provide adequate disequilibrium coverage for even this third of the LPL gene and that at least four to six SNPs are needed to mark most of the mutational divergence found in the regions flanking the recombinational hotspot. Note that choosing these four to six SNPs required detailed analysis of the entire 9.7-kb region with respect to cladistic structure, recombination, and mutation. This reinforces the conclusion of ![]()
However, even these cladistically informed choices may not always be meaningful because of genetic differences among populations. The four to six SNPs indicated above are informative for the three specific populations we have sampled, but these SNPs may not be polymorphic in all human populations, and hence may not be informative for all human populations. Evolutionary considerations are also relevant to the problem of choosing SNPs in light of potential genetic heterogeneity among populations. Detailed phylogeographic analyses of human mitochondrial DNA, Y-linked DNA, and autosomal DNA all indicate that the primary pattern in recent human evolution has been one of gene flow constrained by isolation by distance (![]()
![]()
The evolutionary situation is quite different at the 5' end, which does not reveal any deep cladistic structure. Hence, the SNPs in this part of the LPL gene may have little utility beyond the populations actually sampled. For example, as discussed above, the single most informative SNP in the 5' region for the three populations sampled is the SNP at site 5 (Fig 4). The haplotypes found to the left of the SNP at site 5 in Fig 4 are found in all three populations sampled, but the haplotypes to the right of this SNP in Fig 4 are found only in the Jackson population. Hence, if we had sampled just the Rochester and North Karelian populations and not the Jackson population, the SNP at site 5 would have been uninformative. On the basis of these evolutionary considerations, we conclude that the SNPs identified in the 3' flanking region are likely to have general utility for most human populations, whereas the SNPs in the 5' flanking region are likely to be informative only for specific populations.
The ability to identify such highly informative SNPs for particular populations illustrates the utility that simultaneous estimation of recombinational, mutational, and evolutionary structure can play in human genetic epidemiology. Moreover, the ability to identify SNPs that may have general utility in most human populations is another important application of evolutionary features that can arise from such a cladistic analysis of haplotype variation. However, as the 5' flanking region shows, the evolutionary analysis may also indicate that it may be unlikely to identify a SNP that will be informative for most human populations. When dealing with such DNA regions, it would be better to obtain sequence data for the entire region in the populations of interest rather than relying upon one or a few SNPs.
| ACKNOWLEDGMENTS |
|---|
We thank Jody Hey and two anonymous reviewers for their excellent suggestions for improving an earlier draft of this article. This work was supported by the National Heart, Blood, and Lung Institute grants HL39107, HL58238, HL58239, and HL58240.
Manuscript received January 10, 2000; Accepted for publication June 29, 2000.
| LITERATURE CITED |
|---|
AGARWAL, S. K., L. V. DEBELENKO, M. B. KESTER, S. C. GURU, and P. MANICKAM et al., 1998 Analysis of recurrent germline mutations in the Men1 gene encountered in apparently unrelated families. Hum. Mutat. 12:75-82[Medline].
BOERWINKLE, E., D. L. ELLSWORTH, D. M. HALLMAN, and A. BIDDINGER, 1996 Genetic analysis of atherosclerosisa research paradigm for the common chronic diseases. Hum. Mol. Genet. 5:1405-1410[Abstract].
CASTELLOE, J. and A. R. TEMPLETON, 1994 Root probabilities for intraspecific gene trees under neutral coalescent theory. Mol. Phylogenet. Evol. 3:102-113[Medline].
CHAKRAVARTI, A., K. H. BUETOW, S. E. ANTONARAKIS, P. G. WEBER, and C. D. BOEHM et al., 1984 Nonuniform recombination within the human ß-globin gene cluster. Am. J. Hum. Genet. 36:1239-1258[Medline].
CLARK, A. G., 1990 Inference of haplotypes from PCR-amplified samples of diploid populations. Mol. Biol. Evol. 7:111-122[Abstract].
CLARK, A. G., K. M. WEISS, D. A. NICKERSON, S. L. TAYLOR, and A. BUCHANAN et al., 1998 Haplotype structure and population genetic inferences from nucleotide sequence variation in human lipoprotein lipase. Am. J. Hum. Gen. 63:595-612[Medline].








