Genetics, Vol. 163, 1533-1548, April 2003, Copyright © 2003

A Haplotype-Based Algorithm for Multilocus Linkage Disequilibrium Mapping of Quantitative Trait Loci With Epistasis

Xiang-Yang Loua,c, George Casellaa, Ramon C. Littella, Mark C. K. Yanga, Julie A. Johnsonb, and Rongling Wua
a Department of Statistics, University of Florida, Gainesville, Florida 32611
b Department of Pharmacy Practice, University of Florida, Gainesville, Florida 32611
c Department of Agronomy, Zhejiang University, Hangzhou, Zhejiang 310029, People's Republic of China

Corresponding author: Rongling Wu, 533 McCarty Hall C, University of Florida, Gainesville, FL 32611., rwu{at}stat.ufl.edu (E-mail)

Communicating editor: C. HALEY


*  ABSTRACT
*TOP
*ABSTRACT
*POPULATION GENETIC MODEL
*QUANTITATIVE GENETIC MODEL
*STATISTICAL MODEL
*MONTE CARLO SIMULATION
*A CASE STUDY FROM...
*DISCUSSION
*LITERATURE CITED

For tightly linked loci, cosegregation may lead to nonrandom associations between alleles in a population. Because of its evolutionary relationship with linkage, this phenomenon is called linkage disequilibrium. Today, linkage disequilibrium-based mapping has become a major focus of recent genome research into mapping complex traits. In this article, we present a new statistical method for mapping quantitative trait loci (QTL) of additive, dominant, and epistatic effects in equilibrium natural populations. Our method is based on haplotype analysis of multilocus linkage disequilibrium and exhibits two significant advantages over current disequilibrium mapping methods. First, we have derived closed-form solutions for estimating the marker-QTL haplotype frequencies within the maximum-likelihood framework implemented by the EM algorithm. The allele frequencies of putative QTL and their linkage disequilibria with the markers are estimated by solving a system of regular equations. This procedure has significantly improved the computational efficiency and the precision of parameter estimation. Second, our method can detect marker-QTL disequilibria of different orders and QTL epistatic interactions of various kinds on the basis of a multilocus analysis. This can not only enhance the precision of parameter estimation, but also make it possible to perform whole-genome association studies. We carried out extensive simulation studies to examine the robustness and statistical performance of our method. The application of the new method was validated using a case study from humans, in which we successfully detected significant QTL affecting human body heights. Finally, we discuss the implications of our method for genome projects and its extension to a broader circumstance. The computer program for the method proposed in this article is available at the webpage http://www.ifasstat.ufl.edu/genome/~LD.


DESPITE its significant importance in plant and animal breeding and evolutionary studies, the genetic basis of a quantitatively inherited trait has not been well understood. Two factors have caused this. First, a quantitative trait, or a complex trait, is likely to be controlled by many genes, each having a small effect (LYNCH and WALSH 1998 Down). Such genes are difficult to identify using classic genetic approaches. Second, as revealed in a considerable body of literature, the genes underlying a complex trait are not independent in exerting an effect on quantitative variation (reviewed in MACKAY 2001 Down). The dependence between alleles at different loci may be due to their linkage on the same chromosome or nonrandom association even if they are on different chromosomes. In particular, the nonrandom association between two linked genes is called linkage (gametic) disequilibrium (RISCH and MERIKANGAS 1996 Down; HUDSON 2000 Down). Unlinked genes may also be associated if biological processes, such as population differentiation, population admixture, and natural selection, occur in a population. Linkage and nonrandom association leading to cosegregation of different genes affect gamete distribution and frequencies and, thus, are considered to be population genetic parameters (MACKAY 2001 Down).

Apart from possible allelic associations, different genes may be mutually dependent when one gene prevents another from manifesting its effect through a particular biochemical pathway (PHILLIPS 1998 Down). Such dependence of genes is referred to as epistasis, a ubiquitous phenomenon in the genetic control of a complex trait. Epistasis, described as a type of gene interaction on phenotypes, possesses the property of a quantitative genetic parameter. Classical estimation of quantitative genetic parameters is based on the resemblance among relatives using a theory established by FISHER 1918 Down, but this approach lacks the power to estimate epistasis because epistasis contributes little to the relative resemblance. A lack of powerful methods for estimating epistasis has diminished its central role in trait evolution and domestication, as claimed by S. Wright and his followers (WOLF et al. 2000 Down).

The advent of complete DNA-marker linkage maps has provided a powerful tool for investigating the genetic architecture of a quantitative trait in plants, animals, or humans (reviewed in MACKAY 2001 Down; BARTON and KEIGHTLEY 2002 Down). A number of statistical methods have been proposed to map genes responsible for complex traits, referred to as quantitative trait loci (QTL), based on different genetic designs, different marker types, and different mapping populations (reviewed in JANSEN 2000 Down; HOESCHELE 2000 Down). Most of these methods were developed to adapt gene segregation patterns in a mapping pedigree derived from a controlled cross and have two significant limits when they are applied on a wider scheme. First, for many species such as humans, it is not possible to obtain controlled crosses and thus their QTL mapping should be based on existing natural populations. Second, many methods assume that epistatic effects due to gene interaction are absent, to facilitate the estimation of quantitative genetic parameters. Given the complexity of quantitative variation, however, the assumption of no epistasis most likely deviates from biological reality (MACKAY 2001 Down; BARTON and KEIGHTLEY 2002 Down).

In this article, we present a new statistical method for mapping QTL with epistasis in natural populations. Our analysis is based on an important population property of gene segregation, that is, linkage disequilibrium (LYNCH and WALSH 1998 Down; HUDSON 2000 Down), although this property made quantitative genetic study more difficult in the past. Linkage causing linkage disequilibrium is used as a basis for fine mapping of human diseases (TEMPLETON 1999 Down) and has been successful in several case studies of gene positional cloning (reviewed in ARDLIE et al. 2002 Down). LUO and SUHAI 1999 Down, LUO et al. 2000 Down, and WU et al. 2002 Down extended disequilibrium-based mapping strategies to map QTL for a quantitative trait using a random sample drawn from a natural population. However, in these studies only the simplest model, based on one marker and one QTL, was formulated. The application of these models to study the genetic architecture of polygenic traits is therefore limited.

Our model presented here uses a multilocus linkage disequilibrium analysis to detect epistatic QTL and estimate the effects of their gene action and interaction on a quantitative trait. Our model can be considered to be comprehensive, because it includes all possible disequilibria among different markers and QTL and considers quantitative gene interaction effects of different kinds. We derive a simple closed form of the expectation-maximization (EM) algorithm for estimating the allelic frequencies and coefficients of linkage disequilibria and gene action and interaction effects. Because of this computational innovation, the estimation precision of QTL population and quantitative parameters from our multilocus model can be significantly improved over that of traditional treatments, despite the fact that the multilocus model contains an increasingly larger number of unknown parameters. Extensive simulations are performed to study the robustness and performance of our multilocus disequilibrium mapping of QTL in a natural population. Our method has been validated by the successful detection of QTL for human body height.


*  POPULATION GENETIC MODEL
*TOP
*ABSTRACT
*POPULATION GENETIC MODEL
*QUANTITATIVE GENETIC MODEL
*STATISTICAL MODEL
*MONTE CARLO SIMULATION
*A CASE STUDY FROM...
*DISCUSSION
*LITERATURE CITED

Suppose there is a panmictic natural population, in which m biallelic markers, 1, 2, ... , m, and q biallelic QTL, 1, 2, ... q, are segregating. For the sake of distinction, we use the subscripts to denote the markers and the superscripts to denote the QTL. Two alleles at a locus are denoted by M1k and M2k, or indexed by rk (rk = 1, 2), for marker k, and Ql1 and Ql2, or indexed by sl (sl = 1, 2), for QTL l. This population is assumed to be in disequilibrium, so that all markers and QTL are associated with one another. We denote Dk1k2 as the coefficient of digenic linkage disequilibrium between markers k1 and k2, Dl1l2 as the coefficient of digenic linkage disequilibrium between QTL l1 and l2 and Dl1k1 as the coefficient of digenic linkage disequilibrium between marker k1 and QTL l1. Similarly, the coefficients of trigenic (Dl1k1k2 and Dl1l2k1), quadrigenic (Dl1l2k1k2), or higher-order linkage disequilibrium can be defined.

Haplotypes, zygote configurations, and zygote genotypes:
With the above notation for the markers and QTL and their corresponding alleles, a general multilocus haplotype mix for the markers (M1 M2 ... Mm) and the QTL (Q1 Q2 ... Qq) is denoted by gs1s2...sqr1r2...rm (r1, ... , rm; s1, ... , sq = 1, 2), with population frequency expressed as ps1s2...sqr1r2...rm. This haplotype frequency can be divided into the allele frequencies and a set of disequilibrium coefficients of different orders (HUDSON 2000 Down). For example, when there is no interference between recombinations, a one-marker (k)/one-QTL (l) haplotype frequency is divided into two parts, no disequilibrium and digenic disequilibrium, expressed as

(1)

From now on we omit the symbols k for markers and l for QTL for simplicity, unless they are needed. A two-marker/one-QTL haplotype frequency is expressed as

(2)

When four or more genetic loci are associated, disequilibria of one set of loci may interact with those of other sets to form so-called multiplicative disequilibria (BENNETT 1954 Down). When (at least two) epistatic QTL are associated with multiple markers, these multiplicative disequilibria should be considered. For example, under the assumption of no interference between recombination events (BENNETT 1954 Down), a four-marker/two-QTL haplotype frequency is expressed as

(3)

For the m markers and q QTL, a total of 2m+q haplotypes randomly unite to generate 3m+q zygote genotypes containing 2m+q(2m+q + 1)/2 zygote configurations or modes of zygote formation via different combinations of paternal and maternal gametes. The number of zygote genotypes is smaller than that of zygote configurations because different configurations may generate the same zygote genotype. A zygote genotype can be generally formulated as

where {·}{·} are used to separate two different loci, and r1 <= r'1, ... , rm <= r'm,s1 <= s'1, ... ,sq <= s'q = 1, 2 are the two alternative alleles of a zygote at the same locus. The joint frequency of a zygote for all markers and QTL is expressed as

At Hardy-Weinberg equilibrium, the population frequency of a (m + q)-locus zygotic configuration is calculated as the product of the two corresponding haplotype frequencies (LYNCH and WALSH 1998 Down). Yet, the population frequency of a (m + q)-locus zygotic genotype is the summation of the population frequencies of all possible zygotic configurations.

Table 1 describes a matrix for the population frequencies of 27 zygote genotypes at two markers and one QTL, which contain 36 zygote configurations through random combinations of the eight haplotypes. As shown in the table, we use P{xi}({xi} = 1, ... , 36) to denote the frequencies of different zygote configurations. Because different configurations may produce the same zygote genotype when paternal and maternal gametes have different alleles at the same locus, the number of zygote configurations may be greater than the number of zygote genotypes. For example, zygote genotype G{12}{11}{12} is generated via configuration g111 x g212 or g211 x g112.


 
View this table:
In this window
In a new window

 
Table 1. Joint zygote probabilities of the QTL genotypes and two-marker genotypes in terms of zygote configurations P{xi}({xi} = 1, ... , 36)

Of the eight three-locus haplotype frequencies contained in the joint probability matrix of Table 1, seven are independent because of the constraint . These seven independent frequencies, arrayed by , are composed of three allele frequencies (pr1,pr2,ps), three digenic linkage disequilibria (Dk1k2,Dlk1,Dlk2), and one trigenic linkage disequilibrium (Dlk1k2). We develop a statistical algorithm to estiate {Omega}p and ultimately estimate those population genetic parameters describing allelic association and population evolutionary history.

A more complicated situation is that in which two QTL (l1 and l2) are detected by two pairs of markers (k1 - k2 and k3 - k4). We permit these six loci to be mutually associated in the population. A total of 64 (26) six-locus haplotypes randomly unite to generate 729 (36) six-locus zygotes whose population frequencies are arrayed by matrix . If these six loci contain two independent sets each with two markers and one QTL, we have where {otimes} is the Krockner product. If the two sets are not independent, however, the joint zygote probabilities of four markers and two QTL should be derived on the basis of different zygote configurations via random combinations of the 64 haplotypes. Summing to 1, the 64 six-locus haplotype frequencies contain 63 independent frequencies (arrayed as {Omega}p), which are expressed in terms of an equal number of composite parameters, six-allele frequencies, and 57 coefficients of linkage disequilibria of different orders.

Multiallelic disequilibrium analysis:
For outcrossing populations, it is not uncommon to have multiple alleles at a locus. Although the measure of linkage disequilibrium between two multiallelic loci was well defined (WEIR 1996 Down), a multiallelic disequilibrium analysis has not been extended to many different loci, because of an increasingly large number of disequilibrium coefficients. Assume that there are uk alleles for marker k and vl for QTL l. According to BENNETT 1954 Down, digenic linkage disequilibrium between marker allele rk and QTL allele sl can be defined as

with constraints Trigenic linkage disequilibrium among two marker alleles rk1 and rk2 at markers k1 and k2 and one QTL allele sl at QTL l is expressed as

with constraints .

Hexagenic linkage disequilibria between four marker nonalleles and two QTL nonalleles are expressed as


with constraints


*  QUANTITATIVE GENETIC MODEL
*TOP
*ABSTRACT
*POPULATION GENETIC MODEL
*QUANTITATIVE GENETIC MODEL
*STATISTICAL MODEL
*MONTE CARLO SIMULATION
*A CASE STUDY FROM...
*DISCUSSION
*LITERATURE CITED

In the above section, we discussed the population properties of markers and QTL and derived the joint probabilities of the QTL genotypes and the marker genotypes on the basis of their zygote formation. We now introduce a genetic model for characterizing the additive, dominant, and epistatic effects of QTL affecting a quantitative trait. We use MATHER and JINKS' (1982) notation to describe the epistasis, under which the genotypic values for any two QTL (l1 and l2) are expressed as

(4)

where µ is the overall mean; al1, al2, dl1, and dl2 are the additive and dominant effects of the two QTL, respectively; and il1l2, jl1l2, kl1l2, and ll1l2 are the additive x additive, additive x dominant, dominant x additive, and dominant x dominant epistatic effects between the two QTL, respectively.

The phenotypic value of a quantitative trait due to these two putative QTL for individual i from a random sample of the population can be written as

(5)

where xiS1S'1S2S'2 is an indicator variable, defined as 1 if individual i has QTL genotype Qs1Qs'1Qs2Qs'2 and 0 otherwise, and ei is the residual error, distributed as N(0, {sigma}2). It is possible that each QTL genotype has a different residual variance. Since there are nine QTL genotypes, nine independent genotypic values arrayed by {µ{s1s'1}{s2s'2}}1x9 can be defined as in (4). These nine genotypic values are composed of nine unknown quantitative genetic parameters: one overall mean (µ), two additive {al1, al2}, two dominant {dl1, dl2}, and four epistatic effects {il1l2, jl1l2, kl1l2, ll1l2}. The nine genotypic values and residual variance comprise a quantitative genetic parameter vector arrayed by {Omega}q = {µ{11}{11}, ... , µ{22}{22}, {sigma}2}, which can be estimated by incorporating the population genetic properties of the QTL within a mapping framework.

For many outcrossing populations, the biallelic-QTL genetic model described by Equation 5 may not be sufficient and should be extended to a multiallelic-QTL situation. Quantitative genetic models based on multiallelic inheritance have been well discussed in the literature (see COCKERHAM 1980 Down).


*  STATISTICAL MODEL
*TOP
*ABSTRACT
*POPULATION GENETIC MODEL
*QUANTITATIVE GENETIC MODEL
*STATISTICAL MODEL
*MONTE CARLO SIMULATION
*A CASE STUDY FROM...
*DISCUSSION
*LITERATURE CITED

Likelihood of the complete data:
In this QTL mapping study, marker data (M) and phenotype data (y) for all n individuals randomly drawn from a natural population are observed data, whereas zygote formation modes, i.e., zygote configurations (g), are unobservable or missing data. The observed data alone are incomplete data, which, along with the missing data, comprise the complete data. For simplicity, the statistical framework is illustrated for the relatively simple two-marker/one-QTL model. This framework can be extended to consider a multiple-marker/multiple-QTL model, considering multiple alleles at each locus. The likelihood function of the complete data given the unknown parameters {Omega} = ({Omega}p, {Omega}q) is written, under the two-marker/one-QTL model, as

(6)

or

(7)

where n{r1r'1}{r2r'2} is the observed number of individuals with marker genotype Mr1Mr'1Mr2Mr'2,


and f{ss'}(yi) is a normal density of observation yi with a particular QTL genotype QsQs' having mean µ{ss'} and variance {sigma}2. The maximization of the likelihood (7) is achieved by differentiating the log-likelihood with respect to {Omega}, which leads to

(8)

where By letting the score (Equation 8) equal zero and solving the log-likelihood equation, we obtain the maximum-likelihood estimates (MLE) of the unknown parameters as

where denotes the sample size of a particular zygote configuration {xi}({xi} = 1, ... , 36) as defined in Table 1. For the complete data, {xi} is observable. Based on the invariance property of maximum-likelihood estimates, the MLEs of allelic frequencies, the coefficients of linkage disequilibrium, and QTL effects can be obtained from

The likelihood of incomplete data:
The data of QTL mapping are incomplete because only marker data and phenotype data can be observed and the data on zygote configurations and QTL genotypes are missing. The likelihood of incomplete data is built upon a mixture model in that each individual is assumed to have arisen from one of these unknown QTL genotypes. Finite mixtures of distributions used to model a wide variety of random phenomena (see MCLACHLAN and PEEL 2000 Down) can provide a sound mathematical-based approach for distinguishing unknown QTL genotypes. With a normal mixture model-based approach for separating QTL genotypes, the likelihood function of incomplete data including the phenotype (y) and marker information (M) is written as

(9)

where Prob(G{r1r'1}{r2r'2}) is the population frequency of marker genotypes at the two markers,

and the other notation is defined as above.

The estimation of haplotype frequencies is based on the number of a particular haplotype in the population. For the complete data, 36 modes of zygote formation via random combinations of paternal haplotypes and maternal haplotypes are known a priori and, therefore, the number of various marker-QTL haplotypes can be directly counted (see Table 1). But for the incomplete data, zygote formation is unobservable and its inference is viewed as a missing data problem. In fact, it is possible to calculate the expected number of a particular marker-QTL haplotype for a given marker zygote genotype. Although marker-QTL zygote genotypes are also unknown, they can be inferred on the basis of marker information and phenotype data by implementing the EM algorithm. By differentiating Equation 9, we have

(10)

where

(11)

which could be thought of as a posterior probability that individual i has a marker-QTL zygote configuration {xi} given marker zygote genotype G{r1r'1}{r2r'2}. We then implement the EM algorithm (DEMPSTER et al. 1977 Down) with the expanded parameter set {{Omega}, {Pi}}, where

Each iteration consists of two steps. First, on the E step, the complete-data log-likelihood is averaged over the conditional distribution of the indicator variables given the observed data, using the current estimate of the parameter vector. Since the complete-data log-likelihood is linear in these indicator variables, the E-step of the EM algorithm involves simply replacing them by the current values of their conditional expectations, that is, the posterior probabilities of component membership expressed in Equation 11. Next, in the M step, conditional on {Pi}, we solve for the zeros of Equation 10 (likelihood equations) to get our estimates of {Omega}. The likelihood equation can be split into two terms: The first term refers to the genetic association as specified by haplotype frequencies, and the second term to the phenotype-genotype relationship described by the genotypic means and residual variance. The estimates are then used to update {Pi} in the E step, and the process is repeated until convergence. The values at convergence are the MLEs.

It is interesting to note that the estimation of allelic frequencies and coefficients of linkage disequilibria can be obtained from the closed-form solutions of haplotype frequencies. This will largely improve the computational efficiency of these parameters that can only be expressed as polynomial equations using traditional procedures (see LUO and SUHAI 1999 Down; LUO et al. 2000 Down; WU et al. 2002 Down).

Disequilibrium mapping of epistatic QTL:
We consider two cases for mapping QTL with epistasis for a quantitative trait based on linkage disequilibrium. First, two epistatic QTL are predicted by two independent pairs of markers. Second, two epistatic QTL are predicted by two pairs of genetically associated markers. In the first case, the two QTL can be assumed to be independent from a population genetic standpoint, although they are not so in terms of quantitative genetic effects. In the second case, the two QTL should be associated genetically and interact epistatically to affect a quantitative trait. In both cases, the EM algorithm proposed for the above two-marker/one-QTL model is extended to estimate both population and quantitative genetic parameters.

Two basic tasks should be completed before the implementation of the EM algorithm for epistasis mapping. The first task is the derivation of an (81 x 9) matrix for joint probabilities of four-marker/two-QTL zygote genotypes based on zygote configurations. For two independent sets of markers and QTL, such probabilities are simply calculated as the Krockner product of the joint probability for each set, so that there are 16 haplotype frequencies (and 14 independent haplotype frequencies) in this case. But for two dependent sets of loci, the derivations of such probabilities should consider the formation principle of a six-locus zygote, in which case there are a total of 64 haplotype frequencies, of which 63 are independent. In the first case, we have only digenic and trigenic linkage disequilibria, whereas, in the second case, we must face digenic, trigenic, quadrigenic, pentagenic, and hexagenic disequilibria (see Equation 3).

The second task is the modeling of epistatic QTL with additive, dominant, and additive x additive, additive x dominant, dominant x additive, and dominant x dominant interaction effects (see matrix 4). The estimation of both population (from the first task) and quantitative genetic parameters (from the second task) for epistatic QTL can be obtained using closed forms.

Hypothesis tests:
We can test two major hypotheses in the following sequence: (1) the existence of QTL by testing the significance of QTL effects and (2) the association of significant QTL with markers by testing linkage disequilibrium. The test for the existence of a significant QTL may not rely upon the association between the QTL and known marker(s), but the random association of a QTL with markers can be tested only when the existence of the QTL is statistically tested.

The existence of a QTL can be tested on the basis of two alternative hypotheses:

(12)

The log-likelihood-ratio test statistic for the existence of a QTL is calculated by comparing the likelihood values under H1 (full model) and H0 (reduced model), using

(13)

where the tilde denotes the MLEs of unknown parameters under H0 and the other notation is defined as above. The LRQ is considered to asymptotically follow a {chi}2 distribution with the degrees of freedom equaling the difference of the number of unknown parameters to be estimated under the two hypotheses. However, the approximation of a {chi}2 distribution may be inappropriate when some regularity conditions are violated. The permutation test approach proposed by CHURCHILL and DOERGE 1994 Down, which does not rely upon the distribution of the LRQ, may be used to determine the critical threshold for claiming the existence of a QTL.

After the existence of the QTL is statistically tested, we formulate a second test for its association with two markers under consideration. The log-likelihood ratio for such a test is calculated by LRQ

(14)

which is asymptotically distributed as a {chi}2 distribution with 3 d.f.

Other more specific tests for the existence of additive, dominant, or epistatic effect and the association of a QTL with a particular marker can also be formulated similarly. Generally, these hypotheses can be tested on the basis of a critical threshold calculated from permutation tests (CHURCHILL and DOERGE 1994 Down). Other approaches as proposed by PIEPHO 2001 Down can also be used, which do not require expensive computations.


*  MONTE CARLO SIMULATION
*TOP
*ABSTRACT
*POPULATION GENETIC MODEL
*QUANTITATIVE GENETIC MODEL
*STATISTICAL MODEL
*MONTE CARLO SIMULATION
*A CASE STUDY FROM...
*DISCUSSION
*LITERATURE CITED

Simulation scenarios:
We have performed extensive simulation studies to investigate the robustness and performance of our linkage disequilibrium-based mapping of quantitative traits. A number of simulation scenarios are performed to compare results under different QTL models (one-marker/one-QTL vs. multiple-marker/multiple-QTL), different heritability levels, different sample sizes, and different population genetic parameters (equal vs. extreme allele frequencies at a locus). Marker and QTL genotype data are simulated on the basis of joint frequencies for a given sample size, whereas phenotype data are simulated by summing genotypic values and environmental errors, distributed as N(0, {sigma}2). The residual variance {sigma}2 is determined under a given heritability level. The simulated data are analyzed using the genetic model proposed in this article. Each simulation scenario was repeated 200 times to estimate the biases and mean square errors of the MLEs.

To show the advantage of our multilocus disequilibrium model, we first perform the analysis on the basis of a one-marker/one-QTL model (MQ), as used in earlier studies (LUO and SUHAI 1999 Down; LUO et al. 2000 Down; WU et al. 2002 Down). The results from MQ are compared with those obtained from a two-marker/one-QTL model (MQM) and two epistatic QTL models (two separate sets of two-marker/one-QTL models, denoted by MQM-MQM, and two dependent sets of two-marker/one-QTL models, denoted by MQMMQM). For each QTL model, we design two different heritabilities (H2 = 0.1 vs. 0.4), two different sample sizes (n = 500 vs. 1000), and two different allele frequencies for a locus. These designs including the hypothesized parameter values are listed in Table 2 Table 3 Table 4.


 
View this table:
In this window
In a new window

 
Table 2. Experimental designs of linkage disequilibrium mapping based on the one-marker/one-QTL model (MQ)


 
View this table:
In this window
In a new window

 
Table 3. Experimental designs of linkage disequilibrium mapping based on the two-marker/one-QTL model (MQM)


 
View this table:
In this window
In a new window

 
Table 4. Experimental designs of linkage disequilibrium mapping based on two independent sets of two markers and one QTL (MQM-MQM)

Results:
Compared to the excellent accuracy and precision of the MLEs of marker allele frequencies that were obtained directly from the observed data, the accuracy and precision of the MLEs of QTL-allele frequency and disequilibrium and QTL effects are relatively reduced, sometimes seriously. As expected, the MLEs of QTL-related parameters under all given QTL models from MQ to MQMMQM (Table 2 Table 3 Table 4) display increased accuracy and precision with increased heritabilities and sample sizes, although the extent of the increase may be different among the models (Table 5 Table 6 Table 7). Generally, a more significant improvement of parameter estimation, especially for QTL allele frequency and disequilibrium, was observed when H2 is increased from 0.1 to 0.4 than when N is increased from 500 to 1000.


 
View this table:
In this window
In a new window

 
Table 5. Biases (and squared roots of mean square errors) of genetic parameter estimates based on the one-marker/one-QTL model (MQ)


 
View this table:
In this window
In a new window

 
Table 6. Biases (and squared roots of mean square errors) of genetic parameter estimates based on the two-marker/one-QTL model (MQM)


 
View this table:
In this window
In a new window

 
Table 7. Biases (and squared roots of mean square errors) of genetic parameter estimates based on two independent sets of two markers and one QTL (MQM-MQM)

Additive-dominant model: Under MQ, we consider two situations, one with no disequilibrium between the marker and QTL (MQ1–MQ4) and the other with disequilibrium between the two loci (MQ5–MQ8). The first situation is similar to the detection of major genes because marker information is actually not used. It is found that the MLEs of all QTL-related parameters in the second situation are improved over those in the first situation, especially for the dominant effect of the QTL and model mean (Table 5), suggesting the value of utilizing marker information in quantitative genetic research.

In MQM, we first assume that both markers are not associated with a putative QTL (MQM1–MQM4). As expected, this has no improvement in parameter estimation compared to MQ1–MQ4 (Table 6). Second, we assume that one of the two markers is associated with the QTL (MQM5–MQM8), whereas the second marker is not. This design is similar to composite interval mapping in which those markers outside the marker interval carrying a QTL are used as cofactors to control genome background. We did not observe any improvement in parameter estimation from composite disequilibrium mapping (MQM5–MQM8; Table 6) over disequilibrium mapping (MQ5–MQ8; Table 5).

We also design a scheme in which two markers are associated with each other, both of which are further associated with a putative QTL (MQM9–MQM12). Although this kind of mutual association may not lead to a significant improvement for the MLEs of QTL allele frequency and residual variance, the estimation for the model mean, the additive and dominant effects of the QTL, and the residual variance is largely benefitted by a simultaneous use of two associated markers (MQM9–MQM12; Table 6).

We examine the effects of different allelic frequencies on parameter estimation. The relative frequencies of two alternative alleles reveal the degree of genetic differentiation in a population. When QTL alleles are less differentiated, as shown by the allele frequencies of 0.3 vs. 0.7 (MQM13–MQM16), the precision of the MLEs is not affected as long as more differentiated markers are used (Table 6). We also found that the inclusion of some less differentiated markers did not affect the MLEs of QTL parameters (MQM17–MQM20). However, when less differentiated markers are used to predict less differentiated QTL (MQM21–MQM24), the estimation precision of QTL effects and model parameters is largely affected, although this may not be true for the MLEs of the population genetic parameters of QTL.

Additive-dominant-epistasis model: When two QTL interact epistatically, we develop two different multilocus disequilibrium schemes to estimate their epistasis. The first scheme (MQM-MQM) permits simultaneous use of two independent sets of markers to predict their corresponding QTL, whereas the second scheme (MQMMQM) assumes two dependent sets of marker-QTL relationships. In general, more tightly linked loci tend to have a greater chance for nonrandom association than less tightly linked loci do. Thus, the first scheme is roughly similar to a situation in which two sets of markers and QTL are from different chromosomes, whereas the second scheme is similar to a situation in which two sets of loci are from the same chromosome.

In the first scheme, all genetic parameters can be estimated with reasonable accuracy and precision (Table 7), suggesting that our disequilibrium-based mapping strategy can be used to map epistatic QTL. When two independent marker pairs are used at the same time, the MLEs of QTL-allele frequencies and QTL-marker disequilibrium are more precise than when only one marker pair is used. The root mean-square errors of these estimates are ~0.122 for QTL allele frequencies, 0.033 for QTL-marker digenic disequilibrium, and 0.017 for QTL-marker trigenic disequilibrium under MQM-MQM1 when two independent marker sets are used (Table 7). Yet, the corresponding values are 0.161, 0.045, and 0.023 under MQM9 when one marker pair is used (Table 6). But compared to MQM, the estimation precision of QTL additive and dominant effects is reduced for MQM-MQM (Table 7).

In terms of estimation precision, more precise MLEs of the additive x additive interaction effect between two different QTL than those for the additive effect at a single QTL are obtained, whereas the additive x dominant or dominant x additive interaction effect is similar to the dominant effect, but the dominant x dominant interaction effect is poorer than the dominant effect (Table 7). As expected, the MLE of the dominant x dominant interaction effect may be quite imprecise. However, when both sample size and heritability are significantly increased, the MLEs of the dominant x dominant effect can be close to those of other parameters (Table 7).

In the second scheme, the precision of the MLEs of all parameters is reduced, compared to the first scheme, because of the inclusion of many unknowns. We perform an additional simulation under H2 = 0.4 for the second scheme by increasing the sample size from 1000 to 2000 to 5000, keeping the other parameters unchanged. The precision of the MLEs of population genetic parameters remained approximately constant with increased sample sizes (data not given), but the MLEs of gene effects including the additive, dominant, and epistatic effects displayed significantly improved precision when the sample size is increased (Table 8). When N = 2000, the estimation of the additive and additive x additive interaction effects achieves acceptable precision under the complex MQMMQM model. The estimation of the other effects is certainly acceptable when a sample size is 5000 (Table 8).


 
View this table:
In this window
In a new window

 
Table 8. Biases (and squared roots of mean square errors) of genetic effect estimates for H2 = 0.4 based on two dependent sets of markers and QTL (MQMMQM)


*  A CASE STUDY FROM HUMANS
*TOP
*ABSTRACT
*POPULATION GENETIC MODEL
*QUANTITATIVE GENETIC MODEL
*STATISTICAL MODEL
*MONTE CARLO SIMULATION
*A CASE STUDY FROM...
*DISCUSSION
*LITERATURE CITED

A case study from a human genome project for mapping obesity genes is used to validate our method. The subjects sampled consisted of 643 women from the National Heart, Lung, and Blood Institute-funded Woman's Ischemic Syndrome Evaluation (WISE) study (J. A. JOHNSON, unpublished results). Women included in the WISE study were >18 years old and consist of 105 blacks and 538 whites, each measured for body height and other traits. However, as an example, only body height is analyzed here. All study participants provided informed written consent prior to participation in the study. WISE study protocols were approved by the Institutional Review Boards of the participants' institutions.

Genotypes for the women studied were determined using Orchid BioScience's proprietary primer extension technology (SNP-IT). A total of 10 single-nucleotide polymorphisms (SNPs) were genotyped at six candidate genes for obesity on different chromosomes, which are ADRB1, ADRB2, ADRB3, ADRA1A, the GS protein {alpha}-subunit (GNAS1), and the G protein ß3 subunit (GNB3; J. A. JOHNSON, unpublished results). The ADRB1 gene contains two common nonsynonymous polymorphisms at codon 49 (Ser -> Gly) and codon 389 (Arg -> Gly). At the ADRB2 gene, there are three polymorphism sites at codon 19 of the ß2AR 5' leader cistron (Cys -> Arg), codon 16 (Gly -> Arg), and codon 27 (Gln -> Gly). Two sites at codon 131 and codon 371 were genotyped for the GNAS1 gene.

Our newly developed statistical method is used to perform the association studies between SNPs and body heights in two different populations. According to MACKAY 2001 Down, the gene for a quantitative trait detected by a SNP is called a quantitative trait nucleotide (QTN). We found different QTN that affect human body heights between the black and white populations (Table 9). Two SNPs, Gly16Arg and Cys19Arg, at gene ADRB2 can identify significant QTN in the black population, although the significance of the QTL detected by the second SNP is marginal (Table 9). In fact, the QTL detected by these two SNPs may be identical for two reasons. First, the sum of the allele frequencies of the QTN identified is one, implying that variants detected by the two SNPs within the same gene are actually two alternative alleles. This alternative relationship is confirmed by different signs of disequilibrium coefficients. Second, the QTN displays a similar genetic effect on body height, with different directions for the additive effect as expected. This black-specific height QTN can also be identified by combining one of these two SNPs with SNP C825T at gene GNB3.


 
View this table:
In this window
In a new window

 
Table 9. QTNs associated with SNPs or SNP combinations for body heights in the black and white populations

In the white population, a significant QTN was detected by SNP C835T at gene GNB3 (Table 9). As shown by allele frequencies, linkage disequilibria, and genetic effects, we conjecture that the same QTN has been detected by combining this SNP with other SNPs. Although body height QTNs may be population specific, they can be detected in both populations when some common SNPs are combined.


*  DISCUSSION
*TOP
*ABSTRACT
*POPULATION GENETIC MODEL
*QUANTITATIVE GENETIC MODEL
*STATISTICAL MODEL
*MONTE CARLO SIMULATION
*A CASE STUDY FROM...
*DISCUSSION
*LITERATURE CITED

Since its successful use in mapping human diseases (HASTBACKA et al. 1992 Down, HASTBACKA et al. 1994 Down; JORDE et al. 1993 Down), linkage disequilibrium-based mapping methods have become one of the most active areas in current theoretical and empirical genomic research (ALLISON 1997 Down; XIONG and GUO 1997 Down; COLLINS and MORTON 1998 Down; TERWILLIGER and WEISS 1998 Down; KRUGLYAK 1999 Down; MEUWISSEN and GODDARD 2000 Down; PRITCHARD and PRZEWORSKI 2001 Down; REICH et al. 2001 Down; WEISS and CLARK 2002 Down). A major advantage of disequilibrium mapping is that it can potentially provide high-resolution mapping of actual genes responsible for variation of biomedically important diseases by exploiting their associations with adjacent markers (HALEY 1999 Down; ZAYKIN et al. 2002 Down). The intense interest in linkage disequilibrium mapping is further stimulated by the advent of high-density SNP maps (WANG et al. 1998 Down; SACHIDANANDAM et al. 2001 Down) and an inherited paucity of power for narrowing the genomic interval carrying disease genes for traditional pedigree-based linkage analyses (ARDLIE et al. 2002 Down).

Linkage disequilibrium mapping makes use of cumulative crossovers between markers and the target gene during a large number of generations since the first appearance of the disequilibrium. However, current applications of linkage disequilibrium are limited to genetic diseases in humans that are fairly simple, monogenic, and obey the rule of Mendelian inheritance. In this article, we present a maximum-likelihood-based method for multilocus linkage disequilibrium mapping of complex traits that are controlled by many, genetically interacting, genes. The motivation of our study includes three aspects. First, current disequilibrium-based mapping is based mostly on a one-marker/one-gene model, although TERWILLINGER 1995 Down and GARNER and SLATKIN 2002 Down proposed a likelihood-based method for two-marker haplotype data in disease mapping. The theoretical principle of the one-marker/one-gene model has been extended for fine mapping QTL for quantitative traits (LUO and SUHAI 1999 Down; LUO et al. 2000 Down; WU et al. 2002 Down). However, single-marker likelihood analysis does not adequately make use of information from multiple cosegregated markers. We extend single-marker analysis to multilocus linkage disequilibrium mapping. As demonstrated by the simulations, such an extended model provides a higher resolution of QTL localization by more precisely estimating the linkage disequilibrium between the QTL and markers.

Second, in many organisms, including humans, quantitatively inherited traits are economically or biomedically important. Yet, their inheritance mode can be complex and likely includes multiple genes with interactions at both the gene and population levels (MACKAY 2001 Down; BARTON and KEIGHTLEY 2002 Down). Given the genetic complexity of a quantitative trait, the current one-marker/one-QTL model of LUO and SUHAI 1999 Down, LUO et al. 2000 Down, and WU et al. 2002 Down may be oversimplified. Our method can incorporate the complex inheritance of a quantitative trait by including multiple QTL in the mapping framework. The degree of interaction between two different QTL at the population level is expressed as the coefficient of linkage disequilibrium. With linkage disequilibrium, the segregation of one QTL will depend on the segregation of other QTL. The interaction at the gene level is the epistasis that triggers a direct effect on the phenotype. Gene association and gene interaction have presented much difficulty in genetic research. The statistical method proposed in this study displays power to estimate both linkage disequilibrium and epistasis, although their precise estimation relies upon a large sample size and/or high heritability level. Indeed, given the possible contributions of these two phenomena to the genetic architecture of a complex trait (MACKAY 2001 Down; BARTON and KEIGHTLEY 2002 Down), a reasonable investment in the maintenance of a huge mapping population and its careful management to reduce environmental noises should be well rewarded. Also, with our method and such a mapping population, we will be able to address the current debate, initiated by R. A. Fisher and S. Wright a long time ago, on the role of epistasis in trait development and evolution (WOLF et al. 2000 Down).

Third, one of the critical problems that has not been solved well in previous linkage disequilibrium mapping is the estimation of QTL allele frequencies and disequilibrium coefficients. Unlike gene effect parameters, these population parameters have no explicit closed forms for their solutions in the M step of the EM algorithm. The estimation of these two parameters in previous studies (LUO and SUHAI 1999 Down; LUO et al. 2000 Down; WU et al. 2002 Down) relied upon the resolutions of polynomial log-likelihood equations by other numerical methods, such as the Newton-Raphson iteration. However, this is not efficient from a computational perspective and has difficulty in providing precise estimates of the population parameters. In this article we have derived a closed-form solution for estimating haplotype frequencies, from which the MLEs of QTL allele frequency and linkage disequilibrium are obtained. This theoretical derivation can be regarded as one of our major contributions to linkage disequilibrium mapping and will likely make our method more attractive to real-world applications.

In analyzing two epistatic QTL using multiple markers, we presented two schemes by assuming, respectively, that the two QTL interact at the gene level (epistasis), but not at the population level (no disequilibrium), and that the two QTL interact at both the gene and population levels. Our simulation results suggest that the population and quantitative genetic parameters of QTL in the first epistasis scheme can be estimated with reasonable precision. More interesting, the MLEs of the population genetic parameters for two nonassociated but epistatic QTL are more precise than when only a single QTL is modeled. This result may have practical implications for selecting optimal marker combinations in disequilibrium mapping for a complex trait. However, our simulation results also show that parameter estimation for QTL in the second epistasis scheme has reduced precision because of the inclusion of too many known parameters. To overcome this problem in the second scheme, the use of a much larger sample size is recommended. Other strategies include the development of an analytical method that can typically handle a high dimension of parameter space, like Markov chain Monte Carlo algorithms (MCMC; ROBERT and CASELLA 1999 Down).

The strengths of the Bayesian approach for haplotype analysis have been demonstrated by LIU et al. 2001 Down and NIU et al. 2002 Down. First, it can make a reasonable inference of haplotypes comprising a large number of markers, larger than can be handled by the EM algorithm (ZHANG et al. 2001 Down). Second, the Bayesian approach, implemented with MCMC, is powerful in constructing ancestral haplotypes by considering multiple founders, uncertainty of linkage phases, and incompletion of marker data (LIU et al. 2001 Down). We expect that a similar Bayesian framework can be developed to map QTL based on linkage disequilibrium analysis. However, our method based on the EM algorithm is still needed. It can efficiently analyze a relatively small number of haplotypes and provide accurate information about haplotype construction. Moreover, the results from the EM algorithm can provide prior information for the Bayesian approach with capacity to solve more complicated problems.

Our method can also be modified in other areas. First, we assume an equilibrium population. When this assumption is violated, as is often observed in practice, the introduction of the coefficient of Hardy-Weinberg disequilibrium is necessary. Recently, YANG 2000 Down, YANG 2002 Down proposed a multilocus statistic to examine zygotic associations in nonequilibrium populations. The disequilibria of different degrees due to a single locus or multiple loci can be summarized in such a statistic. It would be interesting to incorporate Yang's zygotic association theory to map QTL in an arbitrary natural population. Second, our method can estimate only linkage disequilibria; however, linkage disequilibrium alone may not be effective to infer the linkage between the marker and QTL. WU and ZENG 2001 Down have developed a new statistical strategy for simultaneously estimating linkage and linkage disequilibrium, and this strategy shows great potential for fine mapping of quantitative trait loci in natural populations (WU et al. 2002 Down). Once our current method is integrated with Wu and Zeng's model, we can test the relationship between linkage and linkage disequilibrium and make linkage disequilibrium mapping a more predictable and powerful approach.


*  ACKNOWLEDGMENTS

This work is partly supported by an Outstanding Young Investigator Award of the National Science Foundation (NSF) of China (30128017), a University of Florida Research Opportunity Fund (02050259), and a University of South Florida Biodefense grant (7222061-12) to R.W., NSF of China grant (30000097) to X.-Y. Lou, and NSF grant (9971586) to G.C. The collection of the SNP and human body height data used in this study was supported by grants from the National Institutes of Health (NIH; HL65729) and Orchid Biosciences to J.A.J. and from the NIH (HL64924) to C.J.P. The publication of this manuscript was approved as journal series no. R-09224 by the Florida Agricultural Experiment Station.

Manuscript received October 7, 2002; Accepted for publication January 16, 2003.


*  LITERATURE CITED
*TOP
*ABSTRACT
*POPULATION GENETIC MODEL
*QUANTITATIVE GENETIC MODEL
*STATISTICAL MODEL
*MONTE CARLO SIMULATION
*A CASE STUDY FROM...
*DISCUSSION
*LITERATURE CITED

ALLISON, D. B., 1997  Transmission disequilibrium tests for quantitative traits. Am. J. Hum. Genet. 60:676-690.[Medline]

ARDLIE, K. G., L. KRUGLYAK, and M. SEIELSTAD, 2002  Patterns of linkage disequilibrium in the human genome. Nat. Rev. Genet. 3:299-309.[Medline]

BARTON, N. H. and P. D. KEIGHTLEY, 2002  Understanding quantitative genetic variation. Nat. Rev. Genet. 3:11-21.[Medline]

BENNETT, J. H., 1954  On the theory of random mating. Ann. Eugen. 18:311-317.

CHURCHILL, G. A. and R. W. DOERGE, 1994  Empirical threshold values for quantitative trait mapping. Genetics 138:963-971.[Abstract]

COCKERHAM, C. C., 1980  Random and fixed effects in plant genetics. Theor. Appl. Genet. 56:119-131.

COLLINS, A. and N. E. MORTON, 1998  Mapping a disease locus by allelic association. Proc. Natl. Acad. Sci. USA 95:1741-1745.[Abstract/Free Full Text]

DEMPSTER, A. P., N. M. LAIRD, and D. B. RUBIN, 1977  Maximum likelihood from incomplete data via EM algorithm. J. R. Stat. Soc. Ser. B 39:1-38.

FISHER, R. A., 1918  The correlation between relatives on the supposition of Mendelian inheritance. Proc. R. Soc. Edinb. 52:399-433.

GARNER, C. and M. SLATKIN, 2002  Likelihood-based disequilibrium mapping for two-marker haplotype data. Theor. Popul. Biol. 61:153-161.[Medline]

HALEY, C., 1999 Advances in quantitative trait locus mapping, pp. 47–59 in From J. L. Lush to Genomics: Visions for Animal Breeding and Genetics, edited by J. C. M. DEKKERS, S. J. LAMONT and M. F. ROTHSCHILD. Iowa State University Press, Ames, IA.

STBACKA, J., A. DE LA CHAPELLE, I. KAITILA, P. SISTONEN, and A. WEAVER et al., 1992  Linkage disequilibrium mapping in isolated founder populations: diastropic dysplasia in Finland. Nat. Genet. 2:204-221.[Medline]

STBACKA, J., A. DE LA CHAPELLE, M. MAHTANI, G. CLINES, and M. P. REEVE-DALY et al., 1994  The diastropic dysplasia gene encodes a novel sulfate transporter: positional cloning by fine-structure linkage disequilibrium mapping. Cell 78:1073-1087.[Medline]

HOESCHELE, I., 2000 Mapping quantitative trait loci in outbred pedigrees, pp. 599–644 in Handbook of Statistical Genetics, edited by D. J. BALDING, M. BISHOP and C. CANNINGS. John Wiley & Sons, New York.

HUDSON, R. R., 2000 Linkage disequilibrium and recombination, pp. 309–324 in Handbook of Statistical Genetics, edited by D. J. BALDING, M. BISHOP and C. CANNINGS. John Wiley & Sons, New York.

JANSEN, R. C., 2000 Quantitative trait loci in inbred lines, pp. 567–597 in Handbook of Statistical Genetics, edited by D. J. BALDING, M. BISHOP and C. CANNINGS. John Wiley & Sons, New York.

JORDE, L. B., W. S. WATKINS, D. VISKOCHIL, P. OCONNELL, and K. WARD, 1993  Linkage disequilibrium in the neurofibromatosis-I (NFI) region: implications for gene mapping. Am. J. Hum. Genet. 53:1038-1050.[Medline]

KRUGLYAK, L., 1