- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Zou, G.
- Articles by Zhao, H.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Zou, G.
- Articles by Zhao, H.
Genotyping Error Detection Through Tightly Linked Markers
Guohua Zoua, Deyun Pana, and Hongyu Zhaoaa Department of Epidemiology and Public Health, Yale University School of Medicine, New Haven, Connecticut 06520-8034
Corresponding author: Hongyu Zhao, Yale University School of Medicine, 60 College St., New Haven, CT 06520-8034., hongyu.zhao{at}yale.edu (E-mail)
Communicating editor: Y.-X. FU
| ABSTRACT |
|---|
The identification of genotyping errors is an important issue in mapping complex disease genes. Although it is common practice to genotype multiple markers in a candidate region in genetic studies, the potential benefit of jointly analyzing multiple markers to detect genotyping errors has not been investigated. In this article, we discuss genotyping error detections for a set of tightly linked markers in nuclear families, and the objective is to identify families likely to have genotyping errors at one or more markers. We make use of the fact that recombination is a very unlikely event among these markers. We first show that, with family trios, no extra information can be gained by jointly analyzing markers if no phase information is available, and error detection rates are usually low if Mendelian consistency is used as the only standard for checking errors. However, for nuclear families with more than one child, error detection rates can be greatly increased with the consideration of more markers. Error detection rates also increase with the number of children in each family. Because families displaying Mendelian consistency may still have genotyping errors, we calculate the probability that a family displaying Mendelian consistency has correct genotypes. These probabilities can help identify families that, although showing Mendelian consistency, may have genotyping errors. In addition, we examine the benefit of available haplotype frequencies in the general population on genotyping error detections. We show that both error detection rates and the probability that an observed family displaying Mendelian consistency has correct genotypes can be greatly increased when such additional information is available.
THE problem of genotyping errors has received much attention in human genetics because of its importance in the analysis and interpretation of genetic data from linkage and association studies. ![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Mendelian consistency is the most common criterion for identifying genotyping errors. Families that fail the Mendelian-consistency check should be flagged out for error checking. In the case of single markers, ![]()
![]()
![]()
| METHODS |
|---|
In this section, we discuss our methods for deriving analytical results of error detection rates for two tightly linked markers and the probability that a family displaying Mendelian consistency has correct genotypes for family trios with one or two markers. We then outline our simulation procedures for nuclear families with multiple children, multiple markers, and multiple alleles.
Error detection rates for family trios with two markers:
Consider family trios where each individual is typed at two biallelic markers. The two markers, denoted by
and
, have alleles A1/A2 and B1/B2, respectively. For simplicity, in the following discussion we denote A1 and A2 by 1 and 2, respectively, and similarly denote B1 and B2 by 1 and 2, respectively.
We use 2 x 2 matrices to denote two-marker diploid genotype data, where elements in each column represent the two alleles at the same marker. When phase information is known, elements in the same row represent the alleles on the same chromosome, and the two rows are exchangeable. For example, matrices

both represent an individual with one chromosome carrying (11) and one chromosome carrying (22). We make no distinction between these two matrices in our following discussion. In addition, no distinction is made between parent 1 and parent 2. For example, the trio genotypes

and

are regarded as equivalent. For a family with genotype M, we define the conjugate of M, denoted by M, as the genotype with each 1 in M replaced by 2 and each 2 in M replaced by 1 (![]()
We now define Mendelian consistency for two or more markers. If phase is known in both parents, let (HP1, HP2) denote the two haplotypes in the father, and (HM1, HM2) denote the two haplotypes in the mother. For tightly linked markers, we say the trio is Mendelian consistent if the child has one of the following genotypes, (HP1, HM1), (HP1, HM2), (HP2, HM1), or (HP2, HM2); i.e., each parent passes one of the two whole haplotypes intact to the offspring. We expect this to be the case, in general, for tightly linked markers as recombinations are unlikely among them.
However, phase information is usually unknown. In this case, genotypes

are not distinguishable. It is important to keep this in mind when we calculate the error detection rate and the probability that a family trio displaying Mendelian consistency has correct genotypes. For phase-unknown data, there may be multiple haplotype sets in the parents that are consistent with the observed genotypes across the set of markers. In this case, we say that the trio is Mendelian consistent if one of the haplotype sets is Mendelian consistent in the sense defined above for phase-known data.
For genotyping errors, we assume that errors are introduced independently. At marker
, the genotyping error rate from true allele 1 to erroneous allele 2 is e1 and from true allele 2 to erroneous allele 1 is e2. At marker
, the genotyping error rates from 1 to 2 and from 2 to 1 are
1 and
2, respectively. This general error model includes the stochastic error model (e1 = e2 =
1 =
2) and the directed error model (e2 =
2 = 0) as special cases (![]()
For each trio genotype M, 012 errors may be introduced for two markers in a family trio. We say a family has undetected errors if the trio is Mendelian consistent. The probability that the errors are not detected via a Mendelian-consistency check, when there is at least one error, is
![]() |
(1) |
where the first conditional probability can be calculated as

where S is the set of all family trio genotypes. Thus,Equation 1 is simplified to
![]() |
(2) |
The calculations of the probabilities that i errors exist in genotype M and no error exists in trio inEquation 2 are discussed in Appendix A. Note that for the stochastic error model, the probabilities that any trio genotype has i errors are the same. In this case,Equation 2 reduces to that in ![]()
In our calculations, we first calculate P(undetected errors|i errors in M) for 1
i
6, and for the cases of 7
i
12, we can obtain the probabilities through the conjugate genotype M and the following lemma. Results similar to Lemma 1(i) were derived by ![]()
![]()
i
6) are available from the authors upon request.
- LEMMA 1. (i) For any trio genotype M and for any i, 0
i
12, we have (a) P(undetected errors|i errors in M) = P(undetected errors|i errors in M) and (b) P(undetected errors|i errors in M) = P(undetected errors|12 - i errors in M). - (ii) If P(M) = f(p11, p12, p21, p22), where f is a function of p11, p12, p21, and p22; and pij is the frequency of haplotype ij, where i, j = 1, 2; then
- P(M) = f(p22, p21, p12, p11).
Error detection rates for nuclear families with more than one child and more than two markers:
For the general case of multiple children, multiple markers, and multiple alleles, we conduct simulation studies to obtain error detection rates as follows:
- We generate the genotypes of the parents according to a set of haplotype frequencies pi1...ik (i1 = 1, ... , I1; ... ; ik = 1, ... , Ik), where k is the number of tightly linked markers, and Ij is the number of alleles at marker j. On the basis of parental haplotypes, we simulate haplotype pairs in the children by randomly assigning one of the two haplotypes in each parent to each child. Then we introduce errors independently into the alleles of parents and children according to a given error model. On the basis of the resulting genotypes for the parents in the nuclear family, we obtain all haplotype pairs that are consistent with the genotypes of the parents.
- We number the children in the family by the number of homozygous sites; e.g., after numbering, child 1 in the family has the largest number of markers with two identical copies of an allele. For the first child, we consider all possible haplotype pairs that are consistent with this child's genotype. For each haplotype pair in the consistent haplotype pair set, we use the procedure described in Appendix B to (a) identify whether this pair is consistent with the parents' possible haplotype pairs and (b) if yes, determine possible haplotype pairs for other children based on this pair. If none of the haplotype pairs for the first child is consistent with the parents' haplotype pairs, then we say we have detected genotyping errors. Otherwise, we collect all the possible haplotype pairs for other children based on the first child and call this set C1.
- Consider child 2 in the family. If no haplotype pairs consistent with this child's genotype belong to C1, then genotyping errors are detected. Otherwise, discard the haplotype pairs that are not consistent with the second child's genotype and call the remaining set C2.
- Repeat steps 2 and 3 until the nth child (assuming this family has n children) is checked and we end up with a set Cn. If Cn is empty, the errors are detected. Otherwise, the whole family is consistent with Mendelian inheritance.
To estimate error detection rates, we base our results on 100,000 simulations for single markers, 10,000 simulations when the number of markers is two or three, and 5000 simulations when there are four markers. Different numbers of simulations are used because the true error detection rates vary according to the number of markers, with lower detection rates for smaller numbers of markers. Therefore, a larger number of simulations are necessary when the number of markers is smaller.
Probability that a family trio displaying Mendelian consistency has correct genotypes:
In addition to calculating error detection rates, another quantity that is of relevance is the probability that an observed trio displaying Mendelian consistency has correct genotypes. We first discuss the single-marker case. There are a total of nine trio genotypes with one marker, which is denoted by S0. We use similar notation on trio genotypes as in the two-marker case. For example, the following trio genotypes are considered equivalent:

With similar genotyping error models, we can derive the probability that i errors are introduced in the trio in the one-marker case. Note that "an observed trio has correct genotypes" is not equivalent to "there is no genotyping error in the trio." For example, for an individual with genotype 12 at one marker, there may be two errors with 1
2 and 2
1, but the observed genotype is true.
For an observed genotype M that is Mendelian consistent, the probability that it is the true genotype is given by
![]() |
(3) |
where P(T = M) is the probability that the true trio genotype is M, and P(O = M) is the probability that the observed trio genotype is M, which is
![]() |
(4) |
where the set of S0 was defined above. An example for calculating the conditional probability P(T = M|O = M) is provided in Appendix C.
In general, in addition to calculating P(T = M|O = M) for a given genotype, we can also calculate the overall probability that a Mendelian-consistent family trio has correct genotypes by summing over all possible genotypes:
![]() |
(5) |
It is readily seen thatEquation 5 can also be expressed as

In deriving the probabilities, we use the following lemma for the one-marker case.
- LEMMA 2. (i) Let P(O = M0|T = M) = u(e1, e2). Then (a) P(O = M0|T = M) = u(1 - e1, 1 - e2) and (b) P(O = M0|T = M) = u(1 - e2, 1 - e1).
- (ii) Let P(T = M0|O = M0) = v(p, e1, e2). Then P(T = M0|O = M0) = v(q, e2, e1), where p is the frequency of allele 1 and q = 1 - p.
For the case of two markers, 125 distinct trio genotypes display Mendelian consistency in the absence of phase information. The general results (3) and (4) still hold. In the calculation of terms in (3) and (4), using the same genotyping error model discussed before, we have
- LEMMA 3. (i) Let P(O = M0|T = M) = g(e1, e2,
1,
2). Then (a) P(O = M0|T = M) = g(1 - e1, 1 - e2, 1 -
1, 1 -
2) and (b) P(O = M0|T = M) = g(1 - e2, 1 - e1, 1 -
2, 1 -
1). - (ii) Let P(T = M0|O = M0) = h(p11, p12, p21, p22, e1, e2,
1,
2). Then P(T = M0|O = M0) = h(p22, p21, p12, p11, e2, e1,
2,
1).
| RESULTS |
|---|
Error detection rates for family trios:
Let the frequencies of alleles 1 and 2 at marker
be p1+ and p2+ (= 1 - p1+), respectively. Similarly, we denote the marker allele frequencies at marker
by p+1 and p+2, respectively. The haplotype frequencies are denoted by p11, p12, p21, and p22, respectively. For different sets of haplotype frequencies, we summarize the results in Table 1 when the error rates are assumed to be the same. The results are qualitatively similar when the error rates differ (data not shown). In addition, we considered the following three cases in more detail.
- Linkage equilibrium: In this case, p11 = p1+p+1, p12 = p1+p+2, p21 = p2+p+1, and p22 = p2+p+2.
View this table:
In this window
In a new window
Table 1. Error detection rates for trios with two markers when e1, e2,
1, and
2 are all equal - Perfect linkage disequilibrium (equal allele frequencies): In this case, only two of the four possible haplotypes are present in the population. Without loss of generality, we assume that p11 = p1+ = p+1, p12 = 0, p21 = 0, and p22 = p2+ = p+2.
- Complete linkage disequilibrium (unequal allele frequencies): In this case, three of the four haplotypes are present in the population. Without loss of generality, we assume haplotypes 11, 12, and 21 are present.
Table 1 and the results for the above three special cases (data not shown) indicate that the error detection rates are generally low when the error rates are low if Mendelian consistency is the criterion for error checking. When the error rates are high (>20%), the error detection rates based on two markers can be significantly higher than those based on single markers, even higher than those for the case of quartet considered by ![]()
![]()
5%). This is not unexpected as we can show (Appendix D) that if trios are Mendelian consistent for each individual marker, then the trio genotype is Mendelian consistent across all the markers even with the use of multiple tightly linked markers. Therefore, error checking through Mendelian consistency offers little more information for family trios in the absence of additional information, e.g., phase and/or population genotypes.
Error detection rates for families with more than one child:
When additional family members are available, joint consideration of two tightly linked markers offers more information than single markers. For example, consider a family with both parents and two children. Let the two-marker quartet genotype be

where the first two matrices denote the parents' genotypes and the last two matrices denote the children's genotypes. It is easy to see that, although the two one-marker quartet genotypes

show Mendelian consistency, the two-marker quartet genotype does not. On the basis of this, it can be expected that if we consider two-marker quartet genotypes, the error detection rate will be increased, as evidenced from our simulation results shown in Table 2 and Table 3, where n and k denote the numbers of children and markers, respectively, and Ij denotes the number of alleles at marker j (j = 1, ... , k).
|
|
Our results show that when more than one child is in a nuclear family, the error detection rates can be greatly increased by adding additional markers. Furthermore, the error detection rates increase with the number of children. The rate of increase is the greatest from one child to two children, and there is usually not much difference between having five or six children.
In addition to considering biallelic markers, we also consider markers with multiple alleles and the results are summarized in Table 4. It can be seen that, as expected, the error detection rates are higher for the case of multiple alleles. Comparing the results of Table 4 with those of the third column of Table 3 is interesting because Table 4 is for a marker with eight alleles of equal frequency and column 3 of Table 3 is for a haplotype system with eight haplotypes having equal frequencies. Although there is substantial difference between error detection rates when there is only one child and when there are two children, the difference for the case of multiple children becomes smaller when the number of children is larger.
|
Probability that a family trio displaying Mendelian consistency has correct genotypes:
In addition to error detection rates, the probability that a family with Mendelian consistency has correct genotypes may be of more relevance as these probabilities will help the investigators to prioritize families for genotyping error checking among Mendelian-consistent families.
Fig 1 reveals the results of the probability that a trio with Mendelian consistency has correct genotypes for the case of single markers when the error rates are the same. It is seen that when the true error rates are low (
1%), most of the Mendelian-consistent trios have correct genotypes. On the other hand, when the error rates are high (
20%), most of the Mendelian-consistent trios have incorrect genotypes. A similar observation can be obtained for different values of e1 and e2 (see Fig 2 and Fig 3). Furthermore, it can be seen from the probability curves for p = 0.1 and 0.5 in Fig 1 that, when e1 = e2, the probability that a trio displaying consistency has correct genotypes is not much affected by allele frequencies.
|
|
|
In the case of two markers, we consider the cases for p11 = p12 = 0.4, p21 = p22 = 0.1, and e1 = e2 =
1 =
2 = 0.005; p11 = 0.4, p12 = 0.2, p21 = 0.3, p22 = 0.1, and e1 = e2 =
1 =
2 = 0.01; and the three special cases considered in error detection rates. Our analytical results (data not shown) indicate that if the error rates are low (
0.5%), the probability of an observed genotype being true is >95%, which implies that a trio genotype displaying consistency is often true. If the error rates are between 0.5 and 2%, then the trio genotypes still tend to be true. However, if the error rates are large (
10%), then the probability is often <40%, which means that a trio genotype displaying consistency is usually not true. As in the case of one marker, the probability that an observed genotype displaying consistency is correct is only slightly affected by haplotype frequencies under the stochastic error model.
Probability that a nuclear family with more than one child displaying Mendelian consistency has correct genotypes:
For the general case of multiple children and multiple markers, we conduct simulation studies to estimate the probability that a family displaying Mendelian consistency has correct genotypes. The simulation results are presented in Table 5. It can be seen from the table that when the error rate is 0.01, the probability that a family displaying Mendelian consistency has correct genotypes is high, even though multiple children and multiple markers are considered. Also, except for the case of only one child, these probabilities are very similar when the number of children differs.
|
Use of population haplotype information:
One way to increase error detection rates is to make use of some other information. Consider the special case of perfect linkage disequilibrium with equal allele frequencies. In this scenario, there are only two possible haplotypes in the population, (11) and (22), and there are a total of 10 possible trio genotypes. Therefore, any patterns differing from these 10 will be identified as caused by genotyping errors. The conditional probabilities that a family trio with i errors (1
i
6) will be undetected, i.e., fall into one of the 10 categories, are available from the authors upon request. Note that we can still use Lemma 1 to calculate the probabilities when i is between 7 and 12. When all the error rates are the same, the error detection rates are as summarized in Table 6. As expected, the error detection rates are indeed greatly increased.
|
Such additional information will also affect the calculation of the probability that a family displaying Mendelian consistency has correct genotypes. Still consider the above case where two markers are in perfect linkage disequilibrium with equal allele frequency. The conditional probability that a Mendelian-consistent trio is true is presented in Table 7. It is apparent that even when the error rates are as high as 10%, the probability that an observed Mendelian-consistent trio is true is quite high.
|
| DISCUSSION |
|---|
In this article, we investigated genotyping error detections through multiple tightly linked markers in nuclear families. Our error detection rate is calculated using families, not markers, as a unit, with the objective of being able to identify families having genotyping errors. We first calculated the error detection rates for family trios with two markers using an analytical method. We showed that in the absence of phase information, genotyping errors can be detected if and only if there is Mendelian inconsistency at one or more of the markers. This means that only the information on each marker is helpful for detecting genotyping errors. Joint consideration of multiple tightly linked markers will not provide more information. Therefore, the error detection rates will not be greatly increased when the error rates are low. As a result, the error detection rates are generally low if Mendelian consistency is used as the unique criterion for checking errors. However, when more than one child is in a family, joint consideration of tightly linked markers can offer more information than single markers. In fact, the error detection rates can be greatly increased by adding tightly linked markers.
Table 2 and Table 3 reveal different properties between markers with equal and unequal allele frequencies: If the number of markers k is small (
2) or only one child is in the family, the error detection rates for markers with unequal allele frequencies are greater than those for markers with equal allele frequencies. However, if there are more than two markers and more than one child is in the family, the error detection rates for markers with equal allele frequencies are greater. This is also seen for the case of linkage disequilibrium (data not shown). An explanation for this phenomenon is as follows. For the case of unequal haplotype (allele) frequencies, the genotype of the first child can often be used to detect the error, but for the case of equal haplotype (allele) frequencies, the genotype of the first child is often used to determine the haplotypes of parents and often cannot be used to detect the error except for the case of k = 1. Thus, when n = 1, the error detection rates for unequal haplotype (allele) frequencies are greater. If k is not small, the errors will often be introduced into each of the genotypes of parents and children, and the errors are often easier to detect for the case of equal haplotype (allele) frequencies because when k becomes large, more and more alleles at each marker will be heterozygous for the case of unequal haplotype (allele) frequencies but the genotype at each marker is more likely to change to homozygotes for the case of equal haplotype (allele) frequencies. Let us consider an extreme case of the following two three-marker genotypes in a family with two parents and two children,
![]() |
(6) |
and
![]() |
(7) |
The first configuration is more likely to occur if the allele frequencies are 0.9 and 0.1 at each marker, and the second one is more likely to occur if the allele frequencies are 0.5 and 0.5 at each marker. If only one error is introduced into some marker for each person, say marker 1 for parent 1, marker 2 for parent 2, then when the genotypes of parents in (6) become

the probability that the errors can be detected is 20(1 -
)10
2 (where
is the genotyping error rate from true allele 1 to erroneous allele 2 and from true allele 2 to erroneous allele 1); and when the genotypes of parents in (7) become

the probability of error detection is 22(1 -
)10
2. The difference between the former and the latter is -2(1 -
)10
2. On the other hand, if the family trio is considered [i.e., only one child is considered in (6) and (7)], then the corresponding difference is 2(1 -
)5
- 2(1 -
)5
= 0. Note that when k is small, the possibility that the errors are introduced into each of the genotypes of parents and children is not great. For this case, the possibility that the haplotypes of parents can be determined through the first child is small.
We have also examined error detections for multiallelic markers and the error detection rates are greater for equal allele frequencies (see Table 4). This can be readily understood by noting that unlike the case of biallelic markers, for the case of multiallelic markers, the errors in the genotypes of parents have greater effect on error detections. Although haplotypes can be thought of as a multiallelic marker, the error detection rates are lower for a haplotype system than for a multiallelic marker with the same allele frequencies as the set of haplotype frequencies. However, the difference is smaller when a larger number of children are considered in a nuclear family.
The probability formula derived in this article, e.g., the probability that i errors are introduced under the general error model, can be used to calculate error detection rates for other sampling types such as quartet under the general error model. For example, for the case of quartet considered by ![]()
In addition to error detection rates, we have also calculated the probability that a family displaying Mendelian consistency has correct genotypes. The calculations of such quantities are useful as they may point to certain families that, although showing Mendelian consistency, are likely to have genotyping errors. The calculations require haplotype frequencies from the population and estimated error rates. A potential application of calculating these probabilities is to conduct transmission/disequilibrium tests in the presence of genotyping errors. We showed that when the error rates are low, the overall probability that a Mendelian-consistent trio has correct genotypes is quite high, and the overall probability is not very sensitive to haplotype frequencies under the stochastic error model. We expect that the number of families showing Mendelian consistency and having correct genotypes decreases with the increase of the number of children in the family and the number of markers. Our simulation results indeed show this property (data not shown). On the other hand, our simulation results show that conditional on a family showing Mendelian consistency, the probability that this family has correct genotypes is not a monotonic function of n and k. We offer an explanation by considering two markers with equal allele frequencies: p1 = p2 = 0.5. Consider the following two-marker genotype,
![]() |
(8) |
which is the most common family configuration. After the errors are introduced, if the parents and the first child show Mendelian consistency and they have correct genotypes, then their genotypes must be

Now we consider the case of adding another child, that is, a quartet (seeEquation 8). After the errors are introduced into the genotype of the second child, P(the resulting quartet genotype is Mendelian consistent) = 1 - 4 x (1 - 0.01)3 x 0.01 - 4 x (1 - 0.01) x 0.013 = 0.9612, and P(the genotype of the second child is correct|the resulting quartet genotype is consistent) = [(1 - 0.01)4 + 2 x (1 - 0.01)2 x 0.012 + 0.014]/0.9612 = 0.9996. Thus, the ratio of the probabilities that the genotype is true and consistent is 1.04. This shows that conditional on Mendelian consistency, the probability of having correct genotypes becomes larger when an additional child is added.
For the case of one marker, we consider the following one-marker genotype:
![]() |
(9) |
After the errors are introduced, if the parents and the first child show Mendelian consistency and they have correct genotypes, then their genotypes must be

If we add one child, then after the errors are introduced into the genotype of the second child (seeEquation 9), P(the resulting quartet genotype is consistent) = 1, and P(the genotype of the second child is true|the resulting quartet genotype is consistent) = 1 - 2 x (1 - 0.01) x 0.01 = 0.9802. Thus, the ratio of the probabilities that the genotype is correct and consistent is 0.9802, which means that conditional on Mendelian consistency, the probability of having correct genotypes becomes smaller when an additional child is added.
If the phase information is known, errors can be detected although each individual marker shows Mendelian consistency. For example, consider a family whose individual marker genotypes are

and their haplotypes are
![]() |
(10) |
Although each marker is Mendelian consistent, the joint haplotypes are not, unless we assume there is a recombination event between these two tightly linked markers. Therefore, when phase information is available, error detection rates may be improved. However, phase information may be difficult to obtain except through some molecular techniques. Instead, we have examined the benefit of perfect linkage disequilibrium information, which can be regarded as partial haplotype information in genotyping error detections. We have considered a situation where only two of the haplotypes are known to exist in a given population. In this case, utilizing this information may significantly increase the chance to detect errors through tightly linked markers and increase the confidence that a Mendelian-consistent trio has correct genotypes, and this line of research is worth pursuing.
In this article, we have considered tightly linked markers by assuming no recombination events among these markers. If we allow the occurrence of recombinations, there would be little benefit from using a Mendelian-consistency check as the only criterion for identifying families with genotyping errors. However, if reliable estimates of recombination fractions among these markers are available, we can calculate the probability for each family, incorporating recombination fraction information as well as population haplotype frequency information if it is available. Therefore, although fewer families can be detected as having genotyping errors purely on the basis of Mendelian-consistency check, we are still able to order families by the likelihoods of their genotypes and pursue those with very small likelihoods to be observed.
| ACKNOWLEDGMENTS |
|---|
The authors are grateful to the two reviewers for their valuable comments and suggestions, which greatly improved the original manuscript. This work was supported in part by grant GM59507 from the National Institutes of Health.
Manuscript received May 17, 2002; Accepted for publication March 19, 2003.
| APPENDIX A |
|---|
THE CALCULATION OF P(i ERRORS IN M)
For family M, let
Mi be the number of errors at marker i, where i = 1, 2. Then we have
![]() |
(A1) |
Let
Mij denote the number of allele j's errors at marker i in M, i, j = 1, 2. For example,
M12 is the number of allele 2's errors, i.e., from true allele 2 to erroneous allele 1, at marker 1 in M. Then
M1j
Binomial(N1j, ej), and
M2j
Binomial(N2j,
j), j = 1, 2, where Nij is the number of allele j at marker i in M (i, j = 1, 2). Note that
Mij = 0 when Nij = 0. Then,
![]() |
(A2) |
(Here we define (Nn) = 0 if N < n). Similarly,
![]() |
(A3) |
Substituting (A2) and (A3) in (A1), we obtain
![]() |
(A4) |
In particular,

Thus,

Noting that

and

we see that for the stochastic error model, (A4) reduces to

For the directed error model, we have

| APPENDIX B |
|---|
In the following, we describe how to determine whether a haplotype pair consistent with child 1 is also consistent with both parents.
Let

denote a consistent haplotype pair of the first child, where ()0 means phase information is known. Further, if hF (hM) is a haplotype of the father (the mother), let
F (
M) denote the complementary haplotype in the sense that
F (
M) consists of the remaining alleles of the father (the mother).
- If h1 is consistent with the father but not the mother, then h2 has to be consistent with the mother unless there are genotyping errors. Thus,
F can be determined by h1 and the genotype of the father and
M can be determined by h2 and the genotype of the mother. Hence, possible haplotype pairs for the children in this family determined by such 
are

(2) If h1 is consistent with the mother but not the father, then h2 has to be consistent with the father unless there are genotyping errors. Thus, possible haplotype pairs for the children in this family determined by

are

(3) If h1 is consistent with both the father and the mother, then when h2 is consistent with the father but not the mother, possible haplotype pairs for the children in this family determined by

are the same as those in possibility 2. When h2 is consistent with the mother but not the father, possible haplotype pairs for the children in this family determined by

are the same as those in possibility 1. When h2 is consistent with both the father and the mother, possible haplotype pairs for the children in this family determined by

are

and

| APPENDIX C |
|---|
As an example, we calculate P(T = M|O = M), where

It can be shown that

Similarly, we can get the other conditional probability P(O = M|T = M')(M'
M). Substituting these formulas into (4) and using the values of P(T = M), we obtain

where p is the population frequency of allele 1, and q = 1 - p. Thus, we have

| APPENDIX D |
|---|
THEOREM. If there is no phase information and each marker in a set of tightly linked markers is Mendelian consistent, the trio is Mendelian consistent across these markers.
Proof. Let
,
, and
denote the genotypes for the two parents and the child across a set of k markers, where

Let cpi be the allele consistent with one of the two alleles ai1 and ai2 in the father and cmi be the allele consistent with one of the two alleles bi1 and bi2 in the mother. Then we would infer that one of the haplotypes in the father is (cp1cp2 ... cpk) and one of the haplotypes in the mother is (cm1cm2 ... cmk). It is easy to see that such inference would imply Mendelian consistency for this family trio without recombinations among these markers.
| LITERATURE CITED |
|---|
AKEY, J. M., K. ZHANG, M. XIONG, P. DORIS, and L. JIN, 2001 The effect that genotyping errors have on the robustness of common linkage-disequilibrium measures. Am. J. Hum. Genet. 68:1447-1456.[Medline]
BROMAN, K. W. and J. L. WEBER, 1998 Estimation of pairwise relationships in the presence of genotyping errors. Am. J. Hum. Genet. 63:1563-1564.[Medline]
BUETOW, K. H., 1991 Influence of aberrant observations on high-resolution linkage analysis outcomes. Am. J. Hum. Genet. 49:985-994.[Medline]
DOUGLAS, J. A., M. BOEHNKE, and K. LANGE, 2000 A multipoint method for detecting genotyping errors and mutations in sibling-pair linkage data. Am. J. Hum. Genet. 66:1287-1297.[Medline]
DOUGLAS, J. A., A. D. SKOL, and M. BOEHNKE, 2002 Probability of detection of genotyping errors and mutations as inheritance inconsistencies in nuclear-family data. Am. J. Hum. Genet. 70:487-495.[Medline]
EHM, M. G., M. KIMMEL, and R. W. COTTINGHAM, JR., 1996 Error detection for pedigree data, using likelihood methods. Am. J. Hum. Genet. 58:225-234.[Medline]
EHM, M. G. and M. WAGNER, 1998 A test statistic to detect errors in sib-pair relationships. Am. J. Hum. Genet. 62:181-188.[Medline]
GOLDSTEIN, D. R., H. ZHAO, and T. P. SPEED, 1997 The effects of genotyping errors and interference on estimation of genetic distance. Hum. Hered. 47:86-100.[Medline]
GORDON, D. and J. OTT, 2001 Assessment and management of single nucleotide polymorphism genotype errors in genetic association analysis. Pac. Symp. Biocomput. 6:18-29.
GORDON, D., S. C. HEATH, and J. OTT, 1999 True pedigree errors more frequent than apparent errors for single nucleotide polymorphisms. Hum. Hered. 49:65-70.[Medline]
GORDON, D., S. M. LEAL, S. C. HEATH, and J. OTT, 2000 An analytic solution to single nucleotide polymorphism error-detection rates in nuclear families: implications for study design. Pac. Symp. Biocomput. 5:663-674.
GORDON, D., S. C. HEATH, X. LIU, and J. OTT, 2001 A transmission/disequilibrium test that allows for genotyping errors in the analysis of single-nucleotide polymorphism data. Am. J. Hum. Genet. 69:371-380.[Medline]
GÖRING, H. H. H. and J. D. TERWILLIGER, 2000a Linkage analysis in the presence of errors I: complex-valued recombination fractions and complex phenotypes. Am. J. Hum. Genet. 66:1095-1106.[Medline]
GÖRING, H. H. H. and J. D. TERWILLIGER, 2000b Linkage analysis in the presence of errors II: marker-locus genotyping errors modeled with hypercomplex recombination fractions. Am. J. Hum. Genet. 66:1107-1118.[Medline]
GÖRING, H. H. H. and J. D. TERWILLIGER, 2000c Linkage analysis in the presence of errors III: marker loci and their map as nuisance parameters. Am. J. Hum. Genet. 66:1298-1309.[Medline]
GÖRING, H. H. H. and J. D. TERWILLIGER, 2000d Linkage analysis in the presence of errors IV: joint pseudomarker analysis of linkage and/or linkage disequilibrium on a mixture of pedigrees and singletons when the mode of inheritance cannot be accurately specified. Am. J. Hum. Genet. 66:1310-1327.[Medline]
LINCOLN, S. E. and E. S. LANDER, 1992 Systematic detection of errors in genetic linkage data. Genomics 14:604-610.[Medline]
O'CONNELL, J. R. and D. E. WEEKS, 1998 PedCheck: a program for identification of genotype incompatibilities in linkage analysis. Am. J. Hum. Genet. 63:259-266.[Medline]
OTT, J., 1993 Detecting marker inconsistencies in human gene mapping. Hum. Hered. 43:25-30.[Medline]
SHIELDS, D. C., A. COLLINS, K. H. BUETOW, and N. E. MORTON, 1991 Error filtration, interference, and the human linkage map. Proc. Natl. Acad. Sci. USA 88:6501-6505.
SOBEL, E., J. PAPP, and K. LANGE, 2002 Detection of genotyping errors. Am. J. Hum. Genet. 70:496-508.[Medline]
STRINGHAM, H. M. and M. BOEHNKE, 1996 Identifying marker typing incompatibilities in linkage analysis. Am. J. Hum. Genet. 59:946-950.[Medline]
TERWILLIGER, J. D., D. E. WEEKS, and J. OTT, 1990 Laboratory errors in the reading of marker alleles cause massive reductions in lod score and lead to gross overestimates of the recombination fraction. Am. J. Hum. Genet. 47(Suppl.):A201.
This article has been cited by other articles:
![]() |
I. M. Heid, C. Lamina, H. Kuchenhoff, G. Fischer, N. Klopp, M. Kolz, H. Grallert, C. Vollmert, S. Wagner, C. Huth, et al. Estimating the Single Nucleotide Polymorphism Genotype Misclassification From Routine Double Measurements in a Large Epidemiologic Sample Am. J. Epidemiol., October 15, 2008; 168(8): 878 - 889. [Abstract] [Full Text] [PDF] |
||||
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Zou, G.
- Articles by Zhao, H.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Zou, G.
- Articles by Zhao, H.

















