Genetics, Vol. 154, 1053-1068, March 2000, Copyright © 2000

Trinucleotide Repeats Are Clustered in Regulatory Genes in Saccharomyces cerevisiae

Elton T. Younga, James S. Sloana, and Kristen Van Ripera
a Department of Biochemistry, University of Washington, Seattle, Washington 98195-7350

Corresponding author: Elton T. Young, Department of Biochemistry, University of Washington, Box 357350, Seattle, WA 98195-7350., ety{at}u.washington.edu (E-mail)

Communicating editor: L. S. SYMINGTON


*  ABSTRACT
*TOP
*ABSTRACT
*MATERIALS AND METHODS
*RESULTS
*DISCUSSION
*LITERATURE CITED

The genome of Saccharomyces cerevisiae contains numerous unstable microsatellite sequences. Mononucleotide and dinucleotide repeats are rarely found in ORFs, and when present in an ORF are frequently located in an intron or at the C terminus of the protein, suggesting that their instability is deleterious to gene function. DNA trinucleotide repeats (TNRs) are found at a higher-than-expected frequency within ORFs, and the amino acids encoded by the TNRs represent a biased set. TNRs are rarely conserved between genes with related sequences, suggesting high instability or a recent origin. The genes in which TNRs are most frequently found are related to cellular regulation. The protein structural database is notably lacking in proteins containing amino acid tracts, suggesting that they are not located in structured regions of a protein but are rather located between domains. This conclusion is consistent with the location of amino acid tracts in two protein families. The preferred location of TNRs within the ORFs of genes related to cellular regulation together with their instability suggest that TNRs could have an important role in speciation. Specifically, TNRs could serve as hot spots for recombination leading to domain swapping, or mutation of TNRs could allow rapid evolution of new domains of protein structure.


REPETITIVE DNA sequences are a hallmark of eukaryotic genomes. Their instability during transmission creates polymorphic alleles that are widely used in population genetics, medicine, and forensic analysis. Microsatellites are a class of repetitive DNA sequences consisting of 1–10 nucleotides tandemly repeated many times (for review see SIA et al. 1997A Down). They are dispersed in the genome and are usually present in tracts of 50 bp or less. They play a role in human disease, both as causative agents in the trinucleotide repeat (TNR) diseases (GUSELLA and MAC DONALD 1996 Down; PAULSON and FISCHBECK 1996 Down; SIA et al. 1997A Down) and as indicators of error-prone phenotypes caused by mutations in other genes, such as those that cause human nonpolyposis colon cancer (LENGAUER et al. 1997 Down).

Speculations about the function of repetitive sequences range from their being selfish DNA to their having a role in chromosome structure or in gene expression. The clearest example of the function of these sequences comes from studies on pathogenic microbes where repetitive sequences cause antigenic variation and adaptive evolution (MOXON et al. 1994 Down).

The yeast Saccharomyces cerevisiae has an abundance of microsatellites that are distributed throughout its 16 chromosomes (FIELD and WILLS 1998 Down). They are found in both open reading frames (ORFs) and intergenic regions. As in multicellular eukaryotes, microsatellites in S. cerevisiae are unstable, leading to numerous polymorphic alleles (RICHARD and DUJON 1996 Down; SIA et al. 1997A Down; FIELD and WILLS 1998 Down). Genetic studies have helped define the mechanisms of instability in yeast (MAURER et al. 1996 Down; FREUDENREICH et al. 1997 Down; MIRET et al. 1997 Down; SCHWEITZER and LIVINGSTON 1997 Down; SIA et al. 1997A Down, SIA et al. 1997B Down).

In addition to providing insight into the mechanism of instability of repetitive DNA, studies of repetitive DNA in yeast can shed new light on its potential function. Analysis of the number of microsatellites in the yeast genome database indicated that negative selection pressure has been relaxed in comparison to the genomes of other microbes (FIELD and WILLS 1998 Down), suggesting that their distribution in different types of genes might resemble that in multicellular eukaryotes. Because the function of a relatively large fraction of yeast genes, estimated at 45–60% of the 6304 genes (GOFFEAU 1996 Down, GOFFEAU 1997 Down; DAS et al. 1997 Down), is known, it should be possible to determine whether particular classes of genes, based on function, harbor a disproportionate number of repetitive sequences.

Our analysis of yeast microsatellites reveals that TNRs are found in ORFs at a frequency that is higher than expected and that TNRs are preferentially located in ORFs that are related to transcription and signal transduction. They appear to be located primarily in nonessential regions of the proteins. The instability of microsatellites, and their location in ORFs of regulatory genes, suggests that alterations of their sequence could lead to changes in gene function that could have important consequences for evolution.


*  MATERIALS AND METHODS
*TOP
*ABSTRACT
*MATERIALS AND METHODS
*RESULTS
*DISCUSSION
*LITERATURE CITED

Strains and species of the genus Saccharomyces:
S. cerevisiae, NRRL-Y12632, S. paradoxus, and S. bayanus were provided by Dr. C. P. Kurtzman, Northern Regional Research Center, USDA, Peoria, IL. S. douglasii was obtained from Dr. C. Hawthorne (University of Washington).

Cloning SWI1/ADR6 homologues:
The PCR product of SWI1/ADR6 was amplified using primers ADR6-5 (CTGAAAGAGCTGCAATGTTTGCCG from bp 980 to 1003 in the ADR6 ORF) and ADR6-6 (CTTTGTTGTTGCTGCCGTTGACTC from bp 1157 to 1134 in the ADR6 ORF). Following amplification by Taq polymerase (35 cycles of 30 sec at 94°, 30 sec at 52°, and 1 min at 72°) the PCR product was purified using the QIAquick PCR purification kit (QIAGEN, Chatsworth, CA). Sequencing products were generated using the ABI Prism dRhodamine dye terminator sequencing reaction kit (PE Applied Biosystems, Foster City, CA) and analyzed on the ABI Prism 377 DNA sequencer.

Computer analysis and data processing:
The DNA sequences of the S. cerevisiae chromosomes and a database of known and predicted open reading frames were obtained as ASCII-formatted files from The Saccharomyces Genome Database World Wide Web site (Stanford University, SGD: ftp://genomeftp.stanford.edu/pub/yeast/tables/ORF_locations/ORF_table.txt) on December 13, 1997. The repetitive ribosomal RNA genes were not included in this analysis. Analysis of the functional classification of the ORFs containing TNRs was updated in May, 1999, using data at the MIPS website (http://www.mips.biochem.mpg.de/). Programs for performing the sequence searches and processing the data were written in C++ on an Intel Pentium-based personal computer using Visual Studio 97 (Microsoft Corp., Redmond, WA). Fig 1 and Fig 2 were drawn from a Postscript file that was generated by a program written in C++. The source code for the programs is available upon request. To determine the frequency of each amino acid in yeast proteins and to search for amino acid repeats in yeast protein sequences all known and hypothetical open reading frames of 100 or more codons in length were translated from the chromosome sequence data into amino acid sequence. Because some of the small ORFs are now known to be functionally significant based on expression data (BASRAI et al. 1997 Down) a small error may have been introduced by neglecting these ORFs. Homologues of yeast genes containing triplet repeats were found at http://acer.gen.tcd.ie/~khwolfe/yeast/topmenu.html. Lists of the ORFs, the type of repeat they contain, and their functional classification can be found at http://weber.u.washington.edu/~etyoung/. Information on yeast C6-zinc cluster proteins was found on the MIPS website given above. Data on protein kinases was found at the Protein Kinase database (http://www.sdsc.edu/kinases/). The protein structural database was located at http://www.rcsb.org/pdb/.



View larger version (39K):
In this window
In a new window
Download PPT slide
 
Figure 1. The distribution of pure mono-, di-, tri-, and tetranucleotide repeats of length 15 bp or greater in the 16 chromosomes of S. cerevisiae. The chromosomes are aligned with their centromeres on the vertical gray line. The colored lines above the chromosomes indicate the type and relative length of the repeats. The gray and black lines and boxes within the chromosomes indicate the positions of known genes and predicted open reading frames; those that are black contain a repeat.



View larger version (35K):
In this window
In a new window
Download PPT slide
 
Figure 2. The distribution of pure TNRs of length 15 bp or greater in the 16 chromosomes of S. cerevisiae. The chromosomes are aligned with their centromeres on the vertical gray line. The colored lines above the chromosomes indicate the type and relative length of the repeats. The gray and black lines and boxes within the chromosomes indicate the positions of known genes and predicted open reading frames; those that are black contain a trinucleotide repeat.


*  RESULTS
*TOP
*ABSTRACT
*MATERIALS AND METHODS
*RESULTS
*DISCUSSION
*LITERATURE CITED

Microsatellites in S. cerevisiae:
We searched the S. cerevisiae database for perfect mono-, di-, tri-, and tetranucleotide repeats containing at least 15 nucleotides (16 for di- and tetranucleotides). The number, average length, longest repeat, and distribution between ORF and non-ORF DNA in each class are listed in Table 1. The total amount of microsatellite DNA of these four classes, as defined above, represents ~0.1% of the yeast genome. Considering all four classes, they occur on average about once every 12 kb (or one microsatellite every 29 kb for a repeat size of 20). The yeast genome contains 6304 ORFs (SGD: ftp://genome-ftp.stanford.edu/pub/yeast/tables/ORF_locations/ORF_table.txt) representing ~70% of the genome. About 95% of the mono-, di-, and tetranucleotide tracts occur between ORFs whereas trinucleotide tracts fall within ORFs ~83% of the time. These numbers vary somewhat from previous reports (RICHARD and DUJON 1996 Down; FIELD and WILLS 1998 Down) but different criteria for tract length and purity were used. All types of microsatellites appear to be uniformly distributed between and within chromosomes, as described previously using a different criterion for microsatellites (FIELD and WILLS 1998 Down; Fig 1). Imperfect repeats of 18 nucleotides are distributed similarly (data not shown).


 
View this table:
In this window
In a new window

 
Table 1. Microsatellite number, average (and maximum) size, and ORF/non-ORF distribution in yeast genomic DNA

Mono-, di-, and tetranucleotide repeats:
The paucity of mono-, di-, and tetranucleotide tracts within ORFs prompted us to examine in more detail the ORFs containing them (Table 2). Among the 20 ORFs containing mononucleotide repeats, 8 are genes of known function. In 6 of these cases, representing 8 repeats, the repetitive sequence is in an intron. Because only 4% of yeast genes contain introns, we would expect only one intron-containing ORF to also contain a mononucleotide repeat if the repeats were distributed randomly in the yeast ORFs. Only 1 of the 13 ORFs containing a dinucleotide repeat is in a gene of known or suspected function, reflecting the fact that most of these are small and questionable ORFs (DAS et al. 1997 Down). Because 45–60% of yeast genes have been assigned a function, we would expect ~6 of the ORFs containing a dinucleotide repeat to have a known function if the repeats were distributed randomly in the yeast ORFs. The dinucleotide repeat is at the C terminus of the two longest ORFs, one of which encodes the reverse transcriptase component of yeast telomerase. Their position and low frequency within ORFs suggest that mono- and dinucleotide repeats are deleterious to an ORF. Nothing unusual was noted about the position of the repeat in the 3 ORFs (BRF1, YGR079w, and YMR245W) containing tetranucleotide repeats.


 
View this table:
In this window
In a new window

 
Table 2. ORFs containing mono- and dinucleotide tracts

Trinucleotide repeats:
TNRs are the most varied and interesting class of microsatellite sequences. There are 64 nucleotide triplets. Repeats of single nucleotides, AAA, TTT, GGG, or CCC, are classified as mononucleotide repeats. Every other triplet is sixfold degenerate. Thus, the triplet repeat AGC is the same as GCA, CAG, GCT, CTG, and TGC. Fig 2 shows the positions of 9 of the 10 different types of TNR in the S. cerevisiae genome. There are no perfect repeats containing only G and C that met the length criterion we used.

Trinucleotide repeats have a biased genomic distribution: Previous work noted the large number of TNRs in the S. cerevisiae genome compared to other microbial genomes (FIELD and WILLS 1998 Down). Table 3 documents the length distribution, type, and partition between ORF and non-ORF regions for TNRs. For TNRs within ORFs, the codon usage and encoded amino acid are shown. TNRs with the sequence AAT are least frequent within ORFs. All other types of TNRs are found primarily within ORFs. The difference between the value observed and that predicted based on a random distribution in the genome is most significant for the triplets AAG, AAC, AGC, and ATC. Their overrepresentation in ORFs is statistically significant at the 95% confidence level by a chi square test. Overall, 83% of the TNRs are found within ORFs. This is significantly higher than expected in a genome where 70% of the DNA represents ORFs (GOFFEAU 1996 Down). The finding that TNRs occur predominantly in ORFs could be useful to examine more closely the non-ORF regions of the genome with TNRs to define new small ORFs.


 
View this table:
In this window
In a new window

 
Table 3. Distribution of perfect TNRs in the yeast genome

Trinucleotide repeats encode a biased set of amino acids: The 10 different types of TNRs occur at frequencies that do not reflect the base composition of yeast DNA. In addition the total number of TNRs, except for pure G/C TNRs, is higher than expected, suggesting that a mechanism exists for expanding or maintaining these repeats (FIELD and WILLS 1998 Down). The number of each kind of tract suggests that certain TNRs are preferentially maintained. For example, there are many more ATC repeats than ACT repeats, despite their identical base composition, and there are many more AGC repeats than ACT repeats, despite the base composition bias that would favor ACT repeats. These and similar observations reported in Table 3 suggest that selection is acting on TNRs in yeast ORFs. The data do not allow one to determine whether it is acting positively on those repeats that are abundant, such as AAG, AAC, AAT, AGC, and ATC, or negatively on those repeats that are rare, such as ACT and ACC.

The data do suggest that this selection operates at the level of protein structure and function, not nucleotide sequence. Table 4 shows that for those TNRs within ORFs the type of amino acid encoded is strongly biased toward the amide-containing amino acids, Gln and Asn, and the acidic amino acids, Glu and Asp. Hydrophobic amino acids, and also Gly and Thr, are either absent or infrequently encoded in TNRs. The distribution of amino acids within yeast proteins, as deduced from a database derived from the genomic DNA sequence, indicates that Leu is the most abundant amino acid in yeast proteins but it is 10-fold underrepresented in TNRs within ORFs. On the other hand Gln is sevenfold overrepresented in TNRs within ORFs.


 
View this table:
In this window
In a new window

 
Table 4. Distribution of amino acids in yeast proteins

This bias must operate at the protein level because it is unlikely that a bias operating at the nucleotide level would be sensitive to the identity of the codons. For example, the codon GAT (Asp) is found frequently in TNRs but its complement ATC (Ile) is never found in repeats of five or greater. Similarly, the triplet AAC is well represented in TNRs, but almost invariably encodes Gln (CAA) or Asn (AAC), not Thr (ACA). Because these represent the same DNA sequence, it is likely that the bias against runs of ATC or ACA codons in an ORF is due to selection against runs of these amino acids.

We also determined whether amino acid repeats in proteins showed a bias similar to that revealed by amino acids encoded by TNRs in yeast ORFs (Table 4, column 3). The interpretation of the TNR data, that certain amino acid repeats are excluded from yeast proteins, would be in error if a significant number of amino acid tracts that are not encoded by TNRs were present. The analysis of amino acid tracts with n >= 5 showed that polySer was the most frequent amino acid repeat in yeast proteins, followed by polyGln, polyGlu, polyAsn, and polyAsp. PolySer was most frequently encoded by a degenerate repeat of the type TCN. The least frequent amino acid tracts were polyIle, polyVal, polyTyr, polyMet, polyCys, and polyTrp. Thus, the data derived from analyzing yeast protein sequences also suggest that hydrophobic amino acid tracts are, in general, excluded from ORFs.

Trinucleotide repeats occur preferentially in regulatory genes: We next asked whether the TNRs fell into different types of ORFs in a random or biased manner. The functional classification is derived from the MIPS database, which often places an ORF into several categories. When this occurred, we chose the category that we felt reflected the proximate role of the protein. For example, most protein kinases and transcription factors play important roles in metabolism. However, their role is usually a regulatory one. For this reason, we classified ORFs encoding transcription-related proteins into the transcription category and ORFs encoding protein kinases into the signal transduction category.

TNRs appear to fall into different functional classes of ORFs in a biased manner. TNRs are more frequent in ORFs encoding transcription-related proteins than in any other type of ORF. Table 5 shows the category, ORF name and number, type, and length of repeat(s) found in each ORF. The length of the amino acid tracts in an ORF is often much longer than the trinucleotide repeat shown in Table 5 due to imperfect repeats interspersed with perfect repeats. The data are summarized in Table 6. Overall ~7–10% of yeast ORFs are involved in transcription (GOFFEAU 1996 Down, GOFFEAU 1997 Down). For those ORFs containing TNRs, and whose function could be reasonably deduced, ~40% are involved in some aspect of transcription, a four- to fivefold overrepresentation in this category. To make this estimate we assumed that the distribution of the types of genes in the known and unknown ORFs was the same. To the extent that some types of genes may be over- or underrepresented in the known category, the estimate may be in error. Because a large fraction of yeast genes have been characterized, the estimate can be made more accurately in yeast than in other eukaryotic organisms. If it were the case that more transcription factors have been identified by various means than ORFs of other functional classes, this would lead to an overrepresentation of this class of ORF in the category of known genes and hence an overestimate of the bias of TNR-containing ORFs of this type. However, even if all of the transcription-related genes were in the known category, which seems unlikely, there would still be an approximately twofold bias of TNR-containing ORFs in this category.


 
View this table:
In this window
In a new window

 
Table 5. Functional classification of yeast genes containing trinucleotide repeats


 
View this table:
In this window
In a new window

 
Table 6. Summary of functional classification of yeast genes containing TNRs

A large number of triplet repeats within coding regions are in ORFs encoding what are commonly called "transcription factors," DNA binding proteins whose function is to activate or repress transcription, or members of multiprotein complexes involved in transcription. Thirteen percent of genes containing TNRs have two or more such regions. A majority of these genes are transcription-related. No TNRs are present in ORFs encoding "core" elements of transcription such as the RNA polymerase subunits themselves.

The second most frequent class of ORFs containing TNRs is in the signal transduction category. About 20% of the TNR-containing ORFs are in this functional category. Many of these genes encode protein kinases likely involved in regulating transcription or other cellular processes. The same caveat applies to this estimate as to the bias for transcription-related functions, but in this case it seems even less likely that all signal transduction-related ORFs have already been discovered. There are a significant number of TNRs found in ORFs involved in cell growth and division, DNA replication, cell organization and biogenesis, protein destination and synthesis, and intracellular transport.

Perhaps most striking is the observation that very few ORFs involved in common metabolic functions such as glycolysis, the TCA cycle, respiration, amino acid biosynthesis, fatty acid synthesis, or carbohydrate breakdown and synthesis contain triplet repeats.

To test the hypothesis that transcription factors and signal transduction molecules are the most frequent classes of proteins containing homopolymeric amino acid repeats we analyzed the proteins derived from the yeast genomic DNA sequence for tracts of eight or more identical amino acids. There are 238 ORFs containing such repeats, of which 142 are due to TNRs and were already classified. The 96 new ORFs identified in this way encode proteins with homopolymeric amino acid tracts that are not encoded by TNRs. Of these, 49 can be functionally classified. Eighteen, or 37%, are in the transcription category and 15 (31%) are in the signal transduction category. The remaining ORFs are distributed across the different functional categories in much the same manner as the ORFs containing TNRs (Table 6). This analysis leads to the same conclusion as that based on analysis of TNR distribution: ORFs involved in cellular regulation are more likely to contain repetitive DNA, and encode homopolymeric amino acid tracts, than are other classes of yeast genes.

Polyamino acid tracts are usually in nonfunctional regions of a protein: Genes encoding regulatory proteins frequently contain more than one functional or structural domain. Polyamino acid tracts could lie between such domains or they could lie within a functional domain. We tested this idea in several ways. First, deletion studies of several genes encoding transcription-related proteins containing TNRs have addressed the functional importance of these regions in yeast. The TNRs in SNF5 (LAURENT et al. 1990 Down), CYC8/SSN6 (SCHULTZ et al. 1990 Down), TAF61 (MOQTADERI et al. 1996 Down), and SWI1/ADR6 (E. PRATT, personal communication) have been deleted with no apparent loss of function, suggesting that they do not encode an essential functional domain. MCM1 encodes a yeast transcription factor related to mammalian serum response factor. Functional domains for DNA binding, dimerization, and transcription activation have been defined by deletion analysis coupled with functional assays. MCM-encoded polyGln and polyAsp tracts are not located in a region of known function (TREISMAN 1995 Down), suggesting that the TNRs may lie between functional domains of the protein. Many TNR-containing ORFs contain multiple repeats. In about half of the cases the TNRs are clustered in a small region of the ORF, consistent with its being a linker or nonessential region.

Comparing genes with related sequences, presumably homologues, also provides some indication of important parts of the ORF. Twenty ORFs containing triplet repeats had putative homologues in the genome. In only four cases were the triplet repeats present in both homologues (SIS2/YOR054C, NGR1/PUB1, YPR042C/JSN1, and YLR449W/FPR3). There are six B-type cyclin genes and only one, CLB5, contains a triplet tract (AAG). In several cases the repetitive tract was at the terminus of the ORF, usually the C terminus, and it was present in only one member of a family of duplicated genes. Genes such as CTK1 (two functional homologues, PHO85 and CDC28, lack the repeat), RAD6 (UBC4, UBC5, UBC7, and UBC13 are functional homologues that lack the repeat), and ZDS1 are examples. GAL11 encodes two polyGln tracts that are not essential for function and they are not highly conserved in its Kluyveromyces lactis homologue, although its N-terminal region is also Gln-rich (MYLIN et al. 1991 Down). These results suggest that TNRs are not highly conserved, and polyamino acid tracts in general may not be essential for protein function.

Another test of the idea that TNRs do not encode structurally or functionally essential domains of a protein was made by examining a database of known protein structures for repetitive tracts of amino acids. In this case the data were interesting as there are very few structures of proteins containing tracts of five or more of the same amino acid in structured domains.

As a final test of the idea that amino acid tracts do not lie in functional domains of a protein, we examined two families of proteins in which many members contain a repeating amino acid tract. A second criterion was the availability of structural and functional information about at least one member of the family so that we could determine whether the repeat might interrupt that domain.

The first family we examined was the zinc cluster or C6 zinc finger proteins. There are 52 such proteins in the yeast database and 13 of them contain a repetitive amino acid tract, in this case defined as at least five repeating amino acids. Fig 3A depicts the location of the C6 zinc finger and the amino acid tract(s) in these proteins. None of the amino acid tracts interrupt the C6 domain and most likely fall outside of the adjacent dimerization domain as well. The example of Dal81p is particularly interesting because the C6 motif is closely flanked on both sides by repetitive amino acid tracts.




View larger version (53K):
In this window
In a new window
Download PPT slide
 
Figure 3. Amino acid repeats and functional domains in yeast C6 zinc cluster proteins and protein kinases. The open boxes represent the zinc cluster region (A), and the stippled boxes represent the kinase catalytic domain (B). Data for both types of proteins were derived from the MIPS database (http://www.mips.biochem.mpg.de/). The data for the protein kinase subdomains (C) are from the Protein Kinase database (http://www.sdsc.edu/kinases/).

The locations of repetitive tracts in these proteins may also provide some insight into the importance of other regions of the proteins. All of the repetitive tracts lie within either the amino-terminal 350 amino acids or beyond amino acid 700, suggesting that the region between these endpoints may comprise a domain of protein function. In the related C6 zinc finger protein Gal4p, this region has been implicated in sensitivity to glucose repression. The minimal length of this class of proteins, at least those containing amino acid repeats, also may be demarcated by the position of the repeats. The smallest ORF containing a repeat is just large enough to encode the hypothetical functionally important central region.

The second protein family examined comprised the Ser/Thr protein kinases. There are 117 yeast ORFs encoding proteins in this class and 16 of them encode amino acid tracts. The catalytic domain contains ~230 amino acids, including 11 conserved subdomains, I–XI. In 15 of the 16 Ser/Thr protein kinases in yeast containing polyamino acid tracts, none of the repeats would fall within the sequence containing the catalytic domain (Fig 3B). The single exception is the kinase encoded by YKL171w in which there is an apparent insertion of ~200 amino acids in the region between subdomains VII and IX (Fig 3C). This insertion is Ser-rich and includes two polyserine tracts. Subdomain VIII is mostly an unstructured loop in the crystal structure of PKA. Subdomain VIII in other protein kinases is involved in substrate recognition in which phosphorylation has been implicated. Thus, expansion of subdomain VIII in this ORF may play a role in regulating kinase activity.

In summary, the data are consistent with the hypothesis that TNRs usually do not encode essential structural or functional information, and the amino acid tracts they encode may reside primarily in linker regions of the protein (SAPOLSKY et al. 1993 Down). This observation may be useful to help define functional coding sequences in genomic sequence data.

Extreme trinucleotide polymorphisms in yeast regulatory genes:
Microsatellite sequences are unstable in yeast as in other organisms. The length of the tract usually varies by one or a few repeats between different laboratory strains or between different Saccharomyces species (RICHARD and DUJON 1996 Down; SIA et al. 1997A Down; FIELD and WILLS 1998 Down). In our investigations of TNR instability we encountered two cases of extreme changes in TNRs in two regulatory genes, ADR1 and SWI1/ADR6. ADR1 from S. cerevisiae has a short and imperfect AAT repeat encoding Asn5 that is expanded to Asn17 and Asn15 in the ADR1 genes from S. douglasii and S. paradoxus, respectively, and is lost in the ADR1 gene from S. bayanus (YOUNG et al. 2000 Down).

The change in one of the five TNRs in the ADR6/SWI1 gene is also dramatic (Fig 4). A Gln repeat in the SWI1 gene of S. cerevisiae is altered by addition of an extensive Gln-Pro repeat at the same position in its homologue from S. douglasii. The nucleotides flanking the TNR in ADR6/SWI1 from S. douglasii are ~95% identical to those flanking the S. cerevisiae homologue, indicating that these are closely related genes. The extensive polymorphism suggests that the repetitive sequences in the ORFs of these regulatory genes have undergone extensive changes associated with speciation.



View larger version (12K):
In this window
In a new window
Download PPT slide
 
Figure 4. Amino acid (a) and DNA (b) sequence comparisons of TNR region of ADR6/SWI1. Sc, S. cerevisiae; Sd, S. douglasii; dashes represent sequences not present. Changes that differ from the sequence in Sc ADR1 are shown below that sequence. The TNR region of ADR6/SWI1 was amplified with forward and reverse primers from genomic DNA prepared from the strains indicated. The PCR products were purified by agarose gel electrophoresis and subjected to sequence analysis.


*  DISCUSSION
*TOP
*ABSTRACT
*MATERIALS AND METHODS
*RESULTS
*DISCUSSION
*LITERATURE CITED

In the yeast genome mono-, di-, and tetranucleotide tracts are preferentially located in non-ORF DNA (RICHARD and DUJON 1996 Down; FIELD and WILLS 1998 Down), whereas TNRs are located primarily in ORFs (Table 3). The likely explanation for this difference between triplet and non-triplet repeats with respect to location invokes the consequences of slipped-strand mispairing during DNA replication in the different types of repeats (HANCOCK 1996 Down). Slippage in mono-, di-, and tetranucleotide repeats, but not trinucleotide repeats, would lead to frameshift errors during translation. The location of many mono- and dinucleotide repeats at the extreme C terminus of ORFs, and in introns, is consistent with their instability being deleterious for protein function. A few ORFs of unknown function have a dinucleotide repeat near the N terminus of the protein, where its instability could lead to altered protein products as in the case of "contingency" genes in pathogenic microbes (MOXON et al. 1994 Down).

Do the TNRs within ORFs have a function? We have no answer to this question, but polyamino acid tracts could have a function, perhaps specific to the proteins in which they are located, or they could be maintained in proteins because mutation pressure and negative selection pressure are balanced. In small genomes, such as those of bacteria, long repetitive tracts have been excluded by negative selection (FIELD and WILLS 1998 Down). The same study reported that strong mutation pressures created long repeats in S. cerevisiae, despite the selection for small genome size. Our data suggest that selection against certain TNRs is due to deleterious consequences for protein function because TNRs encoding hydrophobic amino acids are infrequently found in yeast ORFs, whereas the same repeat representing a different reading frame may be abundant (Table 3). Certain TNRs appear to be especially well tolerated: those encoding polyGln, polyAsn, polyGlu, polyAsp, and polySer in particular. The disparity between different types of TNRs in ORFs could reflect the balance of strong mutation pressure and different levels of negative selection. It appears that TNRs encoding hydrophobic amino acids are subject to the strongest negative selection and those encoding polyGln and polyAsn are subject to the least negative selection.

Whether the TNRs encoding polyGln and polyAsn might be subject to positive selection is difficult to answer. Although studies designed to test the importance of polyamino acid tracts in yeast proteins have generally led to the conclusion that the tracts are not essential, laboratory studies may not be sufficient to answer the important question of whether the tracts have evolutionary significance for the organism. Small changes in fitness would be sufficient to produce dramatic changes in the relative frequency of a mutant in comparison to its wild-type progenitor in a mixed population. This was dramatically demonstrated by testing a series of yeast disruption mutants in mixed-growth experiments (THATCHER et al. 1998 Down). Yeast mutants with no detectable growth phenotype were shown to have highly significant fitness defects when cocultured with a wild-type progenitor. Similarly, negative fitness changes might be detected when mutants containing deletions of TNRs were cocultured with their wild-type progenitors.

One argument suggesting that TNRs in ORFs have a positive selection value is based on their higher representation in ORFs as opposed to intergenic regions. If the abundance of a particular TNR were based solely on a balance of strong mutation pressure and negative selection, it is difficult to see why many of the TNRs occur more frequently in ORFs than in intergenic regions.

A clue to the continued presence of TNRs in yeast ORFs may lie in the function of the genes in which they are found. We found a disproportionate number of TNRs in yeast ORFs encoding regulatory proteins (Table 6). This bias is significant because a similar bias was found when a protein database derived from yeast genomic sequence was queried for amino acid tracts. Transcription factors seem to be the most frequent group of proteins with TNRs, and they frequently contain multiple TNRs. These proteins consist of multiple domains, as do many other proteins that are involved in complex signaling processes. Analysis of the yeast genome suggests a high proportion of complex, multidomain proteins (DAS et al. 1997 Down). Thus, many of the amino acid repeats that are encoded by TNRs could be within linker regions (SAPOLSKY et al. 1993 Down) or at the termini of complex proteins. At these positions perhaps their length could be variable without deleterious effects. In contrast, many enzymes catalyzing reactions of intermediary metabolism consist of a single functional domain and their ORFs are notably lacking in TNRs. Our analysis of yeast genes containing polyamino acid tracts is consistent with the hypothesis that the amino acid tracts lie between structural or functional domains of the protein.

There may be important evolutionary implications for the frequent occurrence of TNRs in genes regulating the synthesis and activity of DNA, RNA, and proteins, and their exclusion from the most ancestral genes in the cell. Unstable TNRs in genes encoding transcription factors and related proteins could have an important influence on the regulation of gene expression. By allowing relatively frequent and often benign alterations to occur in genes encoding information-related processes, a variety of phenotypes could exist in a population, allowing selection to occur for those individuals best suited to new conditions. By acting as a source of genetic variation these sequences could play an important role in evolution.

This hypothesis is similar to that of "contingency" genes in pathogenic microbes (MOXON et al. 1994 Down), where phenotypic variation is generated by mutations within repetitive DNA in genes related to the virulence properties of the organism. Mutations are highly selective for the microbe so that it can cope with the host's constantly changing immune repertoire. Could some of the yeast microsatellites have a similar function? As far as we know, there is only one report of a naturally occurring mutation within a yeast microsatellite that creates a potentially adaptive phenotype in yeast. Two mutations within a polydA20 tract in the promoter of the ADH2 gene extend the length to 54 or 55 A's, create a strong promoter-up mutation, and also change the transcription factor dependence in a novel way (RUSSELL et al. 1983 Down; CIRIACY et al. 1991 Down). The nature and consequence of this mutation are similar to those in Mycoplasma, which generate antigenic diversity (YOGEV et al. 1991 Down). The ADH2 promoter mutation is dominant. We imagine that other changes in microsatellites could also create gain-of-function alleles that could be selected for in a diploid organism. Thus, adaptive diversity may also be created in yeast by slipped-strand mispairing during DNA replication.

Microsatellites within ORFs could also have an important evolutionary role by allowing recombination to shuffle functional domains of a protein. If TNRs lie between domains of protein structure, rare recombination events between the repeats within different genes could lead to rearrangement of genetic information much in the same way that exon shuffling is suggested to occur by recombination within introns. Recombination within TNRs would maintain the correct reading frame and the recombinant genes would encode multidomain proteins with new functions. Remnants of such recombination events might be detected by comparing chromosome and gene organization in related species.

Expansion of TNRs is responsible for an expanding list of human genetic diseases (PAULSON and FISCHBECK 1996 Down; SIA et al. 1997A Down; PANDOLFO 1998 Down). Some of the expansions within ORFs behave in a semidominant manner, suggesting that the abnormal protein interferes with the function of the normal protein. Another situation in which an abnormal protein interferes with the function of its normal counterpart is in prion transmission in mammals and in yeast. A recent report indicates that iteration of a portion of Sup35p responsible for prion formation enhances its transmission (LIU and LINDQUIST 1999 Down). Interestingly, the portion of URE2, a yeast gene encoding another protein capable of prion formation, which is both necessary and sufficient for prion formation, contains a TNR encoding polyAsn. Whether expansion of the polyAsn enhances prion transmission has not been tested. It is interesting to consider whether some of the triplet repeat diseases may have etiologies similar to those associated with prion diseases. Both of them show delayed appearance and are particularly associated with nervous tissues in which protein aggregation may play an important role in pathogenesis of the disease. It would be interesting to test the ability of proteins containing expanded amino acid tracts to induce misfolding in their normal counterparts.


*  ACKNOWLEDGMENTS

We thank Albert LaSpada and Ken Dombek for comments on the manuscript and Jon Cooper for help with the protein kinase analysis. This work was supported by research grants from the National Institutes of Health (R29-GM54043 and GM26079).

Manuscript received July 20, 1999; Accepted for publication November 17, 1999.


*  LITERATURE CITED
*TOP
*ABSTRACT
*MATERIALS AND METHODS
*RESULTS
*DISCUSSION
*LITERATURE CITED

BASRAI, M. A., P. HIETER, and J. D. BOEKE, 1997  Small open reading frames: beautiful needles in the haystack. Genome Res. 7:768-771[Free Full Text].

CIRIACY, M., K. FREIDEL, and C. LOHNING, 1991  Characterization of trans-acting mutations affecting Ty and Ty-mediated transcription in Saccharomyces cerevisiae.. Curr. Genet. 20:441-448[Medline].

DAS, S., L. YU, G. GAITATZES, R. ROGERS, and J. FREEMAN et al., 1997  Biology's new Rosetta stone. Nature 385:29-30[Medline].

FIELD, D. and C. WILLS, 1998  Abundant microsatellite polymorphism in Saccharomyces cerevisiae, and the different distributions of microsatellites in eight prokaryotes and S. cerevisiae, result from strong mutation pressures and a variety of selective forces. Proc. Natl. Acad. Sci. USA 95:1647-1652[Abstract/Free Full Text].

FREUDENREICH, C. H., J. B. STAVENHAGEN, and V. A. ZAKIAN, 1997  Stability of CTG/CAG trinucleotide repeats in yeast is dependent on its orientation in the genome. Mol. Cell. Biol. 17:2090-2098[Abstract].

GOFFEAU, A., 1997  The yeast genome directory. Nature 387(Suppl.):5-103.

GOFFEAU, A. E. A., 1996  Life with 6000 genes. Science 274:546-567[Abstract/Free Full Text].

GUSELLA, J. F. and M. E. MAC DONALD, 1996  Trinucleotide instability: a repeating theme in human genetic disorders. Annu. Rev. Med. 47:201-209[Medline].

HANCOCK, J. M., 1996  Simple sequences and the expanding genome. BioEssays 18:421-425[Medline].

LAURENT, B. C., M. A. TREITEL, and M. CARLSON, 1990  The SNF5 protein of Saccharomyces cerevisiae is a glutamine- and proline-rich transcriptional activator that affects expression of a broad spectrum of genes. Mol. Cell. Biol. 10:5616-5625[Abstract/Free Full Text].

LENGAUER, C., K. W. KINZLER, and B. VOGELSTEIN, 1997  Genetic instability in colorectal cancers. Nature 386:623-627[Medline].

LIU, J. J. and S. LINDQUIST, 1999  Oligopeptide-repeat expansions modulate `protein-only' inheritance in yeast. Nature 400:573-576[Medline].

MAURER, D. J., B. L. O'CALLAGHAN, and D. M. LIVINGSTON, 1996  Orientation dependence of trinucleotide CAG repeat instability in Saccharomyces cerevisiae.. Mol. Cell. Biol. 16:6617-6622[Abstract].

MIRET, J. J., L. PESSOA-BRANDOA, and R. S. LAHUE, 1997  Instability of CAG and CTG trinucleotide repeats in Saccharomyces cerevisiae.. Mol. Cell. Biol. 17:3382-3387[Abstract].

MOQTADERI, Z., J. D. YALE, K. STRUHL, and S. BURATOWSKI, 1996  Yeast homologues of higher eukaryotic TFIID subunits. Proc. Natl. Acad. Sci. USA 93:14654-14658[Abstract/Free Full Text].

MOXON, E. R., P. B. RAINEY, M. A. NOWAK, and R. E. LENSKI, 1994  Adaptive evolution of highly mutable loci in pathogenic bacteria. Curr. Biol. 4:24-33[Medline].

MYLIN, L. M., C. J. GERARDOT, J. E. HOPPER, and R. C. DICKSON, 1991  Sequence conservation in the Saccharomyces and Kluyveromyces GAL11 transcription activators suggests functional domains. Nucleic Acids Res. 19:5345-5350[Abstract/Free Full Text].

PANDOLFO, M., 1998  Molecular genetics and pathogenesis of Friedreich ataxia. Neuromuscul. Disord. 8:409-415[Medline].

PAULSON, H. L. and K. H. FISCHBECK, 1996  Trinucleotide repeats in neurogenetic disorders. Annu. Rev. Neurosci. 19:79-107[Medline].

RICHARD, G.-F. and B. DUJON, 1996  Distribution and variability of trinucleotide repeats in the genome of the yeast Saccharomyces cerevisiae.. Gene 174:165-174[Medline].

RUSSELL, D. W., M. SMITH, D. COX, V. M. WILLIAMSON, and E. T. YOUNG, 1983  DNA sequences of two yeast promoter-up mutants. Nature 304:652-654[Medline].

SAPOLSKY, R. J., V. BRENDEL, and S. KARLIN, 1993  A comparative analysis of distinctive features of yeast protein sequences. Yeast 9:1287-1298[Medline].

SCHULTZ, J., L. MARSHALL-CARLSON, and M. CARLSON, 1990  The N-terminal TPR region is the function domain of SSN6, a nuclear phosphoprotein of Saccharomyces cerevisiae.. Mol. Cell. Biol. 10:4744-4756[Abstract/Free Full Text].

SCHWEITZER, J. K. and D. M. LIVINGSTON, 1997  Destabilization of CAG trinucleotide repeat tracts by mismatch repair mutations in yeast. Hum. Mol. Genet. 6:349-355[Abstract/Free Full Text].

SIA, E. A., S. JINKS-ROBERTSON, and T. D. PETES, 1997a  Genetic control of microsatellite instability. Mutat. Res. 383:62-70.

SIA, E. A., R. J. KOKOSKA, M. DOMINSKA, P. GREENWELL, and T. D. PETES, 1997b  Microsatellite instability in yeast: dependence on repeat unit size and DNA mismatch repair genes. Mol. Cell. Biol. 17:2851-2858[Abstract].

THATCHER, J. W., J. M. SHAW, and W. J. DICKINSON, 1998  Marginal fitness contribution of nonessential genes in yeast. Proc. Natl. Acad. Sci. USA 95:253-257[Abstract/Free Full Text].

TREISMAN, R., 1995  Inside the MADS box. Nature 376:468-469[Medline].

YOGEV, D., R. ROSENGARTEN, R. WATSON-MCKOWN, and K. S. WISE, 1991  Molecular basis of Mycoplasma surface antigenic variation: a novel set of divergent genes undergo spontaneous mutation of periodic coding regions and 5' regulatory sequences. EMBO J. 10:4069-4079[Medline].

YOUNG, E. T., B. M. MILLER, J. S. SLOAN, K. VAN RIPER, N. LI and K. M. DOMBEK, 2000 Evolution of glucose-regulated ADH isozymes and ADR1 in Saccharomyces. Gene (in press).




This article has been cited by other articles:


Home page
Microbiol. Mol. Biol. Rev.Home page
G.-F. Richard, A. Kerrest, and B. Dujon
Comparative Genomics and Molecular Dynamics of DNA Repeats in Eukaryotes
Microbiol. Mol. Biol. Rev., December 1, 2008; 72(4): 686 - 727.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
S. Bowen and A. E. Wheals
Evidence that Protein Length Expansion and Contraction Is Partly Due to Mutational Events in Premeiotic Cells
Mol. Biol. Evol., July 1, 2006; 23(7): 1339 - 1340.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
G.-F. Richard and B. Dujon
Molecular Evolution of Minisatellites in Hemiascomycetous Yeasts
Mol. Biol. Evol., January 1, 2006; 23(1): 189 - 202.
[Abstract] [Full Text] [PDF]


Home page
J HeredHome page
M. Perez, F. Cruz, and P. Presa
Distribution Properties of Polymononucleotide Repeats in Molluscan Genomes
J. Hered., January 1, 2005; 96(1): 40 - 51.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
M. D. Prasad, M. Muthulakshmi, M. Madhu, S. Archak, K. Mita, and J. Nagaraju
Survey and Analysis of Microsatellites in the Silkworm, Bombyx mori: Frequency, Distribution, Mutations, Marker Potential and Their Conservation in Heterologous Species
Genetics, January 1, 2005; 169(1): 197 - 214.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
Y.-C. Li, A. B. Korol, T. Fahima, and E. Nevo
Microsatellites Within Genes: Structure, Function, and Evolution
Mol. Biol. Evol., June 1, 2004; 21(6): 991 - 1007.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
M. M. Alba and R. Guigo
Comparative Analysis of Amino Acid Repeats in Rodents and Humans
Genome Res., April 1, 2004; 14(4): 549 - 554.
[Abstract] [Full Text] [PDF]


Home page
J. Biol. Chem.Home page
A. Pluciennik, R. R. Iyer, M. Napierala, J. E. Larson, M. Filutowicz, and R. D. Wells
Long CTG{middle dot}CAG Repeats from Myotonic Dystrophy Are Preferred Sites for Intermolecular Recombination
J. Biol. Chem., September 6, 2002; 277(37): 34074 - 34086.
[Abstract] [Full Text] [PDF]


Home page
J. Biol. Chem.Home page
M. Napierala, P. Parniewski, A. Pluciennik, and R. D. Wells
Long CTG{middle dot}CAG Repeat Sequences Markedly Stimulate Intramolecular Recombination
J. Biol. Chem., September 6, 2002; 277(37): 34087 - 34100.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
P. P. Calabrese, R. T. Durrett, and C. F. Aquadro
Dynamics of Microsatellite Divergence Under Stepwise Mutation and Proportional Slippage/Point Mutation Models
Genetics, October 1, 2001; 159(2): 839 - 852.
[Abstract] [Full Text] [PDF]


Home page
MicrobiologyHome page
R. Shemer, Z. Weissman, N. Hashman, and D. Kornitzer
A highly polymorphic degenerate microsatellite for molecular strain typing of Candida krusei
Microbiology, August 1, 2001; 147(8): 2021 - 2028.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
S. Temnykh, G. DeClerck, A. Lukashova, L. Lipovich, S. Cartinhour, and S. McCouch
Computational and Experimental Analysis of Microsatellites in Rice (Oryza sativa L.): Frequency, Length Variation, Transposon Associations, and Genetic Marker Potential
Genome Res., August 1, 2001; 11(8): 1441 - 1452.
[Abstract] [Full Text] [PDF]


Home page
J HeredHome page
B. L. Kutil and C. G. Williams
Triplet-Repeat Microsatellites Shared Among Hard and Soft Pines
J. Hered., July 1, 2001; 92(4): 327 - 332.
[Abstract] [Full Text] [PDF]