- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Supplementary Table S1
- Supplementary Table S2
- Supplementary Table S3
- A corrigendum has been published
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Ashburner, M.
- Articles by Rubin, G. M.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Ashburner, M.
- Articles by Rubin, G. M.
An Exploration of the Sequence of a 2.9-Mb Region of the Genome of Drosophila melanogaster: The Adh Region
M. Ashburnera,b, S. Misrad, J. Rootea, S. E. Lewisd, R. Blazejg, T. Davisc, C. Doyleg, R. Galleg, R. Georgeg, N. Harrisg, G. Hartzelld, D. Harveyd,e, L. Hongd, K. Houstong, R. Hoskinsg, G. Johnsona, C. Martin1,g, A. Moshrefig, M. Palazzolo2,g, M. G. Reesed, A. Spradlingf, G. Tsangd,e, K. Wang, K. Whitelawg, B. Kimmel2,g, S. Celnikerg, and G. M. Rubing,d,ea Department of Genetics, University of Cambridge, Cambridge, CB2 3EH, England,
b EMBLEuropean Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, England,
c Department of Pathology, University of Wales College of Medicine, Cardiff, CF4 4XN, Wales,
d Berkeley Drosophila Genome Project, Department of Molecular and Cell Biology, University of California, Berkeley, California 94720-3200,
e Howard Hughes Medical Institute, Life Sciences Annex, University of California, Berkeley, California 94720,
f Howard Hughes Medical Institute, Carnegie Institution of Washington, Baltimore, Maryland
g Berkeley Drosophila Genome Project, Lawrence Berkeley National Laboratory, Berkeley, California 94720
Corresponding author: M. Ashburner, Department of Genetics, Downing St., Cambridge, CB2 3EH, England., m.ashburner{at}gen.cam.ac.uk (E-mail)
Communicating editor: T. C. KAUFMAN
| ABSTRACT |
|---|
A contiguous sequence of nearly 3 Mb from the genome of Drosophila melanogaster has been sequenced from a series of overlapping P1 and BAC clones. This region covers 69 chromosome polytene bands on chromosome arm 2L, including the genetically well-characterized "Adh region." A computational analysis of the sequence predicts 218 protein-coding genes, 11 tRNAs, and 17 transposable element sequences. At least 38 of the protein-coding genes are arranged in clusters of from 2 to 6 closely related genes, suggesting extensive tandem duplication. The gene density is one protein-coding gene every 13 kb; the transposable element density is one element every 171 kb. Of 73 genes in this region identified by genetic analysis, 49 have been located on the sequence; P-element insertions have been mapped to 43 genes. Ninety-five (44%) of the known and predicted genes match a Drosophila EST, and 144 (66%) have clear similarities to proteins in other organisms. Genes known to have mutant phenotypes are more likely to be represented in cDNA libraries, and far more likely to have products similar to proteins of other organisms, than are genes with no known mutant phenotype. Over 650 chromosome aberration breakpoints map to this chromosome region, and their nonrandom distribution on the genetic map reflects variation in gene spacing on the DNA. This is the first large-scale analysis of the genome of D. melanogaster at the sequence level. In addition to the direct results obtained, this analysis has allowed us to develop and test methods that will be needed to interpret the complete sequence of the genome of this species.
Before beginning a Hunt, it is wise to ask someone what you are looking for before you begin looking for it. MILNE 1926
IT is nearly 100 years since W. E. Castle and his colleagues at Harvard University introduced Drosophila melanogaster to the joys and rigors of scientific research (![]()
![]()
![]()
![]()
![]()
The analysis and interpretation of long genomic sequences pose several unsolved problems, among which are gene prediction and correlation of genetically identified loci with computationally predicted genes. We have selected the 2.9-Mb Adh region, a region of the genome of D. melanogaster that was already well characterized by conventional genetic analyses, as a test-bed to develop and evaluate approaches to large-scale genomic sequence annotation in Drosophila. This chromosome region is defined as the 69 polytene chromosome bands from 34C4 to 36A2 on chromosome arm 2L, which is the region between (and including) the previously known genes kuzbanian (kuz) and dachshund (dac). Genetic analysis of this chromosome region began with the studies of E. H. Grell in the early 1960s and the recovery of an Adh- deletion, Df(2L)64j (![]()
![]()
![]()
![]()
Genetic analysis has defined 73 genes in this chromosome region. Of these genes, 65 are represented by mutant alleles and 8 more are predicted on the basis of the phenotypes of overlapping deletions. Of those with mutant alleles, 50 genes have at least one lethal allele (i.e., they are genes whose activities are vital), 6 are known only from sterile alleles (2 male sterile and 4 female sterile), 8 only from alleles with clear visible phenotypes, and 2 genes have alleles with no gross phenotype: Adh and smi35A. Forty-nine protein-coding genes (and 5 tRNA genes) in this region had been molecularly characterized prior to or during our work; these included 7 that had not been identified by genetic analysis. In addition to a collection of over 1038 different mutant alleles of genes in this region, the genetic analysis was enormously aided by a very large collection of chromosome aberrations, including 86 inversions, 109 translocations, 317 deletions, and 40 duplications. Apart from some conventional recombination mapping in the early stages of the project, all genes have been ordered by deletion mapping. The genetic positions of the breakpoints of many inversions and translocations have been mapped with respect to the genes, often by combining these breakpoints with others to synthesize deletions or duplications.
These genetic data posed two major questions. The first was that of "saturation": What proportion of the genes had been identified by the genetic analysis? It is well known (e.g., ![]()
![]()
There is direct experimental evidence, or prediction, for 229 genes in the 2.9 Mb of sequenced DNA. Of these, there is evidence for function or some hint of function from sequence matches for 102 genes. One of the challenges for the future is to discover, by experiment, the function of all of the genes.
| MATERIALS AND METHODS |
|---|
Genetics:
All of the mutations and chromosome aberrations used in this study are fully described in FlyBase (![]()
![]()
|
P elements from several laboratories, from screens for lethal P elements on chromosome 2 (see ![]()
![]()
P-element excisions and male recombinants were generated using P{
2-3}99B as the source of an active P transposase. These derivatives were then characterized by conventional genetic complementation analyses.
Cytology:
For conventional polytene chromosome analysis we used propionic-carmine-orcein squash preparations. In situ hybridization was performed by standard procedures using biotinylated probes and horseradish peroxidase staining. Polytene chromosomes were interpreted using the revised maps of C. B. and P. N. Bridges (see ![]()
Clones:
The P1 clone library, with an average insert size of 80 kb, was that prepared from an isogenic y; cn bw sp stock in the vectors pNS583tet14Ad10 and pAd10sacBII (![]()
![]()
![]()
![]()
![]()
![]()
![]()
The P1 clones were first assembled into eight contigs by screening a 5-hit P1 clone library. By generating STS sequences determined from the ends of these contigs, and then mapping these to a second larger P1 clone library (10 hit), and by directed PCR experiments, these seven contigs assembled into two, of 0.8 Mb and 1.9 Mb, plus an isolated P1 clone containing the kuzbanian gene. The gaps between the two long contigs and between the isolated P1 clone and the 1.9-Mb contig were closed by screening the BAC clone library with sequences prepared from the appropriate end clones.
DNA sequencing:
The sequence of the Adh region has been assembled by first determining the sequences of the 51 individual P1 clones that comprise the 0.8-Mb and 1.9-Mb contigs. The gap between the two contigs was filled by sequencing the BAC clone BACR44L22. The gap between the P1 clones DS07660 and DS01368 was filled by sequencing BACR48E02. Table 2 lists the clones sequenced and their DDBJ/EMBL/GenBank accession numbers.
|
The sequencing strategies have evolved over time. Essentially, ca. 3-kb subclone libraries of randomly sheared DNA were prepared from each P1 clone in plasmid vectors. The sequences of both ends of each plasmid insert were determined using primers complementary to the vector and these sequences were used to assemble a set of overlapping 3-kb clones that span an entire P1 clone. The 3-kb clones were then sequenced using a combination of transposon-mediated sequencing (![]()
![]()
cDNA identification and sequencing:
cDNA clones derived from genes in the 34D-36A region were identified by searching for sequence matches between the genomic DNA sequence and 5' expressed sequence tags (ESTs) from the Berkeley Drosophila Genome Project (BDGP)/Howard Hughes Medical Institute (HHMI) Drosophila EST project (http://www.fruitfly.org/EST/). In addition, cDNAs corresponding to crp, heix, l(2)35Fe, anon-35Fa, anon-35F/36A, BG:DS02740.2, BG:DS02740.4, BG:DS02740.8, BG:DS02740.9, and BG:DS02740.10 were isolated by screening the LD cDNA library using the method of ![]()
![]()
Molecular mapping of P-element insertion sites:
The precise insertion sites of all P elements described here were determined by comparison of the reference genomic sequence with a sequence that spanned the junction between a P element and the genome using sim4. These junction sequences were determined from either plasmid-rescued clones or inverse PCR products, as described in ![]()
Sequence analysis:
Two broad categories of computational method were used together to predict and identify genes. The first was gene prediction algorithms, based on the statistical properties of protein-coding regions. The second category of method used alignment algorithms for predictions based upon similarities of the sequence with other sequences in the public domain, both nucleic acid and protein.
The main gene prediction program used in the early stages of this analysis was GENEFINDER (v. 0.83; ![]()
![]()
![]()
![]()
![]()
![]()
To estimate the statistical properties of D. melanogaster protein-coding regions a nonredundant data set of coding regions (CDS) was made. By nonredundant we mean that for any one gene only one CDS is included, even if the gene encodes multiple protein products (that included was usually the longest complete sequence available from the EMBL Nucleic Acid Sequence Data Library). All of the CDS regions were checked for legitimate start and stop codons and for a continuous open reading frame in between these. Four genes with non-ATG starts were included in this data set (CTG, amn, ewg; GTG, Cha; CTC, cpo) following advice from D. Cavener, as were two CDSs (oaf and kelch) with in-frame UGA codons, perhaps coding for seleno-cysteine. This data set of 1335 CDSs was used for the construction of normalized codon and di-codon (hexamer) tables (![]()
Databases against which similarity searches were made included GenBank, dbEST, SWISS-PROT, SPTREMBL, and sequences from the European Drosophila Genome Project (EDGP). Updates of these were collected weekly, the sequence data sorted into species-specific files, and all submissions from the Berkeley Drosophila Genome Project removed to provide data sets for searches. These data sets were then processed to append all database cross-references to FASTA header lines. For sequence similarity searches the BLASTN, BLASTX, and TBLASTX programs (version 2.0a) of W. GISH (unpublished results) were used (with the option B = 1,000,000, options filter = SEG + XNU).
Transposable elements were screened using a nonredundant data set of transposable element sequences from which all "flanking" DNA sequences had been trimmed. This data set was originally derived from the EMBL Nucleotide Sequence Data Library records, but as our analysis progressed more complete sequences of elements only known before from partial sequence were added, replacing incomplete sequences. This data set is available from ftp://ftp.ebi.ac.uk/pub/databases/edgp/sequence_sets/transposon_sequence_set.embl and from http://www.fruitfly.org/sequence/download.html (as na_te.dros).
A collection of repetitive sequences from D. melanogaster, not otherwise included in the transposable element sequence set, was also made. This data set includes, e.g., satellite DNA sequences and a miscellany of sequences annotated as being repetitive by FlyBase. It is not as nonredundant as the other two data sets, and was only used for screening for sequences similar to those previously described as repetitive. The data set is available from ftp://ftp.ebi.ac.uk/pub/databases/edgp/sequence_sets/repeat_sequence_set.embl and http://www.fruitfly.org/sequence/download.html (as na_re.dros).
The data output from these various computational analyses is voluminous and requires intelligent filtering to remove redundant and irrelevant information before being passed to the human annotators. Moreover, the task of annotation is almost impossible without tools for the visualization of these data. An application, BLAST Output Parser (v. 01; BOP), was written (S. LEWIS, unpublished results). BOP summarizes all automatically computed analysis data for an individual sequence into one file (i.e., all output from the programs mentioned previously: BLAST, GENSCAN, etc.). This file is in XML syntax. BOP also removes as much of the "noise" as possible (e.g., redundant matches, "shadow" matches on the noncoding strand, and matches to sequences of very biased base composition). These condensed data were then presented to the annotator in a graphical view (CloneCurator v. 0.1; S. LEWIS, N. HARRIS, S. MISRA and G. HELT, unpublished results).
CloneCurator was used to isolate individual genes from the clone sequences, based on expert evaluation of these analyses. CloneCurator allowed the annotator to compare results from different programs and to view the results using filters to determine a desired level of probability of prediction. The annotator used this visual summary to endorse a set of results as evidence, thereby generating a verified annotation. Annotations can be edited in CloneCurator and the annotators can add textual comments to any particular annotation, assign gene symbols, etc. This program was used to generate nucleic acid and amino acid FASTA files for each gene annotation. When a gene spanned more than one clone, manual intervention by an annotator was necessary to construct virtual mRNA sequences.
Open reading frames of predicted genes were validated using ORFfinder (v. 0.1; E. FRISE, unpublished results) and all predicted proteins were then tested with BLASTP (v. 2.0a) with the options filter = SEG + XNU (unless the results are stated as being "unfiltered") against SWISS-PROT and SPTREMBL protein sets organized into nine taxonomic groups (Drosophila, Caenorhabditis elegans, Saccharomyces cerevisiae, other invertebrates, primates, rodents, other vertebrates, plants, and bacteria). Matches with an expectation below P = 10-7 were ignored.
Protein domains and motifs were analyzed against the PROSITE (release 15.0; ![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The output from the various sequence analysis programs is archived on FlyBase as FlyBase-Annotation files linked to the sequenced clones. Version 1 of these files includes the analyses used for this article. Subsequent versions will result from reanalysis of the sequence data.
Nomenclature:
All genes are named according to the conventions agreed between the Berkeley and European Drosophila Genome Projects and FlyBase (http://flybase.bio.indiana.edu/docs/nomenclature). Each gene is given a unique name composed of three parts: a prefix (BG for genes defined by the Berkeley Project, EG for those defined by the European Project), followed by a clone name and an integer. The clone name is that of the clone on which the gene was first defined (regardless of whether or not the gene overlaps more than one clone). The final integer is simply a serial number, and does not imply the order of a gene within a clone. An example is BG:DS09218.6, the sixth gene annotated on P1 clone DS09218. If a gene was already known to FlyBase, then a formal name is still assigned but will be treated by FlyBase as a synonym of the established name.
All genes known to FlyBase are named by those names and symbols declared by FlyBase as valid. In addition, the historical names of the lethals identified by the genetic analysis of the Adh region are given.
Availability of data and materials:
The DNA sequence of the Adh region is made available for file transfer protocol (ftp) and searching (using BLAST) at http://www.fruitfly.org/data/genomic_fasta/Adh_and_cactus. All sequence data from genomic clones, ESTs, cDNAs, and P-element flanking regions are deposited in GenBank. Supplementary tables of data, cited in this article as Tables S1, S2, and S3, are available from http://www.genetics.org/supplemental/. Accession numbers for the genomic sequences are given in Table 2, for P-element flanking regions in Table S1 (http://www.genetics.org/cgi/content/full/153/1/179/DC1), and for cDNAs and ESTs in Table S2 (http://www.genetics.org/cgi/content/full/153/1/179/DC2). P1 clones are available from laboratories listed on FlyBase. cDNA clones are available from Research Genetics (Huntsville, AL) or from Genome Systems (St. Louis, MO). BAC clones (library RPCI-98) are available from Dr. P. de Jong (Roswell Park Cancer Institute, Buffalo, NY). P-element alleles are available from the Bloomington and Szeged Drosophila Stock Centers or from the Berkeley Drosophila Genome Project (BDGP). The annotated sequences can be viewed through FlyBase as CloneCurator reports.
| RESULTS AND DISCUSSION |
|---|
The physical map and sequence of the Adh region:
The physical map of the Adh region was assembled and sequenced from P1 and BAC as described in MATERIALS AND METHODS. The P1 clones formed three contigs, one of 1,940,896 bp, one of 798,089 bp, and the third, a single P1 clone. The gap between the 1.9-Mb and 0.79-Mb contigs could not be closed in P1 clones, but was, however, readily closed by screening the BAC library; it was found to be 43,803 bp in length. A BAC clone also linked the isolated P1 clone (DS07660) to the distal end of the 1.9-Mb contig. This gap was 35,162 bp in length. The total length of sequence studied is 2,919,020 bp. A summary of the interpretation of this sequence is given in Figure 1, with an expanded view of three selected regions in Figure 2.
|
|
General features of the sequence:
The overall base composition of the sequence is 40.82% G + C, to be compared to the figure of 43% for the genome as a whole (![]()
Gene prediction in the Adh region:
A primary objective of the sequence analysis was to identify genes, both protein coding and others (e.g., tRNA), in the 2.9 Mb of sequenced DNA. We predict the existence of 229, of which 218 are predicted to be protein coding and 11 tRNA coding (Figure 1). The bases for the predictions are summarized in Table S2 (http://www.genetics.org/cgi/content/full/153/1/179/DC2). Forty-one of the protein-coding genes are predicted only on the basis of a high score with a gene-finding program; of these, 16 have both GENSCAN and GENEFINDER predictions (above the thresholds we used), 2 have only GENEFINDER predictions, and 23 only GENSCAN predictions. All of the other protein-coding genes are predicted by either (or both) sequence similarities (a BLAST score of P = <10-7; 156, 71%) or a match with a Drosophila EST, cDNA, or genomic sequence (110, 52% of protein-coding genes). (Seventeen more genes had matches to Drosophila ESTs, but these matches were clearly due to the ESTs being derived from genes encoding similar sequences, i.e., from paralogous genes.)
It is important to get an estimate of the false-negative and false-positive frequencies of prediction. A GENSCAN threshold of 45 fails to predict 22 protein-coding genes predicted by other means (or known prior to this work). Of these 22, 10 have EST matches and 3 were known prior to this analysis (Mst35Ba, Mst35Bb, and cni). Lowering the threshold for GENSCAN to 30 would include 8 of these 22 false negatives, but this would also predict a further 25 protein-coding genes in this region, none of which would have any other support. The GENEFINDER program, at a threshold of 20, fails to predict 56 of the protein-coding genes. Of these false negatives, 35 have support from experimental data and 21 have support from GENSCAN predictions [Table S2 (http://www.genetics.org/cgi/content/full/153/1/179/DC2)]. One feature of GENSCAN that we have noticed is that its scores tend to be low in regions of very high gene density.
ESTs and cDNA sequences of genes in the Adh region:
Even the best computational methods are imperfect in their ability to determine the intron-exon structures of genes from genomic sequence alone. Moreover, because such methods rely on information from codon usage and the maintenance of open reading frames, they are inherently unable to predict the presence of introns in 5' or 3' untranslated regions or to predict the transcriptional start sites. For these reasons it is necessary to isolate and sequence cDNAs (or RT-PCR products). We have used sequence matches between the genomic sequence and 5' ESTs as a rapid way of identifying cDNAs for sequencing [see MATERIALS AND METHODS; Table S2 (http://www.genetics.org/cgi/content/full/153/1/179/DC2)]. cDNAs corresponding to 95 genes were identified by matches to ESTs (44% of known or predicted protein-coding genes) at a time when the total number of Drosophila ESTs available was 53,000.
Of the 68 protein-coding genes for which there was some prior knowledge (i.e., both genetic and molecular data or molecular data alone), 50 (74%) have ESTs; of the 150 genes that are newly discovered, only 44 (29%) have ESTs. This is a rather surprising result. It may indicate either a bias in the sample of genes that had already been studied or an overprediction of new genes, or it may be a biologically interesting result (see below).
P-element hits:
Several collections of lethal P elements were screened against deletions that, in sum, covered the entire Adh region (see ![]()
![]()
![]()
![]()
Gene density in the Adh region:
Of the 229 genes, 218 are protein coding and 11 are tRNAs. The average gene density for protein-coding genes is one per 13.4 kb. The average size of the genes, as estimated both from computational analysis and the "full"-length cDNAs, is 5.5 kb (from ATG to terminator, including introns). The average gene density of one gene per 13.4 kb hides enormous variation in density. Some regions are very dense, with genes being separated by only a few hundreds of base pairs; others are, by comparison, very gene poor (see Figure 1 and Figure 2).
There are few studies of long genomic sequences of Drosophila that we can use for comparison with the Adh region. Preliminary analyses of 2 Mb of genomic sequence from region 13 of the X chromosome give a gene density of one gene per 8 kb (T. BENOS and M. ASHBURNER, unpublished analyses of European Drosophila Genome Project data). In the 338-kb bithorax region there are 13 known or predicted genes (1 per 24 kb), but 3 of these (Ubx, abd-A, and Abd-B) are exceptionally large (22 to 78 kb for their coding regions alone). In the Antp region Celniker et al. (S. CELNIKER, B. PFEIFFER, J. KNAFELS, C. MAYEDA, C. MARTIN and M. PALAZZOLO, unpublished data) have identified 26 protein-coding genes in 430 kb, a density of 1 gene per 16.5 kb. ![]()
Transcriptional bias:
The number of genes transcribed from each DNA strand is approximately equal (121 vs. 108). In very gene-dense regions there is a strong tendency for the direction of transcription to alternate (see Figure 1); overall, however, the pattern of transcriptional direction appears to be random. This was tested by expressing the pattern as a binary string and attempting to compress it using the Lempel-Ziv compression algorithm (![]()
Estimates of total gene number in Drosophila:
Any estimate of total gene number, based on the analysis of the Adh region, depends on this region being "typical" of the genome as a whole, with respect to the number of genes. This is a difficult question to answer with any rigor. Genetically, there are no indications that the Adh region is atypical. The number of genes discovered by genetic analysis is, given the number of polytene chromosome bands included, very similar to that in other well-studied regions. Classical "saturation" studies give a ratio of lethal complementation groups to polytene chromosomes bands of ~0.84 (Table 3); for the Adh region this ratio is 0.81.
|
Our estimates of the total gene number rely on estimates of the total DNA content of D. melanogaster. This has been independently estimated to be 170 Mb by ![]()
![]()
![]()
![]()
![]()
![]()
![]()
Simple arithmetic, 115 Mb/13.4 kb, gives an estimate of 8600 protein-coding genes for the Drosophila genome as a whole. This is a remarkably low number, being less than half as much again as the yeast S. cerevisiae (6000; ![]()
![]()
![]()
![]()
Local duplications of genes:
A number of genes in Drosophila have been found to exist as locally duplicated gene pairs. Members of a pair may be functionally distinct (e.g., en, inv) or functionally redundant (e.g., gsb-d, gsb-p; ph-d, ph-p). The most obvious model for the origin of gene pairs is unequal recombination (![]()
![]()
![]()
![]()
In this chromosome region we have identified at least 12 (protein-coding) gene repeats. One had already been identified, first in Drosophila pseudoobscura (![]()
![]()
![]()
Five genes, closely clustered in the region between RpII33 and Ance, show between 30 and 37% amino acid sequence similarities. These are BG:DS00941.11BG:DS00941.15, genes whose proteins are about the same size but all lack any sequence matches. BG:DS00180.7BG:DS00180.10, BG:DS00180.12, and BG:DS00180.14 are six genes all with epidermal growth factor (EGF) domains clustered within a few tens of kilobases just distal to rk. Their sequence similarities are not high, but are evidence of ancient duplications.
In the region between the lace and CycE genes there are six predicted genes within 21 kb, each encoding a protein of the astacin subfamily of Zn-metalloproteases (![]()
![]()
![]()
The 38 genes in the 34C-36A region that appear to be members of tandem series represent 17% of the total number of protein-coding genes. This is a minimum estimate, because a BLASTP search of all 218 known and predicted protein sequences against themselves identifies other potential duplications, which require further study. Many of these duplications are very old, as judged by the sequence similarities between members of a set. Tandem series of genes are also a feature of C. elegans (THE C. ELEGANS SEQUENCING CONSORTIUM 1998; THE C. ELEGANS GENOME SEQUENCING PROJECT 1999) and Arabidopsis thaliana (![]()
![]()
Genes within genes:
The first example of a gene known to be entirely included within another gene was that of a pupal cuticle protein gene (Pcp) fully encoded within an intron of ade3 (![]()
The inclusion of Adh within osp was first suggested by genetic data, because osp aberrations mapped to either side of Adh (![]()
![]()
An open reading frame in the 5' intron of vasa (vig, for vasa intronic gene) was first identified by K. EDWARDS (personal communication) by a comparison of sequences from D. grimshawi with those from this project. There is another CDS within vasa: BG:DS00929.15 in the long third intron, first identified as a ubiquitous transcript from RNA blots with genomic DNA by P. LASKO (personal communication; see ![]()
The phenotypes of overlapping and contiguous deletionsthe search for more genes:
We have evidence that the genetic screens failed to recover mutations at loci expected to have scorable phenotypesthe failure to recover any alleles of beat is an example (see Appendix). One new lethal locus (l(2)35Fg) was discovered when the chromosome 2 P elements were systematically screened. One further genetic technique to discover genes is to systematically screen hetetozygotes between two overlapping deletions. We have made transheterozygotes between all possible pairs of deletions, which, by genetic criteria, abut, i.e., the distal end of one and the proximal end of another are located between the same pair of genes identified by mutant alleles. These pairs of deletions may or may not physically overlap.
Pairwise combinations (836) have been made and the genotypes scored for viability, male and female fertility, and obvious visible phenotypes. Although these phenotypes could be the result of the additive effects of haplo-insufficiency, we have predicted the existence of four lethal loci from these data, two loci required for male fertility and two loci required for female fertility (each "locus" could include more than one gene, of course). A variation on this protocol for the discovery of mutant phenotypes is to test combinations of deletions that are known to overlap by only one gene with a mutant phenotype in the presence of a transgene that is known independently to rescue the mutant phenotype. If the transgene rescues the deficiency heterozygote to phenotypic normality, then we can conclude that no other genes capable of giving a mutant phenotype are located in the deleted interval; and if not, then we can conclude the existence of a previously unsuspected locus.
Overlapping Ance- deletions are lethal, which is expected, since Ance itself is a vital gene. There is, however, evidence for another lethal near Ance, because the lethality of some, but not all, overlapping deletion pairs can be rescued by a 16.5-kb transformant that includes both Ance and anon-34Ea (carried on P{RACE}). l(2)34Ec is predicted on the basis of the failure of this transformant to rescue the lethality of, e.g., Df(2L)SR407/Df(2L)b82a1. This predicted gene is not in the overlap of, e.g., Df(2L)SR407/Df(2L)b74c6.
The existence of ms(2)35Bi, between the 5' exons of osp and l(2)35Bb, is predicted on the basis of viable, but male-sterile, overlapping deletion heterozygotes (see Appendix). l(2)35Cc is predicted on the basis of the recessive lethality of Df(2L)rd9 (![]()
The region between esg and sna is, genetically, rather complex. From the phenotypes of overlapping deletions ![]()
![]()
fs(2)35Ec is inferred from the sterility of Df(2L)RA5 females heterozygous with 18 different deletions, e.g., Df(2L)TE35D-3. The existence of fs(2)35Ed is suggested by the sterility of Df(2L)RM5/Df(2L)TE35D-2 females and of four similar genotypes; this gene may correspond to beat-C. ms(2)35Eb is inferred from the male sterility of the heterozygote Df(2L)RA5/Df(2L)TE35D-14. The predicted female steriles, fs(2)35Ec and fs(2)35Ed, are tentative; we are concerned that these phenotypes may simply result from haplo-insufficiency, particularly for BicC.
There are several regions that are homozygous viable when deleted. We estimate that the longest of these, the overlap of Df(2L)A178 and Df(2L)A446, is 190 kb. This overlap deletes or disrupts four known genes (noc, Adh, Adhr, and osp), eight tRNA genes, and five predicted protein-encoding genes in the noc-BG:DS07721.3 interval.
The structure and function of gene products:
We have used three computational techniques to infer structural and functional attributes of the products of the genes predicted for this chromosome region. These are searches for protein motifs or domains using the PFAM and PROSITE databases, BLASTP similarities of the predicted open reading frames with proteins in the SWISSPROT and SPTREMBL databases, and some analysis of protein features using the PSORT and SAPS programs (see MATERIALS AND METHODS). In general, we have been rather conservative in making these inferences, as we have for gene prediction in general. These functional inferences are summarized in Table S3 (http://www.genetics.org/cgi/content/full/153/1/179/DC3), using a classification now being developed by the Gene Ontology Consortium (FlyBase, Mouse Genome Informatics and the Saccharomyces Genome Database; GO 1999). Of the 218 known or predicted protein-coding genes, we know, from previous work by others, or have inferred, the function of less than half (91, 42%). Of these, 41 are obviously enzymes and 18 are predicted to be proteases; the rest cover the functional spectrum from structural proteins (e.g., cuticle protein) to growth factors and transporters. From our analysis of protein motifs we predict that 16 of the proteins are DNA or RNA binding; the PSORT analysis predicts that 82 are nuclear localized, but this may well be an overestimate. There are some features of the domain analysis that deserve further study: the cluster of six genes (BG:DS00180.10 and neighbors) whose products are predicted to have EGF domains in particular.
Evolutionary conservation:
Of the 156 known or predicted protein-coding genes, 72% have clear matches with those in other organisms [summarized in Table S2 (http://www.genetics.org/cgi/content/full/153/1/179/DC2)]. Of these, 120 have matches to the sequences of C. elegans, 69 to the sequence of S. cerevisiae, 35 to sequences of A. thaliana, 114 to sequences from rodents (nearly all mouse, with a few rat), 125 to human sequences, and 128 to rodent + human sequences. Thirty proteins have matches in yeast, C. elegans, Arabidopsis, and rodents + human, and 55 in yeast, C. elegans, and rodents + human. With the exception of S. cerevisiae and C. elegans (whose genomes are entirely sequenced, or almost so) these numbers reflect the available sequence data, although, overall, they are an impressive witness to the conservation of protein sequence across very different taxa. These sequence similarities are, of course, very useful for making functional inferences about new Drosophila genes; they must, however, be treated with some caution as the evolution of function and sequence may not be as tightly linked as is sometimes believed. We see evidence for this in the genes of this region; e.g., the fact that the three genes we first identified by their sequence characteristics as chitinases are in fact secreted imaginal disc growth factors, as has been shown experimentally (![]()
In addition to sequence similarities between genes in this chromosome region and sequences from other taxa, 49 of the predicted or known protein-coding genes have significant database matches outside the Adh region to the known protein universe of Drosophila. This is from a sample of only 2000 or so proteins, <15% of the expected total. The conclusion, which is no great surprise, is that nearly all proteins of Drosophila will be members of protein sequence families. In some cases the similarities in sequence between different proteins are very striking, e.g., the two "stress-activated" mitogen activated protein (MAP) kinases p38b and Mpk2 are 77% identical in sequence (see Appendix). There is no obvious clustering of the genes that are paralogs of genes in the Adh region; this would have been evidence of large-scale genomic duplications, such as are found in S. cerevisiae (![]()
Correspondence between known genes and the sequence:
One of the major objectives of this study was to identify the 73 genes known or predicted from the genetic analyses on the sequence and, if possible, to infer their function. For those that had been sequenced previously their identification was straightforward. Others have been identified by mapping to the sequence the sites of insertion of P-element alleles and by correlating the genetic and sequence maps. Forty-nine of these 73 genes have been identified on the sequence [see Figure 1 and Table S2 (http://www.genetics.org/cgi/content/full/153/1/179/DC2)]. For the remaining 24, candidate sequences can be identified, but no firm correlation can be made on the available data. Detailed consideration of these 49 genes and others of interest identified on the sequence is given in the Appendix
Genes with phenotypes are more likely to be conserved:
Genes that can mutate to an observable phenotype are far more conserved than those that cannot. The data are shown in Table 4. We compare the sequence similarities between known and predicted proteins in two groups: the first is of all 218 proteins, the second just that subset of 49 encoded by genes for which we have phenotypically detectable mutant alleles. Even at a BLASTP threshold of P = 10-50, 63% of the 49 genes with phenotypes (and known sequences) have sequence similarities in other taxa, compared to only 31% for the total sample of 218 genes. This difference is also observed if one only considers the comparisons to individual species, such as C. elegans and S. cerevisiae, whose genomes are completely sequenced; this argues that the observation cannot be due to an ascertainment bias.
|
We know, or predict from genetic data, that 73 out of 218 genes have mutant phenotypes. If we assume that the 24 genes that we have not yet managed to tie to the sequence are as conserved as the 49 that we have, then we can calculate the expected properties of the total sets of genes with and without mutant phenotypes. For example, we can predict 46/73 will have BLASTP hits to other species at an expectation of P = 10-50. Because there are only 67 hits to other species from the total of 218 genes (at this cutoff) we can conclude that 63% of the genes with mutant phenotypes are conserved, but only 14% (21/(218-73)) of the genes without detectable mutant phenotypes. If we raise the BLASTP cutoff to P = 10-100, then the numbers are even more striking: 37 and 2%, respectively, for genes of the two classes.
We realize that this analysis has its limitations. The distinction between genes with and without discernible mutant phenotypes is not hard and fast, but we point out that the great majority of mutant phenotypes known in this chromosome region are very obvious, i.e., lethality, sterility, or marked changes to adult morphology. We can, in addition, have reasonable confidence that mutations have been detected in nearly all of the genes in this region that can mutate to these phenotypes.
Conserved genes are more highly expressed:
Genes known previous to this analysis are far more likely to have ESTs than those newly discovered (see above). We were concerned that this could indicate an overoptimism in predicting new genes. Yet the analysis of Table 4 shows that this cannot be so, or at least it cannot be the entire reason. Genes with BLAST similarities with P values <10-7 are unlikely to be false predictions. Yet in the total data set of 218 genes we see that the fraction that have ESTs increases the higher we set the expectation: for "all" species hits it is 48% at P = 10-7, 53% at P = 10-20, 60% at P = 10-50, and 80% for P = 10-100. Genes with mutant phenotypes have ESTs at an overall higher frequency than do those without phenotypes (Table 4). The observation that "conserved" genes are more highly expressed than are "nonconserved" genes, as judged by the occurrence of ESTs, was first made by ![]()
tRNA genes:
An initial rush of enthusiasm mapped many tRNA genes by in situ hybridization to the polytene chromosomes and many of these were subsequently cloned and sequenced (e.g., ![]()
![]()
Transposable elements:
About 12% of the genome of D. melanogaster is estimated to be composed of transposable element sequences, ribosomal DNA, and core histone genes (![]()
![]()
![]()
![]()
A new retrotransposon element has been identified. It has been called yoyo in view of its sequence similarity with an element of the medfly Ceratitis capitata with this name. The yoyo LTR seems to be a hotspot for P-element insertion; k08808, a lethal allele of l(2)35Bc, is inserted in an LTR of yoyo and at least four other examples are known of P elements in yoyo LTRs (PZ06264, EP(2)0533, EP(2)0396, and EP(2)0417).
About 1.8% of the sequence of the Adh region is within identified transposable elements. This is much less than the 9% of the genome as a whole estimated to be composed of such sequences (![]()
![]()
![]()
![]()
There are other sequences that are clearly related to those of transposable elements but whose identity cannot be confidently stated. For example, on P1 clone DS07108 there are three very A + T-rich sequence regions that show similarities to elements such as 297 and mdg1 but appear to be very degenerate. In addition, in an intron of crp there is an 860-bp sequence very similar to the repetitive element described as Su(Ste) (![]()
Breakpoint distribution:
We have mapped genetically 658 aberration breakpoints to this region of the Drosophila genome. Sixty-three breakpoints disrupt genes. Of these breakpoints many had previously been mapped to chromosome walks, usually in
phage. Ninety-four of these were mapped to restriction fragments in the 450-kb "Adh" walk from Ashburner's laboratory (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()



