Genetics, Vol. 153, 179-219, September 1999, Copyright © 1999

An Exploration of the Sequence of a 2.9-Mb Region of the Genome of Drosophila melanogaster: The Adh Region

M. Ashburnera,b, S. Misrad, J. Rootea, S. E. Lewisd, R. Blazejg, T. Davisc, C. Doyleg, R. Galleg, R. Georgeg, N. Harrisg, G. Hartzelld, D. Harveyd,e, L. Hongd, K. Houstong, R. Hoskinsg, G. Johnsona, C. Martin1,g, A. Moshrefig, M. Palazzolo2,g, M. G. Reesed, A. Spradlingf, G. Tsangd,e, K. Wang, K. Whitelawg, B. Kimmel2,g, S. Celnikerg, and G. M. Rubing,d,e
a Department of Genetics, University of Cambridge, Cambridge, CB2 3EH, England,
b EMBL—European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, England,
c Department of Pathology, University of Wales College of Medicine, Cardiff, CF4 4XN, Wales,
d Berkeley Drosophila Genome Project, Department of Molecular and Cell Biology, University of California, Berkeley, California 94720-3200,
e Howard Hughes Medical Institute, Life Sciences Annex, University of California, Berkeley, California 94720,
f Howard Hughes Medical Institute, Carnegie Institution of Washington, Baltimore, Maryland
g Berkeley Drosophila Genome Project, Lawrence Berkeley National Laboratory, Berkeley, California 94720

Corresponding author: M. Ashburner, Department of Genetics, Downing St., Cambridge, CB2 3EH, England., m.ashburner{at}gen.cam.ac.uk (E-mail)

Communicating editor: T. C. KAUFMAN


*  ABSTRACT
*TOP
*ABSTRACT
*MATERIALS AND METHODS
*RESULTS AND DISCUSSION
*CONCLUSIONS
*APPENDIX
*LITERATURE CITED

A contiguous sequence of nearly 3 Mb from the genome of Drosophila melanogaster has been sequenced from a series of overlapping P1 and BAC clones. This region covers 69 chromosome polytene bands on chromosome arm 2L, including the genetically well-characterized "Adh region." A computational analysis of the sequence predicts 218 protein-coding genes, 11 tRNAs, and 17 transposable element sequences. At least 38 of the protein-coding genes are arranged in clusters of from 2 to 6 closely related genes, suggesting extensive tandem duplication. The gene density is one protein-coding gene every 13 kb; the transposable element density is one element every 171 kb. Of 73 genes in this region identified by genetic analysis, 49 have been located on the sequence; P-element insertions have been mapped to 43 genes. Ninety-five (44%) of the known and predicted genes match a Drosophila EST, and 144 (66%) have clear similarities to proteins in other organisms. Genes known to have mutant phenotypes are more likely to be represented in cDNA libraries, and far more likely to have products similar to proteins of other organisms, than are genes with no known mutant phenotype. Over 650 chromosome aberration breakpoints map to this chromosome region, and their nonrandom distribution on the genetic map reflects variation in gene spacing on the DNA. This is the first large-scale analysis of the genome of D. melanogaster at the sequence level. In addition to the direct results obtained, this analysis has allowed us to develop and test methods that will be needed to interpret the complete sequence of the genome of this species.

Before beginning a Hunt, it is wise to ask someone what you are looking for before you begin looking for it. MILNE 1926


IT is nearly 100 years since W. E. Castle and his colleagues at Harvard University introduced Drosophila melanogaster to the joys and rigors of scientific research (KOHLER 1994 Down). From that slender beginning research with this small fly has dominated genetics and much of biology. It is, therefore, wholly appropriate that Drosophila melanogaster should join the new elite of organisms—as one whose genome will be sequenced in its entirety (MIKLOW and RUBIN 1996 Down; RUBIN 1998 Down). That goal is still some time away, but significant progress has already been made, with the determination of the complete sequences of the 338-kb bithorax and 430-kb Antennapedia regions (LEWIS et al. 1995 Down; MARTIN et al. 1995 Down; S. CELNIKER, B. PFEIFFER, J. KNAFELS, C. MAYEDA, C. MARTIN and M. PALAZZOLO, unpublished results) and with the availability of over 40 Mb of genomic sequence available in the public domain (BERKELEY DROSOPHILA GENOME PROJECT 1999; EUROPEAN DROSOPHILA GENOME PROJECT 1999). There are many reasons, both pragmatic and theoretical, for wanting to complete the sequence of a model organism such as Drosophila. On a practical level, the availability of this sequence will be of immediate benefit to all studying particular genes. More theoretically, only by the completion of this sequence can we contemplate a description of the protein universe of Drosophila, can we answer with assurance the question of gene number in Drosophila, can we know the nature, number, and distribution of noncoding regions of DNA (including transposable elements), or can we explore the Drosophila genome for regularities in sequence organization that may correlate with chromosome organization. Moreover, the availability of the complete sequence of Drosophila will itself be a major impetus to evolutionary studies and to comparative insect genomics. Finally, but by no means least important, the sequence itself will spur functional studies, themselves of great interest to all biologists, especially those struggling to interpret the function of genes of the larger genomes of mammals.

The analysis and interpretation of long genomic sequences pose several unsolved problems, among which are gene prediction and correlation of genetically identified loci with computationally predicted genes. We have selected the 2.9-Mb Adh region, a region of the genome of D. melanogaster that was already well characterized by conventional genetic analyses, as a test-bed to develop and evaluate approaches to large-scale genomic sequence annotation in Drosophila. This chromosome region is defined as the 69 polytene chromosome bands from 34C4 to 36A2 on chromosome arm 2L, which is the region between (and including) the previously known genes kuzbanian (kuz) and dachshund (dac). Genetic analysis of this chromosome region began with the studies of E. H. Grell in the early 1960s and the recovery of an Adh- deletion, Df(2L)64j (GRELL et al. 1968 Down). W. Sofer and students, especially J. M. O'Donnell (O'DONNELL et al. 1977 Down), recovered several more deletions, using formaldehyde as a mutagen, and defined 12 loci by complementation analysis among 33 EMS-induced lethal mutations uncovered by these deletions. These studies have been continued in the last 20 years by M. Ashburner's group (e.g., WOODRUFF and ASHBURNER 1979A Down, WOODRUFF and ASHBURNER 1979B Down).

Genetic analysis has defined 73 genes in this chromosome region. Of these genes, 65 are represented by mutant alleles and 8 more are predicted on the basis of the phenotypes of overlapping deletions. Of those with mutant alleles, 50 genes have at least one lethal allele (i.e., they are genes whose activities are vital), 6 are known only from sterile alleles (2 male sterile and 4 female sterile), 8 only from alleles with clear visible phenotypes, and 2 genes have alleles with no gross phenotype: Adh and smi35A. Forty-nine protein-coding genes (and 5 tRNA genes) in this region had been molecularly characterized prior to or during our work; these included 7 that had not been identified by genetic analysis. In addition to a collection of over 1038 different mutant alleles of genes in this region, the genetic analysis was enormously aided by a very large collection of chromosome aberrations, including 86 inversions, 109 translocations, 317 deletions, and 40 duplications. Apart from some conventional recombination mapping in the early stages of the project, all genes have been ordered by deletion mapping. The genetic positions of the breakpoints of many inversions and translocations have been mapped with respect to the genes, often by combining these breakpoints with others to synthesize deletions or duplications.

These genetic data posed two major questions. The first was that of "saturation": What proportion of the genes had been identified by the genetic analysis? It is well known (e.g., BARRETT 1980 Down) that the distribution of mutant hits to genes defies any rigorous statistical estimation of the size of the class of genes that are mutationally silent (see LEFEVRE and WATKINS 1986 Down). This is particularly true in the present case, since many independent mutagenesis screens using a variety of deletions have been done, as have several specific locus screens. These mutation screens have been done with a variety of chemical agents, with ionizing radiation and with P elements, and although the most mutable genes in general screens have 50 or more alleles (e.g., wb and esg), we already know, or predict, some genes that have been refractory, including those eight genes predicted from overlapping deletion phenotypes. Moreover, we had no experimental estimate of the number of genes that give no phenotype when mutant (see below). The second question is that raised by the very nonrandom clustering of aberration breakpoints. There are two extreme interpretations of this clustering: that the different regions differ in target size or that there is some intrinsic property that biases the recovery of chromosomal breaks. Both this question, and that of "saturation," have been answered from the analysis of the sequence of this region.

There is direct experimental evidence, or prediction, for 229 genes in the 2.9 Mb of sequenced DNA. Of these, there is evidence for function or some hint of function from sequence matches for 102 genes. One of the challenges for the future is to discover, by experiment, the function of all of the genes.


*  MATERIALS AND METHODS
*TOP
*ABSTRACT
*MATERIALS AND METHODS
*RESULTS AND DISCUSSION
*CONCLUSIONS
*APPENDIX
*LITERATURE CITED

Genetics:
All of the mutations and chromosome aberrations used in this study are fully described in FlyBase (FLYBASE CONSORTIUM 1999 Down). Table 1 presents a summary of the mutations that have been identified. The majority of these have been published in previous articles from M. Ashburner's laboratory, and others have been given to us by colleagues; those that are new are described in FlyBase. Where possible we have mapped aberration breakpoints genetically by combining the elements of translocations (by segregation) or inversion breakpoints (by recombination, using autosynaptic intermediates in the case of pericentric inversions; see GUBB 1998 Down) so as to synthesize deletions whose limits could be mapped by complementation. All genetic crosses were, unless otherwise stated, done between balancer heterozygotes and care was always taken to allow any very delayed progeny to eclose. A failure of complementation is based upon the absence of nonbalancer progeny, usually in progenies of 200 flies or more. Crosses were routinely done on standard laboratory food at 25°.


 
View this table:
In this window
In a new window

 
Table 1. Genes in the Adh region identified by genetic analysis

P elements from several laboratories, from screens for lethal P elements on chromosome 2 (see SPRADLING et al. 1995 Down), were screened against three deletions that, in sum, cover the entire genetic interval of interest—(Df(2L)b84a7, Df(2L)A48, and Df(2L)r10)—and then mapped more precisely using appropriate deletions and mutant alleles. We are very grateful to I. Kiss for the preliminary screen with his P-element collection. Further P elements were initially identified only on the basis of the chromosomal mapping of their insertion site by in situ hybridization to polytene chromosomes, using a P-element probe and standard techniques. These were then subjected to genetic analysis, typically tests for complementation with appropriate deletions and mutant alleles representative of candidate loci. The EP lines used in this study were from the collection described by RORTH et al. 1998 Down.

P-element excisions and male recombinants were generated using P{{Delta}2-3}99B as the source of an active P transposase. These derivatives were then characterized by conventional genetic complementation analyses.

Cytology:
For conventional polytene chromosome analysis we used propionic-carmine-orcein squash preparations. In situ hybridization was performed by standard procedures using biotinylated probes and horseradish peroxidase staining. Polytene chromosomes were interpreted using the revised maps of C. B. and P. N. Bridges (see LEFEVRE 1976 Down).

Clones:
The P1 clone library, with an average insert size of 80 kb, was that prepared from an isogenic y; cn bw sp stock in the vectors pNS583tet14Ad10 and pAd10sacBII (STERNBERG 1990 Down) and described by SMOLLER et al. 1991 Down. The strategy for building contigs of overlapping clones has been described by KIMMERLY et al. 1996 Down. The first stage was to build a "framework" map of the genome of D. melanogaster by mapping over 2600 of the P1 clones to the polytene chromosomes by in situ hybridization (HARTL et al. 1994 Down). Then, short sequence tagged sites (STS) were used to determine overlaps between P1 clones by STS-content mapping, using a PCR-based approach (OLSON et al. 1989 Down; GREEN and OLSON 1990 Down). STS sequences were derived from a number of sources: end sequences of P1 clones, insertion sites of P elements determined after plasmid rescue or inverse PCR, and sequences of known Drosophila genes. BAC clones were from a newly constructed library in pBACe3.6 (OSOEGAWA et al. 1998 Down; K. OSOEGAWA, A. MAMMOSER and P. DE JONG, unpublished results). This is a 20-hit library from a partial EcoRI digestion of DNA from the y; cn bw sp isogenic stock.

The P1 clones were first assembled into eight contigs by screening a 5-hit P1 clone library. By generating STS sequences determined from the ends of these contigs, and then mapping these to a second larger P1 clone library (10 hit), and by directed PCR experiments, these seven contigs assembled into two, of 0.8 Mb and 1.9 Mb, plus an isolated P1 clone containing the kuzbanian gene. The gaps between the two long contigs and between the isolated P1 clone and the 1.9-Mb contig were closed by screening the BAC clone library with sequences prepared from the appropriate end clones.

DNA sequencing:
The sequence of the Adh region has been assembled by first determining the sequences of the 51 individual P1 clones that comprise the 0.8-Mb and 1.9-Mb contigs. The gap between the two contigs was filled by sequencing the BAC clone BACR44L22. The gap between the P1 clones DS07660 and DS01368 was filled by sequencing BACR48E02. Table 2 lists the clones sequenced and their DDBJ/EMBL/GenBank accession numbers.


 
View this table:
In this window
In a new window

 
Table 2. Sequenced P1 and BAC clones in region 34D-36A

The sequencing strategies have evolved over time. Essentially, ca. 3-kb subclone libraries of randomly sheared DNA were prepared from each P1 clone in plasmid vectors. The sequences of both ends of each plasmid insert were determined using primers complementary to the vector and these sequences were used to assemble a set of overlapping 3-kb clones that span an entire P1 clone. The 3-kb clones were then sequenced using a combination of transposon-mediated sequencing (KIMMEL et al. 1997 Down) and custom oligonucleotide-primed sequence runs. All sequences were determined on both DNA strands and assembled using the PHRAP program (P. GREEN, unpublished results). The error rate was estimated using PHRAP quality scores as <1 in 10,000. We wrote our own genomic assembler to generate a single complete sequence of the entire region from the individual clone sequences. The core alignment software used in this assembler was the sim4 program of FLOREA et al. 1998 Down. The assembler iteratively runs sim4 against pairs of sequences that are known to overlap from the physical mapping data. The assembler then uses the exact alignment that covers the two ends of the clones to incrementally construct the complete sequence, performing reverse complementation when needed.

cDNA identification and sequencing:
cDNA clones derived from genes in the 34D-36A region were identified by searching for sequence matches between the genomic DNA sequence and 5' expressed sequence tags (ESTs) from the Berkeley Drosophila Genome Project (BDGP)/Howard Hughes Medical Institute (HHMI) Drosophila EST project (http://www.fruitfly.org/EST/). In addition, cDNAs corresponding to crp, heix, l(2)35Fe, anon-35Fa, anon-35F/36A, BG:DS02740.2, BG:DS02740.4, BG:DS02740.8, BG:DS02740.9, and BG:DS02740.10 were isolated by screening the LD cDNA library using the method of MUNROE et al. 1995 Down. The LD cDNA library was made from poly(A)+-selected RNA from 0–22-hr embryos, size fractionated (~1 to 6 kb), and directionally cloned in either the Stratagene (La Jolla, CA) Uni-Zap XR vector or the pOT2 plasmid (both EcoRI/XhoI digested; L. HONG, unpublished results). For each gene, the longest available cDNA was sequenced from one strand to allow unambiguous alignment with the genomic sequence. The cDNA sequences were aligned with the genomic sequence using the sim4 program of FLOREA et al. 1998 Down. Because these cDNA sequences were low-pass, single-stranded sequence it was not always possible to construct a single open reading frame from sim4 alignments. In those cases, adjustments were made by an annotator. The virtual cDNA sequences were verified using the ORFfinder program (v. 0.1, E. FRISE, unpublished results) and their structures relative to the genomic sequence manually checked in CloneCurator (see below).

Molecular mapping of P-element insertion sites:
The precise insertion sites of all P elements described here were determined by comparison of the reference genomic sequence with a sequence that spanned the junction between a P element and the genome using sim4. These junction sequences were determined from either plasmid-rescued clones or inverse PCR products, as described in SPRADLING et al. 1999 Down. The insertion site is reported as the first base pair of the 8-bp target site duplication generated by the P-element insertion.

Sequence analysis:
Two broad categories of computational method were used together to predict and identify genes. The first was gene prediction algorithms, based on the statistical properties of protein-coding regions. The second category of method used alignment algorithms for predictions based upon similarities of the sequence with other sequences in the public domain, both nucleic acid and protein.

The main gene prediction program used in the early stages of this analysis was GENEFINDER (v. 0.83; GREEN 1995 Down), trained on a Drosophila sequence data set (G. HELT, unpublished results). GENEFINDER predicts genes on the basis of the statistical properties of their sequence, codon usage, codon preference, and splice site profiles. More recently, we made a comparison of the performance of a number of different programs using the sequence of the P1 clone DS02740. This showed that GENSCAN (v. 1.0; BURGE 1997 Down; BURGE and KARLIN 1997 Down), trained on a vertebrate sequence data set, gave more reliable predictions than GENEFINDER, GENIE (REESE et al. 1997 Down), or a version of GRAIL trained on a Drosophila sequence training set (XU et al. 1995 Down). This comparison showed a tendency for GENSCAN to overpredict genes. This characteristic was complemented by GENEFINDER, which tends to underpredict genes. For this reason, both programs were used for the final data analyses, using their default parameters. Predictions with scores lower than 45 for GENSCAN or 20 for GENEFINDER were ignored. No current gene prediction program behaves well with introns that are either very large or very small, and these errors were corrected, whenever possible, by using available alignment data. tRNA genes were predicted using the tRNAscan-SE program (v. 1.02) of LOWE and EDDY 1997 Down.

To estimate the statistical properties of D. melanogaster protein-coding regions a nonredundant data set of coding regions (CDS) was made. By nonredundant we mean that for any one gene only one CDS is included, even if the gene encodes multiple protein products (that included was usually the longest complete sequence available from the EMBL Nucleic Acid Sequence Data Library). All of the CDS regions were checked for legitimate start and stop codons and for a continuous open reading frame in between these. Four genes with non-ATG starts were included in this data set (CTG, amn, ewg; GTG, Cha; CTC, cpo) following advice from D. Cavener, as were two CDSs (oaf and kelch) with in-frame UGA codons, perhaps coding for seleno-cysteine. This data set of 1335 CDSs was used for the construction of normalized codon and di-codon (hexamer) tables (HELT 1997 Down) and is available as cds_sequence_set.embl.v1.5 from ftp://ftp.ebi.ac.uk/pub/databases/edgp/sequence_sets/ and as na_embl.dros.v1.5 from http://www.fruitfly.org/sequence/download.html.

Databases against which similarity searches were made included GenBank, dbEST, SWISS-PROT, SPTREMBL, and sequences from the European Drosophila Genome Project (EDGP). Updates of these were collected weekly, the sequence data sorted into species-specific files, and all submissions from the Berkeley Drosophila Genome Project removed to provide data sets for searches. These data sets were then processed to append all database cross-references to FASTA header lines. For sequence similarity searches the BLASTN, BLASTX, and TBLASTX programs (version 2.0a) of W. GISH (unpublished results) were used (with the option B = 1,000,000, options filter = SEG + XNU).

Transposable elements were screened using a nonredundant data set of transposable element sequences from which all "flanking" DNA sequences had been trimmed. This data set was originally derived from the EMBL Nucleotide Sequence Data Library records, but as our analysis progressed more complete sequences of elements only known before from partial sequence were added, replacing incomplete sequences. This data set is available from ftp://ftp.ebi.ac.uk/pub/databases/edgp/sequence_sets/transposon_sequence_set.embl and from http://www.fruitfly.org/sequence/download.html (as na_te.dros).

A collection of repetitive sequences from D. melanogaster, not otherwise included in the transposable element sequence set, was also made. This data set includes, e.g., satellite DNA sequences and a miscellany of sequences annotated as being repetitive by FlyBase. It is not as nonredundant as the other two data sets, and was only used for screening for sequences similar to those previously described as repetitive. The data set is available from ftp://ftp.ebi.ac.uk/pub/databases/edgp/sequence_sets/repeat_sequence_set.embl and http://www.fruitfly.org/sequence/download.html (as na_re.dros).

The data output from these various computational analyses is voluminous and requires intelligent filtering to remove redundant and irrelevant information before being passed to the human annotators. Moreover, the task of annotation is almost impossible without tools for the visualization of these data. An application, BLAST Output Parser (v. 01; BOP), was written (S. LEWIS, unpublished results). BOP summarizes all automatically computed analysis data for an individual sequence into one file (i.e., all output from the programs mentioned previously: BLAST, GENSCAN, etc.). This file is in XML syntax. BOP also removes as much of the "noise" as possible (e.g., redundant matches, "shadow" matches on the noncoding strand, and matches to sequences of very biased base composition). These condensed data were then presented to the annotator in a graphical view (CloneCurator v. 0.1; S. LEWIS, N. HARRIS, S. MISRA and G. HELT, unpublished results).

CloneCurator was used to isolate individual genes from the clone sequences, based on expert evaluation of these analyses. CloneCurator allowed the annotator to compare results from different programs and to view the results using filters to determine a desired level of probability of prediction. The annotator used this visual summary to endorse a set of results as evidence, thereby generating a verified annotation. Annotations can be edited in CloneCurator and the annotators can add textual comments to any particular annotation, assign gene symbols, etc. This program was used to generate nucleic acid and amino acid FASTA files for each gene annotation. When a gene spanned more than one clone, manual intervention by an annotator was necessary to construct virtual mRNA sequences.

Open reading frames of predicted genes were validated using ORFfinder (v. 0.1; E. FRISE, unpublished results) and all predicted proteins were then tested with BLASTP (v. 2.0a) with the options filter = SEG + XNU (unless the results are stated as being "unfiltered") against SWISS-PROT and SPTREMBL protein sets organized into nine taxonomic groups (Drosophila, Caenorhabditis elegans, Saccharomyces cerevisiae, other invertebrates, primates, rodents, other vertebrates, plants, and bacteria). Matches with an expectation below P = 10-7 were ignored.

Protein domains and motifs were analyzed against the PROSITE (release 15.0; HOFMANN et al. 1999 Down) and PFAM (v. 2.1.1; SONNHAMER et al. 1997 Down; BATEMAN et al. 1999 Down) databases using the programs PPSEARCH [a Unix implementation of MacPattern at http://www2.ebi.ac.uk/services.html (FUCHS 1994 Down)] and HAMMER2.1 (EDDY 1998 Down). PROSITE output was filtered using EMOTIF (NEVILL-MANNING et al. 1998 Down) at the European Bioinformatics Institute (EBI). The SAPS program (version of July 23, 1993; BRENDEL et al. 1992 Down) was run from the EBI server (http://www2.ebi.ac.uk/SAPS/) to analyze various compositional features of predicted protein sequences. The PSORTII suite of programs (HORTON and NAKAI 1997 Down), trained on the proteins of S. cerevisiae, was used to predict the subcellular localization of proteins. Sequence alignments were generated using CLUSTALW (HIGGINS et al. 1996 Down) from the European Bioinformatics Institute server (http://www2.ebi.ac.uk/services.html).

The output from the various sequence analysis programs is archived on FlyBase as FlyBase-Annotation files linked to the sequenced clones. Version 1 of these files includes the analyses used for this article. Subsequent versions will result from reanalysis of the sequence data.

Nomenclature:
All genes are named according to the conventions agreed between the Berkeley and European Drosophila Genome Projects and FlyBase (http://flybase.bio.indiana.edu/docs/nomenclature). Each gene is given a unique name composed of three parts: a prefix (BG for genes defined by the Berkeley Project, EG for those defined by the European Project), followed by a clone name and an integer. The clone name is that of the clone on which the gene was first defined (regardless of whether or not the gene overlaps more than one clone). The final integer is simply a serial number, and does not imply the order of a gene within a clone. An example is BG:DS09218.6, the sixth gene annotated on P1 clone DS09218. If a gene was already known to FlyBase, then a formal name is still assigned but will be treated by FlyBase as a synonym of the established name.

All genes known to FlyBase are named by those names and symbols declared by FlyBase as valid. In addition, the historical names of the lethals identified by the genetic analysis of the Adh region are given.

Availability of data and materials:
The DNA sequence of the Adh region is made available for file transfer protocol (ftp) and searching (using BLAST) at http://www.fruitfly.org/data/genomic_fasta/Adh_and_cactus. All sequence data from genomic clones, ESTs, cDNAs, and P-element flanking regions are deposited in GenBank. Supplementary tables of data, cited in this article as Tables S1, S2, and S3, are available from http://www.genetics.org/supplemental/. Accession numbers for the genomic sequences are given in Table 2, for P-element flanking regions in Table S1 (http://www.genetics.org/cgi/content/full/153/1/179/DC1), and for cDNAs and ESTs in Table S2 (http://www.genetics.org/cgi/content/full/153/1/179/DC2). P1 clones are available from laboratories listed on FlyBase. cDNA clones are available from Research Genetics (Huntsville, AL) or from Genome Systems (St. Louis, MO). BAC clones (library RPCI-98) are available from Dr. P. de Jong (Roswell Park Cancer Institute, Buffalo, NY). P-element alleles are available from the Bloomington and Szeged Drosophila Stock Centers or from the Berkeley Drosophila Genome Project (BDGP). The annotated sequences can be viewed through FlyBase as CloneCurator reports.


*  RESULTS AND DISCUSSION
*TOP
*ABSTRACT
*MATERIALS AND METHODS
*RESULTS AND DISCUSSION
*CONCLUSIONS
*APPENDIX
*LITERATURE CITED

The physical map and sequence of the Adh region:
The physical map of the Adh region was assembled and sequenced from P1 and BAC as described in MATERIALS AND METHODS. The P1 clones formed three contigs, one of 1,940,896 bp, one of 798,089 bp, and the third, a single P1 clone. The gap between the 1.9-Mb and 0.79-Mb contigs could not be closed in P1 clones, but was, however, readily closed by screening the BAC library; it was found to be 43,803 bp in length. A BAC clone also linked the isolated P1 clone (DS07660) to the distal end of the 1.9-Mb contig. This gap was 35,162 bp in length. The total length of sequence studied is 2,919,020 bp. A summary of the interpretation of this sequence is given in Figure 1, with an expanded view of three selected regions in Figure 2.





View larger version (77K):
In this window
In a new window
Download PPT slide
 
Figure 1. A summary molecular map of the Adh region, covering 2.9 Mb of DNA. Genes located on the top of each map are transcribed from distal to proximal (with respect to the telomere of chromosome arm 2L); those on the bottom are transcribed from proximal to distal. The gene symbols used in this figure are boldface type; if not the formal symbol then the latter is shown in a lighter font (formal symbols are abbreviated, their BG: prefix being omitted from Figure 1 and Figure 2). P-element insertions are shown as triangles projecting to the molecular map. Red bars indicate transcribed regions, with intron-exon structures as predicted. Those in dark red are confirmed by a cDNA or were previously known; those in light red have only GENEFINDER or GENSCAN predictions (with cutoffs of 20 and 45, respectively). The blue and green boxes are BLASTX or TBLASTX matches detected using genomic DNA sequences from a GenBank submission (usually a single P1 or BAC clone) to search against sequences of other species in the databases. Similarities are shown in green for expectations between P = 10-8 and P = 10-50; blue for expectations of P = 10-51 or lower. Once translations of predicted or known genes were used for BLASTP searches, some similarities that had not been detected using the nucleic acid sequence of the genomic clones were found. A summary of these BLASTP data is found in Table S2 (http://www.genetics.org/cgi/content/full/153/1/179/DC2). Transposable elements are indicated by black boxes and are named according to FlyBase. Genes defined genetically are shown above the map. Genes whose symbols are within square brackets are not tied to the map. These genes are indicated above a horizontal line when their order with respect to the genes below the line is not known. A scale in kilobases is shown; ~1 cm = 10 kb.



View larger version (27K):
In this window
In a new window
Download PPT slide
 
Figure 2. Enlarged views of the Sos-RpII33, l(2)35Bb-vas, and twe-chif regions. Symbols and conventions as in Figure 1. A scale in kilobases is shown; ~3 cm = 10 kb.

General features of the sequence:
The overall base composition of the sequence is 40.82% G + C, to be compared to the figure of 43% for the genome as a whole (LAIRD and MCCARTHY 1969 Down). The G + C contents of functionally different regions of the sequence, protein-coding regions, introns, and intergenic spacer are 49.7, 38.7, and 39.6%, respectively (intergenic regions may well be overestimated in size, because the gene prediction programs will have missed 5' exons distant from the body of a gene unless full-length cDNAs were available). The average number of exons per gene is 4.4, but this figure must be treated with caution for the reasons just mentioned.

Gene prediction in the Adh region:
A primary objective of the sequence analysis was to identify genes, both protein coding and others (e.g., tRNA), in the 2.9 Mb of sequenced DNA. We predict the existence of 229, of which 218 are predicted to be protein coding and 11 tRNA coding (Figure 1). The bases for the predictions are summarized in Table S2 (http://www.genetics.org/cgi/content/full/153/1/179/DC2). Forty-one of the protein-coding genes are predicted only on the basis of a high score with a gene-finding program; of these, 16 have both GENSCAN and GENEFINDER predictions (above the thresholds we used), 2 have only GENEFINDER predictions, and 23 only GENSCAN predictions. All of the other protein-coding genes are predicted by either (or both) sequence similarities (a BLAST score of P = <10-7; 156, 71%) or a match with a Drosophila EST, cDNA, or genomic sequence (110, 52% of protein-coding genes). (Seventeen more genes had matches to Drosophila ESTs, but these matches were clearly due to the ESTs being derived from genes encoding similar sequences, i.e., from paralogous genes.)

It is important to get an estimate of the false-negative and false-positive frequencies of prediction. A GENSCAN threshold of 45 fails to predict 22 protein-coding genes predicted by other means (or known prior to this work). Of these 22, 10 have EST matches and 3 were known prior to this analysis (Mst35Ba, Mst35Bb, and cni). Lowering the threshold for GENSCAN to 30 would include 8 of these 22 false negatives, but this would also predict a further 25 protein-coding genes in this region, none of which would have any other support. The GENEFINDER program, at a threshold of 20, fails to predict 56 of the protein-coding genes. Of these false negatives, 35 have support from experimental data and 21 have support from GENSCAN predictions [Table S2 (http://www.genetics.org/cgi/content/full/153/1/179/DC2)]. One feature of GENSCAN that we have noticed is that its scores tend to be low in regions of very high gene density.

ESTs and cDNA sequences of genes in the Adh region:
Even the best computational methods are imperfect in their ability to determine the intron-exon structures of genes from genomic sequence alone. Moreover, because such methods rely on information from codon usage and the maintenance of open reading frames, they are inherently unable to predict the presence of introns in 5' or 3' untranslated regions or to predict the transcriptional start sites. For these reasons it is necessary to isolate and sequence cDNAs (or RT-PCR products). We have used sequence matches between the genomic sequence and 5' ESTs as a rapid way of identifying cDNAs for sequencing [see MATERIALS AND METHODS; Table S2 (http://www.genetics.org/cgi/content/full/153/1/179/DC2)]. cDNAs corresponding to 95 genes were identified by matches to ESTs (44% of known or predicted protein-coding genes) at a time when the total number of Drosophila ESTs available was 53,000.

Of the 68 protein-coding genes for which there was some prior knowledge (i.e., both genetic and molecular data or molecular data alone), 50 (74%) have ESTs; of the 150 genes that are newly discovered, only 44 (29%) have ESTs. This is a rather surprising result. It may indicate either a bias in the sample of genes that had already been studied or an overprediction of new genes, or it may be a biologically interesting result (see below).

P-element hits:
Several collections of lethal P elements were screened against deletions that, in sum, covered the entire Adh region (see SPRADLING et al. 1995 Down, SPRADLING et al. 1999 Down). We have also analyzed genetically P elements from these collections that had not been recovered in the screens for lethals or semilethals, but which were found to map to the region by in situ hybridization to polytene chromosomes or by a sequence match of the sequences flanking the P-element insertion (SPRADLING et al. 1999 Down). Similarly, sequences flanking 2300 insertions of the EP element (RORTH et al. 1998 Down) were determined (J. REHM and G. RUBIN, unpublished data) and used to identify 24 EP insertions in this region. From these screens, and from those identified by others, 181 independent P-element insertions in 43 genes have been identified [ Table 1 and S1(http://www.genetics.org/cgi/content/full/153/1/179/DC1)]. P-element insertions in 35 genes give a lethal, or semilethal, sterile, or visible phenotype. In the remaining eight genes all known insertions are without obvious phenotypic effect.

Gene density in the Adh region:
Of the 229 genes, 218 are protein coding and 11 are tRNAs. The average gene density for protein-coding genes is one per 13.4 kb. The average size of the genes, as estimated both from computational analysis and the "full"-length cDNAs, is 5.5 kb (from ATG to terminator, including introns). The average gene density of one gene per 13.4 kb hides enormous variation in density. Some regions are very dense, with genes being separated by only a few hundreds of base pairs; others are, by comparison, very gene poor (see Figure 1 and Figure 2).

There are few studies of long genomic sequences of Drosophila that we can use for comparison with the Adh region. Preliminary analyses of 2 Mb of genomic sequence from region 1–3 of the X chromosome give a gene density of one gene per 8 kb (T. BENOS and M. ASHBURNER, unpublished analyses of European Drosophila Genome Project data). In the 338-kb bithorax region there are 13 known or predicted genes (1 per 24 kb), but 3 of these (Ubx, abd-A, and Abd-B) are exceptionally large (22 to 78 kb for their coding regions alone). In the Antp region Celniker et al. (S. CELNIKER, B. PFEIFFER, J. KNAFELS, C. MAYEDA, C. MARTIN and M. PALAZZOLO, unpublished data) have identified 26 protein-coding genes in 430 kb, a density of 1 gene per 16.5 kb. MALESZKA et al. 1998 Down predicted 12 genes within one 67-kb P1 clone from the base of the X chromosome (1 gene per 5.6 kb).

Transcriptional bias:
The number of genes transcribed from each DNA strand is approximately equal (121 vs. 108). In very gene-dense regions there is a strong tendency for the direction of transcription to alternate (see Figure 1); overall, however, the pattern of transcriptional direction appears to be random. This was tested by expressing the pattern as a binary string and attempting to compress it using the Lempel-Ziv compression algorithm (ZIV and LEMPEL 1977 Down). The string did not compress any better than did 1000 randomly generated strings of the same length.

Estimates of total gene number in Drosophila:
Any estimate of total gene number, based on the analysis of the Adh region, depends on this region being "typical" of the genome as a whole, with respect to the number of genes. This is a difficult question to answer with any rigor. Genetically, there are no indications that the Adh region is atypical. The number of genes discovered by genetic analysis is, given the number of polytene chromosome bands included, very similar to that in other well-studied regions. Classical "saturation" studies give a ratio of lethal complementation groups to polytene chromosomes bands of ~0.84 (Table 3); for the Adh region this ratio is 0.81.


 
View this table:
In this window
In a new window

 
Table 3. Selected regions of the gemone of D. melanogaster subjected to "saturation" genetic analysis for lethal complementation groups, showing the average ratio of lethal loci to polytene chromosome bands

Our estimates of the total gene number rely on estimates of the total DNA content of D. melanogaster. This has been independently estimated to be 170 Mb by RUDKIN 1972 Down(and cited in KAVENOFF and ZIMM 1973 Down), using UV microspectrophotometry of diploid ganglion cells by RASCH et al. 1971 Down, by Feulgen microspectrophotometry of sperm and haemocyte cells, and by KAVENOFF and ZIMM 1973 Down from the kinetics of relaxation of whole chromosome-length DNA molecules. The kinetics of reassociation of denatured DNA gave a slightly lower estimate (LAIRD 1971 Down). Of this 170 Mb of DNA, some 21% is estimated to be low-complexity satellite sequence (LOHE and BRUTLAG 1987 Down) and 12% transposable elements and other repeated sequences, such as the histone and rRNA genes (LAIRD and MCCARTHY 1968 Down). This gives an estimate of ~115 Mb of "unique" DNA sequence.

Simple arithmetic, 115 Mb/13.4 kb, gives an estimate of 8600 protein-coding genes for the Drosophila genome as a whole. This is a remarkably low number, being less than half as much again as the yeast S. cerevisiae (6000; MEWES et al. 1997 Down) and less than half the number now estimated for Caenorhabditis elegans (19,090; THE C. ELEGANS SEQUENCING CONSORTIUM 1998). An independent estimate can be made, knowing that the sequenced region covers 69 polytene chromosome bands, an average of 42 kb/band plus its adjacent interband [rather higher than Sorsa's estimate of 21.6 kb/band (SORSA 1988 Down)]. The total band number is estimated to be 5160 (V. Sorsa, quoted in ASHBURNER 1989 Down). In terms of band number, therefore, the Adh region is 1.34% of the total. If the density of genes per band in this region is typical of the genome as a whole, then this leads to an estimate of 16,975 genes. Our two estimates of the total gene number in D. melanogaster, 8600 and 16,975, bracket the estimate of 12,000 by MIKLOW and RUBIN 1996 Down, based on the sizes of 276 individual genes.

Local duplications of genes:
A number of genes in Drosophila have been found to exist as locally duplicated gene pairs. Members of a pair may be functionally distinct (e.g., en, inv) or functionally redundant (e.g., gsb-d, gsb-p; ph-d, ph-p). The most obvious model for the origin of gene pairs is unequal recombination (STURTEVANT 1925 Down; INGRAM 1961 Down; BAGLIONI 1963 Down; SMITHIES et al. 1962 Down) followed by sequence divergence.

In this chromosome region we have identified at least 12 (protein-coding) gene repeats. One had already been identified, first in Drosophila pseudoobscura (SCHAEFFER and AQUADRO 1987 Down), i.e., Adh and Adhr, genes just 300 bp apart that have protein products only 33% identical in sequence, yet with a conserved position of introns. Remarkably, Adhr is only transcribed as a dicistronic transcript with Adh (BROGNA and ASHBURNER 1997 Down). The second gene repeat is a triplication of three zinc finger domain transcription factors, escargot, worniu, and snail, within 150 kb. The proteins encoded by these genes show 31–37% pairwise identity. Interestingly, although each of these is required for viability, there is some residual functional redundancy between at least esg and sna (see Appendix). The third example is BG:DS01514.2 and BG:DS05899.1, two genes 7.5 kb apart that encode protein products 43% identical in sequence; these proteins show similarity to mouse long-chain fatty acid coenzyme-A ligase. Mst35Ba and Mst35Bb are a tandem pair of genes encoding protamine-like proteins characterized by RUSSELL and KAISER 1993 Down. These proteins are 91% identical over their common region (that of Mst35Bb is longer by 25 amino acids than that of Mst35Ba). At the nucleic acid level the duplication extends over ~1 kb.

Five genes, closely clustered in the region between RpII33 and Ance, show between 30 and 37% amino acid sequence similarities. These are BG:DS00941.11–BG:DS00941.15, genes whose proteins are about the same size but all lack any sequence matches. BG:DS00180.7–BG:DS00180.10, BG:DS00180.12, and BG:DS00180.14 are six genes all with epidermal growth factor (EGF) domains clustered within a few tens of kilobases just distal to rk. Their sequence similarities are not high, but are evidence of ancient duplications.

In the region between the lace and CycE genes there are six predicted genes within 21 kb, each encoding a protein of the astacin subfamily of Zn-metalloproteases (BARRETT et al. 1998 Down; BG:BACR44L22.1–BG:BACR44L22.4, BG:BACR44L22.6, and BG:BACR44L22.8). The predicted protein sequences of these genes are between 29 and 64% identical. There are two clusters of genes encoding proteins predicted to be serine proteases. One is of two genes within 14.8 kb and showing 45% pairwise similarity (BG:DS06874.4 and BG:DS06874.6); the other is a pair of genes within 10.2 kb showing 35% sequence similarity (BG:DS07108.1 and BG:DS07108.5). Right at the proximal margin of the region sequenced are three genes encoding proteins identified by KAWAMURA et al. 1999 Down as imaginal disc growth factors (see below). These genes show 51–55% pairwise similarity in sequence and are within 7.7 kb (Idgf1, Idgf2, and Idgf3). Interestingly, there is evidence for a tandem triplication of chitinase genes, which these resemble, in mosquitoes (DE LA VEGA et al. 1998 Down). A further triplication is exemplified by beat and two similar genes, beat-B and beat-C, first discovered in this sequence by T. PIPES (personal communication). These three genes are not contiguous, but are clustered within 200 kb. The proteins predicted for beat-B and beat-C are 51 and 46% identical, respectively, to that of beat. The three genes have a similar structure. The final example of duplicate genes is that of noc and BG:DS06238.3, a gene some 100 kb distal, which we suggest is elB (see below). These two genes encode Zn-finger proteins with 27% amino acid identity.

The 38 genes in the 34C-36A region that appear to be members of tandem series represent 17% of the total number of protein-coding genes. This is a minimum estimate, because a BLASTP search of all 218 known and predicted protein sequences against themselves identifies other potential duplications, which require further study. Many of these duplications are very old, as judged by the sequence similarities between members of a set. Tandem series of genes are also a feature of C. elegans (THE C. ELEGANS SEQUENCING CONSORTIUM 1998; THE C. ELEGANS GENOME SEQUENCING PROJECT 1999) and Arabidopsis thaliana (BEVAN et al. 1998 Down). The fraction of genes included in tandem sets of two or more (18%) is about the same as that found in the Adh region (JONES 1999 Down). One possible reason why C. elegans appears to have more genes than D. melanogaster would be that these local tandem arrays are, on average, larger in C. elegans. The data available so far do not support this suggestion.

Genes within genes:
The first example of a gene known to be entirely included within another gene was that of a pupal cuticle protein gene (Pcp) fully encoded within an intron of ade3 (HENIKOFF et al. 1986 Down). Since then, >30 examples have been discovered (data from FlyBase) and in the majority of cases (25/32) the included gene is transcribed from the opposite strand of the including gene. In the Adh region we have identified 17 examples of nested genes, 12/17 following the majority rule of antiparallel transcription.

The inclusion of Adh within osp was first suggested by genetic data, because osp aberrations mapped to either side of Adh (CHIA et al. 1985 Down; see below). This suggestion, and the inclusion of Adhr in the same intron, was confirmed by molecular analysis (MCNABB et al. 1996 Down) and is proven here by the comparison of the sequence of a full-length osp cDNA with the genomic sequence (see below). Two other predicted genes are within osp: BG:DS07721.1 and BG:DS09219.1.

An open reading frame in the 5' intron of vasa (vig, for vasa intronic gene) was first identified by K. EDWARDS (personal communication) by a comparison of sequences from D. grimshawi with those from this project. There is another CDS within vasa: BG:DS00929.15 in the long third intron, first identified as a ubiquitous transcript from RNA blots with genomic DNA by P. LASKO (personal communication; see STYHLER et al. 1998 Down). The other examples of putative included genes are BG:BACR48E02.1, BG:BACR48E02.2, and BG:BACR48E02.3, all included within the second intron of B4; BG:DS07486.3, BG:DS07486.4, and BG:DS07486.5 included within introns of beat-B, the former in intron 1 and the latter two in intron 2; BG:DS03792.2 is within wb; BG:DS03192.4 is within BG:DS03192.2; BG:DS07295.4 is within BG:DS07295.1; BG:DS07660.1 is within kuz; and BG:DS01514.1 is within BG:DS01514.3.

The phenotypes of overlapping and contiguous deletions—the search for more genes:
We have evidence that the genetic screens failed to recover mutations at loci expected to have scorable phenotypes—the failure to recover any alleles of beat is an example (see Appendix). One new lethal locus (l(2)35Fg) was discovered when the chromosome 2 P elements were systematically screened. One further genetic technique to discover genes is to systematically screen hetetozygotes between two overlapping deletions. We have made transheterozygotes between all possible pairs of deletions, which, by genetic criteria, abut, i.e., the distal end of one and the proximal end of another are located between the same pair of genes identified by mutant alleles. These pairs of deletions may or may not physically overlap.

Pairwise combinations (836) have been made and the genotypes scored for viability, male and female fertility, and obvious visible phenotypes. Although these phenotypes could be the result of the additive effects of haplo-insufficiency, we have predicted the existence of four lethal loci from these data, two loci required for male fertility and two loci required for female fertility (each "locus" could include more than one gene, of course). A variation on this protocol for the discovery of mutant phenotypes is to test combinations of deletions that are known to overlap by only one gene with a mutant phenotype in the presence of a transgene that is known independently to rescue the mutant phenotype. If the transgene rescues the deficiency heterozygote to phenotypic normality, then we can conclude that no other genes capable of giving a mutant phenotype are located in the deleted interval; and if not, then we can conclude the existence of a previously unsuspected locus.

Overlapping Ance- deletions are lethal, which is expected, since Ance itself is a vital gene. There is, however, evidence for another lethal near Ance, because the lethality of some, but not all, overlapping deletion pairs can be rescued by a 16.5-kb transformant that includes both Ance and anon-34Ea (carried on P{RACE}). l(2)34Ec is predicted on the basis of the failure of this transformant to rescue the lethality of, e.g., Df(2L)SR407/Df(2L)b82a1. This predicted gene is not in the overlap of, e.g., Df(2L)SR407/Df(2L)b74c6.

The existence of ms(2)35Bi, between the 5' exons of osp and l(2)35Bb, is predicted on the basis of viable, but male-sterile, overlapping deletion heterozygotes (see Appendix). l(2)35Cc is predicted on the basis of the recessive lethality of Df(2L)rd9 (ASHBURNER et al. 1990 Down). rd9 is lethal with deletions of rd; all five other known alleles of rd are hemizygous viable. The existence of l(2)35Cc is confirmed by the complementation behavior of deletions generated from gftPZ06430 by male recombination. Of nine deletions, one extended distally and was rd+ but lethal with Df(2L)rd9 and gft; the other eight extended proximally from gft to include ms(2)35Ci.

The region between esg and sna is, genetically, rather complex. From the phenotypes of overlapping deletions ASHBURNER et al. 1990 Down identified a region that, when homozygously deleted, can result in either lethality or an absence of the halteres. These phenotypes are separable; e.g., the Df(2L)osp38/Df(2L)TE35D-22 heterozygote is viable and lacks halteres, but Df(2L)osp18/Df(2L)TE35D-22 is lethal. Both map between esg and worniu. The lethal is here named l(2)35Cg. There is another predicted lethal in this region, simply called l by ASHBURNER et al. 1990 Down(Figure 2). It (l(2)35Ch) is predicted from the lethality of, e.g., Df(2L)el20 when heterozygous with Df(2L)Scorv25. There is only one gene prediction in the esg-worniu interval; this is BG:DS03023.4.

fs(2)35Ec is inferred from the sterility of Df(2L)RA5 females heterozygous with 18 different deletions, e.g., Df(2L)TE35D-3. The existence of fs(2)35Ed is suggested by the sterility of Df(2L)RM5/Df(2L)TE35D-2 females and of four similar genotypes; this gene may correspond to beat-C. ms(2)35Eb is inferred from the male sterility of the heterozygote Df(2L)RA5/Df(2L)TE35D-14. The predicted female steriles, fs(2)35Ec and fs(2)35Ed, are tentative; we are concerned that these phenotypes may simply result from haplo-insufficiency, particularly for BicC.

There are several regions that are homozygous viable when deleted. We estimate that the longest of these, the overlap of Df(2L)A178 and Df(2L)A446, is 190 kb. This overlap deletes or disrupts four known genes (noc, Adh, Adhr, and osp), eight tRNA genes, and five predicted protein-encoding genes in the noc-BG:DS07721.3 interval.

The structure and function of gene products:
We have used three computational techniques to infer structural and functional attributes of the products of the genes predicted for this chromosome region. These are searches for protein motifs or domains using the PFAM and PROSITE databases, BLASTP similarities of the predicted open reading frames with proteins in the SWISSPROT and SPTREMBL databases, and some analysis of protein features using the PSORT and SAPS programs (see MATERIALS AND METHODS). In general, we have been rather conservative in making these inferences, as we have for gene prediction in general. These functional inferences are summarized in Table S3 (http://www.genetics.org/cgi/content/full/153/1/179/DC3), using a classification now being developed by the Gene Ontology Consortium (FlyBase, Mouse Genome Informatics and the Saccharomyces Genome Database; GO 1999). Of the 218 known or predicted protein-coding genes, we know, from previous work by others, or have inferred, the function of less than half (91, 42%). Of these, 41 are obviously enzymes and 18 are predicted to be proteases; the rest cover the functional spectrum from structural proteins (e.g., cuticle protein) to growth factors and transporters. From our analysis of protein motifs we predict that 16 of the proteins are DNA or RNA binding; the PSORT analysis predicts that 82 are nuclear localized, but this may well be an overestimate. There are some features of the domain analysis that deserve further study: the cluster of six genes (BG:DS00180.10 and neighbors) whose products are predicted to have EGF domains in particular.

Evolutionary conservation:
Of the 156 known or predicted protein-coding genes, 72% have clear matches with those in other organisms [summarized in Table S2 (http://www.genetics.org/cgi/content/full/153/1/179/DC2)]. Of these, 120 have matches to the sequences of C. elegans, 69 to the sequence of S. cerevisiae, 35 to sequences of A. thaliana, 114 to sequences from rodents (nearly all mouse, with a few rat), 125 to human sequences, and 128 to rodent + human sequences. Thirty proteins have matches in yeast, C. elegans, Arabidopsis, and rodents + human, and 55 in yeast, C. elegans, and rodents + human. With the exception of S. cerevisiae and C. elegans (whose genomes are entirely sequenced, or almost so) these numbers reflect the available sequence data, although, overall, they are an impressive witness to the conservation of protein sequence across very different taxa. These sequence similarities are, of course, very useful for making functional inferences about new Drosophila genes; they must, however, be treated with some caution as the evolution of function and sequence may not be as tightly linked as is sometimes believed. We see evidence for this in the genes of this region; e.g., the fact that the three genes we first identified by their sequence characteristics as chitinases are in fact secreted imaginal disc growth factors, as has been shown experimentally (KAWAMURA et al. 1999 Down). The inferences we have made are only hypotheses that demand experimental verification or falsification.

In addition to sequence similarities between genes in this chromosome region and sequences from other taxa, 49 of the predicted or known protein-coding genes have significant database matches outside the Adh region to the known protein universe of Drosophila. This is from a sample of only 2000 or so proteins, <15% of the expected total. The conclusion, which is no great surprise, is that nearly all proteins of Drosophila will be members of protein sequence families. In some cases the similarities in sequence between different proteins are very striking, e.g., the two "stress-activated" mitogen activated protein (MAP) kinases p38b and Mpk2 are 77% identical in sequence (see Appendix). There is no obvious clustering of the genes that are paralogs of genes in the Adh region; this would have been evidence of large-scale genomic duplications, such as are found in S. cerevisiae (WOLFE and SHIELDS 1997 Down).

Correspondence between known genes and the sequence:
One of the major objectives of this study was to identify the 73 genes known or predicted from the genetic analyses on the sequence and, if possible, to infer their function. For those that had been sequenced previously their identification was straightforward. Others have been identified by mapping to the sequence the sites of insertion of P-element alleles and by correlating the genetic and sequence maps. Forty-nine of these 73 genes have been identified on the sequence [see Figure 1 and Table S2 (http://www.genetics.org/cgi/content/full/153/1/179/DC2)]. For the remaining 24, candidate sequences can be identified, but no firm correlation can be made on the available data. Detailed consideration of these 49 genes and others of interest identified on the sequence is given in the Appendix

Genes with phenotypes are more likely to be conserved:
Genes that can mutate to an observable phenotype are far more conserved than those that cannot. The data are shown in Table 4. We compare the sequence similarities between known and predicted proteins in two groups: the first is of all 218 proteins, the second just that subset of 49 encoded by genes for which we have phenotypically detectable mutant alleles. Even at a BLASTP threshold of P = 10-50, 63% of the 49 genes with phenotypes (and known sequences) have sequence similarities in other taxa, compared to only 31% for the total sample of 218 genes. This difference is also observed if one only considers the comparisons to individual species, such as C. elegans and S. cerevisiae, whose genomes are completely sequenced; this argues that the observation cannot be due to an ascertainment bias.


 
View this table:
In this window
In a new window

 
Table 4. A comparison of the sequence similarities between genes with known mutant phenotypes and those without

We know, or predict from genetic data, that 73 out of 218 genes have mutant phenotypes. If we assume that the 24 genes that we have not yet managed to tie to the sequence are as conserved as the 49 that we have, then we can calculate the expected properties of the total sets of genes with and without mutant phenotypes. For example, we can predict 46/73 will have BLASTP hits to other species at an expectation of P = 10-50. Because there are only 67 hits to other species from the total of 218 genes (at this cutoff) we can conclude that 63% of the genes with mutant phenotypes are conserved, but only 14% (21/(218-73)) of the genes without detectable mutant phenotypes. If we raise the BLASTP cutoff to P = 10-100, then the numbers are even more striking: 37 and 2%, respectively, for genes of the two classes.

We realize that this analysis has its limitations. The distinction between genes with and without discernible mutant phenotypes is not hard and fast, but we point out that the great majority of mutant phenotypes known in this chromosome region are very obvious, i.e., lethality, sterility, or marked changes to adult morphology. We can, in addition, have reasonable confidence that mutations have been detected in nearly all of the genes in this region that can mutate to these phenotypes.

Conserved genes are more highly expressed:
Genes known previous to this analysis are far more likely to have ESTs than those newly discovered (see above). We were concerned that this could indicate an overoptimism in predicting new genes. Yet the analysis of Table 4 shows that this cannot be so, or at least it cannot be the entire reason. Genes with BLAST similarities with P values <10-7 are unlikely to be false predictions. Yet in the total data set of 218 genes we see that the fraction that have ESTs increases the higher we set the expectation: for "all" species hits it is 48% at P = 10-7, 53% at P = 10-20, 60% at P = 10-50, and 80% for P = 10-100. Genes with mutant phenotypes have ESTs at an overall higher frequency than do those without phenotypes (Table 4). The observation that "conserved" genes are more highly expressed than are "nonconserved" genes, as judged by the occurrence of ESTs, was first made by GREEN et al. 1993 Down in their analysis of evolutionarily conserved regions in proteins. They suggested that highly expressed genes might be under a higher selection pressure. The similar bias in C. elegans, where genes with matches to proteins in distant taxa (i.e., non-Nematodes) are three times more likely to have an EST than genes with no such match, was confirmed by an analysis of the C. elegans sequence (THE C. ELEGANS SEQUENCING CONSORTIUM 1998).

tRNA genes:
An initial rush of enthusiasm mapped many tRNA genes by in situ hybridization to the polytene chromosomes and many of these were subsequently cloned and sequenced (e.g., KUBLI 1982 Down). A total of 182 tRNA genes have so far been mapped in Drosophila (data from FlyBase), yet others remain to be discovered (e.g., tryptophan and cysteine tRNAs). Many tRNA genes occur in clusters, either of isoaccepting or diverse tRNAs. A cluster of five glycine tRNAs was already known in the Adh region (MENG et al. 1988 Down; 13 others are known). In addition we have identified a single glutamine tRNA (the first to be sequenced in Drosophila; BG:DS01514.1) and a single leucine tRNA (five others are known; BG:DS03192.1), four proline tRNAs (two others are known), one (BG:DS04641.2) immediately distal to the glycyl-tRNA cluster, and three (BG:DS01486.2–.4) just proximal to this cluster, immediately distal to osp. The 100-kb region between noc and osp therefore contains nine tRNA genes.

Transposable elements:
About 12% of the genome of D. melanogaster is estimated to be composed of transposable element sequences, ribosomal DNA, and core histone genes (LAIRD and MCCARTHY 1968 Down; SPRADLING and RUBIN 1981 Down). Seventeen elements have been recognized in the sequence of the Adh region; 6 are LINE-like elements (G, F, Doc, and jockey), 11 are retrotransposons with long terminal repeats (LTRs; copia, roo, 297, blood, mdg1-like and yoyo; see Figure 1 and Figure 2). This is an average spacing of 1 element per 171 kb. On the basis of kinetic data the "middle-repetitive" sequences of D. melanogaster had been estimated to be ~5.6 kb in length, and separated by 13 kb or more of single-copy DNA (MANNING et al. 1975 Down; CRAIN et al. 1976 Down).

A new retrotransposon element has been identified. It has been called yoyo in view of its sequence similarity with an element of the medfly Ceratitis capitata with this name. The yoyo LTR seems to be a hotspot for P-element insertion; k08808, a lethal allele of l(2)35Bc, is inserted in an LTR of yoyo and at least four other examples are known of P elements in yoyo LTRs (PZ06264, EP(2)0533, EP(2)0396, and EP(2)0417).

About 1.8% of the sequence of the Adh region is within identified transposable elements. This is much less than the 9% of the genome as a whole estimated to be composed of such sequences (SPRADLING and RUBIN 1981 Down). The reason for this difference is probably that the density of transposable elements is higher in the heterochromatic and peri-heterochromatic regions of the chromosomes (see SUN et al. 1997 Down). Perhaps only half the retroviral elements are euchromatic. That this is so is indicated by a comparison of the total numbers of elements estimated by DNA reassociation kinetics and those seen in the euchromatic arms by in situ hybridization. For the 412 element, e.g., the numbers were 40 (POTTER et al. 1979 Down) and 26 (STROBEL et al. 1979 Down), respectively, in Oregon-R; similar data were found for the 297 and copia elements.

There are other sequences that are clearly related to those of transposable elements but whose identity cannot be confidently stated. For example, on P1 clone DS07108 there are three very A + T-rich sequence regions that show similarities to elements such as 297 and mdg1 but appear to be very degenerate. In addition, in an intron of crp there is an 860-bp sequence very similar to the repetitive element described as Su(Ste) (BALAKIREVA et al. 1992 Down).

Breakpoint distribution:
We have mapped genetically 658 aberration breakpoints to this region of the Drosophila genome. Sixty-three breakpoints disrupt genes. Of these breakpoints many had previously been mapped to chromosome walks, usually in {lambda} phage. Ninety-four of these were mapped to restriction fragments in the 450-kb "Adh" walk from Ashburner's laboratory (CHIA et al. 1985 Down; MCGILL et al. 1988 Down; DAVIS et al. 1990 Down, DAVIS et al. 1997 Down; GUBB et al. 1990 Down; CHEAH et al. 1994 Down; MCNABB et al. 1996 Down), while others had been mapped to the vasa (LASKO and ASHBURNER 1988 Down), Su(H) (SCHWEISGUTH and POSAKONY 1992 Down), Sos (BONFINI et al. 1992 Down), BicC (MAHONE et al. 1995 Down), beat (FAMBROUGH and GOODMAN 1996 Down), twe (ALPHEY et al. 1992 Down), fzy (DAWSON et al. 1995 Down), and cni (R