Total Coverage as of Feb 2008 = 320X
This success reflects the technology advances in ultra-high-throughput resequencing. More importantly, our recent experience and the growing potential of new sequencing platforms has inspired another ambitious proposal - resequencing of hundreds of Drosophila melanogaster genomes. By providing the research community with deep population sampling based on the high-throughput platforms, our project will foster the development of new theoretical ideas, talent, and tools. These can be leveraged against the talent and creativity of the Drosophila research community to advance the ideas and applicatons with potential impact on human population genomics.
An isogenic (or inbred) Drosophila melanogaster genome sequenced to 10X with the Illumina GA has a low rate of missing data. We routinely achieving 98% or higher coverage of the non-repetitive genome at Q40 or higher for such isogenic genomes (go to figure) with a single run of the instrument.
One major goal will be to utilize public database resources as much as possible to disseminate our sequences in a timely manner. We are placing the raw data in the NCBI's short read trace archive as quickly as possible given our resources:
Along with the large human population genomics sequencing community, we are working on creating solid and serviceable genomic assemblies and associated publications. As these assemblies emerge, we will post them on this website and submit them to public databases.
This is the README file for the preview release of the initial sample of sequenced Drosophila melanogaster genomes by the DPGP using first generation (single-end and 36 bp) Solexa/Illumina technology (Bentley, et al., 2008 Nature 456:53-59) assembled using maq 0.6.8 (Heng, Ruan and Durbin, 2008 Genome Res. 18: 1851-1858). This data preview is intended to clearly show the scope and quality of the data. Release 1.0 will be a reference dataset. The sample consists of 39 inbred genomes from Trudy Mackay's set of inbred lines sampled in Raleigh, NC (Jordan, et al., 2007 Genome Biology 8: R172. doi:10.1186/gb-2007-8-8-r172.) and a set of sequenced chromosomes (8 chrXs, 6 chr2s and 5 chr3s) from a sample of Malawi isofemale lines (Begun and Lindfors, 2005 Mol. Biol. Evol. 22: 2010-2021) that were inbred using balancers. Regions of residual heterozygosity and repeated sequence are filtered (set to "N"). The "raw data" are available in the NCBI Short Read Trace Archive. This Release 0.5 data are in the form of fasta files for each of the major chromosome arms for each sampled genome. The average coverage of the unique portions of all these genomes is >10X. The called bases are those with a consensus (Solexa) nominal quality score >= 30. Bases in repetitive sequences or in regions of (inbred) residual heterozygosity are not called, i.e. "N".
Basic statistics and examples are available HERE.
The download tarball can be found here: dpgp_solexa_preview.tar.gz.
Release 1.0, which will include calibrated quality values for each called
base (fasta and qfasta files), annotation of indels where possible, and
additional filtering of low quality basecalls, will apear shortly.
That version is a snapshot that will form the basis of a paper
describing the collection, assembly, and initial analyses
(another genome paper!). We anticipate that many researchers will use
Release 1.0 data for a wide diversity of purposes in a timely fashion.
In deference to academic careers of the junior colleagues who invested
great time and effort in this project and with an interest in a coherent
and efficient presentation of the literature of all the analyses, we
asked that users of these data (both Release 0.5 and Release 1.0) defer
publication for six (6) months after the appearance of Release 1.0.
Redundant effort, excessive overlap and publication difficulties must be
balanced against independent and creative analyses that happen to
coincide. The DPGP participants are ready to discuss that emerging
content the "genome paper" and to facilitate coordination of efforts.
This is the README file for the Release 1.0 of the initial sample of sequenced Drosophila melanogaster genomes by the DPGP using first generation (single-end and 36 bp) Solexa/Illumina technology (Bentley, et al., 2008 Nature 456:53-59) assembled using maq 0.6.8 (Heng, Ruan and Durbin, 2008 Genome Res. 18: 1851-1858). The sample consists of 37 inbred genomes from Trudy Mackay's set of inbred lines sampled in Raleigh, NC (Jordan, et al., 2007 Genome Biology 8: R172. doi:10.1186/gb-2007-8-8-r172.) and a set of sequenced chromosomes (7 chrXs, 6 chr2s and 5 chr3s) from a sample of Malawi isofemale lines (Begun and Lindfors, 2005 Mol. Biol. Evol. 22: 2010-2021) that were inbred using balancers. Regions of repeated sequence are filtered (set to "N"). The "raw data" are available in the NCBI Short Read Trace Archive.
Release 1.0 is in the form of FASTQ files. One for each of the major chromosome arms (inbred and sequenced) from each sampled genome. The average coverage of the unique portions of all these genomes is over 10X.
This release adds the following enhancements to Release 0.5
Basic statistics and examples from Release 0.5 are still quite representative and are available HERE.
The download tarball can be found here: dpgp_solexa_r1.0.tar. Checksum: dpgp_solexa_r1.0.tar.md5.
List of regions of residual heterozygosity: dpgp_r1.0_reshet.txt.
List of large regions of identity by descent: dpgp_r1.0_ibd.txt.
We are providing an unsupported FASTQ to FASTA converter: fastq_2_fasta.pl.
Release 1.0 will be the basis of a paper that describes the collection, assembly, and initial population genetics analyses of these genomes (another genome paper!). We anticipate that many researchers will use Release 1.0 data for a wide diversity of purposes in a timely fashion. In deference to the academic careers of the junior colleagues who invested great time and effort in this project and with an interest in a coherent and efficient presentation in the literature of all analyses, we ask that users of these data (both Release 0.5 and Release 1.0) defer publication for six (6) months after the appearance of Release 1.0. Redundant effort, excessive overlap and publication difficulties must be balanced against independent and creative analyses that happen to coincide. The DPGP participants are ready to discuss the emerging content of the "genome paper" and to facilitate coordination of efforts.