Nextgen in Bioperl
From BioPerl
(see thread at bioperl-l here; thanks to Elia)
This is a page for developer and user discussion of next-generation sequencing support in BioPerl. Please comment freely to help create priorities and use cases for development going forward. Thanks to all for your contributions!
Contents |
Wish List
Following is a list of bits and pieces users would like...
Improved support for fastq
After a bit of discussion on the mailing list the consensus so far seems to be to support current next-gen formats within SeqIO utilizing the same naming convention used in BioPython, i.e.:
- "fastq" in Biopython means the original Sanger standard FASTQ files encoding PHRED qualities using an ASCII offset of 33.
- "fastq-solexa" in Biopython means the early Solexa/Illumina style FASTQ files which encode Solexa qualities using an ASCII offset of 64.
- "fastq-illumina" in Biopython will mean recent Solexa/Illumina style FASTQ files (from pipeline version 1.3+) which encode PHRED qualities using an ASCII offset of 64. This is in the Biopython repository, but hasn't been released yet - so the name "fastq-illumina" isn't set in stone yet.
- Although performance is not optimal for the large number of reads that need to be dealt with usually, for the moment a "standard" implementation will be used. Potential improvements in the future might be to provide light-weight versions which avoid or reduce object creation, utilize C, etc.
- Provide if possible some level of validation for the quality values of the format, i.e. check bounds during the parse and warn if they are exceeded
- Heikki has started developing some code, his discussion can be found here, and Heng Li's code is a useful guide, so it would be best to compare/merge to address issues identified in the past.
Some further discussion of this topic can be found at this bioperl-l threads :
Progress
- Bio::SeqIO::fastq now has initial support for Illumina (v1.3) and Solexa (illumina v 1.0) read/write --Chris Fields 19:31, 1 July 2009 (UTC)
- Follows biopython convention.
- Needs serious testing for quality data.
- Do we want to incorporate FASTQ-int?
Support for sequences produced by pyrosequencing (454): SFF format
- Possibly via io_lib (bindings possible via BioLib?)
Bioperl-run support of common external tools
- Bio::Assembly::IO::maq and Bio::Tools::Run::Maq are now in beta version in the respective trunks --maj 12:59, 12 November 2009 (UTC)
- Bio::Tools::Run::Samtools now in beta in bioperl-run trunk --maj 04:00, 30 November 2009 (UTC)
- Bio::Tools::Run::BWA and Bio::Assembly::IO::sam now in beta in respective trunks --maj 04:00, 30 November 2009 (UTC)
- Florent has worked this into Bio::Assembly::IO --maj 04:00, 30 November 2009 (UTC)
- Velvet - being developed at the moment by John Marshall
- Mira
- A C-based FASTQ parser