Nextgen in Bioperl

From BioPerl
Jump to: navigation, search

(see thread at bioperl-l here; thanks to Elia)

This is a page for developer and user discussion of next-generation sequencing support in BioPerl. Please comment freely to help create priorities and use cases for development going forward. Thanks to all for your contributions!

Contents

Wish List

Following is a list of bits and pieces users would like...

Improved support for fastq

After a bit of discussion on the mailing list the consensus so far seems to be to support current next-gen formats within SeqIO utilizing the same naming convention used in BioPython, i.e.:

  • "fastq" in Biopython means the original Sanger standard FASTQ files encoding PHRED qualities using an ASCII offset of 33.
  • "fastq-solexa" in Biopython means the early Solexa/Illumina style FASTQ files which encode Solexa qualities using an ASCII offset of 64.
  • "fastq-illumina" in Biopython will mean recent Solexa/Illumina style FASTQ files (from pipeline version 1.3+) which encode PHRED qualities using an ASCII offset of 64. This is in the Biopython repository, but hasn't been released yet - so the name "fastq-illumina" isn't set in stone yet.
  • Although performance is not optimal for the large number of reads that need to be dealt with usually, for the moment a "standard" implementation will be used. Potential improvements in the future might be to provide light-weight versions which avoid or reduce object creation, utilize C, etc.
  • Provide if possible some level of validation for the quality values of the format, i.e. check bounds during the parse and warn if they are exceeded
  • Heikki has started developing some code, his discussion can be found here, and Heng Li's code is a useful guide, so it would be best to compare/merge to address issues identified in the past.

Some further discussion of this topic can be found at this bioperl-l threads :

Progress

  • Bio::SeqIO::fastq now has initial support for Illumina (v1.3) and Solexa (illumina v 1.0) read/write --Chris Fields 19:31, 1 July 2009 (UTC)
    • Follows biopython convention.
    • Needs serious testing for quality data.
    • Do we want to incorporate FASTQ-int?

Support for sequences produced by pyrosequencing (454): SFF format

  • Possibly via io_lib (bindings possible via BioLib?)

Bioperl-run support of common external tools

Bio::Assembly::IO::maq and Bio::Tools::Run::Maq are now in beta version in the respective trunks --maj 12:59, 12 November 2009 (UTC)
Bio::Tools::Run::Samtools now in beta in bioperl-run trunk --maj 04:00, 30 November 2009 (UTC)
Bio::Tools::Run::BWA and Bio::Assembly::IO::sam now in beta in respective trunks --maj 04:00, 30 November 2009 (UTC)
Florent has worked this into Bio::Assembly::IO --maj 04:00, 30 November 2009 (UTC)

Use Cases

Current Priorities

Personal tools
Namespaces
Variants
Actions
Main Links
documentation
community
development
Toolbox