Align Refactor

From BioPerl
Jump to: navigation, search

This is a stub for any Bio::Align::AlignI-related refactoring, centering on Bio::LocatableSeq and Bio::SimpleAlign. Feel free to add to this page; I'll try to keep it organaized!

Contents

Overall

Consider skewing closer to class-based method name syntax and deprecate use of older methods (something to consider globally?)

Bio::LocatableSeq

  • issues with location inconsistency
  • start/end-point checking and validation
  • Residue/Gap/Frameshift/Other symbols and how to handle them
    • Globals work for now, but it's not the best way to handle these, particularly if mixing LocatableSeqs derived from different sources with different symbol definitions
      • I had a go at this a while back using closures to capture the gap symbols value at the time of sequence creation. Is this the kind of thing envisioned? --Majensen 20:47, 18 February 2009 (UTC)
    • Possibly modify the seq string internally for consistency?
  • subseq needs to be refactored (calling start/end gets the start/end of the string, not the sequence). Also, named args shouldn't clash with the parent class or Bio::PrimarySeqI definition, either that or the definition needs to be changed.

Other issues (likely requires a significant all-around refactor):

This is a semantic issue I ran into when working on the Bio::AlignIO::stockholm parser. With that format one can have annotation that relates to the full alignment, but also have annotation that is specific for the sequence.

Herein lies the issue: Bio::LocatableSeq is-a Bio::PrimarySeq, not a Bio::Seq (and thus not Bio::AnnotatableI). And one can't be both Bio::SeqI and Bio::PrimarySeqI (see 2262 for an example as to why not).

Options?

  • I may test out an idea I had a while back that would use a RangeI-capable Bio::Seq instead of a Bio::LocatableSeq. It would use a symbol-conscious Bio::PrimarySeq, reconfigured for subseq, etc. Class would store any location info in a 'source' feature and delegate to that.
    • I like a structure like this; I have grafted on annotation facilities to LocatableSeqs and it is a distasteful process. Would this mean LocatableSeqs going the way of the snail darter, with the thing never quite going extinct in the Maintainosphere? --Majensen 20:47, 18 February 2009 (UTC)

Bio::SimpleAlign

  • As raised on the mailing list, some methods may have discrepancies with sequence indices, paricularly with add_seq(), among others.
  • Minor cleanups of the interface (lots of methods that could be moved to Bio::Align::Utilities for instance).

Bio::Assembly-related

With current high-throughput sequencing technologies, the Bio::Assembly modules are limited by the following issues:

  • Memory usage is way too high: Loading 10,000 sequences of ~100 bp in an assembly requires > 1GB of memory. Trying to load 100,000 sequences on a machine with 2GB of RAM crashes the application due to lack of memory.
  • Contig features are saved in SeqFeature::Collection which has a tied DB_File filehandle. The number of open filehandles can exceed the system limit (Bug 2577 [1]). A solution may be to move the collection to the Assembly level, and/or use Bio::DB::SeqFeature::Store (which can handle in-memory) instead of SeqFeature::Collection
  • Some assembly/contig features saved do not seem very useful to everyone and not all assemblers output the same information. Maybe decide on a set of core features and optional features.
  • Some people have reported that the parsing assemblies is slow and have proposed implementing a next_contig method in addition to the next_assembly method. This would also reduce memory usage as only bite-size fragments of the assembly are dealt with at a time (mailing list thread [2]). See also this feature request: [3]
  • Parsers (not really a refactoring problem bug something that should probably be done):
    • The ACE parser lacks writing ability (Bug 2483 [4])
    • The phrap parser is ancient and doesn't really parse the sequence string (Bug 2620 [5])
    • Consider writing an AMOS [6] parser. AMOS has plenty of scripts to convert from AMOS to other assembly formats
      • This looks cool. I would be interested in helping here. --Majensen 03:34, 22 March 2009 (UTC)

Incorporate Lincoln's Bio::SamTools

This is a general set of tools that could possibly be used within this refactor, but we need to find a good way for it to fit in.

Personal tools
Namespaces
Variants
Actions
Main Links
documentation
community
development
Toolbox