Consider skewing closer to class-based method name syntax and deprecate use of older methods (something to consider globally?)
- issues with location inconsistency
- start/end-point checking and validation
- Residue/Gap/Frameshift/Other symbols and how to handle them
- Globals work for now, but it's not the best way to handle these, particularly if mixing LocatableSeqs derived from different sources with different symbol definitions
- Possibly modify the seq string internally for consistency?
- subseq needs to be refactored (calling start/end gets the start/end of the string, not the sequence). Also, named args shouldn't clash with the parent class or Bio::PrimarySeqI definition, either that or the definition needs to be changed.
Other issues (likely requires a significant all-around refactor):
This is a semantic issue I ran into when working on the Bio::AlignIO::stockholm parser. With that format one can have annotation that relates to the full alignment, but also have annotation that is specific for the sequence.
Herein lies the issue: Bio::LocatableSeq is-a Bio::PrimarySeq, not a Bio::Seq (and thus not Bio::AnnotatableI). And one can't be both Bio::SeqI and Bio::PrimarySeqI (see 2262 for an example as to why not).
- I may test out an idea I had a while back that would use a RangeI-capable Bio::Seq instead of a Bio::LocatableSeq. It would use a symbol-conscious Bio::PrimarySeq, reconfigured for subseq, etc. Class would store any location info in a 'source' feature and delegate to that.
- I like a structure like this; I have grafted on annotation facilities to LocatableSeqs and it is a distasteful process. Would this mean LocatableSeqs going the way of the snail darter, with the thing never quite going extinct in the Maintainosphere? --Majensen 20:47, 18 February 2009 (UTC)
- As raised on the mailing list, some methods may have discrepancies with sequence indices, paricularly with add_seq(), among others.
- Minor cleanups of the interface (lots of methods that could be moved to Bio::Align::Utilities for instance).
With current high-throughput sequencing technologies, the Bio::Assembly modules are limited by the following issues:
- Memory usage is way too high: Loading 10,000 sequences of ~100 bp in an assembly requires > 1GB of memory. Trying to load 100,000 sequences on a machine with 2GB of RAM crashes the application due to lack of memory.
- Contig features are saved in SeqFeature::Collection which has a tied DB_File filehandle. The number of open filehandles can exceed the system limit (Bug 2577 ). A solution may be to move the collection to the Assembly level, and/or use Bio::DB::SeqFeature::Store (which can handle in-memory) instead of SeqFeature::Collection
- Some assembly/contig features saved do not seem very useful to everyone and not all assemblers output the same information. Maybe decide on a set of core features and optional features.
- Some people have reported that the parsing assemblies is slow and have proposed implementing a next_contig method in addition to the next_assembly method. This would also reduce memory usage as only bite-size fragments of the assembly are dealt with at a time (mailing list thread ). See also this feature request: 
- Parsers (not really a refactoring problem bug something that should probably be done):
- The ACE parser lacks writing ability (Bug 2483 )
- The phrap parser is ancient and doesn't really parse the sequence string (Bug 2620 )
- Consider writing an AMOS  parser. AMOS has plenty of scripts to convert from AMOS to other assembly formats
- This looks cool. I would be interested in helping here. --Majensen 03:34, 22 March 2009 (UTC)
Incorporate Lincoln's Bio::SamTools
This is a general set of tools that could possibly be used within this refactor, but we need to find a good way for it to fit in.