GFF Refactor

From BioPerl
Jump to: navigation, search

This is a page for discussion of GFF-based refactoring in BioPerl (focusing on core modules including (but not limited to) Bio::FeatureIO, Bio::SeqFeature::Annotated, and any other GFF-related code. This page will be updated as more discussion takes place and plans coalesce.

Contents

Implementation Plan

People working on this:

On the bioperl-live TRY_gff_refactor branch:

  • split Bio::FeatureIO modules off into their own CPAN distribution
  • Rename some TypedSeqFeatureI methods as suggested in Hilmar's post
  • Implement a new lighter-weight feature object, Bio::SeqFeature::Typed that implements the Bio::SeqFeature::TypedSeqFeatureI interface and lazily implements the Bio::AnnotatableI, such that the AnnotationI objects it contains are not actually constructed until they are fetched via get_Annotations
  • add a feature_ontology(Ontology) accessor to FeatureIO.pm, which will be inherited by all FeatureIO objects
  • implement a new create_feature(Hash) method in FeatureIO that creates Typed seqfeatures if feature_ontology() is set, otherwise creates Generic features
  • merge features of Bio::Tools::GFF and Bio::FeatureIO::gff into Bio::FeatureIO::gff, and deprecate Bio::Tools::GFF
  • add custom hooks to Bio::SeqFeature::Typed to support compact serialization via Storable
  • rectify all problems reported in bugs linked at the bottom of this page
  • add some new Bio::FeatureIO modules for GenomeThreader XML and FGENESH parsing (rbuels has these already)


Bio::SeqFeature::Annotated

From reading the past history from the mail list (not always a good metric), Bio::SeqFeature::Annotated was designed primarily to ensure that feature data could be checked for consistency prior to being used in a GFF3-related database. This led to several problems within core code, so the ensuing features were rolled back, and Bio::SeqFeature::Annotated became a Bio::SeqFeature::TypedSeqFeatureI experimental implementation which is strongly type-checked against the latest Sequence Ontology.

Type checking

With Bio::SeqFeature::Annotated, all data is stored as Bio::AnnotationI in a Bio::Annotation::Collection, as Bio::SeqFeature::Annotated in subfeatures, or in a Bio::LocationI (which is accessible through the Bio::RangeI interface). However, the only type-checking apparently called for with GFF3 is via the primary_tag (aka the feature 'type'). So, as I see it, two levels of type-checking are being performed:

  1. checking the primary_tag against SO/SOFA, and additionally...
  2. checking the rest of the tag values, possibly mapping against specific annotation types (Bio::Annotation::DBLink, Bio::Annotation::OntologyTerm, Bio::Annotation::Target come to mind).

Implementation issues

We have several significant problems with the current Bio::SeqFeature::Annotated implementation that should be addressed by the BioPerl 1.7 release. Feel free to add to or delete from the list and make suggestions.

  • Rectify Bio::SeqFeature::TypedSeqFeatureI interface methods, possibly adding several other helper methods. See this post.
  • Ensure all methods comply with Bio::SeqFeatureI interface, in that methods specified in the interface that accept and return scalar values do not return other things (objects). No overloading!
  • Adding typed tag data as Bio::AnnotationI will require using the Bio::AnnotatableI and Bio::AnnotationCollectionI interfaces. This is already in place.
  • Do we want optional checking of type? This has been requested...
  • The current implementation has been suggested as being way too heavy (everything is an object) and doesn't persist well. For instance, it loads a singleton Bio::Ontology::Store on the sly (Issue #2513}. When using a freeze/thaw database like Bio::DB::SeqFeature::Store, this singleton may be coming along for the ride.
  • Do we want two levels of typing? Maybe 'weak/loose' (primary_tag only) vs. 'strong/strict' (all tags)?
  • Related to the above, we could work on a lightweight alternative Bio::SeqFeature::TypedSeqFeatureI implementation. It may be possible (for instance) to have an alternative Bio::AnnotationCollectionI that creates the proper Bio::AnnotationI lazily; tag data could be stored lightly as a Data::Stag or simpler data structure.

Bio::FeatureIO and other GFF-related code

TODO: add in links to all GFF-releated code

  • Should we coalesce around a central mode of generating GFF data for output (focusing on GFF3)?
  • How flexible should it be? Do we want callback 'hooks' for advanced developers to make custom changes as needed?
  • Should we work towards consistent methods for all GFF-related output (not just GFF3, but GTF and GFF2).
  • How do we want to deal with hierarchal data (i.e. the canonical gene model for GFF3)?
    • Lazily by default (just pass back to the user), but maybe have a standard way to get unflattened features (Bio::DB::SeqFeature::Store?).

GFF-related code:

Related bugs

See also

Personal tools
Namespaces
Variants
Actions
Main Links
documentation
community
development
Toolbox