GFF Refactor
This is a page for discussion of GFF-based refactoring in BioPerl (focusing on core modules including (but not limited to) Bio::FeatureIO, Bio::SeqFeature::Annotated, and any other GFF-related code. This page will be updated as more discussion takes place and plans coalesce.
Contents |
Implementation Plan
People working on this:
On the bioperl-live TRY_gff_refactor branch:
- split Bio::FeatureIO modules off into their own CPAN distribution
- Rename some TypedSeqFeatureI methods as suggested in Hilmar's post
- Implement a new lighter-weight feature object, Bio::SeqFeature::Typed that implements the Bio::SeqFeature::TypedSeqFeatureI interface and lazily implements the Bio::AnnotatableI, such that the AnnotationI objects it contains are not actually constructed until they are fetched via get_Annotations
- add a feature_ontology(Ontology) accessor to FeatureIO.pm, which will be inherited by all FeatureIO objects
- implement a new create_feature(Hash) method in FeatureIO that creates Typed seqfeatures if feature_ontology() is set, otherwise creates Generic features
- merge features of Bio::Tools::GFF and Bio::FeatureIO::gff into Bio::FeatureIO::gff, and deprecate Bio::Tools::GFF
- add custom hooks to Bio::SeqFeature::Typed to support compact serialization via Storable
- rectify all problems reported in bugs linked at the bottom of this page
- add some new Bio::FeatureIO modules for GenomeThreader XML and FGENESH parsing (rbuels has these already)
Bio::SeqFeature::Annotated
From reading the past history from the mail list (not always a good metric), Bio::SeqFeature::Annotated was designed primarily to ensure that feature data could be checked for consistency prior to being used in a GFF3-related database. This led to several problems within core code, so the ensuing features were rolled back, and Bio::SeqFeature::Annotated became a Bio::SeqFeature::TypedSeqFeatureI experimental implementation which is strongly type-checked against the latest Sequence Ontology.
Type checking
With Bio::SeqFeature::Annotated, all data is stored as Bio::AnnotationI in a Bio::Annotation::Collection, as Bio::SeqFeature::Annotated in subfeatures, or in a Bio::LocationI (which is accessible through the Bio::RangeI interface). However, the only type-checking apparently called for with GFF3 is via the primary_tag (aka the feature 'type'). So, as I see it, two levels of type-checking are being performed:
- checking the primary_tag against SO/SOFA, and additionally...
- checking the rest of the tag values, possibly mapping against specific annotation types (Bio::Annotation::DBLink, Bio::Annotation::OntologyTerm, Bio::Annotation::Target come to mind).
Implementation issues
We have several significant problems with the current Bio::SeqFeature::Annotated implementation that should be addressed by the BioPerl 1.7 release. Feel free to add to or delete from the list and make suggestions.
- Rectify Bio::SeqFeature::TypedSeqFeatureI interface methods, possibly adding several other helper methods. See this post.
- Ensure all methods comply with Bio::SeqFeatureI interface, in that methods specified in the interface that accept and return scalar values do not return other things (objects). No overloading!
- Adding typed tag data as Bio::AnnotationI will require using the Bio::AnnotatableI and Bio::AnnotationCollectionI interfaces. This is already in place.
- Do we want optional checking of type? This has been requested...
- The current implementation has been suggested as being way too heavy (everything is an object) and doesn't persist well. For instance, it loads a singleton Bio::Ontology::Store on the sly (Issue #2513}. When using a freeze/thaw database like Bio::DB::SeqFeature::Store, this singleton may be coming along for the ride.
- Do we want two levels of typing? Maybe 'weak/loose' (primary_tag only) vs. 'strong/strict' (all tags)?
- Related to the above, we could work on a lightweight alternative Bio::SeqFeature::TypedSeqFeatureI implementation. It may be possible (for instance) to have an alternative Bio::AnnotationCollectionI that creates the proper Bio::AnnotationI lazily; tag data could be stored lightly as a Data::Stag or simpler data structure.
TODO: add in links to all GFF-releated code
- Should we coalesce around a central mode of generating GFF data for output (focusing on GFF3)?
- How flexible should it be? Do we want callback 'hooks' for advanced developers to make custom changes as needed?
- Should we work towards consistent methods for all GFF-related output (not just GFF3, but GTF and GFF2).
- How do we want to deal with hierarchal data (i.e. the canonical gene model for GFF3)?
- Lazily by default (just pass back to the user), but maybe have a standard way to get unflattened features (Bio::DB::SeqFeature::Store?).
GFF-related code:
- Bio::DB::SeqFeature::Store::GFF3Loader - GFF3 file loader for Bio::DB::SeqFeature::Store
- Bio::Tools::GFF - A Bio::SeqAnalysisParserI compliant GFF format parser
- [bp_seqfeature_load.pl]