GFF code audit
Note:This is a tracker page and a stub for now.
Contents |
Introduction
We are planning a code audit related to the various ways GFF output is generated via BioPerl classes. Feel free to modify as needed. However, please limit discussions to the Discussion Page or the mail list.
Current classes which generate GFF format
Maybe a table here, with class/GFF format supported/output
Classes
Below are a list of classes which either read or write GFF.
- Bio::DB::GFF
- Bio::DB::SeqFeature
- Bio::FeatureIO
- Bio::Graphics::FeatureBase
- Bio::Graphics::FeatureFile
- Bio::Map::Physical
- Bio::SearchIO::Writer::GbrowseGFF
- Bio::SeqFeature::Annotated
- Bio::SeqFeature::Collection
- Bio::SeqFeature::Generic
- Bio::SeqFeature::PositionProxy
- Bio::SeqFeature::Similarity
- Bio::SeqFeature::SimilarityPair
- Bio::SeqFeature::SiRNA::Oligo
- Bio::SeqFeature::SiRNA::Pair
- Bio::SeqFeature::Tools::IDHandler
- Bio::SeqFeature::Tools::Unflattener
- Bio::SeqFeatureI
- Bio::SeqI
- Bio::Tools::Analysis::DNA::ESEfinder
- Bio::Tools::Analysis::Protein::Mitoprot
- Bio::Tools::Analysis::Protein::NetPhos
- Bio::Tools::Analysis::Protein::Scansite
- Bio::Tools::BPlite::HSP
- Bio::Tools::EMBOSS::Palindrome
- Bio::Tools::EPCR
- Bio::Tools::Eponine
- Bio::Tools::FootPrinter
- Bio::Tools::Geneid
- Bio::Tools::GFF
- Bio::Tools::Glimmer
- Bio::Tools::GuessSeqFormat
- Bio::Tools::ipcress
- Bio::Tools::isPcr
- Bio::Variation::AAChange
- Bio::Variation::DNAMutation
- Bio::Variation::RNAChange
Scripts
- scripts/Bio-DB-GFF/bulk_load_gff.PLS
- scripts/Bio-DB-GFF/fast_load_gff.PLS
- scripts/Bio-DB-GFF/genbank2gff.PLS
- scripts/Bio-DB-GFF/genbank2gff3.PLS
- scripts/Bio-DB-GFF/generate_histogram.PLS
- scripts/Bio-DB-GFF/load_gff.PLS
- scripts/Bio-DB-GFF/meta_gff.PLS
- scripts/Bio-DB-GFF/process_gadfly.PLS
- scripts/Bio-DB-GFF/process_sgd.PLS
- scripts/Bio-DB-GFF/process_wormbase.PLS
- scripts/Bio-SeqFeature-Store/bp_seqfeature_gff3.PLS
- scripts/Bio-SeqFeature-Store/bp_seqfeature_load.PLS
- scripts/graphics/feature_draw.PLS
- scripts/graphics/frend.PLS
- scripts/seq/unflatten_seq.PLS
- scripts/utilities/search2BSML.PLS
- scripts/utilities/search2gff.PLS
Examples
- examples/Bio-DB-GFF/load_ucsc.pl
- examples/biographics/feature_data.gff
- examples/searchio/waba2gff.pl
- examples/searchio/waba2gff3.pl
- examples/tools/gb_to_gff.pl
- examples/tools/gff2ps.pl
Problems with current output
- Features have to build in the Parser objects generally to be GTF/GFF2 or GFF3 compatible.
-
CDS-typed features require a phase component, which currently isn't mapped into SeqFeatures via SeqIO orbp_genbank2gff3(see Issue #2322). - Consistency/flexibility when generating GFF3 from Bio::SearchIO-generated data.
Examples
Here is some gene feature data in GTF and GFF3. Note the order of exon/CDS interleaving is not required in GTF, but is how the results look when sorted by start position*strand. In GFF3 it is not required that gene feature preceed the mRNA, but the Gbrowse (Bio::DB::SeqFeature and Bio::DB::GFF at least) take this shortcut in parsing so it is best to keep them ordered in this fashion.
GFF3
Chrom1 SNAP gene 505 3447 . + . ID=gene000002;Name=Chrom1.0-snap.1 Chrom1 SNAP mRNA 505 3447 . + . ID=mRNA000002;Name=Chrom1.0-snapCCIN.1.1 Chrom1 SNAP exon 505 673 21.624 + . ID=exon000013;Parent=mRNA000002 Chrom1 SNAP exon 730 1446 46.298 + . ID=exon000014;Parent=mRNA000002 Chrom1 SNAP exon 1472 3447 147.456 + . ID=exon000015;Parent=mRNA000002 Chrom1 SNAP CDS 505 673 21.624 + 0 ID=cds000013;Parent=mRNA000002 Chrom1 SNAP CDS 730 1446 46.298 + 2 ID=cds000014;Parent=mRNA000002 Chrom1 SNAP CDS 1472 3447 147.456 + 2 ID=cds000015;Parent=mRNA000002
GTF
Chrom1 SNAP start_codon 505 507 . + . transcript_id "Chrom1.0-snapCCIN.1.1"; gene_id "Chrom1.0-snap.1"; Chrom1 SNAP CDS 505 673 21.624 + 0 exontype "initial"; transcript_id "Chrom1.0-snapCCIN.1.1"; gene_id "Chrom1.0-snap.1"; Chrom1 SNAP exon 505 673 21.624 + . exontype "initial"; transcript_id "Chrom1.0-snapCCIN.1.1"; gene_id "Chrom1.0-snap.1"; Chrom1 SNAP CDS 730 1446 46.298 + 2 exontype "internal"; transcript_id "Chrom1.0-snapCCIN.1.1"; gene_id "Chrom1.0-snap.1"; Chrom1 SNAP exon 730 1446 46.298 + . exontype "internal"; transcript_id "Chrom1.0-snapCCIN.1.1"; gene_id "Chrom1.0-snap.1"; Chrom1 SNAP CDS 1472 3447 147.456 + 2 exontype "terminal"; transcript_id "Chrom1.0-snapCCIN.1.1"; gene_id "Chrom1.0-snap.1"; Chrom1 SNAP exon 1472 3447 147.456 + . exontype "terminal"; transcript_id "Chrom1.0-snapCCIN.1.1"; gene_id "Chrom1.0-snap.1"; Chrom1 SNAP stop_codon 3445 3447 . + . transcript_id "Chrom1.0-snapCCIN.1.1"; gene_id "Chrom1.0-snap.1";
Proposals
- Rewrite Bio::FeatureIO to accept/produce any Bio::SeqFeatureI/Bio::FeatureHolderI, or setup a similar set of tools capable of reading in and producing the required output (Bio::Tools::GFF3?).
- Add new methods to SeqFeatureI interface to deal with typing?
- Should we merge TypedSeqFeatureI and SeqFeatureI?
- Note that Bio::DB::SeqFeature and Bio::SeqFeature::Annotated use type() and source() in conflicting ways (former returns string, latter a Bio::AnnotationI).
- Implement optional typing/unflattening within FeatureIO itself and not within the Bio::SeqFeatureI class.
- Build in an interface to Bio::DB::SeqFeature::Store or other Bio::SeqFeature::CollectionI?
- My thought is to use the SeqFeature::Collection to help build hierarchal data --Chris Fields 10:41, 29 October 2007 (EDT)
- Add new methods to SeqFeatureI interface to deal with typing?
GMOD Discussion
- Build Hierarchical Features (see Bio::SeqFeature::Slim CVS on lightweight_feature_branch branch)
- These will explicitly have PARENT and ID semantic fields
- Map GTF and GFF to this hiearcharchy
- SO compliance and validation can be done on this, but not explicitly coded in to keep the obj lightweight.
- Configurable filters which define what the Group/Parent is and the ID field from GTF or GFF3
Lightweight SF objects
- Array-based SFs ~70% faster than Bio::SeqFeature::Generic, but only ~7% faster than Bio::Graphics::Feature (do we have comparisons with Bio::DB::SeqFeature?)
- Can this approach handle fuzzy locations with ease (for instance, if we want use these instead of Bio::SeqFeature::Generic)?
- Maybe use a Factory, which could be used in Bio::SeqIO (in place of Bio::SeqIO::FTHelper) and Bio::FeatureIO
- Typing and unflattening would be handled in the Factory
- Typing using local or retrieved SO data (are we dropping SOFA completely?)
- Unflattening using Chris M.'s Bio::SeqFeature::Tools::Unflattener or similar.
- The FT data is hard-coded in Bio::SeqFeature::Tools::Unflattener. Maybe have a way to update to the latest mapping via SO CVS or use alternative mappings?