Handler-based SeqIO parsers
This is a spec page for developers only!
Although some code may be committed to CVS for testing, do NOT rely on any of this functionality appearing in future BioPerl releases until/unless the spec is verified with other BioPerl core developers. It is possible that this page will be moved/deleted in the future to another spot (probably as a HOWTO) as development progresses. You have been warned!
Contents |
The current Bio::SeqIO approach
As noted in the Project priority list, several of the parsers for Bio::SeqIO are showing signs of age. In particular, the original code for Bio::SeqIO::genbank, Bio::SeqIO::embl, and Bio::SeqIO::swiss are over 6 yrs old at this point and need serious refactoring.
Problems with older parsers
Code maintenance
Currently the Bio::SeqIO parsers are over six years old. As noted in the mail list over the years and in the Project priority list repeated patching and bug fixes for some parsers (in particular, Bio::SeqIO::genbank, Bio::SeqIO::embl, Bio::SeqIO::swiss) has led to a lot of added code which, over the years, has had very little refactoring. Furthermore, the current structure of the parsers themselves (in general, large if-elsif-else blocks as a consequence of Perl's current lack of a switch control statement) makes it somewhat difficult for a newcomer to decipher or for an experienced developer to refactor.
Code redundancy
As recent issues with Bio::Species/Bio::Taxon illustrate, several parsers contain almost exactly the same code for parsing common types of data, including (in no particular order) accessions, keywords, references, comments, species information, sequence features, dates, and sequence data. When debugging or developing new code this can become a huge problem, as you need to change similar code in at least three different parsers in order to fix the same issue.
Data consistency
Some parsers demonstrate inconsistent behavior with data. As an example, date information in GenBank records is scrupulously checked for possible variations in the date prior to storing, while dates parsed from EMBL/UniProt files are not checked at all. Some parsers make use of a Bio::Seq::SeqBuilder to build sequence objects, while others don't.
Code customization
The current parser structure works well for very simple formats (such as FASTA) but doesn't easily accommodate customization or development of alternative classes for storing data. For instance, as recently pointed out on the mail list Bio::SeqIO::FTHelper works well as a lightweight object for storing data but it is an intermediary step; the final object (a Bio::SeqFeature::Generic) is generated for every Bio::SeqIO::FTHelper. As noted above, changes which would convert directly to SeqFeatures would need to be implemented in at least three different parsers. Furthermore, developing alternative classes for SeqFeatures/Annotations is harder under the current scheme.
The proposal
Adopt an XML parser-like (or event-driven) approach, probably more similar in spirit to XML::Twig as opposed to the SAX2-based XML::SAX. In short, split the parsing of the data and the handling of the data into two distinct tasks.
The Driver
In a very abstract way, next_seq() or the requisite parser would act as a simple driver method, generically parsing data into chunks that would be passed off to a handler object, which either passes the data off to the correct handler method or tosses it.
sub next_seq { my $self = shift; local($/) = "\n"; my $hobj = $self->handler; PARSER: while (defined(my $line = $self->_readline)) { # ... parse into a data structure here if ($data_chunk) { $hobj->data_handler($data_chunk); } # ...bail out of parser at end of seq record } return $hobj->build_sequence; }
The Handler
The data_handler() method would be responsible for passing on the data to the relevant private object handler methods:
package Bio::SeqIO::RichSeq::MyHandler; use strict; use warnings; # use other classes as needed # implement interface use base qw(Bio::Root::Root Bio::SeqIO::HandlerI); # define lookup table for private class methods that # deal with the data my %HANDLER = ( 'genbank' => { 'LOCUS' => \&_genbank_locus, 'DEFINITION' => \&_genbank_definition, 'ACCESSION' => \&_genbank_accession, 'VERSION' => \&_genbank_version, 'DBSOURCE' => \&_genbank_dbsource, 'SOURCE' => \&_generic_species, 'REFERENCE' => \&_generic_reference, 'COMMENT' => \&_generic_comment, 'FEATURES' => \&_generic_seqfeatures, #... and so on # maybe have alternate methods for testing, commented out #'SOURCE' => \&_generic_taxon, #'FEATURES' => \&_generic_lightweightfeature, # skip this feature (can be done in the driver as well) 'BASE' => \&noop, # this is generated from scratch 'ORIGIN' => \&_generic_seq, '_DEFAULT_' => \&_generic_simplevalue, }, 'embl' => { 'ID' => \&_embl_id, 'OS' => \&_generic_species, #... and so on }, 'swiss' => { 'ID' => \&_embl_id, 'OS' => \&_generic_species, #... and so on }, ); # fun class stuff here # set the proper handlers somewhere (constructor?) # based on the format being parsed... $self->{'handlers'} = $HANDLER{ $format }; # this might be defined in a GenericHandler sub data_handler { my ($self, $data) = @_; # grab the name, which is a key for the method handler my $nm = $data->{NAME} || $self->throw("No name tag defined!"); my $method = (exists $self->{'handlers'}->{$nm}) ? $self->{'handlers'}->{$nm} : (exists $self->{'handlers'}->{'_DEFAULT_'}) ? $self->{'handlers'}->{'_DEFAULT_'} : undef; if (!$method) { $self->debug("No handler defined for $nm\n"); return; } $self->$method($data); return; }
Specification
Several key issues would need to be resolved. In no particular order:
- What would the handler object look like (i.e. interface design)?
- How is the data passed to the handler?
- What constitutes a "chunk of data"?
...others???
Handler Interface
A very simple interface would define a single 'public' method for the data, such as data_handler() above, and other get/setters needed to help build the sequence object (Bio::Seq::SeqBuilder, Bio::AnnotationI, Bio::Species, etc).
Private handler methods are defined in implementations to handle data based on the lookup table and sequence format.
Other methods?
Passing Data to the Handler
One could adopt two (maybe more?) approaches here. The first would be to pass all data to one handler method (data_handler()); from there the determination is made to either pass the data on or toss it.
The second would be to call the handlers directly from within the driver itself; this would require a second interface method that passes back the proper handler coderefs to call from within the driver.
What constitutes a chunk of data
This may be the most tricky part. Here I'll use GenBank format to demonstrate what I think a proper chunk would be:
LOCUS NT_021877 10001 bp DNA linear CON 17-OCT-2003
DEFINITION Homo sapiens chromosome 1 genomic contig.
ACCESSION NT_021877 REGION: 13920000..13930000
VERSION NT_021877.16 GI:37539616
KEYWORDS .
SOURCE Homo sapiens (human)
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE 1 (bases 1 to 10001)
AUTHORS International Human Genome Sequencing Consortium.
TITLE The DNA sequence of Homo sapiens
JOURNAL Unpublished (2003)
COMMENT GENOME ANNOTATION REFSEQ: Features on this sequence have been
produced for build 34 of the NCBI's genome annotation [see
documentation].
On Oct 7, 2003 this sequence version replaced gi:29789880.
The DNA sequence is part of the second release of the finished
human reference genome. It was assembled from individual clone
sequences by the Human Genome Sequencing Consortium in consultation
with NCBI staff.
COMPLETENESS: not full length.
FEATURES Location/Qualifiers
source 1..10001
/organism="Homo sapiens"
/mol_type="genomic DNA"
/db_xref="taxon:9606"
/chromosome="1"
source <1..>10001
/organism="Homo sapiens"
/mol_type="genomic DNA"
/db_xref="taxon:9606"
/clone="RP11-302I18"
/note="Accession AL451081 sequenced by The Sanger Centre"
gene complement(3024..6641)
/gene="LOC127086"
/note="Derived by automated computational analysis using
gene prediction method: GNOMON."
/db_xref="GeneID:127086"
/db_xref="InterimID:127086"
mRNA complement(join(3024..4108,4110..4258,4357..4533,
5985..6225,6324..6641))
/gene="LOC127086"
/product="similar to ATP-dependent DNA helicase II, 70 kDa
subunit (Lupus Ku autoantigen protein p70) (Ku70) (70 kDa
subunit of Ku antigen) (Thyroid-lupus autoantigen) (TLAA)
(CTC box binding factor 75 kDa subunit) (CTCBF) (CTC75)"
/note="Derived by automated computational analysis using
gene prediction method: GNOMON."
/transcript_id="XM_060320.3"
/db_xref="GI:37539614"
/db_xref="GeneID:127086"
/db_xref="InterimID:127086"
...
For the annotation data above, one would have to decide whether the annotation was primary (designates the main type of annotation) or secondary (should be included as part of the primary annotation data chunk). For the above, anything that is at the beginning of the line is primary (LOCUS, DEFINITION, REFERENCE, SOURCE, etc) and others (ORGANISM, JOURNAL, etc) are secondary.
The LOCUS and REFERENCE data structures would be something like this (using Data::Dumper):
$VAR1 = {
'NAME' => 'LOCUS',
'DATA' => 'NT_021877 10001 bp DNA linear CON 17-OCT-2003'
};
...
$VAR1 = {
'NAME' => 'REFERENCE',
'DATA' => '1 (bases 1 to 10001)'
'AUTHORS' => 'International Human Genome Sequencing Consortium.'
'TITLE' => 'The DNA sequence of Homo sapiens'
'JOURNAL' => 'Unpublished (2003)'
};
Note that the parsing is generic. For instance, the data in LOCUS needs to be processed further. This is in keeping with the idea that the driver would minimally parse data.
Feature data would carry the common name 'FEATURES' so the same handler could be used for all the data. Qualifiers would have their own key, with multiple qualifiers in an array reference:
$VAR1 = {
'mol_type' => 'genomic DNA',
'LOCATION' => '<1..>10001',
'NAME' => 'FEATURES',
'FEATURE_KEY' => 'source',
'note' => 'Accession AL451081 sequenced by The Sanger Centre',
'db_xref' => 'taxon:9606',
'clone' => 'RP11-302I18',
'organism' => 'Homo sapiens'
};
$VAR1 = {
'db_xref' => [
'GeneID:127086',
'InterimID:127086'
],
'LOCATION' => 'complement(3024..6641)',
'NAME' => 'FEATURES',
'FEATURE_KEY' => 'gene',
'gene' => 'LOC127086',
'note' => 'Derived by automated computational analysis using gene prediction method: GNOMON.'
};
These would be passed on to handler methods which further process the data. The upper-case names I have picked are a bit arbitrary; it's possible a common name could be used for all formats and mapped prior to passing the data off to the handler.