Handler-based SeqIO parsers

From BioPerl
Jump to: navigation, search

This is a spec page for developers only!

Although some code may be committed to CVS for testing, do NOT rely on any of this functionality appearing in future BioPerl releases until/unless the spec is verified with other BioPerl core developers. It is possible that this page will be moved/deleted in the future to another spot (probably as a HOWTO) as development progresses. You have been warned!

Contents

The current Bio::SeqIO approach

As noted in the Project priority list, several of the parsers for Bio::SeqIO are showing signs of age. In particular, the original code for Bio::SeqIO::genbank, Bio::SeqIO::embl, and Bio::SeqIO::swiss are over 6 yrs old at this point and need serious refactoring.

Problems with older parsers

Code maintenance

Currently the Bio::SeqIO parsers are over six years old. As noted in the mail list over the years and in the Project priority list repeated patching and bug fixes for some parsers (in particular, Bio::SeqIO::genbank, Bio::SeqIO::embl, Bio::SeqIO::swiss) has led to a lot of added code which, over the years, has had very little refactoring. Furthermore, the current structure of the parsers themselves (in general, large if-elsif-else blocks as a consequence of Perl's current lack of a switch control statement) makes it somewhat difficult for a newcomer to decipher or for an experienced developer to refactor.

Code redundancy

As recent issues with Bio::Species/Bio::Taxon illustrate, several parsers contain almost exactly the same code for parsing common types of data, including (in no particular order) accessions, keywords, references, comments, species information, sequence features, dates, and sequence data. When debugging or developing new code this can become a huge problem, as you need to change similar code in at least three different parsers in order to fix the same issue.

Data consistency

Some parsers demonstrate inconsistent behavior with data. As an example, date information in GenBank records is scrupulously checked for possible variations in the date prior to storing, while dates parsed from EMBL/UniProt files are not checked at all. Some parsers make use of a Bio::Seq::SeqBuilder to build sequence objects, while others don't.

Code customization

The current parser structure works well for very simple formats (such as FASTA) but doesn't easily accommodate customization or development of alternative classes for storing data. For instance, as recently pointed out on the mail list Bio::SeqIO::FTHelper works well as a lightweight object for storing data but it is an intermediary step; the final object (a Bio::SeqFeature::Generic) is generated for every Bio::SeqIO::FTHelper. As noted above, changes which would convert directly to SeqFeatures would need to be implemented in at least three different parsers. Furthermore, developing alternative classes for SeqFeatures/Annotations is harder under the current scheme.

The proposal

Adopt an XML parser-like (or event-driven) approach, probably more similar in spirit to XML::Twig as opposed to the SAX2-based XML::SAX. In short, split the parsing of the data and the handling of the data into two distinct tasks.

The Driver

In a very abstract way, next_seq() or the requisite parser would act as a simple driver method, generically parsing data into chunks that would be passed off to a handler object, which either passes the data off to the correct handler method or tosses it.

sub next_seq {
    my $self = shift;    
    local($/) = "\n";
    my $hobj = $self->handler;
    PARSER:
    while (defined(my $line = $self->_readline)) {
        # ... parse into a data structure here
        if ($data_chunk) {
            $hobj->data_handler($data_chunk);
        }
        # ...bail out of parser at end of seq record
    }
    return $hobj->build_sequence;
}

The Handler

The data_handler() method would be responsible for passing on the data to the relevant private object handler methods:

package Bio::SeqIO::RichSeq::MyHandler;
use strict;
use warnings;
# use other classes as needed
 
# implement interface
use base qw(Bio::Root::Root Bio::SeqIO::HandlerI);
 
# define lookup table for private class methods that
# deal with the data
 
my %HANDLER = (
    'genbank'   => {
        'LOCUS'         => \&_genbank_locus,
        'DEFINITION'    => \&_genbank_definition,
        'ACCESSION'     => \&_genbank_accession,
        'VERSION'       => \&_genbank_version,
        'DBSOURCE'      => \&_genbank_dbsource,
        'SOURCE'        => \&_generic_species,
        'REFERENCE'     => \&_generic_reference,
        'COMMENT'       => \&_generic_comment,
        'FEATURES'      => \&_generic_seqfeatures,
        #... and so on
 
        # maybe have alternate methods for testing, commented out
        #'SOURCE'        => \&_generic_taxon,
        #'FEATURES'      => \&_generic_lightweightfeature,
 
        # skip this feature (can be done in the driver as well)
        'BASE'          => \&noop,    # this is generated from scratch
 
        'ORIGIN'        => \&_generic_seq,
        '_DEFAULT_'     => \&_generic_simplevalue,
        },
    'embl'      => {
        'ID'    => \&_embl_id,
        'OS'    => \&_generic_species,
        #... and so on
        },
    'swiss'     => {
        'ID'    => \&_embl_id,
        'OS'    => \&_generic_species,
        #... and so on
        },
    );
 
 
# fun class stuff here
 
# set the proper handlers somewhere (constructor?)
# based on the format being parsed...
 
$self->{'handlers'} = $HANDLER{ $format };
 
# this might be defined in a GenericHandler
sub data_handler {
    my ($self, $data) = @_;
 
    # grab the name, which is a key for the method handler
    my $nm = $data->{NAME} || $self->throw("No name tag defined!");
 
    my $method = (exists $self->{'handlers'}->{$nm}) ?
                    $self->{'handlers'}->{$nm} :
                (exists $self->{'handlers'}->{'_DEFAULT_'}) ?
                    $self->{'handlers'}->{'_DEFAULT_'} :
                undef;
    if (!$method) {
        $self->debug("No handler defined for $nm\n");
        return;
    }
 
    $self->$method($data);
    return;
}

Specification

Several key issues would need to be resolved. In no particular order:

  1. What would the handler object look like (i.e. interface design)?
  2. How is the data passed to the handler?
  3. What constitutes a "chunk of data"?

...others???

Handler Interface

A very simple interface would define a single 'public' method for the data, such as data_handler() above, and other get/setters needed to help build the sequence object (Bio::Seq::SeqBuilder, Bio::AnnotationI, Bio::Species, etc).

Private handler methods are defined in implementations to handle data based on the lookup table and sequence format.

Other methods?

Passing Data to the Handler

One could adopt two (maybe more?) approaches here. The first would be to pass all data to one handler method (data_handler()); from there the determination is made to either pass the data on or toss it.

The second would be to call the handlers directly from within the driver itself; this would require a second interface method that passes back the proper handler coderefs to call from within the driver.

What constitutes a chunk of data

This may be the most tricky part. Here I'll use GenBank format to demonstrate what I think a proper chunk would be:

LOCUS       NT_021877              10001 bp    DNA     linear   CON 17-OCT-2003
DEFINITION  Homo sapiens chromosome 1 genomic contig.
ACCESSION   NT_021877 REGION: 13920000..13930000
VERSION     NT_021877.16  GI:37539616
KEYWORDS    .
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 10001)
  AUTHORS   International Human Genome Sequencing Consortium.
  TITLE     The DNA sequence of Homo sapiens
  JOURNAL   Unpublished (2003)
COMMENT     GENOME ANNOTATION REFSEQ:  Features on this sequence have been
            produced for build 34 of the NCBI's genome annotation [see
            documentation].
            On Oct 7, 2003 this sequence version replaced gi:29789880.
            The DNA sequence is part of the second release of the finished
            human reference genome. It was assembled from individual clone
            sequences by the Human Genome Sequencing Consortium in consultation
            with NCBI staff.
            COMPLETENESS: not full length.
FEATURES             Location/Qualifiers
     source          1..10001
                     /organism="Homo sapiens"
                     /mol_type="genomic DNA"
                     /db_xref="taxon:9606"
                     /chromosome="1"
     source          <1..>10001
                     /organism="Homo sapiens"
                     /mol_type="genomic DNA"
                     /db_xref="taxon:9606"
                     /clone="RP11-302I18"
                     /note="Accession AL451081 sequenced by The Sanger Centre"
     gene            complement(3024..6641)
                     /gene="LOC127086"
                     /note="Derived by automated computational analysis using
                     gene prediction method: GNOMON."
                     /db_xref="GeneID:127086"
                     /db_xref="InterimID:127086"
     mRNA            complement(join(3024..4108,4110..4258,4357..4533,
                     5985..6225,6324..6641))
                     /gene="LOC127086"
                     /product="similar to ATP-dependent DNA helicase II, 70 kDa
                     subunit (Lupus Ku autoantigen protein p70) (Ku70) (70 kDa
                     subunit of Ku antigen) (Thyroid-lupus autoantigen) (TLAA)
                     (CTC box binding factor 75 kDa subunit) (CTCBF) (CTC75)"
                     /note="Derived by automated computational analysis using
                     gene prediction method: GNOMON."
                     /transcript_id="XM_060320.3"
                     /db_xref="GI:37539614"
                     /db_xref="GeneID:127086"
                     /db_xref="InterimID:127086"
...

For the annotation data above, one would have to decide whether the annotation was primary (designates the main type of annotation) or secondary (should be included as part of the primary annotation data chunk). For the above, anything that is at the beginning of the line is primary (LOCUS, DEFINITION, REFERENCE, SOURCE, etc) and others (ORGANISM, JOURNAL, etc) are secondary.

The LOCUS and REFERENCE data structures would be something like this (using Data::Dumper):

$VAR1 = {
          'NAME' => 'LOCUS',
          'DATA' => 'NT_021877              10001 bp    DNA     linear   CON 17-OCT-2003'
        };
...
$VAR1 = {
          'NAME' => 'REFERENCE',
          'DATA' => '1  (bases 1 to 10001)'
          'AUTHORS' => 'International Human Genome Sequencing Consortium.'
          'TITLE' => 'The DNA sequence of Homo sapiens'
          'JOURNAL' => 'Unpublished (2003)'
        };

Note that the parsing is generic. For instance, the data in LOCUS needs to be processed further. This is in keeping with the idea that the driver would minimally parse data.

Feature data would carry the common name 'FEATURES' so the same handler could be used for all the data. Qualifiers would have their own key, with multiple qualifiers in an array reference:

$VAR1 = {
          'mol_type' => 'genomic DNA',
          'LOCATION' => '<1..>10001',
          'NAME' => 'FEATURES',
          'FEATURE_KEY' => 'source',
          'note' => 'Accession AL451081 sequenced by The Sanger Centre',
          'db_xref' => 'taxon:9606',
          'clone' => 'RP11-302I18',
          'organism' => 'Homo sapiens'
        };
$VAR1 = {
          'db_xref' => [
                         'GeneID:127086',
                         'InterimID:127086'
                       ],
          'LOCATION' => 'complement(3024..6641)',
          'NAME' => 'FEATURES',
          'FEATURE_KEY' => 'gene',
          'gene' => 'LOC127086',
          'note' => 'Derived by automated computational analysis using gene prediction method: GNOMON.'
        };

These would be passed on to handler methods which further process the data. The upper-case names I have picked are a bit arbitrary; it's possible a common name could be used for all formats and mapped prior to passing the data off to the handler.

Personal tools
Namespaces
Variants
Actions
Main Links
documentation
community
development
Toolbox