HOWTO:SQLite for BioPerl indexing
Contents |
Abstract
Idea: use the AnyDBM_File system to expand the choices of DBM for as many BioPerl instances of indexing as reasonable.
Author
maj -at- fortinbras -dot- us
The Issue
Many of the BioPerl standard indexing modules use the DB_File module exclusively as a DBM to create tied hashes and arrays. While DB_File is part of Perl core, it depends on the external installation of Berkeley DB. On at least one platform (Windows/32 using ActiveState Perl), Berkeley DB is difficult to integrate with DB_File.
The Perl core AnyDBM_File module for ties allows the dev to provide a set of choices of DBMs over which to fail. This system by default includes the SDBM, which is included with Perl. One idea is to swap all DB_File for AnyDBM_File instances. However, DB_File also provides a direct API to Berkeley DB (put(), get(), seq(), etc. methods) that are used fairly extensively in BioPerl indexing code. SDBM does not provide these methods, and also has a record length limit (in the standard Perl build) that is too restrictive for many applications.
Being able to use SQLite for ties would provide an attractive alternative to both BDB and SDBM. The SQLite DBI (DBD::SQLite) contains the Perl DBI interface and the SQLite DBMS as XS all in one, obviating an external install. There are no record length restrictions.
Modules
Currently available in CPAN:
Other modules have also been modified on branch anydbm-branch.
Usage
BEGIN { @AnyDBM_File::ISA = qw( DB_File SQLite_File ) unless @AnyDBM_File::ISA == 1; # single member indicates AnyDBM_File already loaded } use AnyDBM_File; use vars qw( $DB_BTREE &R_DUP); # must declare the globals you expect to use use AnyDBM_File::Importer qw(:bdb); # an import tag is REQUIRED my %db; $DB_BTREE->{'flags'} = R_DUP; tie( %db, 'AnyDBM_File', O_CREAT | O_RDWR, 0644, $DB_BTREE);
Description
DB_File Emulation
The intention was to create a DBM that could almost completely substitute for DB_File, so that DB_File could be replaced everywhere in the code by AnyDBM_File, and things would just work. Currently, it is slightly more complicated than that, but not too much more.
Versions of $DB_HASH, $DB_BTREE, and $DB_RECNO, as well as
the necessary flags (R_DUP, R_FIRST, R_NEXT, etc.) are
imported by using the AnyDBM_File::Importer module. The desired
constants need to be declared global in the calling program, as well
as imported, to avoid compilation errors (at this point). See
Converting from DB_File below.
Arguments to the tie function mirror those of DB_File, and all should
work the same way. See Converting from DB_File.
All of DB_File's random and sequential access functions work:
get() put() del() seq()
as well as the duplicate key handlers
get_dup() del_dup() find_dup()
seq() works by finding partial matches, like DB_File::seq().
The extra array functions ( shift(), pop(), etc. ) are not yet
implemented as method calls, though all these functions (including splice are available on the tied arrays.
Some HASHINFO fields are functional:
$DB_BTREE->{'compare'} = sub { - shift cmp shift };
will provide sequential access in reverse lexographic order, for example.
$DB_HASH->{'cachesize'} = 20000;
will enforce
PRAGMA cache_size = 20000
in the underlying database.
Converting from DB_File
To failover to SQLite_File from DB_File, go from this:
use DB_File; # ... $DB_BTREE->{cachesize} = 100000; $DB_BTREE->{flags} = R_DUP; my %db; my $obj = tie( %db, 'DB_File', 'my.db', $flags, 0666, $DB_BTREE);
to this:
use vars qw( $DB_HASH &R_DUP ); BEGIN { @AnyDBM_File::ISA = qw( DB_File SQLite_File ) unless @AnyDBM_File::ISA == 1; # } use AnyDBM_File; use AnyDBM_File::Importer qw(:bdb); # ... $DB_BTREE->{cachesize} = 100000; $DB_BTREE->{flags} = R_DUP; my %db; my $obj = tie( %db, 'AnyDBM_File', 'my.db', $flags, 0666, $DB_BTREE);
Implementation and Testing
Design and Motivation
Objective
- Provide an SQLite-based DBM as a drop-in alternative for DB_File
Two things are required of such a module:
- The machinery for tying hashes and arrays (see perltie), and
- An emulation of the DB_File API to Berkeley DB.
Tests and Modifications
Unit Tests
Test Suites currently passing
These tests currently (r16252) pass under both DB_File and SQLite_File (for me under ActiveState 5.8 sans DB_File and Cygwin 5.10 with DB_File), with mods in the core modules at that revision and :
- t/LocalDB/BioDBGFF.t
- t/LocalDB/BlastIndex.t
- t/LocalDB/DBFasta.t
- t/LocalDB/DBQual.t
- t/LocalDB/Index.t
- t/LocalDB/transfac_pro.t
- t/LocalDB/SeqFeature_mysql.t (created by ./Build)
Modified Modules in the Bio::DB and Bio::Index Namespaces
The following modules have a DB_File or related dependency:
- Bio::DB::Fasta (uses AnyDBM_File)
- Bio::DB::FileCache
- Bio::DB::Flat::BDB
- Bio::DB::Flat::BinarySearch
- Bio::DB::Flat
- Bio::DB::GFF::Adaptor::berkeleydb::iterator
- Bio::DB::GFF::Adaptor::berkeleydb
- Bio::DB::GFF
- Bio::DB::Qual
- Bio::DB::SeqFeature::Store::bdb
- Bio::DB::SeqFeature::Store::berkeleydb
- Bio::DB::SeqFeature::Store::berkeleydb3
- Bio::DB::SeqFeature::Store::LoadHelper
- Bio::DB::Taxonomy::flatfile
- Bio::DB::TFBS::transfac_pro
- There are others.
Bio::DB::Flat has a BDB dependency that is very particular to its implementation and would not benefit from an AnyDBM_File-based conversion. The Bio::DB::SeqFeature::Store::DBI modules for MySQL and SQLite still rely on BDB-tied hashes for indexing. Other modules are there to provide a BDB option; these could benefit from a SQLite-based BDB emulation (as a workaround in the absence of BDB).
The following modules have been modified to make use of SQLite_File, either because they already use the AnyDBM_File system, or because DB_File was integrated in to tied hash and array indexes. By converting these to AnyDBM_File, DB_File remains available and the default choice, but the modules remain functional on systems that don't support DB_File for whatever reason.
- Bio::DB::Fasta
- Bio::DB::FileCache
- Bio::DB::Qual
- Bio::DB::GFF::Adaptor::berkeleydb
- Bio::DB::SeqFeature::Store::LoadHelper
- Bio::DB::Taxonomy::flatfile
- Bio::DB::TFBS::transfac_pro
- Bio::Index::Abstract
Modifications were successful if all tests passed under both DB_File and SQLite_File.