HOWTO:Local Databases
Contents |
Authors
Brian Osborne, Peter Schattner.
Abstract
This is a HOWTO that talks about using Bioperl to create local sequence databases for fast retrieval.
Introduction
Bioperl offers many ways to retrieve sequences from online databases like NCBI and Swissprot but there may be times when you want build local databases for fast, secure retrieval. This HOWTO discusses the different Bioperl modules you might use. Also see HOWTO:OBDA Flat databases.
Bio::Index
The following sequence data formats are supported by Bio::Index: Bio::Index::Swissprot, Bio::Index::SwissPfam, Bio::Index::EMBL, Bio::Index::Blast, Bio::Index::BlastTable, Bio::Index::Fastq, Bio::Index::Qual, Bio::Index::Hmmer, Bio::Index::Stockholm, Bio::Index::GenBank, and Bio::Index::Fasta. Once the set of sequences have been indexed using Bio::Index*, individual sequences can be accessed using syntax very similar to that used for accessing remote databases.
For example, if one wants to set up an indexed flat-file database of fasta files one could write a script like using Bio::Index::Fasta:
# script 1: create the index use Bio::Index::Fasta; # using fasta file format use strict; # some users have reported that this is necessary my $Index_File_Name = shift; my $inx = Bio::Index::Fasta->new( -filename => $Index_File_Name, -write_flag => 1); $inx->make_index(@sequence_files);
This script then retrieves sequences:
# script 2: retrieve some files use Bio::Index::Fasta; use strict; # some users have reported that this is necessary my $Index_File_Name = shift; my $inx = Bio::Index::Fasta->new($Index_File_Name); foreach my $id (@ARGV) { my $seq = $inx->fetch($id); # Returns Bio::Seq object # do something with the sequence }
To facilitate the creation and use of more complex or flexible indexing systems, the Bioperl distribution includes two sample scripts in the scripts/index directory, bp_index.PLS and bp_fetch.PLS. These scripts can be used as templates to develop customized local data-file indexing systems.
Bio::DB::Fasta
Bioperl also supplies Bio::DB::Fasta as a means to index and query Fasta format files. It's similar in spirit to Bio::Index::Fasta but offers more methods, e.g.
use Bio::DB::Fasta; use strict; my $db = Bio::DB::Fasta->new($file); # one file or many files my $seqstring = $db->seq($id); # get a sequence as string my $seqobj = $db->get_Seq_by_id($id); # get a PrimarySeq obj my $desc = $db->header($id); # get the header, or description line
See Bio::DB::Fasta for more information on this fully-featured module.
Indexing using a specific substring
Both modules also offer the user the ability to designate a specific string within the fasta header as the desired id, such as the gi number within the string "gi|4556644|gb|X45555". Consider the following fasta-formatted sequence, in "test.fa":
>gi|523232|emb|AAC12345|sp|D12567 titin fragment MHRHHRTGYSAAYGPLKJHGYVHFIMCVVVSWWASDVVTYIPLLLNNSSAGWKRWWWIIFGGE GHGHHRTYSALWWPPLKJHGSKHFILCVKVSWLAKKERTYIPKKILLMMGGWWAAWWWI
By default Bio::Index::Fasta and Bio::DB::Fasta will use the first "word" they encounter in the fasta header as the retrieval key, in this case "gi|523232|emb|AAC12345|sp|D12567". What would be more useful as a key would be a single id. The code below will index the "test.fa" file and create an index file called "test.fa.idx" where the keys are the Swissprot, or "sp", identifiers.
$ENV{BIOPERL_INDEX_TYPE} = "SDBM_File"; # look for the index in the current directory $ENV{BIOPERL_INDEX} = "."; my $file_name = "test.fa"; my $inx = Bio::Index::Fasta->new( -filename => $file_name . ".idx", -write_flag => 1 ); # pass a reference to the critical function to the Bio::Index object $inx->id_parser(\&get_id); # make the index $inx->make_index($file_name); # here is where the retrieval key is specified sub get_id { my $header = shift; $header =~ /^>.*\bsp\|([A-Z]\d{5}\b)/; $1; }
Here is how you would retrieve the sequence, as a Bio::Seq object:
my $seq = $inx->fetch("D12567"); print $seq->seq;
What if you wanted to retrieve a sequence using either a Swissprot id or a gi number and the fasta header was actually a concatenation of headers with multiple gi's and Swissprots?
>gi|523232|emb|AAC12345|sp|D12567|gi|7744242|sp|V11223 titin fragment
Modify the function that's passed to the id_parser() method:
sub get_id { my $header = shift; my (@sps) = $header =~ /^>.*\bsp\|([A-Z]\d{5})\b/g; my (@gis) = $header =~ /gi\|(\d+)\b/g; return (@sps,@gis); }
The Bio::DB::Fasta module uses the same principle, but the syntax is slightly different, for example:
my $db = Bio::DB::Fasta->new('test.fa', -makeid=>\&make_my_id); my $seqobj = $db->get_Seq_by_id($id); sub make_my_id { my $description_line = shift; $description_line =~ /gi\|(\d+)\|emb\|(\w+)/; ($1,$2); }
Storing sequences in a relational database
The core Bioperl package does not support accessing sequences and data stored in relational databases but this capability is available in the Bioperl-db package.