Bioperl FAQ ----------- v. 1.0.1 This FAQ maintained by: * Jason Stajich * Brian Osborne * Heikki Lehvaslaiho --------------------------------------------------------------------------- Contents --------------------------------------------------------------------------- 0. About this FAQ Q0.1: What is this FAQ? Q0.2: How is it maintained? 1. Bioperl in general Q1.1: What is Bioperl? Q1.2: Where do I go to get the latest release? Q1.3: What is the difference between 0.9.x and 0.7.x? What do you mean developer release? Q1.4: Is it BioPerl, bioperl, bio.perl.org, Bioperl? What's the deal? Q1.5: How do I figure out how to use a module? Q1.6: I'm interested in the bleeding edge version of the code, where can I get it? Q1.7: Who uses this toolkit? Q1.8: How should I cite Bioperl? Q1.9: What are the License terms for Bioperl? Q1.10: I want to help, where do I start? Q1.11: I've got an idea for a module how do I contribute it? 2. Sequences Q2.1: How do I parse a sequence file? Q2.2: I can't get sequences with Bio::DB::GenBank any more, why not? Q2.3: How can I get NT_ or NM_ accessions from NCBI (Reference Sequences)? Q2.4: How can I use SeqIO to parse sequence data from a string? 3. Report parsing Q3.1: I want to parse BLAST, how do I do this? Q3.2: What's wrong with Bio::Tools::Blast? Q3.3: I want to parse FastA or NCBI -m7 (XML) format, how do I do this? Q3.4: Let's say I want to do pairwise alignments of 2 sequences how can I do this? Q3.5: I'm using BPLite.pm and its frame() to parse Blast but I'm seeing 0, 1, or 2 instead of the expected -3, -2, -1, +1, +2, +3. Why am I seeing these different numbers and how do I get the frame according to Blast? 4. Utilities Q4.1: How do I find all the ORFs in a nucleotide sequence? Antigenic sites in a protein? Calculate nucleotide melting temperature? Find repeats? Q4.2: How do I do motif searches with Bioperl? Can I do "find all sequences that are 75% identical" to a given motif? Q4.3: Can I query MEDLINE or other bibliographic repositories using Bioperl? --------------------------------------------------------------------------- 0. About this FAQ --------------------------------------------------------------------------- Q0.1: What is this FAQ? It is the list of Frequently Asked Questions about Bioperl. Q0.2: How is it maintained? This FAQ was generated using a Perl script and an XML file. All the files are in the Bioperl distribution directory doc/faq. So do not edit this file! Edit file faq.xml and run: % faq.pl -text faq.xml The XML structure was originally used by the Perl XML project. Their website seems to have vanished, though. The XML and modifying scripts were copied from Michael Rodriguez's web site http://www.xmltwig.com/xmltwig/XML-Twig-FAQ.html and modified to our needs. --------------------------------------------------------------------------- 1. Bioperl in general --------------------------------------------------------------------------- Q1.1: What is Bioperl? Bioperl is a tookit of perl modules useful in building bioinformatics solutions in perl. It is built in an object-oriented manner so that many modules depend on each other to achieve a task. The collection of modules in the bioperl-live repository consist of the core of the functionality of bioperl. Additionally auxiliary modules for creating graphical interfaces (bioperl-gui), persistent storage in RDMBS (bioperl-db), and CORBA bridges to the BioCORBA (http://www.biocorba.org) specification (bioperl-corba-server and bioperl-corba-client) are all available as CVS modules in our repository. Q1.2: Where do I go to get the latest release? You can always get our releases from ftp://bioperl.org/pub/DIST. Official releases will be noted on the website http://bioperl.org. Q1.3: What is the difference between 0.9.x and 0.7.x? What do you mean developer release? 0.7.X series (0.7.0, 0.7.2) were all released in 2001 and were stable releases on 0.7 branch. This means they had a set of functionality that is maintained throughout (no experimental modules) and were guaranteed to have all tests and subsequent bug fix releases with the 0.7 designation would not have any API changes. The 0.9.X series was our first attempt at releasing so called developer releases. These are snapshots of the actively developed code that at a minimum pass all our tests. But really, you should be using version 1.*! Q1.4: Is it BioPerl, bioperl, bio.perl.org, Bioperl? What's the deal? Well, the perl.org guys granted us use of bio.perl.org. We prefer to be called Bioperl or BioPerl (unlike our Biopython friends). We're part of the Open Bioinformatics Foundation (OBF) and so as part of the Bio{*} toolkits we prefer the Bioperl spelling. But we're not really all that picky so no worries. Q1.5: How do I figure out how to use a module? Read the embedded perl documentation (Plain Old Documentation - POD) that is part of every modules. Do: % perldoc MODULE (careful - spelling and case counts!). The bioperl tutorial - bptutorial.pl - provided in the root directory of the bioperl release will also provide a good introduction. There are links to tutorials off the bioperl website that may provide some additional help. There are also many scripts in the examples/ and scripts/ directories that could be useful - see bioperl.pod for a brief description of all of them. Additionally we have written many tests for our modules, you can see test data and example usage of the modules in these tests - look in the test dir (called 't'). Q1.6: I'm interested in the bleeding edge version of the code, where can I get it? Go to http://cvs.bioperl.org and you'll see instructions on how to get the CVS code. Basically: % cvs -d :pserver:cvs@cvs.bioperl.org:/home/repository/bioperl login enter 'cvs' for the password % cvs -d :pserver:cvs@cvs.bioperl.org:/home/repository/bioperl co bioperl_all Q1.7: Who uses this toolkit? Lots of people. Sanger Centre, EBI, many large and small academic laboratories, large and small pharmaceutical companies. All the developers on the bioperl list use the toolkit in some capacity on a regular basis. The Genquire annotation system (http://www.bioinformatics.org/Genquire/) and Ensembl (http://www.ensembl.org/) use bioperl as the basis for their implementation. Q1.8: How should I cite Bioperl? For now, cite it as "The Bioperl Project, http://www.bioperl.org". Q1.9: What are the License terms for Bioperl? Bioperl is licensed under the same terms as Perl itself which is the Perl Artistic License. You can see more information on that license at http://www.perl.com/pub/a/language/misc/Artistic.html and http://www.opensource.org/licenses/artistic-license.html. Q1.10: I want to help, where do I start? Bioperl is a pretty diverse collection of modules which has grown from the direct needs of the developers participating in the project. So if you don't have a need for a specific module in the toolkit it becomes hard to just describe ways it needs to be expanded or adapted. One area, however is the development of stand alone scripts which use bioperl components for common tasks. Some starting points for script: find out what people in your institution do routinely that a shortcut can be developed for. Identify modules in bioperl that need easy intefaces and write that wrapper - you'll learn how to use the module inside and out. We always need people to help fix bugs - read the jitterbug bug tracking system (webpage linked from bioperl website sidebar under "Bugs"). Q1.11: I've got an idea for a module how do I contribute it? We suggest the following. Post your idea to the bioperl list. If it is a really new idea consider taking us through your thought process. We'll help you tease out the necessary information such as what methods you'll want and how it can interact with other bioperl modules. If it is a port of something you've already worked on, give us a summary of the current methods. Make sure there is an interface to the module, not just an implementation (see the biodesign.pod for more info) and make sure there will be a set of tests that will be in the t/ directory to insure that your module is tested. --------------------------------------------------------------------------- 2. Sequences --------------------------------------------------------------------------- Q2.1: How do I parse a sequence file? Use the Bio::SeqIO system. This will create Bio::Seq objects for you. See the tutorial bptutorial.pl for more information or the documentation for Bio::SeqIO (e.g. 'perldoc SeqIO.pm'). Q2.2: I can't get sequences with Bio::DB::GenBank any more, why not? NCBI changed the web CGI script that provided this access. You must be using bioperl <= 0.7.2. The developer release 0.9.3 contains this fix as does the 1.0 release. Q2.3: How can I get NT_ or NM_ accessions from NCBI (Reference Sequences)? Use Bio::DB::RefSeq not Bio::DB::GenBank when you are retrieving the NM_ accessions. This is still an area of active development because the data providers have not provided the best interface for us to query. EBI has provided a mirror with their dbfetch system which is accessible through the Bio::DB::RefSeq object however, there are cases where NT_ accessions will not be retrievable. Q2.4: How can I use SeqIO to parse sequence data from a string? use IO::String; use Bio::SeqIO; my $stringfh = new IO::String($string); my $seqio = new Bio::SeqIO(-fh => $stringfh, -format => 'fasta'); while( my $seq = $seqio->next_seq ) { # process each seq } --------------------------------------------------------------------------- 3. Report parsing --------------------------------------------------------------------------- Q3.1: I want to parse BLAST, how do I do this? Well you might notice that there are a lot of choices. Sorry about that. We've been evolving towards a single solution. Currently the best way to parse a report is to use the SearchIO system. This supports blast and fasta report parsing. The bptutorial provides an example of how to use this system as well as the documentation in the Bio::SearchIO system. Q3.2: What's wrong with Bio::Tools::Blast? Nothing is really wrong with it, it has just been outgrown by a more generic approach to reports. This generic approach allows us to just write pluggable modules for fasta and Blast parsing while using the same framework. This is completely analogous to the Bio::SeqIO system of parsing sequence files. However, the objects produced are of the Bio::Search rather than Bio::Seq variety. Q3.3: I want to parse FastA or NCBI -m7 (XML) format, how do I do this? It is as simple as parsing text BLAST results - you simply need to specify the format as "fasta" or "blastxml" and the parser will load the appropriate module for you. You can use the exact logic and code for all of these formats as we have generalized the modules for sequence database searching. Q3.4: Let's say I want to do pairwise alignments of 2 sequences how can I do this? See the Bio::Factory::EMBOSS to see how to use the 'water' and 'needle' alignment programs that are part of the EMBOSS suite. Additionally you can use the pSW module that is part of the bioperl-ext package (distributed separated at ftp://bioperl.org/pub/DIST). However note this only does protein alignments and is no longer a supported module. Instead the EMBOSS implementation is the the best path ahead unless someone else wants to provide an Inline::C implementation. Q3.5: I'm using BPLite.pm and its frame() to parse Blast but I'm seeing 0, 1, or 2 instead of the expected -3, -2, -1, +1, +2, +3. Why am I seeing these different numbers and how do I get the frame according to Blast? These are GFF frames - so +1 is 0 in GFF, -3 will be encoded with a frame of 2 with the strand being set to -1 (for more on GFF see http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml). Frames are relative to the hit or query sequence so you need to query it based on sequence you are interested in: $hsp->hit->strand(); $hsp->hit->frame(); or $hsp->query->strand(); $hsp->query->frame(); So the value according to a blast report of -3 can be constructed as my $blastvalue = ($hsp->query->frame + 1) * $hsp->query->strand; --------------------------------------------------------------------------- 4. Utilities --------------------------------------------------------------------------- Q4.1: How do I find all the ORFs in a nucleotide sequence? Antigenic sites in a protein? Calculate nucleotide melting temperature? Find repeats? In fact, none of these functions are built into Bioperl but they are all available in the EMBOSS package (http://www.emboss.org/), as well as many others. The Bioperl developers created a simple interface to EMBOSS such that any and all EMBOSS programs can be run from within Bioperl. See Bio::Factory::EMBOSS for more information. If you can't find the functionality you want in Bioperl then make sure to look for it in EMBOSS, these packages integrate quite gracefully with Bioperl. Of course, you will have to install EMBOSS to get this access. In addition, Bioperl after version 1.0.1 contains the Pise/Bioperl modules. The Pise package (http://www-alt.pasteur.fr/~letondal/Pise) was designed to provide a uniform interface to bioinformatics applications, and currently provides wrappers to greater than 250 such applications! Included amongst these wrapped apps are HMMER, Phylip, BLAST, GENSCAN, even the EMBOSS suite. Use of the Pise/Bioperl modules does not require installation of the Pise package. Q4.2: How do I do motif searches with Bioperl? Can I do "find all sequences that are 75% identical" to a given motif? There are a number of approaches inside and outside of Bioperl. Within Bioperl take a look at Bio::Tools::SeqPattern, but it's also conceivable that the combination of Bioperl and Perl's regular expressions could do the trick. You might also consider the CPAN module String::Approx (this module addresses the percent match query). Or, take a look at the TFBS package, at http://forkhead.cgb.ki.se/TFBS (Transcription Factor Binding Site). This Bioperl-compliant package specializes in pattern searching of nucleotide sequence using matrices. Finally, you could use EMBOSS, as discussed in the previous question (or you could use Pise to run EMBOSS applications). The relevant programs would be fuzzpro or fuzznuc. Q4.3: Can I query MEDLINE or other bibliographic repositories using Bioperl? Yes! The solution lies in Bio::Biblio*, a set of modules that provide access to MEDLINE and OpenBQS-compliant servers using SOAP. See Bio/Biblio.pm or examples/biblio.pl for details and example code. --------------------------------------------------------------------------- Copyright (c)2002 Open Bioinformatics Foundation. You may distribute this FAQ under the same terms as perl itself.