Core 1.5.2 new features

From BioPerl
Jump to: navigation, search

The main new features in 1.5.2 are described in extended detail here.

Contents

Build.PL installation system

Previously Bioperl has been installed using Makefile.PL which used ExtUtils::MakeMaker. 1.5.2 continues to use (an improved version of) the Makefile.PL, but also introduces Build.PL as a recommended alternative. Build.PL is implemented using a subclass of Module::Build. Benefits include:

  • Man and html documentation are now created and installed
  • Bundle::Bioperl is no longer required; you can now interactively chose which (if any, or all) of the optional external dependencies you would like to install during the Bioperl installation process
  • If a particular optional dependency fails to install, Bioperl itself will still install fine

Bio::Map overhaul

Overhauled to allow relative positions and multiple maps and positions per marker. More details to follow...

Species, taxonomy overhaul

Changes have been made to transition Bio::Species over to Bio::Taxon. A Bio::Taxon is as usable with a database connection as it is without one.

Bio::DB::Taxonomy, ::*

API-CHANGES

  • get_Taxonomy_Node() renamed get_taxon(). get_Taxonomy_Node() is a synonym of get_taxon(), eventually to be deprecated.
  • New methods ancestor() and each_Descendent() correspond to similar methods in Bio::Taxon and Bio::Tree::NodeI, freeing up the need to store parent_id on each Taxon.
  • New internal method _handle_internal_id(). See Implementation notes below.

Implementation changes

  • Normally when you create a Bio::Taxon it automatically receives a new unique internal id. However when you request the same Taxon from a database more than once you always get an object with the same internal id (allows get_lca to work, allows you to modify one copy of a returned object but still compare it to another copy and see they are supposed to be the same taxon). This even applies across different databases. The Taxon objects returned will still have different memory locations.
  • The scientific name field in the database isn't touched except for s/\(class\)$// (with the truly untouched name stored as a common name). Previously, a species node would have its name converted from 'Homo sapiens' to 'sapiens', but the conversion mangled very badly certain other species names.
  • All common names of a taxon are now stored in the resulting Taxon object. This means that the Genbank common name is now just one amongst others, and isn't guaranteed to be the first in the list either.

Bio::DB::Taxonomy::flatfile

BUG-FIXES

  • Removed invalid requirement that all species nodes have at least 7 named-rank parents.
  • The names->id solution used by get_taxonid() only stored that last id associated with a name. However the name used wasn't necessarily unique, such that multiple ids could match. names->id solution now remembers all ids that match a name.

API-CHANGES

  • get_Children_Taxids is deprecated - method no longer part of the DB::Taxonomy interface, and superseded by each_Descendent (which is actually implemented by all databases).
  • Renamed get_taxonid() to get_taxonids() and it returns an array of ids in list context. For backward compatibility it returns one of the ids in scalar context, and get_taxonid remains as a synonym of get_taxonids.

Implementation changes

  • No longer includes the fake root node 'root'; there are multiple roots now (10239, 12884, 12908, 29384 and 131567). This means when getting the lineage you no longer have to remove the root node. This is now consistent with the results possible with entrez. NB: You have to delete your current indexes before you will notice the change.
  • Like Bio::DB::Taxonomy::entrez, flatfile now retrieves and stores the common names, genetic code and mitochondrial genetic code in each node it makes.
    • Note: entrez also stores creation, publication and update dates, but this data is not available in the taxdump from NCBI ftp site.
    • Note: the common names are stored in no particular order; the genbank common name in particular isn't necessarily the first in the list (cf. old entrez.pm behaviour).
  • Used to store within the nodes it makes the division as a three letter code, like 'PRI'. However, for consistency with entrez and the scientific_name() of the node the division is supposed to correspond to, it is now stored as the full name, like 'Primates'.
  • The names->id solution also stores the artificially uniqued names like 'Craniata <chordata>', allowing you for the first time to retrieve the correct id. Previously the search would have simply failed completely.
  • The names->id solution now handles nodes with scientific names of 'xyz (class)', allowing you to retrieve the id with both get_taxonids('xyz') and get_taxonids('xyz (class)'). Previously only the latter would work.

Bio::DB::Taxonomy::entrez

BUG-FIXES

  • Special characters like ", ( and ) in the input query string to get_taxonid() result in the failure or inaccuracy of the search. These characters are now removed prior to submission, allowing for correct search results.

API-CHANGES

  • entrez has always been able to return multiple ids that match a single input name, so I've renamed get_taxonid() to get_taxonids() and it returns an array of ids in list context. It returns one of the ids in scalar context. For backward compatibility, get_taxonid reamins a synonym of get_taxonids.
  • get_node has new option -full that tells it to retrieve full details on a taxon from the website. (Otherwise, it may return a Taxon with minimal information if only minimal information had previously been cached.)

Implementation changes

  • Caches the data it gets from the website and tries to minimise the number of website accesses it does.
  • Now throws on failure to retrieve data from website.
  • get_taxonids() now copes with queries with '<something>' like 'Craniata <chordata>'.

Bio::DB::Taxonomy::list

NEW

An implementation of Bio::DB::Taxonomy that accepts lists of words to build a database. Used especially by Bio::Species for backward compatibility purposes, but also useful generally to quickly and easily create a lineage of Bio::Taxon objects/ a Tree.

Bio::Tree::TreeI

BUG-FIXES

number_nodes() returned the number of descendants belonging to the root node, but forgot to count the root node itself. Now number_nodes() == scalar(get_nodes()).

Bio::Tree::Tree

API-CHANGES

Added -node option to new() which will call get_lineage_nodes() on the supplied NodeI and set the tree root that way. This is so you can easily make a tree from a Bio::Taxon. In order that the Tree resulting from a Bio::Taxon with a db_handle doesn't end up pulling in the entire database, in the process of finding the root from the -node, ancestor() / add_Descendent() is set for each member of the lineage, which means the database will no longer be asked what the ancestor or descendents of the taxa are.

Bio::Tree::TreeFunctionsI

API-CHANGES

  • New method get_lineage_nodes(). Returns all the ancestors of a particular node, up to the tree's defined root node.
  • get_lca() can now also accept just a list of nodes, and also more than 2 nodes.
  • Removed _check_two_nodes() since no longer necessary.
  • New method splice(). Removes requested nodes from a tree, making the ancestors of the removed node's descendants the removed node's ancestor (ie. remove nodes without making the tree fall apart).
  • New method contract_linear_paths(). Splices out all nodes in the tree that have an ancestor and only one descendant.
  • New method merge_lineage(). Merges a lineage of nodes with an existing Tree.

Implementation changes

  • get_lca() uses get_lineage_nodes(), and is the correct implementation; previously not guaranteed to give correct answer. Can get the lca of more than 2 nodes.
  • reroot() uses get_lineage_nodes().
  • Methods distance(), is_monophyletic() and is_paraphyletic() reimplemented with the new get_lca().
  • find_node() no longer warns about an unknown search type (allowing you to search on -rank and any other thing in the future).

Bio::Tools::Phylo::PAML

Implementation changes

  • Methods that make use of get_lca() reimplemented with the new get_lca() in mind. This change was not required, but is advantageous.
  • Support for PAML 3.15

Bio::Tree::Node

Implementation changes

ancestor() now correctly removes and adds descendant from previous/new ancestor when changing ancestor.

t/Node.t

Added tests for setting ancestor()

Bio::Taxonomy::Node

DEPRECATED

(name change) isa Bio::Taxon

Implementation changes

No code; delegates to Bio::Taxon

Bio::Taxon

NEW

(name change from Bio::Taxonomy::Node) Changes below relate to changes to Bio::Taxonomy::Node

API-CHANGES

  • Removed the following options from new(): -classification, -sub_species, -variant and -organelle. The corresponding methods are no longer present.
  • New option to new(): -id. For Tree::Node compatibility. -object_id and -ncbi_taxid are no longer mentioned in docs but still work.
  • The -dbh option to new() no longer defaults to any database. A Bio::Taxon is now fully usable without ever setting a database handle.
  • Removed the methods binomial(), species(), genus(), sub_species(), variant(), classification() and show_all(). Not appropriate to have rank-specific methods in a class that models any single rank. Definitely not appropriate to store information about other taxons in a Taxon. These questions can be answered using Tree* methods, or with Bio::Species.
  • Removed method organelle(). Organelle isn't part of a taxonomy. Other modules like SeqIO should have their own storage of organelle information as necessary (But Bio::Species retains organelle() in the mean time).
  • Removed methods get_Lineage_Nodes() and get_LCA_Node(). For these kinds of methods you should now use Bio::Tree::TreeFunctionsI methods.
  • You can no longer set parent_id(). The id of your parent is determined by the Taxon that is your ancestor. This method is no longer needed (previously it was central to the workings of the object), so is now deprecated. It issues a warning if you try and set its value.
  • get_Parent_Node() eventually to be deprecated, is now a synonym of new method ancestor(). (For Tree::Node compatibility.)
  • get_Children_Nodes() eventually to be deprecated, is now a synonym of new method each_Descendent(). (For Tree::Node compatibility.)
  • object_id() eventually to be deprecated, is now a synonym of new method id(). (For Tree::Node compatibility.)

Implementation changes

  • is(also)a Bio::Tree::Node.
  • node_name() used to be an alias to name('common'). Now it is an alias to name('scientific').
    • Note: node_name is what is set when ->new(-name => $name) is set, so database and user-created taxa now implicitly associate the name of the taxon they create with its scientific name.
  • scientific_name() used to be an alias to binomial(). Now it is an alias of node_name().
  • parent_taxon_id() is now a direct synonym of parent_id(). (Previously, you could assign and retrieve different values to/from each method.)
  • New method common_names() supersedes common_name(), returning a list of all common_names. For backward compatibility, returns one of the names in scalar context, and common_name remains a synonym of common_names.
  • -factory option to new() and factory() method removed, since there is no Bio::Taxonomy::FactoryI is deprecated and was never used.
  • Removed methods validate_name() and validate_species_name().
  • division() was implemented via $self->name('division',@_). Now name('division') will only allow one value to be set, and division() only ever returns a single scalar or undef, never an array.
  • common_names() returns the last common_name in scalar context (instead of first), so set/get/set/get works as expected with common_name().
  • db_handle() similar to before when getting, but now setting the handle will locate $self in the new database (by id or name) and merge data (eg. if rank was 'no rank' and new database node has rank 'species', $self->rank() will become 'species').
  • get_Parent_Node() (ne ancestor()) and get_Children_Nodes() (ne each_Descendent()) now use the Bio::Tree::Node implementation. ancestor() falls back to asking the database for the ancestor if one had not been manually set by the user. each_Descendent does NOT fall back to the database, preventing the whole database being pulled into a Tree object made with a Bio::Taxon.
  • parent_id() now gets the ancestor Taxon with ancestor() and returns $ancestor->id().
  • Had to remove the clean up methods from Bio::Tree::Node since they were in a CODE ref, preventing Bio::Species objects from being frozen with Storable. Will come up with a better solution in the future.

t/Taxonomy.t

This is the main test file for Bio::Taxon and related things.

  • Runs a slightly more comprehensive set of tests on entrez, which are now only skipped if data retrieval fails.
  • Tests flatfile on a cut-down version of the taxdump.

Bio::Taxonomy

DEPRECATED

Redundant

Bio::Taxonomy::Taxon

DEPRECATED

Redundant

Bio::Taxonomy::Tree

DEPRECATED

Redundant

Bio::Taxonomy::FactoryI

DEPRECATED

Redundant

Bio::Species

Implementation changes

Bio::Species isa Bio::Taxon.

  • No method uses validate_species_name() any more. (but the method remains unaltered, as does validate_name() which just returns 1 - no change).
  • classification() set implemented as: Set db_handle() to a new Bio::DB::Taxonomy::list with the supplied classification array and make a Bio::Tree::Tree of self, stored in self. Getting the classification implemented as: Return the scientific_name() of each Taxon returned by our tree->get_lineage_nodes.
  • Methods ncbi_taxid(), division() and common_name() implemented by Taxon.
  • Methods species(), genus(), subspecies() and variant() no longer get/set elements in the classification array or store direct values. They are implemented like: Ask our tree for the taxon with rank() eq method name and get the scientific_name of that. Otherwise, for methods species() and genus() assume we are rank() 'species', our parent taxon is rank() 'genus' and try again. For subspecies() and variant(), fall back to old implementation (store data directly on self). Since species() is purely there for backwards compatability, it now munges the species name in the same way that Bio::DB::Taxonomy modules used to (and that some Bio::SeqIO modules used to/still do). So it will return 'sapiens', not 'Homo sapiens', as before.
  • binomial() prefers to simply return scientific_name() if we are a Taxon with rank() 'species' and the scientific_name is at least a 2 word scalar. It interprets the 'FULL' option as wanting the trinomial name and prefers to simply return scientific_name() if we have rank() 'subspecies' or 'variant' and at least 3 word scalar. Failing these two cases, it falls back on the old implementation (build 'genus species' from the classification), but with a little more intelligence to try and not duplicate names.
  • Stores a Bio::Tree::Tree on itself, had to remove its clean up methods since they were in a CODE ref, preventing us from being frozen with Storable. Will come up with a better solution in the future.

Bio::SeqIO::*

A number of these modules make use of Bio::Species when parsing taxonomic information. They probably all have/had problems. Only genbank, swiss have been updated to correctly parse taxonomic information under the new system; the others need to be properly tested to see if when they read taxonomic data in they can output it again identically to the input file. It is probably the case that some fail at this currently.

Bio::SeqIO::bsml_sax

BUG-FIXES

It used to include the genus twice in the classification array of Bio::Species object. Now it doesn't.

Bio::SeqIO::embl

BUG-FIXES

When the OC lines include the species name, the Bio::Species classification array included the true species name as a rank above genus and the real genus duplicated as a rank above that. Now it doesn't.

Implementation changes

Uses a brand-new way of parsing the taxonomic information, mostly shared with genbank and swiss.

Bio::SeqIO::genbank

BUG-FIXES

  • Now that Bio::Species isa Bio::Taxon, it is possible to ensure that output of input matches the input (in the SOURCE and ORGANISM lines at least). Usage of Bio::Species re-implemented to get all tests in t/genbank.t to pass.
  • Fixed handling of organism lines split over 2 lines and creation of classification array when multi-word name is split over 2 lines.

Implementation changes

Uses a brand-new way of parsing the taxonomic information, mostly shared with embl and swiss.

t/genbank.t

Now checks taxonomic information parsing more carefully to ensure that output of input matches the original input file.

Bio::SeqIO::swiss

Implementation changes

Uses a brand-new way of parsing the taxonomic information, mostly shared with embl and genbank.

scripts/taxa/taxonomy2tree.PLS

  • Added some extra options to define the location of the database indexes and files, or use the entrez on-line database instead. (Note how entrez and flatfile are now truly interchangeable.)
  • Reimplemented using the new Bio::Taxon system. Now much simpler. You also get the correct answer, eg. instead of (("Pongo pygmaeus",(Gorilla,"Pan troglodytes","Homo sapiens")"Homo/Pan/Gorilla group")Hominidae)root; you now get (("Pongo pygmaeus",(Gorilla,"Pan troglodytes","Homo sapiens")"Homo/Pan/Gorilla group")Hominidae)"cellular organisms";

Bio::SearchIO overhaul

These changes are related to speeding up Bio::SearchIO modules so that, for example, BLAST result parsing is quicker and more efficient.

Bio::Search::HSP::GenericHSP

Implementation changes

  • Call to new() now calls no methods of its own; no work is done simply to create a GenericHSP. The code that was previously in new() has been moved to private methods that are called just-in-time, as the user desires to know certain information by manually calling HSPI methods.
  • Added new options to new() -hit_desc and -query_desc for setting the description text for the sequences.

Bio::Search::Hit::GenericHit

API-CHANGES

  • new() has extra option -hsp_factory.
  • New method hsp_factory() which gets/sets a Bio::Factory::ObjectFactoryI.
  • add_hsp() can now accept a hash ref instead of just a HSPI.

Implementation changes

  • next_hsp() and hsps() convert hash ref hsp data to HSPI objects using the hsp_factory() as necessary.
  • num_hsps() claimed to throw if there were no HSPs, but returns '-'. Updated docs.

Bio::Search::Result::GenericResult

API-CHANGES

  • new() has extra option -hit_factory.
  • New method hit_factory() which gets/sets a Bio::Factory::ObjectFactoryI.
  • add_hit() can now accept a hash ref instead of just a HitI.

Implementation changes

  • next_hit() and hits() convert hash ref hit data to HitI objects using the hit_factory() as necessary.
  • statistic and parameter-related methods now all correctly deal with the statistic and parameter objects using their methods, not direct data structure access.

Bio::Search::Iteration::GenericIteration

API-CHANGES

  • new() has extra option -hit_factory.
  • New method hit_factory() which gets/sets a Bio::Factory::ObjectFactoryI.
  • add_hit() can now accept a hash ref instead of just a HitI.

Implementation changes

Various methods convert hash ref hit data to HitI objects using the hit_factory() as necessary.

Bio::SearchIO::SearchResultEventBuilder

Implementation changes

  • Methods end_hsp() and end_hit() return hash refs containing data suitable for creating HitI and HSPI objects respectively.

Bio::SearchIO::IteratedSearchResultEventBuilder

Implementation changes

  • _add_hit() deals with the new way hit information is stored.
  • end_iteration() supplies the hit factory to created and returned iteration factories.

Bio::Search::GenericStatistics

API-CHANGES

Added new method available_statistics() corresponding to the method in Bio::Search::Result:: modules, so delegation is possible.

Implementation changes

Properly inherits from Bio::Root::Root, concomitant change in internal storage structure.

Bio::Tools::Run::GenericParameters

API-CHANGES

Added new method available_parameters() corresponding to the method in Bio::Search::Result:: modules, so delegation is possible.

Implementation changes

Properly inherits from Bio::Root::Root, concomitant change in internal storage structure.

Bio::PullParserI

NEW

While not specific to SearchIO, this new module is used for making new high-performance SearchIO modules with the intent of eventually replacing all the SearchResultEventBuilder-based parsers.

Bio::SearchIO::hmmer_pull

NEW

Replacement SearchIO parser for hmmpfam reports, using PullParserI. Will eventually support hmmsearch reports as well. The differences between hmmer_pull and the existing hmmer modules are:

  • hmmer.pm breaks Bio::Search::HitI API by having hit (model) name()s that are not unique within a ResultI. It also only ever has one domain per model. hmmer_pull.pm has unique model names and as many domains per model as there are in the file being parsed.
  • hmmer_pull.pm gives back more correct answers when you try to use the full variety of HitI, GenericHit, HSPI and GenericHSP methods.
  • hmmer_pull is more memory efficient; in one test it used 1.8x less memory
  • hmmer_pull is faster; in tests it was worst-case 2.2x faster, best-case 38x faster and in one realistic-case example, 23.5x faster.

Bio::Search::Result::HmmpamResult

NEW

A PullParserI for parsing hmmpfam results. Used by SearchIO::hmmer_pull, see above.

Bio::Search::Hit::HmmpamHit

NEW

A PullParserI for parsing hmmpfam hits (models). Used by Search::Result::HmmpamResult, see above.

Bio::Search::HSP::HmmpamHSP

NEW

A PullParserI for parsing hmmpfam hsps (domains). Used by Search::Hit::HmmpamHit, see above.

Personal tools
Namespaces
Variants
Actions
Main Links
documentation
community
development
Toolbox