BioPerl Modularization

From BioPerl
Jump to: navigation, search



Currently, BioPerl is distributed as several large source code packages (called distributions), consisting of several hundred Perl modules each. The core BioPerl distribution, called bioperl-live by developers and published on CPAN as BioPerl (e.g. BioPerl-1.6.901), now consists of nearly 900 modules. As many of our users know, this extremely large distribution can be quite time-consuming and even challenging to install, since it has a very large number of tests and external dependencies. Moreover, since our user base is very diverse, most users actually need and use only a small portion of the modules in a each of these distributions. These large monolithic distributions are also growing increasingly difficult for BioPerl core developers to maintain and update.

In late 2010 and early 2011, BioPerl core developers developed a plan to subdivide BioPerl into smaller units that can be installed independently. The initial plan is to, over time, split the current bioperl-live distribution CPAN distribution into approximately 20 to 30 different repositories, starting with Bio-Root and proceeding upward through the dependency hierarchy to Bio-Range, Bio-Coordinate, and so on. After each split, both the newly split-off distribution and the (now smaller) bioperl-live will be released to CPAN, with bioperl-live including a dependency on the split-off code. In this way, CPAN users of BioPerl will not be affected, and bioperl-live will over time become a pseudo-distribution containing nothing but dependencies on the various split-off code and some modules for backward compatibility (such as Bio::Root::Version).

The project was finally initiated by Sheena Scroggins in summer 2011 with her Google Summer of Code project, mentored by BioPerl core developer Robert Buels. Her project successfully developed the core process and infrastructure for the subdivision process (including a Dist::Zilla plugin bundle) and completed the first few divisions. BioPerl core developers are currently following up by continuing to migrate more code out of bioperl-live.

For BioPerl Users

1. Users who install BioPerl with CPAN clients such as cpan or cpanm will not be affected by these changes.

2. Users who run BioPerl from a cloned git repository will be continually disrupted by these changes. This method of running BioPerl has always been discouraged by the development team, and now we strongly urge all users to use a CPAN client for BioPerl installation.

For BioPerl Developers

Release Plan

Releases will be pushed to CPAN incrementally to ensure things are working; CPANtesters is a great tool for this. Pushing new ones in cases where a bug is found should also be straightforward. Setting up a new release of code is very easy with Dist::Zilla, but use of Dist::Zilla isn't required except in the following situations:

  1. When installing directly from the git repository, when a Build.PL isn't present (installing developer code for production use is highly discouraged in any situation)
  2. Running tests using XS-based modules (not a concern in almost all BioPerl code)
  3. If the modules in question are easily maintained and released without using Dist::Zilla (e.g. have enough information in place to release on their own, such as a Makefile.PL, Build.PL, and some documentation indicating how to make a simple release). Those decisions are left to the maintainer.

We can also create a pass-through Build.PL or Makefile.PL to deal with cases such as XS-based code.

Users can still use git-based code by simply checking out the code and pointing their local environment to the proper directory (TODO: example). We may include a git submodule directory for those modules that are known to work together cohesively.

We'll gradually pull out bundles of related code into their own repositories; in a section below is a git recipe that can do this and retain the history of the commits. The eventual goal is to basically have 'bioperl-live' (and the BioPerl release on CPAN) be a Task:: or Bundle:: module that can install whatever BioPerl code you want, along with required dependencies (e.g. only install the code you need, not everything by default along with all the dependencies thereof that you likely won't need). This will also remove a certain amount of code that to the best of our knowledge is no longer used or maintained.

Current Progress

During GSoC 2011 a few distributions were split out of bioperl-live. However, they were never released and the modules were kept on bioperl-live. This lead to a branching of development since it continued in both the bioperl-live and the new splitted repository.

To solve this, the two branches need to be merged (both on the actual modules but also on the scripts, examples and test sections). After fixing this, the modules must be removed from bioperl-live, so the new distribution can be made.

Currently, only Bio-Biblio and Bio-EUtilities have seen new releases.

BioPerl distributions

As of April 2013, the following repositories have been split out of bioperl-live and are no longer in the master branch.

The following modules are being re-evaluated:

  • Bio-Range : also includes Bio::Location modules

The problem with splitting this set of modules out is that they are quite central to many of the other modules remaining in core, namely Seq/SeqFeature/Annotation. These may be harder to split out (and the worry is the current distribution will fall out of date with master commits). Should these be split out, they will move to the below group.

The following modules lack commit history and may need to be recreated.

The following modules are slated to be split out:

  • Bio::Cluster/Cluster::IO
  • Bio::Index
  • Bio::LiveSeq
  • Bio::Map/MapIO
  • Bio::Matrix
  • Bio::MolEvol
  • Bio::Nexml (and related)
  • Bio::Ontology/OntologyIO
  • Bio::Phenotype
  • Bio::PhyloNetwork
  • Bio::PopGen
  • Bio::Restriction
  • Bio::SeqEvolution
  • Bio::Structure
  • Bio::Symbol
  • Bio::Taxonomy
  • Bio::Variation

In addition, the following new distributions have been created:

Initial steps for existing code

  • Identify code that could be split out
  • Where possible, use git filter-branch to split out code of interest, along with tests, scripts, examples
  • Add appropriate documentation (README, Changes) for the split distribution
  • Push repo to bioperl organization repository on github

At this point, we are not modifying the history of the main repository. Personally, I don't think it's necessary, even though it's redundant, primarily b/c we should ensure the commits up to that point are both intact. This will require that the main bioperl-live repo will be large (the removed modules will still exist in the history of the repo are objects or 'blobs'), but we will also have an unmodified archive from the perspective that no filter-branch changes will be introduced.

HOWTO: splitting a new repository/distribution off of bioperl-live

To maintain the full version control history of the split-out modules, use the git filter-branch command for the initial split. The --index-filter option is particularly useful for this, in combination with git ls-tree.

Generate filter

First, work on creating a filter for the files you want to include in the split-out repository using git ls-tree. One can use a chained grep -v in conjunction with git ls-tree. The example below is the command I used for creating the Bio-EUtilies repo:

 git ls-tree -r --name-only --full-tree HEAD | \
 grep -v "Bio/.*/EUtilities" | \
 grep -v "scripts/Bio-DB-EUtilities" | \
 grep -v "t/.*EUtilities*" | \
 grep -v "t/data/eutils"

This emits to standard output a long list of files that do NOT match those that we want to retain. The reason we want a list of everything else is that we will be filtering out commits to everything else from the history, only retaining the files we want in the split-out repo.

Clone repo

Make a new cloned copy of the original repository (in this example, bioperl-live). It is always a good idea to clone from a read-only version of the repository, just in case (the one here is the read-only link from GitHub).

 git clone git:// bioperl-tmp/

Use git filter-branch

In the new temporary repo, run git filter-branch using the --index-filter option. This is using the same git ls-tree command as above, but using the imported $GIT_COMMIT for each commit to be checked, and the addition of a pipe to git rm to remove the files for each step. In this case, I also added a bit of flexibility based on the knowledge that the directory structure of the tests changed, so hopefully this will capture as much of the history as possible:

 git filter-branch --prune-empty --index-filter   \
     'git ls-tree -r --name-only --full-tree $GIT_COMMIT | \
     grep -v "Bio/.*/EUtilities" | \
     grep -v "scripts/Bio-DB-EUtilities" | \
     grep -v "t/.*EUtilities*" | \
     grep -v "t/data/eutils" | \
     sed "s/^.*$/\"&\"/" | \
     xargs git rm --cached --ignore-unmatch' -- --all

Note the -- --all, which applies to all refs, and --prune-empty, which prunes possibly empty commits. The sed call is to wrap everything in quotes, in case there are files that have spaces in them (in our case, we have one test file that matches this). Enjoy a beer and wait for a bit.

Clean up

The fastest way to clean up everything is to simply create a clean clone:

 git remote rm origin
 git clone bioperl-tmp Bio-EUtilities

The reason: git holds on to objects even after they are filtered; it generally cleans them up only when asked. However, cloning does not carry over removed objects, thus this acts as a quick cleaning step.

The final part is to change to the newly cloned, clean repo, add a new clean remote repository (from github or wherever), and push the code there.

 # add repo on github
 git remote rm origin
 git remote add origin
 git push --all

The newly created repository used for this demonstration is located here.


There also may be alternative ways to accomplish this as well. If the modules are contained within a subdirectory, one could use the simpler --subdirectory-filter option, though with bioperl this isn't likely common (unless you don't care about grabbing the tests as well). Also, git subtree (a git addon) and git-stitch-repo, the latter which can be used to combine two two independent repos together with their history intact.


Policy for new code

First, and most importantly, all developers are free to publish new distributions under the "Bio::" namespace on CPAN at any time. The BioPerl organization does not own this namespace, it is free for all to use. Thus, in most cases, new BioPerl-compatible code should be published on CPAN by its original author, without official affiliation with the core of BioPerl itself.

In certain rare cases, BioPerl core developers may accept new modules for inclusion in the "official" BioPerl repositories. This will be decided on a case-by-case basis through discussion among the core developers.


  • Set up appropriate post-commit hooks:
    • Forward commits to bioperl-guts-l
    • Forward commits to (for issue tracking/management) - do not use the github redmine hook for this!
    • (optional but recommended) add IRC commits for #bioperl on via (contact cjfields on this, should set this up for a group...)
    • others?
  • Do we want to set a git submodule repo to collect the various splits for devs/users? these can be problematic...
Personal tools
Main Links