Project priority list

From BioPerl
Jump to: navigation, search


Contents

Projects

Here is an evolving list of projects and aspects of BioPerl that need to be worked on. There are also important Orphan modules that need caretakers.

This list is generally in an order of most important (or doable) to least, but feel free to work on anything that interests you. We'll want to track who is doing what so adding a comment or keeping track on a bugzilla page for particular enhancements is best. Post to the mailing list as well to get feedback, and if particularly difficult design decisions crop up, make a wiki page for it and link it here under the item so other people can follow (and backtrack) why particular decisions are made.

NOTE

  • Please leave a signature tag if you decide to accept a project.
  • Do not cross out projects until/unless they have been completed. Accepting a project and completing a project are not the same thing!

Core Modules

This list is not necessarily in order - feel free to consider any of these things worth working on.

Module testing

  1. An ongoing request is better testing of modules, help insure a module has tests that cover it well.
  2. Consider moving tests to use Test::More and implementing test coverage with Pod::Coverage and Devel::Cover. Track our progress here.
    • I have started moving modules over to Test::More. --Chris Fields 21:07, 21 September 2006 (EDT)
    • bioperl-live and bioperl-run have converted over to Test::More. We need to work on bioperl-db and bioperl-network next. --Chris Fields 20:49, 6 March 2008 (EST)
  3. A list of modules with no tests or insufficient tests
    • I'll start to add additional tests etc to some of the bioperl-run modules. --Nath 17:16, 2 November 2006 (EST)
  4. A more complete test suite for file formats. Collect examples of the many sequence, alignment, pairwise search reports, etc output. There is a CVS repository called biodata where this was initially started. Write a testing script to fully stress test the parsers against it. This would probably be better outside of the t/ directory as we fully expect it to be both large and take a long time.
  5. I am working towards setting up an nightly code coverage report using Code::Coverage --Spiros Denaxas 12:00, 19 September 2008 (UTC)

Improve the speed of the toolkit

This is generally due to object-oriented code being slow. This does not need any Bio knowledge

  1. Improve Bio::PopGen::Statistics calculation of population statistics is slow due to the object code not the basic math. How to make it efficient for simulation sets of 10,000 -> 100,000? Seems that there is some redundant calculations going on too. Testing for haploid population perhaps?
  2. Add Bio::SearchIO methods to pass hash refs containing multiple elements directly for mapping vs. passing report elements one at a time using element(). Each call to element() currently requires three additional calls (to start_element, characters, and end_element), whereas a direct mapping of related data would likely be a bit faster.
    1. Sendu Bala has an interesting pull parser implementation. It might be worth using some tricks from Higher Order Perl to implement speedups (lookup tables, iterators, etc.) --Chris Fields 00:08, 30 January 2008 (EST)
  3. Utilize or develop optional XS-based C extensions (in bioperl-ext) for tasks which require lots of calculations or for parsing common filetypes. Mention is made below in using libsequence.
    1. This will require an overhaul of bioperl-ext, which uses XS from pre-perl 5.6.

Parsing code

  1. Revamp Bio::SeqIO's GenBank, EMBL, and Swissprot Sequence formats parsing code. Some of these modules are very hacky now after 5+ years of evolution.
    • I am developing event-based parsers (with writers and handlers) for all three formats. There seems to be a ton of repeated code present in all three modules which makes maintenance a hassle, so passing some of the data to the same generic handlers may help a bit. I'll commit these as a separate group of SeqIO modules for further testing in the next few months. As a note, the developed handlers will not support any deprecated (read Bio::Species) parsing or methods, but one could subclass and/or overload the relevant handler if necessary. --Chris Fields 10:15, 26 January 2007 (EST)
  2. Changes are underfoot at Swissprot in the file format so we need to be able to keep up with the variations in the format easily.
    • There have been several commits lately to handle this, but I'll look into it. The above-mentioned event-based parsers should make maintenance a bit easier --Chris Fields 10:15, 26 January 2007 (EST)
  3. Support the TIGR XML standard for sequence files better. Bio::SeqIO::tigr is written with regexp instead of a standard XML parser, perhaps XML::SAX as was done for Bio::SeqIO::tigrxml (which supports a different TIGR sequence format. This project may need its own page on to fully descibe what has been done previously.
  4. Many XML tags are not currently handled by the XML parser in Bio::ClusterIO::dbsnp. --Chris Fields 19:59, 26 June 2006 (EDT)
  5. Standardize all XML parsing to use one of four different XML parsers. XML::Parser amd XML::DOM are no longer actively maintained.
  6. Add support for INSDseqXML, EMBLXML, and UniProtKB XML formats.
    • The old EMBL XML sequence format (XEMBL) is no longer used, but we should keep it around for the time being.
  7. Sequence validation. There have been several requests for methods/classes that validate sequence/alignment data. This is too much to handle for the already overburdened SeqIO/AlignIO parsers, but a separate validation object could be created in order to check formatting. Any validation could be optionally triggered within a parser using such a system and would be 'off' by default. A starting place could be Bio::Tools::GuessSeqFormat. See Issue #1508.

Restriction enzymes

  1. At this time, using Bio::Restriction::IO only gets data into Bio::Restriction::Enzyme or Bio::Restriction::EnzymeCollection objects. See bug #2011.
  2. None of the Bio::Restriction::IO modules currently implement write methods except for 'base' format. Bio::Restriction::IO::base does have a write() method but it doesn't recognize multicut/multisite enzymes so is not recommended.
  3. Format parsing for Bio::Restriction::IO::bairoch is currently broken (it does not recognize multicut and multisite enzymes). It will only grab the first site if multiple cut sites are given.
  4. Tests need to be added to check for multicut and multisite enzyme object types for each REBASE format supported.
  • This was fixed in the Great Bio::Restriction Refactor of '09 --maj 16:01, 18 September 2009 (UTC)
  1. Code and old POD for Bio::Restriction::IO suggested that enzymes in XML format could be parsed, but this obviously isn't available yet. Should support for this be added and are there any other formats that should be supported as well?

Unimplemented methods and other 'unfinished business'

Generally, new BioPerl classes are submitted with a particular intent in mind, such converting one format to another (Bio::SeqIO), analyzing data using object methods (such as taking a slice from an alignment), or parsing data output, such as XML. As the road to hell is paved with good intentions, one could guess that, with a large distribution such as BioPerl, there could also be 'unfinished business.' Here are some areas with 'unfinished business':

  1. Bio::Assembly::Contig has ~25 unimplemented methods (see Issue #2021).
  2. Bio::Restriction::IO modules have unimplemented write methods (see above).
  3. From Bio::AlignIO : only those formats which were implemented in Bio::SimpleAlign have been incorporated in Bio::AlignIO. Specifically, Mase, Stockholm, and ProDom have only been implemented for input. See the specific module (e.g. Bio::AlignIO::meme for notes on supported versions).

General phylogenetic, population genetics, and molecular evolution tools

  1. Test that parsing alignments in the different flavors of PHYLIP multiple alignment format really work - there seems to be some problems with it reading in sequential or interleaved formats (can't remember). Also check that it is robustly supporting writing these formats too. --jason stajich
  2. Support XML Tree formats like phyloxml also phyloxml?
  3. Albert's list of enhancements
  4. Better interface with Hudson's ms
  5. Perhaps an XS link to Kevin Thornton's libsequence (although the calculation speed is not the problem in BioPerl AFAIK). Doing this would give access to more statistics but our object model is a little different from Kevin's so it may take some more thinking. An open area for someone to jump into.
  6. In general, support haplotype data more explicitly in Bio::PopGen modules. This means would allow implementation haplotype LD which is currently lacking.
  7. Several functions to Bio::Tree::Node including a collapse function to collapse nodes below a level.
  8. Implement the Heads or Tails test for assessment of alignment quality [1].P
This could go in as an extra method that is part of modules that can run Alignment programs (basically requires reversing sequences and re-aligning). The result could be stored in the SimpleAlign object.

StructureIO

People who have used, or attempted to use, Bio::Structure::IO::pdb believe that it needs to be refactored. A discussion of the issues can be found here:

http://portal.open-bio.org/pipermail/bioperl-l/2006-September/022990.html


Ontology file parsing

  1. Bioperl doesn't parse OBO-format ontologies as of v. 1.5.1. This would be a useful addition to the package and it would involve migrating code from go-perl into BioPerl. go-perl's author, Chris Mungall, is a contributor to BioPerl, he may be able to provide assistance. Recent work by Sohel Merchant with input from Hilmar Lapp may be solving this soon. (This appears to be basically solved with the new OBOEngine, Sohel will need to comment if it is indeed finished). --jason stajich 20:10, 19 June 2006 (EDT)

Module Enhancements

  1. Most of the items in our Bugzilla tracking system are not bugs, but requests for specific enhancements. Take a look, perhaps you might find an item that interests you. That's the key, your willingness and curiousity, not whether or not you're familiar with the topic. By asking questions on the bioperl-l mailing list you'll get all the information you need.
  2. Find new caregivers for Orphan modules

Test and fix Bioperl's use of proxies

  1. Bug 1884 and bug 1770 concern querying remote databases while using a proxy. No Bio knowledge required. Bugs closed by BrianO and ChrisF.
  2. We need more users who use various proxies to run the test suite to make sure that they all pass. So far, though, everything seems fine. --Chris Fields 00:49, 28 September 2006 (EDT)

New modules to be written

  1. See for a link to module/tool descriptions that are in need of something
  2. Listing some particular wishes
  3. Parsers for interfacing with MUMMER and AVID for whole genome alignment
  4. Parser for Bacterial Glimmer 2.x and Glimmer 3.x - the current Bio::Tools::Glimmer only supports Eukaryotic GlimmerM and GlimmerHMM
  5. Bio::AlignIO::stockholm write_aln function needs to be implemented
  6. There seems to be a general lack of RNA-specific software (report parsers, structure formats, wrappers, etc.). Chris Fields is working on some parsers (RNAMotif, ERPIN, Infernal, FASTR) but are there others that should be added?

Toolkit changes

  1. File::Spec should be standard in all modern Perl distros now so perhaps remove the catfile and catdir functions from Bio::Root::IO (which delegates to File::Spec if it is installed anyways) in favor of the standard modules.
    Function catdir does not exist in BioPerl anywhere? --Torsten 23:56, 21 September 2006 (EDT)
  2. Similarly the use of File::Temp could replace tempfile and tempdir methods in Bio::Root::IO?
    The Bioperl ones require File::Temp anyway? --Torsten 23:56, 21 September 2006 (EDT)

Open Bugs

Open bugs are listed in the Bugs page as well as on bugzilla directly. We could certainly use volunteers to work on any of these. Just sign up for a bugzilla account on the site and add a comment about what you will be doing with the bug.

Run Modules

  1. More extensive tests, several modules have no tests.
  2. Support more command line options for many of the modules.
  3. Find new caretakers for Orphan modules
  4. Module to run formatdb

Documentation/Website

Website - The largest complaint about BioPerl is it is too complicated, that the documentation is poor or disorganized (mostly that people can't seem to find what they think they are looking for). The Wiki should be our place to start to organize this properly.

  1. Volunteer to help improve and extend this Wiki site - some jobs and pages that need to be written are below
  2. Blog weekly mailing list traffic summaries (rotating job?)
    • Chris Fields had started this up but has spent more time lately helping maintain BioPerl, so has fallen waaay behind. Any other volunteers?
  3. Generate prettier POD2HTML like on the search.cpan.org site. No Bio knowledge required. Mauricio has taken this one.
    • Is this still necessary? Is Pdoc site ain't enough or 'sort of the same'? -- Mauricio 15:42, 22 September 2006 (EDT)
      • I think the idea is to have HTML-derived docs similar to those ActiveState provides for their Perl distribution. --Chris Fields 15:06, 2 November 2006 (EST)
  4. Improve Pdoc site and API documentation site. No Bio knowledge required.
    • Raphael Leplae provided a newer version of Pdoc which works better and fixes some issues. Many thanks to him!
    • Mauricio takes care of the Pdoc site now.
  5. Move existing HOWTOs to wiki rather than as Docbook.
  6. More HOWTOs. Possible topics:
  7. How to integrate documentation, tutorials with Wiki new site?
    • Wiki seems to be primary site now? --Torsten 00:05, 22 September 2006 (EDT)
  8. Move FAQ to the Wiki --jason stajich 10:29, 19 November 2005 (EST)
  9. General organization of Documentation to orient people towards.
    • Appears to be done on 'Doc' sub menu on Wiki --Torsten 00:05, 22 September 2006 (EDT)
  10. Re-generate Class Diagram. No Bio knowledge required.
    • The current class diagrams we have are derived from v. 1, very old indeed. Having up-to-date diagrams would be good. Look into auto-generating diagram from Perl code?
      • Mauricio had started this up but has run out of spare time to get it done. He tried class auto-generation with AutoDia and a recent method decribed at Perl.com, but BioPerl's class hierarchy has got kind of complicated to get 'drawn' by this automatic solutions (the rendered diagrams suck!). Probably to do it by hand? Any other volunteers?

References

  1. Landan G and Graur D. Heads or tails: a simple reliability check for multiple sequence alignments. Mol Biol Evol. 2007 Jun;24(6):1380-3. DOI:10.1093/molbev/msm060 | PubMed ID:17387100 | HubMed [landan2007]
Personal tools
Namespaces
Variants
Actions
Main Links
documentation
community
development
Toolbox