XML parsers

From BioPerl
Jump to: navigation, search

These are tools which parse XML into a set of tokens or events. In Perl there are several modules which implement either the SAX or DOM models.

If you plan on developing new classes that will parse XML data, we strongly suggest using one of the following XML parsers (you will have to convince us if you plan on adopting another).

  • XML::SAX - Next-generation SAX parser - uses SAX2 specification
  • XML::LibXML - Probably the most fully realized XML parser/writer distribution. Supports DOM, SAX, Xpath, and RELAX NG, with some support for libxml2's StAX-like pull parser.
  • XML::Simple - typically used for processing small XML documents or small chunks of 'balanced' XML.
  • XML::Twig - combination of DOM and SAX models, for processing chunks of XML in large XML files w/o the memory overhead.

Older Bioperl modules also use the following parsers, which are no longer actively maintained:

  • XML::Parser - distributed with several perl distributions (ActivePerl) but hasn't been updated in years.
  • XML::DOM

We anticipate eventually switching XML parsers in modules using the above two by the next major Bioperl release.

Jason advocates switching to XML::SAX for SAX parsing because there are several different backend engines that can be plugged in. A native slow (but portable) Perl-only engine can be used, or a faster C-based engines like expat can be used. We recommend installing expat or libxml2for your system and using XML::SAX::ExpatXS or XML::LibXML (stay away from using XML::SAX::Expat, it isn't actively maintained). See Bio::SearchIO::blastxml for an example implementation using XML::SAX.

For example implementations of XML::Twig and XML::Simple, see Bio::DB::Taxonomy::entrez and Bio::DB::EUtilities, respectively. It is likely that the various Bio::DB::EUtilities modules may switch to using XML::Twig in the future.

As of now, there are no current implementations using XML::LibXML.

Personal tools
Main Links