HOWTO:Short-read assemblies with BWA
Abstract
Using BioPerl to create and manipulate short-read assemblies with the bwa and samtools suites. Quicklink to synopsis.
Contents |
Author
maj -at- fortinbras -dot- us
Introduction
bwa is a suite of C programs that perform efficient alignments (based in part on the Burrows-Wheeler transform) of short (20-100bp) sequence reads, guided by a set of reference sequences provided in FASTA format. bwa input and output complies with the Sequence/Alignment Map (SAM) binary (.bam) and test (.sam) formats. Alignment files in SAM formats can be converted, indexed, sliced and diced by the samtools suite.
These tools are comprehensive and allow the user to tweak many different parameters, and outputs can be directed to inputs to create highly application-specific workflows. The BioPerl run wrappers Bio::Tools::Run::BWA and Bio::Tools::Run::Samtools are designed to help automate and manage such workflows, and help reduce the number of cryptic command-line options and file order inconsistencies the user must remember. Bio::Tools::Run::BWA can also act as a canned assembler, accepting input data and returning a Bio::Assembly::IO object in a single command. A new assembly IO object has been created for this purpose, Bio::Assembly::IO::sam.
Dependencies and Installation
Bio::Tools::Run::BWA (r16425) can be found in the trunk of bioperl-run. It depends on Bio::Tools::Run::BWA::Config (@ r16416) and Bio::Tools::Run::AssemblerBase (@ r16418). The canned assembler method run() also requires Bio::Tools::Run::Samtools in bioperl-run at r16427.
Bio::Assembly::IO::sam is in bioperl-live, and depends on recent small changes in Bio::Assembly::IO (@ r16414).
To download and install BioPerl core and packages from Git, see Using Git.
Like all run wrappers, these modules need the underlying programs to work. Get bwa and samtools at their Sourceforge sites: [1] [2].
Also, the Bio::DB::Sam modules must be installed on your system. These are not BioPerl modules. (They were written by a core developer, though, so they're all right.) You can get them on CPAN.
Creating an Assembly
The run() method of Bio::Tools::Run::BWA creates an assembly in sorted .bam format, using reads in FASTQ format and a reference sequence database in FASTA. Create a BWA factory, then call run from it with your files as arguments:
$bwa = Bio::Tools::Run::BWA->new() $bwa->out_type('asm.sam'); # specify output file # for single-end reads: $bwa->run( 'read1.fastq', 'refdb.fas' ) # for paired-end reads: $bwa->run( 'read1.fastq', 'refdb.fas', 'read2.fastq');
If you have Bio::Tools::Run::Samtools installed, the output file (asm.sam) will already be sorted and converted to binary SAM.
Getting an assembly object
The BWA::run() method will also create a Bio::Assembly::Scaffold object, containing Bio::Assembly::Contig alignments with associated consensus sequence objects, quality data, and sequence features.
To do this, leave out_type unset above, and capture the return value of run():
$bwa = Bio::Tools::Run::BWA->new() # for single-end reads: $aio = $bwa->run( 'read1.fastq', 'refdb.fas' ) # for paired-end reads: $aio = $bwa->run( 'read1.fastq', 'refdb.fas', 'read2.fastq');
The assembly pipeline
The Bio::Tools::Run::BWA::run() method performs the following steps:
| Action | Program | Commands |
|---|---|---|
| create a bwa index for ref seq | bwa | index |
| map sequence reads to reference seq | bwa | aln |
| assemble | bwa | samse/sampe |
| sort on coordinates | samtools | sort |
| create bam index | samtools | index |
Command-line options can be directed to the aln and samse/sampe steps using factory arguments. See Specifying Options.
Running separate bwa components
A second mode for Bio::Tools::Run::BWA allows direct access to bwa commands. To run a command, construct a run factory, specifying the desired command using the -command argument in the factory constructor, along with options specific to that command (see Specifying Options):
$bwa = Bio::Tools::Run::BWA->new( -command => 'view' );
To execute, use the run_bwa() method off the factory. Input and output files are specified in the arguments of run_maq (see Specifying Files):
$bwa->run_bwa( -bam=>'mysam.bam' -out=>'mysam.sam' );
Specifying options
bwa is complex, with many subprograms (commands) and command-line options and file specs for each. The wrapper module attempts to provide commands and options comprehensively. You can browse the choices like so:
$bwa = Bio::Tools::Run::BWA->new( -command => 'aln' ); # all bwa commands @all_commands = $bwa->available_parameters('commands'); @all_commands = $bwa->available_commands; # alias # just for aln @assemble_params = $bwa->available_parameters('params'); @assemble_switches = $bwa->available_parameters('switches'); @assemble_all_options = $bwa->available_parameters();
Reasonably mnemonic names have been assigned to the single-letter command line options. These are the names returned by available_parameters, and can be used in the factory constructor like typical BioPerl named parameters.
Options can be directed to the aln and samse/sampe components of the assembly pipeline implemented by the run() method. Identify the desired options, for example
@map_params = Bio::Tools::Run::Maq->new(-command => 'aln')->available_parameters(); # returns: # 'max_edit_dist' # 'max_gap_opens' # 'max_gap_extns' # 'deln_protect_3p' # 'deln_protect_ends' # 'subseq_seed' # 'max_edit_dist_seed' # 'n_threads' # 'mm_penalty' # 'gap_open_penalty' # 'gap_extn_penalty' # 'subopt_hit_threshold' # 'trim_parameter' # 'reverse_no_comp' # 'no_iter_search'
then in the factory construction, specify the desired parameters prefixed by aln_, sms_, or smp_, as appropriate (note, no -command parameter is specified in the run() method):
$bwa = Bio::Tools::Run::BWA->new( -aln_n_threads => 3 ); $assy = $bwa->run( "read1.fastq", "refseq.fas" );
See the bwa manpage for many gory details.
Specifying files
When a command requires filenames, these are provided to the run_bwa method, not the constructor (new()). To see the set of files required by a command, use available_parameters('filespec') or the alias filespec().
$bwa = Bio::Tools::Run::BWA->new( -command => 'aln' ); @filespec = $bwa->filespec;
This example returns the following array:
fas faq >sai
This indicates that the FASTA database (fas) and the FASTQ reads (faq) MUST be specified, and the STDOUT of this program (SA coordinates) will be slurped into a file specified in the run_bwa argument list:
$bwa->run_bwa( -fas => 'my.db.fas', -faq => 'reads.faq',
-sai => 'out.sai' );
If capture files are not specified per the filespec, text sent to STDOUT and STDERR is saved and is accessible with $bwa->stdout() and $bwa->stderr().
Running samtools components
By an odd coincidence, the Bio::Tools::Run::Samtools wrapper accesses the samtools commands in a similar way. Create a Samtools factory, specifying the command line options in the constructor argument:
$converter = Bio::Tools::Run::Samtools->new( -command => 'view', -sam_input => 1, -bam_output => 1 );
and run using the run(), specifying necessary files in its arguments:
$converter->run( -bam => 'mysam.sam', -out => 'mysam.bam' );
Parameters and filespecs can be browsed as described in Specifying options and Specifying files.
Synopsis
# create a single-read assembly object $bwa = Bio::Tools::Run::BWA->new(); $assy = $bwa->run( 'read1.fastq', 'refseqs.fas' ); # create a paired-read assembly object $assy_prd = $bwa->run( 'read1.fastq', 'refseqs.fas', 'read2.fastq' ); # create just the sorted, binary SAM file $bwa->out_type( 'assy_prd.bam' ); $bwa->run( 'read1.fastq', 'refseqs.fas', 'read2.fastq' ); # extract regions from the assy $start1 = 150000; $end1 = 200000; $start2 = 250000; $end2 = 275000; $samt = Bio::Tools::Run::Samtools->new( -command => 'view', -bam_output => 1 ); $samt->run( -bam => 'assy_prd.bam', -rgn => [ "my_seqid:$start1\-$end1", "my_seqid:$start2\-$end2" ], -out => 'assy_rgns.bam' ); # convert a text SAM to binary format $samt = Bio::Tools::Run::Samtools->new( -command => 'view', -sam_input => 1, -bam_output => 1 ); $samt->run( -bam => 'mysam.sam', -out => 'mysam.bam' ); # sort it $samt = Bio::Tools::Run::Samtools->new( -command => 'sort' ); # creates 'mysam.srt.bam': $samt->run( -bam => 'mysam.bam', -pfx => 'mysam.srt' );
SEE ALSO
Short-read assemblies with maq