Removing sequencing adapters

From BioPerl
Jump to: navigation, search

(See the bioperl-l threads here and here for discussion on next-gen modules)

This is some example code for the removal of sequencing adapters from next generation sequence reads. This is useful for cleaning up sequence from Solexa/Illumina GA machines, and may also be relevant for the removal of adapter/primer sequence from other types of sequence machine.

These code excerpts are from a prototype sequence pre-processing pipeline, and the approach described may soon become obsolete. This example code is functional, but not optimal. The first stage of the pipeline is to parse the sequence (with quality) and extract the good quality sequence. This must be done prior to using this method for adapter removal.

Before we start with the Bioperl, we align all of the good quality sequence (some.fasta) with the sequencing adapter (adapter.fasta) and output the alignments.

needle -gapopen 100 -gapextend 10 -auto -asequence adapter.fasta -bsequence stdin -outfile stdout < some.fasta > lots_of.aln

We don't want any gaps, and Needle doesn't support ungapped alignments, so we set the gap penalties as high as possible. Note that we don't use Water, because we want to extract the sequence before the adapter, and so need the full sequences in the alignment output.

Now we have our alignments we can pipe them into our adapter removal script.

use Bio::AlignIO;
my $aln_in = Bio::AlignIO->new(-fh => \*STDIN, -format => 'emboss');	# clustalw format is a handy alternative for readability
while ( my $aln = $aln_in->next_aln )
	# Get the coordinates of the start and end of the sequence and adapter in the alignment
	my $r1 = Bio::Range->new(	-start=>	$aln->column_from_residue_number( $aln->get_seq_by_pos(1)->display_id, "1" ),
					-end=>		$aln->column_from_residue_number( $aln->get_seq_by_pos(1)->display_id, $aln->select(1,1)->no_residues ),
					-strand=>	+1);
	my $r2 = Bio::Range->new(	-start=>	$aln->column_from_residue_number( $aln->get_seq_by_pos(2)->display_id, "1" ),
					-end=>		$aln->column_from_residue_number( $aln->get_seq_by_pos(2)->display_id, $aln->select(2,2)->no_residues ),
					-strand=>	+1);
	my ($start, $end, $strand) = $r1->intersection($r2);			# Coordinates where the sequence and adapter align
	my $aln2 = $aln->slice($start, $end);					# Get the alignment slice
	# Neither Emboss nor Bio::Align::PairwiseStatistics offer a suitable scoring method for adapter removal, so here is one I made earlier
	my @seq1 = split ("", $aln2->get_seq_by_pos(1)->seq);
	my @seq2 = split ("", $aln2->get_seq_by_pos(2)->seq);
	my ($score, $mismatches, $gaps) = qw (0 0 0);
	for (my $pos = 0; $pos < $aln2->length; $pos++)
		if ($seq1[$pos] eq $seq2[$pos]) { $score ++; }			# Matches
		elsif ($seq1[$pos] ne "N" && $seq2[$pos] ne "N")		# Mismatches
			$score --;
			$mismatches ++;
		if ($seq1[$pos] eq "-" || $seq2[$pos] eq "-") {	$gaps ++; }	# Gaps
	my $mismatch_rate = $mismatches / $aln2->length;
	# Now we have some alignment stats we can do something with the sequence
	if ($score >= 4 && $mismatch_rate <= 0.1 && $gaps <= 1 && $start >= 2)	# If the adapter sequence match is good enough...
		process_seq($aln->get_seq_by_pos(2)->trunc(1,($start -1)));	# Extract the sequence before the adapter and do something with it
	else									# Or
		process_seq($aln->get_seq_by_pos(2));				# Just do something with the sequence

Once we have trimmed the adapter from our sequences we can pass them on to the next stage of the processing pipeline.

The main issues with this approach are:

  • Needle may produce gapped alignments (bad) and is presumably slower than an ungapped matching algorithm would be.
  • The alignment scores reported by Emboss and by Bio::Align::PairwiseStatistics aren't useful for identifying true sequence adapter matches.
  • Adapter removal cannot be done in conjunction with quality trimming, which might be preferable.

Issues aside, this approach does seem to work, and performance is adequate.

--Giles.weaver 13:23, 8 July 2009 (UTC)

Personal tools
Main Links