HOWTO:Multiple Sequence Alignment
Contents |
Author
Jun Yin, [1]. Email:jun.yin-at-ucd.ie
Abstract
This HOWTO intends to show how to use the BioPerl Bio::SimpleAlign and Bio::DB::Align to manipulate Multiple Sequence Alignment (MSA) sequences and retrieve online alignment sequences.
Please notice that, the modules mentioned in this tutorial, Bio::AlignIO, Bio::SimpleAlign and Bio::DB::Align, read/write/edit alignment sequences, but not to do the alignment. If you want to align your sequences, you may consider clustalw2. The packages introduced here deal with the output file from clustalw2 and other alignment softwares.
In BioPerl, alignment sequence reading in and writing out are performed using Bio::AlignIO. Online alignement sequence can be retrieved using Bio::DB::Align. These two packages (actually bundle of packages) will save the alignment into Bio::SimpleAlign objects. Bio::SimpleAlign provides dozens of methods to manipulate the alignment, calculate consensus sequences and annotate the alignment.
Refactoring of Bio::SimpleAlign and implementation of Bio::DB::Align were funded as part of the Google Summer of Code project, under the guidance of Chris Fields in Open Bioinformatics Foundation.
Reading and Writing MSA sequences
Reading in and writing out MSA sequences in local file are performed using Bio::AlignIO. Here is an example code:
use Bio::AlignIO; use Bio::SimpleAlign; # Use Bio::AlignIO to read in the alignment my $str = Bio::AlignIO->new(-file => 't/data/testaln.pfam'); my $aln = $str->next_aln(); # do something here print $aln->length; print $aln->num_residues; print $aln->is_flush; foreach my $seq ($aln->next_Seq) { #do something with $seq print $seq->get_nse; }
Retrieving online alignment sequences
Bio::DB::Align uses RESTful service provided by the database to retrieve online alignment sequences. The alignment sequences are saved into Bio::SimpleAlign object. Now BioPerl supports Pfam, Rfam, Prosite and NCBI Protein Clusters (ProtClustDB).
Pfam
This package uses the RESTful service provided by Pfam. It retrieves online alignment sequences and save it into Bio::SimpleAlign object.
The retrieval can be based on protein domain id, e.g. "Piwi", or accession number, e.g. "PF02171".
use Bio::DB::Align; use Bio::DB::Align::Pfam; use Bio::SimpleAlign; my $dbobj = Bio::DB::Align::Pfam->new(); #create a db object my $dbobj2 = Bio::DB::Align->new(-db=>"Pfam"); #create a db object #the same with above #retrieve a Bio::SimpleAlign object my $aln=$dbobj->get_Aln_by_id("Piwi"); #do something here with the align object print $aln->length,"\n"; print $aln->num_sequences,"\n"; foreach my $seq ($aln->next_Seq) { #do something with $seq }
Or you can do more complicated calling such as:
#parameter based calling my $aln2 = $dbobj->get_Aln_by_acc(-accession=>"PF02171", -alignment=>"full", -format=>"stockholm", -order=>"a", -case=>"u", -gap=>"dots"); print $aln2->id,"\n"; print $aln2->accession,"\n";
You can also convert the id into accession and vice versa.
#id accession conversion my $acc = $dbobj->id2acc("Piwi"); my $id = $dbobj->acc2id("PF02171");
Rfam
This package uses the RESTful service provided by Rfam. It retrieves online alignment sequences and save it into Bio::SimpleAlign object. Bio::DB::Align::Rfam only supports alignment sequence retrieval using RNA accession number, e.g. "RF00360".
use Bio::DB::Align; use Bio::DB::Align::Rfam; use Bio::SimpleAlign; my $dbobj=Bio::DB::Align::Rfam->new(); #create a db object my $dbobj2=Bio::DB::Align->new(-db=>"Rfam"); #create a db object #the same with above #retrieve a Bio::SimpleAlign object my $aln=$dbobj->get_Aln_by_acc("RF00360"); #or parameter based calling my $aln2=$dbobj->get_Aln_by_acc(-accession=>"RF00360", -alignment=>"full", -nselabel=>1, -format=>"stockholm", -gap=>"dashes");
Prosite
This package uses the RESTful service provided by Prosite. It retrieves online alignment sequences and save it into Bio::SimpleAlign object. Bio::DB::Align::Prosite only supports alignment sequence retrieval using protein domain accession number, e.g. "RF00360".
use Bio::DB::Align; use Bio::DB::Align::Prosite; use Bio::SimpleAlign; my $dbobj=Bio::DB::Align::Prosite->new(); #create a db object my $dbobj2=Bio::DB::Align->new(-db=>"Prosite"); #create a db object #the same with above #retrieve a Bio::SimpleAlign object my $aln=$dbobj->get_Aln_by_acc("PS51092"); #or parameter based calling my $aln2=$dbobj->get_Aln_by_acc(-accession=>"PS00023", -format=>"clustalw");
ProtClustDB
This package supports alignment sequence retrieval of protein clusters from NCBI Entrez Protein Clusters (ProtClustDB). It retrieves online alignment sequences and save it into Bio::SimpleAlign object.This package depends on Bio::DB::EUtilities, which is used to convert id and accession.
The retrieval can be based on protein cluster id, e.g. "2725839", or accession number, e.g. "CLSN2725839".
You may need to provide your email address to use this service, as per NCBI E-utilities policy.
use Bio::DB::Align; use Bio::DB::Align::ProtClustDB; use Bio::SimpleAlign; #create a db object, using either of the two methods below my $dbobj=Bio::DB::Align::ProtClustDB->new(-email=>'bioperl-test@foo.bar'); my $dbobj2=Bio::DB::Align->new(-db=>"ProtClustDB", -email=>'bioperl-test@foo.bar'); #retrieve a Bio::SimpleAlign object my $aln=$dbobj->get_Aln_by_id("2725839"); #or parameter based calling my $aln2=$dbobj->get_Aln_by_acc(-accession=>"CLSN2725839"); #id accession conversion my $acc=$dbobj->id2acc("2725839"); my $id=$dbobj->acc2id("CLSN2725839");
Alignment description
After you read in the alignment, a Bio::SimpleAlign object will be built. You can use the methods below to check the attributes of the alignment. These may include general MSA attributes, e.g. id, accession, or sequence related descriptors, e.g. No. of residues, No. of sequences, and sequence percentage identity.
Here are some example codes returning alignment attributes.
my $id = $aln->id; #id of the alignment my $accession = $aln->accession; #accession of the alignment my $desc = $aln->description; #description my $source = $aln->source; #sequence source, usually the format of the sequence my @symchars = $aln->symbol_chars #Returns all the seen symbols except gap/missing/match characters
Here are some example codes returning sequence based descriptors.
my $score = $aln->score; #score of the alignment my $length = $aln->length; #length of the alignment block my $num_res = $aln->num_residues; #number of residues in total in the alignment my $num_seq = $aln->num_sequences; #number of sequences in the alignment
Bio::SimpleAlign provides several methods to calculate sequence percentage identity.
#This function calculates the percentage identity of the conserved columns my $iden1 = $aln->average_percentage_identity; #This function uses a fast method to calculate the average percentage identity of the alignment my $iden2 = $aln->overall_percentage_identity; #Returns pairwise percentage identity of each sequence to the reference sequence(first sequence as default) my @pairwise_iden1 = $aln->pairwise_percentage_identity; #Returns pairwise percentage identity of each sequence to the 3rd sequence my @pairwise_iden2 = $aln->pairwise_percentage_identity(3);
Selecting sequences or horizontal/vertical subsets of the current MSA
You may select horizontal (sequences) or vertical (columns) subtsets of the current MSA. Here is some example code of doing that. Beware that, in BioPerl sequence and columns are usually "1" based, e.g. the first sequence is 1, and the first column is also 1.
Selecting sequences in the alignment
Here are some codes to select horizontal subsets (sequences) of the alignment:
#selecting sequence object one by one foreach my $seq ($aln->next_Seq) { #do something with $seq print $seq->seq(); } #the sequence objects are returned in alphabetically sorted order foreach my $seq ($aln->next_alphabetically) { #do something with $seq print $seq->seq(); } #selecting a specific sequence by position in the alignment my $seq3 = $aln->get_Seq_by_pos(3); #the third sequence #selecting a specific sequence by id my $seq1 = $aln->get_Seq_by_id("JAK_HUMAN"); #selecting multiple sequences by position, and save it into another alignment object my $aln2 = $aln->select_Seqs([1..5,9,11]); #parameter based way of doing that, and reverse the selection #If you have 11 sequences in the alignment, this code below is the same with the code above my $aln3 = $aln->select_Seqs(-selection=>[6..8,10],-toggle=>1);
Selecting columns in the alignment
Here are some codes to select/de-select columns in the alignment. The result will be saved into a new alignment object. The gap only sequences are usually removed after the selection, unless you defined -keepgaponly=>1 in the selection.
#For example, if an alignment has 35 columns, all the methods below are doing the same thing. #these methods all select the 3rd to 5th, and 28th to 32nd columns and save them into a new alignment object my $newaln = $aln->select_columns([3...5,28..32]); my $newaln2 = $aln->select_columns(-selection=>[1..2,6..27,33..35],-toggle=>1); my $newaln3 = $aln->remove_columns([1..2,6..27,33..35]); my $newaln4 = $aln->remove_columns([3...5,28..32],1); my $newaln5 = $aln->remove_columns(-selection=>[3...5,28..32],-toggle=>1); #You can also select the columns based on the type of these columns my $aln2 = $aln->select_columns(['mismatch']); my $aln2 = $aln->remove_columns(['weak']); #keep gap only sequences my $aln2 = $aln->select_columns(-selection=>['mismatch'],-keepgaponly=>1); my $aln2 = $aln->select_columns(['mismatch'],0,1); # doing the same thing as above #remove gaps my $aln3 = $aln->remove_gaps;
Modifying the current alignment
These methods will modify the original alignment object, so use them with care.
Add or Remove sequences
#For example, we have three Bio::LocatableSeq objects, we can add them into an alignment object. #Beware that, the sequences should be in the same length, otherwise they will be padded to the same length. my $s1 = Bio::LocatableSeq->new( -id => 'AAA', -seq => 'aawtat-tn-', -start => 12, -end => 19, -alphabet => 'dna' ); my $s2 = Bio::LocatableSeq->new( -id => 'BBB', -seq => '-aaaat-tt-', -start => 1, -end => 7, -alphabet => 'dna' ); my $s3 = Bio::LocatableSeq->new( -id => 'BBB', -seq => '-aaaat-t--', -start => 31, -end => 36, -alphabet => 'dna' ); $a = Bio::SimpleAlign->new(); $a->add_Seq($s1); $a->add_Seq($s2); $a->add_Seq($s3); #Now $a has three sequence objects. #We can also remove sequence objects from the alignment object. $a->remove_LocatableSeq($s1); #Now $a has two sequence objects
Sort sequences
$aln->sort_alphabetically; #Sort the alignment by the alphabetical order of sequence ID $aln->sort_by_list($list_file); #Sort the alignmetn by the given list file $aln->sort_by_pairwise_identity; #Sort by the pairwise percentage identity of the first (default) or the given sequence $aln->sort_by_length; #Sort by the length the sequences $aln->sort_by_start; #Sort by the start of the sequences
Remove sequences
$aln->remove_Seqs([1..3,5..8]); #Remove the selected sequences from the alignment $aln->remove_redundant_Seqs(0.7); #Remove redundant sequences by defined percentage $aln->uniq_Seq; #Only retain the non-redundant sequences
Calculating consensus sequences
Bio::SimpleAlign provides several methods calculating consensus sequences. For details of these methods, please refer to the description of these methods in the POD file.
my $str = $aln->consensus_string(0.5);#The consensus residue has to appear at least threshold 50% my $str = $aln->consensus_iupac; #Makes a consensus using IUPAC ambiguity codes my $str = $aln->bracket_string; #a string in BIC format. This is used for allelic comparisons. my $str = $aln->match_line; #Generates a match line my $str = $aln->gap_line;#Generates a gap line my $str = $aln->all_gap_line; #Generates a gap line, where '-' represents all-gap column my %cigars = $aln->cigar_line; #Generates a "cigar" (Compact Idiosyncratic Gapped Alignment Report) line for each sequence in the alignment.
Modifying alignment sequences
Change the letters in the sequences
These methods can be used to modify the sequences in the alignment. For example, you can substitute a specific character(map_chars), capitalize the letters (uppercase), or change the alignment into match characters (match). Here are some code examples:
$aln->map_chars( '\.', '-' ); #Substitute the '.' into '-'. Maybe useful to assign your own gap char $aln->uppercase; #Change all the letters in the alignment to uppercase $aln->lowercase; #Change all the letters in the alignment to lowercase $aln->togglecase; #Change all the letters in the alignment to opposite case $aln->match; #Goes through all columns and changes residues that are identical to residue in first sequence (reference sequence) to match character $aln->unmatch; #Undo the previous change
Set reference sequence
You can set the new reference sequence in the alignment, which is to move the assigned sequence to the first position. The reference sequence (the first sequence) is used as the default sequence in methods such as $aln->match, $aln->pairwise_percentage_identity, and $aln->remove_gaps.
$aln->set_new_reference(3); #Select the 3rd sequence as the reference (1st) sequence