SimpleAlign Updates

From BioPerl
Jump to: navigation, search

Contents

Author

Jun Yin, [1]. Email:jun.yin-at-yale.edu

Abstract

Bio::SimpleAlign is a BioPerl package on manipulating Multiple Sequence Alignment (MSA) sequences. Refactoring of Bio::SimpleAlign and implementation of Bio::DB::Align were funded as part of the Google Summer of Code project, under the guidance of Chris Fields in Open Bioinformatics Foundation.

A new test was implemented for Bio::SimpleAlign, t/Align/SimpleAlign2.t.

Here is a HOWTO on the new Bio::SimpleAlign, HOWTO:Multiple_Sequence_Alignment.

Update Note

Bio::SimpleAlign module update note


1. Module association

Bio::SimpleAlign module is dependent on several other BioPerl modules. Particularly on Bio::AlignIO for input and output of alignment sequences, Bio::LocatableSeq for sequence manipulation, and Bio::AlignI for module interface.

The update this time only applies on Bio::SimpleAlign. The corresponding interface Bio::AlignI can be updated later.

There was a special package created by me in 2010, called Bio::DB::Align, which includes wrappers dependent on Bio::DB::EUtilities to download alignment sequences from different databases. I haven’t checked the compatibility of Bio::DB::Align, so this package is not in this update either.

2. Major improvement

The major improvement for Bio::SimpleAlign is to provide more user friendly and consistent methods on manipulating alignment sequences. As shown in SimpleAlign_structure_121002.xlsx, the improvement includes:

2.1 six new methods for sorting, removing sequences and sequence identity calculation

2.2. A consistent sequence coordinate system for methods Previously, different methods in Bio::SimpleAlign used different coordiates. $aln->remove_columns used 0 based start coordinate, and $aln->slice used 1 based start coordinate. Because the dependent sequence package Bio::LocatableSeq used 1 based coordinate, all the methods for column selection, removal and masking are changed to 1 based method.

2.3 A consistent special character system Previouly, the special characters in Bio::SimpleAlign are defined by each method itself, for example, gap chars and mask chars. It is easy to confuse the methods/programmers what is the current character usage. Now, the characters are defined by the attribute of alignment object. For example, all the gap char usages in the Bio::SimpleAlign are defined by $aln->gap_char

2.4 A more convenient way of programming 2.4.1 selecting multiple non-continuous columns/sequences Now a new feature of selecting non-continuous columns and sequences in the alignment is added. Previously only continuous columns/sequences can be selected, for example $aln2 = $aln->slice(20,30). Now multiple non-continuous columns can be selected as $newaln = $aln->select_columns([20..30,45..48]);

2.4.2 alignment modifier or alignment selector Previously, it is a bit confused whether the Bio::SimpleAlign method will modify the original alignment. Now the alignment manipulating methods are separated into alignment modifiers, which will change the original alignment object, and alignment selectors, which will generate a new user selected alignment object. 2.4.3 an object oriented naming system. As suggested by BioPerl, any method dealing with an object should be capitalized. For example, $aln->each_seq is returning sequence objects, instead of sequences themselves. So a more meaningful was is to call $aln->next_Seq.


Failed tests in t/SimpleAlign.t

There are 2 warnings and 10 failed tests if you use the old t/SimpleAlign.t to test the new Bio::SimpleAlign. These are all caused by $aln->remove_columns method. Some failed tests are caused by a bug on coordinate calculation in remove_columns, and some are caused by the new parameter settings in this method.

ok 72

--------------------- WARNING ---------------------
 MSG: select_columns start is 1 based and in a single array. Please check the method description.
To be removed in 1.008
---------------------------------------------------

Explanation:

It is caused by line 252

$aln2   = $aln1->remove_columns( [ 0, 0 ] );

Now remove_columns is 1 based for module consistency. If there is a “0” in the coordinates, all the coordinates will be added by 1. The computation will be continued but with a warning.


ok 73 - remove_columns by position

--------------------- WARNING ---------------------
MSG: select_columns start is 1 based and in a single array. Please check the method description.
To be removed in 1.008

It is caused by line 264

my $aln3 = $aln1->remove_columns( [ 1, 1 ], [ 30, 30 ], [ 5, 6 ] );

Now remove_columns is called by non-continuous list. If multiple anonymous arrays are seen in the parameter of remove_columns, they will be merged, and added by 1. The computation will be continued but with a warning. Because the coordinate of the old method was 0 based.

The new way of selecting columns is:

my $aln3 = $aln1->remove_columns( [ 2, 6, 7, 31 ] )


not ok 147
#   Failed test at t/Align/SimpleAlign.t line 611.
#          got: '50'
#     expected: '51'
not ok 148
#   Failed test at t/Align/SimpleAlign.t line 612.
#          got: '62'
#     expected: '63'

not ok 156
#   Failed test at t/Align/SimpleAlign.t line 632.
#          got: '52'
#     expected: '53'
not ok 157
#   Failed test at t/Align/SimpleAlign.t line 633.
#          got: '60'
#     expected: '61'

Explanation: Tests 147, 148, 156 and 157 are caused by a bug in SimpleAlign in remove_columns. The way how remove_columns calculated coordinates for Bio::LocatableSeq with reverse strand was wrong. (This bus is not reported by BugZilla)

For example:

The original test is from line 557.

# is _remove_col really working correctly?
my $a = Bio::LocatableSeq->new(
    -id    => 'a',
    -strand => 1,
    -seq   => 'atcgatcgatcgatcg',
    -start => 5,
    -end   => 20
);
my $b = Bio::LocatableSeq->new(
    -id    => 'b',
    -strand => 1,
    -seq   => '-tcgatc-atcgatcg',
    -start => 30,
    -end   => 43
);
my $c = Bio::LocatableSeq->new(
    -id    => 'c',
    -strand => -1,
    -seq   => 'atcgatcgatc-atc-',
    -start => 50,
    -end   => 63
);
my $d = Bio::LocatableSeq->new(
    -id    => 'd',
    -strand => -1,
    -seq   => '--cgatcgatcgat--',
    -start => 80,
    -end   => 91
);
my $e = Bio::LocatableSeq->new(
    -id    => 'e',
    -strand => 1,
    -seq   => '-t-gatcgatcga-c-',
    -start => 100,
    -end   => 111
);
$aln = Bio::SimpleAlign->new();
$aln->add_seq($a);
$aln->add_seq($b);
$aln->add_seq($c);
 
my $gapless = $aln->remove_gaps();
foreach my $seq ( $gapless->each_seq ) {
    if ( $seq->id eq 'a' ) {
        is $seq->start, 6;
        is $seq->end,   19;
        is $seq->seq,   'tcgatcatcatc';
    }
    elsif ( $seq->id eq 'b' ) {
        is $seq->start, 30;
        is $seq->end,   42;
        is $seq->seq,   'tcgatcatcatc';
    }
    elsif ( $seq->id eq 'c' ) {
        is $seq->start, 51;   #<----------------where the error reports
        is $seq->end,   63;
        is $seq->seq,   'tcgatcatcatc';
    }
}

As you can see from seq “c”, its strand is -1. The sequence of “c” is “atcgatcgatc-atc-“, and coordinates are 50 to 63. As defined by Bio::LocatableSeq, the first base “a” is actually at the coordinate of 63, and the last base “c” is at the coordinate of 50. After the removal of gap, the new “c” is “tcgatcatcatc” which should start from the second base “t”, which is 62, and end at the last base “c”, which is 50. The original implementation of remove_columns was wrong, and gave the answer of 51 and 63.


not ok 192
#   Failed test at t/Align/SimpleAlign.t line 727.
#          got: '6'
#     expected: '5'
not ok 193
#   Failed test at t/Align/SimpleAlign.t line 728.
#          got: '8'
#     expected: '5'
not ok 194
#   Failed test at t/Align/SimpleAlign.t line 729.
#          got: 'tc'
#     expected: 'a'
not ok 195
#   Failed test at t/Align/SimpleAlign.t line 732.
#          got: '30'
#     expected: '0'
not ok 196
#   Failed test at t/Align/SimpleAlign.t line 733.
#          got: '32'
#     expected: '0'
not ok 197
#   Failed test at t/Align/SimpleAlign.t line 734.
#          got: 'tg'
#     expected: '-'

These errors are caused by the update of coordinate system of remove_columns from 0 based to 1 based.

In line 724:

my $removed = $aln->remove_columns( [ 1, 3 ] );

This means two different things in the old method and in the new method. In the old method, it is to remove the 1 to 3 (or 1, 2, 3) columns of the alignment. But in the new method, it is to remove 1st and 3rd columns of the alignment. It is impossible to detect this inconsistency.

Detailed Updates

General Name Changed New method Deprecated method name Parameter change Debug Test
new
Alignment modifier methods
add_Seq ! add_seq Added a return value Y
remove_LocatableSeq ! remove_seq Y
remove_Seqs ! New Enabled multiple parameters Y
remove_redundant_Seqs ! purge Y
uniq_Seq ! uniq_seq Y
sort_alphabetically
sort_by_list
sort_by_pairwise_identity ! New Y
sort_by_length ! New Y
sort_by_start
set_new_reference
Alignment selection methods
next_Seq ! each_seq Y
next_alphabetically ! each_alphabetically Y
next_Seq_with_id ! each_seq_with_id Y
get_Seq_by_pos ! get_seq_by_pos Y
get_Seq_by_id ! get_seq_by_id Y
select_Seqs ! select, select_noncont Enabled multiple parameters Y
select_noncont_by_name New (by DaveMessina) Y
select_columns ! slice Enabled multiple parameters Y
remove_columns ! Enabled multiple parameters Y
remove_gaps ! splice_by_seq_pos Enabled multiple parameters Y
mask_columns Enabled multiple parameters Y
seq_with_features
Change sequences within the MSA
map_chars
uppercase Y
lowercase ! New Y
togglecase ! New Y
match
unmatch
Consensus sequences
consensus_string
consensus_iupac
consensus_meta
bracket_string
cigar_line
match_line
gap_line
all_gap_line
gap_col_matrix
MSA attributes
id
accession
description
source
missing_char ! Set default value Y
match_char ! Set default value Y
gap_char ! Set default value Y
mask_char ! Set default value Y
symbol_chars
Alignment descriptors
score
is_flush
length
maxdisplayname_length
max_metaname_length
num_residues
num_sequences ! Y
average_percentage_identity
percentage_identity
overall_percentage_identity ! gap_char Y
pairwise_percentage_identity ! New gap_char Y
column_from_residue_number
Sequence names
displayname
set_displayname_count
set_displayname_flat
set_displayname_normal
set_displayname_safe
restore_displayname
methods implementing Bio::FeatureHolderI
get_SeqFeatures
add_SeqFeature
remove_SeqFeatures
feature_count
get_all_SeqFeatures
methods for Bio::AnnotatableI
annotation
Personal tools
Namespaces
Variants
Actions
Main Links
documentation
community
development
Toolbox