Finding homopolymer stretches in contigs

From BioPerl
Jump to: navigation, search

(see bioperl-l thread here, and the scrap Regular expressions and Repeats)

Abhi Pratap asks:

Is there a quick way to find the homopolymer stretches in the contigs and also report their base start and end positions?


from Hekki Levaslaiho:

If you can load the sequence strings into memory, I'd use a regular expression to detect the homopolymers and the use the pos function to find the location of hits:

 $s = "AGGGGGGGAAAAACGATCGGGGGGGTGTGGGGGCCCCCGTG";
 $min = 4;
 
 while ( $s =~ /(A{$min,}|T{$min,}|G{$min,}|C{$min,})/g) {
    $end = pos($s);
    $start = $end - length($1) + 1;
    print "$start, $end, $1 \n";
 }

one-liner from Russell Smithies:

You can also use the built-in regex variables and back-references to get the positions of the matches:

print join(", ", $-[0], $+[0], $&),"\n" while ( $s =~ /([ACGT])\1{$min,}/g);
Personal tools
Namespaces
Variants
Actions
Main Links
documentation
community
development
Toolbox