Finding homopolymer stretches in contigs
From BioPerl
(see bioperl-l thread here, and the scrap Regular expressions and Repeats)
Abhi Pratap asks:
Is there a quick way to find the homopolymer stretches in the contigs and also report their base start and end positions?
from Hekki Levaslaiho:
If you can load the sequence strings into memory, I'd use a regular expression to detect the homopolymers and the use the pos function to find the location of hits:
$s = "AGGGGGGGAAAAACGATCGGGGGGGTGTGGGGGCCCCCGTG"; $min = 4; while ( $s =~ /(A{$min,}|T{$min,}|G{$min,}|C{$min,})/g) { $end = pos($s); $start = $end - length($1) + 1; print "$start, $end, $1 \n"; }
one-liner from Russell Smithies:
You can also use the built-in regex variables and back-references to get the positions of the matches:
print join(", ", $-[0], $+[0], $&),"\n" while ( $s =~ /([ACGT])\1{$min,}/g);