Regular expressions and Repeats
From BioPerl
How do I find an iteration of any sequence of a specific length?
So /(QA)+/ will match one or more iterations of QA but what if you want to match any repeat of length 2?
/(..)\1+/
Then $1 will tell you what the repeat was, length($&)/2 will tell you the number of repeats.
How do I find some sequence flanked by homopolymers of a given length?
For example, to find FAFCRCFCFAFAFCRF flanked by n number of Q, e.g.:
AGTWRWDFDQQQQQQQQFAFCRCFCFAFAFCRFQQQQQQQQQQQQQThe regular expression would be something like
/(Q{$n,})([^Q]{$x,})(Q{$n,})/
Example:
perl -e '$n=5; $x=9; $_= "AGTWRWDFDQQQQQQQQFAFCRCFCFAFAFCRFQQQQQQQQQQQQQ"; print "$1|$2|$3\n" if /(Q{$n,})([^Q]{$x,})(Q{$n,})/;'
QQQQQQQQ|FAFCRCFCFAFAFCRF|QQQQQQQQQQQQQ|
How do I find any homopolymer flanked on both sides by the same amino acid?
For example, HTTTTTTTTTTH or TGGGGGGGGGGGT.
/(.)[^\1]+\1/
In action:
perl -e '$_ = "HTTH"; print "|$1|\n" if /((.)[^\2]+\2)/;'
Note that the "homopolymer" could have a length of 1!