HOWTO:Testing molecular clock

From BioPerl
Jump to: navigation, search

Testing for a molecular clock

This is just a dumping ground for some scripts for testing for a molecular clock. Justification and background should be placed here.

Code and Scripts

Scripts focused on PAML

Run Codeml (PAML) with and without clock on a directory of files ending in '.cds.phy'. Will create a bunch of files ending in .codeml in the clock and noclock directories. It assumes that the alignments are in a directory call 'aln' and that the genes are named with the species abbreviation that is used in the phylogeny.tre file. I will put up examples of phylogeny.tre and a file from the aln directory.

#!/usr/bin/perl -w
 
my $filecontent = <<EOF
      seqfile = %s * sequence data filename
     treefile = %s * tree structure file name
      outfile = %s           * main result file name
        noisy = 2  * 0,1,2,3,9: how much rubbish on the screen
      verbose = 0  * 0: concise; 1: detailed, 2: too much
      runmode = 0  * 0: user tree; 
      seqtype = 1  * 1:codons; 2:AAs; 3:codons-->AAs
    CodonFreq = 2  * 0:1/61 each, 1:F1X4, 2:F3X4, 3:codon table
        clock = %d  * 0:no clock, 1:clock; 2:local clock; 3:CombinedAnalysis
       aaDist = 0  * 0:equal, +:geometric; -:linear, 1-6:G1974,Miyata,c,p,v,a
        model = 0 
      NSsites = 0 
        icode = 0  * 0:universal code; 1:mammalian mt; 2-10:see below
        Mgene = 0
    fix_kappa = 0  * 1: kappa fixed, 0: kappa to be estimated
        kappa = 2  * initial or fixed kappa
    fix_omega = 0  * 1: omega or omega_1 fixed, 0: estimate 
        omega = .4 * initial or fixed omega, for codons or codon-based AAs
    fix_alpha = 1  * 0: estimate gamma shape parameter; 1: fix it at alpha
        alpha = 0. * initial or fixed alpha, 0:infinity (constant rate)
       Malpha = 0  * different alphas for genes
        ncatG = 8  * # of categories in dG of NSsites models
        getSE =1   * 
 RateAncestor = 0  * (0,1,2): rates (alpha>0) or ancestral states (1 or 2)
   Small_Diff = .5e-6
    cleandata = 1  * remove sites with ambiguity data (1:yes, 0:no)?
       method = 0   * 0: simultaneous; 1: one branch at a time
EOF
;
my $dir = 'aln';
my $clockdir = 'clock';
my $noclockdir = 'noclock';
mkdir($clockdir) unless -d $clockdir;
mkdir($noclockdir) unless -d $noclockdir;
my $ctl = 'codeml.ctl';
my $force = shift || 0;
 
opendir(DIR,$dir) || die $!;
 
for my $file (readdir(DIR) ) {
    next unless $file =~ /(\S+)\.cds\.phy$/;    
    my $stem = $1;
    my $clock = 0;
    for my $o ( ["$noclockdir/$stem.codeml", "phylogeny.unrooted",
		 "$clockdir/$stem.codeml", "phylogeny.rooted" ) {
        my ($outfile,$phylogeny) = @{$o}; # unroll the arrayref
	if( $force || (! -f $outfile || -z $outfile) ) {
	    # clock
	    open(CODEML,">$ctl") || die $!;
	    printf CODEML $filecontent, "$dir/$file", $phylogeny,$outfile,
	    $clock;
	    close(CODEML);
	    `codeml`;
	}
	$clock++;
    }
}

Processing the results

#!/usr/bin/perl -w
use strict;
use Statistics::Distributions;
 
my $signif_high = '0.01';
my $signif_mod  = '0.05';
my @dirs = qw(clock noclock);
my $ext = 'codeml';
my %sets;
for my $dir ( @dirs ) {
    opendir(DIR, $dir) || die $!;
    for my $file ( readdir(DIR) ) {
	next unless $file =~ /(\S+)\.$ext$/;
	my $num = $1;
	open(IN,"$dir/$file") || die $!;
	while(<IN>) {
 
	    if( /lnL\([^\)]+\):\s+(\-?\d+\.\d+)/ ) {
		$sets{$num}->{$dir} = $1 * -1; # this is just a lnL it is not -lnL
	    } elsif( /^ns\s+=\s+(\d+)/ ) {
		$sets{$num}->{'ns'} = $1;
	    }
	}
    }
}
 
# to then compare the clock vs no-clock
my $i = 0;
print join("\t", qw(CLUSTER NOCLOCK CLOCK LRT SIGNIF)),"\n";
for my $k ( sort keys %sets ) {    
    my $d = $sets{$k};
    if( defined $d ) {
        # H1,H0 = - log (likelihood)
 
        # LRT = 2*(lnL1 - lnL2);
        my $lrt = 2 * ($d->{'clock'} - $d->{'noclock'} );
        if( ! defined $d->{clock} || ! defined $d->{'noclock'} ) {
            warn("no dat for $i\n");
        }
        my $p = Statistics::Distributions::chisqrprob ($d->{'ns'} - 2,$lrt);
        printf "%s\t%9.3f\t%9.3f\t%-7.2f\t%-5.2g%s\n",$i,$d->{'noclock'},
        $d->{'clock'}, $lrt, $p, $p < $signif_high ? '**' : $p < $signif_mod ? '*' : '';
    }
    $i++;
}
Personal tools
Namespaces
Variants
Actions
Main Links
documentation
community
development
Toolbox