Chromosomes from an .agp file plus contigs
From BioPerl
The sound of one man coding...
Russell Smithies asks the rhetorical question:
Does anyone have a script for building chromosomes from an .agp file and a directory full of contigs?
and his answer:
use Bio::DB::Fasta; use Bio::Seq; use Bio::SeqIO; open(AGP,"Mt2.0_pgp.agp") or die $!; my @chr = (); my $db = Bio::DB::Fasta->new("contigs.fa"); while(<AGP>){ chomp; split /\s/; # extend temp string if it's too short do{$chr[$_[0]] .= ' ' x 1_000_000;}while length $chr[$_[0]] < $_[2] ; if($_[4] !~ m/N/){ ($start,$stop) = $_[8] eq '+'?($_[6], $_[7]):($_[7], $_[6]); $s = substr $chr[$_[0]], $_[1], $_[9], $db->seq($_[5],$start,$stop); }else{ $s = substr $chr[$_[0]], $_[1], $_[5], "N" x $_[5] ; } } #remove any trailing whitespace @chr = map{s/\s+//g;$_}@chr; #print the sequence. chromosomes are chr0 -> chr8 foreach(0..$#chr){ my $seqobj = Bio::Seq->new( -display_id => "chr$_", -seq => $chr[$_]); my $seq_out = Bio::SeqIO->new('-file' => ">chr$_.fa",'-format' => 'fasta'); $seq_out->write_seq($seqobj); }
Russell adds:
Please excuse my hacky use of substrings but this .agp file had overlapping runs of 'N' and this was the easiest way to deal with it; e.g.
0 1 50000 1 N 50000 clone yes 0 50001 167645 2 F AC144644.3 1 117645 + 117645 0 167646 217645 3 N 50000 clone yes 0 217646 317645 4 N 100000 contig no 0 317646 367645 5 N 50000 clone yes 0 367646 411754 6 F AC146805.17 1 44109 + 44109
to the #top