Talk:GFF code audit
From BioPerl
BLAT and PSL to GFF3
See Don Gilbert's message on Gmod-Gbrowse (which isn't available on the archive right now, SF mailing list archival system is really not very good).
From Don:
From: Don Gilbert <gilbertd@> Date: October 21, 2007 12:50:41 AM EDT To: gmod-gbrowse@lists.sourceforge.net Subject: Re: [Gmod-gbrowse] GFF and PSL I also was looking for a blat psl to gff conversion program, but failed to find one that handled the exon structure and distinct matches to same query that blat produces (i.e. tandem genes). Find now this fairly simple program (1 page): http://iubio.bio.indiana.edu/gmod/tandy/blat2gff.pl This brings up a topic of genome analyses using Bioperl's SearchIO conversions (whether blat, blast or other). That is the details of distinct gene matches are lost in the conversions. Duplicate genes are very common, and tandem duplicates tend to confuse the heck out of many genome analysis programs. Blat computes and writes these explicitly, each row is a distinct match (say of EST x genome), with the exon structure in the array of Q,T-starts on a match row. Bioperl unfortunately smooshes these together into one match when they are the same query EST. One can parse the same distinct match detail from BLAST (e.g. tabular output 8,9), by looking at query and source HSP locations, but it is more programming effort. -- Don # example tandem match pair grep EB634440 dgri-gnoest.blat 670 .. + EB634440 753 0 683 scaffold_14830 6267026 2489511 2490194 2 398,272, 0,411, 2489511,2489922, 683 .. - EB634440 753 0 683 scaffold_14830 6267026 2484805 2485488 1 683, 70, 2484805, 646 .. - EB634440 753 0 683 scaffold_14830 6267026 2480511 2481194 3 272,71,305, 70,355,448, 2480511,2480796,2480889, grep EB634440 dgri-gnoest.blat | blat2gff.pl -match EST_match ##gff-version 3 scaffold_14830 BLAT EST_match 2489512 2490194 670 + . ID=EB634440_mid1;Target=EB634440 1 683 scaffold_14830 BLAT match_part 2489512 2489909 670 + . Parent=EB634440_mid1;Target=EB634440 1 398 scaffold_14830 BLAT match_part 2489923 2490194 670 + . Parent=EB634440_mid1;Target=EB634440 412 683 scaffold_14830 BLAT EST_match 2484806 2485488 683 - . ID=EB634440_mid2;Target=EB634440 1 683 scaffold_14830 BLAT match_part 2484806 2485488 683 - . Parent=EB634440_mid2;Target=EB634440 71 753 scaffold_14830 BLAT EST_match 2480512 2481194 646 - . ID=EB634440_mid3;Target=EB634440 1 683 scaffold_14830 BLAT match_part 2480512 2480783 646 - . Parent=EB634440_mid3;Target=EB634440 71 342 scaffold_14830 BLAT match_part 2480797 2480867 646 - . Parent=EB634440_mid3;Target=EB634440 356 426 scaffold_14830 BLAT match_part 2480890 2481194 646 - . Parent=EB634440_mid3;Target=EB634440 449 753 View thse EST matches at http://insects.eugenes.org/species/cgi-bin/gbrowse/dgri/? name=scaffold_14830:2479354-2492663;label=hsgDM-EST-NCBI_GNO # equivalent Bioperl bp_search2gff3.pl # turns distinct 3 gene matches into one, including reversed one, and ignores exon detail ... grep EB634440 dgri-gnoest.blat | lib/Bio/script/bp_search2gff3.pl -f psl -m -ver 3 -t hit -i - ##gff-version 3 scaffold_14830 BLAT match_part 2489512 2490194 . + 0 Parent=EB634440;Target=Sequence:EB634440 1 683 scaffold_14830 BLAT match_part 2484806 2485488 . - 0 Parent=EB634440;Target=Sequence:EB634440 1 683 scaffold_14830 BLAT match_part 2480512 2481194 . - 0 Parent=EB634440;Target=Sequence:EB634440 1 683 scaffold_14830 BLAT match 2480512 2490194 . - . ID=EB634440 -- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405 -- gilbertd_AT_indiana.edu--http://marmot.bio.indiana.edu/
-
bp_search2gff.pl- two bugs (enhancement requests, really) have been reported on Bugzilla which we should take note of: -
bp_genbank2gff3.pl- doesn't calculate phase: