FASTA alignment program

From BioPerl
Jump to: navigation, search

Contents

Description

This entry refers to the FASTA alignment program [1, 2]. It produces output which can be parsed by in BioPerl by Bio::SearchIO. There is also a FASTA sequence format which refer to the sequence file format that was initially designed for input to these tools. There is a simple extension of the sequence format to a FASTA multiple alignment format which is different from the database search result format that is output by the FASTA applications.

Bill Pearson's package for sequence database searching.

History

(Wanted: someone to add some history of FASTA here)

Tips and Hints

Output options

BioPerl can parse both the default output and the -m 9 output which happens to be much more compact and leads to smaller filesizes (since alignments are not produced). If your needs are just E-value scores from SSEARCH or FASTA you can use the following options to produce a small tab-delimited file using the fastam9_to_table.PLS script.

fasta34 -H -E 1e-5 -m 9 -d 0 QueryFile SearchDatabase | fastam9_to_table > results.tab

This will lead to a small filesize limiting your disk space usage requirements and potentially speeding up your analysis.

Profile searches

From the release notes, here is information on how to search a sequence profile against a database using SW algorithm.

>>June 16, 2003 version: fasta34t22
ssearch34 now supports PSI-BLAST PSSM/profiles.  Currently, it only
supports the "checkpoint" file produced by blastall, and only on
certain architectures where byte-reordering is unnecessary.  It has not
been tested extensively with the -S option.

       ssearch34 -P blast.ckpt -f -11 -g -1 -s BL62 query.aa library

Will use the frequency information in the blast.chkpt file to do a
position specific scoring matrix (PSSM) search using the
Smith-Waterman algorithm.  Because ssearch34 calculates scores for
each of the sequences in the database, we anticipate that PSSM
ssearch34 statistics will be more reliable than PSI-Blast statistics.

The Blast checkpoint file is mostly double precision frequency
numbers, which are represented in a machine specific way.  Thus, you 
must generate the checkpoint file on the same machine that you run
ssearch34 or prss34 -P query.ckpt.  To generate a checkpoint file,
run:

blastpgp -j 2 -h 1e-6 -i query.fa -d swissprot -C query.ckpt -o /dev/null

(This searches swissprot for 2 iterations ("-j 2" using a E()
threshold 1e-6 saving the resulting position specific frequencies in
query.ckpt.  Note that the original query.fa and query.ckpt must match.)


References

  1. Pearson WR, Wood T, Zhang Z, and Miller W. Comparison of DNA sequences with protein sequences. Genomics. 1997 Nov 15;46(1):24-36. DOI:10.1006/geno.1997.4995 | PubMed ID:9403055 | HubMed [fasta97]
  2. Pearson WR and Lipman DJ. Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A. 1988 Apr;85(8):2444-8. PubMed ID:3162770 | HubMed [fasta98]
All Medline abstracts: PubMed | HubMed
Personal tools
Namespaces
Variants
Actions
Main Links
documentation
community
development
Toolbox