UVa FASTA Downloads

New: Annotation features available for
SwissProt/PIR1 library searches.

UVa FASTA Server


If you are interested in using the FASTA WWW service for teaching a class, please email me (wrp@virginia.edu) and I can make arrangements for you to use a Beowulf cluster of FASTA servers.
The following are a set of exercises to illustrate important priniciples in sequence similarity searching: (1) the relationship between homology and statistical estimates; (2) the importance of using protein (or translated protein) rather than DNA sequences for searching; (3) the similarity in the results produced by FASTA, BLAST, and SSEARCH (Smith-Waterman), and detection of local duplications from significant similarity.

Most of the searches in this exercise should be done against a small protein database, e.g. the PIR1 database available at the FASTA WWW site. Searching a small database makes it practical to consider each of the high scoring similarities, and to evaluate further whether they are likely to be biologically meaningful.


Identifying homologs and non-homologs; effects of scoring matrices and algorithms

1. Use the FASTA search page to compare Drosophila glutathione transferase GSTT1_DROME (gi|121694) to the PIR1 Annotated protein sequence database.

  1. What is the highest scoring non-homolog? (How would you confirm that your candidate non-homolog was truly unrelated?)

  2. Note that this drosophila glutathione transferase shares significant similarity with both sequences from bacteria (SSPA_SHIFL, stringent starvation protein) and mammals. How might you test whether the stringent starvation protein is homologous to glutathione transferases? (Hint - search SwissProt for a more comprehensive view of the family)

  3. Compare the expectation (E()) value for the distant relationship between GSTT1_DROME and GSTM2_RAT (class-mu). How would you demonstrate that GSTT1_DROME is homologous to GSTM2_RAT?

  4. Examine how the expectation value changes with different scoring matrices (BLOSUM62, BlastP62, PAM250) and different gap penalties. (The default scoring matrix for the FASTA programs is BLOSUM50, with gap penalties of -10 to open a gap and -2 for each residue in the gap - e.g. -12 for a one residue gap).

    What happens to the E()-value for the highest scoring unrelated sequence with the different matrices?

    Look at the distribution of scores and the E()-value of the highest scoring unrelated sequence when the gap-open/gap-ext penalties are small (-7/-1).

  5. Try the search with ssearch (Smith-Waterman). Again, look at the E()-values for distant homologs and the highest scoring unrelated sequence.

  6. (optional) Try the search with ktup=1 (What is ktup?). FASTA uses the ktup parameter to adjust the sensitivity and speed of the search. With ktup=2, FASTA looks for "pairs" of matched identical residues to find regions of similarity. ktup=1 looks for singly-aligned residues, and thus takes longer.

2. Do the same search (121694) using the Course BLAST WWW page.

  1. What is the highest scoring non-homolog?

  2. How do the blastp E()-values compare with the FASTA (blosum62) E()-values for the distantly related mammalian and plant sequences?

Comparison of Protein:Protein, translated DNA:protein to DNA:DNA searches - more sensitive DNA searches
3. In the next three exercises, we will try to find gstt1_drome homologs in the Arabidopsis genome, using (a) protein:protein (BLASTP), (b) DNA:protein (BLASTX), (c) protein:DNA (TBLASTN), and (d,e) DNA:DNA (BLASTN) searches.

In each of the exercises below, the BLASTP, BLASTX etc. links are pre-set to search Arabidopsis sequences.

  1. BLASTP Compare the GSTT1_DROME (gi|121694) protein sequence to Arabidopsis proteins using BLASTP.

    What are the E()-values for Arabidopsis ATGSTT1, ATGSTF10, ATGSTZ1and ATGSTU4

  2. BLASTX Try the same search using the GSTT1_DROME cDNA DMGST (gi|8033) against Arabidopsis proteins using BLASTX.

    What are the E()-values for Arabidopsis ATGSTT1, ATGSTF10, ATGSTZ1and ATGSTU4

  3. TBLASTN. Use GSTT1_DROME (gi|121694) against translated Arabidopsis DNA using TBLASTN.

    What are the E()-values for Arabidopsis ATGSTT1, ATGSTF10, ATGSTZ1and ATGSTU4

  4. Finally, try the DNA:DNA comparison. Use BLASTN to compare dmgst (gi|8033) to the DNA sequences in Arabidopsis.

    Are there detectable Arabidopsis homologues?

  5. The default BLAST DNA parameters are designed for very close matches (98% identity). Use the Other advanced options: -r +5 -q -4 -G 10 -E 6, which targets the DNA search for 70% identity.

Confirming statistical estimates with shuffles

4. Use the PRSS shuffle program to evaluate the statistical significance of a match.

  1. Compare GSTT1_DROME (gi|121694) to XURTG (gi|66611) using PRSS

    What is the E()-value? What database size is used to calculate the E()-value? Why?

  2. Compare SKIL_HUMAN (gi|134594) to KINH_STRPU (kinesin heavy chain) using PRSS. Compare with or without window shuffling.

Significant similarities within sequences - domain duplication

5. Exploring domains with local alignments --- Calmodulin

  1. Use lalign to examine local similarities between calmodulin CALM_HUMAN and itself.
  2. Use plalign to plot the same alignment. How many repeats are present in this sequence.
  3. What happens to the domain alignment plot when you use a shallower scoring matrix (try BP62, MD20).

6. Exploring domains with local alignments --- Death Associated Protein Kinase 1 (DAPK1)

  1. Use lalign to examine local similarities between DAPK1_HUMAN and itself.
  2. Use plalign to plot the same alignment. How many repeats are present in this sequence. Try zooming in by doing the alignment plot using the subset of the sequence from 350-650
  3. What happens to the domain alignment plot when you use a shallower scoring matrix (try BP62, MD20).

    You can look at the PFAM annotation of this protein at: DAPK1_HUMAN Pfam

For more complex domain alignments, try mwkw, or mouse RNA polymerase (rpb1_mouse resdiues 1500-) against itself. Try the rpb1_mouse alignment using the MD20 scoring matrix as well as BLOSUM50.


Where to get the FASTA package: Download

The "normal" FASTA WWW site:

Contact Bill Pearson: wrp@virginia.edu