The FASTA ktup parameter

What is ktup?

Both FASTA and BLAST use a rapid word-based lookup strategy to speed the initial phase of the similarity search. In protein searches, FASTA looks for pairs of aligned identical amino-acids, e.g.

seq1   KDKEAYADRQELQDELRQEREARQKLEMMIKELKLQILKSSKTAKE
       . ::.    .::..::..::. ::: :.. :::. .:         
seq2   NAKEGLEKIEELEEELENERKLRQKSELQRKELESRIEELQDQLET
         ^^      ^^  ^^  ^^  ^^^     ^^^

With ktup=2, FASTA would ignore a region like:

seq1 LNKKLLNLKQAGEHLKPE
     .....:. ..  :.:. .
seq2 FEEEFLETREQYEKLQKD

in the initial scanning phase. Thus, searches with ktup=1 can be more sensitive than searches with ktup=2. However, a more sensitive algorithm may also raise the scores of unrelated sequences, so that the statistical significance of an intermediate-distance match is reduced, while the significance of a very distance match is improved.

BLAST also looks for initial similarities using a word-size (ktup) of 3, but BLAST looks for conservative substitutions as well as identities. Thus, BLAST with a wordsize of 3 is often more sensitive than FASTA with a ktup=2.

For DNA sequences, FASTA uses a ktup=6 by default. DNA searches with ktup=3 are even more sensitive, but ktup=1 is less sensitive (at a given statistical significance threshold) than ktup=3 for DNA. ktup=1 is appropriate when searching for oligonucleotides (< 20 nt).

FASTA Exercises