Downloading Sequence Libraries
Protein and DNA sequence library files can be downloaded from many
different sources, including the NCBI and EMBL-EBI.
Library formats
The FASTA programs work with many different library formats;
you will not need to run file conversion programs or formatting
programs to search sequence libraries with FASTA. However, the
FASTA programs assume that libraries are in FASTA format; to
search libraries in other formats, the format type must be specified
with the file name, e.g.
fasta36 -q mgstm1.aa "/slib/ncbi/refseq_protein 12"
would search the NCBI refseq_protein library in
NCBI/BLAST formatdb format.
Supported popular library formats include:
Format | Description |
0 | (default) FASTA format |
1 | Genbank flatfile |
3 | EMBL-EBI/Swissprot flatfile |
5 | GCG/PIR flatfile |
6 | GCG compressed binary |
12 | NCBI BLAST formatdb version 2 (current version) |
16 | MySQL SQL query |
17 | PostgresQL SQL query |
Protein and DNA sequence databases
Today, there is little reason to choose one sequence database provider
over another - particularly for DNA sequence libraries, which are
synchronized nightly between NCBI, EMBL-EBI, and the DDBJ. For
protein sequence libraries, both NCBI and EMBL-EBI offer very
comprehensive, but very redundant collections of protein sequences,
e.g NCBI NR and EMBL-EBI/PIR Uniprot, but both groups also offer much
higher quality curated databases, e.g. NCBI refseq_protein and
EMBL-EBI SwissProt. Because sequence similarity searches are
more sensitive when smaller databases are used, it makes the most
sense to search a smaller, higher quality database first, and then
search more comprehensive databases only if no significant
similarities are found in the initial search.
|