DIAMOND protein alignment databases

The DIAMOND protein aligner is a recent tool offering much faster (100× to 1000× faster than Blast) alignment of protein sequences against reference databases. On UPPMAX, DIAMOND is available by loading the diamond module, the most recent installed version of which as of this writing is diamond/0.9.10.  Note that datases built with different diamond minor versions (such as diamond/0.7.12, diamond/0.8.26, and diamond/0.9.10) are not intercompatible.  The later versions of diamond create smaller databases, and more quickly.

As for BLAST databases, UPPMAX provides several pre-built databases suitable for direct usage with the -d flag to diamond. The local UPPMAX copies of each of these databases are checked for updates once a month.

For each of the databases listed below, the method of versioning is indicated. To determine the version at UPPMAX, check the path given below after removing the database name from the last position; latest is a symbolic link that points to a directory with a name equivalent to the version of the most recent update. Old database versions will be removed after updates, so please use latest rather than directly addressing a database version.

Each of the database locations below is also available in the indicated environment variable set when any version of the diamond module is loaded. These are simple to use, for example to search nr:

diamond -d $DIAMOND_NR ...

NCBI Protein Databases

Downloaded from ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/. These are updated frequently at NCBI, so they are versioned here by the monthly download date.

Database Environment variable for diamond -d UPPMAX path
nr DIAMOND_NR /sw/data/diamond_databases/Blast/latest/nr
env_nr DIAMOND_ENV_NR /sw/data/diamond_databases/Blast/latest/env_nr
swissprot DIAMOND_SWISSPROT /sw/data/diamond_databases/Blast/latest/swissprot

NCBI RefSeq Proteins

RefSeq protein databases are downloaded from ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/, with an update occurring if there is a new release as indicated by the contents of ftp://ftp.ncbi.nlm.nih.gov/refseq/release/RELEASE_NUMBER.

Database Environment variable for diamond -d UPPMAX path
complete.nonredundant_protein.protein DIAMOND_REFSEQ_NONREDUNDANT /sw/data/diamond_databases/RefSeq/latest/complete.nonredundant_protein.protein
complete.protein DIAMOND_REFSEQ /sw/data/diamond_databases/RefSeq/latest/complete.protein


The UniRef90 protein database is downloaded as Fasta from its UK mirror at ftp://ftp.expasy.org/databases/uniprot/current_release/uniref/uniref90/, with an update occurring if there is a new version as indicated by the version tag in the XML description available at ftp://ftp.expasy.org/databases/uniprot/current_release/uniref/uniref90/RELEASE.metalink.

Database Environment variable for diamond -d UPPMAX path
uniref90 DIAMOND_UNIREF90 /sw/data/diamond_databases/UniRef90/latest/uniref90