Simons Genome Diversity Project datasets

The Simons Foundation's Genome Diversity Project datasets are now available on Uppmax. These represent deep human genome sequence data sampled to represent as much diversity as possible:

World map showing geographical distribution of genomes

There are currently approximately 8.5 TB of data, in the form of all-sites VCF files representing variants called using a method to minimise reference bias and associated sample-specific reference genomes in Fasta format. For more details on methods, see the project website.

The main archive is found at /sw/data/SGDP/SGDP/samples/. Within this directory are two subdirectories containing the data currently available on Uppmax: FullyPublicGeneral/ and FullyPublicHGDP/. There are other subdirectories prefixed with SignedLetter, these represent data sources that are not yet publicly available and their contents consist of 0-sized files. Each currently-available directory contains a Summary/Summary_info.txt file with metadata describing each sample.

To access this data, please request membership in the kgp group by emailing As for the 1000 Genomes Project, this is not to restrict access in any way, but rather to make it easier to inform UPPMAX users using the datasets of any relevant changes. Because the local copies of these datasets are hosted on UPPMAX systems, access is restricted to UPPMAX users; non-UPPMAX users will need to follow the procedures described on the SGDP website to download their own copies of the datasets.