Uppsala Multidisciplinary Center for Advanced Computational Science

How to use the nodes own hard drive for analysis

Short version: When possible, copy the files you want to use in the analysis to $SNIC_TMP before starting the job, and store all output there as well. The last thing you do in the job is to copy the files you want to keep back to your project direcotry.

Long version: Parallel network file systems are very fast when accessed from many nodes, but can nevertheless become a bottleneck. For instance, if many jobs on a single are doing many file operations, all those jobs may be fighting each other and degrading performance. Additionally, the metadata server on these kinds of file systems can be overburdened if very large numbers of files are created and/or destroyed. 

For this reason, jobs that perform a lot of file accesses, especially on temporary files, should use the compute node's local hard drive. If you do, then any slow-down due to file I/O is limited to the node(s) on which these jobs are running. 

The hard drive of the node is located at /scratch, and each job that runs on a node gets a folder created automatically with the same name as the jobid, /scratch/<jobid>.  This folder name is also stored in the environment variable $SNIC_TMP for ease of use. The idea is that you copy all the files you will be reading from to $SNIC_TMP the first thing that happens in the job. You then run your analysis and put all the output files in $SNIC_TMP as well. After the analysis is done, you copy back all the output files you want to keep to your projects folder. Everything in /scratch/<jobid> will be deleted as soon as the job is finished.

An example would be a script that runs bwa to align read. Usually they look something like this:

#!/bin/bash -l
#SBATCH -A b2017999
#SBATCH -t 01:00:00
#SBATCH -p core
#SBATCH -n 16
 
# load modules
module load bioinfo-tools bwa/0.7.13 samtools/1.3
 
# run the alignment and convert it to bam format directly
bwa mem -t 16 /proj/b2017999/nobackup/ref/hg19.fa /proj/b2017999/rawdata/sample.fq.gz | samtools view -b -o /proj/b2017999/nobackup/results/sample.bam

The only thing that has to be changed is to first copy the files to $SNIC_TMP and then copy the results back once the alignment is done.

#!/bin/bash -l
#SBATCH -A b2017999
#SBATCH -t 01:00:00
#SBATCH -p core
#SBATCH -n 16
 
# load modules
module load bioinfo-tools bwa/0.7.13 samtools/1.3
 
# copy the files used in the analysis to $SNIC_TMP
cp /proj/b2017999/nobackup/ref/hg19.fa* /proj/b2017999/rawdata/sample.fq.gz $SNIC_TMP
 
# go to the $SNIC_TMP folder to make sure any temporary files are created there as well
cd $SNIC_TMP
 
# run the alignment using the files in $SNIC_TMP and convert it to bam format directly
bwa mem -t 16 $SNIC_TMP/hg19.fa $SNIC_TMP/sample.fq.gz | samtools view -b -o $SNIC_TMP/sample.bam
 
# copy the results back to the network file system
cp $SNIC_TMP/sample.bam /proj/b2017999/nobackup/results/

It's not harder than that. This way, the files are copied to $SNIC_TMP in a single long operation, which is much less straining for the file system than small random read/writes. The whole analysis then only uses the nodes local hard drive which keeps the load off the network filesystem. When the alignment is finished the results is copied back to project directory so that it can be used in other analysis.

One problem that can happen is if your files and the results are too large for the node's hard drive. The drive is 2TiB on Rackham and 4TiB on Bianca, so if your files are larger than that you will not be able to do this.