Using the GPU nodes on Snowy

Snowy has a moderate number of nodes with a single Nvidia T4 GPU installed. This page contains instructions on how to use them.

Who has access?

Anyone who can run jobs on Snowy has access. Members of projects with core-hour allocations on Rackham can run jobs on Snowy in the bonus (lower priority) queue. Members of projects with core-hour allocations on Snowy can run jobs at ordinary priority. Members of projects that invested in the GPU resource (e.g. uppmax2020-2-2) will receive a higher priority in the queue.

How do we access them?

There are two ways to ask Slurm for GPU resources. The older "gres" method: "sbatch --gres=gpu:1" and the newer --gpu* options work, for example: "sbatch --gpus=1".
 
If you give Slurm either option, then the job will require a GPU node and will be able to use the GPU when it runs.
Here is an example jobscript for the Snowy hybrid nodes that uses the "--gres" option:
#!/bin/bash
#SBATCH -J jobname
#SBATCH -A snicxxxx-x-xx
#SBATCH -t 03-00:00:00
#SBATCH --exclusive
#SBATCH -p node
#SBATCH -N 1
#SBATCH -M snowy
#SBATCH --gres=gpu:1
#SBATCH --gpus-per-node=1
##for jobs shorter than 15 min (max 4 nodes):
#SBATCH --qos=short
One may ask for an interactive node this way:
salloc -A staff -N 1 -M snowy --gres=gpu:1 --gpus-per-node=1

Limits

The time limit for jobs using GPUs is currently 3 days.

GPU sharing

If you would like to share a gpu, you can use the flag gres=mps:50 when submitting a job to slurm.
This will reserve 50% of a gpu for your job. You can specify mps from 1 to 100.
 
Note 1. We only use the slurm mps option, not the Nvidia MPS support. Nvidia MPS nowdays, does not support different users sharing a gpu. This means that we can't control the jobs sharing a gpu. For example, if you specify mps:50 but no other job is using the gpu, your job will use 100% of the gpu, until another job starts on that node, then the jobs will get half of the gpu each. So if three jobs with one with mps:50 and two with mps:25 is on the same node they will get about a third of the gpu each. Also, we can't control the memory usage of the gpu memory.

Note 2. The gres=mps option is mostly for students in courses that don't use the gpu's all the time. Most other users will probably not use this option.
 
Here is an example jobscript for the Snowy hybrid nodes that uses the mps option, this example will book half (50%) of the gpu:
#!/bin/bash
#SBATCH -J jobname
#SBATCH -A snicxxxx-x-xx
#SBATCH -t 03-00:00:00
#SBATCH -p core
#SBATCH -n 2
#SBATCH -M snowy
#SBATCH --gres=gpu:1
#SBATCH --gres=mps:50

How do we use CUDA and related software?

There is a system installation of CUDA v 11.
There are also two modules for CUDA 9 and 10 installed.
Example:
module use /sw/EasyBuild/snowy/modules/all/
module load intelcuda/2019b
will give you CUDA version 10.1.
Or if one doesn't want to use Intel, one may:
module use /sw/EasyBuild/snowy/modules/all
module load fosscuda/2019b
or
module load fosscuda/2018b

Where can I read more?

We have followed the current best practices in Slurm when
configuring the GPUs. Technically the GPUs are configured as what is
known as a "gres" (generic resource), which is then tracked using
"tres" (trackable resources). This means that most GPU-related options
you can find in Slurm's documentation is expected to work.

Slurm's GPU-related documentation is here:

  https://slurm.schedmd.com/gres.html#Running_Jobs

You can also search for "GPU" in the sbatch man-page on Snowy.
Senast uppdaterad: 2021-04-21