Milou user guide

Table of contents:

This is the user guide to Milou, a high performance computer cluster at UPPMAX. Guides for the other systems at UPPMAX can be found here.

Please read this Users Guide for up-to-date information.
All heavy usage of the cluster must go through the batch system, SLURM. The login nodes only allow up to 30 minutes of cpu time per process.

System configuration

The login node for Milou is called (In fact, there may be multiple login nodes hidden behind this name; you will be automatically redirected to any one of these.)

See the Milou presentation page for information about the hardware, the SLURM user guide for information about how to use our queue system, and the installed software list for details about available compilers and installed software. Software is managed with a module system.

You will probably have good use of the following commands:

  • uquota - telling you about your file system usage.
  • projinfo - telling you about the CPU hour usage of your projects.
  • jobinfo - telling you about running and waiting jobs on milou.
  • projmembers - telling you about project memberships.
  • projsummary [project id] - summarizes some useful information about projects

Accounts and log in

All access to this system is done via secure shell (a.k.a SSH) interactive login to the login node, using the domain name if you're an UPPNEX user and (or if you're an Physics and Astronomy user at UU.

ssh -AX
ssh -AX

For questions concerning accounts and access to Milou, please contact UPPMAX support.

Note that the machine you arrive at when logged in is only a so called login node, where you can do various smaller tasks. We have some limits in place that restricts your usage on login nodes. For larger tasks you must use our batch system that pushes your jobs onto other machines within the cluster.

To allow a fair and efficient usage of the system we use  the SLURM resource manager to coordinate user demands. Read our SLURM user guide for detailed information on how to use SLURM.

Some Limits

  • There is a job walltime limit of ten days (240 hours).
  • We restrict each user to at most 5000 running and waiting jobs in total.
  • Each project has a 30 days running allocation of CPU hours. We do not forbid running jobs after the allocation is overdrafted, but instead allow to submit jobs with a very low queue priority, so that you may be able to run your jobs anyway, if a sufficient number of nodes happens to be free on the system.
  • Very wide jobs will only be started within a maintenance window (just before the maintenance window or at the end of the maintenance window). These are planned for the first Wednesday of each month. On Tintin a "very wide" job asks for 54 nodes or more.

Convenience Variables

  • $SNIC_TMP - Path to node-local temporary disk space

    The $SNIC_TMP variable contains the path to a node-local temporary file directory that you can use when running your jobs, in order to get maxiumum disk performance (since the disks are local to the current compute node). This directory will be automatically created on your (first) compute node before the job starts and automatically deleted when the job has finished.

    The path specified in $SNIC_TMP is equal to the path: /scratch/$SLURM_JOB_ID, where the job variable $SLURM_JOB_ID contains the unique job identifier of your job.

    WARNING: Please note, that in your "core" (see below) jobs, if you write data in the /scratch directory but outside of the /scratch/$SLURM_JOB_ID directory, your data may be automatically deleteted during your job run.

Details about the "core" and "node" partitions

A normal Milou node contains 128 GB of RAM and sixteen compute cores. An equal share of RAM for each core would mean that each core gets at most 8 GB of RAM. This simple calculation gives one of the limits mentioned below for a "core" job.

You need to choose between running a "core" job or a "node" job. A "core" job must keep within certain limits, to be able to run together with up to fifteen other "core" jobs on a shared node. A job that cannot keep within those limits must run as a "node" job.

Some serial jobs must run as "node" jobs. You tell Slurm that you need a "node" job with the flag "-p node". (If you forget to tell Slurm, you are by default choosing to run a "core" job.)

A "core" job:

  • Will use a part of the resources on a node, from a 1/16 share to a 15/16 share of a node.

  • Must specify less cores than 16, i.e.between "-n 1" to "-n 15".

  • Must not demand "-N", "--nodes", or "--exclusive".

  • Is recommended not to demand "--mem"

  • Must not demand to run on a fat node (see below, for an explanation of "fat"), a devel node or a GPU node.

  • Must not use more than 8 GB of RAM for each core it demands. If a job needs half of the RAM, i.e. 64 GB, you need to reserve also at least half of the cores on the node, i.e. 8 cores, with the "-n 8" flag.

A "core" job is accounted on your project as one "core hour" (sometimes also named as a "CPU hour") per core you have been allocated, for each wallclock hour that it runs. On the other hand, a "node" job is accounted on your project as sixteen core hours for each wallclock hour that it runs, multiplied with the number of nodes that you have asked for.

Node types

Milou has two node types, thin being the typical cluster node with 128 GB memory and fat nodes having 256 GB or 512 GB of memory. You may specify a node with more RAM, by adding the words "-C fat" to your job submission line and thus making sure that you will get at least 256 GB of RAM on each node in your job.

If you absolutely must have more than 256 GB of RAM then you can request to get 512 GB of RAM specfifically by adding the words "-C mem512GB" to your job submission line.

Please note that there are only 17 nodes with 256 GB of RAM, as well as 17 nodes with 512 GB of RAM.


File storage and disk space

At UPPMAX we have a few different kinds of storage areas for files, see Disk Storage User Guide for more information and recommended use.