Uppsala Multidisciplinary Center for Advanced Computational Science

Short introduction to using R at UPPMAX

Installation

Different versions of R are already available via the module system on milou, tintin and halvan. For example, as the time of writing we have the following on Milou:

[johanhe@milou1 ~]$ module spider R
-------------------------------------------------------------------------------------------------------------------------------
R:
-------------------------------------------------------------------------------------------------------------------------------
Versions:
R/2.10.1
R/2.11.1
R/2.12.1
R/2.12.2
R/2.13.0
R/2.14.0
R/2.15.0
R/2.15.1
R/2.15.2
R/2.8
R/2.8.1
R/3.0.1
R/3.0.2
R/3.1.0
R/3.2.2
R/3.2.3
R/3.3.0

To load a specific version of R into your environment, just type e.g. "module load R/3.0.1". On the tintin cluster R is accessible via the system installed package, which at this time of writing is 3.3.0.

[johanhe@tintin1 ~]$ which R
/usr/bin/R
[johanhe@tintin1 ~]$ R --version
R version 3.3.0 (2016-05-03) -- "Supposedly Educational"

How to install personal packages

To install personal packages in your own home directory you just type

install.packages("package_name")

as usual. That will install all your packages under the path ~/R/[arch]/[version of R]/. Then you can load it by just doing "library(package_name)" or "require(package_name)" in the R environment.

You can also specify a specific folder for where to put your packages, with

install.packages("package_name", lib="~/some/path/under/your/home/directory/")

But to then be able to find the package inside the R invironment you need to either export the R_LIBS_USER environment variable, or specify the flag "lib.loc" when calling require()/library(), e.g.

library(package_name, lib.loc='~/some/path/under/your/home/directory')

Notice that if you are planning on running R on different clusters then it is probably wisest to manually specify the installation directory, and to have separate directories for each cluster. This is because some of the clusters have different architectures (e.g. milou and tintin), and this will render some packages unusable if you compile them on one system but try to run them on the other.

How to use RStudio

Easiest way is just to use the system installed version, available via the "rstudio" command.

If you need a specific version you'll have to go to http://www.rstudio.com/ide/download/desktop and there pick the tar file for Fedora 64 bit. Either you download it to your local machine and then SCP over it to your home directory at UPPMAX, or you can right click on the link and copy the address, then paste it into the terminal and download the link with "wget [address]".

When you have the file downloaded just unpack it with "tar xvfz [file]", then step into the folder and run the binary with "bin/rstudio". Make sure that you have logged in to UPPMAX with X Forwarding enabled! (I.e. with the -X flag in Linux/Mac OS.)

You might also want to make a bash alias for starting rstudio, so that you don't have to type the whole path each time (e.g. put "alias rstudio=~/path/to/your/rstudio" inside your ~/.bash_profile).

If you're going to run heavier computations within RStudio then you have to remember that you need to do it inside an interactive session on one of the computation nodes, and not on a login node. But if you mostly want to use it as a pretty code editor then you can run it on the login node as well.

How to install rJava

This is a short recipe for how to install rJava. If you need more hand holding, please contact us.

  1. module load R/3.0.1 gcc/4.8
  2. install latest version of JDK available (jdk1.7.0_25 at the time of writing) locally, make sure it is the 64 bit version
  3. export JAVA_HOME to your local installation directory (i.e. something like ~/some/path/jdk1.7.0_25/jre/) -- note that this will have to be exported each time you want to load rJava
  4. R CMD javareconf -e
  5. start R
  6. install.packages("rJava",lib="~/path/to/were/you/want/to/install/", configure.args='--enable-jri=no')

How to run parallel jobs

There are several packages available for R that lets you run parallel jobs. Some of them are only able to run on one node, while others try to leverage several machines. We will here not go through all of the available options, but just illustrate how you easily can launch R computations on several nodes by leveraging packages that build upon the MPI architecture that is available at UPPMAX.

First we have to install the R package Rmpi (inside R just type: install.packages("Rmpi")). Remember that you will also have to load the the openmpi module before starting R, so that the MPI header files can be found (e.g. with the command "module load openmpi").

Batch example with Rmpi

We first create our batch script for SLURM where we allocate 2 nodes ("-n 32" on milou) as well as ask for the develop partition with the short flag to get higher priority for our quick test job. Notice also how we tell MPI to launch one master R process, for it is later inside the R script that the master process spawnes the worker processes.

[johanhe@milou1 slask]$ cat rmpi-test.slurm
#!/bin/bash -l
#SBATCH -A staff
#SBATCH -J rmpi-test
#SBATCH -o rmpi.out
#SBATCH -t 00:10:00
#SBATCH -p devel
#SBATCH --qos=short
#SBATCH -n 32
 
module load openmpi R/3.0.1
mpirun -n 1 R --no-save < rmpi-test.R
[johanhe@milou1 slask]$

We then create our R script as follows. We're here requesting 32 R worker processes ("slaves"), one for each CPU core allocated in our SLURM command above. The R session we're typing in is our R master which will reside on one of the cores as well, but won't be involved in the actual computational work.

[johanhe@milou1 slask]$ cat rmpi-test.R
library("Rmpi")
 
mpi.spawn.Rslaves(nslaves=32)
mpi.bcast.cmd(id <- mpi.comm.rank())
mpi.bcast.cmd(n <- mpi.comm.size())
mpi.bcast.cmd(host <- mpi.get.processor.name())
result <- mpi.remote.exec(paste("I am", id, "of", n, "running on", host))
print(unlist(result))
mpi.close.Rslaves(dellog = FALSE)
mpi.exit()
[johanhe@milou1 slask]$

Now we can schedule our job as usual with "sbatch rmpi-test.slurm". After a while you will see that there's a log file created with the output (as well as a little log file for each slave):

[johanhe@milou1 slask]$ tail -10 rmpi.out
"I am 28 of 33 running on m34" "I am 29 of 33 running on m34"
slave29 slave30
"I am 30 of 33 running on m34" "I am 31 of 33 running on m34"
slave31
"I am 32 of 33 running on m34"
> mpi.close.Rslaves(dellog = FALSE)
[1] 1
> mpi.exit()
[1] "Detaching Rmpi. Rmpi cannot be used unless relaunching R."
[johanhe@milou1 slask]$ cat q34.385+1.31322.log
Host: m34       Rank(ID): 9     of Size: 33 on comm 1
[1] "Done"
[johanhe@milou1 slask]$ ls -l *log |wc -l
32

For more usage examples please see the Rmpi's CRAN page.

Interactive example with snow

snow is a package that builds upon Rmpi with the goal to make paralell computations easier. First we have to install it as usual with the install.packages() command. We then request our interactive session:

[johanhe@milou1 ~]$ interactive -A staff -p devel --qos=short -n 32
Your job may run for at most fifteen minutes.
There are free nodes, so your job is expected to start at once.
Waiting for job 4369906 to start...
Starting job now -- you waited for 1 second.

Then we load the required modules and start R:

[johanhe@m33 ~]$ module load openmpi R/3.0.1
mod: no compiler requested, try gcc4.4
mod: loaded OpenMPI 1.4.5, compiled with gcc4.4 (found in /opt/openmpi/1.4.5gcc4.4/)
[johanhe@m33 ~]$ R

And now we just type in our program. As in the batch example we're also here requesting 15 R workers, which makes it 17 in total with the master.

> library("Rmpi")
> library("snow")
> cl <- makeCluster(32, type = "MPI")
31 slaves are spawned successfully. 0 failed.
> myfun <- function(x = 2) { x + 1 }
> myfun_arg <- 5
> clusterCall(cl, myfun, myfun_arg)
[[1]]
[1] 6
 
[[2]]
[1] 6
 
[[3]]
[1] 6
 
( ... lots and lots of more output from each spawned R process ...)

More examples can be found at snow's CRAN page as well on the snow simplified page.

Problems!

Why are all CPU cores always 100% busy when using MPI?

For OpenMPI to achieve the best possible performance it will let all the worker processes poll the CPU all the time, to see if there's some new message from the master. This frequent polling puts the CPU in 100% usage, even though your R process might not be doing any computational work.