R
Submitting a single job
Single CPU
In order to run a simple R script on one core, two files should be constructed. The first file is the R script (with the .R extension), here it is called R_example1.R
and it holds the following textual contents:
- R_example1.R
# Simple t-test between two equally large groups. # Let us generate some data. apples <- rnorm(20,mean=1.5,sd=.1) pears <- rnorm(20,mean=1.6,sd=.1) print(t.test(pears,apples))
The second is the job script used to send the R code to the cluster, which will compute a t-test for you. This file is called r_ex1.sh
and can be edited by any text editor. This is the contents of this file:
- r_ex1.sh
#!/bin/bash #SBATCH --time=00:05:00 #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --job-name=r_example1 #SBATCH --output=r_example1.out #SBATCH --mem=1000 module purge module load R/4.2.1-foss-2022a Rscript R_example1.R
When both files are created and are in the same directory, the following command can be used to submit the job from that directory:
sbatch r_ex1.sh
The job is now submitted, use the squeue command if you want to see the status of your job:
squeue -u $USER
A file named r_example1.out
is created, open this file with a text editor/viewer in order to see if the job went well. If the file is empty, then the job is probably still running (which can be verified by running the squeue command again).
The final output should look like:
Welch Two Sample t-test data: pears and apples t = 4.3811, df = 37.978, p-value = 8.983e-05 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.07295218 0.19828580 sample estimates: mean of x mean of y 1.616908 1.481289 >>
Multiple CPUs
The following example shows how to submit an R job using multiple processors. In this job two matrices of 5000 by 5000 are multiplied using a single core and using four cores.
The code below is used for submitting the R script job, which is also below. The batch script in this example is named: R_parallel_batch.sh
.
- R_parallel_batch.sh
#!/bin/bash #SBATCH --time=00:10:00 #SBATCH --nodes=1 #SBATCH --cpus-per-task=4 #SBATCH --job-name=R_example_parallel #SBATCH --mem=10GB module purge module load R/4.2.1-foss-2022a Rscript parallel.R
The file containing the R code is named parallel.R
. Note that the package parallel
and snow
are used but others might be available as well. If necessary additional packages have to be installed, see this page.
- parallel.R
library("snow") library("parallel") cpu <- Sys.getenv("SLURM_CPUS_ON_NODE", 1) # Number of cores requested (use 1 core if running outside a job). hosts <- rep("localhost",cpu) cl <- makeCluster(hosts, type = "SOCK") # Create random matrices. n <- 5000 A <- matrix(rnorm(n^2),n) B <- matrix(rnorm(n^2),n) # Single core time of matrix multiplication of matrices # A and B. message("Single core matrix multiplication time: ") system.time(A %*% B) # Multiple core time of matrix multiplication of # matrices A and B. message("Parallel matrix multiplication time: ") system.time(parMM(cl, A, B)) # Stop cores properly. stopCluster(cl)
Now we want to submit the job using:
sbatch R_parallel_batch.sh
The output shows how much time it took a single core to do the matrix multiplication, and the same on 4 cores. Note that for less time consuming (easier) tasks, it is faster to use 1 core, rather than using multiple cores, because each additional core takes time to initialize. It is possible to use the apply (or a similar) function with multiple CPUs in R by substituting the standard R functions with parApply functions (from the “snow” package). These functions are listed below:
Standard R | snow |
---|---|
lapply | parLapply |
sapply | parSapply |
apply | parApply |
apply(row) | parRapply |
apply(column) | parCapply |
Multiple nodes
In order to use multiple nodes with R, you can make use of an MPI cluster in your R code. Due to the way our MPI libraries are installed, it's not possible to use the makeCluster or makeMPIcluster or makeCluster, though. The correct way to do it is by making use of the getMPIcluster function, as shown in the following example:
library("snow") library("parallel") cl <- getMPIcluster()
In your job script, make sure to request a certain number of tasks (with --ntasks
or --ntasks-per-node
) equal to the number of MPI tasks that you want. Furthermore, your R session has to be started in a special way by using srun and RMPISNOW:
#!/bin/bash #SBATCH --time=00:10:00 #SBATCH --nodes=2 #SBATCH --ntasks-per-node=4 #SBATCH --job-name=R_example_parallel #SBATCH --mem=10GB module purge module load R/4.2.1-foss-2022a srun ${EBROOTR}/lib64/R/library/snow/RMPISNOW < parallel.R
GPU
In order to let R use the GPUs of the Hábrók cluster, we first have install the R library gpuR
. Unfortunately, this library has been removed from CRAN, and we will need to install it from GitHub. To do this, we first start an interactive job on one of the GPU nodes with an NVidia V100 GPU. You can similarly install it from a node with an NVidia K40 GPU, but we have fewer of those. This is how you start an interactive job from the command line on the login node:
srun --time=00:30:00 --partition=gpushort --gres=gpu:v100:1 --cpus-per-task=12 --pty bash -i
Once the interactive job starts, you will find yourself in a shell on a GPU node, and you can the load R
:
module load R/4.0.0-foss-2020a
and start an R
interactive session:
R
Within this session, installing gpuR
is rather straightforward:
library(devtools) install_github("cdeterman/gpuR")
Create a personal library if you haven't done so before, and install the package. This may take some minutes, but once it is done, you can exit the interactive R
session, and terminate the interactive job as well, which will get you back to the command line on the login node.
You can create the batch script gpuExampleR.sh
and the R script gpu.R
. In this batch script we request one GPU on a GPU node.
- gpuExampleR.sh
#!/bin/bash #SBATCH --time=00:05:00 #SBATCH --partition=gpu #SBATCH --gres=gpu:1 #SBATCH --mem=10GB module load R/4.0.0-foss-2020a Rscript gpu.R
In this example we use a R script that does a simple GPU matrix multiplication and a normal CPU matrix multiplication.
- gpu.R
library("gpuR") ORDER = 10000 A = matrix(rnorm(ORDER^2), nrow=ORDER) B = matrix(rnorm(ORDER^2), nrow=ORDER) gpuA = gpuMatrix(A, type="double") gpuB = gpuMatrix(B, type="double") print("cpu time: ") system.time(A %*% B) print("gpu time: ") system.time(gpuA %*% gpuB)
Now we can submit the job using sbatch gpuExampleR.sh
.
In this example the matrix multiplication is slower for the GPU when small matrices are used (100×100); however, as the matrix gets larger, the GPU speed will scale better as can be seen from the output.
Installing additional R libraries
R provides the install.packages
command to install additional libraries/packages. Before you start submitting jobs, you can use it to install the packages that you need in the following way:
- log in and load the R module that you would like to use, e.g.:
module load R/4.2.1-foss-2022a
- launch R by running the command:
R
- run the appropriate
install.packages
command for all packages that you want to install. The first time you will do this, R will ask you permission to create a personal library installation directory in your home directory. - quit using
q()
Now the packages will be available for any job that you run. Note that you do have to do this again once you switch to a completely different R version.
Login node limits
Note that if you are trying to install something on the login node, you may run into certain limits that prevent you from using too much memory on the login node. See this question in the FAQ for more information.
Useful packages
One of our users has developed an R package to send function calls as jobs on Slurm via SSH. You can find the documentation and installation instructions on this GitHub Page.