Resource allocation: Jobs and jobscripts

On the cluster users can execute sets of tasks, which are called jobs. These jobs are a combination of a set of requirements and a list of tasks that need to be run. A scheduling system will take these jobs and run them on the available resources.

For each job a user has to specify what amount of resources are required for the job to run properly. The following resources can be requested:

  • A number CPU cores per node
  • A number of nodes/computers
  • An amount of memory per core or node
  • The amount of time the job needs
  • Special hardware classes/partitions
  • Special features like GPUs

In order to ensure the jobs of the users run as efficiently as possible, the scheduling system will give the jobs exclusive access to the resources that it requested. This also means that the jobs cannot exceed these resource limits, as this would affect the other jobs running on the system!

When resources are available jobs will start immediately. If the requested resources are not available jobs will be put in a queue. The ordering of this queue is based on priority. High resource usage will lower your priority for new jobs. A period of low activity will cause your priority for new jobs to increase again.

IMPORTANT For those that know most of this, or are to lazy to read the whole page there are some important guidelines that you should know.

On the Hábrók cluster the SLURM resource scheduler is used. This means that jobs need to be specified according to SLURM syntax rules and that the SLURM commands have to be used. These commands are fully documented at the SLURM manual pages

In order to run a job on the cluster, a job script should be constructed first. This script contains the commands that you want to run. It also contains special lines starting with: #SBATCH. These lines are interpreted by the SLURM workload manager.

All job scripts must start with a line describing which shell/program is used to interpret the script. Since most people want to use the command-line interpreter for the script, and use the default Bash shell, this first line should be:

#!/bin/bash

In order to submit jobs to the cluster one first has to describe the job requirements. Requirements have to be specified in terms of nodes, cores, memory and time. These requirements can be specified either on the command line or in the script which is to be submitted. Since for most people it will be easier to use scripts, we will only describe this option here.

The way to set job parameters in a script is by the inclusion of lines starting with:

#SBATCH

The resulting job script can then be submitted to the scheduling system using the command sbatch. The full documentation of the options available for sbatch is described in the the sbatch documentation. The most common options will now be described here.

The wall clock time requirement for a job can be specified using the --time parameter. The time can be specified in several formats: “minutes”, “minutes:seconds”, “hours:minutes:seconds”, “days-hours”, “days-hours:minutes” and “days-hours:minutes:seconds”.

*IMPORTANT* A job is only allowed to run for the time that has been requested. After this time has passed the job will be stopped, regardless of its state. Therefore one has to make sure that the job will finish within this time.

Examples:

#SBATCH --time=12:00:00

This will request a time limit of 12 hours.

#SBATCH --time=2-10:30:00

This line requests 2 days, 10 hours and 30 minutes of wall clock time.

Detail: Wall clock time and CPU time. Within computer systems a difference is made between wall clock time and CPU time. Wall clock time is the normal time that passes by and can be measured using for example two readings from a wall clock. CPU time is the fraction of this time spent by a CPU on calculations. Times when the CPU is waiting for the operating system or incoming data from the file system is not counted. When using a single CPU core this amount of time can therefore never be greater than the wall clock time. But, to make things more complex a program can make use of multiple CPU cores. In that case the CPU time is accumulated on all these CPU cores and will therefore normally progress much faster than on a single CPU, and normally increase much faster than the time that has passed on the wall clock in the same period.

The requirements for nodes (full computers) and cores can be given using the parameters --nodes, --ntasks, –ntasks-per-node, --ntasks-per-core, --cpus-per-task and --ntasks-per-core. Here is a basic description of what they mean:

Parameter Description Default value
--nodes Number of nodes to use for the job 1
--ntasks Total number of tasks that the job will start (only useful if --ntasks-per-node is not used)1
--ntasks-per-nodeNumber of tasks to start per node (only useful if --ntasks is not used) 1
--cpus-per-task Number of threads per task (for multithreaded applications) 1

IMPORTANT The numbers that are given here, depend on the capabilities of the program being run. Only for programs that can use multiple CPU cores the number of tasks and/or cpus per task may be set higher than 1. The number of nodes can only be higher than 1 for software that is capable of running on multiple physical computers, using network communication.

VERY IMPORTANT! If you don't know if your program is capable of running in parallel, do not request multiple cores, or nodes! In most cases this is useless and a waste of resources.

The precise requirements are determined by both the software and its scalability and by the user who has to decide himself how to balance runtime, waiting time in the queue and the number of jobs that he or she wants to run.

Examples:

#SBATCH --cpus-per-task=4

This will assign four cores on a single machine to your job and is useful for multithreaded applications, e.g. MATLAB.

If you have an application that uses something like MPI to scale beyond a single machine, you can increase the number of nodes and tasks per node:

#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4

This will request four machines and you can run four tasks (MPI processes) on each of them. Note that your software has to be able to actually use these resources!

Once again, if this is not the case just use the following:

#SBATCH --ntasks=1

Now only a single task is requested.

Note that if you only use --ntasks to request N cores, these N cores may be distributed over 1 to N nodes. If your software cannot handle this or if you do not want this, you will have to use the --nodes parameter to limit the number of nodes.

If your application is using MPI and may benefit from a high bandwidth (the amount of data transferred per second) and/or low latency (the amount of time it takes for the first bit to arrive) you can send the job to the omnipath partition. The nodes in this partition are equipped with an Omni-Path network adapter which has 100 Gbps bandwidth and a latency of a few microseconds. You can do this by specifying the partition in the jobscript using:

#SBATCH --partition=parallel

Since there are only limited resources available in this partition there are two important guidelines

  1. When using just a few cores you might as well run your application on a single node
  2. It would be wise to test the performance difference between a job running on the regular nodes and the omnipath nodes, since there may be more capacity available in the regular partition.

Jobs also need a memory requirement. This can be specified using either --mem or --mem-per-cpu, as shown in the following table:

Parameter Description
--mem The amount of memory needed for a job per node in megabytes
--mem-per-cpuThe amount of memory needed for a job per physical cpu core in megabytes

Both options assume a number in megabytes is given. You can include a suffix to use a different unit: K or KB for kilobytes, M or MB for megabytes, G or GB for gigabytes, and T or TB for terabytes. Make sure to not include a space between the number and the suffix, e.g.:

#SBATCH --mem=2GB

Example:

#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=500

This will request 500MB of memory per physical core requested, hence 2000MB in total.

#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=2G

This will just request 2GB on one node, regardless of the number of requested cores.

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=24
#SBATCH --mem=64G

This will request two nodes, with 24 cores per node and 64GB of memory on each node.

#SBATCH --nodes=2
#SBATCH --ntasks=8
#SBATCH --mem=4000

This will request two nodes, with 8 tasks distributed over these nodes. Each task will have 4000 Megabytes of memory. If multiple tasks land on the same node these amounts of memory will be combined, and the tasks share the combined memory.

IMPORTANT Each job requests a number of cores and an amount of memory. When the job is running, it will be limited to the amounts requested. This has the following effects, which are important to keep in mind:

  • If your job or application starts more processes than cores are requested, these processes will share the number of requested cores. This does not benefit the performance of your software, but will normally slow it down.
  • Your job and the tasks it starts are limited to the amount of memory requested. More memory is not available and your program will stop if it tries to use it. The error messages you get depend on the software you use. The job status will show “OUT_OF_MEMORY”.
  • If your run takes longer than you requested time for, the run will be stopped prematurely. The job status will show “TIMEOUT”.

Therefore, be sure to request the right amounts and, if possible, instruct your program such that it knows about these limits.

The only (easy) way to find out how much time and memory a run needs is to get these numbers from practice. If you have done similar calculations on your desktop or laptop, you may have some idea how long a run takes. You may also have obtained some idea about memory usage by looking at a task manager.

For the first runs you can then use overestimates for the time and memory requirement to make sure your calculations will not be aborted. Once you have gotten feedback from the scheduler about the actual time and memory consumption you can then use more precise amounts. Some hints about reasonable sizes:

  • The memory on a standard Hábrók node is at least 4GB per core. So memory requests around 4GB are no problem at all.
  • For memory requests above 4GB/core you should check the job output for the actual memory usage and adjust the number for consecutive runs. VERY IMPORTANT Please don't request more than 10GB/core when you are not sure that your program needs it! You are wasting valuable resources others may need if you do.
  • VERY IMPORTANT Never request more than 1 CPU core if you don't know that your program can actually use multiple cores. Check the program documentation for information on this.
  • IMPORTANT When requesting multiple cores check the actual speed gain with respect to runs using fewer cores. Most programs will not scale beyond a certain number of CPU cores or nodes. Runs may not be faster, and again you will just be wasting resources for no increase in time to result. On the other hand you will experience longer waiting times for your jobs.
  • A reasonable time requirement will mainly improve the scheduler performance. Shorter jobs can be more easily scheduled. You will therefore benefit yourself if you don't request long times if you don't need them. There are also limits on the number of very long jobs that are allowed to run in the system simultaneously.
  • Smaller (CPU cores & memory) and shorter jobs can be more easily scheduled as fewer resources need to be freed up for them. They may even be squeezed in before large jobs that are waiting for resources to become available. So it is beneficial to use precise job requirements. But balance this with the fact that running out of memory or time will kill your job.

The following table gives an overview and description of other useful parameters that can be used:

Parameter Description
--job-name
--output Name of the job output file (default: slurm-.out). Use %j if you want to include a job id in the filename.
--partitionSpecify in which partition the job has to run

Example:

#SBATCH --job-name=my_first_slurm_job
#SBATCH --output=job-%j.log
#SBATCH --partition=short

Depending on the parameters specified for the job, certain environment variables will be set. In principle those environment variables correspond to these parameters.
The following table lists the most useful environment variables, the full list of all available environment variables can be found in the sbatch documentation. Their availability may depend on if a certain parameter has been set or not.

Environment variable Description
SLURM_JOB_ID The job id of the job; can be useful for creating unique temporary directories or filenames.
SLURM_JOB_NUM_NODES Number of nodes for the job (if --nodes is defined)
SLURM_NTASKS Total number of tasks for the job (if --ntasks or --ntasks-per-node is defined)
SLURM_NTASKS_PER_NODE Number of tasks per node for the job (if --ntasks-per-node is defined)
SLURM_TASKS_PER_NODE Number of tasks per node for the job, including the number of nodes (always defined by SLURM)
SLURM_CPUS_PER_TASK The number of CPUs per tasks (if --cpus-per-task has been defined)
SLURM_JOB_CPUS_ON_NODETotal number of CPUs allocated on this node. This includes allocated hyperthreaded cores.
SLURM_NTASKS_PER_CORE Should correspond to the setting of --ntasks-per-core.

In order to start tasks, the srun command can be used:

srun some_application <arguments>


When this command is used, SLURM will execute the task on the allocated resources; this is especially convenient if you request multiple nodes, as you do not have to worry about which nodes to use yourself. If you are familiar with MPI’s mpirun command, srun does a similar job and can actually be used to replace mpirun.

Furthermore, SLURM will also keep track of the status of these tasks in this case. If you have multiple srun invocations in your script, for instance if you want to run multiple sequential or parallel tasks, SLURM can show which one is currently running. This also allows you to get detailed information and accounting details about the resource usage for each individual step, instead of just getting a total overview for the entire job.

In case you just want to run one sequential task or if your application handles the parallelism itself (e.g. OpenMP-based applications), it is still possible to use srun, but you can also just run your program in the usual way.

A complete job script can then look as follows:

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --time=2-12:00
#SBATCH --mem=4000

module purge
module load GROMACS/2021.5-foss-2021b

srun gmx_mpi <arguments>

This script will ask for 2 nodes and 4 tasks per node. The maximum runtime is 2 days and 12 hours. The amount of memory available for the job is almost 4 GiB per node. Once the job is executed, it will first load the module for GROMACS 2021.5 To start a parallel (MPI) run, we use srun (instead of mpirun) to start all GROMACS processes on the allocated nodes.