On the cluster users can execute sets of tasks, which are called jobs. These jobs are a combination of a set of requirements and a list of tasks that need to be run. A scheduling system will take these jobs and run them on the available resources.
For each job a user has to specify what amount of resources are required for the job to run properly. The following resources can be requested:
In order to ensure the jobs of the users run as efficiently as possible, the scheduling system will give the jobs exclusive access to the resources that it requested. This also means that the jobs cannot exceed these resource limits, as this would affect the other jobs running on the system!
When resources are available jobs will start immediately. If the requested resources are not available jobs will be put in a queue. The ordering of this queue is based on priority. High resource usage will lower your priority for new jobs. A period of low activity will cause your priority for new jobs to increase again.
IMPORTANT For those that know most of this, or are to lazy to read the whole page there are some important guidelines that you should know.
On the Hábrók cluster the SLURM resource scheduler is used. This means that jobs need to be specified according to SLURM syntax rules and that the SLURM commands have to be used. These commands are fully documented at the SLURM manual pages
In order to run a job on the cluster, a job script should be constructed first. This script contains the commands that you want to run. It also contains special lines starting with: #SBATCH
. These lines are interpreted by the SLURM workload manager.
All job scripts must start with a line describing which shell/program is used to interpret the script. Since most people want to use the command-line interpreter for the script, and use the default Bash shell, this first line should be:
#!/bin/bash
In order to submit jobs to the cluster one first has to describe the job requirements. Requirements have to be specified in terms of nodes, cores, memory and time. These requirements can be specified either on the command line or in the script which is to be submitted. Since for most people it will be easier to use scripts, we will only describe this option here.
The way to set job parameters in a script is by the inclusion of lines starting with:
#SBATCH
The resulting job script can then be submitted to the scheduling system using the command sbatch
. The full documentation of the options available for sbatch is described in the the sbatch documentation. The most common options will now be described here.
The wall clock time requirement for a job can be specified using the --time
parameter. The time can be specified in several formats: “minutes”, “minutes:seconds”, “hours:minutes:seconds”, “days-hours”, “days-hours:minutes” and “days-hours:minutes:seconds”.
*IMPORTANT* A job is only allowed to run for the time that has been requested. After this time has passed the job will be stopped, regardless of its state. Therefore one has to make sure that the job will finish within this time.
Examples:
#SBATCH --time=12:00:00
This will request a time limit of 12 hours.
#SBATCH --time=2-10:30:00
This line requests 2 days, 10 hours and 30 minutes of wall clock time.
Detail: Wall clock time and CPU time. Within computer systems a difference is made between wall clock time and CPU time. Wall clock time is the normal time that passes by and can be measured using for example two readings from a wall clock. CPU time is the fraction of this time spent by a CPU on calculations. Times when the CPU is waiting for the operating system or incoming data from the file system is not counted. When using a single CPU core this amount of time can therefore never be greater than the wall clock time. But, to make things more complex a program can make use of multiple CPU cores. In that case the CPU time is accumulated on all these CPU cores and will therefore normally progress much faster than on a single CPU, and normally increase much faster than the time that has passed on the wall clock in the same period.
The requirements for nodes (full computers) and cores can be given using the parameters --nodes
, --ntasks
, –ntasks-per-node
,
--ntasks-per-core
, --cpus-per-task
and --ntasks-per-core
. Here is a basic description of what they mean:
Parameter | Description | Default value |
---|---|---|
--nodes | Number of nodes to use for the job | 1 |
--ntasks | Total number of tasks that the job will start (only useful if --ntasks-per-node is not used) | 1 |
--ntasks-per-node | Number of tasks to start per node (only useful if --ntasks is not used) | 1 |
--cpus-per-task | Number of threads per task (for multithreaded applications) | 1 |
IMPORTANT The numbers that are given here, depend on the capabilities of the program being run. Only for programs that can use multiple CPU cores the number of tasks and/or cpus per task may be set higher than 1. The number of nodes can only be higher than 1 for software that is capable of running on multiple physical computers, using network communication.
VERY IMPORTANT! If you don't know if your program is capable of running in parallel, do not request multiple cores, or nodes! In most cases this is useless and a waste of resources.
The precise requirements are determined by both the software and its scalability and by the user who has to decide himself how to balance runtime, waiting time in the queue and the number of jobs that he or she wants to run.
Examples:
#SBATCH --cpus-per-task=4
This will assign four cores on a single machine to your job and is useful for multithreaded applications, e.g. MATLAB.
If you have an application that uses something like MPI to scale beyond a single machine, you can increase the number of nodes and tasks per node:
#SBATCH --nodes=4 #SBATCH --ntasks-per-node=4
This will request four machines and you can run four tasks (MPI processes) on each of them. Note that your software has to be able to actually use these resources!
Once again, if this is not the case just use the following:
#SBATCH --ntasks=1
Now only a single task is requested.
Note that if you only use --ntasks
to request N cores, these N cores may be distributed over 1 to N nodes. If your software cannot handle this or if you do not want this, you will have to use the --nodes
parameter to limit the number of nodes.
If your application is using MPI and may benefit from a high bandwidth (the amount of data transferred per second) and/or low latency (the amount of time it takes for the first bit to arrive) you can send the job to the omnipath
partition. The nodes in this partition are equipped with an Omni-Path network adapter which has 100 Gbps bandwidth and a latency of a few microseconds. You can do this by specifying the partition in the jobscript using:
#SBATCH --partition=parallel
Since there are only limited resources available in this partition there are two important guidelines
regular
partition.
Jobs also need a memory requirement. This can be specified using either --mem
or --mem-per-cpu
, as shown in the following table:
Parameter | Description |
---|---|
--mem | The amount of memory needed for a job per node in megabytes |
--mem-per-cpu | The amount of memory needed for a job per physical cpu core in megabytes |
Both options assume a number in megabytes is given. You can include a suffix to use a different unit: K or KB for kilobytes, M or MB for megabytes, G or GB for gigabytes, and T or TB for terabytes. Make sure to not include a space between the number and the suffix, e.g.:
#SBATCH --mem=2GB
Example:
#SBATCH --ntasks=1 #SBATCH --cpus-per-task=4 #SBATCH --mem-per-cpu=500
This will request 500MB of memory per physical core requested, hence 2000MB in total.
#SBATCH --ntasks=1 #SBATCH --cpus-per-task=4 #SBATCH --mem=2G
This will just request 2GB on one node, regardless of the number of requested cores.
#SBATCH --nodes=2 #SBATCH --ntasks-per-node=24 #SBATCH --mem=64G
This will request two nodes, with 24 cores per node and 64GB of memory on each node.
#SBATCH --nodes=2 #SBATCH --ntasks=8 #SBATCH --mem=4000
This will request two nodes, with 8 tasks distributed over these nodes. Each task will have 4000 Megabytes of memory. If multiple tasks land on the same node these amounts of memory will be combined, and the tasks share the combined memory.
IMPORTANT Each job requests a number of cores and an amount of memory. When the job is running, it will be limited to the amounts requested. This has the following effects, which are important to keep in mind:
Therefore, be sure to request the right amounts and, if possible, instruct your program such that it knows about these limits.
The only (easy) way to find out how much time and memory a run needs is to get these numbers from practice. If you have done similar calculations on your desktop or laptop, you may have some idea how long a run takes. You may also have obtained some idea about memory usage by looking at a task manager.
For the first runs you can then use overestimates for the time and memory requirement to make sure your calculations will not be aborted. Once you have gotten feedback from the scheduler about the actual time and memory consumption you can then use more precise amounts. Some hints about reasonable sizes:
The following table gives an overview and description of other useful parameters that can be used:
Parameter | Description |
---|---|
--job-name | Specify a name for the job, which will be shown in the job overview |
--output | Name of the job output file (default: slurm-<jobid>.out). Use %j if you want to include a job id in the filename. |
--partition | Specify in which partition the job has to run |
Example:
#SBATCH --job-name=my_first_slurm_job #SBATCH --output=job-%j.log #SBATCH --partition=short
Depending on the parameters specified for the job, certain environment variables will be set. In principle those environment variables correspond to these parameters.
The following table lists the most useful environment variables, the full list of all available environment variables can be found in the sbatch documentation. Their availability may depend on if a certain parameter has been set or not.
Environment variable | Description |
---|---|
SLURM_JOB_ID | The job id of the job; can be useful for creating unique temporary directories or filenames. |
SLURM_JOB_NUM_NODES | Number of nodes for the job (if --nodes is defined) |
SLURM_NTASKS | Total number of tasks for the job (if --ntasks or --ntasks-per-node is defined) |
SLURM_NTASKS_PER_NODE | Number of tasks per node for the job (if --ntasks-per-node is defined) |
SLURM_TASKS_PER_NODE | Number of tasks per node for the job, including the number of nodes (always defined by SLURM) |
SLURM_CPUS_PER_TASK | The number of CPUs per tasks (if --cpus-per-task has been defined) |
SLURM_JOB_CPUS_ON_NODE | Total number of CPUs allocated on this node. This includes allocated hyperthreaded cores. |
SLURM_NTASKS_PER_CORE | Should correspond to the setting of --ntasks-per-core. |
In order to start tasks, the srun command can be used:
srun some_application <arguments>
When this command is used, SLURM will execute the task on the allocated resources; this is especially convenient if you request multiple nodes, as you do not have to worry about which nodes to use yourself. If you are familiar with MPI’s mpirun command, srun does a similar job and can actually be used to replace mpirun.
Furthermore, SLURM will also keep track of the status of these tasks in this case. If you have multiple srun invocations in your script, for instance if you want to run multiple sequential or parallel tasks, SLURM can show which one is currently running. This also allows you to get detailed information and accounting details about the resource usage for each individual step, instead of just getting a total overview for the entire job.
In case you just want to run one sequential task or if your application handles the parallelism itself (e.g. OpenMP-based applications), it is still possible to use srun, but you can also just run your program in the usual way.
A complete job script can then look as follows:
#!/bin/bash #SBATCH --nodes=2 #SBATCH --ntasks-per-node=4 #SBATCH --time=2-12:00 #SBATCH --mem=4000 module purge module load GROMACS/2021.5-foss-2021b srun gmx_mpi <arguments>
This script will ask for 2 nodes and 4 tasks per node. The maximum runtime is 2 days and 12 hours. The amount of memory available for the job is almost 4 GiB per node. Once the job is executed, it will first load the module for GROMACS 2021.5 To start a parallel (MPI) run, we use srun (instead of mpirun) to start all GROMACS processes on the allocated nodes.