The jobinfo tool may give hints on improving the efficiency of your jobs. A further explanation of these hints is given below.
In general poor efficiency can have two causes:
When jobinfo warns about your job's CPU efficiency, three cases are distinguished. We will explain these cases in more detail.
This hint is given for a program that is only running on a single CPU core and is thus not running in parallel. When the CPU time used is much lower than the time that passed by on the wall clock, this normally means that the program is waiting on data being read from or written to the file system.
The CPU is then waiting for these operations to finish. In general the following issues can occur:
The following tips may help in reducing the problem:
This hint is given if the efficiency when requesting n cores is below 100/n. This basically means that the program is running on a single core, while n cores have been requested.
This is normally caused by the program not being parallelized at all. A program can only use multiple cores and/or nodes if the program has been written in a way that supports this. Please note that normal code will not run in parallel by itself. The programmer has to tell in the program code that this is to be done, and how it is to be done. This involves making use of special tools and libraries like OpenMP or MPI.
If the documentation of the program you are using does not state that the program code has been adapted for parallel computations, you can safely assume that it will not run in parallel. So, check your program's documentation for sections where it is explained to the user how to make use of parallelism. If this is not documented, requesting multiple cores for your program does not make any sense. It will only increase your waiting times, reduce your priority faster, and keep the resources you claimed unused. The latter will also increase the waiting times for other users. So, please don't request more cores than your program can use.
If multiple cores and/or nodes have been claimed, but the CPU efficiency is higher than in the previous scenario this lack of efficiency can have two causes:
$SLURM_JOB_CPUS_PER_NODE
set by the scheduler. You can supply $SLURM_JOB_CPUS_PER_NODE
to your program arguments, where it selects the number of threads. For OpenMP programs you may have to increase the number of cores by setting $OMP_NUM_THREADS
to the correct value, e.g.: export OMP_NUM_THREADS=$SLURM_JOB_CPUS_PER_NODE
Note that for hybrid MPI/threaded applications this will work differently, and you will have to use $SLURM_CPUS_PER_TASK
, because in that case you have to differentiate between tasks and CPUs per task.
SLURM tries to monitor the memory usage of your application. If this memory usage is much lower than the amount you requested you will get this hint.
This means that it is wise to check before the next run if your program can run with less memory. Please take into account the following guidelines:
OUT_OF_MEMORY
status. In that case you will just have to increase the requested amount of memory, or check if you can reduce the memory requirement of your program.OUT_OF_MEMORY
errors before, just check if the amount of memory requested is still correct. If the program won't work with less memory, the memory reporting apparently does not catch the real memory usage of your program. We have especially seen this happening with java programs.