This is an old revision of the document!
Getting information about submitted jobs
Using squeue
In order to get information about the jobs running on the cluster the squeue command is available. The command shows a (long) list of jobs in the system. Here is an example of a shortened version of squeue
output:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 4983 nodes testjob p456789 PD 0:00 20 (Resources) 4984 nodes testjob p456789 PD 0:00 20 (Priority) 4985 nodes testjob p456789 PD 0:00 20 (Priority) 4986 nodes testjob p456789 PD 0:00 20 (Priority) 4987 nodes testjob p456789 PD 0:00 20 (Priority) 4978 nodes testjob p456789 R 0:01 20 node[041-060] 4979 nodes testjob p456789 R 0:01 20 node[061-080] 4980 nodes testjob p456789 R 0:01 20 node[081-100] 4981 nodes testjob p456789 R 0:01 20 node[101-120] 4982 nodes testjob p456789 R 0:01 20 node[121-140] 4976 nodes testjob p456789 R 0:04 20 node[001-020] 4977 nodes testjob p456789 R 0:04 20 node[021-040]
Hint: Since the output of squeue
is very long, it is useful to send the output to less
for easier viewing:
squeue | less
Exit less
with the q
key.
Explanation of squeue output
By default the following columns are shown in the output:
JOBID | The job id, which is used by the system to refer to the job |
PARTITION | Tbe partition the job is submitted in |
USER | The user id of the user that submitted the job |
NAME | The name that has been given to the job by the user |
ST | The job status. This will be explained below |
TIME | The time the job is running |
NODES | The number of nodes requested by the job. The number of cores requested on these nodes is not shown |
NODELIST(REASON) | The reason a job is waiting (explained below) or the nodes allocated to the job. |
The status of a job can be the following:
PD | Pending, the job is waiting |
R | Running, the job is running on one or more nodes of the cluster |
CG | Job is completing |
The reasons for not having started yet can be the following:
(Resources) | The job is waiting for resources to be available |
(Priority) | The job does not have enough priority compared to other jobs |
{ReqNodeNotAvail) | The nodes required for the job are not available. This can be because of upcoming maintenance, or nodes that are down because of issues. |
(QosGrpCpuLimit) | The job has hit the limits on the number of cores that are allowed to be in use for long running jobs |
Finding back your own jobs
In order to find back the jobs that you submitted the -u
option for squeue
can be used. After the -u
the username has to be supplied. In this example we use $USER
for this, which will be replaced by your username:
squeue -u $USER
This command will only show the jobs submitted by the user given after the -u
.
Using jobinfo
From the moment that a job is submitted, you can request relevant information about this job using the jobinfo command. If you forgot the job ID that you want to have the information for, then you are able to request all jobs that you have submitted with squeue
(see above), sacct or sstat. The jobinfo command basically combines relevant output of the squeue
, sacct
and sstat
commands. It is also possible to use these commands themselves, especially if you want to have more detailed information about your jobs, such as info about available node partitions, lists of all your submitted jobs, a list of jobs that are in the queue or information about a node (that your job is running on).
After you submitted a job, you can request the information by using the command:
jobinfo jobID
E.g. jobinfo 633658
will give the following information:
Name : tutorial User : p123456 Partition : regular Nodes : node12 Cores : 2 State : COMPLETED Submit : 2015-09-11T13:03:03 Start : 2015-09-11T13:03:19 End : 2015-09-11T13:03:40 Reserved walltime : 00:20:00 Used walltime : 00:00:21 Used CPU time : 00:00:20 % User (Computation): 23.87% % System (I/O) : 76.13% Mem reserved : 1000M/node Max Mem used : 42.58M (node012) Max Disk Write : 0.00 (node012) Max Disk Read : 51.20K (node012)
Interpreting jobinfo output
This information shows that the job has run for 21 seconds, while 20 minutes were requested. With this knowledge similar jobs can be submitted with sbatch, while requesting less time for the resources. By doing so, the SLURM scheduler might be able to schedule your job earlier than it might have for a 20 minute request. In this case 20 minutes is not a lot, but if your job runs for hours or more, then you might profit from requesting resources for the time that it actually needs instead of 10 times more.
The same is true for the number of requested cores (which is requested with –ntasks, –ntasks-per-node, and/or –cpus-per-task in the batch script). The number of cores requested in this example is 2: for an efficient job, the used CPU time should be about twice the used walltime. This is not the case here, because the values are about the same. This implies that one core was used and the other was doing nothing. Hence, the number of cores requested in this job should have been 1 and not 2. By doing so, the SLURM scheduler is able to run your job earlier while you do not lose any time, performance or accuracy. Furthermore, your fairshare decreases less by requesting fewer cores, meaning that your next jobs will get a higher priority.
Finally, we look at the amount of memory reserved. Each standard node has 128GB of memory and 24 cores, meaning that there is on average ~5GB per core available. For simple jobs this should be more than enough. If you do request more than 5GB memory, it might be useful to check the “Max Mem used” afterwards with jobinfo if you really needed the extra memory and possibly adjust for (similar) future jobs. In this case ~42MB is used at the maximum of this job, thus requesting 1000MB is also not that efficient (100MB should have been enough).