{{indexmenu_n>4}}
====== Getting information about submitted jobs ======
===== Using squeue =====
In order to get information about the jobs running on the cluster the [[http://slurm.schedmd.com/squeue.html|squeue]] command is available. The command shows a (long) list of jobs in the system. Here is an example of a shortened version of ''squeue'' output:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
4983 nodes testjob p456789 PD 0:00 20 (Resources)
4984 nodes testjob p456789 PD 0:00 20 (Priority)
4985 nodes testjob p456789 PD 0:00 20 (Priority)
4986 nodes testjob p456789 PD 0:00 20 (Priority)
4987 nodes testjob p456789 PD 0:00 20 (Priority)
4978 nodes testjob p456789 R 0:01 20 node[041-060]
4979 nodes testjob p456789 R 0:01 20 node[061-080]
4980 nodes testjob p456789 R 0:01 20 node[081-100]
4981 nodes testjob p456789 R 0:01 20 node[101-120]
4982 nodes testjob p456789 R 0:01 20 node[121-140]
4976 nodes testjob p456789 R 0:04 20 node[001-020]
4977 nodes testjob p456789 R 0:04 20 node[021-040]
Hint: Since the output of ''squeue'' is very long, it is useful to send the output to ''less'' for easier viewing:
squeue | less
Exit ''less'' with the ''q'' key.
==== Explanation of squeue output ====
By default the following columns are shown in the output:
| JOBID | The job id, which is used by the system to refer to the job |
| PARTITION | Tbe partition the job is submitted in |
| USER | The user id of the user that submitted the job |
| NAME | The name that has been given to the job by the user |
| ST | The job status. This will be explained below |
| TIME | The time the job is running |
| NODES | The number of nodes requested by the job. The number of cores requested on these nodes is not shown |
| NODELIST(REASON) | The reason a job is waiting (explained below) or the nodes allocated to the job. |
The status of a job can be the following:
| PD | Pending, the job is waiting |
| R | Running, the job is running on one or more nodes of the cluster |
| CG | Job is completing |
The reasons for not having started yet can be the following:
| (Resources) | The job is waiting for resources to be available |
| (Priority) | The job does not have enough priority compared to other jobs |
| {ReqNodeNotAvail) | The nodes required for the job are not available. This can be because of upcoming maintenance, or nodes that are down because of issues. |
| (QosGrpCpuLimit) | The job has hit the limits on the number of cores that are allowed to be in use for long running jobs |
==== Finding back your own jobs ====
In order to find back the jobs that you submitted the ''-u'' option for ''squeue'' can be used. After the ''-u'' the username has to be supplied. In this example we use ''$USER'' for this, which will be replaced by your username:
squeue -u $USER
This command will only show the jobs submitted by the user given after the ''-u''.
===== Using jobinfo =====
From the moment that a job is submitted, you can request relevant information about this job using the jobinfo command. If you forgot the job ID that you want to have the information for, then you are able to request all jobs that you have submitted with ''squeue'' (see above), [[habrok:advanced_job_management:getting_information_about_jobs_nodes_partitions|sacct or sstat]]. The jobinfo command basically combines relevant output of the ''squeue'', ''sacct'' and ''sstat'' commands. It is also possible to use these commands themselves, especially if you want to have more detailed information about your jobs, such as info about available node partitions, lists of all your submitted jobs, a list of jobs that are in the queue or information about a node (that your job is running on).
The code for the jobinfo command is available at: https://github.com/rug-cit-hpc/hb-jobinfo
After you submitted a job, you can request the information by using the command:
jobinfo jobID
\\
E.g. ''%%jobinfo 633658%%'' will give the following information:
Job ID : 633658
Name : My_job
User : p_number
Partition : regularlong
Nodes : node[6-7,14,19]
Number of Nodes : 4
Cores : 16
Number of Tasks : 4
State : COMPLETED
Submit : 2024-04-01T12:46:52
Start : 2024-04-01T16:15:22
End : 2024-04-05T20:30:22
Reserved walltime : 10-00:00:00
Used walltime : 4-04:15:00
Used CPU time : 14-22:06:02 (Efficiency: 22.33%)
% User (Computation) : 99.77%
% System (I/O) : 0.23%
Total memory reserved : 40G
Maximum memory used : 8.71G
Hints and tips :
1) The program efficiency is low. Your program is not using the assigned cores
effectively. Please check if you are using all the cores you requested.
You may also need to check the file in- and output pattern of your program.
2) You requested much more CPU memory than your program used.
Please reduce the requested amount of memory.
*) For more information on these issues see:
https://wiki.hpc.rug.nl/habrok/additional_information/job_hints
The jobinfo command supports the option ''-l'', which will show more advanced statistics.
===== Interpreting jobinfo output =====
This information shows that the job has run for more than 4 days, while 10 days were requested. With this knowledge similar jobs can be submitted with sbatch, while requesting less time for the resources. By doing so, the SLURM scheduler might be able to schedule your job earlier than it might have for a 10 day request.
An important metric is the Efficiency. This is related to the number of requested cores (which is requested with ''--ntasks'', ''--ntasks-per-node'', and/or ''--cpus-per-task'' in the batch script). The number of cores requested in this example is 16. For an efficient job, the used CPU time should be almost 16 times the used walltime. In this case the used CPU time is much lower, leading to an efficiency of only 22.33%. This suggests that only 4 of the 16 requested cores were actually used. Given the fact that the job was run on four nodes with four tasks, this means that either only one node was actually used, or that only a single CPU core per task was used. If the program was started with srun, it should have been started on each node, which makes it quite probable that these tasks did not employ multithreading to start up more processes. The way to fix this should be checked in the program documentation.
The low efficiency results in a hint being displayed.
Not using the resources you requested is troublesome because somebody else could have used them instead. Furthermore your priority for newer jobs will be lower than necessary as all allocated resources are attributed to your cluster usage, reducing your priority for the next job more than necessary. Also requesting more resources than necessary might increase the waiting time for your job, as it will take more time for these resources to become available.
Finally, we look at the amount of memory reserved. Each standard node has 512GB of memory and 128 cores, meaning that there is on average 4GB per core available. For simple jobs this should be more than enough. If you do request more than 4GB memory, it might be useful to look at the "Max Mem used" afterwards with jobinfo to check if you really needed the extra memory. You can then adjust the requested amount of memory for for similar future jobs.
In this case 8.71G is used at the maximum of this job, thus requesting 40GB is not very efficient. In this case the amount requested per core is 2.5 GB, so for this case this is not a big issue.
===== jobinfo GPU example =====
Here is the output of a job that was using a GPU:
Job ID : 833913
Name : gpu_job
User : s_number
Partition : gpumedium
Nodes : a100gpu5
Number of Nodes : 1
Cores : 16
Number of Tasks : 1
State : COMPLETED
Submit : 2024-05-11T18:44:22
Start : 2024-05-11T18:46:03
End : 2024-05-11T21:14:37
Reserved walltime : 06:00:00
Used walltime : 02:28:34
Used CPU time : 23:20:49 (Efficiency: 58.93%)
% User (Computation) : 86.69%
% System (I/O) : 13.31%
Total memory reserved : 16G
Maximum memory used : 4.29G
Requested GPUs : a100=1
Allocated GPUs : a100=1
Max GPU utilization : 35%
Max GPU memory used : 3.76G
For a GPU job information about the GPU memory usage, GPU utilization and requested GPU resources is shown. The GPU utilization is the maximum utilization that was measured over the job's lifetime. Unfortunately this number may therefore not be very relevant as their may have been long periods of much lower GPU utilization.
As you can see CPU memory and GPU memory are reported separately as they are different types of memory. CPU memory is connected to the CPU and GPU memory is separate memory on the GPU board.