Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
habrok:job_management:checking_jobs [2024/05/14 11:07] – [Interpreting jobinfo output] fokkehabrok:job_management:checking_jobs [2024/06/21 09:51] (current) – [jobinfo GPU example] admin
Line 62: Line 62:
  
 From the moment that a job is submitted, you can request relevant information about this job using the jobinfo command. If you forgot the job ID that you want to have the information for, then you are able to request all jobs that you have submitted with ''squeue'' (see above), [[habrok:advanced_job_management:getting_information_about_jobs_nodes_partitions|sacct or sstat]]. The jobinfo command basically combines relevant output of the ''squeue'', ''sacct'' and ''sstat'' commands. It is also possible to use these commands themselves, especially if you want to have more detailed information about your jobs, such as info about available node partitions, lists of all your submitted jobs, a list of jobs that are in the queue or information about a node (that your job is running on). From the moment that a job is submitted, you can request relevant information about this job using the jobinfo command. If you forgot the job ID that you want to have the information for, then you are able to request all jobs that you have submitted with ''squeue'' (see above), [[habrok:advanced_job_management:getting_information_about_jobs_nodes_partitions|sacct or sstat]]. The jobinfo command basically combines relevant output of the ''squeue'', ''sacct'' and ''sstat'' commands. It is also possible to use these commands themselves, especially if you want to have more detailed information about your jobs, such as info about available node partitions, lists of all your submitted jobs, a list of jobs that are in the queue or information about a node (that your job is running on).
 +
 +The code for the jobinfo command is available at: https://github.com/rug-cit-hpc/hb-jobinfo
  
 After you submitted a job, you can request the information by using the command: After you submitted a job, you can request the information by using the command:
Line 101: Line 103:
 </code> </code>
  
 +The jobinfo command supports the option ''-l'', which will show more advanced statistics.
 ===== Interpreting jobinfo output ===== ===== Interpreting jobinfo output =====
  
 This information shows that the job has run for more than 4 days, while 10 days were requested. With this knowledge similar jobs can be submitted with sbatch, while requesting less time for the resources. By doing so, the SLURM scheduler might be able to schedule your job earlier than it might have for a 10 day request.  This information shows that the job has run for more than 4 days, while 10 days were requested. With this knowledge similar jobs can be submitted with sbatch, while requesting less time for the resources. By doing so, the SLURM scheduler might be able to schedule your job earlier than it might have for a 10 day request. 
  
-An important metric is the Efficiency. This is related to the number of requested cores (which is requested with --ntasks, --ntasks-per-node, and/or --cpus-per-task in the batch script). The number of cores requested in this example is 16. For an efficient job, the used CPU time should be almost 16 times the used walltime. In this case the used CPU time is much lower, leading to an efficiency of only 22.33%. This suggests that only 4 of the 16 requested cores were actually used. Given the fact that the job was run on four nodes with four tasks, this means that either only one node was actually used, or that only a single CPU core per task was used. If the program was started with srun, it should have been started on each node, which makes it quite probable that these tasks did not employ multithreading to start up more processes. The way to fix this should be checked in the program documentation.+An important metric is the Efficiency. This is related to the number of requested cores (which is requested with ''--ntasks''''--ntasks-per-node'', and/or ''--cpus-per-task'' in the batch script). The number of cores requested in this example is 16. For an efficient job, the used CPU time should be almost 16 times the used walltime. In this case the used CPU time is much lower, leading to an efficiency of only 22.33%. This suggests that only 4 of the 16 requested cores were actually used. Given the fact that the job was run on four nodes with four tasks, this means that either only one node was actually used, or that only a single CPU core per task was used. If the program was started with srun, it should have been started on each node, which makes it quite probable that these tasks did not employ multithreading to start up more processes. The way to fix this should be checked in the program documentation.
 The low efficiency results in a hint being displayed. The low efficiency results in a hint being displayed.
  
Line 113: Line 116:
 In this case 8.71G is used at the maximum of this job, thus requesting 40GB is not very efficient. In this case the amount requested per core is 2.5 GB, so for this case this is not a big issue. In this case 8.71G is used at the maximum of this job, thus requesting 40GB is not very efficient. In this case the amount requested per core is 2.5 GB, so for this case this is not a big issue.
  
 +===== jobinfo GPU example =====
 +
 +Here is the output of a job that was using a GPU:
 +<code>
 +Job ID                         : 833913
 +Name                           : gpu_job
 +User                           : s_number
 +Partition                      : gpumedium
 +Nodes                          : a100gpu5
 +Number of Nodes                : 1
 +Cores                          : 16
 +Number of Tasks                : 1
 +State                          : COMPLETED  
 +Submit                         : 2024-05-11T18:44:22
 +Start                          : 2024-05-11T18:46:03
 +End                            : 2024-05-11T21:14:37
 +Reserved walltime              : 06:00:00
 +Used walltime                  : 02:28:34
 +Used CPU time                  : 23:20:49 (Efficiency: 58.93%)
 +% User (Computation)           : 86.69%
 +% System (I/O)                 : 13.31%
 +Total memory reserved          : 16G
 +Maximum memory used            : 4.29G
 +Requested GPUs                 : a100=1
 +Allocated GPUs                 : a100=1
 +Max GPU utilization            : 35%
 +Max GPU memory used            : 3.76G
 +</code>
 +
 +For a GPU job information about the GPU memory usage, GPU utilization and requested GPU resources is shown. The GPU utilization is the maximum utilization that was measured over the job's lifetime. Unfortunately this number may therefore not be very relevant as their may have been long periods of much lower GPU utilization. 
 +As you can see CPU memory and GPU memory are reported separately as they are different types of memory. CPU memory is connected to the CPU and GPU memory is separate memory on the GPU board.