Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision | |||
habrok:advanced_job_management:job_profiling [2023/03/22 13:31] – fokke | habrok:advanced_job_management:job_profiling [2023/03/22 13:31] (current) – removed fokke | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Job profiling ====== | ||
- | |||
- | Job profiling is a method to monitor and get insight into the resource usage of your job. This can be very helpful in determining if you are requesting the right number of cores, nodes, and memory for your job. On this page we describe two different methods that can be used to profile your job: the job summary at the end of the job output (which you can also obtain using the [[habrok: | ||
- | |||
- | ===== Job summary ===== | ||
- | |||
- | The job information that is printed at the end of each job output already gives a quick overview of the resource usage of your job. It might look like this: | ||
- | |||
- | < | ||
- | Job details: | ||
- | ============ | ||
- | |||
- | Name : job_example1 | ||
- | User : p123456 | ||
- | Partition | ||
- | Nodes : pg-node058 | ||
- | Cores : 12 | ||
- | State : COMPLETED | ||
- | Submit | ||
- | Start : 2018-04-11T14: | ||
- | End : 2018-04-11T15: | ||
- | Reserved walltime | ||
- | Used walltime | ||
- | Used CPU time : 03:25:57 (efficiency: | ||
- | % User (Computation): | ||
- | % System (I/O) : 55.58% | ||
- | Mem reserved | ||
- | Max Mem used : 6.48G (pg-node058) | ||
- | Max Disk Write : 30.72K (pg-node058) | ||
- | Max Disk Read : 1.05M (pg-node058) | ||
- | </ | ||
- | From these statistics you can already get some impressions about the resource usage of your jobs. For instance, you can find the maximum memory usage and the used CPU time. From the latter you can conclude if the job was able to use all the cores that were requested (12 in this example). Ideally, the amount of CPU time divided by the walltime should be equal or close to the number of requested cores. In the example output we have:\\ | ||
- | |||
- | < | ||
- | (Used CPU time) / (Used walltime) = 11.4 | ||
- | </ | ||
- | |||
- | This means that the job on average used 11.4 CPUs, while 12 CPUs where requested: this leads to an obtained efficiency of about 95. Most programs have to do some sequential steps (like reading input or writing output), so getting 100 is not very realistic, and 95% is perfectly fine. Things are worse for the following job: | ||
- | |||
- | < | ||
- | Job details: | ||
- | ============ | ||
- | |||
- | Name : job_example2 | ||
- | User : p123456 | ||
- | Partition | ||
- | Nodes : pg-node190 | ||
- | Cores : 24 | ||
- | State : COMPLETED | ||
- | Submit | ||
- | Start : 2018-04-11T11: | ||
- | End : 2018-04-11T11: | ||
- | Reserved walltime | ||
- | Used walltime | ||
- | Used CPU time : 00:12:11 (efficiency: | ||
- | % User (Computation): | ||
- | % System (I/O) : 76.71% | ||
- | Mem reserved | ||
- | Max Mem used : 511.48M (pg-node190) | ||
- | Max Disk Write : 20.48K (pg-node190) | ||
- | Max Disk Read : 1.06M (pg-node190) | ||
- | </ | ||
- | This job has a CPU efficiency of only 55%, which means that it only used about half of the possible computing power that was allocated. However, this single number for both the CPU time and the memory usage does not provide any more details. It would be informative to visualize how the job behaves over time in order to find bottlenecks or parts of the program that are not behaving as they should. This is where the job profiling plugin can help. | ||
- | |||
- | ===== Job profiling plugin ===== | ||
- | |||
- | A plugin for the SLURM scheduler is available that will monitor the behavior of your job and store a sample of job details per sampling interval. A [[https:// | ||
- | |||
- | If you want to profile your job, first add the following line to your job script: | ||
- | |||
- | < | ||
- | #SBATCH --profile=task | ||
- | </ | ||
- | The default sampling interval is 30 seconds. If you want to change this (note that a shorter interval may lead to degraded performance!), | ||
- | |||
- | < | ||
- | #SBATCH --acctg-freq=task=60 | ||
- | </ | ||
- | Once your job is running, it will start collecting information. In order to visualize this information, | ||
- | |||
- | {{..: | ||
- | |||
- | ==== Explanation of different metrics ==== | ||
- | |||
- | The job profiling dashboard currently shows plots for six metrics: | ||
- | |||
- | === Cumulative CPU time over all previous sampling intervals === | ||
- | |||
- | This plot shows the total CPU time for each sampling interval in a cumulative way, which means that the last sampling point should, more or less, resemble the total CPU time of the entire job. | ||
- | |||
- | === CPU Utilization === | ||
- | |||
- | This shows the CPU utilization for each sampling interval. Note that 100% represents one core at 100%. | ||
- | |||
- | === (Virtual) memory usage === | ||
- | |||
- | These plots show the maximum (virtual) memory usage during the sampling interval. | ||
- | |||
- | === Read from / Write to disk === | ||
- | |||
- | This metric represents the amount of data that was read from or written to disk during the sampling interval. | ||
- | |||
- | ==== Data retention ==== | ||
- | |||
- | In order to save space, the database that stores all data samples has a data retention policy of 14 days. This means that data that was ingested more than 14 days ago will be removed from the database automatically. | ||
- | |||
- | ==== Report issues and feedback ==== | ||
- | |||
- | This plugin has not been extensively tested yet, but feel free to try it out. If you encounter any issues or if you have other feedback, please send an email to [[hpc@rug.nl]]. | ||