Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
habrok:advanced_job_management:job_profiling [2022/09/15 10:33] – external edit 127.0.0.1habrok:advanced_job_management:job_profiling [2023/03/22 13:31] (current) – removed fokke
Line 1: Line 1:
-====== Job profiling ====== 
- 
-Job profiling is a method to monitor and get insight into the resource usage of your job. This can be very helpful in determining if you are requesting the right number of cores, nodes, and memory for your job. On this page we describe two different methods that can be used to profile your job: the job summary at the end of the job output (which you can also obtain using the [[peregrine:job_management:checking_jobs#Using_jobinfo|jobinfo command]]), and a specific job profiling plugin with more advanced feature and nice visualizations. 
- 
-===== Job summary ===== 
- 
-The job information that is printed at the end of each job output already gives a quick overview of the resource usage of your job. It might look like this: 
- 
-<code> 
-Job details: 
-============ 
- 
-Name                : job_example1 
-User                : p123456 
-Partition           : regular 
-Nodes               : pg-node058 
-Cores               : 12 
-State               : COMPLETED 
-Submit              : 2018-04-11T14:49:06 
-Start               : 2018-04-11T14:49:06 
-End                 : 2018-04-11T15:07:09 
-Reserved walltime   : 00:40:00 
-Used walltime       : 00:18:03 
-Used CPU time       : 03:25:57 (efficiency: 95.08%) 
-% User (Computation): 44.42% 
-% System (I/O)      : 55.58% 
-Mem reserved        : 62G/node 
-Max Mem used        : 6.48G (pg-node058) 
-Max Disk Write      : 30.72K (pg-node058) 
-Max Disk Read       : 1.05M (pg-node058) 
-</code> 
-From these statistics you can already get some impressions about the resource usage of your jobs. For instance, you can find the maximum memory usage and the used CPU time. From the latter you can conclude if the job was able to use all the cores that were requested (12 in this example). Ideally, the amount of CPU time divided by the walltime should be equal or close to the number of requested cores. In the example output we have:\\ 
- 
-<code> 
-(Used CPU time) / (Used walltime) = 11.4 
-</code> 
- 
-This means that the job on average used 11.4 CPUs, while 12 CPUs where requested: this leads to an obtained efficiency of about 95. Most programs have to do some sequential steps (like reading input or writing output), so getting 100 is not very realistic, and 95% is perfectly fine. Things are worse for the following job: 
- 
-<code> 
-Job details: 
-============ 
- 
-Name                : job_example2 
-User                : p123456 
-Partition           : regular 
-Nodes               : pg-node190 
-Cores               : 24 
-State               : COMPLETED 
-Submit              : 2018-04-11T11:14:03 
-Start               : 2018-04-11T11:14:03 
-End                 : 2018-04-11T11:14:58 
-Reserved walltime   : 00:10:00 
-Used walltime       : 00:00:55 
-Used CPU time       : 00:12:11 (efficiency: 55.41%) 
-% User (Computation): 23.29% 
-% System (I/O)      : 76.71% 
-Mem reserved        : 125G/node 
-Max Mem used        : 511.48M (pg-node190) 
-Max Disk Write      : 20.48K (pg-node190) 
-Max Disk Read       : 1.06M (pg-node190) 
-</code> 
-This job has a CPU efficiency of only 55%, which means that it only used about half of the possible computing power that was allocated. However, this single number for both the CPU time and the memory usage does not provide any more details. It would be informative to visualize how the job behaves over time in order to find bottlenecks or parts of the program that are not behaving as they should. This is where the job profiling plugin can help. 
- 
-===== Job profiling plugin ===== 
- 
-A plugin for the SLURM scheduler is available that will monitor the behavior of your job and store a sample of job details per sampling interval. A [[https://grafana.com/|Grafana]] dashboard can then be used to visualize these samples as time series. 
- 
-If you want to profile your job, first add the following line to your job script: 
- 
-<code> 
-#SBATCH --profile=task 
-</code> 
-The default sampling interval is 30 seconds. If you want to change this (note that a shorter interval may lead to degraded performance!), you can also add the following line: 
- 
-<code> 
-#SBATCH --acctg-freq=task=60 
-</code> 
-Once your job is running, it will start collecting information. In order to visualize this information, go to [[https://profiling.hpc.rug.nl|https://profiling.hpc.rug.nl]], log in using your RUG credentials, and click on Peregrine Job Profiling dashboard. At the top of the page, select your job id in the drop-down menu and, if you requested more than one node, a node that you want to visualize. You should now get data in the plots for the different metrics. If not, please make sure the selected time range at the top corresponds to when your job was running and that your job is at least already running for a couple of minutes. If it does work, you should see something like this: 
- 
-{{..:jobs:advanced_topics:profiling.png|}} 
- 
-==== Explanation of different metrics ==== 
- 
-The job profiling dashboard currently shows plots for six metrics: 
- 
-=== Cumulative CPU time over all previous sampling intervals === 
- 
-This plot shows the total CPU time for each sampling interval in a cumulative way, which means that the last sampling point should, more or less, resemble the total CPU time of the entire job. 
- 
-=== CPU Utilization === 
- 
-This shows the CPU utilization for each sampling interval. Note that 100% represents one core at 100%. 
- 
-=== (Virtual) memory usage === 
- 
-These plots show the maximum (virtual) memory usage during the sampling interval. 
- 
-=== Read from / Write to disk === 
- 
-This metric represents the amount of data that was read from or written to disk during the sampling interval. 
- 
-==== Data retention ==== 
- 
-In order to save space, the database that stores all data samples has a data retention policy of 14 days. This means that data that was ingested more than 14 days ago will be removed from the database automatically. 
- 
-==== Report issues and feedback ==== 
- 
-This plugin has not been extensively tested yet, but feel free to try it out. If you encounter any issues or if you have other feedback, please send an email to [[hpc@rug.nl]].