Both sides previous revision Previous revision Next revision | Previous revision |
habrok:additional_information:course_material:advanced_exercises_solutions [2023/09/27 08:37] – [Exercise 2.4 GPUs] pedro | habrok:additional_information:course_material:advanced_exercises_solutions [2024/10/28 14:40] (current) – [Exercise 0] Changed to GitLab link aurel |
---|
===== Exercise 0 ===== | ===== Exercise 0 ===== |
| |
The files needed to complete these exercises are on [[https://github.com/rug-cit-hpc/cluster_course.git|GitHub]]. Get a copy of the exercise files by running: | The files needed to complete these exercises are on [[https://gitrepo.service.rug.nl/cit-hpc/habrok/cluster_course.git|GitLab]]. Get a copy of the exercise files by running: |
| |
<code> | <code> |
git clone https://github.com/rug-cit-hpc/cluster_course.git | git clone https://gitrepo.service.rug.nl/cit-hpc/habrok/cluster_course.git |
</code> | </code> |
| |
**Run these commands on the imagefile to see what they do.** | **Run these commands on the imagefile to see what they do.** |
| |
As you have seen these commands result in numbers that we could use in our script. In order to store a number into a variable we can use the ''$( )'' operator. E.g.: | As you have seen these commands result in numbers that we could use in our script. In order to store the output into a variable we can use the ''$( )'' operator. E.g.: |
<code> | <code> |
myvar=$( command ) | myvar=$( command ) |
| |
After performing this exercise, you should obtain something like the following: | After performing this exercise, you should obtain something like the following: |
{{:peregrine:additional_information:course_material:openmp_times.png?nolink |}} | {{:habrok:additional_information:course_material:openmp_times.png?nolink |}} |
| |
The ''Ideal Performance'' shows the case where the scaling is perfect. The work is fully parallelizable, and the walltime is halved with doubling the number of CPUs. The real case is not as efficient: the ''CPU Time'' is consistently larger than the ''Ideal Performance'' suggesting that there is some inefficiency in parallelization; furthermore, the ''Walltime'' is somewhat larger still, which means that some overhead is introduced by adding additional CPUs to the computation. | The ''Ideal Performance'' shows the case where the scaling is perfect. The work is fully parallelizable, and the walltime is halved with doubling the number of CPUs. The real case is not as efficient: the ''CPU Time'' is consistently larger than the ''Ideal Performance'' suggesting that there is some inefficiency in parallelization; furthermore, the ''Walltime'' is somewhat larger still, which means that some overhead is introduced by adding additional CPUs to the computation. |
| |
**Run the blurring app with 2, 4, 8, and 16 MPI tasks, each using one core and running on a separate node.** Make note of the runtimes, as well as the overall wallclock time. How does this differ from the previous exercise? | **Run the blurring app with 2, 4, 8, and 16 MPI tasks, each using one core and running on a separate node.** Make note of the runtimes, as well as the overall wallclock time. How does this differ from the previous exercise? |
| |
| You can try to resubmit the job with 4 nodes to a ''parallel'' partition in which the nodes have a faster low-latency interconnect. Does this make a difference? Note that using more nodes will result in a long waiting time as there are only 24 nodes in this partition. |
| |
| The "low-latency" means that the time it takes for the first byte of a message to reach the other node is very small. It only takes 1.2 μs on our 100 Gb/s Omni-Path network, whereas on our 25 Gb/s ethernet the latency is 19.7 μs. |
| |
| |
<hidden Solution> | <hidden Solution> |
After performing this exercise, you should get something like this: | After performing this exercise, you should get something like this: |
| |
{{:peregrine:additional_information:course_material:mpi_times_nodes.png?nolink |}} | {{:habrok:additional_information:course_material:mpi_times_nodes.png?nolink |}} |
| |
It is interesting to compare this graph with the one from exercise 2.2. The main difference is in ''Walltime'', which does not scale the same way with the number of CPUs. When all the CPUs were on the same machine, as in the previous exercise, the ''Walltime'' scaling was similar to that for ''CPU Time'' and ''Ideal Performance'', though less steep. When the CPUs are distributed over many machines, however, we see that, even though the ''CPU Time'' scales the same way as previously, and close to ''Ideal Performance'', the ''Walltime'' eventually levels off and remains constant, not decreasing with an increasing number of CPUs. This points to a fundamental limitation of MPI, which stems from the fact that memory is not shared among the CPUs, and data needs to be copied over the network between machines, which limits the scaling. | It is interesting to compare this graph with the one from exercise 2.2. The main difference is in ''Walltime'', which does not scale the same way with the number of CPUs. When all the CPUs were on the same machine, as in the previous exercise, the ''Walltime'' scaling was similar to that for ''CPU Time'' and ''Ideal Performance'', though less steep. When the CPUs are distributed over many machines, however, we see that, even though the ''CPU Time'' scales the same way as previously, and close to ''Ideal Performance'', the ''Walltime'' eventually levels off and remains constant, not decreasing with an increasing number of CPUs. This points to a fundamental limitation of MPI, which stems from the fact that memory is not shared among the CPUs, and data needs to be copied over the network between machines, which limits the scaling. |
Programming the GPU is not for the faint of heart, though OpenACC makes it relatively easy. If you read C code, **study the code and try to figure out where is the GPU used**. If you plan to use an existing application with the GPU, you needn't worry about the implementation. | Programming the GPU is not for the faint of heart, though OpenACC makes it relatively easy. If you read C code, **study the code and try to figure out where is the GPU used**. If you plan to use an existing application with the GPU, you needn't worry about the implementation. |
| |
<hidden Solution>''#SBATCH --gpus-per-node=type:n'' | <hidden Solution> |
| <code> |
''#SBATCH --reservation=advanced_course'' | #SBATCH --gpus-per-node=v100:1 |
| #SBATCH --reservation=advanced_course |
| </code> |
</hidden> | </hidden> |
| |