Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
habrok:advanced_job_management:many_file_jobs [2023/09/19 18:06] – [Putting it all together: example jobscript] camarocicohabrok:advanced_job_management:many_file_jobs [2025/07/09 14:23] (current) – Add syntax highlighting pedro
Line 11: Line 11:
  
 You can extract archives (e.g. tarballs) at the destination using You can extract archives (e.g. tarballs) at the destination using
-<code>+<code bash>
 mkdir $TMPDIR/dataset mkdir $TMPDIR/dataset
-tar xvzf /scratch/public/hb-courses/basic/inputfiles/dataset.tar.gz -C $TMPDIR/dataset+tar xzf /scratch/public/hb-courses/basic/inputfiles/dataset.tar.gz -C $TMPDIR/dataset
 </code> </code>
  
Line 24: Line 24:
 Here's an example of training a neural network that can classify different types of rice: Here's an example of training a neural network that can classify different types of rice:
  
-<code>+<code bash>
 module load matplotlib/3.5.2-foss-2022a module load matplotlib/3.5.2-foss-2022a
 module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0 module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0
Line 32: Line 32:
 cd $TMPDIR cd $TMPDIR
  
-python train.py+python train.py 3
 </code> </code>
  
Line 40: Line 40:
 At the end of the job, you will probably want to copy some files back to the shared storage, otherwise they will be deleted from the local storage. The easiest way to do that is to create another archive and then copy it over to the shared storage: At the end of the job, you will probably want to copy some files back to the shared storage, otherwise they will be deleted from the local storage. The easiest way to do that is to create another archive and then copy it over to the shared storage:
  
-<code>+<code bash>
 mkdir -p /scratch/$USER/rice_classifier/job_${SLURM_JOBID} mkdir -p /scratch/$USER/rice_classifier/job_${SLURM_JOBID}
 tar czvf /scratch/$USER/rice_classifier/job_${SLURM_JOBID}/results.tar.gz $TMPDIR/results tar czvf /scratch/$USER/rice_classifier/job_${SLURM_JOBID}/results.tar.gz $TMPDIR/results
Line 50: Line 50:
  
 To facilitate jobs using this method, below is an example script; you can simply replace all instances of 'compressed.tar.gz' and 'compressed' with the name of your data archive. To facilitate jobs using this method, below is an example script; you can simply replace all instances of 'compressed.tar.gz' and 'compressed' with the name of your data archive.
-<code>+<code bash>
 #!/bin/bash #!/bin/bash
 #SBATCH --job-name=rice_classifier #SBATCH --job-name=rice_classifier
 #SBATCH --output=rice_classifier.out #SBATCH --output=rice_classifier.out
-#SBATCH --time=00:20:00+#SBATCH --time=00:10:00
 #SBATCH --nodes=1 #SBATCH --nodes=1
 #SBATCH --ntasks=1 #SBATCH --ntasks=1
Line 64: Line 64:
 mkdir -p $TMPDIR/results/logs $TMPDIR/results/plots  mkdir -p $TMPDIR/results/logs $TMPDIR/results/plots 
  
-# Extract tar file (which could be stored on /scratch to local disk +# Extract tar file (which could be stored on /scratchto local disk 
-tar xzvf /scratch/public/hb-courses/basic/inputfiles/dataset.tar.gz -C $TMPDIR/dataset+tar xzf /scratch/public/hb-courses/basic/inputfiles/dataset.tar.gz -C $TMPDIR/dataset
 cp /scratch/public/hb-courses/basic/inputfiles/train.py $TMPDIR cp /scratch/public/hb-courses/basic/inputfiles/train.py $TMPDIR
 cd $TMPDIR cd $TMPDIR
Line 74: Line 74:
  
 # Run the training # Run the training
-python train.py+python train.py 3
  
-mkdir -p /scratch/$USER/example_classifier/job_${SLURM_JOBID} +mkdir -p /scratch/$USER/rice_classifier/job_${SLURM_JOBID} 
-tar czvf /scratch/$USER/example_classifier/job_${SLURM_JOBID}/results.tar.gz $TMPDIR/results+tar czvf /scratch/$USER/rice_classifier/job_${SLURM_JOBID}/results.tar.gz $TMPDIR/results
 </code> </code>
  
Line 92: Line 92:
 to the jobscript. Here ''<sig_time>'' is the time **in seconds** before the timelimit is reached when the signal should be sent. ''<sig_num>'' is the ID of the signal to be sent, and we follow he recommendation of our colleagues at Gottingen University to set this to 12 (SIGUSR2), as it is unlikely to be used in your program. Thus, to send a signal 10 minutes before the job's time runs out, you would add: to the jobscript. Here ''<sig_time>'' is the time **in seconds** before the timelimit is reached when the signal should be sent. ''<sig_num>'' is the ID of the signal to be sent, and we follow he recommendation of our colleagues at Gottingen University to set this to 12 (SIGUSR2), as it is unlikely to be used in your program. Thus, to send a signal 10 minutes before the job's time runs out, you would add:
  
-<code>+<code bash>
 #SBATCH --signal=B:12@600 #SBATCH --signal=B:12@600
 </code> </code>
Line 100: Line 100:
 Once the signal is sent, the scheduler needs to be told what to do. We will have the scheduler copy the ''results'' folder to the shared storage (archived and compressed). For that we use the ''trap'' command to catch the signal and execute some code: Once the signal is sent, the scheduler needs to be told what to do. We will have the scheduler copy the ''results'' folder to the shared storage (archived and compressed). For that we use the ''trap'' command to catch the signal and execute some code:
  
-<code>+<code bash>
 trap 'echo "Trapped the signal!"; exit 12' 12 trap 'echo "Trapped the signal!"; exit 12' 12
 </code> </code>
Line 106: Line 106:
 or, more usefully: or, more usefully:
  
-<code>+<code bash>
 trap 'mkdir /scratch/$USER/path/job_${SLURM_JOBID}; tar czvf /scratch/$USER/path/job_${SLURM_JOBID}/results.tar.gz $TMPDIR/results' 12 trap 'mkdir /scratch/$USER/path/job_${SLURM_JOBID}; tar czvf /scratch/$USER/path/job_${SLURM_JOBID}/results.tar.gz $TMPDIR/results' 12
 </code> </code>
  
-This will create a folder ''job_${SLURM_JOBID}'' in ''/scratch/$USER/path'', and then archive the ''$TMPDIR/results'' into that folder, with the name ''results.tar.gz''. Obviously, you need to adapt it to your needs.+This will create a folder ''job_${SLURM_JOBID}'' in ''/scratch/$USER/rice_classifier'', and then archive the ''$TMPDIR/results'' into that folder, with the name ''results.tar.gz''. Obviously, you need to adapt it to your needs.
  
 Because of the way the command ''trap'' works -- it waits for the currently running process to finish before doing anything -- the calculation you will perform needs to be started in the background, and then the job needs to wait. This can be achieved by using something like: Because of the way the command ''trap'' works -- it waits for the currently running process to finish before doing anything -- the calculation you will perform needs to be started in the background, and then the job needs to wait. This can be achieved by using something like:
  
-<code> +<code bash
-python main.py &+python train.py &
 wait wait
 </code> </code>
Line 123: Line 123:
 Thus, the new jobscript file might look something like: Thus, the new jobscript file might look something like:
  
-<code> +<code bash
-#!/usr/bin/env bash +#!/bin/bash 
- +#SBATCH --job-name=rice_classifier 
-#SBATCH --time=00:10:00+#SBATCH --output=rice_classifier.out 
 +#SBATCH --time=00:30:00
 #SBATCH --nodes=1 #SBATCH --nodes=1
 #SBATCH --ntasks=1 #SBATCH --ntasks=1
-#SBATCH --cpus-per-task=1 +#SBATCH --cpus-per-task=16 
-#SBATCH --mem=2GB +#SBATCH --mem=4GB 
-#SBATCH --partition=gpu +#SBATCH --partition=regular
-#SBATCH --gres=gpu:v100:1+
 #SBATCH --signal=B:12@600 #SBATCH --signal=B:12@600
  
-# Change directory to local directory +mkdir $TMPDIR/dataset
-cd $TMPDIR+
  
 # Extract tar file (which could be stored on /scratch) to local disk # Extract tar file (which could be stored on /scratch) to local disk
-tar xvzf /scratch/$USER/path/to/compressed.tar.gz $TMPDIR+tar xzf /scratch/public/hb-courses/basic/inputfiles/dataset.tar.gz -C $TMPDIR/dataset 
 +cp /scratch/public/hb-courses/basic/inputfiles/train.py $TMPDIR 
 +cd $TMPDIR
  
-trap 'mkdir /scratch/$USER/path/job_${SLURM_JOBID}; tar czvf /scratch/$USER/path/job_${SLURM_JOBID}/results.tar.gz $TMPDIR/results' 12+# Compress and save the results if the timelimit is close 
 +trap 'mkdir -p /scratch/$USER/rice_classifier/job_${SLURM_JOBID}; tar czvf /scratch/$USER/rice_classifier/job_${SLURM_JOBID}/results.tar.gz $TMPDIR/results' 12
  
 # Load modules # Load modules
-module load Python/3.10.8-GCCcore-12.2.0+module load matplotlib/3.5.2-foss-2022a 
 +module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0 
 + 
 +# Create folders for final results 
 +mkdir -p $TMPDIR/results/logs $TMPDIR/results/plots 
 # Run in the background and wait # Run in the background and wait
-python main.py &+python train.py &
 wait wait
 +
 +mkdir -p /scratch/$USER/rice_classifier/job_${SLURM_JOBID}
 +tar czvf /scratch/$USER/rice_classifier/job_${SLURM_JOBID}/results.tar.gz $TMPDIR/results
 </code> </code>
  
Line 156: Line 166:
 Be mindful of the fact that ''numpy'' arrays default to ''int64'' type which means you would be storing data in a much larger data type than required. If your image is encoded in RGB format you usually only require ''uint8'' (unsigned 8 bit integers; 0-255) at most anyways. Additionally, ''numpy'' arrays are not like Python (linked) lists where they can be expanded with very little effort. It is preferable to allocate all the space required in one go and then fill the array as expanding it later will cause numpy to allocate a second array then copy the first into it. This can become an issue for large datasets as you are keeping multiple copies of it in memory (RAM). Below is an example code snippet showing the process used for making ''numpy'' arrays from many image files using ''numpy'' and ''cv2''. Be mindful of the fact that ''numpy'' arrays default to ''int64'' type which means you would be storing data in a much larger data type than required. If your image is encoded in RGB format you usually only require ''uint8'' (unsigned 8 bit integers; 0-255) at most anyways. Additionally, ''numpy'' arrays are not like Python (linked) lists where they can be expanded with very little effort. It is preferable to allocate all the space required in one go and then fill the array as expanding it later will cause numpy to allocate a second array then copy the first into it. This can become an issue for large datasets as you are keeping multiple copies of it in memory (RAM). Below is an example code snippet showing the process used for making ''numpy'' arrays from many image files using ''numpy'' and ''cv2''.
  
-<code>+<code python>
 import numpy as np import numpy as np
 import cv2 import cv2
Line 199: Line 209:
  
 Say you are working with a tar archive named ''archive.tar'', you can get the paths and filenames of all the contents in the following way Say you are working with a tar archive named ''archive.tar'', you can get the paths and filenames of all the contents in the following way
-<code>+<code python>
 tar = tarfile.open('archive.tar') tar = tarfile.open('archive.tar')
 print(tar.getnames()) print(tar.getnames())
 </code> </code>
 Keep in mind that all of the filenames will have the full path attached to them, so a file called ''image.png'' stored inside a sub-directory of archive ''archive/images'' will appear in the output as the following string ''%%'archive/images/image.png'%%''. You can then extract this file by using this string, so; Keep in mind that all of the filenames will have the full path attached to them, so a file called ''image.png'' stored inside a sub-directory of archive ''archive/images'' will appear in the output as the following string ''%%'archive/images/image.png'%%''. You can then extract this file by using this string, so;
-<code>+<code python>
 file = tar.extractfile('archive/images/image.png') file = tar.extractfile('archive/images/image.png')
 </code> </code>
 The resulting file is a byte array, not an image file that you can directly work on with python. To convert this to a workable image, say in ''cv2'' you need to convert it to a ''numpy'' array and then to a ''cv2'' readable image. Example code is provided to help with this. The resulting file is a byte array, not an image file that you can directly work on with python. To convert this to a workable image, say in ''cv2'' you need to convert it to a ''numpy'' array and then to a ''cv2'' readable image. Example code is provided to help with this.
  
-<code>+<code python>
 import tarfile import tarfile
 import numpy as np import numpy as np
Line 228: Line 238:
  
 For a text file, we can similarly extract it from the tar archive. Say we wish to extract a file saved under ''text.txt'' in the top level directory of the archive, the following code can do this; For a text file, we can similarly extract it from the tar archive. Say we wish to extract a file saved under ''text.txt'' in the top level directory of the archive, the following code can do this;
-<code>+<code python>
 import tarfile import tarfile