Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
habrok:advanced_job_management:many_file_jobs [2023/09/19 18:03] – [Running your computations] camarocico | habrok:advanced_job_management:many_file_jobs [2025/07/09 14:23] (current) – Add syntax highlighting pedro | ||
---|---|---|---|
Line 11: | Line 11: | ||
You can extract archives (e.g. tarballs) at the destination using | You can extract archives (e.g. tarballs) at the destination using | ||
- | < | + | < |
mkdir $TMPDIR/ | mkdir $TMPDIR/ | ||
- | tar xvzf / | + | tar xzf / |
</ | </ | ||
Line 24: | Line 24: | ||
Here's an example of training a neural network that can classify different types of rice: | Here's an example of training a neural network that can classify different types of rice: | ||
- | < | + | < |
module load matplotlib/ | module load matplotlib/ | ||
module load TensorFlow/ | module load TensorFlow/ | ||
Line 32: | Line 32: | ||
cd $TMPDIR | cd $TMPDIR | ||
- | python train.py | + | python train.py |
</ | </ | ||
Line 40: | Line 40: | ||
At the end of the job, you will probably want to copy some files back to the shared storage, otherwise they will be deleted from the local storage. The easiest way to do that is to create another archive and then copy it over to the shared storage: | At the end of the job, you will probably want to copy some files back to the shared storage, otherwise they will be deleted from the local storage. The easiest way to do that is to create another archive and then copy it over to the shared storage: | ||
- | < | + | < |
- | tar czvf / | + | mkdir -p / |
+ | tar czvf /scratch/$USER/rice_classifier/job_${SLURM_JOBID}/ | ||
</ | </ | ||
Line 49: | Line 50: | ||
To facilitate jobs using this method, below is an example script; you can simply replace all instances of ' | To facilitate jobs using this method, below is an example script; you can simply replace all instances of ' | ||
- | < | + | < |
- | #!/usr/bin/env bash | + | # |
+ | #SBATCH --job-name=rice_classifier | ||
+ | #SBATCH --output=rice_classifier.out | ||
#SBATCH --time=00: | #SBATCH --time=00: | ||
#SBATCH --nodes=1 | #SBATCH --nodes=1 | ||
#SBATCH --ntasks=1 | #SBATCH --ntasks=1 | ||
- | #SBATCH --cpus-per-task=1 | + | #SBATCH --cpus-per-task=16 |
- | #SBATCH --mem=2GB | + | #SBATCH --mem=4GB |
- | #SBATCH --partition=gpu | + | #SBATCH --partition=regular |
- | #SBATCH --gres=gpu: | + | |
- | # Change directory to local directory | + | mkdir $TMPDIR/ |
- | cd $TMPDIR | + | mkdir -p $TMPDIR/ |
# Extract tar file (which could be stored on /scratch) to local disk | # Extract tar file (which could be stored on /scratch) to local disk | ||
- | tar xvzf /scratch/$USER/path/to/compressed.tar.gz $TMPDIR | + | tar xzf /scratch/public/hb-courses/basic/inputfiles/ |
+ | cp / | ||
+ | cd $TMPDIR | ||
- | # Your code goes here | + | # # Load modules |
- | # Load modules | + | module load matplotlib/ |
- | # Run scripts | + | module load TensorFlow/ |
- | # etc. | + | |
+ | # Run the training | ||
+ | python train.py 3 | ||
- | tar czvf / | + | mkdir -p / |
+ | tar czvf /scratch/$USER/rice_classifier/job_${SLURM_JOBID}/ | ||
</ | </ | ||
Line 86: | Line 92: | ||
to the jobscript. Here ''< | to the jobscript. Here ''< | ||
- | < | + | < |
#SBATCH --signal=B: | #SBATCH --signal=B: | ||
</ | </ | ||
Line 94: | Line 100: | ||
Once the signal is sent, the scheduler needs to be told what to do. We will have the scheduler copy the '' | Once the signal is sent, the scheduler needs to be told what to do. We will have the scheduler copy the '' | ||
- | < | + | < |
trap 'echo " | trap 'echo " | ||
</ | </ | ||
Line 100: | Line 106: | ||
or, more usefully: | or, more usefully: | ||
- | < | + | < |
trap 'mkdir / | trap 'mkdir / | ||
</ | </ | ||
- | This will create a folder '' | + | This will create a folder '' |
Because of the way the command '' | Because of the way the command '' | ||
- | < | + | < |
- | python | + | python |
wait | wait | ||
</ | </ | ||
Line 117: | Line 123: | ||
Thus, the new jobscript file might look something like: | Thus, the new jobscript file might look something like: | ||
- | < | + | < |
- | #!/usr/bin/env bash | + | # |
- | + | #SBATCH --job-name=rice_classifier | |
- | #SBATCH --time=00:10:00 | + | #SBATCH --output=rice_classifier.out |
+ | #SBATCH --time=00:30:00 | ||
#SBATCH --nodes=1 | #SBATCH --nodes=1 | ||
#SBATCH --ntasks=1 | #SBATCH --ntasks=1 | ||
- | #SBATCH --cpus-per-task=1 | + | #SBATCH --cpus-per-task=16 |
- | #SBATCH --mem=2GB | + | #SBATCH --mem=4GB |
- | #SBATCH --partition=gpu | + | #SBATCH --partition=regular |
- | #SBATCH --gres=gpu: | + | |
#SBATCH --signal=B: | #SBATCH --signal=B: | ||
- | # Change directory to local directory | + | mkdir $TMPDIR/dataset |
- | cd $TMPDIR | + | |
# Extract tar file (which could be stored on /scratch) to local disk | # Extract tar file (which could be stored on /scratch) to local disk | ||
- | tar xvzf /scratch/$USER/path/to/compressed.tar.gz $TMPDIR | + | tar xzf /scratch/public/hb-courses/basic/inputfiles/ |
+ | cp / | ||
+ | cd $TMPDIR | ||
- | trap 'mkdir / | + | # Compress and save the results if the timelimit is close |
+ | trap ' | ||
# Load modules | # Load modules | ||
- | module load Python/3.10.8-GCCcore-12.2.0 | + | module load matplotlib/3.5.2-foss-2022a |
+ | module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0 | ||
+ | |||
+ | # Create folders for final results | ||
+ | mkdir -p $TMPDIR/ | ||
# Run in the background and wait | # Run in the background and wait | ||
- | python | + | python |
wait | wait | ||
+ | |||
+ | mkdir -p / | ||
+ | tar czvf / | ||
</ | </ | ||
Line 150: | Line 166: | ||
Be mindful of the fact that '' | Be mindful of the fact that '' | ||
- | < | + | < |
import numpy as np | import numpy as np | ||
import cv2 | import cv2 | ||
Line 193: | Line 209: | ||
Say you are working with a tar archive named '' | Say you are working with a tar archive named '' | ||
- | < | + | < |
tar = tarfile.open(' | tar = tarfile.open(' | ||
print(tar.getnames()) | print(tar.getnames()) | ||
</ | </ | ||
Keep in mind that all of the filenames will have the full path attached to them, so a file called '' | Keep in mind that all of the filenames will have the full path attached to them, so a file called '' | ||
- | < | + | < |
file = tar.extractfile(' | file = tar.extractfile(' | ||
</ | </ | ||
The resulting file is a byte array, not an image file that you can directly work on with python. To convert this to a workable image, say in '' | The resulting file is a byte array, not an image file that you can directly work on with python. To convert this to a workable image, say in '' | ||
- | < | + | < |
import tarfile | import tarfile | ||
import numpy as np | import numpy as np | ||
Line 222: | Line 238: | ||
For a text file, we can similarly extract it from the tar archive. Say we wish to extract a file saved under '' | For a text file, we can similarly extract it from the tar archive. Say we wish to extract a file saved under '' | ||
- | < | + | < |
import tarfile | import tarfile | ||