Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
habrok:advanced_job_management:many_file_jobs [2023/08/30 13:10] – [Dealing with exceeding the allocated time] camarocicohabrok:advanced_job_management:many_file_jobs [2023/10/02 12:20] (current) – Removed verbose option from extraction command aurel
Line 12: Line 12:
 You can extract archives (e.g. tarballs) at the destination using You can extract archives (e.g. tarballs) at the destination using
 <code> <code>
-mkdir $TMPDIR/myArchive +mkdir $TMPDIR/dataset 
-tar xvzf /path/to/myArchive.tar.gz $TMPDIR/myArchive+tar xzf /scratch/public/hb-courses/basic/inputfiles/dataset.tar.gz -C $TMPDIR/dataset
 </code> </code>
  
Line 20: Line 20:
 ==== Running your computations ==== ==== Running your computations ====
  
-You can then operate on your extracted files located in ''$TMPDIR/myArchive''. This will not use the cluster's shared storage systems, meaning that file operations of all kinds should be much faster. Please note that local storage is cleaned once your job completes, so this is not useful for long-term storage. +You can then operate on your extracted files located in ''$TMPDIR/dataset''. This will not use the cluster's shared storage systems, meaning that file operations of all kinds should be much faster. Please note that local storage is cleaned once your job completes, so this is not useful for long-term storage. 
  
 +Here's an example of training a neural network that can classify different types of rice:
 +
 +<code>
 +module load matplotlib/3.5.2-foss-2022a
 +module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0
 +
 +mkdir -p $TMPDIR/results/logs $TMPDIR/results/plots
 +cp /scratch/public/hb-courses/basic/inputfiles/train.py $TMPDIR
 +cd $TMPDIR
 +
 +python train.py 3
 +</code>
 +
 +The script ''train.py'' uses the dataset we just copied to the local disk.
 ==== Copying results to shared storage ==== ==== Copying results to shared storage ====
  
Line 27: Line 41:
  
 <code> <code>
-tar czvf /scratch/$USER/path/to/where/results/should/be/stored/results.tar.gz $TMPDIR/results+mkdir -p /scratch/$USER/rice_classifier/job_${SLURM_JOBID} 
 +tar czvf /scratch/$USER/rice_classifier/job_${SLURM_JOBID}/results.tar.gz $TMPDIR/results
 </code> </code>
  
Line 36: Line 51:
 To facilitate jobs using this method, below is an example script; you can simply replace all instances of 'compressed.tar.gz' and 'compressed' with the name of your data archive. To facilitate jobs using this method, below is an example script; you can simply replace all instances of 'compressed.tar.gz' and 'compressed' with the name of your data archive.
 <code> <code>
-#!/usr/bin/env bash +#!/bin/bash 
 +#SBATCH --job-name=rice_classifier 
 +#SBATCH --output=rice_classifier.out
 #SBATCH --time=00:10:00 #SBATCH --time=00:10:00
 #SBATCH --nodes=1 #SBATCH --nodes=1
 #SBATCH --ntasks=1 #SBATCH --ntasks=1
-#SBATCH --cpus-per-task=1 +#SBATCH --cpus-per-task=16 
-#SBATCH --mem=2GB +#SBATCH --mem=4GB 
-#SBATCH --partition=gpu +#SBATCH --partition=regular
-#SBATCH --gres=gpu:v100:1+
  
-# Change directory to local directory +mkdir $TMPDIR/dataset 
-cd $TMPDIR+mkdir -p $TMPDIR/results/logs $TMPDIR/results/plots 
  
 # Extract tar file (which could be stored on /scratch) to local disk # Extract tar file (which could be stored on /scratch) to local disk
-tar xvzf /scratch/$USER/path/to/compressed.tar.gz $TMPDIR+tar xzf /scratch/public/hb-courses/basic/inputfiles/dataset.tar.gz -C $TMPDIR/dataset 
 +cp /scratch/public/hb-courses/basic/inputfiles/train.py $TMPDIR 
 +cd $TMPDIR
  
-Your code goes here +# # Load modules 
-# Load modules +module load matplotlib/3.5.2-foss-2022a 
-# Run scripts +module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0 
-# etc.+ 
 +# Run the training 
 +python train.py 3
  
-tar czvf /scratch/$USER/path/to/where/results/should/be/stored/results.tar.gz $TMPDIR/results+mkdir -p /scratch/$USER/rice_classifier/job_${SLURM_JOBID} 
 +tar czvf /scratch/$USER/rice_classifier/job_${SLURM_JOBID}/results.tar.gz $TMPDIR/results
 </code> </code>
  
Line 90: Line 110:
 </code> </code>
  
-This will create a folder ''job_${SLURM_JOBID}'' in ''/scratch/$USER/path'', and then archive the ''$TMPDIR/results'' into that folder, with the name ''results.tar.gz''. Obviously, you need to adapt it to your needs.+This will create a folder ''job_${SLURM_JOBID}'' in ''/scratch/$USER/rice_classifier'', and then archive the ''$TMPDIR/results'' into that folder, with the name ''results.tar.gz''. Obviously, you need to adapt it to your needs.
  
 Because of the way the command ''trap'' works -- it waits for the currently running process to finish before doing anything -- the calculation you will perform needs to be started in the background, and then the job needs to wait. This can be achieved by using something like: Because of the way the command ''trap'' works -- it waits for the currently running process to finish before doing anything -- the calculation you will perform needs to be started in the background, and then the job needs to wait. This can be achieved by using something like:
  
 <code> <code>
-python main.py &+python train.py &
 wait wait
 </code> </code>
Line 104: Line 124:
  
 <code> <code>
-#!/usr/bin/env bash +#!/bin/bash 
- +#SBATCH --job-name=rice_classifier 
-#SBATCH --time=00:10:00+#SBATCH --output=rice_classifier.out 
 +#SBATCH --time=00:30:00
 #SBATCH --nodes=1 #SBATCH --nodes=1
 #SBATCH --ntasks=1 #SBATCH --ntasks=1
-#SBATCH --cpus-per-task=1 +#SBATCH --cpus-per-task=16 
-#SBATCH --mem=2GB +#SBATCH --mem=4GB 
-#SBATCH --partition=gpu +#SBATCH --partition=regular
-#SBATCH --gres=gpu:v100:1+
 #SBATCH --signal=B:12@600 #SBATCH --signal=B:12@600
  
-# Change directory to local directory +mkdir $TMPDIR/dataset
-cd $TMPDIR+
  
 # Extract tar file (which could be stored on /scratch) to local disk # Extract tar file (which could be stored on /scratch) to local disk
-tar xvzf /scratch/$USER/path/to/compressed.tar.gz $TMPDIR+tar xzf /scratch/public/hb-courses/basic/inputfiles/dataset.tar.gz -C $TMPDIR/dataset 
 +cp /scratch/public/hb-courses/basic/inputfiles/train.py $TMPDIR 
 +cd $TMPDIR
  
-trap 'mkdir /scratch/$USER/path/job_${SLURM_JOBID}; tar czvf /scratch/$USER/path/job_${SLURM_JOBID}/results.tar.gz $TMPDIR/results' 12+# Compress and save the results if the timelimit is close 
 +trap 'mkdir -p /scratch/$USER/rice_classifier/job_${SLURM_JOBID}; tar czvf /scratch/$USER/rice_classifier/job_${SLURM_JOBID}/results.tar.gz $TMPDIR/results' 12
  
 # Load modules # Load modules
-module load Python/3.10.8-GCCcore-12.2.0+module load matplotlib/3.5.2-foss-2022a 
 +module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0 
 + 
 +# Create folders for final results 
 +mkdir -p $TMPDIR/results/logs $TMPDIR/results/plots 
 # Run in the background and wait # Run in the background and wait
-python main.py &+python train.py &
 wait wait
 +
 +mkdir -p /scratch/$USER/rice_classifier/job_${SLURM_JOBID}
 +tar czvf /scratch/$USER/rice_classifier/job_${SLURM_JOBID}/results.tar.gz $TMPDIR/results
 </code> </code>
 +
 +
 ===== Numpy Arrays ===== ===== Numpy Arrays =====
 Another way of making large amounts of data accessible with fewer file requests is to concatenate large amounts of it in ''%%numpy%%'' arrays. Given that these work best when all the data is uniform in size, it is likely best to use ''%%numpy%%'' arrays for storage of already pre-processed data. In addition, this has not been tried with data that is not in image format. Another way of making large amounts of data accessible with fewer file requests is to concatenate large amounts of it in ''%%numpy%%'' arrays. Given that these work best when all the data is uniform in size, it is likely best to use ''%%numpy%%'' arrays for storage of already pre-processed data. In addition, this has not been tried with data that is not in image format.