Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
habrok:advanced_job_management:many_file_jobs [2023/09/19 18:02] – [Running your computations] camarocicohabrok:advanced_job_management:many_file_jobs [2023/10/02 12:20] (current) – Removed verbose option from extraction command aurel
Line 13: Line 13:
 <code> <code>
 mkdir $TMPDIR/dataset mkdir $TMPDIR/dataset
-tar xvzf /scratch/public/hb-courses/basic/inputfiles/dataset.tar.gz -C $TMPDIR/dataset+tar xzf /scratch/public/hb-courses/basic/inputfiles/dataset.tar.gz -C $TMPDIR/dataset
 </code> </code>
  
Line 28: Line 28:
 module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0 module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0
  
 +mkdir -p $TMPDIR/results/logs $TMPDIR/results/plots
 cp /scratch/public/hb-courses/basic/inputfiles/train.py $TMPDIR cp /scratch/public/hb-courses/basic/inputfiles/train.py $TMPDIR
 cd $TMPDIR cd $TMPDIR
  
-python train.py+python train.py 3
 </code> </code>
  
-The script 'train.py' uses the dataset we just copied to the local disk.+The script ''train.py'' uses the dataset we just copied to the local disk.
 ==== Copying results to shared storage ==== ==== Copying results to shared storage ====
  
Line 40: Line 41:
  
 <code> <code>
-tar czvf /scratch/$USER/path/to/where/results/should/be/stored/results.tar.gz $TMPDIR/results+mkdir -p /scratch/$USER/rice_classifier/job_${SLURM_JOBID} 
 +tar czvf /scratch/$USER/rice_classifier/job_${SLURM_JOBID}/results.tar.gz $TMPDIR/results
 </code> </code>
  
Line 49: Line 51:
 To facilitate jobs using this method, below is an example script; you can simply replace all instances of 'compressed.tar.gz' and 'compressed' with the name of your data archive. To facilitate jobs using this method, below is an example script; you can simply replace all instances of 'compressed.tar.gz' and 'compressed' with the name of your data archive.
 <code> <code>
-#!/usr/bin/env bash +#!/bin/bash 
 +#SBATCH --job-name=rice_classifier 
 +#SBATCH --output=rice_classifier.out
 #SBATCH --time=00:10:00 #SBATCH --time=00:10:00
 #SBATCH --nodes=1 #SBATCH --nodes=1
 #SBATCH --ntasks=1 #SBATCH --ntasks=1
-#SBATCH --cpus-per-task=1 +#SBATCH --cpus-per-task=16 
-#SBATCH --mem=2GB +#SBATCH --mem=4GB 
-#SBATCH --partition=gpu +#SBATCH --partition=regular
-#SBATCH --gres=gpu:v100:1+
  
-# Change directory to local directory +mkdir $TMPDIR/dataset 
-cd $TMPDIR+mkdir -p $TMPDIR/results/logs $TMPDIR/results/plots 
  
 # Extract tar file (which could be stored on /scratch) to local disk # Extract tar file (which could be stored on /scratch) to local disk
-tar xvzf /scratch/$USER/path/to/compressed.tar.gz $TMPDIR+tar xzf /scratch/public/hb-courses/basic/inputfiles/dataset.tar.gz -C $TMPDIR/dataset 
 +cp /scratch/public/hb-courses/basic/inputfiles/train.py $TMPDIR 
 +cd $TMPDIR
  
-Your code goes here +# # Load modules 
-# Load modules +module load matplotlib/3.5.2-foss-2022a 
-# Run scripts +module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0 
-# etc.+ 
 +# Run the training 
 +python train.py 3
  
-tar czvf /scratch/$USER/path/to/where/results/should/be/stored/results.tar.gz $TMPDIR/results+mkdir -p /scratch/$USER/rice_classifier/job_${SLURM_JOBID} 
 +tar czvf /scratch/$USER/rice_classifier/job_${SLURM_JOBID}/results.tar.gz $TMPDIR/results
 </code> </code>
  
Line 103: Line 110:
 </code> </code>
  
-This will create a folder ''job_${SLURM_JOBID}'' in ''/scratch/$USER/path'', and then archive the ''$TMPDIR/results'' into that folder, with the name ''results.tar.gz''. Obviously, you need to adapt it to your needs.+This will create a folder ''job_${SLURM_JOBID}'' in ''/scratch/$USER/rice_classifier'', and then archive the ''$TMPDIR/results'' into that folder, with the name ''results.tar.gz''. Obviously, you need to adapt it to your needs.
  
 Because of the way the command ''trap'' works -- it waits for the currently running process to finish before doing anything -- the calculation you will perform needs to be started in the background, and then the job needs to wait. This can be achieved by using something like: Because of the way the command ''trap'' works -- it waits for the currently running process to finish before doing anything -- the calculation you will perform needs to be started in the background, and then the job needs to wait. This can be achieved by using something like:
  
 <code> <code>
-python main.py &+python train.py &
 wait wait
 </code> </code>
Line 117: Line 124:
  
 <code> <code>
-#!/usr/bin/env bash +#!/bin/bash 
- +#SBATCH --job-name=rice_classifier 
-#SBATCH --time=00:10:00+#SBATCH --output=rice_classifier.out 
 +#SBATCH --time=00:30:00
 #SBATCH --nodes=1 #SBATCH --nodes=1
 #SBATCH --ntasks=1 #SBATCH --ntasks=1
-#SBATCH --cpus-per-task=1 +#SBATCH --cpus-per-task=16 
-#SBATCH --mem=2GB +#SBATCH --mem=4GB 
-#SBATCH --partition=gpu +#SBATCH --partition=regular
-#SBATCH --gres=gpu:v100:1+
 #SBATCH --signal=B:12@600 #SBATCH --signal=B:12@600
  
-# Change directory to local directory +mkdir $TMPDIR/dataset
-cd $TMPDIR+
  
 # Extract tar file (which could be stored on /scratch) to local disk # Extract tar file (which could be stored on /scratch) to local disk
-tar xvzf /scratch/$USER/path/to/compressed.tar.gz $TMPDIR+tar xzf /scratch/public/hb-courses/basic/inputfiles/dataset.tar.gz -C $TMPDIR/dataset 
 +cp /scratch/public/hb-courses/basic/inputfiles/train.py $TMPDIR 
 +cd $TMPDIR
  
-trap 'mkdir /scratch/$USER/path/job_${SLURM_JOBID}; tar czvf /scratch/$USER/path/job_${SLURM_JOBID}/results.tar.gz $TMPDIR/results' 12+# Compress and save the results if the timelimit is close 
 +trap 'mkdir -p /scratch/$USER/rice_classifier/job_${SLURM_JOBID}; tar czvf /scratch/$USER/rice_classifier/job_${SLURM_JOBID}/results.tar.gz $TMPDIR/results' 12
  
 # Load modules # Load modules
-module load Python/3.10.8-GCCcore-12.2.0+module load matplotlib/3.5.2-foss-2022a 
 +module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0 
 + 
 +# Create folders for final results 
 +mkdir -p $TMPDIR/results/logs $TMPDIR/results/plots 
 # Run in the background and wait # Run in the background and wait
-python main.py &+python train.py &
 wait wait
 +
 +mkdir -p /scratch/$USER/rice_classifier/job_${SLURM_JOBID}
 +tar czvf /scratch/$USER/rice_classifier/job_${SLURM_JOBID}/results.tar.gz $TMPDIR/results
 </code> </code>