Differences

This shows you the differences between two versions of the page.

--- habrok:advanced_job_management:many_file_jobs [2023/09/19 18:02] – [Running your computations] camarocico
+++ habrok:advanced_job_management:many_file_jobs [2023/10/02 12:20] (current) – Removed verbose option from extraction command aurel
@@ Line 13: / Line 13: @@
 <code>
 mkdir $TMPDIR/dataset
-tar xvzf /scratch/public/hb-courses/basic/inputfiles/dataset.tar.gz -C $TMPDIR/dataset
+tar xzf /scratch/public/hb-courses/basic/inputfiles/dataset.tar.gz -C $TMPDIR/dataset
 </code>
@@ Line 28: / Line 28: @@
 module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0
+mkdir -p $TMPDIR/results/logs $TMPDIR/results/plots
 cp /scratch/public/hb-courses/basic/inputfiles/train.py $TMPDIR
 cd $TMPDIR
-python train.py
+python train.py 3
 </code>
-The script 'train.py' uses the dataset we just copied to the local disk.
+The script ''train.py'' uses the dataset we just copied to the local disk.
 ==== Copying results to shared storage ====
@@ Line 40: / Line 41: @@
 <code>
-tar czvf /scratch/$USER/path/to/where/results/should/be/stored/results.tar.gz $TMPDIR/results
+mkdir -p /scratch/$USER/rice_classifier/job_${SLURM_JOBID}
+tar czvf /scratch/$USER/rice_classifier/job_${SLURM_JOBID}/results.tar.gz $TMPDIR/results
 </code>
@@ Line 49: / Line 51: @@
 To facilitate jobs using this method, below is an example script; you can simply replace all instances of 'compressed.tar.gz' and 'compressed' with the name of your data archive.
 <code>
-#!/usr/bin/env bash
+#!/bin/bash
+#SBATCH --job-name=rice_classifier
+#SBATCH --output=rice_classifier.out
 #SBATCH --time=00:10:00
 #SBATCH --nodes=1
 #SBATCH --ntasks=1
-#SBATCH --cpus-per-task=1
+#SBATCH --cpus-per-task=16
-#SBATCH --mem=2GB
+#SBATCH --mem=4GB
-#SBATCH --partition=gpu
+#SBATCH --partition=regular
-#SBATCH --gres=gpu:v100:1
-# Change directory to local directory
+mkdir $TMPDIR/dataset
-cd $TMPDIR
+mkdir -p $TMPDIR/results/logs $TMPDIR/results/plots
 # Extract tar file (which could be stored on /scratch) to local disk
-tar xvzf /scratch/$USER/path/to/compressed.tar.gz $TMPDIR
+tar xzf /scratch/public/hb-courses/basic/inputfiles/dataset.tar.gz -C $TMPDIR/dataset
+cp /scratch/public/hb-courses/basic/inputfiles/train.py $TMPDIR
+cd $TMPDIR
-# Your code goes here
+# # Load modules
-# Load modules
+module load matplotlib/3.5.2-foss-2022a
-# Run scripts
+module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0
-# etc.
+# Run the training
+python train.py 3
-tar czvf /scratch/$USER/path/to/where/results/should/be/stored/results.tar.gz $TMPDIR/results
+mkdir -p /scratch/$USER/rice_classifier/job_${SLURM_JOBID}
+tar czvf /scratch/$USER/rice_classifier/job_${SLURM_JOBID}/results.tar.gz $TMPDIR/results
 </code>
@@ Line 103: / Line 110: @@
 </code>
-This will create a folder ''job_${SLURM_JOBID}'' in ''/scratch/$USER/path'', and then archive the ''$TMPDIR/results'' into that folder, with the name ''results.tar.gz''. Obviously, you need to adapt it to your needs.
+This will create a folder ''job_${SLURM_JOBID}'' in ''/scratch/$USER/rice_classifier'', and then archive the ''$TMPDIR/results'' into that folder, with the name ''results.tar.gz''. Obviously, you need to adapt it to your needs.
 Because of the way the command ''trap'' works -- it waits for the currently running process to finish before doing anything -- the calculation you will perform needs to be started in the background, and then the job needs to wait. This can be achieved by using something like:
 <code>
-python main.py &
+python train.py 3 &
 wait
 </code>
@@ Line 117: / Line 124: @@
 <code>
-#!/usr/bin/env bash
+#!/bin/bash
+#SBATCH --job-name=rice_classifier
-#SBATCH --time=00:10:00
+#SBATCH --output=rice_classifier.out
+#SBATCH --time=00:30:00
 #SBATCH --nodes=1
 #SBATCH --ntasks=1
-#SBATCH --cpus-per-task=1
+#SBATCH --cpus-per-task=16
-#SBATCH --mem=2GB
+#SBATCH --mem=4GB
-#SBATCH --partition=gpu
+#SBATCH --partition=regular
-#SBATCH --gres=gpu:v100:1
 #SBATCH --signal=B:12@600
-# Change directory to local directory
+mkdir $TMPDIR/dataset
-cd $TMPDIR
 # Extract tar file (which could be stored on /scratch) to local disk
-tar xvzf /scratch/$USER/path/to/compressed.tar.gz $TMPDIR
+tar xzf /scratch/public/hb-courses/basic/inputfiles/dataset.tar.gz -C $TMPDIR/dataset
+cp /scratch/public/hb-courses/basic/inputfiles/train.py $TMPDIR
+cd $TMPDIR
-trap 'mkdir /scratch/$USER/path/job_${SLURM_JOBID}; tar czvf /scratch/$USER/path/job_${SLURM_JOBID}/results.tar.gz $TMPDIR/results' 12
+# Compress and save the results if the timelimit is close
+trap 'mkdir -p /scratch/$USER/rice_classifier/job_${SLURM_JOBID}; tar czvf /scratch/$USER/rice_classifier/job_${SLURM_JOBID}/results.tar.gz $TMPDIR/results' 12
 # Load modules
-module load Python/3.10.8-GCCcore-12.2.0
+module load matplotlib/3.5.2-foss-2022a
+module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0
+# Create folders for final results
+mkdir -p $TMPDIR/results/logs $TMPDIR/results/plots
 # Run in the background and wait
-python main.py &
+python train.py 3 &
 wait
+mkdir -p /scratch/$USER/rice_classifier/job_${SLURM_JOBID}
+tar czvf /scratch/$USER/rice_classifier/job_${SLURM_JOBID}/results.tar.gz $TMPDIR/results
 </code>