{{indexmenu_n>1}}

====== Many File Jobs ======

This page of the wiki is aimed at helping reduce the number of files you work with if these number in the thousands. Jobs with many files put a lot of load on the cluster's I/O slowing things down for everyone including yourself. Given that most jobs requiring many thousands of files generally fit in the field of data science and therefore make use of python, most of the methods described here target that, however at least one universal method will be explored as well.

===== Using Local Storage =====
Each node has a certain amount of fast local storage attached directly to the hardware. This is generally used by Slurm to cache important job-related files, however we can make use of it more directly as well. A node's local storage can be accessed by using the ''$TMPDIR'' variable, which is set to a unique directory for each job, and which will be automatically cleaned up when the job completes.

==== Extracting to local storage ====

You can extract archives (e.g. tarballs) at the destination using
<code bash>
mkdir $TMPDIR/dataset
tar xzf /scratch/public/hb-courses/basic/inputfiles/dataset.tar.gz -C $TMPDIR/dataset
</code>

It is recommended that you transfer your files to the local storage using a .tar or .tar.gz archive as otherwise you are copying thousands of files over anyways, defeating the purpose of the whole exercise in the first place.

==== Running your computations ====

You can then operate on your extracted files located in ''$TMPDIR/dataset''. This will not use the cluster's shared storage systems, meaning that file operations of all kinds should be much faster. Please note that local storage is cleaned once your job completes, so this is not useful for long-term storage. 

Here's an example of training a neural network that can classify different types of rice:

<code bash>
module load matplotlib/3.5.2-foss-2022a
module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0

mkdir -p $TMPDIR/results/logs $TMPDIR/results/plots
cp /scratch/public/hb-courses/basic/inputfiles/train.py $TMPDIR
cd $TMPDIR

python train.py 3
</code>

The script ''train.py'' uses the dataset we just copied to the local disk.
==== Copying results to shared storage ====

At the end of the job, you will probably want to copy some files back to the shared storage, otherwise they will be deleted from the local storage. The easiest way to do that is to create another archive and then copy it over to the shared storage:

<code bash>
mkdir -p /scratch/$USER/rice_classifier/job_${SLURM_JOBID}
tar czvf /scratch/$USER/rice_classifier/job_${SLURM_JOBID}/results.tar.gz $TMPDIR/results
</code>

where ''results'' is the folder on the local storage where the files that need to be copied are. Please notice that the ''tar'' command here creates the archive directly on the shared storage, so there's nothing to copy afterwards.

==== Putting it all together: example jobscript ====

To facilitate jobs using this method, below is an example script; you can simply replace all instances of 'compressed.tar.gz' and 'compressed' with the name of your data archive.
<code bash>
#!/bin/bash
#SBATCH --job-name=rice_classifier
#SBATCH --output=rice_classifier.out
#SBATCH --time=00:10:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=4GB
#SBATCH --partition=regular

mkdir $TMPDIR/dataset
mkdir -p $TMPDIR/results/logs $TMPDIR/results/plots 

# Extract tar file (which could be stored on /scratch) to local disk
tar xzf /scratch/public/hb-courses/basic/inputfiles/dataset.tar.gz -C $TMPDIR/dataset
cp /scratch/public/hb-courses/basic/inputfiles/train.py $TMPDIR
cd $TMPDIR

# # Load modules
module load matplotlib/3.5.2-foss-2022a
module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0

# Run the training
python train.py 3

mkdir -p /scratch/$USER/rice_classifier/job_${SLURM_JOBID}
tar czvf /scratch/$USER/rice_classifier/job_${SLURM_JOBID}/results.tar.gz $TMPDIR/results
</code>

==== Dealing with exceeding the allocated time ====

It can happen that your job will run over its allocated time before the computation finishes, and thus the files will not be copied back to the shared storage and will be lost. To get around this, we will use a feature of SLURM that allows you to send a signal to the scheduler a specified time before the end of the allocated time, and ask it to perform some task (such as copying files to the shared storage). The documentation of this feature can be found [[https://slurm.schedmd.com/sbatch.html#OPT_signal|here]], and a good short tutorial can be found [[https://docs.gwdg.de/doku.php?id=en:services:application_services:high_performance_computing:running_jobs_slurm:signals|here]] from the University of Gottingen.

In essence, we begin by adding a line like:

<code>
#SBATCH --signal=B:<sig_num>@<sig_time>
</code>

to the jobscript. Here ''<sig_time>'' is the time **in seconds** before the timelimit is reached when the signal should be sent. ''<sig_num>'' is the ID of the signal to be sent, and we follow he recommendation of our colleagues at Gottingen University to set this to 12 (SIGUSR2), as it is unlikely to be used in your program. Thus, to send a signal 10 minutes before the job's time runs out, you would add:

<code bash>
#SBATCH --signal=B:12@600
</code>

to your jobscript.

Once the signal is sent, the scheduler needs to be told what to do. We will have the scheduler copy the ''results'' folder to the shared storage (archived and compressed). For that we use the ''trap'' command to catch the signal and execute some code:

<code bash>
trap 'echo "Trapped the signal!"; exit 12' 12
</code>

or, more usefully:

<code bash>
trap 'mkdir /scratch/$USER/path/job_${SLURM_JOBID}; tar czvf /scratch/$USER/path/job_${SLURM_JOBID}/results.tar.gz $TMPDIR/results' 12
</code>

This will create a folder ''job_${SLURM_JOBID}'' in ''/scratch/$USER/rice_classifier'', and then archive the ''$TMPDIR/results'' into that folder, with the name ''results.tar.gz''. Obviously, you need to adapt it to your needs.

Because of the way the command ''trap'' works -- it waits for the currently running process to finish before doing anything -- the calculation you will perform needs to be started in the background, and then the job needs to wait. This can be achieved by using something like:

<code bash>
python train.py 3 &
wait
</code>

in the computation section of your jobscript.

Thus, the new jobscript file might look something like:

<code bash>
#!/bin/bash
#SBATCH --job-name=rice_classifier
#SBATCH --output=rice_classifier.out
#SBATCH --time=00:30:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=4GB
#SBATCH --partition=regular
#SBATCH --signal=B:12@600

mkdir $TMPDIR/dataset

# Extract tar file (which could be stored on /scratch) to local disk
tar xzf /scratch/public/hb-courses/basic/inputfiles/dataset.tar.gz -C $TMPDIR/dataset
cp /scratch/public/hb-courses/basic/inputfiles/train.py $TMPDIR
cd $TMPDIR

# Compress and save the results if the timelimit is close
trap 'mkdir -p /scratch/$USER/rice_classifier/job_${SLURM_JOBID}; tar czvf /scratch/$USER/rice_classifier/job_${SLURM_JOBID}/results.tar.gz $TMPDIR/results' 12

# Load modules
module load matplotlib/3.5.2-foss-2022a
module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0

# Create folders for final results
mkdir -p $TMPDIR/results/logs $TMPDIR/results/plots

# Run in the background and wait
python train.py 3 &
wait

mkdir -p /scratch/$USER/rice_classifier/job_${SLURM_JOBID}
tar czvf /scratch/$USER/rice_classifier/job_${SLURM_JOBID}/results.tar.gz $TMPDIR/results
</code>


===== Numpy Arrays =====
Another way of making large amounts of data accessible with fewer file requests is to concatenate large amounts of it in ''%%numpy%%'' arrays. Given that these work best when all the data is uniform in size, it is likely best to use ''%%numpy%%'' arrays for storage of already pre-processed data. In addition, this has not been tried with data that is not in image format.

Be mindful of the fact that ''numpy'' arrays default to ''int64'' type which means you would be storing data in a much larger data type than required. If your image is encoded in RGB format you usually only require ''uint8'' (unsigned 8 bit integers; 0-255) at most anyways. Additionally, ''numpy'' arrays are not like Python (linked) lists where they can be expanded with very little effort. It is preferable to allocate all the space required in one go and then fill the array as expanding it later will cause numpy to allocate a second array then copy the first into it. This can become an issue for large datasets as you are keeping multiple copies of it in memory (RAM). Below is an example code snippet showing the process used for making ''numpy'' arrays from many image files using ''numpy'' and ''cv2''.

<code python>
import numpy as np
import cv2

# The width and height we want our images to be after pre-processing and for storage
im_height = 64
im_width = 64

# Define a pre-processing function
def preprocess(image):
    # Takes a CV2 image class and returns a modified CV2 image clas
    # Other parameters and custom image pre-processing here
    image = cv2.reshape(-1,im_height,im_width,3)
    # Make sure that image is being stored as unsigned 8-bit integer values
    image = image.astype('uint8')
    return image

pathList = [] # create a list of paths where you find your dataset's images
data_len = len(pathList) # number of data points/images

# Allocate the numpy arrays
# 'dtype=np.uint8' specifies we want to allocate an array of
# unsigned 8-bit integers instead of the default 64-bit int values
data = np.empty((data_len,im_height,im_width,3),dtype=np.uint8)

# Populate the data array
for idx in range(data_len):
    current_path = pathList[idx]
    image = cv2.imread(current_path)
    data[idx] = preprocess(image)

save_location = " " # where you want to store your newly created numpy array
np.save(save_location + "fileName",data)
</code>

The example provided above is extremely simplistic and should be adapted to your exact needs. In addition it entirely glosses over the fact that we wish to also have labels for this data. The recommendation here is to create a label array of equal length to the data array and populate it in the exact same order. For simplicity this can be done in the same loop as populating the data array.

For this method be effective at bypassing the cluster's congested I/O system it is best to pre-process your data into a ''numpy'' array either on your local machine or use a node's local storage (most nodes, not only GPU nodes have 1TB of attached storage you can use for this as described in the local storage section above). If you are using a node's local storage to do this, it is recommended to not use GPU nodes as neither ''cv2'' nor ''numpy'' benefit at all from having an attached GPU in this scenario and it should shorten your wait time somewhat.

===== Tarfile =====
Tarfile is a python package which allows the use of files stored inside .tar and .tar.gz archives without extracting the whole thing to disk first. Use this package by importing it in python with ''import tarfile''.

Say you are working with a tar archive named ''archive.tar'', you can get the paths and filenames of all the contents in the following way
<code python>
tar = tarfile.open('archive.tar')
print(tar.getnames())
</code>
Keep in mind that all of the filenames will have the full path attached to them, so a file called ''image.png'' stored inside a sub-directory of archive ''archive/images'' will appear in the output as the following string ''%%'archive/images/image.png'%%''. You can then extract this file by using this string, so;
<code python>
file = tar.extractfile('archive/images/image.png')
</code>
The resulting file is a byte array, not an image file that you can directly work on with python. To convert this to a workable image, say in ''cv2'' you need to convert it to a ''numpy'' array and then to a ''cv2'' readable image. Example code is provided to help with this.

<code python>
import tarfile
import numpy as np
import cv2

def tar2numpy(tar_file):
    return np.asarray(
        bytearray(tar_file.read()),
        dtype=np.uint8)

tar = tarfile.open('archive.tar',mode='r')
file = tarfile.extractfile('archive/images/image.png')
numArr = tar2numpy(file)
# In this function 1 is color image and 0 is grayscale decode
image = cv2.imdecode(numArr,1)
</code>
In this code example we then store in ''image'' a cv2 image file which we can manipulate and save with cv2 as normal.

For a text file, we can similarly extract it from the tar archive. Say we wish to extract a file saved under ''text.txt'' in the top level directory of the archive, the following code can do this;
<code python>
import tarfile

tar = tarfile.open('archive.tar',mode='r')
file = tarfile.extractfile('archive/text.txt')
text = file.read().decode('utf-8')
</code>
Here ''text'' at the end stores a normal python string with the text of the file contained in it. If we wish to see it line by line it is possible to use ''%%text.split('\n')%%'' to separate it into a list of lines which we can iterate through to read. It should be noted that if, for whatever reason, the system on which you created the text file does not use ''utf-8'' encoding, please change it to the appropriate encoding in the ''.decode()'' function.