Job arrays allow you to easily submit a whole bunch of very similar jobs with a single job script. All jobs need to have the same resource requirements. The job array allows you to define some range of numbers; the length of this range determines how many jobs will be submitted. Furthermore, each job gets one of the numbers in this range through an environment variable $SLURM_ARRAY_TASK_ID
and you can use it, for instance, to send the right input or parameter to each job.
In order to create a job array, start by creating the job script that you would need to run just one instance of the job. For instance, consider the following job script for running R:
#!/bin/bash #SBATCH --job-name=R_job #SBATCH --time=12:00:00 #SBATCH --ntasks=1 #SBATCH --mem=1gb module load R/3.4.2-foss-2016a-X11-20160819 Rscript myscript.r
Now suppose you want this to be run 100 times. You can simply add the array definition by adding the following line somewhere at the top of the script:
#SBATCH --array=1-100
Then you can use sbatch <name of job script>
just once, and the job will run 100 times. Each of the 100 jobs will get one core, 1GB of memory and 12 hours of wall clock time. Do note that they will all run the same R script, myscript.r in this case. This is in a lot of cases probably not very useful!
So let us take it a step further: suppose we have 100 different R scripts that have to be run, which are named myscript1.r, myscript2.r, …, myscript100.r. Now we can use the aforementioned environment variable to pick the right R script for each job:
#!/bin/bash #SBATCH --job-name=R_job #SBATCH --time=12:00:00 #SBATCH --ntasks=1 #SBATCH --mem=1gb #SBATCH --array=1-100 module load R/3.4.2-foss-2016a-X11-20160819 Rscript myscript${SLURM_ARRAY_TASK_ID}.r
Here the variable ${SLURM_ARRAY_TASK_ID}
will be replaced for each job by a value in the given range.
The range does not necessarily have to be an interval of integers. You can define multiple intervals and/or use a step size to define more complex ranges:
--array=1,3-5,8,101-103 # Step size 2 --array=1-99:2
Suppose that you want to use this to pass input parameters to your program. If the input parameter takes a complex range of values, or if you need more than one parameter, the approach described above would probably not work. In this case you could put all your input parameter combinations in a file, where each combination is on a separate line. You can then use the $SLURM_ARRAY_TASK_ID
variable to get the n-th line from the file, and pass that your program. For instance:
INPUTFILE=parameters.in # get n-th line from $INPUTFILE ARGS=$(sed "${SLURM_ARRAY_TASK_ID}q;d" $INPUTFILE) myprogram $ARGS
Alternatively, you could also declare arrays with the input parameters, and use $SLURM_ARRAY_TASK_ID
to fetch the input parameters from those arrays:
parameter1=(1 2 3) parameter2=(100 1000 10000) myprogram ${parameter1[${SLURM_ARRAY_TASK_ID}]} ${parameter2[${SLURM_ARRAY_TASK_ID}]}
Note that with a file your range should start at 1 and go up to the number of lines in the file. In the latter case, with the arrays, your range should start at 0, and go up to the number of elements in the array.
A job array will get just one main job id, just like a regular job. However, the index values of the range will be used as suffix for the job id: <jobid>_1, <jobid>_2, etcetera. Furthermore, each job will produce its own output file with a filename like slurm_<jobid>_<index>.out. It is also possible to to provide a custom name for the slurm output file with $SBATCH –ouput=
. In normal circumstances a name such as R_job.out
would be fine, however, with job arrays that would result in every job writing to the same output file, thus overwriting the previous ones. We can get around his by using %j
and %a
, which here will be replaced with the job ID and array index, which would look like: $SBATCH –ouput=R_job_%j_%a.out
The same kind of job ids will also be used in the output of SLURM tools like squeue
and sacct
. The squeue
command will usually try to combine the jobs in the array into a single line, e.g.:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 12345_[1-100] nodes R_job p123456 PD 0:00 1 (Resources)
If you want to get each job of the array to appear on a separate line, you can pass the -r
or —array
option to squeue
.
The scancel command can be use to cancel an entire job array:
scancel 12345
If you want to cancel only specific jobs of the array, you can use the index as suffix to the job id:
scancel 12345_12
Using square brackets you can cancel ranges of jobs, where a range can be defined in a similar way as described in the part about creating the job array:
scancel 12345_[1-10,15]