Partitions and Limits
The cluster is divided into several partitions. Partitions divide up the resources in the cluster based on either physical attributes of the machines or based on the job types that are allowed to run on certain resources.
Cluster layout
The node types of the cluster are described in Cluster description.
Partitions
Partitions have been made in the scheduling system for the following reasons:
- To differentiate the physical machine types
- To make a separation between the different job lengths, this to prevent short jobs from being delayed by a cluster full of long running jobs.
Only the first partition types can be selected by the user. The sub-partitions based on the job length are assigned automatically.
The partitions that have been made are described in the following table.
Partition name | Description | Time limits | Remarks | |
---|---|---|---|---|
regular | Standard 128 core, 512 GB memory nodes | 10 days | ||
parallel | Standard 128 core, 512 GB memory nodes, with fast Omni-Path network connection | 5 days | ||
gpu | GPU nodes | 3 days | See this page for more information | |
himem | Big memory nodes with 4 TB of memory and 80 cores | 10 days | ||
gelifes | Nodes purchased by the GELIFES institute | 10 days | See this page for more information |
Details on time limits
The Hábrók cluster allows for running very long jobs, that may take up to 10 days. Running these long jobs has certain disadvantages however.
These are:
- Long jobs increase the waiting time for new jobs
- Parallel jobs (taking full nodes) will have longer queuing times
- Chances of node or system failure are higher when jobs run longer
- These long jobs cannot be started in the 10 days before maintenance periods
- Urgent system maintenance may cause these jobs to be killed
We therefore urge you to make use of any save and restart options your program has if you need to run this long.
If these options are not in the program these should be added.
We can also help you in optimizing your code so that it can run faster. Please contact us if you need help.
In order to alleviate the scheduling problems we limit long running jobs to only part of the cluster. For example for the standard nodes maximum 80% of the machines may be used for jobs taking more than 3 days. The precise settings depend on the partition, and may be changed to improve job scheduling.
Limits on number of jobs
To prevent the scheduling system from being flooded with jobs, there are some limits in place that define how many jobs you are allowed to have in the cluster at any time.
Note that these limits only count the current number of (waiting and running) jobs. So if you reach the limit, you can submit new jobs after some other ones have finished.
If you try to submit more jobs than allowed, the sbatch command will deny the job submission and show the following error:
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)