Partitions and Limits

The cluster is divided into several partitions. Partitions divide up the resources in the cluster based on either physical attributes of the machines or based on the job types that are allowed to run on certain resources.

Cluster layout

The node types of the cluster are described in Cluster description.

Partitions

Partitions have been made in the scheduling system for the following reasons:

To differentiate the physical machine types
To make a separation between the different job lengths, this to prevent short jobs from being delayed by a cluster full of long running jobs.

Only the first partition types can be selected by the user. The sub-partitions based on the job length are assigned automatically.

The partitions that have been made are described in the following table.

Partition name	Description	Time limits	Remarks
regular	Standard 128 core, 512 GB memory nodes	10 days
parallel	Standard 128 core, 512 GB memory nodes, with fast Omni-Path network connection	5 days
gpu	GPU nodes	3 days	See this page for more information
himem	Big memory nodes with 4 TB of memory and 80 cores	10 days
gelifes	Nodes purchased by the GELIFES institute	10 days	See this page for more information

Details on time limits

The Hábrók cluster allows for running very long jobs, that may take up to 10 days. Running these long jobs has certain disadvantages however.
These are:

Long jobs increase the waiting time for new jobs
Parallel jobs (taking full nodes) will have longer queuing times
Chances of node or system failure are higher when jobs run longer
These long jobs cannot be started in the 10 days before maintenance periods
Urgent system maintenance may cause these jobs to be killed

We therefore urge you to make use of any save and restart options your program has if you need to run this long.
If these options are not in the program these should be added.
We can also help you in optimizing your code so that it can run faster. Please contact us if you need help.

In order to alleviate the scheduling problems we limit long running jobs to only part of the cluster. For example for the standard nodes maximum 80% of the machines may be used for jobs taking more than 3 days. The precise settings depend on the partition, and may be changed to improve job scheduling.

Limits on number of jobs

To prevent the scheduling system from being flooded with jobs, there are some limits in place that define how many jobs you are allowed to have in the cluster at any time.

Note that these limits only count the current number of (waiting and running) jobs. So if you reach the limit, you can submit new jobs after some other ones have finished.
If you try to submit more jobs than allowed, the sbatch command will deny the job submission and show the following error:

sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

Table of Contents

Partitions and Limits

Cluster layout

Partitions

Details on time limits

Limits on number of jobs