The cluster is divided into several partitions. Partitions divide up the resources in the cluster based on either physical attributes of the machines or based on the job types that are allowed to run on certain resources.
The node types of the cluster are described in Cluster description.
Partitions have been made in the scheduling system for the following reasons:
Only the first partition types can be selected by the user. The sub-partitions based on the job length are assigned automatically.
The partitions that have been made are described in the following table.
Partition name | Description | Time limits | Remarks | |
---|---|---|---|---|
regular | Standard 128 core, 512 GB memory nodes | 10 days | ||
parallel | Standard 128 core, 512 GB memory nodes, with fast Omni-Path network connection | 5 days | ||
gpu | GPU nodes | 3 days | See this page for more information | |
himem | Big memory nodes with 4 TB of memory and 80 cores | 10 days | ||
gelifes | Nodes purchased by the GELIFES institute | 10 days | See this page for more information |
The Hábrók cluster allows for running very long jobs, that may take up to 10 days. Running these long jobs has certain disadvantages however.
These are:
We therefore urge you to make use of any save and restart options your program has if you need to run this long.
If these options are not in the program these should be added.
We can also help you in optimizing your code so that it can run faster. Please contact us if you need help.
In order to alleviate the scheduling problems we limit long running jobs to only part of the cluster. For example for the standard nodes maximum 80% of the machines may be used for jobs taking more than 3 days. The precise settings depend on the partition, and may be changed to improve job scheduling.
To prevent the scheduling system from being flooded with jobs, there are some limits in place that define how many jobs you are allowed to have in the cluster at any time.
Note that these limits only count the current number of (waiting and running) jobs. So if you reach the limit, you can submit new jobs after some other ones have finished.
If you try to submit more jobs than allowed, the sbatch command will deny the job submission and show the following error:
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)