Migration to Hábrók

You can find the slides for the presentation on March 27th here .

Because of increased security requirements we will only allow RUG accounts on Hábrók. Since most people are already using their p- or s-number to login this should not cause issues. Existing f-accounts can also still be used, but note that the RUG is working on reducing the number of f-accounts, and external collaborators may need a p-number in the future. Given the lack of provenance for the existing umcg- accounts we will no longer support these. UMCG staff members will need an account based on a p-number to log in.

See the FAQ for information about the transition procedure for the accounts.

For the logins on Hábrók multi factor authentication (MFA) will be used, just like for all other university services. To make life easier for the users input of the token will only be requested once per 8 hours. This when connecting to the same login node from the same computer for the same user.

Because there are quite a number of inactive accounts on Peregrine, we have decided not to automatically migrate the accounts to Hábrók, so your account on Peregrine will not automatically work on Hábrók.

If you want to use the new cluster, you need to request access to it by using the Self-Service Portal IRIS.

Please go to: Research and Innovation Support → Computing and Research Support Facilities → High Performance Computing Cluster → Request Hábrók Account.

The existing groups on Peregrine will be recreated on Hábrók with a hb- prefix. When group members get a new account on Hábrók we will add these accounts to the new group on Peregrine. After three months we will check for groups without active members and remove those groups from Hábrók. See the data migration section for more details.

For Hábrók we will have two login nodes and two interactive nodes. This to increase the availability. When one of the nodes is down you can use another one. One or two interactive GPU nodes are planned for the near future. The host names are:

  • login1.hb.hpc.rug.nl
  • login2.hb.hpc.rug.nl
  • interactive1.hb.hpc.rug.nl
  • interactive2.hb.hpc.rug.nl

On all these nodes limits are in place for CPU and memory usage. The limits are higher for the interactive nodes.

On Hábrók a clear separation is made between the home directories, long-term, medium-term, and short-term storage.

The home directories are available on all nodes and are meant for storing personal software and settings. Each user has 50 GiB of storage space available. This amount is fixed. We do make backups of the home directories.

The long- and medium-term storage areas are only available on the login nodes. For long-term storage the RDMS system can be used. For medium-term a /projects storage area is mounted on the login nodes. This /projects area will not be available on the compute nodes, as this storage is not optimized for data processing.

Data that needs to be processed or the data resulting from processing can be stored in the /scratch area. Given the fact that the 30 day retention time for /scratch was circumvented by many users we will no longer remove data from /scratch automatically. As a consequence we will now apply smaller limits to /scratch by default. This limit can be increased on request, where you will need to explain how and where you are going to store important data for the long term.

We will not make a backup of /scratch! And in case of file system issues we can decide to wipe and reformat /scratch.

In Hábrók all nodes have been equipped with fast local storage, which is available during the runtime of the jobs. This storage will perform better than any shared storage that we currently have. This storage can be accessed using the environment variable $TMPDIR in your job scripts. Using this storage area is especially important for use cases with many small files, as most shared file systems (at least those within the available budget) are based on spinning disks and centralized file metadata.

On all storage areas quotas are applied to prevent single users from taking up too much space, thereby limiting what is available for others.

For the home directories a fixed quota of 50 GiB is set for each user. For the /projects and /scratch areas the default quotas are 250 GiB per user. For the latter 200,000 files can be stored. This limit is much lower than on Peregrine as /data and /scratch on Peregrine were overloaded by the amount of files.

Please store huge collections of small files in archives (tar, zip) and extract these to the fast local disk in the nodes before processing.

The quotas on /projects are handled by the “data handling” project.

For /scratch the quota can only be increased if you can guarantee that you can safely store important data elsewhere, and the quotas are based upon a fairshare principle. That means that requests must be reasonable compared to the available space, and that quotas will be reduced when other users need additional space, and there is no longer sufficient storage space available.

The data in the home directories and /data of Peregrine was available read-only on the login nodes of Hábrók for a period of three months (until July 1st 2023), and is no longer available.

The data on Peregrine /scratch was not be migrated, since it is temporary space only.

In Hábrók three main hardware classes are available. Here is a short overview:

Type # Cores Memory
(GiB)
GPU Partition Local storage
(TiB)
Notes
Regular 117 128 512 - regular 3.5
Omnipath 24 128 512 - parallel 3.5 high-bandwidth low-latency network connection
Memory 4 80 4096 - himem 14
GPU1 6 64 512 4 x A100
(40 GiB)
gpu 12 Some GPUs have been divided into smaller 20 GiB units.
GPU2 36 12 128 1 x V100
(32 GiB)
gpu 1
Gelifes 15 64 512 - gelifes 15
(spinning disk based)
Nodes owned by the GELIFES institute.

In Hábrók four main partitions are available: regular, himem, parallel and gpu. The partitions correspond to the hardware classes in the table above. Besides this the gelifes partition is accessible to the members of the GELIFES research institute.

All these partitions are subdivided into a short, medium and long partition, which you as a user don’t have to select as this will be done automatically, based on job length. All nodes in a class are available in the short sub-partition, a large part in the medium sub-partition and a limited fraction in the long sub-partition.

This setup is to prevent long waiting times for shorter jobs and it will make sure that long running jobs are not spread out over all the nodes.

When no partition is specified the job will be sent to regular or himem nodes depending on the CPU and memory requirements for the job.

Here is a short description of the partitions:

Partition Description
regular Partition for the standard CPU nodes
himem Partition for the nodes with a large amount of memory
parallel Partition for nodes with a fast, low-latency interconnect. This partition is meant for jobs that use multiple nodes and require high bandwidth or low latency.
gpu Partition with the GPU nodes. More details in the GPU section below.
gelifes Partition with the nodes owned by the GELIFES institute.

Coming soon.

Please see Known issues