Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
habrok:introduction:cluster_description [2022/12/14 14:30] fokkehabrok:introduction:cluster_description [2025/03/05 15:42] (current) – [Compute nodes] Add Digital Lab link pedro
Line 6: Line 6:
 ===== Compute nodes ===== ===== Compute nodes =====
  
-{{ :habrok:introduction:habrok_standard_compute_node.jpg?direct&200|Standard Hábrók compute node}}+{{ :habrok:introduction:habrok_standard_compute_node.jpg?direct&150|Standard Hábrók compute node}}
  
   * 119 standard nodes with the following components:   * 119 standard nodes with the following components:
     * 128 cores @ 2.45 GHz (two AMD 7763 CPUs)     * 128 cores @ 2.45 GHz (two AMD 7763 CPUs)
     * 512 GB memory     * 512 GB memory
-    * 3.84 TB internal SSD disk space+    * 3.TB internal SSD disk space
  
   * 24 nodes for multi-node jobs with the following components:   * 24 nodes for multi-node jobs with the following components:
     * 128 cores @ 2.45 GHz (two AMD 7763 CPUs)     * 128 cores @ 2.45 GHz (two AMD 7763 CPUs)
     * 512 GB memory     * 512 GB memory
-    * 3.84 TB internal SSD disk space+    * 3.TB internal SSD disk space
     * 100 Gbps Omni-Path link     * 100 Gbps Omni-Path link
  
-{{ :habrok:introduction:habrok_nodes.jpg?direct&200|Rack of Hábrók nodes}}+{{ :habrok:introduction:habrok_nodes.jpg?direct&150|Rack of Hábrók nodes}}
  
   * 4 big memory nodes with the following components:   * 4 big memory nodes with the following components:
     * 80 cores @ 2.3 GHz (two Intel Xeon Platinum 8380 CPUs)     * 80 cores @ 2.3 GHz (two Intel Xeon Platinum 8380 CPUs)
     * 4096 GB memory     * 4096 GB memory
-    * 15.36 TB internal SSD disk space+    * 14 TB internal SSD disk space 
 + 
 +  * 2 Interactive GPU nodes (Delivered by Fujitsu in an earlier purchase) with the following components: 
 +    * 24 cores @ 2.4 GHz (two Intel Xeon Gold 6240R CPUs) 
 +    * 768 GB memory 
 +    * 1 Nvidia L40s GPU accelerator card with 48GB RAM
  
   * 6 GPU nodes with the following components:   * 6 GPU nodes with the following components:
-    * 64 cores @ 2.6 GHz +    * 64 cores @ 2.6 GHz (two Intel Xeon Platinum 8358 CPUs)
     * 512 GB memory     * 512 GB memory
-    * 4 Nvidia A100 GPU accelerator cards +    * 4 Nvidia A100 GPU accelerator cards with 40 GB RAM 
-    * 12.8 TB internal SSD NVMe disk space+    * 12 TB internal SSD NVMe disk space
     * 100 Gbps Omni-Path link     * 100 Gbps Omni-Path link
 +{{ :habrok:introduction:habrok_power_nodes.jpg?direct&150|Hábrók nodes with cables for power and network}}
  
-  * 36 GPU nodes with the following components: +  * 18 GPU nodes (Delivered by Fujitsu in an earlier purchase) with the following components: 
-    * cores @ 2.7 GHz (12 cores with hyperthreading+    * 18 cores @ 2.7 GHz (two Intel Xeon Gold 6150 CPUs
-    * 128 GB memory +    * 768 GB memory (621 GB used for temporary disk space) 
-    * 1 Nvidia V100 GPU accelerator card +    * 1 Nvidia V100 GPU accelerator card with 32 GB RAM 
- +    * 621 GB RAM disk
-{{ :habrok:introduction:habrok_power_nodes.jpg?direct&200|Hábrók nodes with cables for power and network}}+
  
   * 15 nodes with the following components:   * 15 nodes with the following components:
Line 44: Line 49:
     * 512 GB memory     * 512 GB memory
     * 16 TB internal disk space     * 16 TB internal disk space
-    * Only accessible by GELIFES users, see [[habrok:job_management:partitions:gelifes|GELIFES Partition]]+    * Only accessible by GELIFES users, see [[habrok:job_management:partition_details:gelifes|GELIFES Partition]]
  
 +
 +  * 1 node with the following components:
 +    * 64 cores @ 2.1 GHz (two Intel Xeon Gold 6448Y CPUs)
 +    * 1 TB memory
 +    * 440 GB internal disk space
 +    * 4 Nvidia H100 GPU accelerator cards with 80 GB RAM
 +    * Only accessible for education purposes in the scope of the [[https://myuniversity.rug.nl/infonet/medewerkers/fse/education/who-is-who/sse-teachers-and-x-lab-teams/digital-lab|Digital Lab project (employee login required)]]
  
 ===== Network ===== ===== Network =====
  
-{{ :habrok:introduction:habrok_network.jpg?direct&200|Hábrók network switches}}+{{ :habrok:introduction:habrok_network.jpg?direct&150|Hábrók network switches}}
  
-  * A 100 Gbps non-blocking Omni-Path network for 24 compute and 6 GPU nodes+  * A 100 Gbps low-latency non-blocking Omni-Path network for 24 compute and 6 GPU nodes 
 +    * High bandwidth (100 Gigabit per second) 
 +    * Low latency (few microseconds delay before a client starts receiving the message)
     * Useful for parallel processing over multiple computers     * Useful for parallel processing over multiple computers
-    * Connects the system to the bulk storage 
   * Two 25 Gbps Ethernet networks   * Two 25 Gbps Ethernet networks
     * Used for accessing the storage areas and for job communication     * Used for accessing the storage areas and for job communication
     * Can also be useful to access remote data more quickly     * Can also be useful to access remote data more quickly
- 
  
  
 ===== Storage ===== ===== Storage =====
  
-{{ :habrok:introduction:habrok_storage_rack.jpg?direct&400|Standard Hábrók storage rack}} +{{ :habrok:introduction:habrok_storage_rack.jpg?direct&150|Standard Hábrók storage rack}}
- +
-  * The cluster has 617 TB of formatted storage available. This storage is set up using the Lustre parallel file system and split into three file systems: /home, /data and /scratch. See [[peregrine:storage:start|]] for more information. +
  
 +  * The cluster has 2.5 PB (2562 TB) of formatted storage available. This scratch storage is set up using the Lustre parallel file system. 
 +  * 50 GB of home directory storage per user
  
-===== System description =====+See [[habrok:data_management:storage_areas|]] for more information.
  
-The layout of a modern Xeon E5 2600v3 system is shown in the figure (copied from Intel) below:\\ 
-{{..:additional_information:grantley-diagram-16x9.png|}} 
  
 ==== Clock speed and turbo mode ==== ==== Clock speed and turbo mode ====
  
-Our standard systems have two sockets with a Xeon E5 2680v3 processor. Each processor has 12 CPU cores running at 2.5GHz. When not all cores of a processor are used the cores can be run at a higher clockspeed (at most 3.GHz).+Our standard systems have two sockets with a AMD 7763 processor. Each processor has 64 CPU cores running at 2.45GHz. When not all cores of a processor are used the cores can be run at a higher clockspeed (at most 3.GHz).
  
 ==== Hyperthreading ==== ==== Hyperthreading ====
  
-In principle each CPU core of a Xeon processor can run multiple threads (programs) simultaneously. This is called hyperthreading. This feature has been disabled for the Peregrine cluster as the performance benefits are minimal and it introduces additional complexity to the scheduling system.+In principle each CPU core of a modern processor can run multiple threads (programs) simultaneously. This is called hyperthreading. This feature has been disabled on most nodes for the Hábrók cluster as the performance benefits are minimal and it introduces additional complexity to the scheduling system and for the user.
  
 ==== Memory access ==== ==== Memory access ====
  
-Each processor has its own memory controller and is connected to its own set of memory. For our systems this means that each processor has **direct** access to 64 GB of the 128 GB in the system.+Each processor has its own memory controllers and is connected to its own set of memory. For our standard systems this means that each processor has **direct** access to 256 GB of the 512 GB in the system.
  
-When a processor wants to access the memory of another processor it has to use the QPI connections between the processors. This connection is much slower than the connection to the local memory. This means that it is important that processes running on one of the processors use the memory local to that processor!+When a processor wants to access the memory of another processor it has to use the infinity fabric connections between the processors. This connection is much slower than the connection to the local memory. This means that it is important that processes running on one of the processors use the memory local to that processor!
  
 **NOTE** **NOTE**
-You can still request all the memory on a machine, even with a single core.  +You can still request all the memory on a machine, even with a single core. This is extremely inefficient however as most cores are idling and you should really look into parallelizing your workload.
-==== Accelerator and network access ====+
  
-Each processor also has its own links to a PCIe 3.0 bus which connects to the network cards and accelerator cards (Infiniband, 10Gbps Ethernet, GPU, Xeon Phi). This means that one processor is connected to the network card(s) and the other can only reach the network card through the first one.