Differences

This shows you the differences between two versions of the page.

--- habrok:connecting_to_the_system:login_nodes [2023/03/20 11:51] – admin
+++ habrok:connecting_to_the_system:login_nodes [2025/10/23 15:50] (current) – [Long process termination] pedro
@@ Line 3: / Line 3: @@
 ====== Login nodes ======
-Hábrók has five login nodes that can be used to connect to the system. Besides redundancy reasons (you can always try another one if one of them is down), they all serve different purposes.
+Hábrók has six login nodes that can be used to connect to the system. Besides redundancy reasons (you can always try another one if one of them is down), they all serve different purposes.
 ===== Login nodes =====
-''login1.hb.hpc.rug.nl'' and ''login2.hb.hpc.rug.nl'' are the default login nodes that are used by most users. You can use these to connect to the system, copy your files, submit jobs, compile your code, et cetera. You should not use it to test your applications, since this might slow down the system, which will hinder other users who are trying to log in. It is also a smaller system.
+''login1.hb.hpc.rug.nl'' and ''login2.hb.hpc.rug.nl'' are the default login nodes that are used by most users. You can use these to connect to the system, copy your files, submit jobs, compile your code, et cetera. You should not use it to test your applications, since this might slow down the system, which will hinder other users who are trying to log in. It is also a smaller system. For this reason, long running intensive processes will be automatically killed, see section below.
 We have set up two of these login nodes to increase the availability of the service.
@@ Line 17: / Line 17: @@
 The interactive nodes are about half the size of a default compute node, and they allow for a bit more testing. If you just want to run your program for a couple of minutes, these are the machines to use. Do keep in mind that these are also a shared machines and other people may also want to do some testing. So, if you need to do longer and/or more intensive tests, these tasks should be submitted as jobs.
-To prevent a single user from using all capacity CPU and memory limits are in place.
+To prevent a single user from using all capacity CPU and memory limits are in place. Further, long running intensive processes will be automatically killed, see section below.
-===== Interactive GPU node =====
+===== Interactive GPU nodes =====
-** Work in progress, this machine is not yet available **
+Finally, the interactive GPU nodes, ''gpu1.hb.hpc.rug.nl'' and ''gpu2.hb.hpc.rug.nl'' are login nodes equipped with a GPU. You can use them to develop and test your GPU applications.
-Finally, the interactive GPU node, ''gpu1.hb.hpc.rug.nl'' is a login node equipped with a GPU. You can use it to develop and test your GPU applications.
+These machines have an NVIDIA L40s GPU each, which can be shared by multiple users. The tool ''nvidia-smi'' will show if the GPU is in use.
-This machine has an NVIDIA V100 GPU, which can be shared by multiple users. The tool ''nvidia-smi'' will show if the GPU is in use.
+Please keep in mind that this is also a shared machine, and more users want to use the GPU in this machine. So, allow everyone to make use of these GPUs and do not perform long runs here. Long runs should be submitted as jobs to scheduler. Long running processes will be automatically killed, see section below.
-Please keep in mind that this is also a shared machine, and more users want to use the GPU in this machine. So, allow everyone to make use of these GPUs and do not perform long runs here. Long runs should be submitted as jobs to scheduler.
+===== Long process termination =====
+Since 2025-10-24, we automatically kill misbehaving processes that have been running for too long and using too many resources on the login, interactive, and interactive GPU nodes. Certain processes that are expected to run for a long time are allowed (for example, ssh sessions). This is to prevent one or a few users from occupying resources that are only meant for short tests, which then prevents other users from executing legitimate tasks on these nodes. This, in addition to the periodic rebooting of these nodes, ensures that the resources are available in good order for all users.
+===== Periodic reboots =====
+In order to prevent the login/interactive nodes from being filled up with temporary files and long-running processes, these nodes are rebooted every other week on Monday morning at 6:00 CE(S)T. The odd-numbered nodes (''login1'', ''interactive1'', ''gpu1'') are rebooted in odd weeks, and the even-numbered nodes (''login2'', ''interactive2'', ''gpu2'') are rebooted in even weeks.