===== Running LLMs =====
If you want to run a Large Language Model (LLM) on Habrok, here's one possible and relatively easy way to do it. Note that the versions are recent as of 26 February 2026.
==== Installation ====
1. Login with your account on Habrok on an interactive node for the installation procedure.
ssh pnumber@interactive1.hb.hpc.rug.nl
2. Since the vllm installation packages require a newer glibc than our operating system provides, we will switch to the EESSI software stack. This provides a compatability layer with a newer glibc.
module load EESSI/2025.06
3. Load the Python module in the version you would like to use:
module load Python/3.13.5-GCCcore-14.3.0
4. Create a virtual environment (only once):
python3 -m venv .env
5. Activate the venv:
source .env/bin/activate
6. Upgrade ''pip'' and ''wheel'' (optional):
pip install --upgrade pip wheel
7. Install ''vllm'' (you can also specify a version):
pip install vllm
Might take a bit the first time.
==== Running through an interactive job ====
1. Start an interactive job on an A100 node (single GPU) to be able to run the software:
srun --nodes=1 --ntasks=1 --partition=gpushort --mem=120G --time=04:00:00 --gres=gpu:a100:1 --pty bash
2. Switch to the EESSI software stack
module load EESSI/2025.06
3. Load the Python module you used for installation
module load Python/3.13.5-GCCcore-14.3.0
4. Activate the venv you created earlier:
source .env/bin/activate
5. Run ''vllm'' with the appropriate parameters (these are some examples):
export HF_HOME=/tmp && vllm serve neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16 --download-dir /tmp/models --max-model-len 1024 --gpu-memory-utilization 0.95 --port 8192
explanations of some of the parameters:
* ''HF_HOME'': since the models can be large, this downloads them to the local disc on the particular GPU node that the model is running
* The model is ''neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16'', and other models should be possible; there may be issues with the compute GPU capability that some models require, and which might not be availalbe on Habrok
* ''download-dir'': this may be the same as ''HF_HOME''
* ''port'': You can specify whatever port you want
Once ''vllm'' is up and running, take note of the node it is running on (e.g. ''a100gpu6''), and then forward the appropriate port to your local machine:
ssh -NL 8192:a100gpu6:8192 pnumber@login1.hb.hpc.rug.nl
You can the test that it is working with:
curl -X GET localhost:8192/v1/models
and you should get something like:
{"object":"list","data":[{"id":"neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16","object":"model","created":1729006332,"owned_by":"vllm","root":"neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16","parent":null,"max_model_len":1024,"permission":[{"id":"modelperm-13c3464597dc45dd9b661847a0343f39","object":"model_permission","created":1729006332,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}
or you can go to ''http://localhost:8192/v1/models'' and get the same ''json'':
{
"object": "list",
"data": [
{
"id": "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",
"object": "model",
"created": 1729006479,
"owned_by": "vllm",
"root": "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",
"parent": null,
"max_model_len": 1024,
"permission": [
{
"id": "modelperm-5c65faf9419446fb94c80c2d669056c4",
"object": "model_permission",
"created": 1729006479,
"allow_create_engine": false,
"allow_sampling": true,
"allow_logprobs": true,
"allow_search_indices": false,
"allow_view": true,
"allow_fine_tuning": false,
"organization": "*",
"group": null,
"is_blocking": false
}
]
}
]
}
==== Running Ollama in a jobscript ====
The following code can be used in a jobscript to run an Ollama model:
# Load the Ollama module
# GPU node
module load ollama/0.6.0-GCCcore-12.3.0-CUDA-12.1.1
# CPU node
# module load ollama/0.6.0-GCCcore-12.3.0
# Use /scratch for storing models
export OLLAMA_MODELS=/scratch/$USER/ollama/models
# Start the Ollama server in the background, log all its output to ollama-serve.log
ollama serve >& ollama-serve.log &
# Wait a few seconds to make sure that the server has started
sleep 5
# Run the model
echo "Tell me something about Groningen" | ollama run deepseek-r1:14b
# Kill the server process
pkill -u $USER ollama