Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
habrok:examples:llms [2025/03/05 12:06] – created camarocicohabrok:examples:llms [2026/02/26 13:23] (current) fokke
Line 1: Line 1:
-===== Running LLMs =====+ ===== Running LLMs =====
  
-If you want to run a Large Language Model (LLM) on Habrok, here's one possible and relatively easy way to do it. +If you want to run a Large Language Model (LLM) on Habrok, here's one possible and relatively easy way to do it. Note that the versions are recent as of 26 February 2026. 
 + 
 +==== Installation ==== 
 + 
 +1. Login with your account on Habrok on an interactive node for the installation procedure. 
 +<code>ssh pnumber@interactive1.hb.hpc.rug.nl</code> 
 + 
 +2. Since the vllm installation packages require a newer glibc than our operating system provides, we will switch to the EESSI software stack. This provides a compatability layer with a newer glibc. 
 +<code>module load EESSI/2025.06</code> 
 + 
 +3. Load the Python module in the version you would like to use: 
 +<code>module load Python/3.13.5-GCCcore-14.3.0</code>
  
-  - Login with your account on Habrok (obviously).<code>ssh pnumber@login1.hb.hpc.rug.nl</code> 
-2. Start an interactive job on an A100 node (single GPU): 
-   ```bash 
-   srun --nodes=1 --ntasks=1 --partition=gpushort --mem=120G --time=04:00:00 --gres=gpu:a100:1 --pty bash 
-   ``` 
-3. Load the Python and CUDA modules: 
-   ```bash 
-   module load Python/3.11.5-GCCcore-13.2.0 CUDA/12.1.1 
-   ``` 
 4. Create a virtual environment (only once): 4. Create a virtual environment (only once):
-   ```bash +<code>python3 -m venv .env</code> 
-   python3 -m venv .env +
-   ```+
 5. Activate the venv: 5. Activate the venv:
-   ```bash +<code>source .env/bin/activate</code> 
-   source .env/bin/activate + 
-   ``` +6. Upgrade ''pip'' and ''wheel'' (optional): 
-6. Upgrade `pip(optional): +<code>pip install --upgrade pip wheel</code> 
-   ```bash + 
-   pip install --upgrade pip +7. Install ''vllm'' (you can also specify a version): 
-   ```+<code>pip install vllm</code> 
 +Might take a bit the first time. 
 + 
 +==== Running through an interactive job ==== 
 + 
 +1. Start an interactive job on an A100 node (single GPU) to be able to run the software: 
 +<code>srun --nodes=1 --ntasks=1 --partition=gpushort --mem=120G --time=04:00:00 --gres=gpu:a100:1 --pty bash</code> 
 + 
 +2. Switch to the EESSI software stack 
 +<code>module load EESSI/2025.06</code> 
 + 
 +3. Load the Python module you used for installation 
 +<code>module load Python/3.13.5-GCCcore-14.3.0</code> 
 + 
 +4. Activate the venv you created earlier: 
 +<code>source .env/bin/activate</code> 
 + 
 +5. Run ''vllm'' with the appropriate parameters (these are some examples): 
 +<code>export HF_HOME=/tmp && vllm serve neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16 --download-dir /tmp/models --max-model-len 1024 --gpu-memory-utilization 0.95 --port 8192</code> 
 +explanations of some of the parameters: 
 +     * ''HF_HOME'': since the models can be large, this downloads them to the local disc on the particular GPU node that the model is running 
 +     * The model is ''neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16'', and other models should be possible; there may be issues with the compute GPU capability that some models require, and which might not be availalbe on Habrok 
 +     * ''download-dir'': this may be the same as ''HF_HOME'' 
 +     * ''port'': You can specify whatever port you want 
 + 
 +Once ''vllm'' is up and running, take note of the node it is running on (e.g. ''a100gpu6''), and then forward the appropriate port to your local machine: 
 +<code>ssh -NL 8192:a100gpu6:8192 pnumber@login1.hb.hpc.rug.nl</code> 
 + 
 +You can the test that it is working with: 
 +<code>curl -X GET localhost:8192/v1/models</code> 
 + 
 +and you should get something like: 
 + 
 +<code>{"object":"list","data":[{"id":"neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16","object":"model","created":1729006332,"owned_by":"vllm","root":"neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16","parent":null,"max_model_len":1024,"permission":[{"id":"modelperm-13c3464597dc45dd9b661847a0343f39","object":"model_permission","created":1729006332,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}</code> 
 + 
 +or you can go to ''http://localhost:8192/v1/models'' and get the same ''json'': 
 + 
 +<code>
 +  "object": "list", 
 +  "data":
 +    { 
 +      "id": "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16", 
 +      "object": "model", 
 +      "created": 1729006479, 
 +      "owned_by": "vllm", 
 +      "root": "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16", 
 +      "parent": null, 
 +      "max_model_len": 1024, 
 +      "permission":
 +        { 
 +          "id": "modelperm-5c65faf9419446fb94c80c2d669056c4", 
 +          "object": "model_permission", 
 +          "created": 1729006479, 
 +          "allow_create_engine": false, 
 +          "allow_sampling": true, 
 +          "allow_logprobs": true, 
 +          "allow_search_indices": false, 
 +          "allow_view": true, 
 +          "allow_fine_tuning": false, 
 +          "organization": "*", 
 +          "group": null, 
 +          "is_blocking": false 
 +        } 
 +      ] 
 +    } 
 +  ] 
 +}</code> 
 + 
 +==== Running Ollama in a jobscript ==== 
 + 
 +The following code can be used in a jobscript to run an Ollama model: 
 + 
 +<code> 
 +# Load the Ollama module 
 +# GPU node 
 +module load ollama/0.6.0-GCCcore-12.3.0-CUDA-12.1.1 
 +# CPU node 
 +# module load ollama/0.6.0-GCCcore-12.3.0 
 + 
 +# Use /scratch for storing models 
 +export OLLAMA_MODELS=/scratch/$USER/ollama/models 
 + 
 +# Start the Ollama server in the background, log all its output to ollama-serve.log 
 +ollama serve >& ollama-serve.log & 
 +# Wait a few seconds to make sure that the server has started 
 +sleep 5 
 + 
 +# Run the model 
 +echo "Tell me something about Groningen" | ollama run deepseek-r1:14b
  
 +# Kill the server process
 +pkill -u $USER ollama
 +</code>