Differences

This shows you the differences between two versions of the page.

--- habrok:examples:llms [2025/03/05 12:06] – created camarocico
+++ habrok:examples:llms [2025/07/28 11:23] (current) – bob
@@ Line 1: / Line 1: @@
-===== Running LLMs =====
+ ===== Running LLMs =====
 If you want to run a Large Language Model (LLM) on Habrok, here's one possible and relatively easy way to do it.
-  - Login with your account on Habrok (obviously).<code>ssh pnumber@login1.hb.hpc.rug.nl</code>
+. Login with your account on Habrok.
+<code>ssh pnumber@login1.hb.hpc.rug.nl</code>
 . Start an interactive job on an A100 node (single GPU):
-   ```bash
+<code>srun --nodes=1 --ntasks=1 --partition=gpushort --mem=120G --time=04:00:00 --gres=gpu:a100:1 --pty bash</code>
-   srun --nodes=1 --ntasks=1 --partition=gpushort --mem=120G --time=04:00:00 --gres=gpu:a100:1 --pty bash
-   ```
 . Load the Python and CUDA modules:
-   ```bash
+<code>module load Python/3.11.5-GCCcore-13.2.0 CUDA/12.1.1</code>
-   module load Python/3.11.5-GCCcore-13.2.0 CUDA/12.1.1
-   ```
 . Create a virtual environment (only once):
-   ```bash
+<code>python3 -m venv .env</code>
-   python3 -m venv .env
-   ```
 . Activate the venv:
-   ```bash
+<code>source .env/bin/activate</code>
-   source .env/bin/activate
-   ```
+. Upgrade ''pip'' (optional):
-. Upgrade `pip` (optional):
+<code>pip install --upgrade pip</code>
-   ```bash
-   pip install --upgrade pip
+. Install ''vllm'' (you can also specify a version):
-   ```
+<code>pip install vllm</code>
+Might take a bit the first time.
+. Run ''vllm'' with the appropriate parameters (these are some examples):
+<code>export HF_HOME=/tmp && vllm serve neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16 --download-dir /tmp/models --max-model-len 1024 --gpu-memory-utilization 0.95 --port 8192</code>
+explanations of some of the parameters:
+     * ''HF_HOME'': since the models can be large, this downloads them to the local disc on the particular GPU node that the model is running
+     * The model is ''neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16'', and other models should be possible; there may be issues with the compute GPU capability that some models require, and which might not be availalbe on Habrok
+     * ''download-dir'': this may be the same as ''HF_HOME''
+     * ''port'': You can specify whatever port you want
+Once ''vllm'' is up and running, take note of the node it is running on (e.g. ''a100gpu6''), and then forward the appropriate port to your local machine:
+<code>ssh -NL 8192:a100gpu6:8192 pnumber@login1.hb.hpc.rug.nl</code>
+You can the test that it is working with:
+<code>curl -X GET localhost:8192/v1/models</code>
+and you should get something like:
+<code>{"object":"list","data":[{"id":"neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16","object":"model","created":1729006332,"owned_by":"vllm","root":"neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16","parent":null,"max_model_len":1024,"permission":[{"id":"modelperm-13c3464597dc45dd9b661847a0343f39","object":"model_permission","created":1729006332,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}</code>
+or you can go to ''http://localhost:8192/v1/models'' and get the same ''json'':
+<code>{
+  "object": "list",
+  "data": [
+    {
+      "id": "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",
+      "object": "model",
+      "created": 1729006479,
+      "owned_by": "vllm",
+      "root": "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",
+      "parent": null,
+      "max_model_len": 1024,
+      "permission": [
+        {
+          "id": "modelperm-5c65faf9419446fb94c80c2d669056c4",
+          "object": "model_permission",
+          "created": 1729006479,
+          "allow_create_engine": false,
+          "allow_sampling": true,
+          "allow_logprobs": true,
+          "allow_search_indices": false,
+          "allow_view": true,
+          "allow_fine_tuning": false,
+          "organization": "*",
+          "group": null,
+          "is_blocking": false
+        }
+      ]
+    }
+  ]
+}</code>
+==== Running Ollama in a jobscript ====
+The following code can be used in a jobscript to run an Ollama model:
+<code>
+# Load the Ollama module
+# GPU node
+module load ollama/0.6.0-GCCcore-12.3.0-CUDA-12.1.1
+# CPU node
+# module load ollama/0.6.0-GCCcore-12.3.0
+# Use /scratch for storing models
+export OLLAMA_MODELS=/scratch/$USER/ollama/models
+# Start the Ollama server in the background, log all its output to ollama-serve.log
+ollama serve >& ollama-serve.log &
+# Wait a few seconds to make sure that the server has started
+sleep 5
+# Run the model
+echo "Tell me something about Groningen" | ollama run deepseek-r1:14b
+# Kill the server process
+pkill -u $USER ollama
+</code>