Running LLMs

If you want to run a Large Language Model (LLM) on Habrok, here's one possible and relatively easy way to do it.

1. Login with your account on Habrok.

ssh pnumber@login1.hb.hpc.rug.nl

2. Start an interactive job on an A100 node (single GPU):

srun --nodes=1 --ntasks=1 --partition=gpushort --mem=120G --time=04:00:00 --gres=gpu:a100:1 --pty bash

3. Load the Python and CUDA modules:

module load Python/3.11.5-GCCcore-13.2.0 CUDA/12.1.1

4. Create a virtual environment (only once):

python3 -m venv .env

5. Activate the venv:

source .env/bin/activate

6. Upgrade pip (optional):

pip install --upgrade pip

7. Install vllm (you can also specify a version):

pip install vllm

Might take a bit the first time.

8. Run vllm with the appropriate parameters (these are some examples):

export HF_HOME=/tmp && vllm serve neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16 --download-dir /tmp/models --max-model-len 1024 --gpu-memory-utilization 0.95 --port 8192

explanations of some of the parameters:

Once vllm is up and running, take note of the node it is running on (e.g. a100gpu6), and then forward the appropriate port to your local machine:

ssh -NL 8192:a100gpu6:8192 pnumber@login1.hb.hpc.rug.nl

You can the test that it is working with:

curl -X GET localhost:8192/v1/models

and you should get something like:

{"object":"list","data":[{"id":"neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16","object":"model","created":1729006332,"owned_by":"vllm","root":"neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16","parent":null,"max_model_len":1024,"permission":[{"id":"modelperm-13c3464597dc45dd9b661847a0343f39","object":"model_permission","created":1729006332,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

or you can go to http://localhost:8192/v1/models and get the same json:

{
  "object": "list",
  "data": [
    {
      "id": "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",
      "object": "model",
      "created": 1729006479,
      "owned_by": "vllm",
      "root": "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",
      "parent": null,
      "max_model_len": 1024,
      "permission": [
        {
          "id": "modelperm-5c65faf9419446fb94c80c2d669056c4",
          "object": "model_permission",
          "created": 1729006479,
          "allow_create_engine": false,
          "allow_sampling": true,
          "allow_logprobs": true,
          "allow_search_indices": false,
          "allow_view": true,
          "allow_fine_tuning": false,
          "organization": "*",
          "group": null,
          "is_blocking": false
        }
      ]
    }
  ]
}