Running LLMs [CIT Research Documentation]

If you want to run a Large Language Model (LLM) on Habrok, here's one possible and relatively easy way to do it.

1. Login with your account on Habrok.

ssh pnumber@login1.hb.hpc.rug.nl

2. Start an interactive job on an A100 node (single GPU):

srun --nodes=1 --ntasks=1 --partition=gpushort --mem=120G --time=04:00:00 --gres=gpu:a100:1 --pty bash

3. Load the Python and CUDA modules:

module load Python/3.11.5-GCCcore-13.2.0 CUDA/12.1.1

4. Create a virtual environment (only once):

python3 -m venv .env

5. Activate the venv:

source .env/bin/activate

6. Upgrade pip (optional):

pip install --upgrade pip

7. Install vllm (you can also specify a version):

pip install vllm

Might take a bit the first time.

8. Run vllm with the appropriate parameters (these are some examples):

export HF_HOME=/tmp && vllm serve neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16 --download-dir /tmp/models --max-model-len 1024 --gpu-memory-utilization 0.95 --port 8192

explanations of some of the parameters:

HF_HOME: since the models can be large, this downloads them to the local disc on the particular GPU node that the model is running
The model is neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16, and other models should be possible; there may be issues with the compute GPU capability that some models require, and which might not be availalbe on Habrok
download-dir: this may be the same as HF_HOME
port: You can specify whatever port you want

Once vllm is up and running, take note of the node it is running on (e.g. a100gpu6), and then forward the appropriate port to your local machine:

ssh -NL 8192:a100gpu6:8192 pnumber@login1.hb.hpc.rug.nl

You can the test that it is working with:

curl -X GET localhost:8192/v1/models

and you should get something like:

{"object":"list","data":[{"id":"neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16","object":"model","created":1729006332,"owned_by":"vllm","root":"neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16","parent":null,"max_model_len":1024,"permission":[{"id":"modelperm-13c3464597dc45dd9b661847a0343f39","object":"model_permission","created":1729006332,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

or you can go to http://localhost:8192/v1/models and get the same json:

{
  "object": "list",
  "data": [
    {
      "id": "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",
      "object": "model",
      "created": 1729006479,
      "owned_by": "vllm",
      "root": "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w8a16",
      "parent": null,
      "max_model_len": 1024,
      "permission": [
        {
          "id": "modelperm-5c65faf9419446fb94c80c2d669056c4",
          "object": "model_permission",
          "created": 1729006479,
          "allow_create_engine": false,
          "allow_sampling": true,
          "allow_logprobs": true,
          "allow_search_indices": false,
          "allow_view": true,
          "allow_fine_tuning": false,
          "organization": "*",
          "group": null,
          "is_blocking": false
        }
      ]
    }
  ]
}