Jupyter notebooks

Jupyter notebooks are used quite heavily in data science applications, and it can be useful to be able to run them on the cluster.

The main issue that you'll encounter is that running them in a batch scheduled system has the disadvantage that the notebook server may not start immediately. You can also run a notebook server on the interactive or GPU frontend node, but you should only use these for a limited time.

There are two ways of accessing Jupyter on the cluster nodes. The first is through the web portal, the second makes use of ssh tunnels to allow your local browser to connect to the remote Jupyter notebook server.

The web portal at https://portal.hb.hpc.rug.nl allows starting up a Jupyter notebook through a batch job and offers a direct web interface into the notebook. The details are described at the web portal page.

The main issue with these notebooks is that controlling the packages that are used is somewhat cumbersome. We have also discovered that it can be problematic to combine these with your own virtual environment. This because the standard modules set PYTHONPATH and this path may get preference over packages installed in your virtual environment. If this causes issues we advice you to use the option to use an SSH tunnel to connect to your notebooks like described below.

It is possible to install jupyter and all packages that you want to use in a virtual environment. The use of these is described at the page on Python and virtual environments .

We will therefore only describe the basic installation steps here. The full steps are:

  1. Load the Python version you want to use and set up a virtual environment for your packages
  2. Create a batch job with the CPU, memory and time requirements you need, which will load the virtual environment and start the Jupyter notebook on the compute node
  3. Once the job is running you need to set up an SSH tunnel on your local system that will forward requests from your local web browser to the Jupyter notebook on a Hábrók compute node
  4. You connect your local browser to the address given in the job output

We'll now take a closer look at these steps.

Setting up the Python environment requires one to first select the Python version to start with, including any additional modules that should be used. Please note that using additional modules may override package versions in your virtual environment. If that is problematic you should unload that module and only use the packages from your virtual environment. First load a Python module, e.g.

module load Python/3.11.5-GCCcore-13.2.0 

If you don't need specific numerical Python package versions, consider loading the corresponding SciPy-bundle, which includes numpy, scipy and pandas. Be sure to select the module with the same Python version in the module name.

module load SciPy-bundle/2023.11-gfbf-2023b

If you need different versions of these packages you can ignore this step, or try a more recent Python/SciPy-bundle combination.

After this you can create the virtual environment in a directory you select, e.g.:

python -m venv ~/virtual_env/myenv

Activate the virtual environment:

source ~/virtual_env/myenv/bin/activate

Install any Python packages you need. It is wise to first update pip and install wheel, as using more recent versions will make installing packages easier and faster. In this example we have also chosen to install specific versions of pandas and numpy.

pip install --upgrade pip wheel
pip install pandas==1.3.4 tensorflow-gpu==2.4.1 scikit-learn==1.0.1 numpy==1.19.5 simpletransformers==0.63.2 XlsxWriter==3.0.2 torch==1.9.1

And of course make sure to install jupyter.

pip install jupyter

Once you've done this the packages should be in place and can be used. Note that you will have to load the Python module and activate the virtual environment each time after logging in, before using the Python packages.

The next step is to create a batch job that will start jupyter within your virtual environment. A sample jobscript may look like:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --tasks-per-node=4
#SBATCH --time=02:00:00
#SBATCH --job-name=jupyter
#SBATCH --mem=8G
#SBATCH --partition=regular

# Clear the module environment
module purge
# Load the Python version that has been used to construct the virtual environment
# we are using below
module load Python/3.11.5-GCCcore-13.2.0

# Activate the virtual environment
source ~/virtual_env/jupyter_tunnel/bin/activate

# Start the jupyter server, using the hostname of the node as the way to connect to it
jupyter notebook --no-browser --ip=$( hostname )

Note that you will need to adjust the job in order to claim the right size and type of resources. E.g. a GPU like described on the page on submitting GPU jobs.

Once the job is running you need to check two things. The first is the node on which the job is running, and the second the port number that the Jupyter notebook server is using. Once the job is running the output of the notebook server will be shown in the job output. This output will include the node name and the port number. It should have lines like:

cat slurm-22866265.out 
[I 17:24:45.784 NotebookApp] http://pg-node187:8888/?token=1c8295e800ecedf2aaa39cd777bdd3a1fefdb99fe94de60d
[I 17:24:45.784 NotebookApp]  or http://127.0.0.1:8888/?token=1c8295e800ecedf2aaa39cd777bdd3a1fefdb99fe94de60d
[I 17:24:45.784 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 17:24:45.795 NotebookApp] 
    
    To access the notebook, open this file in a browser:
        file:///home2/username/.local/share/jupyter/runtime/nbserver-18544-open.html
    Or copy and paste one of these URLs:
        http://pg-node187:8888/?token=1c8295e800ecedf2aaa39cd777bdd3a1fefdb99fe94de60d
     or http://127.0.0.1:8888/?token=1c8295e800ecedf2aaa39cd777bdd3a1fefdb99fe94de60d

The thing to take a note of here is the address of the Jupyter web server, including the nodename, the port number and a *secret* token. The nodename is the machine name given in the http address. In this case pg-node187. We will refer to this as habrok_node. The port number is the number behind the colon after nodename and localhost. In this case 8888. We will refer to this number as jupyter_port.

We now need to setup an SSH tunnel from our local machine to the node where the job is running. For this we also need to define on which port we want the remote session to be available locally. We will refer to this port number as jupyter_local_port. In principle you can use the same value for jupyter_local_port as was used for jupyter_port. This unless you are already running a local Jupyter session, as this session will also by default be using port 8888. If the jupyter_port is already in use on your local machine, you can pick another value, e.g. 8889.

On command line SSH this can be done using a command like:

ssh username@login1.hb.hpc.rug.nl -L jupyter_local_port:habrok_node:jupyter_port

Note that this command will open a session on Habrok, which you have to leave open for the tunnel to keep working.

When using MobaXterm you can set up an SSH tunnel using the Tunnel icon in the top list. After clicking on this icon you can select “New SSH tunnel”, after which you'll see the following settings menu: Within this menu you have to add the following settings:

  1. <Forwarded port> : The jupyter_local_port value, see the notes above.
  2. <SSH server> : A Hábrók login node, like e.g. login1.hb.hpc.rug.nl
  3. <SSH login>: Your Hábrók user name.
  4. <SSH port>: This can be left at the default of 22.
  5. <Remote server> : The habrok_node from the Jupyter output.
  6. <Remote port> : The jupyter_port value from the Jupyter output.

After saving these settings, you can start the tunnel by clicking on the start button with the triangle icon.

Port number issues

In principle jupyter will select a free port number when starting. So therefore the notebook server itself should always start properly. If you want to select a port number yourself the option –port port_number can be used, where port_number should be a number between 1024 and 65535.

Since we need to connect to Jupyter on the local machine using a certain port number as well, you may also get issues if that port is already occupied, for example by a local Jupyter session. If this happens you have to adjust the jupyter_local_port value. The way you will notice this problem is that the SSH tunnel will refuse to start, and that connecting to the notebook will show you a connection to your local machine, instead of the remote Hábrók one. The token will also be incorrect in that case, since the local Jupyter session will be using a different one.

Once the tunnel is running you can connect to the notebook. This can be done on your local machine, by clicking on the link given in the Jupyter output. In this example case you need to select the link with localhost (127.0.0.1) as the address, e.g.:

http://127.0.0.1:8888/?token=1c8295e800ecedf2aaa39cd777bdd3a1fefdb99fe94de60d

In the case the jupyter_local_port number differs from the jupyter_port, you will have to slightly adjust the URL by changing the port number value (8888) in this case to jupyter_local_port, where jupyter_local_port is the value you selected in the steps before. So if you, for example, have chosen that to be 9999, the link needs to be adjusted to:

http://127.0.0.1:9999/?token=1c8295e800ecedf2aaa39cd777bdd3a1fefdb99fe94de60d

Once you have finished your running the notebooks it is best to cancel the job in order to release the resources. This can be done using:

scancel jobid

where jobid is the id of the job which is running the notebook. The job id can be discovered by running:

squeue -u $USER