Python

This page describes the recommended way to use Python on the cluster. This is a long page, so please check the table of contents menu on the right to find the information most relevant for you.

The main take home message for Python is to load the Python module for the version you want to start with, possibly by loading SciPy-bundle, which includes a.o. optimized numpy, scipy and pandas libraries, and then to create a Python virtual environment for installing any packages that are not already provided through the module system. The details are given below.

Caveat Unfortunately packages installed outside of the main Python installation, and reached through the environment variable PYTHONPATH cannot be upgraded inside a virtual environment. This issue is often encountered with the packages from SciPy-bundle.

In these cases it may be better to stick to only loading the Python module itself in the version you need and installing additional modules yourself in the virtual environment.

Another trick, if you do want to use a module which imports SciPy-bundle is to unload the SciPy-bundle afterwards and to install numpy, scipy, pandas and any other required packages in your virtual environment instead.

On Hábrók, we have several versions of Python installed (Six versions for Python 3 only!). In addition to these bare-bones Python installations, we also have optimized versions of a handful of common Python packages (scipy, matplotlib, etc.). However, the Python ecosystem is so large and varied, that we have no hope of installing cluster-optimized versions of even the most common Python packages.

As a regular user on Hábrók, you have the power to build you own Python Virtual Environment, where you can install any and all Python packages you need. A Python Virtual Environment simply consists of a folder saved somewhere you have access to, and which will contain your own copy of Python, as well as all the packages you install in the Virtual Environment. You can build several Virtual Environments (for example one for each project you're working on), each residing in its own folder, and not interfering with each other. To use any of them, you simply tell the system which folder to use. You can therefore easily switch between these Virtual Environments.

Below, we show a short and hopefully simple guide to setting up and using a Python Virtual Environment, using the venv Python package.

Building the Python Virtual Environment

Before setting up a Python Virtual Environment, you need to first load a specific version of Python. In this example, we will use the latest version of Python, 3.9.6, that is available on Hábrók, but this should work for older versions as well. If you cannot follow these instructions for a specific version of Python, please let us know, and we will add special instructions for that version.

We load the Python module:

module load Python/3.9.6-GCCcore-11.2.0

and check that we have the right version:

python3 --version
Python 3.9.6

which is what we wanted.

Now, we need to decide where to save the folder that contains the Python Virtual Environment we're going to build. There is no restriction on this, as long as you have the permissions, but we suggest saving it in your home directory, since this storage works best for directories containing many files, and each Python Virtual Environment can contain several hundred files (or more), depending on how many packages you install. Therefore, we will place all environments in $HOME/venvs.

It is easy to build a Python Virtual Environment:

python3 -m venv $HOME/venvs/first_env

where first_env is the name of the environment, as well as of the folder it resides in. Give it a good descriptive name, otherwise you'll be sorry when you have 10-20 different environments.

Using the Python Virtual Environment

The Python Virtual Environment is now built, but we can't use it yet, first we need to activate it. We do this with the following command:

source $HOME/venvs/first_env/bin/activate

and this will change the prompt of the command line from something like [p123456@login1 ~]$ to something like (first_env) [p123456@login1 ~]$. This is a really useful feature, allowing you to see, at a glance, which Python Virtual Environment you are working with.

The environment we just built and activated is a pristine one, and it only contains the Python packages that were available in the Python/3.9.6-GCCcore-11.2.0 module. However, we can now populate the environment with whatever packages we want to use in this particular project, by installing them. Before installing any additional package in the Python Virtual Environment, it might be a good idea to update pip, the Python Package Installer, and wheel which is used to install binary packages:

pip install --upgrade pip
pip install --upgrade wheel

This is not strictly necessary, but it is recommended, especially for older version of Python, which also come with older versions of pip. Having up-to-date versions makes sure pip and wheel work with the latest package formats.

We are now ready to install additional Python packages into our Python Virtual Environment. This is as simple as

pip install package_name

where package_name is the name of the Python package you want to install. This will install the package into the Python Virtual Environment folder $HOME/venvs/first_env, and the package will be available every time we activate this particular environment in the future.

It is considered good practice to save the names of all the packages you wish to install in a text file (usually called requirements.txt) and use that file to install the packages all at once with the command:

pip install -r requirements.txt

A typical requirements.txt file would look something like

requirements.txt
keras
tqdm==4.59.0

where you can also specify a particular version of a certain packages, as for tqdm.

How do we use the Python Virtual Environment we just build in a job script? Here's an example of such a jobscript:

jobscript.sh
#!/bin/bash
#SBATCH --time=00:01:00
#SBATCH --partition=regular
 
module purge
module load Python/3.9.6-GCCcore-11.2.0
 
source $HOME/venvs/first_env/bin/activate
 
python3 --version
which python3
 
deactivate

which you can submit with

sbatch jobscript.sh

This jobscript will first purge your module environment, then load the correct version of Python (you always have to load the Python module before activating your Python Virtual Environment), and then it activates your environment. Once the environment is activated, we check the version of Python, and the location of the Python executable, which should be $HOME/venvs/first_env/bin/python3, the location of your environment. In place of these commands which only give you some information, you can, of course, run your own Python scripts.

Deactivating the Python Virtual Environment isn't strictly necessary, since the job ends after that in any case.

TLDR

If you need to use a specific Python library on Hábrók, don't just pip install it, as what you will get will not be an optimized version. First, check whether the library is already available from the specific version of Python you loaded. If it is not, check whether the library is installed on Hábrók as a module with module avail library_name. When using multiple libraries via the module system, pay attention to the Python and toolchain versions. Only if you've not been able to find the library, should you consider installing it via pip and a virtual environment.


The Python ecosystem is extremely varied, with a lot of libraries for all sorts of purposes, from web servers, to numerical computing, and everything in between and to the sides.

As a Python user, you would usually install these libraries with pip, the Python Package Installer. You can still do that on Hábrók, as we have detailed above, but this is not always the best way, because pip doesn't optimize the libraries for the particular machines they would be running on. In an HPC environment, performance is key, especially for numerical libraries.

The Python module itself, comes with a host of libraries already installed (optimally), so that is the first place to look for a specific library. You can do this with:

module whatis Python/3.7.4-GCCcore-8.3.0

which gives the following output:

Python/3.7.4-GCCcore-8.3.0                            : Description: Python is a programming language that lets you work more quickly and integrate your systems more effectively.
Python/3.7.4-GCCcore-8.3.0                            : Homepage: https://python.org/
Python/3.7.4-GCCcore-8.3.0                            : URL: https://python.org/
Python/3.7.4-GCCcore-8.3.0                            : Extensions: alabaster-0.7.12, asn1crypto-0.24.0, atomicwrites-1.3.0, attrs-19.1.0, Babel-2.7.0, bcrypt-3.1.7, bitstring-3.1.6, blist-1.3.6, certifi-2019.9.11, cffi-1.12.3, chardet-3.0.4, Click-7.0, cryptography-2.7, Cython-0.29.13, deap-1.3.0, decorator-4.4.0, docopt-0.6.2, docutils-0.15.2, ecdsa-0.13.2, future-0.17.1, idna-2.8, imagesize-1.1.0, importlib_metadata-0.22, ipaddress-1.0.22, Jinja2-2.10.1, joblib-0.13.2, liac-arff-2.4.0, MarkupSafe-1.1.1, mock-3.0.5, more-itertools-7.2.0, netaddr-0.7.19, netifaces-0.10.9, nose-1.3.7, packaging-19.1, paramiko-2.6.0, pathlib2-2.3.4, paycheck-1.0.2, pbr-5.4.3, pip-19.2.3, pluggy-0.13.0, psutil-5.6.3, py-1.8.0, py_expression_eval-0.3.9, pyasn1-0.4.7, pycparser-2.19, pycrypto-2.6.1, Pygments-2.4.2, PyNaCl-1.3.0, pyparsing-2.4.2, pytest-5.1.2, python-dateutil-2.8.0, pytz-2019.2, requests-2.22.0, scandir-1.10.0, setuptools-41.2.0, setuptools_scm-3.3.3, six-1.12.0, snowballstemmer-1.9.1, Sphinx-2.2.0, sphinxcontrib-applehelp-1.0.1, sphinxcontrib-devhelp-1.0.1, sphinxcontrib-htmlhelp-1.0.2, sphinxcontrib-jsmath-1.0.1, sphinxcontrib-qthelp-1.0.2, sphinxcontrib-serializinghtml-1.1.3, sphinxcontrib-websupport-1.1.2, tabulate-0.8.3, ujson-1.35, urllib3-1.25.3, virtualenv-16.7.5, wcwidth-0.1.7, wheel-0.33.6, xlrd-1.2.0, zipp-0.6.0

All these libraries will be available to you when you load the Python module with module load Python/3.7.4-GCCcore-8.3.0.

If the library you want is not listed here, it might be that we have it installed as a module on Hábrók, in an optimized version. We've done this for several common libraries, and we strongly encourage you to use these modules, rather than pip install the libraries. Doing so can speed up your computation significantly. Below, we present a list of the Python libraries which are installed as modules on Hábrók:

  • TensorFlow
  • SciPy-bundle: numpy, scipy, pandas, mpi4py, mpmath
  • scikit-learn, scikit-image
  • matplotliob
  • PyTorch
  • Numba
  • Tkinter (only usable with portal or X server forwarding)
  • h5py

This is not an exhaustive list, please check with module avail to see if a module you are looking for is available before installing it with pip. This only applies to large, well known libraries, however, so don't make a pain for yourself searching every single package you intend to import.

To find out which versions of these libraries are available on Hábrók, you can use the module avail command, e.g.

module avail TensorFlow

which will produce something like the following output:

------------------------------------------------ /software/modules/lib ------------------------------------------------
   TensorFlow/1.6.0-foss-2018a-Python-3.6.4-CUDA-9.1.85    TensorFlow/1.12.0-foss-2018a-Python-2.7.14
   TensorFlow/1.6.0-foss-2018a-Python-3.6.4                TensorFlow/1.12.0-foss-2018a-Python-3.6.4
   TensorFlow/1.8.0-foss-2018a-Python-3.6.4                TensorFlow/1.12.0-fosscuda-2018a-Python-2.7.14
   TensorFlow/1.8.0-fosscuda-2018a-Python-3.6.4            TensorFlow/1.12.0-fosscuda-2018a-Python-3.6.4
   TensorFlow/1.9.0-foss-2018a-Python-3.6.4-CUDA-9.1.85    TensorFlow/1.15.2-fosscuda-2019b-Python-3.7.4
   TensorFlow/1.9.0-foss-2018a-Python-3.6.4                TensorFlow/2.0.0-foss-2019a-Python-3.7.2
   TensorFlow/1.10.1-foss-2018a-Python-3.6.4               TensorFlow/2.1.0-fosscuda-2019b-Python-3.7.4
   TensorFlow/1.10.1-fosscuda-2018a-Python-2.7.14          TensorFlow/2.2.0-fosscuda-2019b-Python-3.7.4
   TensorFlow/1.10.1-fosscuda-2018a-Python-3.6.4           TensorFlow/2.3.1-fosscuda-2019b-Python-3.7.4   (D)

  Where:
   D:  Default Module

Use "module spider" to find all possible modules and extensions.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".

You can then load a specific version with module load, e.g.:

module load TensorFlow/2.3.1-fosscuda-2019b-Python-3.7.4

TensorFlow loads a bunch of other modules on which it depends. You can check which modules are loaded with

module list

and that will give you the following list of almost 50 modules:

Currently Loaded Modules:
  1) GCCcore/8.3.0                    25) GMP/6.1.2-GCCcore-8.3.0
  2) zlib/1.2.11-GCCcore-8.3.0        26) libffi/3.2.1-GCCcore-8.3.0
  3) binutils/2.32-GCCcore-8.3.0      27) Python/3.7.4-GCCcore-8.3.0
  4) GCC/8.3.0                        28) SciPy-bundle/2019.10-fosscuda-2019b-Python-3.7.4
  5) CUDA/10.1.243-GCC-8.3.0          29) Szip/2.1.1-GCCcore-8.3.0
  6) gcccuda/2019b                    30) HDF5/1.10.5-gompic-2019b
  7) numactl/2.0.12-GCCcore-8.3.0     31) h5py/2.10.0-fosscuda-2019b-Python-3.7.4
  8) XZ/5.2.4-GCCcore-8.3.0           32) cURL/7.66.0-GCCcore-8.3.0
  9) libxml2/2.9.9-GCCcore-8.3.0      33) double-conversion/3.1.4-GCCcore-8.3.0
 10) libpciaccess/0.14-GCCcore-8.3.0  34) flatbuffers/1.12.0-GCCcore-8.3.0
 11) hwloc/1.11.12-GCCcore-8.3.0      35) giflib/5.2.1-GCCcore-8.3.0
 12) OpenMPI/3.1.4-gcccuda-2019b      36) ICU/64.2-GCCcore-8.3.0
 13) OpenBLAS/0.3.7-GCC-8.3.0         37) JsonCpp/1.9.3-GCCcore-8.3.0
 14) gompic/2019b                     38) NASM/2.14.02-GCCcore-8.3.0
 15) FFTW/3.3.8-gompic-2019b          39) libjpeg-turbo/2.0.3-GCCcore-8.3.0
 16) ScaLAPACK/2.0.2-gompic-2019b     40) LMDB/0.9.24-GCCcore-8.3.0
 17) fosscuda/2019b                   41) nsync/1.24.0-GCCcore-8.3.0
 18) cuDNN/7.6.4.38-gcccuda-2019b     42) PCRE/8.43-GCCcore-8.3.0
 19) NCCL/2.4.8-gcccuda-2019b         43) protobuf/3.10.0-GCCcore-8.3.0
 20) bzip2/1.0.8-GCCcore-8.3.0        44) protobuf-python/3.10.0-fosscuda-2019b-Python-3.7.4
 21) ncurses/6.1-GCCcore-8.3.0        45) libpng/1.6.37-GCCcore-8.3.0
 22) libreadline/8.0-GCCcore-8.3.0    46) snappy/1.1.7-GCCcore-8.3.0
 23) Tcl/8.6.9-GCCcore-8.3.0          47) SWIG/4.0.1-GCCcore-8.3.0
 24) SQLite/3.29.0-GCCcore-8.3.0      48) TensorFlow/2.3.1-fosscuda-2019b-Python-3.7.4

As you can see, several of the associated Python modules that we listed above have also been loaded, e.g. SciPy-bundle, as well as a specific version of Python itself, i.e. Python/3.7.4-GCCcore-8.3.0.

Associated Python modules behave just like every other module on Hábrók, which means that you need to pay careful attention to toolchain versions, fosscuda/2019b, and Python versions.

IMPORTANT

Make sure that all the Associated Python Modules you load use the same Python and toolchain versions. Using different versions of these will most likely lead to conflicts.

This section covers submitting single jobs, for multiple jobs see page on Job arrays.

We take the following python script example, in order to run a Python script on the Hábrók cluster:

python_example.py
#!/bin/env python
import math #In order to use the square root function, Pythons math module is imported.
 
x = 2*3*7
print ("The answer of 2*3*7 = %d" % (x))
x = math.sqrt(1764)
print ("Also the square root of 1764 = %d" % (x))

And we save the file as python_example.py.
Next, create a new text file that reserves resources, loads the Python module and runs the Python script. In this case the text file is called python_batch.sh.

python_batch.sh
#!/bin/bash
#SBATCH --time=00:01:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --job-name=python_example
#SBATCH --mem=800
module load Python/3.6.4-foss-2018a
python python_example.py

Now the job is ready to be submitted to the SLURM scheduler, the following command in the terminal will do this:

sbatch python_batch.sh

An output file is created, it should contain:

The answer of 2*3*7 = 42
Also the square root of 1764 = 42

In this example we request 10 CPUs from the Hábrók cluster to do a simple calculation on, using all the requested CPUs. In this example an array from 0 to 10 is created and each CPU does a simple computation on each value in this array.
Requesting resources in a batch script is done as follows:

#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --nodes=1
#SBATCH --ntasks=10
#SBATCH --job-name=python_cpu
#SBATCH --mem=8000
module load Python/3.6.4-foss-2018a
python python_cpu.py

In the Python script we create a pool of 10 threads (each for one CPU):

#!/usr/bin/env python
 
import multiprocessing 
import os # For reading the number of CPUs requested. 
import time # For clocking the calculation. 
 
def double(data):
    return data * 2
 
if __name__ == '__main__':
    begin = time.time()
    inputs = list(range(10)) # Makes an array from 0 to 10
    poolSize = int(os.environ['SLURM_JOB_CPUS_PER_NODE']) # Number of CPUs requested.
    pool = multiprocessing.Pool(processes=poolSize,)
    poolResults = pool.map(double, inputs) # Do the calculation.
    pool.close() # Stop pool accordingly.
    pool.join()  # Wrap up data from the workers in the pool.
    print ('Pool output:', poolResults) # Results.
    elapsedTime = time.time() - begin
    print ('Time elapsed for ' , poolSize, ' workers: ', elapsedTime, ' seconds')

After the execution of this job, an output file is created in which the array with the new values is printed. Also the time elapsed for this job is printed. Note that for this case it is possible to request fewer CPUs, then each will compute more than one value in this array. However, it would not make sense to request more CPUs, because there are not as many values to compute in the array (10) as there are CPUs (>10); this means some CPUs will then be left unused.

This example shows how to submit a Python GPU job to the Hábrók cluster. In this example we make use of the pycuda library, which can be installed by typing this in the terminal:

module load Python/3.10.4-GCCcore-11.3.0
module load CUDA/11.7.0
module load Boost/1.79.0-GCC-11.3.0
pip install pycuda --user

The script installs pycuda in $HOME/.local/.
Now that pycuda is installed, a new SLURM batch script can be created:

#!/bin/bash
#SBATCH --time=00:05:00
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --mem=8000
module load Python/3.10.4-GCCcore-11.3.0
module load CUDA/11.7.0
module load Boost/1.79.0-GCC-11.3.0
python ./python_gpu.py

And now we need a Python script that uses GPU functions:

import pycuda.gpuarray as gpuarray
import pycuda.driver as cuda
import pycuda.autoinit
import numpy
from pycuda.curandom import rand as curand
 
a_gpu = curand((50,))
b_gpu = curand((50,))
 
from pycuda.elementwise import ElementwiseKernel
lin_comb = ElementwiseKernel(
        "float a, float *x, float b, float *y, float *z",
        "z[i] = a*x[i] + b*y[i]",
        "linear_combination")
 
c_gpu = gpuarray.empty_like(a_gpu)
lin_comb(5, a_gpu, 6, b_gpu, c_gpu)
 
import numpy.linalg as la
assert la.norm((c_gpu - (5*a_gpu+6*b_gpu)).get()) < 1e-5
print (c_gpu) # This line is added to the original file to show the final output of the c_gpu array.


The Python code above is taken from the following file in the PyCuda distribution: examples/elementwise.py
When the job is completed, the output should show:

[ 9.33068848  1.10685492  8.71351433  6.2380209   7.40134811  4.05352402
  2.23266721  6.43384314  7.88853645  5.24907207  8.20568562  5.35862446
  4.10265684  5.24931097  7.30736542  0.65177125  2.21118498  6.48129606
  5.39043808  2.93192148  3.9563725   2.91366696  8.68741035  2.19538403
  7.98006058  3.73060822  6.01299191  5.21303606  2.10666442  2.17959881
  4.78864717  6.74258471  6.92914629  4.06129932  3.62104774  9.37001038
  3.90818572  7.15125608  9.08951855  6.56625509  3.63945365  5.43198586
  8.2178421   3.70657778  0.51833171  6.62938118  2.43193173  3.03066897
  2.44896507  6.26867485]

Avoiding I/O Bottlenecks

If you are using the GPU with, say, many small image files you may notice that your jobs can take a long time to complete as the images are being read to GPU sequentially. In this case you can bypass the issue by copying your data (as an archive) to the local storage on the GPU node. To do this follow the instructions on the Many File Jobs, it describes the process in more detail.

The next example shows how a MPI job for Python is run on the Hábrók cluster. In the SLURM batch script below three nodes are requested to do an array scatter from the master node to all nodes requested. The batch file is named python_mpi_batch.sh.

python_mpi_batch.sh
#!/bin/bash
#SBATCH --time=00:05:00
#SBATCH --nodes=2
#SBATCH --ntasks=3
#SBATCH --job-name=python_mpi
#SBATCH --mem=8000
module load Python/3.6.4-foss-2016a
mpirun python ./python_mpi.py

Then the Python script is named python_mpi.py. In this script an array is created and is scattered among all nodes, then node dependent computations are done on these values and at last all these values are collected at the master node:

python_mpi.py
from mpi4py import MPI
comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()
 
# Scattering part.
if rank == 0:
    data = [(i+1)**2 for i in range(size)]
else:
    data = None
data = comm.scatter(data, root=0)
assert data == (rank+1)**2
 
# Check if data is scattered accordingly.
print ("rank ", rank, "has data: ", data)
 
 
# Node dependent computations on data.
for i in range(size):
    if rank == i:
        data = data * rank
 
# Synchronization of the nodes.
comm.Barrier() 
 
# Gathering part.
data = comm.gather(data, root=0)
if rank == 0:
    print (data)
else:
    assert data is None 
quit()

Submit the job by giving the command:

sbatch python_mpi_batch.sh

The output of this script gives:

rank  0 has data:  1
rank  1 has data:  4
rank  2 has data:  9
[0, 4, 18]

Note that the first four lines can be in a different order. Additional job information shows that 3 CPUs are used over 2 nodes:

###############################################################################
Hábrók Cluster
Job 1150286 for user 'p275545'
Finished at: Wed Apr 25 11:19:30 CEST 2018

Job details:
============

Name                : python_mpi
User                : p275545
Partition           : regular
Nodes               : pg-node[036,196]
Cores               : 3
State               : COMPLETED
Submit              : 2018-04-25T11:19:23
Start               : 2018-04-25T11:19:25
End                 : 2018-04-25T11:19:30
Reserved walltime   : 00:05:00
Used walltime       : 00:00:05
Used CPU time       : 00:00:01 (efficiency:  7.27%)
% User (Computation): 55.23%
% System (I/O)      : 44.77%
Mem reserved        : 8000M/node
Max Mem used        : 0.00  (pg-node036,pg-node196)
Max Disk Write      : 0.00  (pg-node036,pg-node196)
Max Disk Read       : 0.00  (pg-node036,pg-node196)


Acknowledgements:
=================

Please see this page if you want to acknowledge Hábrók in your publications:

https://wiki.hpc.rug.nl/habrok/additional_information/scientific_output

################################################################################

My program output does not appear as expected when I submit a job.

There are two possibilities here. First; your program is not reaching the line where you expect to output something, this is something you will have to solve yourself and preferably test on a local machine. Second; python has buffered your output, then an unexpected crash ate the buffer. This makes debugging certain programs on Hábrók very tough because you might have produced some output but it is not shown because of the way Python handles output to the terminal or in this case the .job file. The easiest solution is to run python with the -u flag, so python -u <my_script> <other_arguments>. There are also some solutions including logging and outputting to stderr instead of stdout.