Running Python in batch mode

Questions

  • What is a batch job?

  • How to make a batch job?

Objectives

  • Short introduction to SLURM scheduler

  • Show structure of a batch script

  • Try example

Compute allocations in this workshop

  • Rackham: naiss2024-22-107

  • Kebnekaise: hpc2n2024-025

Storage space for this workshop

  • Rackham: /proj/r-py-jl

  • Kebnekaise: /proj/nobackup/hpc2n2024-025

Any longer, resource-intensive, or parallel jobs must be run through a batch script.

The batch system used at both UPPMAX and HPC2N is called SLURM. The same is the case at most of the Swedish HPC centres.

SLURM is an Open Source job scheduler, which provides three key functions

  • Keeps track of available system resources

  • Enforces local system resource usage and job scheduling policies

  • Manages a job queue, distributing work across resources according to policies

In order to run a batch job, you need to create and submit a SLURM submit file (also called a batch submit file, a batch script, or a job script).

Guides and documentation at: http://www.hpc2n.umu.se/support and https://www.uppmax.uu.se/support/user-guides/slurm-user-guide/

Workflow

  • Write a batch script

    • Inside the batch script you need to load the modules you need (Python, Python packages … )

    • Possibly activate an isolated/virtual environment to access own-installed packages

    • Ask for resources depending on if it is a parallel job or a serial job, if you need GPUs or not, etc.

    • Give the command(s) to your Python script

  • Submit batch script with sbatch <my-python-script.sh>

Common file extensions for batch scripts are .sh or .batch, but they are not necessary. You can choose any name that makes sense to you.

Useful commands to the batch system

  • Submit job: sbatch <jobscript.sh>

  • Get list of your jobs: squeue -u <username>

  • Check on a specific job: scontrol show job <job-id>

  • Delete a specific job: scancel <job-id>

  • Useful info about a job: sacct -l -j <job-id> | less -S

  • Url to a page with info about the job (Kebnekaise only): job-usage <job-id>

Example Python batch scripts

Serial code

Short serial example script for Rackham. Loading Python 3.11.8. Numpy is preinstalled and does not need to be loaded.

#!/bin/bash
#SBATCH -A naiss2024-22-107 # Change to your own after the course
#SBATCH --time=00:10:00 # Asking for 10 minutes
#SBATCH -n 1 # Asking for 1 core

# Load any modules you need, here Python 3.11.x.
module load python/3.11.8

# Run your Python script
python mmmult.py

Send the script to the batch:

$ sbatch <batch script>

Serial code + self-installed package in virt. env.

Here we are using the virtual environment we created under the “isolated environments” session earlier. It is using the Python package “<a href=”https://github.com/mwaskom/seaborn”>seaborn</a>”. In order to run the seaborn-code.py example, you need to download the data set “tips.csv” which you can find here: <a href=”https://github.com/mwaskom/seaborn-data”>https://github.com/mwaskom/seaborn-data</a>. If you want, there are other datasets there to play with.

Short serial example for running on Rackham. Loading Python 3.11.x + using any Python packages you have installed yourself with venv. More information under the separate session for UPPMAX. Change to your directory name and venv name below.

#!/bin/bash
#SBATCH -A naiss2024-22-107 # Change to your own after the course
#SBATCH --time=00:10:00 # Asking for 10 minutes
#SBATCH -n 1 # Asking for 1 core

# Load any modules you need, here for Python 3.11.x
module load python/3.11.8

# Activate your virtual environment.
# CHANGE <path-to-virt-env> to the full path where you installed your virtual environment
# Example: /proj/naiss2024-22-107/<user-dir>/python/<venv-name>
source /proj/naiss2024-22-107/<user-dir>/<path-to-virt-env>/<venv-name>/bin/activate

# Run your Python script
python seaborn-code.py

Send the script to the batch:

$ sbatch <batch script>

Note that the slurm output file will be empty on success, and it will just create the file tipsplot.png.

GPU code

We’ll not test this live, but you can try if you have Snowy access or if you have an account on Kebnekaise with GPU access

Note

Since the newest Python package modules on UPPMAX and HPC2N do not contain CUDA, we will use Python 3.9.x for these examples. There is some problem with PyTorch under the ML package on UPPMAX, so you need to use the virtual environment

Short GPU example for running on Snowy. This runs the example pytorch_fitting_gpu.py program that you can find in the Exercises/Python directory

#!/bin/bash
#SBATCH -A naiss2024-22-107
#SBATCH -t 00:10:00
#SBATCH --exclusive
#SBATCH -p node
#SBATCH -N 1
#SBATCH -M snowy
#SBATCH --gpus=1
#SBATCH --gpus-per-node=1

# Load any modules you need, here loading Python 3.9.5 and the corresponding ML packages module

module load uppmax
module load python_ML_packages/3.9.5-gpu python/3.9.5

# Activate the Example-gpu environment to use the PyTorch we installed there
source <path-to-to-your-virtual-environment>/Example-gpu/bin/activate

# Run your code
srun python pytorch_fitting_gpu.py

Send the script to the batch:

$ sbatch <batch script>

Exercises

Run the first serial example script from further up on the page for this short Python code (sum-2args.py)

import sys

x = int(sys.argv[1])
y = int(sys.argv[2])

sum = x + y

print("The sum of the two numbers is: {0}".format(sum))

Remember to give the two arguments to the program in the batch script.

Tip

Keypoints

  • The SLURM scheduler handles allocations to the calculation nodes

  • Batch jobs runs without interaction with user

  • A batch script consists of a part with SLURM parameters describing the allocation and a second part describing the actual work within the job, for instance one or several Python scripts.

    • Remember to include possible input arguments to the Python script in the batch script.