Basic batch and Slurm

Questions

  • What is a batch job?

  • What are some important commands regarding batch jobs?

  • How to make a batch job?

Objectives

  • Short introduction to SLURM scheduler commands

  • Show structure of a batch script

  • Try example

Compute allocations in this workshop

  • Pelle: uppmax2025-2-393

  • Kebnekaise: hpc2n2025-151

  • Cosmos: lu2025-7-106

  • Tetralith: naiss2025-22-934

  • Dardel: naiss2025-22-934

  • Alvis: naiss2025-22-934

Storage space for this workshop

  • Rackham: /proj/hpc-python-uppmax

  • Kebnekaise: /proj/nobackup/fall-courses

  • Cosmos: /lunarc/nobackup/projects/lu2025-17-52

  • Tetralith: /proj/courses-fall-courses

  • Dardel: /cfs/klemming/projects/supr/courses-fall-courses

  • Alvis: /mimer/NOBACKUP/groups/courses-fall-2025

Reservation

Include with #SBATCH --reservation==<reservation-name>. On UPPMAX it is “magnetic” and so follows the project ID without you having to add the reservation name.

NOTE as there is only one/a few nodes reserved, you should NOT use the reservations for long jobs as this will block their use for everyone else. Using them for short test jobs is what they are for.

  • UPPMAX
  • HPC2N
    • hpc-python-fri for one AMD Zen4 cpu on Friday

    • hpc-python-mon for one AMD Zen4 cpu on Monday

    • hpc-python-tue for two L40s gpus on Tuesday

    • it is magnetic, so will be used automatically

  • LUNARC

What is a batch job?

Batch systems keeps track of available system resources and takes care of scheduling jobs of multiple users running their tasks simultaneously. It typically organizes submitted jobs into some sort of prioritized queue. The batch system is also used to enforce local system resource usage and job scheduling policies.

Most Swedish HPC clusters are running Slurm. It is an Open Source job scheduler, which provides three key functions.

  • First, it allocates to users, exclusive or non-exclusive access to resources for some period of time.

  • Second, it provides a framework for starting, executing, and monitoring work on a set of allocated nodes (the cluster).

  • Third, it manages a queue of pending jobs, in order to distribute work across resources according to policies.

Slurm is designed to handle thousands of nodes in a single cluster, and can sustain throughput of 120,000 jobs per hour.

What are some important commands regarding batch jobs?


This is a brief summary of some of the most important Slurm commands:

  • Submit job: sbatch JOBSCRIPT
    • When you submit a job, the system also returns a Job-ID.

  • Get list of your jobs: squeue -u USERNAME or squeue --me

    You can also find your Job-ID from this command

  • Give Slurm commands on command line: srun <commands-for-your-job> program

  • Check on a specific job: scontrol show job JOBID

  • Delete a specific job: scancel JOBID

  • Delete all your own jobs: scancel -u USERNAME

  • Submit job: sbatch JOBSCRIPT

  • Get info on partitions and nodes: sinfo

Running your programs and scripts on UPPMAX, HPC2N, LUNARC, NSC, and PDC

As mentioned under interactive jobs, any longer, resource-intensive, or parallel jobs must be run through a batch script or in an interactive session on allocated compute nodes.

A batch job is not interactive, so you cannot make changes to the job while it is running.

In order to run a batch job, you need to create and submit a SLURM submit file (also called a batch submit file, a batch script, or a job script).

Guides and documentation at:

Workflow

  • Write a batch script

    • Inside the batch script you need to load the modules you need (Python, Python packages, any prerequisites, … )

    • Possibly activate an isolated/virtual environment to access own-installed packages

    • Ask for resources depending on if it is a parallel job or a serial job, if you need GPUs or not, etc.

    • Give the command(s) to your Python script

  • Submit batch script with sbatch <my-python-script.sh>

Common file extensions for batch scripts are .sh or .batch, but they are not necessary. You can choose any name that makes sense to you.

Simple example batch script

Hint

Type along!

This first example shows how to run a short, serial script. The batch script (named run_mmmult.sh) can be found in the directory: - If you did git clone https://github.com/UPPMAX/HPC-python.git

  • HPC-Python/Exercises/day2/<center>, where <center> is hpc2n, uppmax, lunarc, nsc, pdc, or c3se.

  • The Python script is in HPC-Python/Exercises/day2/programs and is named mmmult.py.

  • If you did wget https://github.com/UPPMAX/HPC-python/raw/refs/heads/main/exercises.tar.gz and then tar -xvzf exercises.tar.gz
    • exercises/day2/<center>, where <center> is hpc2n, uppmax, lunarc, nsc, or pdc.

    • The Python script is in exercises/day2/programs and is named mmmult.py.

  1. The batch script is run with sbatch run_mmmult.sh.

  2. Try type squeue -u <username> to see if it is pending or running.

  3. When it has run, look at the output with nano slurm-<jobid>.out.

Short serial example script for Pelle. Loading Python 3.12.3 and a compatible SciPy-bundle for Numpy.

#!/bin/bash -l
#SBATCH -A uppmax2025-2-393 # Change to your own after the course
#SBATCH --time=00:20:00 # Asking for 20 minutes
#SBATCH -n 1 # Asking for 1 core

# Load any modules you need, here Python 3.12.3
# and a compatible SciPy-bundle for numpy
module load Python/3.12.3-GCCcore-13.3.0
module load SciPy-bundle/2024.05-gfbf-2024a

# Run your Python script
python mmmult.py

Exercises

Run the first serial example script (the one that was used to run mmmult.py) from further up on the page for this short Python code (sum-2args.py) instead

import sys

x = int(sys.argv[1])
y = int(sys.argv[2])

sum = x + y

print("The sum of the two numbers is: {0}".format(sum))

Remember to give the two arguments to the program in the batch script.

Keypoints

  • The SLURM scheduler handles allocations to the calculation nodes

  • Batch jobs runs without interaction with user

  • A batch script consists of a part with SLURM parameters describing the allocation and a second part describing the actual work within the job, for instance one or several Python scripts.

    • Remember to include possible input arguments to the Python script in the batch script.