Submitting jobs

Objectives

  • This is a short introduction in how to reach the calculation nodes

  • Wednesday afternoon is wedded to this topic!

Slurm, sbatch, the job queue

  • Problem: 1000 users, 500 nodes, 10k cores

  • Need a queue:

Image

  • x-axis: cores, one thread per core

  • y-axis: time

  • Slurm is a jobs scheduler

  • Plan your job and but in the slurm job batch (sbatch) sbatch <flags> <program> or sbatch <job script>

  • Easiest to schedule single-threaded, short jobs

Image Image

  • Left: 4 one-core jobs can run immediately (or a 4-core wide job).

    • The jobs are too long to fit in core number 9-13.

  • Right: A 5-core job has to wait.

    • Too long to fit in cores 9-13 and too wide to fit in the last cores.

Jobs

  • Job = what happens during booked time

  • Described in a Bash script file

    • Slurm parameters (flags)

    • Load software modules

    • (Move around file system)

    • Run programs

    • (Collect output)

  • … and more

Slurm parameters

  • 1 mandatory setting for jobs:

    • Which compute project? (-A)

    • For example, if your project is named NAISS 2017/1-334 you specify -A naiss2017-1-234

  • 3 settings you really should set:

    • Type of queue? (-p)

      • core, node, (for short development jobs and tests: devcore, devel)

    • How many cores? (-n)

      • up to 16 (20 on Rackham) for core job

    • How long at most? (-t)

  • If in doubt:

    • -p core

    • -n 1

    • -t 7-00:00:00

Image

  • Where should it run? (-p node or -p core)

  • Use a whole node or just part of it?

    • 1 node = 20 cores (16 on Bianca & Snowy)

    • 1 hour walltime = 20 core hours = expensive

    • Waste of resources unless you have a parallel program or need all the memory, e.g. 128 GB per node

  • Default value: core

Walltime at the different clusters

  • Rackham: 10 days

  • Snowy: 30 days

  • Bianca: 10 days

Interactive jobs

  • Most work is most effective as submitted jobs, but e.g. development needs responsiveness

  • Interactive jobs are high-priority but limited in -n and -t

  • Quickly give you a job and logs you in to the compute node

  • Require same Slurm parameters as other jobs

Try interactive

$ interactive -A naiss2023-22-793 -p core -n 1 -t 10:00
  • Which node are you on?

    • Logout with <Ctrl>-D or logout

A simple job script template

#!/bin/bash -l 
# tell it is bash language and -l is for starting a session with a "clean environment, e.g. with no modules loaded and paths reset"

#SBATCH -A naiss2023-22-793  # Project name

#SBATCH -p devcore  # Asking for cores (for test jobs and as opposed to multiple nodes) 

#SBATCH -n 1  # Number of cores

#SBATCH -t 00:10:00  # Ten minutes

#SBATCH -J Template_script  # Name of the job

# go to some directory

cd /proj/introtouppmax/labs
pwd -P

# load software modules

module load bioinfo-tools
module list

# do something

echo Hello world!  

Other Slurm tools

  • squeue — quick info about jobs in queue

  • jobinfo — detailed info about jobs

  • finishedjobinfo — summary of finished jobs

  • jobstats— efficiency of booked resources

Exercise at home

  • Copy the code just further up!

  • Put it into a file named “jobtemplate.sh”

  • Make the file executable (chmod)

  • Submit the job:

$ sbatch jobtemplate.sh
  • Note the job id!

  • Check the queue:

$ squeue -u <username>
$ jobinfo -u <username>
  • When it’s done (rather fast), look for the output file (slurm-.out):

$ ls -lrt slurm-*
  • Check the output file to see if it ran correctly

$ cat <filename>

What kind of work are you doing?

  • Compute bound

    • you use mainly CPU power (more cores can help)

  • Memory bound

    • if the bottlenecks are allocating memory, copying/duplicating

More on Wednesday afternoon!

Keypoints

  • You are always in the login node unless you:

    • start an interactive session

    • start a batch job

  • Slurm is a job scheduler

    • add flags to describe your job.