Submitting jobs
Objectives
This is a short introduction in how to reach the calculation nodes
Wednesday afternoon is wedded to this topic!
Slurm, sbatch, the job queue
Problem: 1000 users, 500 nodes, 10k cores
Need a queue:
x-axis: cores, one thread per core
y-axis: time
Slurm is a jobs scheduler
Plan your job and but in the slurm job batch (sbatch)
sbatch <flags> <program>
orsbatch <job script>
Easiest to schedule single-threaded, short jobs
Left: 4 one-core jobs can run immediately (or a 4-core wide job).
The jobs are too long to fit in core number 9-13.
Right: A 5-core job has to wait.
Too long to fit in cores 9-13 and too wide to fit in the last cores.
Jobs
Job = what happens during booked time
Described in a Bash script file
Slurm parameters (flags)
Load software modules
(Move around file system)
Run programs
(Collect output)
… and more
Slurm parameters
1 mandatory setting for jobs:
Which compute project? (
-A
)For example, if your project is named
NAISS 2017/1-334
you specify-A naiss2017-1-234
3 settings you really should set:
Type of queue? (
-p
)core, node, (for short development jobs and tests: devcore, devel)
How many cores? (
-n
)up to 16 (20 on Rackham) for core job
How long at most? (
-t
)
If in doubt:
-
p core
-
n 1
-t 7-00:00:00
Where should it run? (
-p node
or-p core
)Use a whole node or just part of it?
1 node = 20 cores (16 on Bianca & Snowy)
1 hour walltime = 20 core hours = expensive
Waste of resources unless you have a parallel program or need all the memory, e.g. 128 GB per node
Default value: core
Walltime at the different clusters
Rackham: 10 days
Snowy: 30 days
Bianca: 10 days
Interactive jobs
Most work is most effective as submitted jobs, but e.g. development needs responsiveness
Interactive jobs are high-priority but limited in
-n
and-t
Quickly give you a job and logs you in to the compute node
Require same Slurm parameters as other jobs
Try interactive
$ interactive -A naiss2023-22-793 -p core -n 1 -t 10:00
Which node are you on?
Logout with
<Ctrl>-D
orlogout
A simple job script template
#!/bin/bash -l
# tell it is bash language and -l is for starting a session with a "clean environment, e.g. with no modules loaded and paths reset"
#SBATCH -A naiss2023-22-793 # Project name
#SBATCH -p devcore # Asking for cores (for test jobs and as opposed to multiple nodes)
#SBATCH -n 1 # Number of cores
#SBATCH -t 00:10:00 # Ten minutes
#SBATCH -J Template_script # Name of the job
# go to some directory
cd /proj/introtouppmax/labs
pwd -P
# load software modules
module load bioinfo-tools
module list
# do something
echo Hello world!
Other Slurm tools
squeue
— quick info about jobs in queuejobinfo
— detailed info about jobsfinishedjobinfo
— summary of finished jobsjobstats
— efficiency of booked resources
Exercise at home
Copy the code just further up!
Put it into a file named “jobtemplate.sh”
Make the file executable (chmod)
Submit the job:
$ sbatch jobtemplate.sh
Note the job id!
Check the queue:
$ squeue -u <username>
$ jobinfo -u <username>
When it’s done (rather fast), look for the output file (slurm-
.out):
$ ls -lrt slurm-*
Check the output file to see if it ran correctly
$ cat <filename>
What kind of work are you doing?
Compute bound
you use mainly CPU power (more cores can help)
Memory bound
if the bottlenecks are allocating memory, copying/duplicating
More on Wednesday afternoon!
Keypoints
You are always in the login node unless you:
start an interactive session
start a batch job
Slurm is a job scheduler
add flags to describe your job.