Running R in batch mode

Questions

  • What is a batch job?

  • How to write a batch script and submit a batch job?

Objectives

  • Short introduction to SLURM scheduler

  • Show structure of a batch script

  • Example to try

Compute allocations in this workshop

  • Rackham: naiss2024-22-107

  • Kebnekaise: hpc2n2024-025

Storage space for this workshop

  • Rackham: /proj/r-py-jl

  • Kebnekaise: /proj/nobackup/hpc2n2024-025

Overview of the UPPMAX systems

graph TB

  Node1 -- interactive --> SubGraph2Flow
  Node1 -- sbatch --> SubGraph2Flow
  subgraph "Snowy"
  SubGraph2Flow(calculation nodes) 
        end

        ThinLinc -- usr-sensXXX + 2FA + VPN ----> SubGraph1Flow
        Terminal/ThinLinc -- usr --> Node1
        Terminal -- usr-sensXXX + 2FA + VPN ----> SubGraph1Flow
        Node1 -- usr-sensXXX + 2FA + no VPN ----> SubGraph1Flow
        
        subgraph "Bianca"
        SubGraph1Flow(Bianca login) -- usr+passwd --> private(private cluster)
        private -- interactive --> calcB(calculation nodes)
        private -- sbatch --> calcB
        end

        subgraph "Rackham"
        Node1[Login] -- interactive --> Node2[calculation nodes]
        Node1 -- sbatch --> Node2
        end

Overview of the HPC2N system

graph TB


        Terminal/ThinLinc -- usr --> Node1
        

        subgraph "Kebnekaise"
        Node1[Login] -- interactive --> Node2[compute nodes]
        Node1 -- sbatch --> Node2
        end

Any longer, resource-intensive, or parallel jobs must be run through a batch script.

The batch system used at both UPPMAX and HPC2N (and most other HPC centres in Sweden) is called SLURM.

SLURM is an Open Source job scheduler, which provides three key functions

  • Keeps track of available system resources

  • Enforces local system resource usage and job scheduling policies

  • Manages a job queue, distributing work across resources according to policies

In order to run a batch job, you need to create and submit a SLURM submit file (also called a batch submit file, a batch script, or a job script).

Guides and documentation at: http://www.hpc2n.umu.se/support and https://www.uppmax.uu.se/support/user-guides/slurm-user-guide/

Workflow

  • Write a batch script

    • Inside the batch script you need to load the modules you need (R and any prerequisites)

    • If you are using any own-installed packages, make sure R_LIBS_USER is set (export R_LIBS_USER=/path/to/my/R-packages)

    • Ask for resources depending on if it is a parallel job or a serial job, if you need GPUs or not, etc.

    • Give the command(s) to your R script

  • Submit batch script with sbatch <my-batch-script-for-R.sh>

Common file extensions for batch scripts are .sh or .batch, but they are not necessary. You can choose any name that makes sense to you.

Useful commands to the batch system

  • Submit job: sbatch <jobscript.sh>

  • Get list of your jobs: squeue -u <username>

  • Check on a specific job: scontrol show job <job-id>

  • Delete a specific job: scancel <job-id>

  • Useful info about a job: sacct -l -j <job-id> | less -S

  • Url to a page with info about the job (Kebnekaise only): job-usage <job-id>

Keypoints

  • The SLURM scheduler handles allocations to the calculation nodes

  • Interactive sessions was presented in the previous presentation

  • Batch jobs runs without interaction with the user

  • A batch script consists of a part with SLURM parameters describing the allocation and a second part describing the actual work within the job, for instance one or several R scripts.
    • Remember to include possible input arguments to the R script in the batch script.

Example R batch scripts

Serial code

Type-Along

Short serial batch example for running the code hello.R

Short serial example script for Rackham. Loading R/4.1.1

#!/bin/bash
#SBATCH -A naiss2024-22-107 # Course project id. Change to your own project ID after the course
#SBATCH --time=00:10:00 # Asking for 10 minutes
#SBATCH -n 1 # Asking for 1 core

# Load any modules you need, here R/4.1.1
module load R/4.1.1

# Run your R script (here 'hello.R')
R --no-save --quiet < hello.R

Send the script to the batch:

$ sbatch <batch script>

Parallel code

foreach and doParallel

Type-Along

Short parallel example, using foreach and doParallel

Short parallel example (Since we are using packages “foreach” and “doParallel”, you need to use module R_packages/4.1.1 instead of R/4.1.1.

#!/bin/bash
#SBATCH -A naiss2024-22-107
#SBATCH -t 00:10:00
#SBATCH -N 1
#SBATCH -c 4

ml purge > /dev/null 2>&1
ml R_packages/4.1.1

# Batch script to submit the R program parallel_foreach.R
R -q --slave -f parallel_foreach.R

Send the script to the batch:

$ sbatch <batch script>
Rmpi

Type-Along

Short parallel example using package “Rmpi”

Short parallel example (using package “Rmpi”, so we need to load the module R_packages/4.1.1 instead of R/4.1.1 and we need to load a suitable openmpi module, openmpi/4.0.3)

#!/bin/bash
#SBATCH -A naiss2024-22-107
#Asking for 10 min.
#SBATCH -t 00:10:00
#SBATCH -n 8

export OMPI_MCA_mpi_warn_on_fork=0
export OMPI_MCA_btl_openib_allow_ib=1

ml purge > /dev/null 2>&1
ml R_packages/4.1.1
ml openmpi/4.0.3

mpirun -np 1 R CMD BATCH --no-save --no-restore Rmpi.R output.out

Send the script to the batch system:

$ sbatch <batch script>

Using GPUs in a batch job

There are generally either not GPUs on the login nodes or they cannot be accessed for computations. To use them you need to either launch an interactive job or submit a batch job.

UPPMAX only

Rackham’s compute nodes do not have GPUs. You need to use Snowy for that.

You need to use this batch command (for x being the number of cards, 1 or 2):

#SBATCH -M snowy
#SBATCH --gres=gpu:x

HPC2N

Kebnekaise’s GPU nodes are considered a separate resource, and the regular compute nodes do not have GPUs.

You need to use this to access the batch system:

#SBATCH --gres=gpu:<card>:x

for <card>=v100 or a100 and x=1 or 2.

In addition, for the A100 GPUs you also need to use

#SBATCH -p amd_gpu

Example batch script

#!/bin/bash
#SBATCH -A naiss2024-22-107
#Asking for runtime: hours, minutes, seconds. At most 1 week
#SBATCH -t HHH:MM:SS
#SBATCH --exclusive
#SBATCH -p node
#SBATCH -N 1
#SBATCH -M snowy
#SBATCH --gpus=1
#SBATCH --gpus-per-node=1
#Writing output and error files
#SBATCH --output=output%J.out
#SBATCH --error=error%J.error

ml purge > /dev/null 2>&1
ml R/4.1.1 R_packages/4.1.1

R --no-save --no-restore -f MY-R-GPU-SCRIPT.R

Exercises

Serial batch script for R

Run the serial batch script from further up on the page, but for the add2.R code. Remember the arguments.

Parallel job run

Try running the parallel example with “foreach” from further up on the page.