ML with R

Questions

  • Is R suitable for Machine Learning (ML)?

  • How to run R ML jobs on a HPC system (UPPMAX, HPC2N, …)

Objectives

  • Short introduction to ML with R

  • Workflow

  • Show the structure of a suitable batch script

  • Examples to try

R provides many packages that are specifically designed for machine learning. R is also known for its statistical capabilities for analysis and interpretation of data.

This all makes it easier to develop and deploy models, also without having to write a lot of code yourself.

The R community has contributed many powerful packages, both for machine learning and data science. Some of the popular packages are:

  • Dplyr

  • Tidyr

  • Caret

  • MLR

  • ggplot2

  • randomForest

  • mlbench

  • tidyverse

and many many more.

Running your code

Workflow

  1. Determine if you need any R libraries that are not already installed (load R module and R_packages/R-bundle-Bioconductor and check)

  2. Determine if you want to run on CPUs or GPUs - some of the R version modules are not CUDA-aware

  3. Install any missing R libraries in an isolated environment

  4. Possibly download any datasets

  5. Write a batch script

  6. Submit the batch script

Example

Type-Along

We will run a simple example taken from https://machinelearningmastery.com/machine-learning-in-r-step-by-step/

If you cannot access remote data-sets, change the R code as mentioned inside to use a local data-set, which has already been downloaded

$ module load R/4.1.1 R_packages/4.1.1
$ Rscript iris_ml.R

R batch scripts for ML

Since most R codes for Machine Learning would run for a fairly long time, you would usually have to run them in a batch script.

ML on CPUs

Type-Along

Short serial batch example for running the R code above, iris_ml.R

Short serial example script for Rackham. Loading R/4.1.1 and R_packages/4.1.1

#!/bin/bash
#SBATCH -A naiss2024-22-107 # Course project id. Change to your own project ID after the course
#SBATCH --time=00:10:00 # Asking for 10 minutes
#SBATCH -n 1 # Asking for 1 core

# Load any modules you need, here R/4.1.1 and R_packages/4.1.1
module load R/4.1.1 R_packages/4.1.1

# Run your R script (here 'iris_ml.R')
R --no-save --quiet < iris_ml.R

Send the script to the batch:

$ sbatch <batch script>

ML on GPUs

Type-Along

Short ML example for running on Snowy.

#!/bin/bash
#SBATCH -A naiss2024-22-107
#Asking for 10 min.
#SBATCH -t 00:10:00
#SBATCH --exclusive
#SBATCH -p node
#SBATCH -N 1
#SBATCH -M snowy
#SBATCH --gres=gpu:1
#Writing output and error files
#SBATCH --output=output%J.out
#SBATCH --error=error%J.error

ml purge > /dev/null 2>&1
ml R_packages/4.1.1

R --no-save --no-restore -f Rscript.R
$ sbatch <batch script>

Exercises

Run validation.R with Rscript

This example is taken from https://www.geeksforgeeks.org/cross-validation-in-r-programming/

To run this, you need to install the datarium package in your renv on HPC2N. This is already installed in R_packages on UPPMAX.

Note Remember that for HPC2N you need to run in the renv directory.

Create a batch script to run validation.R

You can find example batch scripts in the exercises/r directory.