ML with R

Questions

  • Is R suitable for Machine Learning (ML)?

  • How to run R ML jobs on a HPC system (UPPMAX, HPC2N, LUNARC, NSC, PDC)

Objectives

  • Short introduction to ML with R

  • Workflow

  • Show the structure of a suitable batch script

  • Examples to try

R provides many packages that are specifically designed for machine learning. R is also known for its statistical capabilities for analysis and interpretation of data.

This all makes it easier to develop and deploy models, also without having to write a lot of code yourself.

The R community has contributed many powerful packages, both for machine learning and data science. Some of the popular packages are:

  • Dplyr

  • Tidyr

  • Caret

  • MLR

  • ggplot2

  • randomForest

  • mlbench

  • tidyverse

and others.

Running your code

Workflow

  1. Determine if you need any R libraries that are not already installed (load R module and R_packages/R-bundle-Bioconductor and check)

  2. Determine if you want to run on CPUs or GPUs - some of the R version modules are not CUDA-aware

  3. Install any missing R libraries in an isolated environment

  4. Possibly download any datasets

  5. Write a batch script

  6. Submit the batch script

Example

Type-Along

We will run a simple example taken from https://machinelearningmastery.com/machine-learning-in-r-step-by-step/

If you cannot access remote data-sets, change the R code as mentioned inside to use a local data-set, which has already been downloaded

NOTE: normally we would not run this on the command line, but through a batch script, but since these are short examples we will run it on the command line.

$ module load R_packages/4.1.1
$ Rscript iris_ml.R

R batch scripts for ML

Since most R codes for Machine Learning would run for a fairly long time, you would usually have to run them in a batch script.

Serial jobs

Type-Along

Short serial batch example for running the R code above, iris_ml.R

Short serial example script for Rackham. Loading R/4.1.1 and R_packages/4.1.1

#!/bin/bash
#SBATCH -A uppmax202u-w-xyz # Course project id. Change to your own project ID after the course
#SBATCH --time=00:10:00 # Asking for 10 minutes
#SBATCH -n 1 # Asking for 1 core

# Load any modules you need, here R_packages/4.1.1 (R/4.1.1 is loaded automatically)
module load R_packages/4.1.1

# Run your R script (here 'iris_ml.R')
R --no-save --quiet < iris_ml.R

Send the script to the batch:

$ sbatch <batch script>

Parallel jobs

Type-Along

Short ML example for running on Snowy.

#!/bin/bash
#SBATCH -A uppmax202t-u-xyz
#Asking for 10 min.
#SBATCH -t 00:10:00
#SBATCH --exclusive
#SBATCH -p node
#SBATCH -n 1
#Writing output and error files
#SBATCH --output=output%J.out
#SBATCH --error=error%J.error

ml R_packages/4.1.1

R --no-save --no-restore -f Rscript.R
$ sbatch <batch script>

GPU jobs

Some packages are now able to use GPUs for ML jobs in R. One of them is xgboost. In the following demo you will find instructions to install this package and run a test case with GPUs.

Exercises

Run validation.R with Rscript

This example is taken from https://www.geeksforgeeks.org/cross-validation-in-r-programming/

Create a batch script to run validation.R

You can find example batch scripts in the exercises/r directory.