More about ML

This section contains one more example for ML.

Horovod

As the training is one of the most computationally demanding steps in a ML workflow, it would be worth it to optimize this step. Horovod is a framework dedicated to make more efficient the training step by distributing the workload across several nodes, each consisting of some CPUs and GPUs. An example on the usage of Horovod can be found in the course Upscaling AI workflows offered by ENCCS.

The following steps need to be performed before running this example:

A sample batch script for running this Horovod example is here:

#!/bin/bash -l
#SBATCH -A naiss2024-22-415
#SBATCH -t 00:05:00
#SBATCH -M snowy
#SBATCH -n 1
#SBATCH -o output_%j.out   # output file
#SBATCH -e error_%j.err    # error messages
#SBATCH --gres=gpu:1

# Set a path where the example programs are installed.
# Change the below to your own path to where you placed the example programs
MYPATH=/proj/hpc-python/<mydir-name>/HPC-python/Exercises/examples/programs/

ml purge
module load uppmax
module load python_ML_packages python/3.9.5
module load gcc/10.3.0 build-tools cmake/3.22.2

# Change the below to your own path to the virtual environment you installed horovod to
source /proj/hpc-python/<mydir-name>/env-horovod/bin/activate

srun python $MYPATH/Transfer_Learning_NLP_Horovod.py --epochs 10 --batch-size 64

Running the Horovod example

Do the initial steps for loading the required modules for Horovod, create an environment and install the dependencies for Horovod.

Run the Horovod example on 1 node each with 4 GPU engines. Thus, 4 MPI ranks will be needed. Then run the script on 2 nodes. Compare the wall times reported at the end of the output files.