Using Python for Machine Learning jobs 2

While Python does not run fast, it is still well suited for machine learning. However, it is fairly easy to code in, and this is particularly useful in machine learning where the right solution is rarely known from the start. A lot of tests and experimentation is needed, and the program usually goes through many iterations. In addition, there are a lot of useful libraries written for machine learning in Python, making it a good choice for this area.

Some of the most used libraries in Python for machine learning are:

  • PyTorch

  • scikit-learn

  • TensorFlow

These are all available at UPPMAX and HPC2N.

In this course we will look at two examples: PyTorch and TensorFlow, and show how you run them at our centres.

There are some examples for beginners at https://machinelearningmastery.com/start-here/#python and at https://pytorch.org/tutorials/beginner/pytorch_with_examples.html

PyTorch

PyTorch has:

  • An n-dimensional Tensor, similar to numpy, but can run on GPUs

  • Automatic differentiation for building and training neural networks

The example we will use in this course is taken from the official PyTorch page: https://pytorch.org/ and the problem is of fitting \(y=sin⁡(x)\) with a third order polynomial. We will run an example as a batch job.

You can find the full list of examples for this problem here: https://pytorch.org/tutorials/beginner/pytorch_with_examples.html

In order to run this at HPC2N (and at UPPMAX?) you should use a batch job.

This is an example of a batch script for running the above example, using PyTorch 1.10.0 and Python 3.9.5, running on GPUs.

TensorFlow

The example comes from https://machinelearningmastery.com/tensorflow-tutorial-deep-learning-with-tf-keras/ but there are also good examples at https://www.tensorflow.org/tutorials

We are using Tensorflow 2.6.0 and Python 3.9.5. Since there is no scikit-learn for these versions, we have to install that too:

We can now use scikit-learn in our example.

In order to run the above example, we will create a batch script and submit it.

Submit with sbatch <myjobscript.sh>. After submitting you will (as usual) be given the job-id for your job. You can check on the progress of your job with squeue -u <username> or scontrol show <job-id>.

The output and errors will in this case be written to slurm-<job-id>.out.

General

You almost always want to run several iterations of your machine learning code with changed parameters and/or added layers. If you are doing this in a batch job, it is easiest to either make a batch script that submits several variations of your Python script (changed parameters, changed layers), or make a script that loops over and submits jobs with the changes.

Running several jobs from within one job

This example shows how you would run several programs or variations of programs sequentially within the same job:

Horovod

As the training is one of the most computationally demanding steps in a ML workflow, it would be worth it to optimize this step. Horovod is a framework dedicated to make more efficient the training step by distributing the workload across several nodes, each consisting of some CPUs and GPUs. An example on the usage of Horovod can be found in the course Upscaling AI workflows offered by ENCCS.

The following steps need to be performed before running this example:

A sample batch script for running this Horovod example is here:

#!/bin/bash
#SBATCH -A project_ID
#SBATCH -t 00:05:00
#SBATCH -N X               # nr. nodes
#SBATCH -n Y               # nr. MPI ranks
#SBATCH -o output_%j.out   # output file
#SBATCH -e error_%j.err    # error messages
#SBATCH --gres=gpu:k80:2
#SBATCH --exclusive

ml purge > /dev/null 2>&1
ml GCC/10.2.0 CUDA/11.1.1 OpenMPI/4.0.5
ml TensorFlow/2.4.1
ml Horovod/0.21.1-TensorFlow-2.4.1

source /proj/nobackup/<your-project-storage>/env-horovod/bin/activate

list_of_nodes=$( scontrol show hostname $SLURM_JOB_NODELIST | sed -z 's/\n/\:4,/g' )
list_of_nodes=${list_of_nodes%?}
mpirun -np $SLURM_NTASKS -H $list_of_nodes python Transfer_Learning_NLP_Horovod.py --epochs 10 --batch-size 64

Running the Horovod example

Do the initial steps for loading the required modules for Horovod, create an environment and install the dependencies for Horovod.

Run the Horovod example on 1 node each with 4 GPU engines. Thus, 4 MPI ranks will be needed. Then run the script on 2 nodes. Compare the wall times reported at the end of the output files.