Machine Learning and Deep Learning

Questions

  • Which machine learning and deep learning tools are installed at HPCs?

  • How to start the tools at HPCs?

  • How to deploy GPU:s with ML/DL at HPCs?

Objectives

  • Get a general overview of ML/DL with Python.

  • Get a general overview of installed ML/DL tools at HPCs.

  • Get started with ML/DL in Python.

  • Code along and demos.

  • We will not learn about:
    • How to write and optimize ML/DL code.

    • How to use multi-node setup for training models on CPU and GPU.

Introduction

Python is well suited for machine learning and deep learning. For instance, it is fairly easy to code in, and this is particularly useful in ML/DL where the right solution is rarely known from the start. A lot of tests and experimentation is needed, and the program usually goes through many iterations. In addition, there are a lot of useful libraries written for ML and DL in Python, making it a good choice for this area.

Some of the most used libraries in Python for ML/DL are:

  • scikit-learn (sklearn)

  • PyTorch

  • TensorFlow

Comparison of ML/DL Libraries

Feature

scikit-learn

PyTorch

TensorFlow

Primary Use

Traditional machine learning

Deep learning and neural networks

Deep learning and neural networks

Ease of Use

High, simple API

Moderate, more control over computations

Moderate, high-level Keras API available

Performance

Good for small to medium datasets

Excellent with GPU support

Excellent with GPU support

Flexibility

Limited to traditional ML algorithms

High, supports dynamic computation graphs

High, supports both static and dynamic computation graphs

Community and Support

Large, extensive documentation

Large, growing rapidly

Large, extensive documentation and community support

In this course we will look at examples for these, and show how you run them at our centres.

The loading are slightly different at the clusters
  • UPPMAX: All tools are available from the module python_ML_packages/3.11.8

  • HPC2N:
    • For TensorFlow ml GCC/12.3.0  OpenMPI/4.1.5 TensorFlow/2.15.1-CUDA-12.1.1 scikit-learn/1.4.2 Tkinter/3.11.3 matplotlib/3.7.2

    • For the Pytorch: ml GCC/12.3.0  OpenMPI/4.1.5 PyTorch/2.1.2-CUDA-12.1.1 scikit-learn/1.4.2 Tkinter/3.11.3 matplotlib/3.7.2

  • LUNARC:
    • For TensorFlow module load GCC/11.3.0 Python/3.10.4 SciPy-bundle/2022.05 TensorFlow/2.11.0-CUDA-11.7.0 scikit-learn/1.1.2

    • For Pytorch module load GCC/11.3.0 Python/3.10.4 SciPy-bundle/2022.05 PyTorch/1.12.1-CUDA-11.7.0 scikit-learn/1.1.2

  • NSC: For Tetralith, use virtual environment. Pytorch and TensorFlow might coming soon to the cluster!

  • PDC: For both TensorFlow and Pytorch : module load PDC singularity/4.1.1-cpeGNU-23.12

List of installed ML/DL tools

There are minor differences depending on the version of python.

The list is not exhaustive, but lists the more popular ML/DL libraries. I encourage you to module spider them to see the exact versions before loading them.

Tool

UPPMAX (Python 3.11.8)

HPC2N (Python 3.11.3/3.11.5)

LUNARC (Python 3.11.3/3.11.5)

NSC (Python 3.11.3/3.11.5)

PDC (Python 3.11.7)

NumPy

python

SciPy-bundle

SciPy-bundle

N.A.

cray-python

SciPy

python

SciPy-bundle

SciPy-bundle

N.A.

cray-python

Scikit-Learn (sklearn)

python_ML_packages (Python 3.9.5-gpu and Python 3.11.8-cpu)

scikit-learn (no newer than for GCC/12.3.0 and Python 3.11.3)

scikit-learn

N.A.

N.A.

Theano

N.A.

Theano (only for some older Python versions)

N.A.

N.A.

N.A.

TensorFlow

python_ML_packages (Python 3.9.5-gpu and Python 3.11.8-cpu)

TensorFlow (newest version is for Python 3.11.3)

TensorFlow (up to Python 3.10.4)

N.A.

PDC singularity/4.1.1-cpeGNU-23.12 (v2.13)

Keras

python_ML_packages (Python 3.9.5-gpu and Python 3.11.8-cpu)

Keras (up to Python 3.8.6), TensorFlow (Python 3.11.3)

TensorFlow (up to Python 3.10.4)

N.A.

PDC singularity/4.1.1-cpeGNU-23.12 (v2.13)

PyTorch (torch)

python_ML_packages (Python 3.11.8-cpu)

PyTorch (up to Python 3.11.3)

PyTorch (up to Python 3.10.4)

N.A.

PDC singularity/4.1.1-cpeGNU-23.12 (v2.4)

Pandas

python

SciPy-bundle

SciPy-bundle

N.A.

cray-python

Matplotlib

python

matplotlib

matplotlib

N.A.

PDC/23.12 matplotlib/3.8.2-cpeGNU-23.12

Beautiful Soup (beautifulsoup4)

python_ML_packages (Python 3.9.5-gpu and Python 3.11.8-cpu)

BeautifulSoup

BeautifulSoup

N.A.

N.A.

Seaborn

python_ML_packages (Python 3.9.5-gpu and Python 3.11.8-cpu)

Seaborn

Seaborn

N.A.

N.A.

Horovod

N.A.

Horovod (up to Python 3.11.3)

N.A.

N.A.

N.A.

Scikit-Learn

Scikit-learn (sklearn) is a powerful and easy-to-use open-source machine learning library for Python. It provides simple and efficient tools for data mining and data analysis, and it is built on NumPy, SciPy, and matplotlib. Scikit-learn is designed to interoperate with the Python numerical and scientific libraries.

More often that not, scikit-learn is used along with other popular libraries like tensorflow and pytorch to perform exploratory data analysis, data preprocessing, model selection, and evaluation. For our examples, we will use jupyter notebook on a CPU node to see visualization of the data and the results.

Scikit-learn provides a comprehensive suite of tools for building and evaluating machine learning models, making it an essential library for data scientists and machine learning practitioners.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Generate some data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([1, 3, 2, 3, 5])

# Create and fit the model
model = LinearRegression()
model.fit(X, y)

# Make predictions
y_pred = model.predict(X)

# Plot the results
plt.scatter(X, y, color='black')
plt.plot(X, y_pred, color='blue', linewidth=3)
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression Example')
plt.show()

Exercise

Try running titanic_sklearn.ipynb that can be found in Exercises/day4/MLDL directory, on an interactive CPU node. Also note that datasets are kept in Exercises/day4/MLDL/datasets directory. Give the full path to these datasets for this and subsequent Exercises.

Run it on a jupyter notebook on an interactive CPU node. An interative GPU node will also do.

Load the correct modules that contain scikit-learn, numpy, seaborn, pandas, matplotlib and jupyter libraries before starting the jupyter notebook. Users on NSC and PDC can build their own venvs. Use %matplotlib inline in jupyter to see the plots inline.

  • Learning outcomes:
    • How to load a jupyter notebook on an interactive node.

    • How to load correct modules already available on the system, in order to run scikit-learn.

PyTorch and TensorFlow

The following table demonstrates some common tasks in PyTorch and TensorFlow, highlighting their similarities and differences through code examples (not a working code):

PyTorch

TensorFlow

import torch
import torch.nn as nn
import torch.optim as optim

# Tensor creation with gradients enabled
x = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32, requires_grad=True)

# Automatic differentiation
y = x.sum()
y.backward()
print("Gradient of x:", x.grad)

# Creating and using a neural network layer
layer = nn.Linear(2, 2)
input_tensor = torch.tensor([[1.0, 2.0]], dtype=torch.float32)
output = layer(input_tensor)
print("Layer output:", output)

# Optimizer usage
optimizer = optim.SGD(layer.parameters(), lr=0.01)
loss = output.sum()
optimizer.zero_grad()  # Clear gradients
loss.backward()        # Compute gradients
optimizer.step()       # Update weights
print("Updated weights:", layer.weight)
import tensorflow as tf
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD

# Tensor creation with gradients enabled
x = tf.Variable([[1.0, 2.0], [3.0, 4.0]])

# Automatic differentiation
with tf.GradientTape() as tape:
    y = tf.reduce_sum(x)
grads = tape.gradient(y, x)
print("Gradient of x:", grads)

# Creating and using a neural network layer
layer = Dense(2, input_shape=(2,))
input_tensor = tf.constant([[1.0, 2.0]], dtype=tf.float32)
output = layer(input_tensor)
print("Layer output:", output)

# Optimizer usage
optimizer = SGD(learning_rate=0.01)
with tf.GradientTape() as tape:
    loss = tf.reduce_sum(output)
gradients = tape.gradient(loss, layer.trainable_variables)
optimizer.apply_gradients(zip(gradients, layer.trainable_variables))
print("Updated weights:", layer.weights)

We now learn by submitting a batch job which consists of loading python module, activating python environment and running DNN code for image classification.

Tips and Tricks (Lessons Learned):

  • Understand your data:
    • Tensor datatypes affect performance: BF16, FP16, FP32.

    • Choose appropriate dtypes in pandas to reduce memory usage.

  • Version management:
    • Freeze all your dependencies using requirements.txt or environment.yml.

    • Document versions of all libraries in your code repository.

    • Keep your environments away from HOME dir if possible, unless IOPS is a problem.

  • Start small:
    • Begin with smaller batch sizes and sequence lengths.

    • Helps identify issues before scaling up.

    • Reduces debugging time when errors occur.

    • Shorter training cycles allow faster iterations.

    • Easier to monitor memory usage and prevent OOM errors.

  • Optimize I/O operations:
    • Be aware of I/O bottlenecks: many small files can hit IOPS limits.

    • Large but few files may cause slower data loading.

    • Consider using data formats designed for ML (like HDF5).

  • Storage management:
    • Monitor directory quotas carefully (both size and IOPS limits)

    • Consider using compressed formats for datasets

  • GPU memory management:
    • Monitor CPU and GPU memory usage with tools like htop, nvidia-smi, https://pytorch.org/memory_viz, nvidia nsight, tensorboard profiler.

    • Start with smaller batches to avoid Out-Of-Memory (OOM) errors

    • Use gradient accumulation for training with limited memory

    • Consider mixed precision training to reduce memory footprint. autocast() in PyTorch and tf.keras.mixed_precision in TensorFlow.

  • Job monitoring:
    • Log all experiments thoroughly - jobs may be terminated by administrators

    • Use checkpointing to resume interrupted training

    • Include timestamps and run parameters in log files

    • Monitor resource usage for optimizing future jobs

  • Performance optimization:
    • Use GPU profiling tools to identify bottlenecks

    • Accelerate PyTorch models with: model = torch.compile(model)

    • Optimize data loading operations to match GPU computation speed

    • Benchmark to find optimal batch sizes for your hardware

Exercise

Try and run the either pytorch or tensorflow code for Fasion MNIST dataset by submitting a batch job. The dataset is stored in datasets/pytorch or datasets/tf directory. In order to run this at any HPC resource you should either do a batch job or run interactively on compute nodes. Remember, you should not run long/resource heavy jobs on the login nodes, and they also do not have GPUs if you want to use that.

Pytorch env can be created with pip install torch torchvision jupyter and TensorFlow env can be created with pip install tensorflow[and-cuda] jupyter scikit-learn pandas.

  • Learning outcomes:
    • How to submit a batch job on a HPC GPU resource inside a virtual env.

    • How to load the correct modules and activate the correct environment for running PyTorch or TensorFlow code.

Miscellaneous examples

Exercises

Exercise

Try running a pytorch code for fitting a third degree polynomial to a sine function. Use the pytorch provided by module systems instead of using the virtual environment (except if you are on Tetralith (NSC), there is no pytorch available). Submit the job using either a batch script or run the code interactively on a GPU node (if you already are on one).

Visit the List of installed ML/DL tools and make sure to load the correct pre-requisite modules like correct python version and GCC if needed.

  • Learning outcomes:
    • How to load pytorch/tensorflow from module system instead of using virtual environment.

    • Run the job on a GPU node either interactively or via batch script.

Keypoints

  • At all clusters you will find PyTorch, TensorFlow, Scikit-learn under different modules, except Tetralith (NSC).

  • When in doubt, search your modules and its correct version using module spider. If you still wished to have the correct versions for each cluster, check the summary page.

  • If you plan to use mutiple libraries with complex dependencies, it is recommended to use a virtual environment and pip install your libraries.

  • Always run heavy ML/DL jobs on compute nodes and not on login nodes. For development purpose, you can use an interactive session on a compute node.