Machine Learning and Deep Learning

Questions

  • Which machine learning and deep learning tools are installed at HPC2N, UPPMAX, and LUNARC?

  • How to start the tools at HPC2N, UPPMAX, and LUNARC?

  • How to deploy GPU:s with ML/DL at HPC2N, UPPMAX, and LUNARC?

Objectives

  • Get a general overview of ML/DL with Python.

  • Get a general overview of installed ML/DL tools at HPC2N, UPPMAX, and LUNARC.

  • Get started with ML/DL in Python.

  • Code along and demos (Kebnekaise, Rackham/Snowy, Cosmos and Tetralith).

  • We will not learn about:
    • How to write and optimize ML/DL code.

    • How to use multi-node setup for training models on CPU and GPU.

Introduction

Python is well suited for machine learning and deep learning. For instance, it is fairly easy to code in, and this is particularly useful in ML/DL where the right solution is rarely known from the start. A lot of tests and experimentation is needed, and the program usually goes through many iterations. In addition, there are a lot of useful libraries written for ML and DL in Python, making it a good choice for this area.

Some of the most used libraries in Python for ML/DL are:

  • scikit-learn (sklearn)

  • PyTorch

  • TensorFlow

Comparison of ML/DL Libraries

Feature

scikit-learn

PyTorch

TensorFlow

Primary Use

Traditional machine learning

Deep learning and neural networks

Deep learning and neural networks

Ease of Use

High, simple API

Moderate, more control over computations

Moderate, high-level Keras API available

Performance

Good for small to medium datasets

Excellent with GPU support

Excellent with GPU support

Flexibility

Limited to traditional ML algorithms

High, supports dynamic computation graphs

High, supports both static and dynamic computation graphs

Community and Support

Large, extensive documentation

Large, growing rapidly

Large, extensive documentation and community support

These are all available at UPPMAX, HPC2N, and LUNARC.

In this course we will look at examples for these, and show how you run them at our centres.

The loading are slightly different at the clusters
  • UPPMAX: All tools are available from the module python_ML_packages/3.11.8

  • HPC2N:
    • For TensorFlow ml GCC/12.3.0  OpenMPI/4.1.5 TensorFlow/2.15.1-CUDA-12.1.1 scikit-learn/1.4.2 Tkinter/3.11.3 matplotlib/3.7.2

    • For the Pytorch: ml GCC/12.3.0  OpenMPI/4.1.5 PyTorch/2.1.2-CUDA-12.1.1 scikit-learn/1.4.2 Tkinter/3.11.3 matplotlib/3.7.2

  • LUNARC:
    • For TensorFlow module load GCC/11.3.0 Python/3.10.4 SciPy-bundle/2022.05 TensorFlow/2.11.0-CUDA-11.7.0 scikit-learn/1.1.2

    • For Pytorch module load GCC/11.3.0 Python/3.10.4 SciPy-bundle/2022.05 PyTorch/1.12.1-CUDA-11.7.0 scikit-learn/1.1.2

  • NSC: For Tetralith, use virtual environment. Pytorch and TensorFlow might coming soon to the cluster!

List of installed ML/DL tools

There are minor differences depending on the version of python.

The list is not exhaustive, but lists the more popular ML/DL libraries. I encourage you to module spider them to see the exact versions before loading them.

Tool

UPPMAX (python 3.11.8)

HPC2N (Python 3.11.3/3.11.5)

LUNARC (Python 3.11.3/3.11.5)

NSC (Python 3.11.3/3.11.5)

NumPy

python

SciPy-bundle

SciPy-bundle

N.A.

SciPy

python

SciPy-bundle

SciPy-bundle

N.A.

Scikit-Learn (sklearn)

python_ML_packages (Python 3.11.8-gpu and Python 3.11.8-cpu)

scikit-learn (no newer than for GCC/12.3.0 and Python 3.11.3)

scikit-learn

N.A.

Theano

N.A.

Theano (only for some older Python versions)

N.A.

N.A.

TensorFlow

python_ML_packages (Python 3.11.8-gpu and Python 3.11.8-cpu)

TensorFlow (newest version is for Python 3.11.3)

TensorFlow (up to Python 3.10.4)

N.A.

Keras

python_ML_packages (Python 3.11.8-gpu and Python 3.11.8-cpu)

Keras (up to Python 3.8.6), TensorFlow (Python 3.11.3)

TensorFlow (up to Python 3.10.4)

N.A.

PyTorch (torch)

python_ML_packages (Python 3.11.5-gpu and Python 3.11.8-cpu)

PyTorch (up to Python 3.11.3)

PyTorch (up to Python 3.10.4)

N.A.

Pandas

python

SciPy-bundle

SciPy-bundle

N.A.

Matplotlib

python

matplotlib

matplotlib

N.A.

Beautiful Soup (beautifulsoup4)

python_ML_packages (Python 3.9.5-gpu and Python 3.11.8-cpu)

BeautifulSoup

BeautifulSoup

N.A.

Seaborn

python_ML_packages (Python 3.9.5-gpu and Python 3.11.8-cpu)

Seaborn

Seaborn

N.A.

Horovod

N.A.

Horovod (up to Python 3.11.3)

N.A.

N.A.

Scikit-Learn

Scikit-learn (sklearn) is a powerful and easy-to-use open-source machine learning library for Python. It provides simple and efficient tools for data mining and data analysis, and it is built on NumPy, SciPy, and matplotlib. Scikit-learn is designed to interoperate with the Python numerical and scientific libraries.

More often that not, scikit-learn is used along with other popular libraries like tensorflow and pytorch to perform exploratory data analysis, data preprocessing, model selection, and evaluation. For our examples, we will use jupyter notebook on a CPU node to see visualization of the data and the results.

Scikit-learn provides a comprehensive suite of tools for building and evaluating machine learning models, making it an essential library for data scientists and machine learning practitioners.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Generate some data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([1, 3, 2, 3, 5])

# Create and fit the model
model = LinearRegression()
model.fit(X, y)

# Make predictions
y_pred = model.predict(X)

# Plot the results
plt.scatter(X, y, color='black')
plt.plot(X, y_pred, color='blue', linewidth=3)
plt.xlabel('X')
plt.ylabel('y')
plt.title('Linear Regression Example')
plt.show()

Exercise

Try running titanic_sklearn.ipynb that can be found in Exercises/examples/programs directory, on an interactive CPU node. Copy the .ipynb file into your personal folder. Also copy the data directory into your personal folder as it contains the dataset for this and subsequent Exercises.

Run it on a jupyter notebook on an interactive CPU node. An interative GPU node will also do.

Load the correct modules that contain scikit-learn, numpy, seaborn, pandas, matplotlib and jupyter libraries before starting the jupyter notebook. Users on NSC can use prebuilt tf_env or torch_env venv.

  • Learning outcomes:
    • How to load a jupyter notebook on an interactive node.

    • How to load correct modules already available on the system, in order to run scikit-learn.

PyTorch and TensorFlow

The following table demonstrates some common tasks in PyTorch and TensorFlow, highlighting their similarities and differences through code examples:

PyTorch

TensorFlow

import torch
import torch.nn as nn
import torch.optim as optim

# Tensor creation with gradients enabled
x = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32, requires_grad=True)

# Automatic differentiation
y = x.sum()
y.backward()
print("Gradient of x:", x.grad)

# Creating and using a neural network layer
layer = nn.Linear(2, 2)
input_tensor = torch.tensor([[1.0, 2.0]], dtype=torch.float32)
output = layer(input_tensor)
print("Layer output:", output)

# Optimizer usage
optimizer = optim.SGD(layer.parameters(), lr=0.01)
loss = output.sum()
optimizer.zero_grad()  # Clear gradients
loss.backward()        # Compute gradients
optimizer.step()       # Update weights
print("Updated weights:", layer.weight)
import tensorflow as tf
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD

# Tensor creation
x = tf.Variable([[1, 2], [3, 4]], dtype=tf.float32)

# Automatic differentiation
with tf.GradientTape() as tape:
    y = tf.reduce_sum(x)
grads = tape.gradient(y, x)
print("Gradient of x:", grads)

# Creating and using a neural network layer
layer = Dense(2)
input_tensor = tf.constant([[1.0, 2.0]], dtype=tf.float32)
output = layer(input_tensor)
print("Layer output:", output)

# Optimizer usage
optimizer = SGD(learning_rate=0.01)
with tf.GradientTape() as tape:
    loss = tf.reduce_sum(output)
gradients = tape.gradient(loss, layer.trainable_variables)
optimizer.apply_gradients(zip(gradients, layer.trainable_variables))
print("Updated weights:", layer.weights)

We now learn by submitting a batch job which consists of loading python module, activating python environment and running DNN code for image classification.

Exercise

Try and run the either pytorch or tensorflow code for Fasion MNIST dataset by submitting a batch job. The dataset is stored in data/pytorch or data/tf directory. Copy the data directory to your personal folder. In order to run this at any HPC resource you should either do a batch job or run interactively on compute nodes. Remember, you should not run long/resource heavy jobs on the login nodes, and they also do not have GPUs if you want to use that.

  • Learning outcomes:
    • How to submit a batch job on a HPC GPU resource inside a virtual env.

    • How to load the correct modules and activate the correct environment for running PyTorch or TensorFlow code.

Miscellaneous examples

Exercises

Exercise

Try running a pytorch code for fitting a third degree polynomial to a sine function. Use the pytorch provided by module systems instead of using the virtual environment (except if you are on Tetralith (NSC), there is no pytorch available). Submit the job using either a batch script or run the code interactively on a GPU node (if you already are on one).

Visit the List of installed ML/DL tools and make sure to load the correct pre-requisite modules like correct python version and GCC if needed.

  • Learning outcomes:
    • How to load pytorch/tensorflow from module system instead of using virtual environment.

    • Run the job on a GPU node either interactively or via batch script.

Keypoints

  • At all clusters you will find PyTorch, TensorFlow, Scikit-learn under different modules, except Tetralith (NSC).

  • When in doubt, search your modules and its correct version using module spider. If you still wished to have the correct versions for each cluster, check the summary page.

  • If you plan to use mutiple libraries with complex dependencies, it is recommended to use a virtual environment and pip install your libraries.

  • Always run heavy ML/DL jobs on compute nodes and not on login nodes. For development purpose, you can use an interactive session on a compute node.