Using packages
Learning outcomes
Practice using the documentation of your HPC cluster
Can find and load a Python package module
Can determine if a Python package is installed
Why Python packages are important
Python packages are pieces of tested Python code. Prefer using a Python package over writing your own code.
Some definitions
Library: A collection of code used by a program.
Package: A library that has been made easily installable and reusable. Often published on public repositories such as the Python Package Index
Dependency: A requirement of another program, not included in that program.
What packages are out there
Core numerics libraries: Ex
numpyPlotting: Ex
matplotlibandseabornData analysis and other important core packages: Ex
pandas,dask,xarrayInteractive computing and human interface: Ex
Jupyter,spyderData format support and data ingestion: Ex
h5pySpeeding up code and parallelism: Ex
mpi4py,numba,daskMachine learning: Ex
scikit-learnDeep learning: Ex
pytorch,tensorflow,keras
Plan of the week:
Cover the use of the above packages in more or less detail
Why software modules are important on an HPC cluster
Software modules allows users of any HPC cluster to activate their favorite software and/or packages of any version. This helps to assure reproducible research.
Where are the python packages?
Python packages can be included inside a Python software module, in a bundle module or needs to be installed by the user.
Cluster |
Recommended Python module |
Python packages |
|---|---|---|
Dardel |
|
Many installed in the Python module |
Tetralith |
|
Many installed in the Python module |
Alvis |
|
Other then core module in Bundle modules |
Bianca |
|
Many installed in the Python module |
Kebnekaise |
|
Other then core module in Bundle modules |
Pelle |
|
Other then core module in Bundle modules |
Cosmos |
|
Other then core module in Bundle modules |
About Python bundles from EasyBuild.
How to see which Python packages are installed
There are two ways to determine which Python packages are installed (with software modules loaded):
Where |
Command to run |
The package is present when … |
|---|---|---|
On the command-line |
|
It shows up in the list |
In the Python interpreter |
|
There is no error |
Exercises
Want to see the answers as a video?
Some HPC clusters have multiple remote desktops. We recommend:
HPC cluster |
YouTube video |
|---|---|
Alvis |
|
Bianca |
|
COSMOS |
|
Dardel |
|
Kebnekaise |
|
LUMI |
|
Pelle |
|
Rackham |
|
Tetralith |
Exercise 1: using Python packages
login to your HPC cluster
Forgot how to do this?
Answer can be found at day 1
load the Python module of the version below
HPC cluster |
Python version |
|---|---|
Alvis |
|
Bianca |
|
COSMOS |
|
Dardel |
|
Kebnekaise |
|
LUMI |
|
Pelle |
|
Tetralith |
|
Forgot how to do this?
HPC cluster |
Python version |
|---|---|
Alvis |
|
Bianca |
|
COSMOS |
|
Dardel |
|
Kebnekaise |
|
LUMI |
|
Pelle |
|
Tetralith |
|
Confirm that the Python package, indicated in the table below, is absent. You can use any way to do so.
HPC cluster |
Python package |
|---|---|
Alvis |
|
Bianca |
|
COSMOS |
|
Dardel |
|
Kebnekaise |
|
LUMI |
|
Pelle |
|
Tetralith |
|
Answer
From the terminal, use the command below to confirm that the package is not available yet:
HPC cluster |
Command |
|---|---|
Alvis |
|
Bianca |
|
COSMOS |
|
Dardel |
|
Kebnekaise |
|
LUMI |
|
Pelle |
|
Tetralith |
|
In all cases, the package is not yet installed, as that is what we’ll be doing next :-)
Find the software module to load the package. Use either the documentation of the HPC center, or use the module system
Answer: where is this documented?
HPC cluster |
URL to documentation |
|---|---|
Alvis |
|
Bianca |
|
COSMOS |
|
Dardel |
Here, but it just says it need to be installed |
Kebnekaise |
|
LUMI |
|
Pelle |
|
Tetralith |
Answer: how to use the module system?
In the terminal, type the command as shown below to get a decent hint.
There are many possible terms to use with module spider: whatever
works for you is good too :-)
HPC cluster |
Command |
|---|---|
Alvis |
|
Bianca |
|
COSMOS |
|
Dardel |
|
Kebnekaise |
|
LUMI |
Has no module system, use a container instead. |
|
|
Pelle |
|
Tetralith |
|
Load the software module
Answer
In the terminal, type the following command:
HPC cluster
Command
Alvis
module load SciPy-bundle/2024.05-gfbf-2024aBianca
module load python_ML_packages/3.9.5-cpu. You will be asked to do amodule unload pythonfirst. Do so :-)COSMOS
module load module load GCC/13.3.0 SciPy-bundle/2024.05Dardel
module load PDCOLD/23.12 matplotlib/3.8.2-cpeGNU-23.12. It is not recommended to load a PDCOLD module,but it works and loads an older python version
Kebnekaise
module load GCC/13.3.0 SciPy-bundle/2024.05LUMI
Not applicable: we are using a container
Pelle
module load PyTorch/2.6.0-foss-2024aTetralith
module load buildtool-easybuild/4.9.4-hpc71cbb0050 GCC/13.2.0 SciPy-bundle/2023.11Alternatively:
module load Python/3.11.5(which happens to be a Python version withscipyinstalled)
See the package is now present
Answer
From the terminal, use the command below to confirm that the package is now available:
HPC cluster |
Command |
|---|---|
Alvis |
|
Bianca |
|
COSMOS |
|
Dardel |
|
Kebnekaise |
|
LUMI |
|
Pelle |
|
Tetralith |
|
In all cases, the package is now installed. Well done!
Done?
When done, and if you haven’t done so yet, do Use the tarball with exercises.
After that, get acquainted about packages in the “See also section”
Using a cluster with bundles (all but Dardel,Tetralith and Bianca)
Read about Python bundles from EasyBuild.
More about packages
Summary taken from Libraries section of course Python for Scientific Computing.
Core numerics libraries
numpy - Arrays and array math.
scipy - Software for math, science, and engineering.
Plotting
matplotlib - Base plotting package, somewhat low level but almost everything builds on it.
seaborn - Higher level plotting interface; statistical graphics.
Vega-Altair - Declarative Python plotting.
mayavi - 3D plotting
Plotly - Big graphing library.
Data analysis and other important core packages
pandas - Columnar data analysi.
polars - Alternative to pandas that uses similar API, but is re-imagined for more speed.
Vaex - Alternative for pandas that uses similar API for lazy-loading and processing huge DataFrames.
Dask - Alternative to Pandas that uses similar API and can do analysis in parallel.
xarrray - Framework for working with mutli-dimensional arrays.
statsmodels - Statistical models and tests.
SymPy - Symbolic math.
networkx - Graph and network analysis.
graph-tool - Graph and network analysis toolkit implemented in C++.
Interactive computing and human interface
- Interactive computing
IPython - Nicer interactive interpreter
Jupyter - Web-based interface to IPython and other languages (includes projects such as jupyter notebook, lab, hub, …)
- Testing
pytest - Automated testing interface
- Documentation
Sphinx - Documentation generator (also used for this lesson…)
- Development environments
Spyder - Interactive Python development environment.
Visual Studio Code - Microsoft’s flagship code editor.
PyCharm - JetBrains’s Python IDE.
Binder - load any git repository in Jupyter automatically, good for reproducible research
Data format support and data ingestion
pillow - Image manipulation. The original PIL is no longer maintained, the new “Pillow” is a drop-in replacement.
h5py and PyTables - Interfaces to the HDF5 file format.
Speeding up code and parallelism
MPI for Python (mpi4py) - Message Passing Interface (MPI) in Python for parallelizing jobs.
cython - easily make C extensions for Python, also interface to C libraries
numba - just in time compiling of functions for speed-up
PyPy - Python written in Python so that it can internally optimize more.
Dask - Distributed array data structure for distributed computation
Joblib - Easy embarrassingly parallel computing
IPyParallel - Easy parallel task engine.
numexpr - Fast evaluation of array expressions by automatically compiling the arithmetic.
Machine learning
nltk - Natural language processing toolkit.
scikit-learn - Traditional machine learning toolkit.
xgboost - Toolkit for gradient boosting algorithms.
Deep learning
tensorflow - Deep learning library by Google.
pytorch - Currently the most popular deep learning library.
keras - Simple libary for doing deep learning.
huggingface - Ecosystem for sharing and running deep learning models and datasets. Incluses packages like transformers, datasets, accelerate, etc.
jax - Google’s Python library for running NumPy and automatic differentiation on GPUs.
flax - Neural network framework built on Jax.
equinox - Another neural network framework built on Jax.
DeepSpeed - Algorithms for running massive scale trainings. Included in many of the frameworks.
PyTorch Lightning - Framework for creating and training PyTorch models.
Tensorboard - Tool for visualizing model training on a web page.
Other packages for special cases
dateutil and pytz - Date arithmetic and handling, timezone database and conversion.
Discussion
Questions?
About Dardel?
Coming Arrhenius, probably a combination of Dardel and Kebnekaise and Alvis.