Load and run python and use packages

At UPPMAX, HPC2N, LUNARC, and NSC (and most other Swedish HPC centres) we call the applications available via the module system modules.

Objectives

  • Show how to load Python

  • Show how to run Python scripts and start the Python command line

Short cheat sheet

  • See which modules exists: module spider or ml spider

  • Find module versions for a particular software: module spider <software>

  • Modules depending only on what is currently loaded: module avail or ml av

  • See which modules are currently loaded: module list or ml

  • Load a module: module load <module>/<version> or ml <module>/<version>

  • Unload a module: module unload <module>/<version> or ml -<module>/<version>

  • More information about a module: module show <module>/<version> or ml show <module>/<version>

  • Unload all modules except the ‘sticky’ modules: module purge or ml purge

Warning

  • Note that the module systems at UPPMAX, HPC2N, LUNARC, and NSC are slightly different.

  • While all modules at
    • UPPMAX not directly related to bio-informatics are shown by ml avail

    • NSC are show by ml avail

    • HPC2N and LUNARC are hidden until one has loaded a prerequisite like the compiler GCC.

  • For reproducibility reasons, you should always load a specific version of a module instead of just the default version

  • Many modules have prerequisite modules which needs to be loaded first (at HPC2N/LUNARC/NSC this is also the case for the Python modules). When doing module spider <module>/<version> you will get a list of which other modules needs to be loaded first

Check for Python versions

Tip

Type along!

Check all available Python versions with:

$ module avail python

NOTE that python is written in lower case!

Note

Unless otherwise said, we recommend using Python 3.11.x in this course at HPC2N, UPPMAX, LUNARC, and NSC. We will us Python 3.10.4 at NSC for a small number of examples, since more packages are installed for that.

Load a Python module

For reproducibility, we recommend ALWAYS loading a specific module instad of using the default version!

Tip

Type along!

Go back and check which Python modules were available. To load version 3.11.8, do:

$ module load python/3.11.8

Note: Lowercase p. For short, you can also use:

$ ml python/3.11.8

Warning

  • UPPMAX: Don’t use system-installed python (2.7.5)

  • UPPMAX: Don’t use system installed python3 (3.6.8)

  • HPC2N: Don’t use system-installed python (2.7.18)

  • HPC2N: Don’t use system-installed python3 (3.8.10)

  • LUNARC: Don’t use system-installed python/python3 (3.9.18)

  • NSC: Don’t use system-installed python/python3 (3.9.18)

  • ALWAYS use python module

Why are there both Python/2.X.Y and Python/3.Z.W modules?

  • Some existing software might use Python2 and some will use Python3.

  • Some of the Python packages have both Python2 and Python3 versions.

  • Check what your software as well as the installed modules need when you pick!

UPPMAX: Why are there both python/3.X.Y and python3/3.X.Y modules?

  • Sometimes existing software might use python2 and there’s nothing you can do about that.

  • In pipelines and other toolchains the different tools may together require both python2 and python3.

  • Here’s how you handle that situation:

  • You can run two python modules at the same time if ONE of the module is python/2.X.Y and the other module is python3/3.X.Y (not python/3.X.Y).

LUNARC: Are python and python3 equivalent, or does the former load Python/2.X.Y?

The answer depends on which module is loaded. If Python/3.X.Y is loaded, then python is just an alias for python3 and it will start the same command line. However, if Python/2.7.X is loaded, then python will start the Python/2.7.X command line while python3 will start the system version (3.9.18). If you load Python/2.7.X and then try to load Python/3.X.Y as well, or vice-versa, the most recently loaded Python version will replace anything loaded prior, and all dependencies will be upgraded or downgraded to match. Only the system’s Python/3.X.Y version can be run at the same time as a version of Python/2.7.X.

Run

Run Python script

Hint

  • There are many ways to edit your scripts.

  • If you are rather new.

    • Graphical: $ gedit <script> &

      • (& is for letting you use the terminal while editor window is open)

      • Requires ThinLinc or ssh -X

    • Terminal: $ nano <script>

  • Otherwise you would know what to do!

  • ⚠️ The teachers may use their common editor, like vi/vim
    • If you get stuck in vim, press: <esc> and then :q !

Type-Along

  • Let’s make a script with the name example.py

$ nano example.py
  • Insert the following text

# This program prints Hello, world!
print('Hello, world!')
  • Save and exit. In nano: <ctrl>+O, <ctrl>+X

You can run a python script in the shell like this:

$ python example.py
# or
$ python3 example.py

Warning

  • ONLY run jobs that are short and/or do not use a lot of resources from the command line.

  • Otherwise use the batch system (see the batch session)

Run an interactive Python shell

  • You can start a simple python terminal by:

$ python

Example

>>> a=3
>>> b=7
>>> c=a+b
>>> c
10
  • Exit Python with <Ctrl-D>, quit() or exit() in the python prompt

>>> <Ctrl-D>
>>> quit()
>>> exit()

For more interactiveness you can run Ipython.

Tip

Type along!

NOTE: remember to load a python module first. Then start IPython from the terminal

$ ipython

or

$ ipython3

UPPMAX has also jupyter-notebook installed and available from the loaded Python module. Start with

$ jupyter-notebook

You can decide on your own favorite browser and add --no-browser and open the given URL from the output given. From python/3.10.8 and forward, also jupyterlab is available.

  • Exit IPython with <Ctrl-D>, quit() or exit() in the python prompt

iPython

In [2]: <Ctrl-D>
In [12]: quit()
In [17]: exit()

Packages/Python modules

Python modules AKA Python packages

  • Python packages broaden the use of python to almost infinity!

  • Instead of writing code yourself there may be others that have done the same!

  • Many scientific tools are distributed as python packages, making it possible to run a script in the prompt and there define files to be analysed and arguments defining exactly what to do.

  • A nice introduction to packages can be found here: Python for scientific computing

Questions

  • How do I find which packages and versions are available?

  • What to do if I need other packages?

  • Are there differences between HPC2N, LUNARC, UPPMAX, and NSC?

Objectives

  • Show how to check for Python packages

  • show how to install own packages on the different clusters

Check current available packages

General for all four centers

Some python packages are working as stand-alone tools, for instance in bioinformatics. The tool may be already installed as a module. Check if it is there by:

$ module spider <tool-name or tool-name part>

Using module spider lets you search regardless of upper- or lowercase characters and regardless of already loaded modules (like GCC on HPC2N/LUNARC/NSC and bioinfo-tools on UPPMAX).

Check the pre-installed packages of a specific python module:

$ module help python/<version>

Check the pre-installed packages of a loaded python module, in shell:

$ pip list

To see which Python packages you, yourself, has installed, you can use pip list --user while the environment you have installed the packages in are active.

You can also test from within python to make sure that the package is not already installed:

>>> import <package>

Does it work? Then it is there!

Otherwise, you can either use pip or conda.

Check path to the package you are using,

  • In a python session, type:

import [a_module]
print([a_module].__file__)
  • The print-out tells you the path to the .pyc file, but should give you a hint where it belongs.

Check packages (5 min)

  • See if the following packages are installed. Use python version 3.11.8 on Rackham, 3.11.3 on Kebnekaise, 3.11.5 on Cosmos, and 3.10.4 on Tetralith (remember: the Python module on kebnekaise/cosmos/tetralith has prerequisite(s)).

    • numpy

    • mpi4py

    • distributed

    • multiprocessing

    • time

    • dask

NOTE: at HPC2N, LUNARC, and NSC, the available Python packages needs to be loaded as modules/module-bundles before using! See a list of some of them below, under the HPC2N/LUNARC/NSC tab or find more as mentioned above, using module spider -r ...

A selection of the Python packages and libraries installed on UPPMAX, HPC2N, LUNARC, and NSC are given in extra reading: UPPMAX clusters and Kebnekaise cluster and eventually LUNARC cluster and NSC cluster

  • The python application at UPPMAX comes with several preinstalled packages.

  • You can check them here: UPPMAX packages.

  • In addition there are packages available from the module system as python tools/packages

  • Note that bioinformatics-related tools can be reached only after loading bioinfo-tools.

  • Two modules contains topic specific packages. These are:

    • Machine learning: python_ML_packages (cpu and gpu versions and based on python/3.9.5 and python/3.11.8)

    • GIS: python_GIS_packages (cpu version based on python/3.10.8)

Demo/Type-along

This is an exercise that combines loading, running, and using site-installed packages. Later, during the batch session, we will look at running the same exercise, but as a batch job. There is also a follow-up exercise of an extended version of the script, if you want to try run that as well (see further down on the page).

We will use the pandas and matplotlib packages in this very simple example, but not explain anything about them. That comes later in the course!

Note

You need the data-file scottish_hills.csv which can be found in the directory Exercises/examples/programs. If you have cloned the git-repo for the course, or copied the tar-ball, you should have this directory. The easiest thing to do is just change to that directory and run the exercise there.

Since the exercise opens a plot, you need to login with ThinLinc (or otherwise have an x11 server running on your system and login with ssh -X ...).

The exercise is modified from an example found on https://ourcodingclub.github.io/tutorials/pandas-python-intro/.

Warning

Not relevant if using UPPMAX. Only if you are using HPC2N, LUNARC, or NSC!

You need to also load Tkinter.

For HPC2N:

ml GCC/12.3.0 Python/3.11.3 SciPy-bundle/2023.07 matplotlib/3.7.2 Tkinter/3.11.3

For LUNARC

ml GCC/13.2.0 Python/3.11.5 SciPy-bundle/2023.11 matplotlib/3.8.2 Tkinter/3.11.5

For NSC (Tetralith)

ml buildtool-easybuild/4.8.0-hpce082752a2 GCC/11.3.0 OpenMPI/4.1.4 Python/3.10.4 SciPy-bundle/2022.05 matplotlib/3.5.2 Tkinter/3.10.4

In addition, you need to add the following two lines to the top of your python script/run them first in Python, for HPC2N, LUNARC, and NSC:

import matplotlib
matplotlib.use('TkAgg')

Python example with packages pandas and matplotlib

NOTE if you have loaded a different Python version than what we use here, do ml purge first to get a clean work area.

We are using Python version 3.11.x except on Tetralith where we use Python/3.10.4. To access the packages pandas and matplotlib, you may need to load other modules, depending on the site where you are working.

Here you only need to load the python module, as the relevant packages are included (as long as you are not using GPUs, but that is talked about later in the course). Thus, you just do:

$ ml python/3.11.8
  1. From inside Python/interactive (if you are on Kebnekaise/Cosmos/Tetralith, mind the warning above about loading a compatible Tkinter and adding the two lines importing matplotlib and setting TkAgg at the top):

    Not on UPPMAX, but on HPC2N, LUNARC, NSC: Start Python and run these lines:

    import matplotlib
    matplotlib.use('TkAgg')
    

    On all systems: Start python (if you have not already) and run these lines:

    import pandas as pd
    
    import matplotlib.pyplot as plt
    
    dataframe = pd.read_csv("scottish_hills.csv")
    
    x = dataframe.Height
    
    y = dataframe.Latitude
    
    plt.scatter(x, y)
    
    plt.show()
    

    If you change the last line to plt.savefig("myplot.png") then you will instead get a file myplot.png containing the plot. This is what you would do if you were running a python script in a batch job.

    • On UPPMAX, LUNARC, and NSC you can view png files with the program eog
      • Test: eog myplot.png &

    • On HPC2N you can view png files with the program eom
      • Test: eom myplot.png &

  2. As a Python script (if you are on Kebnekaise/Cosmos/Tetralith, mind the warning above about Tkinter):

    Copy and save this script as a file (or just run the file pandas_matplotlib-<system>.py that is located in the <path-to>/Exercises/examples/programs directory you got from the repo or copied. Where <system> is either rackham, kebnekaise, cosmos, or tetralith.

    import pandas as pd
    import matplotlib.pyplot as plt
    
    dataframe = pd.read_csv("scottish_hills.csv")
    x = dataframe.Height
    y = dataframe.Latitude
    plt.scatter(x, y)
    plt.show()
    

If you have time, you can also try and run these extended versions, which also requires the scipy packages (included with python at UPPMAX and with the same modules loaded as for pandas for HPC2N/LUNARC/NSC):

Exercises (C. 10 min)

Python example that requires pandas, matplotlib, and scipy packages.

You can either save the scripts or run them line by line inside Python. The scripts are also available in the directory <path-to>/Exercises/examples/programs, as pandas_matplotlib-linreg.py and pandas_matplotlib-linreg-pretty.py.

NOTE that there are separate versions for rackham, kebnekaise, cosmos, and tetralith and that you for kebnekaise, cosmos, and tetralith need to again add the same lines regarding TkAgg as mentioned under the warning before the previous exercise. The example below shows how it looks for rackham.

Remember that you also need the data file scottish_hills.csv located in the above directory.

Examples are from https://ourcodingclub.github.io/tutorials/pandas-python-intro/

pandas_matplotlib-linreg.py

import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import linregress

dataframe = pd.read_csv("scottish_hills.csv")

x = dataframe.Height
y = dataframe.Latitude

stats = linregress(x, y)

m = stats.slope
b = stats.intercept

plt.scatter(x, y)
plt.plot(x, m * x + b, color="red")   # I've added a color argument here

plt.show()

pandas_matplotlib-linreg-pretty.py

import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import linregress

dataframe = pd.read_csv("scottish_hills.csv")

x = dataframe.Height
y = dataframe.Latitude

stats = linregress(x, y)

m = stats.slope
b = stats.intercept

# Change the default figure size
plt.figure(figsize=(10,10))

# Change the default marker for the scatter from circles to x's
plt.scatter(x, y, marker='x')

# Set the linewidth on the regression line to 3px
plt.plot(x, m * x + b, color="red", linewidth=3)

# Add x and y lables, and set their font size
plt.xlabel("Height (m)", fontsize=20)
plt.ylabel("Latitude", fontsize=20)

# Set the font size of the number lables on the axes
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)

plt.show()

Keypoints

  • Before you can run Python scripts or work in a Python shell, first load a python module and probable prerequisites

  • Start a Python shell session either with python or ipython

  • Run scripts with python3 <script.py>

  • You can check for packages

    • from the Python shell with the import command

    • from BASH shell with the

      • pip list command at all three centers

      • ml help python/<version> at UPPMAX

  • Installation of Python packages can be done either with PYPI or Conda

  • You install own packages with the pip install command (This is the recommended way on HPC2N)

  • At UPPMAX, LUNARC, and NSC Conda is also available (See Conda section)