Load and run python and use packages

At both UPPMAX and HPC2N we call the applications available via the module system modules.

Objectives

  • Show how to load Python

  • Show how to run Python scripts and start the Python command line

Warning

  • Note that the module systems at UPPMAX and HPC2N are slightly different.

  • While all modules at UPPMAX not directly related to bio-informatics are shown by ml avail, modules at HPC2N are hidden until one has loaded a prerequisite like the compiler GCC.

  • For reproducibility reasons, you should always load a specific version of a module instead of just the default version

  • Many modules have prerequisite modules which needs to be loaded first (at HPC2N this is also the case for the Python modules). When doing module spider <module>/<version> you will get a list of which other modules needs to be loaded first

Check for Python versions

Tip

Type along!

Check all available Python versions with:

$ module avail python

Load a Python module

For reproducibility, we recommend ALWAYS loading a specific module instad of using the default version!

For this course, we recommend using Python 3.11.x (except for some GPU examples that will use 3.9.5).

Tip

Type along!

Go back and check which Python modules were available. To load version 3.11.8, do:

$ module load python/3.11.8

Note: Lowercase p. For short, you can also use:

$ ml python/3.11.8

Warning

  • UPPMAX: Don’t use system-installed python (2.7.5)

  • UPPMAX: Don’t use system installed python3 (3.6.8)

  • HPC2N: Don’t use system-installed python (2.7.18)

  • HPC2N: Don’t use system-installed python3 (3.8.10)

  • ALWAYS use python module

Why are there both Python/2.X.Y and Python/3.Z.W modules?

Some existing software might use Python2 and some will use Python3. Some of the Python packages have both Python2 and Python3 versions. Check what your software as well as the installed modules need when you pick!

UPPMAX: Why are there both python/3.X.Y and python3/3.X.Y modules?

Sometimes existing software might use python2 and there’s nothing you can do about that. In pipelines and other toolchains the different tools may together require both python2 and python3. Here’s how you handle that situation:

  • You can run two python modules at the same time if ONE of the module is python/2.X.Y and the other module is python3/3.X.Y (not python/3.X.Y).

Run

Run Python script

Hint

  • There are many ways to edit your scripts.

  • If you are rather new.

    • Graphical: $ gedit <script> &

      • (& is for letting you use the terminal while editor window is open)

      • Requires ThinLinc or ssh -Y ... or ssh -X

    • Terminal: $ nano <script>

  • Otherwise you would know what to do!

  • ⚠️ The teachers may use their common editor, like vi/vim
    • If you get stuck, press: <esc> and then :q !

Type-Along

  • Let’s make a script with the name example.py

$ nano example.py
  • Insert the following text

# This program prints Hello, world!
print('Hello, world!')
  • Save and exit. In nano: <ctrl>+O, <ctrl>+X

You can run a python script in the shell like this:

$ python example.py
# or
$ python3 example.py

Warning

  • ONLY run jobs that are short and/or do not use a lot of resources from the command line.

  • Otherwise use the batch system (see the batch session)

Run an interactive Python shell

  • You can start a simple python terminal by:

$ python

Example

>>> a=3
>>> b=7
>>> c=a+b
>>> c
10
  • Exit Python with <Ctrl-D>, quit() or exit() in the python prompt

>>> <Ctrl-D>
>>> quit()
>>> exit()

For more interactiveness you can run Ipython.

Tip

Type along!

NOTE: remember to load a python module first. Then start IPython from the terminal

$ ipython

or

$ ipython3

UPPMAX has also jupyter-notebook installed and available from the loaded Python module. Start with

$ jupyter-notebook

You can decide on your own favorite browser and add --no-browser and open the given URL from the output given. From python/3.10.8 and forward, also jupyterlab is available.

  • Exit IPython with <Ctrl-D>, quit() or exit() in the python prompt

iPython

In [2]: <Ctrl-D>
In [12]: quit()
In [17]: exit()

Packages/Python modules

Python modules AKA Python packages

  • Python packages broaden the use of python to almost infinity!

  • Instead of writing code yourself there may be others that have done the same!

  • Many scientific tools are distributed as python packages, making it possible to run a script in the prompt and there define files to be analysed and arguments defining exactly what to do.

  • A nice introduction to packages can be found here: Python for scientific computing

Questions

  • How do I find which packages and versions are available?

  • What to do if I need other packages?

  • Are there differences between HPC2N and UPPMAX?

Objectives

  • Show how to check for Python packages

  • show how to install own packages on the different clusters

Check current available packages

General for both centers

Some python packages are working as stand-alone tools, for instance in bioinformatics. The tool may be already installed as a module. Check if it is there by:

$ module spider <tool-name or tool-name part>

Using module spider lets you search regardless of upper- or lowercase characters and regardless of already loaded modules (like GCC on HPC2N and bioinfo-tools on UPPMAX).

Check the pre-installed packages of a specific python module:

$ module help python/<version>

Check the pre-installed packages of a loaded python module, in shell:

$ pip list

To see which Python packages you, yourself, has installed, you can use pip list --user while the environment you have installed the packages in are active.

You can also test from within python to make sure that the package is not already installed:

>>> import <package>

Does it work? Then it is there!

Otherwise, you can either use pip or conda.

Check packages (5 min)

  • See if the following packages are installed. Use python version 3.11.8 on Rackham and 3.11.3 on Kebnekaise (remember: the Python module on kebnekaise has a prerequisite).

    • numpy

    • mpi4py

    • distributed

    • multiprocessing

    • time

    • dask

NOTE: at HPC2N, the available Python packages needs to be loaded as modules before using! See a list of some of them below, under the HPC2N tab or find more as mentioned above, using module spider -r ...

A selection of the Python packages and libraries installed on UPPMAX and HPC2N are give in extra reading: UPPMAX clusters and Kebnekaise cluster

  • The python application at UPPMAX comes with several preinstalled packages.

  • You can check them here: UPPMAX packages.

  • In addition there are packages available from the module system as python tools/packages

  • Note that bioinformatics-related tools can be reached only after loading bioinfo-tools.

  • Two modules contains topic specific packages. These are:

    • Machine learning: python_ML_packages (cpu and gpu versions and based on python/3.9.5)

    • GIS: python_GIS_packages (cpu version based on python/3.10.8)

Exercises

This is an exercise that combines loading, running, and using site-installed packages. Later, during the batch session, we will look at running the same exercise, but as a batch job. There is also a follow-up exercise of an extended version of the script, if you want to try run that as well (see further down on the page).

Note

You need the data-file scottish_hills.csv which can be found in the directory Exercises/examples/programs. If you have cloned the git-repo for the course, or copied the tar-ball, you should have this directory. The easiest thing to do is just change to that directory and run the exercise there.

Since the exercise opens a plot, you need to login with ThinLinc (or otherwise have an x11 server running on your system and login with ssh -X ...).

The exercise is modified from an example found on https://ourcodingclub.github.io/tutorials/pandas-python-intro/.

Warning

Not relevant if using UPPMAX. Only if you are using HPC2N!

You need to also load Tkinter. Use this:

ml GCC/12.3.0 Python/3.11.3 SciPy-bundle/2023.07 matplotlib/3.7.2 Tkinter/3.11.3

In addition, you need to add the following two lines to the top of your python script/run them first in Python:

import matplotlib
matplotlib.use('TkAgg')

Python example with packages pandas and matplotlib

We are using Python version 3.11.x. To access the packages pandas and matplotlib, you may need to load other modules, depending on the site where you are working.

Here you only need to load the python module, as the relevant packages are included (as long as you are not using GPUs, but that is talked about later in the course). Thus, you just do:

ml python/3.11.8
  1. From inside Python/interactive (if you are on Kebnekaise, mind the warning above):

    Start python and run these lines:

    import pandas as pd
    
    import matplotlib.pyplot as plt
    
    dataframe = pd.read_csv("scottish_hills.csv")
    
    x = dataframe.Height
    
    y = dataframe.Latitude
    
    plt.scatter(x, y)
    
    plt.show()
    

    If you change the last line to plt.savefig("myplot.png") then you will instead get a file myplot.png containing the plot. This is what you would do if you were running a python script in a batch job.

  2. As a Python script (if you are on Kebnekaise, mind the warning above):

    Copy and save this script as a file (or just run the file pandas_matplotlib-<system>.py that is located in the <path-to>/Exercises/examples/programs directory you got from the repo or copied. Where <system> is either rackham or kebnekaise.

    import pandas as pd
    import matplotlib.pyplot as plt
    
    dataframe = pd.read_csv("scottish_hills.csv")
    x = dataframe.Height
    y = dataframe.Latitude
    plt.scatter(x, y)
    plt.show()
    

If you have time, you can also try and run these extended versions, which also requires the scipy packages (included with python at UPPMAX and with the same modules loaded as for pandas for HPC2N):

Python example that requires pandas, matplotlib, and scipy packages.

You can either save the scripts or run them line by line inside Python. The scripts are also available in the directory <path-to>/Exercises/examples/programs, as pandas_matplotlib-linreg.py and pandas_matplotlib-linreg-pretty.py.

NOTE that there are separate versions for rackham and kebnekaise and that you for kebnekaise need to again add the same lines as mentioned under the warning before the previous exercise.

Remember that you also need the data file scottish_hills.csv located in the above directory.

Examples are from https://ourcodingclub.github.io/tutorials/pandas-python-intro/

import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import linregress

dataframe = pd.read_csv("scottish_hills.csv")

x = dataframe.Height
y = dataframe.Latitude

stats = linregress(x, y)

m = stats.slope
b = stats.intercept

plt.scatter(x, y)
plt.plot(x, m * x + b, color="red")   # I've added a color argument here

plt.show()
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import linregress

dataframe = pd.read_csv("scottish_hills.csv")

x = dataframe.Height
y = dataframe.Latitude

stats = linregress(x, y)

m = stats.slope
b = stats.intercept

# Change the default figure size
plt.figure(figsize=(10,10))

# Change the default marker for the scatter from circles to x's
plt.scatter(x, y, marker='x')

# Set the linewidth on the regression line to 3px
plt.plot(x, m * x + b, color="red", linewidth=3)

# Add x and y lables, and set their font size
plt.xlabel("Height (m)", fontsize=20)
plt.ylabel("Latitude", fontsize=20)

# Set the font size of the number lables on the axes
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)

plt.show()

Keypoints

  • Before you can run Python scripts or work in a Python shell, first load a python module and probable prerequisites

  • Start a Python shell session either with python or ipython

  • Run scripts with python3 <script.py>

  • You can check for packages

    • from the Python shell with the import command

    • from BASH shell with the

      • pip list command at both centers

      • ml help python/<version> at UPPMAX

  • Installation of Python packages can be done either with PYPI or Conda

  • You install own packages with the pip install command (This is the recommended way on HPC2N)

  • At UPPMAX Conda is also available (See Conda section)