`pandas`

The pandas logo

The pandas logo

Learning outcomes

At the end of this sessions, learners …

have practiced using the documentation of favorite HPC cluster
understand what pandas is
understand why pandas is important
have run Python code that uses pandas
(optional) have read a comma-separated file using pandas
(optional) have saved a table as a comma-separated file using pandas
(optional) have seen the effect of the index argument when saving a table
(optional) have tried out some of the operation at the the pandas page ‘10 minutes to pandas’

For teachers

Repeat:

Q: What is an HPC cluster? A: A group of computers working together to do things one computer would not be able to do
Q: When we are on an HPC cluster, what is the name of the computer we are on? A: The login node
Q: What is a software module? A: Pre-installed software of a specific version
Q: In which two ways can we find out which software module to load? A: search the documentation or search the modules using module spider
Q: If there is no documentation, how to find out which software module to load? A: search the modules using module spider
Q: If there is no documentation, and there is no software module, what do we do? A: we install this ourselves
Q: What is pip? A: The Python package installer
Q: How do we behave on the login node? A: We run only light jobs
Q: How do we run heavy calculations? A: We submit a job to the job scheduler

Prior:

What is data wrangling?
What is tidy data?

What is `pandas`?

From the pandas homepage:

pandas is [an] […] open source data analysis and manipulation tool […]

It allows you to do work with/on data, for example, you can turn this messy data …

Country	1952	1957	1962
Albania	-9	-9	-9
Argentina	-9	-1	-1

using this pandas code …

table = pd.read_csv("dem_score.csv")
table = table.melt(id_vars = ["country"])

into this tidy data, which is easier to work with:

Country	Year	Democracy level
Albania	1952	-9
Albania	1957	-9
Albania	1962	-9
Argentina	1952	-9
Argentina	1957	-1
Argentina	1962	-1

pandas can do many other things, such as reshaping data (from the pandas cheat sheet):

Reshaping functions, from the cheat sheet

Why `pandas` is important

pandas is a popular Python package that allows you to work with data and it gives you a vocabulary (and the Python functions) to do so.

How popular is pandas?

pandas is not popular enough to be in the PyPI top 20.

However, at the pandas PyPI statistics page we see that it has more than 600 million downloads per month. As the number 20 package has around 800 million downloads per month, we can infer that it is not all too unpopular.

Exercises

Want to see the answers as a video?

HPC cluster	YouTube video
Alvis	YouTube video
COSMOS	YouTube video
Dardel	YouTube video
Kebnekaise	YouTube video
Pelle	YouTube video
Tetralith	YouTube video

Exercise 1: minimal code

Use the documentation of the HPC cluster you work on.

Answer: where is your documentation?

Sorted by HPC cluster:

HPC center	HPC cluster	HPC cluster-specific documentation
C3SE	Alvis	Documentation
UPPMAX	Bianca	Documentation
LUNARC	COSMOS	Documentation
PDC	Dardel	Documentation
HPC2N	Kebnekaise	Documentation
UPPMAX	Pelle	Documentation
NSC	Tetralith	Documentation

In that documentation, find the software module to load the pandas Python package.

Answer: where is the pandas documentation?

HPC cluster	HPC cluster-specific `pandas` documentation
Alvis	Has no documentation on how to load `pandas`
Bianca	`pandas` documentation
COSMOS	Has no documentation on how to load `pandas`
Dardel	Has no documentation on how to load `pandas`
Kebnekaise	`pandas` documentation
Pelle	`pandas` documentation
Tetralith	`pandas` documentation

In a terminal (on your HPC cluster), load the software module to use pandas.

Answer: how to load the pandas software module

HPC cluster	How to load the `pandas` software module
Alvis	`module load matplotlib/3.9.2-gfbf-2024a`
COSMOS	`module load GCC/13.2.0 Python/3.11.5 SciPy-bundle/2023.11`
Dardel	`module load python/3.12.3 ; pip3 install pandas`
Kebnekaise	`module load GCC/13.3.0 SciPy-bundle/2024.05`
Pelle	`module load SciPy-bundle/2024.05-gfbf-2024a`
Tetralith	`module load Python/3.10.4-env-hpc1-gcc-2022a-eb`

On your HPC cluster, create a script called pandas_exercise_1.py with the following code:

import pandas
print(pandas.__version__)

Run the script.

Answer: how to run the script

HPC cluster	How to run the script
Alvis	`python pandas_exercise_1.py`
COSMOS	`python pandas_exercise_1.py`
Dardel	`python3 pandas_exercise_1.py`
Kebnekaise	`python pandas_exercise_1.py`
Pelle	`python pandas_exercise_1.py`
Tetralith	`python pandas_exercise_1.py`

What do you see?

Answer: how does that look like?

The output looks similar to this:

3.0.1

Even though the code shows nothing directly useful, why is this a useful exercise anyways?

Answer

This is a useful exercise, because it proves that you have successfully loaded/installed pandas.

(optional) Exercise 2: reading and saving a comma-separated file

In this exercise, we will first read the ‘diamonds’ dataset (as a comma-separated file): a dataset about diamonds. It is described in the ggplot2 (an R package) documentation.

Download this file to the same folder as where you are running your Python code.

How do I do that?

There are many ways:

Click on the ‘diamonds’ dataset (as a comma-separated file). This will take you to a webpage with the data. Right-click and do ‘Save as’ to save this file to your computer
Download the file from the command-line:

wget https://raw.githubusercontent.com/UPPMAX/HPC-python/refs/heads/main/docs/day3/pandas/diamonds.csv

Your favorite alternative way

On your HPC cluster, create a script called pandas_exercise_2.py with the following code:

import pandas as pd

table = pd.read_csv("diamonds.csv")
print(table)

Run the script pandas_exercise_2.py.

Answer: how to run the script

HPC cluster	How to run the script
Alvis	`python pandas_exercise_2.py`
COSMOS	`python pandas_exercise_2.py`
Dardel	`python3 pandas_exercise_2.py`
Kebnekaise	`python pandas_exercise_2.py`
Pelle	`python pandas_exercise_2.py`
Tetralith	`python pandas_exercise_2.py`

What does the script pandas_exercise_2.py do?

Answer

It reads a comma-separated file into memory.

Next step is to save it. Add the following code to pandas_exercise_2.py:

table.to_csv("pandas_exercise_2.csv")

Again, run the script pandas_exercise_2.py.

Answer: how to run the script

HPC cluster	How to run the script
Alvis	`python pandas_exercise_2.py`
COSMOS	`python pandas_exercise_2.py`
Dardel	`python3 pandas_exercise_2.py`
Kebnekaise	`python pandas_exercise_2.py`
Pelle	`python pandas_exercise_2.py`
Tetralith	`python pandas_exercise_2.py`

Take a look at the file pandas_exercise_2.csv. What has been added to the data?

Answer

Of each row in the data, there has been an index added:

,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75

In pandas_exercise_2.py, replace the last line by this version:

table.to_csv("pandas_exercise_2.csv", index = False)

Run pandas_exercise_2.py. How does the data look like now?

Answer

Now, the file looks like shown below, where there is no indexing anymore.

carat,cut,color,clarity,depth,table,price,x,y,z
23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75

What seems to be the most useful way to save: with or without indexing?

Answer

Typically, you will want to save without indexing.

Why would pandas supply this option, to save with/without indexing?

Answer

For backwards compatibility.

Indexing was a useful feature in the field pandas was initially developed in, so pandas always used indexing, with no way to disable this feature.

However, later it was found that indexing is not useful in other fields.

There were two options:

Remove indexing from pandas
Allow users to disable indexing

Removing indexing would cause old code to break, so this was decided against. Instead, it was decided to allow users to disable indexing when needed.

(optional) Exercise 3: tidy data

pandas shines when the data is tidy.

Search the web for ‘What is tidy data?’. Is the diamonds dataset tidy? Why?

Answer

I found the definition below from a tidyr (an R package) article:

In tidy data:

Each variable is a column; each column is a variable.
Each observation is a row; each row is an observation.
Each value is a cell; each cell is a single value.

The diamonds dataset is tidy, because:

Each feature of each single diamond has a column. Each feature is observed at more-or-less the same time
Each diamond has its own row
Each value in the table is indeed one value

Now take a look at a dataset from this book called dem_score.csv. This dataset shows the ratings of the level of democracy in different countries spanning 1952 to 1992, where the minimum value of -10 corresponds to a highly autocratic nation whereas a value of 10 corresponds to a highly democratic nation. Here is how it looks like:

country,1952,1957,1962,1967,1972,1977,1982,1987,1992
Albania,-9,-9,-9,-9,-9,-9,-9,-9,5
Argentina,-9,-1,-1,-9,-9,-9,-8,8,7
Armenia,-9,-7,-7,-7,-7,-7,-7,-7,7
Australia,10,10,10,10,10,10,10,10,10

Is the dem_score dataset tidy? Why?

Answer

I found the definition below from a tidyr (an R package) article:

In tidy data:

Each variable is a column; each column is a variable.
Each observation is a row; each row is an observation.
Each value is a cell; each cell is a single value.

The dem_score.csv dataset is not tidy, because:

For all expect the first column, these columns are values: they are values for the year the measurement was done.
Each row contains multiple observations: per country, it shows the democratic index of 1952, the democratic index of 1953, etc.
Each value in the table is indeed one value

How would this data look like, would it be tidy?

Answer

Here is how this data would look like, would it be tidy:

country,year,democracy_level
Albania,1952,-9
Albania,1953,-9
Albania,1954,-9
Albania,1955,-9

Create a Python script called pandas_exercise_3.py. In that script, use pandas to read the dem_score.csv dataset, convert it to tidy data and save it as tidy_dem_scores.csv.

For this use:

the pandas cheat sheet. Tip: the function you will need is this one.
Your favorite web search engine
Your favorite AI

Answer

Here is the minimal code to do so:

import pandas as pd
table = pd.read_csv("dem_score.csv")
table = table.melt(id_vars = ["country"])
table.rename(columns = {"variable": "year", "value": "democratic_score"}, inplace = True)
table.to_csv("tidy_dem_scores.csv", index = False)

(optional) Exercise 4: what does `pandas` mean?

The word pandas is actually a shortened version of something. Search the internet for what it stands for. In which field did pandas originate?

Answer

pandas is short for ‘panel data’. Panel data is a type of data set used in econometrics. Econometrics is the field where pandas originated.

Done?

Go to the session about matplotlib

pandas

What is pandas?

Why pandas is important

Exercises

Exercise 1: minimal code

(optional) Exercise 2: reading and saving a comma-separated file

(optional) Exercise 3: tidy data

(optional) Exercise 4: what does pandas mean?

Done?

External links

`pandas`

What is `pandas`?

Why `pandas` is important

(optional) Exercise 4: what does `pandas` mean?