pandas

The pandas logo

Learning outcomes

At the end of this sessions, learners …

  • have practiced using the documentation of favorite HPC cluster

  • understand what pandas is

  • understand why pandas is important

  • have run Python code that uses pandas

  • (optional) have read a comma-separated file using pandas

  • (optional) have saved a table as a comma-separated file using pandas

  • (optional) have seen the effect of the index argument when saving a table

  • (optional) have tried out some of the operation at the the pandas page ‘10 minutes to pandas’

What is pandas?

From the pandas homepage:

pandas is [an] […] open source data analysis and manipulation tool […]

It allows you to do work with/on data, for example, you can turn this messy data …

Country

1952

1957

1962

Albania

-9

-9

-9

Argentina

-9

-1

-1

using this pandas code …

table = pd.read_csv("dem_score.csv")
table = table.melt(id_vars = ["country"])

into this tidy data, which is easier to work with:

Country

Year

Democracy level

Albania

1952

-9

Albania

1957

-9

Albania

1962

-9

Argentina

1952

-9

Argentina

1957

-1

Argentina

1962

-1

pandas can do many other things, such as reshaping data (from the pandas cheat sheet):

Reshaping functions, from the  cheat sheet

Why pandas is important

pandas is a popular Python package that allows you to work with data and it gives you a vocabulary (and the Python functions) to do so.

Exercises

Exercise 1: minimal code

Use the documentation of the HPC cluster you work on.

In that documentation, find the software module to load the pandas Python package.

In a terminal (on your HPC cluster), load the software module to use pandas.

On your HPC cluster, create a script called pandas_exercise_1.py with the following code:

import pandas
print(pandas.__version__)

Run the script.

What do you see?

Even though the code shows nothing directly useful, why is this a useful exercise anyways?

(optional) Exercise 2: reading and saving a comma-separated file

In this exercise, we will first read the ‘diamonds’ dataset (as a comma-separated file): a dataset about diamonds. It is described in the ggplot2 (an R package) documentation.

Download this file to the same folder as where you are running your Python code.

On your HPC cluster, create a script called pandas_exercise_2.py with the following code:

import pandas as pd

table = pd.read_csv("diamonds.csv")
print(table)

Run the script pandas_exercise_2.py.

What does the script pandas_exercise_2.py do?

Next step is to save it. Add the following code to pandas_exercise_2.py:

table.to_csv("pandas_exercise_2.csv")

Again, run the script pandas_exercise_2.py.

Take a look at the file pandas_exercise_2.csv. What has been added to the data?

In pandas_exercise_2.py, replace the last line by this version:

table.to_csv("pandas_exercise_2.csv", index = False)

Run pandas_exercise_2.py. How does the data look like now?

What seems to be the most useful way to save: with or without indexing?

Why would pandas supply this option, to save with/without indexing?

(optional) Exercise 3: tidy data

pandas shines when the data is tidy.

Search the web for ‘What is tidy data?’. Is the diamonds dataset tidy? Why?

Now take a look at a dataset from this book called dem_score.csv. This dataset shows the ratings of the level of democracy in different countries spanning 1952 to 1992, where the minimum value of -10 corresponds to a highly autocratic nation whereas a value of 10 corresponds to a highly democratic nation. Here is how it looks like:

country,1952,1957,1962,1967,1972,1977,1982,1987,1992
Albania,-9,-9,-9,-9,-9,-9,-9,-9,5
Argentina,-9,-1,-1,-9,-9,-9,-8,8,7
Armenia,-9,-7,-7,-7,-7,-7,-7,-7,7
Australia,10,10,10,10,10,10,10,10,10

Is the dem_score dataset tidy? Why?

How would this data look like, would it be tidy?

Create a Python script called pandas_exercise_3.py. In that script, use pandas to read the dem_score.csv dataset, convert it to tidy data and save it as tidy_dem_scores.csv.

For this use:

(optional) Exercise 4: what does pandas mean?

The word pandas is actually a shortened version of something. Search the internet for what it stands for. In which field did pandas originate?

Done?

Go to the session about matplotlib