A Brief Introduction to the Seaborn Statistical Plotting Library

Seaborn is a plotting library built entirely with Matplotlib and designed to quickly and easily create presentation-ready statistical plots from Pandas data structures.

Seaborn can produce a wide variety of statistical plots, but in the interest of time, we will focus on a few that are especially hard to replicate with Matplotlib alone. We will use a couple of Seaborn’s built-in testing datasets, of which there are many.

Caution

Do NOT Rely on Seaborn for Regression Analysis!

Seaborn has a built-in regression function, but we will not cover it because Seaborn does not return any parameters needed to assess the quality of the fit. The official documentation itself warns that the Seaborn regression functions are only intended for quick and dirty visualizations to motivate proper in-depth analysis.

Load and Run Seaborn

Important

Interactive use (Recommended)

Go to the Open On-Demand web portal and start Jupyter Notebook (or VSCode) as described here in the Kebnekaise documentation and discussed on day 2 in the On-Demand lecture session. Available Spyder versions are old and generally not recommended.

Non-Interactive Use

To use Seaborn in a batch script, you can load

ml GCC/13.2.0 Seaborn/0.13.2

As usual, ml spider Seaborn shows the available versions and how to load them. These Seaborn modules are built to load their Matplotlib, Tkinter, and SciPy-bundle dependencies internally.

In all cases, once Seaborn or the module that provides it is loaded, it can be imported directly in Python. The typical abbreviation in online documentation is sns, but for those of us who never watched The West Wing, any sensible abbrevation will do. Here we use sb.

Attention

Remember: if you write Python scripts to be executed from the command line and you want any figures to open in a GUI window at runtime (as opposed to merely saving a figure to file), then your Python script will need to include matplotlib.use('Tkinter').

If you run these code snippets in Jupyter, you will need to include %% matplotlib inline

Common Features

Sample Datasets

This tutorial will make use of some of the free test data sets that Seaborn provides with the sb.load_dataset() function. These are also handy for playing with Pandas and a variety of machine learning packages (TensorFlow, PyTorch, etc.). The full list of datasets can be viewed with sb.get_dataset_names() (requires internet), and for more details, you can visit the GitHub repository and follow the links in the ReadMe under “Data Sources”. A few of the more popular data sets include…

  • 'penguins', sex-segregated measurements of the beaks, flippers, and body masses of 3 species of penguins that live on the Antarctic Peninsula.

  • 'iris', measurements of the petal and sepal dimensions of three species of iris flower.

  • 'titanic', records of the ticket class, demographics, and survival status of passengers on the Titanic

  • 'mpg', information about the model, year, physical characteristics, engine specifications, and fuel economy of a variety of cars.

  • 'planets', a much older, smaller sample of the exoplanets data we used in the Pandas seminar, with fewer physical and orbital parameters. Hopefully it will be updated soon.

For most of this tutorial, we will use the 'mpg' dataset. For one of the exercises, you will use the ‘penguins’` dataset.

Commonalities in Plotting

Seaborn plotting functions are designed to take Pandas DataFrames (or sometimes Series) as inputs. As such, different plot types share many of the same kwargs (there are no mandatory positional args). The following are the most important:

  • data—the DataFrame in which to search for the remaining kwargs. You can pass it as either the first positional arg or as a kwarg, but it is mandatory either way.

  • x and y—the names of two columns in your DataFrame to plot against each other. These are usually necessary, but not if you’re plotting every possible pairing of numerical data columns against each other all at once, as in pairplot or heatmap.

  • hue—this kwarg accepts a categorical variable (e.g. species, sex, brand, etc.) column name, groups the data by those categories, and plots them all on the same plot in a different color.

  • ax—this kwarg takes the name of an axis object if you want to add your Seaborn plot(s) as subplots on an existing figure.

“Figure vs. Axis-level interfaces.”

Whether you import matplotlib.pyplot and instantiate the usual fig, ax or not, Seaborn plotting commands look almost identical apart from the ax kwarg, which is only required if you want to add Seaborn subplots to other figures. If you use Seaborn plots with ax, they are essentially drop-in replacements for other Matplotlib axes methods, but you lose some of the nicer automatic formatting features, like exterior legends. Without the ax kwarg, a Seaborn plot will occupy a whole figure, which can make it trickier to format axes labels properly. A fuller explanation of the pros and cons of each approach is provided in the official documentation..

Caution

Seaborn typically titles axes by the variable names as they appear in the DataFrame, underscores and all. It’s easy enough to override the labels for a simple pairwise plot, but correcting the typesetting can get tedious and tricky when there are many subplots. An upcoming example will demonstrate one possible way to fix the axis label formatting.

Another common feature of Seaborn is that many of the high-level functions that you would ordinarily use are wrappers for more flexible base classes with methods that let you layer different plot types on top of each other. We only cover one case here, but keep in mind that if you need more customisation, check the documentation—almost everything is tunable.

Showing and Saving Figures

The easiest way to show and save figures produced with Seaborn is to import matplotlib.pyplot as plt and use the standard plt.show() and plt.savefig() commands. However, it is possible to get this functionality with Seaborn alone using the .figure accessor, though producing an interactive display can be very unintuitive when executing scripts directly from the command line.

Showing your figure

  • In an IDE, this is trivial: just assign your plotting command to a variable, and call .figure.show() off of that variable name.

  • In a script to be executed from the command line, you may as well use plt.show() because you typically still have to do import matplotlib and set matplotlib.use('TkAgg') or another backend to make the display open. Moreover, while plt.show() keeps the script from terminating until the user closes the graphic, for some reason .figure.show() does not, so the figure closes almost immediately after opening UNLESS you do one of the following:
    • After the line containing .figure.show(), add an input() command, something like input("Press any key to exit").

    • Run the script with the interactive -i option between python and the name of the script. Note that with this method, you will step into a Python shell after closing the figure.

Saving your figure

This is easier: you can assign your plotting command to a variable, and call .figure.savefig(fname) off of that variable name. Since .figure.savefig() is just a wrapper for the pyplot method of the same name, all the args and kwargs are the same. The main difference is that there is no default filename, so you must at minimum pass a file name string or path as the first arg.

Plotting with Seaborn

Here we will explore a few of the plot types Seaborn offers that are difficult to replicate in Matplotlib:

  1. sb.pairplot() (and the underlying sb.PairGrid() function)

  2. sb.jointplot() (the bivariate special case of pair plot)

  3. sb.heatmap() and sb.clustermap()

In the interest of time, we will not go into box-and-whisker or violin plots, but be aware that compared to the Matplotlib implementations, the Seaborn versions of those plotting functions produce much nicer results with far less work.

Pairplot and PairGrid

When first starting analysis, it is often necessary to view bivariate distributions of many combinations of numeric variables.

For a typical dataset and typical display settings, it is enough to use Seaborn’s pairplot() function, a wrapper around the underlying, more customizable PairGrid(). Any categorical column not specified with hue is ignored automatically. If you need more flexibility in what is displayed on, above, and below the diagonal, see the Seaborn documentation on PairGrid.

Let’s use the 'mpg' dataset for demonstration. First, we need to see how many variables there are and whether any of them take a small number of discrete values. If there are more than about 5-6 numeric variables, a pairplot featuring all of them can become hard to read if constrained to the size of a journal page, so it’s best to plot only as many as necessary.

import seaborn as sb
mpg = sb.load_dataset('mpg')
print(mpg.info())
print(mpg.nunique())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    392 non-null    float64
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model_year    398 non-null    int64  
 7   origin        398 non-null    object 
 8   name          398 non-null    object 
dtypes: float64(4), int64(3), object(2)
memory usage: 28.1+ KB
None
mpg             129
cylinders         5
displacement     82
horsepower       93
weight          351
acceleration     95
model_year       13
origin            3
name            305
dtype: int64

Let’s drop ‘cylinders’, ‘model_year’, and ‘name’, and keep ‘origin’ for the hue. While we’re at it, let’s see what it would take just to give the axis labels proper capitalization and units.

import seaborn as sb
mpg = sb.load_dataset('mpg')
temp = mpg.drop(['model_year','cylinders', 'name'], axis='columns')
g = sb.pairplot(data=temp, corner=True, hue='origin')
### corner=True just turns off the redundant upper off-diagonal plots
### everything from here down is just fixing the axis labels
import string
for i in range(5):
    for j in range(5):
        try:
            xlabel = g.axes[i,j].xaxis.get_label_text()
            ylabel = g.axes[i,j].yaxis.get_label_text()
            if xlabel == 'mpg':
                g.axes[i,j].set_xlabel('Fuel Economy [mpg]')
            elif xlabel=='weight':
                g.axes[i,j].set_xlabel('Weight [lbs]')
            else:
                g.axes[i,j].set_xlabel(string.capwords(xlabel))

            if ylabel == 'mpg':
                g.axes[i,j].set_ylabel('Fuel Economy [mpg]')
            elif ylabel=='weight':
                g.axes[i,j].set_ylabel('Weight [lbs]')
            else:
                g.axes[i,j].set_ylabel(string.capwords(ylabel))
        except AttributeError:
            pass
../_images/seaborn-new_1_0.png

As you can see, most of the code was spent fixing the labels. The plot itself required only 1 line.

By default the off-diagonals are scatter plots, and the marginal distributions on the diagonal are either histograms if the data are not shaded by a categorical variable, or kernel density estimations (KDEs, basically histograms smoothed by convolution with a usually Gaussian kernel) if the hue kwarg is used. These options can be modified without resorting to PairGrid(), as detailed in the documentation.

Exercise

Load the dataset 'penguins' and make a pairplot where the data are colored by 'species'. You do not need to do anything to format the axis labels.

Note

Unlike most other plots demonstrated here, pairplot() and PairGrid() do not have an ax kwarg because they are already plotting multiple subplots. They will and must occupy an entire figure.

Joint Plots

A joint plot is a special case of a pair plot with just 2 variables. The 1-line Seaborn .jointplot() command replaces roughly a dozen lines of pure Matplotlib commands. There is also an underlying, more tunable .JointGrid() function, similar to how .pairplot() wraps .PairGrid().

To demonstrate with the 'mpg' dataset, let’s plot the fuel economy in mpg against vehicle weight, and color the data by region of origin.

import seaborn as sb
mpg = sb.load_dataset('mpg')
jp = sb.jointplot(data=mpg, x='weight', y='mpg', hue='origin', marginal_ticks=True)
#fix the labels to make them presentable
from matplotlib import pyplot as plt
plt.xlabel('Weight [lbs]')
plt.ylabel('Fuel Economy [mpg]')
plt.show()
../_images/seaborn-new_3_0.png

By default the main plot is a scatter plot, and the marginal plots are either histograms if the data are not shaded by a categorical variable, or KDEs if the hue kwarg is used. The type of central plot can be changed with the kind kwarg, (see `the documentation on joint plots for options and other kwargs <https://seaborn.pydata.org/generated/seaborn.jointplot.html>__). Some options change the appearance of the marginal distributions.

Heatmap and Clustermap

Sometimes you have too many variables to look at with pair plots/corner plots, and the best you can do is map the correlation coeffcients between different parameters. Alternatively, you might have a DataFrame with a comparable number of numeric rows and columns, and you want to see how the rows and columns correlate. Either way, the DataFrame must be able to be coerced to ndarray.

Once again, making this type of plot is extremely tedious in pure Matplotlib, but can require as little as one line of code with Seaborn. There are two functions that do this: sb.heatmap() and sb.clustermap(). The main difference between the two is that clustermap() attempts to rearrange variables so those that are correlated are positioned next to each other and connected by a tree diagram.

The mpg DataFrame can’t be used directly, but the correlation matrix of it can be. Fortunately, .corr() is a DataFrame method. Let’s see what heatmap() looks like for the numeric columns of mpg.

import seaborn as sb
mpg = sb.load_dataset('mpg')
sb.heatmap(mpg.corr(numeric_only=True), annot=True, fmt=".2f", cbar_kws={'label':'Correlation Coefficients'})
<Axes: >
../_images/seaborn-new_4_1.png

Exercise

Reformat the code above to run on your own system in your choice of interface, but use clustermap instead of heatmap.

The most handy kwargs for these two functions is annot, which prints the values of the squares is True or accepts an alternative annotation array, and cbar_kws, which accepts formatting kwargs for Matplotlib’s fig.colorbar() as a dictionary. Also shown are fmt, which tells annot how to render the numbers and uses the same form as the expressions in the curly braces of {:}.format() after the colon. For other kwargs, see the heatmap() documentation here.

Warning

There is a bug in Seaborn/0.12.2 that causes heatmap() and clustermap() with annot=True to only label the top row. This is fixed in later versions.

Keypoints

  • Seaborn makes statistical plots easy and good-looking!

  • Seaborn plotting functions take in a Pandas DataFrame, sometimes the names of variables in the DataFrame to extract as x and y, and often a hue that makes different subsets of the data appear in different colors depending on the value of the given categorical variable.

  • Seaborn also offers datasets to play with.

  • Typesetting axes labels can be tedious, though.

  • Saving and showing figures are best done with the standard pyplot functions.