Big data with Python

“Learning outcomes”

Learners

  • can decide on useful file formats

  • can allocate resources sufficient to data size

  • can use data-chunking as technique

  • have grasped the surface of more efficient packages

High-Performance Data Analytics (HPDA)

Types of scientific data

Bit and Byte

  • The smallest building block of storage in the computer is a bit, which stores either a 0 or 1.

  • Normally a number of 8 bits are combined in a group to make a byte.

  • One byte (8 bits) can represent/hold at most 2^8 distinct values. Organising bytes in different ways can represent different types of information, i.e. data.

Data and storage format

In real scientific applications, data is complex and structured and usually contains both numerical and text data. Here we list a few of the data and file storage formats commonly used.

An overview of common data formats

Name:
Human
readable:
Space
efficiency:
Arbitrary
data:
Tidy
data:
Array
data:
Long term
storage/sharing:

Pickle

🟨

🟨

🟨

CSV

🟨

Feather

Parquet

🟨

🟨

npy

🟨

HDF5

NetCDF4

JSON

🟨

Excel

🟨

🟨

Graph formats

🟨

🟨

Important

Legend

  • ✅ : Good

  • 🟨 : Ok / depends on a case

  • ❌ : Bad

Adapted from Aalto university’s Python for scientific computing… seealso:

  • ENCCS course “HPDA-Python”: Scientific data

  • Aalto Scientific Computing course “Python for Scientific Computing”: Xarray

Computing efficiency with Python

Python is an interpreted language, and many features that make development rapid with Python are a result of that, with the price of reduced performance in many cases.

  • Dynamic typing

  • Flexible data structures

  • There are some packages that are more efficient than Numpy and Pandas.

    • SciPy is a library that builds on top of NumPy.

      • It contains a lot of interfaces to battle-tested numerical routines written in Fortran or C, as well as Python implementations of many common algorithms.

    • ENCCS course material

XARRAY Package

  • xarray is a Python package that builds on NumPy but adds labels to multi-dimensional arrays.

  • It also borrows heavily from the Pandas package for labelled tabular data and integrates tightly with dask for parallel computing.

  • Xarray is particularly tailored to working with NetCDF files.

  • It reads and writes to NetCDF file using

    • open_dataset() function

    • open_dataarray() function

    • to_netcdf() method.

  • Explore these in the exercise below!

Allocating RAM

  • Storing the data in an efficient way is one thing!

  • Using the data in a program is another.

  • How much is actually loaded into the working memory (RAM)

  • Is more data in variables created during the run or work?

Important

  • Allocate many cores or a full node!

  • You do not have to explicitely run threads or other parallelism.

  • Note that shared memory among the cores works within node only.

Discussion

  • Take some time to find out the answers on the questions below, using the table of hardware

  • I’ll ask around in a few minutes

  • Choose, if necessary a node with more RAM
    • See local HPC center documentation in how to do so!

Dask

How to use more resources than available?

../_images/when-to-use-pandas.png

Dask is very popular for data analysis and is used by a number of high-level Python libraries:

  • Dask is composed of two parts:

    • Dask Clusters
      • Dynamic task scheduling optimized for computation. Similar to other workflow management systems, but optimized for interactive computational workloads.

      • ENCCS course

    • “Big Data” Collections
      • Like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of dynamic task schedulers.

      -ENCCS course

Dask Collections

  • Dask provides dynamic parallel task scheduling and three main high-level collections:

  • A Dask array looks and feels a lot like a NumPy array.

  • However, a Dask array uses the so-called “lazy” execution mode, which allows one to
    • build up complex, large calculations symbolically

    • before turning them over the scheduler for execution.

  • Dask divides arrays into many small pieces (chunks), as small as necessary to fit it into memory.

  • Operations are delayed (lazy computing) e.g. tasks are queue and no computation is performed until you actually ask values to be computed (for instance print mean values).

  • Then data is loaded into memory and computation proceeds in a streaming fashion, block-by-block.

Example from dask.org

# Arrays implement the Numpy API
import dask.array as da
x = da.random.random(size=(10000, 10000),
                     chunks=(1000, 1000))
x + x.T - x.mean(axis=0)
# It runs using multiple threads on your machine.
# It could also be distributed to multiple machines

See also

  • dask_ml package: Dask-ML provides scalable machine learning in Python using Dask alongside popular machine learning libraries like Scikit-Learn, XGBoost, and others.

  • Dask.distributed: Dask.distributed is a lightweight library for distributed computing in Python. It extends both the concurrent.futures and dask APIs to moderate sized clusters.

Exercises

Note

  • You can do the python exercises in the Python command line or Spyder, but better in Jupyter.

  • You may want to use an interactive session (on-demand or interactive/salloc)

Compute allocations in this workshop

  • Rackham: uppmax2025-2-296

  • Kebnekaise: hpc2n2025-076

  • Cosmos: lu2025-7-34

  • Tetralith: naiss2025-22-403

  • Dardel: naiss2025-22-403

Storage space for this workshop

  • Rackham: /proj/hpc-python-uppmax

  • Kebnekaise: /proj/nobackup/hpc-python-spring

  • Cosmos: /lunarc/nobackup/projects/lu2024-17-44

  • Tetralith: /proj/hpc-python-spring-naiss

  • Dardel: /cfs/klemming/projects/snic/hpc-python-spring-naiss

Load and run

Important

You should for this session load

ml GCC/12.3.0 Python/3.11.3 SciPy-bundle/2023.07 matplotlib/3.7.2 Tkinter/3.11.3
  • And install dask & xarray to ~/.local/ if you don’t already have it

pip install xarray dask

While installing or asking for compute nodes, read a bit more about the different data storage formats

  • Give it 5 minutes

  • Discuss a bit later in the group the different formats

  • Do you have a new favorite?

Chunk sizes in Dask

  • The following example calculate the mean value of a random generated array.

  • Run the 2 examples and see the performance improvement by using dask.

import numpy as np
%%time
x = np.random.random((20000, 20000))
y = x.mean(axis=0)

But what happens if we use different chunk sizes? Try out with different chunk sizes:

  • What happens if the dask chunks=(20000,20000)

  • What happens if the dask chunks=(250,250)

Use Xarray to work with NetCDF files

This exercise is derived from Xarray Tutorials, which is distributed under an Apache-2.0 License.

First create an Xarray dataset:

import numpy as np
import xarray as xr

ds1 = xr.Dataset(
    data_vars={
        "a": (("x", "y"), np.random.randn(4, 2)),
        "b": (("z", "x"), np.random.randn(6, 4)),
    },
    coords={
        "x": np.arange(4),
        "y": np.arange(-2, 0),
        "z": np.arange(-3, 3),
    },
)
ds2 = xr.Dataset(
    data_vars={
        "a": (("x", "y"), np.random.randn(7, 3)),
        "b": (("z", "x"), np.random.randn(2, 7)),
    },
    coords={
        "x": np.arange(6, 13),
        "y": np.arange(3),
        "z": np.arange(3, 5),
    },
)

Then write the datasets to disk using to_netcdf() method:

ds1.to_netcdf("ds1.nc")
ds2.to_netcdf("ds2.nc")

You can read an individual file from disk by using open_dataset() method:

ds3 = xr.open_dataset("ds1.nc")

or using the load_dataset() method:

ds4 = xr.load_dataset('ds1.nc')

Tasks:

  • Explore the hierarchical structure of the ds1 and ds2 datasets in a Jupyter notebook by typing the variable names in a code cell and execute. Click the disk-looking objects on the right to expand the fields.

  • Explore ds3 and ds4 datasets, and compare them with ds1. What are the differences?

Follow-up discussion

  • New learnings?

  • Useful file formats

  • Resources sufficient to data size

  • Data-chunking as technique if not enough RAM

  • Is xarray useful for you?

Keypoints

  • File formats
    • No format fits all requirements

    • HDF5 and NetCDF good for Big data

  • Packages
    • xarray
      • can deal with 3D-data and higher dimensions

    • Dask
      • uses lazy execution

      • Only use for processing very large amount of data

  • Allocate more RAM by asking for
    • Several cores

    • Nodes will more RAM