More about reproducibility

Versioning systems

Semantic (SemVer)

Given a version number MAJOR.MINOR.PATCH, increment the:

  1. MAJOR version when you make incompatible API changes

  2. MINOR version when you add functionality in a backwards compatible manner

  3. PATCH version when you make backwards compatible bug fixes Additional labels for pre-release and build metadata are available as extensions to the MAJOR.MINOR.PATCH format.

Isolated tracks

<https://nehckl0.medium.com/semver-and-calver-2-popular-software-versioning-schemes-96be80efe36

Calender (CalVer)

Isolated tracks

<https://nehckl0.medium.com/semver-and-calver-2-popular-software-versioning-schemes-96be80efe36

Recording dependencies

  • Reproducibility: We can control our code but how can we control dependencies?

  • 10-year challenge: Try to build/run your own code that you have created 10 (or less) years ago. Will your code from today work in 5 years if you don’t change it?

  • Dependency hell: Different codes on the same environment can have conflicting dependencies.

Conda, Anaconda, pip, Virtualenv, Pipenv, pyenv, Poetry, requirements.txt …

These tools try to solve the following problems:

  • Defining a specific set of dependencies, possibly with well defined versions

  • Installing those dependencies mostly automatically

  • Recording the versions for all dependencies

  • Isolated environments

  • On your computer for projects so they can use different software.

  • Isolate environments on computers with many users (and allow self-installations)

  • Using different Python/R versions per project??

  • Provide tools and services to share packages

The tools

  • Python

    • PyPI -pip freeze > requirements.txt

    • Conda: any language, also compiled code and libraries.

      • conda-forge is a GitHub organization containing repositories of conda recipes.

      • Export the requirements into requirements.txt with conda list –export > requirements.txt.

      • Export the full environment using conda env export > environment.yml, and compare the .yml file format to the .txt file format.

    • Virtualenv

      • Pipenv

      • Poetry

      • Pyenv

      • Mamba (faster conda)

  • R

  • Packrat, jetpack, rsuite, renv, automagic, deplearning, devtools

  • C/C+

    • CMake

    • Conan

    • Conda

  • Fortran

  • Fortran package manager

  • Julia

    • Pkg.jl

      • designed around using isolated environments with independent sets of packages. Environments can either be local to a particular project or shared and selected by name.

Course advertisement

Python for scientific computing

Recording computational steps

Questions

  • How can we create a reproducible workflow?

  • When to use scientific workflow management systems.

Objectives

  • Discuss pros/cons of GUI vs. manual steps vs. scripted vs. workflow tools.

4 ways (from CodeRefinery)

Graphical user interface (GUI)

Imagine we have programmed a GUI with a nice interface with icons where you can select scripts and input files by clicking:

  1. Click on counting script

  2. Select book txt file

  3. Give a name for the dat file

  4. Click on a run symbol

  5. Click on plotting script

  6. Select book dat file

  7. Give a name for the image file

  8. Click on a run symbol

  9. Go to next book …

  10. Click on counting script

  11. Select book txt file

Disclaimer: not all GUIs behave this way - there exist very good GUI solutions which enable reproducibility and automation.


Solution 2: Manual steps

It is not too much work for four files:

$ python source/wordcount.py data/abyss.txt processed_data/abyss.dat
$ python source/plotcount.py processed_data/abyss.dat processed_data/abyss.png

$ python source/wordcount.py data/isles.txt processed_data/isles.dat
$ python source/plotcount.py processed_data/isles.dat processed_data/isles.png

$ python source/wordcount.py data/last.txt processed_data/last.dat
$ python source/plotcount.py processed_data/last.dat processed_data/last.png

$ python source/wordcount.py data/sierra.txt processed_data/sierra.dat
$ python source/plotcount.py processed_data/sierra.dat processed_data/sierra.png

$ python source/zipf_test.py processed_data/abyss.dat processed_data/isles.dat processed_data/last.dat processed_data/sierra.dat

This is imperative style: first do this, then to that, then do that, finally do …


Solution 3: Script

Let’s express it more compactly with a shell script (Bash). Let’s call it script.sh:

#!/usr/bin/env bash

# loop over all books
for title in abyss isles last sierra; do
    python source/wordcount.py data/${title}.txt processed_data/${title}.dat
    python source/plotcount.py processed_data/${title}.dat processed_data/${title}.png
done

# this could be done using variables but nevermind
python source/zipf_test.py processed_data/abyss.dat processed_data/isles.dat processed_data/last.dat processed_data/sierra.dat

We can run it with:

bash script.sh

Solution 4: Using Snakemake or other workflow manager

Isolated tracks

Isolated tracks of work.

Discussion

Discuss the pros and cons of these different approaches. Which are reproducible? Which scale to hundreds of books and which can it be automated?

Containers

Popular container implementations:

  • Docker

  • Singularity (popular on high-performance computing systems)

  • Apptainer (popular on high-performance computing systems, fork of Singularity)

  • podman

They are to some extent interoperable:

  • podman is very close to Docker

  • Docker images can be converted to Singularity/Apptainer images

  • Singularity Python can convert Dockerfiles to Singularity definition files

https://coderefinery.github.io/reproducible-research/environments

Reproducible publications

  • Git (overleaf, authorea), hackmd, manuscripts.io, google docs

  • Scholarly output reproducible: rrtools, jupyter, binder, research compendia

  • Reprohack

Keypoints

  • Preserve the steps for re-generating published results.

  • Hundreds of workflow management tools exist.

  • Snakemake is a comparatively simple and lightweight option to create transferable and scalable data analyses.

  • Sometimes a script is enough.