Skip to article frontmatterSkip to article content

Reproducibility and Workflow Management

Max Planck Institute for Astronomy, Königstuhl 17, 69117 Heidelberg, Germany

What is Reproducibility?

Reproducibility (closely related to replicability and repeatability) is a major principle underpinning the scientific method. For the findings of a study to be reproducible means that results obtained by an experiment or an observational study or in a statistical analysis of a data set should be achieved again with a high degree of reliability when the work is done again. There are different kinds of “re-doing” depending on whether the data, methods and team involved in the analysis are the same.

What are the differences between the terms?

Definitions

🔄 Repeatability

Definition: The ability to obtain the same results using the same experimental or observational setup, under the same conditions by the same team (that I think is optional).

Key Ideas:


🔁 Reproducibility

Definition: The ability to obtain the same results using the same data and the same code or analysis pipeline, but by the same or a different team or at a later time (by the same team).

Key Ideas:


🧪 Replicability

Definition: The ability to obtain consistent results using new data or a different setup that tests the same hypothesis.

Key Ideas:

Examples

Example 1: Galaxy Stellar Mass Function


Example 2: Exoplanet Detection via Transit Method


Example 3: Weak Lensing in Cluster Surveys

What if we use the same data but a different method?

Why is reproducibility important?

Reproducibility is crucial in research because it is the foundation of scientific integrity, trust, and progress. Here are the key reasons why:

🧭 1. Verifies Scientific Accuracy: If a result can be independently reproduced using the same data and code, it confirms that the original analysis was correctly performed. It helps catch mistakes, bugs, or unintentional biases in data processing.

🔍 2. Builds Trust and Transparency: Open, reproducible research allows others to see exactly how results were obtained.This transparency increases confidence in the findings, especially in high-impact or policy-relevant work.

🔁 3. Enables Reuse and Extension: Other researchers can build on your work more easily if your code and data are available and reproducible. This accelerates discovery and innovation by reducing duplication of effort.

🧪 4. Supports Scientific Self-Correction: Science advances by challenging and refining previous findings. Reproducibility allows the community to evaluate which results are robust and which need revision.

📈 5. Enhances Training and Education: Well-documented, reproducible research makes it easier to teach students and early-career researchers how complex analyses are done in practice.

💼 6. Improves Credibility with Funders and Journals: Many funders, journals, and institutions now require data and code sharing as part of reproducibility and open science initiatives.

In short, reproducibility turns research from a personal insight into a public contribution — one that others can understand, verify, and build on.

BUT IT’S. SO. HARD.

Yes, it is hard to make work reproducible after the fact. Not so much if you start working with reproducibility in mind.

Try out the tools we cover today as part of your next project, not for your coleagues, but for yourself, and see if you find any use in them.

Reproductions Packages

One way to make sure you have a good organization of your project is to use a template for organization. A template that helps you organize your data and scripts for a paper that helps a student reproduce your work. Depending on the project this may include:

Example Structure:

Existing Templates

The “easiest” way to create a reproducible workflow is to use a template that forces you to capture all the necessary information in the same place.

Open data template: https://github.com/jdeplaa/open-data-template

Data Science Cookiecutter: https://cookiecutter-data-science.drivendata.org/

Show your work: https://github.com/showyourwork/showyourwork: BROKEN

(Notice all the .gitignores!!! The data should not go in the repo!!!)

Work through the Installation and quickstart pages.

Data Versioning

🔹 1. Manual Versioning (File-Based) Store datasets as separate files with explicit version labels (e.g. data_v1.csv, data_v2.1.fits).

Simple but error-prone for large or collaborative projects.

Pros: Easy to implement Cons: No change tracking or metadata; manual bookkeeping

🔹 2. Checksum or Hashing Use tools like md5sum or sha256 to generate a fingerprint of a dataset.

Include the hash in your publication or README to verify exact versions.

Pros: Verifies data integrity Cons: Doesn’t help manage versions—just verifies them

🔹 3. Git or Git-LFS (for Small/Medium Files) Git can track small text-based datasets.

Git-LFS (Large File Storage) supports larger files (like .csv, .fits, or .hdf5) by storing them outside the Git repository while keeping version history.

Pros: Version control with history; integrates with Git workflows Cons: Git doesn’t scale well to very large datasets or binary blobs

🔹 4. DVC (Data Version Control) A Git-like tool specifically for tracking datasets and models.

Stores metadata in Git and data itself in remote storage (e.g., S3, GDrive, Azure Blob).

Pros: Designed for machine learning and scientific workflows; integrates well with code Cons: Adds some setup complexity; collaborators need to install DVC

🔹 5. Zenodo, Figshare, or Dryad (DOI-Based Archives) Upload a dataset to a long-term repository that provides a DOI (Digital Object Identifier).

Some platforms (like Zenodo) allow versioned deposits with persistent links.

Pros: Citable, FAIR-compliant, supports open access Cons: Not designed for frequent updates or large datasets (> GBs–TBs)

🔹 6. Dataverse or Institutional Repositories Supports formal data publication with versioning, metadata, and access control.

Good for institutional reproducibility mandates.

Pros: Robust metadata; structured versioning; long-term preservation Cons: Varies by institution; may be less flexible for large-scale collaborative workflows

🔹 7. Database Snapshots If working with dynamic databases (e.g., SQL, APIs), take a snapshot (dump) and record the timestamp or version.

Save as a static file or container image for reproducibility.

Pros: Captures dynamic sources Cons: Snapshots can be large and hard to archive if not properly compressed/indexed

TaskRecommended Tool/Platform
Simple file trackingGit + semantic filenames
Binary or large filesGit-LFS, DVC
Long-term, citable archiveZenodo, Figshare, Dryad
Team-scale reproducibilityDVC + remote storage
Dynamic or database-based dataSnapshots with version tags

Data Versioning by Volume:

Data VolumeRecommended Tools / Strategies
< 1 GB- Git (for text/CSV)
- Git-LFS (for binary files)
- Zenodo/Figshare for archiving
1–10 GB- Git-LFS (for moderate file sizes)
- DVC with remote storage (e.g., S3, GDrive)
- Archive to Zenodo with DOIs
10–100 GB- DVC with S3, Azure, or institutional object storage
- RClone or rsync for syncing large datasets
- Publish a snapshot to Zenodo (check size limits) or institutional repository
100 GB–1 TB- DVC or Pachyderm for full data pipelines
- Use structured cloud storage buckets
- Archive metadata and reference paths (not full data) in Git
Institutional repositories
1–10 TB- DVC + scalable cloud storage (e.g., AWS S3, Google Cloud Storage)
- Snapshot-based backups (e.g., ZFS/Btrfs or database dumps)
- Version using hash-based manifest files or tools like Quilt
- DOIs should reference access instructions, not raw files
Institutional repositories

Zenodo limits (https://about.zenodo.org/policies/):

Default Limit: 50GB total file size per record, with a maximum of 100 files. Quota Increase: One-time quota increases of up to 200GB are available for a single record. File Limit: While the default is 100 files, larger files can be accommodated by archiving them into a ZIP file, according to Zenodo. Case-by-case basis: Zenodo may also grant quota increases on a case-by-case basis if you need more space, says OpenAIRE.

Git-LFS also has limits:

Product Maximum file size GitHub Free 2 GB GitHub Pro 2 GB GitHub Team 4 GB GitHub Enterprise Cloud 5 GB

DVC

What is it and how it works: https://dvc.org/doc/user-guide

Options for remote storage: https://dvc.org/doc/user-guide/data-management/remote-storage

Installation: https://dvc.org/doc/install

alt text Versioning and data models tutorial: https://dvc.org/doc/use-cases/versioning-data-and-models/tutorial

Workflows

Make:

https://mfouesneau.github.io/astro_ds/chapters/Unix/chapter5-snakemake.html

Snakemake

https://slides.com/johanneskoester/snakemake-tutorial

Pixi and Snakemake

PyCon talk by Raphael: https://github.com/TheSkyentist/pycon2025-examples

Pixi:

https://pixi.sh/latest/

https://pixi.sh/latest/#getting-started

01_linear_workflow/Snakefile

Containers

https://carpentries-incubator.github.io/docker-introduction/index.html

https://docs.sylabs.io/guides/3.5/user-guide/introduction.html

One step further: Open Workflows

(Based on Goodman et al., 2014, Ten Simple Rules for the Care and Feeding of Scientific Data)

Data Science Experiment Trackers

Weights and biases: https://docs.wandb.ai/tutorials/

AimStack: https://aimstack.io/

What is reasonable reuse?

It is practically impossible to document every singe decision that goes into making a dataset. Some choices will seem arbitrary, some will have to be arbitraty, most will not be the bestest choices for the people who want to reuse your data. So think of what level of reproducibiluty you want your users to have? What is a reasonable level of reproducibility your users may hope for or expect?

Publish Workflow

There are applicatins that are dedicated to managing sequences of reduction and analysis steps (Taverna, Kepler, Wings, etc.). These are not used in astro.

Instead consider publishing your Jupyter notebooks. I think AAS will publish notebooks.

Publish your snakemake files.

And your SLURM scripts.

Publish your code.

Put your code on github, add a licence, archive on Zenodo, add DOI to GitHub. And add a CITATION.cff file. Just do it. Better published than tidy.

Attend the workshop on software packaging.

Also JOSS (for significant scholarly effort). Joint publications with AAS.

Papers are still the most findable artifacts in research.

Always state how you want to get credit!

How should people cite your code? How should they cite your data? Don’t make them cite an (in prep.) paper. Use DOIs and published papers.

“make information about you (e.g., your name, institution), about the data and/or code (e.g., origin, version, associated files, and metadata), and about exactly how you would like to get credit, as clear as possible.”

It is astounding how many repositories on the GitHub have no contact info for the author!

Reward colleagues who share data and code

Be a booster for open science

Progress will not happen by itself. The practices described here are increasingly incentivized by requirements from journals and funding agencies, but the time and skills required to actually do them are still not being valued.

At a local level, principal investigators (PIs) can have the most impact, requiring that the research their lab produces follow these recommendations. Even if a PI doesn’t have a background in computation, they can require that students show and share their code in lab meetings and with lab mates, those data are available and accessible to all in the lab, and that computational methods sections are comprehensive. PIs can also value the time it takes to do these things effectively and provide opportunities for training.

Many funding agencies now require data-management plans, education, and outreach activities. Just having a plan does not mean you have a good plan or that you will actually execute it. Be a good community citizen by setting an example for what good open science looks like. The true cost of implementing these plans includes training; it is unfair as well as counterproductive to insist that researchers do things without teaching them how. We believe it is now time for funders to invest in such training; we hope that our recommendations will help shape consensus on what “good enough” looks like and how to achieve it.

References
  1. Goodman, A., Pepe, A., Blocker, A. W., Borgman, C. L., Cranmer, K., Crosas, M., Di Stefano, R., Gil, Y., Groth, P., Hedstrom, M., Hogg, D. W., Kashyap, V., Mahabal, A., Siemiginowska, A., & Slavkovic, A. (2014). Ten Simple Rules for the Care and Feeding of Scientific Data. PLoS Computational Biology, 10(4), e1003542. 10.1371/journal.pcbi.1003542