Skip to article frontmatterSkip to article content

Data Management and Open Science

Max Planck Institute for Astronomy, Königstuhl 17, 69117 Heidelberg, Germany

# SharingIsCaring

Alt text

Definitions

What is Data Management?

Data management comprises all disciplines related to handling data as a valuable resource, it is the practice of managing an organization’s data so it can be analyzed for decision making. (wikipedia) It includes areas such as data governance (who’s responsible for what, who makes decisions, how is data being kept, by who, what are the ethincs surrounging the collection, storing and distribution of this data); data architecture (how is data structured); data modeling and design (how should the data be represented); database management; metadata, data quality, etc.

What is Open Science?

Open sience as a whole is a movement to make research as a whole (or specifically its artifacts: including data, software, publications, etc.) and the dissemination of scientific results, accessible to all levels of society.

Alt text

Open research is another term, considered to be more inclusive of the humanities. Also sometimes use: open scholarship

Generally, there are 6 pilars of open science:

In the context of OPEN DATA you will often hear about the FAIR principles: The FAIR principles (Findable, Accessible, Interoperable, and Reusable) are designed to make research data more discoverable, both by humans and machines, and to promote wider sharing and reuse of data. They can be applied to digital data and artefacts from any discipline. Working towards making research data more FAIR provides a range of benefits including:

FAIR research data is easier to understand and share with collaborators or with the wider world.

Findable: To make research data findable, publish the data with as much as much metadata as possible + license + DOI + a sharing platform.

Accessible: make sure one can actually download the data without bariers. API?

Interoperable: To make research data interoperable, use community-accepted languages, formats and vocabularies in the data and metadata.

Reusable: Data should come with a clear human- and machine-readable licence and provenance information on how it was collected or generated. It should also abide by discipline-specific data and metadata standards where possible, to ensure it retains important context.

(also see the FORCE 11 original publication: https://force11.org/info/the-fair-data-principles/)

Motivation

Why talk about these topics?

Problems:

Types of data

Best practices for data management

(from G. Wilson et al., Good enough practices in scientific computing. PLoS Comput Biol 13(6): e1005510. Wilson et al. (2017))

Taken in order, the recommendations above will produce intermediate data files with increasing levels of cleanliness and task specificity. An alternative approach to data management would be to fold all data-management tasks into a monolithic procedure for data analysis, so that intermediate data products are created “on the fly” and stored only in memory, not saved as distinct files.

While the latter approach may be appropriate for projects in which very little data cleaning or processing is needed, we recommend the explicit creation and retention of intermediate products. Saving intermediate files makes it easy to rerun parts of a data analysis pipeline, which in turn makes it less onerous to revisit and improve specific data-processing tasks. Breaking a lengthy workflow into pieces makes it easier to understand, share, describe, and modify. This is particularly true when working with large data sets, for which storage and transfer of the entire data set is not trivial or inexpensive.

Publishing It

GET THYSELF A DIGITAL OBJECT IDENTIFIER!

“The best option for releasing your data with long-term guarantee is to deposit them in whatever data archive is the “go to” place for your field. A proper, trustworthy archive will: (1) assign an identifier such as a “handle” (hdl) or “digital object identifier” (doi); (2) require that you provide adequate documentation and metadata; and (3) manage the “care and feeding” of your data by employing good curation practices.”

Let’s be honest: astro journals don’t have teeth when it comes to requiring authors to publish data. MNRAS and A&A have a data availability policy but one of the possible answers is “The data underlying this article will be shared on reasonable request to the corresponding author.” Which is equivalent to “go away”.

MPG provided resources: https://www.mpdl.mpg.de/en/services/service-catalog.html?filter=research-services

MAST/IPAC and others can mind DOIs for datasets.

You can also give your final data back to its parent archive, MAST for example will take data and store it for you “in perpetuity”.

Other options:

A: General Data Repositories Dataverse (http://thedata.org): A repository for research data that takes care of long-term preservation and good archival practices, while researchers can share, keep control of, and get recognition for their data.

FigShare (http://figshare.com): A repository where users can make all of their research outputs available in a citable, shareable, and discoverable manner.

Zenodo (http://zenodo.org): A repository service that enables researchers, scientists, projects, and institutions to share and showcase multidisciplinary research results (data and publications) that are not part of existing institutional or subject-based repositories.

Dryad (http://datadryad.org): A repository that aims to make data archiving as simple and as rewarding as possible through a suite of services not necessarily provided by publishers or institutional websites.

B: Directories of Research Data Repositories DataBib (http://databib.org): Databib is a tool for helping people identify and locate online repositories of research data. Users and bibliographers create and curate records that describe data repositories that users can search.

Re3data.org (http://www.re3data.org): Re3data is a global registry of research data repositories from different academic disciplines for researchers, funding bodies, publishers, and scholarly institutions.

Open Access Directory (http://oad. simmons.edu/oadwiki/Data_repositories): A list of repositories and databases for open data.

Force 11 Catalog (http://www.force11.org/catalog): A dynamic inventory of web-based scholarly resources, a collection of alternative publication systems, databases, organizations and groups, software, services, standards, formats, and training tools.

Reproductions Packages

One way to make sure you have a good organization of your project is to use a template for organization. A template that helps you organize your data and scripts for a paper that helps a student reproduce your work. Depending on the project this may include:

Example Structure:

Here is an example: https://github.com/jdeplaa/open-data-template

Another example: cookiecutter https://cookiecutter-data-science.drivendata.org/

Alt text

Show your work: https://github.com/showyourwork/showyourwork Alt text

Notice all the .gitignores!!! The data should not go in the repo!!!

One step further: Open Workflows

(Based on Goodman et al., 2014, Ten Simple Rules for the Care and Feeding of Scientific Data)

What is reasonable reuse?

It is practically impossible to document every singe decision that goes into making a dataset. Some choices will seem arbitrary, some will have to be arbitraty, most will not be the bestest choices for the people who want to reuse your data. So think of what level of reproducibiluty you want your users to have? What is a reasonable level of reproducibility your users may hope for or expect?

Publish Workflow

There are applicatins that are dedicated to managing sequences of reduction and analysis steps (Taverna, Kepler, Wings, etc.). These are not used in astro.

Instead consider publishing your Jupyter notebooks. I think AAS will publish notebooks.

Publish your snakemake files.

And your SLURM scripts.

Publish your code.

Put your code on github, add a licence, archive on Zenodo, add DOI to GitHub. And add a CITATION.cff file. Just do it. Better published than tidy.

Attend the workshop on software packaging.

Also JOSS (for significant scholarly effort). Joint publications with AAS.

Papers are still the most findable artifacts in research.

Always state how you want to get credit!

How should people cite your code? How should they cite your data? Don’t make them cite an (in prep.) paper. Use DOIs and published papers.

“make information about you (e.g., your name, institution), about the data and/or code (e.g., origin, version, associated files, and metadata), and about exactly how you would like to get credit, as clear as possible.”

It is astounding how many repositories on the GitHub have no contact info for the author!

Reward colleagues who share data and code

Be a booster for open science

Progress will not happen by itself. The practices described here are increasingly incentivized by requirements from journals and funding agencies, but the time and skills required to actually do them are still not being valued.

At a local level, principal investigators (PIs) can have the most impact, requiring that the research their lab produces follow these recommendations. Even if a PI doesn’t have a background in computation, they can require that students show and share their code in lab meetings and with lab mates, those data are available and accessible to all in the lab, and that computational methods sections are comprehensive. PIs can also value the time it takes to do these things effectively and provide opportunities for training.

Many funding agencies now require data-management plans, education, and outreach activities. Just having a plan does not mean you have a good plan or that you will actually execute it. Be a good community citizen by setting an example for what good open science looks like. The true cost of implementing these plans includes training; it is unfair as well as counterproductive to insist that researchers do things without teaching them how. We believe it is now time for funders to invest in such training; we hope that our recommendations will help shape consensus on what “good enough” looks like and how to achieve it.

References
  1. Wilson, G., Bryan, J., Cranston, K., Kitzes, J., Nederbragt, L., & Teal, T. K. (2017). Good enough practices in scientific computing. PLOS Computational Biology, 13(6), e1005510. 10.1371/journal.pcbi.1005510