Python Projects - MPI-astronomy Workshop Materials

This workshop will examine what makes a good project structure, i.e., the decisions you make concerning how your project best meets its objective. In practical terms, structure means making a clean code with clear logic, explicit dependencies, and how the files and folders are organized in the filesystem.

Before we start, some logistics:

The workshop will run for about 3.5 hours and there will be a coffee break in the middle.
This is an interactive workshop and there will be hands on activities. Please follow along.
Please ask questions throughout.
Postits.
Introductions.

REMEMBER: One-minute feedback!!

In this workshop we will look closely at each of the 4 main elements of a software project:

Source code
Data (examples, external files, reference data, etc.)
Tests that ensure that the code runs as expected
Documentation which tells the user how to use the code

Think about the structurs of a science paper.

This workshop focuses on Python packages. However, many of the recommendations can be applied to other languages and the tools we discuss exist for other languages as well.

This workshop looks closely at Python’s modules and import systems as they are the central elements to enforcing structure in your project. We then discuss various perspectives on building code that one can extend and test reliably.

0. Example Data¶

This workshop will use a test repository which is available here:

https://github.com/mpi-astronomy/mpia-python-template

Create a Virtual Environment¶

To keep your project space running, it’s a good idea to create a virtual environment.

You can use the venv module of Python and specify the Python version and the environment name. Or conda (recommend installing mini-conda).

In a nutshell, virtual environments allow you to:

Maintain dependencies isolated. This avoids situations where you have projects using different package versions and you globally uninstalling/reinstalling what you need every time you need to run a project.
Share your dependencies with other people.

Here is an example of how to create an environment with conda:

conda create -n my_env python=3.9
conda activate my_env

And here is an example with venv:

python3 -m venv my_env
source my_env/bin/activate

1. Source Code¶

In any language, a project contains several source code files organized into logical units or modules. These could be single files or a collection of them in different directories.

Thanks to how Python handles imports and modules, it is relatively easy to structure a Python project. There are only a limited set of constraints to structure a module in python. Therefore, you can focus on the pure architectural task of crafting the different parts of your project and their interactions. Python modules are one of the main abstraction layers and a natural one. Abstraction layers allow separating code into components, holding related data and functionality together.

For example, one component can handle reading and writing the data from/to a remote database, while another deals with the complex training of your neural network model. The most natural way to separate these two is to regroup all I/O functionality in one file, and the training operations in another. The training file may import the I/O file through import ... or from ... import A, B, C.

As soon as you use import statements, you use modules. These can be either built-in modules (e.g., os, sys, math), third-party modules you have installed in your environment (e.g., numpy, astropy, pandas), or your project’s internal modules.

In Python, a module is a file or a folder containing Python definitions and statements. In the case of a file, the filename is the module name with the suffix .py.

Modules¶

Aside from some naming restrictions, Python does not require anything special for a Python file to be a module, but one needs to understand the import mechanism to avoid some pitfalls. For a directory, there are some constraints that we will detail later.

The import mymodule statement looks for a mymodule.py file in the same directory as the caller. If it does not exist, the Python interpreter will search the python path recursively. Finally, it will raise an ImportError exception if it cannot be located.

import mysupermodule  # ImportError

When Python locates the file, the interpreter will execute it in an isolated scope (namespace). This execution includes any top-level statement (e.g. other imports) and function and class definitions. These are stored in the module’s dictionary.

import math
math.__dict__

Then, the module’s variables, functions, and classes will be available to the caller through the module’s namespace. (e.g., math.cos)

Let’s create a file called hello.py in which we code the usual helloworld function.

%%file hello.py

def helloworld():
    print("Hello, world!")

If our current working directory contains this file, we can import hello, and Python will automatically find it.

import hello
hello.helloworld()

The Python Code Style section emphasizes that readability is one of the main features of Python, i.e., avoiding useless boilerplate text and clutter. Being able to tell immediately where a class or function comes from dramatically improves the code readability and understandability of a project.

However, we do not want to work with a bunch of python files or manually set the python path. We also sometimes need complex organization.

Packages in Python¶

Python provides a straightforward packaging system that extends the module mechanism to a directory.

Any directory with an __init__.py file is considered a Python package. The package’s modules (python files) obey the import rules mentioned before. Only the __init__.py file has a particular behavior as it gathers all package-wide definitions (e.g., special values, documentation): it is always imported with the package.

To import a file hello.py in the directory examples/, you need

import examples.hello

The interpreter will look for examples/__init__.py and first execute its content. Then it will look for examples/hello.py and execute its top-level statements. After these operations, any variable, function, or class is available in the example.hello namespace.

Often in complex projects, there may be sub-packages and sub-sub-packages in a deep directory structure. In this case, importing a single item from a sub-sub-package will require executing all __init__.py files down to the module. Leaving an __init__.py file (almost) empty is considered standard and even good practice if the package’s modules and sub-packages do not need to share any code.

Lastly, Python provides a convenient syntax for importing deeply nested packages: import very.deep.module as mod. This allows you to use mod in place of the long list of packages.

Directories within the package are subpackages:

package
|-- __init__.py
|-- module.py
|-- sub_package/
       |--  __init__.py
       |-- sub_module.py

Exercise: Follow the instructions for the template here and make a copy under your own GitHub account: https://github.com/mpi-astronomy/mpia-python-template and clone it locally. If you decide not to use GitHub, you can download the .ZIP file for the static_version branch.

The standard structure is to have a src directory and then have the directory with the package name inside that:

You can set up this for your own code or take a look at the template.

Exercise: Rename the my_package subdirectory to something else.

Setting up the python path is an outdated method, does not translate to other people’s directory structure and operating systems. Also, does not install all the dependencies. Better, use the same methods as other installation.

pip install numpy
pip show numpy

How do you get your project to behave that way?

You need to provide some additional files that would allow the built-in Python tools to treat your code as a package. The instructions on how to turn your code into a library are contained in the pyproject.toml file. Let’s review the contents of this file to understand what these instructions are. A minumum pyproject.toml is already included in the template repository. At this point it would be helpful to open the template direcotry in a text editor.

The first section of the file contains some basic information about your project. Edit it to correspond to your project.

[project]
name = "my_package"
authors = [{name = "Example Author", email = "author@mpia.de"}]
description = "An example package"
readme = "README.md"
license = { file = 'LICENSE' }
classifiers = [
    "Programming Language :: Python :: 3",
    "License :: OSI Approved :: BSD License",
    "Operating System :: OS Independent",
]

dynamic = ['version']

[project.urls]
"Bug Tracker" = "https://github.com/mpi-astronomy/snowblind/issues"
"Source Code" = "https://github.com/mpi-astronomy/snowblind"

There are dependencies that are needed to build the package in the first place. These are specified in the [build-system] block. At minimum setuptools and wheel are needed. In addition, if you want to do the dynamic versioning, you should also include setuptools_sc. If your package has C or Fortran extensions that interface with numpy, you may also need to add numpy to the list below. Look up documentation for more details. The recommended way to specify build-time dependencies is to create a file called pyproject.toml which contains:

[build-system]
requires = [
    "setuptools>=60",
    "setuptools_scm>=8.0",
    "wheel",
]
build-backend = "setuptools.build_meta"

Your package also has dependencies that need to be installed in order for it to run. There are specified in the dependencies block like this. Extend the list as needed.

dependencies = [
    "numpy",
]

Finally, there is a block of [project.optional-dependencies] which specifiy packages that are only needed for tests and documentation. Users generally should not need to install these. You can also create some special installation setups here, for example if your code supports several different instruments, or different compute environments, you can specify different sets or versions of dependencies. In the example repo we have [test] and [docs] dependencies specified.

Now you can install your library. Try the commands below.

pip install -e .
pip show my_package
pip uninstall 
pip install .
pip show my_package

Try install installing with -e and without is and showing where the files are located. Look what is in the directory. Note when you do the latter pip installation, only the files in the src/my_package/ directory are being copied to site-packages. Within python you can do the following:

python - Unknown Directive

import sys
sys.modules['my_package']
my_package?

Push any changes to GitHub. Now you can try installing directly form GitHub:

pip install git+https://

And any one else can clone your repo and install it locally:

git clone ....
pip install -e .

Exercise: Add your own function to the package. Install. Commit & push.

touch my_func.py

"""
This is documentation for the file.
"""


def add_two_numbers(a, b):
    """
    This is my function and it adds two numbers.
    """
    # this is my comment
    return a + b

Good Enough Practices for Scientific Computing¶

Some signs of a poorly structured project

Circular dependencies: if you have a classes Star and Planet in two different modules (or files) but you need to import Planet to answer star.get_planetary_system() and similarly import star to answer planet.which_star() then you have a circular dependency. In this case you will have to resort to fragile hacks such as using import statements inside your methods or functions.
Hidden coupling: if changes on a given class always break tests on a different class (e.g. changes in Planet breaks many tests on Star) this means the Star class heavily rely on knowing the details about Planet.
Heavy usage of global state or context: it is very tempting to avoid an explicit list of arguments to some functions and instead rely on global variables that can be modified and are modified on the fly by different agents. This practice is very common in notebooks for instance. This is bad practice as you need to scrutinize all access to these global variables to understand which part changes what.
Spaghetti code: multiple pages of nested if clauses and for loops with a lot of copy-pasted procedural code and no proper segmentation are known as spaghetti code. Python’s meaningful indentation makes it very hard to maintain this kind of code.
Ravioli code: it consists of hundreds of similar little pieces of logic, often classes or objects, without proper structure. If you never can remember, if you have to use List, Tuple, np.array, pd.DataFrame for your task at hand, then you might be swimming in a ravioli code.

The things that make code good are also the things that make scientific papers good:

(logical)
(short paragraphs): short functions, small-ish classes, with unique functionality
(subsections): split code in different files based on functionality, split out different types of work: read data, work on data, write data out
Be ruthless about eliminating duplication.
(cite your sources) Always search for well-maintained software libraries that do what you need.
(consistent language): follow standard (PEP8) or at least be consistent within the code base
(clarity): function names, variable names should be meaningful, write for people
Documentation: DOC STRINGS!
No obvious bugs ;) (more on tests)
Flexible vs. generic: don’t generalize too early, focus on doing the task at hand without hard-coding too many things.
Provide a simple example or test data set.

PEP and Linters¶

pycodestyle
pylint
pyflakes
flake8
black - auto
autopep8

2. Data and Configuration Files¶

Fixed data files should generally be avoided in packaging code. In fact data is the one element of a project that is entirely optional and should be avoided. Before you include data files (i.e., non-code files) to your projects, consider the following alternatives:

Is there an (authoritative) copy of this data elsewhere on the internet?
Is this data accessible via an API or a client library?
Do I need this whole file or a small piece of it will do?
Can I create an effective test with simulated data?

For resources which are available on the internet (say an archive or an obesrvatory website) use client libraries and have your user download the file when they need it. Especially for binary files - do not put these in repos.

A very feature-rich client library is astroquery - it has modules for most major astro archives. See details here.

For anything that you download from the internet, do not use wget or curl. Instead use the built-it Python requrest module:

> import urllib.request
> urllib.request.urlretrieve(url, file_name)

Test data: generate or get through API, think of what is the minimum amount of data that is needed to test functionality

Ok, so you do actually have to include data with your code. How do you do it?

Data files should be included in a data directory in the src/package/ directory. For example in our repo:

% mkdir src/my_package/data

Now let’s put some data here to see how this works.

% touch src/my_package/data/data.txt
% pip uninstall my_package
% pip install .
% pip show my_package
% ls ...../site-packages/my_package/data/

By default, data put in this data directory will be included. You should now have your data.txt file included. But subdirectories will not be. Let’s test this:

% mkdir src/my_package/data/filter
% touch src/my_package/data/filter/filter.flt
% pip uninstall my_package
% pip install .
% pip show my_package
% ls ...../site-packages/my_package/data/filter/

This is probably no there. In order to include this you need an additional block in the pyproject.toml file that directs it to include the subdirectories:

But just because the data is here, that does not automatically include it in the installed repository.

Try it:

% pip install .
% show poet
% ls .........../site-packages/poet/data
pip uninstall poet

The information about which files should be included is again in the pyproject.toml file, where the following section needs to be added:

[tool.setuptools.package-data]
my_package = ["data/filters/*"]

Try the installation again:

% pip install .
% show my_package
% ls .........../site-packages/my_package/data

And now, since this not some random directry set via an environmental variable, you can always find the location of the data via the installation directory of the code:

> import pkg_resources
> my_package_dir = pkg_resources.resource_filename("my_package","")
> data_dir = os.path.join(my_package_dir, 'data')

One generally occasional legitimate reason to include data is to have an example configuration file. It is much better to code the generation up so your code can just create a fresh one on demand. But if you must. Do not create your own configuration file format. Use configparser.

> import configparser
> config = configparser.ConfigParser()
> config.read('src/data/config/center.pcf')
> config.sections()
> for key in config['PSF']: 
      print(key)

Tests¶

Tests are an absolutely critical element of a good software project. You could write the most beautiful code and yet it could be irreparably broken a year later.

Detecting problems early - Tests find problems early into the development.
Mitigating change - Allows you to change the source code during the testing stage and later on, while still making sure the module works as expected.
Simplifying integration - By testing the separate components of an application first and then testing them altogether, integration testing becomes much easier.
Allows for contribution and collaboration - Automatic testing makes sure that changes contributed by others don’t break the core code

Testing tips:

Add assertions to programs to check their operation.
Use an off-the-shelf unit testing library.
Turn bugs into test cases.
Use a symbolic debugger (pdb) to explore the code

Where do the test live in a package? At the top level.

% mkdir tests
% touch test_my_func.py

Tests can have their own dependencies that are different from the code. For example, if the tests need to generate or get data, those can be dependencies to JUST the tests and not the main code. Thereore you can specify a separate list of dependencies in the pyproject.toml file like so:

[project.optional-dependencies]
test = [
    "pytest",
    "pytest-doctestplus",
    "flake8",
    "flake8-pyproject",
    "codecov",
    "pytest-cov",
]

To install both the main and the test dependancies, you can use pip as follows:

 pip install ."[test]"

Let’s write a really simple test, just for example purposes.

from my_package import my_func
import numpy as np

def test_add():
    assert my_func.add_two_numbers(2,2) == 4
    assert my_func.add_two_numbers(2.2, 3.4) == 5.6

def test_nan():
    results = my_func.add_two_numbers(2, np.nan)
    assert np.isnan(results)

And then can run it:

pip install pytest
pip install pytest-cov
pytest
pytest --cov=src/poet/ tests/

Test Best Practices:

Write tests for parts that have the fewest dependencies on external resources first, and work your way up.
Tests should be logically as simple as possible.
Each unit test should be independent of all other tests.
Each unit test should be clearly named and well documented.
All methods, regardless of visibility, should have appropriate Python unit tests.
Strive for one assertion per test case.
Create unit tests that cover exceptions.
DO write tests concurrently with code. DO NOT “leave tests for later”.

Test Driven Development: for each new functionality that your application must have, you first design Python unit tests and only then do you carry on writing the code that will implement this feature. TDD may seem like developing Python applications upside-down, but it has some benefits. For example, it ensures that you won’t overlook unit testing for some feature. Further, developing test-first will help you to focus first on the tasks a certain function, test suite, or test case should achieve, and only afterwards to deal with how to implement that function, test suite, or test case.

Once you do have tests, it is really easy to check your code periodically or as new functionality is added. This is called continuous integration/continuous development workflow (CI/CD). The testing is just the CI part.

.github/workflows/ci.yml

3. Documentation¶

There are several different ways to document code. Use them.

Overall tips:

Document design and purpose, not mechanics.
Document interfaces and reasons, not implementations.
Refactor code in preference to explaining how it works: i.e., rather than write a paragraph to explain a complex piece of code, reorganize the code itself so that it doesn’t need such an explanation. This may not always be possible—some pieces of code are intrinsically difficult—but the onus should always be on the author to convince his or her peers of that.
Embed the documentation for a piece of software in that software: Doing this increases the probability that when programmers change the code, they will update the documentation at the same time.

README¶

The first piece of documentation a user may encounter is the README.md file. This file lives in the root directory of your project, is usually written in md or rst and should contain the following basic info:

What does this code do?
Who wrote it?
How does it get installed?

Optional:

Links to other related resources: documentation pages, related packages, papers that use it.
How to contribute (also see CONTRIBUTING.md)?
Changelog (can also be a separate file) containing a short overview of changes by version. This information can be captured by the commit history but it is good to have a short summary.
Sometimes you will see to-do lists but this should be in the issue tracker.

Docstrings¶

The docstring describes the operation of the function or class:

# This function slows down program execution for some reason.
def square_and_rooter(x):
    """Returns the square root of self times self."""
    ...

Unlike block comments, docstrings are built into the Python language itself. This means you can use all of Python’s powerful introspection capabilities to access docstrings at runtime, compared with comments which are optimized out. Docstrings are accessible from both the doc dunder attribute for almost every Python object, as well as with the built in help() function.

While block comments are usually used to explain what a section of code is doing, or the specifics of an algorithm, docstrings are more intended towards explaining other users of your code (or you in 6 months time) how a particular function can be used and the general purpose of a function, class, or module.

Depending on the complexity of the function, method, or class being written, a one-line docstring may be perfectly appropriate. These are generally used for really obvious cases, such as:

def add(a, b):
    """Add two numbers and return the result."""
    return a + b

The docstring should describe the function in a way that is easy to understand. For simple cases like trivial functions and classes, simply embedding the function’s signature (i.e. add(a, b) -> result) in the docstring is unnecessary. This is because with Python’s inspect module, it is already quite easy to find this information if needed, and it is also readily available by reading the source code.

> import inspect
> print(inspect.getsource(add))
> print(inspect.getdoc(add))
> print(inspect.isfunction(add))

True

In larger or more complex projects however, it is often a good idea to give more information about a function, what it does, any exceptions it may raise, what it returns, or relevant details about the parameters.

For more detailed documentation of code a popular style used, is the one used by the NumPy project, often called NumPy style docstrings. While it can take up more lines than the previous example, it allows the developer to include a lot more information about a method, function, or class.

def random_number_generator(arg1, arg2):
    """
    Summary line.

    Extended description of function.

    Parameters
    ----------
    arg1 : int
        Description of arg1
    arg2 : str
        Description of arg2

    Returns
    -------
    int
        Description of return value

    """
    return 42

Many text editors will come with an extension that helps you generate docstrings. This is the one for VSCode.

Compiled Documentation¶

Once you have docstrings in the location and format that Python expects them, you can autogenerate documentation. This is usually done with the sphinx library. You can add narrative documentation to that as well.

Create a docs directory at the top level of your project:

mkdir docs
cd docs
pip install sphinx
sphinx-quickstart
make html

Sphinx uses restructured text which is just enough different from markdown to be confusing. Look at the docs.

The docs can have their own dependencies that need to be documented in the pyproject.toml file:

[project.optional-dependencies]
docs = [
    "sphinx",
    "sphinx-automodapi",
    "numpydoc",
]

If you already have a [options.extras_require] section from the tests above, you don’t need a repeat heading. Just specify the new context docs and its dependencies.

To install all (main, test and docs) dependencies the pip command is:

 pip install ."[docs,test]"

4. Collaboration¶

CITATION.cff
CODE_OF_CONDUCT.md
CONTRIBUTING.md
SECURITY.md

Collaborative development: what is it? why is it important?

A word about the license¶

The License is arguably the most critical part of your repository, aside from the source code itself. The full license text and copyright claims should exist in this file.

If you aren’t sure which license to use for your project, check out choosealicense.com. For astronomy projects, we recommend a “BSD-3-Clause license”. This license allows private and commercial use, modification, or redistribution of the code with the preservation of the credits of the copyright holder. However, it removes your liability if the code does not work as expected. It is a permissive license similar to the BSD 2-Clause License but with a 3rd clause that prohibits others from using the name of the copyright holder or its contributors to promote derived products without written consent.

Of course, you can publish code without a license, but this would prevent many people from potentially using or contributing to your code.

Publishing, DOI, Citation

5. Summary and Template¶

A typical python project contains the following:

filename	Description
src/packagename	The actual project source code
tests/	contains all the unit tests
docs/	contains all the documentation
.gitignore	defines rules for your versioning workflow (here git)
CITATION.cff	specifies how the code should be cited
CODE_OF_CONDUCT.md	community guidelines, reporting, consequences (example)
CONTRIBUTING.md	how can users contribute to the code
LICENSE	defines rules on how your code can be used/modified/distributed
MANIFEST.in	(optional?) sets files to be distributed with the source distribution (e.g. LICENSE, README)
README.md	At least contains a project name and a description
pyproject.toml	is the specified file format of PEP 518 which contains the build system requirements, dependencies and metadata of Python projects.

Template: https://github.com/mpi-astronomy/mpia-python-template

6.Resources¶

Best practices for scientific computing: Wilson et al. (2014)

Good enough practices for scientific computing: Wilson et al. (2017)

Documentation advice: https://docs.python-guide.org/writing/documentation/

Further info on docstrings: https://peps.python.org/pep-0257/

Linters and formatters: https://books.agiliq.com/projects/essential-python-tools/en/latest/linters.html

References¶

Wilson, G., Aruliah, D. A., Brown, C. T., Chue Hong, N. P., Davis, M., Guy, R. T., Haddock, S. H. D., Huff, K. D., Mitchell, I. M., Plumbley, M. D., Waugh, B., White, E. P., & Wilson, P. (2014). Best Practices for Scientific Computing. PLoS Biology, 12(1), e1001745. 10.1371/journal.pbio.1001745
Wilson, G., Bryan, J., Cranston, K., Kitzes, J., Nederbragt, L., & Teal, T. K. (2017). Good enough practices in scientific computing. PLOS Computational Biology, 13(6), e1005510. 10.1371/journal.pcbi.1005510

🔌 Data access

Using Databases Introduction

🐍 Computing with Python

Advanced Python