Skip to content

Dev process

Peter Ebert edited this page Jun 7, 2024 · 6 revisions
author date tags
PE 2024-06-07 update, git, commands
PE 2022-09-21 cubi, internal, convention, rule, policy, standard

The CUBI code development process

The code development process in the CUBI follows certain standards for the two main types of code repositories created by CUBI team members (described in the following).

Regarding naming and style conventions, please refer to the respective wiki article.

Repository types

1. Workflow repository

Workflow repositories contain pipeline code to execute a series of bioinformatic tools in the same way for many different samples or batches of samples. The goal of workflow design should always be that the code can be executed by third parties.

Important workflow restrictions

Workflow code must not make any assumptions about the execution infrastructure or environment, except that it will be a Linux system. In particular, never hard-code any of the following (non-exhaustive list):

  • file system paths or locations
  • non-standard environment variables
  • download of resources on-the-fly from the internet
  • (input) data transformations or (meta-)data cleanup specific to a project

In general, a new workflow repository should be derived from the respective workflow template repository. The relation between workflow and workflow template has to be documented in the workflow's pyproject.toml.

2. Project repository

Project repositories may contain metadata, small and usually hand-curated annotation files, project-specific code performing preprocessing or cleanup tasks, and should generally document project-specific decisions. A single project can make use of several workflows, potentially executed manually in a serial fashion. The relation between project and workflow(s) has to be documented in the project's pyproject.toml.

3. other repositories

Other repositories, such as this knowledge base, can be organized differently. Use your best judgement or ask your colleagues for advice.

Making (git) history

The desired state for all shared repositories (mostly workflow and project) is to have a linear git commit history in the central branches main and dev. Maintaining that state requires some effort to pursue a "rebase and merge" strategy. In other words, merging pull requests via "merge commits" or "squash commits" is a forbidden operation.

Caution: rebase recomputes commit hashes

The git rebase development and merging strategy cannot be applied to the two constant branches of a CUBI git repository main and dev (main as the default target and main to dev only in case of emergency fixes that were applied to main directly --- if there were really compelling reasons to do so ...). The git rebase command always recomputes the commit hashes (which are dependent on the parent commit and so on), which implies that if one would rebase dev carelessly onto main, one would risk to change a substantial number of commit hashes that may already exist in some branch created from dev. However, for a clean and linear commit history, one can simply merge two branches by fast forwarding, which does not result in a merge commit. In other words, the new commits from dev are just applied on top of the last commit in main.

This series of operations has to be realized on the command line level and does not work via the github web interface!

Example strategies

  1. For the non-primary branches (feature-, analysis-, issue- etc. --- so everything that is not main, dev and prototype [if applicable]), you may rebase onto dev and use github's "rebase and merge" option to close an open pull request. Note that if you rebase w/o merging the branch into dev, you will have to push --force your local history to the remote, which will effectively break the history for all other developers working on that branch. Talk to them first!
$ git switch feature-xyz
$ git rebase --keep-base --interactive dev
# fix conflicts if necessary
$ git push --force all
  1. For the primary and constant branches main, dev and prototype [if applicable], do not merge via the rebase strategy for the reasons explained above. If properly implemented, there shouldn't be any commits in main (or dev) that do not exist in dev (or prototype) when you want to merge dev into main. Hence, you can execute:
# assuming all branches are up-to-date
$ git switch main
$ git merge --ff-only dev
# this merges all new commits from dev into/onto main
# and does not create an unnecessary merge commit
  1. If it so happens that there are unshared commits between main and dev and fast forwarding does not work or would create duplicated code changes, resolve the conflicts by cherry-picking the commit(s). Cherry picking works on a single commit or on a series of commits and pulls commits from one branch into another. Use this strategy to ensure that fast-forwarding works.

The main branch

Please do:

  • make production releases
  • coordinated single pushes of a bug fix only in absolute emergencies
    • may be followed by a post-mortem to clarify what went wrong
  • name this branch main

Please don't:

  • (force) push directly
  • start feature or issues branches from here
  • rename the branch
  • delete the branch
  • merge main into another branch

The dev branch

Please do:

  • make development releases
  • name this branch dev
  • start feature or issue branches from here
  • rebase and merge finished feature, issue or analysis branches into dev

Please don't:

  • (force) push directly
  • rename the branch
  • delete the branch
  • merge dev into another branch except for fast-forwarding (git merge --ff-only) into main

The prototype branch

Please do:

  • use only in the very beginning of the dev process "when nothing works"
  • name this branch prototype
  • delete this branch as soon as main development has been moved to dev

Please don't:

  • keep using this branch forever
  • keep using this branch if several people contribute to the development
  • don't start feature or issue branches from here

Suggested init steps

  • create the main branch and populate it with the appropriate metadata files
    • LICENSE, CITATION info, pyproject.toml etc.
    • push to git
  • from main, create dev and populate it with template files
    • (if applicable)
    • push to git
  • from dev, create prototype and start adding your code

feature, analysis and issue branches

Important: feature- branches should only exist for workflow or tool repositories. The analogous branch in standard CUBI project repositories is called analysis- and is more lenient in terms of the development and merge policy. The following dos and don'ts are binding for workflow and tool repositories.

Please do:

  • start a new branch for every single unit of work
  • always branch off from dev
  • follow naming conventions as described in the naming and style wiki article
  • clean up your commit history every now and then via git rebase --interactive dev (see below)
  • force push into the feature/issue branch after a rebase if necessary
    • notify your colleagues if you are sharing the implementation work

Please don't:

  • use a branch as a hidden fork of a repo and implement breaking changes
  • keep branches alive in production and use them as pipeline run targets
  • start a pull request before testing, linting and formatting your code
  • forget to delete your branch everywhere (!) after it has been merged

Illustrated examples

Remark: the Mermaid Gitgraph capabilities are still under development, and the following examples are thus not showing the full (possible) complexity.

The goal is to have a simple, linear commit history in main and dev:

gitGraph
  commit id: "A"
  commit id: "B"
  commit id: "C"
  commit id: "D"
  commit id: "E"
  commit id: "F"
Loading

The dev branch is used as the central development branch, i.e., it is the starting point for all feature or issue branches:

gitGraph
  commit id: "A"
  commit id: "B"
  branch dev
  commit id: "C"
  commit id: "D"
  branch feature-1
  commit id: "E"
  commit id: "F"
  commit id: "G"
  checkout dev
  branch issue-1
  commit id: "H"
  commit id: "I"
  commit id: "J"
Loading

Given the different speed of development in the various branches, the series of operations to merge branches back into dev should be considered unpredictable. In the example below, the issue-1 branch should be merged back into dev before the work in feature-1 is complete:

gitGraph
  commit id: "A"
  commit id: "B"
  branch dev
  commit id: "C"
  commit id: "D"
  branch feature-1
  commit id: "E"
  commit id: "F"
  commit id: "G"
  checkout dev
  branch issue-1
  commit id: "H"
  commit id: "I"
  commit id: "J"
  checkout dev
  merge issue-1
Loading

Remark: Mermaid does not support visualizing the git rebase operation (yet).

Let's assume the issue-1 branch could be rebased and merged into dev w/o problems because it had no conflicts with dev. After the successful merge, the issue-1 branch was deleted. It is very much possible, though, that the code changes merged via issue-1 are in conflict with the feature-1 branch, in which case it would not be possible to also simply merge feature-1 back into dev.

gitGraph
  commit id: "A"
  commit id: "B"
  branch dev
  commit id: "C"
  commit id: "D"
  branch feature-1
  commit id: "E"
  commit id: "F"
  commit id: "G"
  checkout dev
  commit id: "H"
  commit id: "I"
  commit id: "J" tag: "v1.0.0dev"
Loading

At this point, you could rebase feature-1 on dev as well even if the development is still incomplete just to resolve all conflicts and to thus make feature-1 coherent with dev again. Since the history of the feature-1 branch is changed during the rebase (parent in dev changed from D to J), the commit hashes have to be updated. Consequently, you would need to force push (git push --force all) your changes to the git server, effectively rewriting history. This operation breaks the feature-1 branch for all other developers working on that branch, hence they should be notified.

gitGraph
  commit id: "A"
  commit id: "B"
  branch dev
  commit id: "C"
  commit id: "D"
  commit id: "H"
  commit id: "I"
  commit id: "J" tag: "v1.0.0dev"
  branch feature-1
  commit id: "E'"
  commit id: "F'"
  commit id: "G'"
  commit id: "K"
  commit id: "L"
Loading

As soon as the development in feature-1 is complete, it can be merged into dev via merge --ff-only or using the "rebase and merge" option from github. Remember to delete the branch afterwards:

gitGraph
  commit id: "A"
  commit id: "B"
  branch dev
  commit id: "C"
  commit id: "D"
  commit id: "H"
  commit id: "I"
  commit id: "J" tag: "v1.0.0dev"
  commit id: "E'"
  commit id: "F'"
  commit id: "G'"
  commit id: "K"
  commit id: "L"
Loading

For a production release, the dev branch is merged into main via merge --ff-only. Do not use github's "rebase and merge" and do not rebase dev onto main; there shouldn't be any conflicts if the development process was implemented in a proper manner by all developers. The development cycle can now start again, with the parent of the dev branch being the last commit in main:

gitGraph
  commit id: "A"
  commit id: "B"
  commit id: "C"
  commit id: "D"
  commit id: "H"
  commit id: "I"
  commit id: "J" tag: "v1.0.0dev"
  commit id: "E'"
  commit id: "F'"
  commit id: "G'"
  commit id: "K"
  commit id: "L" tag: "v1.0.0"
  branch dev
  commit id: "M"
Loading

Interactive rebase: what for?

The point of interactively rebasing a feature or issue branch on dev is to reduce the commit history in the branch to the relevant commits. Consider the following example:

gitGraph
  commit id: "A"
  branch dev
  commit id: "B"
  branch feature-1
  commit id: "add function X"
  commit id: "add function Y"
  commit id: "fix syntax error"
  commit id: "fix syntax"
  commit id: "add test case X"
  commit id: "fix spelling"
  commit id: "add test case Y"
  commit id: "fix formatting"
  commit id: "fix language"
  commit id: "update docs"
  commit id: "update docs 2"
Loading

Arguably, fixing trivial syntax errors or spelling mistakes are no changes worth keeping in the commit history after merging the feature-1 branch into dev. Hence, during an interactive rebase, you are presented with various options to edit, combine or delete commits (see section "Changing Multiple Commit Messages"). In the above case, the fixup option, which combines a commit with the previous one, could be used to simplify the commit history as follows:

gitGraph
  commit id: "A"
  branch dev
  commit id: "B"
  branch feature-1
  commit id: "add function X"
  commit id: "add function Y"
  commit id: "add test case X"
  commit id: "add test case Y"
  commit id: "update docs"
Loading

Depending on the complexity of the fixed functions and test cases, it could even be made simpler by also rewording some commit messages in addition to the fixup operation:

gitGraph
  commit id: "A"
  branch dev
  commit id: "B"
  branch feature-1
  commit id: "add functions X and Y"
  commit id: "add test cases X and Y"
  commit id: "update docs"
Loading