Skip to content
This repository has been archived by the owner on Dec 30, 2023. It is now read-only.

A tutorial on the use of git, and GitHub for open and reproducible science. See the live version of this here: https://aezarebski.github.io/misc/github-tutorial/README.html

License

Notifications You must be signed in to change notification settings

dataeducation/github-tutorial

 
 

Repository files navigation

Git, GitHub and Reproducible Research

Disclaimer

These materials makes use of proprietary products and a simplified workflow in order to emphasise the main concepts and to save on installation and configuration time. Some references will be given at the end to direct you to free and open source solutions and more sustainable workflows; the core ideas are the same though so learning the material here should make it easier to migrate to more sensible methods in the future should the need arise.

Background material

This section provides some context for why you might care about git, GitHub and reproducible research. This document only provides the absolute basics of these topics but should get you started with them and help direct further learning.

Version control

Version control refers to systems which help to manage the writing and maintenance of things such as software, documents, and websites. These systems were developed to manage large software projects but are useful at many levels. For example, version control can help you avoid the following situation: my-analysis.R, my-analysis-final.R, my-analysis-final-again.R, and so on. Try coming back to that 3 months later when a reviewer asks you to re-run something with a slight modification!

Reproducible research and Open research

In addition to helping you with organising your files, a version control system and it’s associated tooling can also help the scientific community by helping to make your research reproducible and open.

For your computational research to be considered reproducible, it needs to be described in such a way that others can replicate your results. For it to be open, the materials (code, data and sufficient documentation) need to be available for others. Simply dumping all of your code into something like GitHub is not sufficient for your research to be considered reproducible.

Git

Created in 2005 by Linus Torvalds to help with the development of the Linux kernel, git has become a fundamental tool in software development. In the 2021 Stackoverflow Developer Survey over \(90\%\) of respondents used git; it is nearly synonymous with version control. If you intend to collaborate in the writing of substantial amounts of code, taking the time to learn how to use git is a good idea.

Working with git will be much easier if you get familiar with some of the terminology first. Unless you are familiar with git already, you should at least skim these before continuing.

Repository

A repository is a directory containing your files and the history of all the edits (see commit) made to these files. You can have a repository that only lives on your machine, but they are often shared on a platform such as GitHub.

Commit

An edit to a file that you have recorded as part of the history of edits is called a commit. It is both a noun and a verb, you commit an edit and the repository contains all of your commits. This can be thought of as a stronger version of saving a file. Each commit gets a unique identifier (called a hash). Sometimes we use “commit” to refer to the state of all the code after an edit.

Clone

When you make a copy of a repository you are cloning that repository. The resulting copy is referred to as a clone. Typically this will mean you have downloaded a copy from a platform such as GitHub.

Pull

Suppose you cloned a repository a while ago and you want to get a copy of all the commits that have been made to the original repository since then. To get these commits you pull them, which is a fancy way of saying updating your files. This is sometimes referred to as fetching.

Push

If you have committed some changes to your clone of a repository and want the original repository to have these changes made, you push these changes. This is a fancy way of saying use your edits to update the original files.

Branch

A branch is similar to a clone in that it is a copy of a repository. This provides a more sophisticated way for people to work on their own version of code, without messing up the main copy. This is not particularly important unless you are collaborating with others on a project.

Merge

If someone has made some useful changes on their branch the owner of the repository may decide to include their commits in the main copy. This process of including the changes on someone’s branch is called merging the changes.

Fork

When you make a copy of a repository that sits on your GitHub account. This is similar to, but distinct from cloning and making a branch.

GitHub

What is GitHub?

GitHub, Inc. is a subsidiary of Microsoft. Their website provides freemium hosting of git repositories. In addition to hosting the repositories, it offers additional tools to assist with software development. We will make extensive use of GitHub in this tutorial to avoid you needing to install anything on your machine. If you are going to use git extensively, it would be wise to learn how to do this from the command line or some other program.

Setting up a GitHub account

To register an account you will need an email address that can be used for verification.

  1. Visit https://github.com/ and click Sign Up.
  2. Fill in the forms to create an account.
  3. Verify that account by entering the access code GitHub sends to the email address you registered with.
  4. Verify that you can summon the Command Palette with crtl k for Windows and Linux and command k on a mac.
  5. The appearance and accessibility settings can be reached by searching for them in the command palette.

Zenodo

Zenodo is an open access archive operated by CERN which allows researchers to archive research materials with a DOI which makes them easier to cite. This is a more permanent form of storage than GitHub. It is easy to archive a particular commit of a repository which is good practice if you want to refer to a particular version of some code in a paper.

Worked example

Now that we have an understanding of version control and its associated tooling, we can see an example of how this enables us to do more reproducible research. Suppose that you wanted to include Figure fig:demo-result-1 in a manuscript and you wanted to ensure your analysis reproducible.

./git-usage-1.png

Code and data

The data and the code that generated this figure are included below. The data is saved in a file stackoverflow-git-data.csv.

year,percentage
2015,69.3
2017,69.2
2018,87.2
2020,82.8
2021,93.43

The code is saved in a file make-plot.R

library(ggplot2)

sods_data <- read.csv("stackoverflow-git-data.csv")

g <- ggplot(
  data = sods_data,
  mapping = aes(x = year, y = percentage)) +
  geom_point() +
  geom_smooth(method = "lm") +
  geom_text(
    aes(x = 2020, y = 82.8, label = "only GitHub"),
    nudge_x = 0.2,
    nudge_y = -4) +
  labs(
    x = "Year",
    y = "Percentage who used git",
    title = "Git usage has increased",
    subtitle = "Data from Stackoverflow Developer Survey")

ggsave(filename = "git-usage.png",
       plot = g,
       height = 7.4,
       width = 10.5,
       units = "cm")

sink(file = "regression-summary.txt")
summary(lm(percentage ~ year, data = sods_data))
sink()

If we put these into a directory called git-usage we end up with the following

git-usage
├── git-usage.png
├── make-plot.R
├── regression-summary.txt
└── stackoverflow-git-data.csv

Copy the code and data into a suitable place on your machine and run the R script to ensure that it works. In this worked example we will go through cleaning this up so it is easier for people (including ourselves) to make sense of this.

Organising the data and code

As a first step we will use directories to impose a bit of structure. Organising our files in this way is useful as it makes it far easier for someone to understand the purpose of each of the files. Follow the following steps to organise your code more appropriately.

  1. Make a directory called src and move make-plot.R there.
  2. Make a directory called data and move stackoverflow-git-data.csv there.
  3. Make a directory called out which we will write results to.
  4. Fix the call to read.csv so it can find the CSV.
  5. Fix the calls to ggsave and sink so it writes the output into out.

Once you have done this, the R script should look like the following.

sods_data <- read.csv("data/stackoverflow-git-data.csv")

...

ggsave(filename = "out/git-usage.png",
       plot = g,
       height = 7.4,
       width = 10.5,
       units = "cm")

sink(file = "out/regression-summary.txt")
summary(lm(percentage ~ year, data = sods_data))
sink()

After you have run the code, the directory structure should look like the following.

git-usage
├── data
│   └── stackoverflow-git-data.csv
├── out
│   ├── git-usage.png
│   └── regression-summary.txt
└── src
    └── make-plot.R

Uploading to GitHub

Now that our code is in a reasonable state, we can upload it to GitHub. If you do not already have a GitHub account, please follow the instructions above, which describe how to make one. Once you have done this, follow the following steps:

  1. Visit https://github.com/ and create a new repository by clicking New, you will need to pick a name for the repository (I called mine git-usage.) The default settings are fine. Click Create repository.
  2. Click creating a new file to start the process of adding src/make-plot.R.
    1. Ensure the name of the file is git-usage/src/make-plot.R.
    2. Copy-and-paste the code in make-plot.R into the editor.
    3. Click Commit new file.
  3. Repeat this process with data/stackoverflow-git-data.csv and the output files by clicking on Add file and selecting Create new file. Note that for git-usage.png you will need to use Upload file instead of Create new file.

Adding a license

A license specifies what people can do with your code. If you aren’t sure what license suits your needs, you might find https://choosealicense.com/ has some helpful information. Most of the time, I will opt for the MIT license.

There are two ways you might add a license. The manual method is to copy and paste the license text into a file called LICENSE to your repository, filling in [year] and [fullname] as appropriate. Alternatively, you can Add file and Create new file and specify that the file will be called “LICENSE” and it will offer you some templates to choose from. It will auto-fill the details of your name and the year.

Adding a README

When you encounter a repository online it can be difficult to understand what its purpose is and how to use it. “README” is the name given to a file that contains this sort of information. Typically these will be written in markdown (similar to RMarkdown). Add a file called README.md to your repository with text similar to the following.

This repository contains an analysis of git usage through time.

To run this analysis use the following command:

```
Rscript src/make-plot.R
```

The input data is in `data` and the results are in `out`.

Recording the session information

Software gets updated, and sometimes these updates cause things to break. Where possible, it is very good practise to include details of the versions of software you have used. When working with R the sessionInfo command makes this simple. Try adding the following to the end of the make-plot.R script.

sink(file = "out/package-versions.txt")
sessionInfo()
sink()

The next time that you run this script, it will write a description of the version of R you used and the versions of all the loaded packages to the file out/package-versions.txt. Try running the script again to make sure this additional file was generated and contains something similar to the following.

R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS:   /usr/local/lib/R/lib/libRblas.so
LAPACK: /usr/local/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] ggplot2_3.3.5

loaded via a namespace (and not attached):
 [1] magrittr_2.0.1   splines_4.1.2    tidyselect_1.1.1 munsell_0.5.0
 [5] colorspace_2.0-2 lattice_0.20-45  R6_2.5.1         rlang_0.4.12
 [9] fansi_0.5.0      dplyr_1.0.7      tools_4.1.2      grid_4.1.2
[13] gtable_0.3.0     nlme_3.1-153     mgcv_1.8-38      utf8_1.2.2
[17] withr_2.4.3      ellipsis_0.3.2   digest_0.6.29    tibble_3.1.6
[21] lifecycle_1.0.1  crayon_1.4.2     Matrix_1.3-4     farver_2.1.0
[25] purrr_0.3.4      vctrs_0.3.8      glue_1.6.0       labeling_0.4.2
[29] compiler_4.1.2   pillar_1.6.4     generics_0.1.1   scales_1.1.1
[33] pkgconfig_2.0.3

Once you are happy that this has worked, we need to commit these changes. First by editing the script, and second, add the package-versions.txt file.

Branching and merging

Suppose that after doing all of this one of your collaborators wants to adjust the figure. We will now go through the steps involved with doing this using branches.

Branching to make changes

Figure fig:demo-result-2 is a modification of Figure fig:demo-result-1 with the desired changes.

./git-usage-2.png

To avoid making changes to the main copy of the code we will work on a branch, and then when we are happy with the changes we will merge them. To start with, create a new branch by clicking on the drop-down menu labelled “main” as shown in Figure fig:create-new-branch. I called it “edit-plot”, but you can use anything other than “main” (because that is the default branch name used by GitHub).

./create-new-branch.png

Make desired edits to the code and output

Making sure that you are on your branch — if you’re not sure, click on the branch button to double check — edit the make-plot.R script so that it has the following

g <- ggplot(
  data = sods_data,
  mapping = aes(x = year, y = percentage)) +
  geom_point() +
  geom_smooth(method = "lm", colour = "darkgrey") +
  geom_text(
    aes(x = 2020, y = 82.8, label = "only GitHub"),
    size = 3,
    nudge_x = 0.2,
    nudge_y = -6) +
  labs(
    x = "Year",
    y = "Percentage who used git") +
  ylim(c(0,100)) +
  theme_bw()

Once you have made the changes and re-run that script the figure in git-usage.png will have changed — it should look like Figure fig:demo-result-2 now. Ordinarily, you would update the figure in the same way that you update code, by committing the changes. However, this is tricky to do via the GitHub website for image files, so instead, delete the file and upload the modified one. At this point it might be interesting to move between the main branch and your new branch to see how the files change between the two.

One motivation for branches is that you can make exploratory changes without risking messing up your code on the main branch. If you have a collaborator that wanted to try something, they could do so on a separate branch and then, if you like their edits, you can merge them into main as we are about to do now.

Merge the changes

To merge your changes via the website, go back to the main page of the repository and you should see a new button, like the one shown in Figure fig:pull-request, inviting you to compare the changes on this branch, i.e., to inspect if you consider this work worthy of inclusion.

./pull-request.png

Inspect the differences between the branches and if you are happy with them create a pull request by clicking the button as shown in Figure fig:create-pull-request.

./create-pull-request.png

Once you have created the pull request, the next step is to merge that branch into the main branch. To do this you just need to click the button shown in Figure fig:merge-pull-request.

./merge-pull-request.png

Once a branch has been merged it will hang around until you delete it. Since having old branches around can lead to confusion, it is sensible to delete them afterwards. As shown in Figure fig:delete-branch there is a button to achieve this.

./delete-branch.png

At this point you should only have a single branch left and it should have the modifications to the figure. Congratulations on a reproducible analysis!

Next steps and alternative solutions

Upload to Zenodo

The Zenodo FAQs contain information about how to archive a GitHub repository if you want a more permanent form of storage. Ideally, one would archive the commit used to generate the contents of a manuscript so it has a DOI and reference both the archive and the live version of the code on GitHub in the manuscript.

Learn more about git

Learn more about GitHub

There are lots of features in GitHub that haven’t been covered but may be worth looking into: the issue tracker, the wiki, VSCode integration, and GitHub Pages and GitHub Actions.

Alternative solutions

Git

Git has the greatest market share but there are alternatives such as Subversion, Mercurial, CVS and Darcs. Given that the vast majority of people use git, your time is probably best spent learning git.

GitHub

While git dominates the market as the choice of version control system, there are many viable alternatives platforms to GitHub which may be more suitable for your needs:

Zenodo

There are good general purpose alternatives to Zenodo such as figshare and Dryad. There are also numerous alternatives that are more field specific, such as GISAID.

Homework

Question 1

Explain (in 100–200 words) the purpose of git, GitHub, and Zenodo and the relationship between these things. Describe the value of one feature of GitHub not overed in this tutorial (in 100–150 words).

Question 2

Explain (in 100–200 words) the role of version control in reproducible research. Give an example (in 100–150 words) of a situation in which version control does not suffice to make a piece of work reproducible.

Question 3

Download the following script and data and organise this material in a repository in a suitable way. Give a brief overview of the decisions you made along the way (100–200 words).

Question 4

Fork the repository at https://github.com/aezarebski/conflict-resolution-example and merge the pull request. Note that this will require resolving the conflict in a sensible way. Explain what you did (in <100 words). If you have done this well, the commit log should look like Figure fig:merge-task.

./merge-task.png

Question 5

Read the editorial Ten Simple Rules for Reproducible Computational Research and (in 200–300 words) give a brief explanation of how git and GitHub would or would not be relevant to each rule.

About

A tutorial on the use of git, and GitHub for open and reproducible science. See the live version of this here: https://aezarebski.github.io/misc/github-tutorial/README.html

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • CSS 100.0%