This document describes the goal, requirements and plan of this project.
Gain insight in potential problems regarding the licensing of software packages we develop at eScienceCenter
This tool should produce a report about the license compatibility for 1 or more pieces of software and their dependencies.
To use open source software, a license needs to be applied to it. Most software have dependencies which themselves have licenses. Not all licenses can be combined with each other. It is often difficult to find out the licenses of the dependencies, and to gain insight into whether there are any conflicting licenses in use in a software package.
At the Netherlands eScience Center (NLeSC) there is a need for more attention to licensing issues. Personas in the center are
- program manager, someone accountable for software adhering to guidelines and best practices.
- research software engineer, someone writing software
As a program manager at NLeSC I would like have a tool that can find potential license problems in the software created in a project.
As a research software engineer at NLeSC I would like to find out if my piece of software has any license violations.
Regardless of whether we choose to use an existing tool/service or make something ourselves, the solution needs to have the following elements:
-
input is a list of URLs to our repositories on GitHub, e.g. from Research Software Directory
-
visit each, look in the root of the repository for a list of runtime (?) dependencies, for example from dependency files such as
- requirements.txt
- setup.cfg
- setup.py
- Pipfile
- Pipfile.lock
- pyproject.toml
- environment.yml
- yarn.lock
- package.json
- package-lock.json
- DESCRIPTION
when repositories contain subdirectories with any of these dependency files, we need to do more nested/recursing evaluation
-
for each of the project dependencies,
- figure out whose copy is being used (e.g. PyPI, GitHub, crates.io, npm, maven, Anaconda cloud, bower, etc. more here)
- identify the license as stated on the platform whose copy we're using
-
yields a one-to-many graph (a tree) with relations ("If I have a GPL-3.0 here, can I have Apache-2.0 one level higher?")
-
figure out if there is any license of a dependency that yields a conflict with the top level license (should be Apache-2.0) (based on which rules?)
-
choose which result best represents the situation:
- didn't find a dependency file
- found a dependency file,
- every name was resolvable
- no conflicts
- had conflicts (with list of conflicts found)
- not every name was resolvable
- needs human evaluation
- every name was resolvable
After evaluating existing tools we decided to wrap ort in a script. As ort can detect licenses of dependences of a freshly cloned repository and produce a nice HTML report and machine readable reports. Ort is also widely used and supported by the OSS community. The runner up was https://github.com/pivotal/LicenseFinder, but it requires that dependencies are already installed while ort installs them for you.
- Write a script to regularly
- Clone the repository
- For each repo generate a report using ort
- Define curation files
- Start with some predefined curations
- If users want to change or overwrite these curations they can
- Add index page to list repos
- Each repo should have a detailed report (html webapp)
- Instructions for program managers and engineers
- How to run analysis if you are a program manager
- How to run analysis if you are an engineer
- How to update curations
The tool will be written as a GitHub action so it can be called in a GitHub workflow The action has 2 modes:
This mode targets program managers.
Steps in a Github workflow:
- Weekly scheduled (eg fairtally-test)
- Action fetches repos from RSD
- Use ort Docker image from https://hub.docker.com/r/philipssoftware/ort/
- Run https://github.com/tortellini-tools/action/blob/rsd-software-vs-ort/ort/batch.sh, replace shell script with Typescript
- Create
index-<timestamp>.json
with stats of all repos using Typescript - Create symlink/copy
index-<timestamp>.json
to index-latest.json - To S3 Upload index-latest.json,
index-<timestamp>.json
and for each repo- scan-report-web-app.html,
- out.txt
- evaluation-result.yml
- Vue app index.html which shows index-latest.json in table
- Upload files to S3, the S3 bucket can be visted with web browser
Steps 2 .. 8 could be captured in Github Actions like
on:
schedule:
- cron: '0 0 * * 4' # Every thursday
# Allows you to run this workflow manually from the Actions tab
workflow_dispatch:
jobs:
tortellini:
runs-on: ubuntu-latest
steps:
- name: Get the data dump from the RSD
run: curl https://research-software.nl/api/software > software.json
- name: Extract the list of URLs
run: cat software.json | jq -r '[.[].repositoryURLs.github] | flatten | .[]' > urls.txt
- name: Run ort on urls
uses: tortellini-tools/action@v3
with:
repositories: urls.txt
output-dir: results
curations: conf/curations.yml
rules: conf/rules.kts
classifications: conf/license-classifications.yml
- uses: jakejarvis/s3-sync-action@master
with:
args: --acl public-read --follow-symlinks
env:
AWS_S3_BUCKET: ${{ secrets.AWS_S3_BUCKET }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_REGION: 'us-west-1' # optional: defaults to us-east-1
SOURCE_DIR: 'results'
This mode targets engineers.
Using the GitHub Action for a single repo is very similar to using it for multiple repositories, but works on the currently checked out repository instead of on a list of URLs.
on:
release:
# Allows you to run this workflow manually from the Actions tab
workflow_dispatch:
jobs:
tortellini:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Run ort on .
uses: tortellini-tools/action@v3
- uses: actions/upload-artifact@v2
with:
name: tortellini-results
path: results/**
Action repo will have:
- action.yml, GitHub Action definition file
- src/index.ts which will clone repos, start ort container, save output to repos//, render index-latest.json
- package.json with docker library
- README/LICENSE
Repo in NLeSC org with ort default config files
Start repo from boilerplate https://github.com/actions/typescript-action Use same repo as a test