Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FlowCam visit - documentation #17

Merged
merged 39 commits into from
Aug 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
daa87c8
high altitude sketches of sources to storage
metazool Jul 9, 2024
f349263
very rudimentary workflow engine diagram
metazool Jul 9, 2024
2c73040
stub of a graphviz re-render workflow
metazool Jul 10, 2024
a9392b3
bash script to re-render diagrams
metazool Jul 10, 2024
22bc41e
guess the syntax for publishing graphs to GH pages
metazool Jul 10, 2024
f316fad
whitespace change to test the pipeline trigger
metazool Jul 10, 2024
bf22b55
cribbing all the jekyll build/deploy from discoverability project
metazool Jul 10, 2024
b2ae1c3
add an index for the diagrams, one page for now
metazool Jul 10, 2024
55063e9
copypasta Jekyll Gemfile
metazool Jul 10, 2024
49794a8
simplify to just test the pages render
metazool Jul 10, 2024
d0aa31a
token change underneath ./docs
metazool Jul 10, 2024
759c1a9
set default working directory
metazool Jul 10, 2024
39cad77
more token change
metazool Jul 10, 2024
64828ad
revert to the deprecated upload action, then back away
metazool Jul 10, 2024
1cfa10f
another whitespace change for pipeline
metazool Jul 11, 2024
c960305
try reinstating the graph render step
metazool Jul 11, 2024
1d39d4d
doc update, really to test pipeline
metazool Jul 11, 2024
433ed67
fiddle about with relative paths in workflow
metazool Jul 11, 2024
e6e63df
tweak graph render to output where Jekyll does
metazool Jul 11, 2024
7224e80
move the graphviz step after the jekyll one
metazool Jul 11, 2024
a290572
whitespace sigh
metazool Jul 11, 2024
b36dda8
tweak the diagram path. general brain typo
metazool Jul 11, 2024
347f354
hopefully the last relative path typo
metazool Jul 11, 2024
df88fce
decorate the docs a bit
metazool Jul 11, 2024
6b21e0c
assume we're rendering underneath ./docs
metazool Jul 11, 2024
5b8e25e
whitespace tweak N
metazool Jul 11, 2024
9fef7d3
embedded style for dark mode diagrams
metazool Jul 16, 2024
e97cf8d
bypass the SVG dark mode issue by using an object tag in markdown
metazool Jul 18, 2024
9fa7756
Brief descriptive text on future data pipeline state
metazool Jul 18, 2024
a15abba
update the site title
metazool Jul 22, 2024
113601f
Merge branch 'diagram_view' into flowcam_visit
metazool Jul 29, 2024
82e9815
Add images and video clips from the FlowCam visit
metazool Jul 29, 2024
513f48c
add image links and alt text
metazool Jul 29, 2024
c816bb3
correct relative image paths
metazool Jul 29, 2024
688b8d3
Two more images of the instrument
metazool Jul 29, 2024
e4f4966
short story about a visit to the flow microscope
metazool Jul 29, 2024
d5af31a
whitespace for readability
metazool Jul 29, 2024
1758c93
Merge branch 'main' into flowcam_visit
metazool Jul 30, 2024
1646943
fix duplicate link due to number blindness
metazool Jul 30, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 8 additions & 2 deletions .github/workflows/lint.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,13 @@
name: flake8 Lint

on: [push, pull_request]

on:
push:
paths:
- "cyto_ml"
pull_request:
paths:
- "cyto_ml"

jobs:
flake8-lint:
runs-on: ubuntu-latest
Expand Down
59 changes: 59 additions & 0 deletions .github/workflows/pages_graphs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
name: Pages and Graphviz re-render
on:
push:
paths: 'docs/**/*'

# Allows you to run this workflow manually from the Actions tab
workflow_dispatch:

# Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages
permissions:
contents: read
pages: write
id-token: write

jobs:
build:
name: Rebuild graphs and pages
runs-on: ubuntu-latest
defaults:
run:
working-directory: docs
steps:
- uses: actions/checkout@v4
- name: Setup Ruby
uses: ruby/setup-ruby@v1
with:
ruby-version: '3.3' # Not needed with a .ruby-version file
bundler-cache: true # runs 'bundle install' and caches installed gems automatically
cache-version: 0 # Increment this number if you need to re-download cached gems
working-directory: '${{ github.workspace }}/docs'
- name: Setup Pages
id: pages
uses: actions/configure-pages@v3
- name: Build with Jekyll
# Outputs to the './_site' directory by default
# Will this copy the diagrams tho
run: bundle exec jekyll build --baseurl "${{ steps.pages.outputs.base_path }}"
env:
JEKYLL_ENV: production
- uses: ts-graphviz/setup-graphviz@v2
- name: Diagrams
run: chmod +x ../scripts/render_diagrams.sh; bash ../scripts/render_diagrams.sh
- name: Upload artifact
# Automatically uploads an artifact from the './_site' directory by default
uses: actions/upload-pages-artifact@v1
with:
path: "docs/_site"

# Deployment job
deploy:
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}
runs-on: ubuntu-latest
needs: build
steps:
- name: Deploy to GitHub Pages
id: deployment
uses: actions/deploy-pages@v2
1 change: 1 addition & 0 deletions .github/workflows/pytest_coverage.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
on:
pull_request:
branches: [ "main" ]
paths: "cyto_ml"

jobs:
tests:
Expand Down
35 changes: 35 additions & 0 deletions docs/Gemfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
source "https://rubygems.org"
# Hello! This is where you manage which Jekyll version is used to run.
# When you want to use a different version, change it below, save the
# file and run `bundle install`. Run Jekyll with `bundle exec`, like so:
#
# bundle exec jekyll serve
#
# This will help ensure the proper Jekyll version is running.
# Happy Jekylling!
#gem "jekyll", "~> 4.3.3"
# This is the default theme for new Jekyll sites. You may change this to anything you like.
gem "minima", "~> 2.5"
# If you want to use GitHub Pages, remove the "gem "jekyll"" above and
# uncomment the line below. To upgrade, run `bundle update github-pages`.
gem "github-pages", "~> 231", group: :jekyll_plugins
gem "webrick"
gem "just-the-docs"
# If you have any plugins, put them here!
group :jekyll_plugins do
gem "jekyll-feed", "~> 0.12"
end

# Windows and JRuby does not include zoneinfo files, so bundle the tzinfo-data gem
# and associated library.
platforms :mingw, :x64_mingw, :mswin, :jruby do
gem "tzinfo", ">= 1", "< 3"
gem "tzinfo-data"
end

# Performance-booster for watching directories on Windows
gem "wdm", "~> 0.1.1", :platforms => [:mingw, :x64_mingw, :mswin]

# Lock `http_parser.rb` gem to `v0.6.x` on JRuby builds since newer versions of the gem
# do not have a Java counterpart.
gem "http_parser.rb", "~> 0.6.0", :platforms => [:jruby]
12 changes: 12 additions & 0 deletions docs/_config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
title: Plankton ML / pipelines
email: jowals@ceh.ac.uk
description: >- # this means to ignore newlines until "baseurl:"
This repository contains code, proof of concepts, test cases and workflows for low-investment methods to apply image machine learning to plankton characterisation.
baseurl: "" # the subpath of your site, e.g. /blog
url: "" # the base hostname & protocol for your site, e.g. http://example.com
github_username: metazool

# Build settings
theme: just-the-docs
plugins:
- jekyll-feed
34 changes: 34 additions & 0 deletions docs/diagrams/as_is/instrument_to_store.dot
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# http://www.graphviz.org/content/cluster

digraph G {
rankdir=LR;
graph [fontname = "Handlee"];
node [fontname = "Handlee"];
edge [fontname = "Handlee"];

bgcolor=transparent;

scope [shape=rect label="Microscope \n(FlowCam)"];
pc [shape=rect label="Local PC"]

scope2 [shape=rect label="Laser Imaging \n(Flow Cytometer)"];
pc2 [shape=rect label="Local PC"]

san [shape=cylinder label="SAN \nprivate cloud"]
vm [shape=rect label="VM \nprivate cloud"]
store [shape=cylinder label="S3 \nobject store"]

vm->store [label="triggered by app?" fontsize=10];
scope->pc
scope2->pc2

pc2->san [label="physically, via USB stick", fontsize=10];
pc->san [label="physically, via USB stick", fontsize=10];


san->vm [dir=back] [label="manually run script" fontsize=10];

}



33 changes: 33 additions & 0 deletions docs/diagrams/could_be/instrument_to_store.dot
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# http://www.graphviz.org/content/cluster

digraph G {
rankdir=LR;
graph [fontname = "Handlee"];
node [fontname = "Handlee"];
edge [fontname = "Handlee"];

bgcolor=transparent;

scope [shape=rect label="Microscope \n(FlowCam)"];
pc [shape=rect label="Local PC"]

scope2 [shape=rect label="Laser imaging \n(Flow Cytometer)"];
pc2 [shape=rect label="Local PC"]

san [shape=cylinder label="SAN \nprivate cloud"]
engine [shape=rect label="Workflow engine"]
tasks [label="Task graph"]
store [shape=cylinder label="S3 \nobject store"]

engine->tasks
tasks->san;
tasks->store [];
scope->pc
scope2->pc2

pc2->san [label="pull on a schedule?", dir=back,fontsize=10];

pc->san [label="push on a schedule?", fontsize=10];

}

22 changes: 22 additions & 0 deletions docs/diagrams/could_be/task_graph.dot
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# http://www.graphviz.org/content/cluster

digraph G {
rankdir=LR;

edge [fontname = "Handlee"];

graph [fontsize=10 fontname="Handlee"];
node [shape=record fontsize=10 fontname="Handlee"];

bgcolor=transparent;

subgraph cluster_0 {
style=filled;
color=lightgrey;
node [color=white,style=filled];
store -> chunk -> sift -> profile -> upload;
label = "Task flow";
fontsize = 20;
}
}

33 changes: 33 additions & 0 deletions docs/diagrams/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
---
# Feel free to add content and custom Front Matter to this file.
# To modify the layout, see https://jekyllrb.com/docs/themes/#overriding-theme-defaults

layout: home
title: Plankton ML - workflow diagrams
---

# Workflow Diagrams

Views of the flow of data from the imaging instrument to cloud-accessible storage

### As is

Data saved during a session with the microscope is downloaded onto a USB key, then uploaded from a researcher's laptop into a shared storage area on a site-specific SAN.

Later, a data scientist logs into a virtual machine in the on-premise "private cloud" and runs more than one script to read the data, process it for analysis, and then upload to s3 storage hosted at JASMIN. Authorisation in this chain requires personal credentials.

<object data="as_is/instrument_to_store.svg" type="image/svg+xml">
</object>

There are file naming conventions including metadata which doesn't follow the same path as the data, and there are spatio-temporal properties of the samples which could be recorded.

### Could be

PC that drives the instrument is connected to the storage network, but not the internet (for security standards compliance reasons). What are the current precedents for either directly saving output to shared storage, or a watcher process that either pulls or pushes data from a lab PC to networked storage?

Automated workflow (could be Apache Airflow or Beam based - FDRI project is trialling components) which watches for new source data, distributes the preprocessing with Dask or Spark if necessary, and publishes analysis-ready data _and metadata_ to cloud storage, continuously.

<object data="could_be/instrument_to_store.svg" type="image/svg+xml">
</object>


Binary file added docs/flowcam/images/20240725_154135.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/flowcam/images/20240725_154320.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/flowcam/images/20240725_154511.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/flowcam/images/20240725_154600.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/flowcam/images/20240725_154812.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/flowcam/images/20240725_155207.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/flowcam/images/20240725_155433.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/flowcam/images/20240725_161537.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/flowcam/images/20240725_161806.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/flowcam/images/20240725_162442.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/flowcam/images/20240725_163521.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/flowcam/images/YouCut_20240729_123745260.mp4
Binary file not shown.
Binary file added docs/flowcam/images/YouCut_20240729_124027460.mp4
Binary file not shown.
79 changes: 79 additions & 0 deletions docs/flowcam/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
---
# Feel free to add content and custom Front Matter to this file.
# To modify the layout, see https://jekyllrb.com/docs/themes/#overriding-theme-defaults

layout: home
title: Plankton ML - FlowCam walkthrough
---

# FlowCam walkthrough

Report of a visit to the FlowCam instrument in Lancaster on 25/07/2024 to see the flow from specimen to analysis data, see the interpretation through researcher's eyes, understand the problem better.

## Sampling

The samples are collected once every two weeks. At this stage the plankton are suspended in ether. The ones we analysed had been collected a day or two earlier.
<img src="images/20240725_154135.jpg" alt="Contents of a sample jar, plankton suspended in ether" style="max-height:640px;max-width:640px;">

They're washed through a very fine sieve and then diluted back into a beaker of water.
<img src="images/20240725_154320.jpg" alt="Sample jars marked with site, date, and depth" style="max-height:640px;max-width:640px;">


The diluted sample runs through the hose and between this flat section with two glass plates, which is where the camera points.
<img src="images/20240725_154600.jpg" alt="The diluted sample runs through the hose and between this flat section with two glass plates, which is where the camera runs" style="max-height:640px;max-width:640px;">

Sampling through analysis involves swirling a hose through the water while it pumps through the flat section at an adjustable pressure.

<video height="640" controls>
<source src="images/YouCut_20240729_124027460.mp4" type="video/mp4"/>

</video>


At this stage the researchers are looking for relative volumes of half a dozen plankton genus, with an index to their types and typical features helpfully printed out and stuck to the window.

<img src="images/20240725_154812.jpg" alt="Window display of print-outs of plankton genus with identifying shapes and features" style="max-height:640px;max-width:640px;">

An onboard model does basic object detection of specific plankton as they pass through the flat section. You can see that it's picking up _a lot_ of empty space - either slightly out of sync with the flow through the device, or with a very low threshold of acceptance for what it perceives as possibly plankton, or both.

<video width="640" controls>
<source src="images/YouCut_20240729_123745260.mp4" type="video/mp4"/>
Short video of the FlowCam in action, stream through the device on the right, output of object detection on the left
</video>


This results in a set of "collage" images which are what we see _at the start of_ the pipeline in this project. In the sample we recorded, there were 350 pages of these and over 250 of them were purely blank images.

<img src="images/20240725_162442.jpg" alt="A page of the output once it has stopped showing blank images" style="max-height:640px;max-width:640px;">

There's quite a lot of computer vision happening onboard the FlowCam. Here it's doing edge detection and deriving a set of metrics for shapes (area, circularity, complexity) which could be used for shallow ML approaches.

<img src="images/20240725_161537.jpg" alt="Collaged images showing the onboard edge detection function" style="max-height:640px;max-width:640px;">

We don't see a data dump of these in a way that can be recoupled with the exported collage images - I'm told that involves a licence that we don't have resource for - or any sign of a programmatic interface for development on the FlowCam.

<img src="images/20240725_161806.jpg" alt="Blurred view of interesting metrics derived from the edge detection" style="max-height:640px;max-width:640px;">

There's an amount of potentially interesting intermediate image data left behind on the device - including snapshots of the raw flow through the camera, and all the binary masks of the collages from which the shape analysis is done.

<img src="images/20240725_163521.jpg" alt="Binary masks stored with the intermediate images" style="max-height:640px;max-width:640px;">

The exported collage images are managed using a file naming convention which includes geographic location (WGS84 lon/lat) and sample depth as well as date. This is getting detached from the single-plankton images in the current workflow, and we very much need to preserve it. I'd wondered if depending so heavily on file naming conventions was an overhead, but it looks like a good affordance for the researcher's workflow in the FlowCam application; they take the previous session and tweak a small part of the filename.

<img src="images/20240725_155433.jpg" alt="Example of the file naming convention with embedded metadata" style="max-height:640px;max-width:640px;">

The FlowCam unit has a built in PC running Windows, rather than attaching to an external one. I didn't ask what version, whether there's any lifecycle for it receiving updates, or whether support through updates has an extra manufacturer cost.

<img src="images/20240725_154511.jpg" alt="The FlowCam unit has a built in PC running Windows" style="max-height:640px;max-width:640px;">

It's got ethernet available, but isn't connected to the network, the reason cited being the default implementation of Cyber Essentials Plus for risk management - transfer of data is done via USB stick, and when the disk fills with intermediate images they're just deleted.

<img src="images/20240725_155207.jpg" alt="Ports available on the back of the FlowCam instrument's integrated PC" style="max-height:640px;max-width:640px;">

## Next steps

* [Diagrams](../diagrams/) show the as-is and could-be versions of a data pipeline which takes the exported FlowCam images, breaks them back down into single plankton samples and publishes them to an object store for use with model building
* The workflow loses metadata at too many points, though a lot of it's knitted up by the file naming convention
* The arduous part from the researcher POV is paging through images classifying them by hand and eye. A model interpretation of them retrospectively, done in the cloud, isn't going to reduce the immediate work in the lab; at worst it risks casualising it, by reducing the _apparent_ need for expertise in interpretation at the point of sampling
* Recommendation is to take small steps (a "hello world" python application, then `scikit-image`, then a pytorch based classifier) to run models on the instrument itself as a direct assistance to the researcher. For a single-purpose application that's hard to justify, but as a feasibility test of general approach influencing future design (there's already precedent for [running .net applications directly on a flow cytometer](https://github.com/OBAMANEXT/cyz2json)) it's quite interesting

20 changes: 20 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
---
# Feel free to add content and custom Front Matter to this file.
# To modify the layout, see https://jekyllrb.com/docs/themes/#overriding-theme-defaults

layout: home
title: Plankton ML
---

# Plankton ML

This is a small experimental project on automating the analysis of plankton images

* Inform related work on reproducible analytical pipelines for bioimage machine learning by grounding them in a concrete use case
* Evaluate reusable components (e.g. the Cefas plankton model from scivision) and associated trade-offs
* Evolve a shared template for similar smaller projects undertaken by members of the RSE group in the Environmental Data Service, UK Centre for Ecology and Hydrology

Please see the associated Github repository which has [outline tasks in Issues](https://github.com/NERC-CEH/plankton_ml/issues) and [prototype work in pull requests](https://github.com/NERC-CEH/plankton_ml/pulls)



26 changes: 26 additions & 0 deletions scripts/render_diagrams.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
#!/bin/bash
# Copilot generated script to render diagrams as SVG

# Set the directory path
DIR="./diagrams/"
SITE="_site/"

# Loop through each subdirectory
for sub_dir in "$DIR"*/; do
# Loop through each dot file in the subdirectory
for dotfile in "$sub_dir"*.dot; do
# Get the base name without extension
base_name=$(basename "$dotfile" .dot)
dir_path=${sub_dir//diagrams/_site\/diagrams}
mkdir -p $dir_path
output="$dir_path$base_name.svg"

# Render the dot file to SVG
dot -Tsvg "$dotfile" -o $output

# Print a success message
echo "Rendered $dotfile to $output"
done
done


Loading