Drop scivision #24

jmarshrossney · 2024-08-24T15:35:07Z

SORRY - I thought I could open a PR on my fork and then change the base repository to after the fact. It turns out I can't (as far as I can see) and so I've had a whole conversation with myself over here on the original PR.

I am proposing to drop scivision as a dependency, but to keep an eye on how it develops in case we want to adopt their model-loading protocol in future.

Why?

Bluntly, there are no functional benefits to using scivision, for us, at this point in time.

As far as I can tell the only reason to keep using it is because we support what they are trying to do and want to inject momentum into the project rather than remove it. That is a good reason, but given that we are in the very early stages of a project and don't know quite the direction we want to take it, I think it's reasonable for us to make this decision down the line.

Another reason for waiting is that intake (which scivision uses heavily) is currently going through a full-scale rewrite, and the maintainers say (see docs)

All old “sources”, whether still working or not, should be considered deprecated.

My instinct is to wait and see how this plays out and if/when scivision gets updated. [EDIT: my worry is that it just won't.. see observation at the end of this comment ]

Finally, the single model in the scivision catalogue that we are interested in locks us into a very old version of Python (3.9).

What changes

I forked the cefas/turing model, updated it to Python 3.12 and stripped out all the stuff that either didn't work or that we didn't need.

So at first approximation the only changes in the code should be that instead of loading the pretrained model from the original repo via scivision.load_pretrained_model, we instead just pip install the fork.

I also found that scivision.load_dataset can be trivially replaced by intake.open_catalog (see comment)

metazool · 2024-08-27T13:54:08Z

I concur with the "remove scivision, for now" take, and here's the longer, philosophical essay version of why: thank you for the PR. I'm happy to merge, waiting to do this in sequence starting before the layout changes.

It was helpful for notebook prototyping and exploration, but ended up brittle to work with (the older python dependency; the requirement for certain EXIF headers to be present to use the data loading code in the model)
Though its central repo still looks relatively active it's not clear what the roadmap is for longer term preservation / continuous update. It's a conversation to try having with Turing Inst folks during RSECon.
This plankton model is a stopgap, there's new work from Turing/Cefas with a choice of a couple of different architectures to try (release pending on a publication by @noushineftekhari that's due out soon!) as well as options for more "foundation model" setups like Extract and store embeddings from BioCLIP #10 that we haven't explored, but could have reuse value well beyond a plankton-specific model. I don't know whether the plan is to package the new model(s) through scivision.
It's because the model is a stopgap that I don't raise any objection to keeping the fork of it in a personal rather than institutional namespace, and otherwise would!

Regarding the sweeping changes to intake though, I thought the scivision codebase and by extension this one was already on the newer Take2. I've got mixed feelings about intake as well; it seems very powerful but almost too generic, in a way that you have to relearn how to configure it slightly every time you work with it? It seemed full of promise at the start as a alternative to STAC for data without a spatio-temporal component (and I'd not come across it other than its connection to scivision).

I found that once i'd constructed a local database with filenames, option for other metadata, in order to store the embeddings, it was a lot more straightforward to keep using that than to refer back to the intake interface
I never managed to persuade intake to consume a whole s3 bucket using a wildcard in a path, or tell whether this was an overoptimistic assumption about what its interface would offer, or a storage permissions configuration issue that i'm not sure i need to understand better. So effectively ended up bypassing intake-xarray in favour of reading the metadata as CSV and the image data separately - and this counter-acts a lot of the benefits of choosing intake.

We do have spatio-temporal metadata available for this project now after #22 - and I honestly can't think of other projects in the image machine learning line of work here that don't- so it could be a lot healthier to look again at the STAC approach, try to keep it standards-oriented, and there are others within CEH whose expertise could be drawn on for an environmental sample STAC extension - some notes and links here #4

metazool

Thank you, needs update to the fork, happy in principle

jmarshrossney added 13 commits August 22, 2024 12:32

don't move to gpu inside prepare_data

724c096

switch to src layout

65cc62a

undo accidental nesting of test folder

9600b33

run black

5807fed

change python -m pytest to pytest in README

49b1d21

update python version; deps in pyproject

70b82c2

load scivision model from forked repo

bc7c43b

move vector store; minor adjustments

6c7aaf2

test install/setup using pip and conda

c03d9f2

minimally remove scivision from ImageEmbeddings notebook

d84c768

update remainder of ImageEmbeddings.md

daa6733

add matplotlib dependence for notebooks

32bdc1f

run black; update flake8 config to ignore directories

2bf3586

jmarshrossney mentioned this pull request Aug 25, 2024

Single-source pip and conda dependencies #26

Closed

metazool mentioned this pull request Aug 27, 2024

Drop scivision jmarshrossney/plankton_ml#1

Closed

Merge branch 'main' into drop-scivision

a21719c

metazool approved these changes Aug 28, 2024

View reviewed changes

jmarshrossney merged commit 59531eb into NERC-CEH:main Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drop scivision #24

Drop scivision #24

jmarshrossney commented Aug 24, 2024

metazool commented Aug 27, 2024 •

edited

Loading

metazool left a comment

Drop scivision #24

Drop scivision #24

Conversation

jmarshrossney commented Aug 24, 2024

Why?

What changes

metazool commented Aug 27, 2024 • edited Loading

metazool left a comment

Choose a reason for hiding this comment

metazool commented Aug 27, 2024 •

edited

Loading