Add in the dependencies and documentation from spin-out project

NERC-CEH · Oct 3, 2024 · 071377b · 071377b
1 parent 8f39f39
commit 071377b
Show file tree

Hide file tree

Showing 3 changed files with 95 additions and 1 deletion.
diff --git a/PIPELINES.md b/PIPELINES.md
@@ -0,0 +1,87 @@
+
+# Luigi Pipeline for Decollaging and Uploading FlowCam Images
+
+## Overview
+
+This Luigi pipeline is designed to process large `.tif` images generated by a FlowCam device. The pipeline breaks down these large images into smaller "vignette" images, adds metadata (e.g., latitude, longitude, date, and depth) to the resulting images, and then uploads the processed images to a specified destination (e.g., an S3 bucket or an external API).
+
+The pipeline is structured as a series of Luigi tasks, each handling a specific step in the workflow:
+1. **Reading Metadata**: Parses `.lst` files to extract metadata.
+2. **Decollaging**: Extracts individual images from large `.tif` files.
+3. **Uploading**: Uploads processed images to a specified endpoint.
+
+## Pipeline Architecture
+
+The pipeline consists of the following Luigi tasks:
+
+### 1. `ReadMetadata`
+   - **Purpose**: Reads the `.lst` file to extract metadata for image slicing.
+   - **Input**: `.lst` file generated by the FlowCam device.
+   - **Output**: A `.csv` file (`metadata.csv`) containing parsed metadata.
+
+### 2. `DecollageImages`
+   - **Purpose**: Uses metadata to slice a large `.tif` image into smaller vignette images.
+   - **Input**: The `metadata.csv` file generated by `ReadMetadata`.
+   - **Output**: Individual vignette images with EXIF metadata, saved in the specified output directory.
+
+### 3. `UploadDecollagedImagesToS3`
+   - **Purpose**: Uploads processed vignette images to a specified S3 bucket or an external API.
+   - **Input**: Processed vignette images generated by `DecollageImages`.
+   - **Output**: A confirmation file (`s3_upload_complete.txt`) indicating successful uploads.
+
+### 4. `FlowCamPipeline` (Wrapper Task)
+   - **Purpose**: A wrapper task that runs all the above tasks in sequence.
+   - **Dependencies**: It manages the dependencies and order of execution of the entire pipeline.
+
+## Prerequisites
+
+- Python 3.7 or above
+- The following Python packages:
+  - `luigi`
+  - `pandas`
+  - `numpy`
+  - `scikit-image`
+  - `requests`
+  - `pytest` (for testing)
+  - `boto3` (for S3 interactions)
+  - `aioboto3` (for async S3 interactions)
+  - `fastapi` and `uvicorn` (for the external API)
+
+## Setup and Installation
+
+1. **Clone the Repository**
+
+   ```bash
+   git clone https://github.com/your_username/plankton_pipeline_luigi.git
+   cd flowcam-pipeline
+   ```
+
+2. **Setup JASMIN credentials**  
+
+   If using S3 for uploading, make sure your AWS credentials are set in a .env file in the root directory:
+
+   ```bash
+   AWS_ACCESS_KEY_ID=your_access_key
+   AWS_SECRET_ACCESS_KEY=your_secret_key
+   AWS_URL_ENDPOINT=your_endpoint_url
+   ```
+
+## Running the pipeline
+
+1. **Start the Luigi Central Scheduler**
+
+Path to `--logdir` is optional, if you don't have permissions to write to `/var/log`
+
+   ```bash
+   luigid --background --logdir=./logs
+   ```
+
+2. **Run the Pipeline Script**
+
+   ```bash
+   python -m luigi --module pipeline.pipeline_decollage FlowCamPipeline \
+    --directory /path/to/flowcam/data \
+    --output-directory /path/to/output \
+    --experiment-name test_experiment \
+    --s3-bucket your-s3-bucket-name
+    ```
diff --git a/environment.yml b/environment.yml
@@ -8,12 +8,15 @@ dependencies:
   - pytorch
   - black
   - boto3
+  - aioboto3
   - chromadb
   - dvc[s3]
+  - fastapi
   - flake8
   - isort
   - jupyterlab
   - jupytext
+  - luigi
   - matplotlib
   - pandas
   - Pillow
@@ -25,5 +28,6 @@ dependencies:
   - pip
   - streamlit
   - plotly
+  - uvicorn
   - pip:
     - git+https://github.com/jmarshrossney/resnet50-cefas
diff --git a/pyproject.toml b/pyproject.toml
@@ -9,7 +9,8 @@ requires-python = ">=3.12"
 description = "This package supports the processing and analysis of plankton sample data"
 readme = "README.md"
 dependencies = [
-    "boto3",	
+    "boto3",
+    "aioboto3",	
     "chromadb",
     "dvc[s3]",
     "imagecodecs",
@@ -18,6 +19,7 @@ dependencies = [
     "plotly",
     "pyexiftool",
     "python-dotenv",
+    "requests",
     "scikit-image",
     "scikit-learn",
     "streamlit", 
@@ -31,6 +33,7 @@ dependencies = [
 jupyter = ["jupyterlab", "jupytext", "matplotlib", "scikit-learn"]
 test = ["pytest", "pytest-cov", "parameterized"]
 lint = ["flake8", "isort", "ruff"]
+pipeline = ["luigi", "fastapi", "uvicorn"]
 all = ["cyto_ml[jupyter,test,lint]"]
 
 [tool.jupytext]