Updating hdbscan-clustering plugin (#498)

* fix merge conflicts * fix apply manifest * fix apply manifest * remove file * updated hdbscan-clustering-plugin * fix bug in tests * fixed random generation of floats * fixed docker file and shell script for running docker * fixed docker files * renamed plugin and fixed merged conflicts * fixed docker files
PolusAI · Aug 6, 2024 · a6bfd1d · a6bfd1d
1 parent 014f4dd
commit a6bfd1d
Show file tree

Hide file tree

Showing 17 changed files with 804 additions and 0 deletions.
diff --git a/clustering/hdbscan-clustering-tool/.bumpversion.cfg b/clustering/hdbscan-clustering-tool/.bumpversion.cfg
@@ -0,0 +1,27 @@
+[bumpversion]
+current_version = 0.4.8-dev0
+commit = True
+tag = False
+parse = (?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)(\-(?P<release>[a-z]+)(?P<dev>\d+))?
+serialize = 
+	{major}.{minor}.{patch}-{release}{dev}
+	{major}.{minor}.{patch}
+
+[bumpversion:part:release]
+optional_value = _
+first_value = dev
+values = 
+	dev
+	_
+
+[bumpversion:part:dev]
+
+[bumpversion:file:pyproject.toml]
+search = version = "{current_version}"
+replace = version = "{new_version}"
+
+[bumpversion:file:plugin.json]
+
+[bumpversion:file:VERSION]
+
+[bumpversion:file:src/polus/images/clustering/hdbscan_clustering/__init__.py]
diff --git a/clustering/hdbscan-clustering-tool/.gitignore b/clustering/hdbscan-clustering-tool/.gitignore
@@ -0,0 +1,23 @@
+# Jupyter Notebook
+.ipynb_checkpoints
+poetry.lock
+../../poetry.lock
+# Environments
+.env
+.myenv
+.venv
+env/
+venv/
+# test data directory
+data
+# yaml file
+.pre-commit-config.yaml
+# hidden files
+.DS_Store
+.ds_store
+# flake8
+.flake8
+../../.flake8
+__pycache__
+.mypy_cache
+requirements.txt
diff --git a/clustering/hdbscan-clustering-tool/Dockerfile b/clustering/hdbscan-clustering-tool/Dockerfile
@@ -0,0 +1,21 @@
+FROM polusai/bfio:2.3.6
+
+# environment variables defined in polusai/bfio
+ENV EXEC_DIR="/opt/executables"
+ENV POLUS_LOG="INFO"
+ENV POLUS_IMG_EXT=".ome.tif"
+ENV POLUS_TAB_EXT=".csv"
+
+# Work directory defined in the base container
+WORKDIR ${EXEC_DIR}
+
+COPY pyproject.toml ${EXEC_DIR}
+COPY VERSION ${EXEC_DIR}
+COPY README.md ${EXEC_DIR}
+COPY src ${EXEC_DIR}/src
+
+RUN pip3 install ${EXEC_DIR} --no-cache-dir
+
+
+ENTRYPOINT ["python3", "-m", "polus.images.clustering.hdbscan_clustering"]
+CMD ["--help"]
diff --git a/clustering/hdbscan-clustering-tool/README.md b/clustering/hdbscan-clustering-tool/README.md
@@ -0,0 +1,52 @@
+# Hierarchical Density-Based Spatial Clustering of Applications with Noise(HDBSCAN) Clustering (v0.4.8-dev0)
+
+The HDBSCAN Clustering plugin clusters the data using [HDBSCAN clustering](https://pypi.org/project/hdbscan/) library. The input and output for this plugin is a CSV file. Each observation (row) in the input CSV file is assigned to one of the clusters. The output CSV file contains the column `cluster` that identifies the cluster to which each observation belongs. A user can supply a regular expression with capture groups if they wish to cluster each group independently, or if they wish to average the numerical features across each group and treat them as a single observation.
+
+## Inputs:
+
+### Input directory:
+This plugin supports the all [vaex](https://vaex.readthedocs.io/en/latest/guides/io.html) supported file formats.
+
+### Filename pattern:
+This plugin uses [filepattern](https://filepattern2.readthedocs.io/en/latest/Home.html) python library to parse file names of tabular files to be processed by this plugin.
+
+### Grouping pattern:
+The input for this parameter is a regular expression with capture group. This input splits the data into groups based on the matched pattern. A new column `group` is created in the output file that has the group based on the given pattern. Unless `averageGroups` is set to `true`, providing a grouping pattern will cluster each group independently.
+
+### Average groups:
+`groupingPattern` to average the numerical features and produce a single row per group which is then clustered. The resulting cluster is assigned to all observations belonging in that group.
+
+### Label column:
+This is the name of the column containing the labels to be used with `groupingPattern`.
+
+### Minimum cluster size:
+This parameter defines the smallest number of points that should be considered as cluster. This is a required parameter. The input should be an integer and the value should be greater than 1.
+
+### Increment outlier ID:
+This parameter sets the ID of the outlier cluster to `1`, otherwise it will be 0. This is useful for visualization purposes if the resulting cluster IDs are turned into image annotations.
+
+## Output:
+The output is a tabular file containing the clustered data.
+
+## Building
+To build the Docker image for the conversion plugin, run
+`./build-docker.sh`.
+
+## Install WIPP Plugin
+If WIPP is running, navigate to the plugins page and add a new plugin. Paste the contents of `plugin.json` into the pop-up window and submit.
+For more information on WIPP, visit the [official WIPP page](https://isg.nist.gov/deepzoomweb/software/wipp).
+
+## Options
+
+This plugin takes four input arguments and one output argument:
+
+| Name                   | Description                                                                                    | I/O    | Type          |
+| ---------------------- | ---------------------------------------------------------------------------------------------- | ------ | ------------- |
+| `--inpDir`             | Input tabular data files.                                                                      | Input  | genericData   |
+| `--groupingPattern`    | Regular expression to group rows. Clustering will be applied across capture groups by default. | Input  | string        |
+| `--averageGroups`      | Average data across groups. Requires capture groups                                            | Input  | boolean       |
+| `--labelCol`           | Name of the column containing labels for grouping pattern.                                     | Input  | string        |
+| `--minClusterSize`     | Minimum cluster size.                                                                          | Input  | number        |
+| `--incrementOutlierId` | Increments outlier ID to 1.                                                                    | Input  | boolean       |
+| `--outDir`             | Output collection                                                                              | Output | genericData   |
+| `--preview`            | Generate a JSON file with outputs                                                              | Output | JSON       |
diff --git a/clustering/hdbscan-clustering-tool/VERSION b/clustering/hdbscan-clustering-tool/VERSION
@@ -0,0 +1 @@
+0.4.8-dev0
diff --git a/clustering/hdbscan-clustering-tool/build-docker.sh b/clustering/hdbscan-clustering-tool/build-docker.sh
@@ -0,0 +1,4 @@
+#!/bin/bash
+
+version=$(<VERSION)
+docker build . -t polusai/hdbscan-clustering-tool:${version}
diff --git a/clustering/hdbscan-clustering-tool/package-release.sh b/clustering/hdbscan-clustering-tool/package-release.sh
@@ -0,0 +1,16 @@
+# This script is designed to help package a new version of a plugin
+
+# Get the new version
+version=$(<VERSION)
+
+# Bump the version
+bump2version --config-file bumpversion.cfg --new-version ${version} --allow-dirty part
+
+# Build the container
+./build-docker.sh
+
+# Push to dockerhub
+docker push polusai/hdbscan-clustering-tool:${version}
+
+# Run pytests
+python -m pytest -s tests
diff --git a/clustering/hdbscan-clustering-tool/plugin.json b/clustering/hdbscan-clustering-tool/plugin.json
@@ -0,0 +1,123 @@
+{
+  "name": "Hdbscan Clustering",
+  "version": "0.4.8-dev0",
+  "title": "Hdbscan Clustering",
+  "description": "Cluster the data using HDBSCAN.",
+  "author": "Jayapriya Nagarajan (github.com/Priyaaxle), Hythem Sidky (hythem.sidky@nih.gov) and Hamdah Shafqat Abbasi (hamdahshafqat.abbasi@nih.gov)",
+  "institution": "National Center for Advancing Translational Sciences, National Institutes of Health",
+  "repository": "https://github.com/PolusAI/image-tools",
+  "website": "https://ncats.nih.gov/preclinical/core/informatics",
+  "citation": "",
+  "containerId": "polusai/hdbscan-clustering-tool:0.4.8-dev0",
+  "baseCommand": [
+    "python3",
+    "-m",
+    "polus.images.clustering.hdbscan_clustering"
+  ],
+  "inputs": {
+    "inpDir": {
+      "type": "genericData",
+      "title": "Input tabular data",
+      "description": "Input tabular data.",
+      "required": "True"
+    },
+    "filePattern": {
+      "type": "string",
+      "title": "Filename pattern",
+      "description": "Filename pattern used to separate data.",
+      "required": "False"
+    },
+    "groupingPattern": {
+      "type": "string",
+      "title": "Grouping pattern",
+      "description": "Regular expression for optional row grouping.",
+      "required": "False"
+    },
+    "averageGroups": {
+      "type": "boolean",
+      "title": "Average groups",
+      "description": "Whether to average data across groups. Requires grouping pattern to be defined.",
+      "required": "False"
+    },
+    "labelCol": {
+      "type": "string",
+      "title": "Label Column",
+      "description": "Name of column containing labels. Required for grouping pattern.",
+      "required": "False"
+    },
+    "minClusterSize": {
+      "type": "number",
+      "title": "Minimum cluster size",
+      "description": "Minimum cluster size.",
+      "required": "True"
+    },
+    "incrementOutlierId": {
+      "type": "number",
+      "title": "Increment Outlier ID",
+      "description": "Increments outlier ID to 1.",
+      "required": "True"
+    },
+    "preview": {
+      "type": "boolean",
+      "title": "Preview",
+      "description": "Generate an output preview.",
+      "required": "False"
+    }
+  },
+  "outputs": {
+    "outDir": {
+      "type": "genericData",
+      "description": "Output collection."
+    }
+  },
+  "ui": {
+    "inpDir": {
+      "type": "genericData",
+      "title": "Input tabular data",
+      "description": "Input tabular data to be processed by this plugin.",
+      "required": "True"
+    },
+    "filePattern": {
+      "type": "string",
+      "title": "Filename pattern",
+      "description": "Filename pattern used to separate data.",
+      "required": "False"
+    },
+    "groupingPattern": {
+      "type": "string",
+      "title": "Grouping pattern",
+      "description": "Regular expression for optional row grouping.",
+      "required": "False"
+    },
+    "averageGroups": {
+      "type": "boolean",
+      "title": "Average groups",
+      "description": "Whether to average data across groups. Requires grouping pattern to be defined.",
+      "required": "False"
+    },
+    "labelCol": {
+      "type": "string",
+      "title": "Label Column",
+      "description": "Name of column containing labels. Required for grouping pattern.",
+      "required": "False"
+    },
+    "minClusterSize": {
+      "type": "number",
+      "title": "Minimum cluster size",
+      "description": "Minimum cluster size.",
+      "required": "True"
+    },
+    "incrementOutlierId": {
+      "type": "number",
+      "title": "Increment Outlier ID",
+      "description": "Increments outlier ID to 1.",
+      "required": "True"
+    },
+    "preview": {
+      "type": "boolean",
+      "title": "Preview",
+      "description": "Generate an output preview.",
+      "required": "False"
+    }
+  }
+}
diff --git a/clustering/hdbscan-clustering-tool/pyproject.toml b/clustering/hdbscan-clustering-tool/pyproject.toml
@@ -0,0 +1,32 @@
+[tool.poetry]
+name = "polus-images-clustering-hdbscan-clustering"
+version = "0.4.8-dev0"
+description = "Cluster the data using HDBSCAN."
+authors = [
+           "Jayapriya Nagarajan <jayapriya.nagarajan@axleinfo.com>",
+           "Hythem Sidky <hythem.sidky@nih.gov>",
+           "Hamdah Shafqat abbasi <hamdahshafqat.abbasi@nih.gov>"
+           ]
+readme = "README.md"
+packages = [{include = "polus", from = "src"}]
+
+[tool.poetry.dependencies]
+python = ">=3.9,<3.12"
+filepattern = "^2.0.4"
+typer = "^0.7.0"
+tqdm = "^4.64.1"
+preadator="0.4.0.dev2"
+vaex = "^4.17.0"
+hdbscan = "^0.8.34rc1"
+
+
+[tool.poetry.group.dev.dependencies]
+pre-commit = "^3.3.3"
+bump2version = "^1.0.1"
+pytest = "^7.3.2"
+pytest-xdist = "^3.3.1"
+pytest-sugar = "^0.9.7"
+
+[build-system]
+requires = ["poetry-core"]
+build-backend = "poetry.core.masonry.api"
diff --git a/clustering/hdbscan-clustering-tool/run-docker.sh b/clustering/hdbscan-clustering-tool/run-docker.sh
@@ -0,0 +1,23 @@
+#!/bin/bash
+
+version=$(<VERSION)
+datapath=$(readlink --canonicalize data)
+echo ${datapath}
+
+# Inputs
+inpDir=${datapath}/input
+filePattern=".*.csv"
+groupingPattern="\w+$"
+labelCol="species"
+minClusterSize=3
+outDir=${datapath}/output
+
+docker run -v ${datapath}:${datapath} \
+            polusai/hdbscan-clustering-plugin:${version} \
+            --inpDir ${inpDir} \
+            --filePattern ${filePattern} \
+            --groupingPattern ${groupingPattern} \
+            --labelCol ${labelCol} \
+            --minClusterSize ${minClusterSize} \
+            --incrementOutlierId \
+            --outDir ${outDir}
diff --git a/...tering/hdbscan-clustering-tool/src/polus/images/clustering/hdbscan_clustering/__init__.py b/...tering/hdbscan-clustering-tool/src/polus/images/clustering/hdbscan_clustering/__init__.py
@@ -0,0 +1,4 @@
+"""Hdbscan Clustering Plugin."""
+
+__version__ = "0.4.8-dev0"
+from . import hdbscan_clustering