Skip to content

Commit

Permalink
Updating hdbscan-clustering plugin (#498)
Browse files Browse the repository at this point in the history
* fix merge conflicts

* fix apply manifest

* fix apply manifest

* remove file

* updated hdbscan-clustering-plugin

* fix bug in tests

* fixed random generation of floats

* fixed docker file and shell script for running docker

* fixed docker files

* renamed plugin and fixed merged conflicts

* fixed docker files
  • Loading branch information
hamshkhawar authored Aug 6, 2024
1 parent 014f4dd commit a6bfd1d
Show file tree
Hide file tree
Showing 17 changed files with 804 additions and 0 deletions.
27 changes: 27 additions & 0 deletions clustering/hdbscan-clustering-tool/.bumpversion.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
[bumpversion]
current_version = 0.4.8-dev0
commit = True
tag = False
parse = (?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)(\-(?P<release>[a-z]+)(?P<dev>\d+))?
serialize =
{major}.{minor}.{patch}-{release}{dev}
{major}.{minor}.{patch}

[bumpversion:part:release]
optional_value = _
first_value = dev
values =
dev
_

[bumpversion:part:dev]

[bumpversion:file:pyproject.toml]
search = version = "{current_version}"
replace = version = "{new_version}"

[bumpversion:file:plugin.json]

[bumpversion:file:VERSION]

[bumpversion:file:src/polus/images/clustering/hdbscan_clustering/__init__.py]
23 changes: 23 additions & 0 deletions clustering/hdbscan-clustering-tool/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Jupyter Notebook
.ipynb_checkpoints
poetry.lock
../../poetry.lock
# Environments
.env
.myenv
.venv
env/
venv/
# test data directory
data
# yaml file
.pre-commit-config.yaml
# hidden files
.DS_Store
.ds_store
# flake8
.flake8
../../.flake8
__pycache__
.mypy_cache
requirements.txt
21 changes: 21 additions & 0 deletions clustering/hdbscan-clustering-tool/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
FROM polusai/bfio:2.3.6

# environment variables defined in polusai/bfio
ENV EXEC_DIR="/opt/executables"
ENV POLUS_LOG="INFO"
ENV POLUS_IMG_EXT=".ome.tif"
ENV POLUS_TAB_EXT=".csv"

# Work directory defined in the base container
WORKDIR ${EXEC_DIR}

COPY pyproject.toml ${EXEC_DIR}
COPY VERSION ${EXEC_DIR}
COPY README.md ${EXEC_DIR}
COPY src ${EXEC_DIR}/src

RUN pip3 install ${EXEC_DIR} --no-cache-dir


ENTRYPOINT ["python3", "-m", "polus.images.clustering.hdbscan_clustering"]
CMD ["--help"]
52 changes: 52 additions & 0 deletions clustering/hdbscan-clustering-tool/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Hierarchical Density-Based Spatial Clustering of Applications with Noise(HDBSCAN) Clustering (v0.4.8-dev0)

The HDBSCAN Clustering plugin clusters the data using [HDBSCAN clustering](https://pypi.org/project/hdbscan/) library. The input and output for this plugin is a CSV file. Each observation (row) in the input CSV file is assigned to one of the clusters. The output CSV file contains the column `cluster` that identifies the cluster to which each observation belongs. A user can supply a regular expression with capture groups if they wish to cluster each group independently, or if they wish to average the numerical features across each group and treat them as a single observation.

## Inputs:

### Input directory:
This plugin supports the all [vaex](https://vaex.readthedocs.io/en/latest/guides/io.html) supported file formats.

### Filename pattern:
This plugin uses [filepattern](https://filepattern2.readthedocs.io/en/latest/Home.html) python library to parse file names of tabular files to be processed by this plugin.

### Grouping pattern:
The input for this parameter is a regular expression with capture group. This input splits the data into groups based on the matched pattern. A new column `group` is created in the output file that has the group based on the given pattern. Unless `averageGroups` is set to `true`, providing a grouping pattern will cluster each group independently.

### Average groups:
`groupingPattern` to average the numerical features and produce a single row per group which is then clustered. The resulting cluster is assigned to all observations belonging in that group.

### Label column:
This is the name of the column containing the labels to be used with `groupingPattern`.

### Minimum cluster size:
This parameter defines the smallest number of points that should be considered as cluster. This is a required parameter. The input should be an integer and the value should be greater than 1.

### Increment outlier ID:
This parameter sets the ID of the outlier cluster to `1`, otherwise it will be 0. This is useful for visualization purposes if the resulting cluster IDs are turned into image annotations.

## Output:
The output is a tabular file containing the clustered data.

## Building
To build the Docker image for the conversion plugin, run
`./build-docker.sh`.

## Install WIPP Plugin
If WIPP is running, navigate to the plugins page and add a new plugin. Paste the contents of `plugin.json` into the pop-up window and submit.
For more information on WIPP, visit the [official WIPP page](https://isg.nist.gov/deepzoomweb/software/wipp).

## Options

This plugin takes four input arguments and one output argument:

| Name | Description | I/O | Type |
| ---------------------- | ---------------------------------------------------------------------------------------------- | ------ | ------------- |
| `--inpDir` | Input tabular data files. | Input | genericData |
| `--groupingPattern` | Regular expression to group rows. Clustering will be applied across capture groups by default. | Input | string |
| `--averageGroups` | Average data across groups. Requires capture groups | Input | boolean |
| `--labelCol` | Name of the column containing labels for grouping pattern. | Input | string |
| `--minClusterSize` | Minimum cluster size. | Input | number |
| `--incrementOutlierId` | Increments outlier ID to 1. | Input | boolean |
| `--outDir` | Output collection | Output | genericData |
| `--preview` | Generate a JSON file with outputs | Output | JSON |
1 change: 1 addition & 0 deletions clustering/hdbscan-clustering-tool/VERSION
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
0.4.8-dev0
4 changes: 4 additions & 0 deletions clustering/hdbscan-clustering-tool/build-docker.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
#!/bin/bash

version=$(<VERSION)
docker build . -t polusai/hdbscan-clustering-tool:${version}
16 changes: 16 additions & 0 deletions clustering/hdbscan-clustering-tool/package-release.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# This script is designed to help package a new version of a plugin

# Get the new version
version=$(<VERSION)

# Bump the version
bump2version --config-file bumpversion.cfg --new-version ${version} --allow-dirty part

# Build the container
./build-docker.sh

# Push to dockerhub
docker push polusai/hdbscan-clustering-tool:${version}

# Run pytests
python -m pytest -s tests
123 changes: 123 additions & 0 deletions clustering/hdbscan-clustering-tool/plugin.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
{
"name": "Hdbscan Clustering",
"version": "0.4.8-dev0",
"title": "Hdbscan Clustering",
"description": "Cluster the data using HDBSCAN.",
"author": "Jayapriya Nagarajan (github.com/Priyaaxle), Hythem Sidky (hythem.sidky@nih.gov) and Hamdah Shafqat Abbasi (hamdahshafqat.abbasi@nih.gov)",
"institution": "National Center for Advancing Translational Sciences, National Institutes of Health",
"repository": "https://github.com/PolusAI/image-tools",
"website": "https://ncats.nih.gov/preclinical/core/informatics",
"citation": "",
"containerId": "polusai/hdbscan-clustering-tool:0.4.8-dev0",
"baseCommand": [
"python3",
"-m",
"polus.images.clustering.hdbscan_clustering"
],
"inputs": {
"inpDir": {
"type": "genericData",
"title": "Input tabular data",
"description": "Input tabular data.",
"required": "True"
},
"filePattern": {
"type": "string",
"title": "Filename pattern",
"description": "Filename pattern used to separate data.",
"required": "False"
},
"groupingPattern": {
"type": "string",
"title": "Grouping pattern",
"description": "Regular expression for optional row grouping.",
"required": "False"
},
"averageGroups": {
"type": "boolean",
"title": "Average groups",
"description": "Whether to average data across groups. Requires grouping pattern to be defined.",
"required": "False"
},
"labelCol": {
"type": "string",
"title": "Label Column",
"description": "Name of column containing labels. Required for grouping pattern.",
"required": "False"
},
"minClusterSize": {
"type": "number",
"title": "Minimum cluster size",
"description": "Minimum cluster size.",
"required": "True"
},
"incrementOutlierId": {
"type": "number",
"title": "Increment Outlier ID",
"description": "Increments outlier ID to 1.",
"required": "True"
},
"preview": {
"type": "boolean",
"title": "Preview",
"description": "Generate an output preview.",
"required": "False"
}
},
"outputs": {
"outDir": {
"type": "genericData",
"description": "Output collection."
}
},
"ui": {
"inpDir": {
"type": "genericData",
"title": "Input tabular data",
"description": "Input tabular data to be processed by this plugin.",
"required": "True"
},
"filePattern": {
"type": "string",
"title": "Filename pattern",
"description": "Filename pattern used to separate data.",
"required": "False"
},
"groupingPattern": {
"type": "string",
"title": "Grouping pattern",
"description": "Regular expression for optional row grouping.",
"required": "False"
},
"averageGroups": {
"type": "boolean",
"title": "Average groups",
"description": "Whether to average data across groups. Requires grouping pattern to be defined.",
"required": "False"
},
"labelCol": {
"type": "string",
"title": "Label Column",
"description": "Name of column containing labels. Required for grouping pattern.",
"required": "False"
},
"minClusterSize": {
"type": "number",
"title": "Minimum cluster size",
"description": "Minimum cluster size.",
"required": "True"
},
"incrementOutlierId": {
"type": "number",
"title": "Increment Outlier ID",
"description": "Increments outlier ID to 1.",
"required": "True"
},
"preview": {
"type": "boolean",
"title": "Preview",
"description": "Generate an output preview.",
"required": "False"
}
}
}
32 changes: 32 additions & 0 deletions clustering/hdbscan-clustering-tool/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
[tool.poetry]
name = "polus-images-clustering-hdbscan-clustering"
version = "0.4.8-dev0"
description = "Cluster the data using HDBSCAN."
authors = [
"Jayapriya Nagarajan <jayapriya.nagarajan@axleinfo.com>",
"Hythem Sidky <hythem.sidky@nih.gov>",
"Hamdah Shafqat abbasi <hamdahshafqat.abbasi@nih.gov>"
]
readme = "README.md"
packages = [{include = "polus", from = "src"}]

[tool.poetry.dependencies]
python = ">=3.9,<3.12"
filepattern = "^2.0.4"
typer = "^0.7.0"
tqdm = "^4.64.1"
preadator="0.4.0.dev2"
vaex = "^4.17.0"
hdbscan = "^0.8.34rc1"


[tool.poetry.group.dev.dependencies]
pre-commit = "^3.3.3"
bump2version = "^1.0.1"
pytest = "^7.3.2"
pytest-xdist = "^3.3.1"
pytest-sugar = "^0.9.7"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
23 changes: 23 additions & 0 deletions clustering/hdbscan-clustering-tool/run-docker.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#!/bin/bash

version=$(<VERSION)
datapath=$(readlink --canonicalize data)
echo ${datapath}

# Inputs
inpDir=${datapath}/input
filePattern=".*.csv"
groupingPattern="\w+$"
labelCol="species"
minClusterSize=3
outDir=${datapath}/output

docker run -v ${datapath}:${datapath} \
polusai/hdbscan-clustering-plugin:${version} \
--inpDir ${inpDir} \
--filePattern ${filePattern} \
--groupingPattern ${groupingPattern} \
--labelCol ${labelCol} \
--minClusterSize ${minClusterSize} \
--incrementOutlierId \
--outDir ${outDir}
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
"""Hdbscan Clustering Plugin."""

__version__ = "0.4.8-dev0"
from . import hdbscan_clustering
Loading

0 comments on commit a6bfd1d

Please sign in to comment.