Hierarchical clustering on the latent image representations #4

agitter · 2018-06-18T16:01:23Z

Start by clustering the transformed images with http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html

Check whether compounds cluster together. Then try comparing image clustering with fingerprint clustering.

xiaohk · 2018-06-25T14:03:50Z

Here are some visualizations of the hierarchical clustering with different distance functions and the corresponding silhouette coefficient.

Distance functions which cannot partition dataset to at least 2 clusters tend to have higher silhouette scores.
Among the distances which have at least 2 clear clusters, yule gives the highest score.

efbd7d5 also gives an example to check hierarchical clustering and our UMAP visualization.

chao1224 · 2018-06-25T14:29:37Z

Nice plottings.

For the ultimate paper/report/presentation, I would suggest doing clustering on all data points, but only plot part of them for visualization. Make sure you are choosing i.i.d. data points once, and use them for comparison among different metrics.

Besides, how to evaluate the clustering is another issue. I guess what @agitter suggest now is just to try to see which metric best align with using fingerprints. The best evaluation method is always putting it back into the problem setting and see which metric/algorithm best fits the goal.

There are also cases people don't have specific problem setting, and they just want to check the clustering performance. sklearn has some useful packages, like silhouette coefficient and calinski-harabaz index.

agitter · 2018-07-18T14:12:45Z

We agree that the metrics that produce >= 2 clusters and have silhouette score > 0.37 look reasonable for the most part. There are some exceptions (e.g. sokalsneath, which produces many clusters). We can used the adjusted rand index to assess whether the other metrics actually produce the same clusters. That is, are the purple cells in yule the same as the red cells in cosine.

agitter · 2018-07-18T14:41:51Z

Once we choose an image clustering and distance metric, we can compare that clustering to the clustering of chemicals. Computational chemists traditionally cluster chemicals by computing the ECFP fingerprint (bit vector) and using the Tanimoto similarity, which is either similar to or equivalent to the Jaccard index. jaccard is an option in scipy.spatial.distance.

The chemicals will cluster into more, and smaller, groups. We may not be able to directly compare the two clusterings with the adjusted rand index.

xiaohk · 2018-07-23T13:53:55Z

cfa9077 adds the adjusted rand index comparison of some potential distance functions.

The adjusted rand score is related to the size difference of the smaller cluster.
Except there are 3 variants in braycurtis, all smaller clusters are subsets of the smaller cluster of cosine.
To be conservative, we can choose cosine as our final distance function.

See the notebook for more details.

xiaohk · 2018-07-30T13:49:26Z

0a70e25 adds the distance function comparison for the ECFP of all compounds in the dataset.

Some metrics give small groups while some give bigger ones. Jaccard looks reasonable here.

agitter mentioned this issue Jul 27, 2018

High-throughput run over all plates #5

Closed

xiaohk mentioned this issue Jul 31, 2019

Project summary #14

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hierarchical clustering on the latent image representations #4

Hierarchical clustering on the latent image representations #4

agitter commented Jun 18, 2018

xiaohk commented Jun 25, 2018 •

edited

Loading

chao1224 commented Jun 25, 2018 •

edited

Loading

agitter commented Jul 18, 2018

agitter commented Jul 18, 2018

xiaohk commented Jul 23, 2018

xiaohk commented Jul 30, 2018

Hierarchical clustering on the latent image representations #4

Hierarchical clustering on the latent image representations #4

Comments

agitter commented Jun 18, 2018

xiaohk commented Jun 25, 2018 • edited Loading

chao1224 commented Jun 25, 2018 • edited Loading

agitter commented Jul 18, 2018

agitter commented Jul 18, 2018

xiaohk commented Jul 23, 2018

xiaohk commented Jul 30, 2018

xiaohk commented Jun 25, 2018 •

edited

Loading

chao1224 commented Jun 25, 2018 •

edited

Loading