Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hierarchical clustering on the latent image representations #4

Open
agitter opened this issue Jun 18, 2018 · 6 comments
Open

Hierarchical clustering on the latent image representations #4

agitter opened this issue Jun 18, 2018 · 6 comments

Comments

@agitter
Copy link
Member

agitter commented Jun 18, 2018

Start by clustering the transformed images with http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html

Check whether compounds cluster together. Then try comparing image clustering with fingerprint clustering.

@xiaohk
Copy link
Member

xiaohk commented Jun 25, 2018

Here are some visualizations of the hierarchical clustering with different distance functions and the corresponding silhouette coefficient.

distance

  • Distance functions which cannot partition dataset to at least 2 clusters tend to have higher silhouette scores.
  • Among the distances which have at least 2 clear clusters, yule gives the highest score.

efbd7d5 also gives an example to check hierarchical clustering and our UMAP visualization.

hierarchical_cluster_1

@chao1224
Copy link
Collaborator

chao1224 commented Jun 25, 2018

Nice plottings.

For the ultimate paper/report/presentation, I would suggest doing clustering on all data points, but only plot part of them for visualization. Make sure you are choosing i.i.d. data points once, and use them for comparison among different metrics.

Besides, how to evaluate the clustering is another issue. I guess what @agitter suggest now is just to try to see which metric best align with using fingerprints. The best evaluation method is always putting it back into the problem setting and see which metric/algorithm best fits the goal.

There are also cases people don't have specific problem setting, and they just want to check the clustering performance. sklearn has some useful packages, like silhouette coefficient and calinski-harabaz index.

@agitter
Copy link
Member Author

agitter commented Jul 18, 2018

We agree that the metrics that produce >= 2 clusters and have silhouette score > 0.37 look reasonable for the most part. There are some exceptions (e.g. sokalsneath, which produces many clusters). We can used the adjusted rand index to assess whether the other metrics actually produce the same clusters. That is, are the purple cells in yule the same as the red cells in cosine.

@agitter
Copy link
Member Author

agitter commented Jul 18, 2018

Once we choose an image clustering and distance metric, we can compare that clustering to the clustering of chemicals. Computational chemists traditionally cluster chemicals by computing the ECFP fingerprint (bit vector) and using the Tanimoto similarity, which is either similar to or equivalent to the Jaccard index. jaccard is an option in scipy.spatial.distance.

The chemicals will cluster into more, and smaller, groups. We may not be able to directly compare the two clusterings with the adjusted rand index.

@xiaohk
Copy link
Member

xiaohk commented Jul 23, 2018

cfa9077 adds the adjusted rand index comparison of some potential distance functions.

  • The adjusted rand score is related to the size difference of the smaller cluster.
  • Except there are 3 variants in braycurtis, all smaller clusters are subsets of the smaller cluster of cosine.
  • To be conservative, we can choose cosine as our final distance function.

See the notebook for more details.

@xiaohk
Copy link
Member

xiaohk commented Jul 30, 2018

0a70e25 adds the distance function comparison for the ECFP of all compounds in the dataset.

distance_fp

Some metrics give small groups while some give bigger ones. Jaccard looks reasonable here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants