-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hierarchical clustering on the latent image representations #4
Comments
Here are some visualizations of the hierarchical clustering with different distance functions and the corresponding silhouette coefficient.
efbd7d5 also gives an example to check hierarchical clustering and our UMAP visualization. |
Nice plottings. For the ultimate paper/report/presentation, I would suggest doing clustering on all data points, but only plot part of them for visualization. Make sure you are choosing i.i.d. data points once, and use them for comparison among different metrics. Besides, how to evaluate the clustering is another issue. I guess what @agitter suggest now is just to try to see which metric best align with using fingerprints. The best evaluation method is always putting it back into the problem setting and see which metric/algorithm best fits the goal. There are also cases people don't have specific problem setting, and they just want to check the clustering performance. |
We agree that the metrics that produce >= 2 clusters and have silhouette score > 0.37 look reasonable for the most part. There are some exceptions (e.g. sokalsneath, which produces many clusters). We can used the adjusted rand index to assess whether the other metrics actually produce the same clusters. That is, are the purple cells in yule the same as the red cells in cosine. |
Once we choose an image clustering and distance metric, we can compare that clustering to the clustering of chemicals. Computational chemists traditionally cluster chemicals by computing the ECFP fingerprint (bit vector) and using the Tanimoto similarity, which is either similar to or equivalent to the Jaccard index. The chemicals will cluster into more, and smaller, groups. We may not be able to directly compare the two clusterings with the adjusted rand index. |
cfa9077 adds the adjusted rand index comparison of some potential distance functions.
See the notebook for more details. |
0a70e25 adds the distance function comparison for the ECFP of all compounds in the dataset. Some metrics give small groups while some give bigger ones. |
Start by clustering the transformed images with http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html
Check whether compounds cluster together. Then try comparing image clustering with fingerprint clustering.
The text was updated successfully, but these errors were encountered: