-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpectledly large number of clusters #7
Comments
Hi Sam! For datasets with multiple batches, it's best to use CHOIR's built-in batch correction method rather than providing a pre-generated dimensionality reduction. Try setting
CHOIR will run Harmony on the dimensionality reductions generated, on which the initial hierarchical clustering tree is then based. But more importantly, the batch information will be used in the random forest classifier comparisons to ensure that batch effects do not bias the ultimate clusters that pass the significance and accuracy thresholds imposed. Essentially, within the permutation tests, only cells from the same batch are compared within a single iteration. Since most datasets these days have multiple batches, CHOIR's performance was largely optimized with this batch correction method enabled. Let me know how your results look after trying this! |
Thanks for the quick response! I ran the code as you suggested, and I ended up with 21 clusters, relevant figures attached at the bottom. However they do seem somewhat strange to me compared to my previous analysis. For example, almost all of the clusters that I had previously annotated as excitatory neurons have been assigned to only one CHOIR cluster. The marker gene analysis of my previous excitatory neuron clusters did show unique markers that mapped to cortical layers so I think there should be more than one cluster here. Any further advice? I am also curious based on what you said:
Since CHOIR's performance relies on this batch information, does this mean that users should not supply dim reductions from other batch correction methods (scVI, LIGER, etc)? It would be great to be able to use more than just Harmony here. |
I think a couple of things may be going on, related to CHOIR’s statistical power when working with multiple batches within a small-ish dataset. First, it'd be useful to see the clusters plotted on a UMAP based on the dimensionality reduction that CHOIR generates. To easily do this, run:
Are you considering each of your 8 individuals as a separate batch? Would be helpful to see a UMAP colored by batch also on CHOIR's UMAP embedding. Would you expect any cell types/states to be largely confined to a single batch or small subset of batches? In instances where the two clusters being compared by CHOIR have few cells belonging to the same batch, these clusters are assumed to be batch-confounded instances of the same cell population and are merged. CHOIR's batch correction was designed to work best with batches that include more than 1 sample and that are not confounded by groups of interest (e.g., disease vs. control). If you upload the records from Regarding your question:
Yes, for now, CHOIR’s batch correction is limited to the built-in Harmony method. But I’d like to expand compatibility to more methods of batch correction in the future, particularly scVI. Regardless of batch correction, for most users, I’d advise against providing a pre-generated dimensionality reduction. This is because CHOIR generates multiple dimensionality reductions as it builds out the hierarchical clustering tree. A “root” dimensionality reduction encompassing all cells is followed by multiple “subtree” dimensionality reductions that each encompass a subset of the total cells. These subsettted reductions, and their accompanying sets of highly variable features allow CHOIR to pick up on more nuanced differences between cell subtypes/states. You can visualize the CHOIR clusters on these subsetted dimensionality clusters as well, by specifying the reduction to use in the code up top. |
Hi again, Based on what you said:
My 8 samples actually do belong to a single batch. Previously I was using the sample ID as the "batch" grouping. I re-ran CHOIR without any batch correction since all the samples did come from the same batch. Unfortunately it looks like I still have the problem of having a large number of clusters. Here's the code:
If you are interested here's the records (.rda file).
Okay this makes sense! I was assuming that this was more akin to other clustering algorithms where you just need to supply any dim reduction. It also seems to me like the feature selection step would play a huge role in the downstream results, since some genes that are discriminatory between two groups of cells may not be present in the variable features. |
Hi Sam, we're still looking into this—balancing the performance of CHOIR with and without batch correction has turned out to be a bit of a complicated issue. The fundamental issue seems to be related to the number of cells used for each iteration of the permutation test (so each train/test random forest classifier application), and we're hoping to have an update on this soon.
I do think the feature selection step matters, and we're hoping to explore some of the newer feature selection methods in the future! CHOIR does incorporate a couple of ways to modify the features used—there's an option to use all features as input to the random forest classifiers, rather than just variable features (although this is quite a bit slower), and a parameter We will update here when we have something tangible to report on the batch correction side of things! |
Hi Cathrine,
First of all thanks for developing this very interesting method! I tried out CHOIR on one of my datasets to see how the results would compare with my previous clustering analysis. I understand that one of the main advantages of CHOIR is that in principle it should identify the "correct" number of clusters without over or under clustering.
I have a snRNA-seq dataset of postmortem human cortical tissue from 8 individuals totaling ~66k nuclei. Based on my previous analysis and what we expect based on other studies, this dataset has clusters corresponding to seven major cell lineages (oligodendrocytes, OPCs, microglia, astrocytes, vascular cells, excitatory neurons, and inhibitory neurons). We generally expect that each of these major cell types would have some subclusters as well. I was surprised that after running CHOIR, I ended up with 132 clusters identified in my dataset, many of which contained very few nuclei. I ran CHOIR using the
harmony
dim reduction that I had already computed. I am not sure what is going on here or if you have any advice in this scenario? Below I am including the code that I ran as well as a UMAP plot comparing my previous clustering to the CHOIR clustering.The text was updated successfully, but these errors were encountered: