Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

different results each the CITE-seq count is run #165

Open
colin986 opened this issue Mar 11, 2022 · 5 comments
Open

different results each the CITE-seq count is run #165

colin986 opened this issue Mar 11, 2022 · 5 comments

Comments

@colin986
Copy link

Hi,

I'm getting a different output each time CITE-seq count is run. My whitelist and parameters do not each each time.

Is this expected? Is there anyway to control this in terms of reproducibility (i.e. setting a seed) ?

Thanks,
Colin

@Hoohm
Copy link
Owner

Hoohm commented Mar 11, 2022 via email

@colin986
Copy link
Author

Hi Hoohm,

Thanks for coming back to me.

You were right. The CITE-seq count output is the same each time.

The variation in the result seems to come from the HTODemux function in Seurat when using clara clustering option (When using kmeans clustering the output is consistent). The result changes each time I run CITE-Seq count. The function has an option to set the seed, but I've still found that the output changes each time. So what I mean here is that HTODemux is reproducible with the same CITE-seq count output. CITE-seq count is also reproducible. However, when I re-run CITE-Seq count and HTODemux I get a different result - I don't understand why this is happening.

I know HTODemux draws 100 samples from the dataset for clara clustering - I wonder if during the CITE-Seq count the samples, while the same, the data are written in a a different order and the 100 samples are drawn in a different order - and that gives rise to variability in the output?

Thanks,
Colin

@johnyaku
Copy link

johnyaku commented Sep 5, 2024

I can verify "different" CITE-seq-count results on different runs.

The difference is in the column order, not in the actual content of the count matrices. Reordering the columns to match each other (or the whitelist) results in identical matrices.

I haven't been able to pin down the source of the variation. I can't see any random functions. Initially I suspected parallelization, with different chunks finishing in different orders depending on the run, but the problem persists even with only one thread.

This difference in ordering produces different assignments from Seurat::HTODemux() when kfunc='clara' (the default). In the good quality dataset where I have been testing this, assignments are different for about 5% of total barcodes. In a low or even medium quality dataset I suspect the variability might be worse.

I haven't looked at why, but @colin986's suggestion that different ordering might produce different sampling (even with the same seed) seems plausible to me.

Setting kfunc = 'kmeans' results in consist demux assignments, despite the difference in ordering.

For now I am reordering CITE-seq-count outputs based on the whitelist, and also using kmeans rather than clara.

@Hoohm
Copy link
Owner

Hoohm commented Sep 5, 2024

Thank you for looking into this. I was afraid there was a bug I missed in my code but the downstream issues seem more plausible. Btw, if you are interested to test it out, I have a beta branch rewritten in Polars that is available. Some inputs names have changed but it should overall decrease memory usage and improve speeds.

@johnyaku
Copy link

Thanks @Hoohm. I'll check out the beta branch when I get a moment.

I'm not sure if it is worth making a feature request, but I do think it would be helpful if CITE-seq-count produced identical output for identical input (including sort order).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants