Very slow paired reads mode for transcriptome #31

siddharthab · 2024-09-03T22:17:44Z

Hi!

I am trying to make UMICollapse the default tool in one of the popular RNAseq analysis pipelines -- nf-core/rnaseq#1087.

Not sure if this is covered by #5 already, but when using paired reads aligned to the human transcriptome, it seems like UMICollapse is 20x slower when compared to umi-tools. UMICollapse takes between 9-10 hours for the BAM files we are considering, whereas umi-tools takes ~30 minutes. The slowness is present in both two-pass and single pass modes.

I have not gone through how UMICollapse works, so I do not have an opinion on whether this is expected or not. If it is expected, some commentary on this in the README would be appreciated.

I have made some test data available in Google Drive. You will notice that the BAM file has 44319354 read pairs with 8 bp UMIs.

Thank you for continuing to follow up on your work from a long time ago.

siddharthab · 2024-09-04T04:44:50Z

On profiling, it seems like 98% of the CPU is spent in write.

siddharthab mentioned this issue Sep 3, 2024

Add umicollapse and benchmark it against umitools dedup nf-core/rnaseq#1087

Open

siddharthab linked a pull request Sep 4, 2024 that will close this issue

Remove redundant BAM file open in paired mode #32

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very slow paired reads mode for transcriptome #31

Very slow paired reads mode for transcriptome #31

siddharthab commented Sep 3, 2024

siddharthab commented Sep 4, 2024

Very slow paired reads mode for transcriptome #31

Very slow paired reads mode for transcriptome #31

Comments

siddharthab commented Sep 3, 2024

siddharthab commented Sep 4, 2024