-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Kallisto-0.48.0, usage of -x BDWTA, outputting too many barcodes #77
Comments
bustools whitelist should be used on a sorted BUS file, not the original BUS file outputted by kallisto. |
Hi Yenaled, Number of distinct barcodes: 1799828 Number of distinct UMIs: 65536 Estimated number of new records at 2x sequencing depth: 28228399 Thank you for your help! |
Can you "bustools sort" again immediately after "bustools correct", and then afterwards, run inspect and count on the sorted file? Generally, whenever we use bustools, we sort twice: at the very beginning and then at the very end. If this doesn't fix it, feel free to send me an email with your attached sorted BUS file (my email is list in my github profile) and I'll look into it further. |
Here is the final solution/workflow that worked for me, many many thanks to @Yenaled for helping me every step of the way! Number of distinct barcodes: 232194 Number of distinct UMIs: 65536 Estimated number of new records at 2x sequencing depth: 26326635 From here I was worried bc I was still getting many more barcodes than I was expecting (I was expecting ~2500 barcodes) Hence, most of the removal of barcodes occurred when filtering the matrix in R based upon UMI counts per barcode |
does bustools work for new BD enhanced Version 2 beads? |
Hi,
I am using the 0.48.0 version of kallisto, as well as bustools (0.41.0) to demultiplex and obtain gene count tables for my BD Rhapsody WTA data. This is my initial kallisto bus script:
kallisto bus --index ./mus_musculus/transcriptome.idx -o /${f} --technology=BDWTA --threads=16 --fr-stranded ${f}_R1.fastq ${f}_R2.fastq -g /mus_musculus/Mus_musculus.GRCm38.96.gtf
Example result:
[index] k-mer length: 31
[index] number of targets: 118,489
[index] number of k-mers: 100,614,952
[index] number of equivalence classes: 433,624
[quant] will process sample 1: control_R1.fastq
control_R2.fastq
[quant] finding pseudoalignments for the reads ... done
[quant] processed 289,230,676 reads, 224,235,182 reads pseudoaligned
From there I sorted my .bus file and tried to generate a count table:
bustools sort -o sorted.bus output.bus
bustools count --genecounts -g /mus_musculus/transcripts_to_genes.txt -t transcripts.txt -e matrix.ec -o counts sorted.bus
This through me an odd matrix with dimensions: 13 18494348
From there I decided to correct the .bus file with bustools correct. I didn't see any whitelists for the BDWTA data so I also generated my own whitelists for each set of data and then sorted it:
bustools whitelist -o control_whitelist output.bus
Example results:
Read in 102086448 BUS records, wrote 232194 barcodes to whitelist with threshold 61
bustools correct -o corr_control.bus --whitelist control_whitelist output.bus
Example results:
Found 232194 barcodes in the whitelist
Processed 224235182 BUS records
In whitelist = 176801187
Corrected = 5916173
Uncorrected = 41517822
Then I sorted the .bus file
bustools sort -o sorted_corr_control.bus corr_control.bus
and ran bustools count:
bustools count --genecounts -g /mus_musculus/transcripts_to_genes.txt -t transcripts.txt -e matrix.ec -o control_counts sorted_corr_control.bus
I now have a matrix with more reasonable dimensions: 16632 9838 (with 9838 barcodes detected), but I am expecting to see ~2500 unique barcodes per sample. I am actually seeing a range between ~10,000 to 2500 barcodes per sample (across 4 samples). Do I have a mistake in how I am generating the whitelist? Is there already a built-in whitelist for the BDWTA data?
Thank you for your time!
The text was updated successfully, but these errors were encountered: