Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Kallisto-0.48.0, usage of -x BDWTA, outputting too many barcodes #77

Open
eyscott opened this issue Mar 10, 2022 · 5 comments
Open

Comments

@eyscott
Copy link

eyscott commented Mar 10, 2022

Hi,
I am using the 0.48.0 version of kallisto, as well as bustools (0.41.0) to demultiplex and obtain gene count tables for my BD Rhapsody WTA data. This is my initial kallisto bus script:
kallisto bus --index ./mus_musculus/transcriptome.idx -o /${f} --technology=BDWTA --threads=16 --fr-stranded ${f}_R1.fastq ${f}_R2.fastq -g /mus_musculus/Mus_musculus.GRCm38.96.gtf
Example result:
[index] k-mer length: 31
[index] number of targets: 118,489
[index] number of k-mers: 100,614,952
[index] number of equivalence classes: 433,624
[quant] will process sample 1: control_R1.fastq
control_R2.fastq
[quant] finding pseudoalignments for the reads ... done
[quant] processed 289,230,676 reads, 224,235,182 reads pseudoaligned
From there I sorted my .bus file and tried to generate a count table:
bustools sort -o sorted.bus output.bus
bustools count --genecounts -g /mus_musculus/transcripts_to_genes.txt -t transcripts.txt -e matrix.ec -o counts sorted.bus
This through me an odd matrix with dimensions: 13 18494348
From there I decided to correct the .bus file with bustools correct. I didn't see any whitelists for the BDWTA data so I also generated my own whitelists for each set of data and then sorted it:
bustools whitelist -o control_whitelist output.bus
Example results:
Read in 102086448 BUS records, wrote 232194 barcodes to whitelist with threshold 61
bustools correct -o corr_control.bus --whitelist control_whitelist output.bus
Example results:
Found 232194 barcodes in the whitelist
Processed 224235182 BUS records
In whitelist = 176801187
Corrected = 5916173
Uncorrected = 41517822
Then I sorted the .bus file
bustools sort -o sorted_corr_control.bus corr_control.bus
and ran bustools count:
bustools count --genecounts -g /mus_musculus/transcripts_to_genes.txt -t transcripts.txt -e matrix.ec -o control_counts sorted_corr_control.bus
I now have a matrix with more reasonable dimensions: 16632 9838 (with 9838 barcodes detected), but I am expecting to see ~2500 unique barcodes per sample. I am actually seeing a range between ~10,000 to 2500 barcodes per sample (across 4 samples). Do I have a mistake in how I am generating the whitelist? Is there already a built-in whitelist for the BDWTA data?
Thank you for your time!

@Yenaled
Copy link
Collaborator

Yenaled commented Mar 10, 2022

bustools whitelist should be used on a sorted BUS file, not the original BUS file outputted by kallisto.

@eyscott
Copy link
Author

eyscott commented Mar 17, 2022

Hi Yenaled,
Sorry for the late response, I have also tried sorting first then running through the pipeline with the sorted file:
bustools sort -o sorted.bus output.bus
bustools whitelist -o control_whitelist sorted.bus
bustools correct -o corr_sorted_control.bus --whitelist control_whitelist sorted.bus
bustools inspect corr_sorted_control.bus
And still get way too many detected barcodes:
Read in 80594529 BUS records
Total number of reads: 182717360

Number of distinct barcodes: 1799828
Median number of reads per barcode: 1.000000
Mean number of reads per barcode: 101.519345

Number of distinct UMIs: 65536
Number of distinct barcode-UMI pairs: 61452334
Median number of UMIs per barcode: 1.000000
Mean number of UMIs per barcode: 34.143448

Estimated number of new records at 2x sequencing depth: 28228399

Thank you for your help!

@Yenaled
Copy link
Collaborator

Yenaled commented Mar 17, 2022

Can you "bustools sort" again immediately after "bustools correct", and then afterwards, run inspect and count on the sorted file?

Generally, whenever we use bustools, we sort twice: at the very beginning and then at the very end.

If this doesn't fix it, feel free to send me an email with your attached sorted BUS file (my email is list in my github profile) and I'll look into it further.

@eyscott
Copy link
Author

eyscott commented Mar 24, 2022

Here is the final solution/workflow that worked for me, many many thanks to @Yenaled for helping me every step of the way!
bustools sort -o sorted_control.bus output.bus
bustools correct -o corr_sorted_control.bus --whitelist BD_WTA_whitelist_final.txt sorted_control.bus
#where BD_WTA_whitelist_final.txt was provided by Delaney
bustools inspect corr_sorted_control.bus
Read in 78824653 BUS records
Total number of reads: 182717360

Number of distinct barcodes: 232194
Median number of reads per barcode: 132.000000
Mean number of reads per barcode: 786.916802

Number of distinct UMIs: 65536
Number of distinct barcode-UMI pairs: 59560112
Median number of UMIs per barcode: 64.000000
Mean number of UMIs per barcode: 256.510125

Estimated number of new records at 2x sequencing depth: 26326635

From here I was worried bc I was still getting many more barcodes than I was expecting (I was expecting ~2500 barcodes)
I made this into a count matrix:
bustools count --genecounts -g //mus_musculus/transcripts_to_genes.txt
-t transcripts.txt -e matrix.ec -o /control_counts corr_sorted_control.bus

And imported the matrix into R:
read_count_output <- function(dir, name) {
dir <- normalizePath(dir, mustWork = TRUE)
m <- readMM(paste0(dir, "/", name, ".mtx"))
m <- Matrix::t(m)
m <- as(m, "dgCMatrix")
ge <- ".genes.txt"
genes <- readLines(file(paste0(dir, "/", name, ge)))
barcodes <- readLines(file(paste0(dir, "/", name, ".barcodes.txt")))
colnames(m) <- barcodes
rownames(m) <- genes
return(m)
}

control_mat <- read_count_output("/Control/", name = "control_counts")
dim(control_mat)
[1] 36047 232194

I then made an elbow plot of the UMIs per barcode to establish a cut off
umi.per.barcode <- colSums(control_mat)
x <- sort(umi.per.barcode, decreasing = TRUE)
plot(x, log="xy",type="l", xlab="Barcodes", ylab="UMI counts")

image
control_mat <- control_mat[ , which(colSums(control_mat) >= 1000) ]
dim(control_mat)

[1] 36047 2725

Hence, most of the removal of barcodes occurred when filtering the matrix in R based upon UMI counts per barcode
Thank you again to @Yenaled for all the help solving this issue!

@BDXH-FY24
Copy link

does bustools work for new BD enhanced Version 2 beads?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants