Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kallisto gene vs transcript count #434

Open
Rajesh-HKU opened this issue Apr 15, 2024 · 4 comments
Open

kallisto gene vs transcript count #434

Rajesh-HKU opened this issue Apr 15, 2024 · 4 comments

Comments

@Rajesh-HKU
Copy link

Rajesh-HKU commented Apr 15, 2024

Dear Laura,

There is a discrepancy between the gene count and transcript count from the count matrices generated by Kallisto. I generated the gene count matrix and tcc matrix from the same set of fastq files. Subsequently, I compared the gene count vs the total of transcript count. I used only the equivalence classes which map only to the one gene of interest. I found that these counts do not match. Surprisingly, even when the transcript count, based on the equivalence classes which only map to the gene of interest, is greater than zero, the gene count for the corresponding cells is zero.

The details are as follows:

Software versions:
Kb_python 0.28.2
Kallisto 0.48

Count Matrix Commands:
The code reads a list of fastq files and creates a count matrix of either genes or tcc associated with each specimen and saves the file with the specimen prefix.

Gene count matrix :
system(paste("kb count -i index.idx -g t2g.txt -x 10xv2 -t 2 --overwrite", paste(fastq_fns[(2i-1):(2i)], collapse = " "), sep = " "), intern = TRUE)
my_files <- list.files(path = "./counts_unfiltered",pattern = "^cells_x_genes")
file.copy(from = paste0("./counts_unfiltered/", my_files), to = paste0("./cell_out/unfcounts/",specimens[2*i],"_", my_files))

Tcc_count matrix:
system(paste("kb count -i index.idx -g t2g.txt -x 10xv2 -t2 --tcc --overwrite", paste(fastq_fns[(2i-1):(2i)], collapse = " "), sep = " "), intern = TRUE)
my_files <- list.files(path = "./counts_unfiltered",pattern = "^cells_x_tcc")
file.copy(from = paste0("./counts_unfiltered/", my_files), to=paste0("./cell_out/unfcounts_tcc/",specimendata$specimenID[i],"_", my_files))

For the purpose of comparison of a particular gene vs tcc count, we only included the equivalent classes which had one gene mapping only.

#read the gene count matrix and identify rows in the gene count matrix matching to BIN1 gene
lib <- lib <- specimendata$specimenID[i]
count_file <- paste0(lib,"_cells_x_genes")
rowbin1 <- which(grepl(pattern = "^ENSG00000136717",rownames(res_matx))) #row 7729
res_mat <- read_count_output(dir = "./cell_out/unfcounts_tcc/", name = count_file, tcc = TRUE)
totalgene <- res_mat[rowbin1_tcc,]

#read the tcc count matrix, identify rows in the t2g file matching BIN1 and use only those rows for counting BIN1 transcripts
lib <- lib <- specimendata$specimenID[i]
count_file <- paste0(lib,"_cells_x_tcc")
rowbin1_tcc <- which(grepl(pattern = "\tBIN1$",t2g$V1)) #rows 28581 to 28594
res_mat <- read_count_output(dir = "./cell_out/unfcounts_tcc/", name = count_file, tcc = TRUE)
res_matx_tcc <- res_mat[rowbin1_tcc,]
#find total of 14 BIN1 transcripts for each cell
totalbin1 <- colSums(as.matrix(res_matx_tcc))

#compare res_matx and res_matx_tcc

table(totalgene,totalbin1)

      totalbin1 (sum of count of 14 transcripts)
            
            0    1   2 3 4 5 6

0 6877 933 151 29 15 1 0

1           1164 174 30 6 1 0 0

2           182 32 3 0 0 0 1

3          26 3 0 1 0 0 0

4         3 0 0 0 0 0 0

   
Please note above that when total of transcripts is >=1, the gene count is still predominantly zero.

Thanks.

Best regards,
Rajesh

@Yenaled
Copy link
Collaborator

Yenaled commented Apr 15, 2024

I'm sorry, but I'm not going to look through many lines of unformatted code to identify why you're observing a discrepancy.

A few things to keep in mind: 1) Line numbers (and mtx row/col numbers) are 1-indexed while EC/transcript identifiers are 0-indexed. 2) Equivalence classes will NOT be the same between two different kallisto runs, leading to differences.

@Rajesh-HKU
Copy link
Author

Dear Delaney,
Thanks for the prompt feedback. Sorry for the unformatted code. I am new to this or any similar forum.
Regarding your suggestions: 1) I have taken into account the 1-indexed vs 0-indexed difference 2) I have checked the equivalence classes manually as well.
The problem seems to be that for my dataset the counts generated by kallisto when run with tcc= TRUE vs tcc= FALSE option seem to be different. Is this true in other cases you have encountered as well? How do I achieve better correspondence between these two runs?

Thanks.

Best regards,
Rajesh

@Yenaled
Copy link
Collaborator

Yenaled commented Apr 15, 2024

Yes, I've been able to recapitulate my desired counts between TCCs and genes on my end.

If you want, you can run the bustools commands on your BUS file, and manually run bustools count to get TCC matrices and gene count matrices from the same BUS file. (You can even download bustools from source and edit the bustools_count.cpp C++ file to manually output things).

@lauraluebbert
Copy link
Member

Just linking the original issue here for reference pachterlab/gget_examples#6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants