This repository contains codes which are used in analyzing the data from DISIST. Note that
- We can't provide the used data (
merged_count_matrix_3
) and its metadata (condtion_3
) - During analyzing, we figure out that the used dataset has problems (in preprocessing-level). So DIGIST has been resolving this problem. We will re-analyze the corrected data and report results.
- The code 92-113 might not be able to work. I will deal with this part after getting the data.
Suppose we have a n_gene
by n_cell
matrix. (it is ok to understand a gene as a feature and a cell as one datapoint if you want). Each cell follows from a known population. In this situation, which genes represent a populations? how much represent?
Statistical models give us good answers. "How much" is arranged by p-value and "Which" is determined by genes having lower p-value. But, if there is covariate effect, we should adjust the original data or establish a corrected data matrix. Our data contains two different sequencing data. One is from DropSeq and the other is from Smart-seq2. They are much different. One way to adress this issue is to introduce a surrogate variable, which will be used at modeling step with the data. Please refer to Draft_1_4.pptx
in this repository if you want to know more backgorunds, methods, and results. (NB: the defintion of our surrogate variable is still private.)