Scalable doublet finder #103

GreenGilad · 2021-02-17T10:39:07Z

A major limitation of the current DoubletFinder version is it's limited ability to scale to larger datasets. The reason for this is that the current implementation computes the distances matrix (over the PC space) for all cells in dataset, resulting in an O(n^2) space complexity.

To improve space complexity, the distance matrix can be computed only for a subset of batch.size cells at a time, resulting in an O(n*k) space complexity solution. Default value of batch.size is Inf so to not change default behaviour of algorithm.

In addition, I found it beneficial to store for each real cell the ids of artificial nearest neighbours, as well as the real cell identities that were used to generate each artificial cell. Once DoubletFinder is executed over a dataset, this information is useful to interpret the doublet/singlet classification. Both the list of artificial nearest neighbours and the parent idents data frame are stored as a Tool record in the Seurat object.

Lastly, I also called the LogSeuratCommand function in order to store the parameters used to run DoubletFinder

…a frames as a Tool record.

GreenGilad and others added 3 commits February 17, 2021 12:23

Computation of distance matrix and pANN in batches

33ca25b

Added artificial doublet parents and artificial doublet neighbors dat…

e23fc5c

…a frames as a Tool record.

Added support for sampling pairs of different identities

6fe3501

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalable doublet finder #103

Scalable doublet finder #103

GreenGilad commented Feb 17, 2021

Scalable doublet finder #103

Are you sure you want to change the base?

Scalable doublet finder #103

Conversation

GreenGilad commented Feb 17, 2021