-
Notifications
You must be signed in to change notification settings - Fork 6
/
getting_started.Rmd
152 lines (130 loc) · 4.49 KB
/
getting_started.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
title: "Getting started"
author: "Lambda"
date: "6/15/2019"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
fig.align = "center")
```
In thihs notebook, we load the output of `bustools` and do some basic analysis with R.
```{r, message=FALSE}
library(Seurat)
library(DropletUtils)
library(Matrix)
library(data.table)
library(ggplot2)
library(magrittr)
fn <- "/home/single_cell_analysis/kallisto_out_single/kallisto_SRR8599150_v2/"
```
Here's a function that can load the output into R in one step. This is part of the `BUSpaRse` package, which implement utility functions related to `bustools`.
```{r}
# Function to load matrix
#' Read matrix along with barcode and gene names
#'
#' This function takes in a directory and name and reads the mtx file, genes,
#' and barcodes from the output of `bustools` to return a sparse matrix with
#' column names and row names.
#'
#' @param dir Directory with the bustools count outputs.
#' @param name The files in the output directory should be <name>.mtx, <name>.genes.txt,
#' and <name>.barcodes.txt.
#' @param tcc Logical, whether the matrix of interest is a TCC matrix. Defaults
#' to \code{FALSE}.
#' @return A dgCMatrix with barcodes as column names and genes as row names.
#'
read_count_output <- function(dir, name, tcc = TRUE) {
dir <- normalizePath(dir, mustWork = TRUE)
m <- readMM(paste0(dir, "/", name, ".mtx"))
m <- t(m)
m <- as(m, "dgCMatrix")
# The matrix read has cells in rows
ge <- if (tcc) ".ec.txt" else ".genes.txt"
genes <- fread(paste0(dir, "/", name, ge), header = FALSE)$V1
barcodes <- fread(paste0(dir, "/", name, ".barcodes.txt"), header = FALSE)$V1
colnames(m) <- barcodes
rownames(m) <- genes
return(m)
}
```
```{r}
# The directory where the data is placed
list.files(fn)
```
```{r}
# Load data
m <- read_count_output(fn, "genes", tcc = FALSE)
dim(m)
```
We expect 3949 cells, but there are way more barcodes (columns) than expected.
# Remove empty droplets
The vast majority of barcodes we see in the data are empty droplets. We will remove barcodes that are likely to be empty droplets and keep those that are likely to be real cells. Here we use the classic knee plot method to remove empty droplets.
```{r}
bc_rank <- barcodeRanks(m)
```
```{r}
qplot(bc_rank$rank, bc_rank$total, geom = "line") +
geom_hline(yintercept = metadata(bc_rank)$knee, color = "blue", linetype = 2) +
geom_hline(yintercept = metadata(bc_rank)$inflection, color = "green", linetype = 2) +
annotate("text", x = 1000, y = 1.2 * c(metadata(bc_rank)$knee, metadata(bc_rank)$inflection),
label = c("knee", "inflection"), color = c("blue", "green")) +
scale_x_log10() +
scale_y_log10() +
labs(x = "Rank", y = "Total reads") +
theme_bw()
```
Now we filter the barcodes based on the inflectionsn point. We will also remove genes not detected in any cell.
```{r}
# Filter the barcodes based on
tot_counts <- colSums(m)
tot_genes <- rowSums(m)
m_filtered <- m[, tot_counts > metadata(bc_rank)$inflection]
m_filtered <- m_filtered[tot_genes > 0, ]
dim(m_filtered)
```
That's a more reasonable number of barcodes.
# Basic analysis
First, we normali and scale the data with [`SCTransform`](https://github.com/ChristophH/sctransform).
```{r, message=FALSE, warning=FALSE}
seu <- CreateSeuratObject(m_filtered) %>%
SCTransform(verbose = FALSE)
```
Some QC: distribution of total UMI counts and number of genes detected
```{r}
VlnPlot(seu, c("nCount_RNA", "nFeature_RNA"), pt.size = 0.1)
```
Also see how total UMI counts in each cell relates to the number of genes detected.
```{r}
ggplot(seu@meta.data, aes(nCount_RNA, nFeature_RNA)) +
geom_point(size = 0.5, alpha = 0.3) +
theme_bw() +
labs(x = "Total UMI counts", y = "Number of genes detected")
```
Then PCA.
```{r}
seu <- RunPCA(seu, npcs = 50, verbose = FALSE)
ElbowPlot(seu, ndims = 50) + labs(x = "Principal component")
```
The y axis of this plot is the standard deviation of the data explained by the principal component (PC). Judging from the elbow plot, we will use the first 30 PCs for tSNE.
```{r}
# Run tSNE
seu <- RunTSNE(seu, dims = 1:30, verbose = FALSE)
```
Here we cluster the data with the Louvain algorithm. The algorithm used in Seurat can be changed.
```{r}
seu <- FindNeighbors(seu, verbose = FALSE)
seu <- FindClusters(seu, verbose = FALSE)
```
What do the clusters look like?
```{r}
# On PCA
DimPlot(seu, pt.size = 0.5, reduction = "pca")
```
```{r}
# On tSNE
DimPlot(seu, pt.size = 0.5, reduction = "tsne")
```
```{r}
sessionInfo()
```