FredHutch · cansavvy · Mar 22, 2024 · Mar 11, 2024 · Mar 11, 2024 · Mar 11, 2024
diff --git a/vignettes/getting_started.Rmd b/vignettes/getting_started.Rmd
@@ -1,5 +1,5 @@
 ---
-title: "getting-started"
+title: "getting_started"
 output: rmarkdown::html_vignette
 vignette: >
   %\VignetteIndexEntry{getting-started}
@@ -22,18 +22,26 @@ gimap performs analysis of dual-targeting CRISPR screening data, with the goal o
 library(gimap)
 ```
 
+```{r}
+library(dplyr)
+```
+
 ## Data requirements
 
 Let's examine this example pgPEN counts table. It's divided into columns containing:
 
-- an ID corresponding to the names of paired guides
-- gRNA sequence 1, targeting "paralog A"
-- gRNA sequence 2, targeting "paralog B"
-- The sample, day, and replicate number for which gRNAs were sequenced
+- `id`: an ID corresponding to the names of paired guides
+- `seq_1`: gRNA sequence 1, targeting "paralog A"
+- `seq_2`: gRNA sequence 2, targeting "paralog B"
+- `Day00_RepA`: Gene Counts from Day 00 for Replicate A
+- `Day05_RepA`: Gene Counts from Day 05 for Replicate A
+- `Day22_RepA`: Gene Counts from Day 22 for Replicate A
+- `Day22_RepB`: Gene Counts from Day 22 for Replicate B
 
 ```{r}
 example_data <- get_example_data("count")
 ```
+
 The metadata you have may vary slightly from this but you'll want to make sure you have the essential variables and information regarding how you collected your data.
 
 ```{r}
@@ -62,7 +70,7 @@ The first data set contains the readcounts from each sample type. Required for a
 
 ```{r}
 example_counts <- example_data %>%
-  dplyr::select(c("Day00_RepA", "Day05_RepA", "Day22_RepA", "Day22_RepB", "Day22_RepC")) %>%
+  select(c("Day00_RepA", "Day05_RepA", "Day22_RepA", "Day22_RepB", "Day22_RepC")) %>%
   as.matrix()
 ```
 
@@ -74,11 +82,9 @@ The next datasets are metadata that describe the dimensions of the count data.
 One of these (`example_pg_metadata`) is required because it is necessary to know the IDs and be able to map them to pgRNA constructs.
 
 ```{r}
-# pg metadata is the information that describes the paired guide RNA targets and will be loaded/explained later
-
-#pg id are just the unique IDs listed in the same order/sorted the same way as the count data and can be used for mapping between the count data and the metadata
-example_pg_id <- example_data %>%
-  dplyr::select("id")
+example_pg_metadata <- example_data %>%
+  select(c("id", "seq_1", "seq_2"))
+```
 
 # sample metadata is the information that describes the samples and is sorted the same way as the columns in the count data
 example_sample_metadata <- data.frame(
@@ -106,6 +112,17 @@ gimap_dataset <- setup_data(counts = example_counts,
 
 You'll notice that this set up gives us a list of formatted data. This contains the original counts we gave `setup_data()` function but also normalized counts, and the total counts per sample.
 
+- `raw_counts`: The original counts data that illustrates the number of cells alive in the sample. This data has samples as the columns and the paired guide constructs as rows.
+- `counts_per_sample`: Add up all the counts for each sample over all of the paired guide designs.
+- Transformed data: This section contains the various types of normalized and adjusted data made from the raw counts data.
+- `count_norm` - For each sample, the data is normalized `-log10(( counts +1) /  total counts for the sample over all the pg designs ))`
+- `cpm` - For each sample this is calculated by taking the `counts / total counts for the sample over all the pg designs)*1e6`
+- `log2cpm`: log-2 transformed counts per million this is calculated by `log2(cpms + 1)`
+- pg_metadata: paired guide metadata - information that describes the paired-guided RNA designs. This may include the sequences used in the CRISPR design as well as what genes are targeted.
+- `sample_metadata`: Metadata that describes the samples. This likely includes the time point information, replicates, sample IDs, and any other additional information that is needed regarding the experimental setup.
+
+
+
 ```{r}
 str(gimap_dataset)
 ```
@@ -117,7 +134,7 @@ Later explain other parameters and how they can be used
 ```{r}
 run_qc(gimap_dataset,
        output_file = "example_qc_report.Rmd",
-       overwrite = TRUE, 
+       overwrite = TRUE,
        quiet = TRUE)
 ```