-
Notifications
You must be signed in to change notification settings - Fork 3
/
README.Rmd
154 lines (121 loc) · 5.58 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# kgp
<!-- badges: start -->
[![CRAN status](https://www.r-pkg.org/badges/version/kgp)](https://CRAN.R-project.org/package=kgp)
[![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html#stable)
[![arXiv](https://img.shields.io/badge/arXiv-2210.00539-b31b1b.svg)](https://arxiv.org/abs/2210.00539)
<!-- badges: end -->
This kgp data package provides metadata about populations and data about samples from the 1000 Genomes Project, including the 2,504 samples sequenced for the Phase 3 release and the expanded collection of 3,202 samples with 602 additional trios.
## Installation
You can install the released version of kgp from [CRAN](https://CRAN.R-project.org/package=kgp) with:
```r
install.packages("kgp")
```
You can install the development version of kgp from [GitHub](https://github.com/stephenturner/kgp) with:
```r
# install.packages("devtools")
devtools::install_github("stephenturner/kgp")
```
## About the data
The 1000 Genomes Project data Phase 3 data contains 2,504 samples with sequence data available, and was later expanded to 3,202 samples with high coverage adding 602 trios. Data is available through the [1000 Genomes FTP site](http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/) and [GitHub](https://github.com/igsr/1000Genomes_data_indexes/).
- Pilot publication: [An integrated map of genetic variation from 1,092 human genomes](https://www.nature.com/articles/nature11632)
- Phase 1 publication: [A map of human genome variation from population scale sequencing](https://www.nature.com/articles/nature09534)
- Phase 3 publication: [A global reference for human genetic variation](https://www.nature.com/articles/nature15393)
- Expanded high-coverage publication: [High-coverage whole-genome sequencing of the expanded 1000 Genomes Project cohort including 602 trios](https://pubmed.ncbi.nlm.nih.gov/36055201/)
There are three data sets available in the kgp package.
```{r example}
library(kgp)
data(kgp)
```
The `kgp3` data contains pedigree and population information for the 2,504 samples included in the Phase 3 release of the 1000 Genomes Project data.
```{r}
kgp3
```
The `kgpe` data contains pedigree and population information all 3,202 samples included in the expanded 1000 Genomes Project data, which includes 602 trios.
```{r}
kgpe
```
The `kgpmeta` contains population metadata for the 26 populations across five continental regions.
```{r}
kgpmeta
```
## Examples
```{r, message=FALSE, warning=FALSE}
library(dplyr)
library(ggplot2)
library(kgp)
data(kgp)
```
Count the number of samples in each region, or in each population:
```{r}
kgp3 %>%
count(region) %>%
knitr::kable()
```
```{r}
kgp3 %>%
count(region, population) %>%
knitr::kable()
```
```{r kgp3barplot, fig.width=9, fig.height=12}
kgp3 %>%
count(region, population) %>%
arrange(region, n) %>%
mutate(population=forcats::fct_inorder(population)) %>%
ggplot(aes(population, n)) +
geom_col(aes(fill=region)) +
labs(fill=NULL, x=NULL, x="N") +
coord_flip() +
theme_bw() +
theme(legend.position="bottom")
```
The latitude and longitude coordinates in `kgpmeta` can be used to plot a map of the locations of the 1000 Genomes populations. There is also a column for region color, which provides a hexadecimal color code to enable reproduction of the population data map as shown on the IGSR population data page. The figure below shows a static map produced using ggplot2, but interactive maps such as that shown on the IGSR population data portal can be created with the leaflet package.
```{r kgpmap, fig.cap="Map showing locations of the 1000 Genomes Phase 3 populations.", fig.width=8, fig.height=6}
pal <- kgpmeta %>% distinct(reg, regcolor) %>% tibble::deframe()
ggplot() +
geom_polygon(data=map_data("world"),
aes(long, lat, group=group),
col="gray30", fill="gray95", lwd=.2, alpha=.5) +
geom_point(data=kgpmeta, aes(lng, lat, col=reg), size=4) +
scale_colour_manual(values=pal) +
theme_minimal() +
theme(axis.ticks = element_blank(),
axis.text = element_blank(),
axis.title = element_blank(),
legend.title = element_blank(),
panel.grid = element_blank(),
legend.position = "bottom")
```
The table below shows a selection of samples from `kgpe` showing pedigree information for each sample. This pedigree information could be used in downstream analysis to filter out related individuals, select only trios, or to visualize family structure.
```{r kgpe}
kgpe %>%
filter(pid!="0" & mid!="0") %>%
group_by(pop) %>%
slice(1) %>%
head(12) %>%
arrange(reg, pop) %>%
select(fid:reg) %>%
select(-sexf) %>%
knitr::kable()
```
The figure below shows an example of a pedigree plot made by parsing the pedigree information using [skater](https://cran.r-project.org/package=skater) and plotting using [kinship2](https://cran.r-project.org/package=kinship2). The skater package provides documentation, examples, and a vignette demonstrating how to iteratively plot all pedigrees in a given data set.
```{r pedplot, fig.height=5, fig.width=8, fig.cap="Trios in 1000 Genomes Project family 13291."}
kgpe %>%
filter(fid=="13291") %>%
transmute(fid, id, dadid=pid, momid=mid, sex, affected=1) %>%
skater::fam2ped() %>%
pull(ped) %>%
purrr::pluck(1) %>%
kinship2::plot.pedigree(mar=c(4,2,4,2), cex=.8)
```