annotables

Provides tables for converting and annotating Ensembl Gene IDs.

Installation

This is an R package.

Bioconductor method

source("https://bioconductor.org/biocLite.R")
biocLite("stephenturner/annotables")

devtools method

install.packages("devtools")
devtools::install_github("stephenturner/annotables")

Rationale

Many bioinformatics tasks require converting gene identifiers from one convention to another, or annotating gene identifiers with gene symbol, description, position, etc. Sure, biomaRt does this for you, but I got tired of remembering biomaRt syntax and hammering Ensembl's servers every time I needed to do this.

This package has basic annotation information from Ensembl Genes 91 for:

Human build 38 (grch38)
Human build 37 (grch37)
Mouse (grcm38)
Rat (rnor6)
Chicken (galgal5)
Worm (wbcel235)
Fly (bdgp6)
Macaque (mmul801)

Where each table contains:

ensgene: Ensembl gene ID
entrez: Entrez gene ID
symbol: Gene symbol
chr: Chromosome
start: Start
end: End
strand: Strand
biotype: Protein coding, pseudogene, mitochondrial tRNA, etc.
description: Full gene name/description

Additionally, there are tx2gene tables that link Ensembl gene IDs to Ensembl transcript IDs.

Usage

library(annotables)

Look at the human genes table (note the description column gets cut off because the table becomes too wide to print nicely):

grch38

## # A tibble: 64,428 x 9
##    ensgene  entrez symbol chr    start    end strand biotype description  
##    <chr>     <int> <chr>  <chr>  <int>  <int>  <int> <chr>   <chr>        
##  1 ENSG000…   7105 TSPAN6 X     1.01e⁸ 1.01e⁸     -1 protei… tetraspanin …
##  2 ENSG000…  64102 TNMD   X     1.01e⁸ 1.01e⁸      1 protei… tenomodulin …
##  3 ENSG000…   8813 DPM1   20    5.09e⁷ 5.10e⁷     -1 protei… dolichyl-pho…
##  4 ENSG000…  57147 SCYL3  1     1.70e⁸ 1.70e⁸     -1 protei… SCY1 like ps…
##  5 ENSG000…  55732 C1orf… 1     1.70e⁸ 1.70e⁸      1 protei… chromosome 1…
##  6 ENSG000…   2268 FGR    1     2.76e⁷ 2.76e⁷     -1 protei… FGR proto-on…
##  7 ENSG000…   3075 CFH    1     1.97e⁸ 1.97e⁸      1 protei… complement f…
##  8 ENSG000…   2519 FUCA2  6     1.43e⁸ 1.44e⁸     -1 protei… fucosidase, …
##  9 ENSG000…   2729 GCLC   6     5.35e⁷ 5.36e⁷     -1 protei… glutamate-cy…
## 10 ENSG000…   4800 NFYA   6     4.11e⁷ 4.11e⁷      1 protei… nuclear tran…
## # ... with 64,418 more rows

Look at the human genes-to-transcripts table:

grch38_tx2gene

## # A tibble: 219,288 x 2
##    enstxp          ensgene        
##    <chr>           <chr>          
##  1 ENST00000373020 ENSG00000000003
##  2 ENST00000496771 ENSG00000000003
##  3 ENST00000494424 ENSG00000000003
##  4 ENST00000612152 ENSG00000000003
##  5 ENST00000614008 ENSG00000000003
##  6 ENST00000373031 ENSG00000000005
##  7 ENST00000485971 ENSG00000000005
##  8 ENST00000371588 ENSG00000000419
##  9 ENST00000466152 ENSG00000000419
## 10 ENST00000371582 ENSG00000000419
## # ... with 219,278 more rows

Tables are saved in tibble format, pipe-able with dplyr:

grch38 %>% 
    dplyr::filter(biotype == "protein_coding" & chr == "1") %>% 
    dplyr::select(ensgene, symbol, chr, start, end, description) %>% 
    head %>% 
    knitr::kable(.)

ensgene	symbol	chr	start	end	description
ENSG00000000457	SCYL3	1	169849631	169894267	SCY1 like pseudokinase 3 [Source:HGNC Symbol;Acc:HGNC:19285]
ENSG00000000460	C1orf112	1	169662007	169854080	chromosome 1 open reading frame 112 [Source:HGNC Symbol;Acc:HGNC:25565]
ENSG00000000938	FGR	1	27612064	27635277	FGR proto-oncogene, Src family tyrosine kinase [Source:HGNC Symbol;Acc:HGNC:3697]
ENSG00000000971	CFH	1	196651878	196747504	complement factor H [Source:HGNC Symbol;Acc:HGNC:4883]
ENSG00000001460	STPG1	1	24356999	24416934	sperm tail PG-rich repeat containing 1 [Source:HGNC Symbol;Acc:HGNC:28070]
ENSG00000001461	NIPAL3	1	24415794	24472976	NIPA like domain containing 3 [Source:HGNC Symbol;Acc:HGNC:25233]

Example with DESeq2 results from the airway package, made tidy with biobroom:

library(DESeq2)
library(airway)

data(airway)
airway <- DESeqDataSet(airway, design = ~cell + dex)
airway <- DESeq(airway)
res <- results(airway)

# tidy results with biobroom
library(biobroom)
res_tidy <- tidy.DESeqResults(res)
head(res_tidy)

## # A tibble: 6 x 7
##   gene            baseMean estimate stderror statistic  p.value p.adjusted
##   <chr>              <dbl>    <dbl>    <dbl>     <dbl>    <dbl>      <dbl>
## 1 ENSG00000000003  709       0.381     0.101     3.79   1.52e⁻⁴    0.00128
## 2 ENSG00000000005    0      NA        NA        NA     NA         NA      
## 3 ENSG00000000419  520     - 0.207     0.112   - 1.84   6.53e⁻²    0.197  
## 4 ENSG00000000457  237     - 0.0379    0.143   - 0.264  7.92e⁻¹    0.911  
## 5 ENSG00000000460   57.9     0.0882    0.287     0.307  7.59e⁻¹    0.895  
## 6 ENSG00000000938    0.318   1.38      3.50      0.394  6.94e⁻¹   NA

res_tidy %>% 
    dplyr::arrange(p.adjusted) %>% 
    head(20) %>% 
    dplyr::inner_join(grch38, by = c("gene" = "ensgene")) %>% 
    dplyr::select(gene, estimate, p.adjusted, symbol, description) %>% 
    knitr::kable(.)

gene	estimate	symbol	description
ENSG00000152583	-4.574919	SPARCL1	SPARC like 1 [Source:HGNC Symbol;Acc:HGNC:11220]
ENSG00000165995	-3.291062	CACNB2	calcium voltage-gated channel auxiliary subunit beta 2 [Source:HGNC Symbol;Acc:HGNC:1402]
ENSG00000120129	-2.947810	DUSP1	dual specificity phosphatase 1 [Source:HGNC Symbol;Acc:HGNC:3064]
ENSG00000101347	-3.766995	SAMHD1	SAM and HD domain containing deoxynucleoside triphosphate triphosphohydrolase 1 [Source:HGNC Symbol;Acc:HGNC:15925]
ENSG00000189221	-3.353580	MAOA	monoamine oxidase A [Source:HGNC Symbol;Acc:HGNC:6833]
ENSG00000211445	-3.730403	GPX3	glutathione peroxidase 3 [Source:HGNC Symbol;Acc:HGNC:4555]
ENSG00000157214	-1.976773	STEAP2	STEAP2 metalloreductase [Source:HGNC Symbol;Acc:HGNC:17885]
ENSG00000162614	-2.035665	NEXN	nexilin F-actin binding protein [Source:HGNC Symbol;Acc:HGNC:29557]
ENSG00000125148	-2.210979	MT2A	metallothionein 2A [Source:HGNC Symbol;Acc:HGNC:7406]
ENSG00000154734	-2.345604	ADAMTS1	ADAM metallopeptidase with thrombospondin type 1 motif 1 [Source:HGNC Symbol;Acc:HGNC:217]
ENSG00000139132	-2.228903	FGD4	FYVE, RhoGEF and PH domain containing 4 [Source:HGNC Symbol;Acc:HGNC:19125]
ENSG00000162493	-1.891217	PDPN	podoplanin [Source:HGNC Symbol;Acc:HGNC:29602]
ENSG00000134243	-2.195712	SORT1	sortilin 1 [Source:HGNC Symbol;Acc:HGNC:11186]
ENSG00000179094	-3.191750	PER1	period circadian clock 1 [Source:HGNC Symbol;Acc:HGNC:8845]
ENSG00000162692	3.692662	VCAM1	vascular cell adhesion molecule 1 [Source:HGNC Symbol;Acc:HGNC:12663]
ENSG00000163884	-4.459128	KLF15	Kruppel like factor 15 [Source:HGNC Symbol;Acc:HGNC:14536]
ENSG00000178695	2.528174	KCTD12	potassium channel tetramerization domain containing 12 [Source:HGNC Symbol;Acc:HGNC:14678]
ENSG00000198624	-2.918436	CCDC69	coiled-coil domain containing 69 [Source:HGNC Symbol;Acc:HGNC:24487]
ENSG00000107562	1.911670	CXCL12	C-X-C motif chemokine ligand 12 [Source:HGNC Symbol;Acc:HGNC:10672]
ENSG00000148848	1.814543	ADAM12	ADAM metallopeptidase domain 12 [Source:HGNC Symbol;Acc:HGNC:190]

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
R		R
data-raw		data-raw
data		data
inst/templates		inst/templates
man		man
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md
annotables.Rproj		annotables.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

annotables

Installation

Bioconductor method

devtools method

Rationale

Usage

About

Releases

Packages

Languages

ZLJLZX/annotables

Folders and files

Latest commit

History

Repository files navigation

annotables

Installation

Bioconductor method

devtools method

Rationale

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages