Introduction

Thanks to research becoming more open in the last decade, there is now a huge amount of freely available data online. The push towards repeatable analysis and open access to data has had enormous impact on medical science, and will continue to do so in the future. Clearly, there is too much data to properly cite in a single document, but hopefully the main sections have been covered here.

This document provides an overview of some of the commonly used open-access archives.

Databases

Jump to

Genomics & Functional Elements
Transcriptomics
Networks, pathways & reactions
Variation
Proteomics
Microbiomics / Metagenomics
Imaging
Organism Specific
Misc

Genomics & Functional Elements

Name	Data stored	Organisms	Ease of Access	Amount of data	Data curation /quality
Nucleotide Databases
ENA	All nucleotide sequences	All	🟢	🟢	🔴
NCBI Nucleotide	All nucleotide sequences	All	🟡	🟢	🔴
DDBJ	All nucleotide sequences	All	🔴	🟢	🔴
Functional Elements
ENCODE	Annotations for human functional DNA elements	Human + select model organisms	🟢	🟢	🟢
GENCODE	Annotations for human (and mouse) genes	Human, Mouse	🟢	🟢	🟢
GeneCards	Aggregator for all gene-centric data. Each gene listed once.	Human	🟢	🟢	🟡
NCBI Gene	Genes and links to data/metadata	All	🟡	🟢	🔴
Sequence Reads
ENA	All nucleotide data	All	🟢	🟢	🔴
SRA	High-throughput sequence data	All	🟡	🟢	🔴
DRA	High-throughput sequence data	All	🔴	🟢	🔴
NCBI Trace Archive	Capillary sequencing only	All	🔴	🟢	🔴
DDBJ Trace Archive (DTA)	Capillary sequencing only	All	🔴	🟢	🔴
Genome Assemblies
NCBI Assembly	Genome Assemblies	All	🟢	🟢	🔴
ENA	All nucleotide sequences	All	🟡	🟢	🔴
DDBJ	All nucleotide sequences	All	🔴	🟢	🔴
Taxonomy
NCBI Taxonomy	The standard taxonomy system	All	🟡	🟢	🔴
Ontologies
-	-	-	-	-	-

Transcriptomics

Name	Data stored	Organisms	Ease of Access	Amount of data	Data curation /quality
Bulk Tissue Gene Expression
GTEx	Tissue-specific gene expression and regulation	Human	🟢	🟢	🟢
AOE	Aggregates publicly available gene expression data	All	🟢	🟢	🟡
Expression Atlas	Abundance and localisation of RNA	All	🟢	🟡	🟢
GEO datasets	Functional Genomics Data (from NGS, Arrays etc)	All	🟡	🟢	🟡
GEO profiles	Expression profiles for a specific condition	All	🟡	🟢	🟡
ArrayExpress	Functional Genomics Data (NGS, Arrays etc)	All	🟡	🟢	🟡
Single Cell Gene Expression
The Human Cell Atlas	Single cell studies	Human	🟢	🟢	🟡
Single Cell Expression Atlas	Single cell studies	All	🟢	🟡	🟢
Single Cell Portal	Single cell studies	All	🟡	🟢	🟡
Tabula Muris	Single-cell transcriptome data	Mouse	🟡	🔴	🟢
Human cell landscape	Cell types and localisations	Human	🔴	🔴	🟢
Gene Regulation
ENCODE	Annotations for human functional DNA elements	Human + select model organisms	🟢	🟢	🟢
GTEx	Tissue-specific gene expression and regulation	Human	🟢	🟢	🟢
Transcript Isoforms
GTEx	Tissue-specific gene expression and regulation	Human	🟢	🟢	🟢
Noncoding RNA
RNAcentral	All RNA information	All	🟢	🟢	🟡

Networks, pathways & Reactions

Name	Data stored	Organisms	Ease of Access	Amount of data	Data curation /quality
Networks
NetworkAnalyst	Networks relevant to a set of input genes (PPI, TF-gene networks etc)	Many Model Organisms	🟡	🟢	🟢
PINA	Visualisation of protein-protein interactions & expression in cancer subtypes	Many Model Organisms	🟡	🟢	🟡
Connectivity Map (CMap)	Transcriptional responses to chemical, genetic, and disease perturbation	Human	🔴	🟢	🟡
JASPAR	Transcription factor bind sites (DNA-binding preferences)	Many Model Organisms	🟢	🟢	🟢
TRANSFAC	Transcription factor bind sites & regulated genes. Has paywall.	Select Eukaryotes	🔴	🟢	🟢
KEGG	15 databases related to the function of biological systems	All	🟢	🟡	🟢
Pathways
Reactome	Biological pathways	All	🟢	🟢	🟢
KEGG Pathway	Biological systems	All	🔴	🟢	🟢
Interactions
STRING	Known and predicted protein-protein interactions	All	🟢	🟢	🟡
MINT	Molecular interactions - primarily protein-protein interactions (feeds to IMEx)	Many Model Organisms	🟢	🔴	🟢
IMEx	Molecular interactions (primarily PPI) curated to internationally agreed standard	Many Model Organisms	🔴	🟡	🟢
IntAct	Aggregates molecular interactions from multiple databases (feeds to IMEx)	Many Model Organisms	🔴	🟢	🟡

Reactions & Metabolites

Name	Data stored	Organisms	Ease of Access	Amount of data	Data curation /quality
Reactions
Rhea	Reactions of biological interest	All	🟡	🟢	🟢
KEGG Reaction	Details of all reactions found in KEGG Pathway	All	🟡	🟢	🟢
Metabolites
ChEMBL	Bioactive molecules	All	🟢	🟢	🟢
MetaboLights	Studies of Metabolites	All	🟡	🟢	🟡

Variation

Name	Data stored	Organisms	Ease of Access	Amount of data	Data curation /quality
Sequence variants (SNVs/SNPs, small indels etc)
EVA	All variant data	All	🟡	🟢	🔴
NCBI dbSNP	All sequence variant data	Human	🔴	🟢	🔴
ClinVar	Variant-phenotype relationship (health)	Human	🔴	🟢	🟡
OMIM	Gene-phenotype relationship	Human	🔴	🟡	🟢
COSMIC	Somatic mutations in human cancer	Human	🟢	🟢	🟢
Structural Variants
EVA	All variant data	All	🟡	🟢	🔴
NCBI dbVar	All structural variant data	Human	🔴	🟢	🔴
DGV	Structural variation in healthy control samples (archived)	Human	🔴	🟡	🟡

Proteomics

Name	Data stored	Organisms	Ease of Access	Amount of data	Data curation /quality
Protein Sequences
UniProt	Protein sequences and annotations	All	🟢	🟢	🟢
Enzyme portal	Concise summary of enzymes	All	🟢	🟡	🟢
NCBI Protein	Protein sequences and annotations	All	🔴	🟢	🔴
Protein Domains & Families
InterPro	Protein domains & families	All	🟢	🟢	🟡
Pfam	Protein families	All	🔴	🟢	🟡
Protein Expression
The Human Protein Atlas	Antibody-based imaging, mass spectrometry, transcriptomics data	Human	🟢	🟢	🟢
PRIDE	Mass spectrometry data	All	🟡	🟢	🟢
Tertiary Structures
PDB	Protein structures & associated data	All	🟢	🟢	🟢
PDBe	Protein structures & associated data	All	🟡	🟢	🟢
PDBJ	Protein structures & associated data	All	🔴	🟢	🟢
EM, XRay, & NMR
EMDB	3D EM density maps	All	🟡	🟢	🟡
EMDataResource	3D EM density maps, models & metadata	All	🔴	🟢	🟡
EMPIRE	Raw electron microscopy images	All	🟡	🟡	🟡
BMRB	NMR data	All	🔴	🟢	🟡

Microbiomics / Metagenomics

Name	Data stored	Organisms	Ease of Access	Amount of data	Data curation /quality
Metagenomics
SILVA	ribosomal RNA sequences	All	🟡	🟢	🟢
Ribosomal database project (RDP)	ribosomal RNA sequences	Bacteria, Archaea, Fungi	🟡	🟢	🟡
Microbiomics
MGnify	Microbiome experiments & data	All	🟢	🟢	🔴
BacDrive	Bacterial information (Geographical, biochemical)	Bacteria	🟢	🟢	🟢

Organism Specific

Name	Data stored	Organisms	Ease of Access	Amount of data	Data curation /quality
Viruses
VIPR	Pathogenic viral genomes incl. functional annotations	Viruses	-	-	-
GISAID	Influenza & COVID-19 coronavirus sequences & analysis	Viruses	-	-	-
Enterobacteria
Enterobase	Databases for multiple enteric bacteria	Enteric Bacteria	-	-	-
Eukaryotic Pathogens (incl Malaria)
VEuPathDB	Databases for multiple eukaryotic pathogens incl plasmodium, giardia etc	Many Eukaryotic Pathogens	-	-	-
Fruit flies
FlyBase	All	Fruit flies	🔴	🟢	🔴
Mouse
MGI	All	Mus Musculus (house mouse)	-	-	-
Rat
RGD	All	Rattus norvegicus (common rat)	-	-	-
Zebrafish
ZFIN	All	Danio rerio (zebrafish)	-	-	-
Worms
WormBase	All	C. elegans (roundworm)	-	-	-
Yeast
SGD	All	S. cerevisiae (Brewer's Yeast)	-	-	-

Imaging

Name	Data stored	Organisms	Ease of Access	Amount of data	Data curation /quality
BioImage archive	All biological image data	All	🟢	🟢	🟢
Image Data Resource (IDR)	Image datasets from published studies	All	🟢	🟡	🟢
Cell Image Library	Images, videos, and animations of cells	All	🟢	🟢	🟡

Misc

Name	Data stored	Organisms	Ease of Access	Amount of data	Data curation /quality
Neuroscience
Allen Brain Map	Data and analysis related to the brain	Human, Mouse	🟢	🟢	🟢
Immunology
ImmGen	Microarray gene expression & regulation	Mouse	🟢	🟢	🟢
Interferome	-	-	-	-	-
Epigenomics
MethBase	Reference methylomes (bisulfide-seq)	Selected model organisms	🔴	🟡	🟡
Biodiversity
GBIF	Biodiversity data	All	🟡	🟢	🟢
Disease Biomarkers
BIONDA	Biomarker candidates published in PubMed articles	Human	🔴	🟢	🟡

Summaries

This section is designed to provide a quick summary of each resource mentioned above. It is currently being generated over time.

Jump to

Genomics
Transcriptomics
Networks, pathways & Reactions
Variation
- Sequence Variation (SNVs/SNPs, Indels etc)
- Structural Variation (SVs)
Proteomics
Metagenomics / Microbiomics
Metabolomics
Imaging
Domain Specific

Genomics

Contents

NCBI, ENA, & DDBJ
Organisation - BioProjects & BioSamples
Genome Assemblies
Taxonomy
Functional Elements (Annotations)

NCBI, ENA, & DDBJ

Data Sharing - INSDC

NCBI, EMBL-EBI and DDBJ share data on a daily basis as members of the International Nucleotide Sequence Database Collaboration (INSDC).

All nucleotide data submitted to the following organisations are automatically shared between them - the choice of archive therefore mainly depends on familiarity - which one you personally find is easiest to use.

ENA (European Nucleotide Archive)

The European Nucleotide Archive (ENA) contains all publicly available EMBL-EBI nucleotide sequences. This includes coding sequences (genes), Non-coding DNA elements, genome assemblies, DNA/RNA sequence readsets and much more. The data itself, as well as metadata (information about the data - what it is, how it was derived, what techniques were used etc) are stored.

When searching ENA, all types of genomic data will be returned. You can then choose the specific kind of nucleotide sequence you want using filters (ie only genome assemblies, only coding sequence etc). ENA advanced search allows you to create a more specific search for your needs.

ENA has the cleanest UI amongst ENA, NCBI Nucleotide, and DDBJ.

NCBI Nucleotide

The National Centre for Biotechnology Information (NCBI) Nucleotide is a search tool which pulls results from GenBank, RefSeq, the TPA and other repositories. Searching NCBI Nucleotide is akin to searching all of NCBIs sequence data, so is comparable to ENA. Similar to ENA, all kinds of genomic data is available, rather than one type only.

In general, NCBI has an archive specific to your needs (ie NCBI Assembly for assemblies, NCBI Gene for gene sequences etc), but searching NCBI Nucleotide can indicate the total data of all types given your search. NCBI advanced search is a powerful tool for searching, given you know the syntax.

DDBJ

The DNA Data Bank of Japan (DDBJ) is also a member of the INSDC and so contains virtually the same nucleotide data as the archives above. The DDBJs UI and web page is harder to use than NCBI Nucleotide or ENA, and feels a little dated. Given that the DDBJ collects and shares data for INSDC members, similar results will appear using ENA or NCBI Nucleotide searches. The search tool for DDBJ is called ARSA.

Organisation - BioProjects & BioSamples

The three Understanding the hierarchy between archives is one of the most tricky aspects when navigating public data. Anyone who has worked with databases will know that the relationships between data are often hard to express in a standard way. The 3 main organisations (NCBI, EMBL-EBI and DDBJ) arrange information into BioProjects, BioSamples, and Data, which is a good solution given the challenge.

BioProjects are containers which store links. They are like folders which hold links to all the data and metadata associated with some project. The links can be directly to data, or can be to descriptions of the data (metadata).

Side note: EMBL-EBI call these ‘BioStudies’ instead of BioProjects for some unknown reason. We will use the term BioProject from here.

BioSamples are actually just descriptions of biological material. They do not relate to the data which was generated, but they can link to data which was derived from the particular biological sample / material. For example, if you isolated a colony of bacteria for whole genome sequencing (WGS), a BioSample entry would be created to describe the bacterial isolate. The BioSample would then have a link to the WGS data, specifying “the WGS dataset was generated from this biological material!”.

Genome Assemblies

NCBI Assembly

NCBI Assembly specifically displays genome assemblies and associated data. Is offers the best filtering options when searching, as searches can be narrowed by attributes such as assembly level (complete, scaffold etc), organism group, ploidy, contig N50, and annotation level.

In terms of metadata, each assembly has an organism name, the submitter name and submission date, accession numbers, and the actual genome sequence data. Other useful information, including the assembly level - ‘complete genome’, ‘chromosome’, ‘Scaffold’ or ‘Contig’ - is available.

The following is usually downloadable:

DNA/RNA genome sequence
Genomic features (annotations)
Coding sequences (gene products)
RNA data
RepeatMasker output
& others

Most assemblies are annotated, but the quality of the annotation is variable. Genomic features are usually inferred using software first, then may be validated experimentally at a later date. The quality of software annotation often depends on how similar the particular organism is to other, well studied organisms.

ENA (European Nucleotide Archive)

The European Nucleotide Archive (ENA) contains all publicly available EMBL-EBI nucleotide sequences, including genome assemblies. When searching, select 'Assembly' from the filters on the left side of the page to restrict results to genome assemblies.

Unfortunately, assembly searches using ENA cannot be easily filtered like NCBI Assembly. If searching for a bacterial organism, eg Bacillus subtilis, hundreds of assemblies for strains are returned. The only way to do a more specific search is using ENA advanced search, which is actually a fantastic tool in any case.

Once you have selected an assembly, the sequence and annotations can be downloaded.

DDBJ

The DNA Data Bank of Japan (DDBJ) is also a member of the INSDC and so contains virtually the same nucleotide data as the archives above. The DDBJs UI and web page is harder to use than NCBI Nucleotide or ENA, and feels a little dated. Assembly searches may be easier using NCBI Assembly or ENA.

Taxonomy

https://asia.ensembl.org/info/about/speciestree.html

NCBI Taxonomy

The INSDC (mentioned above) maintains a database of taxonomic classifications for each known organism. This taxonomic information is shared across NCBI, ENA, and DDBJ, but only NCBI has built a specific tool to browse and explore taxonomic clades in a web browser.

The NCBI Taxonomy resources allows users to search for taxonomic groups, then provides information on the subgroups within. For a given taxa, you can view and link to the records in NCBI databases - including genome assemblies, protein sequences, read sets, genes & other functional element annotations etc.

The entire INSDC taxonomy can be downloaded here: https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/

Functional Elements

ENCODE

The Encyclopedia of DNA Elements (ENCODE) is a high-quality and extensive catalogue of all known functional elements in the human genome. In addition to genes, ENCODE includes any region with functional impact - such as noncoding RNA, and promoter / enhancer regulatory regions. ENCODE data has a high level of quality, and uses multiple sources of evidence when annotating new functional elements. A variety of methods including bioinformatics analysis of current data, sequencing, DNA hypersensitivity assays, DNA methylation and binding assays etc are routinely used to identify and confirm new elements.

GENCODE

The Encyclopedia of genes and gene variants (GENCODE) catalogues all the gene features in the human and mouse genomes. Gene classifications are detailed and high-quality, as they are supported by biological evidence. GENCODE can be seen as a subset of ENCODE, which attempts to catalogue all function elements in the human genome.

GeneCards

GeneCards provides a summary for each human gene. It integrates information from more than 150 web sources, and presents it to the user in one location. A huge amount of data for each gene is presented, including summaries, regulatory elements of the gene, proteomics information, detailed annotations, and noteworthy genetic variants to name a few (if available). If you want to improve your knowledge of a particular gene, GeneCards is a great option.

NCBI Gene

While the archives above only catalogue human & mouse genes, NCBI gene spans all organisms. Searches usually need to be narrowed using filters or advanced search to be useful, but the amount of information given per gene is high. Sometimes the data is good quality and verified, other times it is only software predictions. Links to all NCBI data, as well as academic publications are provided when browsing a particular gene.

Sequence Reads

Contents

Next Gen Sequencing
Capillary Electrophoresis

Next Gen Sequencing

Data Sharing - INSDC

NCBI, EMBL-EBI and DDBJ share data on a daily basis as members of the International Nucleotide Sequence Database Collaboration (INSDC).

All read sets submitted to the following organisations are automatically shared between them - the choice of archive therefore mainly depends on familiarity - which one you personally find is easiest to use.

ENA

The European Nucleotide Archive (ENA) will display read sets in their default search. In the filter menu under 'Reads', both 'Runs' and 'Experiments' contain read sets with download links to the raw FASTQ files. For a specific sequencing experiment or run, there is a 'Show Column Selection' bar above the read files section - clicking this allows a huge amount of metadata to be displayed for each read set, which can be handy if you have certain demands. The ENA advanced search facilitates searching only for read sets, and allows us to restrict the results based on numerous conditions such as taxonomic group, instrument platform, geographical location, and read length to name a few.

SRA

The NCBI Sequence Read Archive (SRA) is another portal for read set access. Unlike ENA and DDBJ, it is limited to read sets only. After doing a basic search, there are a number of useful filters on the left side of the screen (taxon filters are on the right) to help narrow your results, without the need for a SRA advanced search. This said, advanced searches are always better if you know how to use them. Accessing the actual raw read files is trickier with SRA compared to ENA, as a few links need to be followed. After selecting a read experiment, click on a sequence run accession (starts with SRR) in the 'Runs' section at the bottom, then on the following page select the 'Data Access' tab to access the raw data.

DRA

The DDBJ Sequence Read Archive (DRA) contains virtually the same data as NCBI SRA and ENA. The search is similar to a simple advanced search, but has far fewer options than an advanced search using NCBI SRA or ENA. The format of the results displays all the important information, but again is lacking compared to the other portals.

Capillary Electrophoresis

Overview

Capillary electrophoresis specific data is included in the NGS archives. The repositories above are permanent stores of DNA sequence chromotograms (traces), alongside the actual base calls and quality scores. The FASTQ data now feeds into modern archives (SRA, ENA, DRA), and can be specifically searched for using the advanced search tools (instrument platform = 'capillary').

Transcriptomics

ENCODE & GeneCards for transcriptomics - regulation of gene expression, promoters etc.

Bulk Tissue Gene Expression

br>

Single Cell Gene Expression

Gene Regulation

Transcript Isoforms

Noncoding RNA

Networks, pathways & reactions

Networks

NetworkAnalyist

Text

PINA

The Protein Interaction Network Analysis (PINA)

Pathways

Reactome

Reactome allows the user to interactively explore cellular pathways for multiple model organisms (incl human). Reactome consists of metabolic and signalling molecules placed into the biological pathways and processes they are associated with. For example, the human apoptosis pathway can be explored to find which molecules take part in this pathway, and which reactions occur. User data can be uploaded to perform pathway enrichment analysis. The data is curated by domain experts and is backed by primary literature.

Interactions

STRING

STRING contains known and predicted protein-protein interactions. Both direct (physical) and indirect (functional) interactions are shown. More than 24 million proteins are documented spanning over 5,000 organisms. Searching for a protein and organism returns a small PPI network which users can follow and interact with to understand the nature of these interactions. STRING uses various methods to support the interactions it presents, including co-expression experiments, genomic context (nearby regulatory elements), text-mining, and data pulled from other databases.

IMEx

The IMEx Consortium catalogues experimentally demonstrated molecular interactions. It focuses on protein-protein interactions within model organisms such as homo sapiens and mus musculus. Biochemical evidence (Y2H, coimmunoprecipitation assays etc) is provided for each interaction in the catalogue. To properly search for interactions, the Molecular Interaction Query Language should be used. These are key-value tags (ie species:human) which allow search narrowing.

Variation

HGMD (Human Gene Mutation Database)

100k genomes project https://www.genomicsengland.co.uk/about-genomics-england/the-100000-genomes-project/

European Genome Phenome Archive (EGA) https://ega-archive.org/

GnomAD haplotypes (phased genotypes ): International HapMap Project

confused: are individuals with mental health conditions included?
what about late-onset conditions? these individuals may have disease waiting to occur

The Cancer Genome Atlas TCGA COSMIC

Proteomics

Pathways & Reactions

Metagenomics / Microbiomics

Metabolomics

Imaging

Domain Specific

Graveyard

Data has been divided into sections as best as possible.
In each section, the following is covered:

Who uses the data? (what type of analysis / field)
How to access (download or use)
Format of the data

🔴 Low 🟡 Med 🟢 High

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
media		media
README.md		README.md
README2.md		README2.md
thumbnail_design.svg		thumbnail_design.svg

GraceAHall/accessing-public-data

Folders and files

Latest commit

History

Repository files navigation

Introduction

Databases

Genomics & Functional Elements

Transcriptomics

Networks, pathways & Reactions

Reactions & Metabolites

Variation

Proteomics

Microbiomics / Metagenomics

Organism Specific

Imaging

Misc

Summaries

Genomics

NCBI, ENA, & DDBJ

Organisation - BioProjects & BioSamples

Genome Assemblies

Taxonomy

Functional Elements

Sequence Reads

Next Gen Sequencing

Capillary Electrophoresis

Transcriptomics

Bulk Tissue Gene Expression

Single Cell Gene Expression

Gene Regulation

Transcript Isoforms

Noncoding RNA

Networks, pathways & reactions

Networks

Pathways

Interactions

Variation

Proteomics

Pathways & Reactions

Metagenomics / Microbiomics

Metabolomics

Imaging

Domain Specific

Graveyard

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Packages