Thanks to research becoming more open in the last decade, there is now a huge amount of freely available data online. The push towards repeatable analysis and open access to data has had enormous impact on medical science, and will continue to do so in the future. Clearly, there is too much data to properly cite in a single document, but hopefully the main sections have been covered here.
This document provides an overview of some of the commonly used open-access archives.
Jump to
- Genomics & Functional Elements
- Transcriptomics
- Networks, pathways & reactions
- Variation
- Proteomics
- Microbiomics / Metagenomics
- Imaging
- Organism Specific
- Misc
Name | Data stored | Organisms | Ease of Access |
Amount of data |
Data curation /quality |
---|---|---|---|---|---|
Nucleotide Databases |
|||||
ENA | All nucleotide sequences | All | 🟢 | 🟢 | 🔴 |
NCBI Nucleotide | All nucleotide sequences | All | 🟡 | 🟢 | 🔴 |
DDBJ | All nucleotide sequences | All | 🔴 | 🟢 | 🔴 |
Functional Elements |
|||||
ENCODE | Annotations for human functional DNA elements | Human + select model organisms | 🟢 | 🟢 | 🟢 |
GENCODE | Annotations for human (and mouse) genes | Human, Mouse | 🟢 | 🟢 | 🟢 |
GeneCards | Aggregator for all gene-centric data. Each gene listed once. | Human | 🟢 | 🟢 | 🟡 |
NCBI Gene | Genes and links to data/metadata | All | 🟡 | 🟢 | 🔴 |
Sequence Reads |
|||||
ENA | All nucleotide data | All | 🟢 | 🟢 | 🔴 |
SRA | High-throughput sequence data | All | 🟡 | 🟢 | 🔴 |
DRA | High-throughput sequence data | All | 🔴 | 🟢 | 🔴 |
NCBI Trace Archive | Capillary sequencing only | All | 🔴 | 🟢 | 🔴 |
DDBJ Trace Archive (DTA) | Capillary sequencing only | All | 🔴 | 🟢 | 🔴 |
Genome Assemblies |
|||||
NCBI Assembly | Genome Assemblies | All | 🟢 | 🟢 | 🔴 |
ENA | All nucleotide sequences | All | 🟡 | 🟢 | 🔴 |
DDBJ | All nucleotide sequences | All | 🔴 | 🟢 | 🔴 |
Taxonomy |
|||||
NCBI Taxonomy | The standard taxonomy system | All | 🟡 | 🟢 | 🔴 |
Ontologies |
|||||
- | - | - | - | - | - |
Name | Data stored | Organisms | Ease of Access |
Amount of data |
Data curation /quality |
---|---|---|---|---|---|
Bulk Tissue Gene Expression |
|||||
GTEx | Tissue-specific gene expression and regulation | Human | 🟢 | 🟢 | 🟢 |
AOE | Aggregates publicly available gene expression data | All | 🟢 | 🟢 | 🟡 |
Expression Atlas | Abundance and localisation of RNA | All | 🟢 | 🟡 | 🟢 |
GEO datasets | Functional Genomics Data (from NGS, Arrays etc) | All | 🟡 | 🟢 | 🟡 |
GEO profiles | Expression profiles for a specific condition | All | 🟡 | 🟢 | 🟡 |
ArrayExpress | Functional Genomics Data (NGS, Arrays etc) | All | 🟡 | 🟢 | 🟡 |
Single Cell Gene Expression |
|||||
The Human Cell Atlas | Single cell studies | Human | 🟢 | 🟢 | 🟡 |
Single Cell Expression Atlas | Single cell studies | All | 🟢 | 🟡 | 🟢 |
Single Cell Portal | Single cell studies | All | 🟡 | 🟢 | 🟡 |
Tabula Muris | Single-cell transcriptome data | Mouse | 🟡 | 🔴 | 🟢 |
Human cell landscape | Cell types and localisations | Human | 🔴 | 🔴 | 🟢 |
Gene Regulation |
|||||
ENCODE | Annotations for human functional DNA elements | Human + select model organisms | 🟢 | 🟢 | 🟢 |
GTEx | Tissue-specific gene expression and regulation | Human | 🟢 | 🟢 | 🟢 |
Transcript Isoforms |
|||||
GTEx | Tissue-specific gene expression and regulation | Human | 🟢 | 🟢 | 🟢 |
Noncoding RNA |
|||||
RNAcentral | All RNA information | All | 🟢 | 🟢 | 🟡 |
Name | Data stored | Organisms | Ease of Access |
Amount of data |
Data curation /quality |
---|---|---|---|---|---|
Networks |
|||||
NetworkAnalyst | Networks relevant to a set of input genes (PPI, TF-gene networks etc) | Many Model Organisms | 🟡 | 🟢 | 🟢 |
PINA | Visualisation of protein-protein interactions & expression in cancer subtypes | Many Model Organisms | 🟡 | 🟢 | 🟡 |
Connectivity Map (CMap) | Transcriptional responses to chemical, genetic, and disease perturbation | Human | 🔴 | 🟢 | 🟡 |
JASPAR | Transcription factor bind sites (DNA-binding preferences) | Many Model Organisms | 🟢 | 🟢 | 🟢 |
TRANSFAC | Transcription factor bind sites & regulated genes. Has paywall. | Select Eukaryotes | 🔴 | 🟢 | 🟢 |
KEGG | 15 databases related to the function of biological systems | All | 🟢 | 🟡 | 🟢 |
Pathways |
|||||
Reactome | Biological pathways | All | 🟢 | 🟢 | 🟢 |
KEGG Pathway | Biological systems | All | 🔴 | 🟢 | 🟢 |
Interactions |
|||||
STRING | Known and predicted protein-protein interactions | All | 🟢 | 🟢 | 🟡 |
MINT | Molecular interactions - primarily protein-protein interactions (feeds to IMEx) | Many Model Organisms | 🟢 | 🔴 | 🟢 |
IMEx | Molecular interactions (primarily PPI) curated to internationally agreed standard | Many Model Organisms | 🔴 | 🟡 | 🟢 |
IntAct | Aggregates molecular interactions from multiple databases (feeds to IMEx) | Many Model Organisms | 🔴 | 🟢 | 🟡 |
Name | Data stored | Organisms | Ease of Access |
Amount of data |
Data curation /quality |
---|---|---|---|---|---|
Reactions |
|||||
Rhea | Reactions of biological interest | All | 🟡 | 🟢 | 🟢 |
KEGG Reaction | Details of all reactions found in KEGG Pathway | All | 🟡 | 🟢 | 🟢 |
Metabolites |
|||||
ChEMBL | Bioactive molecules | All | 🟢 | 🟢 | 🟢 |
MetaboLights | Studies of Metabolites | All | 🟡 | 🟢 | 🟡 |
Name | Data stored | Organisms | Ease of Access |
Amount of data |
Data curation /quality |
---|---|---|---|---|---|
Sequence variants (SNVs/SNPs, small indels etc) |
|||||
EVA | All variant data | All | 🟡 | 🟢 | 🔴 |
NCBI dbSNP | All sequence variant data | Human | 🔴 | 🟢 | 🔴 |
ClinVar | Variant-phenotype relationship (health) | Human | 🔴 | 🟢 | 🟡 |
OMIM | Gene-phenotype relationship | Human | 🔴 | 🟡 | 🟢 |
COSMIC | Somatic mutations in human cancer | Human | 🟢 | 🟢 | 🟢 |
Structural Variants |
|||||
EVA | All variant data | All | 🟡 | 🟢 | 🔴 |
NCBI dbVar | All structural variant data | Human | 🔴 | 🟢 | 🔴 |
DGV | Structural variation in healthy control samples (archived) | Human | 🔴 | 🟡 | 🟡 |
Name | Data stored | Organisms | Ease of Access |
Amount of data |
Data curation /quality |
---|---|---|---|---|---|
Protein Sequences |
|||||
UniProt | Protein sequences and annotations | All | 🟢 | 🟢 | 🟢 |
Enzyme portal | Concise summary of enzymes | All | 🟢 | 🟡 | 🟢 |
NCBI Protein | Protein sequences and annotations | All | 🔴 | 🟢 | 🔴 |
Protein Domains & Families |
|||||
InterPro | Protein domains & families | All | 🟢 | 🟢 | 🟡 |
Pfam | Protein families | All | 🔴 | 🟢 | 🟡 |
Protein Expression |
|||||
The Human Protein Atlas | Antibody-based imaging, mass spectrometry, transcriptomics data | Human | 🟢 | 🟢 | 🟢 |
PRIDE | Mass spectrometry data | All | 🟡 | 🟢 | 🟢 |
Tertiary Structures |
|||||
PDB | Protein structures & associated data | All | 🟢 | 🟢 | 🟢 |
PDBe | Protein structures & associated data | All | 🟡 | 🟢 | 🟢 |
PDBJ | Protein structures & associated data | All | 🔴 | 🟢 | 🟢 |
EM, XRay, & NMR |
|||||
EMDB | 3D EM density maps | All | 🟡 | 🟢 | 🟡 |
EMDataResource | 3D EM density maps, models & metadata | All | 🔴 | 🟢 | 🟡 |
EMPIRE | Raw electron microscopy images | All | 🟡 | 🟡 | 🟡 |
BMRB | NMR data | All | 🔴 | 🟢 | 🟡 |
Name | Data stored | Organisms | Ease of Access |
Amount of data |
Data curation /quality |
---|---|---|---|---|---|
Metagenomics |
|||||
SILVA | ribosomal RNA sequences | All | 🟡 | 🟢 | 🟢 |
Ribosomal database project (RDP) | ribosomal RNA sequences | Bacteria, Archaea, Fungi | 🟡 | 🟢 | 🟡 |
Microbiomics |
|||||
MGnify | Microbiome experiments & data | All | 🟢 | 🟢 | 🔴 |
BacDrive | Bacterial information (Geographical, biochemical) | Bacteria | 🟢 | 🟢 | 🟢 |
Name | Data stored | Organisms | Ease of Access |
Amount of data |
Data curation /quality |
---|---|---|---|---|---|
Viruses |
|||||
VIPR | Pathogenic viral genomes incl. functional annotations | Viruses | - | - | - |
GISAID | Influenza & COVID-19 coronavirus sequences & analysis | Viruses | - | - | - |
Enterobacteria |
|||||
Enterobase | Databases for multiple enteric bacteria | Enteric Bacteria | - | - | - |
Eukaryotic Pathogens (incl Malaria) |
|||||
VEuPathDB | Databases for multiple eukaryotic pathogens incl plasmodium, giardia etc | Many Eukaryotic Pathogens | - | - | - |
Fruit flies |
|||||
FlyBase | All | Fruit flies | 🔴 | 🟢 | 🔴 |
Mouse |
|||||
MGI | All | Mus Musculus (house mouse) | - | - | - |
Rat |
|||||
RGD | All | Rattus norvegicus (common rat) | - | - | - |
Zebrafish |
|||||
ZFIN | All | Danio rerio (zebrafish) | - | - | - |
Worms |
|||||
WormBase | All | C. elegans (roundworm) | - | - | - |
Yeast |
|||||
SGD | All | S. cerevisiae (Brewer's Yeast) | - | - | - |
Name | Data stored | Organisms | Ease of Access |
Amount of data |
Data curation /quality |
---|---|---|---|---|---|
BioImage archive | All biological image data | All | 🟢 | 🟢 | 🟢 |
Image Data Resource (IDR) | Image datasets from published studies | All | 🟢 | 🟡 | 🟢 |
Cell Image Library | Images, videos, and animations of cells | All | 🟢 | 🟢 | 🟡 |
Name | Data stored | Organisms | Ease of Access |
Amount of data |
Data curation /quality |
---|---|---|---|---|---|
Neuroscience |
|||||
Allen Brain Map | Data and analysis related to the brain | Human, Mouse | 🟢 | 🟢 | 🟢 |
Immunology |
|||||
ImmGen | Microarray gene expression & regulation | Mouse | 🟢 | 🟢 | 🟢 |
Interferome | - | - | - | - | - |
Epigenomics |
|||||
MethBase | Reference methylomes (bisulfide-seq) | Selected model organisms | 🔴 | 🟡 | 🟡 |
Biodiversity |
|||||
GBIF | Biodiversity data | All | 🟡 | 🟢 | 🟢 |
Disease Biomarkers |
|||||
BIONDA | Biomarker candidates published in PubMed articles | Human | 🔴 | 🟢 | 🟡 |
This section is designed to provide a quick summary of each resource mentioned above. It is currently being generated over time.
Jump to
- Genomics
- Transcriptomics
- Networks, pathways & Reactions
- Variation
- Proteomics
- Metagenomics / Microbiomics
- Metabolomics
- Imaging
- Domain Specific
Contents
- NCBI, ENA, & DDBJ
- Organisation - BioProjects & BioSamples
- Genome Assemblies
- Taxonomy
- Functional Elements (Annotations)
NCBI, EMBL-EBI and DDBJ share data on a daily basis as members of the International Nucleotide Sequence Database Collaboration (INSDC).
All nucleotide data submitted to the following organisations are automatically shared between them - the choice of archive therefore mainly depends on familiarity - which one you personally find is easiest to use.
ENA (European Nucleotide Archive)
The European Nucleotide Archive (ENA) contains all publicly available EMBL-EBI nucleotide sequences. This includes coding sequences (genes), Non-coding DNA elements, genome assemblies, DNA/RNA sequence readsets and much more. The data itself, as well as metadata (information about the data - what it is, how it was derived, what techniques were used etc) are stored.
When searching ENA, all types of genomic data will be returned. You can then choose the specific kind of nucleotide sequence you want using filters (ie only genome assemblies, only coding sequence etc). ENA advanced search allows you to create a more specific search for your needs.
ENA has the cleanest UI amongst ENA, NCBI Nucleotide, and DDBJ.
The National Centre for Biotechnology Information (NCBI) Nucleotide is a search tool which pulls results from GenBank, RefSeq, the TPA and other repositories. Searching NCBI Nucleotide is akin to searching all of NCBIs sequence data, so is comparable to ENA. Similar to ENA, all kinds of genomic data is available, rather than one type only.
In general, NCBI has an archive specific to your needs (ie NCBI Assembly for assemblies, NCBI Gene for gene sequences etc), but searching NCBI Nucleotide can indicate the total data of all types given your search. NCBI advanced search is a powerful tool for searching, given you know the syntax.
The DNA Data Bank of Japan (DDBJ) is also a member of the INSDC and so contains virtually the same nucleotide data as the archives above. The DDBJs UI and web page is harder to use than NCBI Nucleotide or ENA, and feels a little dated. Given that the DDBJ collects and shares data for INSDC members, similar results will appear using ENA or NCBI Nucleotide searches. The search tool for DDBJ is called ARSA.
The three Understanding the hierarchy between archives is one of the most tricky aspects when navigating public data. Anyone who has worked with databases will know that the relationships between data are often hard to express in a standard way. The 3 main organisations (NCBI, EMBL-EBI and DDBJ) arrange information into BioProjects, BioSamples, and Data, which is a good solution given the challenge.
BioProjects are containers which store links. They are like folders which hold links to all the data and metadata associated with some project. The links can be directly to data, or can be to descriptions of the data (metadata).
Side note: EMBL-EBI call these ‘BioStudies’ instead of BioProjects for some unknown reason. We will use the term BioProject from here.
BioSamples are actually just descriptions of biological material. They do not relate to the data which was generated, but they can link to data which was derived from the particular biological sample / material. For example, if you isolated a colony of bacteria for whole genome sequencing (WGS), a BioSample entry would be created to describe the bacterial isolate. The BioSample would then have a link to the WGS data, specifying “the WGS dataset was generated from this biological material!”.
NCBI Assembly specifically displays genome assemblies and associated data. Is offers the best filtering options when searching, as searches can be narrowed by attributes such as assembly level (complete, scaffold etc), organism group, ploidy, contig N50, and annotation level.
In terms of metadata, each assembly has an organism name, the submitter name and submission date, accession numbers, and the actual genome sequence data. Other useful information, including the assembly level - ‘complete genome’, ‘chromosome’, ‘Scaffold’ or ‘Contig’ - is available.
The following is usually downloadable:
- DNA/RNA genome sequence
- Genomic features (annotations)
- Coding sequences (gene products)
- RNA data
- RepeatMasker output
- & others
Most assemblies are annotated, but the quality of the annotation is variable. Genomic features are usually inferred using software first, then may be validated experimentally at a later date. The quality of software annotation often depends on how similar the particular organism is to other, well studied organisms.
ENA (European Nucleotide Archive)
The European Nucleotide Archive (ENA) contains all publicly available EMBL-EBI nucleotide sequences, including genome assemblies. When searching, select 'Assembly' from the filters on the left side of the page to restrict results to genome assemblies.
Unfortunately, assembly searches using ENA cannot be easily filtered like NCBI Assembly. If searching for a bacterial organism, eg Bacillus subtilis, hundreds of assemblies for strains are returned. The only way to do a more specific search is using ENA advanced search, which is actually a fantastic tool in any case.
Once you have selected an assembly, the sequence and annotations can be downloaded.
The DNA Data Bank of Japan (DDBJ) is also a member of the INSDC and so contains virtually the same nucleotide data as the archives above. The DDBJs UI and web page is harder to use than NCBI Nucleotide or ENA, and feels a little dated. Assembly searches may be easier using NCBI Assembly or ENA.
The INSDC (mentioned above) maintains a database of taxonomic classifications for each known organism. This taxonomic information is shared across NCBI, ENA, and DDBJ, but only NCBI has built a specific tool to browse and explore taxonomic clades in a web browser.
The NCBI Taxonomy resources allows users to search for taxonomic groups, then provides information on the subgroups within. For a given taxa, you can view and link to the records in NCBI databases - including genome assemblies, protein sequences, read sets, genes & other functional element annotations etc.
The entire INSDC taxonomy can be downloaded here: https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/
The Encyclopedia of DNA Elements (ENCODE) is a high-quality and extensive catalogue of all known functional elements in the human genome. In addition to genes, ENCODE includes any region with functional impact - such as noncoding RNA, and promoter / enhancer regulatory regions. ENCODE data has a high level of quality, and uses multiple sources of evidence when annotating new functional elements. A variety of methods including bioinformatics analysis of current data, sequencing, DNA hypersensitivity assays, DNA methylation and binding assays etc are routinely used to identify and confirm new elements.
The Encyclopedia of genes and gene variants (GENCODE) catalogues all the gene features in the human and mouse genomes. Gene classifications are detailed and high-quality, as they are supported by biological evidence. GENCODE can be seen as a subset of ENCODE, which attempts to catalogue all function elements in the human genome.
GeneCards provides a summary for each human gene. It integrates information from more than 150 web sources, and presents it to the user in one location. A huge amount of data for each gene is presented, including summaries, regulatory elements of the gene, proteomics information, detailed annotations, and noteworthy genetic variants to name a few (if available). If you want to improve your knowledge of a particular gene, GeneCards is a great option.
While the archives above only catalogue human & mouse genes, NCBI gene spans all organisms. Searches usually need to be narrowed using filters or advanced search to be useful, but the amount of information given per gene is high. Sometimes the data is good quality and verified, other times it is only software predictions. Links to all NCBI data, as well as academic publications are provided when browsing a particular gene.
Contents
NCBI, EMBL-EBI and DDBJ share data on a daily basis as members of the International Nucleotide Sequence Database Collaboration (INSDC).
All read sets submitted to the following organisations are automatically shared between them - the choice of archive therefore mainly depends on familiarity - which one you personally find is easiest to use.
The European Nucleotide Archive (ENA) will display read sets in their default search. In the filter menu under 'Reads', both 'Runs' and 'Experiments' contain read sets with download links to the raw FASTQ files. For a specific sequencing experiment or run, there is a 'Show Column Selection' bar above the read files section - clicking this allows a huge amount of metadata to be displayed for each read set, which can be handy if you have certain demands. The ENA advanced search facilitates searching only for read sets, and allows us to restrict the results based on numerous conditions such as taxonomic group, instrument platform, geographical location, and read length to name a few.
The NCBI Sequence Read Archive (SRA) is another portal for read set access. Unlike ENA and DDBJ, it is limited to read sets only. After doing a basic search, there are a number of useful filters on the left side of the screen (taxon filters are on the right) to help narrow your results, without the need for a SRA advanced search. This said, advanced searches are always better if you know how to use them. Accessing the actual raw read files is trickier with SRA compared to ENA, as a few links need to be followed. After selecting a read experiment, click on a sequence run accession (starts with SRR) in the 'Runs' section at the bottom, then on the following page select the 'Data Access' tab to access the raw data.
The DDBJ Sequence Read Archive (DRA) contains virtually the same data as NCBI SRA and ENA. The search is similar to a simple advanced search, but has far fewer options than an advanced search using NCBI SRA or ENA. The format of the results displays all the important information, but again is lacking compared to the other portals.
Overview
Capillary electrophoresis specific data is included in the NGS archives. The repositories above are permanent stores of DNA sequence chromotograms (traces), alongside the actual base calls and quality scores. The FASTQ data now feeds into modern archives (SRA, ENA, DRA), and can be specifically searched for using the advanced search tools (instrument platform = 'capillary').
ENCODE & GeneCards for transcriptomics - regulation of gene expression, promoters etc.
br>
Text
The Protein Interaction Network Analysis (PINA)
Reactome allows the user to interactively explore cellular pathways for multiple model organisms (incl human). Reactome consists of metabolic and signalling molecules placed into the biological pathways and processes they are associated with. For example, the human apoptosis pathway can be explored to find which molecules take part in this pathway, and which reactions occur. User data can be uploaded to perform pathway enrichment analysis. The data is curated by domain experts and is backed by primary literature.
STRING contains known and predicted protein-protein interactions. Both direct (physical) and indirect (functional) interactions are shown. More than 24 million proteins are documented spanning over 5,000 organisms. Searching for a protein and organism returns a small PPI network which users can follow and interact with to understand the nature of these interactions. STRING uses various methods to support the interactions it presents, including co-expression experiments, genomic context (nearby regulatory elements), text-mining, and data pulled from other databases.
The IMEx Consortium catalogues experimentally demonstrated molecular interactions. It focuses on protein-protein interactions within model organisms such as homo sapiens and mus musculus. Biochemical evidence (Y2H, coimmunoprecipitation assays etc) is provided for each interaction in the catalogue. To properly search for interactions, the Molecular Interaction Query Language should be used. These are key-value tags (ie species:human) which allow search narrowing.
HGMD (Human Gene Mutation Database)
100k genomes project https://www.genomicsengland.co.uk/about-genomics-england/the-100000-genomes-project/
European Genome Phenome Archive (EGA) https://ega-archive.org/
GnomAD haplotypes (phased genotypes ): International HapMap Project
- confused: are individuals with mental health conditions included?
- what about late-onset conditions? these individuals may have disease waiting to occur
The Cancer Genome Atlas TCGA COSMIC
Data has been divided into sections as best as possible.
In each section, the following is covered:
- Who uses the data? (what type of analysis / field)
- How to access (download or use)
- Format of the data
🔴 Low 🟡 Med 🟢 High