Skip to content

GraceAHall/accessing-public-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

Thanks to research becoming more open in the last decade, there is now a huge amount of freely available data online. The push towards repeatable analysis and open access to data has had enormous impact on medical science, and will continue to do so in the future. Clearly, there is too much data to properly cite in a single document, but hopefully the main sections have been covered here.

This document provides an overview of some of the commonly used open-access archives.


Databases

Jump to



Genomics & Functional Elements

Name Data stored Organisms Ease
of
Access
Amount
of
data
Data
curation
/quality


Nucleotide Databases
ENA All nucleotide sequences All 🟢 🟢 🔴
NCBI Nucleotide All nucleotide sequences All 🟡 🟢 🔴
DDBJ All nucleotide sequences All 🔴 🟢 🔴


Functional Elements
ENCODE Annotations for human functional DNA elements Human + select model organisms 🟢 🟢 🟢
GENCODE Annotations for human (and mouse) genes Human, Mouse 🟢 🟢 🟢
GeneCards Aggregator for all gene-centric data. Each gene listed once. Human 🟢 🟢 🟡
NCBI Gene Genes and links to data/metadata All 🟡 🟢 🔴


Sequence Reads
ENA All nucleotide data All 🟢 🟢 🔴
SRA High-throughput sequence data All 🟡 🟢 🔴
DRA High-throughput sequence data All 🔴 🟢 🔴
NCBI Trace Archive Capillary sequencing only All 🔴 🟢 🔴
DDBJ Trace Archive (DTA) Capillary sequencing only All 🔴 🟢 🔴


Genome Assemblies
NCBI Assembly Genome Assemblies All 🟢 🟢 🔴
ENA All nucleotide sequences All 🟡 🟢 🔴
DDBJ All nucleotide sequences All 🔴 🟢 🔴


Taxonomy
NCBI Taxonomy The standard taxonomy system All 🟡 🟢 🔴


Ontologies
- - - - - -



Transcriptomics

Name Data stored Organisms Ease
of
Access
Amount
of
data
Data
curation
/quality


Bulk Tissue Gene Expression
GTEx Tissue-specific gene expression and regulation Human 🟢 🟢 🟢
AOE Aggregates publicly available gene expression data All 🟢 🟢 🟡
Expression Atlas Abundance and localisation of RNA All 🟢 🟡 🟢
GEO datasets Functional Genomics Data (from NGS, Arrays etc) All 🟡 🟢 🟡
GEO profiles Expression profiles for a specific condition All 🟡 🟢 🟡
ArrayExpress Functional Genomics Data (NGS, Arrays etc) All 🟡 🟢 🟡


Single Cell Gene Expression
The Human Cell Atlas Single cell studies Human 🟢 🟢 🟡
Single Cell Expression Atlas Single cell studies All 🟢 🟡 🟢
Single Cell Portal Single cell studies All 🟡 🟢 🟡
Tabula Muris Single-cell transcriptome data Mouse 🟡 🔴 🟢
Human cell landscape Cell types and localisations Human 🔴 🔴 🟢


Gene Regulation
ENCODE Annotations for human functional DNA elements Human + select model organisms 🟢 🟢 🟢
GTEx Tissue-specific gene expression and regulation Human 🟢 🟢 🟢


Transcript Isoforms
GTEx Tissue-specific gene expression and regulation Human 🟢 🟢 🟢


Noncoding RNA
RNAcentral All RNA information All 🟢 🟢 🟡



Networks, pathways & Reactions

Name Data stored Organisms Ease
of
Access
Amount
of
data
Data
curation
/quality


Networks
NetworkAnalyst Networks relevant to a set of input genes (PPI, TF-gene networks etc) Many Model Organisms 🟡 🟢 🟢
PINA Visualisation of protein-protein interactions & expression in cancer subtypes Many Model Organisms 🟡 🟢 🟡
Connectivity Map (CMap) Transcriptional responses to chemical, genetic, and disease perturbation Human 🔴 🟢 🟡
JASPAR Transcription factor bind sites (DNA-binding preferences) Many Model Organisms 🟢 🟢 🟢
TRANSFAC Transcription factor bind sites & regulated genes. Has paywall. Select Eukaryotes 🔴 🟢 🟢
KEGG 15 databases related to the function of biological systems All 🟢 🟡 🟢


Pathways
Reactome Biological pathways All 🟢 🟢 🟢
KEGG Pathway Biological systems All 🔴 🟢 🟢


Interactions
STRING Known and predicted protein-protein interactions All 🟢 🟢 🟡
MINT Molecular interactions - primarily protein-protein interactions (feeds to IMEx) Many Model Organisms 🟢 🔴 🟢
IMEx Molecular interactions (primarily PPI) curated to internationally agreed standard Many Model Organisms 🔴 🟡 🟢
IntAct Aggregates molecular interactions from multiple databases (feeds to IMEx) Many Model Organisms 🔴 🟢 🟡



Reactions & Metabolites



Name Data stored Organisms Ease
of
Access
Amount
of
data
Data
curation
/quality


Reactions
Rhea Reactions of biological interest All 🟡 🟢 🟢
KEGG Reaction Details of all reactions found in KEGG Pathway All 🟡 🟢 🟢


Metabolites
ChEMBL Bioactive molecules All 🟢 🟢 🟢
MetaboLights Studies of Metabolites All 🟡 🟢 🟡



Variation

Name Data stored Organisms Ease
of
Access
Amount
of
data
Data
curation
/quality


Sequence variants (SNVs/SNPs, small indels etc)
EVA All variant data All 🟡 🟢 🔴
NCBI dbSNP All sequence variant data Human 🔴 🟢 🔴
ClinVar Variant-phenotype relationship (health) Human 🔴 🟢 🟡
OMIM Gene-phenotype relationship Human 🔴 🟡 🟢
COSMIC Somatic mutations in human cancer Human 🟢 🟢 🟢


Structural Variants
EVA All variant data All 🟡 🟢 🔴
NCBI dbVar All structural variant data Human 🔴 🟢 🔴
DGV Structural variation in healthy control samples (archived) Human 🔴 🟡 🟡



Proteomics

Name Data stored Organisms Ease
of
Access
Amount
of
data
Data
curation
/quality


Protein Sequences
UniProt Protein sequences and annotations All 🟢 🟢 🟢
Enzyme portal Concise summary of enzymes All 🟢 🟡 🟢
NCBI Protein Protein sequences and annotations All 🔴 🟢 🔴


Protein Domains & Families
InterPro Protein domains & families All 🟢 🟢 🟡
Pfam Protein families All 🔴 🟢 🟡


Protein Expression
The Human Protein Atlas Antibody-based imaging, mass spectrometry, transcriptomics data Human 🟢 🟢 🟢
PRIDE Mass spectrometry data All 🟡 🟢 🟢


Tertiary Structures
PDB Protein structures & associated data All 🟢 🟢 🟢
PDBe Protein structures & associated data All 🟡 🟢 🟢
PDBJ Protein structures & associated data All 🔴 🟢 🟢


EM, XRay, & NMR
EMDB 3D EM density maps All 🟡 🟢 🟡
EMDataResource 3D EM density maps, models & metadata All 🔴 🟢 🟡
EMPIRE Raw electron microscopy images All 🟡 🟡 🟡
BMRB NMR data All 🔴 🟢 🟡



Microbiomics / Metagenomics

Name Data stored Organisms Ease
of
Access
Amount
of
data
Data
curation
/quality


Metagenomics
SILVA ribosomal RNA sequences All 🟡 🟢 🟢
Ribosomal database project (RDP) ribosomal RNA sequences Bacteria, Archaea, Fungi 🟡 🟢 🟡


Microbiomics
MGnify Microbiome experiments & data All 🟢 🟢 🔴
BacDrive Bacterial information (Geographical, biochemical) Bacteria 🟢 🟢 🟢



Organism Specific


Name Data stored Organisms Ease
of
Access
Amount
of
data
Data
curation
/quality


Viruses
VIPR Pathogenic viral genomes incl. functional annotations Viruses - - -
GISAID Influenza & COVID-19 coronavirus sequences & analysis Viruses - - -


Enterobacteria
Enterobase Databases for multiple enteric bacteria Enteric Bacteria - - -


Eukaryotic Pathogens (incl Malaria)
VEuPathDB Databases for multiple eukaryotic pathogens incl plasmodium, giardia etc Many Eukaryotic Pathogens - - -


Fruit flies
FlyBase All Fruit flies 🔴 🟢 🔴


Mouse
MGI All Mus Musculus (house mouse) - - -


Rat
RGD All Rattus norvegicus (common rat) - - -


Zebrafish
ZFIN All Danio rerio (zebrafish) - - -


Worms
WormBase All C. elegans (roundworm) - - -


Yeast
SGD All S. cerevisiae (Brewer's Yeast) - - -



Imaging

Name Data stored Organisms Ease
of
Access
Amount
of
data
Data
curation
/quality
BioImage archive All biological image data All 🟢 🟢 🟢
Image Data Resource (IDR) Image datasets from published studies All 🟢 🟡 🟢
Cell Image Library Images, videos, and animations of cells All 🟢 🟢 🟡



Misc

Name Data stored Organisms Ease
of
Access
Amount
of
data
Data
curation
/quality


Neuroscience
Allen Brain Map Data and analysis related to the brain Human, Mouse 🟢 🟢 🟢


Immunology
ImmGen Microarray gene expression & regulation Mouse 🟢 🟢 🟢
Interferome - - - - -


Epigenomics
MethBase Reference methylomes (bisulfide-seq) Selected model organisms 🔴 🟡 🟡


Biodiversity
GBIF Biodiversity data All 🟡 🟢 🟢


Disease Biomarkers
BIONDA Biomarker candidates published in PubMed articles Human 🔴 🟢 🟡



Summaries

This section is designed to provide a quick summary of each resource mentioned above. It is currently being generated over time.

Jump to



Genomics

Contents



NCBI, ENA, & DDBJ

Data Sharing - INSDC      

NCBI, EMBL-EBI and DDBJ share data on a daily basis as members of the International Nucleotide Sequence Database Collaboration (INSDC).

All nucleotide data submitted to the following organisations are automatically shared between them - the choice of archive therefore mainly depends on familiarity - which one you personally find is easiest to use.


ENA (European Nucleotide Archive)      

The European Nucleotide Archive (ENA) contains all publicly available EMBL-EBI nucleotide sequences. This includes coding sequences (genes), Non-coding DNA elements, genome assemblies, DNA/RNA sequence readsets and much more. The data itself, as well as metadata (information about the data - what it is, how it was derived, what techniques were used etc) are stored.

When searching ENA, all types of genomic data will be returned. You can then choose the specific kind of nucleotide sequence you want using filters (ie only genome assemblies, only coding sequence etc). ENA advanced search allows you to create a more specific search for your needs.

ENA has the cleanest UI amongst ENA, NCBI Nucleotide, and DDBJ.


NCBI Nucleotide      

The National Centre for Biotechnology Information (NCBI) Nucleotide is a search tool which pulls results from GenBank, RefSeq, the TPA and other repositories. Searching NCBI Nucleotide is akin to searching all of NCBIs sequence data, so is comparable to ENA. Similar to ENA, all kinds of genomic data is available, rather than one type only.

In general, NCBI has an archive specific to your needs (ie NCBI Assembly for assemblies, NCBI Gene for gene sequences etc), but searching NCBI Nucleotide can indicate the total data of all types given your search. NCBI advanced search is a powerful tool for searching, given you know the syntax.


DDBJ      

The DNA Data Bank of Japan (DDBJ) is also a member of the INSDC and so contains virtually the same nucleotide data as the archives above. The DDBJs UI and web page is harder to use than NCBI Nucleotide or ENA, and feels a little dated. Given that the DDBJ collects and shares data for INSDC members, similar results will appear using ENA or NCBI Nucleotide searches. The search tool for DDBJ is called ARSA.



Organisation - BioProjects & BioSamples


The three Understanding the hierarchy between archives is one of the most tricky aspects when navigating public data. Anyone who has worked with databases will know that the relationships between data are often hard to express in a standard way. The 3 main organisations (NCBI, EMBL-EBI and DDBJ) arrange information into BioProjects, BioSamples, and Data, which is a good solution given the challenge.

BioProjects are containers which store links. They are like folders which hold links to all the data and metadata associated with some project. The links can be directly to data, or can be to descriptions of the data (metadata).

Side note: EMBL-EBI call these ‘BioStudies’ instead of BioProjects for some unknown reason. We will use the term BioProject from here.

BioSamples are actually just descriptions of biological material. They do not relate to the data which was generated, but they can link to data which was derived from the particular biological sample / material. For example, if you isolated a colony of bacteria for whole genome sequencing (WGS), a BioSample entry would be created to describe the bacterial isolate. The BioSample would then have a link to the WGS data, specifying “the WGS dataset was generated from this biological material!”.



Genome Assemblies


NCBI Assembly      

NCBI Assembly specifically displays genome assemblies and associated data. Is offers the best filtering options when searching, as searches can be narrowed by attributes such as assembly level (complete, scaffold etc), organism group, ploidy, contig N50, and annotation level.

In terms of metadata, each assembly has an organism name, the submitter name and submission date, accession numbers, and the actual genome sequence data. Other useful information, including the assembly level - ‘complete genome’, ‘chromosome’, ‘Scaffold’ or ‘Contig’ - is available.

The following is usually downloadable:

  • DNA/RNA genome sequence
  • Genomic features (annotations)
  • Coding sequences (gene products)
  • RNA data
  • RepeatMasker output
  • & others

Most assemblies are annotated, but the quality of the annotation is variable. Genomic features are usually inferred using software first, then may be validated experimentally at a later date. The quality of software annotation often depends on how similar the particular organism is to other, well studied organisms.


ENA (European Nucleotide Archive)      

The European Nucleotide Archive (ENA) contains all publicly available EMBL-EBI nucleotide sequences, including genome assemblies. When searching, select 'Assembly' from the filters on the left side of the page to restrict results to genome assemblies.

Unfortunately, assembly searches using ENA cannot be easily filtered like NCBI Assembly. If searching for a bacterial organism, eg Bacillus subtilis, hundreds of assemblies for strains are returned. The only way to do a more specific search is using ENA advanced search, which is actually a fantastic tool in any case.

Once you have selected an assembly, the sequence and annotations can be downloaded.


DDBJ      

The DNA Data Bank of Japan (DDBJ) is also a member of the INSDC and so contains virtually the same nucleotide data as the archives above. The DDBJs UI and web page is harder to use than NCBI Nucleotide or ENA, and feels a little dated. Assembly searches may be easier using NCBI Assembly or ENA.



Taxonomy


NCBI Taxonomy      

The INSDC (mentioned above) maintains a database of taxonomic classifications for each known organism. This taxonomic information is shared across NCBI, ENA, and DDBJ, but only NCBI has built a specific tool to browse and explore taxonomic clades in a web browser.

The NCBI Taxonomy resources allows users to search for taxonomic groups, then provides information on the subgroups within. For a given taxa, you can view and link to the records in NCBI databases - including genome assemblies, protein sequences, read sets, genes & other functional element annotations etc.

The entire INSDC taxonomy can be downloaded here: https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/



Functional Elements

ENCODE      

The Encyclopedia of DNA Elements (ENCODE) is a high-quality and extensive catalogue of all known functional elements in the human genome. In addition to genes, ENCODE includes any region with functional impact - such as noncoding RNA, and promoter / enhancer regulatory regions. ENCODE data has a high level of quality, and uses multiple sources of evidence when annotating new functional elements. A variety of methods including bioinformatics analysis of current data, sequencing, DNA hypersensitivity assays, DNA methylation and binding assays etc are routinely used to identify and confirm new elements.


GENCODE      

The Encyclopedia of genes and gene variants (GENCODE) catalogues all the gene features in the human and mouse genomes. Gene classifications are detailed and high-quality, as they are supported by biological evidence. GENCODE can be seen as a subset of ENCODE, which attempts to catalogue all function elements in the human genome.


GeneCards      

GeneCards provides a summary for each human gene. It integrates information from more than 150 web sources, and presents it to the user in one location. A huge amount of data for each gene is presented, including summaries, regulatory elements of the gene, proteomics information, detailed annotations, and noteworthy genetic variants to name a few (if available). If you want to improve your knowledge of a particular gene, GeneCards is a great option.


NCBI Gene      

While the archives above only catalogue human & mouse genes, NCBI gene spans all organisms. Searches usually need to be narrowed using filters or advanced search to be useful, but the amount of information given per gene is high. Sometimes the data is good quality and verified, other times it is only software predictions. Links to all NCBI data, as well as academic publications are provided when browsing a particular gene.



Sequence Reads

Contents



Next Gen Sequencing

Data Sharing - INSDC      

NCBI, EMBL-EBI and DDBJ share data on a daily basis as members of the International Nucleotide Sequence Database Collaboration (INSDC).

All read sets submitted to the following organisations are automatically shared between them - the choice of archive therefore mainly depends on familiarity - which one you personally find is easiest to use.


ENA      

The European Nucleotide Archive (ENA) will display read sets in their default search. In the filter menu under 'Reads', both 'Runs' and 'Experiments' contain read sets with download links to the raw FASTQ files. For a specific sequencing experiment or run, there is a 'Show Column Selection' bar above the read files section - clicking this allows a huge amount of metadata to be displayed for each read set, which can be handy if you have certain demands. The ENA advanced search facilitates searching only for read sets, and allows us to restrict the results based on numerous conditions such as taxonomic group, instrument platform, geographical location, and read length to name a few.


SRA      

The NCBI Sequence Read Archive (SRA) is another portal for read set access. Unlike ENA and DDBJ, it is limited to read sets only. After doing a basic search, there are a number of useful filters on the left side of the screen (taxon filters are on the right) to help narrow your results, without the need for a SRA advanced search. This said, advanced searches are always better if you know how to use them. Accessing the actual raw read files is trickier with SRA compared to ENA, as a few links need to be followed. After selecting a read experiment, click on a sequence run accession (starts with SRR) in the 'Runs' section at the bottom, then on the following page select the 'Data Access' tab to access the raw data.


DRA      

The DDBJ Sequence Read Archive (DRA) contains virtually the same data as NCBI SRA and ENA. The search is similar to a simple advanced search, but has far fewer options than an advanced search using NCBI SRA or ENA. The format of the results displays all the important information, but again is lacking compared to the other portals.



Capillary Electrophoresis


Overview

Capillary electrophoresis specific data is included in the NGS archives. The repositories above are permanent stores of DNA sequence chromotograms (traces), alongside the actual base calls and quality scores. The FASTQ data now feeds into modern archives (SRA, ENA, DRA), and can be specifically searched for using the advanced search tools (instrument platform = 'capillary').



Transcriptomics

ENCODE & GeneCards for transcriptomics - regulation of gene expression, promoters etc.

Bulk Tissue Gene Expression

br>

Single Cell Gene Expression


Gene Regulation

Transcript Isoforms

Noncoding RNA

Networks, pathways & reactions

Networks


NetworkAnalyist      

Text


PINA      

The Protein Interaction Network Analysis (PINA)


Pathways


Reactome      

Reactome allows the user to interactively explore cellular pathways for multiple model organisms (incl human). Reactome consists of metabolic and signalling molecules placed into the biological pathways and processes they are associated with. For example, the human apoptosis pathway can be explored to find which molecules take part in this pathway, and which reactions occur. User data can be uploaded to perform pathway enrichment analysis. The data is curated by domain experts and is backed by primary literature.


Interactions


STRING      

STRING contains known and predicted protein-protein interactions. Both direct (physical) and indirect (functional) interactions are shown. More than 24 million proteins are documented spanning over 5,000 organisms. Searching for a protein and organism returns a small PPI network which users can follow and interact with to understand the nature of these interactions. STRING uses various methods to support the interactions it presents, including co-expression experiments, genomic context (nearby regulatory elements), text-mining, and data pulled from other databases.


IMEx      

The IMEx Consortium catalogues experimentally demonstrated molecular interactions. It focuses on protein-protein interactions within model organisms such as homo sapiens and mus musculus. Biochemical evidence (Y2H, coimmunoprecipitation assays etc) is provided for each interaction in the catalogue. To properly search for interactions, the Molecular Interaction Query Language should be used. These are key-value tags (ie species:human) which allow search narrowing.


Variation

HGMD (Human Gene Mutation Database)

100k genomes project https://www.genomicsengland.co.uk/about-genomics-england/the-100000-genomes-project/

European Genome Phenome Archive (EGA) https://ega-archive.org/

GnomAD haplotypes (phased genotypes ): International HapMap Project

  • confused: are individuals with mental health conditions included?
  • what about late-onset conditions? these individuals may have disease waiting to occur

The Cancer Genome Atlas TCGA COSMIC

Proteomics

Pathways & Reactions

Metagenomics / Microbiomics

Metabolomics

Imaging

Domain Specific




Graveyard

Data has been divided into sections as best as possible.
In each section, the following is covered:

  • Who uses the data? (what type of analysis / field)
  • How to access (download or use)
  • Format of the data

🔴 Low 🟡 Med 🟢 High

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published