Data intake formats

Jump to bottom

xzhou82 edited this page Oct 9, 2024 · 1 revision

Intake formats for typical file types of a typical cancer genomic study

Somatic SNV/indel should be in VCF or tabular format with following fields:

chromosome
1-based (vcf) or 0-based (tabular) position
reference allele
mutant allele
DNA sequencing read count per allele
RNAseq read count per allele (will be great if this can be available for cases with RNAseq)
sample name

SV and Fusion should be a tabular format with following fields:

chromosome A
0-based position A
strand A, value is + or -
chromosome B
0-based position B
strand B
Sample name

CNV should be a tabular format with following fields:

chromosome
0-based start position
0-based stop position
Magnitude of change, can be in log2(tumor/germline), or seg.mean etc. We need a consistent quantification for all CNV calls

Gene expression, for both FPKM/TPM and raw counts should be a tabular gene-by-sample value matrix:

Genes on rows, samples on columns
Please use either HGVS symbol or GENCODE gene accession as gene names

tSNE results based on bulk transcriptome or methylome should be a tabular file:

has a header row
has columns for sample and X/Y coordinates

Clinical data can be in a tabular matrix with samples and variables on rows or columns.

When variables are on columns, there should be a row called DATATYPE to identify data type of each column. Supported values are "categorical", "integer", "float"
likewise, when variables are on rows, the DATATYPE should be a column
For variables identified as numeric, non-numeric values will be treated as exceptions

Survival data should be a tabular matrix with following columns

Sample name
Time to event in decimal years
Exit code, 0=alive, 1=dead
(Additional columns for time to event and exit code could be added when more types of survival data are available)

Single-cell RNAseq TBA