Skip to content

Data intake formats

xzhou82 edited this page Oct 9, 2024 · 1 revision

Intake formats for typical file types of a typical cancer genomic study

Somatic SNV/indel should be in VCF or tabular format with following fields:

  • chromosome
  • 1-based (vcf) or 0-based (tabular) position
  • reference allele
  • mutant allele
  • DNA sequencing read count per allele
  • RNAseq read count per allele (will be great if this can be available for cases with RNAseq)
  • sample name

SV and Fusion should be a tabular format with following fields:

  • chromosome A
  • 0-based position A
  • strand A, value is + or -
  • chromosome B
  • 0-based position B
  • strand B
  • Sample name

CNV should be a tabular format with following fields:

  • chromosome
  • 0-based start position
  • 0-based stop position
  • Magnitude of change, can be in log2(tumor/germline), or seg.mean etc. We need a consistent quantification for all CNV calls

Gene expression, for both FPKM/TPM and raw counts should be a tabular gene-by-sample value matrix:

  • Genes on rows, samples on columns
  • Please use either HGVS symbol or GENCODE gene accession as gene names

tSNE results based on bulk transcriptome or methylome should be a tabular file:

  • has a header row
  • has columns for sample and X/Y coordinates

Clinical data can be in a tabular matrix with samples and variables on rows or columns.

  • When variables are on columns, there should be a row called DATATYPE to identify data type of each column. Supported values are "categorical", "integer", "float"
  • likewise, when variables are on rows, the DATATYPE should be a column
  • For variables identified as numeric, non-numeric values will be treated as exceptions

Survival data should be a tabular matrix with following columns

  • Sample name
  • Time to event in decimal years
  • Exit code, 0=alive, 1=dead
  • (Additional columns for time to event and exit code could be added when more types of survival data are available)

Single-cell RNAseq TBA