-
Notifications
You must be signed in to change notification settings - Fork 5
Data intake formats
xzhou82 edited this page Oct 9, 2024
·
1 revision
Somatic SNV/indel should be in VCF or tabular format with following fields:
- chromosome
- 1-based (vcf) or 0-based (tabular) position
- reference allele
- mutant allele
- DNA sequencing read count per allele
- RNAseq read count per allele (will be great if this can be available for cases with RNAseq)
- sample name
SV and Fusion should be a tabular format with following fields:
- chromosome A
- 0-based position A
- strand A, value is + or -
- chromosome B
- 0-based position B
- strand B
- Sample name
CNV should be a tabular format with following fields:
- chromosome
- 0-based start position
- 0-based stop position
- Magnitude of change, can be in log2(tumor/germline), or seg.mean etc. We need a consistent quantification for all CNV calls
Gene expression, for both FPKM/TPM and raw counts should be a tabular gene-by-sample value matrix:
- Genes on rows, samples on columns
- Please use either HGVS symbol or GENCODE gene accession as gene names
tSNE results based on bulk transcriptome or methylome should be a tabular file:
- has a header row
- has columns for sample and X/Y coordinates
Clinical data can be in a tabular matrix with samples and variables on rows or columns.
- When variables are on columns, there should be a row called DATATYPE to identify data type of each column. Supported values are "categorical", "integer", "float"
- likewise, when variables are on rows, the DATATYPE should be a column
- For variables identified as numeric, non-numeric values will be treated as exceptions
Survival data should be a tabular matrix with following columns
- Sample name
- Time to event in decimal years
- Exit code, 0=alive, 1=dead
- (Additional columns for time to event and exit code could be added when more types of survival data are available)
Single-cell RNAseq TBA