Testing aberrant expression pipeline using external counts #584

TrueMaverick · 2024-10-28T15:49:18Z

Hello, I am new to DROP. We have a single affected sample and hence wanted to use the publicly available control dataset of external counts. First I wanted to test the pipeline to see if I have setup the input files properly(using version 1.4.0).
This is my current sample annotation file taken from https://doi.org/10.5281/zenodo.3963470:

The gene annotation file I use is from https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/GRCh37_mapping/gencode.v34lift37.annotation.gtf.gz

RNA_ID RNA_BAM_FILE DNA_VCF_FILE DNA_ID DROP_GROUP PAIRED_END COUNT_MODE COUNT_OVERLAPS STRAND GENE_COUNTS_FILE GENE_ANNOTATION GENOME
WB100071 outrider TRUE IntersectionStrict TRUE yes /projectnb/4kajdatx/drop_project/Data/control_counts/geneCounts.tsv.gz gencode34 hg19
WB100005 outrider TRUE IntersectionStrict TRUE yes /projectnb/4kajdatx/drop_project/Data/control_counts/geneCounts.tsv.gz gencode34 hg19
WB100061 outrider TRUE IntersectionStrict TRUE yes /projectnb/4kajdatx/drop_project/Data/control_counts/geneCounts.tsv.gz gencode34 hg19
WB100043 outrider TRUE IntersectionStrict TRUE yes /projectnb/4kajdatx/drop_project/Data/control_counts/geneCounts.tsv.gz gencode34 hg19
WB100083 outrider TRUE IntersectionStrict TRUE yes /projectnb/4kajdatx/drop_project/Data/control_counts/geneCounts.tsv.gz gencode34 hg19
WB100079 outrider TRUE IntersectionStrict TRUE yes /projectnb/4kajdatx/drop_project/Data/control_counts/geneCounts.tsv.gz gencode34 hg19
WB100078 outrider TRUE IntersectionStrict TRUE yes /projectnb/4kajdatx/drop_project/Data/control_counts/geneCounts.tsv.gz gencode34 hg19
WB100074 outrider TRUE IntersectionStrict TRUE yes /projectnb/4kajdatx/drop_project/Data/control_counts/geneCounts.tsv.gz gencode34 hg19
WB100038 outrider TRUE IntersectionStrict TRUE yes /projectnb/4kajdatx/drop_project/Data/control_counts/geneCounts.tsv.gz gencode34 hg19
WB100017 outrider TRUE IntersectionStrict TRUE yes /projectnb/4kajdatx/drop_project/Data/control_counts/geneCounts.tsv.gz gencode34 hg19

This is my config.yaml:

projectTitle: "DROP: Detection of RNA Outliers Pipeline"
root: /projectnb/4kajdatx/drop_project/Output            # root directory of all output objects and tables
htmlOutputPath: Output/html   # path for HTML rendered reports
indexWithFolderName: true # whether the root base name should be part of the index name

hpoFile: null  # if null, downloads it from webserver
sampleAnnotation: /projectnb/4kajdatx/drop_project/Data/sample_annotation.tsv # path to sample annotation (see documentation on how to create it)

geneAnnotation:
    gencode34: /projectnb/4kajdatx/drop_project/Data/gencode.v34lift37.annotation.gtf
genomeAssembly: hg19
genome:
    hg19: /projectnb/4kajdatx/drop_project/Data/GRCh37.primary_assembly.genome.fa # path to reference genome sequence in fasta format.
    # You can define multiple reference genomes in yaml format, ncbi: path/to/ncbi, ucsc: path/to/ucsc
    # the keywords that define the path should be in the GENOME column of the sample annotation table

random_seed: true  # just for demo data, remove for analysis

exportCounts:
    # specify which gene annotations to include and which
    # groups to exclude when exporting counts
    geneAnnotations:
        - gencode34
    excludeGroups:

aberrantExpression:
    run: true
    groups:
        - outrider
    fpkmCutoff: 1
    implementation: autoencoder
    padjCutoff: 0.05
    zScoreCutoff: 0
    genesToTest: null
    maxTestedDimensionProportion: 3
    yieldSize: 2000000

aberrantSplicing:
    run: false
    groups:
        - fraser
    recount: false
    longRead: false
    keepNonStandardChrs: false
    filter: true
    minExpressionInOneSample: 20
    quantileMinExpression: 10
    minDeltaPsi: 0.05
    implementation: PCA
    padjCutoff: 0.1
    maxTestedDimensionProportion: 6
    genesToTest: null
    ### FRASER1 configuration
    FRASER_version: "FRASER"
    deltaPsiCutoff : 0.3
    quantileForFiltering: 0.95
    ### For FRASER2, use the follwing parameters instead of the 3 lines above:
    # FRASER_version: "FRASER2"
    # deltaPsiCutoff : 0.1
    # quantileForFiltering: 0.75

mae:
    run: false
    groups:
"config.yaml" 100L, 2981C

When I run snakemake aberrantExpression --cores 1, I get this error:

Error in rowRanges<-(*tmp*, value = new("CompressedGRangesList", unlistData = new("GRanges", :

nb of rows in 'assay' (62455) must equal nb of rows in 'rowData' (62492)
Calls: rowRanges<- -> rowRanges<-
Execution halted
[Mon Oct 28 11:47:03 2024]
Error in rule AberrantExpression_pipeline_Counting_mergeCounts_R:
jobid: 4
input: /projectnb/4kajdatx/drop_project/Data/control_counts/geneCounts.tsv.gz, /projectnb/4kajdatx/drop_project/Output/processed_data/aberrant_expression/gencode34/count_ranges.Rds, /projectnb/4kajdatx/drop_project/Output/processed_data/aberrant_expression/gencode34/params/merge/outrider_mergeParams.csv, Scripts/AberrantExpression/pipeline/Counting/mergeCounts.R
output: /projectnb/4kajdatx/drop_project/Output/processed_data/aberrant_expression/gencode34/outrider/outrider/total_counts.Rds
log: /projectnb/4kajdatx/drop_project/.drop/tmp/AE/gencode34/outrider/merge.Rds (check log file(s) for error details)

RuleException:
CalledProcessError in file /tmp/tmpn8qsm0fb, line 48:
Command 'set -euo pipefail; Rscript --vanilla /projectnb/4kajdatx/drop_project/.snakemake/scripts/tmpo5sn0bp9.mergeCounts.R' returned non-zero exit status 1.
File "/tmp/tmpn8qsm0fb", line 48, in __rule_AberrantExpression_pipeline_Counting_mergeCounts_R
File "/projectnb/4kajdatx/miniconda3/envs/drop_env/lib/python3.12/concurrent/futures/thread.py", line 58, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-10-28T114347.025705.snakemake.log

I am not sure what I am doing wrong. I would greatly appreciate your guidance in this.

Best,
Madesh

The text was updated successfully, but these errors were encountered:

TrueMaverick · 2024-10-29T20:03:04Z

So apparently the annotation file has more genes than the counts file and the excess transcripts had to be removed? I had to edit countsRanges.Rds manually and rerun from the incomplete step. Why does the pipeline expect the same rows in both?
This is what I did to resolve the issue -

In R,

Load geneCounts.tsv.gz:
control_counts <- read.table("/projectnb/4kajdatx/drop_project/Data/control_counts/geneCounts.tsv.gz", header = TRUE,sep = "\t") 
control_gene_ids <- unique(control_counts$geneID)

# Load count_ranges
count_ranges <- readRDS("/projectnb/4kajdatx/drop_project/Output/processed_data/aberrant_expression/gencode34/count_ranges.Rds")



# Filter count_ranges based on control_gene_ids
count_ranges_filtered <- count_ranges[names(count_ranges) %in% control_gene_ids]

# Save the filtered count_ranges
saveRDS(count_ranges_filtered, "/projectnb/4kajdatx/drop_project/Output/processed_data/aberrant_expression/gencode34/count_ranges.Rds")

Then I ran : snakemake aberrantExpression --rerun-incomplete --cores 1 again

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testing aberrant expression pipeline using external counts #584

Testing aberrant expression pipeline using external counts #584

TrueMaverick commented Oct 28, 2024

TrueMaverick commented Oct 29, 2024 •

edited

Loading

Testing aberrant expression pipeline using external counts #584

Testing aberrant expression pipeline using external counts #584

Comments

TrueMaverick commented Oct 28, 2024

TrueMaverick commented Oct 29, 2024 • edited Loading

TrueMaverick commented Oct 29, 2024 •

edited

Loading