Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing aberrant expression pipeline using external counts #584

Open
TrueMaverick opened this issue Oct 28, 2024 · 1 comment
Open

Testing aberrant expression pipeline using external counts #584

TrueMaverick opened this issue Oct 28, 2024 · 1 comment

Comments

@TrueMaverick
Copy link

Hello, I am new to DROP. We have a single affected sample and hence wanted to use the publicly available control dataset of external counts. First I wanted to test the pipeline to see if I have setup the input files properly(using version 1.4.0).
This is my current sample annotation file taken from https://doi.org/10.5281/zenodo.3963470:

The gene annotation file I use is from https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/GRCh37_mapping/gencode.v34lift37.annotation.gtf.gz

RNA_ID RNA_BAM_FILE DNA_VCF_FILE DNA_ID DROP_GROUP PAIRED_END COUNT_MODE COUNT_OVERLAPS STRAND GENE_COUNTS_FILE GENE_ANNOTATION GENOME
WB100071 outrider TRUE IntersectionStrict TRUE yes /projectnb/4kajdatx/drop_project/Data/control_counts/geneCounts.tsv.gz gencode34 hg19
WB100005 outrider TRUE IntersectionStrict TRUE yes /projectnb/4kajdatx/drop_project/Data/control_counts/geneCounts.tsv.gz gencode34 hg19
WB100061 outrider TRUE IntersectionStrict TRUE yes /projectnb/4kajdatx/drop_project/Data/control_counts/geneCounts.tsv.gz gencode34 hg19
WB100043 outrider TRUE IntersectionStrict TRUE yes /projectnb/4kajdatx/drop_project/Data/control_counts/geneCounts.tsv.gz gencode34 hg19
WB100083 outrider TRUE IntersectionStrict TRUE yes /projectnb/4kajdatx/drop_project/Data/control_counts/geneCounts.tsv.gz gencode34 hg19
WB100079 outrider TRUE IntersectionStrict TRUE yes /projectnb/4kajdatx/drop_project/Data/control_counts/geneCounts.tsv.gz gencode34 hg19
WB100078 outrider TRUE IntersectionStrict TRUE yes /projectnb/4kajdatx/drop_project/Data/control_counts/geneCounts.tsv.gz gencode34 hg19
WB100074 outrider TRUE IntersectionStrict TRUE yes /projectnb/4kajdatx/drop_project/Data/control_counts/geneCounts.tsv.gz gencode34 hg19
WB100038 outrider TRUE IntersectionStrict TRUE yes /projectnb/4kajdatx/drop_project/Data/control_counts/geneCounts.tsv.gz gencode34 hg19
WB100017 outrider TRUE IntersectionStrict TRUE yes /projectnb/4kajdatx/drop_project/Data/control_counts/geneCounts.tsv.gz gencode34 hg19

This is my config.yaml:

projectTitle: "DROP: Detection of RNA Outliers Pipeline"
root: /projectnb/4kajdatx/drop_project/Output            # root directory of all output objects and tables
htmlOutputPath: Output/html   # path for HTML rendered reports
indexWithFolderName: true # whether the root base name should be part of the index name

hpoFile: null  # if null, downloads it from webserver
sampleAnnotation: /projectnb/4kajdatx/drop_project/Data/sample_annotation.tsv # path to sample annotation (see documentation on how to create it)

geneAnnotation:
    gencode34: /projectnb/4kajdatx/drop_project/Data/gencode.v34lift37.annotation.gtf
genomeAssembly: hg19
genome:
    hg19: /projectnb/4kajdatx/drop_project/Data/GRCh37.primary_assembly.genome.fa # path to reference genome sequence in fasta format.
    # You can define multiple reference genomes in yaml format, ncbi: path/to/ncbi, ucsc: path/to/ucsc
    # the keywords that define the path should be in the GENOME column of the sample annotation table

random_seed: true  # just for demo data, remove for analysis

exportCounts:
    # specify which gene annotations to include and which
    # groups to exclude when exporting counts
    geneAnnotations:
        - gencode34
    excludeGroups:

aberrantExpression:
    run: true
    groups:
        - outrider
    fpkmCutoff: 1
    implementation: autoencoder
    padjCutoff: 0.05
    zScoreCutoff: 0
    genesToTest: null
    maxTestedDimensionProportion: 3
    yieldSize: 2000000

aberrantSplicing:
    run: false
    groups:
        - fraser
    recount: false
    longRead: false
    keepNonStandardChrs: false
    filter: true
    minExpressionInOneSample: 20
    quantileMinExpression: 10
    minDeltaPsi: 0.05
    implementation: PCA
    padjCutoff: 0.1
    maxTestedDimensionProportion: 6
    genesToTest: null
    ### FRASER1 configuration
    FRASER_version: "FRASER"
    deltaPsiCutoff : 0.3
    quantileForFiltering: 0.95
    ### For FRASER2, use the follwing parameters instead of the 3 lines above:
    # FRASER_version: "FRASER2"
    # deltaPsiCutoff : 0.1
    # quantileForFiltering: 0.75

mae:
    run: false
    groups:
"config.yaml" 100L, 2981C      

When I run snakemake aberrantExpression --cores 1, I get this error:

Error in rowRanges<-(*tmp*, value = new("CompressedGRangesList", unlistData = new("GRanges", :

nb of rows in 'assay' (62455) must equal nb of rows in 'rowData' (62492)
Calls: rowRanges<- -> rowRanges<-
Execution halted
[Mon Oct 28 11:47:03 2024]
Error in rule AberrantExpression_pipeline_Counting_mergeCounts_R:
jobid: 4
input: /projectnb/4kajdatx/drop_project/Data/control_counts/geneCounts.tsv.gz, /projectnb/4kajdatx/drop_project/Output/processed_data/aberrant_expression/gencode34/count_ranges.Rds, /projectnb/4kajdatx/drop_project/Output/processed_data/aberrant_expression/gencode34/params/merge/outrider_mergeParams.csv, Scripts/AberrantExpression/pipeline/Counting/mergeCounts.R
output: /projectnb/4kajdatx/drop_project/Output/processed_data/aberrant_expression/gencode34/outrider/outrider/total_counts.Rds
log: /projectnb/4kajdatx/drop_project/.drop/tmp/AE/gencode34/outrider/merge.Rds (check log file(s) for error details)

RuleException:
CalledProcessError in file /tmp/tmpn8qsm0fb, line 48:
Command 'set -euo pipefail; Rscript --vanilla /projectnb/4kajdatx/drop_project/.snakemake/scripts/tmpo5sn0bp9.mergeCounts.R' returned non-zero exit status 1.
File "/tmp/tmpn8qsm0fb", line 48, in __rule_AberrantExpression_pipeline_Counting_mergeCounts_R
File "/projectnb/4kajdatx/miniconda3/envs/drop_env/lib/python3.12/concurrent/futures/thread.py", line 58, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-10-28T114347.025705.snakemake.log

I am not sure what I am doing wrong. I would greatly appreciate your guidance in this.

Best,
Madesh

@TrueMaverick
Copy link
Author

TrueMaverick commented Oct 29, 2024

So apparently the annotation file has more genes than the counts file and the excess transcripts had to be removed? I had to edit countsRanges.Rds manually and rerun from the incomplete step. Why does the pipeline expect the same rows in both?
This is what I did to resolve the issue -

In R,

Load geneCounts.tsv.gz:
control_counts <- read.table("/projectnb/4kajdatx/drop_project/Data/control_counts/geneCounts.tsv.gz", header = TRUE,sep = "\t") 
control_gene_ids <- unique(control_counts$geneID)

# Load count_ranges
count_ranges <- readRDS("/projectnb/4kajdatx/drop_project/Output/processed_data/aberrant_expression/gencode34/count_ranges.Rds")



# Filter count_ranges based on control_gene_ids
count_ranges_filtered <- count_ranges[names(count_ranges) %in% control_gene_ids]

# Save the filtered count_ranges
saveRDS(count_ranges_filtered, "/projectnb/4kajdatx/drop_project/Output/processed_data/aberrant_expression/gencode34/count_ranges.Rds")

Then I ran : snakemake aberrantExpression --rerun-incomplete --cores 1 again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant