Merge pull request #231 from hoelzer-lab/pig

Add Sus scrofa
hoelzer-lab · Feb 26, 2024 · 8cd6307 · 8cd6307
2 parents 094be4d + 203b76d
commit 8cd6307
Show file tree

Hide file tree

Showing 6 changed files with 36 additions and 11 deletions.
diff --git a/README.md b/README.md
@@ -194,7 +194,7 @@ nextflow pull hoelzer-lab/rnaflow -r <RELEASE>
 nextflow run hoelzer-lab/rnaflow --reads input.csv --autodownload hsa --pathway hsa --max_cores 6 --cores 2
 ```
 
-with `--autodownload <hsa|mmu|mau|eco>` [build-in species](#build-in-species), or define your own genome reference and annotation files in CSV files:
+with `--autodownload <hsa|mmu|ssc|mau|eco>` [build-in species](#build-in-species), or define your own genome reference and annotation files in CSV files:
 
 ```bash
 nextflow run hoelzer-lab/rnaflow --reads input.csv --genome fastas.csv --annotation gtfs.csv --max_cores 6 --cores 2
@@ -204,13 +204,13 @@ Genomes and annotations from `--autodownload`, `--genome` and `--annotation` are
 
 By default, all possible comparisons are performed. Use `--deg` to change this.
 
-`--pathway <hsa|mmu|mau>` performs downstream pathway analysis. Available are WebGestalt set enrichment analysis (GSEA) for `hsa`, piano GSEA with different settings and consensus scoring for `hsa`, `mmu` and `mau`.
+`--pathway <hsa|mmu|mau|ssc>` performs downstream pathway analysis. Available are WebGestalt set enrichment analysis (GSEA) for `hsa`, `mmu` and `ssc`, piano GSEA with different settings and consensus scoring for `hsa`, `mmu`, `mau`, and `ssc`.
 
 ### Input files
 
 #### Read files (required)
 
-Specify your read files in `FASTQ` format with `--reads input.csv`. The file `input.csv` has to look like this for single-end reads (just leave R2 empty):
+Specify your read files in `FASTQ` format with `--reads input.csv`. The file `input.csv` has to look like this for single-end reads:
 
 ```csv
 Sample,R1,R2,Condition,Source,Strandedness
@@ -258,10 +258,11 @@ You can add a [build-in species](#build-in-species) to your defined genomes and
 
 We provide a small set of build-in species for which the genome and annotation files are automatically downloaded from [Ensembl](https://www.ensembl.org/index.html) with `--autodownload xxx`. Please let us know, we can easily add other species.
 
-| Species      | three-letter shortcut | Genome                              | Annotation                                    |
+| Species      | three-letter shortcut | Annotation                              | Genome                                    |
 | ------------ | --------------------- | ----------------------------------- | --------------------------------------------- |
 | Homo sapiens | `hsa` <sup>*</sup>               | Homo_sapiens.GRCh38.98              | Homo_sapiens.GRCh38.dna.primary_assembly      |
 | Mus musculus | `mmu` <sup>*</sup>               | Mus_musculus.GRCm38.99              | Mus_musculus.GRCm38.dna.primary_assembly      |
+| Sus scrofa   | `ssc` <sup>*</sup>               | Sus_scrofa.Sscrofa11.1.111              | Sus_scrofa.Sscrofa11.1.dna.toplevel      |
 | Mesocricetus auratus | `mau` <sup>*</sup>               | Mesocricetus_auratus.MesAur1.0.100  | Mesocricetus_auratus.MesAur1.0.dna.toplevel   |
 | Escherichia coli | `eco`                 | Escherichia_coli_k_12.ASM80076v1.45 | Escherichia_coli_k_12.ASM80076v1.dna.toplevel |
 
@@ -313,7 +314,7 @@ Nextflow will need access to the working directory where temporary calculations
 --strand                        # strandness for counting with featureCounts: 0 (unstranded), 1 (stranded) and 2 (reversely stranded) [default 0]
 --tpm                           # threshold for TPM (transcripts per million) filter [default 1]
 --deg                           # a CSV file following the pattern: conditionX,conditionY
---pathway                       # perform different downstream pathway analysis for the species hsa|mmu|mau
+--pathway                       # perform different downstream pathway analysis for the species hsa|mmu|mau|ssc
 --feature_id_type               # ID type for downstream analysis [default: ensembl_gene_id]
 ```
 
@@ -468,7 +469,7 @@ We provide `DESeq2` normalized, regularized log (rlog), variance stabilized (vsd
 
 For each comparison (specified with `--deg` or, per default, all possible pairwise comparisons in one direction), a new folder `X_vs_Y` is created. This also describes the direction of the comparison, e.g., the log2FoldChange describes the change of a gene A under condition Y with respect to the gene under condition X. For example, a log2FoldChange of +2 for gene A would tell you that this gene is 2-fold upregulated when we compare condition X vs. condition Y. The gene A is higher expressed in samples belonging to condition X.
 
-Downstream analysis (`--pathway xxx`) are currently provided for some species: GSEA consensus scoring with `piano` for *Homo sapiens* (`hsa`), *Mus musculus* (`mmu`) and *Mesocricetus auratus* (`mau`); and `WebGestalt` GSEA for *Homo sapiens* and *Mus musculus*.
+Downstream analysis (`--pathway xxx`) are currently provided for some species: GSEA consensus scoring with `piano` for *Homo sapiens* (`hsa`), *Mus musculus* (`mmu`), *Mesocricetus auratus* (`mau`), and *Sus scofa* (`ssc`); and `WebGestalt` GSEA for *Homo sapiens*, *Mus musculus*, and *Sus scrofa*.
 
 ## Working offline
 
@@ -518,6 +519,7 @@ Input:
                                     - hsa [Ensembl: Homo_sapiens.GRCh38.dna.primary_assembly | Homo_sapiens.GRCh38.98]
                                     - eco [Ensembl: Escherichia_coli_k_12.ASM80076v1.dna.toplevel | Escherichia_coli_k_12.ASM80076v1.45]
                                     - mmu [Ensembl: Mus_musculus.GRCm38.dna.primary_assembly | Mus_musculus.GRCm38.99.gtf]
+                                    - ssc [Ensembl: Sus_scrofa.Sscrofa11.1.dna.toplevel | Sus_scrofa.Sscrofa11.1.111 ]
                                     - mau [Ensembl: Mesocricetus_auratus.MesAur1.0.dna.toplevel | Mesocricetus_auratus.MesAur1.0.100]
 --species                Specifies the species identifier for downstream path analysis. (DEPRECATED)
                          If `--include_species` is set, reference genome and annotation are added and automatically downloaded. [default: ]
@@ -552,6 +554,7 @@ DEG analysis options:
                              - hsa | Homo sapiens
                              - mmu | Mus musculus
                              - mau | Mesocricetus auratus
+                             - ssc | Sus scrofa
 --feature_id_type        ID type for downstream analysis [default: ensembl_gene_id]                            
 
 Transcriptome assembly options:

diff --git a/bin/piano.R b/bin/piano.R
@@ -26,6 +26,8 @@ try.biomart <- try(
     biomart.ensembl <- useMart('ensembl', dataset='mmusculus_gene_ensembl')
   } else if (species == 'hsa') {
     biomart.ensembl <- useMart('ensembl', dataset='hsapiens_gene_ensembl')
+  } else if (species == 'ssc') {
+    biomart.ensembl <- useMart('ensembl', dataset='sscrofa_gene_ensembl')
   } else if (species == 'mau') {
     biomart.ensembl <- useMart('ensembl', dataset='mauratus_gene_ensembl')
   } else {

diff --git a/bin/webgestalt.R b/bin/webgestalt.R
@@ -18,6 +18,8 @@ if ( species == 'hsa' ){
 organism <- "hsapiens"
 } else if (species == 'mmu') {
 organism <- "mmusculus"
+} else if (species == 'ssc') {
+organism <- "sscrofa"
 } else {
 organism <- NA
 }
@@ -42,5 +44,5 @@ if (! is.na(organism)) {
         print(paste('SKIPPING: WebGestaltR. Feature ID', id_type, 'not supported.'))
     }
 } else {
-    print("Unknown organism, only organisms 'hsapiens' and 'mmusculus' are supported by default. Exiting.")
+    print("Unknown organism, only organisms 'hsapiens', 'mmusculus', and 'sscrofa' are supported by default. Exiting.")
 }
diff --git a/main.nf b/main.nf
@@ -85,9 +85,9 @@ if (params.nanopore) {
 }
 
 
-Set species = ['hsa', 'eco', 'mmu', 'mau']
-Set autodownload = ['hsa', 'eco', 'mmu', 'mau']
-Set pathway = ['hsa', 'mmu', 'mau']
+Set species = ['hsa', 'eco', 'mmu', 'mau', 'ssc']
+Set autodownload = ['hsa', 'eco', 'mmu', 'mau', 'ssc']
+Set pathway = ['hsa', 'mmu', 'mau', 'ssc']
 
 if ( params.profile ) { exit 1, "--profile is WRONG use -profile" }
 
@@ -916,14 +916,15 @@ def helpMSG() {
     ${c_dim}Genomes and annotations from --autodownload, --genome and --annotation are concatenated.${c_reset}
 
     ${c_yellow}Input:${c_reset}
-    ${c_green}--reads${c_reset}                  A CSV file following the pattern: Sample,R,Condition,Source for single-end or Sample,R1,R2,Condition,Source for paired-end
+    ${c_green}--reads${c_reset}                  A CSV file following the pattern: Sample,R1,R2,Condition,Source,Strandedness (for single-end leave 'R2' column empty)
                                         ${c_dim}(check terminal output if correctly assigned)
                                         Per default, all possible comparisons of conditions in one direction are made. Use --deg to change.${c_reset}
     ${c_green}--autodownload${c_reset}           Specifies the species identifier for automated download [default: $params.autodownload]
                                         ${c_dim}Currently supported are:
                                         - hsa [Ensembl: Homo_sapiens.GRCh38.dna.primary_assembly | Homo_sapiens.GRCh38.98]
                                         - eco [Ensembl: Escherichia_coli_k_12.ASM80076v1.dna.toplevel | Escherichia_coli_k_12.ASM80076v1.45]
                                         - mmu [Ensembl: Mus_musculus.GRCm38.dna.primary_assembly | Mus_musculus.GRCm38.99.gtf]
+                                        - ssc [Ensembl: Sus_scrofa.Sscrofa11.1.dna.toplevel | Sus_scrofa.Sscrofa11.1.111 ]
                                         - mau [Ensembl: Mesocricetus_auratus.MesAur1.0.dna.toplevel | Mesocricetus_auratus.MesAur1.0.100]${c_reset}
     ${c_dim}--species                Specifies the species identifier for downstream path analysis. (DEPRECATED)
                              If `--include_species` is set, reference genome and annotation are added and automatically downloaded. [default: $params.species]
@@ -960,6 +961,7 @@ def helpMSG() {
                              ${c_dim}Currently supported are:
                                  - hsa | Homo sapiens
                                  - mmu | Mus musculus
+                                 - ssc | Sus scrofa
                                  - mau | Mesocricetus auratus${c_reset}
     --feature_id_type        ID type for downstream analysis [default: $params.feature_id_type]
 

diff --git a/modules/annotationGet.nf b/modules/annotationGet.nf
@@ -27,6 +27,12 @@ process annotationGet {
       gunzip -f Mus_musculus.GRCm38.99.gtf.gz
       mv Mus_musculus.GRCm38.99.gtf ${species}.gtf
       """
+    else if (species == 'ssc')
+      """
+      wget ftp://ftp.ensembl.org/pub/release-111/gtf/sus_scrofa/Sus_scrofa.Sscrofa11.1.111.gtf.gz
+      gunzip -f Sus_scrofa.Sscrofa11.1.111.gtf.gz
+      mv Sus_scrofa.Sscrofa11.1.111.gtf ${species}.gtf
+      """
     else if (species == 'eco')
       """
       wget ftp://ftp.ensemblgenomes.org/pub/release-45/bacteria//gtf/bacteria_90_collection/escherichia_coli_k_12/Escherichia_coli_k_12.ASM80076v1.45.gtf.gz

diff --git a/modules/referenceGet.nf b/modules/referenceGet.nf
@@ -27,6 +27,16 @@ process referenceGet {
       gunzip -f Mus_musculus.GRCm38.dna.primary_assembly.fa.gz
       mv Mus_musculus.GRCm38.dna.primary_assembly.fa ${species}.fa
       """
+    else if (species == 'ssc')
+      """
+      # Primary assembly contains all toplevel sequence regions excluding haplotypes and patches. 
+      # This file is best used for performing sequence similarity searches where patch and haplotype 
+      # sequences would confuse analysis. If the primary assembly file is not present, that 
+      # indicates that there are no haplotype/patch regions, and the 'toplevel' file is equivalent.
+      wget ftp://ftp.ensembl.org/pub/release-111/fasta/sus_scrofa/dna/Sus_scrofa.Sscrofa11.1.dna.toplevel.fa.gz
+      gunzip -f Sus_scrofa.Sscrofa11.1.dna.toplevel.fa.gz
+      mv Sus_scrofa.Sscrofa11.1.dna.toplevel.fa ${species}.fa
+      """
     else if (species == 'eco')
       """
       wget ftp://ftp.ensemblgenomes.org/pub/release-45/bacteria//fasta/bacteria_90_collection/escherichia_coli_k_12/dna/Escherichia_coli_k_12.ASM80076v1.dna.toplevel.fa.gz