Skip to content

Commit

Permalink
Merge pull request #231 from hoelzer-lab/pig
Browse files Browse the repository at this point in the history
Add Sus scrofa
  • Loading branch information
hoelzer authored Feb 26, 2024
2 parents 094be4d + 203b76d commit 8cd6307
Show file tree
Hide file tree
Showing 6 changed files with 36 additions and 11 deletions.
15 changes: 9 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -194,7 +194,7 @@ nextflow pull hoelzer-lab/rnaflow -r <RELEASE>
nextflow run hoelzer-lab/rnaflow --reads input.csv --autodownload hsa --pathway hsa --max_cores 6 --cores 2
```

with `--autodownload <hsa|mmu|mau|eco>` [build-in species](#build-in-species), or define your own genome reference and annotation files in CSV files:
with `--autodownload <hsa|mmu|ssc|mau|eco>` [build-in species](#build-in-species), or define your own genome reference and annotation files in CSV files:

```bash
nextflow run hoelzer-lab/rnaflow --reads input.csv --genome fastas.csv --annotation gtfs.csv --max_cores 6 --cores 2
Expand All @@ -204,13 +204,13 @@ Genomes and annotations from `--autodownload`, `--genome` and `--annotation` are

By default, all possible comparisons are performed. Use `--deg` to change this.

`--pathway <hsa|mmu|mau>` performs downstream pathway analysis. Available are WebGestalt set enrichment analysis (GSEA) for `hsa`, piano GSEA with different settings and consensus scoring for `hsa`, `mmu` and `mau`.
`--pathway <hsa|mmu|mau|ssc>` performs downstream pathway analysis. Available are WebGestalt set enrichment analysis (GSEA) for `hsa`, `mmu` and `ssc`, piano GSEA with different settings and consensus scoring for `hsa`, `mmu`, `mau`, and `ssc`.

### Input files

#### Read files (required)

Specify your read files in `FASTQ` format with `--reads input.csv`. The file `input.csv` has to look like this for single-end reads (just leave R2 empty):
Specify your read files in `FASTQ` format with `--reads input.csv`. The file `input.csv` has to look like this for single-end reads:

```csv
Sample,R1,R2,Condition,Source,Strandedness
Expand Down Expand Up @@ -258,10 +258,11 @@ You can add a [build-in species](#build-in-species) to your defined genomes and
We provide a small set of build-in species for which the genome and annotation files are automatically downloaded from [Ensembl](https://www.ensembl.org/index.html) with `--autodownload xxx`. Please let us know, we can easily add other species.
| Species | three-letter shortcut | Genome | Annotation |
| Species | three-letter shortcut | Annotation | Genome |
| ------------ | --------------------- | ----------------------------------- | --------------------------------------------- |
| Homo sapiens | `hsa` <sup>*</sup> | Homo_sapiens.GRCh38.98 | Homo_sapiens.GRCh38.dna.primary_assembly |
| Mus musculus | `mmu` <sup>*</sup> | Mus_musculus.GRCm38.99 | Mus_musculus.GRCm38.dna.primary_assembly |
| Sus scrofa | `ssc` <sup>*</sup> | Sus_scrofa.Sscrofa11.1.111 | Sus_scrofa.Sscrofa11.1.dna.toplevel |
| Mesocricetus auratus | `mau` <sup>*</sup> | Mesocricetus_auratus.MesAur1.0.100 | Mesocricetus_auratus.MesAur1.0.dna.toplevel |
| Escherichia coli | `eco` | Escherichia_coli_k_12.ASM80076v1.45 | Escherichia_coli_k_12.ASM80076v1.dna.toplevel |
Expand Down Expand Up @@ -313,7 +314,7 @@ Nextflow will need access to the working directory where temporary calculations
--strand # strandness for counting with featureCounts: 0 (unstranded), 1 (stranded) and 2 (reversely stranded) [default 0]
--tpm # threshold for TPM (transcripts per million) filter [default 1]
--deg # a CSV file following the pattern: conditionX,conditionY
--pathway # perform different downstream pathway analysis for the species hsa|mmu|mau
--pathway # perform different downstream pathway analysis for the species hsa|mmu|mau|ssc
--feature_id_type # ID type for downstream analysis [default: ensembl_gene_id]
```
Expand Down Expand Up @@ -468,7 +469,7 @@ We provide `DESeq2` normalized, regularized log (rlog), variance stabilized (vsd
For each comparison (specified with `--deg` or, per default, all possible pairwise comparisons in one direction), a new folder `X_vs_Y` is created. This also describes the direction of the comparison, e.g., the log2FoldChange describes the change of a gene A under condition Y with respect to the gene under condition X. For example, a log2FoldChange of +2 for gene A would tell you that this gene is 2-fold upregulated when we compare condition X vs. condition Y. The gene A is higher expressed in samples belonging to condition X.
Downstream analysis (`--pathway xxx`) are currently provided for some species: GSEA consensus scoring with `piano` for *Homo sapiens* (`hsa`), *Mus musculus* (`mmu`) and *Mesocricetus auratus* (`mau`); and `WebGestalt` GSEA for *Homo sapiens* and *Mus musculus*.
Downstream analysis (`--pathway xxx`) are currently provided for some species: GSEA consensus scoring with `piano` for *Homo sapiens* (`hsa`), *Mus musculus* (`mmu`), *Mesocricetus auratus* (`mau`), and *Sus scofa* (`ssc`); and `WebGestalt` GSEA for *Homo sapiens*, *Mus musculus*, and *Sus scrofa*.
## Working offline
Expand Down Expand Up @@ -518,6 +519,7 @@ Input:
- hsa [Ensembl: Homo_sapiens.GRCh38.dna.primary_assembly | Homo_sapiens.GRCh38.98]
- eco [Ensembl: Escherichia_coli_k_12.ASM80076v1.dna.toplevel | Escherichia_coli_k_12.ASM80076v1.45]
- mmu [Ensembl: Mus_musculus.GRCm38.dna.primary_assembly | Mus_musculus.GRCm38.99.gtf]
- ssc [Ensembl: Sus_scrofa.Sscrofa11.1.dna.toplevel | Sus_scrofa.Sscrofa11.1.111 ]
- mau [Ensembl: Mesocricetus_auratus.MesAur1.0.dna.toplevel | Mesocricetus_auratus.MesAur1.0.100]
--species Specifies the species identifier for downstream path analysis. (DEPRECATED)
If `--include_species` is set, reference genome and annotation are added and automatically downloaded. [default: ]
Expand Down Expand Up @@ -552,6 +554,7 @@ DEG analysis options:
- hsa | Homo sapiens
- mmu | Mus musculus
- mau | Mesocricetus auratus
- ssc | Sus scrofa
--feature_id_type ID type for downstream analysis [default: ensembl_gene_id]
Transcriptome assembly options:
Expand Down
2 changes: 2 additions & 0 deletions bin/piano.R
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@ try.biomart <- try(
biomart.ensembl <- useMart('ensembl', dataset='mmusculus_gene_ensembl')
} else if (species == 'hsa') {
biomart.ensembl <- useMart('ensembl', dataset='hsapiens_gene_ensembl')
} else if (species == 'ssc') {
biomart.ensembl <- useMart('ensembl', dataset='sscrofa_gene_ensembl')
} else if (species == 'mau') {
biomart.ensembl <- useMart('ensembl', dataset='mauratus_gene_ensembl')
} else {
Expand Down
4 changes: 3 additions & 1 deletion bin/webgestalt.R
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ if ( species == 'hsa' ){
organism <- "hsapiens"
} else if (species == 'mmu') {
organism <- "mmusculus"
} else if (species == 'ssc') {
organism <- "sscrofa"
} else {
organism <- NA
}
Expand All @@ -42,5 +44,5 @@ if (! is.na(organism)) {
print(paste('SKIPPING: WebGestaltR. Feature ID', id_type, 'not supported.'))
}
} else {
print("Unknown organism, only organisms 'hsapiens' and 'mmusculus' are supported by default. Exiting.")
print("Unknown organism, only organisms 'hsapiens', 'mmusculus', and 'sscrofa' are supported by default. Exiting.")
}
10 changes: 6 additions & 4 deletions main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -85,9 +85,9 @@ if (params.nanopore) {
}


Set species = ['hsa', 'eco', 'mmu', 'mau']
Set autodownload = ['hsa', 'eco', 'mmu', 'mau']
Set pathway = ['hsa', 'mmu', 'mau']
Set species = ['hsa', 'eco', 'mmu', 'mau', 'ssc']
Set autodownload = ['hsa', 'eco', 'mmu', 'mau', 'ssc']
Set pathway = ['hsa', 'mmu', 'mau', 'ssc']

if ( params.profile ) { exit 1, "--profile is WRONG use -profile" }

Expand Down Expand Up @@ -916,14 +916,15 @@ def helpMSG() {
${c_dim}Genomes and annotations from --autodownload, --genome and --annotation are concatenated.${c_reset}
${c_yellow}Input:${c_reset}
${c_green}--reads${c_reset} A CSV file following the pattern: Sample,R,Condition,Source for single-end or Sample,R1,R2,Condition,Source for paired-end
${c_green}--reads${c_reset} A CSV file following the pattern: Sample,R1,R2,Condition,Source,Strandedness (for single-end leave 'R2' column empty)
${c_dim}(check terminal output if correctly assigned)
Per default, all possible comparisons of conditions in one direction are made. Use --deg to change.${c_reset}
${c_green}--autodownload${c_reset} Specifies the species identifier for automated download [default: $params.autodownload]
${c_dim}Currently supported are:
- hsa [Ensembl: Homo_sapiens.GRCh38.dna.primary_assembly | Homo_sapiens.GRCh38.98]
- eco [Ensembl: Escherichia_coli_k_12.ASM80076v1.dna.toplevel | Escherichia_coli_k_12.ASM80076v1.45]
- mmu [Ensembl: Mus_musculus.GRCm38.dna.primary_assembly | Mus_musculus.GRCm38.99.gtf]
- ssc [Ensembl: Sus_scrofa.Sscrofa11.1.dna.toplevel | Sus_scrofa.Sscrofa11.1.111 ]
- mau [Ensembl: Mesocricetus_auratus.MesAur1.0.dna.toplevel | Mesocricetus_auratus.MesAur1.0.100]${c_reset}
${c_dim}--species Specifies the species identifier for downstream path analysis. (DEPRECATED)
If `--include_species` is set, reference genome and annotation are added and automatically downloaded. [default: $params.species]
Expand Down Expand Up @@ -960,6 +961,7 @@ def helpMSG() {
${c_dim}Currently supported are:
- hsa | Homo sapiens
- mmu | Mus musculus
- ssc | Sus scrofa
- mau | Mesocricetus auratus${c_reset}
--feature_id_type ID type for downstream analysis [default: $params.feature_id_type]
Expand Down
6 changes: 6 additions & 0 deletions modules/annotationGet.nf
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,12 @@ process annotationGet {
gunzip -f Mus_musculus.GRCm38.99.gtf.gz
mv Mus_musculus.GRCm38.99.gtf ${species}.gtf
"""
else if (species == 'ssc')
"""
wget ftp://ftp.ensembl.org/pub/release-111/gtf/sus_scrofa/Sus_scrofa.Sscrofa11.1.111.gtf.gz
gunzip -f Sus_scrofa.Sscrofa11.1.111.gtf.gz
mv Sus_scrofa.Sscrofa11.1.111.gtf ${species}.gtf
"""
else if (species == 'eco')
"""
wget ftp://ftp.ensemblgenomes.org/pub/release-45/bacteria//gtf/bacteria_90_collection/escherichia_coli_k_12/Escherichia_coli_k_12.ASM80076v1.45.gtf.gz
Expand Down
10 changes: 10 additions & 0 deletions modules/referenceGet.nf
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,16 @@ process referenceGet {
gunzip -f Mus_musculus.GRCm38.dna.primary_assembly.fa.gz
mv Mus_musculus.GRCm38.dna.primary_assembly.fa ${species}.fa
"""
else if (species == 'ssc')
"""
# Primary assembly contains all toplevel sequence regions excluding haplotypes and patches.
# This file is best used for performing sequence similarity searches where patch and haplotype
# sequences would confuse analysis. If the primary assembly file is not present, that
# indicates that there are no haplotype/patch regions, and the 'toplevel' file is equivalent.
wget ftp://ftp.ensembl.org/pub/release-111/fasta/sus_scrofa/dna/Sus_scrofa.Sscrofa11.1.dna.toplevel.fa.gz
gunzip -f Sus_scrofa.Sscrofa11.1.dna.toplevel.fa.gz
mv Sus_scrofa.Sscrofa11.1.dna.toplevel.fa ${species}.fa
"""
else if (species == 'eco')
"""
wget ftp://ftp.ensemblgenomes.org/pub/release-45/bacteria//fasta/bacteria_90_collection/escherichia_coli_k_12/dna/Escherichia_coli_k_12.ASM80076v1.dna.toplevel.fa.gz
Expand Down

0 comments on commit 8cd6307

Please sign in to comment.