Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Sus scrofa #231

Merged
merged 8 commits into from
Feb 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 9 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -194,7 +194,7 @@ nextflow pull hoelzer-lab/rnaflow -r <RELEASE>
nextflow run hoelzer-lab/rnaflow --reads input.csv --autodownload hsa --pathway hsa --max_cores 6 --cores 2
```

with `--autodownload <hsa|mmu|mau|eco>` [build-in species](#build-in-species), or define your own genome reference and annotation files in CSV files:
with `--autodownload <hsa|mmu|ssc|mau|eco>` [build-in species](#build-in-species), or define your own genome reference and annotation files in CSV files:

```bash
nextflow run hoelzer-lab/rnaflow --reads input.csv --genome fastas.csv --annotation gtfs.csv --max_cores 6 --cores 2
Expand All @@ -204,13 +204,13 @@ Genomes and annotations from `--autodownload`, `--genome` and `--annotation` are

By default, all possible comparisons are performed. Use `--deg` to change this.

`--pathway <hsa|mmu|mau>` performs downstream pathway analysis. Available are WebGestalt set enrichment analysis (GSEA) for `hsa`, piano GSEA with different settings and consensus scoring for `hsa`, `mmu` and `mau`.
`--pathway <hsa|mmu|mau|ssc>` performs downstream pathway analysis. Available are WebGestalt set enrichment analysis (GSEA) for `hsa`, `mmu` and `ssc`, piano GSEA with different settings and consensus scoring for `hsa`, `mmu`, `mau`, and `ssc`.

### Input files

#### Read files (required)

Specify your read files in `FASTQ` format with `--reads input.csv`. The file `input.csv` has to look like this for single-end reads (just leave R2 empty):
Specify your read files in `FASTQ` format with `--reads input.csv`. The file `input.csv` has to look like this for single-end reads:

```csv
Sample,R1,R2,Condition,Source,Strandedness
Expand Down Expand Up @@ -258,10 +258,11 @@ You can add a [build-in species](#build-in-species) to your defined genomes and

We provide a small set of build-in species for which the genome and annotation files are automatically downloaded from [Ensembl](https://www.ensembl.org/index.html) with `--autodownload xxx`. Please let us know, we can easily add other species.

| Species | three-letter shortcut | Genome | Annotation |
| Species | three-letter shortcut | Annotation | Genome |
| ------------ | --------------------- | ----------------------------------- | --------------------------------------------- |
| Homo sapiens | `hsa` <sup>*</sup> | Homo_sapiens.GRCh38.98 | Homo_sapiens.GRCh38.dna.primary_assembly |
| Mus musculus | `mmu` <sup>*</sup> | Mus_musculus.GRCm38.99 | Mus_musculus.GRCm38.dna.primary_assembly |
| Sus scrofa | `ssc` <sup>*</sup> | Sus_scrofa.Sscrofa11.1.111 | Sus_scrofa.Sscrofa11.1.dna.toplevel |
| Mesocricetus auratus | `mau` <sup>*</sup> | Mesocricetus_auratus.MesAur1.0.100 | Mesocricetus_auratus.MesAur1.0.dna.toplevel |
| Escherichia coli | `eco` | Escherichia_coli_k_12.ASM80076v1.45 | Escherichia_coli_k_12.ASM80076v1.dna.toplevel |

Expand Down Expand Up @@ -313,7 +314,7 @@ Nextflow will need access to the working directory where temporary calculations
--strand # strandness for counting with featureCounts: 0 (unstranded), 1 (stranded) and 2 (reversely stranded) [default 0]
--tpm # threshold for TPM (transcripts per million) filter [default 1]
--deg # a CSV file following the pattern: conditionX,conditionY
--pathway # perform different downstream pathway analysis for the species hsa|mmu|mau
--pathway # perform different downstream pathway analysis for the species hsa|mmu|mau|ssc
--feature_id_type # ID type for downstream analysis [default: ensembl_gene_id]
```

Expand Down Expand Up @@ -468,7 +469,7 @@ We provide `DESeq2` normalized, regularized log (rlog), variance stabilized (vsd

For each comparison (specified with `--deg` or, per default, all possible pairwise comparisons in one direction), a new folder `X_vs_Y` is created. This also describes the direction of the comparison, e.g., the log2FoldChange describes the change of a gene A under condition Y with respect to the gene under condition X. For example, a log2FoldChange of +2 for gene A would tell you that this gene is 2-fold upregulated when we compare condition X vs. condition Y. The gene A is higher expressed in samples belonging to condition X.

Downstream analysis (`--pathway xxx`) are currently provided for some species: GSEA consensus scoring with `piano` for *Homo sapiens* (`hsa`), *Mus musculus* (`mmu`) and *Mesocricetus auratus* (`mau`); and `WebGestalt` GSEA for *Homo sapiens* and *Mus musculus*.
Downstream analysis (`--pathway xxx`) are currently provided for some species: GSEA consensus scoring with `piano` for *Homo sapiens* (`hsa`), *Mus musculus* (`mmu`), *Mesocricetus auratus* (`mau`), and *Sus scofa* (`ssc`); and `WebGestalt` GSEA for *Homo sapiens*, *Mus musculus*, and *Sus scrofa*.

## Working offline

Expand Down Expand Up @@ -518,6 +519,7 @@ Input:
- hsa [Ensembl: Homo_sapiens.GRCh38.dna.primary_assembly | Homo_sapiens.GRCh38.98]
- eco [Ensembl: Escherichia_coli_k_12.ASM80076v1.dna.toplevel | Escherichia_coli_k_12.ASM80076v1.45]
- mmu [Ensembl: Mus_musculus.GRCm38.dna.primary_assembly | Mus_musculus.GRCm38.99.gtf]
- ssc [Ensembl: Sus_scrofa.Sscrofa11.1.dna.toplevel | Sus_scrofa.Sscrofa11.1.111 ]
- mau [Ensembl: Mesocricetus_auratus.MesAur1.0.dna.toplevel | Mesocricetus_auratus.MesAur1.0.100]
--species Specifies the species identifier for downstream path analysis. (DEPRECATED)
If `--include_species` is set, reference genome and annotation are added and automatically downloaded. [default: ]
Expand Down Expand Up @@ -552,6 +554,7 @@ DEG analysis options:
- hsa | Homo sapiens
- mmu | Mus musculus
- mau | Mesocricetus auratus
- ssc | Sus scrofa
--feature_id_type ID type for downstream analysis [default: ensembl_gene_id]

Transcriptome assembly options:
Expand Down
2 changes: 2 additions & 0 deletions bin/piano.R
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@ try.biomart <- try(
biomart.ensembl <- useMart('ensembl', dataset='mmusculus_gene_ensembl')
} else if (species == 'hsa') {
biomart.ensembl <- useMart('ensembl', dataset='hsapiens_gene_ensembl')
} else if (species == 'ssc') {
biomart.ensembl <- useMart('ensembl', dataset='sscrofa_gene_ensembl')
} else if (species == 'mau') {
biomart.ensembl <- useMart('ensembl', dataset='mauratus_gene_ensembl')
} else {
Expand Down
4 changes: 3 additions & 1 deletion bin/webgestalt.R
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ if ( species == 'hsa' ){
organism <- "hsapiens"
} else if (species == 'mmu') {
organism <- "mmusculus"
} else if (species == 'ssc') {
organism <- "sscrofa"
} else {
organism <- NA
}
Expand All @@ -42,5 +44,5 @@ if (! is.na(organism)) {
print(paste('SKIPPING: WebGestaltR. Feature ID', id_type, 'not supported.'))
}
} else {
print("Unknown organism, only organisms 'hsapiens' and 'mmusculus' are supported by default. Exiting.")
print("Unknown organism, only organisms 'hsapiens', 'mmusculus', and 'sscrofa' are supported by default. Exiting.")
}
10 changes: 6 additions & 4 deletions main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -85,9 +85,9 @@ if (params.nanopore) {
}


Set species = ['hsa', 'eco', 'mmu', 'mau']
Set autodownload = ['hsa', 'eco', 'mmu', 'mau']
Set pathway = ['hsa', 'mmu', 'mau']
Set species = ['hsa', 'eco', 'mmu', 'mau', 'ssc']
Set autodownload = ['hsa', 'eco', 'mmu', 'mau', 'ssc']
Set pathway = ['hsa', 'mmu', 'mau', 'ssc']

if ( params.profile ) { exit 1, "--profile is WRONG use -profile" }

Expand Down Expand Up @@ -916,14 +916,15 @@ def helpMSG() {
${c_dim}Genomes and annotations from --autodownload, --genome and --annotation are concatenated.${c_reset}

${c_yellow}Input:${c_reset}
${c_green}--reads${c_reset} A CSV file following the pattern: Sample,R,Condition,Source for single-end or Sample,R1,R2,Condition,Source for paired-end
${c_green}--reads${c_reset} A CSV file following the pattern: Sample,R1,R2,Condition,Source,Strandedness (for single-end leave 'R2' column empty)
${c_dim}(check terminal output if correctly assigned)
Per default, all possible comparisons of conditions in one direction are made. Use --deg to change.${c_reset}
${c_green}--autodownload${c_reset} Specifies the species identifier for automated download [default: $params.autodownload]
${c_dim}Currently supported are:
- hsa [Ensembl: Homo_sapiens.GRCh38.dna.primary_assembly | Homo_sapiens.GRCh38.98]
- eco [Ensembl: Escherichia_coli_k_12.ASM80076v1.dna.toplevel | Escherichia_coli_k_12.ASM80076v1.45]
- mmu [Ensembl: Mus_musculus.GRCm38.dna.primary_assembly | Mus_musculus.GRCm38.99.gtf]
- ssc [Ensembl: Sus_scrofa.Sscrofa11.1.dna.toplevel | Sus_scrofa.Sscrofa11.1.111 ]
- mau [Ensembl: Mesocricetus_auratus.MesAur1.0.dna.toplevel | Mesocricetus_auratus.MesAur1.0.100]${c_reset}
${c_dim}--species Specifies the species identifier for downstream path analysis. (DEPRECATED)
If `--include_species` is set, reference genome and annotation are added and automatically downloaded. [default: $params.species]
Expand Down Expand Up @@ -960,6 +961,7 @@ def helpMSG() {
${c_dim}Currently supported are:
- hsa | Homo sapiens
- mmu | Mus musculus
- ssc | Sus scrofa
- mau | Mesocricetus auratus${c_reset}
--feature_id_type ID type for downstream analysis [default: $params.feature_id_type]

Expand Down
6 changes: 6 additions & 0 deletions modules/annotationGet.nf
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,12 @@ process annotationGet {
gunzip -f Mus_musculus.GRCm38.99.gtf.gz
mv Mus_musculus.GRCm38.99.gtf ${species}.gtf
"""
else if (species == 'ssc')
"""
wget ftp://ftp.ensembl.org/pub/release-111/gtf/sus_scrofa/Sus_scrofa.Sscrofa11.1.111.gtf.gz
gunzip -f Sus_scrofa.Sscrofa11.1.111.gtf.gz
mv Sus_scrofa.Sscrofa11.1.111.gtf ${species}.gtf
"""
else if (species == 'eco')
"""
wget ftp://ftp.ensemblgenomes.org/pub/release-45/bacteria//gtf/bacteria_90_collection/escherichia_coli_k_12/Escherichia_coli_k_12.ASM80076v1.45.gtf.gz
Expand Down
10 changes: 10 additions & 0 deletions modules/referenceGet.nf
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,16 @@ process referenceGet {
gunzip -f Mus_musculus.GRCm38.dna.primary_assembly.fa.gz
mv Mus_musculus.GRCm38.dna.primary_assembly.fa ${species}.fa
"""
else if (species == 'ssc')
"""
# Primary assembly contains all toplevel sequence regions excluding haplotypes and patches.
# This file is best used for performing sequence similarity searches where patch and haplotype
# sequences would confuse analysis. If the primary assembly file is not present, that
# indicates that there are no haplotype/patch regions, and the 'toplevel' file is equivalent.
wget ftp://ftp.ensembl.org/pub/release-111/fasta/sus_scrofa/dna/Sus_scrofa.Sscrofa11.1.dna.toplevel.fa.gz
gunzip -f Sus_scrofa.Sscrofa11.1.dna.toplevel.fa.gz
mv Sus_scrofa.Sscrofa11.1.dna.toplevel.fa ${species}.fa
"""
else if (species == 'eco')
"""
wget ftp://ftp.ensemblgenomes.org/pub/release-45/bacteria//fasta/bacteria_90_collection/escherichia_coli_k_12/dna/Escherichia_coli_k_12.ASM80076v1.dna.toplevel.fa.gz
Expand Down