Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seqkit #59

Draft
wants to merge 11 commits into
base: dev
Choose a base branch
from
Draft

Seqkit #59

wants to merge 11 commits into from

Conversation

sarahjeeeze
Copy link

@sarahjeeeze sarahjeeeze commented Oct 29, 2024

Add seqkit stats nf-core module

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/seqinspector branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nf-test test main.nf.test -profile test,docker).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

Copy link

github-actions bot commented Oct 29, 2024

nf-core pipelines lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 76b882d

+| ✅ 191 tests passed       |+
#| ❔   1 tests were ignored |#
!| ❗  21 tests had warnings |!

❗ Test warnings:

  • readme - README contains the placeholder zenodo.XXXXXXX. This should be replaced with the zenodo doi (after the first release).
  • pipeline_todos - TODO string in main.nf: Remove this line if you don't need a FASTA file
  • pipeline_todos - TODO string in nextflow.config: Specify your pipeline's command line flags
  • pipeline_todos - TODO string in nextflow.config: Optionally, you can add a pipeline-specific nf-core config at https://github.com/nf-core/configs
  • pipeline_todos - TODO string in README.md: TODO nf-core:
  • pipeline_todos - TODO string in README.md: Include a figure that guides the user through the major workflow steps. Many nf-core
  • pipeline_todos - TODO string in README.md: Fill in short bullet-pointed list of the default steps in the pipeline
  • pipeline_todos - TODO string in README.md: Add citation for pipeline after first release. Uncomment lines below and update Zenodo doi and badge at the top of this file.
  • pipeline_todos - TODO string in README.md: Add bibliography of tools and data used in your pipeline
  • pipeline_todos - TODO string in usage.md: Add documentation about anything specific to running your pipeline. For general topics, please point to (and add to) the main nf-core website.
  • pipeline_todos - TODO string in main.nf: Optionally add in-text citation tools to this list.
  • pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
  • pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!
  • pipeline_todos - TODO string in test_full.config: Specify the paths to your full test data ( on nf-core/test-datasets or directly in repositories, e.g. SRA)
  • pipeline_todos - TODO string in test_full.config: Give any required params for the test so that command line flags are not needed
  • pipeline_todos - TODO string in test.config: Specify the paths to your test data on nf-core/test-datasets
  • pipeline_todos - TODO string in test.config: Give any required params for the test so that command line flags are not needed
  • pipeline_todos - TODO string in base.config: Check the defaults for all processes
  • pipeline_todos - TODO string in base.config: Customise requirements for specific processes.
  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline
  • pipeline_todos - TODO string in awsfulltest.yml: You can customise AWS full pipeline tests as required

❔ Tests ignored:

  • files_unchanged - File ignored due to lint config: .github/CONTRIBUTING.md

✅ Tests passed:

Run details

  • nf-core/tools version 3.0.2
  • Run at 2024-10-30 10:41:32

Copy link
Member

@MatthiasZepper MatthiasZepper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution!

Ultimately, the pipeline will allow to flexibly choose the tools that are run, so having more tools in the pipeline is always great, but what exactly was your rationale here?

Admittedly, I just skimmed over the description, but to me, it seems that it is predominantly meant to run on FASTA files and for judging the quality of genome assemblies. Unless I missed some relevant arguments, its application on sequencing reads is in my opinion does not really yield too many meaningful statistics - at least for Illumina, for Nanopore probably yes.

Do you happen to have a Nanopore example? For Illumina, this is basically all you get:

file	format	type	num_seqs	sum_len	min_len	avg_len	max_len
S11_L001_R1_001.fastq.gz	FASTQ	DNA	35875355	5417178605	151	151.0	151
S11_L001_R1_001.fastq.gz	FASTQ	DNA	35875355	322878195	9	9.0	9
S11_L001_R2_001.fastq.gz	FASTQ	DNA	35875355	5417178605	151	151.0	151

But with regard to the code, this already looks very good and tidy!

I am only missing a publishDir directive in the config, so that the reports from the stats output channel are published in a subfolder of the outdir.

CHANGELOG.md Outdated Show resolved Hide resolved
CITATIONS.md Show resolved Hide resolved
@@ -22,6 +22,10 @@ process {
ext.args = '--quiet'
}

withName: SEQKIT_STATS {
ext.args = ''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you not want to publish the results as well in the outdir? Or only MultiQC?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it is defined here - I guess you would only overwrite it if you dont want your output published in the standard way.

publishDir = [

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh fancy - a central publishDir directive for all modules ?!? That's smart and has totally escaped me...sorry then for the false accusations!

//
// MODULE: Run SEQKIT_STATS
//
SEQKIT_STATS (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, you are neither mixing the output into the MultiQC channel nor publishing it in the outdir. Unless I am missing something, the results of the run are therefore not used at all?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah am outputting them

docs/output.md Outdated

</details>

[SeqkitStats](https://bioinf.shenwei.me/seqkit/usage/#stats) it gives general quality metrics about your sequenced reads including average read lengths, GC(%) and n50's. For further reading and documentation see the [Seqkit help pages]([Seqkit help](https://bioinf.shenwei.me/seqkit/)).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is true that SeqkitStats computes some quality metrics, but to my best knowledge it is more useful for FASTA files and genomic assemblies than sequencing reads?

For example, for an Illumina run, an average read length prior to trimming should be known, because it corresponds to the number of cycles. For Nanopore, admittedly, such a statistic is more useful, so you may want to structure the documentation accordingly?

Copy link
Author

@sarahjeeeze sarahjeeeze Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated text a little, should i specifically mention this is more useful for nanopore data? for ilumina etc you do still get n50's etc but agree it is less useful - want me to add a nanopore test? :D

file	format	type	num_seqs	sum_len	min_len	avg_len	max_len	Q1	Q2	Q3	sum_gap	N50	N50_num	Q20(%)	Q30(%)	AvgQual	GC(%)
sample1_R1.fastq.gz	FASTQ	DNA	1377513	411872902	35	299.0	301	300.0	301.0	301.0	0	301	1	99.10	97.54	29.31	38.53
sample1_R2.fastq.gz	FASTQ	DNA	1377513	411840994	35	299.0	301	300.0	301.0	301.0	0	301	1	97.11	93.54	25.78	38.54

docs/output.md Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants