To guide eBook authors having a better sense of the workflow layout, here we briefly introduce the specific purposes of the dir system.
- cache: Here, it stores intermediate datasets or results that are generated during the preprocessing steps.
- graphs: The graphs/figures produced during the analysis.
- input: Here, we store the raw input data. Data size > 100M is not allowed. We recommend using small sample data for the illustration purpose of the workflow. If you have files > 100M, please contact the chapter editor to find a solution.
- lib: The source code, functions, or algorithms used within the workflow.
- output: The final output results of the workflow.
- workflow: Step by step pipeline. It may contain some sub-directories.
- It is suggested to use a numbering system and keywords to indicate the order and the main purpose of the scripts, i.e.,
1_fastq_quality_checking.py
,2_cleaned_reads_alignment.py
. - To ensure reproducibility, please use the relative path within the
workflow
.
- It is suggested to use a numbering system and keywords to indicate the order and the main purpose of the scripts, i.e.,
- README: In the readme file, please briefly describe the purpose of the repository, the installation, and the input data format.
- We recommend using a diagram to describe the workflow briefly.
- Provide the installation details.
- Show a small proportion of the input data unless the data file is in a well-known standard format, i.e., the
head
ortail
of the input data.
This is an example workflow to check the quality of the paired-end fastq files using FastQC
software.
-
Running environment:
- The workflow was constructed based on the Linux system running the Oracle v1.6 to 1.8 java runtime environment (JREs).
-
Required software and versions:
- FastQC v0.11.9
- multiqc
- R 3.6.3 for results ploting
The example data used here is the paired-end fastq file generated by using Illumina platform.
- R1 FASTQ file:
input/reads1.fastq
- R2 FASTQ file:
input/reads2.fastq
Each entry in a FASTQ files consists of 4 lines:
- A sequence identifier with information about the sequencing run and the cluster. The exact contents of this line vary by based on the BCL to FASTQ conversion software used.
- The sequence (the base calls; A, C, T, G and N).
- A separator, which is simply a plus (+) sign.
- The base call quality scores. These are Phred +33 encoded, using ASCII characters to represent the numerical quality scores.
The first entry of the input data:
@HWI-ST361_127_1000138:2:1101:1195:2141/1
CGTTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGGAGGGGTTNNNNNNNNNNNNNNN
+
[[[_BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
- Note that you have to normalize the path in the shell script.
sh workflow/1_run_fastqc.sh
sh workflow/2_aggregate_results.sh
- Results can be visualized by clicking
output/multiqc_report.html
. - Alternatively, you can plot the results yourself using the below R code.
3_visualize_results.Rmd
It is a free and open source software, licensed under (choose a license from the suggested list: GPLv3, MIT, or CC BY 4.0).