Data, Code and Workflows Guideline

To guide eBook authors having a better sense of the workflow layout, here we briefly introduce the specific purposes of the dir system.

cache: Here, it stores intermediate datasets or results that are generated during the preprocessing steps.
graphs: The graphs/figures produced during the analysis.
input: Here, we store the raw input data. Data size > 100M is not allowed. We recommend using small sample data for the illustration purpose of the workflow. If you have files > 100M, please contact the chapter editor to find a solution.
lib: The source code, functions, or algorithms used within the workflow.
output: The final output results of the workflow.
workflow: Step by step pipeline. It may contain some sub-directories.
- It is suggested to use a numbering system and keywords to indicate the order and the main purpose of the scripts, i.e., 1_fastq_quality_checking.py, 2_cleaned_reads_alignment.py.
- To ensure reproducibility, please use the relative path within the workflow.
README: In the readme file, please briefly describe the purpose of the repository, the installation, and the input data format.
- We recommend using a diagram to describe the workflow briefly.
- Provide the installation details.
- Show a small proportion of the input data unless the data file is in a well-known standard format, i.e., the head or tail of the input data.

Overview of an example workflow: Fastq data quality checking

This is an example workflow to check the quality of the paired-end fastq files using FastQC software.

Installation

Running environment:
- The workflow was constructed based on the Linux system running the Oracle v1.6 to 1.8 java runtime environment (JREs).
Required software and versions:
- FastQC v0.11.9
- multiqc
- R 3.6.3 for results ploting
  - RStudio 1.4, ggplot2 3.3.3, tidyr 1.1.2

Input Data

The example data used here is the paired-end fastq file generated by using Illumina platform.

R1 FASTQ file: input/reads1.fastq
R2 FASTQ file: input/reads2.fastq

Each entry in a FASTQ files consists of 4 lines:

A sequence identifier with information about the sequencing run and the cluster. The exact contents of this line vary by based on the BCL to FASTQ conversion software used.
The sequence (the base calls; A, C, T, G and N).
A separator, which is simply a plus (+) sign.
The base call quality scores. These are Phred +33 encoded, using ASCII characters to represent the numerical quality scores.

The first entry of the input data:

@HWI-ST361_127_1000138:2:1101:1195:2141/1
CGTTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGGAGGGGTTNNNNNNNNNNNNNNN
+
[[[_BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

Major steps

Step 1: running the FastQC to conduct quality checking

Note that you have to normalize the path in the shell script.

sh workflow/1_run_fastqc.sh

Step 2: aggregate results from FastQC

sh workflow/2_aggregate_results.sh

Step 3: view the results

Results can be visualized by clicking output/multiqc_report.html.
Alternatively, you can plot the results yourself using the below R code.

3_visualize_results.Rmd

Expected results

License

It is a free and open source software, licensed under (choose a license from the suggested list: GPLv3, MIT, or CC BY 4.0).

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
cache		cache
graphs		graphs
input		input
lib		lib
output		output
workflow		workflow
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
readme.md		readme.md
template.Rproj		template.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data, Code and Workflows Guideline

Overview of an example workflow: Fastq data quality checking

Installation

Input Data

Major steps

Step 1: running the FastQC to conduct quality checking

Step 2: aggregate results from FastQC

Step 3: view the results

Expected results

License

About

Releases

Packages

Contributors 3

Languages

Bio-protocol/bioprotocol_2104072

Folders and files

Latest commit

History

Repository files navigation

Data, Code and Workflows Guideline

Overview of an example workflow: Fastq data quality checking

Installation

Input Data

Major steps

Step 1: running the FastQC to conduct quality checking

Step 2: aggregate results from FastQC

Step 3: view the results

Expected results

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages