Primary repo and development on https://gitlab.com/aschuerch/bactofidia.git
bactofidia is a bacterial assembly and basic analysis pipeline using Snakemake and bioconda. It is currently written for paired-end Illumina data with length 250 or 150. The pipeline is written to ensure reproducibility, and creates virtual software environments with the software versions that are used for analysis.
bactofidia runs under bash and relies on software available from bioconda and a (mini)conda installation. If conda is not present, the script will suggest a miniconda installation.
Clone this repository with
git clone https://gitlab.com/aschuerch/bactofidia.git bactofidia_[myproject]
where [myproject] is the name of your project.
Copy or symlink your paired-end read sequencing files (Sample1_R1.fastq.gz, Sample1_R2.fastq.gz, Sample2_R1.fastq.gz and Sample2_R2.fastq.gz) to the bactofidia_[myproject] directory. After succesful execution of the pipeline, these files will be removed from this folder. Make sure they have been stored elsewhere.
The first underscore in the sample names is regarded as the delimiter for the sample name. Avoid other underscores in the sample names.
Run the pipeline with
./bactofidia.sh Sample1_R1.fastq.gz Sample1_R2.fastq.gz Sample2_R1.fastq.gz Sample2_R2.fastq.gz
or
./bactofidia.sh ALL
The pipeline takes Illumina paired-end sequencing reads as compressed sequencing files (.fastq.gz) which must be present in the same directory from where the script is run.
The config.yaml or config_miseq.yaml files in the config/ directory can be adjusted for parameters of the different tools.
The different versions of the packages that are run are defined in the envs/
directory.
The first time bactofidia is run, it generates all virtual environments which can take a considerable time depending on the speed of your internet connection. Do not interrupt this process!
For debugging or testing purposes, the pipeline itself can be dry-run with
./dryrun_bactofidia.sh
The pipeline takes compressed sequencing files (.fastq.gz) which must be present in the same directory from where the script is called.
Test the whole pipeline with:
ln -s test/Test*gz .
./bactofidia.sh Test_R1.fastq.gz Test_R2.fastq.gz
This will run the pipeline on the included test files.
Currently it runs:
- quality check before trimming using fastqc
- trimming with trimgalore
- assembly with spades
- mlst with mlst
- resistance gene determination with abricate using the resfinder database
- annotation (general, not genus-specific) with prokka
- quality assessment of assembly with quast
- coverage estimation with bbmap2
- summarizing report with multiqc
Running only
./bactofidia.sh
will give an explanation of the (limited) options.
The output can be found in the 'results' directory which contains the following files representing the output of the different tools
results/
├── assembly_graphs
│ ├── ... # contains the assembly graphs of each scaffold in gfa format
│
├── config
│ ├── ... # contains all configuration definitions, e.g. parameter choices
│
├── envs
│ ├── ... # contains all environment definitions, e.g. used versions of programs
│
├── scaffolds
│ ├── Test.fna # scaffolds
│
└── stats
├── annotated # results of PROKKA
│ ├── Test
│ ├── Test.err
│ ├── Test.faa
│ ├── Test.ffn
│ ├── Test.fna
│ ├── Test.fsa
│ ├── Test.gbk
│ ├── Test.gff
│ ├── Test.log
│ ├── Test.sqn
│ ├── Test.tbl
│ ├── Test.tsv
│ └── Test.txt
│
├── CoverageStatistics_summary.tsv # Summary of coverage for all sample
├── Extra
│ ├── CoverageStatistics_Test.txt # Coverage details on each sample
│
├── MLST.tsv # MLST results in table format
├── MultiQC_report_data.zip # Contains data to generate the MultiQC report
├── MultiQC_report.html # Check this file first for overview of quality
└── ResFinder.tsv # Resistance genes per sample in table format
Opening the MultiQC_report.html in a browser is a good starting point to judge quality of data and assembly.
Command line parameters for the different tools can be adjusted in the config.yaml or config_miseq.yaml file in the config/ directory, or in the Snakefile.assembly directly. For many cases, the default parameters should be sufficient.
Package versions can be adjusted in envs/*yaml. The packages are in different files, mainly due to different dependencies such as python 2 / python 3. Please visit bioconda for available packages.
For further customizing, see snakemake documentation
Sometimes the snakemake workflow will give you an error due to an already running instance. In this case, unlock the snakemake instance with
./unlock_bactofidia.sh
The pipeline can be tested with
./dryrun_bactofidia.sh
To re-run parts of the pipeline, try
./rerun_incomplete.sh