-
Notifications
You must be signed in to change notification settings - Fork 3
The Structure of a Project Configuration File
output_directory = /home/kakapo/kakapo-output
project_name = kakapo-prj-01
entrez_api_key = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
run_rcorrector = Yes
run_inter_pro_scan = No
prepend_assembly_name_to_sequence_name = Yes
kraken_2_confidence = 0.20
requery_after = 7
use_colors = Yes
-
output_directory
: a path to a directory wherekakapo
places all of its output. -
project_name
: a short name for the analysis.kakapo
creates a subdirectory with this name where a number of project-specific output files are stored (log files, backups of the configuration files used, assembled sequences, etc):[output_directory]/02-project-specific/[project_name]
. A well-chosen name could significantly help with future data management. -
entrez_api_key
: This field is required.kakapo
uses GenBank, whose users are allowed 3 requests per second without an API key. With an API key, the limit is increased to 10 requests/second. To obtain your key, go here: https://www.ncbi.nlm.nih.gov/account/settings -
run_rcorrector
: if set toYes
, the reads are processed by Rcorrector. -
run_inter_pro_scan
: if set toYes
, the translated CDS sequences found bykakapo
are submitted for functional annotation by InterProScan to https://www.ebi.ac.uk/interpro/search/sequence. -
prepend_assembly_name_to_sequence_name
: if set toYes
, prepends sample name to the assembled isoform names. The sample name is derived based on the type of input:- SRA: GenBank metadata.
- FASTQ: file name.
- FASTA (user-provided assembly): file name.
if set to
No
, unaltered names produced by SPAdes are used inkakapo
output. (Users are highly discouraged from setting this option toNo
when more than one sample is being analyzed.) -
kraken_2_confidence
: value set between0
and1
. I find that a value of0.20
works quite well. Higher values reduce the number of reads classified (filtered out), and lower values increase filtered reads. See helpful discussion here, for additional guidance. -
requery_after
: Numeric value used to tell 'kakapo' not to re-query GenBank and/or Pfam, if search was previously performed already, and the results are less than this many days old. -
use_colors
: if set toYes
, adds color to the log messages in the terminal, may not look great on light terminal backgrounds.
allow_non_aug_start_codon = No
allow_missing_start_codon = No
allow_missing_stop_codon = No
-
allow_non_aug_start_codon
: allow other start codons, in addition toAUG
. A set of appropriate start codons are chosen using GenBank taxonomical classification. Ifallow_non_aug_start_codon
is set toNo
,allow_missing_start_codon
has no effect. -
allow_missing_start_codon
: annotate ORFs even if the start codon is missing. Forallow_missing_start_codon
to have any effect,allow_non_aug_start_codon
must be set toYes
. -
allow_missing_stop_codon
: annotate ORFs even if the stop codon is missing.
plants
The group your samples belong to. Although it is rare, Latin binomials of vastly different organisms may contain the same words. The purpose of this setting is to restrict the search space to a relatively broad taxonomic group in order to avoid ambiguity in name resolution. You may choose between animals
, archaea
, bacteria
, fungi
, plants
, viruses
. Alternatively, you may enter an NCBI TaxID for any taxon as long as all of your samples belong to it. You can look up NCBI TaxIDs here.
SRR7829961
SRR23214014
A list of SRA accessions. One per line.
/home/kakapo/kakapo-input/fastq/Solanum_chilense_sample1_R*.fastq.gz
/home/kakapo/kakapo-input/fastq/Solanum_chilense_sample2_R*.fastq
/home/kakapo/kakapo-input/fastq/Solanum_chilense_sample3_R*.fq.gz
/home/kakapo/kakapo-input/fastq/Solanum_chilense_sample4_R1.fastq.gz
/home/kakapo/kakapo-input/fastq/Solanum_chilense_sample5_R1.fastq
/home/kakapo/kakapo-input/fastq/Solanum_chilense_sample6_R1.fq
A list of FASTQ files, one entry per line. Can be gzip-compressed or not. For paired-end reads replace 1/2 or F/R with a *
. Anything with *
in the file name is treated as a paired-end set. File names without a *
character are treated as single-read (forward-read only) files, even if the reverse reads are in the same directory.
Important: File names are parsed as Latin_binomial_sampleid_readinfo.extension
. If your files do not follow this format, you may enter a species name with a colon :
before the path:
Solanum chilense:/home/kakapo/kakapo-input/fastq/sample1_R*.fastq.gz
Solanum chilense:/home/kakapo/kakapo-input/fastq/sample2_R*.fastq
Solanum chilense:/home/kakapo/kakapo-input/fastq/sample3_R*.fq.gz
Solanum chilense:/home/kakapo/kakapo-input/fastq/sample4_R1.fastq.gz
Solanum chilense:/home/kakapo/kakapo-input/fastq/sample5_R1.fastq
Solanum chilense:/home/kakapo/kakapo-input/fastq/sample6_R1.fq
/home/kakapo/kakapo-input/assemblies/Matucana_madisoniorum_HBG13.fasta
A list of FASTA files, one entry per line. If you already have a set of transcripts or any other set of sequences without introns (CDS, mRNA). kakapo
will perform the gene search part of the pipeline and will output the transcripts matching the search parameters it finds together with the transcripts derived from raw reads.
cactus_virus_x = /home/kakapo/kakapo-input/reference_genomes/cactus_virus_x.fasta
plastid
mitochondrion
A list of FASTA files and/or keywords plastid
and mitochondrion
, one entry per line. For the keywords plastid
and mitochondrion
, kakapo
finds the most closely related plastid or mitochondrial assembly on GenBank. Reads mapping to any of the entries listed here are stored in subdirectories in the output_directory
.
16S_Silva132
16S_Silva138
viral
mitochondrion
plastid
mitochondrion_and_plastid
minikraken_8GB_2020-03-12
A list of Kraken2 databases, one entry per line. kakapo
will download a few smaller Kraken2 databases during the dependency installation process. You can place (or link) additional databases in the ~/.local/share/kakapo/kraken2_dbs
directory for them to be visible to kakapo
. Reads classified by Kraken2 will be stored in subdirectories in the output_directory
.
evalue = 1e-5
max_hsps = 10000
qcov_hsp_perc = 1
best_hit_overhang = 0.05
best_hit_score_edge = 0.25
max_target_seqs = 1000000
BLAST parameters for searching RNA-Seq reads matching the query.
evalue = 1e-20
max_hsps = 4
qcov_hsp_perc = 70
best_hit_overhang = 0.15
best_hit_score_edge = 0.15
max_target_seqs = 500
BLAST parameters for searching assembled transcripts matching the query. Settings in this section can be overridden in a search strategies file; each search strategy can have its own settings.