Kyle A. Gervers 2024-05-12
While the UNITE general FASTA releases represent high-quality, dynamically clustered references for fungal taxon assignment, most releases consistently ignore taxa that I care about. Also, I’ve noticed that the taxonomic ranks applied are not always the most up-to-date when compared with what Index Fungorum reports, and many of the included sequences lack detailed taxonomic resolution.
Starting from the most recent (as of 2024-05-12) fungal UNITE+INSD release, this repo does the following:
- Removes sequences not identified to genus
- De-replicates sequences with the same header taxonomy and sequence
- Updates the taxonomy of the remaining sequences according to Index Fungorum
- De-replicates sequences (again) with the same header taxonomy and sequence
- Formats sequences headers in a format compatible with the
assignTaxonomy()
function fromdada2
Because UNITE does not include the authority in the applied taxonomy, this approach ends up shedding more sequences as a conservative measure, only including sequences with unambiguous taxonomy. Also, because data dumps of Index Fungorum taxonomy (which are difficult to find anyways!) appear to only allow indexing at the genus-level, a genus-level ID requirement is placed on the UNITE+INSD release.
This whole repo will (hopefully) become obsolete if:
- future releases include species hypotheses of taxa currently only found in UNITE+INSD releases
- fungal taxonomy of future releases matches Index Fungorum’s
The 2023-07-25 UNITE general releases seem to have species hypotheses for taxa previously only found in UNITE+INSD releases (e.g., Nothophaeocryptopus). This is good news, but some species (e.g., Micraspis strobilina) are still missing. Additionally, some ranks still are not in line with current Index Fungorum taxonomy (e.g., Rhabdocline in Hemiphacidiaceae vs Cenangiaceae). For this reason, I will still use the UNITE+INSD releases moving forward.
The updated fungal taxonomy reference is found in the 02-rename
directory and named if-unite-insdc.fa.gz
.
As of 2023-03-22, this repo comes with the script 03-host.r
, which
uses the United States National Fungus Collections Fungus-Host
Dataset to subset the
updated reference for fungi associated with a requested plant genus. In
addition to producing a gzipped fasta
file, a corresponding csv
including all the information in the United States National Fungus
Collections Fungus-Host Dataset (with corrected taxonomy for all ranks)
is also produced.
To use 03-host.r
, run the following in project directory:
Rscript code/03-host.r 32 Pseudotsuga
Here, 32
corresponds to the number threads requested for parallel
processing, while Pseudotsuga
is the host request. Only genus
arguments are currently accepted, but other ranks will work soon. The
main purpose of this script is to generate taxonomic references for use
as priors with the DADA2 R package.
As of 2023-03-25, each subregion (ITS1, 5.8S, and ITS2) of the full-ITS
sequences in if-unite-insdc.fa.gz
also exists in a separate file. This
functionality was added to help remove taxonomy assignment noise
associated with the assignTaxonomy()
function in DADA2, which
implements the RDP Naive Bayesian Classifier algorithm described in
Wang et al., 2007. Essentially,
if separate databases are used (ITS1 for ITS1, ITS2 for ITS2, etc.),
this removes the chances for non-target kmer matches to occur (e.g.,
where an 8mer in an ITS1 ASV/OTU sequence just happens to match
somewhere other than the ITS1 subregion full ITS reference).
These split references are found in the its1
, 5.8s
, and its2
subdirectories within the 04-trim
directory. Use files following the
fun.[ITS1/5_8S/ITS2].fasta.gz
or euk.5_8S.fasta.gz
pattern as
taxonomic references. The files in the 03-extract
subdirectory are
outputs of ITSx
and
available for inspection.
5.8S sequences prefixed with euk
originate from the most current,
non-singleton eukaryote general UNITE release, facilitating non-fungal
ASV/OTU filtering prior to taxonomy assignment using one of the fun
ITS2 references (which originate from a processed UNITE+INSD fungal
release).
This separation also facilitates the use of both 5.8S and ITS2 for taxonomic assignment, as demonstrated in Heeger et al., 2019. They found that ITS1 and ITS2 sequences alone can fail to obtain taxonomic assignments at high ranks (e.g., kingdom, phylum, and class) when assignment references are incomplete (as they always are), which lines up with my experience and those of my peers. After extracting the 5.8S and ITS2 sequences from an ASV/OTU sequence, each subregion can be used to separately assign taxonomy to that sequence, allowing the 5.8S subregion to provide high-rank assignments when ITS2 fails to provide any assignment. This separation seems necessary, as Heeger et al. report that classification with combined fragments did not perform as well as independent classification.
LSUx
is an R package developed by
Brendan Furneaux that uses ribosomal
large subunit covariance models (such as those available on
Rfam) to estimate the start and stop positions of
the 5.8S, ITS2, and 28S/32S subregions. Variable regions (e.g., D1/V2,
D2/V3, etc.) within the 28S/32S region are also demarcated, which
ITSx
/itsxpress
cannot do. ITS1 positions are also inferred with
regard to 5.8S starting positions, so all three subregions can
ultimately be extracted. I thought it would be interesting to see how
the output of this user-friendly R package compares to ITSx
output, so
the 03-alt.r
and its associated output directory 03-alt.r
have been
added here.
LSUx
calls inferrnal
, which
wraps Infernal. Unlike ITSx
profile
hidden Markov models, Infernal covariance models allow for potentially
base-pairing nucleotides in different positions of a primary nucleic
acid sequence to covary, thereby accounting for the conservation of
secondary structures known to exist for rRNA sequences. The R package
tzara
is used here to extract
subregions identified by LSUx
(although it can do much more).
- retention of the original accession number associated with each UNITE+INSD sequence entry
- more plant taxonomic ranks for
03-host.r
- inclusion of fungal sequences at higher ranks (dependent on Index Fungorum data dumps)
All packages were installed and managed with conda
.
conda 24.1.2
name: /home/gerverska/projects/if-unite-insdc/env
channels:
- conda-forge
- bioconda
- nodefaults
dependencies:
- bioconductor-biostrings=2.66.0
- bioconductor-shortread=1.56.0
- r-base=4.2.2
- r-dplyr=1.1.2
- r-futile.logger=1.4.3
- r-markdown=1.6
- r-readr=2.1.4
- r-remotes=2.5.0
- r-rmarkdown=2.21
- r-stringr=1.5.0
- r-tidyr=1.3.0
- infernal=1.1.5
- itsx=1.1.3
- vsearch=2.22.1
- pigz=2.8
prefix: /home/gerverska/projects/if-unite-insdc/env
Install the above bioinformatic environment from config.yml
using the
script 00-build.sh
# Clone the repo (using the GitHub CLI tool) ####
gh repo clone gerverska/if-unite-insdc
# Run the build script ####
bash code/00-build.sh