Skip to content

Commit

Permalink
update with R220
Browse files Browse the repository at this point in the history
  • Loading branch information
shenwei356 committed Apr 24, 2024
1 parent fdb64cb commit a78911b
Show file tree
Hide file tree
Showing 6 changed files with 140 additions and 115 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,6 @@ gtdb-taxid-changelog.csv.gz
gtdb-taxdump.tar.gz
taxid.map.stats*
taxonkit


gtdb_species.txt
1 change: 1 addition & 0 deletions Escherichia.tsv
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
1155214706 Escherichia fergusonii species
1627494196 Escherichia sp002965065 species
1705205476 Escherichia whittamii species
1831350832 Escherichia coli_F species
1854306313 Escherichia marmotae species
1904681918 Escherichia coli_E species
2087647928 Escherichia albertii species
169 changes: 92 additions & 77 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,9 +85,12 @@ GTDB taxnomy files are download from https://data.gtdb.ecogenomic.org/releases/,
├── R207
│   ├── ar53_taxonomy_r207.tsv.gz
│   └── bac120_taxonomy_r207.tsv.gz
└── R214
├── ar53_taxonomy_r214.tsv.gz
└── bac120_taxonomy_r214.tsv.gz
├── R214
│   ├── ar53_taxonomy_r214.tsv.gz
│   └── bac120_taxonomy_r214.tsv.gz
└── R220
├── ar53_taxonomy_r220.tsv.gz
└── bac120_taxonomy_r220.tsv.gz


[TaxonKit](https://github.com/shenwei356/taxonkit) v0.12.0 or a later version is needed.
Expand Down Expand Up @@ -129,6 +132,9 @@ TaxIds in `int32` following BLAST and DIAMOND, rather than `uint32` in previous

taxonkit create-taxdump --gtdb -x gtdb-taxdump/R207/ \
taxonomy/R214/*.tsv* --out-dir gtdb-taxdump/R214 --force

taxonkit create-taxdump --gtdb -x gtdb-taxdump/R214/ \
taxonomy/R220/*.tsv* --out-dir gtdb-taxdump/R220 --force
3. Generating TaxId changelog (Note that, it's not perfect for GTDB taxonomy).

Expand All @@ -154,11 +160,11 @@ Learn more about the [taxid-changelog](https://github.com/shenwei356/taxid-chang

set the environment variable for simplicity

export TAXONKIT_DB=gtdb-taxdump/R214/
export TAXONKIT_DB=gtdb-taxdump/R220/

Query the TaxId via an assembly accession

grep GCA_905234495.1 gtdb-taxdump/R214/taxid.map
grep GCA_905234495.1 gtdb-taxdump/R220/taxid.map
GCA_905234495.1 254122285

Query the TaxId via taxon name
Expand Down Expand Up @@ -194,64 +200,71 @@ All lineages

Checking consistency

$ zcat taxonomy/R214/* | cut -f 2 | sort | uniq | md5sum
7eb83651aa8491399c2684f3ceb1a404 -
$ zcat taxonomy/R220/* | cut -f 2 | sort | uniq | md5sum
f9e0f5268ab65026894703db3eab7b4b -

$ cut -f 2 gtdb_species.txt | sort | md5sum
7eb83651aa8491399c2684f3ceb1a404 -
f9e0f5268ab65026894703db3eab7b4b -



### TaxId changes

<img src="stats/changes.png" alt="" width="600"/>

Note that the Y axis is the number of *TaxId*, not that of species.
Notes:
1. The Y axis is the number of *TaxId*, not that of species.
2. The data is generated by "taxonkit taxid-changelog", which was originally designed for NCBI taxonomy, where the the TaxIds are stable.
For other taxonomic data created by "taxonkit create-taxdump", e.g., GTDB-taxdump, some change events might be wrong, because
- There would be dramatic changes between the two versions.
- Different taxons in multiple versions might have the same TaxIds, because we only
check and eliminate taxid collision within a single version

### Species changes

How many species are there in R214?

$ taxonkit list --data-dir gtdb-taxdump/R214/ --ids 1 -I "" \
| taxonkit filter --data-dir gtdb-taxdump/R214/ -E species \
How many species are there in R220?

$ taxonkit list --data-dir gtdb-taxdump/R220/ --ids 1 -I "" \
| taxonkit filter --data-dir gtdb-taxdump/R220/ -E species \
| wc -l
85205
113104

How many species are added in R214?
How many species are added in R220?

$ pigz -cd gtdb-taxid-changelog.csv.gz \
| csvtk grep -f version -p R214 \
| csvtk grep -f version -p R220 \
| csvtk grep -f change -p NEW \
| csvtk grep -f rank -p species \
| csvtk nrow
23660
31987

How many species are deleted in R214?
How many species are deleted in R220?

$ pigz -cd gtdb-taxid-changelog.csv.gz \
| csvtk grep -f version -p R214 \
| csvtk grep -f version -p R220 \
| csvtk grep -f change -p DELETE \
| csvtk grep -f rank -p species \
| csvtk nrow
2923
3127

How many species are merged into others in R214?
How many species are merged into others in R220?

$ pigz -cd gtdb-taxid-changelog.csv.gz \
| csvtk grep -f version -p R214 \
| csvtk grep -f version -p R220 \
| csvtk grep -f change -p MERGE \
| csvtk grep -f rank -p species \
| csvtk nrow
1430
1182

### Summary

Complete lineages (R214)
Complete lineages (R220)

$ cat gtdb-taxdump/R214/taxid.map \
$ cat gtdb-taxdump/R220/taxid.map \
| csvtk freq -Ht -f 2 -nr \
| taxonkit lineage -r -n -L --data-dir gtdb-taxdump/R214/ \
| taxonkit reformat -I 1 -f '{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}' --data-dir gtdb-taxdump/R214/ \
| taxonkit lineage -r -n -L --data-dir gtdb-taxdump/R220/ \
| taxonkit reformat -I 1 -f '{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}' --data-dir gtdb-taxdump/R220/ \
| csvtk add-header -t -n 'taxid,count,name,rank,superkindom,phylum,class,order,family,genus,species' \
> taxid.map.stats.tsv
Expand All @@ -264,26 +277,26 @@ Frequency of species
| csvtk pretty -t
species frequency
-------------------------- ---------
Escherichia coli 33849
Klebsiella pneumoniae 14975
Staphylococcus aureus 14959
Salmonella enterica 13832
Streptococcus pneumoniae 8895
Mycobacterium tuberculosis 7132
Pseudomonas aeruginosa 7037
Acinetobacter baumannii 6912
Clostridioides difficile 2701
Enterococcus_B faecium 2657
Enterobacter hormaechei_A 2605
Campylobacter_D jejuni 2442
Enterococcus faecalis 2314
Listeria monocytogenes 2307
Neisseria meningitidis 2243
Streptococcus pyogenes 2220
Listeria monocytogenes_B 1985
Mycobacterium abscessus 1903
Vibrio parahaemolyticus 1892
Burkholderia mallei 1824
Escherichia coli 38926
Klebsiella pneumoniae 18499
Staphylococcus aureus 16021
Salmonella enterica 15089
Streptococcus pneumoniae 9133
Acinetobacter baumannii 8536
Pseudomonas aeruginosa 8390
Mycobacterium tuberculosis 7337
Enterococcus_B faecium 3202
Enterococcus faecalis 3044
Clostridioides difficile 2991
Campylobacter_D jejuni 2873
Listeria monocytogenes 2517
Neisseria meningitidis 2336
Vibrio parahaemolyticus 2264
Streptococcus pyogenes 2258
Mycobacterium abscessus 2029
Listeria monocytogenes_B 2025
Burkholderia mallei 1934
Streptococcus agalactiae 1893


### Taxon history of Escherichia coli
Expand All @@ -294,7 +307,7 @@ Frequency of species
Get the TaxId:

$ echo Escherichia coli \
| taxonkit name2taxid --data-dir gtdb-taxdump/R214/
| taxonkit name2taxid --data-dir gtdb-taxdump/R220/
Escherichia coli 599451526

Any changes in the past? Hmm, of cause, it appeared in R80.
Expand Down Expand Up @@ -372,14 +385,15 @@ also shows the taxonomic information of current version (R207) and the taxon his

|Release|Domain |Phylum |Class |Order |Family |Genus |Species |
|:------|:----------|:----------------|:---------------------|:------------------|:--------------------|:-------------|:----------------------|
|R220 |d__Bacteria|p__Pseudomonadota|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia coli |
|R214 |d__Bacteria|p__Pseudomonadota|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia coli |
|R207 |d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia coli |
|R202 |d__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia flexneri|

### Species of the genus Escherichia

# set the direcotory of taxdump file
export TAXONKIT_DB=gtdb-taxdump/R214
export TAXONKIT_DB=gtdb-taxdump/R220

$ echo Escherichia | taxonkit name2taxid
Escherichia 1028471294
Expand All @@ -395,6 +409,7 @@ also shows the taxonomic information of current version (R207) and the taxon his
1155214706 Escherichia fergusonii species
1627494196 Escherichia sp002965065 species
1705205476 Escherichia whittamii species
1831350832 Escherichia coli_F species
1854306313 Escherichia marmotae species
1904681918 Escherichia coli_E species
2087647928 Escherichia albertii species
Expand All @@ -409,11 +424,12 @@ also shows the taxonomic information of current version (R207) and the taxon his
|taxid |name |rank |#assembly|
|:---------|:----------------------|:------|:--------|
|599451526 |Escherichia coli |species|33849 |
|2087647928|Escherichia albertii |species|164 |
|1155214706|Escherichia fergusonii |species|141 |
|1854306313|Escherichia marmotae |species|103 |
|1083756244|Escherichia ruysiae |species|54 |
|599451526 |Escherichia coli |species|38926 |
|2087647928|Escherichia albertii |species|239 |
|1155214706|Escherichia fergusonii |species|161 |
|1854306313|Escherichia marmotae |species|141 |
|1831350832|Escherichia coli_F |species|97 |
|1083756244|Escherichia ruysiae |species|62 |
|300575795 |Escherichia sp005843885|species|37 |
|1705205476|Escherichia whittamii |species|4 |
|1904681918|Escherichia coli_E |species|2 |
Expand All @@ -428,45 +444,45 @@ and [GCF_023276905.1](https://gtdb.ecogenomic.org/genome?gid=GCF_023276905.1) (f
231798968 [no rank] 011881725
1417695290 [no rank] 023276905

$ grep 011881725 gtdb-taxdump/R214/taxid.map
$ grep 011881725 gtdb-taxdump/R220/taxid.map
GCF_011881725.1 231798968

### Common manipulations

Except the four taxdump files, we provide a `taxid.map` file which maps genome accessions to TaxIds.

$ wc -l gtdb-taxdump/R214/*
19731 gtdb-taxdump/R214/delnodes.dmp
1535 gtdb-taxdump/R214/merged.dmp
516027 gtdb-taxdump/R214/names.dmp
516027 gtdb-taxdump/R214/nodes.dmp
107 gtdb-taxdump/R214/ranks.txt
402709 gtdb-taxdump/R214/taxid.map
$ wc -l gtdb-taxdump/R220/*
23767 gtdb-taxdump/R220/delnodes.dmp
1322 gtdb-taxdump/R220/merged.dmp
743239 gtdb-taxdump/R220/names.dmp
743239 gtdb-taxdump/R220/nodes.dmp
107 gtdb-taxdump/R220/ranks.txt
596859 gtdb-taxdump/R220/taxid.map

List all the genomes of a species, e.g., *Akkermansia muciniphila*,

# Retreive the TaxId
$ echo Akkermansia muciniphila | taxonkit name2taxid --data-dir gtdb-taxdump/R214
$ echo Akkermansia muciniphila | taxonkit name2taxid --data-dir gtdb-taxdump/R220
Akkermansia muciniphila 791276584

# list subtree
$ taxonkit list --data-dir gtdb-taxdump/R214 -nr --ids 791276584 | head -n 5
$ taxonkit list --data-dir gtdb-taxdump/R220 -nr --ids 791276584 | head -n 5
791276584 [species] Akkermansia muciniphila
9073941 [no rank] 008422865
13250174 [no rank] 008671835
25307961 [no rank] 004015245
30563015 [no rank] 004557465
2229511 [no rank] 948901395
3636769 [no rank] 948711495
7496143 [no rank] 949510945
7567111 [no rank] 949384685

# mapping TaxIds to Genome accessions with taxid.map
$ taxonkit list --data-dir gtdb-taxdump/R214 -I "" --ids 791276584 \
| csvtk join -Ht -f '1;2' - gtdb-taxdump/R207/taxid.map \
$ taxonkit list --data-dir gtdb-taxdump/R220 -I "" --ids 791276584 \
| csvtk join -Ht -f '1;2' - gtdb-taxdump/R220/taxid.map \
| head -n 5
9073941 GCF_008422865.1
13250174 GCA_008671835.1
25307961 GCF_004015245.1
30563015 GCA_004557465.1
34761210 GCF_008422665.1
2229511 GCA_948901395.1
3636769 GCA_948711495.1
7496143 GCA_949510945.1
7567111 GCA_949384685.1
7776528 GCA_959604705.1

Find the history of a taxon using scientific name:

$ zcat gtdb-taxid-changelog.csv.gz \
Expand All @@ -491,7 +507,6 @@ Find the history of a taxon using scientific name:
|174151795 |R089 |MERGE |1584917910 |Escherichia coli_A|species|
|266865208 |R086 |NEW | |Escherichia coli_B|species|
|266865208 |R089 |MERGE |1584917910 |Escherichia coli_B|species|
|525903441 |R214.1 |NEW | |Escherichia coli_E|species|
|599451526 |R080 |NEW | |Escherichia coli |species|
|599451526 |R207 |ABSORB |1223627963;1584917910;1670897256;2030830777|Escherichia coli |species|
|599451526 |R214 |CHANGE_LIN_TAX| |Escherichia coli |species|
Expand All @@ -500,9 +515,9 @@ Find the history of a taxon using scientific name:
|1584917910|R207 |MERGE |599451526 |Escherichia coli_C|species|
|1670897256|R089 |NEW | |Escherichia coli_D|species|
|1670897256|R207 |MERGE |599451526 |Escherichia coli_D|species|
|1831350832|R220 |NEW | |Escherichia coli_F|species|
|1904681918|R202 |NEW | |Escherichia coli_E|species|
|1904681918|R214 |CHANGE_LIN_TAX| |Escherichia coli_E|species|
|1945799576|R214.1 |NEW | |Escherichia coli |species|


Check more [TaxonKit commands and usages](https://bioinf.shenwei.me/taxonkit/usage/).
Expand Down
15 changes: 8 additions & 7 deletions stats/changes.csv
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
version,deleted,newly_added,merged,deleted_reused,merged_reused
R083,315,16538,220,,
R207,2876,85072,1528,96,43
R214,5346,114446,1479,1089,59
R089,6868,41664,2763,733,22
R202,2186,86349,1075,90,66
R095,2336,60252,815,133,97
R086,1484,22624,329,78,75
R083,361,16675,220,,
R202,2202,87355,1131,92,73
R086,1562,23116,392,98,69
R207,2937,86260,1577,90,61
R220,4144,232385,1312,107,99
R095,2416,60770,845,135,100
R089,7012,42082,2839,736,31
R214,5480,115496,1527,1089,62
Loading

0 comments on commit a78911b

Please sign in to comment.