GitHub - digital-botanical-gardens-initiative/integrate_trydb_globi_enpkg: Integrate TRY db and GLOBI db data with minimal subset of enpkg

This repository contains scripts for integrating species and subsequent traits data from trydb with taxonomic ids from gbif, otol, ncbi and wikidata. Data for only 25 traits was downloaded from TRY-db. Subsequently, the traits metadata was retrieved from TRY-db website and a subset of enpkg was also retrieved. The csv file for interactions was retrieved from GLOBI database. The csv files retrieved were converted to duckdb database file.

Download the datasets used in this repo from zenodo. Merge them with the data folder in this repo.

The TRY-db dataset with 25 traits has multiple columns, which are related as depicted in the diagram below.

I. Prerequisites:

For smooth running of the scripts (R,shell), install R (version 4.1.2) and the following R-packages as well as some other softwares:

a) For accessing taxonomic ids from wikidata, with mappings from gbif and ncbi (taxizedb) and from open treel of life (rotl) install.packages(c("taxizedb", "rotl"))

b) For data manipulation, install dplyr and dbplyr (backend wrapper to convert dplyr code into SQL) install.packages(c("dplyr", "dbplyr"))

c) For the on-disk approach of accessing and querying databases, duckdb's API client for R install.packages("duckdb")

and duckdb

d) For building a Virtual Knowledge Graph (VKG), download Ontop-cli/Ontop-protege bundle (version 5.1.2)

e) For converting ontology files between multiple formats (e.g.: owl to ttl), install robot.

II. Script to map the TRY plant species name to the gbif, ncbi, wikidata ids

Taxonomy mapping of the plant species name in the TRY-db were done using the taxizedb R-package. Values of the SpeciesName column were used for name matching. There are many plants and insect species which share names -both genus and species (e.g. Iris orientalis), meaning that for TRY-db plant species, ids from all dbs with kingdom 'Plantae' have to be retained. For wikidata, taxizedb downloads the database from zenodo. There is no function to match the wikidata ids to their full-lineage or upper taxon levels in taxizedb, therefore first the names were matched to gbif names, lineage for the gbif ids were retrieved in taxize-db and the ones with kingdom 'Plantae' were retained, which in turn were matched to the 'external-ids' column of the wikidata db provided by taxizedb, thus reducing the overall number of mapped plant species. This is a bottleneck and is in-progress to be solved. Also, see here for details. To run the script:

Rscript matchTaxonomy.R

To plot distribution of the TRY-db species matched with ids from ncbi, gbif and wikidata, run

Rscript distTaxonomicIds.R

III. Script to build a duckdb database for Ontop and build the knowledge graph

sh run_duckdb.sh

The relations between tables are depicted in the following minimal ER diagram. The full diagram can be found in the file 'figures/TableRelations_ER_diagram_full.png'

IV. Script to extract common wikidata ids between TRY-db and GLOBI, followed by retrieveing pair-wise interaction data from GLOBI for those ids (not followed eventually, all interactions with kingdom 'Plantae' were used)

sh run_ext_com_GLOBI_TRY.sh (No longer needed)

Note that the above requires downloading interactions.csv.gz from GLOBI database

V. Script to convert clubbed ids in file from step V to individual columns

Rscript extractIntoColumns_globi.R

VI. Script to materialize the knowledge graph in Ontop
#Set the path in data/Ontop_config/duckdb.properties

sh run_ontop.sh

VII. Script to build graph and run sparql queries from the ontop-materialized graph
#Set the path in data/Ontop_config/duckdb.properties

python3.12 graphify_n_visualize.py

VIII. Disclaimer

Ontop used for making the mappings. Six SPARQL queries in 'graphify_n_visualize.py' work, but more to be tested.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
data		data
figures		figures
README.md		README.md
distTaxonomicIds.R		distTaxonomicIds.R
extractIntoColumns_globi.R		extractIntoColumns_globi.R
graphify_n_visualize.py		graphify_n_visualize.py
matchTaxonomy.R		matchTaxonomy.R
run_duckdb.sh		run_duckdb.sh
run_ext_com_GLOBI_TRY.sh		run_ext_com_GLOBI_TRY.sh
run_ontop.sh		run_ontop.sh

digital-botanical-gardens-initiative/integrate_trydb_globi_enpkg

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages