A ZeroPM R package
The goal of cleanventory is to provide simple functionality to clean and partially curate data sets of common chemical inventories. The aim is to document every step, from the raw (downloaded) files to the final tables.
cleanventory aims to correctly identify all missing values in data sets, validates CAS Registry Numbers (when present) and additionally offers functionality to transform all special characters into ASCII characters.
The dependencies of cleanventory are kept at as minimal as possible: openxlsx for handling .xlsx files, and the trio of pdftools, magick and tesseract to extract data from (image) .pdf files.
We suggest the following packages/functionalities in addition:
bit64::as.integer64()
to correctly handle the us_tsca$cas_reg_no
and
us_cdr$chemical_id_wo_dashes
columns (kept as double
for
compatibility).
As of 2022-08-02, the following inventories are included:
You can install the development version of cleanventory from GitHub with:
# install.packages("devtools")
remotes::install_github("ZeroPM-H2020/cleanventory")
This is an example which shows you how to get the data set of the (current) EU CLP Annex VI:
library(cleanventory)
tmp <- tempdir()
url <- paste0(
"https://echa.europa.eu/documents/10162/17218/",
"annex_vi_clp_table_atp17_en.xlsx/",
"4dcec79c-f277-ed68-5e1b-d435900dbe34?t=1638888918944"
)
eu_clp_file <- download.file(
url,
destfile = paste(tmp, "annex_vi_clp_table_atp17_en.xlsx", sep = "/"),
quiet = TRUE,
mode = ifelse(.Platform$OS.type == "windows", "wb", "w")
)
path <- paste(tmp, "annex_vi_clp_table_atp17_en.xlsx", sep = "/")
eu_clp <- read_eu_clp(path)
invisible(file.remove(path))
head(eu_clp)
#> index_no international_chemical_identification ec_no cas_no
#> 1 001-001-00-9 hydrogen 215-605-7 1333-74-0
#> 2 001-002-00-4 aluminium lithium hydride 240-877-9 16853-85-3
#> 3 001-003-00-X sodium hydride 231-587-3 7646-69-7
#> 4 001-004-00-5 calcium hydride 232-189-2 7789-78-8
#> 5 003-001-00-4 lithium 231-102-5 7439-93-2
#> 6 003-002-00-X n-hexyllithium 404-950-0 21369-64-2
str(eu_clp)
#> 'data.frame': 4702 obs. of 4 variables:
#> $ index_no : chr "001-001-00-9" "001-002-00-4" "001-003-00-X" "001-004-00-5" ...
#> $ international_chemical_identification: chr "hydrogen" "aluminium lithium hydride" "sodium hydride" "calcium hydride" ...
#> $ ec_no : chr "215-605-7" "240-877-9" "231-587-3" "232-189-2" ...
#> $ cas_no : chr "1333-74-0" "16853-85-3" "7646-69-7" "7789-78-8" ...
This R package was developed at the Norwegian Geotechnical Institute (NGI) as part of the project ZeroPM: Zero pollution of Persistent, Mobile substances. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101036756.
If you find this package useful and can afford it, please consider making a donation to a humanitarian non-profit organization, such as Sea-Watch. Thank you.