Italian Thesaurus terms to Library of Congress Subject Headings via Wikidata.
thes2loc
helps librarian building a multilingual
Thesarus in particular it finds
a mapping between the Italian
Tesauro del Nuovo Soggettario (THES)
from the Biblioteca Nazionale di Firenze (National Library of Florence)
and the Library of Congress Subject Headings (LCSH).
The BNCF Thesaurus links Wikipedia article in some of it s terms, see for example: Abbazie (Abbeys) which links to the Italian Wikipedia article Abbazia (Abbey). On May 2013 the Italian Wikipedia community created a template {{BNCF Thesaurus}} to link back this terms and inserted the data also in Wikidata, creating the property: P:508, i.e. BNCF Thesaurus
For the mapping between the Library of Congress Subject Headings and English Wikipedia articles it uses this mapping by John Ockerbloom: wikimap.
Thus a map THES <-> LCSH is built in this way:
THES <-> itwiki <-> wikidata <-> enwiki <-> LCSH
thes2loc
has only been tested on Ubuntu Linux so far. it should work also on
other *nix systems.
To run it you need the following software as prerequisites:
curl
, this comes pre-packaged with most desktop Linux distributions.jq
, a powerful CLI tool for processing JSON. You can download it from the project's website.- GNU
parallel
, a shell tool for executing jobs in parallel. GNU parallel is packaged on several Linux distributions. pywikibot
, a python framework to interact with MediaWiki wikis and in particular with the Wikimedia projects.
USAGE:
make all
produces (among others) the file thes2lcsh.map
which is what you are
interested in.
This command comprises three other commands:
-
make get
: retrieves data from Wikidata (list of items with property BNCF Thesaurus) and from Wikimap (LCHS -> enwiki article titles) -
make resolve
: retrieves data from Wikidata (BNCF Thesaurus item id, itwiki article title, enwiki aticle title) -
make match
: builds thes2lsch.map with (BNCF Thesaurus item id, relation type, LCHS id, Wikidata item no.)
To retrieve the corresponding URLs from the file thes2lcsh.map
use the
following mapping:
- column in
thes2lcsh.map
are
(thes_id, relation, lcsh_id, wikidata_id)
where:
-
thes_id
is the BNCF Thesaurus term identifier; -
relation
is the relation type (as defined by John Ockerbloom's classification, see the documentation; -
lcsh_id
is the Library of Congress Subject Heading identifier; -
wikidata_id
is the Wikidata item no. (e. g.42
forQ42
);
To retrieve the corresponding URLs from thes2lcsh.map
use the following
mapping:
-
for BNCF Thesaurus:
http://thes.bncf.firenze.sbn.it/termine.php?id={thes_id}
-
for LCSH:
http://id.loc.gov/authorities/subjects/{lhcs_id}.html
-
for Wikidata:
http://www.wikidata.org/wiki/Q{wikidata_id}
Inspired by this Gist by @atomotic.
This software is released under the MIT license. It is free software. (c) 2014 by Cristian Consonni