An Julia interface to ConceptNetNumberbatch
This package is a simple API to ConceptNetNumberbatch.
The following examples illustrate some common usage patterns:
julia> using ConceptnetNumberbatch, Languages
file_conceptnet = download_embeddings(url=CONCEPTNET_HDF5_LINK,
localfile="./_conceptnet_/conceptnet.h5");
# [ Info: Download ConceptNetNumberbatch to ./_conceptnet_/conceptnet.h5...
# % Total % Received % Xferd Average Speed Time Time Time Current
# Dload Upload Total Spent Left Speed
# 100 127M 100 127M 0 0 3646k 0 0:00:35 0:00:35 --:--:-- 4107k
# "./_conceptnet_/conceptnet.h5"
# Load embeddings
julia> conceptnet = load_embeddings(file_conceptnet, languages=:en)
# ConceptNet{Languages.English} (compressed): 1 language(s), 150875 embeddings
julia> conceptnet["apple"] # Get embeddings for a single word
# 300-element Array{Int8,1}:
# 0
# 0
# 1
# -4
# ...
julia> conceptnet[["apple", "pear", "cherry"]] # Get embeddings for multiple words
# 300×3 Array{Int8,2}:
# 0 0 0
# 0 0 0
# 1 1 1
# -4 -6 -7
# ...
# Load multiple languages
julia> conceptnet = load_embeddings(file_conceptnet, languages=[:en, :fr])
# ConceptNet{Language} (compressed): 2 language(s), 174184 embeddings
julia> conceptnet["apple"] # fails, language must be specified
# ERROR: ...
julia> [conceptnet[:en, "apple"] conceptnet[:fr, "poire"]]
# 300×2 Array{Int8,2}:
# 0 -2
# 0 -2
# 1 -2
# -4 -7
# ...
# Wildcard matching
julia> conceptnet[:en, "xxyyzish"] # returns embedding for "#####ish"
# 300×1 Array{Int8,2}:
# 5
# -1
# 0
# 1
# ...
# Useful functions
julia> length(conceptnet) # total number of embeddings for all languages
# 174184
julia> size(conceptnet) # embedding vector length, number of embeddings
# (300, 174184)
julia> "apple" in conceptnet # found in the English embeddings
# true
julia> "poire" in conceptnet # found in the French embeddings
# true
julia> # `keys` returns an iterator for all words
for word in Iterators.take(keys(conceptnet),3)
println(word)
end
# définie
# invités
# couvents
Document embedding is quite straightforward:
julia> doc = "embed this document containing X_#-s231 which cannot be embedded"
edoc, idxs_missed = embed_document(conceptnet, doc, language=Languages.English(), keep_size=false)
missed_words = tokenize_for_conceptnet(doc)[idx_missed]
println("Missed word: $missed_word")
edoc
# Missed word: SubString{String}["X_#-s231"]
# 300×8 Array{Int8,2}:
# 0 0 0 0 0 1 0 0
# -1 -2 -1 -1 -3 -2 -3 0
# 1 5 0 4 6 6 6 2
# ...
- for the best speed, the
HDF5
version should be used - the API is very fast for retrieving embeddings of single word exact matches
- it is also quite fast for retrieving embeddings of wildcard matches (
xyzabcish
is matched to######ish
) and multiple word expressions of arbitrary length (provided these are embedded) - the document embedding is slower (scales with document length)
The installation can be done through the usual channels (manually by cloning the repository or installing it though the julia REPL
).
This code has an MIT license and therefore it is free.
[1] ConceptNetNumberbatch GitHub homepage