Skip to content
/ clj-mecab Public

Clojure wrapper for the Japanese Morphological Analyzer MeCab

Notifications You must be signed in to change notification settings

borh/clj-mecab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Clojars Project

clj-mecab

Clojure wrapper for the Japanese Morphological Analyzer MeCab.

About

A minimal wrapper around the SWIG-generated Java bindings for MeCab. Currently tested with all varieties of UniDic and IPAdic, although other dictionaries are planned.

Prerequisites

clj-mecab requires you to have MeCab (0.996) installed (the mecab-config binary is used to find your MeCab configuration) and on your path.

Package manager

On Debian:

apt get install mecab mecab-utils libmecab-java libmecab-jni unidic-mecab

On MacOS:

brew install mecab mecab-unidic

Note that you will need to manually install Maven dependencies on MacOS (see next section).

Maven dependencies

You also need to have the Java JNI (SWIG) bindings for the version of MeCab you have installed on your system installed in your local Maven repository (~/.m2). This can be accomplished by:

mvn install:install-file -DgroupId=org.chasen -DartifactId=mecab -Dpackaging=jar -Dversion=0.996 -Dfile=/usr/share/java/mecab/MeCab.jar -DgeneratePom=true

Where /usr/share/java/mecab/MeCab.jar should point to the generated jar on your system.

You will also need to manually download cmecab-java and install it into your local Maven repo:

wget https://github.com/takscape/cmecab-java/releases/download/2.1.0/cmecab-java-2.1.0.tar.gz
tar xzf cmecab-java-2.1.0.tar.gz
mvn install:install-file -DgroupId=net.moraleboost.cmecab-java -DartifactId=cmecab-java -Dpackaging=jar -Dversion=2.1.0 -Dfile=cmecab-java-2.1.0/cmecab-java-2.1.0.jar -DgeneratePom=true

Manually building and installing MeCab

MeCab depends on CRF++, so first install that.

wget http://crfpp.googlecode.com/files/CRF%2B%2B-0.58.tar.gz
tar xzf CRF++-0.58.tar.gz
cd CRF++-0.58 && ./configure && make -j4 && make install && cd ..

Next, install MeCab.

wget http://mecab.googlecode.com/files/mecab-0.996.tar.gz
tar xzf mecab-0.996.tar.gz
cd mecab-0.996 && ./configure --with-charset=utf8 --enable-utf8-only && make -j4 && make install && cd ..

And at least one dictionary:

  • IPAdic:

    wget http://mecab.googlecode.com/files/mecab-ipadic-2.7.0-20070801.tar.gz
    tar xzf mecab-ipadic-2.7.0-20070801.tar.gz
    cd mecab-ipadic-2.7.0-20070801 && ./configure --with-charset=utf8 && make -j4 && make install && cd ..
  • UniDic:

    curl -O https://unidic.ninjal.ac.jp/unidic_archive/cwj/2.3.0/unidic-cwj-2.3.0.zip
    unzip -x unidic-cwj-2.3.0.zip
    cd unidic-cwj-2.3.0 && install -d $(mecab-config --dicdir)/unidic-cwj && install -m 644 dicrc *.bin *.dic $(mecab-config --dicdir)/unidic-cwj && cd ..

Usage

Interactive use:

(require '[clj-mecab.parse :as mecab])
(mecab/parse-sentence "こんにちは、世界!")

[{:orth "こんにちは", :f-type "*", :i-type "*", ...} {:orth "", :f-type "*", :i-type "*", ...} {:orth "世界", :f-type "*", :i-type "*", ...} ...]

BUGS

  • For some yet unknown reason, calling .getSurface on a Node object will not work (empty string) the first time, but will the second time. Currently this means that :orth is not generated when using IPAdic. UniDic provides the surface node in the features array and is unaffected. Probably same issue as taku910/mecab#26

License

Copyright © 2013-2020 Bor Hodošček

Distributed under the Eclipse Public License, the same as Clojure, as well as the 3-clause BSD license.

About

Clojure wrapper for the Japanese Morphological Analyzer MeCab

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published