Utility scripts or libraries for various Natural Language Processing tasks.
charfreq.awk
: calculate character frequency.convcat.py
: cat files with different encodings together.csvcol.py
: get specified columns of csv files.csvsql.py
: convert csv file to sql definition.dbsort.tcl
: sort SQLite tables in place.detokenizer.py
: detokenize Chinese text.dump2db.py
: make a database from leaked password dumps.epubzhconv.py
: Chinese varient conversion for epub books.filtermd5.py
: remove md5s not in known list.findbadlines.py
: find encoding errors in stdin.gbk_pua.py
: convert PUA codes in GBK to unicode.getautodesk.py
: get Moses format parallel text from Autodesk corpus.gettxtcollection.py
: merge a txt file collection to one large corpus.haodoo
: crawl and download all books from haodoo.net.iconv.py
: implements iconv.iso639.json
,iso639-to-calibre.py
: get ISO639 codes from Wikipedia and convert to calibre po file.jiebazhc
: tokenize Classical Chinese using jieba.libpinyin_bopomofo.py
: Decorator to use with python-pinyin, to convert Pinyin to Bopomofo. (now useless)ngramfreq.awk
: calculate n-gram character frequency.num2chinese.py
: convert numbers to Chinese numbers.phrasecombine.py
: combine splitted words to large phrases given a dictionary.pwdsort.js
,zxcvbn.js
: print out password strength according to zxcvbn.pgexplaindot.py
: output a GraphVizdot
file forEXPLAIN (FORMAT JSON)
.pgviewdep.tcl
: output a GraphVizdot
file representing view dependencies in a PostgreSQL database.rmdup.c
: remove duplicate lines without sort (compile withmake
, needslibxxhash-dev
).simpdump.py
: try to find username, email, password and hash from leaked password dumps.splitrecutfilter.py
: reads stdin, filters non-chinese sentences and cuts sentences and words.tatoeba
: convert tatoeba dumps to a SQLite3 database.wordfreq.awk
: calculate word frequency.WWStarClone.py
: clone of WWStar, an ancient Classical Chinese translator.zhutil.py
: misc. utils for processing Chinese.modelzh.json
: model to detect Classical/Modern Chinese.
If not otherwise noted in file, all files are licensed under MIT License.