Indonesian NLP resources

Language modeling

Kompas online collection. This corpus contains Kompas online news articles from 2001-2002. See here for more info and citations.
Tempo online collection. This corpus contains Tempo online news articles from 2000-2002. See here for more info and citations.

POS tagging

PANL10N POS tagging. This corpus has 39K sentences and 900K word tokens.
IDN tagged corpus. This corpus contains 10K sentences and 250K word tokens. The POS tags are annotated manually.

Syntactic parsing

Indonesian Treebank. This corpus contains 1K parsed sentences. (constituency parsing)
UD Indonesian. This corpus is provided by Universal Dependencies. Training, development, and testing split are already provided. (dependency parsing)

Machine translation

PANL10N EN-ID news parallel corpus. This corpus has sentences from news articles from several categories: economy (6K sentences), international (6K sentences), science (6K sentences), and sport (4K sentences).
PANL10N Indonesian translation of Penn treebank. This corpus contains Indonesian translation of the Penn treebank. In total there are 24K sentences.

Text Summarization

IndoSum. A collection of 20K online news article-summary pairs belonging to 6 categories and 10 sources. It has both abstractive summaries and extractive labels.

Text Classification

SMS Spam. This corpus contains 1143 sentences that have been labeled with normal message, fraud, promotion. It is provided by http://nlp.yuliadi.pro/dataset

Speech recognition

TITML-IDN speech corpus. The corpus contains 20 speakers (11 male and 9 female), where each of the speaker speaks 343 utterances. The utterances are phonetically balanced.

The corpus itself is free to use for academic/non-commercial usage, but interested party should make a formal request via email to the institution. The procedure is listed here
frankydotid/Indonesian-Speech-Recognition. A small corpus of 50 utterances by a single male speaker.
CMU Wilderness Multilingual Speech Dataset. A dataset of over 700 different languages providing audio, aligned texts, and word pronunciations. One of the languages is Indonesian. The utterances are read from the bible, which is recorded by bible.is.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Indonesian NLP resources

Language modeling

POS tagging

Syntactic parsing

Machine translation

Text Summarization

Text Classification

Speech recognition

About

Releases

Packages

arynas/id-nlp-resource

Folders and files

Latest commit

History

Repository files navigation

Indonesian NLP resources

Language modeling

POS tagging

Syntactic parsing

Machine translation

Text Summarization

Text Classification

Speech recognition

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages