DiscourseConstituencyInduction-ViterbiEM

This is an implementation of an unsupervised discourse constituency parser described in the paper:

Noriki Nishida and Hideki Nakayama. 2020. Unsupervised Discourse Constituency Parsing Using Viterbi EM. Transactions of the Association for Computational Linguistics, vol.8, pp.215-230.

Task Definition

Unsupervised discourse constituency parsing based on Rhetorical Structure Theory
Input: EDUs, syntactic features, sentence/paragraph boundaries
Output: Unlabeled RST-style constituent tree

Setup

Requirements

numpy
spacy >= 2.1.9
chainer >= 6.1.0
multiset
jsonlines
pyprind

Clone this repository and create directories to store preprocessed data and outputs

$ git clone https://github.com/norikinishida/DiscourseConstituencyInduction-ViterbiEM
$ cd ./DiscourseConstituencyInduction-ViterbiEM
$ mkdir ./data
$ mkdir ./results

Edit `./run_preprocessing.sh` as follows:

STORAGE=./data

Edit `./config/path.ini` as follows:

data = "./data"
results = "./results"
pretrained_word_embeddings = "/path/to/your/pretrained_word_embeddings"
rstdt = "/path/to/rst_discourse_treebank/data/RSTtrees-WSJ-main-1.0"
ptbwsj = "/path/to/LDC99T42/treebank_3/raw/wsj"

Clone other libraries

$ mkdir ./tmp
$ cd ./tmp
$ pip install pandas
$ pip install scikit-learn
$ pip install gensim
$ pip install nltk
$ git clone https://github.com/norikinishida/utils.git
$ git clone https://github.com/norikinishida/treetk.git
$ cp -r ./utils/utils ..
$ cp -r ./treetk/treetk ..

Preprocessing

./run_preprocessing.sh

The following directories will be generated:
- ./data/rstdt/wsj/{train,test} (preprocessed RST-DT)
- ./data/ptbwsj_wo_rstdt (preprocessed PTB-WSJ)
- ./data/rstdt-vocab (vocabularies)
NOTE: We rewrote this part from scratch using spaCy to make the codes much simpler than the previous ones. (2020/05/11)

Training

Training data: RST-DT training set

python main.py --gpu 0 --model spanbasedmodel2 --initial_tree_sampling RB2_RB_LB --config ./config/hyperparams_2.ini --name trial1 --actiontype train --max_epoch 15

The following files will be generated:
- ./results/spanbasedmodel2.RB2_RB_LB.hyperparams_2.aug_False.trial1.training.log
- ./results/spanbasedmodel2.RB2_RB_LB.hyperparams_2.aug_False.trial1.training.jsonl
- ./results/spanbasedmodel2.RB2_RB_LB.hyperparams_2.aug_False.trial1.model
- ./results/spanbasedmodel2.RB2_RB_LB.hyperparams_2.aug_False.trial1.valid_pred.ctrees (optional)
- ./results/spanbasedmodel2.RB2_RB_LB.hyperparams_2.aug_False.trial1.valid_gold.ctrees (optional)
- ./results/spanbasedmodel2.RB2_RB_LB.hyperparams_2.aug_False.trial1.validation.jsonl (optional)

Evaluation

Metrics: RST PARSEVAL by Morey et al. (2018)
Test data: RST-DT test set

python main.py --gpu 0 --model spanbasedmodel2 --initial_tree_sampling RB2_RB_LB --config ./config/hyperparams_2.ini --name trial1 --actiontype evaluate

The following files will be generated:
- ./results/spanbasedmodel2.RB2_RB_LB.hyperparams_2.aug_False.trial1.evaluation.ctrees
- ./results/spanbasedmodel2.RB2_RB_LB.hyperparams_2.aug_False.trial1.evaluation.json

Citation

If you use the code in research publications, please cite:

@article{nishida2020unsupervised,
    author={Nishida, Noriki and Nakayama, Hideki},
    title={Unsupervised Discourse Constituency Parsing Using Viterbi EM},
    journal={Transactions of the Association for Computational Linguistics},
    volume={8},
    number={},
    pages={215-230},
    year={2020},
    doi={10.1162/tacl\_a\_00312},
    URL={https://doi.org/10.1162/tacl_a_00312},
}

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DiscourseConstituencyInduction-ViterbiEM

Task Definition

Setup

Requirements

Clone this repository and create directories to store preprocessed data and outputs

Edit `./run_preprocessing.sh` as follows:

Edit `./config/path.ini` as follows:

Clone other libraries

Preprocessing

Training

Evaluation

Citation

About

Releases

Packages

Languages

License

norikinishida/DiscourseConstituencyInduction-ViterbiEM

Folders and files

Latest commit

History

Repository files navigation

DiscourseConstituencyInduction-ViterbiEM

Task Definition

Setup

Requirements

Clone this repository and create directories to store preprocessed data and outputs

Edit ./run_preprocessing.sh as follows:

Edit ./config/path.ini as follows:

Clone other libraries

Preprocessing

Training

Evaluation

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Edit `./run_preprocessing.sh` as follows:

Edit `./config/path.ini` as follows:

Packages