-
Notifications
You must be signed in to change notification settings - Fork 16
/
README
85 lines (62 loc) · 4.09 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
CRF chunking with word representations
--------------------------------------
scripts and steps by Joseph Turian
A standard baseline in NLP is chunking (shallow parsing), which was the
CoNLL 2000 shared task. Training a CRF is a standard approach to this task
(Sha + Pereira 2003). In fact, many CRF implementations include
instructions for reimplementing the Sha+Pereira chunker with identical
features:
crfsgd: http://leon.bottou.org/projects/sgd
crf++: http://crfpp.sourceforge.net/
CRFsuite: http://www.chokkan.org/software/crfsuite/
We use CRFsuite because it makes it simple to modify the feature
generation code, so one can easily add new features.
We have instructions and scripts for how we add word representations
(Brown clusters and/or word embeddings) to the training.
INSTALLATION:
-------------
Download and install CRFsuite: http://www.chokkan.org/software/crfsuite/
You will need my common Python library:
http://github.com/turian/common
Go into data/ and download the CoNLL train and test files:
cd data/
wget http://www.cnts.ua.ac.be/conll2000/chunking/train.txt.gz
wget http://www.cnts.ua.ac.be/conll2000/chunking/test.txt.gz
gunzip *.gz
Download word representations:
cd representations/
wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c100-freq1.txt
wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c320-freq1.txt
wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c1000-freq1.txt
wget http://pylearn.org/turian/brown-clusters/brown-rcv1.clean.tokenized-CoNLL03.txt-c3200-freq1.txt
wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-1750000000.LEARNING_RATE%3d1e-09.EMBEDDING_LEARNING_RATE%3d1e-06.EMBEDDING_SIZE%3d200.txt.gz
wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-2030000000.LEARNING_RATE%3d1e-09.EMBEDDING_LEARNING_RATE%3d1e-06.EMBEDDING_SIZE%3d100.txt.gz
wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-2270000000.LEARNING_RATE%3d1e-09.EMBEDDING_LEARNING_RATE%3d1e-06.txt.gz
wget http://pylearn.org/turian/embeddings-ACL2010-20100116-redo-baseline-with-100dims/model-2280000000.LEARNING_RATE%3d1e-08.EMBEDDING_LEARNING_RATE%3d1e-07.EMBEDDING_SIZE%3d25.txt.gz
ln -s model-1750000000.LEARNING_RATE=1e-09.EMBEDDING_LEARNING_RATE=1e-06.EMBEDDING_SIZE=200.txt.gz cw-embeddings-200dim.txt.gz
ln -s model-2030000000.LEARNING_RATE\=1e-09.EMBEDDING_LEARNING_RATE\=1e-06.EMBEDDING_SIZE\=100.txt.gz cw-embeddings-100dim.txt.gz
ln -s model-2270000000.LEARNING_RATE\=1e-09.EMBEDDING_LEARNING_RATE\=1e-06.txt.gz cw-embeddings-50dim.txt.gz
ln -s model-2280000000.LEARNING_RATE\=1e-08.EMBEDDING_LEARNING_RATE\=1e-07.EMBEDDING_SIZE\=25.txt.gz cw-embeddings-25dim.txt.gz
wget http://pylearn.org/turian/hlbl_reps_clean_1.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz
wget http://pylearn.org/turian/hlbl_reps_clean_2.50d.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz
ln -s hlbl_reps_clean_1.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz hlbl-embeddings-100dim.txt.gz
ln -s hlbl_reps_clean_2.50d.rcv1.clean.tokenized-CoNLL03.case-intact.txt.gz hlbl-embeddings-50dim.txt.gz
BATCH EVALUATIONS
-----------------
./scripts/train-and-evaluate.py -name baseline --dev --l2 2
WARNING: Everytime you change the --features parameter, you should also
change the --name.
NOTE:
-----
CRFsuite has benchmark results on the CoNLL shared task:
http://www.chokkan.org/software/crfsuite/benchmark.html
However, I did not achievable achieve comparable F1 score on the CoNLL
test set until I used the following parameters:
Dev F1 Test F1 params
94.04 93.63 l2=2
94.03 93.65 l2=3.2, possible_transitions=1
94.15 93.73 l2=3.2, possible_transitions=1, possible_states=1
94.16 93.79 SGD, l2=3.2, possible_transitions=1, possible_states=1
I chose the l2 penalty on the dev set, which was a subset of the
training data.
I then used this l2 penalty and trained over the entire training set.