This project is based on work by Yiming Jiang, who wrote the initial version and evaluated CCG and Stanford tokenizers against a corpus drawn from OntoNotes.
It has been modified in the following way: additional classes were
written to use cogcomp-core-utilities
data structures. The underlying
tokenizer is still the LBJava SentenceSplitter
. The evaluation code
has not been updated to use the new IllinoisTokenizer
class (TODO).
The class edu.illinois.cs.cogcomp.annotation.TextAnnotationBuilder
interface from cogcomp-core-utilities to create a TextAnnotation
object with SENTENCE
and TOKEN
views, other builder can be provided
as a constructor argument to a CachingAnnotatorService
that uses other
annotators in pipeline fashion.
The StanfordTokenizer requires using a Java 8 runtime.
OntoNotesParser:
input: a single ontonotes file
output: JSON file
OntoNotesJsonReader:
input: JSON file generated by OntoNotesParser
output: Curator Record Data Structure
IllinoisTokenizer:
input: array of sentence strings, total is an article
output: Curator Record Data Structure
StanfordTokenizer:
input: array of sentence strings, total is an article
output: Curator Record Data Structure
Evaluator:
takes gold standard Record and a sample Record
Evaluation Criteria:
ON_GOLD_STANDARD_AGAINST_SAMPLE: iterate over each gold standard token and see if it's in sample tokens
ON_SAMPLE_AGAINST_GOLD_STANDARD: iterate over each sample token and see if it's in gold standard tokens
Each JSON file is generated from OntoNotesParser. Each JSON file corresponds to each OntoNotes file.
This is the format of JSON file.
{
"sentences": [
{
"sentence_text":
"sentence_start_offset":
"sentence_end_offset":
"tokens": [
{
"token_text":
"token_start_offset":
"token_end_offset":
},
...
},
...
}
Format:
"sentences"
has an array of sentences, with sentence text and offsets.
"tokens"
has an array of tokens with token text and offsets.
OntoNotesParser parser = new OntoNotesParser("wsj_0089.onf");
parser.writeToFileInJson("json_output.txt");
OntoNotesJsonReader reader = new OntoNotesJsonReader("json_output.txt");
reader.parseIntoCuratorRecord();
OntoNotesJsonReader reader = new OntoNotesJsonReader("json_output.txt");
ArrayList<String> rawTexts = reader.getRawTexts();
IllinoisTokenizer illinoisTokenizer = new IllinoisTokenizer(rawTexts);
OntoNotesJsonReader reader = new OntoNotesJsonReader("json_output.txt");
ArrayList<String> rawTexts = reader.getRawTexts();
StanfordTokenizer stanfordTokenizer = new StanfordTokenizer(rawTexts);
Evaluator evaluator = new Evaluator();
evaluator.evaluateIllinoisTokenizer(EvaluationCriteria.ON_SAMPLE_AGAINST_GOLD_STANDARD);
evaluator.evaluateStanfordTokenizer(EvaluationCriteria.ON_SAMPLE_AGAINST_GOLD_STANDARD);
- Developed by: Yiming Jiang
- Advised by: Professor Dan Roth
- Mentored by: Mark Sammons
##Citation
If you use this code in your research, please provide the URL for this github repository in the relevant publications. Thank you for citing us if you use us in your work!