Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

Commit

Permalink
Add forgotten unicode punctuation normalization to get_ende_bleu.
Browse files Browse the repository at this point in the history
PiperOrigin-RevId: 191758943
  • Loading branch information
Lukasz Kaiser authored and Ryan Sepassi committed Apr 5, 2018
1 parent bca81be commit fc9335c
Showing 1 changed file with 4 additions and 1 deletion.
5 changes: 4 additions & 1 deletion tensor2tensor/utils/get_ende_bleu.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,11 @@ tok_gold_targets=newstest2013.tok.de

decodes_file=$1

# Replace unicode.
perl $mosesdecoder/scripts/tokenizer/replace-unicode-punctuation.perl -l de < $decodes_file > $decodes_file.n

# Tokenize.
perl $mosesdecoder/scripts/tokenizer/tokenizer.perl -l de < $decodes_file > $decodes_file.tok
perl $mosesdecoder/scripts/tokenizer/tokenizer.perl -l de < $decodes_file.n > $decodes_file.tok

# Put compounds in ATAT format (comparable to papers like GNMT, ConvS2S).
# See https://nlp.stanford.edu/projects/nmt/ :
Expand Down

0 comments on commit fc9335c

Please sign in to comment.