This is a Java port of Jeff Pasternack's C# code from Learning Better Transliterations
See examples in TestTransliteration or Runner.
To train a model, you need pairs of names. A common source is Wikipedia interlanguage links. For example, see this data from Transliterating From All Languages by Anne Irvine et al.
The standard data format expected is:
foreign<tab>english
That said, the Utils class has readers for many different datasets (including Anne Irvine's data).
The standard class is the SPModel. Use it as follows:
List<Example> training = Utils.readWikiData(trainfile);
SPModel model = new SPModel(training);
model.Train(10);
model.WriteProbs(modelfile);
This will train a model, and write it to the path specified by modelfile
.
SPModel
has another useful function called Probability(source, target)
, which will return the transliteration probability
of a given pair.
A trained model can be used immediately after training, or you can initialize SPModel
using a
previously trained and saved modelfile
.
SPModel model = new SPModel(modelfile);
model.setMaxCandidates(10);
TopList<Double,String> predictions = model.Generate(testexample);
We limited the max number of candidates to 10, so predictions
will have at most 10 elements. These
are sorted by score, highest to lowest, where the first element is the best.
Once you have trained a model, it is often helpful to try interacting with it. Use interactive.sh for this:
$ ./scripts/interactive.sh models/modelfile