Transliteration

This is a Java port of Jeff Pasternack's C# code from Learning Better Transliterations

See examples in TestTransliteration or Runner.

Training data

To train a model, you need pairs of names. A common source is Wikipedia interlanguage links. For example, see this data from Transliterating From All Languages by Anne Irvine et al.

The standard data format expected is:

foreign<tab>english

That said, the Utils class has readers for many different datasets (including Anne Irvine's data).

Training a model

The standard class is the SPModel. Use it as follows:

List<Example> training = Utils.readWikiData(trainfile);
SPModel model = new SPModel(training);
model.Train(10);
model.WriteProbs(modelfile);

This will train a model, and write it to the path specified by modelfile.

SPModel has another useful function called Probability(source, target), which will return the transliteration probability of a given pair.

Annotating

A trained model can be used immediately after training, or you can initialize SPModel using a previously trained and saved modelfile.

SPModel model = new SPModel(modelfile);
model.setMaxCandidates(10);
TopList<Double,String> predictions = model.Generate(testexample);

We limited the max number of candidates to 10, so predictions will have at most 10 elements. These are sorted by score, highest to lowest, where the first element is the best.

Interactive

Once you have trained a model, it is often helpful to try interacting with it. Use interactive.sh for this:

$ ./scripts/interactive.sh models/modelfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Transliteration

Training data

Training a model

Annotating

Interactive

Files

README.md

Latest commit

History

README.md

File metadata and controls

Transliteration

Training data

Training a model

Annotating

Interactive