Skip to content
/ hat Public

[unmaintained since 2005] HaT is a script intended for adding diacritic marks to (Czech) text.

License

Notifications You must be signed in to change notification settings

srcx/hat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

HaT

Warning: unmaintained since 2005!

HaT is a script intended for adding diacritic marks to (Czech) text. It is based on statistical methods. Statistics are gathered from training data, stored in a database, and then used. The error rate if test database is used is around 5%.

Running

Requirements:

  • Perl 5.x or higher (tested with v5.8.2)
  • Cz::Cstocs (tested with version 3.4)

Generation (training) of database:

./hat.pl -b hat.db il2 < train.txt

  • creates database hat.db from training data train.txt, which are in encoding iso-8859-2 (encoding names are according to Cz::Cstocs)

Adding diacritic marks:

./hat.pl -h hat.db il2 < ascii.txt > czech.txt

  • using database hat.db adds diacritic marks to ascii.txt and saves it as czech.txt in encoding iso-8859-2

Test database

Test database was generated from these sources:

Exact form of used texts can not be reconstructed from test database (it does not contain all the information from original source) so I consider this to be fair use.

About

[unmaintained since 2005] HaT is a script intended for adding diacritic marks to (Czech) text.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages