Diacritic restoration tool

A tool for au­to­mat­ic di­a­crit­ic restora­tion on text with po­ten­tial­ly miss­ing di­a­crit­ics (e.g. it turns kuca into kuća if nec­es­sary). Re­port­ed ac­cu­ra­cy of the tool: 99.5% on stan­dard lan­guage and 99.2% on non-stan­dard lan­guage.

Au­thors
Niko­la Ljubešić, Tomaž Er­javec, Dar­ja Fišer
Avail­abil­i­ty
The tool is freely avail­able in two forms:
  1. The code and mod­els of the tool can be down­loaded from this GitHub repos­i­to­ry.
  2. Our web ser­vice can be ac­cessed from of our Python li­brary, which can also be down­loaded from the CLARIN.SI GitHub repos­i­to­ry. In­struc­tions on how to in­stall the ReLDI li­brary from GitHub can be found here (in Ser­bian). Al­ter­na­tive­ly, the eas­i­est way to in­stall it is through PyPI from the com­mand line in­ter­face. (De­tailed in­struc­tions also on GitHub.)

The sec­ond op­tion, i.e. us­ing the ReLDI Python li­brary, is most rec­om­mend­ed for han­dling larg­er amounts of data.

Pub­li­ca­tions
Niko­la Ljubešić, Tomaž Er­javec, and Dar­ja Fišer (2016). Cor­pus-based di­a­crit­ic restora­tion for south slav­ic lan­guages. Pro­ceed­ings of the Tenth In­ter­na­tion­al Con­fer­ence on Lan­guage Re­sources and Eval­u­a­tion (LREC’16). Por­torož, Slove­nia. [Link] [.bib]


Licence and citation

The soft­ware on this page is avail­able un­der the Apache Li­cense 2.0. By down­load­ing the soft­ware, you agree to the terms of use de­fined by this li­cense.

When us­ing the soft­ware it is nec­es­sary to cite the pa­pers list­ed with it as well as the ReLDI repos­i­to­ry page.