Croatian and Serbian tokeniser [legacy]

This tool is con­sid­ered a lega­cy tool as the NLP pipeline achieves bet­ter re­sults on the same task, but is not avail­able as a web ser­vice yet.

A tool for au­to­mat­ic to­keni­sa­tion (di­vid­ing text into words and sen­tences). It was en­gi­neered through it­er­a­tive runs on rep­re­sen­ta­tive datasets and fea­tures modes for both stan­dard and non-stan­dard lan­guage.

Niko­la Ljubešić, Tomaž Er­javec
The to­kenis­er is freely avail­able in three forms:
  1. For lo­cal use, the to­kenis­er can be down­loaded from this GitHub repos­i­to­ry.
  2. The to­kenis­er can be used on­line, via our web in­ter­face that can be found here.
  3. Our web ser­vice can be ac­cessed from of our Python li­brary, which can also be down­loaded from the CLARIN.SI GitHub repos­i­to­ry. In­struc­tions on how to in­stall the ReLDI li­brary from GitHub can be found here (in Ser­bian). Al­ter­na­tive­ly, the eas­i­est way to in­stall it is through PyPI from the com­mand line in­ter­face. (De­tailed in­struc­tions also on GitHub.)

The third op­tion, i.e. us­ing the ReLDI Python li­brary, is most rec­om­mend­ed for han­dling larg­er amounts of data.

Licence and citation

The soft­ware on this page is avail­able un­der the Apache Li­cense 2.0. By down­load­ing the soft­ware, you agree to the terms of use de­fined by this li­cense.

When us­ing the soft­ware it is nec­es­sary to cite the pa­pers list­ed with it as well as the ReLDI repos­i­to­ry page.