02 May Croatian and Serbian tokeniser
A tool for automatic tokenisation (dividing text into words and sentences). It was engineered through iterative runs on representative datasets and features modes for both standard and non-standard language.
- For local use, the tokeniser can be downloaded from this GitHub repository.
- The tokeniser can be used online, via our web interface that can be found here.
- Our web service can be accessed from of our Python library, which can also be downloaded from the CLARIN.SI GitHub repository. Instructions on how to install the ReLDI library from GitHub can be found here (in Serbian). Alternatively, the easiest way to install it is through PyPI from the command line interface. (Detailed instructions also on GitHub.)
The third option, i.e. using the ReLDI Python library, is most recommended for handling larger amounts of data.