23 Feb NLP pipeline for Croatian and Serbian
A Python module comprising of a tokeniser, a part-of-speech/MSD tagger, a lemmatiser, a dependency parser, and a named entity recognizer for most South Slavic languages. For Croatian and Serbian there are models for processing standard and Internet non-standard texts. The estimated accuracy of morphosyntactic tagging for this tool is ~94%, while for lemmatisation the accuracy is ~99%. Dependency parsing has an labeled attachment score of ~0.9, while named entity recognition achieves a micro-F1 of ~0.9.