NLP pipeline for Croatian and Serbian

A Python mod­ule com­pris­ing of a to­kenis­er, a part-of-speech/MSD tag­ger, a lem­ma­tis­er, a de­pen­den­cy pars­er, and a named en­ti­ty rec­og­niz­er for most South Slav­ic lan­guages. For Croa­t­ian and Ser­bian there are mod­els for pro­cess­ing stan­dard and In­ter­net non-stan­dard texts. The es­ti­mat­ed ac­cu­ra­cy of mor­phosyn­tac­tic tag­ging for this tool is ~94%, while for lem­ma­ti­sa­tion the ac­cu­ra­cy is ~99%. De­pen­den­cy pars­ing has an la­beled at­tach­ment score of ~0.9, while named en­ti­ty recog­ni­tion achieves a mi­cro-F1 of ~0.9.

Niko­la Ljubešić
The ex­per­i­ments yield­ing this pipeline have been de­scribed in the fol­low­ing pa­per: Niko­la Ljubešić and Kaja Do­bro­voljc (2019). What Does Neur­al Bring? Analysing Im­prove­ments in Mor­phosyn­tac­tic An­no­ta­tion and Lem­ma­ti­sa­tion of Sloven­ian, Croa­t­ian and Ser­bian. Pro­ceed­ings of the 7th Work­shop on Bal­to-Slav­ic Nat­ur­al Lan­guage Pro­cess­ing. Flo­rence, Italy. pp. 29–34. [Link] [.bib]

Licence and citation

The soft­ware on this page is avail­able un­der the Apache Li­cense 2.0. By down­load­ing the soft­ware, you agree to the terms of use de­fined by this li­cense.

When us­ing the soft­ware it is nec­es­sary to cite the pa­pers list­ed with it as well as the ReLDI repos­i­to­ry page.