02 May Croatian web corpus: hrWaC
hrWaC is a web corpus collected from the .hr top-level domain. The 2.1 version of the corpus contains 1.4 billion tokens. The corpus is automatically annotated on the diacritic restoration, morphosyntax and lemma layers. The dependency syntax layer will be added in version 2.2.
The set of morphosyntactic tags used in the corpus follows the revised MULTEXT-East V5 tagset for Croatian and Serbian, available here.
hrWaC can also be accessed and queried online, via the noSketchEngine web interface available here.
Nikola Ljubešić, Filip Klubička (2014). {bs,hr,sr}WaC — Web corpora of Bosnian, Croatian and Serbian. Proceedings of the 9th Web as Corpus Workshop (WaC-9). Gothenburg, Sweden. [Link] [.bib]