Croatian annotated corpus: hr500k

hr500k is a ref­er­ence train­ing cor­pus of Croa­t­ian that con­sists of 900 doc­u­ments di­vid­ed into 24 794 sen­tences, or 506 457 to­kens. It is an ex­ten­sion of pre­vi­ous train­ing cor­po­ra for Croa­t­ian, such as SETimes.HR and SETimes.HR+.
The cor­pus is man­u­al­ly an­no­tat­ed on the fol­low­ing lev­els:

  • To­ken, sen­tence, and doc­u­ment seg­men­ta­tion
  • Mor­phosyn­tax
  • Lem­mas
  • De­pen­den­cy syn­tax
  • Se­man­tic roles
  • Named en­ti­ties

The en­tire cor­pus was an­no­tat­ed with re­gard to mor­phosyn­tax and lem­mas. The set of mor­phosyn­tac­tic tags used in the cor­pus fol­lows the re­vised MUL­TEXT-East V5 tagset for Croa­t­ian and Ser­bian, avail­able here.
De­pen­den­cy syn­tax is an­no­tat­ed ac­cord­ing to the Uni­ver­sal De­pen­den­cy v2 spec­i­fi­ca­tion (UDv2) and cov­ers the first two fifths of the hr500k, i.e. the first 197 028 to­kens of the cor­pus.
Se­man­tic roles are an­no­tat­ed in the old­est part of the cor­pus, name­ly the first 163 doc­u­ments / 83 630 to­kens, which come from the orig­i­nal SETimes.HR cor­pus.
Named en­ti­ty an­no­ta­tions cov­er the en­tire hr500k and are en­cod­ed in the IOB2 for­mat, with five NE types con­sid­ered – peo­ple (PER), per­son de­riv­a­tives (DERIV-PER), lo­ca­tions (LOC), or­ga­ni­za­tions (ORG), and mis­cel­la­neous en­ti­ties (MISC).

Niko­la Ljubešić, Željko Agić, Fil­ip Klu­bič­ka, Vuk Batanović, Tomaž Er­javec
For lo­cal use, a full-text ver­sion of hr500k can be down­loaded from the CLARIN.SI repos­i­to­ry. The cor­pus can also be ac­cessed via the NoS­ketch En­gine, as well as via Kon­Text.
The com­pi­la­tion of the cor­pus is de­scribed in the fol­low­ing pa­per:
Niko­la Ljubešić, Željko Agić, Fil­ip Klu­bič­ka, Vuk Batanović, and Tomaž Er­javec (2018). hr500k – A Ref­er­ence Train­ing Cor­pus of Croa­t­ian. In Pro­ceed­ings of the Con­fer­ence on Lan­guage Tech­nolo­gies & Dig­i­tal Hu­man­i­ties 2018 (JT-DH 2018), pp. 154–161, Ljubl­jana, Slove­nia. [Link]

Licence and citation

The resource on this page is available under the Creative Commons Attribution-ShareAlike 4.0 International License. By downloading the resource, you agree to the terms of use defined by this license.

Creative Commons License

When using the resource it is necessary to cite the papers listed with it as well as the ReLDI repository page.