Serbian annotated corpus: SETimes.SR

SETimes.SR is a ref­er­ence train­ing cor­pus of Ser­bian texts col­lect­ed from the SE­Times par­al­lel news cor­pus.
It con­tains 163 doc­u­ments di­vid­ed into 3891 sen­tences, or 86 726 to­kens.
The cor­pus is man­u­al­ly an­no­tat­ed on the fol­low­ing lev­els:

  • To­ken, sen­tence, and doc­u­ment seg­men­ta­tion
  • Mor­phosyn­tax
  • Lem­mas
  • De­pen­den­cy syn­tax
  • Named en­ti­ties

The set of mor­phosyn­tac­tic tags used in the cor­pus fol­lows the re­vised MUL­TEXT-East V5 tagset for Bosn­ian, Croa­t­ian and Ser­bian, avail­able here.
De­pen­den­cy syn­tax is an­no­tat­ed ac­cord­ing to the Uni­ver­sal De­pen­den­cy v2 spec­i­fi­ca­tion (UDv2).
Named en­ti­ty an­no­ta­tions are en­cod­ed in the IOB2 for­mat, with five NE types con­sid­ered – peo­ple (PER), per­son de­riv­a­tives (DERIV-PER), lo­ca­tions (LOC), or­ga­ni­za­tions (ORG), and mis­cel­la­neous en­ti­ties (MISC).
Fur­ther in­for­ma­tion about the cor­pus can be found on its GitHub repos­i­to­ry.

Au­thors
Vuk Batanović, Niko­la Ljubešić, Tan­ja Samardžić
Avail­abil­i­ty
For lo­cal use, a full-text ver­sion of SETimes.SR can be down­loaded from the CLARIN.SI repos­i­to­ry. SETimes.SR is also avail­able on the Ser­bian UD tree­bank repos­i­to­ry. In ad­di­tion, the cor­pus can be ac­cessed via the NoS­ketch En­gine, as well as via Kon­Text.
Pub­li­ca­tions
The com­pi­la­tion of the cor­pus is de­scribed in the fol­low­ing pa­per:
Vuk Batanović, Niko­la Ljubešić, and Tan­ja Samardžić (2018). SETimes.SR – A Ref­er­ence Train­ing Cor­pus of Ser­bian. In Pro­ceed­ings of the Con­fer­ence on Lan­guage Tech­nolo­gies & Dig­i­tal Hu­man­i­ties 2018 (JT-DH 2018), pp. 11–17, Ljubl­jana, Slove­nia. [Link]

Ad­di­tion­al in­for­ma­tion re­gard­ing the UD an­no­ta­tion of this cor­pus are avail­able in the fol­low­ing pa­per:
Tan­ja Samardžić, Mir­jana Starović, Željko Agić, Niko­la Ljubešić (2017). Uni­ver­sal De­pen­den­cies for Ser­bian in Com­par­i­son with Croa­t­ian and Oth­er Slav­ic Lan­guages. In Pro­ceed­ings of the 6th Work­shop on Bal­to-Slav­ic Nat­ur­al Lan­guage Pro­cess­ing. Va­len­cia, Spain. [Link] [.bib]



Licence and citation

The resource on this page is available under the Creative Commons Attribution-ShareAlike 4.0 International License. By downloading the resource, you agree to the terms of use defined by this license.

Creative Commons License

When using the resource it is necessary to cite the papers listed with it as well as the ReLDI repository page.