Serbian semantic textual similarity news corpus:

The Ser­bian Se­man­tic Tex­tu­al Sim­i­lar­i­ty News Cor­pus ( con­sists of 1192 pairs of sen­tences in Ser­bian, or around 64 thou­sand to­kens, gath­ered from news sources on the web and writ­ten in the Ser­bian Latin script. Each sen­tence pair was man­u­al­ly an­no­tat­ed with fine-grained se­man­tic sim­i­lar­i­ty scores on the 0–5 scale. The fi­nal scores were ob­tained by av­er­ag­ing the in­di­vid­ual scores of five an­no­ta­tors.

The sen­tence pairs in this dataset were tak­en from the Ser­bian Para­phrase Cor­pus ( The an­no­ta­tion method­ol­o­gy gen­er­al­ly fol­lowed the one es­tab­lished in the Se­mEval STS shared tasks (2012–2017). An­no­ta­tion in­struc­tions used in the cre­ation of cor­pus are avail­able here. The STSAn­no tool was used in the an­no­ta­tion process.

The av­er­age an­no­ta­tor self-agree­ment score, ex­pressed in terms of the Pear­son cor­re­la­tion co­ef­fi­cient r, is 0.93. The av­er­age in­ter-rater cor­re­la­tion be­tween an an­no­ta­tor and the av­er­aged scores of all oth­er an­no­ta­tors is 0.92, which is ef­fec­tive­ly the up­per bound for STS mod­el per­for­mance on this dataset.

Vuk Batanović
The cor­pus and its doc­u­men­ta­tion can be found on the GitHub repos­i­to­ry.
Vuk Batanović, Miloš Cve­tanović, Boško Nikolić (2018). Fine-grained Se­man­tic Tex­tu­al Sim­i­lar­i­ty for Ser­bian. Pro­ceed­ings of the 11th In­ter­na­tion­al Con­fer­ence on Lan­guage Re­sources and Eval­u­a­tion (LREC 2018), pp. 1370–1378, Miyaza­ki, Japan. [Link][.bib]

Licence and citation

The resource on this page is available under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. By downloading the resource, you agree to the terms of use defined by this license.

Creative Commons License

When using the resource it is necessary to cite the papers listed with it as well as the ReLDI repository page.