Serbian paraphrase corpus:

The Ser­bian Para­phrase Cor­pus ( con­sists of 1194 pairs of sen­tences gath­ered from news sources on the web. Each sen­tence pair was man­u­al­ly an­no­tat­ed with a bi­na­ry sim­i­lar­i­ty score that in­di­cates whether the sen­tences in the pair are se­man­ti­cal­ly sim­i­lar enough to be con­sid­ered close para­phras­es. The cor­pus con­tains 553 sen­tence pairs deemed to be se­man­ti­cal­ly equiv­a­lent (46.31% of the to­tal num­ber), and 641 se­man­ti­cal­ly di­verse pairs (53.69% of the to­tal num­ber).

Vuk Batanović
The cor­pus and its doc­u­men­ta­tion can be found on the GitHub repos­i­to­ry.
  • Vuk Batanović, Bo­jan Furlan, Boško Nikolić (2011). A soft­ware sys­tem for de­ter­min­ing the se­man­tic sim­i­lar­i­ty of short texts in Ser­bian. Pro­ceed­ings of the 19th Telecom­mu­ni­ca­tions fo­rum (TELFOR 2011), pp. 1249–1252, Bel­grade, Ser­bia. [Link]
  • Bo­jan Furlan, Vuk Batanović, Boško Nikolić (2013). Se­man­tic sim­i­lar­i­ty of short texts in lan­guages with a de­fi­cient nat­ur­al lan­guage pro­cess­ing sup­port. De­ci­sion Sup­port Sys­tems, Vol. 55, No. 3, pp. 710–719. [Link]

Licence and citation

The resource on this page is available under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. By downloading the resource, you agree to the terms of use defined by this license.

Creative Commons License

When using the resource it is necessary to cite the papers listed with it as well as the ReLDI repository page.