10 Dec Serbian semantic textual similarity news corpus: STS.news.sr
The Serbian Semantic Textual Similarity News Corpus (STS.news.sr) consists of 1192 pairs of sentences in Serbian, or around 64 thousand tokens, gathered from news sources on the web and written in the Serbian Latin script. Each sentence pair was manually annotated with fine-grained semantic similarity scores on the 0–5 scale. The final scores were obtained by averaging the individual scores of five annotators.
The sentence pairs in this dataset were taken from the Serbian Paraphrase Corpus (paraphrase.sr). The annotation methodology generally followed the one established in the SemEval STS shared tasks (2012–2017). Annotation instructions used in the creation of STS.news.sr corpus are available here. The STSAnno tool was used in the annotation process.
The average annotator self-agreement score, expressed in terms of the Pearson correlation coefficient r, is 0.93. The average inter-rater correlation between an annotator and the averaged scores of all other annotators is 0.92, which is effectively the upper bound for STS model performance on this dataset.