The SentiComments.SR dataset includes the following three corpora: The main SentiComments.SR corpus, consisting of 3490 movie-related comments The movie verification corpus, consisting of 464 movie-related comments The book verification corpus, consisting of 173 book-related comments The main SentiComments.SR corpus was constructed out of the comments written by visitors on the kakavfilm.com movie review website in Serbian. The movie verification corpus comments were sourced from two other Serbian movie review websites - gledajme.rs and happynovisad.com. The book verification corpus comments were also...

The Serbian Semantic Textual Similarity News Corpus (STS.news.sr) consists of 1192 pairs of sentences in Serbian, or around 64 thousand tokens, gathered from news sources on the web and written in the Serbian Latin script. Each sentence pair was manually annotated with fine-grained semantic similarity scores on the 0-5 scale. The final scores were obtained by averaging the individual scores of five annotators. The sentence pairs in this dataset were taken from the Serbian Paraphrase Corpus (paraphrase.sr). The annotation methodology generally...

The Serbian Paraphrase Corpus (paraphrase.sr) consists of 1194 pairs of sentences gathered from news sources on the web. Each sentence pair was manually annotated with a binary similarity score that indicates whether the sentences in the pair are semantically similar enough to be considered close paraphrases. The corpus contains 553 sentence pairs deemed to be semantically equivalent (46.31% of the total number), and 641 semantically diverse pairs (53.69% of the total number). Author Vuk Batanović Availability The corpus and its documentation can be found...

ReLDI-NormTagNER-sr 2.1 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation, and named entity recognition of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness). Authors Nikola Ljubešić, Tomaž Erjavec, Vuk Batanović, Maja Miličević, Tanja Samardžić Availability For local use, a full-text version of the corpus can be downloaded from the CLARIN.SI repository. Publication The corpus...

ReLDI-NormTagNER-hr 2.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation, and named entity recognition of non-standard Croatian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness). Authors Nikola Ljubešić, Tomaž Erjavec, Vuk Batanović, Maja Miličević, Tanja Samardžić Availability For local use, a full-text version of the corpus can be downloaded from the CLARIN.SI repository. Publication The corpus...

The Serbian Movie Review Dataset collection consists of three movie review datasets in Serbian which were constructed for the task of sentiment analysis: Collected movie reviews in Serbian (ISLRN 252-457-966-231-5) - an imbalanced collection of 4725 movie reviews in Serbian. SerbMR-2C - The Serbian Movie Review Dataset (2 Classes) (ISLRN 016-049-192-514-1) - a two-class balanced dataset that contains 1682 movie reviews (841 positive and 841 negative). SerbMR-3C - The Serbian Movie Review Dataset (3 Classes) (ISLRN 229-533-271-984-0) - a three-class...

SETimes.SR is a reference training corpus of Serbian texts collected from the SETimes parallel news corpus. It contains 163 documents divided into 3891 sentences, or 86 726 tokens. The corpus is manually annotated on the following levels: Token, sentence, and document segmentation Morphosyntax Lemmas Dependency syntax Named entities The set of morphosyntactic tags used in the corpus follows the revised MULTEXT-East V5 tagset for Bosnian, Croatian and Serbian, available here. Dependency syntax is annotated according to the Universal Dependency v2 specification (UDv2). Named entity annotations...

hr500k is a reference training corpus of Croatian that consists of 900 documents divided into 24 794 sentences, or 506 457 tokens. It is an extension of previous training corpora for Croatian, such as SETimes.HR and SETimes.HR+. The corpus is manually annotated on the following levels: Token, sentence, and document segmentation Morphosyntax Lemmas Dependency syntax Semantic roles Named entities The entire corpus was annotated with regard to morphosyntax and lemmas. The set of morphosyntactic tags used in the corpus follows the revised...

srWaC is a web corpus collected from the .rs top-level domain. The 1.1 version of the corpus contains 555 million tokens. The corpus is automatically annotated on the diacritic restoration, morphosyntax and lemma layers. The dependency syntax layer will be added in version 1.2. The set of morphosyntactic tags used in the corpus follows the revised MULTEXT-East V5 tagset for Croatian and Serbian, available here. Authors Nikola Ljubešić, Filip Klubička Availability For local use, a full-text version of srWaC can be downloaded here. srWaC can also...

hrWaC is a web corpus collected from the .hr top-level domain. The 2.1 version of the corpus contains 1.4 billion tokens. The corpus is automatically annotated on the diacritic restoration, morphosyntax and lemma layers. The dependency syntax layer will be added in version 2.2. The set of morphosyntactic tags used in the corpus follows the revised MULTEXT-East V5 tagset for Croatian and Serbian, available here. Authors Nikola Ljubešić, Filip Klubička Availability For local use, a full-text version of hrWaC can be downloaded here. hrWaC can also...