The Serbian Semantic Textual Similarity News Corpus (STS.news.sr) consists of 1192 pairs of sentences in Serbian, or around 64 thousand tokens, gathered from news sources on the web and written in the Serbian Latin script. Each sentence pair was manually annotated with fine-grained semantic similarity scores on the 0-5 scale. The final scores were obtained by averaging the individual scores of five annotators. The sentence pairs in this dataset were taken from the Serbian Paraphrase Corpus (paraphrase.sr). The annotation methodology generally...

The Serbian Paraphrase Corpus (paraphrase.sr) consists of 1194 pairs of sentences gathered from news sources on the web. Each sentence pair was manually annotated with a binary similarity score that indicates whether the sentences in the pair are semantically similar enough to be considered close paraphrases. The corpus contains 553 sentence pairs deemed to be semantically equivalent (46.31% of the total number), and 641 semantically diverse pairs (53.69% of the total number). Author Vuk Batanović Availability The corpus and its documentation can be found...

ReLDI-NormTagNER-sr 2.1 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation, and named entity recognition of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness). As an update to version 1.1, 2.0 adds annotations for named entities. Authors Nikola Ljubešić, Tomaž Erjavec, Maja Miličević, Tanja Samardžić Availability For local use, a full-text version of...

The Serbian Movie Review Dataset collection consists of three movie review datasets in Serbian which were constructed for the task of sentiment analysis: Collected movie reviews in Serbian (ISLRN 252-457-966-231-5) - an imbalanced collection of 4725 movie reviews in Serbian. SerbMR-2C - The Serbian Movie Review Dataset (2 Classes) (ISLRN 016-049-192-514-1) - a two-class balanced dataset that contains 1682 movie reviews (841 positive and 841 negative). SerbMR-3C - The Serbian Movie Review Dataset (3 Classes) (ISLRN 229-533-271-984-0) - a three-class...

This package is a Java reimplementation of four previously published stemming algorithms for Serbian and Croatian: The greedy and the optimal subsumption-based stemmer for Serbian, by Vlado Kešelj and Danko Šipka A refinement of the greedy subsumption-based stemmer, by Nikola Milošević A "Simple stemmer for Croatian v0.1", by Nikola Ljubešić and Ivan Pandžić All the stemmers expect the input text to be formatted in UTF-8. Their outputs are also UTF-8 encoded. Author Vuk Batanović Availability The package and a more extensive documentation can be downloaded...

A tool for automatic lemmatisation (returning the base or dictionary form of an inflected word). The tool looks up the hrLex/srLex lexicons and uses a predictive model for lemmatising OOVs (out of vocabulary words) which was trained on available corpora and lexicons. Author Nikola Ljubešić Availability The lemmatiser is freely available in three forms: For local use, the code and models of the lemmatiser can be downloaded from this GitHub repository. The lemmatiser web service can be used online, via our web interface that...

SETimes.SR is a reference training corpus of Serbian texts collected from the SETimes parallel news corpus. It contains 163 documents divided into 3891 sentences, or 86 726 tokens. The corpus is manually annotated on the following levels: Token, sentence, and document segmentation Morphosyntax Lemmas Dependency syntax Named entities The set of morphosyntactic tags used in the corpus follows the revised MULTEXT-East V5 tagset for Bosnian, Croatian and Serbian, available here. Dependency syntax is annotated according to the Universal Dependency v2 specification (UDv2). Named entity annotations...

srLex is an inflectional lexicon of Serbian. The size of the lexicon is 108,829 lemmas, or 5,326,726 surface forms. Each entry in the lexicon consists of a (wordform, lemma, MSD, absolute frequency, in-million frequency) quintuple, e.g.: (ženu, žena, Ncfsa, 15838, 0.028556). The frequencies were estimated on the Serbian web corpus srWaC. The set of morphosyntactic tags used in the lexicon follows the MULTEXT-East V5 tagset for Bosnian (and Serbian), available here. Authors Nikola Ljubešić, Filip Klubička Availability For local use, srLex can be downloaded as a raw...

srWaC is a web corpus collected from the .rs top-level domain. The 1.1 version of the corpus contains 555 million tokens. The corpus is automatically annotated on the diacritic restoration, morphosyntax and lemma layers. The dependency syntax layer will be added in version 1.2. The set of morphosyntactic tags used in the corpus follows the revised MULTEXT-East V5 tagset for Croatian and Serbian, available here. Authors Nikola Ljubešić, Filip Klubička Availability For local use, a full-text version of srWaC can be downloaded here. srWaC can also...

A tool for automatic diacritic restoration on text with potentially missing diacritics (e.g. it turns kuca into kuća if necessary). Reported accuracy of the tool: 99.5% on standard language and 99.2% on non-standard language. Authors Nikola Ljubešić, Tomaž Erjavec, Darja Fišer Availability The tool is freely available in two forms: The code and models of the tool can be downloaded from this GitHub repository. Our web service can be accessed from of our Python library, which can also be downloaded from the CLARIN.SI GitHub...

A tool for automatic tokenisation (dividing text into words and sentences). It was engineered through iterative runs on representative datasets and features modes for both standard and non-standard language. Authors Nikola Ljubešić, Tomaž Erjavec Availability The tokeniser is freely available in three forms: For local use, the tokeniser can be downloaded from this GitHub repository. The tokeniser can be used online, via our web interface that can be found here. Our web service can be accessed from of our Python library, which can also...

A tool for automatic annotation on the morphosyntactic level. It is capable of tagging both Croatian and Serbian as models for both languages are present in the tool. The tagger is based on the CRF algorithm trained on a 500,000-token Croatian training corpus and the hrLex/srLex lexicons for each respective language. The set of morphosyntactic tags used in the corpus follows the revised MULTEXT-East V5 tagset for Croatian and Serbian, available here. Accuracies calculated on test sets for each language: Croatian: 92.53% Serbian: 92.33% Author Nikola...