The Serbian Semantic Textual Similarity News Corpus (STS.news.sr) consists of 1192 pairs of sentences in Serbian, or around 64 thousand tokens, gathered from news sources on the web and written in the Serbian Latin script. Each sentence pair was manually annotated with fine-grained semantic similarity scores on the 0-5 scale. The final scores were obtained by averaging the individual scores of five annotators. The sentence pairs in this dataset were taken from the Serbian Paraphrase Corpus (paraphrase.sr). The annotation methodology generally...

The next ReLDI meeting will take place in Zagreb on 21 December 2018. If you would like to attend, let us know via our contact form....

A new grant Revisiting research training in linguistics: theory, logic, method (Nr. 2018-CH-IP-0012), is awarded by the Swiss funding agency Movetia to Tanja Samardžić and Maja Miličević in collaboration with Genoveva Puskas from the University of Geneva. The goal of the new partnership is to introduce the content taught in the ReLDI seminars to the undergraduate programmes at the Universities of Zurich, Belgrade and Geneva. See the announcement from the University of Zurich....

The Serbian Paraphrase Corpus (paraphrase.sr) consists of 1194 pairs of sentences gathered from news sources on the web. Each sentence pair was manually annotated with a binary similarity score that indicates whether the sentences in the pair are semantically similar enough to be considered close paraphrases. The corpus contains 553 sentence pairs deemed to be semantically equivalent (46.31% of the total number), and 641 semantically diverse pairs (53.69% of the total number). Author Vuk Batanović Availability The corpus and its documentation can be found...

Four ReLDI online courses are now open and can be accessed via the Open edX @ Zurich platform. More detailed information can be found at the online courses page....

Between 15 August and 1 October 2017 four ReLDI online courses will be opened on the Open edX @ Zurich platform. More detailed information can be accessed from the online courses page....

ReLDI-NormTag-sr 1.1 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging and lemmatisation of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness). As an update to version 1.0, 1.1 corrects some minor errors. Authors Nikola Ljubešić, Daša Farkaš, Filip Klubička, Tomaž Erjavec, Maja Miličević, Teodora Vuković Availability For local use, a full-text version of...

ReLDI-NormTag-hr 1.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging and lemmatisation of non-standard Croatian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness). As an update to version 1.0, 1.1 corrects some minor errors. Authors Nikola Ljubešić, Daša Farkaš, Filip Klubička, Tomaž Erjavec, Maja Miličević, Matea Filko, Denis Kranjčić, Barbara Dujmić Availability For local use,...

Serbian Universal Dependency Treebank currently consists of the SETimes.SR corpus manually annotated with syntactic dependencies following the guidelines for Universal Dependencies v2. Authors Tanja Samardžić, Mirjana Starović, Željko Agić, Nikola Ljubešić Availability For local use, a full-text version of the treebank can be downloaded from the UD GitHub page. Publication The development of the treebank is described in detail in the following paper: Tanja Samardžić, Mirjana Starović, Željko Agić, Nikola Ljubešić (2017). Universal Dependencies for Serbian in Comparison with Croatian and Other Slavic Languages. Proceedings of the...

Blog posts about ReLDI have been published on the URPP Language and Space blog (author: Tanja Samardžić), and the CLARIN blog (authors: Ana Slavec and Jakob Lenardič)....

The fifth ReLDI seminar took place from 21 to 23 June 2017 in Ljubljana, Slovenia. The seminar had about 50 participants from the entire former Yugoslavia. The program and the materials are available on the seminar page in Serbian. We are grateful to the organisers who made this additional seminar possible, and especially to the JANES project. JANES ends at the same time as ReLDI, and at the end of the seminar we celebrated several years of fruitful activities by dancing...

The fourth ReLDI seminar (the second in Zagreb) took place from 23 to 26 February 2017. The program and the materials are available on the seminar page in Serbian. Participant list and short biographies are available on the same page. ...

In collaboration with the Slovene project JANES, we are organising an additional ReLDI seminar, to be held 21-23 June 2017 in Ljubljana. Detailed information is available here....

The second ReLDI seminar took place in Belgrade from 26 to 29 January 2017. The program and materials are available on the seminar page in Serbian. The list of the participants and their short biographies will be available soon. ...

On 28 October 2016 Adriano Ferraresi (University of Bologna) and Maja Milicevic held a statistics seminar for (corpus) linguists titled "Count your frequencies wisely! An introduction to concepts and methods in quantitative (corpus) linguistics". The seminar was hosted by the Department of Interpreting and Translation of the University of Bologna....

On 29 September 2016 Nikola Ljubešić gave two ReLDI-related talks at the Language Technologies & Digital Humanities 2016 conference in Ljubljana: (1) Easily Accessible Language Technologies for Slovene, Croatian and Serbian (authors: Nikola Ljubešić, Tomaž Erjavec, Darja Fišer, Tanja Samardžić, Maja Miličević, Filip Klubička, Filip Petkovski); (2) Analysing spatial distribution of linguistic variables in geoencoded tweets from Croatia, Bosnia, Montenegro and Serbia (authors: Nikola Ljubešić, Tanja Samardžić, Maja Miličević)....

The most important language technology tools developed inside the project are available via a web application (http://nl.ijs.si/services, registration point) and an API library (https://github.com/clarinsi/reldi-lib). We will continue adding tools to the ReLDI ecosystem as we develop them....

An article on our project appeared in the Belgrade daily newspaper Danas (in Serbian)....

The second ReLDI seminar took place in Zagreb from 27 to 30 June 2016. The list of seminar participants and their short biographies will be available soon on the seminar page (in Croatian)....

The Serbian Movie Review Dataset collection consists of three movie review datasets in Serbian which were constructed for the task of sentiment analysis: Collected movie reviews in Serbian (ISLRN 252-457-966-231-5) - an imbalanced collection of 4725 movie reviews in Serbian. SerbMR-2C - The Serbian Movie Review Dataset (2 Classes) (ISLRN 016-049-192-514-1) - a two-class balanced dataset that contains 1682 movie reviews (841 positive and 841 negative). SerbMR-3C - The Serbian Movie Review Dataset (3 Classes) (ISLRN 229-533-271-984-0) - a three-class...

This package is a Java reimplementation of four previously published stemming algorithms for Serbian and Croatian: The greedy and the optimal subsumption-based stemmer for Serbian, by Vlado Kešelj and Danko Šipka A refinement of the greedy subsumption-based stemmer, by Nikola Milošević A "Simple stemmer for Croatian v0.1", by Nikola Ljubešić and Ivan Pandžić All the stemmers expect the input text to be formatted in UTF-8. Their outputs are also UTF-8 encoded. Author Vuk Batanović Availability The package and a more extensive documentation can be downloaded...

Authors Vladan Pavlović, Miloš Jovanović Contents and description Questionnaire on attitudes towards the relation between language and national identity Download instrument Questionnaire on attitudes towards the relation between language and national identity Publications Pavlović, Vladan and Miloš Jovanović (2013). Stavovi studenata Univerziteta u Nišu o odnosu jezičkog i nacionalnog identiteta. Teme 38/2. 701-717. [Link] Pavlović, Vladan and Miloš Jovanović (2013). 'Language Nationalism' vs 'Language Cosmopolitanism': Divisions in the Attitudes towards the Relation between Language and National Identity''. In I. Spasić and P. Cvetičanin (Eds) Us and Them –...

Author Tanja Stipeć Contents and description Picture verification task Download instrument Picture verification task Publications Kraš, Tihana and Tanja Stipeć (2013). Interpretation of ambiguous subject pronouns in Croatian by people with Down syndrome and typically developing children. In S. Baiz, N. Goldman and R. Hawkes (Eds), Proceedings of the 37th annual Boston University Conference on Language Development (pp. 178–190). Somerville, MA: Cascadilla Press. Kraš, Tihana, Helena Rubčić and Tanja Stipeć (2015). Subject pronoun interpretation in Croatian: Comparing monolinguals with simultaneous bilinguals. In Cergol Kovačević, K. and Udier,...

Author Mile Vuković Contents and description Serbian Word Reading Test Download instrument Serbian Word Reading Test Serbian Word Reading Test - examiner's form Publications Vuković, Mile (2015). Tretman afazija. 2. edition. Belgrade: University of Belgrade – Faculty of Special Education and Rehabilitation. Vuković, Mile, Irena Vuković, Nick Miller (in press). Acquired dyslexia in Serbian speakers with Broca’s and Wernicke’s aphasia. Journal of Communication Disorders. [Link] Cite the repository page Vuković, Mile (2016). Test čitanja reči. ReLDI - Regional Linguistic Data Initiative platform. http://reldi.spur.uzh.ch/hr-sr/2016/06/12/instrumenti-vukovic2015/....

The first set of resources and tools for Croatian and Serbian developed or improved in the ReLDI project is now ready for sharing. The access details and terms of use are available on the resources and tools page....

The first ReLDI seminar took place in Belgrade from 2 to 5 June. The list of the participants and their short biographies will be available soon on the seminar page (in Serbian). [gallery ids="542"]...

A tool for automatic lemmatisation (returning the base or dictionary form of an inflected word). The tool looks up the hrLex/srLex lexicons and uses a predictive model for lemmatising OOVs (out of vocabulary words) which was trained on available corpora and lexicons. Author Nikola Ljubešić Availability The lemmatiser is freely available in three forms: For local use, the code and models of the lemmatiser can be downloaded from this GitHub repository. The lemmatiser web service can be used online, via our web interface that...

SETimes.SR is a reference training corpus of Serbian texts collected from the SETimes parallel news corpus. It contains 163 documents divided into 3891 sentences, or 86 726 tokens. The corpus is manually annotated on the following levels: Token, sentence, and document segmentation Morphosyntax Lemmas Dependency syntax Named entities The set of morphosyntactic tags used in the corpus follows the revised MULTEXT-East V5 tagset for Bosnian, Croatian and Serbian, available here. Dependency syntax is annotated according to the Universal Dependency v2 specification (UDv2). Named entity annotations...

hr500k is a reference training corpus of Croatian that consists of 900 documents divided into 24 794 sentences, or 506 457 tokens. It is an extension of previous training corpora for Croatian, such as SETimes.HR and SETimes.HR+. The corpus is manually annotated on the following levels: Token, sentence, and document segmentation Morphosyntax Lemmas Dependency syntax Semantic roles Named entities The entire corpus was annotated with regard to morphosyntax and lemmas. The set of morphosyntactic tags used in the corpus follows the revised...

Author Mile Vuković Contents and description Serbian Aphasia Screening Test Download instrument Serbian Aphasia Screening Test Publications Vuković, Mile, Bojana Drljan, and Irena Vuković (2014). Validacija skrining testa za afazije govornika srpskog jezika. Specijalna edukacija i rehabilitacija 1. 73-86. [Link] Cite the repository page Vuković, Mile (2016). Skrining test za afazije za govornike srpskog jezika. ReLDI - Regional Linguistic Data Initiative platform. http://reldi.spur.uzh.ch/hr-sr/2016/05/30/instrumenti-vukovic2010/....

srLex is an inflectional lexicon of Serbian. The size of the lexicon is 108,829 lemmas, or 5,326,726 surface forms. Each entry in the lexicon consists of a (wordform, lemma, MSD, absolute frequency, in-million frequency) quintuple, e.g.: (ženu, žena, Ncfsa, 15838, 0.028556). The frequencies were estimated on the Serbian web corpus srWaC. The set of morphosyntactic tags used in the lexicon follows the MULTEXT-East V5 tagset for Bosnian (and Serbian), available here. Authors Nikola Ljubešić, Filip Klubička Availability For local use, srLex can be downloaded as a raw...

hrLex is an inflectional lexicon of Croatian. The size of the lexicon is 103,077 lemmas, or 4,970,520 surface forms. Each entry in the lexicon consists of a (word form, lemma, MSD, absolute frequency, in-million frequency) quintuple, e.g.: (ženu, žena, Ncfsa, 54158, 0.038746). The frequencies were estimated on the Croatian web corpus hrWaC. The set of morphosyntactic tags used in the lexicon follows the MULTEXT-East V5 tagset for Croatian, available here. Authors Nikola Ljubešić, Filip Klubička Availability For local use, hrLex can be downloaded as a raw text...

srWaC is a web corpus collected from the .rs top-level domain. The 1.1 version of the corpus contains 555 million tokens. The corpus is automatically annotated on the diacritic restoration, morphosyntax and lemma layers. The dependency syntax layer will be added in version 1.2. The set of morphosyntactic tags used in the corpus follows the revised MULTEXT-East V5 tagset for Croatian and Serbian, available here. Authors Nikola Ljubešić, Filip Klubička Availability For local use, a full-text version of srWaC can be downloaded here. srWaC can also...

hrWaC is a web corpus collected from the .hr top-level domain. The 2.1 version of the corpus contains 1.4 billion tokens. The corpus is automatically annotated on the diacritic restoration, morphosyntax and lemma layers. The dependency syntax layer will be added in version 2.2. The set of morphosyntactic tags used in the corpus follows the revised MULTEXT-East V5 tagset for Croatian and Serbian, available here. Authors Nikola Ljubešić, Filip Klubička Availability For local use, a full-text version of hrWaC can be downloaded here. hrWaC can also...

A tool for automatic diacritic restoration on text with potentially missing diacritics (e.g. it turns kuca into kuća if necessary). Reported accuracy of the tool: 99.5% on standard language and 99.2% on non-standard language. Authors Nikola Ljubešić, Tomaž Erjavec, Darja Fišer Availability The tool is freely available in two forms: The code and models of the tool can be downloaded from this GitHub repository. Our web service can be accessed from of our Python library, which can also be downloaded from the CLARIN.SI GitHub...

A tool for automatic tokenisation (dividing text into words and sentences). It was engineered through iterative runs on representative datasets and features modes for both standard and non-standard language. Authors Nikola Ljubešić, Tomaž Erjavec Availability The tokeniser is freely available in three forms: For local use, the tokeniser can be downloaded from this GitHub repository. The tokeniser can be used online, via our web interface that can be found here. Our web service can be accessed from of our Python library, which can also...

A tool for automatic annotation on the morphosyntactic level. It is capable of tagging both Croatian and Serbian as models for both languages are present in the tool. The tagger is based on the CRF algorithm trained on a 500,000-token Croatian training corpus and the hrLex/srLex lexicons for each respective language. The set of morphosyntactic tags used in the corpus follows the revised MULTEXT-East V5 tagset for Croatian and Serbian, available here. Accuracies calculated on test sets for each language: Croatian: 92.53% Serbian: 92.33% Author Nikola...

Registration is open for the second ReLDI seminar, to be at the Faculty of Humanities and Social Sciences in Zagreb 27-30 June 2016. More info (in Croatian) can be found on the local page of the seminar....

Author Tihana Kraš Contents and description Picture verification task - sentences Download instrument Picture verification task - sentences Publications Kraš, Tihana (2008). Anaphora resolution in Croatian: Psycholinguistic evidence from native speakers. In M. Tadić, M. Dimitrova-Vulchanova i S. Koeva (Eds), Proceedings of the Sixth International Conference Formal Approaches to South Slavic and Balkan Languages. Zagreb: Croatian Language Technologies Society – Faculty of Humanities and Social Sciences. 67-72. [Link] Cite the repository page Kraš, Tihana (2016). Zadatak odabira slike za ispitivanje (ne)izrečenih subjektnih zamjenica u hrvatskom jeziku. Platforma ReLDI...

Author Maja Miličević Contents and description 1. Sociodemographic questionnaire 2. Offline acceptability judgement task 3. Online acceptability judgement task Download instrument Sociodemographic questionnaire Offline task instructions Offline task - ListA1 Offline task - ListB1 Offline task - ListC1 Offline task - ListD1 Online task instructions Online task - ListA (E-Prime .es2 file) Online task - ListB (E-Prime .es2 file) Online task - ListC (E-Prime .es2 file) Online task - ListD (E-Prime .es2 file) Publications Miličević, Maja (2012). The possessive dative in Serbian as a valency phenomenon: a preliminary empirical study. U V. Ružić, M. Alanović i G. Štasni (prir.),...

Author Maja Miličević Contents and description 1. Sociodemographic questionnaire 2. Proficiency test (Cloze test) 3. Picture judgement task 4. Acceptability judgement task Download instrument Sociodemographic questionnaire Proficiency test (Cloze test) Picture judgement task Pictures for the picture judgement task Acceptability judgement task Complete test (native speaker version, randomisation A) Complete test (native speaker version, randomisation B) Download results Results of all tasks Publications Miličević, Maja (2012). The Acquisition of Reflexives and Reciprocals in L2 Italian, Serbian and English. Doctoral dissertation. Cambridge: University of Cambridge. [Link] Cite the repository page Miličević, Maja (2016). Test za ispitivanje usvojenosti refleksivnih i recipročnih glagola...

Author Jelena Grubor Contents and description 1. (Neologisms) Anglicisms in Serbian attitude scale 2. (Neologism) Anglicism use evaluation scale 3. Registry of lexical pairs Download instrument Grubor 2011 - Baterija za ispitivanje stavova prema anglicizmima u srpskom jeziku Publications Grubor, Jelena (2011). Stav govornika srpskog jezika prema (neologizmima) anglicizmima u srpskom jeziku. Prilozi proučavanju jezika 42: 65–79. Cite the repository page Grubor, Jelena (2016). Baterija testova za ispitivanje stavova prema anglicizmima u srpskom jeziku. The ReLDI - Regional Linguistic Data Initiative platform. http://reldi.spur.uzh.ch/hr-sr/2016/03/29/instrumenti-grubor2011/....

Author Jelena Grubor Contents and description I: Test battery for establishing attitudes towards L2 English language learning 0. Sociodemographic questionnaire 1a. History of English language learning questionnaire 1b. English language learning context evaluation scale 2. English language learning attitude scale 3. Social distance scale (Bogardus scale) 4. Aspiration scale 5. Extracurricular English language input scale II: Test battery for establishing the parents' attitudes towards L2 English language learning 1. Sociodemographic questionnaire 2. Language proficiency evaluation scale 3. English language learning attitude scale Download instrument Grubor 2012 - Test battery for establishing attitudes towards L2 English language...

Po?ela je prijava za prvi ReLDI seminar na Filološkom fakultetu u Beogradu od 2. do 5. juna 2016. Detaljnije informacije i obrazac prijave se nalaze na stranici seminara....

In collaboration with the JANES Ekspres project, Nikola Ljubesic and Maja Milicevic held parallel workshops for annotators of non-standard Croatian and Serbian. Nikola led the workshop held in Zagreb on 4 December, and Maja the one held in Belgrade on 10 December, both together with Tomaz Erjavec. More info is available here....

In collaboration with the Slovene project JANES, on 25 November 2015 Maja Milicevic held a statistics seminar for corpus linguists titled "Beyond example extraction: Quantitative analysis of the JANES corpus". More info is available here....

We are happy to announce that the English version of our website is up and running. For now you can find information about what ReLDI is and what we will do over the next year and a half; much more stuff will be added soon. We are also working on website localisation for Croatian and Serbian, which we hope to finalise in December. So stay tuned!...