A Python module comprising of a tokeniser, a part-of-speech/MSD tagger, a lemmatiser, a dependency parser, and a named entity recognizer for most South Slavic languages. For Croatian and Serbian there are models for processing standard and Internet non-standard texts. The estimated accuracy of morphosyntactic tagging for this tool is ~94%, while for lemmatisation the accuracy is ~99%. Dependency parsing has an labeled attachment score of ~0.9, while named entity recognition achieves a micro-F1 of ~0.9. Author Nikola Ljubešić Publications The experiments yielding this pipeline...

The SentiComments.SR dataset includes the following three corpora: The main SentiComments.SR corpus, consisting of 3490 movie-related comments The movie verification corpus, consisting of 464 movie-related comments The book verification corpus, consisting of 173 book-related comments The main SentiComments.SR corpus was constructed out of the comments written by visitors on the kakavfilm.com movie review website in Serbian. The movie verification corpus comments were sourced from two other Serbian movie review websites - gledajme.rs and happynovisad.com. The book verification corpus comments were also...

On 26 February 2021, ReLDI will participate in the conference Internationalisms in Slavic as a window into the architecture of grammar - InterSlavic 2020/2021, organised by the University of Graz. The abstract and the presentation can be found here, and in case you wish to join us and the rest of the conference, the registration instructions are available here....

The CLASSLA knowledge centre has recently published a new state-of-the-art BERT-like model covering Croatian and Serbian. The model is available via HuggingFace, and a version fine-tuned to the task of named entity recognition has been published on the same platform too (see here). The creator of the model is Nikola Ljubešić....

ReLDI centre has recently collaborated with the Research Centre of the Slovenian Academy of Sciences and Arts on the creation of corpora of language-related news articles and news comments in Serbian, Croatian and Slovene. The work was coordinated by Vuk Batanović, and the corpora can be downloaded here....

On 24 October 2020, a number of ReLDI resources have been presented by Vuk Batanović at the conference Primena slobodnog softvera i otvorenog hardvera (Applicaton of Free Software and Open Hardware) at the School of Electrical Engineering in Belgrade. You can see the presentation and the paper here....

A three-year strategic partnership UPSKILLS - UPgrading the SKIlls of Linguistics and Language Students, whose participants are Tanja Samardžić and Maja Miličević Petrović, has been approved for funding within the Erasmus+ programme, with the addition of funding by the Swiss agency Movetia. The coordinator of the partnership is the University of Malta. The project partners will develop teaching materials for students of languages and linguistics on topics such as programming, research methods and quantitative data analysis....

A two-year project Advancing Novel Textual Similarity-based Solutions in Software Development -   AVANTES has been approved for funding within the Artificial Intelligence programme of the Science Fund of the Republic of Serbia. Participants in the project are ReLDI members Vuk Batanović, Maja Miličević Petrović and Tanja Samardžić. The project is dedicated to the study of the relationship between programming code semantics and the meaning of code comments written in natural languages. Solutions regarding the problems of code comment categorization,...

On 6 May 2020, ReLDI participated in the first CLASSLA (online) workshop. You can read more about it here (and see us!)....

The recently set-up CLARIN knowledge centre for South Slavic languages [CLASSLA] and ReLDI have come to an agreement of close collaboration. While CLASSLA has a wider scope (South Slavic languages) and is mostly active in the CLARIN ERIC community, ReLDI is still a cornerstone of grassroots efforts in Croatia, Serbia and beyond, to bootstrap language technology development and improve language research methodology....

Our Movetia project ended after a successful and fun summer school held in the Petnica Science Centre 1-13 July 2019. For more information and some photos, see here....

Borders and boundaries in Bosnian, Croatian, Montenegrin and Serbian: Twitter data to the rescue Journal of Linguistic Geography, 6(2), 100-124. Nikola Ljubešić, Maja Miličević Petrović and Tanja Samardžić A new article by an all-ReLDI team shows how we can use Twitter to clarify the ever-puzzling relationships between Bosnian, Croatian, Montenegrin and Serbian. For a pre-print version follow this link.  ...

Author Sabina Halupka-Rešetar Contents and description Survey on request modification Download instrument Survey on request modification Publications Halupka-Rešetar, Sabina (2014). Request modification in the pragmatic production of intermediate ESP learners. ESP Today 2: 29-47. [Link] Cite the repository page Halupka-Rešetar, Sabina (2019). Survey on request modification. ReLDI - Regional Linguistic Data Initiative platform. http://reldi.spur.uzh.ch/hr-sr/instrumenti-halupka-resetar2014b....

Author Sabina Halupka-Rešetar Contents and description Questionnaire on EFL pragmatic competence (compliments) Download instrument Questionnaireon EFL pragmatic competence (compliments) Publications Halupka-Rešetar, Sabina (2014). Compliment responses - a study of the pragmatic competence of advanced EFL students in Serbia. In T. Prćić et al. (Eds) Engleski jezik i anglofone književnosti u teoriji i praksi: Zbornik radova u čast Draginji Pervaz (pp. 173-191). Novi Sad: Filozofski fakultet. [Link] Cite the repository page Halupka-Rešetar, Sabina (2019). Questionnaire on EFL pragmatic competence (compliments). ReLDI - Regional Linguistic Data Initiative platform. http://reldi.spur.uzh.ch/hr-sr/instrumenti-halupka-resetar2014a....

On 21 December 2018, the ReLDI network held a meeting at the Institute for Croatian Language and Linguistics in Zagreb. Network members and guests presented their ideas for future projects, followed by a discussion on possibilities for regional collaboration. We thank all the participants for their input. Special thanks go to the local organiser, Kristina Štrkalj Despot, and to the Institute for Croatian Language and Linguistics for their hospitality!   ...

The Serbian Semantic Textual Similarity News Corpus (STS.news.sr) consists of 1192 pairs of sentences in Serbian, or around 64 thousand tokens, gathered from news sources on the web and written in the Serbian Latin script. Each sentence pair was manually annotated with fine-grained semantic similarity scores on the 0-5 scale. The final scores were obtained by averaging the individual scores of five annotators. The sentence pairs in this dataset were taken from the Serbian Paraphrase Corpus (paraphrase.sr). The annotation methodology generally...

The next ReLDI meeting will take place in Zagreb on 21 December 2018. If you would like to attend, let us know via our contact form....

A new grant Revisiting research training in linguistics: theory, logic, method (Nr. 2018-CH-IP-0012), is awarded by the Swiss funding agency Movetia to Tanja Samardžić and Maja Miličević in collaboration with Genoveva Puskas from the University of Geneva. The goal of the new partnership is to introduce the content taught in the ReLDI seminars to the undergraduate programmes at the Universities of Zurich, Belgrade and Geneva. See the announcement from the University of Zurich....

The Serbian Paraphrase Corpus (paraphrase.sr) consists of 1194 pairs of sentences gathered from news sources on the web. Each sentence pair was manually annotated with a binary similarity score that indicates whether the sentences in the pair are semantically similar enough to be considered close paraphrases. The corpus contains 553 sentence pairs deemed to be semantically equivalent (46.31% of the total number), and 641 semantically diverse pairs (53.69% of the total number). Author Vuk Batanović Availability The corpus and its documentation can be found...

Four ReLDI online courses are now open and can be accessed via the Open edX @ Zurich platform. More detailed information can be found at the online courses page....

Between 15 August and 1 October 2017 four ReLDI online courses will be opened on the Open edX @ Zurich platform. More detailed information can be accessed from the online courses page....

ReLDI-NormTagNER-sr 2.1 is a manually annotated corpus of Serbian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation, and named entity recognition of non-standard Serbian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness). Authors Nikola Ljubešić, Tomaž Erjavec, Vuk Batanović, Maja Miličević, Tanja Samardžić Availability For local use, a full-text version of the corpus can be downloaded from the CLARIN.SI repository. Publication The corpus...

ReLDI-NormTagNER-hr 2.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation, and named entity recognition of non-standard Croatian. Each tweet is also annotated for its automatically assigned standardness levels (T = technical standardness, L = linguistic standardness). Authors Nikola Ljubešić, Tomaž Erjavec, Vuk Batanović, Maja Miličević, Tanja Samardžić Availability For local use, a full-text version of the corpus can be downloaded from the CLARIN.SI repository. Publication The corpus...

Blog posts about ReLDI have been published on the URPP Language and Space blog (author: Tanja Samardžić), and the CLARIN blog (authors: Ana Slavec and Jakob Lenardič)....

The fifth ReLDI seminar took place from 21 to 23 June 2017 in Ljubljana, Slovenia. The seminar had about 50 participants from the entire former Yugoslavia. The program and the materials are available on the seminar page in Serbian. We are grateful to the organisers who made this additional seminar possible, and especially to the JANES project. JANES ends at the same time as ReLDI, and at the end of the seminar we celebrated several years of fruitful activities by dancing...

The fourth ReLDI seminar (the second in Zagreb) took place from 23 to 26 February 2017. The program and the materials are available on the seminar page in Serbian. Participant list and short biographies are available on the same page. ...

In collaboration with the Slovene project JANES, we are organising an additional ReLDI seminar, to be held 21-23 June 2017 in Ljubljana. Detailed information is available here....

The second ReLDI seminar took place in Belgrade from 26 to 29 January 2017. The program and materials are available on the seminar page in Serbian. The list of the participants and their short biographies will be available soon. ...

On 28 October 2016 Adriano Ferraresi (University of Bologna) and Maja Milicevic held a statistics seminar for (corpus) linguists titled "Count your frequencies wisely! An introduction to concepts and methods in quantitative (corpus) linguistics". The seminar was hosted by the Department of Interpreting and Translation of the University of Bologna....

On 29 September 2016 Nikola Ljubešić gave two ReLDI-related talks at the Language Technologies & Digital Humanities 2016 conference in Ljubljana: (1) Easily Accessible Language Technologies for Slovene, Croatian and Serbian (authors: Nikola Ljubešić, Tomaž Erjavec, Darja Fišer, Tanja Samardžić, Maja Miličević, Filip Klubička, Filip Petkovski); (2) Analysing spatial distribution of linguistic variables in geoencoded tweets from Croatia, Bosnia, Montenegro and Serbia (authors: Nikola Ljubešić, Tanja Samardžić, Maja Miličević)....

The most important language technology tools developed inside the project are available via a web application (http://nl.ijs.si/services, registration point) and an API library (https://github.com/clarinsi/reldi-lib). We will continue adding tools to the ReLDI ecosystem as we develop them....

An article on our project appeared in the Belgrade daily newspaper Danas (in Serbian)....

The second ReLDI seminar took place in Zagreb from 27 to 30 June 2016. The list of seminar participants and their short biographies will be available soon on the seminar page (in Croatian)....

The Serbian Movie Review Dataset collection consists of three movie review datasets in Serbian which were constructed for the task of sentiment analysis: Collected movie reviews in Serbian (ISLRN 252-457-966-231-5) - an imbalanced collection of 4725 movie reviews in Serbian. SerbMR-2C - The Serbian Movie Review Dataset (2 Classes) (ISLRN 016-049-192-514-1) - a two-class balanced dataset that contains 1682 movie reviews (841 positive and 841 negative). SerbMR-3C - The Serbian Movie Review Dataset (3 Classes) (ISLRN 229-533-271-984-0) - a three-class...

This package is a Java reimplementation of four previously published stemming algorithms for Serbian and Croatian: The greedy and the optimal subsumption-based stemmer for Serbian, by Vlado Kešelj and Danko Šipka A refinement of the greedy subsumption-based stemmer, by Nikola Milošević A "Simple stemmer for Croatian v0.1", by Nikola Ljubešić and Ivan Pandžić All the stemmers expect the input text to be formatted in UTF-8. Their outputs are also UTF-8 encoded. Author Vuk Batanović Availability The package and a more extensive documentation can be downloaded...

Authors Vladan Pavlović, Miloš Jovanović Contents and description Questionnaire on attitudes towards the relation between language and national identity Download instrument Questionnaire on attitudes towards the relation between language and national identity Publications Pavlović, Vladan and Miloš Jovanović (2013). Stavovi studenata Univerziteta u Nišu o odnosu jezičkog i nacionalnog identiteta. Teme 38/2. 701-717. [Link] Pavlović, Vladan and Miloš Jovanović (2013). 'Language Nationalism' vs 'Language Cosmopolitanism': Divisions in the Attitudes towards the Relation between Language and National Identity''. In I. Spasić and P. Cvetičanin (Eds) Us and Them –...

Author Tanja Stipeć Contents and description Picture verification task Download instrument Picture verification task Publications Kraš, Tihana and Tanja Stipeć (2013). Interpretation of ambiguous subject pronouns in Croatian by people with Down syndrome and typically developing children. In S. Baiz, N. Goldman and R. Hawkes (Eds), Proceedings of the 37th annual Boston University Conference on Language Development (pp. 178–190). Somerville, MA: Cascadilla Press. Kraš, Tihana, Helena Rubčić and Tanja Stipeć (2015). Subject pronoun interpretation in Croatian: Comparing monolinguals with simultaneous bilinguals. In Cergol Kovačević, K. and Udier,...

Author Mile Vuković Contents and description Serbian Word Reading Test Download instrument Serbian Word Reading Test Serbian Word Reading Test - examiner's form Publications Vuković, Mile (2015). Tretman afazija. 2. edition. Belgrade: University of Belgrade – Faculty of Special Education and Rehabilitation. Vuković, Mile, Irena Vuković, Nick Miller (in press). Acquired dyslexia in Serbian speakers with Broca’s and Wernicke’s aphasia. Journal of Communication Disorders. [Link] Cite the repository page Vuković, Mile (2016). Test čitanja reči. ReLDI - Regional Linguistic Data Initiative platform. http://reldi.spur.uzh.ch/hr-sr/instrumenti-vukovic2015/....

The first set of resources and tools for Croatian and Serbian developed or improved in the ReLDI project is now ready for sharing. The access details and terms of use are available on the resources and tools page....

The first ReLDI seminar took place in Belgrade from 2 to 5 June. The list of the participants and their short biographies will be available soon on the seminar page (in Serbian). [gallery ids="542"]...

This tool is considered a legacy tool as the NLP pipeline achieves better results on the same task, but is not available as a web service yet. A tool for automatic lemmatisation (returning the base or dictionary form of an inflected word). The tool looks up the hrLex/srLex lexicons and uses a predictive model for lemmatising OOVs (out of vocabulary words) which was trained on available corpora and lexicons. Author Nikola Ljubešić Availability The lemmatiser is freely available in three forms: For local use, the...

SETimes.SR is a reference training corpus of Serbian texts collected from the SETimes parallel news corpus. It contains 163 documents divided into 3891 sentences, or 86 726 tokens. The corpus is manually annotated on the following levels: Token, sentence, and document segmentation Morphosyntax Lemmas Dependency syntax Named entities The set of morphosyntactic tags used in the corpus follows the revised MULTEXT-East V5 tagset for Bosnian, Croatian and Serbian, available here. Dependency syntax is annotated according to the Universal Dependency v2 specification (UDv2). Named entity annotations...

hr500k is a reference training corpus of Croatian that consists of 900 documents divided into 24 794 sentences, or 506 457 tokens. It is an extension of previous training corpora for Croatian, such as SETimes.HR and SETimes.HR+. The corpus is manually annotated on the following levels: Token, sentence, and document segmentation Morphosyntax Lemmas Dependency syntax Semantic roles Named entities The entire corpus was annotated with regard to morphosyntax and lemmas. The set of morphosyntactic tags used in the corpus follows the revised...

Author Mile Vuković Contents and description Serbian Aphasia Screening Test Download instrument Serbian Aphasia Screening Test Publications Vuković, Mile, Bojana Drljan, and Irena Vuković (2014). Validacija skrining testa za afazije govornika srpskog jezika. Specijalna edukacija i rehabilitacija 1. 73-86. [Link] Cite the repository page Vuković, Mile (2016). Skrining test za afazije za govornike srpskog jezika. ReLDI - Regional Linguistic Data Initiative platform. http://reldi.spur.uzh.ch/hr-sr/instrumenti-vukovic2010/....

srLex is an inflectional lexicon of Serbian. The size of the lexicon is 169,328 lemmas, or 6,905,941 surface forms. Each entry in the lexicon consists of a (wordform, lemma, MSD, MSD features, UPOS, morphological features, absolute frequency, in-million frequency) 8-tuple. The frequencies were estimated on the Serbian web corpus srWaC. The set of morphosyntactic tags used in the lexicon follows the MULTEXT-East V6 tagset for Serbo-Croatian macro-language, available here. Authors Nikola Ljubešić Availability For local use, srLex can be downloaded as a raw text file here. srLex can...

hrLex is an inflectional lexicon of Croatian. The size of the lexicon is 164,206 lemmas, or 6,427,709 4,970,520 surface forms. Each entry in the lexicon consists of a (word form, lemma, MSD, MSD features, UPOS, morphological features, absolute frequency, in-million frequency) 8-tuple. The frequencies were estimated on the Croatian web corpus hrWaC. The set of morphosyntactic tags used in the lexicon follows the MULTEXT-East V6 tagset for Serbo-Croatian macro-language, available here. Authors Nikola Ljubešić Availability For local use, hrLex can be downloaded as a raw text file here. hrLex...

srWaC is a web corpus collected from the .rs top-level domain. The 1.1 version of the corpus contains 555 million tokens. The corpus is automatically annotated on the diacritic restoration, morphosyntax and lemma layers. The dependency syntax layer will be added in version 1.2. The set of morphosyntactic tags used in the corpus follows the revised MULTEXT-East V5 tagset for Croatian and Serbian, available here. Authors Nikola Ljubešić, Filip Klubička Availability For local use, a full-text version of srWaC can be downloaded here. srWaC can also...

hrWaC is a web corpus collected from the .hr top-level domain. The 2.1 version of the corpus contains 1.4 billion tokens. The corpus is automatically annotated on the diacritic restoration, morphosyntax and lemma layers. The dependency syntax layer will be added in version 2.2. The set of morphosyntactic tags used in the corpus follows the revised MULTEXT-East V5 tagset for Croatian and Serbian, available here. Authors Nikola Ljubešić, Filip Klubička Availability For local use, a full-text version of hrWaC can be downloaded here. hrWaC can also...

A tool for automatic diacritic restoration on text with potentially missing diacritics (e.g. it turns kuca into kuća if necessary). Reported accuracy of the tool: 99.5% on standard language and 99.2% on non-standard language. Authors Nikola Ljubešić, Tomaž Erjavec, Darja Fišer Availability The tool is freely available in two forms: The code and models of the tool can be downloaded from this GitHub repository. Our web service can be accessed from of our Python library, which can also be downloaded from the CLARIN.SI GitHub...

This tool is considered a legacy tool as the NLP pipeline achieves better results on the same task, but is not available as a web service yet. A tool for automatic tokenisation (dividing text into words and sentences). It was engineered through iterative runs on representative datasets and features modes for both standard and non-standard language. Authors Nikola Ljubešić, Tomaž Erjavec Availability The tokeniser is freely available in three forms: For local use, the tokeniser can be downloaded from this GitHub repository. The tokeniser can...

This tool is considered a legacy tool as the NLP pipeline achieves better results on the same task, but is not available as a web service yet. A tool for automatic annotation on the morphosyntactic level. It is capable of tagging both Croatian and Serbian as models for both languages are present in the tool. The tagger is based on the CRF algorithm trained on a 500,000-token Croatian training corpus and the hrLex/srLex lexicons for each respective language. The set of morphosyntactic tags used...

Registration is open for the second ReLDI seminar, to be at the Faculty of Humanities and Social Sciences in Zagreb 27-30 June 2016. More info (in Croatian) can be found on the local page of the seminar....

Author Tihana Kraš Contents and description Picture verification task - sentences Download instrument Picture verification task - sentences Publications Kraš, Tihana (2008). Anaphora resolution in Croatian: Psycholinguistic evidence from native speakers. In M. Tadić, M. Dimitrova-Vulchanova i S. Koeva (Eds), Proceedings of the Sixth International Conference Formal Approaches to South Slavic and Balkan Languages. Zagreb: Croatian Language Technologies Society – Faculty of Humanities and Social Sciences. 67-72. [Link] Cite the repository page Kraš, Tihana (2016). Zadatak odabira slike za ispitivanje (ne)izrečenih subjektnih zamjenica u hrvatskom jeziku. Platforma ReLDI...

Author Maja Miličević Contents and description 1. Sociodemographic questionnaire 2. Offline acceptability judgement task 3. Online acceptability judgement task Download instrument Sociodemographic questionnaire Offline task instructions Offline task - ListA1 Offline task - ListB1 Offline task - ListC1 Offline task - ListD1 Online task instructions Online task - ListA (E-Prime .es2 file) Online task - ListB (E-Prime .es2 file) Online task - ListC (E-Prime .es2 file) Online task - ListD (E-Prime .es2 file) Publications Miličević, Maja (2012). The possessive dative in Serbian as a valency phenomenon: a preliminary empirical study. U V. Ružić, M. Alanović i G. Štasni (prir.),...

Author Maja Miličević Contents and description 1. Sociodemographic questionnaire 2. Proficiency test (Cloze test) 3. Picture judgement task 4. Acceptability judgement task Download instrument Sociodemographic questionnaire Proficiency test (Cloze test) Picture judgement task Pictures for the picture judgement task Acceptability judgement task Complete test (native speaker version, randomisation A) Complete test (native speaker version, randomisation B) Download results Results of all tasks Publications Miličević, Maja (2007). The Acquisition of Reflexives and Reciprocals in L2 Italian, Serbian and English. Doctoral dissertation. Cambridge: University of Cambridge. [Link] Cite the repository page Miličević, Maja (2016). Test za ispitivanje usvojenosti refleksivnih i recipročnih glagola...

Author Jelena Grubor Contents and description 1. (Neologisms) Anglicisms in Serbian attitude scale 2. (Neologism) Anglicism use evaluation scale 3. Registry of lexical pairs Download instrument Grubor 2011 - Baterija za ispitivanje stavova prema anglicizmima u srpskom jeziku Publications Grubor, Jelena (2011). Stav govornika srpskog jezika prema (neologizmima) anglicizmima u srpskom jeziku. Prilozi proučavanju jezika 42: 65–79. Cite the repository page Grubor, Jelena (2016). Baterija testova za ispitivanje stavova prema anglicizmima u srpskom jeziku. The ReLDI - Regional Linguistic Data Initiative platform. http://reldi.spur.uzh.ch/hr-sr/instrumenti-grubor2011/....

Author Jelena Grubor Contents and description I: Test battery for establishing attitudes towards L2 English language learning 0. Sociodemographic questionnaire 1a. History of English language learning questionnaire 1b. English language learning context evaluation scale 2. English language learning attitude scale 3. Social distance scale (Bogardus scale) 4. Aspiration scale 5. Extracurricular English language input scale II: Test battery for establishing the parents' attitudes towards L2 English language learning 1. Sociodemographic questionnaire 2. Language proficiency evaluation scale 3. English language learning attitude scale Download instrument Grubor 2012 - Test battery for establishing attitudes towards L2 English language...

Po?ela je prijava za prvi ReLDI seminar na Filološkom fakultetu u Beogradu od 2. do 5. juna 2016. Detaljnije informacije i obrazac prijave se nalaze na stranici seminara....

In collaboration with the JANES Ekspres project, Nikola Ljubesic and Maja Milicevic held parallel workshops for annotators of non-standard Croatian and Serbian. Nikola led the workshop held in Zagreb on 4 December, and Maja the one held in Belgrade on 10 December, both together with Tomaz Erjavec. More info is available here....

In collaboration with the Slovene project JANES, on 25 November 2015 Maja Milicevic held a statistics seminar for corpus linguists titled "Beyond example extraction: Quantitative analysis of the JANES corpus". More info is available here....

We are happy to announce that the English version of our website is up and running. For now you can find information about what ReLDI is and what we will do over the next year and a half; much more stuff will be added soon. We are also working on website localisation for Croatian and Serbian, which we hope to finalise in December. So stay tuned!...