Resources and tools

Found 17 Re­sults
Page 1 of 1

Croatian and Serbian lemmatiser


A tool for au­to­mat­ic lem­ma­ti­sa­tion (re­turn­ing the base or dic­tio­nary form of an in­flect­ed word). The tool looks up the hrLex/sr­Lex lex­i­cons and uses a pre­dic­tive mod­el for lem­ma­tis­ing OOVs (out of vo­cab­u­lary words) which was trained on avail­able cor­po­ra and lex­i­cons. Au­thor Niko­la Ljubešić Avail­abil­i­ty The lem­ma­tis­er is freely avail­able in three forms: For lo­cal use, the code and mod­els of the lem­ma­tis­er can be down­loaded from this GitHub repos­i­to­ry. The lem­ma­tis­er web ser­vice can be used on­line, via…

31/05/2016


Croatian and Serbian part of speech (POS) and morphosyntactic (MSD) tagger


A tool for au­to­mat­ic an­no­ta­tion on the mor­phosyn­tac­tic lev­el. It is ca­pa­ble of tag­ging both Croa­t­ian and Ser­bian as mod­els for both lan­guages are present in the tool. The tag­ger is based on the CRF al­go­rithm trained on a 500,000-token Croa­t­ian train­ing cor­pus and the hrLex/sr­Lex lex­i­cons for each re­spec­tive lan­guage. The set of mor­phosyn­tac­tic tags used in the cor­pus fol­lows the re­vised MUL­­­TEXT-East V5 tagset for Croa­t­ian and Ser­bian, avail­able here. Ac­cu­ra­cies cal­cu­lat­ed on test sets for each lan­guage: Croa­t­ian:…

02/05/2016


Croatian and Serbian tokeniser


A tool for au­to­mat­ic to­keni­sa­tion (di­vid­ing text into words and sen­tences). It was en­gi­neered through it­er­a­tive runs on rep­re­sen­ta­tive datasets and fea­tures modes for both stan­dard and non-stan­­­dard lan­guage. Au­thors Niko­la Ljubešić, Tomaž Er­javec Avail­abil­i­ty The to­kenis­er is freely avail­able in three forms: For lo­cal use, the to­kenis­er can be down­loaded from this GitHub repos­i­to­ry. The to­kenis­er can be used on­line, via our web in­ter­face that can be found here. Our web ser­vice can be ac­cessed from of our Python…


Croatian annotated corpus: hr500k


hr500k is a ref­er­ence train­ing cor­pus of Croa­t­ian that con­sists of 900 doc­u­ments di­vid­ed into 24 794 sen­tences, or 506 457 to­kens. It is an ex­ten­sion of pre­vi­ous train­ing cor­po­ra for Croa­t­ian, such as SE­Times.HR and SE­Times.HR+. The cor­pus is man­u­al­ly an­no­tat­ed on the fol­low­ing lev­els: To­ken, sen­tence, and doc­u­ment seg­men­ta­tion Mor­phosyn­tax Lem­mas De­pen­den­cy syn­tax Se­man­tic roles Named en­ti­ties The en­tire cor­pus was an­no­tat­ed with re­gard to mor­phosyn­tax and lem­mas. The set of mor­phosyn­tac­tic tags used in the cor­pus fol­lows…

31/05/2016


Croatian lexicon: hrLex


hrLex is an in­flec­tion­al lex­i­con of Croa­t­ian. The size of the lex­i­con is 103,077 lem­mas, or 4,970,520 sur­face forms. Each en­try in the lex­i­con con­sists of a (word form, lem­ma, MSD, ab­solute fre­quen­cy, in-mil­lion fre­quen­cy) quin­tu­ple, e.g.: (ženu, žena, Ncf­sa, 54158, 0.038746). The fre­quen­cies were es­ti­mat­ed on the Croa­t­ian web cor­pus hrWaC. The set of mor­phosyn­tac­tic tags used in the lex­i­con fol­lows the MUL­­­TEXT-East V5 tagset for Croa­t­ian, avail­able here. Au­thors Niko­la Ljubešić, Fil­ip Klu­bič­ka Avail­abil­i­ty For lo­cal use, hrLex…

02/05/2016


Croatian web corpus: hrWaC


hrWaC is a web cor­pus col­lect­ed from the .hr top-lev­­­el do­main. The 2.1 ver­sion of the cor­pus con­tains 1.4 bil­lion to­kens. The cor­pus is au­to­mat­i­cal­ly an­no­tat­ed on the di­a­crit­ic restora­tion, mor­phosyn­tax and lem­ma lay­ers. The de­pen­den­cy syn­tax lay­er will be added in ver­sion 2.2. The set of mor­phosyn­tac­tic tags used in the cor­pus fol­lows the re­vised MUL­­­TEXT-East V5 tagset for Croa­t­ian and Ser­bian, avail­able here. Au­thors Niko­la Ljubešić, Fil­ip Klu­bič­ka Avail­abil­i­ty For lo­cal use, a full-text ver­sion of hrWaC can…


Diacritic restoration tool


A tool for au­to­mat­ic di­a­crit­ic restora­tion on text with po­ten­tial­ly miss­ing di­a­crit­ics (e.g. it turns kuca into kuća if nec­es­sary). Re­port­ed ac­cu­ra­cy of the tool: 99.5% on stan­dard lan­guage and 99.2% on non-stan­­­dard lan­guage. Au­thors Niko­la Ljubešić, Tomaž Er­javec, Dar­ja Fišer Avail­abil­i­ty The tool is freely avail­able in two forms: The code and mod­els of the tool can be down­loaded from this GitHub repos­i­to­ry. Our web ser­vice can be ac­cessed from of our Python li­brary, which can also be down­loaded…


ReLDI-NormTag-hr 1.1


ReLDI-Nor­m­­­Tag-hr 1.1 is a man­u­al­ly an­no­tat­ed cor­pus of Croa­t­ian tweets. It is meant as a gold-stan­­­dard train­ing and test­ing dataset for to­keni­sa­tion, sen­tence seg­men­ta­tion, word nor­mal­i­sa­tion, mor­phosyn­tac­tic tag­ging and lem­ma­ti­sa­tion of non-stan­­­dard Croa­t­ian. Each tweet is also an­no­tat­ed for its au­to­mat­i­cal­ly as­signed stan­dard­ness lev­els (T = tech­ni­cal stan­dard­ness, L = lin­guis­tic stan­dard­ness). As an up­date to ver­sion 1.0, 1.1 cor­rects some mi­nor er­rors. Au­thors Niko­la Ljubešić, Daša Farkaš, Fil­ip Klu­bič­ka, Tomaž Er­javec, Maja Mil­iče­vić, Matea Filko, De­nis Kran­jčić, Bar­bara Du­jmić…

11/08/2017


ReLDI-NormTag-sr 1.1


ReLDI-Nor­m­­­Tag-sr 1.1 is a man­u­al­ly an­no­tat­ed cor­pus of Ser­bian tweets. It is meant as a gold-stan­­­dard train­ing and test­ing dataset for to­keni­sa­tion, sen­tence seg­men­ta­tion, word nor­mal­i­sa­tion, mor­phosyn­tac­tic tag­ging and lem­ma­ti­sa­tion of non-stan­­­dard Ser­bian. Each tweet is also an­no­tat­ed for its au­to­mat­i­cal­ly as­signed stan­dard­ness lev­els (T = tech­ni­cal stan­dard­ness, L = lin­guis­tic stan­dard­ness). As an up­date to ver­sion 1.0, 1.1 cor­rects some mi­nor er­rors. Au­thors Niko­la Ljubešić, Daša Farkaš, Fil­ip Klu­bič­ka, Tomaž Er­javec, Maja Mil­iče­vić, Teodo­ra Vuković Avail­abil­i­ty For lo­cal use,…


Serbian annotated corpus: SETimes.SR


SE­Times.SR is a ref­er­ence train­ing cor­pus of Ser­bian texts col­lect­ed from the SE­Times par­al­lel news cor­pus. It con­tains 163 doc­u­ments di­vid­ed into 3891 sen­tences, or 86 726 to­kens. The cor­pus is man­u­al­ly an­no­tat­ed on the fol­low­ing lev­els: To­ken, sen­tence, and doc­u­ment seg­men­ta­tion Mor­phosyn­tax Lem­mas De­pen­den­cy syn­tax Named en­ti­ties The set of mor­phosyn­tac­tic tags used in the cor­pus fol­lows the re­vised MUL­­­TEXT-East V5 tagset for Bosn­ian, Croa­t­ian and Ser­bian, avail­able here. De­pen­den­cy syn­tax is an­no­tat­ed ac­cord­ing to the Uni­ver­sal De­pen­den­cy v2…

31/05/2016


Serbian lexicon: srLex


sr­Lex is an in­flec­tion­al lex­i­con of Ser­bian. The size of the lex­i­con is 108,829 lem­mas, or 5,326,726 sur­face forms. Each en­try in the lex­i­con con­sists of a (word­form, lem­ma, MSD, ab­solute fre­quen­cy, in-mil­lion fre­quen­cy) quin­tu­ple, e.g.: (ženu, žena, Ncf­sa, 15838, 0.028556). The fre­quen­cies were es­ti­mat­ed on the Ser­bian web cor­pus srWaC. The set of mor­phosyn­tac­tic tags used in the lex­i­con fol­lows the MUL­­­TEXT-East V5 tagset for Bosn­ian (and Ser­bian), avail­able here. Au­thors Niko­la Ljubešić, Fil­ip Klu­bič­ka Avail­abil­i­ty For lo­cal use,…

02/05/2016


Serbian movie review dataset: SerbMR


The Ser­bian Movie Re­view Dataset col­lec­tion con­sists of three movie re­view datasets in Ser­bian which were con­struct­ed for the task of sen­ti­ment analy­sis: Col­lect­ed movie re­views in Ser­bian (ISLRN 252–457–966–231–5) — an im­bal­anced col­lec­tion of 4725 movie re­views in Ser­bian. SerbMR-2C — The Ser­bian Movie Re­view Dataset (2 Class­es) (ISLRN 016–049–192–514–1) — a two-class bal­anced dataset that con­tains 1682 movie re­views (841 pos­i­tive and 841 neg­a­tive). SerbMR-3C — The Ser­bian Movie Re­view Dataset (3 Class­es) (ISLRN 229–533–271–984–0) — a three-class…

24/06/2016


Serbian paraphrase corpus: paraphrase.sr


The Ser­bian Para­phrase Cor­pus (paraphrase.sr) con­sists of 1194 pairs of sen­tences gath­ered from news sources on the web. Each sen­tence pair was man­u­al­ly an­no­tat­ed with a bi­na­ry sim­i­lar­i­ty score that in­di­cates whether the sen­tences in the pair are se­man­ti­cal­ly sim­i­lar enough to be con­sid­ered close para­phras­es. The cor­pus con­tains 553 sen­tence pairs deemed to be se­man­ti­cal­ly equiv­a­lent (46.31% of the to­tal num­ber), and 641 se­man­ti­cal­ly di­verse pairs (53.69% of the to­tal num­ber). Au­thor Vuk Batanović Avail­abil­i­ty The cor­pus and its…

10/12/2017


Serbian semantic textual similarity news corpus: STS.news.sr


The Ser­bian Se­man­tic Tex­tu­al Sim­i­lar­i­ty News Cor­pus (STS.news.sr) con­sists of 1192 pairs of sen­tences in Ser­bian, or around 64 thou­sand to­kens, gath­ered from news sources on the web and writ­ten in the Ser­bian Latin script. Each sen­tence pair was man­u­al­ly an­no­tat­ed with fine-grained se­man­tic sim­i­lar­i­ty scores on the 0–5 scale. The fi­nal scores were ob­tained by av­er­ag­ing the in­di­vid­ual scores of five an­no­ta­tors. The sen­tence pairs in this dataset were tak­en from the Ser­bian Para­phrase Cor­pus (paraphrase.sr). The an­no­ta­tion method­ol­o­gy…

10/12/2018


Serbian web corpus: srWaC


srWaC is a web cor­pus col­lect­ed from the .rs top-lev­­­el do­main. The 1.1 ver­sion of the cor­pus con­tains 555 mil­lion to­kens. The cor­pus is au­to­mat­i­cal­ly an­no­tat­ed on the di­a­crit­ic restora­tion, mor­phosyn­tax and lem­ma lay­ers. The de­pen­den­cy syn­tax lay­er will be added in ver­sion 1.2. The set of mor­phosyn­tac­tic tags used in the cor­pus fol­lows the re­vised MUL­­­TEXT-East V5 tagset for Croa­t­ian and Ser­bian, avail­able here. Au­thors Niko­la Ljubešić, Fil­ip Klu­bič­ka Avail­abil­i­ty For lo­cal use, a full-text ver­sion of srWaC can…

02/05/2016


Stemmers for Serbian and Croatian: SCStemmers


This pack­age is a Java reim­ple­men­ta­tion of four pre­vi­ous­ly pub­lished stem­ming al­go­rithms for Ser­bian and Croa­t­ian: The greedy and the op­ti­mal sub­­­­­sump­­­tion-based stem­mer for Ser­bian, by Vla­do Kešelj and Danko Šip­ka A re­fine­ment of the greedy sub­­­­­sump­­­tion-based stem­mer, by Niko­la Miloše­vić A “Sim­ple stem­mer for Croa­t­ian v0.1”, by Niko­la Ljubešić and Ivan Pandžić All the stem­mers ex­pect the in­put text to be for­mat­ted in UTF-8. Their out­puts are also UTF-8 en­cod­ed. Au­thor Vuk Batanović Avail­abil­i­ty The pack­age and a more…

24/06/2016


Universal Dependency (UD) treebank


Ser­bian Uni­ver­sal De­pen­den­cy Tree­bank cur­rent­ly con­sists of the SE­Times.SR cor­pus man­u­al­ly an­no­tat­ed with syn­tac­tic de­pen­den­cies fol­low­ing the guide­lines for Uni­ver­sal De­pen­den­cies v2. Au­thors Tan­ja Samardžić, Mir­jana Starović, Željko Agić, Niko­la Ljubešić Avail­abil­i­ty For lo­cal use, a full-text ver­sion of the tree­bank can be down­loaded from the UD GitHub page. Pub­li­ca­tion The de­vel­op­ment of the tree­bank is de­scribed in de­tail in the fol­low­ing pa­per: Tan­ja Samardžić, Mir­jana Starović, Željko Agić, Niko­la Ljubešić (2017). Uni­ver­sal De­pen­den­cies for Ser­bian in Com­par­i­son with Croa­t­ian…

10/08/2017


Page 1 of 1