Serbian short-text sentiment analysis dataset: SentiComments.SR

The SentiComments.SR dataset in­cludes the fol­low­ing three cor­po­ra:

  • The main SentiComments.SR cor­pus, con­sist­ing of 3490 movie-re­lat­ed com­ments
  • The movie ver­i­fi­ca­tion cor­pus, con­sist­ing of 464 movie-re­lat­ed com­ments
  • The book ver­i­fi­ca­tion cor­pus, con­sist­ing of 173 book-re­lat­ed com­ments

The main SentiComments.SR cor­pus was con­struct­ed out of the com­ments writ­ten by vis­i­tors on the movie re­view web­site in Ser­bian. The movie ver­i­fi­ca­tion cor­pus com­ments were sourced from two oth­er Ser­bian movie re­view web­sites — and The book ver­i­fi­ca­tion cor­pus com­ments were also sourced from the web­site. Com­ments con­tain­ing more than a pre­de­fined up­per bound for to­ken count (us­ing ba­sic white­space to­k­eniza­tion), were dis­card­ed, as were the com­ments not writ­ten in Ser­bian.

Six sen­ti­ment la­bels were used in dataset an­no­ta­tion: +1, -1, +M, -M, +NS, and -NS, with the ad­di­tion of an ‘s’ la­bel suf­fix de­not­ing the pres­ence of sar­casm. The an­no­ta­tion prin­ci­ples used to as­sign sen­ti­ment la­bels to items in SentiComments.SR are de­scribed in the pa­pers list­ed in the Pub­li­ca­tions sec­tion. The main SentiComments.SR cor­pus was an­no­tat­ed by two an­no­ta­tors work­ing to­geth­er, and there­fore con­tains a sin­gle, uni­fied sen­ti­ment la­bel for each com­ment. The ver­i­fi­ca­tion cor­po­ra were used to eval­u­ate the qual­i­ty, ef­fi­cien­cy, and cost-ef­fec­tive­ness of the an­no­ta­tion frame­work, which is why they con­tain sep­a­rate sen­ti­ment la­bels for six an­no­ta­tors.

Vuk Batanović
The cor­pus and its doc­u­men­ta­tion can be found on the SentiComments.SR GitHub repos­i­to­ry.
Vuk Batanović, Miloš Cve­tanović, Boško Nikolić (2020). A ver­sa­tile frame­work for re­source-lim­it­ed sen­ti­ment ar­tic­u­la­tion, an­no­ta­tion and analy­sis of short texts. PLoS ONE 15(11): e0242050. [Link]
Vuk Batanović (2020). A method­ol­o­gy for solv­ing se­man­tic tasks in the pro­cess­ing of short texts writ­ten in nat­ur­al lan­guages with lim­it­ed re­sources. PhD the­sis, Uni­ver­si­ty of Bel­grade — School of Elec­tri­cal En­gi­neer­ing. [Link]  (con­tains the full an­no­ta­tion guide­lines in Ser­bian)

Licence and citation

The resource on this page is available under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. By downloading the resource, you agree to the terms of use defined by this license.

Creative Commons License

When using the resource it is necessary to cite the papers listed with it as well as the ReLDI repository page.