Croatian web corpus: hrWaC

hrWaC is a web cor­pus col­lect­ed from the .hr top-lev­el do­main. The 2.1 ver­sion of the cor­pus con­tains 1.4 bil­lion to­kens. The cor­pus is au­to­mat­i­cal­ly an­no­tat­ed on the di­a­crit­ic restora­tion, mor­phosyn­tax and lem­ma lay­ers. The de­pen­den­cy syn­tax lay­er will be added in ver­sion 2.2.

The set of mor­phosyn­tac­tic tags used in the cor­pus fol­lows the re­vised MUL­TEXT-East V5 tagset for Croa­t­ian and Ser­bian, avail­able here.

Niko­la Ljubešić, Fil­ip Klu­bič­ka
For lo­cal use, a full-text ver­sion of hrWaC can be down­loaded here.
hrWaC can also be ac­cessed and queried on­line, via the noS­ketchEngine web in­ter­face avail­able here.
The com­pi­la­tion of the 1.0 ver­sion of the cor­pus is de­scribed in the fol­low­ing pa­per:
Niko­la Ljubešić, Fil­ip Klu­bič­ka (2014). {bs,hr,sr}WaC — Web cor­po­ra of Bosn­ian, Croa­t­ian and Ser­bian. Pro­ceed­ings of the 9th Web as Cor­pus Work­shop (WaC-9). Gothen­burg, Swe­den. [Link] [.bib]

Licence and citation

The resource on this page is available under the Creative Commons Attribution-ShareAlike 4.0 International License. By downloading the resource, you agree to the terms of use defined by this license.

Creative Commons License

When using the resource it is necessary to cite the papers listed with it as well as the ReLDI repository page.