CLMay 7, 2020

The Danish Gigaword Project

Leon Strømberg-Derczynski, Manuel R. Ciosici, Rebekah Baglini, Morten H. Christiansen, Jacob Aarup Dalsgaard, Riccardo Fusaroli, Peter Juel Henrichsen, Rasmus Hvingelby, Andreas Kirkedal, Alex Speed Kjeldsen, Claus Ladefoged, Finn Årup Nielsen

arXiv:2005.03521v33.122 citations

Originality Synthesis-oriented

AI Analysis

This addresses the data bottleneck for NLP researchers and practitioners working with Danish language technology.

The paper tackled the problem of Danish language technology being hindered by lack of large-scale corpora by creating the Danish Gigaword Corpus, a freely-available one billion word corpus covering diverse time periods, domains, socio-economic statuses, and dialects.

Danish language technology has been hindered by a lack of broad-coverage corpora at the scale modern NLP prefers. This paper describes the Danish Gigaword Corpus, the result of a focused effort to provide a diverse and freely-available one billion word corpus of Danish text. The Danish Gigaword corpus covers a wide array of time periods, domains, speakers' socio-economic status, and Danish dialects.

View on arXiv PDF

Similar