Claus Ladefoged

1paper

1 Paper

CLMay 7, 2020
The Danish Gigaword Project

Leon Strømberg-Derczynski, Manuel R. Ciosici, Rebekah Baglini et al.

Danish language technology has been hindered by a lack of broad-coverage corpora at the scale modern NLP prefers. This paper describes the Danish Gigaword Corpus, the result of a focused effort to provide a diverse and freely-available one billion word corpus of Danish text. The Danish Gigaword corpus covers a wide array of time periods, domains, speakers' socio-economic status, and Danish dialects.