The WiLI benchmark dataset for written language identification
It provides a new dataset for language identification research, addressing the need for standardized benchmarks in this domain.
The paper introduces the WiLI-2018 benchmark dataset for monolingual written language identification, containing 235 languages with 1000 paragraphs each, totaling 23,500 paragraphs, to classify unknown paragraphs by language.
This paper describes the WiLI-2018 benchmark dataset for monolingual written natural language identification. WiLI-2018 is a publicly available, free of charge dataset of short text extracts from Wikipedia. It contains 1000 paragraphs of 235 languages, totaling in 23500 paragraphs. WiLI is a classification dataset: Given an unknown paragraph written in one dominant language, it has to be decided which language it is.