CV CLJan 23, 2018

The WiLI benchmark dataset for written language identification

arXiv:1801.07779v15.820 citationsHas Code

Originality Synthesis-oriented

AI Analysis

It provides a new dataset for language identification research, addressing the need for standardized benchmarks in this domain.

The paper introduces the WiLI-2018 benchmark dataset for monolingual written language identification, containing 235 languages with 1000 paragraphs each, totaling 23,500 paragraphs, to classify unknown paragraphs by language.

This paper describes the WiLI-2018 benchmark dataset for monolingual written natural language identification. WiLI-2018 is a publicly available, free of charge dataset of short text extracts from Wikipedia. It contains 1000 paragraphs of 235 languages, totaling in 23500 paragraphs. WiLI is a classification dataset: Given an unknown paragraph written in one dominant language, it has to be decided which language it is.

View on arXiv PDF Code

Similar