CVCLJan 23, 2018

The WiLI benchmark dataset for written language identification

arXiv:1801.07779v120 citations
Originality Synthesis-oriented
AI Analysis

It provides a new dataset for language identification research, addressing the need for standardized benchmarks in this domain.

The paper introduces the WiLI-2018 benchmark dataset for monolingual written language identification, containing 235 languages with 1000 paragraphs each, totaling 23,500 paragraphs, to classify unknown paragraphs by language.

This paper describes the WiLI-2018 benchmark dataset for monolingual written natural language identification. WiLI-2018 is a publicly available, free of charge dataset of short text extracts from Wikipedia. It contains 1000 paragraphs of 235 languages, totaling in 23500 paragraphs. WiLI is a classification dataset: Given an unknown paragraph written in one dominant language, it has to be decided which language it is.

Code Implementations4 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes