CL AIFeb 10, 2025

Multi-label Scandinavian Language Identification (SLIDE)

Mariia Fedorova, Jonas Sebulon Frydenberg, Victoria Handford, Victoria Ovedie Chruickshank Langø, Solveig Helene Willoch, Marthe Løken Midtgaard, Yves Scherrer, Petter Mæhlum, David Samuel

arXiv:2502.06692v117.012 citationsh-index: 11Has Code

Originality Synthesis-oriented

AI Analysis

This addresses the challenge of language identification for closely related Scandinavian languages, which is incremental as it builds on existing LID methods with a new dataset and training approach.

The paper tackled the problem of identifying multiple closely related Scandinavian languages at the sentence level, presenting the SLIDE dataset and models with varying speed-accuracy tradeoffs, and demonstrated that multi-label identification is necessary for accuracy.

Identifying closely related languages at sentence level is difficult, in particular because it is often impossible to assign a sentence to a single language. In this paper, we focus on multi-label sentence-level Scandinavian language identification (LID) for Danish, Norwegian Bokmål, Norwegian Nynorsk, and Swedish. We present the Scandinavian Language Identification and Evaluation, SLIDE, a manually curated multi-label evaluation dataset and a suite of LID models with varying speed-accuracy tradeoffs. We demonstrate that the ability to identify multiple languages simultaneously is necessary for any accurate LID method, and present a novel approach to training such multi-label LID models.

View on arXiv PDF Code

Similar