CLMar 9, 2025

KréyoLID From Language Identification Towards Language Mining

Rasul Dent, Pedro Ortiz Suarez, Thibault Clérice, Benoît Sagot

arXiv:2503.06547v12.7h-index: 2Has Code

Originality Synthesis-oriented

AI Analysis

This addresses the challenge of efficiently building corpora for low-resource languages, though it is incremental as it adapts existing ideas to a specific domain.

The paper tackles the problem of creating digital corpora for less commonly written languages by reframing language identification as a data mining problem to minimize resources on uninteresting documents, resulting in faster corpus creation with better coverage for French-based Creoles.

Automatic language identification is frequently framed as a multi-class classification problem. However, when creating digital corpora for less commonly written languages, it may be more appropriate to consider it a data mining problem. For these varieties, one knows ahead of time that the vast majority of documents are of little interest. By minimizing resources spent on classifying such documents, we can create corpora much faster and with better coverage than using established pipelines. To demonstrate the effectiveness of the language mining perspective, we introduce a new pipeline and corpora for several French-based Creoles.

View on arXiv PDF Code

Similar