CLMar 9, 2025

KréyoLID From Language Identification Towards Language Mining

arXiv:2503.06547v1h-index: 2
Originality Synthesis-oriented
AI Analysis

This addresses the challenge of efficiently building corpora for low-resource languages, though it is incremental as it adapts existing ideas to a specific domain.

The paper tackles the problem of creating digital corpora for less commonly written languages by reframing language identification as a data mining problem to minimize resources on uninteresting documents, resulting in faster corpus creation with better coverage for French-based Creoles.

Automatic language identification is frequently framed as a multi-class classification problem. However, when creating digital corpora for less commonly written languages, it may be more appropriate to consider it a data mining problem. For these varieties, one knows ahead of time that the vast majority of documents are of little interest. By minimizing resources spent on classifying such documents, we can create corpora much faster and with better coverage than using established pipelines. To demonstrate the effectiveness of the language mining perspective, we introduce a new pipeline and corpora for several French-based Creoles.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes