How Should We Model the Probability of a Language?
This addresses the problem of poor language identification for tail languages, which is an incremental shift in perspective rather than a technical breakthrough.
The paper argues that the limited coverage of language identification (LID) systems for most of the world's languages stems from framing LID as decontextualized text classification, and proposes rethinking it as a routing problem to incorporate environmental cues for better coverage.
Of the over 7,000 languages spoken in the world, commercial language identification (LID) systems only reliably identify a few hundred in written form. Research-grade systems extend this coverage under certain circumstances, but for most languages coverage remains patchy or nonexistent. This position paper argues that this situation is largely self-imposed. In particular, it arises from a persistent framing of LID as decontextualized text classification, which obscures the central role of prior probability estimation and is reinforced by institutional incentives that favor global, fixed-prior models. We argue that improving coverage for tail languages requires rethinking LID as a routing problem and developing principled ways to incorporate environmental cues that make languages locally plausible.