Agnostic Language Identification and Generation

Mikael Møller Høgsgaard, Chirag Pabbaraju

arXiv:2601.23258v13.82 citationsh-index: 9

Originality Incremental advance

AI Analysis

This work addresses a foundational limitation in language modeling by removing realizability assumptions, which is incremental but important for robustness in AI applications.

The paper tackles language identification and generation without assuming input data comes from a known language distribution, obtaining novel characterizations and nearly tight statistical rates.

Recent works on language identification and generation have established tight statistical rates at which these tasks can be achieved. These works typically operate under a strong realizability assumption: that the input data is drawn from an unknown distribution necessarily supported on some language in a given collection. In this work, we relax this assumption of realizability entirely, and impose no restrictions on the distribution of the input data. We propose objectives to study both language identification and generation in this more general "agnostic" setup. Across both problems, we obtain novel interesting characterizations and nearly tight rates.

View on arXiv PDF

Similar