Banach density of generated languages: Dichotomies in topology and dimension

arXiv:2604.0238564.52 citationsh-index: 2

Predicted impact top 3% in DM · last 90 daysOriginality Incremental advance

AI Analysis

This work provides a theoretical foundation for evaluating generative models' coverage in continuous spaces, revealing topological and dimensional constraints not captured by asymptotic density.

The paper studies language generation in the limit using Banach density to measure breadth in d-dimensional embeddings, proving that in dimension 1, optimal lower Banach density of 1/2 is achievable for languages with finite Cantor-Bendixson rank, but impossible for infinite rank; in higher dimensions, a Ramsey-theoretic obstacle arises requiring a nondegeneracy condition.

The formalism of language generation in the limit studies generative models by requiring an algorithm, given strings from a hidden true language, to eventually generate new valid strings. A core issue is the tension between validity and breadth. Prior work quantified breadth via asymptotic density, where the priority is generating strings early in a natural countable ordering. Here, we study density when the strings are embedded in $d$ dimensions, a ubiquitous structure in current generative models. Our goal is for the generated strings to be dense throughout the embedding. This requires a different measure, the Banach density, which captures whether a set contains large sparse regions. Using Banach density uncovers a rich structure based on dimension and the topology of the language collection. We prove that in dimension one, when the underlying topological space has finite Cantor-Bendixson rank, an algorithm can always generate a subset of the true language with an optimal lower Banach density of 1/2. However, for collections with infinite Cantor-Bendixson rank, there are cases where no algorithm can achieve any positive lower Banach density; the generated set must contain arbitrarily large, sparse regions. This reveals a topological contrast unseen with asymptotic density, where 1/2 is always achievable. We also extend our results to a family of measures interpolating between Banach and asymptotic density. Finally, in dimension $d \geq 2$, our positive result for Banach density encounters a Ramsey-theoretic obstacle regarding two-colored point sets. Overcoming this requires a nondegeneracy condition: the embedding of the true language must be sufficiently represented throughout the full $d$-dimensional space.

View on arXiv PDF

Similar