Teaching Metric Distance to Discrete Autoregressive Language Models
This addresses the challenge of adapting language models to non-linguistic domains where tokens have metric meanings, offering a method that is incremental but effective under resource constraints.
The paper tackles the problem of training autoregressive discrete language models to handle metric relationships in tokens for domains like mathematics and multimodal tasks, introducing DIST2Loss to incorporate distance-aware training, resulting in consistent performance gains in applications such as visual grounding and robotic manipulation, especially in low-data regimes.
As large language models expand beyond natural language to domains such as mathematics, multimodal understanding, and embodied agents, tokens increasingly reflect metric relationships rather than purely linguistic meaning. We introduce DIST2Loss, a distance-aware framework designed to train autoregressive discrete models by leveraging predefined distance relationships among output tokens. At its core, DIST2Loss transforms continuous exponential family distributions derived from inherent distance metrics into discrete, categorical optimization targets compatible with the models' architectures. This approach enables the models to learn and preserve meaningful distance relationships during token generation while maintaining compatibility with existing architectures. Empirical evaluations show consistent performance gains in diverse multimodal applications, including visual grounding, robotic manipulation, generative reward modeling, and image generation using vector-quantized features. These improvements are most notable in low-data regimes, demonstrating DIST2Loss's strength under resource constraints.