SOMBRERO: Measuring and Steering Boundary Placement in End-to-End Hierarchical Sequence Models

Pit Neitemeier, Alessio Serra, Jiaze Li, Sascha Wirges, Lukas Balles, Jan Hendrik Metzen

arXiv:2601.22805v11.4h-index: 25

Originality Incremental advance

AI Analysis

This work addresses the challenge of optimizing compute allocation in sequence modeling for applications like text and code processing, representing an incremental improvement over existing methods.

The paper tackles the problem of assessing and steering boundary placement in hierarchical sequence models to align compute with predictive difficulty, resulting in improved accuracy-efficiency trade-offs across diverse UTF-8 corpora at a 1B scale.

Hierarchical sequence models replace fixed tokenization with learned segmentations that compress long byte sequences for efficient autoregressive modeling. While recent end-to-end methods can learn meaningful boundaries from the language-modeling objective alone, it remains difficult to quantitatively assess and systematically steer where compute is spent. We introduce a router-agnostic metric of boundary quality, boundary enrichment B, which measures how strongly chunk starts concentrate on positions with high next-byte surprisal. Guided by this metric, we propose Sombrero, which steers boundary placement toward predictive difficulty via a confidence-alignment boundary loss and stabilizes boundary learning by applying confidence-weighted smoothing at the input level rather than on realized chunks. On 1B scale, across UTF-8 corpora covering English and German text as well as code and mathematical content, Sombrero improves the accuracy-efficiency trade-off and yields boundaries that more consistently align compute with hard-to-predict positions.

View on arXiv PDF

Similar