CVLGMay 26

Structure over Pixels: Learning Variable-Length Visual Programs

arXiv:2605.2769629.6h-index: 2
AI Analysis

For computer vision researchers, STROP provides a tokenizer that learns adaptive sequence lengths and structural representations without pixel-level supervision, addressing a key limitation of existing discrete tokenizers.

STROP learns variable-length discrete visual programs that adapt to scene complexity, using a curriculum supervised by DINOv3 features to bypass pixel reconstruction. The method yields compositional structure and improves downstream dense-prediction transfer.

Discrete visual tokenizers translate images into ordered sequences of codes, providing a natural representation for structural description of scenes. Yet existing adaptive tokenizers either require post-hoc search or select among a discrete set of pre-trained rates, rather than learning a continuous per-image sequence length coupled to the model and scene, and they typically train against pixel reconstruction, emphasizing texture rather than structure. We propose STROP, a discrete visual tokenizer architecture that forms structural scene representations and simultaneously learns how long an image's visual program should be. Using a four-phase curriculum supervised by local rate--distortion probes against frozen DINOv3 features, STROP optimizes a dedicated length head that estimates the active prefix length in a single forward pass. By bypassing pixel-level reconstruction gradients, the codebook is shaped entirely by the quality of higher-level latent representations. Program length grows with scene complexity, and signs of compositional structure emerge both in downstream dense-prediction transfer and in direct inspection of the learned code vocabulary.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes