LG AIApr 13

Bottleneck Tokens for Unified Multimodal Retrieval

Siyu Sun, Jing Ren, Zhaohe Liao, Dongxiao Mao, Xiangyuan Ren, Yiyi Zhang, Haohua Zhao, Weixiong Lin, Jiang Shaohua, Liqing Zhang, Yuchao Zheng

arXiv:2604.1109591.9h-index: 7

AI Analysis

This work provides a principled solution to the pooling and supervision bottlenecks in multimodal retrieval for decoder-only MLLMs, enabling better semantic compression and retrieval performance.

The paper introduces Bottleneck Tokens (BToks) and Generative Information Condensation to address structural gaps in adapting decoder-only MLLMs for unified multimodal retrieval. The method achieves state-of-the-art among 2B-scale methods on MMEB-V2 with an Overall score of 59.0 (+3.6 over VLM2Vec-V2) and substantial gains on semantically demanding tasks (e.g., +12.6 on Video-QA).

Adapting decoder-only multimodal large language models (MLLMs) for unified multimodal retrieval faces two structural gaps. First, existing methods rely on implicit pooling, which overloads the hidden state of a standard vocabulary token (e.g., <EOS>) as the sequence-level representation, a mechanism never designed for information aggregation. Second, contrastive fine-tuning specifies what the embedding should match but provides no token-level guidance on how information should be compressed into it. We address both gaps with two complementary components. Architecturally, we introduce Bottleneck Tokens (BToks), a small set of learnable tokens that serve as a fixed-capacity explicit pooling mechanism. For training, we propose Generative Information Condensation: a next-token prediction objective coupled with a Condensation Mask that severs the direct attention path from target tokens to query tokens. All predictive signals are thereby forced through the BToks, converting the generative loss into dense, token-level supervision for semantic compression. At inference time, only the input and BToks are processed in a single forward pass with negligible overhead over conventional last-token pooling. On MMEB-V2 (78 datasets, 3 modalities, 9 meta-tasks), our approach achieves state-of-the-art among 2B-scale methods under comparable data conditions, attaining an Overall score of 59.0 (+3.6 over VLM2Vec-V2) with substantial gains on semantically demanding tasks (e.g., +12.6 on Video-QA).

View on arXiv PDF

Similar