CVCLMay 27, 2025

ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models

arXiv:2505.21465v11 citationsh-index: 3Has Code
Originality Incremental advance
AI Analysis

This addresses a specific bottleneck in VLMs for tasks requiring high-resolution image understanding, but it is incremental as it builds on existing methods like LLaVA-Next.

The paper tackled the problem of high-resolution image encoding in Vision-Language Models causing poor token interactions due to Rotary Position Embedding decay, and proposed ID-Align, which improved relation reasoning on MMBench by 6.09% and showed gains across multiple benchmarks.

Currently, a prevalent approach for enhancing Vision-Language Models (VLMs) performance is to encode both the high-resolution version and the thumbnail of an image simultaneously. While effective, this method generates a large number of image tokens. When combined with the widely used Rotary Position Embedding (RoPE), its long-term decay property hinders the interaction between high-resolution tokens and thumbnail tokens, as well as between text and image. To address these issues, we propose ID-Align, which alleviates these problems by reordering position IDs. In this method, high-resolution tokens inherit IDs from their corresponding thumbnail token while constraining the overexpansion of positional indices. Our experiments conducted within the LLaVA-Next framework demonstrate that ID-Align achieves significant improvements, including a 6.09% enhancement on MMBench's relation reasoning tasks and notable gains across multiple benchmarks. Our code is available at the following link: https://github.com/zooblastlbz/ID-Align.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes