CVAILGDec 25, 2025

Towards Long-window Anchoring in Vision-Language Model Distillation

arXiv:2512.21576v11 citationsh-index: 4
Originality Incremental advance
AI Analysis

This work addresses the challenge of building more efficient long-context vision-language models, offering practical techniques and theoretical insights, though it is incremental as it builds on existing distillation and RoPE methods.

The paper tackled the problem of limited context window size in small vision-language models by proposing LAid, a knowledge distillation method that transfers long-range attention mechanisms, resulting in models achieving up to 3.2 times longer effective context windows while maintaining or improving performance on standard benchmarks.

While large vision-language models (VLMs) demonstrate strong long-context understanding, their prevalent small branches fail on linguistics-photography alignment for a limited window size. We discover that knowledge distillation improves students' capability as a complement to Rotary Position Embeddings (RoPE) on window sizes (anchored from large models). Building on this insight, we propose LAid, which directly aims at the transfer of long-range attention mechanisms through two complementary components: (1) a progressive distance-weighted attention matching that dynamically emphasizes longer position differences during training, and (2) a learnable RoPE response gain modulation that selectively amplifies position sensitivity where needed. Extensive experiments across multiple model families demonstrate that LAid-distilled models achieve up to 3.2 times longer effective context windows compared to baseline small models, while maintaining or improving performance on standard VL benchmarks. Spectral analysis also suggests that LAid successfully preserves crucial low-frequency attention components that conventional methods fail to transfer. Our work not only provides practical techniques for building more efficient long-context VLMs but also offers theoretical insights into how positional understanding emerges and transfers during distillation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes