CVFeb 10

Beyond Next-Token Alignment: Distilling Multimodal Large Language Models via Token Interactions

arXiv:2602.09483v12 citationsh-index: 4Has Code
Originality Incremental advance
AI Analysis

This work addresses deployment challenges for MLLMs by offering a more efficient distillation method, though it is incremental as it builds on existing knowledge distillation techniques.

The paper tackles the problem of compressing large multimodal language models (MLLMs) for deployment by proposing Align-TI, a knowledge distillation framework that focuses on token interactions, achieving a 2.6% improvement over vanilla KD and outperforming a larger model by 7.0%.

Multimodal Large Language Models (MLLMs) demonstrate impressive cross-modal capabilities, yet their substantial size poses significant deployment challenges. Knowledge distillation (KD) is a promising solution for compressing these models, but existing methods primarily rely on static next-token alignment, neglecting the dynamic token interactions, which embed essential capabilities for multimodal understanding and generation. To this end, we introduce Align-TI, a novel KD framework designed from the perspective of Token Interactions. Our approach is motivated by the insight that MLLMs rely on two primary interactions: vision-instruction token interactions to extract relevant visual information, and intra-response token interactions for coherent generation. Accordingly, Align-TI introduces two components: IVA enables the student model to imitate the teacher's instruction-relevant visual information extract capability by aligning on salient visual regions. TPA captures the teacher's dynamic generative logic by aligning the sequential token-to-token transition probabilities. Extensive experiments demonstrate Align-TI's superiority. Notably, our approach achieves $2.6\%$ relative improvement over Vanilla KD, and our distilled Align-TI-2B even outperforms LLaVA-1.5-7B (a much larger MLLM) by $7.0\%$, establishing a new state-of-the-art distillation framework for training parameter-efficient MLLMs. Code is available at https://github.com/lchen1019/Align-TI.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes