CVOct 27, 2025

MergeMix: A Unified Augmentation Paradigm for Visual and Multi-Modal Understanding

Xin Jin, Siyuan Li, Siyong Jian, Kai Yu, Huan Wang

arXiv:2510.23479v19 citationsh-index: 4

Originality Incremental advance

AI Analysis

This addresses the problem of preference alignment in multi-modal models, offering a scalable approach, though it appears incremental as it builds on existing mixup and preference-driven methods.

The paper tackles the trade-off between scalability, robustness, and alignment quality in vision-language alignment for multi-modal large language models by proposing MergeMix, a training-time augmentation paradigm that bridges supervised fine-tuning and reinforcement learning. It achieves competitive accuracy with improved efficiency in classification tasks.

Vision-language alignment in multi-modal large language models (MLLMs) typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL). SFT is stable and efficient but requires large-scale human annotations and cannot capture subtle preferences, while RL brings in a reward signal for training, but suffers from overhead and instability. These limitations highlight a trade-off between scalability, robustness, and alignment quality. To address this, we propose MergeMix, a training-time augmentation paradigm that bridges SFT and RL. It first applies an attention-aware image mixing via token merge with more cluster representation and spatial context, and then presents a preference-driven training paradigm for MLLMs by building preference pairs with mixed images and raw images, and optimizing via SimPO loss. As a mixup augmentation, MergeMix enhances attention consistency and efficiency, surpassing other heuristic-based methods in classification. Extensive experiments demonstrate that MergeMix achieves competitive accuracy with improved efficiency, providing a scalable approach to preference alignment in classification and MLLMs.

View on arXiv PDF

Similar