AIOct 26, 2025

Rethinking the Text-Vision Reasoning Imbalance in MLLMs through the Lens of Training Recipes

Guanyu Yao, Qiucheng Wu, Yang Zhang, Zhaowen Wang, Handong Zhao, Shiyu Chang

arXiv:2510.22836v12 citationsh-index: 14Has Code

Originality Incremental advance

AI Analysis

This addresses a key limitation in MLLMs for researchers and practitioners, but it is incremental as it builds on existing training methods rather than introducing a new paradigm.

The paper tackles the imbalance in multimodal large language models (MLLMs) where they over-rely on text and under-attend to visual content, leading to suboptimal performance on vision-centric tasks; it analyzes this modality gap through training recipes and explores strategies to bridge it via data and loss design.

Multimodal large language models (MLLMs) have demonstrated strong capabilities on vision-and-language tasks. However, recent findings reveal an imbalance in their reasoning capabilities across visual and textual modalities. Specifically, current MLLMs often over-rely on textual cues while under-attending to visual content, resulting in suboptimal performance on tasks that require genuine visual reasoning. We refer to this phenomenon as the \textit{modality gap}, defined as the performance disparity between text-centric and vision-centric inputs. In this paper, we analyze the modality gap through the lens of training recipes. We first show that existing training recipes tend to amplify this gap. Then, we systematically explore strategies to bridge it from two complementary perspectives: data and loss design. Our findings provide insights into developing training recipes that mitigate the modality gap and promote more balanced multimodal reasoning. Our code is publicly available at https://github.com/UCSB-NLP-Chang/Bridging-Modality-Gap.

View on arXiv PDF Code

Similar