LGAICVJun 13, 2025

RollingQ: Reviving the Cooperation Dynamics in Multimodal Transformer

arXiv:2506.11465v11 citationsh-index: 12Has CodeICML
Originality Incremental advance
AI Analysis

This addresses a critical bottleneck in multimodal learning for applications relying on Transformer architectures, though it appears incremental as it modifies an existing mechanism.

The paper tackles the problem of multimodal Transformers losing dynamic adaptability and developing modality bias, proposing Rolling Query (RollingQ) to restore cooperation dynamics, which improves performance across various multimodal scenarios.

Multimodal learning faces challenges in effectively fusing information from diverse modalities, especially when modality quality varies across samples. Dynamic fusion strategies, such as attention mechanism in Transformers, aim to address such challenge by adaptively emphasizing modalities based on the characteristics of input data. However, through amounts of carefully designed experiments, we surprisingly observed that the dynamic adaptability of widely-used self-attention models diminishes. Model tends to prefer one modality regardless of data characteristics. This bias triggers a self-reinforcing cycle that progressively overemphasizes the favored modality, widening the distribution gap in attention keys across modalities and deactivating attention mechanism's dynamic properties. To revive adaptability, we propose a simple yet effective method Rolling Query (RollingQ), which balances attention allocation by rotating the query to break the self-reinforcing cycle and mitigate the key distribution gap. Extensive experiments on various multimodal scenarios validate the effectiveness of RollingQ and the restoration of cooperation dynamics is pivotal for enhancing the broader capabilities of widely deployed multimodal Transformers. The source code is available at https://github.com/GeWu-Lab/RollingQ_ICML2025.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes