CVJul 31, 2024

ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models

Mingrui Wu, Xinyue Cai, Jiayi Ji, Jiale Li, Oucheng Huang, Gen Luo, Hao Fei, Guannan Jiang, Xiaoshuai Sun, Rongrong Ji

arXiv:2407.21534v624.043 citationsh-index: 18Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of integrating referring abilities into MLLMs for users needing efficient visual-language tasks without costly training, though it is incremental as it builds on existing attention mechanisms.

The authors tackled the problem of enabling Multimodal Large Language Models (MLLMs) to perform detailed region description and reasoning without retraining by proposing a training-free method that injects visual prompts through test-time optimization of a learnable latent variable, resulting in support for referring with box, mask, scribble, and point inputs while exhibiting out-of-domain generalization and interpretability.

In this work, we propose a training-free method to inject visual prompts into Multimodal Large Language Models (MLLMs) through test-time optimization of a learnable latent variable. We observe that attention, as the core module of MLLMs, connects text prompt tokens and visual tokens, ultimately determining the final results. Our approach involves adjusting visual tokens from the MLP output at test time, controlling the attention response to ensure text prompt tokens attend to visual tokens in referring regions. We optimize a learnable latent variable based on an energy function, enhancing the strength of referring regions in the attention map. This enables detailed region description and reasoning without the need for substantial training costs or model retraining. Our method offers a promising direction for integrating referring abilities into MLLMs, and supports referring with box, mask, scribble and point. The results demonstrate that our method exhibits out-of-domain generalization and interpretability.

View on arXiv PDF Code

Similar