Test-Time Computing for Referring Multimodal Large Language Models
This addresses the need for more precise and adaptable visual reasoning in AI systems, though it is incremental as it builds on existing MLLM frameworks.
The paper tackles the problem of enabling fine-grained region-based visual reasoning in frozen multimodal large language models without retraining or fine-tuning, by proposing ControlMLLM++, which injects learnable visual prompts and achieves strong out-of-domain generalization.
We propose ControlMLLM++, a novel test-time adaptation framework that injects learnable visual prompts into frozen multimodal large language models (MLLMs) to enable fine-grained region-based visual reasoning without any model retraining or fine-tuning. Leveraging the insight that cross-modal attention maps intrinsically encode semantic correspondences between textual tokens and visual regions, ControlMLLM++ optimizes a latent visual token modifier during inference via a task-specific energy function to steer model attention towards user-specified areas. To enhance optimization stability and mitigate language prompt biases, ControlMLLM++ incorporates an improved optimization strategy (Optim++) and a prompt debiasing mechanism (PromptDebias). Supporting diverse visual prompt types including bounding boxes, masks, scribbles, and points, our method demonstrates strong out-of-domain generalization and interpretability. The code is available at https://github.com/mrwu-mac/ControlMLLM.