Learning to Select Visual In-Context Demonstrations
For researchers working on visual in-context learning, this work provides a principled method to improve demonstration selection for factual regression tasks, though it is incremental as it applies RL to a known bottleneck.
The paper addresses the sub-optimality of kNN-based demonstration selection for visual in-context learning in MLLMs, particularly for factual regression tasks. The proposed LSD method, using reinforcement learning, outperforms baselines on five visual regression benchmarks, achieving significant gains while kNN remains better for subjective tasks.
Multimodal Large Language Models (MLLMs) adapt to visual tasks via in-context learning (ICL), which relies heavily on demonstration quality. The dominant demonstration selection strategy is unsupervised k-Nearest Neighbor (kNN) search. While simple, this similarity-first approach is sub-optimal for complex factual regression tasks; it selects redundant examples that fail to capture the task's full output range. We reframe selection as a sequential decision-making problem and introduce Learning to Select Demonstrations (LSD), training a Reinforcement Learning agent to construct optimal demonstration sets. Using a Dueling DQN with a query-centric Transformer Decoder, our agent learns a policy that maximizes MLLM downstream performance. Evaluating across five visual regression benchmarks, we uncover a crucial dichotomy: while kNN remains optimal for subjective preference tasks, LSD significantly outperforms baselines on objective, factual regression tasks. By balancing visual relevance with diversity, LSD better defines regression boundaries, illuminating when learned selection is strictly necessary for visual ICL.