Towards Efficient Online Tuning of VLM Agents via Counterfactual Soft Reinforcement Learning
This work addresses a specific bottleneck in online tuning of VLM agents for dynamic environments, offering an incremental improvement over prior methods.
The paper tackles the challenge of inefficient online exploration in fine-tuning vision-language model agents due to their open-ended textual action space, proposing Counterfactual Soft Reinforcement Learning (CoSo) to dynamically prioritize action-critical tokens, which results in enhanced exploration efficiency and consistent performance gains across tasks like Android device control and embodied AI.
Online fine-tuning vision-language model (VLM) agents with reinforcement learning (RL) has shown promise for equipping agents with multi-step, goal-oriented capabilities in dynamic environments. However, their open-ended textual action space and non-end-to-end nature of action generation present significant challenges to effective online exploration in RL, e.g., explosion of the exploration space. We propose a novel online fine-tuning method, Counterfactual Soft Reinforcement Learning (CoSo), better suited to the textual output space of VLM agents. Compared to prior methods that assign uniform uncertainty to all tokens, CoSo leverages counterfactual reasoning to dynamically assess the causal influence of individual tokens on post-processed actions. By prioritizing the exploration of action-critical tokens while reducing the impact of semantically redundant or low-impact tokens, CoSo enables a more targeted and efficient online rollout process. We provide theoretical analysis proving CoSo's convergence and policy improvement guarantees, and extensive empirical evaluations supporting CoSo's effectiveness. Our results across a diverse set of agent tasks, including Android device control, card gaming, and embodied AI, highlight its remarkable ability to enhance exploration efficiency and deliver consistent performance gains. The code is available at https://github.com/langfengQ/CoSo.