AIApr 24, 2025

Cracking the Code of Action: a Generative Approach to Affordances for Reinforcement Learning

MILA
arXiv:2504.17282v11 citationsh-index: 13
Originality Incremental advance
AI Analysis

This addresses sample efficiency for agents in sparse-reward, large-action-space environments like web navigation, though it appears incremental as it builds on existing vision-language models and program generation techniques.

The paper tackles the problem of low sample efficiency in reinforcement learning agents navigating web GUIs with large action spaces by proposing CoGA, a method that uses pre-trained vision-language models to generate code that identifies affordable actions. The result shows CoGA is orders of magnitude more sample efficient than its RL agent, generalizes within task families, and matches or exceeds behavior cloning with few demonstrations.

Agents that can autonomously navigate the web through a graphical user interface (GUI) using a unified action space (e.g., mouse and keyboard actions) can require very large amounts of domain-specific expert demonstrations to achieve good performance. Low sample efficiency is often exacerbated in sparse-reward and large-action-space environments, such as a web GUI, where only a few actions are relevant in any given situation. In this work, we consider the low-data regime, with limited or no access to expert behavior. To enable sample-efficient learning, we explore the effect of constraining the action space through $\textit{intent-based affordances}$ -- i.e., considering in any situation only the subset of actions that achieve a desired outcome. We propose $\textbf{Code as Generative Affordances}$ $(\textbf{$\texttt{CoGA}$})$, a method that leverages pre-trained vision-language models (VLMs) to generate code that determines affordable actions through implicit intent-completion functions and using a fully-automated program generation and verification pipeline. These programs are then used in-the-loop of a reinforcement learning agent to return a set of affordances given a pixel observation. By greatly reducing the number of actions that an agent must consider, we demonstrate on a wide range of tasks in the MiniWob++ benchmark that: $\textbf{1)}$ $\texttt{CoGA}$ is orders of magnitude more sample efficient than its RL agent, $\textbf{2)}$ $\texttt{CoGA}$'s programs can generalize within a family of tasks, and $\textbf{3)}$ $\texttt{CoGA}$ performs better or on par compared with behavior cloning when a small number of expert demonstrations is available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes