AI SYMay 26, 2023

A Reminder of its Brittleness: Language Reward Shaping May Hinder Learning for Instruction Following Agents

Sukai Huang, Nir Lipovetzky, Trevor Cohn

arXiv:2305.16621v25.42 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses a critical oversight in prior studies for researchers in reinforcement learning and natural language processing, revealing incremental insights into the limitations of LRS methods.

The paper tackles the problem of language reward shaping (LRS) in reinforcement learning for instruction-following agents, showing that its apparent success is brittle due to suboptimal designs and weak baselines, with empirical evidence that LRS-trained agents converge more slowly than pure RL agents.

Teaching agents to follow complex written instructions has been an important yet elusive goal. One technique for enhancing learning efficiency is language reward shaping (LRS). Within a reinforcement learning (RL) framework, LRS involves training a reward function that rewards behaviours precisely aligned with given language instructions. We argue that the apparent success of LRS is brittle, and prior positive findings can be attributed to weak RL baselines. Specifically, we identified suboptimal LRS designs that reward partially matched trajectories, and we characterised a novel reward perturbation to capture this issue using the concept of loosening task constraints. We provided theoretical and empirical evidence that agents trained using LRS rewards converge more slowly compared to pure RL agents. Our work highlights the brittleness of existing LRS methods, which has been overlooked in the previous studies.

View on arXiv PDF Code

Similar