LG HCJan 23

The Trajectory Alignment Coefficient in Two Acts: From Reward Tuning to Reward Learning

Calarina Muslimani, Yunshu Du, Kenta Kawamoto, Kaushik Subramanian, Peter Stone, Peter Wurman

arXiv:2601.16906v11.4h-index: 6

Originality Incremental advance

AI Analysis

This addresses the labor-intensive and error-prone process of reward specification for RL practitioners, offering both a tuning aid and a learning objective, though it remains incremental in scope.

The paper tackles the challenge of designing accurate reward functions in reinforcement learning by introducing the Trajectory Alignment Coefficient (TAC) to align rewards with expert preferences, showing that TAC-guided tuning improves performance and reduces workload in Lunar Lander, and learning reward models with Soft-TAC leads to distinct behaviors in Gran Turismo 7.

The success of reinforcement learning (RL) is fundamentally tied to having a reward function that accurately reflects the task objective. Yet, designing reward functions is notoriously time-consuming and prone to misspecification. To address this issue, our first goal is to understand how to support RL practitioners in specifying appropriate weights for a reward function. We leverage the Trajectory Alignment Coefficient (TAC), a metric that evaluates how closely a reward function's induced preferences match those of a domain expert. To evaluate whether TAC provides effective support in practice, we conducted a human-subject study in which RL practitioners tuned reward weights for Lunar Lander. We found that providing TAC during reward tuning led participants to produce more performant reward functions and report lower cognitive workload relative to standard tuning without TAC. However, the study also underscored that manual reward design, even with TAC, remains labor-intensive. This limitation motivated our second goal: to learn a reward model that maximizes TAC directly. Specifically, we propose Soft-TAC, a differentiable approximation of TAC that can be used as a loss function to train reward models from human preference data. Validated in the racing simulator Gran Turismo 7, reward models trained using Soft-TAC successfully captured preference-specific objectives, resulting in policies with qualitatively more distinct behaviors than models trained with standard Cross-Entropy loss. This work demonstrates that TAC can serve as both a practical tool for guiding reward tuning and a reward learning objective in complex domains.

View on arXiv PDF

Similar