LGSep 28, 2025

STAIR: Addressing Stage Misalignment through Temporal-Aligned Preference Reinforcement Learning

Yao Luan, Ni Mu, Yiqin Yang, Bo Xu, Qing-Shan Jia

arXiv:2509.23802v19.42 citationsh-index: 4

Originality Incremental advance

AI Analysis

This addresses a bottleneck in preference-based reinforcement learning for multi-stage tasks like navigation and grasping, offering an incremental improvement by mitigating stage misalignment without requiring predefined task knowledge.

The paper tackles the problem of stage misalignment in preference-based reinforcement learning for multi-stage tasks, where comparing segments from mismatched stages hinders policy learning, and proposes STAIR, which learns stage approximations via temporal distance to prioritize within-stage comparisons, achieving superior performance in multi-stage tasks and competitive results in single-stage tasks.

Preference-based reinforcement learning (PbRL) bypasses complex reward engineering by learning rewards directly from human preferences, enabling better alignment with human intentions. However, its effectiveness in multi-stage tasks, where agents sequentially perform sub-tasks (e.g., navigation, grasping), is limited by stage misalignment: Comparing segments from mismatched stages, such as movement versus manipulation, results in uninformative feedback, thus hindering policy learning. In this paper, we validate the stage misalignment issue through theoretical analysis and empirical experiments. To address this issue, we propose STage-AlIgned Reward learning (STAIR), which first learns a stage approximation based on temporal distance, then prioritizes comparisons within the same stage. Temporal distance is learned via contrastive learning, which groups temporally close states into coherent stages, without predefined task knowledge, and adapts dynamically to policy changes. Extensive experiments demonstrate STAIR's superiority in multi-stage tasks and competitive performance in single-stage tasks. Furthermore, human studies show that stages approximated by STAIR are consistent with human cognition, confirming its effectiveness in mitigating stage misalignment.

View on arXiv PDF

Similar