LG ROFeb 1, 2025

Learning from Suboptimal Data in Continuous Control via Auto-Regressive Soft Q-Network

Jijia Liu, Feng Gao, Qingmin Liao, Chao Yu, Yu Wang

arXiv:2502.00288v24.1h-index: 4ICML

Originality Incremental advance

AI Analysis

This addresses the challenge of learning from suboptimal data in continuous control tasks, offering incremental improvements in sample efficiency for robotics and simulation applications.

The paper tackles the problem of reinforcement learning for continuous control requiring large online interaction data by proposing Auto-Regressive Soft Q-learning (ARSQ), which models Q-values in a coarse-to-fine, auto-regressive manner to improve sample efficiency, achieving a 1.62x performance improvement over state-of-the-art baselines on the D4RL benchmark with non-expert demonstrations.

Reinforcement learning (RL) for continuous control often requires large amounts of online interaction data. Value-based RL methods can mitigate this burden by offering relatively high sample efficiency. Some studies further enhance sample efficiency by incorporating offline demonstration data to "kick-start" training, achieving promising results in continuous control. However, they typically compute the Q-function independently for each action dimension, neglecting interdependencies and making it harder to identify optimal actions when learning from suboptimal data, such as non-expert demonstration and online-collected data during the training process. To address these issues, we propose Auto-Regressive Soft Q-learning (ARSQ), a value-based RL algorithm that models Q-values in a coarse-to-fine, auto-regressive manner. First, ARSQ decomposes the continuous action space into discrete spaces in a coarse-to-fine hierarchy, enhancing sample efficiency for fine-grained continuous control tasks. Next, it auto-regressively predicts dimensional action advantages within each decision step, enabling more effective decision-making in continuous control tasks. We evaluate ARSQ on two continuous control benchmarks, RLBench and D4RL, integrating demonstration data into online training. On D4RL, which includes non-expert demonstrations, ARSQ achieves an average $1.62\times$ performance improvement over SOTA value-based baseline. On RLBench, which incorporates expert demonstrations, ARSQ surpasses various baselines, demonstrating its effectiveness in learning from suboptimal online-collected data. Project page is at https://sites.google.com/view/ar-soft-q

View on arXiv PDF

Similar