AI LGMay 13, 2025

Monte Carlo Beam Search for Actor-Critic Reinforcement Learning in Continuous Control

arXiv:2505.09029v19.63 citationsh-index: 32025 22nd International Conference on Ubiquitous Robots (UR)

Originality Incremental advance

AI Analysis

This addresses the challenge of efficient exploration in continuous control tasks for reinforcement learning practitioners, though it is incremental as it builds on existing methods like TD3.

The paper tackles the problem of suboptimal policy convergence in actor-critic reinforcement learning for continuous control by introducing Monte Carlo Beam Search (MCBS), which improves exploration and action selection, resulting in enhanced sample efficiency and performance, such as achieving 90% of maximum reward in 200,000 timesteps compared to 400,000 for the second-best method.

Actor-critic methods, like Twin Delayed Deep Deterministic Policy Gradient (TD3), depend on basic noise-based exploration, which can result in less than optimal policy convergence. In this study, we introduce Monte Carlo Beam Search (MCBS), a new hybrid method that combines beam search and Monte Carlo rollouts with TD3 to improve exploration and action selection. MCBS produces several candidate actions around the policy's output and assesses them through short-horizon rollouts, enabling the agent to make better-informed choices. We test MCBS across various continuous-control benchmarks, including HalfCheetah-v4, Walker2d-v5, and Swimmer-v5, showing enhanced sample efficiency and performance compared to standard TD3 and other baseline methods like SAC, PPO, and A2C. Our findings emphasize MCBS's capability to enhance policy learning through structured look-ahead search while ensuring computational efficiency. Additionally, we offer a detailed analysis of crucial hyperparameters, such as beam width and rollout depth, and explore adaptive strategies to optimize MCBS for complex control tasks. Our method shows a higher convergence rate across different environments compared to TD3, SAC, PPO, and A2C. For instance, we achieved 90% of the maximum achievable reward within around 200 thousand timesteps compared to 400 thousand timesteps for the second-best method.

View on arXiv PDF

Similar