Expressive Temporal Specifications for Reward Monitoring
This addresses the problem of sparse rewards in long-horizon decision-making for RL practitioners, offering an incremental improvement over existing methods.
The paper tackles the challenge of specifying dense reward functions in Reinforcement Learning by using quantitative Linear Temporal Logic on finite traces to synthesize reward monitors that provide nuanced feedback during training, resulting in improved task completion and reduced convergence time compared to Boolean monitors.
Specifying informative and dense reward functions remains a pivotal challenge in Reinforcement Learning, as it directly affects the efficiency of agent training. In this work, we harness the expressive power of quantitative Linear Temporal Logic on finite traces (($\text{LTL}_f[\mathcal{F}]$)) to synthesize reward monitors that generate a dense stream of rewards for runtime-observable state trajectories. By providing nuanced feedback during training, these monitors guide agents toward optimal behaviour and help mitigate the well-known issue of sparse rewards under long-horizon decision making, which arises under the Boolean semantics dominating the current literature. Our framework is algorithm-agnostic and only relies on a state labelling function, and naturally accommodates specifying non-Markovian properties. Empirical results show that our quantitative monitors consistently subsume and, depending on the environment, outperform Boolean monitors in maximizing a quantitative measure of task completion and in reducing convergence time.