LGJul 22, 2024
Comprehensive Overview of Reward Engineering and Shaping in Advancing Reinforcement Learning ApplicationsSinan Ibrahim, Mostafa Mostafa, Ali Jnadi et al.
The aim of Reinforcement Learning (RL) in real-world applications is to create systems capable of making autonomous decisions by learning from their environment through trial and error. This paper emphasizes the importance of reward engineering and reward shaping in enhancing the efficiency and effectiveness of reinforcement learning algorithms. Reward engineering involves designing reward functions that accurately reflect the desired outcomes, while reward shaping provides additional feedback to guide the learning process, accelerating convergence to optimal policies. Despite significant advancements in reinforcement learning, several limitations persist. One key challenge is the sparse and delayed nature of rewards in many real-world scenarios, which can hinder learning progress. Additionally, the complexity of accurately modeling real-world environments and the computational demands of reinforcement learning algorithms remain substantial obstacles. On the other hand, recent advancements in deep learning and neural networks have significantly improved the capability of reinforcement learning systems to handle high-dimensional state and action spaces, enabling their application to complex tasks such as robotics, autonomous driving, and game playing. This paper provides a comprehensive review of the current state of reinforcement learning, focusing on the methodologies and techniques used in reward engineering and reward shaping. It critically analyzes the limitations and recent advancements in the field, offering insights into future research directions and potential applications in various domains.
SYAug 31, 2022
A stabilizing reinforcement learning approach for sampled systems with partially unknown modelsLukas Beckenbach, Pavel Osinenko, Stefan Streif
Reinforcement learning is commonly associated with training of reward-maximizing (or cost-minimizing) agents, in other words, controllers. It can be applied in model-free or model-based fashion, using a priori or online collected system data to train involved parametric architectures. In general, online reinforcement learning does not guarantee closed loop stability unless special measures are taken, for instance, through learning constraints or tailored training rules. Particularly promising are hybrids of reinforcement learning with "classical" control approaches. In this work, we suggest a method to guarantee practical stability of the system-controller closed loop in a purely online learning setting, i.e., without offline training. Moreover, we assume only partial knowledge of the system model. To achieve the claimed results, we employ techniques of classical adaptive control. The implementation of the overall control scheme is provided explicitly in a digital, sampled setting. That is, the controller receives the state of the system and computes the control action at discrete, specifically, equidistant moments in time. The method is tested in adaptive traction control and cruise control where it proved to significantly reduce the cost.
LGMar 18
Benchmarking Reinforcement Learning via Stochastic Converse Optimality: Generating Systems with Known Optimal PoliciesSinan Ibrahim, Grégoire Ouerdane, Hadi Salloum et al.
The objective comparison of Reinforcement Learning (RL) algorithms is notoriously complex as outcomes and benchmarking of performances of different RL approaches are critically sensitive to environmental design, reward structures, and stochasticity inherent in both algorithmic learning and environmental dynamics. To manage this complexity, we introduce a rigorous benchmarking framework by extending converse optimality to discrete-time, control-affine, nonlinear systems with noise. Our framework provides necessary and sufficient conditions, under which a prescribed value function and policy are optimal for constructed systems, enabling the systematic generation of benchmark families via homotopy variations and randomized parameters. We validate it by automatically constructing diverse environments, demonstrating our framework's capacity for a controlled and comprehensive evaluation across algorithms. By assessing standard methods against a ground-truth optimum, our work delivers a reproducible foundation for precise and rigorous RL benchmarking.
ROJul 15, 2023
Combining model-predictive control and predictive reinforcement learning for stable quadrupedal robot locomotionVyacheslav Kovalev, Anna Shkromada, Henni Ouerdane et al.
Stable gait generation is a crucial problem for legged robot locomotion as this impacts other critical performance factors such as, e.g. mobility over an uneven terrain and power consumption. Gait generation stability results from the efficient control of the interaction between the legged robot's body and the environment where it moves. Here, we study how this can be achieved by a combination of model-predictive and predictive reinforcement learning controllers. Model-predictive control (MPC) is a well-established method that does not utilize any online learning (except for some adaptive variations) as it provides a convenient interface for state constraints management. Reinforcement learning (RL), in contrast, relies on adaptation based on pure experience. In its bare-bone variants, RL is not always suitable for robots due to their high complexity and expensive simulation/experimentation. In this work, we combine both control methods to address the quadrupedal robot stable gate generation problem. The hybrid approach that we develop and apply uses a cost roll-out algorithm with a tail cost in the form of a Q-function modeled by a neural network; this allows to alleviate the computational complexity, which grows exponentially with the prediction horizon in a purely MPC approach. We demonstrate that our RL gait controller achieves stable locomotion at short horizons, where a nominal MP controller fails. Further, our controller is capable of live operation, meaning that it does not require previous training. Our results suggest that the hybridization of MPC with RL, as presented here, is beneficial to achieve a good balance between online control capabilities and computational complexity.
ROSep 15, 2024
Critic as Lyapunov function (CALF): a model-free, stability-ensuring agentPavel Osinenko, Grigory Yaremenko, Roman Zashchitin et al.
This work presents and showcases a novel reinforcement learning agent called Critic As Lyapunov Function (CALF) which is model-free and ensures online environment, in other words, dynamical system stabilization. Online means that in each learning episode, the said environment is stabilized. This, as demonstrated in a case study with a mobile robot simulator, greatly improves the overall learning performance. The base actor-critic scheme of CALF is analogous to SARSA. The latter did not show any success in reaching the target in our studies. However, a modified version thereof, called SARSA-m here, did succeed in some learning scenarios. Still, CALF greatly outperformed the said approach. CALF was also demonstrated to improve a nominal stabilizer provided to it. In summary, the presented agent may be considered a viable approach to fusing classical control with reinforcement learning. Its concurrent approaches are mostly either offline or model-based, like, for instance, those that fuse model-predictive control into the agent.
ROSep 23, 2024
A novel agent with formal goal-reaching guarantees: an experimental study with a mobile robotGrigory Yaremenko, Dmitrii Dobriborsci, Roman Zashchitin et al.
Reinforcement Learning (RL) has been shown to be effective and convenient for a number of tasks in robotics. However, it requires the exploration of a sufficiently large number of state-action pairs, many of which may be unsafe or unimportant. For instance, online model-free learning can be hazardous and inefficient in the absence of guarantees that a certain set of desired states will be reached during an episode. An increasingly common approach to address safety involves the addition of a shielding system that constrains the RL actions to a safe set of actions. In turn, a difficulty for such frameworks is how to effectively couple RL with the shielding system to make sure the exploration is not excessively restricted. This work presents a novel safe model-free RL agent called Critic As Lyapunov Function (CALF) and showcases how CALF can be used to improve upon control baselines in robotics in an efficient and convenient fashion while ensuring guarantees of stable goal reaching. The latter is a crucial part of safety, as seen generally. With CALF all state-action pairs remain explorable and yet reaching of desired goal states is formally guaranteed. Formal analysis is provided that shows the goal stabilization-ensuring properties of CALF and a set of real-world and numerical experiments with a non-holonomic wheeled mobile robot (WMR) TurtleBot3 Burger confirmed the superiority of CALF over such a well-established RL agent as proximal policy optimization (PPO), and a modified version of SARSA in a few-episode setting in terms of attained total cost.
SYJul 18, 2022
A framework for online, stabilizing reinforcement learningGrigory Yaremenko, Georgiy Malaniya, Pavel Osinenko
Online reinforcement learning is concerned with training an agent on-the-fly via dynamic interaction with the environment. Here, due to the specifics of the application, it is not generally possible to perform long pre-training, as it is commonly done in off-line, model-free approaches, which are akin to dynamic programming. Such applications may be found more frequently in industry, rather than in pure digital fields, such as cloud services, video games, database management, etc., where reinforcement learning has been demonstrating success. Online reinforcement learning, in contrast, is more akin to classical control, which utilizes some model knowledge about the environment. Stability of the closed-loop (agent plus the environment) is a major challenge for such online approaches. In this paper, we tackle this problem by a special fusion of online reinforcement learning with elements of classical control, namely, based on the Lyapunov theory of stability. The idea is to start the agent at once, without pre-training, and learn approximately optimal policy under specially designed constraints, which guarantee stability. The resulting approach was tested in an extensive experimental study with a mobile robot. A nominal parking controller was used as a baseline. It was observed that the suggested agent could always successfully park the robot, while significantly improving the cost. While many approaches may be exploited for mobile robot control, we suggest that the experiments showed the promising potential of online reinforcement learning agents based on Lyapunov-like constraints. The presented methodology may be utilized in safety-critical, industrial applications where stability is necessary.
ROMay 18, 2025Code
Adaptive MPC-based quadrupedal robot control under periodic disturbancesElizaveta Pestova, Ilya Osokin, Danil Belov et al.
Recent advancements in adaptive control for reference trajectory tracking enable quadrupedal robots to perform locomotion tasks under challenging conditions. There are methods enabling the estimation of the external disturbances in terms of forces and torques. However, a specific case of disturbances that are periodic was not explicitly tackled in application to quadrupeds. This work is devoted to the estimation of the periodic disturbances with a lightweight regressor using simplified robot dynamics and extracting the disturbance properties in terms of the magnitude and frequency. Experimental evidence suggests performance improvement over the baseline static disturbance compensation. All source files, including simulation setups, code, and calculation scripts, are available on GitHub at https://github.com/aidagroup/quad-periodic-mpc.
ROMar 8Code
Low-Cost Teleoperation Extension for Mobile ManipulatorsDanil Belov, Artem Erkhov, Yaroslav Savotin et al.
Teleoperation of mobile bimanual manipulators requires simultaneous control of high-dimensional systems, often necessitating expensive specialized equipment. We present an open-source teleoperation framework that enables intuitive whole body control using readily available commodity hardware. Our system combines smartphone-based head tracking for camera control, leader arms for bilateral manipulation, and foot pedals for hands-free base navigation. Using a standard smartphone with IMU and display, we eliminate the need for costly VR helmets while maintaining immersive visual feedback. The modular architecture integrates seamlessly with the XLeRobot framework, but can be easily adapted to other types of mobile manipulators. We validate our approach through user studies that demonstrate improved task performance and reduced cognitive load compared to keyboard-based control.
ROMay 10, 2025Code
Quadrupedal Robot Skateboard Mounting via Reverse Curriculum LearningDanil Belov, Artem Erkhov, Elizaveta Pestova et al.
The aim of this work is to enable quadrupedal robots to mount skateboards using Reverse Curriculum Reinforcement Learning. Although prior work has demonstrated skateboarding for quadrupeds that are already positioned on the board, the initial mounting phase still poses a significant challenge. A goal-oriented methodology was adopted, beginning with the terminal phases of the task and progressively increasing the complexity of the problem definition to approximate the desired objective. The learning process was initiated with the skateboard rigidly fixed within the global coordinate frame and the robot positioned directly above it. Through gradual relaxation of these initial conditions, the learned policy demonstrated robustness to variations in skateboard position and orientation, ultimately exhibiting a successful transfer to scenarios involving a mobile skateboard. The code, trained models, and reproducible examples are available at the following link: https://github.com/dancher00/quadruped-skateboard-mounting
OCJan 4, 2025
Towards a constructive framework for control theoryPavel Osinenko
This work presents a framework for control theory based on constructive analysis to account for discrepancy between mathematical results and their implementation in a computer, also referred to as computational uncertainty. In control engineering, the latter is usually either neglected or considered submerged into some other type of uncertainty, such as system noise, and addressed within robust control. However, even robust control methods may be compromised when the mathematical objects involved in the respective algorithms fail to exist in exact form and subsequently fail to satisfy the required properties. For instance, in general stabilization using a control Lyapunov function, computational uncertainty may distort stability certificates or even destabilize the system despite robustness of the stabilization routine with regards to system, actuator and measurement noise. In fact, battling numerical problems in practical implementation of controllers is common among control engineers. Such observations indicate that computational uncertainty should indeed be addressed explicitly in controller synthesis and system analysis. The major contribution here is a fairly general framework for proof techniques in analysis and synthesis of control systems based on constructive analysis which explicitly states that every computation be doable only up to a finite precision thus accounting for computational uncertainty. A series of previous works is overviewed, including constructive system stability and stabilization, approximate optimal controls, eigenvalue problems, Caratheodory trajectories, measurable selectors. Additionally, a new constructive version of the Danskin's theorem, which is crucial in adversarial defense, is presented.
LGMay 18, 2025
A universal policy wrapper with guaranteesAnton Bolychev, Georgiy Malaniya, Grigory Yaremenko et al.
We introduce a universal policy wrapper for reinforcement learning agents that ensures formal goal-reaching guarantees. In contrast to standard reinforcement learning algorithms that excel in performance but lack rigorous safety assurances, our wrapper selectively switches between a high-performing base policy -- derived from any existing RL method -- and a fallback policy with known convergence properties. Base policy's value function supervises this switching process, determining when the fallback policy should override the base policy to ensure the system remains on a stable path. The analysis proves that our wrapper inherits the fallback policy's goal-reaching guarantees while preserving or improving upon the performance of the base policy. Notably, it operates without needing additional system knowledge or online constrained optimization, making it readily deployable across diverse reinforcement learning architectures and tasks.
LGMay 18, 2025
Multi-CALF: A Policy Combination Approach with Statistical GuaranteesGeorgiy Malaniya, Anton Bolychev, Grigory Yaremenko et al.
We introduce Multi-CALF, an algorithm that intelligently combines reinforcement learning policies based on their relative value improvements. Our approach integrates a standard RL policy with a theoretically-backed alternative policy, inheriting formal stability guarantees while often achieving better performance than either policy individually. We prove that our combined policy converges to a specified goal set with known probability and provide precise bounds on maximum deviation and convergence time. Empirical validation on control tasks demonstrates enhanced performance while maintaining stability guarantees.
DSNov 24, 2021
A note on stabilizing reinforcement learningPavel Osinenko, Grigory Yaremenko, Ilya Osokin
Reinforcement learning is a general methodology of adaptive optimal control that has attracted much attention in various fields ranging from video game industry to robot manipulators. Despite its remarkable performance demonstrations, plain reinforcement learning controllers do not guarantee stability which compromises their applicability in industry. To provide such guarantees, measures have to be taken. This gives rise to what could generally be called stabilizing reinforcement learning. Concrete approaches range from employment of human overseers to filter out unsafe actions to formally verified shields and fusion with classical stabilizing controllers. A line of attack that utilizes elements of adaptive control has become fairly popular in the recent years. In this note, we critically address such an approach in a fairly general actor-critic setup for nonlinear time-continuous environments. The actor network utilizes a so-called robustifying term that is supposed to compensate for the neural network errors. The corresponding stability analysis is based on the value function itself. We indicate a problem in such a stability analysis and provide a counterexample to the overall control scheme. Implications for such a line of attack in stabilizing reinforcement learning are discussed. Furthermore, unfortunately the said problem possess no fix without a substantial reconsideration of the whole approach. As a positive message, we derive a stochastic critic neural network weight convergence analysis provided that the environment was stabilized.
ROAug 23, 2021
A generalized stacked reinforcement learning method for sampled systemsPavel Osinenko, Dmitrii Dobriborsci, Grigory Yaremenko et al.
A common setting of reinforcement learning (RL) is a Markov decision process (MDP) in which the environment is a stochastic discrete-time dynamical system. Whereas MDPs are suitable in such applications as video-games or puzzles, physical systems are time-continuous. A general variant of RL is of digital format, where updates of the value (or cost) and policy are performed at discrete moments in time. The agent-environment loop then amounts to a sampled system, whereby sample-and-hold is a specific case. In this paper, we propose and benchmark two RL methods suitable for sampled systems. Specifically, we hybridize model-predictive control (MPC) with critics learning the optimal Q- and value (or cost-to-go) function. Optimality is analyzed and performance comparison is done in an experimental case study with a mobile robot.
ROAug 10, 2021
An experimental study of two predictive reinforcement learning methods and comparison with model-predictive controlDmitrii Dobriborsci, Pavel Osinenko
Reinforcement learning (RL) has been successfully used in various simulations and computer games. Industry-related applications, such as autonomous mobile robot motion control, are somewhat challenging for RL up to date though. This paper presents an experimental evaluation of predictive RL controllers for optimal mobile robot motion control. As a baseline for comparison, model-predictive control (MPC) is used. Two RL methods are tested: a roll-out Q-learning, which may be considered as MPC with terminal cost being a Q-function approximation, and a so-called stacked Q-learning, which in turn is like MPC with the running cost substituted for a Q-function approximation. The experimental foundation is a mobile robot with a differential drive (Robotis Turtlebot3). Experimental results showed that both RL methods beat the baseline in terms of the accumulated cost, whereas the stacked variant performed best. Provided the series of previous works on stacked Q-learning, this particular study supports the idea that MPC with a running cost adaptation inspired by Q-learning possesses potential of performance boost while retaining the nice properties of MPC.