55.5LGApr 23
Do Not Imitate, Reinforce: Iterative Classification via Belief RefinementMahdi Kallel, Johannes Tölle, Ahmed Hendawy et al.
Standard supervised classification trains models to imitate the exact labels provided by a perfect oracle. This imitation happens in a single pass, restricting the model to a fixed compute budget even when inputs vary in complexity. Moreover, the rigid training objective forces the model to express absolute certainty on its training data, resulting in overconfident predictions during evaluation. We propose Reinforced Iterative Classification (RIC), which replaces the imitative objective with Reinforcement Learning (RL). RIC deploys a recurrent agent that iteratively updates a predictive distribution over classes, receiving reward for stepwise improvement in prediction quality. The value function provides a natural halting criterion by estimating the remaining scope for improvement. We prove that the iterative formulation recovers the same optimal predictions as cross-entropy while yielding an anytime classifier. On image classification benchmarks, RIC matches the accuracy of supervised baselines with improved calibration and learns to allocate computation adaptively across inputs.
LGJul 5, 2024
Augmented Bayesian Policy SearchMahdi Kallel, Debabrota Basu, Riad Akrour et al.
Deterministic policies are often preferred over stochastic ones when implemented on physical systems. They can prevent erratic and harmful behaviors while being easier to implement and interpret. However, in practice, exploration is largely performed by stochastic policies. First-order Bayesian Optimization (BO) methods offer a principled way of performing exploration using deterministic policies. This is done through a learned probabilistic model of the objective function and its gradient. Nonetheless, such approaches treat policy search as a black-box problem, and thus, neglect the reinforcement learning nature of the problem. In this work, we leverage the performance difference lemma to introduce a novel mean function for the probabilistic model. This results in augmenting BO methods with the action-value function. Hence, we call our method Augmented Bayesian Search~(ABS). Interestingly, this new mean function enhances the posterior gradient with the deterministic policy gradient, effectively bridging the gap between BO and policy gradient methods. The resulting algorithm combines the convenience of the direct policy search with the scalability of reinforcement learning. We validate ABS on high-dimensional locomotion problems and demonstrate competitive performance compared to existing direct policy search schemes.
LGOct 2, 2025
Use the Online Network If You Can: Towards Fast and Stable Reinforcement LearningAhmed Hendawy, Henrik Metternich, Théo Vincent et al.
The use of target networks is a popular approach for estimating value functions in deep Reinforcement Learning (RL). While effective, the target network remains a compromise solution that preserves stability at the cost of slowly moving targets, thus delaying learning. Conversely, using the online network as a bootstrapped target is intuitively appealing, albeit well-known to lead to unstable learning. In this work, we aim to obtain the best out of both worlds by introducing a novel update rule that computes the target using the MINimum estimate between the Target and Online network, giving rise to our method, MINTO. Through this simple, yet effective modification, we show that MINTO enables faster and stable value function learning, by mitigating the potential overestimation bias of using the online network for bootstrapping. Notably, MINTO can be seamlessly integrated into a wide range of value-based and actor-critic algorithms with a negligible cost. We evaluate MINTO extensively across diverse benchmarks, spanning online and offline RL, as well as discrete and continuous action spaces. Across all benchmarks, MINTO consistently improves performance, demonstrating its broad applicability and effectiveness.