Rinu Boney

LG
11papers
205citations
Novelty45%
AI Score30

11 Papers

LGOct 25, 2022Code
Adaptive Behavior Cloning Regularization for Stable Offline-to-Online Reinforcement Learning

Yi Zhao, Rinu Boney, Alexander Ilin et al.

Offline reinforcement learning, by learning from a fixed dataset, makes it possible to learn agent behaviors without interacting with the environment. However, depending on the quality of the offline dataset, such pre-trained agents may have limited performance and would further need to be fine-tuned online by interacting with the environment. During online fine-tuning, the performance of the pre-trained agent may collapse quickly due to the sudden distribution shift from offline to online data. While constraints enforced by offline RL methods such as a behaviour cloning loss prevent this to an extent, these constraints also significantly slow down online fine-tuning by forcing the agent to stay close to the behavior policy. We propose to adaptively weigh the behavior cloning loss during online fine-tuning based on the agent's performance and training stability. Moreover, we use a randomized ensemble of Q functions to further increase the sample efficiency of online fine-tuning by performing a large number of learning updates. Experiments show that the proposed method yields state-of-the-art offline-to-online reinforcement learning performance on the popular D4RL benchmark. Code is available: \url{https://github.com/zhaoyi11/adaptive_bc}.

LGJun 15, 2023
Simplified Temporal Consistency Reinforcement Learning

Yi Zhao, Wenshuai Zhao, Rinu Boney et al.

Reinforcement learning is able to solve complex sequential decision-making tasks but is currently limited by sample efficiency and required computation. To improve sample efficiency, recent work focuses on model-based RL which interleaves model learning with planning. Recent methods further utilize policy learning, value estimation, and, self-supervised learning as auxiliary objectives. In this paper we show that, surprisingly, a simple representation learning approach relying only on a latent dynamics model trained by latent temporal consistency is sufficient for high-performance RL. This applies when using pure planning with a dynamics model conditioned on the representation, but, also when utilizing the representation as policy and value function features in model-free RL. In experiments, our approach learns an accurate dynamics model to solve challenging high-dimensional locomotion tasks with online planners while being 4.1 times faster to train compared to ensemble-based methods. With model-free RL without planning, especially on high-dimensional tasks, such as the DeepMind Control Suite Humanoid and Dog tasks, our approach outperforms model-free methods by a large margin and matches model-based methods' sample efficiency while training 2.4 times faster.

CVJul 2, 2024
Understanding Alignment in Multimodal LLMs: A Comprehensive Study

Elmira Amirloo, Jean-Philippe Fauconnier, Christoph Roesmann et al.

Preference alignment has become a crucial component in enhancing the performance of Large Language Models (LLMs), yet its impact in Multimodal Large Language Models (MLLMs) remains comparatively underexplored. Similar to language models, MLLMs for image understanding tasks encounter challenges like hallucination. In MLLMs, hallucination can occur not only by stating incorrect facts but also by producing responses that are inconsistent with the image content. A primary objective of alignment for MLLMs is to encourage these models to align responses more closely with image information. Recently, multiple works have introduced preference datasets for MLLMs and examined different alignment methods, including Direct Preference Optimization (DPO) and Proximal Policy Optimization (PPO). However, due to variations in datasets, base model types, and alignment methods, it remains unclear which specific elements contribute most significantly to the reported improvements in these works. In this paper, we independently analyze each aspect of preference alignment in MLLMs. We start by categorizing the alignment algorithms into two groups, offline (such as DPO), and online (such as online-DPO), and show that combining offline and online methods can improve the performance of the model in certain scenarios. We review a variety of published multimodal preference datasets and discuss how the details of their construction impact model performance. Based on these insights, we introduce a novel way of creating multimodal preference data called Bias-Driven Hallucination Sampling (BDHS) that needs neither additional annotation nor external models, and show that it can achieve competitive performance to previously published alignment work for multimodal models across a range of benchmarks.

RONov 5, 2020Code
RealAnt: An Open-Source Low-Cost Quadruped for Education and Research in Real-World Reinforcement Learning

Rinu Boney, Jussi Sainio, Mikko Kaivola et al.

Current robot platforms available for research are either very expensive or unable to handle the abuse of exploratory controls in reinforcement learning. We develop RealAnt, a minimal low-cost physical version of the popular `Ant' benchmark used in reinforcement learning. RealAnt costs only $\sim$350 EUR (\$410) in materials and can be assembled in less than an hour. We validate the platform with reinforcement learning experiments and provide baseline results on a set of benchmark tasks. We demonstrate that the RealAnt robot can learn to walk from scratch from less than 10 minutes of experience. We also provide simulator versions of the robot (with the same dimensions, state-action spaces, and delayed noisy observations) in the MuJoCo and PyBullet simulators. We open-source hardware designs, supporting software, and baseline results for educational use and reproducible research.

ROAug 3, 2020Code
Learning to Drive (L2D) as a Low-Cost Benchmark for Real-World Reinforcement Learning

Ari Viitala, Rinu Boney, Yi Zhao et al.

We present Learning to Drive (L2D), a low-cost benchmark for real-world reinforcement learning (RL). L2D involves a simple and reproducible experimental setup where an RL agent has to learn to drive a Donkey car around three miniature tracks, given only monocular image observations and speed of the car. The agent has to learn to drive from disengagements, which occurs when it drives off the track. We present and open-source our training pipeline, which makes it straightforward to apply any existing RL algorithm to the task of autonomous driving with a Donkey car. We test imitation learning, state-of-the-art model-free, and model-based algorithms on the proposed L2D benchmark. Our results show that existing RL algorithms can learn to drive the car from scratch in less than five minutes of interaction. We demonstrate that RL algorithms can learn from sparse and noisy disengagement to drive even faster than imitation learning and a human operator.

LGJun 15, 2021
Learning of feature points without additional supervision improves reinforcement learning from images

Rinu Boney, Alexander Ilin, Juho Kannala

In many control problems that include vision, optimal controls can be inferred from the location of the objects in the scene. This information can be represented using feature points, which is a list of spatial locations in learned feature maps of an input image. Previous works show that feature points learned using unsupervised pre-training or human supervision can provide good features for control tasks. In this paper, we show that it is possible to learn efficient feature point representations end-to-end, without the need for unsupervised pre-training, decoders, or additional losses. Our proposed architecture consists of a differentiable feature point extractor that feeds the coordinates of the estimated feature points directly to a soft actor-critic agent. The proposed algorithm yields performance competitive to the state-of-the art on DeepMind Control Suite tasks.

AIDec 22, 2020
Learning to Play Imperfect-Information Games by Imitating an Oracle Planner

Rinu Boney, Alexander Ilin, Juho Kannala et al.

We consider learning to play multiplayer imperfect-information games with simultaneous moves and large state-action spaces. Previous attempts to tackle such challenging games have largely focused on model-free learning methods, often requiring hundreds of years of experience to produce competitive agents. Our approach is based on model-based planning. We tackle the problem of partial observability by first building an (oracle) planner that has access to the full state of the environment and then distilling the knowledge of the oracle to a (follower) agent which is trained to play the imperfect-information game by imitating the oracle's choices. We experimentally show that planning with naive Monte Carlo tree search does not perform very well in large combinatorial action spaces. We therefore propose planning with a fixed-depth tree search and decoupled Thompson sampling for action selection. We show that the planner is able to discover efficient playing strategies in the games of Clash Royale and Pommerman and the follower policy successfully learns to implement them by training on a few hundred battles.

LGOct 12, 2019
Regularizing Model-Based Planning with Energy-Based Models

Rinu Boney, Juho Kannala, Alexander Ilin

Model-based reinforcement learning could enable sample-efficient learning by quickly acquiring rich knowledge about the world and using it to improve behaviour without additional data. Learned dynamics models can be directly used for planning actions but this has been challenging because of inaccuracies in the learned models. In this paper, we focus on planning with learned dynamics models and propose to regularize it using energy estimates of state transitions in the environment. We visually demonstrate the effectiveness of the proposed method and show that off-policy training of an energy estimator can be effectively used to regularize planning with pre-trained dynamics models. Further, we demonstrate that the proposed method enables sample-efficient learning to achieve competitive performance in challenging continuous control tasks such as Half-cheetah and Ant in just a few minutes of experience.

LGMar 28, 2019
Regularizing Trajectory Optimization with Denoising Autoencoders

Rinu Boney, Norman Di Palo, Mathias Berglund et al.

Trajectory optimization using a learned model of the environment is one of the core elements of model-based reinforcement learning. This procedure often suffers from exploiting inaccuracies of the learned model. We propose to regularize trajectory optimization by means of a denoising autoencoder that is trained on the same trajectories as the model of the environment. We show that the proposed regularization leads to improved planning with both gradient-based and gradient-free optimizers. We also demonstrate that using regularized trajectory optimization leads to rapid initial learning in a set of popular motor control tasks, which suggests that the proposed approach can be a useful tool for improving sample efficiency.

LGNov 29, 2017
Semi-Supervised and Active Few-Shot Learning with Prototypical Networks

Rinu Boney, Alexander Ilin

We consider the problem of semi-supervised few-shot classification where a classifier needs to adapt to new tasks using a few labeled examples and (potentially many) unlabeled examples. We propose a clustering approach to the problem. The features extracted with Prototypical Networks are clustered using $K$-means with the few labeled examples guiding the clustering process. We note that in many real-world applications the adaptation performance can be significantly improved by requesting the few labels through user feedback. We demonstrate good performance of the active adaptation strategy using image data.

NEJul 28, 2017
Recurrent Ladder Networks

Isabeau Prémont-Schwarz, Alexander Ilin, Tele Hotloo Hao et al.

We propose a recurrent extension of the Ladder networks whose structure is motivated by the inference required in hierarchical latent variable models. We demonstrate that the recurrent Ladder is able to handle a wide variety of complex learning tasks that benefit from iterative inference and temporal modeling. The architecture shows close-to-optimal results on temporal modeling of video data, competitive results on music modeling, and improved perceptual grouping based on higher order abstractions, such as stochastic textures and motion cues. We present results for fully supervised, semi-supervised, and unsupervised tasks. The results suggest that the proposed architecture and principles are powerful tools for learning a hierarchy of abstractions, learning iterative inference and handling temporal information.