Marvin Alles

LG
h-index44
5papers
32citations
Novelty52%
AI Score39

5 Papers

LGDec 10, 2025
Latent Action World Models for Control with Unlabeled Trajectories

Marvin Alles, Xingyuan Zhang, Patrick van der Smagt et al.

Inspired by how humans combine direct interaction with action-free experience (e.g., videos), we study world models that learn from heterogeneous data. Standard world models typically rely on action-conditioned trajectories, which limits effectiveness when action labels are scarce. We introduce a family of latent-action world models that jointly use action-conditioned and action-free data by learning a shared latent action representation. This latent space aligns observed control signals with actions inferred from passive observations, enabling a single dynamics model to train on large-scale unlabeled trajectories while requiring only a small set of action-labeled ones. We use the latent-action world model to learn a latent-action policy through offline reinforcement learning (RL), thereby bridging two traditionally separate domains: offline RL, which typically relies on action-conditioned data, and action-free training, which is rarely used with subsequent RL. On the DeepMind Control Suite, our approach achieves strong performance while using about an order of magnitude fewer action-labeled samples than purely action-conditioned baselines. These results show that latent actions enable training on both passive and interactive data, which makes world models learn more efficiently.

LGNov 7, 2024
Constrained Latent Action Policies for Model-Based Offline Reinforcement Learning

Marvin Alles, Philip Becker-Ehmck, Patrick van der Smagt et al.

In offline reinforcement learning, a policy is learned using a static dataset in the absence of costly feedback from the environment. In contrast to the online setting, only using static datasets poses additional challenges, such as policies generating out-of-distribution samples. Model-based offline reinforcement learning methods try to overcome these by learning a model of the underlying dynamics of the environment and using it to guide policy search. It is beneficial but, with limited datasets, errors in the model and the issue of value overestimation among out-of-distribution states can worsen performance. Current model-based methods apply some notion of conservatism to the Bellman update, often implemented using uncertainty estimation derived from model ensembles. In this paper, we propose Constrained Latent Action Policies (C-LAP) which learns a generative model of the joint distribution of observations and actions. We cast policy learning as a constrained objective to always stay within the support of the latent action distribution, and use the generative capabilities of the model to impose an implicit constraint on the generated actions. Thereby eliminating the need to use additional uncertainty penalties on the Bellman update and significantly decreasing the number of gradient steps required to learn a policy. We empirically evaluate C-LAP on the D4RL and V-D4RL benchmark, and show that C-LAP is competitive to state-of-the-art methods, especially outperforming on datasets with visual observations.

LGMay 20, 2025
FlowQ: Energy-Guided Flow Policies for Offline Reinforcement Learning

Marvin Alles, Nutan Chen, Patrick van der Smagt et al.

The use of guidance to steer sampling toward desired outcomes has been widely explored within diffusion models, especially in applications such as image and trajectory generation. However, incorporating guidance during training remains relatively underexplored. In this work, we introduce energy-guided flow matching, a novel approach that enhances the training of flow models and eliminates the need for guidance at inference time. We learn a conditional velocity field corresponding to the flow policy by approximating an energy-guided probability path as a Gaussian path. Learning guided trajectories is appealing for tasks where the target distribution is defined by a combination of data and an energy function, as in reinforcement learning. Diffusion-based policies have recently attracted attention for their expressive power and ability to capture multi-modal action distributions. Typically, these policies are optimized using weighted objectives or by back-propagating gradients through actions sampled by the policy. As an alternative, we propose FlowQ, an offline reinforcement learning algorithm based on energy-guided flow matching. Our method achieves competitive performance while the policy training time is constant in the number of flow sampling steps.

ROMar 22, 2024
Guided Decoding for Robot On-line Motion Generation and Adaption

Nutan Chen, Botond Cseke, Elie Aljalbout et al.

We present a novel motion generation approach for robot arms, with high degrees of freedom, in complex settings that can adapt online to obstacles or new via points. Learning from Demonstration facilitates rapid adaptation to new tasks and optimizes the utilization of accumulated expertise by allowing robots to learn and generalize from demonstrated trajectories. We train a transformer architecture, based on conditional variational autoencoder, on a large dataset of simulated trajectories used as demonstrations. Our architecture learns essential motion generation skills from these demonstrations and is able to adapt them to meet auxiliary tasks. Additionally, our approach implements auto-regressive motion generation to enable real-time adaptations, as, for example, introducing or changing via-points, and velocity and acceleration constraints. Using beam search, we present a method for further adaption of our motion generator to avoid obstacles. We show that our model successfully generates motion from different initial and target points and that is capable of generating trajectories that navigate complex tasks across different robotic platforms.

ROOct 8, 2021
Learning to Centralize Dual-Arm Assembly

Marvin Alles, Elie Aljalbout

Robotic manipulators are widely used in modern manufacturing processes. However, their deployment in unstructured environments remains an open problem. To deal with the variety, complexity, and uncertainty of real-world manipulation tasks, it is essential to develop a flexible framework with reduced assumptions on the environment characteristics. In recent years, reinforcement learning (RL) has shown great results for single-arm robotic manipulation. However, research focusing on dual-arm manipulation is still rare. From a classical control perspective, solving such tasks often involves complex modeling of interactions between two manipulators and the objects encountered in the tasks, as well as the two robots coupling at a control level. Instead, in this work, we explore the applicability of model-free RL to dual-arm assembly. As we aim to contribute towards an approach that is not limited to dual-arm assembly, but dual-arm manipulation in general, we keep modeling efforts at a minimum. Hence, to avoid modeling the interaction between the two robots and the used assembly tools, we present a modular approach with two decentralized single-arm controllers which are coupled using a single centralized learned policy. We reduce modeling effort to a minimum by using sparse rewards only. Our architecture enables successful assembly and simple transfer from simulation to the real world. We demonstrate the effectiveness of the framework on dual-arm peg-in-hole and analyze sample efficiency and success rates for different action spaces. Moreover, we compare results on different clearances and showcase disturbance recovery and robustness, when dealing with position uncertainties. Finally we zero-shot transfer policies trained in simulation to the real world and evaluate their performance.