17.1ROJun 3
MoDex: A Diffusion Policy for Sequential Multi-Object Dexterous GraspingHaofei Lu, Hongjia Liu, Yifei Dong et al.
This work addresses sequentially grasping multiple objects with a single dexterous hand without releasing those already held. Most dexterous grasping methods commit all of the hand's degrees of freedom to a single object, underutilizing its dexterity and leaving no redundancy for subsequent grasps. The proposed solution, MoDex, is a diffusion policy that predicts the next gripper pose directly from observations, conditioned on an opposition space and point cloud. The opposition space condition specifies which fingers participate in the current grasp, enabling the gripper to use only a subset of its available degrees of freedom while reserving the remaining degrees of freedom for subsequent grasps. To facilitate sim-to-real transfer, MoDex is trained in two stages: first through imitation learning on expert demonstrations, and subsequently through reinforcement learning fine-tuning, which consistently improves success rates over the pre-trained policy. We evaluate MoDex in simulation on a MuJoCo-based Franka Emika Panda robot equipped with an Allegro Hand and on the corresponding real-world hardware platform. Across both simulation and real-world experiments, MoDex achieves higher success rates than the evaluated learning-based baselines, improving performance by 2.92-17.92% and 6.67-17.78%, respectively. Project page: https://modex2026.github.io/.
LGJun 16, 2022
Active Nearest Neighbor Regression Through Delaunay RefinementAlexander Kravberg, Giovanni Luca Marchetti, Vladislav Polianskii et al.
We introduce an algorithm for active function approximation based on nearest neighbor regression. Our Active Nearest Neighbor Regressor (ANNR) relies on the Voronoi-Delaunay framework from computational geometry to subdivide the space into cells with constant estimated function value and select novel query points in a way that takes the geometry of the function graph into account. We consider the recent state-of-the-art active function approximator called DEFER, which is based on incremental rectangular partitioning of the space, as the main baseline. The ANNR addresses a number of limitations that arise from the space subdivision strategy used in DEFER. We provide a computationally efficient implementation of our method, as well as theoretical halting guarantees. Empirical results show that ANNR outperforms the baseline for both closed-form functions and real-world examples, such as gravitational wave parameter inference and exploration of the latent space of a generative model.
RODec 10, 2025
ViTA-Seg: Vision Transformer for Amodal Segmentation in RoboticsDonato Caramia, Florian T. Pokorny, Giuseppe Triggiani et al.
Occlusions in robotic bin picking compromise accurate and reliable grasp planning. We present ViTA-Seg, a class-agnostic Vision Transformer framework for real-time amodal segmentation that leverages global attention to recover complete object masks, including hidden regions. We proposte two architectures: a) Single-Head for amodal mask prediction; b) Dual-Head for amodal and occluded mask prediction. We also introduce ViTA-SimData, a photo-realistic synthetic dataset tailored to industrial bin-picking scenario. Extensive experiments on two amodal benchmarks, COOCA and KINS, demonstrate that ViTA-Seg Dual Head achieves strong amodal and occlusion segmentation accuracy with computational efficiency, enabling robust, real-time robotic manipulation.
ROOct 3, 2023
How Physics and Background Attributes Impact Video Transformers in Robotic Manipulation: A Case Study on Planar PushingShutong Jin, Ruiyu Wang, Muhammad Zahid et al.
As model and dataset sizes continue to scale in robot learning, the need to understand how the composition and properties of a dataset affect model performance becomes increasingly urgent to ensure cost-effective data collection and model performance. In this work, we empirically investigate how physics attributes (color, friction coefficient, shape) and scene background characteristics, such as the complexity and dynamics of interactions with background objects, influence the performance of Video Transformers in predicting planar pushing trajectories. We investigate three primary questions: How do physics attributes and background scene characteristics influence model performance? What kind of changes in attributes are most detrimental to model generalization? What proportion of fine-tuning data is required to adapt models to novel scenarios? To facilitate this research, we present CloudGripper-Push-1K, a large real-world vision-based robot pushing dataset comprising 1278 hours and 460,000 videos of planar pushing interactions with objects with different physics and background attributes. We also propose Video Occlusion Transformer (VOT), a generic modular video-transformer-based trajectory prediction framework which features 3 choices of 2D-spatial encoders as the subject of our case study. The dataset and source code are available at https://cloudgripper.org.
11.7ROMar 19
Speculative Policy Orchestration: A Latency-Resilient Framework for Cloud-Robotic ManipulationChanh Nguyen, Shutong Jin, Florian T. Pokorny et al.
Cloud robotics enables robots to offload high-dimensional motion planning and reasoning to remote servers. However, for continuous manipulation tasks requiring high-frequency control, network latency and jitter can severely destabilize the system, causing command starvation and unsafe physical execution. To address this, we propose Speculative Policy Orchestration (SPO), a latency-resilient cloud-edge framework. SPO utilizes a cloud-hosted world model to pre-compute and stream future kinematic waypoints to a local edge buffer, decoupling execution frequency from network round-trip time. To mitigate unsafe execution caused by predictive drift, the edge node employs an $ε$-tube verifier that strictly bounds kinematic execution errors. The framework is coupled with an Adaptive Horizon Scaling mechanism that dynamically expands or shrinks the speculative pre-fetch depth based on real-time tracking error. We evaluate SPO on continuous RLBench manipulation tasks under emulated network delays. Results show that even when deployed with learned models of modest accuracy, SPO reduces network-induced idle time by over 60% compared to blocking remote inference. Furthermore, SPO discards approximately 60% fewer cloud predictions than static caching baselines. Ultimately, SPO enables fluid, real-time cloud-robotic control while maintaining bounded physical safety.
CVFeb 1Code
Where to Attend: A Principled Vision-Centric Position Encoding with ParabolasChristoffer Koo Øhrstrøm, Rafael I. Cabral Muchacho, Yifei Dong et al.
We propose Parabolic Position Encoding (PaPE), a parabola-based position encoding for vision modalities in attention-based architectures. Given a set of vision tokens-such as images, point clouds, videos, or event camera streams-our objective is to encode their positions while accounting for the characteristics of vision modalities. Prior works have largely extended position encodings from 1D-sequences in language to nD-structures in vision, but only with partial account of vision characteristics. We address this gap by designing PaPE from principles distilled from prior work: translation invariance, rotation invariance (PaPE-RI), distance decay, directionality, and context awareness. We evaluate PaPE on 8 datasets that span 4 modalities. We find that either PaPE or PaPE-RI achieves the top performance on 7 out of 8 datasets. Extrapolation experiments on ImageNet-1K show that PaPE extrapolates remarkably well, improving in absolute terms by up to 10.5% over the next-best position encoding. Code is available at https://github.com/DTU-PAS/parabolic-position-encoding.
LGFeb 12, 2025
One-Shot Federated Learning with Classifier-Free Diffusion ModelsObaidullah Zaland, Shutong Jin, Florian T. Pokorny et al.
Federated learning (FL) enables collaborative learning without data centralization but introduces significant communication costs due to multiple communication rounds between clients and the server. One-shot federated learning (OSFL) addresses this by forming a global model with a single communication round, often relying on the server's model distillation or auxiliary dataset generation - often through pre-trained diffusion models (DMs). Existing DM-assisted OSFL methods, however, typically employ classifier-guided DMs, which require training auxiliary classifier models at each client, introducing additional computation overhead. This work introduces OSCAR (One-Shot Federated Learning with Classifier-Free Diffusion Models), a novel OSFL approach that eliminates the need for auxiliary models. OSCAR uses foundation models to devise category-specific data representations at each client, seamlessly integrated into a classifier-free diffusion model pipeline for server-side data generation. OSCAR is a simple yet cost-effective OSFL approach that outperforms the state-of-the-art on four benchmarking datasets while reducing the communication load by at least 99%.
AIApr 15, 2025
Nondeterministic Polynomial-time Problem Challenge: An Ever-Scaling Reasoning Benchmark for LLMsChang Yang, Ruiyu Wang, Junzhe Jiang et al.
Reasoning is the fundamental capability of large language models (LLMs). Due to the rapid progress of LLMs, there are two main issues of current benchmarks: i) these benchmarks can be crushed in a short time (less than 1 year), and ii) these benchmarks may be easily hacked. To handle these issues, we propose the ever-scalingness for building the benchmarks which are uncrushable, unhackable, auto-verifiable and general. This paper presents Nondeterministic Polynomial-time Problem Challenge (NPPC), an ever-scaling reasoning benchmark for LLMs. Specifically, the NPPC has three main modules: i) npgym, which provides a unified interface of 25 well-known NP-complete problems and can generate any number of instances with any levels of complexities, ii) npsolver: which provides a unified interface to evaluate the problem instances with both online and offline models via APIs and local deployments, respectively, and iii) npeval: which provides the comprehensive and ready-to-use tools to analyze the performances of LLMs over different problems, the number of tokens, the aha moments, the reasoning errors and the solution errors. Extensive experiments over widely-used LLMs demonstrate: i) NPPC can successfully decrease the performances of advanced LLMs' performances to below 10%, demonstrating that NPPC is uncrushable, ii) DeepSeek-R1, Claude-3.7-Sonnet, and o1/o3-mini are the most powerful LLMs, where DeepSeek-R1 outperforms Claude-3.7-Sonnet and o1/o3-mini in most NP-complete problems considered, and iii) the numbers of tokens, aha moments in the advanced LLMs, e.g., Claude-3.7-Sonnet and DeepSeek-R1, are observed first to increase and then decrease when the problem instances become more and more difficult. We believe that NPPC is the first ever-scaling reasoning benchmark, serving as the uncrushable and unhackable testbed for LLMs toward artificial general intelligence (AGI).
LGJul 23, 2025
Federated Learning for Large-Scale Cloud Robotic Manipulation: Opportunities and ChallengesObaidullah Zaland, Chanh Nguyen, Florian T. Pokorny et al.
Federated Learning (FL) is an emerging distributed machine learning paradigm, where the collaborative training of a model involves dynamic participation of devices to achieve broad objectives. In contrast, classical machine learning (ML) typically requires data to be located on-premises for training, whereas FL leverages numerous user devices to train a shared global model without the need to share private data. Current robotic manipulation tasks are constrained by the individual capabilities and speed of robots due to limited low-latency computing resources. Consequently, the concept of cloud robotics has emerged, allowing robotic applications to harness the flexibility and reliability of computing resources, effectively alleviating their computational demands across the cloud-edge continuum. Undoubtedly, within this distributed computing context, as exemplified in cloud robotic manipulation scenarios, FL offers manifold advantages while also presenting several challenges and opportunities. In this paper, we present fundamental concepts of FL and their connection to cloud robotic manipulation. Additionally, we envision the opportunities and challenges associated with realizing efficient and reliable cloud robotic manipulation at scale through FL, where researchers adopt to design and verify FL models in either centralized or decentralized settings.
ROMar 28, 2025
Grasping a Handful: Sequential Multi-Object Dexterous Grasp GenerationHaofei Lu, Yifei Dong, Zehang Weng et al.
We introduce the sequential multi-object robotic grasp sampling algorithm SeqGrasp that can robustly synthesize stable grasps on diverse objects using the robotic hand's partial Degrees of Freedom (DoF). We use SeqGrasp to construct the large-scale Allegro Hand sequential grasping dataset SeqDataset and use it for training the diffusion-based sequential grasp generator SeqDiffuser. We experimentally evaluate SeqGrasp and SeqDiffuser against the state-of-the-art non-sequential multi-object grasp generation method MultiGrasp in simulation and on a real robot. The experimental results demonstrate that SeqGrasp and SeqDiffuser reach an 8.71%-43.33% higher grasp success rate than MultiGrasp. Furthermore, SeqDiffuser is approximately 1000 times faster at generating grasps than SeqGrasp and MultiGrasp. Project page: https://yulihn.github.io/SeqGrasp/.
CVDec 19, 2023
RealCraft: Attention Control as A Tool for Zero-Shot Consistent Video EditingShutong Jin, Ruiyu Wang, Florian T. Pokorny
Even though large-scale text-to-image generative models show promising performance in synthesizing high-quality images, applying these models directly to image editing remains a significant challenge. This challenge is further amplified in video editing due to the additional dimension of time. This is especially the case for editing real-world videos as it necessitates maintaining a stable structural layout across frames while executing localized edits without disrupting the existing content. In this paper, we propose RealCraft, an attention-control-based method for zero-shot real-world video editing. By swapping cross-attention for new feature injection and relaxing spatial-temporal attention of the editing object, we achieve localized shape-wise edit along with enhanced temporal consistency. Our model directly uses Stable Diffusion and operates without the need for additional information. We showcase the proposed zero-shot attention-control-based method across a range of videos, demonstrating shape-wise, time-consistent and parameter-free editing in videos of up to 64 frames.
ROFeb 20
Latent Diffeomorphic Co-Design of End-Effectors for Deformable and Fragile Object ManipulationKei Ikemura, Yifei Dong, Florian T. Pokorny
Manipulating deformable and fragile objects remains a fundamental challenge in robotics due to complex contact dynamics and strict requirements on object integrity. Existing approaches typically optimize either end-effector design or control strategies in isolation, limiting achievable performance. In this work, we present the first co-design framework that jointly optimizes end-effector morphology and manipulation control for deformable and fragile object manipulation. We introduce (1) a latent diffeomorphic shape parameterization enabling expressive yet tractable end-effector geometry optimization, (2) a stress-aware bi-level co-design pipeline coupling morphology and control optimization, and (3) a privileged-to-pointcloud policy distillation scheme for zero-shot real-world deployment. We evaluate our approach on challenging food manipulation tasks, including grasping and pushing jelly and scooping fillets. Simulation and real-world experiments demonstrate the effectiveness of the proposed method.
ROMar 6, 2025
Data-augmented Learning of Geodesic Distances in Irregular Domains through Soner Boundary ConditionsRafael I. Cabral Muchacho, Florian T. Pokorny
Geodesic distances play a fundamental role in robotics, as they efficiently encode global geometric information of the domain. Recent methods use neural networks to approximate geodesic distances by solving the Eikonal equation through physics-informed approaches. While effective, these approaches often suffer from unstable convergence during training in complex environments. We propose a framework to learn geodesic distances in irregular domains by using the Soner boundary condition, and systematically evaluate the impact of data losses on training stability and solution accuracy. Our experiments demonstrate that incorporating data losses significantly improves convergence robustness, reducing training instabilities and sensitivity to initialization. These findings suggest that hybrid data-physics approaches can effectively enhance the reliability of learning-based geodesic distance solvers with sparse data.
ROOct 29, 2024
PACA: Perspective-Aware Cross-Attention Representation for Zero-Shot Scene RearrangementShutong Jin, Ruiyu Wang, Kuangyi Chen et al.
Scene rearrangement, like table tidying, is a challenging task in robotic manipulation due to the complexity of predicting diverse object arrangements. Web-scale trained generative models such as Stable Diffusion can aid by generating natural scenes as goals. To facilitate robot execution, object-level representations must be extracted to match the real scenes with the generated goals and to calculate object pose transformations. Current methods typically use a multi-step design that involves separate models for generation, segmentation, and feature encoding, which can lead to a low success rate due to error accumulation. Furthermore, they lack control over the viewing perspectives of the generated goals, restricting the tasks to 3-DoF settings. In this paper, we propose PACA, a zero-shot pipeline for scene rearrangement that leverages perspective-aware cross-attention representation derived from Stable Diffusion. Specifically, we develop a representation that integrates generation, segmentation, and feature encoding into a single step to produce object-level representations. Additionally, we introduce perspective control, thus enabling the matching of 6-DoF camera views and extending past approaches that were limited to 3-DoF top-down views. The efficacy of our method is demonstrated through its zero-shot performance in real robot experiments across various scenes, achieving an average matching accuracy and execution success rate of 87% and 67%, respectively.
ROJul 6, 2021
Approximate Topological Optimization using Multi-Mode Estimation for Robot Motion PlanningAndreas Orthey, Florian T. Pokorny, Marc Toussaint
In this extended abstract, we report on ongoing work towards an approximate multimodal optimization algorithm with asymptotic guarantees. Multimodal optimization is the problem of finding all local optimal solutions (modes) to a path optimization problem. This is important to compress path databases, as contingencies for replanning and as source of symbolic representations. Following ideas from Morse theory, we define modes as paths invariant under optimization of a cost functional. We develop a multi-mode estimation algorithm which approximately finds all modes of a given motion optimization problem and asymptotically converges. This is made possible by integrating sparse roadmaps with an existing single-mode optimization algorithm. Initial evaluation results show the multi-mode estimation algorithm as a promising direction to study path spaces from a topological point of view.
ROFeb 7, 2020
Free Space of Rigid Objects: Caging, Path Non-Existence, and Narrow Passage DetectionAnastasiia Varava, J. Frederico Carvalho, Danica Kragic et al.
In this work we propose algorithms to explicitly construct a conservative estimate of the configuration spaces of rigid objects in 2D and 3D. Our approach is able to detect compact path components and narrow passages in configuration space which are important for applications in robotic manipulation and path planning. Moreover, as we demonstrate, they are also applicable to identification of molecular cages in chemistry. Our algorithms are based on a decomposition of the resulting 3 and 6 dimensional configuration spaces into slices corresponding to a finite sample of fixed orientations in configuration space. We utilize dual diagrams of unions of balls and uniform grids of orientations to approximate the configuration space. We carry out experiments to evaluate the computational efficiency on a set of objects with different geometric features thus demonstrating that our approach is applicable to different object shapes. We investigate the performance of our algorithm by computing increasingly fine-grained approximations of the object's configuration space.
ROOct 27, 2017
A Decomposition-Based Approach to Reasoning about Free Space Path-Connectivity for Rigid Objects in 2DAnastasiia Varava, J. Frederico Carvalho, Danica Kragic et al.
In this paper, we compute a conservative approximation of the path-connected components of the free space of a rigid object in a 2D workspace in order to solve two closely related problems: to determine whether there exists a collision-free path between two given configurations, and to verify whether an object can escape arbitrarily far from its initial configuration -- i.e., whether the object is caged. Furthermore, we consider two quantitative characteristics of the free space: the volume of path-connected components and the width of narrow passages. To address these problems, we decompose the configuration space into a set of two-dimensional slices, approximate them as two-dimensional alpha-complexes, and then study the relations between them. This significantly reduces the computational complexity compared to a direct approximation of the free space. We implement our algorithm and run experiments in a three-dimensional configuration space of a simple object showing runtime of less than 2 seconds.
ROOct 17, 2016
CapriDB - Capture, Print, Innovate: A Low-Cost Pipeline and Database for Reproducible Manipulation ResearchFlorian T. Pokorny, Yasemin Bekiroglu, Karl Pauwels et al.
We present a novel approach and database which combines the inexpensive generation of 3D object models via monocular or RGB-D camera images with 3D printing and a state of the art object tracking algorithm. Unlike recent efforts towards the creation of 3D object databases for robotics, our approach does not require expensive and controlled 3D scanning setups and enables anyone with a camera to scan, print and track complex objects for manipulation research. The proposed approach results in highly detailed mesh models whose 3D printed replicas are at times difficult to distinguish from the original. A key motivation for utilizing 3D printed objects is the ability to precisely control and vary object properties such as the mass distribution and size in the 3D printing process to obtain reproducible conditions for robotic manipulation research. We present CapriDB - an extensible database resulting from this approach containing initially 40 textured and 3D printable mesh models together with tracking features to facilitate the adoption of the proposed approach.
AIJul 25, 2016
Estimating Activity at Multiple Scales using Spatial AbstractionsMajd Hawasly, Florian T. Pokorny, Subramanian Ramamoorthy
Autonomous robots operating in dynamic environments must maintain beliefs over a hypothesis space that is rich enough to represent the activities of interest at different scales. This is important both in order to accommodate the availability of evidence at varying degrees of coarseness, such as when interpreting and assimilating natural instructions, but also in order to make subsequent reactive planning more efficient. We present an algorithm that combines a topology-based trajectory clustering procedure that generates hierarchically-structured spatial abstractions with a bank of particle filters at each of these abstraction levels so as to produce probability estimates over an agent's navigation activity that is kept consistent across the hierarchy. We study the performance of the proposed method using a synthetic trajectory dataset in 2D, as well as a dataset taken from AIS-based tracking of ships in an extended harbour area. We show that, in comparison to a baseline which is a particle filter that estimates activity without exploiting such structure, our method achieves a better normalised error in predicting the trajectory as well as better time to convergence to a true class when compared against ground truth.
ROApr 21, 2016
HIRL: Hierarchical Inverse Reinforcement Learning for Long-Horizon Tasks with Delayed RewardsSanjay Krishnan, Animesh Garg, Richard Liaw et al.
Reinforcement Learning (RL) struggles in problems with delayed rewards, and one approach is to segment the task into sub-tasks with incremental rewards. We propose a framework called Hierarchical Inverse Reinforcement Learning (HIRL), which is a model for learning sub-task structure from demonstrations. HIRL decomposes the task into sub-tasks based on transitions that are consistent across demonstrations. These transitions are defined as changes in local linearity w.r.t to a kernel function. Then, HIRL uses the inferred structure to learn reward functions local to the sub-tasks but also handle any global dependencies such as sequentiality. We have evaluated HIRL on several standard RL benchmarks: Parallel Parking with noisy dynamics, Two-Link Pendulum, 2D Noisy Motion Planning, and a Pinball environment. In the parallel parking task, we find that rewards constructed with HIRL converge to a policy with an 80% success rate in 32% fewer time-steps than those constructed with Maximum Entropy Inverse RL (MaxEnt IRL), and with partial state observation, the policies learned with IRL fail to achieve this accuracy while HIRL still converges. We further find that that the rewards learned with HIRL are robust to environment noise where they can tolerate 1 stdev. of random perturbation in the poses in the environment obstacles while maintaining roughly the same convergence rate. We find that HIRL rewards can converge up-to 6x faster than rewards constructed with IRL.
LGJan 26, 2015
On a Family of Decomposable Kernels on SequencesAndrea Baisero, Florian T. Pokorny, Carl Henrik Ek
In many applications data is naturally presented in terms of orderings of some basic elements or symbols. Reasoning about such data requires a notion of similarity capable of handling sequences of different lengths. In this paper we describe a family of Mercer kernel functions for such sequentially structured data. The family is characterized by a decomposable structure in terms of symbol-level and structure-level similarities, representing a specific combination of kernels which allows for efficient computation. We provide an experimental evaluation on sequential classification tasks comparing kernels from our family of kernels to a state of the art sequence kernel called the Global Alignment kernel which has been shown to outperform Dynamic Time Warping