ROSep 16, 2024Code
Industry 6.0: New Generation of Industry driven by Generative AI and Swarm of Heterogeneous RobotsArtem Lykov, Miguel Altamirano Cabrera, Mikhail Konenkov et al.
This paper presents the concept of Industry 6.0, introducing the world's first fully automated production system that autonomously handles the entire product design and manufacturing process based on user-provided natural language descriptions. By leveraging generative AI, the system automates critical aspects of production, including product blueprint design, component manufacturing, logistics, and assembly. A heterogeneous swarm of robots, each equipped with individual AI through integration with Large Language Models (LLMs), orchestrates the production process. The robotic system includes manipulator arms, delivery drones, and 3D printers capable of generating assembly blueprints. The system was evaluated using commercial and open-source LLMs, functioning through APIs and local deployment. A user study demonstrated that the system reduces the average production time to 119.10 minutes, significantly outperforming a team of expert human developers, who averaged 528.64 minutes (an improvement factor of 4.4). Furthermore, in the product blueprinting stage, the system surpassed human CAD operators by an unprecedented factor of 47, completing the task in 0.5 minutes compared to 23.5 minutes. This breakthrough represents a major leap towards fully autonomous manufacturing.
72.7ROMay 2
Action Agent: Agentic Video Generation Meets Flow-Constrained DiffusionJeffrin Sam, Nguyen Khang, Yara Mahmoud et al.
We present Action Agent, a two-stage framework that unifies agentic navigation video generation with flow-constrained diffusion control for multi-embodiment robot navigation. In Stage I, a large language model (LLM) acts as an orchestration module that selects video diffusion models, refines prompts through iterative validation, and accumulates cross-task memory to synthesize physically plausible first-person navigation videos from language and image inputs. This increases video generation success from 35% (single-shot) to 86% across 50 navigation tasks. In Stage II, we introduce FlowDiT, a Flow-Constrained Diffusion Transformer that converts optimized goal videos and language instructions into continuous velocity commands using action-space denoising diffusion. FlowDiT integrates DINOv2 visual features, learned optical flow for ego-motion representation, and CLIP language embeddings for semantic stopping. We pretrain on the RECON outdoor navigation dataset and fine-tune on 203 Unitree G1 humanoid episodes collected in Isaac Sim to calibrate velocity dynamics. A single 43M-parameter checkpoint achieves 73.2% navigation success in simulation and 64.7% task completion on a real Unitree G1 in unseen indoor environments under open-loop execution, while operating at 40--47 Hz. We evaluate Action Agent across three embodiments: a Unitree G1 humanoid (real hardware), a drone, and a wheeled mobile robot (Isaac Sim), demonstrating that decoupling trajectory imagination from execution yields a scalable and embodiment-aware paradigm for language-guided navigation.
66.0ROMar 27
DiffusionAnything: End-to-End In-context Diffusion Learning for Unified Navigation and Pre-Grasp MotionIana Zhura, Yara Mahmoud, Jeffrin Sam et al.
Efficiently predicting motion plans directly from vision remains a fundamental challenge in robotics, where planning typically requires explicit goal specification and task-specific design. Recent vision-language-action (VLA) models infer actions directly from visual input but demand massive computational resources, extensive training data, and fail zero-shot in novel scenes. We present a unified image-space diffusion policy handling both meter-scale navigation and centimeter-scale manipulation via multi-scale feature modulation, with only 5 minutes of self-supervised data per task. Three key innovations drive the framework: (1) Multi-scale FiLM conditioning on task mode, depth scale, and spatial attention enables task-appropriate behavior in a single model; (2) trajectory-aligned depth prediction focuses metric 3D reasoning along generated waypoints; (3) self-supervised attention from AnyTraverse enables goal-directed inference without vision-language models and depth sensors. Operating purely from RGB input (2.0 GB memory, 10 Hz), the model achieves robust zero-shot generalization to novel scenes while remaining suitable for onboard deployment.
63.9ROMar 23
Closed-Loop Verbal Reinforcement Learning for Task-Level Robotic PlanningDmitrii Plotnikov, Iaroslav Kolomiets, Dmitrii Maliukov et al.
We propose a new Verbal Reinforcement Learning (VRL) framework for interpretable task-level planning in mobile robotic systems operating under execution uncertainty. The framework follows a closed-loop architecture that enables iterative policy improvement through interaction with the physical environment. In our framework, executable Behavior Trees are repeatedly refined by a Large Language Model actor using structured natural-language feedback produced by a Vision-Language Model critic that observes the physical robot and execution traces. Unlike conventional reinforcement learning, policy updates in VRL occur directly at the symbolic planning level, without gradient-based optimization. This enables transparent reasoning, explicit causal feedback, and human-interpretable policy evolution. We validate the proposed framework on a real mobile robot performing a multi-stage manipulation and navigation task under execution uncertainty. Experimental results show that the framework supports explainable policy improvements, closed-loop adaptation to execution failures, and reliable deployment on physical robotic systems.
ROJan 20
DroneVLA: VLA based Aerial ManipulationFawad Mehboob, Monijesu James, Amir Habel et al.
As aerial platforms evolve from passive observers to active manipulators, the challenge shifts toward designing intuitive interfaces that allow non-expert users to command these systems naturally. This work introduces a novel concept of autonomous aerial manipulation system capable of interpreting high-level natural language commands to retrieve objects and deliver them to a human user. The system is intended to integrate a MediaPipe based on Grounding DINO and a Vision-Language-Action (VLA) model with a custom-built drone equipped with a 1-DOF gripper and an Intel RealSense RGB-D camera. VLA performs semantic reasoning to interpret the intent of a user prompt and generates a prioritized task queue for grasping of relevant objects in the scene. Grounding DINO and dynamic A* planning algorithm are used to navigate and safely relocate the object. To ensure safe and natural interaction during the handover phase, the system employs a human-centric controller driven by MediaPipe. This module provides real-time human pose estimation, allowing the drone to employ visual servoing to maintain a stable, distinct position directly in front of the user, facilitating a comfortable handover. We demonstrate the system's efficacy through real-world experiments for localization and navigation, which resulted in a 0.164m, 0.070m, and 0.084m of max, mean euclidean, and root-mean squared errors, respectively, highlighting the feasibility of VLA for aerial manipulation operations.
ROSep 24, 2024
TiltXter: CNN-based Electro-tactile Rendering of Tilt Angle for Telemanipulation of Pasteur PipettesMiguel Altamirano Cabrera, Jonathan Tirado, Aleksey Fedoseev et al.
The shape of deformable objects can change drastically during grasping by robotic grippers, causing an ambiguous perception of their alignment and hence resulting in errors in robot positioning and telemanipulation. Rendering clear tactile patterns is fundamental to increasing users' precision and dexterity through tactile haptic feedback during telemanipulation. Therefore, different methods have to be studied to decode the sensors' data into haptic stimuli. This work presents a telemanipulation system for plastic pipettes that consists of a Force Dimension Omega.7 haptic interface endowed with two electro-stimulation arrays and two tactile sensor arrays embedded in the 2-finger Robotiq gripper. We propose a novel approach based on convolutional neural networks (CNN) to detect the tilt of deformable objects. The CNN generates a tactile pattern based on recognized tilt data to render further electro-tactile stimuli provided to the user during the telemanipulation. The study has shown that using the CNN algorithm, tilt recognition by users increased from 23.13\% with the downsized data to 57.9%, and the success rate during teleoperation increased from 53.12% using the downsized data to 92.18% using the tactile patterns generated by the CNN.
ROJan 9, 2025
UAV-VLA: Vision-Language-Action System for Large Scale Aerial Mission GenerationOleg Sautenkov, Yasheerah Yaqoot, Artem Lykov et al.
The UAV-VLA (Visual-Language-Action) system is a tool designed to facilitate communication with aerial robots. By integrating satellite imagery processing with the Visual Language Model (VLM) and the powerful capabilities of GPT, UAV-VLA enables users to generate general flight paths-and-action plans through simple text requests. This system leverages the rich contextual information provided by satellite images, allowing for enhanced decision-making and mission planning. The combination of visual analysis by VLM and natural language processing by GPT can provide the user with the path-and-action set, making aerial operations more efficient and accessible. The newly developed method showed the difference in the length of the created trajectory in 22% and the mean error in finding the objects of interest on a map in 34.22 m by Euclidean distance in the K-Nearest Neighbors (KNN) approach.
42.1ROApr 21
GenerativeMPC: VLM-RAG-guided Whole-Body MPC with Virtual Impedance for Bimanual Mobile ManipulationMarcelino Julio Fernando, Miguel Altamirano Cabrera, Jeffrin Sam et al.
Bimanual mobile manipulation requires a seamless integration between high-level semantic reasoning and safe, compliant physical interaction - a challenge that end-to-end models approach opaquely and classical controllers lack the context to address. This paper presents GenerativeMPC, a hierarchical cyber-physical framework that explicitly bridges semantic scene understanding with physical control parameters for bimanual mobile manipulators. The system utilizes a Vision-Language Model with Retrieval-Augmented Generation (VLM-RAG) to translate visual and linguistic context into grounded control constraints, specifically outputting dynamic velocity limits and safety margins for a Whole-Body Model Predictive Controller (MPC). Simultaneously, the VLM-RAG module modulates virtual stiffness and damping gains for a unified impedance-admittance controller, enabling context-aware compliance during human-robot interaction. Our framework leverages an experience-driven vector database to ensure consistent parameter grounding without retraining. Experimental results in MuJoCo, IsaacSim, and on a physical bimanual platform confirm a 60% speed reduction near humans and safe, socially-aware navigation and manipulation through semantic-to-physical parameter grounding. This work advances the field of human-centric cybernetics by grounding large-scale cognitive models into predictable, high-frequency physical control loops.
ROFeb 5, 2024
DogSurf: Quadruped Robot Capable of GRU-based Surface Recognition for Blind Person NavigationArtem Bazhenov, Vladimir Berman, Sergei Satsevich et al.
This paper introduces DogSurf - a newapproach of using quadruped robots to help visually impaired people navigate in real world. The presented method allows the quadruped robot to detect slippery surfaces, and to use audio and haptic feedback to inform the user when to stop. A state-of-the-art GRU-based neural network architecture with mean accuracy of 99.925% was proposed for the task of multiclass surface classification for quadruped robots. A dataset was collected on a Unitree Go1 Edu robot. The dataset and code have been posted to the public domain.
22.2ROApr 7
GraspSense: Physically Grounded Grasp and Grip Planning for a Dexterous Robotic Hand via Language-Guided Perception and Force MapsElizaveta Semenyakina, Ivan Snegirev, Mariya Lezina et al.
Dexterous robotic manipulation requires more than geometrically valid grasps: it demands physically grounded contact strategies that account for the spatially non-uniform mechanical properties of the object. However, existing grasp planners typically treat the surface as structurally homogeneous, even though contact in a weak region can damage the object despite a geometrically perfect grasp. We present a pipeline for grasp selection and force regulation in a five-fingered robotic hand, based on a map of locally admissible contact loads. From an operator command, the system identifies the target object, reconstructs its 3D geometry using SAM3D, and imports the model into Isaac Sim. A physics-informed geometric analysis then computes a force map that encodes the maximum lateral contact force admissible at each surface location without deformation. Grasp candidates are filtered by geometric validity and task-goal consistency. When multiple candidates are comparable under classical metrics, they are re-ranked using a force-map-aware criterion that favors grasps with contacts in mechanically admissible regions. An impedance controller scales the stiffness of each finger according to the locally admissible force at the contact point, enabling safe and reliable grasp execution. Validation on paper, plastic, and glass cups shows that the proposed approach consistently selects structurally stronger contact regions and keeps grip forces within safe bounds. In this way, the work reframes dexterous manipulation from a purely geometric problem into a physically grounded joint planning problem of grasp selection and grip execution for future humanoid systems.
ROOct 25, 2021
CoboGuider: Haptic Potential Fields for Safe Human-Robot InteractionViktor Rakhmatulin, Miguel Altamirano Cabrera, Fikre Hagos et al.
Modern industry still relies on manual manufacturing operations and safe human-robot interaction is of great interest nowadays. Speed and Separation Monitoring (SSM) allows close and efficient collaborative scenarios by maintaining a protective separation distance during robot operation. The paper focuses on a novel approach to strengthen the SSM safety requirements by introducing haptic feedback to a robotic cell worker. Tactile stimuli provide early warning of dangerous movements and proximity to the robot, based on the human reaction time and instantaneous velocities of robot and operator. A preliminary experiment was performed to identify the reaction time of participants when they are exposed to tactile stimuli in a collaborative environment with controlled conditions. In a second experiment, we evaluated our approach into a study case where human worker and cobot performed collaborative planetary gear assembly. Results show that the applied approach increased the average minimum distance between the robot's end-effector and hand by 44% compared to the operator relying only on the visual feedback. Moreover, the participants without the haptic support have failed several times to maintain the protective separation distance.
ROSep 13, 2021
CoHaptics: Development of Human-Robot Collaborative System with Forearm-worn Haptic Display to Increase Safety in Future FactoriesMiguel Altamirano Cabrera, Juan Heredia, Jonathan Tirado et al.
Complex tasks require human collaboration since robots do not have enough dexterity. However, robots are still used as instruments and not as collaborative systems. We are introducing a framework to ensure safety in a human-robot collaborative environment. The system is composed of a haptic feedback display, low-cost wearable mocap, and a new collision avoidance algorithm based on the Artificial Potential Fields (APF). Wearable optical motion capturing system enables tracking the human hand position with high accuracy and low latency on large working areas. This study evaluates whether haptic feedback improves safety in human-robot collaboration. Three experiments were carried out to evaluate the performance of the proposed system. The first one evaluated human responses to the haptic device during interaction with the Robot Tool Center Point (TCP). The second experiment analyzed human-robot behavior during an imminent collision. The third experiment evaluated the system in a collaborative activity in a shared working environment. This study had shown that when haptic feedback in the control loop was included, the safe distance (minimum robot-obstacle distance) increased by 4.1 cm from 12.39 cm to 16.55 cm, and the robot's path, when the collision avoidance algorithm was activated, was reduced by 81%.
ROFeb 7, 2021
DroneTrap: Drone Catching in Midair by Soft Robotic Hand with Color-Based Force Detection and Hand Gesture RecognitionAleksey Fedoseev, Valerii Serpiva, Ekaterina Karmanova et al.
The paper proposes a novel concept of docking drones to make this process as safe and fast as possible. The idea behind the project is that a robot with a soft gripper grasps the drone in midair. The human operator navigates the robotic arm with the ML-based gesture recognition interface. The 3-finger robot hand with soft fingers is equipped with touch sensors, making it possible to achieve safe drone catching and avoid inadvertent damage to the drone's propellers and motors. Additionally, the soft hand is featured with a unique color-based force estimation technology based on a computer vision (CV) system. Moreover, the visual color-changing system makes it easier for the human operator to interpret the applied forces. Without any additional programming, the operator has full real-time control of the robot's motion and task execution by wearing a mocap glove with gesture recognition, which was developed and applied for the high-level control of DroneTrap. The experimental results revealed that the developed color-based force estimation can be applied for rigid object capturing with high precision (95.3\%). The proposed technology can potentially revolutionize the landing and deployment of drones for parcel delivery on uneven ground, structure maintenance and inspection, risque operations, and etc.
ROJul 20, 2020
CobotGear: Interaction with Collaborative Robots using Wearable Optical Motion Capturing SystemsJuan Heredia, Miguel Altamirano Cabrera, Jonathan Tirado et al.
In industrial applications, complex tasks require human collaboration since the robot doesn't have enough dexterity. However, the robots are still implemented as tools and not as collaborative intelligent systems. To ensure safety in the human-robot collaboration, we introduce a system that presents a new method that integrates low-cost wearable mocap, and an improved collision avoidance algorithm based on the artificial potential fields. Wearable optical motion capturing allows to track the human hand position with high accuracy and low latency on large working areas. To increase the efficiency of the proposed algorithm, two obstacle types are discriminated according to their collision probability. A preliminary experiment was performed to analyze the algorithm behavior and to select the best values for the obstacle's threshold angle $θ_{OBS}$, and for the avoidance threshold distance $d_{AT}$. The second experiment was carried out to evaluate the system performance with $d_{AT}$ = 0.2 m and $θ_{OBS}$ = 45 degrees. The third experiment evaluated the system in a real collaborative task. The results demonstrate the robust performance of the robotic arm generating smooth collision-free trajectories. The proposed technology will allow consumer robots to safely collaborate with humans in cluttered environments, e.g., factories, kitchens, living rooms, and restaurants.
HCJun 22, 2020
Tactile Perception of Objects by the User's Palm for the Development of Multi-contact Wearable Tactile DisplaysMiguel Altamirano Cabrera, Juan Heredia, Dzmitry Tsetserukou
The user's palm plays an important role in object detection and manipulation. The design of a robust multi-contact tactile display must consider the sensation and perception of of the stimulated area aiming to deliver the right stimuli at the correct location. To the best of our knowledge, there is no study to obtain the human palm data for this purpose. The objective of this work is to introduce the method to investigate the user's palm sensations during the interaction with objects. An array of fifteen Force Sensitive Resistors (FSRs) was located at the user's palm to get the area of interaction, and the normal force delivered to four different convex surfaces. Experimental results showed the active areas at the palm during the interaction with each of the surfaces at different forces. The obtained results can be applied in the development of multi-contact wearable tactile and haptic displays for the palm, and in training a machine-learning algorithm to predict stimuli aiming to achieve a highly immersive experience in Virtual Reality.