Valerii Serpiva

RO
h-index24
11papers
105citations
Novelty50%
AI Score49

11 Papers

ROSep 16, 2024Code
Industry 6.0: New Generation of Industry driven by Generative AI and Swarm of Heterogeneous Robots

Artem Lykov, Miguel Altamirano Cabrera, Mikhail Konenkov et al.

This paper presents the concept of Industry 6.0, introducing the world's first fully automated production system that autonomously handles the entire product design and manufacturing process based on user-provided natural language descriptions. By leveraging generative AI, the system automates critical aspects of production, including product blueprint design, component manufacturing, logistics, and assembly. A heterogeneous swarm of robots, each equipped with individual AI through integration with Large Language Models (LLMs), orchestrates the production process. The robotic system includes manipulator arms, delivery drones, and 3D printers capable of generating assembly blueprints. The system was evaluated using commercial and open-source LLMs, functioning through APIs and local deployment. A user study demonstrated that the system reduces the average production time to 119.10 minutes, significantly outperforming a team of expert human developers, who averaged 528.64 minutes (an improvement factor of 4.4). Furthermore, in the product blueprinting stage, the system surpassed human CAD operators by an unprecedented factor of 47, completing the task in 0.5 minutes compared to 23.5 minutes. This breakthrough represents a major leap towards fully autonomous manufacturing.

57.6ROJun 2
AgenticDiffusion: Agentic Diffusion-based Path Planning for Vision-Based UAV Navigation

Faryal Batool, Muhammad Ahsan Mustafa, Fawad Mehboob et al.

Indoor UAV navigation requires efficient exploration, scene understanding, and reliable trajectory execution under limited field-of-view observations. Existing vision-based navigation frameworks typically rely on single-view observations, limiting their ability to reason about occlusions, target visibility, and global scene structure. In this work, we propose AgenticDiffusion, a multi-view UAV navigation framework that coordinates language-guided reasoning, open-vocabulary target grounding, vision-based diffusion planning, and NMPC within a unified aerial navigation pipeline. Given a natural language instruction and synchronized first-person-view (FPV) and top-view observations, the framework determines the most informative viewpoint for navigation and generates a mission plan prior to trajectory execution. The targets are localized using an open-vocabulary grounding model, after which viewpoint-specific diffusion planners generate navigation trajectories for UAV execution. Using complementary viewpoints, the proposed framework reduces repeated target exploration and improves navigation efficiency in cluttered indoor environments. The framework was validated in four real-world UAV navigation scenarios involving adaptive viewpoint selection, multi-stage mission execution, long-horizon navigation, and safe landing-site selection. The experimental results demonstrated an overall mission success rate of 80% in 40 real-world trials, while the diffusion planners achieved a trajectory generation success rate of 100%.

CVMar 11, 2022
Multi-sensor large-scale dataset for multi-view 3D reconstruction

Oleg Voynov, Gleb Bobrovskikh, Pavel Karpyshev et al.

We present a new multi-sensor dataset for multi-view 3D surface reconstruction. It includes registered RGB and depth data from sensors of different resolutions and modalities: smartphones, Intel RealSense, Microsoft Kinect, industrial cameras, and structured-light scanner. The scenes are selected to emphasize a diverse set of material properties challenging for existing algorithms. We provide around 1.4 million images of 107 different scenes acquired from 100 viewing directions under 14 lighting conditions. We expect our dataset will be useful for evaluation and training of 3D reconstruction algorithms and for related tasks. The dataset is available at skoltech3d.appliedai.tech.

ROJan 21
HumanDiffusion: A Vision-Based Diffusion Trajectory Planner with Human-Conditioned Goals for Search and Rescue UAV

Faryal Batool, Iana Zhura, Valerii Serpiva et al.

Reliable human--robot collaboration in emergency scenarios requires autonomous systems that can detect humans, infer navigation goals, and operate safely in dynamic environments. This paper presents HumanDiffusion, a lightweight image-conditioned diffusion planner that generates human-aware navigation trajectories directly from RGB imagery. The system combines YOLO-11--based human detection with diffusion-driven trajectory generation, enabling a quadrotor to approach a target person and deliver medical assistance without relying on prior maps or computationally intensive planning pipelines. Trajectories are predicted in pixel space, ensuring smooth motion and a consistent safety margin around humans. We evaluate HumanDiffusion in simulation and real-world indoor mock-disaster scenarios. On a 300-sample test set, the model achieves a mean squared error of 0.02 in pixel-space trajectory reconstruction. Real-world experiments demonstrate an overall mission success rate of 80% across accident-response and search-and-locate tasks with partial occlusions. These results indicate that human-conditioned diffusion planning offers a practical and robust solution for human-aware UAV navigation in time-critical assistance settings.

ROMar 6
DreamToNav: Generalizable Navigation for Robots via Generative Video Planning

Valerii Serpiva, Jeffrin Sam, Chidera Simon et al.

We present DreamToNav, a novel autonomous robot framework that uses generative video models to enable intuitive, human-in-the-loop control. Instead of relying on rigid waypoint navigation, users provide natural language prompts (e.g. ``Follow the person carefully''), which the system translates into executable motion. Our pipeline first employs Qwen 2.5-VL-7B-Instruct to refine vague user instructions into precise visual descriptions. These descriptions condition NVIDIA Cosmos 2.5, a state-of-the-art video foundation model, to synthesize a physically consistent video sequence of the robot performing the task. From this synthetic video, we extract a valid kinematic path using visual pose estimation, robot detection and trajectory recovery. By treating video generation as a planning engine, DreamToNav allows robots to visually "dream" complex behaviors before executing them, providing a unified framework for obstacle avoidance and goal-directed navigation without task-specific engineering. We evaluate the approach on both a wheeled mobile robot and a quadruped robot in indoor navigation tasks. DreamToNav achieves a success rate of 76.7%, with final goal errors typically within 0.05-0.10 m and trajectory tracking errors below 0.15 m. These results demonstrate that trajectories extracted from generative video predictions can be reliably executed on physical robots across different locomotion platforms.

ROMar 4, 2025
RaceVLA: VLA-based Racing Drone Navigation with Human-like Behaviour

Valerii Serpiva, Artem Lykov, Artyom Myshlyaev et al.

RaceVLA presents an innovative approach for autonomous racing drone navigation by leveraging Visual-Language-Action (VLA) to emulate human-like behavior. This research explores the integration of advanced algorithms that enable drones to adapt their navigation strategies based on real-time environmental feedback, mimicking the decision-making processes of human pilots. The model, fine-tuned on a collected racing drone dataset, demonstrates strong generalization despite the complexity of drone racing environments. RaceVLA outperforms OpenVLA in motion (75.0 vs 60.0) and semantic generalization (45.5 vs 36.3), benefiting from the dynamic camera and simplified motion tasks. However, visual (79.6 vs 87.0) and physical (50.0 vs 76.7) generalization were slightly reduced due to the challenges of maneuvering in dynamic environments with varying object sizes. RaceVLA also outperforms RT-2 across all axes - visual (79.6 vs 52.0), motion (75.0 vs 55.0), physical (50.0 vs 26.7), and semantic (45.5 vs 38.8), demonstrating its robustness for real-time adjustments in complex environments. Experiments revealed an average velocity of 1.04 m/s, with a maximum speed of 2.02 m/s, and consistent maneuverability, demonstrating RaceVLA's ability to handle high-speed scenarios effectively. These findings highlight the potential of RaceVLA for high-performance navigation in competitive racing contexts. The RaceVLA codebase, pretrained weights, and dataset are available at this http URL: https://racevla.github.io/

HCAug 3, 2021
SwarmPlay: Interactive Tic-tac-toe Board Game with Swarm of Nano-UAVs driven by Reinforcement Learning

Ekaterina Karmanova, Valerii Serpiva, Stepan Perminov et al.

Reinforcement learning (RL) methods have been actively applied in the field of robotics, allowing the system itself to find a solution for a task otherwise requiring a complex decision-making algorithm. In this paper, we present a novel RL-based Tic-tac-toe scenario, i.e. SwarmPlay, where each playing component is presented by an individual drone that has its own mobility and swarm intelligence to win against a human player. Thus, the combination of challenging swarm strategy and human-drone collaboration aims to make the games with machines tangible and interactive. Although some research on AI for board games already exists, e.g., chess, the SwarmPlay technology has the potential to offer much more engagement and interaction with the user as it proposes a multi-agent swarm instead of a single interactive robot. We explore user's evaluation of RL-based swarm behavior in comparison with the game theory-based behavior. The preliminary user study revealed that participants were highly engaged in the game with drones (70% put a maximum score on the Likert scale) and found it less artificial compared to the regular computer-based systems (80%). The affection of the user's game perception from its outcome was analyzed and put under discussion. User study revealed that SwarmPlay has the potential to be implemented in a wider range of games, significantly improving human-drone interactivity.

HCAug 1, 2021
SwarmPlay: A Swarm of Nano-Quadcopters Playing Tic-tac-toe Board Game against a Human

Ekaterina Karmanova, Valerii Serpiva, Stepan Perminov et al.

We present a new paradigm of games, i.e. SwarmPlay, where each playing component is presented by an individual drone that has its own mobility and swarm intelligence to win against a human player. The motivation behind the research is to make the games with machines tangible and interactive. Although some research on the robotic players for board games already exists, e.g., chess, the SwarmPlay technology has the potential to offer much more engagement and interaction with a human as it proposes a multi-agent swarm instead of a single interactive robot. The proposed system consists of a robotic swarm, a workstation, a computer vision (CV), and Game Theory-based algorithms. A novel game algorithm was developed to provide a natural game experience to the user. The preliminary user study revealed that participants were highly engaged in the game with drones (69% put a maximum score on the Likert scale) and found it less artificial compared to the regular computer-based systems (77% put maximum score). The affection of the user's game perception from its outcome was analyzed and put under discussion. User study revealed that SwarmPlay has the potential to be implemented in a wider range of games, significantly improving human-drone interactivity.

ROJul 23, 2021
DronePaint: Swarm Light Painting with DNN-based Gesture Recognition

Valerii Serpiva, Ekaterina Karmanova, Aleksey Fedoseev et al.

We propose a novel human-swarm interaction system, allowing the user to directly control a swarm of drones in a complex environment through trajectory drawing with a hand gesture interface based on the DNN-based gesture recognition. The developed CV-based system allows the user to control the swarm behavior without additional devices through human gestures and motions in real-time, providing convenient tools to change the swarm's shape and formation. The two types of interaction were proposed and implemented to adjust the swarm hierarchy: trajectory drawing and free-form trajectory generation control. The experimental results revealed a high accuracy of the gesture recognition system (99.75%), allowing the user to achieve relatively high precision of the trajectory drawing (mean error of 5.6 cm in comparison to 3.1 cm by mouse drawing) over the three evaluated trajectory patterns. The proposed system can be potentially applied in complex environment exploration, spray painting using drones, and interactive drone shows, allowing users to create their own art objects by drone swarms.

ROJun 28, 2021
SwarmPaint: Human-Swarm Interaction for Trajectory Generation and Formation Control by DNN-based Gesture Interface

Valerii Serpiva, Ekaterina Karmanova, Aleksey Fedoseev et al.

Teleoperation tasks with multi-agent systems have a high potential in supporting human-swarm collaborative teams in exploration and rescue operations. However, it requires an intuitive and adaptive control approach to ensure swarm stability in a cluttered and dynamically shifting environment. We propose a novel human-swarm interaction system, allowing the user to control swarm position and formation by either direct hand motion or by trajectory drawing with a hand gesture interface based on the DNN gesture recognition. The key technology of the SwarmPaint is the user's ability to perform various tasks with the swarm without additional devices by switching between interaction modes. Two types of interaction were proposed and developed to adjust a swarm behavior: free-form trajectory generation control and shaped formation control. Two preliminary user studies were conducted to explore user's performance and subjective experience from human-swarm interaction through the developed control modes. The experimental results revealed a sufficient accuracy in the trajectory tracing task (mean error of 5.6 cm by gesture draw and 3.1 cm by mouse draw with the pattern of dimension 1 m by 1 m) over three evaluated trajectory patterns and up to 7.3 cm accuracy in targeting task with two target patterns of 1 m achieved by SwarmPaint interface. Moreover, the participants evaluated the trajectory drawing interface as more intuitive (12.9 %) and requiring less effort to utilize (22.7%) than direct shape and position control by gestures, although its physical workload and failure in performance were presumed as more significant (by 9.1% and 16.3%, respectively).

ROFeb 7, 2021
DroneTrap: Drone Catching in Midair by Soft Robotic Hand with Color-Based Force Detection and Hand Gesture Recognition

Aleksey Fedoseev, Valerii Serpiva, Ekaterina Karmanova et al.

The paper proposes a novel concept of docking drones to make this process as safe and fast as possible. The idea behind the project is that a robot with a soft gripper grasps the drone in midair. The human operator navigates the robotic arm with the ML-based gesture recognition interface. The 3-finger robot hand with soft fingers is equipped with touch sensors, making it possible to achieve safe drone catching and avoid inadvertent damage to the drone's propellers and motors. Additionally, the soft hand is featured with a unique color-based force estimation technology based on a computer vision (CV) system. Moreover, the visual color-changing system makes it easier for the human operator to interpret the applied forces. Without any additional programming, the operator has full real-time control of the robot's motion and task execution by wearing a mocap glove with gesture recognition, which was developed and applied for the high-level control of DroneTrap. The experimental results revealed that the developed color-based force estimation can be applied for rigid object capturing with high precision (95.3\%). The proposed technology can potentially revolutionize the landing and deployment of drones for parcel delivery on uneven ground, structure maintenance and inspection, risque operations, and etc.