Arash Ajoudani

RO
h-index39
38papers
289citations
Novelty46%
AI Score54

38 Papers

ROMay 31
PLanAR: Planning-Language-Grounded Agentic Reasoning for Robot Manipulation

Pengyuan Guo, Zhonghao Mai, Zhengtong Xu et al.

Recent advances in vision-language models (VLMs) have enabled increasing progress in real-world robot manipulation. However, long-horizon manipulation in unstructured environments requires VLMs to reason about changing scene states, action constraints, and execution outcomes, which remains difficult with natural language reasoning alone. We present PLanAR, a planning-language-grounded robot agent framework for open-vocabulary, long-horizon manipulation. PLanAR uses a planning-language interface to define the VLM reasoning space: object predicates represent scene states, action schemas specify robot skills with preconditions and effects, and symbolic plans provide executable intermediate representations. This interface enables stepwise verification: after each action, PLanAR uses onboard observations to check whether the expected symbolic effects have been achieved, allowing the VLM-based agent to update task states, detect failures, and replan when execution deviates from expectation. Across robot embodiments, VLM backends, and tasks including stacking, crossword solving, and long-horizon kitchen workflows, PLanAR demonstrates strong real-world capability while revealing key limitations of current VLMs in embodied reasoning.

ROMar 28, 2022Code
Open-VICO: An Open-Source Gazebo Toolkit for Vision-based Skeleton Tracking in Human-Robot Collaboration

Luca Fortini, Mattia Leonori, Juan M. Gandarias et al.

Simulation tools are essential for robotics research, especially for those domains in which safety is crucial, such as Human-Robot Collaboration (HRC). However, it is challenging to simulate human behaviors, and existing robotics simulators do not integrate functional human models. This work presents Open-VICO, an open-source toolkit to integrate virtual human models in Gazebo focusing on vision-based human tracking. In particular, Open-VICO allows to combine in the same simulation environment realistic human kinematic models, multi-camera vision setups, and human-tracking techniques along with numerous robot and sensor models thanks to Gazebo. The possibility to incorporate pre-recorded human skeleton motion with Motion Capture systems broadens the landscape of human performance behavioral analysis within Human-Robot Interaction (HRI) settings. To describe the functionalities and stress the potential of the toolkit four specific examples, chosen among relevant literature challenges in the field, are developed using our simulation utils: i) 3D multi-RGB-D camera calibration in simulation, ii) creation of a synthetic human skeleton tracking dataset based on OpenPose, iii) multi-camera scenario for human skeleton tracking in simulation, and iv) a human-robot interaction example. The key of this work is to create a straightforward pipeline which we hope will motivate research on new vision-based algorithms and methodologies for lightweight human-tracking and flexible human-robot applications.

CVApr 12Code
IMPACT: A Dataset for Multi-Granularity Human Procedural Action Understanding in Industrial Assembly

Di Wen, Zeyun Zhong, David Schneider et al.

We introduce IMPACT, a synchronized five-view RGB-D dataset for deployment-oriented industrial procedural understanding, built around real assembly and disassembly of a commercial angle grinder with professional-grade tools. To our knowledge, IMPACT is the first real industrial assembly benchmark that jointly provides synchronized ego-exo RGB-D capture, decoupled bimanual annotation, compliance-aware state tracking, and explicit anomaly--recovery supervision within a single real industrial workflow. It comprises 112 trials from 13 participants totaling 39.5 hours, with multi-route execution governed by a partial-order prerequisite graph, a six-category anomaly taxonomy, and operator cognitive load measured via NASA-TLX. The annotation hierarchy links hand-specific atomic actions to coarse procedural steps, component assembly states, and per-hand compliance phases, with synchronized null spans across views to decouple perceptual limitations from algorithmic failure. Systematic baselines reveal fundamental limitations that remain invisible to single-task benchmarks, particularly under realistic deployment conditions that involve incomplete observations, flexible execution paths, and corrective behavior. The full dataset, annotations, and evaluation code are available at https://github.com/Kratos-Wen/IMPACT.

RONov 21, 2023
FollowMe: a Robust Person Following Framework Based on Re-Identification and Gestures

Federico Rollo, Andrea Zunino, Gennaro Raiola et al.

Human-robot interaction (HRI) has become a crucial enabler in houses and industries for facilitating operational flexibility. When it comes to mobile collaborative robots, this flexibility can be further increased due to the autonomous mobility and navigation capacity of the robotic agents, expanding their workspace and consequently, the personalizable assistance they can provide to the human operators. This however requires that the robot is capable of detecting and identifying the human counterpart in all stages of the collaborative task, and in particular while following a human in crowded workplaces. To respond to this need, we developed a unified perception and navigation framework, which enables the robot to identify and follow a target person using a combination of visual Re-Identification (Re-ID), hand gestures detection, and collision-free navigation. The Re-ID module can autonomously learn the features of a target person and use the acquired knowledge to visually re-identify the target. The navigation stack is used to follow the target avoiding obstacles and other individuals in the environment. Experiments are conducted with few subjects in a laboratory setting where some unknown dynamic obstacles are introduced.

CVSep 14, 2022
Semantic Visual Simultaneous Localization and Mapping: A Survey

Kaiqi Chen, Junhao Xiao, Jialing Liu et al.

Visual Simultaneous Localization and Mapping (vSLAM) has achieved great progress in the computer vision and robotics communities, and has been successfully used in many fields such as autonomous robot navigation and AR/VR. However, vSLAM cannot achieve good localization in dynamic and complex environments. Numerous publications have reported that, by combining with the semantic information with vSLAM, the semantic vSLAM systems have the capability of solving the above problems in recent years. Nevertheless, there is no comprehensive survey about semantic vSLAM. To fill the gap, this paper first reviews the development of semantic vSLAM, explicitly focusing on its strengths and differences. Secondly, we explore three main issues of semantic vSLAM: the extraction and association of semantic information, the application of semantic information, and the advantages of semantic vSLAM. Then, we collect and analyze the current state-of-the-art SLAM datasets which have been widely used in semantic vSLAM systems. Finally, we discuss future directions that will provide a blueprint for the future development of semantic vSLAM.

ROJul 3, 2023
Artifacts Mapping: Multi-Modal Semantic Mapping for Object Detection and 3D Localization

Federico Rollo, Gennaro Raiola, Andrea Zunino et al.

Geometric navigation is nowadays a well-established field of robotics and the research focus is shifting towards higher-level scene understanding, such as Semantic Mapping. When a robot needs to interact with its environment, it must be able to comprehend the contextual information of its surroundings. This work focuses on classifying and localising objects within a map, which is under construction (SLAM) or already built. To further explore this direction, we propose a framework that can autonomously detect and localize predefined objects in a known environment using a multi-modal sensor fusion approach (combining RGB and depth data from an RGB-D camera and a lidar). The framework consists of three key elements: understanding the environment through RGB data, estimating depth through multi-modal sensor fusion, and managing artifacts (i.e., filtering and stabilizing measurements). The experiments show that the proposed framework can accurately detect 98% of the objects in the real sample environment, without post-processing, while 85% and 80% of the objects were mapped using the single RGBD camera or RGB + lidar setup respectively. The comparison with single-sensor (camera or lidar) experiments is performed to show that sensor fusion allows the robot to accurately detect near and far obstacles, which would have been noisy or imprecise in a purely visual or laser-based approach.

CVOct 30, 2023
CARPE-ID: Continuously Adaptable Re-identification for Personalized Robot Assistance

Federico Rollo, Andrea Zunino, Nikolaos Tsagarakis et al.

In today's Human-Robot Interaction (HRI) scenarios, a prevailing tendency exists to assume that the robot shall cooperate with the closest individual or that the scene involves merely a singular human actor. However, in realistic scenarios, such as shop floor operations, such an assumption may not hold and personalized target recognition by the robot in crowded environments is required. To fulfil this requirement, in this work, we propose a person re-identification module based on continual visual adaptation techniques that ensure the robot's seamless cooperation with the appropriate individual even subject to varying visual appearances or partial or complete occlusions. We test the framework singularly using recorded videos in a laboratory environment and an HRI scenario, i.e., a person-following task by a mobile robot. The targets are asked to change their appearance during tracking and to disappear from the camera field of view to test the challenging cases of occlusion and outfit variations. We compare our framework with one of the state-of-the-art Multi-Object Tracking (MOT) methods and the results show that the CARPE-ID can accurately track each selected target throughout the experiments in all the cases (except two limit cases). At the same time, the s-o-t-a MOT has a mean of 4 tracking errors for each video.

CVMar 31, 2023
Markerless 3D human pose tracking through multiple cameras and AI: Enabling high accuracy, robustness, and real-time performance

Luca Fortini, Mattia Leonori, Juan M. Gandarias et al.

Tracking 3D human motion in real-time is crucial for numerous applications across many fields. Traditional approaches involve attaching artificial fiducial objects or sensors to the body, limiting their usability and comfort-of-use and consequently narrowing their application fields. Recent advances in Artificial Intelligence (AI) have allowed for markerless solutions. However, most of these methods operate in 2D, while those providing 3D solutions compromise accuracy and real-time performance. To address this challenge and unlock the potential of visual pose estimation methods in real-world scenarios, we propose a markerless framework that combines multi-camera views and 2D AI-based pose estimation methods to track 3D human motion. Our approach integrates a Weighted Least Square (WLS) algorithm that computes 3D human motion from multiple 2D pose estimations provided by an AI-driven method. The method is integrated within the Open-VICO framework allowing simulation and real-world execution. Several experiments have been conducted, which have shown high accuracy and real-time performance, demonstrating the high level of readiness for real-world applications and the potential to revolutionize human motion capture.

CVApr 19, 2023
Automatic Interaction and Activity Recognition from Videos of Human Manual Demonstrations with Application to Anomaly Detection

Elena Merlo, Marta Lagomarsino, Edoardo Lamon et al.

This paper presents a new method to describe spatio-temporal relations between objects and hands, to recognize both interactions and activities within video demonstrations of manual tasks. The approach exploits Scene Graphs to extract key interaction features from image sequences while simultaneously encoding motion patterns and context. Additionally, the method introduces event-based automatic video segmentation and clustering, which allow for the grouping of similar events and detect if a monitored activity is executed correctly. The effectiveness of the approach was demonstrated in two multi-subject experiments, showing the ability to recognize and cluster hand-object and object-object interactions without prior knowledge of the activity, as well as matching the same activity performed by different subjects.

ROJan 19, 2023
A Unified Architecture for Dynamic Role Allocation and Collaborative Task Planning in Mixed Human-Robot Teams

Edoardo Lamon, Fabio Fusaro, Elena De Momi et al.

The growing deployment of human-robot collaborative processes in several industrial applications, such as handling, welding, and assembly, unfolds the pursuit of systems which are able to manage large heterogeneous teams and, at the same time, monitor the execution of complex tasks. In this paper, we present a novel architecture for dynamic role allocation and collaborative task planning in a mixed human-robot team of arbitrary size. The architecture capitalizes on a centralized reactive and modular task-agnostic planning method based on Behavior Trees (BTs), in charge of actions scheduling, while the allocation problem is formulated through a Mixed-Integer Linear Program (MILP), that assigns dynamically individual roles or collaborations to the agents of the team. Different metrics used as MILP cost allow the architecture to favor various aspects of the collaboration (e.g. makespan, ergonomics, human preferences). Human preference are identified through a negotiation phase, in which, an human agent can accept/refuse to execute the assigned task.In addition, bilateral communication between humans and the system is achieved through an Augmented Reality (AR) custom user interface that provides intuitive functionalities to assist and coordinate workers in different action phases. The computational complexity of the proposed methodology outperforms literature approaches in industrial sized jobs and teams (problems up to 50 actions and 20 agents in the team with collaborations are solved within 1 s). The different allocated roles, as the cost functions change, highlights the flexibility of the architecture to several production requirements. Finally, the subjective evaluation demonstrating the high usability level and the suitability for the targeted scenario.

RONov 11, 2025
Intuitive Programming, Adaptive Task Planning, and Dynamic Role Allocation in Human-Robot Collaboration

Marta Lagomarsino, Elena Merlo, Andrea Pupa et al.

Remarkable capabilities have been achieved by robotics and AI, mastering complex tasks and environments. Yet, humans often remain passive observers, fascinated but uncertain how to engage. Robots, in turn, cannot reach their full potential in human-populated environments without effectively modeling human states and intentions and adapting their behavior. To achieve a synergistic human-robot collaboration (HRC), a continuous information flow should be established: humans must intuitively communicate instructions, share expertise, and express needs. In parallel, robots must clearly convey their internal state and forthcoming actions to keep users informed, comfortable, and in control. This review identifies and connects key components enabling intuitive information exchange and skill transfer between humans and robots. We examine the full interaction pipeline: from the human-to-robot communication bridge translating multimodal inputs into robot-understandable representations, through adaptive planning and role allocation, to the control layer and feedback mechanisms to close the loop. Finally, we highlight trends and promising directions toward more adaptive, accessible HRC.

ROMar 17
CompliantVLA-adaptor: VLM-Guided Variable Impedance Action for Safe Contact-Rich Manipulation

Heng Zhang, Wei-Hsing Huang, Qiyi Tong et al.

We propose a CompliantVLA-adaptor that augments the state-of-the-art Vision-Language-Action (VLA) models with vision-language model (VLM)-informed context-aware variable impedance control (VIC) to improve the safety and effectiveness of contact-rich robotic manipulation tasks. Existing VLA systems (e.g., RDT, Pi0.5, OpenVLA-oft) typically output position, but lack force-aware adaptation, leading to unsafe or failed interactions in physical tasks involving contact, compliance, or uncertainty. In the proposed CompliantVLA-adaptor, a VLM interprets task context from images and natural language to adapt the stiffness and damping parameters of a VIC controller. These parameters are further regulated using real-time force/torque feedback to ensure interaction forces remain within safe thresholds. We demonstrate that our method outperforms the VLA baselines on a suite of complex contact-rich tasks, both in simulation and the real world, with improved success rates and reduced force violations. This work presents a promising path towards a safe foundation model for physical contact-rich manipulation. We release our code, prompts, and force-torque-impedance-scenario context datasets at https://sites.google.com/view/compliantvla.

LGApr 20
Dissipative Latent Residual Physics-Informed Neural Networks for Modeling and Identification of Electromechanical Systems

Youyuan Long, Gokhan Solak, Arash Ajoudani

Accurate dynamical modeling is essential for simulation and control of embodied systems, yet first-principles models of electromechanical systems often fail to capture complex dissipative effects such as joint friction, stray losses, and structural damping. While residual-learning physics-informed neural networks (PINNs) can effectively augment imperfect first-principles models with data-driven components, the residual terms are typically implemented as unconstrained multilayer perceptrons (MLPs), which may inadvertently inject artificial energy into the system. To more faithfully model the dissipative dynamics, we propose DiLaR-PINN, a dissipative latent residual PINN designed to learn unmodeled dissipative effects in a physically consistent manner. Structurally, the residual network operates only on unmeasurable (latent) state components and is parameterized in a skew-dissipative form that guarantees non-increasing energy for any choice of network parameters. To enable stable and data-efficient training under partial measurability of the state, we further develop a recurrent rollout scheme with a curriculum-based sequence length extension strategy. We validate DiLaR-PINN on a real-world helicopter system and compare it against four baselines: a pure physical model (without a residual network), an unstructured residual MLP, a DiLaR variant with a soft dissipativity constraint, and a black-box LSTM. The results demonstrate that DiLaR-PINN more accurately captures dissipative effects and achieves superior long-horizon extrapolation performance.

LGMar 10
Reward-Zero: Language Embedding Driven Implicit Reward Mechanisms for Reinforcement Learning

Heng Zhang, Haddy Alchaer, Arash Ajoudani et al.

We introduce Reward-Zero, a general-purpose implicit reward mechanism that transforms natural-language task descriptions into dense, semantically grounded progress signals for reinforcement learning (RL). Reward-Zero serves as a simple yet sophisticated universal reward function that leverages language embeddings for efficient RL training. By comparing the embedding of a task specification with embeddings derived from an agent's interaction experience, Reward-Zero produces a continuous, semantically aligned sense-of-completion signal. This reward supplements sparse or delayed environmental feedback without requiring task-specific engineering. When integrated into standard RL frameworks, it accelerates exploration, stabilizes training, and enhances generalization across diverse tasks. Empirically, agents trained with Reward-Zero converge faster and achieve higher final success rates than conventional methods such as PPO with common reward-shaping baselines, successfully solving tasks that hand-designed rewards could not in some complex tasks. In addition, we develop a mini benchmark for the evaluation of completion sense during task execution via language embeddings. These results highlight the promise of language-driven implicit reward functions as a practical path toward more sample-efficient, generalizable, and scalable RL for embodied agents. Code will be released after peer review.

ROApr 29
Learning Tactile-Aware Quadrupedal Loco-Manipulation Policies

Pokuang Zhou, Yuhao Zhou, Quan Luu et al.

Quadrupedal loco-manipulation is commonly built on visual perception and proprioception. Yet reliable contact-rich manipulation remains difficult: vision and proprioception alone cannot resolve uncertain, evolving interactions with the environment. Tactile sensing offers direct contact observability, but scalable tactile-aware learning framework for quadrupedal loco-manipulation is still underexplored. In this paper, we present a tactile-aware loco-manipulation policy learning pipeline with a hierarchical structure. Our approach has two key components. First, we leverage real-world human demonstrations to train a tactile-conditioned visuotactile high-level policy. This policy predicts not only end-effector trajectories for manipulation, but also the evolving tactile interaction cues that characterize how contact should develop over time. Second, we perform large-scale reinforcement learning in simulation to learn a tactile-aware whole-body control policy that tracks diverse commanded trajectories and tactile interaction cues, and transfers zero-shot to the real world. Together, these components enable coordinated locomotion and manipulation under contact-rich scenarios. We evaluate the system on real-world contact-rich tasks, including in-hand reorientation with insertion, valve tightening, and delicate object manipulation. Compared to vision-only and visuotactile baselines, our method improves performance by 28.54% on average across these tasks.

ROJun 16, 2025
A Survey on Imitation Learning for Contact-Rich Tasks in Robotics

Toshiaki Tsuji, Yasuhiro Kato, Gokhan Solak et al.

This paper comprehensively surveys research trends in imitation learning for contact-rich robotic tasks. Contact-rich tasks, which require complex physical interactions with the environment, represent a central challenge in robotics due to their nonlinear dynamics and sensitivity to small positional deviations. The paper examines demonstration collection methodologies, including teaching methods and sensory modalities crucial for capturing subtle interaction dynamics. We then analyze imitation learning approaches, highlighting their applications to contact-rich manipulation. Recent advances in multimodal learning and foundation models have significantly enhanced performance in complex contact tasks across industrial, household, and healthcare domains. Through systematic organization of current research and identification of challenges, this survey provides a foundation for future advancements in contact-rich robotic manipulation.

CLMar 28, 2025
Scaling Laws in Scientific Discovery with AI and Robot Scientists

Pengsong Zhang, Heng Zhang, Huazhe Xu et al.

Scientific discovery is poised for rapid advancement through advanced robotics and artificial intelligence. Current scientific practices face substantial limitations as manual experimentation remains time-consuming and resource-intensive, while multidisciplinary research demands knowledge integration beyond individual researchers' expertise boundaries. Here, we envision an autonomous generalist scientist (AGS) concept combines agentic AI and embodied robotics to automate the entire research lifecycle. This system could dynamically interact with both physical and virtual environments while facilitating the integration of knowledge across diverse scientific disciplines. By deploying these technologies throughout every research stage -- spanning literature review, hypothesis generation, experimentation, and manuscript writing -- and incorporating internal reflection alongside external feedback, this system aims to significantly reduce the time and resources needed for scientific discovery. Building on the evolution from virtual AI scientists to versatile generalist AI-based robot scientists, AGS promises groundbreaking potential. As these autonomous systems become increasingly integrated into the research process, we hypothesize that scientific discovery might adhere to new scaling laws, potentially shaped by the number and capabilities of these autonomous systems, offering novel perspectives on how knowledge is generated and evolves. The adaptability of embodied robots to extreme environments, paired with the flywheel effect of accumulating scientific knowledge, holds the promise of continually pushing beyond both physical and intellectual frontiers.

ROMar 27, 2025
Bresa: Bio-inspired Reflexive Safe Reinforcement Learning for Contact-Rich Robotic Tasks

Heng Zhang, Gokhan Solak, Arash Ajoudani

Ensuring safety in reinforcement learning (RL)-based robotic systems is a critical challenge, especially in contact-rich tasks within unstructured environments. While the state-of-the-art safe RL approaches mitigate risks through safe exploration or high-level recovery mechanisms, they often overlook low-level execution safety, where reflexive responses to potential hazards are crucial. Similarly, variable impedance control (VIC) enhances safety by adjusting the robot's mechanical response, yet lacks a systematic way to adapt parameters, such as stiffness and damping throughout the task. In this paper, we propose Bresa, a Bio-inspired Reflexive Hierarchical Safe RL method inspired by biological reflexes. Our method decouples task learning from safety learning, incorporating a safety critic network that evaluates action risks and operates at a higher frequency than the task solver. Unlike existing recovery-based methods, our safety critic functions at a low-level control layer, allowing real-time intervention when unsafe conditions arise. The task-solving RL policy, running at a lower frequency, focuses on high-level planning (decision-making), while the safety critic ensures instantaneous safety corrections. We validate Bresa on multiple tasks including a contact-rich robotic task, demonstrating its reflexive ability to enhance safety, and adaptability in unforeseen dynamic environments. Our results show that Bresa outperforms the baseline, providing a robust and reflexive safety mechanism that bridges the gap between high-level planning and low-level execution. Real-world experiments and supplementary material are available at project website https://jack-sherman01.github.io/Bresa.

CVMay 9, 2025
Graph-based Online Monitoring of Train Driver States via Facial and Skeletal Features

Olivia Nocentini, Marta Lagomarsino, Gokhan Solak et al.

Driver fatigue poses a significant challenge to railway safety, with traditional systems like the dead-man switch offering limited and basic alertness checks. This study presents an online behavior-based monitoring system utilizing a customised Directed-Graph Neural Network (DGNN) to classify train driver's states into three categories: alert, not alert, and pathological. To optimize input representations for the model, an ablation study was performed, comparing three feature configurations: skeletal-only, facial-only, and a combination of both. Experimental results show that combining facial and skeletal features yields the highest accuracy (80.88%) in the three-class model, outperforming models using only facial or skeletal features. Furthermore, this combination achieves over 99% accuracy in the binary alertness classification. Additionally, we introduced a novel dataset that, for the first time, incorporates simulated pathological conditions into train driver monitoring, broadening the scope for assessing risks related to fatigue and health. This work represents a step forward in enhancing railway safety through advanced online monitoring using vision-based technologies.

AIApr 2
GenGait: A Transformer-Based Model for Human Gait Anomaly Detection and Normative Twin Generation

Elisa Motta, Marta Lorenzini, Clara Mouawad et al.

Gait analysis provides an objective characterization of locomotor function and is widely used to support diagnosis and rehabilitation monitoring across neurological and orthopedic disorders. Deep learning has been increasingly applied to this domain, yet most approaches rely on supervised classifiers trained on disease-labeled data, limiting generalization to heterogeneous pathological presentations. This work proposes a label-free framework for joint-level anomaly detection and kinematic correction based on a Transformer masked autoencoder trained exclusively on normative gait sequences from 150 adults, acquired with a markerless multi-camera motion-capture system. At inference, a two-pass procedure is applied to potentially pathological input sequences, first it estimates joint inconsistency scores by occluding individual joints and measuring deviations from the learned normative prior. Then, it withholds the flagged joints from the encoder input and reconstructs the full skeleton from the remaining spatiotemporal context, yielding corrected kinematic trajectories at the flagged positions. Validation on 10 held-out normative participants, who mimicked seven simulated gait abnormalities, showed accurate localization of biomechanically inconsistent joints, a significant reduction in angular deviation across all analyzed joints with large effect sizes, and preservation of normative kinematics. The proposed approach enables interpretable, subject-specific localization of gait impairments without requiring disease labels. Video is available at https://youtu.be/Rcm3jqR5pN4.

ROMar 13
TacVLA: Contact-Aware Tactile Fusion for Robust Vision-Language-Action Manipulation

Kaidi Zhang, Heng Zhang, Zhengtong Xu et al.

Vision-Language-Action (VLA) models have demonstrated significant advantages in robotic manipulation. However, their reliance on vision and language often leads to suboptimal performance in tasks involving visual occlusion, fine-grained manipulation, and physical contact. To address these challenges, we propose TacVLA, a fine-tuned VLA model by incorporating tactile modalities into the transformer-based policy to enhance fine-grained manipulation capabilities. Specifically, we introduce a contact-aware gating mechanism that selectively activates tactile tokens only when contact is detected, enabling adaptive multimodal fusion while avoiding irrelevant tactile interference. The fused visual, language, and tactile tokens are jointly processed within the transformer architecture to strengthen cross-modal grounding during contact-rich interaction. Extensive experiments on constraint-locked disassembly, in-box picking and robustness evaluations demonstrate that our model outperforms baselines, improving the performance by averaging 20% success rate in disassembly, 60% in in-box picking and 2.1x improvement in scenarios with visual occlusion. Videos are available at https://sites.google.com/view/tacvla and code will be released.

LGOct 13, 2025
Lightweight Facial Landmark Detection in Thermal Images via Multi-Level Cross-Modal Knowledge Transfer

Qiyi Tong, Olivia Nocentini, Marta Lagomarsino et al.

Facial Landmark Detection (FLD) in thermal imagery is critical for applications in challenging lighting conditions, but it is hampered by the lack of rich visual cues. Conventional cross-modal solutions, like feature fusion or image translation from RGB data, are often computationally expensive or introduce structural artifacts, limiting their practical deployment. To address this, we propose Multi-Level Cross-Modal Knowledge Distillation (MLCM-KD), a novel framework that decouples high-fidelity RGB-to-thermal knowledge transfer from model compression to create both accurate and efficient thermal FLD models. A central challenge during knowledge transfer is the profound modality gap between RGB and thermal data, where traditional unidirectional distillation fails to enforce semantic consistency across disparate feature spaces. To overcome this, we introduce Dual-Injected Knowledge Distillation (DIKD), a bidirectional mechanism designed specifically for this task. DIKD establishes a connection between modalities: it not only guides the thermal student with rich RGB features but also validates the student's learned representations by feeding them back into the frozen teacher's prediction head. This closed-loop supervision forces the student to learn modality-invariant features that are semantically aligned with the teacher, ensuring a robust and profound knowledge transfer. Experiments show that our approach sets a new state-of-the-art on public thermal FLD benchmarks, notably outperforming previous methods while drastically reducing computational overhead.

LGSep 30, 2025
Physics-Informed Learning for Human Whole-Body Kinematics Prediction via Sparse IMUs

Cheng Guo, Giuseppe L'Erario, Giulio Romualdi et al.

Accurate and physically feasible human motion prediction is crucial for safe and seamless human-robot collaboration. While recent advancements in human motion capture enable real-time pose estimation, the practical value of many existing approaches is limited by the lack of future predictions and consideration of physical constraints. Conventional motion prediction schemes rely heavily on past poses, which are not always available in real-world scenarios. To address these limitations, we present a physics-informed learning framework that integrates domain knowledge into both training and inference to predict human motion using inertial measurements from only 5 IMUs. We propose a network that accounts for the spatial characteristics of human movements. During training, we incorporate forward and differential kinematics functions as additional loss components to regularize the learned joint predictions. At the inference stage, we refine the prediction from the previous iteration to update a joint state buffer, which is used as extra inputs to the network. Experimental results demonstrate that our approach achieves high accuracy, smooth transitions between motions, and generalizes well to unseen subjects

CVSep 1, 2025
Anticipatory Fall Detection in Humans with Hybrid Directed Graph Neural Networks and Long Short-Term Memory

Younggeol Cho, Gokhan Solak, Olivia Nocentini et al.

Detecting and preventing falls in humans is a critical component of assistive robotic systems. While significant progress has been made in detecting falls, the prediction of falls before they happen, and analysis of the transient state between stability and an impending fall remain unexplored. In this paper, we propose a anticipatory fall detection method that utilizes a hybrid model combining Dynamic Graph Neural Networks (DGNN) with Long Short-Term Memory (LSTM) networks that decoupled the motion prediction and gait classification tasks to anticipate falls with high accuracy. Our approach employs real-time skeletal features extracted from video sequences as input for the proposed model. The DGNN acts as a classifier, distinguishing between three gait states: stable, transient, and fall. The LSTM-based network then predicts human movement in subsequent time steps, enabling early detection of falls. The proposed model was trained and validated using the OUMVLP-Pose and URFD datasets, demonstrating superior performance in terms of prediction error and recognition accuracy compared to models relying solely on DGNN and models from literature. The results indicate that decoupling prediction and classification improves performance compared to addressing the unified problem using only the DGNN. Furthermore, our method allows for the monitoring of the transient state, offering valuable insights that could enhance the functionality of advanced assistance systems.

AIMay 6, 2025
Improving Failure Prediction in Aircraft Fastener Assembly Using Synthetic Data in Imbalanced Datasets

Gustavo J. G. Lahr, Ricardo V. Godoy, Thiago H. Segreto et al.

Automating aircraft manufacturing still relies heavily on human labor due to the complexity of the assembly processes and customization requirements. One key challenge is achieving precise positioning, especially for large aircraft structures, where errors can lead to substantial maintenance costs or part rejection. Existing solutions often require costly hardware or lack flexibility. Used in aircraft by the thousands, threaded fasteners, e.g., screws, bolts, and collars, are traditionally executed by fixed-base robots and usually have problems in being deployed in the mentioned manufacturing sites. This paper emphasizes the importance of error detection and classification for efficient and safe assembly of threaded fasteners, especially aeronautical collars. Safe assembly of threaded fasteners is paramount since acquiring sufficient data for training deep learning models poses challenges due to the rarity of failure cases and imbalanced datasets. The paper addresses this by proposing techniques like class weighting and data augmentation, specifically tailored for temporal series data, to improve classification performance. Furthermore, the paper introduces a novel problem-modeling approach, emphasizing metrics relevant to collar assembly rather than solely focusing on accuracy. This tailored approach enhances the models' capability to handle the challenges of threaded fastener assembly effectively.

ROMay 5, 2025
Learning and Online Replication of Grasp Forces from Electromyography Signals for Prosthetic Finger Control

Robin Arbaud, Elisa Motta, Marco Domenico Avaro et al.

Partial hand amputations significantly affect the physical and psychosocial well-being of individuals, yet intuitive control of externally powered prostheses remains an open challenge. To address this gap, we developed a force-controlled prosthetic finger activated by electromyography (EMG) signals. The prototype, constructed around a wrist brace, functions as a supernumerary finger placed near the index, allowing for early-stage evaluation on unimpaired subjects. A neural network-based model was then implemented to estimate fingertip forces from EMG inputs, allowing for online adjustment of the prosthetic finger grip strength. The force estimation model was validated through experiments with ten participants, demonstrating its effectiveness in predicting forces. Additionally, online trials with four users wearing the prosthesis exhibited precise control over the device. Our findings highlight the potential of using EMG-based force estimation to enhance the functionality of prosthetic fingers.

CVApr 23, 2025
WiFi based Human Fall and Activity Recognition using Transformer based Encoder Decoder and Graph Neural Networks

Younggeol Cho, Elisa Motta, Olivia Nocentini et al.

Human pose estimation and action recognition have received attention due to their critical roles in healthcare monitoring, rehabilitation, and assistive technologies. In this study, we proposed a novel architecture named Transformer based Encoder Decoder Network (TED Net) designed for estimating human skeleton poses from WiFi Channel State Information (CSI). TED Net integrates convolutional encoders with transformer based attention mechanisms to capture spatiotemporal features from CSI signals. The estimated skeleton poses were used as input to a customized Directed Graph Neural Network (DGNN) for action recognition. We validated our model on two datasets: a publicly available multi modal dataset for assessing general pose estimation, and a newly collected dataset focused on fall related scenarios involving 20 participants. Experimental results demonstrated that TED Net outperformed existing approaches in pose estimation, and that the DGNN achieves reliable action classification using CSI based skeletons, with performance comparable to RGB based systems. Notably, TED Net maintains robust performance across both fall and non fall cases. These findings highlight the potential of CSI driven human skeleton estimation for effective action recognition, particularly in home environments such as elderly fall detection. In such settings, WiFi signals are often readily available, offering a privacy preserving alternative to vision based methods, which may raise concerns about continuous camera monitoring.

ROMar 10, 2025
Intelligent Framework for Human-Robot Collaboration: Dynamic Ergonomics and Adaptive Decision-Making

Francesco Iodice, Elena De Momi, Arash Ajoudani

The integration of collaborative robots into industrial environments has improved productivity, but has also highlighted significant challenges related to operator safety and ergonomics. This paper proposes an innovative framework that integrates advanced visual perception, continuous ergonomic monitoring, and adaptive Behaviour Tree decision-making to overcome the limitations of traditional methods that typically operate as isolated components. Our approach synthesizes deep learning models, advanced tracking algorithms, and dynamic ergonomic assessments into a modular, scalable, and adaptive system. Experimental validation demonstrates the framework's superiority over existing solutions across multiple dimensions: the visual perception module outperformed previous detection models with 72.4% mAP@50:95; the system achieved high accuracy in recognizing operator intentions (92.5%); it promptly classified ergonomic risks with minimal latency (0.57 seconds); and it dynamically managed robotic interventions with exceptionally responsive decision-making capabilities (0.07 seconds), representing a 56% improvement over benchmark systems. This comprehensive solution provides a robust platform for enhancing human-robot collaboration in industrial environments by prioritizing ergonomic safety, operational efficiency, and real-time adaptability.

ROFeb 27, 2022
MOCA-S: A Sensitive Mobile Collaborative Robotic Assistant exploiting Low-Cost Capacitive Tactile Cover and Whole-Body Control

Mattia Leonori, Juan M. Gandarias, Arash Ajoudani

Safety is one of the most fundamental aspects of robotics, especially when it comes to collaborative robots (cobots) that are expected to physically interact with humans. Although a large body of literature has focused on safety-related aspects for fixed-based cobots, a low effort has been put into developing collaborative mobile manipulators. In response to this need, this work presents MOCA-S, i.e., Sensitive Mobile Collaborative Robotic Assistant, that integrates a low-cost, capacitive tactile cover to measure interaction forces applied to the robot base. The tactile cover comprises a set of 11 capacitive large-area tactile sensors distributed as a 1-D tactile array around the base. Characterization of the tactile sensors with different materials is included. Moreover, two expanded whole-body controllers that exploit the platform's tactile cover and the loco-manipulation features are proposed. These controllers are tested in two experiments, demonstrating the potential of MOCA-S for safe physical Human-Robot Interaction (pHRI). Finally, an experiment is carried out in which an undesired collision occurs between MOCA-S and a human during a loco-manipulation task. The results demonstrate the intrinsic safety of MOCA-S and the proposed controllers, suggesting a new step towards creating safe mobile manipulators.

ROJan 25, 2022
Human-Robot Collaborative Carrying of Objects with Unknown Deformation Characteristics

Doganay Sirintuna, Alberto Giammarino, Arash Ajoudani

In this work, we introduce an adaptive control framework for human-robot collaborative transportation of objects with unknown deformation behaviour. The proposed framework takes as input the haptic information transmitted through the object, and the kinematic information of the human body obtained from a motion capture system to create reactive whole-body motions on a mobile collaborative robot. In order to validate our framework experimentally, we compared its performance with an admittance controller during a co-transportation task of a partially deformable object. We additionally demonstrate the potential of the framework while co-transporting rigid (aluminum rod) and highly deformable (rope) objects. A mobile manipulator which consists of an Omni-directional mobile base, a collaborative robotic arm, and a robotic hand is used as the robotic partner in the experiments. Quantitative and qualitative results of a 12-subjects experiment show that the proposed framework can effectively deal with objects of unknown deformability and provides intuitive assistance to human partners.

ROJan 17, 2022
SUPER-MAN: SUPERnumerary Robotic Bodies for Physical Assistance in HuMAN-Robot Conjoined Actions

Alberto Giammarino, Juan M. Gandarias, Pietro Balatti et al.

This paper presents a mobile supernumerary robotic approach to physical assistance in human-robot conjoined actions. The study starts with a description of the SUPER-MAN concept. The idea is to develop and utilize mobile collaborative systems that can follow human loco-manipulation commands to perform industrial tasks through three main components: i) an admittance-type interface, ii) a human-robot interaction controller, and iii) a supernumerary robotic body. Next, we present two possible implementations within the framework from theoretical and hardware perspectives. The first system is called MOCA-MAN and comprises a redundant torque-controlled robotic arm and an omnidirectional mobile platform. The second one is called Kairos-MAN, formed by a high-payload 6-DoF velocity-controlled robotic arm and an omnidirectional mobile platform. The systems share the same admittance interface, through which user wrenches are translated to loco-manipulation commands generated by whole-body controllers of each system. Besides, a thorough user study with multiple and cross-gender subjects is presented to reveal the quantitative performance of the two systems in effort-demanding and dexterous tasks. Moreover, we provide qualitative results from the NASA-TLX questionnaire to demonstrate the SUPER-MAN approach's potential and its acceptability from the users' viewpoint.

RONov 11, 2021
An Online Multi-Index Approach to Human Ergonomics Assessment in the Workplace

Marta Lorenzini, Wansoo Kim, Arash Ajoudani

Work-related musculoskeletal disorders (WMSDs) remain one of the major occupational safety and health problems in the European Union nowadays. Thus, continuous tracking of workers' exposure to the factors that may contribute to their development is paramount. This paper introduces an online approach to monitor kinematic and dynamic quantities on the workers, providing on the spot an estimate of the physical load required in their daily jobs. A set of ergonomic indexes is defined to account for multiple potential contributors to WMSDs, also giving importance to the subject-specific requirements of the workers. To evaluate the proposed framework, a thorough experimental analysis was conducted on twelve human subjects considering tasks that represent typical working activities in the manufacturing sector. For each task, the ergonomic indexes that better explain the underlying physical load were identified, following a statistical analysis, and supported by the outcome of a surface electromyography (sEMG) analysis. A comparison was also made with a well-recognised and standard tool to evaluate human ergonomics in the workplace, to highlight the benefits introduced by the proposed framework. Results demonstrate the high potential of the proposed framework in identifying the physical risk factors, and therefore to adopt preventive measures. Another equally important contribution of this study is the creation of a comprehensive database on human kinodynamic measurements, which hosts multiple sensory data of healthy subjects performing typical industrial tasks.

RONov 5, 2021
Dynamic Human-Robot Role Allocation based on Human Ergonomics Risk Prediction and Robot Actions Adaptation

Elena Merlo, Edoardo Lamon, Fabio Fusaro et al.

Despite cobots have high potential in bringing several benefits in the manufacturing and logistic processes, but their rapid (re-)deployment in changing environments is still limited. To enable fast adaptation to new product demands and to boost the fitness of the human workers to the allocated tasks, we propose a novel method that optimizes assembly strategies and distributes the effort among the workers in human-robot cooperative tasks. The cooperation model exploits AND/OR Graphs that we adapted to solve also the role allocation problem. The allocation algorithm considers quantitative measurements that are computed online to describe human operator's ergonomic status and task properties. We conducted preliminary experiments to demonstrate that the proposed approach succeeds in controlling the task allocation process to ensure safe and ergonomic conditions for the human worker.

ROSep 24, 2021
Improving Standing Balance Performance through the Assistance of a Mobile Collaborative Robot

Francisco J. Ruiz-Ruiz, Alberto Giammarino, Marta Lorenzini et al.

This paper presents the design and development of a robotic system to give physical assistance to the elderly or people with neurological disorders such as Ataxia or Parkinson's. In particular, we propose using a mobile collaborative robot with an interaction-assistive whole-body interface to help people unable to maintain balance. The robotic system consists of an Omni-directional mobile base, a high-payload robotic arm, and an admittance-type interface acting as a support handle while measuring human-sourced interaction forces. The postural balance of the human body is estimated through the projection of the body Center of Mass (CoM) to the support polygon (SP) representing the quasi-static Center of Pressure (CoP). In response to the interaction forces and the tracking of the human posture, the robot can create assistive forces to restore balance in case of its loss. Otherwise, during normal stance or walking, it will follow the user with minimum/no opposing forces through the generation of coupled arm and base movements. As the balance-restoring strategy, we propose two strategies and evaluate them in a laboratory setting on healthy human participants. Quantitative and qualitative results of a 12-subjects experiment are then illustrated and discussed, comparing the performances of the two strategies and the overall system.

ROSep 8, 2021
An Online Framework for Cognitive Load Assessment in Assembly Tasks

Marta Lagomarsino, Marta Lorenzini, Elena De Momi et al.

The ongoing trend towards Industry 4.0 has revolutionised ordinary workplaces, profoundly changing the role played by humans in the production chain. Research on ergonomics in industrial settings mainly focuses on reducing the operator's physical fatigue and discomfort to improve throughput and avoid safety hazards. However, as the production complexity increases, the cognitive resources demand and mental workload could compromise the operator's performance and the efficiency of the shop floor workplace. State-of-the-art methods in cognitive science work offline and/or involve bulky equipment hardly deployable in industrial settings. This paper presents a novel method for online assessment of cognitive load in manufacturing, primarily assembly, by detecting patterns in human motion directly from the input images of a stereo camera. Head pose estimation and skeleton tracking are exploited to investigate the workers' attention and assess hyperactivity and unforeseen movements. Pilot experiments suggest that our factor assessment tool provides significant insights into workers' mental workload, even confirmed by correlations with physiological and performance measurements. According to data gathered in this study, a vision-based cognitive load assessment has the potential to be integrated into the development of mechatronic systems for improving cognitive ergonomics in manufacturing.

ROMay 25, 2021
An Integrated Dynamic Method for Allocating Roles and Planning Tasks for Mixed Human-Robot Teams

Fabio Fusaro, Edoardo Lamon, Elena De Momi et al.

This paper proposes a novel integrated dynamic method based on Behavior Trees for planning and allocating tasks in mixed human robot teams, suitable for manufacturing environments. The Behavior Tree formulation allows encoding a single job as a compound of different tasks with temporal and logic constraints. In this way, instead of the well-studied offline centralized optimization problem, the role allocation problem is solved with multiple simplified online optimization sub-problem, without complex and cross-schedule task dependencies. These sub-problems are defined as Mixed-Integer Linear Programs, that, according to the worker-actions related costs and the workers' availability, allocate the yet-to-execute tasks among the available workers. To characterize the behavior of the developed method, we opted to perform different simulation experiments in which the results of the action-worker allocation and computational complexity are evaluated. The obtained results, due to the nature of the algorithm and to the possibility of simulating the agents' behavior, should describe well also how the algorithm performs in real experiments.

HCMay 20, 2021
Quantitative Physical Ergonomics Assessment of Teleoperation Interfaces

Soheil Gholami, Marta Lorenzini, Elena De Momi et al.

Human factors and ergonomics are the essential constituents of teleoperation interfaces, which can significantly affect the human operator's performance. Thus, a quantitative evaluation of these elements and the ability to establish reliable comparison bases for different teleoperation interfaces are the keys to select the most suitable one for a particular application. However, most of the works on teleoperation have so far focused on the stability analysis and the transparency improvement of these systems, and do not cover the important usability aspects. In this work, we propose a foundation to build a general framework for the analysis of human factors and ergonomics in employing diverse teleoperation interfaces. The proposed framework will go beyond the traditional subjective analyses of usability by complementing it with online measurements of the human body configurations. As a result, multiple quantitative metrics such as joints' usage, range of motion comfort, center of mass divergence, and posture comfort are introduced. To demonstrate the potential of the proposed framework, two different teleoperation interfaces are considered, and real-world experiments with eleven participants performing a simulated industrial remote pick-and-place task are conducted. The quantitative results of this analysis are provided, and compared with subjective questionnaires, illustrating the effectiveness of the proposed framework.

ROMay 20, 2021
Enhancing Flexibility and Adaptability in Conjoined Human-Robot Industrial Tasks with a Minimalist Physical Interface

Juan M. Gandarias, Pietro Balatti, Edoardo Lamon et al.

This paper presents a physical interface for collaborative mobile manipulators in industrial manufacturing and logistics applications. The proposed work builds on our earlier MOCA-MAN interface, through which an operator could be physically coupled to a mobile manipulator to be assisted in performing daily activities. The previous interface was based on a magnetic clamp attached to one arm of the user for the coupling stage, and a bracelet based on EMG sensors on the other arm for human-robot communication via gestures. The new interface instead presents the following additions: i) An industrial-like design that allows the worker to couple/decouple easily and to operate mobile manipulators locally; ii) A simplistic communication channel via a simple buttons board that allows controlling the robot with one hand only; iii) The interface offers enhanced loco-manipulation capabilities that do not compromise the worker mobility. In addition, an experimental evaluation with six human subjects is carried out to analyze the enhanced locomotion and flexibility of the proposed interface in terms of mobility constraint, usability, and physical load reduction.