ROJun 29, 2023
Identifying Important Sensory Feedback for Learning Locomotion SkillsWanming Yu, Chuanyu Yang, Christopher McGreavy et al.
Robot motor skills can be learned through deep reinforcement learning (DRL) by neural networks as state-action mappings. While the selection of state observations is crucial, there has been a lack of quantitative analysis to date. Here, we present a systematic saliency analysis that quantitatively evaluates the relative importance of different feedback states for motor skills learned through DRL. Our approach can identify the most essential feedback states for locomotion skills, including balance recovery, trotting, bounding, pacing and galloping. By using only key states including joint positions, gravity vector, base linear and angular velocities, we demonstrate that a simulated quadruped robot can achieve robust performance in various test scenarios across these distinct skills. The benchmarks using task performance metrics show that locomotion skills learned with key states can achieve comparable performance to those with all states, and the task performance or learning success rate will drop significantly if key states are missing. This work provides quantitative insights into the relationship between state observations and specific types of motor skills, serving as a guideline for robot motor learning. The proposed method is applicable to differentiable state-action mapping, such as neural network based control policies, enabling the learning of a wide range of motor skills with minimal sensing dependencies.
CVAug 31, 2023Code
Pose-Graph Attentional Graph Neural Network for Lidar Place RecognitionMilad Ramezani, Liang Wang, Joshua Knights et al.
This paper proposes a pose-graph attentional graph neural network, called P-GAT, which compares (key)nodes between sequential and non-sequential sub-graphs for place recognition tasks as opposed to a common frame-to-frame retrieval problem formulation currently implemented in SOTA place recognition methods. P-GAT uses the maximum spatial and temporal information between neighbour cloud descriptors -- generated by an existing encoder -- utilising the concept of pose-graph SLAM. Leveraging intra- and inter-attention and graph neural network, P-GAT relates point clouds captured in nearby locations in Euclidean space and their embeddings in feature space. Experimental results on the large-scale publically available datasets demonstrate the effectiveness of our approach in scenes lacking distinct features and when training and testing environments have different distributions (domain adaptation). Further, an exhaustive comparison with the state-of-the-art shows improvements in performance gains. Code is available at https://github.com/csiro-robotics/P-GAT.
ROAug 15, 2023
Hierarchical generative modelling for autonomous robotsKai Yuan, Noor Sajid, Karl Friston et al.
Humans can produce complex whole-body motions when interacting with their surroundings, by planning, executing and combining individual limb movements. We investigated this fundamental aspect of motor control in the setting of autonomous robotic operations. We approach this problem by hierarchical generative modelling equipped with multi-level planning-for autonomous task completion-that mimics the deep temporal architecture of human motor control. Here, temporal depth refers to the nested time scales at which successive levels of a forward or generative model unfold, for example, delivering an object requires a global plan to contextualise the fast coordination of multiple local movements of limbs. This separation of temporal scales also motivates robotics and control. Specifically, to achieve versatile sensorimotor control, it is advantageous to hierarchically structure the planning and low-level motor control of individual limbs. We use numerical and physical simulation to conduct experiments and to establish the efficacy of this formulation. Using a hierarchical generative model, we show how a humanoid robot can autonomously complete a complex task that necessitates a holistic use of locomotion, manipulation, and grasping. Specifically, we demonstrate the ability of a humanoid robot that can retrieve and transport a box, open and walk through a door to reach the destination, approach and kick a football, while showing robust performance in presence of body damage and ground irregularities. Our findings demonstrated the effectiveness of using human-inspired motor control algorithms, and our method provides a viable hierarchical architecture for the autonomous completion of challenging goal-directed tasks.
83.4AIApr 20Code
Using large language models for embodied planning introduces systematic safety risksTao Zhang, Kaixian Qu, Zhibin Li et al.
Large language models are increasingly used as planners for robotic systems, yet how safely they plan remains an open question. To evaluate safe planning systematically, we introduce DESPITE, a benchmark of 12,279 tasks spanning physical and normative dangers with fully deterministic validation. Across 23 models, even near-perfect planning ability does not ensure safety: the best-planning model fails to produce a valid plan on only 0.4% of tasks but produces dangerous plans on 28.3%. Among 18 open-source models from 3B to 671B parameters, planning ability improves substantially with scale (0.4-99.3%) while safety awareness remains relatively flat (38-57%). We identify a multiplicative relationship between these two capacities, showing that larger models complete more tasks safely primarily through improved planning, not through better danger avoidance. Three proprietary reasoning models reach notably higher safety awareness (71-81%), while non-reasoning proprietary models and open-source reasoning models remain below 57%. As planning ability approaches saturation for frontier models, improving safety awareness becomes a central challenge for deploying language-model planners in robotic systems.
LGMar 22, 2022
Exploring Linear Feature Disentanglement For Neural NetworksTiantian He, Zhibin Li, Yongshun Gong et al.
Non-linear activation functions, e.g., Sigmoid, ReLU, and Tanh, have achieved great success in neural networks (NNs). Due to the complex non-linear characteristic of samples, the objective of those activation functions is to project samples from their original feature space to a linear separable feature space. This phenomenon ignites our interest in exploring whether all features need to be transformed by all non-linear functions in current typical NNs, i.e., whether there exists a part of features arriving at the linear separable feature space in the intermediate layers, that does not require further non-linear variation but an affine transformation instead. To validate the above hypothesis, we explore the problem of linear feature disentanglement for neural networks in this paper. Specifically, we devise a learnable mask module to distinguish between linear and non-linear features. Through our designed experiments we found that some features reach the linearly separable space earlier than the others and can be detached partly from the NNs. The explored method also provides a readily feasible pruning strategy which barely affects the performance of the original model. We conduct our experiments on four datasets and present promising results.
RONov 9, 2023
Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in ClutterGeorgios Tziafas, Yucheng Xu, Arushi Goel et al.
Robots operating in human-centric environments require the integration of visual grounding and grasping capabilities to effectively manipulate objects based on user instructions. This work focuses on the task of referring grasp synthesis, which predicts a grasp pose for an object referred through natural language in cluttered scenes. Existing approaches often employ multi-stage pipelines that first segment the referred object and then propose a suitable grasp, and are evaluated in private datasets or simulators that do not capture the complexity of natural indoor scenes. To address these limitations, we develop a challenging benchmark based on cluttered indoor scenes from OCID dataset, for which we generate referring expressions and connect them with 4-DoF grasp poses. Further, we propose a novel end-to-end model (CROG) that leverages the visual grounding capabilities of CLIP to learn grasp synthesis directly from image-text pairs. Our results show that vanilla integration of CLIP with pretrained models transfers poorly in our challenging benchmark, while CROG achieves significant improvements both in terms of grounding and grasping. Extensive robot experiments in both simulation and hardware demonstrate the effectiveness of our approach in challenging interactive object grasping scenarios that include clutter.
LGJul 18, 2023
Exploiting Field Dependencies for Learning on Categorical DataZhibin Li, Piotr Koniusz, Lu Zhang et al.
Traditional approaches for learning on categorical data underexploit the dependencies between columns (\aka fields) in a dataset because they rely on the embedding of data points driven alone by the classification/regression loss. In contrast, we propose a novel method for learning on categorical data with the goal of exploiting dependencies between fields. Instead of modelling statistics of features globally (i.e., by the covariance matrix of features), we learn a global field dependency matrix that captures dependencies between fields and then we refine the global field dependency matrix at the instance-wise level with different weights (so-called local dependency modelling) w.r.t. each field to improve the modelling of the field dependencies. Our algorithm exploits the meta-learning paradigm, i.e., the dependency matrices are refined in the inner loop of the meta-learning algorithm without the use of labels, whereas the outer loop intertwines the updates of the embedding matrix (the matrix performing projection) and global dependency matrix in a supervised fashion (with the use of labels). Our method is simple yet it outperforms several state-of-the-art methods on six popular dataset benchmarks. Detailed ablation studies provide additional insights into our method.
AIJul 5, 2024
Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum GamesNathan Herr, Fernando Acero, Roberta Raileanu et al.
Large Language Models (LLMs) have been increasingly used in real-world settings, yet their strategic decision-making abilities remain largely unexplored. To fully benefit from the potential of LLMs, it's essential to understand their ability to function in complex social scenarios. Game theory, which is already used to understand real-world interactions, provides a good framework for assessing these abilities. This work investigates the performance and merits of LLMs in canonical game-theoretic two-player non-zero-sum games, Stag Hunt and Prisoner Dilemma. Our structured evaluation of GPT-3.5, GPT-4-Turbo, GPT-4o, and Llama-3-8B shows that these models, when making decisions in these games, are affected by at least one of the following systematic biases: positional bias, payoff bias, or behavioural bias. This indicates that LLMs do not fully rely on logical reasoning when making these strategic decisions. As a result, it was found that the LLMs' performance drops when the game configuration is misaligned with the affecting biases. When misaligned, GPT-3.5, GPT-4-Turbo, GPT-4o, and Llama-3-8B show an average performance drop of 32\%, 25\%, 34\%, and 29\% respectively in Stag Hunt, and 28\%, 16\%, 34\%, and 24\% respectively in Prisoner's Dilemma. Surprisingly, GPT-4o (a top-performing LLM across standard benchmarks) suffers the most substantial performance drop, suggesting that newer models are not addressing these issues. Interestingly, we found that a commonly used method of improving the reasoning capabilities of LLMs, chain-of-thought (CoT) prompting, reduces the biases in GPT-3.5, GPT-4o, and Llama-3-8B but increases the effect of the bias in GPT-4-Turbo, indicating that CoT alone cannot fully serve as a robust solution to this problem. We perform several additional experiments, which provide further insight into these observed behaviours.
AIJun 15, 2023
Real-Time Network-Level Traffic Signal Control: An Explicit Multiagent Coordination MethodWanyuan Wang, Tianchi Qiao, Jinming Ma et al.
Efficient traffic signal control (TSC) has been one of the most useful ways for reducing urban road congestion. Key to the challenge of TSC includes 1) the essential of real-time signal decision, 2) the complexity in traffic dynamics, and 3) the network-level coordination. Recent efforts that applied reinforcement learning (RL) methods can query policies by mapping the traffic state to the signal decision in real-time, however, is inadequate for unexpected traffic flows. By observing real traffic information, online planning methods can compute the signal decisions in a responsive manner. We propose an explicit multiagent coordination (EMC)-based online planning methods that can satisfy adaptive, real-time and network-level TSC. By multiagent, we model each intersection as an autonomous agent, and the coordination efficiency is modeled by a cost (i.e., congestion index) function between neighbor intersections. By network-level coordination, each agent exchanges messages with respect to cost function with its neighbors in a fully decentralized manner. By real-time, the message passing procedure can interrupt at any time when the real time limit is reached and agents select the optimal signal decisions according to the current message. Moreover, we prove our EMC method can guarantee network stability by borrowing ideas from transportation domain. Finally, we test our EMC method in both synthetic and real road network datasets. Experimental results are encouraging: compared to RL and conventional transportation baselines, our EMC method performs reasonably well in terms of adapting to real-time traffic dynamics, minimizing vehicle travel time and scalability to city-scale road networks.
LGJun 6, 2023
Value Functions are Control Barrier Functions: Verification of Safe Policies using Control TheoryDaniel C. H. Tan, Fernando Acero, Robert McCarthy et al.
Guaranteeing safe behaviour of reinforcement learning (RL) policies poses significant challenges for safety-critical applications, despite RL's generality and scalability. To address this, we propose a new approach to apply verification methods from control theory to learned value functions. By analyzing task structures for safety preservation, we formalize original theorems that establish links between value functions and control barrier functions. Further, we propose novel metrics for verifying value functions in safe control tasks and practical implementation details to improve learning. Our work presents a novel method for certificate learning, which unlocks a diversity of verification techniques from control theory for RL policies, and marks a significant step towards a formal framework for the general, scalable, and verifiable design of RL-based control systems. Code and videos are available at this https url: https://rl-cbf.github.io/
ROSep 28, 2023
Intrinsic Language-Guided Exploration for Complex Long-Horizon Robotic Manipulation TasksEleftherios Triantafyllidis, Filippos Christianos, Zhibin Li
Current reinforcement learning algorithms struggle in sparse and complex environments, most notably in long-horizon manipulation tasks entailing a plethora of different sequences. In this work, we propose the Intrinsically Guided Exploration from Large Language Models (IGE-LLMs) framework. By leveraging LLMs as an assistive intrinsic reward, IGE-LLMs guides the exploratory process in reinforcement learning to address intricate long-horizon with sparse rewards robotic manipulation tasks. We evaluate our framework and related intrinsic learning methods in an environment challenged with exploration, and a complex robotic manipulation task challenged by both exploration and long-horizons. Results show IGE-LLMs (i) exhibit notably higher performance over related intrinsic methods and the direct use of LLMs in decision-making, (ii) can be combined and complement existing learning methods highlighting its modularity, (iii) are fairly insensitive to different intrinsic scaling parameters, and (iv) maintain robustness against increased levels of uncertainty and horizons.
81.9DBApr 13
Ozone: A Unified Platform for Transportation ResearchOu Zheng, Ruyi Feng, Yufeng Yang et al.
Intelligent Transportation Systems increasingly depend on heterogeneous data from roadside cameras, UAV imagery, LiDAR, and in-vehicle sensors, yet the lack of unified data standards, model interfaces, and evaluation protocols across these sources hampers reproducibility, cross-dataset benchmarking, and cross-region transferability of research findings. Existing trajectory datasets follow incompatible conventions for coordinate systems, object representations, and metadata fields, forcing researchers to build custom preprocessing pipelines for each dataset and simulator combination. To address these challenges, we propose Ozone, a unified platform for transportation research organized around five interconnected layers -- Hardware, Data, Model, Evaluation, and Prototype -- each with standardized schemas, automated conversion pipelines, and interoperable interfaces. In the first release, the data schema unifies four trajectory datasets -- NGSIM, highD, CitySim, and UTE -- into a canonical format with oriented bounding boxes, kinematic variables, and pre-computed surrogate safety measures. Digital-twin maps in CARLA and calibrated traffic models provide integrated benchmarking environments. Case studies in human-factor research, traffic scene generation, and safety-critical modeling demonstrate that Ozone reduces experiment setup time by 85%, achieves 91% cross-city transfer efficiency for safety models, and improves cross-dataset reproducibility to within 3% variance. The source code and datasets are publicly available.
CVMar 9, 2023
Controllable Video Generation by Learning the Underlying Dynamical System with Neural ODEYucheng Xu, Li Nanbo, Arushi Goel et al.
Videos depict the change of complex dynamical systems over time in the form of discrete image sequences. Generating controllable videos by learning the dynamical system is an important yet underexplored topic in the computer vision community. This paper presents a novel framework, TiV-ODE, to generate highly controllable videos from a static image and a text caption. Specifically, our framework leverages the ability of Neural Ordinary Differential Equations~(Neural ODEs) to represent complex dynamical systems as a set of nonlinear ordinary differential equations. The resulting framework is capable of generating videos with both desired dynamics and content. Experiments demonstrate the ability of the proposed method in generating highly controllable and visually consistent videos, and its capability of modeling dynamical systems. Overall, this work is a significant step towards developing advanced controllable video generation models that can handle complex and dynamic scenes.
AISep 22, 2023
TrTr: A Versatile Pre-Trained Large Traffic Model based on Transformer for Capturing Trajectory Diversity in Vehicle PopulationRuyi Feng, Zhibin Li, Bowen Liu et al.
Understanding trajectory diversity is a fundamental aspect of addressing practical traffic tasks. However, capturing the diversity of trajectories presents challenges, particularly with traditional machine learning and recurrent neural networks due to the requirement of large-scale parameters. The emerging Transformer technology, renowned for its parallel computation capabilities enabling the utilization of models with hundreds of millions of parameters, offers a promising solution. In this study, we apply the Transformer architecture to traffic tasks, aiming to learn the diversity of trajectories within vehicle populations. We analyze the Transformer's attention mechanism and its adaptability to the goals of traffic tasks, and subsequently, design specific pre-training tasks. To achieve this, we create a data structure tailored to the attention mechanism and introduce a set of noises that correspond to spatio-temporal demands, which are incorporated into the structured data during the pre-training process. The designed pre-training model demonstrates excellent performance in capturing the spatial distribution of the vehicle population, with no instances of vehicle overlap and an RMSE of 0.6059 when compared to the ground truth values. In the context of time series prediction, approximately 95% of the predicted trajectories' speeds closely align with the true speeds, within a deviation of 7.5144m/s. Furthermore, in the stability test, the model exhibits robustness by continuously predicting a time series ten times longer than the input sequence, delivering smooth trajectories and showcasing diverse driving behaviors. The pre-trained model also provides a good basis for downstream fine-tuning tasks. The number of parameters of our model is over 50 million.
ROOct 31, 2025
End-to-End Dexterous Arm-Hand VLA Policies via Shared Autonomy: VR Teleoperation Augmented by Autonomous Hand VLA Policy for Efficient Data CollectionYu Cui, Yujian Zhang, Lina Tao et al.
Achieving human-like dexterous manipulation remains a major challenge for general-purpose robots. While Vision-Language-Action (VLA) models show potential in learning skills from demonstrations, their scalability is limited by scarce high-quality training data. Existing data collection methods face inherent constraints: manual teleoperation overloads human operators, while automated planning often produces unnatural motions. We propose a Shared Autonomy framework that divides control between macro and micro motions. A human operator guides the robot's arm pose through intuitive VR teleoperation, while an autonomous DexGrasp-VLA policy handles fine-grained hand control using real-time tactile and visual feedback. This division significantly reduces cognitive load and enables efficient collection of high-quality coordinated arm-hand demonstrations. Using this data, we train an end-to-end VLA policy enhanced with our novel Arm-Hand Feature Enhancement module, which captures both distinct and shared representations of macro and micro movements for more natural coordination. Our Corrective Teleoperation system enables continuous policy improvement through human-in-the-loop failure recovery. Experiments demonstrate that our framework generates high-quality data with minimal manpower and achieves a 90% success rate across diverse objects, including unseen instances. Comprehensive evaluations validate the system's effectiveness in developing dexterous manipulation capabilities.
CVJul 10, 2022
Self-attention on Multi-Shifted Windows for Scene SegmentationLitao Yu, Zhibin Li, Jian Zhang et al.
Scene segmentation in images is a fundamental yet challenging problem in visual content understanding, which is to learn a model to assign every image pixel to a categorical label. One of the challenges for this learning task is to consider the spatial and semantic relationships to obtain descriptive feature representations, so learning the feature maps from multiple scales is a common practice in scene segmentation. In this paper, we explore the effective use of self-attention within multi-scale image windows to learn descriptive visual features, then propose three different strategies to aggregate these feature maps to decode the feature representation for dense prediction. Our design is based on the recently proposed Swin Transformer models, which totally discards convolution operations. With the simple yet effective multi-scale feature learning and aggregation, our models achieve very promising performance on four public scene segmentation datasets, PASCAL VOC2012, COCO-Stuff 10K, ADE20K and Cityscapes.
ROJun 30, 2023
RObotic MAnipulation Network (ROMAN) -- Hybrid Hierarchical Learning for Solving Complex Sequential TasksEleftherios Triantafyllidis, Fernando Acero, Zhaocheng Liu et al.
Solving long sequential tasks poses a significant challenge in embodied artificial intelligence. Enabling a robotic system to perform diverse sequential tasks with a broad range of manipulation skills is an active area of research. In this work, we present a Hybrid Hierarchical Learning framework, the Robotic Manipulation Network (ROMAN), to address the challenge of solving multiple complex tasks over long time horizons in robotic manipulation. ROMAN achieves task versatility and robust failure recovery by integrating behavioural cloning, imitation learning, and reinforcement learning. It consists of a central manipulation network that coordinates an ensemble of various neural networks, each specialising in distinct re-combinable sub-tasks to generate their correct in-sequence actions for solving complex long-horizon manipulation tasks. Experimental results show that by orchestrating and activating these specialised manipulation experts, ROMAN generates correct sequential activations for accomplishing long sequences of sophisticated manipulation tasks and achieving adaptive behaviours beyond demonstrations, while exhibiting robustness to various sensory noises. These results demonstrate the significance and versatility of ROMAN's dynamic adaptability featuring autonomous failure recovery capabilities, and highlight its potential for various autonomous manipulation tasks that demand adaptive motor skills.
CVNov 3, 2021Code
A cross-modal fusion network based on self-attention and residual structure for multimodal emotion recognitionZiwang Fu, Feng Liu, Hanyang Wang et al.
The audio-video based multimodal emotion recognition has attracted a lot of attention due to its robust performance. Most of the existing methods focus on proposing different cross-modal fusion strategies. However, these strategies introduce redundancy in the features of different modalities without fully considering the complementary properties between modal information, and these approaches do not guarantee the non-loss of original semantic information during intra- and inter-modal interactions. In this paper, we propose a novel cross-modal fusion network based on self-attention and residual structure (CFN-SR) for multimodal emotion recognition. Firstly, we perform representation learning for audio and video modalities to obtain the semantic features of the two modalities by efficient ResNeXt and 1D CNN, respectively. Secondly, we feed the features of the two modalities into the cross-modal blocks separately to ensure efficient complementarity and completeness of information through the self-attention mechanism and residual structure. Finally, we obtain the output of emotions by splicing the obtained fused representation with the original representation. To verify the effectiveness of the proposed method, we conduct experiments on the RAVDESS dataset. The experimental results show that the proposed CFN-SR achieves the state-of-the-art and obtains 75.76% accuracy with 26.30M parameters. Our code is available at https://github.com/skeletonNN/CFN-SR.
LGDec 1, 2020Code
Field-wise Learning for Multi-field Categorical DataZhibin Li, Jian Zhang, Yongshun Gong et al.
We propose a new method for learning with multi-field categorical data. Multi-field categorical data are usually collected over many heterogeneous groups. These groups can reflect in the categories under a field. The existing methods try to learn a universal model that fits all data, which is challenging and inevitably results in learning a complex model. In contrast, we propose a field-wise learning method leveraging the natural structure of data to learn simple yet efficient one-to-one field-focused models with appropriate constraints. In doing this, the models can be fitted to each category and thus can better capture the underlying differences in data. We present a model that utilizes linear models with variance and low-rank constraints, to help it generalize better and reduce the number of parameters. The model is also interpretable in a field-wise manner. As the dimensionality of multi-field categorical data can be very high, the models applied to such data are mostly over-parameterized. Our theoretical analysis can potentially explain the effect of over-parametrization on the generalization of our model. It also supports the variance constraints in the learning objective. The experiment results on two large-scale datasets show the superior performance of our model, the trend of the generalization error bound, and the interpretability of learning outcomes. Our code is available at https://github.com/lzb5600/Field-wise-Learning.
52.1ROMar 10
SCDP: Learning Humanoid Locomotion from Partial Observations via Mixed-Observation DistillationMilo Carroll, Tianhu Peng, Lingfan Bao et al.
Distilling humanoid locomotion control from offline datasets into deployable policies remains a challenge, as existing methods rely on privileged full-body states that require complex and often unreliable state estimation. We present Sensor-Conditioned Diffusion Policies (SCDP) that enables humanoid locomotion using only onboard sensors, eliminating the need for explicit state estimation. SCDP decouples sensing from supervision through mixed-observation training: diffusion model conditions on sensor histories while being supervised to predict privileged future state-action trajectories, enforcing the model to infer the motion dynamics under partial observability. We further develop restricted denoising, context distribution alignment, and context-aware attention masking to encourage implicit state estimation within the model and to prevent train-deploy mismatch. We validate SCDP on velocity-commanded locomotion and motion reference tracking tasks. In simulation, SCDP achieves near-perfect success on velocity control (99-100%) and 93% tracking success in AMASS test set, performing comparable to privileged baselines while using only onboard sensors. Finally, we deploy the trained policy on a real G1 humanoid at 50 Hz, demonstrating robust real robot locomotion without external sensing or state estimation.
ROApr 30, 2024
Towards Generalist Robot Learning from Internet Video: A SurveyRobert McCarthy, Daniel C. H. Tan, Dominik Schmidt et al.
Scaling deep learning to massive and diverse internet data has driven remarkable breakthroughs in domains such as video generation and natural language processing. Robot learning, however, has thus far failed to replicate this success and remains constrained by a scarcity of available data. Learning from videos (LfV) methods aim to address this data bottleneck by augmenting traditional robot data with large-scale internet video. This video data provides foundational information regarding physical dynamics, behaviours, and tasks, and can be highly informative for general-purpose robots. This survey systematically examines the emerging field of LfV. We first outline essential concepts, including detailing fundamental LfV challenges such as distribution shift and missing action labels in video data. Next, we comprehensively review current methods for extracting knowledge from large-scale internet video, overcoming LfV challenges, and improving robot learning through video-informed training. The survey concludes with a critical discussion of future opportunities. Here, we emphasize the need for scalable foundation model approaches that can leverage the full range of available internet video and enhance the learning of robot policies and dynamics models. Overall, the survey aims to inform and catalyse future LfV research, driving progress towards general-purpose robots.
ROSep 5, 2025
RoboBallet: Planning for Multi-Robot Reaching with Graph Neural Networks and Reinforcement LearningMatthew Lai, Keegan Go, Zhibin Li et al.
Modern robotic manufacturing requires collision-free coordination of multiple robots to complete numerous tasks in shared, obstacle-rich workspaces. Although individual tasks may be simple in isolation, automated joint task allocation, scheduling, and motion planning under spatio-temporal constraints remain computationally intractable for classical methods at real-world scales. Existing multi-arm systems deployed in the industry rely on human intuition and experience to design feasible trajectories manually in a labor-intensive process. To address this challenge, we propose a reinforcement learning (RL) framework to achieve automated task and motion planning, tested in an obstacle-rich environment with eight robots performing 40 reaching tasks in a shared workspace, where any robot can perform any task in any order. Our approach builds on a graph neural network (GNN) policy trained via RL on procedurally-generated environments with diverse obstacle layouts, robot configurations, and task distributions. It employs a graph representation of scenes and a graph policy neural network trained through reinforcement learning to generate trajectories of multiple robots, jointly solving the sub-problems of task allocation, scheduling, and motion planning. Trained on large randomly generated task sets in simulation, our policy generalizes zero-shot to unseen settings with varying robot placements, obstacle geometries, and task poses. We further demonstrate that the high-speed capability of our solution enables its use in workcell layout optimization, improving solution times. The speed and scalability of our planner also open the door to new capabilities such as fault-tolerant planning and online perception-based re-planning, where rapid adaptation to dynamic task sets is required.
CLMar 28, 2025
Scaling Laws in Scientific Discovery with AI and Robot ScientistsPengsong Zhang, Heng Zhang, Huazhe Xu et al.
Scientific discovery is poised for rapid advancement through advanced robotics and artificial intelligence. Current scientific practices face substantial limitations as manual experimentation remains time-consuming and resource-intensive, while multidisciplinary research demands knowledge integration beyond individual researchers' expertise boundaries. Here, we envision an autonomous generalist scientist (AGS) concept combines agentic AI and embodied robotics to automate the entire research lifecycle. This system could dynamically interact with both physical and virtual environments while facilitating the integration of knowledge across diverse scientific disciplines. By deploying these technologies throughout every research stage -- spanning literature review, hypothesis generation, experimentation, and manuscript writing -- and incorporating internal reflection alongside external feedback, this system aims to significantly reduce the time and resources needed for scientific discovery. Building on the evolution from virtual AI scientists to versatile generalist AI-based robot scientists, AGS promises groundbreaking potential. As these autonomous systems become increasingly integrated into the research process, we hypothesize that scientific discovery might adhere to new scaling laws, potentially shaped by the number and capabilities of these autonomous systems, offering novel perspectives on how knowledge is generated and evolves. The adaptability of embodied robots to extreme environments, paired with the flywheel effect of accumulating scientific knowledge, holds the promise of continually pushing beyond both physical and intellectual frontiers.
RODec 21, 2023
Modular Neural Network Policies for Learning In-Flight Object Catching with a Robot Hand-Arm SystemWenbin Hu, Fernando Acero, Eleftherios Triantafyllidis et al.
We present a modular framework designed to enable a robot hand-arm system to learn how to catch flying objects, a task that requires fast, reactive, and accurately-timed robot motions. Our framework consists of five core modules: (i) an object state estimator that learns object trajectory prediction, (ii) a catching pose quality network that learns to score and rank object poses for catching, (iii) a reaching control policy trained to move the robot hand to pre-catch poses, (iv) a grasping control policy trained to perform soft catching motions for safe and robust grasping, and (v) a gating network trained to synthesize the actions given by the reaching and grasping policy. The former two modules are trained via supervised learning and the latter three use deep reinforcement learning in a simulated environment. We conduct extensive evaluations of our framework in simulation for each module and the integrated system, to demonstrate high success rates of in-flight catching and robustness to perturbations and sensory noise. Whilst only simple cylindrical and spherical objects are used for training, the integrated system shows successful generalization to a variety of household objects that are not used in training.
ROMar 21, 2024
Distilling Reinforcement Learning Policies for Interpretable Robot Locomotion: Gradient Boosting Machines and Symbolic RegressionFernando Acero, Zhibin Li
Recent advancements in reinforcement learning (RL) have led to remarkable achievements in robot locomotion capabilities. However, the complexity and ``black-box'' nature of neural network-based RL policies hinder their interpretability and broader acceptance, particularly in applications demanding high levels of safety and reliability. This paper introduces a novel approach to distill neural RL policies into more interpretable forms using Gradient Boosting Machines (GBMs), Explainable Boosting Machines (EBMs) and Symbolic Regression. By leveraging the inherent interpretability of generalized additive models, decision trees, and analytical expressions, we transform opaque neural network policies into more transparent ``glass-box'' models. We train expert neural network policies using RL and subsequently distill them into (i) GBMs, (ii) EBMs, and (iii) symbolic policies. To address the inherent distribution shift challenge of behavioral cloning, we propose to use the Dataset Aggregation (DAgger) algorithm with a curriculum of episode-dependent alternation of actions between expert and distilled policies, to enable efficient distillation of feedback control policies. We evaluate our approach on various robot locomotion gaits -- walking, trotting, bounding, and pacing -- and study the importance of different observations in joint actions for distilled policies using various methods. We train neural expert policies for 205 hours of simulated experience and distill interpretable policies with only 10 minutes of simulated interaction for each gait using the proposed method.
APMar 25, 2025
Structured and sparse partial least squares coherence for multivariate cortico-muscular analysisJingyao Sun, Qilu Zhang, Di Ma et al.
Multivariate cortico-muscular analysis has recently emerged as a promising approach for evaluating the corticospinal neural pathway. However, current multivariate approaches encounter challenges such as high dimensionality and limited sample sizes, thus restricting their further applications. In this paper, we propose a structured and sparse partial least squares coherence algorithm (ssPLSC) to extract shared latent space representations related to cortico-muscular interactions. Our approach leverages an embedded optimization framework by integrating a partial least squares (PLS)-based objective function, a sparsity constraint and a connectivity-based structured constraint, addressing the generalizability, interpretability and spatial structure. To solve the optimization problem, we develop an efficient alternating iterative algorithm within a unified framework and prove its convergence experimentally. Extensive experimental results from one synthetic and several real-world datasets have demonstrated that ssPLSC can achieve competitive or better performance over some representative multivariate cortico-muscular fusion methods, particularly in scenarios characterized by limited sample sizes and high noise levels. This study provides a novel multivariate fusion method for cortico-muscular analysis, offering a transformative tool for the evaluation of corticospinal pathway integrity in neurological disorders.
CVMar 6, 2025
NsBM-GAT: A Non-stationary Block Maximum and Graph Attention Framework for General Traffic Crash Risk PredictionKequan Chen, Pan Liu, Yuxuan Wang et al.
Accurate prediction of traffic crash risks for individual vehicles is essential for enhancing vehicle safety. While significant attention has been given to traffic crash risk prediction, existing studies face two main challenges: First, due to the scarcity of individual vehicle data before crashes, most models rely on hypothetical scenarios deemed dangerous by researchers. This raises doubts about their applicability to actual pre-crash conditions. Second, some crash risk prediction frameworks were learned from dashcam videos. Although such videos capture the pre-crash behavior of individual vehicles, they often lack critical information about the movements of surrounding vehicles. However, the interaction between a vehicle and its surrounding vehicles is highly influential in crash occurrences. To overcome these challenges, we propose a novel non-stationary extreme value theory (EVT), where the covariate function is optimized in a nonlinear fashion using a graph attention network. The EVT component incorporates the stochastic nature of crashes through probability distribution, which enhances model interpretability. Notably, the nonlinear covariate function enables the model to capture the interactive behavior between the target vehicle and its multiple surrounding vehicles, facilitating crash risk prediction across different driving tasks. We train and test our model using 100 sets of vehicle trajectory data before real crashes, collected via drones over three years from merging and weaving segments. We demonstrate that our model successfully learns micro-level precursors of crashes and fits a more accurate distribution with the aid of the nonlinear covariate function. Our experiments on the testing dataset show that the proposed model outperforms existing models by providing more accurate predictions for both rear-end and sideswipe crashes simultaneously.
CVJun 26, 2024
3D Feature Distillation with Object-Centric PriorsGeorgios Tziafas, Yucheng Xu, Zhibin Li et al.
Grounding natural language to the physical world is a ubiquitous topic with a wide range of applications in computer vision and robotics. Recently, 2D vision-language models such as CLIP have been widely popularized, due to their impressive capabilities for open-vocabulary grounding in 2D images. Recent works aim to elevate 2D CLIP features to 3D via feature distillation, but either learn neural fields that are scene-specific and hence lack generalization, or focus on indoor room scan data that require access to multiple camera views, which is not practical in robot manipulation scenarios. Additionally, related methods typically fuse features at pixel-level and assume that all camera views are equally informative. In this work, we show that this approach leads to sub-optimal 3D features, both in terms of grounding accuracy, as well as segmentation crispness. To alleviate this, we propose a multi-view feature fusion strategy that employs object-centric priors to eliminate uninformative views based on semantic information, and fuse features at object-level via instance segmentation masks. To distill our object-centric 3D features, we generate a large-scale synthetic multi-view dataset of cluttered tabletop scenes, spawning 15k scenes from over 3300 unique object instances, which we make publicly available. We show that our method reconstructs 3D CLIP features with improved grounding capacity and spatial consistency, while doing so from single-view RGB-D, thus departing from the assumption of multiple camera views at test time. Finally, we show that our approach can generalize to novel tabletop domains and be re-purposed for 3D instance segmentation without fine-tuning, and demonstrate its utility for language-guided robotic grasping in clutter.
CVDec 21, 2021
Distribution-aware Margin Calibration for Semantic Segmentation in ImagesLitao Yu, Zhibin Li, Min Xu et al.
The Jaccard index, also known as Intersection-over-Union (IoU), is one of the most critical evaluation metrics in image semantic segmentation. However, direct optimization of IoU score is very difficult because the learning objective is neither differentiable nor decomposable. Although some algorithms have been proposed to optimize its surrogates, there is no guarantee provided for the generalization ability. In this paper, we propose a margin calibration method, which can be directly used as a learning objective, for an improved generalization of IoU over the data-distribution, underpinned by a rigid lower bound. This scheme theoretically ensures a better segmentation performance in terms of IoU score. We evaluated the effectiveness of the proposed margin calibration method on seven image datasets, showing substantial improvements in IoU score over other learning objectives using deep segmentation models.
CVOct 22, 2021
EvoGAN: An Evolutionary Computation Assisted GANFeng Liu, HanYang Wang, Jiahao Zhang et al.
The image synthesis technique is relatively well established which can generate facial images that are indistinguishable even by human beings. However, all of these approaches uses gradients to condition the output, resulting in the outputting the same image with the same input. Also, they can only generate images with basic expression or mimic an expression instead of generating compound expression. In real life, however, human expressions are of great diversity and complexity. In this paper, we propose an evolutionary algorithm (EA) assisted GAN, named EvoGAN, to generate various compound expressions with any accurate target compound expression. EvoGAN uses an EA to search target results in the data distribution learned by GAN. Specifically, we use the Facial Action Coding System (FACS) as the encoding of an EA and use a pre-trained GAN to generate human facial images, and then use a pre-trained classifier to recognize the expression composition of the synthesized images as the fitness function to guide the search of the EA. Combined random searching algorithm, various images with the target expression can be easily sythesized. Quantitative and Qualitative results are presented on several compound expressions, and the experimental results demonstrate the feasibility and the potential of EvoGAN.
ROSep 28, 2021
Learning Perceptual Locomotion on Uneven Terrains using Sparse Visual ObservationsFernando Acero, Kai Yuan, Zhibin Li
To proactively navigate and traverse various terrains, active use of visual perception becomes indispensable. We aim to investigate the feasibility and performance of using sparse visual observations to achieve perceptual locomotion over a range of common terrains (steps, ramps, gaps, and stairs) in human-centered environments. We formulate a selection of sparse visual inputs suitable for locomotion over the terrains of interest, and propose a learning framework to integrate exteroceptive and proprioceptive states. We specifically design the state observations and a training curriculum to learn feedback control policies effectively over a range of different terrains. We extensively validate and benchmark the learned policy in various tasks: omnidirectional walking on flat ground, and forward locomotion over various obstacles, showing high success rate of traversability. Furthermore, we study exteroceptive ablations and evaluate policy generalization by adding various levels of noise and testing on new unseen terrains. We demonstrate the capabilities of autonomous perceptual locomotion that can be achieved by only using sparse visual observations from direct depth measurements, which are easily available from a Lidar or RGB-D sensor, showing robust ascent and descent over high stairs of 20 cm height, i.e., 50% leg length, and robustness against noise and unseen terrains.
ROSep 23, 2021
Accessibility-Based Clustering for Efficient Learning of Locomotion SkillsChong Zhang, Wanming Yu, Zhibin Li
For model-free deep reinforcement learning of quadruped locomotion, the initialization of robot configurations is crucial for data efficiency and robustness. This work focuses on algorithmic improvements of data efficiency and robustness simultaneously through automatic discovery of initial states, which is achieved by our proposed K-Access algorithm based on accessibility metrics. Specifically, we formulated accessibility metrics to measure the difficulty of transitions between two arbitrary states, and proposed a novel K-Access algorithm for state-space clustering that automatically discovers the centroids of the static-pose clusters based on the accessibility metrics. By using the discovered centroidal static poses as the initial states, we can improve data efficiency by reducing redundant explorations, and enhance the robustness by more effective explorations from the centroids to sampled poses. Focusing on fall recovery as a very hard set of locomotion skills, we validated our method extensively using an 8-DoF quadrupedal robot Bittle. Compared to the baselines, the learning curve of our method converges much faster, requiring only 60% of training episodes. With our method, the robot can successfully recover to standing poses within 3 seconds in 99.4% of the test cases. Moreover, the method can generalize to other difficult skills successfully, such as backflipping.
ROSep 9, 2021
Fine Manipulation and Dynamic Interaction in Haptic TeleoperationCarlo Tiseo, Quentin Rouxel, Zhibin Li et al.
The teleoperation of robots enables remote intervention in distant and dangerous tasks without putting the operator in harm's way. However, remote operation faces fundamental challenges due to limits in communication delays. The proposed work improves the performances of teleoperation architecture based on Fractal Impedance Controller (FIC) by integrating into the haptic teleoperation pipeline a postural optimisation that also accounts for the replica robots' physical limitations. This update improves dynamic interactions by trading off tracking accuracy to maintain the system within its power limits. Thus, allowing fine manipulation without renouncing the robustness of the FIC controller. Additionally, the proposed method allows an online trade-off between tracking the autonomous trajectory and executing the teleoperated command, allowing their safe superimposition. The validated experimental results show that the proposed method is robust to increased communication delays. Moreover, we demonstrated that the remote teleoperated robot remains stable and safe to interact with, even when the communication with the master side is abruptly interrupted. with, even when the communication with the master side is abruptly interrupted.
ROSep 9, 2021
Robust Impedance Control for Dexterous Interaction Using Fractal Impedance Controller with IK-OptimisationCarlo Tiseo, Quentin Rouxel, Zhibin Li et al.
Robust dynamic interactions are required to move robots in daily environments alongside humans. Optimisation and learning methods have been used to mimic and reproduce human movements. However, they are often not robust and their generalisation is limited. This work proposed a hierarchical control architecture for robot manipulators and provided capabilities of reproducing human-like motions during unknown interaction dynamics. Our results show that the reproduced end-effector trajectories can preserve the main characteristics of the initial human motion recorded via a motion capture system, and are robust against external perturbations. The data indicate that some detailed movements are hard to reproduce due to the physical limits of the hardware that cannot reach the same velocity recorded in human movements. Nevertheless, these technical problems can be addressed by using better hardware and our proposed algorithms can still be applied to produce imitated motions.
ROSep 9, 2021
Learning Vision-Guided Dynamic Locomotion Over Challenging TerrainsZhaocheng Liu, Fernando Acero, Zhibin Li
Legged robots are becoming increasingly powerful and popular in recent years for their potential to bring the mobility of autonomous agents to the next level. This work presents a deep reinforcement learning approach that learns a robust Lidar-based perceptual locomotion policy in a partially observable environment using Proximal Policy Optimisation. Visual perception is critical to actively overcome challenging terrains, and to do so, we propose a novel learning strategy: Dynamic Reward Strategy (DRS), which serves as effective heuristics to learn a versatile gait using a neural network architecture without the need to access the history data. Moreover, in a modified version of the OpenAI gym environment, the proposed work is evaluated with scores over 90% success rate in all tested challenging terrains.
ROAug 10, 2021
Learning Autonomous Mobility Using Real Demonstration DataJiacheng Gu, Zhibin Li
This work proposed an efficient learning-based framework to learn feedback control policies from human teleoperated demonstrations, which achieved obstacle negotiation, staircase traversal, slipping control and parcel delivery for a tracked robot. Due to uncertainties in real-world scenarios, eg obstacle and slippage, closed-loop feedback control plays an important role in improving robustness and resilience, but the control laws are difficult to program manually for achieving autonomous behaviours. We formulated an architecture based on a long-short-term-memory (LSTM) neural network, which effectively learn reactive control policies from human demonstrations. Using datasets from a few real demonstrations, our algorithm can directly learn successful policies, including obstacle-negotiation, stair-climbing and delivery, fall recovery and corrective control of slippage. We proposed decomposition of complex robot actions to reduce the difficulty of learning the long-term dependencies. Furthermore, we proposed a method to efficiently handle non-optimal demos and to learn new skills, since collecting enough demonstration can be time-consuming and sometimes very difficult on a real robotic system.
HCJun 12, 2021
Metrics for 3D Object Pointing and Manipulation in Virtual RealityEleftherios Triantafyllidis, Wenbin Hu, Christopher McGreavy et al.
Assessing the performance of human movements during teleoperation and virtual reality is a challenging problem, particularly in 3D space due to complex spatial settings. Despite the presence of a multitude of metrics, a compelling standardized 3D metric is yet missing, aggravating inter-study comparability between different studies. Hence, evaluating human performance in virtual environments is a long-standing research goal, and a performance metric that combines two or more metrics under one formulation remains largely unexplored, particularly in higher dimensions. The absence of such a metric is primarily attributed to the discrepancies between pointing and manipulation, the complex spatial variables in 3D, and the combination of translational and rotational movements altogether. In this work, four experiments were designed and conducted with progressively higher spatial complexity to study and compare existing metrics thoroughly. The research goal was to quantify the difficulty of these 3D tasks and model human performance sufficiently in full 3D peripersonal space. Consequently, a new model extension has been proposed and its applicability has been validated across all the experimental results, showing improved modelling and representation of human performance in combined movements of 3D object pointing and manipulation tasks than existing work. Lastly, the implications on 3D interaction, teleoperation and object task design in virtual reality are discussed.
ROMay 3, 2021
Reachability Map for Diverse Balancing Strategies and Energy Efficient Stepping of HumanoidsChristopher McGreavy, Zhibin Li
In legged locomotion, the relationship between different gait behaviors and energy consumption must consider the full-body dynamics and the robot control as a whole, which cannot be captured by simple models. This work studies the robot dynamics and whole-body optimal control as a coupled system to investigate energy consumption during balance recovery. We developed a 2-phase nonlinear optimization pipeline for dynamic stepping, which generates reachability maps showing complex energy-stepping relations. We optimize gait parameters to search all reachable locations and quantify the energy cost during dynamic transitions, which allows studying the relationship between energy consumption and stepping locations given different initial conditions. We found that to achieve efficient actuation, the stepping location and timing can have simple approximations close to the underlying optimality. Despite the complexity of this nonlinear process, we show that near-minimal effort stepping locations fall within a region of attractions, rather than a narrow solution space suggested by a simple model. This provides new insights into the non-uniqueness of near-optimal solutions in robot motion planning and control, and the diversity of stepping behavior in humans.
HCMar 23, 2021
Considerations and Challenges of Measuring Operator Performance in Telepresence and Teleoperation Entailing Mixed Reality TechnologiesEleftherios Triantafyllidis, Zhibin Li
Assessing human performance in robotic scenarios such as those seen in telepresence and teleoperation has always been a challenging task. With the recent spike in mixed reality technologies and the subsequent focus by researchers, new pathways have opened in elucidating human perception and maximising overall immersion. Yet with the multitude of different assessment methods in evaluating operator performance in virtual environments within the field of HCI and HRI, inter-study comparability and transferability are limited. In this short paper, we present a brief overview of existing methods in assessing operator performance including subjective and objective approaches while also attempting to capture future technical challenges and frontiers. The ultimate goal is to assist and pinpoint readers towards potentially important directions with the future hope of providing a unified immersion framework for teleoperation and telepresence by standardizing a set of guidelines and evaluation methods.
ROJan 20, 2021
Trajectory optimization for contact-rich motions using implicit differential dynamic programmingIordanis Chatzinikolaidis, Zhibin Li
This paper presents a novel approach using sensitivity analysis for generalizing Differential Dynamic Programming (DDP) to systems characterized by implicit dynamics, such as those modelled via inverse dynamics and variational or implicit integrators. It leads to a more general formulation of DDP, enabling for example the use of the faster recursive Newton-Euler inverse dynamics. We leverage the implicit formulation for precise and exact contact modelling in DDP, where we focus on two contributions: (1) Contact dynamics in acceleration level that enables high-order integration schemes; (2) Formulation using an invertible contact model in the forward pass and a closed form solution in the backward pass to improve the numerical resolution of contacts. The performance of the proposed framework is validated (1) by comparing implicit versus explicit DDP for the swing-up of a double pendulum, and (2) by planning motions for two tasks using a single leg model making multi-body contacts with the environment: standing up from ground, where a priori contact enumeration is challenging, and maintaining balance under an external perturbation.
ROJan 19, 2021
Meta-Reinforcement Learning for Adaptive Motor Control in Changing Robot Dynamics and EnvironmentsTimothée Anne, Jack Wilkinson, Zhibin Li
This work developed a meta-learning approach that adapts the control policy on the fly to different changing conditions for robust locomotion. The proposed method constantly updates the interaction model, samples feasible sequences of actions of estimated the state-action trajectories, and then applies the optimal actions to maximize the reward. To achieve online model adaptation, our proposed method learns different latent vectors of each training condition, which are selected online given the newly collected data. Our work designs appropriate state space and reward functions, and optimizes feasible actions in an MPC fashion which are then sampled directly in the joint space considering constraints, hence requiring no prior design of specific walking gaits. We further demonstrate the robot's capability of detecting unexpected changes during interaction and adapting control policies quickly. The extensive validation on the SpotMicro robot in a physics simulation shows adaptive and robust locomotion skills under varying ground friction, external pushes, and different robot models including hardware faults and changes.
HCJan 1, 2021
The Challenges in Modeling Human Performance in 3D Space with Fitts' LawEleftherios Triantafyllidis, Zhibin Li
With the rapid growth in virtual reality technologies, object interaction is becoming increasingly more immersive, elucidating human perception and leading to promising directions towards evaluating human performance under different settings. This spike in technological growth exponentially increased the need for a human performance metric in 3D space. Fitts' law is perhaps the most widely used human prediction model in HCI history attempting to capture human movement in lower dimensions. Despite the collective effort towards deriving an advanced extension of a 3D human performance model based on Fitts' law, a standardized metric is still missing. Moreover, most of the extensions to date assume or limit their findings to certain settings, effectively disregarding important variables that are fundamental to 3D object interaction. In this review, we investigate and analyze the most prominent extensions of Fitts' law and compare their characteristics pinpointing to potentially important aspects for deriving a higher-dimensional performance model. Lastly, we mention the complexities, frontiers as well as potential challenges that may lay ahead.
RODec 10, 2020
Multi-expert learning of adaptive legged locomotionChuanyu Yang, Kai Yuan, Qiuguo Zhu et al.
Achieving versatile robot locomotion requires motor skills which can adapt to previously unseen situations. We propose a Multi-Expert Learning Architecture (MELA) that learns to generate adaptive skills from a group of representative expert skills. During training, MELA is first initialised by a distinct set of pre-trained experts, each in a separate deep neural network (DNN). Then by learning the combination of these DNNs using a Gating Neural Network (GNN), MELA can acquire more specialised experts and transitional skills across various locomotion modes. During runtime, MELA constantly blends multiple DNNs and dynamically synthesises a new DNN to produce adaptive behaviours in response to changing situations. This approach leverages the advantages of trained expert skills and the fast online synthesis of adaptive policies to generate responsive motor skills during the changing tasks. Using a unified MELA framework, we demonstrated successful multi-skill locomotion on a real quadruped robot that performed coherent trotting, steering, and fall recovery autonomously, and showed the merit of multi-expert learning generating behaviours which can adapt to unseen scenarios.
RONov 27, 2020
Learning Hybrid Locomotion Skills -- Learn to Exploit Residual Dynamics and Modulate Model-based Gait ControlMohammadreza Kasaei, Miguel Abreu, Nuno Lau et al.
This work aims to combine machine learning and control approaches for legged robots, and developed a hybrid framework to achieve new capabilities of balancing against external perturbations. The framework embeds a kernel which is a fully parametric closed-loop gait generator based on analytical control. On top of that, a neural network with symmetric partial data augmentation learns to automatically adjust the parameters for the gait kernel and to generate compensatory actions for all joints as the residual dynamics, thus significantly augmenting the stability under unexpected perturbations. The performance of the proposed framework was evaluated across a set of challenging simulated scenarios. The results showed considerable improvements compared to the baseline in recovering from large external forces. Moreover, the produced behaviours are more natural, human-like and robust against noisy sensing.
CVNov 3, 2020
Distribution-aware Margin Calibration for Medical Image SegmentationZhibin Li, Litao Yu, Jian Zhang
The Jaccard index, also known as Intersection-over-Union (IoU score), is one of the most critical evaluation metrics in medical image segmentation. However, directly optimizing the mean IoU (mIoU) score over multiple objective classes is an open problem. Although some algorithms have been proposed to optimize its surrogates, there is no guarantee provided for their generalization ability. In this paper, we present a novel data-distribution-aware margin calibration method for a better generalization of the mIoU over the whole data-distribution, underpinned by a rigid lower bound. This scheme ensures a better segmentation performance in terms of IoU scores in practice. We evaluate the effectiveness of the proposed margin calibration method on two medical image segmentation datasets, showing substantial improvements of IoU scores over other learning schemes using deep segmentation models.
RONov 1, 2020
Efficient Learning of Control Policies for Robust Quadruped Bounding using Pretrained Neural NetworksZhicheng Wang, Anqiao Li, Yixiao Zheng et al.
Bounding is one of the important gaits in quadrupedal locomotion for negotiating obstacles. The authors proposed an effective approach that can learn robust bounding gaits more efficiently despite its large variation in dynamic body movements. The authors first pretrained the neural network (NN) based on data from a robot operated by conventional model based controllers, and then further optimised the pretrained NN via deep reinforcement learning (DRL). In particular, the authors designed a reward function considering contact points and phases to enforce the gait symmetry and periodicity, which improved the bounding performance. The NN based feedback controller was learned in the simulation and directly deployed on the real quadruped robot Jueying Mini successfully. A variety of environments are presented both indoors and outdoors with the authors approach. The authors approach shows efficient computing and good locomotion results by the Jueying Mini quadrupedal robot bounding over uneven terrain.
ROJul 22, 2020
Contact-Implicit Trajectory Optimization using an Analytically Solvable Contact Model for Locomotion on Variable GroundIordanis Chatzinikolaidis, Yangwei You, Zhibin Li
This paper presents a novel contact-implicit trajectory optimization method using an analytically solvable contact model to enable planning of interactions with hard, soft, and slippery environments. Specifically, we propose a novel contact model that can be computed in closed-form, satisfies friction cone constraints and can be embedded into direct trajectory optimization frameworks without complementarity constraints. The closed-form solution decouples the computation of the contact forces from other actuation forces and this property is used to formulate a minimal direct optimization problem expressed with configuration variables only. Our simulation study demonstrates the advantages over the rigid contact model and a trajectory optimization approach based on complementarity constraints. The proposed model enables physics-based optimization for a wide range of interactions with hard, slippery, and soft grounds in a unified manner expressed by two parameters only. By computing trotting and jumping motions for a quadruped robot, the proposed optimization demonstrates the versatility for multi-contact motion planning on surfaces with different physical properties.
HCJul 18, 2020
Applications of brain imaging methods in driving behaviour researchMilad Haghani, Michiel C. J. Bliemer, Bilal Farooq et al.
Applications of neuroimaging methods have substantially contributed to the scientific understanding of human factors during driving by providing a deeper insight into the neuro-cognitive aspects of driver brain. This has been achieved by conducting simulated (and occasionally, field) driving experiments while collecting driver brain signals of certain types. Here, this sector of studies is comprehensively reviewed at both macro and micro scales. Different themes of neuroimaging driving behaviour research are identified and the findings within each theme are synthesised. The surveyed literature has reported on applications of four major brain imaging methods. These include Functional Magnetic Resonance Imaging (fMRI), Electroencephalography (EEG), Functional Near-Infrared Spectroscopy (fNIRS) and Magnetoencephalography (MEG), with the first two being the most common methods in this domain. While collecting driver fMRI signal has been particularly instrumental in studying neural correlates of intoxicated driving (e.g. alcohol or cannabis) or distracted driving, the EEG method has been predominantly utilised in relation to the efforts aiming at development of automatic fatigue/drowsiness detection systems, a topic to which the literature on neuro-ergonomics of driving particularly has shown a spike of interest within the last few years. The survey also reveals that topics such as driver brain activity in semi-automated settings or the brain activity of drivers with brain injuries or chronic neurological conditions have by contrast been investigated to a very limited extent. Further, potential topics in relation to driving behaviour are identified that could benefit from the adoption of neuroimaging methods in future studies.
ROMay 20, 2020
Learning natural locomotion behaviors for humanoid robots using human knowledgeChuanyu Yang, Kai Yuan, Shuai Heng et al.
This paper presents a new learning framework that leverages the knowledge from imitation learning, deep reinforcement learning, and control theories to achieve human-style locomotion that is natural, dynamic, and robust for humanoids. We proposed novel approaches to introduce human bias, i.e. motion capture data and a special Multi-Expert network structure. We used the Multi-Expert network structure to smoothly blend behavioral features, and used the augmented reward design for the task and imitation rewards. Our reward design is composable, tunable, and explainable by using fundamental concepts from conventional humanoid control. We rigorously validated and benchmarked the learning framework which consistently produced robust locomotion behaviors in various test scenarios. Further, we demonstrated the capability of learning robust and versatile policies in the presence of disturbances, such as terrain irregularities and external pushes.
CVMay 13, 2020
Towards Better Graph Representation: Two-Branch Collaborative Graph Neural Networks for Multimodal Marketing Intention DetectionLu Zhang, Jian Zhang, Zhibin Li et al.
Inspired by the fact that spreading and collecting information through the Internet becomes the norm, more and more people choose to post for-profit contents (images and texts) in social networks. Due to the difficulty of network censors, malicious marketing may be capable of harming society. Therefore, it is meaningful to detect marketing intentions online automatically. However, gaps between multimodal data make it difficult to fuse images and texts for content marketing detection. To this end, this paper proposes Two-Branch Collaborative Graph Neural Networks to collaboratively represent multimodal data by Graph Convolution Networks (GCNs) in an end-to-end fashion. Experimental results demonstrate that our proposed method achieves superior graph classification performance for marketing intention detection.