Botao He

RO
h-index5
9papers
147citations
Novelty51%
AI Score47

9 Papers

CVAug 13, 2023Code
Condition-Adaptive Graph Convolution Learning for Skeleton-Based Gait Recognition

Xiaohu Huang, Xinggang Wang, Zhidianqiu Jin et al.

Graph convolutional networks have been widely applied in skeleton-based gait recognition. A key challenge in this task is to distinguish the individual walking styles of different subjects across various views. Existing state-of-the-art methods employ uniform convolutions to extract features from diverse sequences and ignore the effects of viewpoint changes. To overcome these limitations, we propose a condition-adaptive graph (CAG) convolution network that can dynamically adapt to the specific attributes of each skeleton sequence and the corresponding view angle. In contrast to using fixed weights for all joints and sequences, we introduce a joint-specific filter learning (JSFL) module in the CAG method, which produces sequence-adaptive filters at the joint level. The adaptive filters capture fine-grained patterns that are unique to each joint, enabling the extraction of diverse spatial-temporal information about body parts. Additionally, we design a view-adaptive topology learning (VATL) module that generates adaptive graph topologies. These graph topologies are used to correlate the joints adaptively according to the specific view conditions. Thus, CAG can simultaneously adjust to various walking styles and viewpoints. Experiments on the two most widely used datasets (i.e., CASIA-B and OU-MVLP) show that CAG surpasses all previous skeleton-based methods. Moreover, the recognition performance can be enhanced by simply combining CAG with appearance-based methods, demonstrating the ability of CAG to provide useful complementary information.The source code will be available at https://github.com/OliverHxh/CAG.

CVApr 7, 2022Code
Multi-scale Context-aware Network with Transformer for Gait Recognition

Duowang Zhu, Xiaohu Huang, Xinggang Wang et al.

Although gait recognition has drawn increasing research attention recently, since the silhouette differences are quite subtle in spatial domain, temporal feature representation is crucial for gait recognition. Inspired by the observation that humans can distinguish gaits of different subjects by adaptively focusing on clips of varying time scales, we propose a multi-scale context-aware network with transformer (MCAT) for gait recognition. MCAT generates temporal features across three scales, and adaptively aggregates them using contextual information from both local and global perspectives. Specifically, MCAT contains an adaptive temporal aggregation (ATA) module that performs local relation modeling followed by global relation modeling to fuse the multi-scale features. Besides, in order to remedy the spatial feature corruption resulting from temporal operations, MCAT incorporates a salient spatial feature learning (SSFL) module to select groups of discriminative spatial features. Extensive experiments conducted on three datasets demonstrate the state-of-the-art performance. Concretely, we achieve rank-1 accuracies of 98.7%, 96.2% and 88.7% under normal walking, bag-carrying and coat-wearing conditions on CASIA-B, 97.5% on OU-MVLP and 50.6% on GREW. The source code will be available at https://github.com/zhuduowang/MCAT.git.

ROJul 1, 2024
Active Human Pose Estimation via an Autonomous UAV Agent

Jingxi Chen, Botao He, Chahat Deep Singh et al.

One of the core activities of an active observer involves moving to secure a "better" view of the scene, where the definition of "better" is task-dependent. This paper focuses on the task of human pose estimation from videos capturing a person's activity. Self-occlusions within the scene can complicate or even prevent accurate human pose estimation. To address this, relocating the camera to a new vantage point is necessary to clarify the view, thereby improving 2D human pose estimation. This paper formalizes the process of achieving an improved viewpoint. Our proposed solution to this challenge comprises three main components: a NeRF-based Drone-View Data Generation Framework, an On-Drone Network for Camera View Error Estimation, and a Combined Planner for devising a feasible motion plan to reposition the camera based on the predicted errors for camera views. The Data Generation Framework utilizes NeRF-based methods to generate a comprehensive dataset of human poses and activities, enhancing the drone's adaptability in various scenarios. The Camera View Error Estimation Network is designed to evaluate the current human pose and identify the most promising next viewing angles for the drone, ensuring a reliable and precise pose estimation from those angles. Finally, the combined planner incorporates these angles while considering the drone's physical and environmental limitations, employing efficient algorithms to navigate safe and effective flight paths. This system represents a significant advancement in active 2D human pose estimation for an autonomous UAV agent, offering substantial potential for applications in aerial cinematography by improving the performance of autonomous human pose estimation and maintaining the operational safety and efficiency of UAVs.

59.9CVMar 16
FEEL (Force-Enhanced Egocentric Learning): A Dataset for Physical Action Understanding

Eadom Dessalene, Botao He, Michael Maynord et al.

We introduce FEEL (Force-Enhanced Egocentric Learning), the first large-scale dataset pairing force measurements gathered from custom piezoresistive gloves with egocentric video. Our gloves enable scalable data collection, and FEEL contains approximately 3 million force-synchronized frames of natural unscripted manipulation in kitchen environments, with 45% of frames involving hand-object contact. Because force is the underlying cause that drives physical interaction, it is a critical primitive for physical action understanding. We demonstrate the utility of force for physical action understanding through application of FEEL to two families of tasks: (1) contact understanding, where we jointly perform temporal contact segmentation and pixel-level contacted object segmentation; and, (2) action representation learning, where force prediction serves as a self-supervised pretraining objective for video backbones. We achieve state-of-the-art temporal contact segmentation results and competitive pixel-level segmentation results without any need for manual contacted object segmentation annotations. Furthermore we demonstrate that action representation learning with FEEL improves transfer performance on action understanding tasks without any manual labels over EPIC-Kitchens, SomethingSomething-V2, EgoExo4D and Meccano.

91.3ROMay 24
HumanEgo: Zero-Shot Robot Learning from Minutes of Human Egocentric Videos

Zhi, Wang, Botao He et al.

Human egocentric video captures rich manipulation demonstrations without any robot hardware, yet transferring these skills to robots remains challenging due to the embodiment gap between human and robot in both visual appearance and kinematics. We present HumanEgo, a framework that bridges the embodiment gap by lifting each human demonstration to an entity-level representation of hand-object interaction, and training a flow matching policy with dense auxiliary objectives that amplify supervision from every trajectory. HumanEgo is robot-data-free, hardware-agnostic, data-efficient, and zero-shot human-to-robot transferable. With only 30 minutes of human videos per task, HumanEgo achieves 92.5% average success across four real-world tasks (75% with just 15 minutes), outperforms matched-time robot teleoperation by 41%, and robustly transfers zero-shot across novel robots, cameras, and environments.

ROOct 24, 2024
Search-Based Path Planning in Interactive Environments among Movable Obstacles

Zhongqiang Ren, Bunyod Suvonov, Guofei Chen et al.

This paper investigates Path planning Among Movable Obstacles (PAMO), which seeks a minimum cost collision-free path among static obstacles from start to goal while allowing the robot to push away movable obstacles (i.e., objects) along its path when needed. To develop planners that are complete and optimal for PAMO, the planner has to search a giant state space involving both the location of the robot as well as the locations of the objects, which grows exponentially with respect to the number of objects. This paper leverages a simple yet under-explored idea that, only a small fraction of this giant state space needs to be searched during planning as guided by a heuristic, and most of the objects far away from the robot are intact, which thus leads to runtime efficient algorithms. Based on this idea, this paper introduces two PAMO formulations, i.e., bi-objective and resource constrained problems in an occupancy grid, and develops PAMO*, a planning method with completeness and solution optimality guarantees, to solve the two problems. We then further extend PAMO* to hybrid-state PAMO* to plan in continuous spaces with high-fidelity interaction between the robot and the objects. Our results show that, PAMO* can often find optimal solutions within a second in cluttered maps with up to 400 objects.

CVJun 5, 2024
Event3DGS: Event-Based 3D Gaussian Splatting for High-Speed Robot Egomotion

Tianyi Xiong, Jiayi Wu, Botao He et al.

By combining differentiable rendering with explicit point-based scene representations, 3D Gaussian Splatting (3DGS) has demonstrated breakthrough 3D reconstruction capabilities. However, to date 3DGS has had limited impact on robotics, where high-speed egomotion is pervasive: Egomotion introduces motion blur and leads to artifacts in existing frame-based 3DGS reconstruction methods. To address this challenge, we introduce Event3DGS, an {\em event-based} 3DGS framework. By exploiting the exceptional temporal resolution of event cameras, Event3GDS can reconstruct high-fidelity 3D structure and appearance under high-speed egomotion. Extensive experiments on multiple synthetic and real-world datasets demonstrate the superiority of Event3DGS compared with existing event-based dense 3D scene reconstruction frameworks; Event3DGS substantially improves reconstruction quality (+3dB) while reducing computational costs by 95\%. Our framework also allows one to incorporate a few motion-blurred frame-based measurements into the reconstruction process to further improve appearance fidelity without loss of structural accuracy.

ROSep 10, 2021
GPA-Teleoperation: Gaze Enhanced Perception-aware Safe Assistive Aerial Teleoperation

Qianhao Wang, Botao He, Zhiren Xun et al.

Gaze is an intuitive and direct way to represent the intentions of an individual. However, when it comes to assistive aerial teleoperation which aims to perform operators' intention, rare attention has been paid to gaze. Existing methods obtain intention directly from the remote controller (RC) input, which is inaccurate, unstable, and unfriendly to non-professional operators. Further, most teleoperation works do not consider environment perception which is vital to guarantee safety. In this paper, we present GPA-Teleoperation, a gaze enhanced perception-aware assistive teleoperation framework, which addresses the above issues systematically. We capture the intention utilizing gaze information, and generate a topological path matching it. Then we refine the path into a safe and feasible trajectory which simultaneously enhances the perception awareness to the environment operators are interested in. Additionally, the proposed method is integrated into a customized quadrotor system. Extensive challenging indoor and outdoor real-world experiments and benchmark comparisons verify that the proposed system is reliable, robust and applicable to even unskilled users. We will release the source code of our system to benefit related researches.

ROMar 10, 2021
FAST-Dynamic-Vision: Detection and Tracking Dynamic Objects with Event and Depth Sensing

Botao He, Haojia Li, Siyuan Wu et al.

The development of aerial autonomy has enabled aerial robots to fly agilely in complex environments. However, dodging fast-moving objects in flight remains a challenge, limiting the further application of unmanned aerial vehicles (UAVs). The bottleneck of solving this problem is the accurate perception of rapid dynamic objects. Recently, event cameras have shown great potential in solving this problem. This paper presents a complete perception system including ego-motion compensation, object detection, and trajectory prediction for fast-moving dynamic objects with low latency and high precision. Firstly, we propose an accurate ego-motion compensation algorithm by considering both rotational and translational motion for more robust object detection. Then, for dynamic object detection, an event camera-based efficient regression algorithm is designed. Finally, we propose an optimizationbased approach that asynchronously fuses event and depth cameras for trajectory prediction. Extensive real-world experiments and benchmarks are performed to validate our framework. Moreover, our code will be released to benefit related researches.