Xuewei Qi

CV
h-index16
11papers
168citations
Novelty46%
AI Score40

11 Papers

ROJan 23, 2023
Learning to View: Decision Transformers for Active Object Detection

Wenhao Ding, Nathalie Majcherczyk, Mohit Deshpande et al. · cmu

Active perception describes a broad class of techniques that couple planning and perception systems to move the robot in a way to give the robot more information about the environment. In most robotic systems, perception is typically independent of motion planning. For example, traditional object detection is passive: it operates only on the images it receives. However, we have a chance to improve the results if we allow planning to consume detection signals and move the robot to collect views that maximize the quality of the results. In this paper, we use reinforcement learning (RL) methods to control the robot in order to obtain images that maximize the detection quality. Specifically, we propose using a Decision Transformer with online fine-tuning, which first optimizes the policy with a pre-collected expert dataset and then improves the learned policy by exploring better solutions in the environment. We evaluate the performance of proposed method on an interactive dataset collected from an indoor scenario simulator. Experimental results demonstrate that our method outperforms all baselines, including expert policy and pure offline RL methods. We also provide exhaustive analyses of the reward distribution and observation space.

LGOct 21, 2022
Group Distributionally Robust Reinforcement Learning with Hierarchical Latent Variables

Mengdi Xu, Peide Huang, Yaru Niu et al.

One key challenge for multi-task Reinforcement learning (RL) in practice is the absence of task indicators. Robust RL has been applied to deal with task ambiguity, but may result in over-conservative policies. To balance the worst-case (robustness) and average performance, we propose Group Distributionally Robust Markov Decision Process (GDR-MDP), a flexible hierarchical MDP formulation that encodes task groups via a latent mixture model. GDR-MDP identifies the optimal policy that maximizes the expected return under the worst-possible qualified belief over task groups within an ambiguity set. We rigorously show that GDR-MDP's hierarchical structure improves distributional robustness by adding regularization to the worst possible outcomes. We then develop deep RL algorithms for GDR-MDP for both value-based and policy-based RL methods. Extensive experiments on Box2D control tasks, MuJoCo benchmarks, and Google football platforms show that our algorithms outperform classic robust training algorithms across diverse environments in terms of robustness under belief uncertainties. Demos are available on our project page (\url{https://sites.google.com/view/gdr-rl/home}).

ROMar 23
GaussianSSC: Triplane-Guided Directional Gaussian Fields for 3D Semantic Completion

Ruiqi Xian, Jing Liang, He Yin et al.

We present \emph{GaussianSSC}, a two-stage, grid-native and triplane-guided approach to semantic scene completion (SSC) that injects the benefits of Gaussians without replacing the voxel grid or maintaining a separate Gaussian set. We introduce \emph{Gaussian Anchoring}, a sub-pixel, Gaussian-weighted image aggregation over fused FPN features that tightens voxel--image alignment and improves monocular occupancy estimation. We further convert point-like voxel features into a learned per-voxel Gaussian field and refine triplane features via a triplane-aligned \emph{Gaussian--Triplane Refinement} module that combines \emph{local gathering} (target-centric) and \emph{global aggregation} (source-centric). This directional, anisotropic support captures surface tangency, scale, and occlusion-aware asymmetry while preserving the efficiency of triplane representations. On SemanticKITTI~\cite{behley2019semantickitti}, GaussianSSC improves Stage~1 occupancy by +1.0\% Recall, +2.0\% Precision, and +1.8\% IoU over state-of-the-art baselines, and improves Stage~2 semantic prediction by +1.8\% IoU and +0.8\% mIoU.

CVOct 14, 2024Code
ET-Former: Efficient Triplane Deformable Attention for 3D Semantic Scene Completion From Monocular Camera

Jing Liang, He Yin, Xuewei Qi et al.

We introduce ET-Former, a novel end-to-end algorithm for semantic scene completion using a single monocular camera. Our approach generates a semantic occupancy map from single RGB observation while simultaneously providing uncertainty estimates for semantic predictions. By designing a triplane-based deformable attention mechanism, our approach improves geometric understanding of the scene than other SOTA approaches and reduces noise in semantic predictions. Additionally, through the use of a Conditional Variational AutoEncoder (CVAE), we estimate the uncertainties of these predictions. The generated semantic and uncertainty maps will help formulate navigation strategies that facilitate safe and permissible decision making in the future. Evaluated on the Semantic-KITTI dataset, ET-Former achieves the highest Intersection over Union (IoU) and mean IoU (mIoU) scores while maintaining the lowest GPU memory usage, surpassing state-of-the-art (SOTA) methods. It improves the SOTA scores of IoU from 44.71 to 51.49 and mIoU from 15.04 to 16.30 on SeamnticKITTI test, with a notably low training memory consumption of 10.9 GB. Project page: https://github.com/jingGM/ET-Former.git.

ROApr 1, 2025
HomeEmergency -- Using Audio to Find and Respond to Emergencies in the Home

James F. Mullen, Dhruva Kumar, Xuewei Qi et al.

In the United States alone accidental home deaths exceed 128,000 per year. Our work aims to enable home robots who respond to emergency scenarios in the home, preventing injuries and deaths. We introduce a new dataset of household emergencies based in the ThreeDWorld simulator. Each scenario in our dataset begins with an instantaneous or periodic sound which may or may not be an emergency. The agent must navigate the multi-room home scene using prior observations, alongside audio signals and images from the simulator, to determine if there is an emergency or not. In addition to our new dataset, we present a modular approach for localizing and identifying potential home emergencies. Underpinning our approach is a novel probabilistic dynamic scene graph (P-DSG), where our key insight is that graph nodes corresponding to agents can be represented with a probabilistic edge. This edge, when refined using Bayesian inference, enables efficient and effective localization of agents in the scene. We also utilize multi-modal vision-language models (VLMs) as a component in our approach, determining object traits (e.g. flammability) and identifying emergencies. We present a demonstration of our method completing a real-world version of our task on a consumer robot, showing the transferability of both our task and our method. Our dataset will be released to the public upon this papers publication.

CVFeb 28, 2022
Spatiotemporal Transformer Attention Network for 3D Voxel Level Joint Segmentation and Motion Prediction in Point Cloud

Zhensong Wei, Xuewei Qi, Zhengwei Bai et al.

Environment perception including detection, classification, tracking, and motion prediction are key enablers for automated driving systems and intelligent transportation applications. Fueled by the advances in sensing technologies and machine learning techniques, LiDAR-based sensing systems have become a promising solution. The current challenges of this solution are how to effectively combine different perception tasks into a single backbone and how to efficiently learn the spatiotemporal features directly from point cloud sequences. In this research, we propose a novel spatiotemporal attention network based on a transformer self-attention mechanism for joint semantic segmentation and motion prediction within a point cloud at the voxel level. The network is trained to simultaneously outputs the voxel level class and predicted motion by learning directly from a sequence of point cloud datasets. The proposed backbone includes both a temporal attention module (TAM) and a spatial attention module (SAM) to learn and extract the complex spatiotemporal features. This approach has been evaluated with the nuScenes dataset, and promising performance has been achieved.

CVFeb 28, 2022
Cyber Mobility Mirror: A Deep Learning-based Real-World Object Perception Platform Using Roadside LiDAR

Zhengwei Bai, Saswat Priyadarshi Nayak, Xuanpeng Zhao et al.

Object perception plays a fundamental role in Cooperative Driving Automation (CDA) which is regarded as a revolutionary promoter for the next-generation transportation systems. However, the vehicle-based perception may suffer from the limited sensing range and occlusion as well as low penetration rates in connectivity. In this paper, we propose Cyber Mobility Mirror (CMM), a next-generation real-time traffic surveillance system for 3D object perception and reconstruction, to explore the potential of roadside sensors for enabling CDA in the real world. The CMM system consists of six main components: 1) the data pre-processor to retrieve and preprocess the raw data; 2) the roadside 3D object detector to generate 3D detection results; 3) the multi-object tracker to identify detected objects; 4) the global locator to map positioning information from the LiDAR coordinate to geographic coordinate using coordinate transformation; 5) the cloud-based communicator to transmit perception information from roadside sensors to equipped vehicles, and 6) the onboard advisor to reconstruct and display the real-time traffic conditions via Graphical User Interface (GUI). In this study, a field-operational system is deployed at a real-world intersection, University Avenue and Iowa Avenue in Riverside, California to assess the feasibility and performance of our CMM system. Results from field tests demonstrate that our CMM prototype system can provide satisfactory perception performance with 96.99% precision and 83.62% recall. High-fidelity real-time traffic conditions (at the object level) can be geo-localized with an average error of 0.14m and displayed on the GUI of the equipped vehicle with a frequency of 3-4 Hz.

CVJan 28, 2022
Infrastructure-Based Object Detection and Tracking for Cooperative Driving Automation: A Survey

Zhengwei Bai, Guoyuan Wu, Xuewei Qi et al.

Object detection plays a fundamental role in enabling Cooperative Driving Automation (CDA), which is regarded as the revolutionary solution to addressing safety, mobility, and sustainability issues of contemporary transportation systems. Although current computer vision technologies could provide satisfactory object detection results in occlusion-free scenarios, the perception performance of onboard sensors could be inevitably limited by the range and occlusion. Owing to flexible position and pose for sensor installation, infrastructure-based detection and tracking systems can enhance the perception capability for connected vehicles and thus quickly become one of the most popular research topics. In this paper, we review the research progress for infrastructure-based object detection and tracking systems. Architectures of roadside perception systems based on different types of sensors are reviewed to show a high-level description of the workflows for infrastructure-based perception systems. Roadside sensors and different perception methodologies are reviewed and analyzed with detailed literature to provide a low-level explanation for specific methods followed by Datasets and Simulators to draw an overall landscape of infrastructure-based object detection and tracking methods. Discussions are conducted to point out current opportunities, open problems, and anticipated future trends.

SEJan 24, 2022
Cyber Mobility Mirror for Enabling Cooperative Driving Automation in Mixed Traffic: A Co-Simulation Platform

Zhengwei Bai, Guoyuan Wu, Xuewei Qi et al.

Endowed with automation and connectivity, Connected and Automated Vehicles are meant to be a revolutionary promoter for Cooperative Driving Automation. Nevertheless, CAVs need high-fidelity perception information on their surroundings, which is available but costly to collect from various onboard sensors as well as vehicle-to-everything (V2X) communications. Therefore, authentic perception information based on high-fidelity sensors via a cost-effective platform is crucial for enabling CDA-related research, e.g., cooperative decision-making or control. Most state-of-the-art traffic simulation studies for CAVs rely on situation-awareness information by directly calling on intrinsic attributes of the objects, which impedes the reliability and fidelity of the assessment of CDA algorithms. In this study, a \textit{Cyber Mobility Mirror (CMM)} Co-Simulation Platform is designed for enabling CDA by providing authentic perception information. The \textit{CMM} Co-Simulation Platform can emulate the real world with a high-fidelity sensor perception system and a cyber world with a real-time rebuilding system acting as a "\textit{Mirror}" of the real-world environment. Concretely, the real-world simulator is mainly in charge of simulating the traffic environment, sensors, as well as the authentic perception process. The mirror-world simulator is responsible for rebuilding objects and providing their information as intrinsic attributes of the simulator to support the development and evaluation of CDA algorithms. To illustrate the functionality of the proposed co-simulation platform, a roadside LiDAR-based vehicle perception system for enabling CDA is prototyped as a study case. Specific traffic environments and CDA tasks are designed for experiments whose results are demonstrated and analyzed to show the performance of the platform.

LGJun 19, 2021
Scalable Safety-Critical Policy Evaluation with Accelerated Rare Event Sampling

Mengdi Xu, Peide Huang, Fengpei Li et al.

Evaluating rare but high-stakes events is one of the main challenges in obtaining reliable reinforcement learning policies, especially in large or infinite state/action spaces where limited scalability dictates a prohibitively large number of testing iterations. On the other hand, a biased or inaccurate policy evaluation in a safety-critical system could potentially cause unexpected catastrophic failures during deployment. This paper proposes the Accelerated Policy Evaluation (APE) method, which simultaneously uncovers rare events and estimates the rare event probability in Markov decision processes. The APE method treats the environment nature as an adversarial agent and learns towards, through adaptive importance sampling, the zero-variance sampling distribution for the policy evaluation. Moreover, APE is scalable to large discrete or continuous spaces by incorporating function approximators. We investigate the convergence property of APE in the tabular setting. Our empirical studies show that APE can estimate the rare event probability with a smaller bias while only using orders of magnitude fewer samples than baselines in multi-agent and single-agent environments.

CVJan 24, 2020
End-to-End Vision-Based Adaptive Cruise Control (ACC) Using Deep Reinforcement Learning

Zhensong Wei, Yu Jiang, Xishun Liao et al.

This paper presented a deep reinforcement learning method named Double Deep Q-networks to design an end-to-end vision-based adaptive cruise control (ACC) system. A simulation environment of a highway scene was set up in Unity, which is a game engine that provided both physical models of vehicles and feature data for training and testing. Well-designed reward functions associated with the following distance and throttle/brake force were implemented in the reinforcement learning model for both internal combustion engine (ICE) vehicles and electric vehicles (EV) to perform adaptive cruise control. The gap statistics and total energy consumption are evaluated for different vehicle types to explore the relationship between reward functions and powertrain characteristics. Compared with the traditional radar-based ACC systems or human-in-the-loop simulation, the proposed vision-based ACC system can generate either a better gap regulated trajectory or a smoother speed trajectory depending on the preset reward function. The proposed system can be well adaptive to different speed trajectories of the preceding vehicle and operated in real-time.