Wenxiao Zhang

h-index38

14papers

618citations

Novelty47%

AI Score33

Ranked #116,274 of 194,257 authors (top 60%)#38,750 in CV (top 66%)

14 Papers

11.3CVJul 3, 2024Code

Explicitly Guided Information Interaction Network for Cross-modal Point Cloud Completion

Hang Xu, Chen Long, Wenxiao Zhang et al.

In this paper, we explore a novel framework, EGIInet (Explicitly Guided Information Interaction Network), a model for View-guided Point cloud Completion (ViPC) task, which aims to restore a complete point cloud from a partial one with a single view image. In comparison with previous methods that relied on the global semantics of input images, EGIInet efficiently combines the information from two modalities by leveraging the geometric nature of the completion task. Specifically, we propose an explicitly guided information interaction strategy supported by modal alignment for point cloud completion. First, in contrast to previous methods which simply use 2D and 3D backbones to encode features respectively, we unified the encoding process to promote modal alignment. Second, we propose a novel explicitly guided information interaction strategy that could help the network identify critical information within images, thus achieving better guidance for completion. Extensive experiments demonstrate the effectiveness of our framework, and we achieved a new state-of-the-art (+16% CD over XMFnet) in benchmark datasets despite using fewer parameters than the previous methods. The pre-trained model and code and are available at https://github.com/WHU-USI3DV/EGIInet.

9.1CVNov 30, 2023Code

SparseDC: Depth Completion from sparse and non-uniform inputs

Chen Long, Wenxiao Zhang, Zhe Chen et al.

We propose SparseDC, a model for Depth Completion of Sparse and non-uniform depth inputs. Unlike previous methods focusing on completing fixed distributions on benchmark datasets (e.g., NYU with 500 points, KITTI with 64 lines), SparseDC is specifically designed to handle depth maps with poor quality in real usage. The key contributions of SparseDC are two-fold. First, we design a simple strategy, called SFFM, to improve the robustness under sparse input by explicitly filling the unstable depth features with stable image features. Second, we propose a two-branch feature embedder to predict both the precise local geometry of regions with available depth values and accurate structures in regions with no depth. The key of the embedder is an uncertainty-based fusion module called UFFM to balance the local and long-term information extracted by CNNs and ViTs. Extensive indoor and outdoor experiments demonstrate the robustness of our framework when facing sparse and non-uniform input depths. The pre-trained model and code are available at https://github.com/WHU-USI3DV/SparseDC.

11.7CVApr 21, 2022

DGECN: A Depth-Guided Edge Convolutional Network for End-to-End 6D Pose Estimation

Tuo Cao, Fei Luo, Yanping Fu et al.

Monocular 6D pose estimation is a fundamental task in computer vision. Existing works often adopt a two-stage pipeline by establishing correspondences and utilizing a RANSAC algorithm to calculate 6 degrees-of-freedom (6DoF) pose. Recent works try to integrate differentiable RANSAC algorithms to achieve an end-to-end 6D pose estimation. However, most of them hardly consider the geometric features in 3D space, and ignore the topology cues when performing differentiable RANSAC algorithms. To this end, we proposed a Depth-Guided Edge Convolutional Network (DGECN) for 6D pose estimation task. We have made efforts from the following three aspects: 1) We take advantages ofestimated depth information to guide both the correspondences-extraction process and the cascaded differentiable RANSAC algorithm with geometric information. 2)We leverage the uncertainty ofthe estimated depth map to improve accuracy and robustness ofthe output 6D pose. 3) We propose a differentiable Perspective-n-Point(PnP) algorithm via edge convolution to explore the topology relations between 2D-3D correspondences. Experiments demonstrate that our proposed network outperforms current works on both effectiveness and efficiency.

23.7HCAug 6, 2024Code

OpenOmni: A Collaborative Open Source Tool for Building Future-Ready Multimodal Conversational Agents

Qiang Sun, Yuanyi Luo, Sirui Li et al.

Multimodal conversational agents are highly desirable because they offer natural and human-like interaction. However, there is a lack of comprehensive end-to-end solutions to support collaborative development and benchmarking. While proprietary systems like GPT-4o and Gemini demonstrating impressive integration of audio, video, and text with response times of 200-250ms, challenges remain in balancing latency, accuracy, cost, and data privacy. To better understand and quantify these issues, we developed OpenOmni, an open-source, end-to-end pipeline benchmarking tool that integrates advanced technologies such as Speech-to-Text, Emotion Detection, Retrieval Augmented Generation, Large Language Models, along with the ability to integrate customized models. OpenOmni supports local and cloud deployment, ensuring data privacy and supporting latency and accuracy benchmarking. This flexible framework allows researchers to customize the pipeline, focusing on real bottlenecks and facilitating rapid proof-of-concept development. OpenOmni can significantly enhance applications like indoor assistance for visually impaired individuals, advancing human-computer interaction. Our demonstration video is available https://www.youtube.com/watch?v=zaSiT3clWqY, demo is available via https://openomni.ai4wa.com, code is available via https://github.com/AI4WA/OpenOmniFramework.

15.7ROSep 3, 2024

SafeEmbodAI: a Safety Framework for Mobile Robots in Embodied AI Systems

Wenxiao Zhang, Xiangrui Kong, Thomas Braunl et al.

Embodied AI systems, including AI-powered robots that autonomously interact with the physical world, stand to be significantly advanced by Large Language Models (LLMs), which enable robots to better understand complex language commands and perform advanced tasks with enhanced comprehension and adaptability, highlighting their potential to improve embodied AI capabilities. However, this advancement also introduces safety challenges, particularly in robotic navigation tasks. Improper safety management can lead to failures in complex environments and make the system vulnerable to malicious command injections, resulting in unsafe behaviours such as detours or collisions. To address these issues, we propose \textit{SafeEmbodAI}, a safety framework for integrating mobile robots into embodied AI systems. \textit{SafeEmbodAI} incorporates secure prompting, state management, and safety validation mechanisms to secure and assist LLMs in reasoning through multi-modal data and validating responses. We designed a metric to evaluate mission-oriented exploration, and evaluations in simulated environments demonstrate that our framework effectively mitigates threats from malicious commands and improves performance in various environment settings, ensuring the safety of embodied AI systems. Notably, In complex environments with mixed obstacles, our method demonstrates a significant performance increase of 267\% compared to the baseline in attack scenarios, highlighting its robustness in challenging conditions.

5.7ROJul 2, 2024

Embodied AI in Mobile Robots: Coverage Path Planning with Large Language Models

Xiangrui Kong, Wenxiao Zhang, Jin Hong et al.

In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and solving mathematical problems, leading to advancements in various fields. We propose an LLM-embodied path planning framework for mobile agents, focusing on solving high-level coverage path planning issues and low-level control. Our proposed multi-layer architecture uses prompted LLMs in the path planning phase and integrates them with the mobile agents' low-level actuators. To evaluate the performance of various LLMs, we propose a coverage-weighted path planning metric to assess the performance of the embodied models. Our experiments show that the proposed framework improves LLMs' spatial inference abilities. We demonstrate that the proposed multi-layer framework significantly enhances the efficiency and accuracy of these tasks by leveraging the natural language understanding and generative capabilities of LLMs. Our experiments show that this framework can improve LLMs' 2D plane reasoning abilities and complete coverage path planning tasks. We also tested three LLM kernels: gpt-4o, gemini-1.5-flash, and claude-3.5-sonnet. The experimental results show that claude-3.5 can complete the coverage planning task in different scenarios, and its indicators are better than those of the other models.

8.0CVSep 20, 2021Code

PC$^2$-PU: Patch Correlation and Point Correlation for Effective Point Cloud Upsampling

Chen Long, Wenxiao Zhang, Ruihui Li et al.

Point cloud upsampling is to densify a sparse point set acquired from 3D sensors, providing a denser representation for the underlying surface. Existing methods divide the input points into small patches and upsample each patch separately, however, ignoring the global spatial consistency between patches. In this paper, we present a novel method PC$^2$-PU, which explores patch-to-patch and point-to-point correlations for more effective and robust point cloud upsampling. Specifically, our network has two appealing designs: (i) We take adjacent patches as supplementary inputs to compensate the loss structure information within a single patch and introduce a Patch Correlation Module to capture the difference and similarity between patches. (ii) After augmenting each patch's geometry, we further introduce a Point Correlation Module to reveal the relationship of points inside each patch to maintain the local spatial consistency. Extensive experiments on both synthetic and real scanned datasets demonstrate that our method surpasses previous upsampling methods, particularly with the noisy inputs. The code and data are at \url{https://github.com/chenlongwhu/PC2-PU.git}.

8.7CVMar 15, 2024

Enhancing Human-Centered Dynamic Scene Understanding via Multiple LLMs Collaborated Reasoning

Hang Zhang, Wenxiao Zhang, Haoxuan Qu et al.

Human-centered dynamic scene understanding plays a pivotal role in enhancing the capability of robotic and autonomous systems, in which Video-based Human-Object Interaction (V-HOI) detection is a crucial task in semantic scene understanding, aimed at comprehensively understanding HOI relationships within a video to benefit the behavioral decisions of mobile robots and autonomous driving systems. Although previous V-HOI detection models have made significant strides in accurate detection on specific datasets, they still lack the general reasoning ability like human beings to effectively induce HOI relationships. In this study, we propose V-HOI Multi-LLMs Collaborated Reasoning (V-HOI MLCR), a novel framework consisting of a series of plug-and-play modules that could facilitate the performance of current V-HOI detection models by leveraging the strong reasoning ability of different off-the-shelf pre-trained large language models (LLMs). We design a two-stage collaboration system of different LLMs for the V-HOI task. Specifically, in the first stage, we design a Cross-Agents Reasoning scheme to leverage the LLM conduct reasoning from different aspects. In the second stage, we perform Multi-LLMs Debate to get the final reasoning answer based on the different knowledge in different LLMs. Additionally, we devise an auxiliary training strategy that utilizes CLIP, a large vision-language model to enhance the base V-HOI models' discriminative ability to better cooperate with LLMs. We validate the superiority of our design by demonstrating its effectiveness in improving the prediction accuracy of the base V-HOI model via reasoning from multiple perspectives.

7.3CVNov 23, 2021

KTNet: Knowledge Transfer for Unpaired 3D Shape Completion

Zhen Cao, Wenxiao Zhang, Xin Wen et al.

Unpaired 3D object completion aims to predict a complete 3D shape from an incomplete input without knowing the correspondence between the complete and incomplete shapes. In this paper, we propose the novel KTNet to solve this task from the new perspective of knowledge transfer. KTNet elaborates a teacher-assistant-student network to establish multiple knowledge transfer processes. Specifically, the teacher network takes complete shape as input and learns the knowledge of complete shape. The student network takes the incomplete one as input and restores the corresponding complete shape. And the assistant modules not only help to transfer the knowledge of complete shape from the teacher to the student, but also judge the learning effect of the student network. As a result, KTNet makes use of a more comprehensive understanding to establish the geometric correspondence between complete and incomplete shapes in a perspective of knowledge transfer, which enables more detailed geometric inference for generating high-quality complete shapes. We conduct comprehensive experiments on several datasets, and the results show that our method outperforms previous methods of unpaired point cloud completion by a large margin.

6.4HCNov 9, 2021

EdgeXAR: A 6-DoF Camera Multi-target Interaction Framework for MAR with User-friendly Latency Compensation

Wenxiao Zhang, Sikun Lin, Farshid Hassani Bijarbooneh et al.

The computational capabilities of recent mobile devices enable the processing of natural features for Augmented Reality (AR), but the scalability is still limited by the devices' computation power and available resources. In this paper, we propose EdgeXAR, a mobile AR framework that utilizes the advantages of edge computing through task offloading to support flexible camera-based AR interaction. We propose a hybrid tracking system for mobile devices that provides lightweight tracking with 6 Degrees of Freedom and hides the offloading latency from users' perception. A practical, reliable and unreliable communication mechanism is used to achieve fast response and consistency of crucial information. We also propose a multi-object image retrieval pipeline that executes fast and accurate image recognition tasks on the cloud and edge servers. Extensive experiments are carried out to evaluate the performance of EdgeXAR by building mobile AR Apps upon it. Regarding the Quality of Experience (QoE), the mobile AR Apps powered by EdgeXAR framework run on average at the speed of 30 frames per second with precise tracking of only 1~2 pixel errors and accurate image recognition of at least 97% accuracy. As compared to Vuforia, one of the leading commercial AR frameworks, EdgeXAR transmits 87% less data while providing a stable 30 FPS performance and reducing the offloading latency by 50 to 70% depending on the transmission medium. Our work facilitates the large-scale deployment of AR as the next generation of ubiquitous interfaces.

22.3CVJul 5, 2020Code

Detail Preserved Point Cloud Completion via Separated Feature Aggregation

Wenxiao Zhang, Qingan Yan, Chunxia Xiao

Point cloud shape completion is a challenging problem in 3D vision and robotics. Existing learning-based frameworks leverage encoder-decoder architectures to recover the complete shape from a highly encoded global feature vector. Though the global feature can approximately represent the overall shape of 3D objects, it would lead to the loss of shape details during the completion process. In this work, instead of using a global feature to recover the whole complete surface, we explore the functionality of multi-level features and aggregate different features to represent the known part and the missing part separately. We propose two different feature aggregation strategies, named global \& local feature aggregation(GLFA) and residual feature aggregation(RFA), to express the two kinds of features and reconstruct coordinates from their combination. In addition, we also design a refinement component to prevent the generated point cloud from non-uniform distribution and outliers. Extensive experiments have been conducted on the ShapeNet dataset. Qualitative and quantitative evaluations demonstrate that our proposed network outperforms current state-of-the art methods especially on detail preservation.

24.7CVApr 22, 2019Code

PCAN: 3D Attention Map Learning Using Contextual Information for Point Cloud Based Retrieval

Wenxiao Zhang, Chunxia Xiao

Point cloud based retrieval for place recognition is an emerging problem in vision field. The main challenge is how to find an efficient way to encode the local features into a discriminative global descriptor. In this paper, we propose a Point Contextual Attention Network (PCAN), which can predict the significance of each local point feature based on point context. Our network makes it possible to pay more attention to the task-relevent features when aggregating local features. Experiments on various benchmark datasets show that the proposed network can provide outperformance than current state-of-the-art approaches.

2.3MMMay 8, 2018

CloudAR: A Cloud-based Framework for Mobile Augmented Reality

Wenxiao Zhang, Sikun Lin, Farshid Hassani Bijarbooneh et al.

Computation capabilities of recent mobile devices enable natural feature processing for Augmented Reality (AR). However, mobile AR applications are still faced with scalability and performance challenges. In this paper, we propose CloudAR, a mobile AR framework utilizing the advantages of cloud and edge computing through recognition task offloading. We explore the design space of cloud-based AR exhaustively and optimize the offloading pipeline to minimize the time and energy consumption. We design an innovative tracking system for mobile devices which provides lightweight tracking in 6 degree of freedom (6DoF) and hides the offloading latency from users' perception. We also design a multi-object image retrieval pipeline that executes fast and accurate image recognition tasks on servers. In our evaluations, the mobile AR application built with the CloudAR framework runs at 30 frames per second (FPS) on average with precise tracking of only 1~2 pixel errors and image recognition of at least 97% accuracy. Our results also show that CloudAR outperforms one of the leading commercial AR framework in several performance metrics.

6.1HCNov 25, 2016

Dissecting the End-to-end Latency of Interactive Mobile Video Applications

Teemu Kämäräinen, Matti Siekkinen, Antti Ylä-Jääski et al.

In this paper we measure the step-wise latency in the pipeline of three kinds of interactive mobile video applications that are rapidly gaining popularity, namely Remote Graphics Rendering (RGR) of which we focus on mobile cloud gaming, Mobile Augmented Reality (MAR), and Mobile Virtual Reality (MVR). The applications differ from each other by the way in which the user interacts with the application, i.e., video I/O and user controls, but they all share in common the fact that their user experience is highly sensitive to end-to-end latency. Long latency between a user control event and display update renders the application unusable. Hence, understanding the nature and origins of latency of these applications is of paramount importance. We show through extensive measurements that control input and display buffering have a substantial effect on the overall delay. Our results shed light on the latency bottlenecks and the maturity of technology for seamless user experience with these applications.