Xinyan Lu

CV
h-index12
3papers
15citations
Novelty73%
AI Score39

3 Papers

CVDec 24, 2024
UniPLV: Towards Label-Efficient Open-World 3D Scene Understanding by Regional Visual Language Supervision

Yuru Wang, Pei Liu, Songtao Wang et al.

Open-world 3D scene understanding is a critical challenge that involves recognizing and distinguishing diverse objects and categories from 3D data, such as point clouds, without relying on manual annotations. Traditional methods struggle with this open-world task, especially due to the limitations of constructing extensive point cloud-text pairs and handling multimodal data effectively. In response to these challenges, we present UniPLV, a robust framework that unifies point clouds, images, and text within a single learning paradigm for comprehensive 3D scene understanding. UniPLV leverages images as a bridge to co-embed 3D points with pre-aligned images and text in a shared feature space, eliminating the need for labor-intensive point cloud-text pair crafting. Our framework achieves precise multimodal alignment through two innovative strategies: (i) Logit and feature distillation modules between images and point clouds to enhance feature coherence; (ii) A vision-point matching module that implicitly corrects 3D semantic predictions affected by projection inaccuracies from points to pixels. To further boost performance, we implement four task-specific losses alongside a two-stage training strategy. Extensive experiments demonstrate that UniPLV significantly surpasses state-of-the-art methods, with average improvements of 15.6% and 14.8% in semantic segmentation for Base-Annotated and Annotation-Free tasks, respectively. These results underscore UniPLV's efficacy in pushing the boundaries of open-world 3D scene understanding. We will release the code to support future research and development.

CVAug 31, 2025
OmniReason: A Temporal-Guided Vision-Language-Action Framework for Autonomous Driving

Pei Liu, Qingtian Ning, Xinyan Lu et al.

Recent advances in vision-language models (VLMs) have demonstrated impressive spatial reasoning capabilities for autonomous driving, yet existing methods predominantly focus on static scene understanding while neglecting the essential temporal dimension of real-world driving scenarios. To address this critical limitation, we propose the OmniReason framework, which establishes robust spatiotemporal reasoning by jointly modeling dynamic 3D environments and their underlying decision-making processes. Our work makes two fundamental advances: (1) We introduce OmniReason-Data, two large-scale vision-language-action (VLA) datasets with dense spatiotemporal annotations and natural language explanations, generated through a novel hallucination-mitigated auto-labeling pipeline that ensures both physical plausibility and temporal coherence; (2) We develop the OmniReason-Agent architecture, which integrates a sparse temporal memory module for persistent scene context modeling and an explanation generator that produces human-interpretable decision rationales, facilitated by our spatiotemporal knowledge distillation approach that effectively captures spatiotemporal causal reasoning patterns. Comprehensive experiments demonstrate state-of-the-art performance, where OmniReason-Agent achieves significant improvements in both open-loop planning tasks and visual question answering (VQA) benchmarks, while establishing new capabilities for interpretable, temporally-aware autonomous vehicles operating in complex, dynamic environments.

LGMar 2, 2018
Convolutional Geometric Matrix Completion

Kai-Lang Yao, Wu-Jun Li, Jianbo Yang et al.

Geometric matrix completion (GMC) has been proposed for recommendation by integrating the relationship (link) graphs among users/items into matrix completion (MC). Traditional GMC methods typically adopt graph regularization to impose smoothness priors for MC. Recently, geometric deep learning on graphs (GDLG) is proposed to solve the GMC problem, showing better performance than existing GMC methods including traditional graph regularization based methods. To the best of our knowledge, there exists only one GDLG method for GMC, which is called RMGCNN. RMGCNN combines graph convolutional network (GCN) and recurrent neural network (RNN) together for GMC. In the original work of RMGCNN, RMGCNN demonstrates better performance than pure GCN-based method. In this paper, we propose a new GMC method, called convolutional geometric matrix completion (CGMC), for recommendation with graphs among users/items. CGMC is a pure GCN-based method with a newly designed graph convolutional network. Experimental results on real datasets show that CGMC can outperform other state-of-the-art methods including RMGCNN in terms of both accuracy and speed.