Qihang Peng

CV
h-index4
7papers
86citations
Novelty54%
AI Score43

7 Papers

CVDec 28, 2025
ColaVLA: Leveraging Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous Driving

Qihang Peng, Xuesong Chen, Chenye Yang et al.

Autonomous driving requires generating safe and reliable trajectories from complex multimodal inputs. Traditional modular pipelines separate perception, prediction, and planning, while recent end-to-end (E2E) systems learn them jointly. Vision-language models (VLMs) further enrich this paradigm by introducing cross-modal priors and commonsense reasoning, yet current VLM-based planners face three key challenges: (i) a mismatch between discrete text reasoning and continuous control, (ii) high latency from autoregressive chain-of-thought decoding, and (iii) inefficient or non-causal planners that limit real-time deployment. We propose ColaVLA, a unified vision-language-action framework that transfers reasoning from text to a unified latent space and couples it with a hierarchical, parallel trajectory decoder. The Cognitive Latent Reasoner compresses scene understanding into compact, decision-oriented meta-action embeddings through ego-adaptive selection and only two VLM forward passes. The Hierarchical Parallel Planner then generates multi-scale, causality-consistent trajectories in a single forward pass. Together, these components preserve the generalization and interpretability of VLMs while enabling efficient, accurate and safe trajectory generation. Experiments on the nuScenes benchmark show that ColaVLA achieves state-of-the-art performance in both open-loop and closed-loop settings with favorable efficiency and robustness.

CVMay 8, 2025
DenseGrounding: Improving Dense Language-Vision Semantics for Ego-Centric 3D Visual Grounding

Henry Zheng, Hao Shi, Qihang Peng et al.

Enabling intelligent agents to comprehend and interact with 3D environments through natural language is crucial for advancing robotics and human-computer interaction. A fundamental task in this field is ego-centric 3D visual grounding, where agents locate target objects in real-world 3D spaces based on verbal descriptions. However, this task faces two significant challenges: (1) loss of fine-grained visual semantics due to sparse fusion of point clouds with ego-centric multi-view images, (2) limited textual semantic context due to arbitrary language descriptions. We propose DenseGrounding, a novel approach designed to address these issues by enhancing both visual and textual semantics. For visual features, we introduce the Hierarchical Scene Semantic Enhancer, which retains dense semantics by capturing fine-grained global scene features and facilitating cross-modal alignment. For text descriptions, we propose a Language Semantic Enhancer that leverages large language models to provide rich context and diverse language descriptions with additional context during model training. Extensive experiments show that DenseGrounding significantly outperforms existing methods in overall accuracy, with improvements of 5.81% and 7.56% when trained on the comprehensive full dataset and smaller mini subset, respectively, further advancing the SOTA in egocentric 3D visual grounding. Our method also achieves 1st place and receives the Innovation Award in the CVPR 2024 Autonomous Grand Challenge Multi-view 3D Visual Grounding Track, validating its effectiveness and robustness.

CVFeb 26, 2025
ProxyTransformation: Preshaping Point Cloud Manifold With Proxy Attention For 3D Visual Grounding

Qihang Peng, Henry Zheng, Gao Huang

Embodied intelligence requires agents to interact with 3D environments in real time based on language instructions. A foundational task in this domain is ego-centric 3D visual grounding. However, the point clouds rendered from RGB-D images retain a large amount of redundant background data and inherent noise, both of which can interfere with the manifold structure of the target regions. Existing point cloud enhancement methods often require a tedious process to improve the manifold, which is not suitable for real-time tasks. We propose Proxy Transformation suitable for multimodal task to efficiently improve the point cloud manifold. Our method first leverages Deformable Point Clustering to identify the point cloud sub-manifolds in target regions. Then, we propose a Proxy Attention module that utilizes multimodal proxies to guide point cloud transformation. Built upon Proxy Attention, we design a submanifold transformation generation module where textual information globally guides translation vectors for different submanifolds, optimizing relative spatial relationships of target regions. Simultaneously, image information guides linear transformations within each submanifold, refining the local point cloud manifold of target regions. Extensive experiments demonstrate that Proxy Transformation significantly outperforms all existing methods, achieving an impressive improvement of 7.49% on easy targets and 4.60% on hard targets, while reducing the computational overhead of attention blocks by 40.6%. These results establish a new SOTA in ego-centric 3D visual grounding, showcasing the effectiveness and robustness of our approach.

SPJan 24, 2022
Knowledge Graph Based Waveform Recommendation: A New Communication Waveform Design Paradigm

Wei Huang, Tianfu Qi, Yundi Guan et al.

Traditionally, a communication waveform is designed by experts based on communication theory and their experiences on a case-by-case basis, which is usually laborious and time-consuming. In this paper, we investigate the waveform design from a novel perspective and propose a new waveform design paradigm with the knowledge graph (KG)-based intelligent recommendation system. The proposed paradigm aims to improve the design efficiency by structural characterization and representations of existing waveforms and intelligently utilizing the knowledge learned from them. To achieve this goal, we first build a communication waveform knowledge graph (CWKG) with a first-order neighbor node, for which both structured semantic knowledge and numerical parameters of a waveform are integrated by representation learning. Based on the developed CWKG, we further propose an intelligent communication waveform recommendation system (CWRS) to generate waveform candidates. In the CWRS, an improved involution1D operator, which is channel-agnostic and space-specific, is introduced according to the characteristics of KG-based waveform representation for feature extraction, and the multi-head self-attention is adopted to weigh the influence of various components for feature fusion. Meanwhile, multilayer perceptron-based collaborative filtering is used to evaluate the matching degree between the requirement and the waveform candidate. Simulation results show that the proposed CWKG-based CWRS can automatically recommend waveform candidates with high reliability.

ITAug 1, 2019
Robust Deep Sensing Through Transfer Learning in Cognitive Radio

Qihang Peng, Andrew Gilman, Nuno Vasconcelos et al.

We propose a robust spectrum sensing framework based on deep learning. The received signals at the secondary user's receiver are filtered, sampled and then directly fed into a convolutional neural network. Although this deep sensing is effective when operating in the same scenario as the collected training data, the sensing performance is degraded when it is applied in a different scenario with different wireless signals and propagation. We incorporate transfer learning into the framework to improve the robustness. Results validate the effectiveness as well as the robustness of the proposed deep spectrum sensing framework.

SPJul 15, 2019
Comparison of Neural Network Architectures for Spectrum Sensing

Ziyu Ye, Andrew Gilman, Qihang Peng et al.

Different neural network (NN) architectures have different advantages. Convolutional neural networks (CNNs) achieved enormous success in computer vision, while recurrent neural networks (RNNs) gained popularity in speech recognition. It is not known which type of NN architecture is the best fit for classification of communication signals. In this work, we compare the behavior of fully-connected NN (FC), CNN, RNN, and bi-directional RNN (BiRNN) in a spectrum sensing task. The four NN architectures are compared on their detection performance, requirement of training data, computational complexity, and memory requirement. Given abundant training data and computational and memory resources, CNN, RNN, and BiRNN are shown to achieve similar performance. The performance of FC is worse than that of the other three types, except in the case where computational complexity is stringently limited.

SPJul 15, 2019
A Neural Network Detector for Spectrum Sensing under Uncertainties

Ziyu Ye, Qihang Peng, Kelly Levick et al.

Spectrum sensing is of critical importance in any cognitive radio system. When the primary user's signal has uncertain parameters, the likelihood ratio test, which is the theoretically optimal detector, generally has no closed-form expression. As a result, spectrum sensing under parameter uncertainty remains an open question, though many detectors exploiting specific features of a primary signal have been proposed and have achieved reasonably good performance. In this paper, a neural network is trained as a detector for modulated signals. The result shows by training on an appropriate dataset, the neural network gains robustness under uncertainties in system parameters including the carrier frequency offset, carrier phase offset, and symbol time offset. The result displays the neural network's potential in exploiting implicit and incomplete knowledge about the signal's structure.