Xuanang Gao

CV
h-index11
4papers
13citations
Novelty59%
AI Score54

4 Papers

98.8CVMay 11Code
V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning

Zhiwei Ning, Xuanang Gao, Jiaxi Cao et al.

Multimodal large language models (MLLMs) have achieved remarkable success in general perception, yet complex multi-step visual reasoning remains a persistent challenge. Although recent agentic approaches incorporate tool use, they often neglect critical execution feedback. Consequently, they suffer from the imagination-action-observer (IAO) bias, a misalignment between prior imagination and observer feedback that undermines reasoning stability and optimality. To bridge this gap, we introduce V-ABS, an action-observer driven beam search framework that enables deliberate reasoning through thinker-actor-observer iterations. We also propose an entropy-based adaptive weighting algorithm to mitigate the IAO bias by dynamically balancing the confidence scores between the policy priors and the observational feedback. Moreover, we construct a large-scale supervised fine-tuning (SFT) dataset comprising over 80k samples to guide the model to assign higher prior confidence to correct action paths. Extensive experiments across eight diverse benchmarks show that V-ABS achieves state-of-the-art performance, delivering an average improvement of 19.7% on the Qwen3-VL-8B baseline and consistent gains across both open-source and proprietary models.

CVFeb 23
Fore-Mamba3D: Mamba-based Foreground-Enhanced Encoding for 3D Object Detection

Zhiwei Ning, Xuanang Gao, Jiaxi Cao et al.

Linear modeling methods like Mamba have been merged as the effective backbone for the 3D object detection task. However, previous Mamba-based methods utilize the bidirectional encoding for the whole non-empty voxel sequence, which contains abundant useless background information in the scenes. Though directly encoding foreground voxels appears to be a plausible solution, it tends to degrade detection performance. We attribute this to the response attenuation and restricted context representation in the linear modeling for fore-only sequences. To address this problem, we propose a novel backbone, termed Fore-Mamba3D, to focus on the foreground enhancement by modifying Mamba-based encoder. The foreground voxels are first sampled according to the predicted scores. Considering the response attenuation existing in the interaction of foreground voxels across different instances, we design a regional-to-global slide window (RGSW) to propagate the information from regional split to the entire sequence. Furthermore, a semantic-assisted and state spatial fusion module (SASFMamba) is proposed to enrich contextual representation by enhancing semantic and geometric awareness within the Mamba model. Our method emphasizes foreground-only encoding and alleviates the distance-based and causal dependencies in the linear autoregression model. The superior performance across various benchmarks demonstrates the effectiveness of Fore-Mamba3D in the 3D object detection task.

CVJul 14, 2025Code
FIX-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text

Bingchao Wang, Zhiwei Ning, Jianyu Ding et al.

CLIP has shown promising performance across many short-text tasks in a zero-shot manner. However, limited by the input length of the text encoder, CLIP struggles on under-stream tasks with long-text inputs ($>77$ tokens). To remedy this issue, we propose FIX-CLIP, which includes three novel modules: (1) A dual-branch training pipeline that aligns short and long texts with masked and raw images, respectively, which boosts the long-text representation while preserving the short-text ability. (2) Multiple learnable regional prompts with unidirectional masks in Transformer layers for regional information extraction. (3) A hierarchical feature alignment module in the intermediate encoder layers to promote the consistency of multi-scale features. Furthermore, we collect 30M images and utilize existing MLLMs to synthesize long-text captions for training. Extensive experiments show that FIX-CLIP achieves state-of-the-art performance on both long-text and short-text retrieval benchmarks. For downstream applications, we reveal that FIX-CLIP's text encoder delivers promising performance in a plug-and-play manner for diffusion models with long-text input. The code is available at https://github.com/bcwang-sjtu/Fix-CLIP.

CVAug 18, 2025
CMF-IoU: Multi-Stage Cross-Modal Fusion 3D Object Detection with IoU Joint Prediction

Zhiwei Ning, Zhaojiang Liu, Xuanang Gao et al.

Multi-modal methods based on camera and LiDAR sensors have garnered significant attention in the field of 3D detection. However, many prevalent works focus on single or partial stage fusion, leading to insufficient feature extraction and suboptimal performance. In this paper, we introduce a multi-stage cross-modal fusion 3D detection framework, termed CMF-IOU, to effectively address the challenge of aligning 3D spatial and 2D semantic information. Specifically, we first project the pixel information into 3D space via a depth completion network to get the pseudo points, which unifies the representation of the LiDAR and camera information. Then, a bilateral cross-view enhancement 3D backbone is designed to encode LiDAR points and pseudo points. The first sparse-to-distant (S2D) branch utilizes an encoder-decoder structure to reinforce the representation of sparse LiDAR points. The second residual view consistency (ResVC) branch is proposed to mitigate the influence of inaccurate pseudo points via both the 3D and 2D convolution processes. Subsequently, we introduce an iterative voxel-point aware fine grained pooling module, which captures the spatial information from LiDAR points and textural information from pseudo points in the proposal refinement stage. To achieve more precise refinement during iteration, an intersection over union (IoU) joint prediction branch integrated with a novel proposals generation technique is designed to preserve the bounding boxes with both high IoU and classification scores. Extensive experiments show the superior performance of our method on the KITTI, nuScenes and Waymo datasets.