CVFeb 28

PointAlign: Feature-Level Alignment Regularization for 3D Vision-Language Models

Yuanhao Su, Shaofeng Zhang, Xiaosong Jia, Qi Fan

arXiv:2603.00412v1Has Code

Originality Incremental advance

AI Analysis

This addresses data scarcity and geometric degradation in 3D VLMs for applications like robotics and autonomous driving, representing an incremental improvement over existing methods.

The paper tackles the problem of 3D vision-language models suffering from geometric information loss due to limited paired data and inefficient supervision, proposing a feature-level alignment regularization method that achieves a 2.08 percentage point average improvement in classification tasks, with up to 7.50 pp gains in open-vocabulary settings.

The development of 3D Vision-Language Models (VLMs), crucial for applications in robotics, autonomous driving, and augmented reality, is severely constrained by the scarcity of paired 3D-text data. Existing methods rely solely on next-token prediction loss, using only language tokens for supervision. This results in inefficient utilization of limited 3D data and leads to a significant degradation and loss of valuable geometric information in intermediate representations. To address these limitations, we propose {\mname}, a novel feature-level alignment regularization method. {\mname} explicitly supervises intermediate point cloud tokens to preserve fine-grained 3D geometric-semantic information throughout the language modeling process. Specifically, we constrain the intermediate point cloud tokens within the LLM to align with visual input tokens via a consistency loss. By training only a lightweight alignment projector and LoRA adapters, {\mname} achieves explicit feature-level supervision with minimal computational overhead, effectively preventing geometric degradation. Extensive experiments on ModelNet40 and Objaverse datasets demonstrate that our method achieves \textbf{2.08} pp improvement on average for classification tasks, with a substantial \textbf{7.50} pp gain on the challenging open-vocabulary Objaverse classification task and \textbf{4.88} pp improvement on 3D object captioning evaluated by Qwen2-72B-Instruct, validating the effectiveness of {\mname}. Code is publicly available at \href{https://github.com/yharoldsu0627/PointAlign}{https://github.com/yharoldsu0627/PointAlign}.

View on arXiv PDF Code

Similar