CVROApr 17

PLAF: Pixel-wise Language-Aligned Feature Extraction for Efficient 3D Scene Understanding

arXiv:2604.1577059.5h-index: 2Has Code
AI Analysis

This work addresses the challenge of balancing semantic precision and scalability in open-vocabulary 3D scene understanding for robotics and AR/VR applications.

PLAF introduces a pixel-wise language-aligned feature extraction framework for 3D scene understanding that achieves dense semantic alignment in 2D while reducing redundancy in 3D storage and querying. The method enables efficient open-vocabulary 3D understanding without sacrificing accuracy.

Accurate open-vocabulary 3D scene understanding requires semantic representations that are both language-aligned and spatially precise at the pixel level, while remaining scalable when lifted to 3D space. However, existing representations struggle to jointly satisfy these requirements, and densely propagating pixel-wise semantics to 3D often results in substantial redundancy, leading to inefficient storage and querying in large-scale scenes. To address these challenges, we present \emph{PLAF}, a Pixel-wise Language-Aligned Feature extraction framework that enables dense and accurate semantic alignment in 2D without sacrificing open-vocabulary expressiveness. Building upon this representation, we further design an efficient semantic storage and querying scheme that significantly reduces redundancy across both 2D and 3D domains. Experimental results show that \emph{PLAF} provides a strong semantic foundation for accurate and efficient open-vocabulary 3D scene understanding. The codes are publicly available at https://github.com/RockWenJJ/PLAF.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes