CVDec 15, 2025

LitePT: Lighter Yet Stronger Point Transformer

arXiv:2512.13689v113 citationsh-index: 46Has Code
Originality Highly original
AI Analysis

This work addresses the need for more efficient and effective 3D point cloud backbones for computer vision applications, offering a significant improvement over existing methods.

The paper tackled the problem of designing efficient neural architectures for 3D point cloud processing by proposing LitePT, which uses convolutions in early layers and attention in deeper layers, resulting in a model with 3.6× fewer parameters, 2× faster runtime, and 2× less memory while matching or outperforming the state-of-the-art Point Transformer V3.

Modern neural architectures for 3D point cloud processing contain both convolutional layers and attention blocks, but the best way to assemble them remains unclear. We analyse the role of different computational blocks in 3D point cloud networks and find an intuitive behaviour: convolution is adequate to extract low-level geometry at high-resolution in early layers, where attention is expensive without bringing any benefits; attention captures high-level semantics and context in low-resolution, deep layers more efficiently. Guided by this design principle, we propose a new, improved 3D point cloud backbone that employs convolutions in early stages and switches to attention for deeper layers. To avoid the loss of spatial layout information when discarding redundant convolution layers, we introduce a novel, training-free 3D positional encoding, PointROPE. The resulting LitePT model has $3.6\times$ fewer parameters, runs $2\times$ faster, and uses $2\times$ less memory than the state-of-the-art Point Transformer V3, but nonetheless matches or even outperforms it on a range of tasks and datasets. Code and models are available at: https://github.com/prs-eth/LitePT.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes