CVMay 15

Invaria: Learning Scale and Density Invariance in Point Clouds via Next-Resolution Prediction

arXiv:2605.1592367.8
Predicted impact top 46% in CV · last 90 daysOriginality Highly original
AI Analysis

For robotics and 3D vision, this addresses the bottleneck of point cloud encoders overfitting to specific resolutions and scales, enabling robust real-world deployment.

Invaria introduces a point cloud encoder that achieves scale and density invariance via next-resolution prediction and receptive field calibration, resulting in a 56% higher mIoU at 3× lower resolution and 20% improvement under 3× scale reduction on ScanNet, with 45% smaller model size and 40% fewer input tokens.

Modern image encoders achieve high generalization by decoupling semantic meaning from resolution, an ability yet to be fully realized in the 3D domain. We investigate the failure of 3D point cloud encoders to achieve similar generalization and find that existing models are highly sensitive to sampling resolution and scale changes, leading to significant performance degradation. This sensitivity is a major bottleneck for real-world deployment in robotics, as it suggests models overfit to specific quantization densities and object scales rather than learning invariant semantic features. To mitigate this dependency, we propose Invaria, a point cloud encoder that achieves scale and density invariance through next-resolution prediction and receptive field calibration. While our objective is not the explicit generation of high-resolution point clouds, we find that this training objective encourages the model to learn robust, structural invariants. The resulting encoder achieves significant performance gains during resolution shifts while maintaining high efficiency through a compact model size and reduced token requirements. Specifically, on ScanNet, Invaria achieves a 56.0\% higher mIoU at 3$\times$ lower resolution and a 20\% improvement when the objects scale is reduced by a factor of 3. These gains are achieved with a 45\% smaller model size and an average reduction of 40\% in input tokens.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes