CVAug 19, 2024

P3P: Pseudo-3D Pre-training for Scaling 3D Voxel-based Masked Autoencoders

Xuechao Chen, Ying Chen, Jialin Li, Qiang Nie, Hanqiu Deng, Yong Liu, Qixing Huang, Yang Li

arXiv:2408.10007v32.0h-index: 12Has Code

Originality Highly original

AI Analysis

This addresses data scarcity in 3D perception for computer vision applications, representing a novel method rather than an incremental improvement.

The paper tackles the challenge of scaling 3D pre-training by incorporating millions of images via depth estimation, proposing a linear-time tokenizer to handle varying point numbers and achieving state-of-the-art results in 3D classification, few-shot learning, and segmentation.

3D pre-training is crucial to 3D perception tasks. Nevertheless, limited by the difficulties in collecting clean and complete 3D data, 3D pre-training has persistently faced data scaling challenges. In this work, we introduce a novel self-supervised pre-training framework that incorporates millions of images into 3D pre-training corpora by leveraging a large depth estimation model. New pre-training corpora encounter new challenges in representation ability and embedding efficiency of models. Previous pre-training methods rely on farthest point sampling and k-nearest neighbors to embed a fixed number of 3D tokens. However, these approaches prove inadequate when it comes to embedding millions of samples that feature a diverse range of point numbers, spanning from 1,000 to 100,000. In contrast, we propose a tokenizer with linear-time complexity, which enables the efficient embedding of a flexible number of tokens. Accordingly, a new 3D reconstruction target is proposed to cooperate with our 3D tokenizer. Our method achieves state-of-the-art performance in 3D classification, few-shot learning, and 3D segmentation. Code is available at https://github.com/XuechaoChen/P3P-MAE.

View on arXiv PDF Code

Similar