Collect-and-Distribute Transformer for 3D Point Cloud Analysis
This work addresses the problem of effective point cloud analysis for applications like 3D object recognition and scene understanding, representing an incremental improvement over existing transformer-based methods.
The paper tackles the challenge of learning local and global structures in 3D point clouds by proposing CDFormer, a transformer network with a collect-and-distribute mechanism, achieving new state-of-the-art performances on classification and segmentation tasks across five datasets.
Remarkable advancements have been made recently in point cloud analysis through the exploration of transformer architecture, but it remains challenging to effectively learn local and global structures within point clouds. In this paper, we propose a new transformer network equipped with a collect-and-distribute mechanism to communicate short- and long-range contexts of point clouds, which we refer to as CDFormer. Specifically, we first employ self-attention to capture short-range interactions within each local patch, and the updated local features are then collected into a set of proxy reference points from which we can extract long-range contexts. Afterward, we distribute the learned long-range contexts back to local points via cross-attention. To address the position clues for short- and long-range contexts, we additionally introduce the context-aware position encoding to facilitate position-aware communications between points. We perform experiments on five popular point cloud datasets, namely ModelNet40, ScanObjectNN, ShapeNetPart, S3DIS and ScanNetV2, for classification and segmentation. Results show the effectiveness of the proposed CDFormer, delivering several new state-of-the-art performances on point cloud classification and segmentation tasks. The source code is available at \url{https://github.com/haibo-qiu/CDFormer}.