From Theory to Throughput: CUDA-Optimized APML for Large-Batch 3D Learning
This work addresses computational bottlenecks for researchers and practitioners in 3D computer vision, enabling large-batch training with transport-based losses, though it is incremental as it optimizes an existing method.
The paper tackled the high memory cost of APML loss for 3D point cloud learning by developing CUDA-APML, a sparse GPU implementation that reduces peak GPU memory by 99.9% while matching dense APML performance on ShapeNet and MM-Fi datasets.
Loss functions are fundamental to learning accurate 3D point cloud models, yet common choices trade geometric fidelity for computational cost. Chamfer Distance is efficient but permits many-to-one correspondences, while Earth Mover Distance better reflects one-to-one transport at high computational cost. APML approximates transport with differentiable Sinkhorn iterations and an analytically derived temperature, but its dense formulation scales quadratically in memory. We present CUDA-APML, a sparse GPU implementation that thresholds negligible assignments and runs adaptive softmax, bidirectional symmetrization, and Sinkhorn normalization directly in COO form. This yields near-linear memory scaling and preserves gradients on the stored support, while pairwise distance evaluation remains quadratic in the current implementation. On ShapeNet and MM-Fi, CUDA-APML matches dense APML within a small tolerance while reducing peak GPU memory by 99.9%. Code available at: https://github.com/Multimodal-Sensing-Lab/apml