PS-Transformer: Learning Sparse Photometric Stereo Network using Self-Attention Mechanism
This addresses the challenge of accurate 3D surface reconstruction from limited lighting conditions in computer vision, offering a novel approach that improves performance in sparse settings.
The paper tackles the problem of sparse calibrated photometric stereo, where existing methods fail with few input images, by proposing PS-Transformer that uses a self-attention mechanism to capture complex interactions, achieving surface normal prediction accuracy that significantly outperforms state-of-the-art methods with the same number of images and is comparable to dense algorithms using 10x more images.
Existing deep calibrated photometric stereo networks basically aggregate observations under different lights based on the pre-defined operations such as linear projection and max pooling. While they are effective with the dense capture, simple first-order operations often fail to capture the high-order interactions among observations under small number of different lights. To tackle this issue, this paper presents a deep sparse calibrated photometric stereo network named {\it PS-Transformer} which leverages the learnable self-attention mechanism to properly capture the complex inter-image interactions. PS-Transformer builds upon the dual-branch design to explore both pixel-wise and image-wise features and individual feature is trained with the intermediate surface normal supervision to maximize geometric feasibility. A new synthetic dataset named CyclesPS+ is also presented with the comprehensive analysis to successfully train the photometric stereo networks. Extensive results on the publicly available benchmark datasets demonstrate that the surface normal prediction accuracy of the proposed method significantly outperforms other state-of-the-art algorithms with the same number of input images and is even comparable to that of dense algorithms which input 10$\times$ larger number of images.