Transformer-Based Inpainting for Real-Time 3D Streaming in Sparse Multi-Camera Setups
This work provides a real-time solution for improving the visual quality of 3D streaming for users in AR/VR applications, where sparse camera setups lead to incomplete rendered images.
This paper addresses the problem of missing information and incomplete surfaces in 3D streaming from sparse multi-camera setups, which often leads to visual artifacts. The authors propose a transformer-based inpainting method that acts as an image-based post-processing step to complete missing textures. Their method achieves the best trade-off between quality and speed compared to state-of-the-art inpainting techniques under real-time constraints, outperforming competitors in both image and video-based metrics.
High-quality 3D streaming from multiple cameras is crucial for immersive experiences in many AR/VR applications. The limited number of views - often due to real-time constraints - leads to missing information and incomplete surfaces in the rendered images. Existing approaches typically rely on simple heuristics for the hole filling, which can result in inconsistencies or visual artifacts. We propose to complete the missing textures using a novel, application-targeted inpainting method independent of the underlying representation as an image-based post-processing step after the novel view rendering. The method is designed as a standalone module compatible with any calibrated multi-camera system. For this we introduce a multi-view aware, transformer-based network architecture using spatio-temporal embeddings to ensure consistency across frames while preserving fine details. Additionally, our resolution-independent design allows adaptation to different camera setups, while an adaptive patch selection strategy balances inference speed and quality, allowing real-time performance. We evaluate our approach against state-of-the-art inpainting techniques under the same real-time constraints and demonstrate that our model achieves the best trade-off between quality and speed, outperforming competitors in both image and video-based metrics.