RePAST: Relative Pose Attention Scene Representation Transformer
This work addresses a scalability issue for transformer-based rendering methods in large-scale scenes, though it is incremental as it builds directly on SRT.
The paper tackled the problem of the Scene Representation Transformer (SRT) not being invariant to input view order due to reliance on a fixed reference camera, which limits its applicability to large-scale scenes. They proposed RePAST, which injects pairwise relative camera poses into the attention mechanism, achieving invariance without loss in quality, as shown empirically.
The Scene Representation Transformer (SRT) is a recent method to render novel views at interactive rates. Since SRT uses camera poses with respect to an arbitrarily chosen reference camera, it is not invariant to the order of the input views. As a result, SRT is not directly applicable to large-scale scenes where the reference frame would need to be changed regularly. In this work, we propose Relative Pose Attention SRT (RePAST): Instead of fixing a reference frame at the input, we inject pairwise relative camera pose information directly into the attention mechanism of the Transformers. This leads to a model that is by definition invariant to the choice of any global reference frame, while still retaining the full capabilities of the original method. Empirical results show that adding this invariance to the model does not lead to a loss in quality. We believe that this is a step towards applying fully latent transformer-based rendering methods to large-scale scenes.