Multi-task View Synthesis with Neural Radiance Fields
This work addresses the limitation of current multi-task dense prediction methods in computer vision by enabling versatile imagination and multi-view consistency, though it is incremental as it builds on existing NeRF backbones.
The paper tackles the problem of multi-task visual learning by introducing multi-task view synthesis (MTVS), which reinterprets multi-task prediction as novel-view synthesis tasks for multiple scene properties, and proposes MuvieNeRF, a framework that outperforms conventional discriminative models in various settings.
Multi-task visual learning is a critical aspect of computer vision. Current research, however, predominantly concentrates on the multi-task dense prediction setting, which overlooks the intrinsic 3D world and its multi-view consistent structures, and lacks the capability for versatile imagination. In response to these limitations, we present a novel problem setting -- multi-task view synthesis (MTVS), which reinterprets multi-task prediction as a set of novel-view synthesis tasks for multiple scene properties, including RGB. To tackle the MTVS problem, we propose MuvieNeRF, a framework that incorporates both multi-task and cross-view knowledge to simultaneously synthesize multiple scene properties. MuvieNeRF integrates two key modules, the Cross-Task Attention (CTA) and Cross-View Attention (CVA) modules, enabling the efficient use of information across multiple views and tasks. Extensive evaluation on both synthetic and realistic benchmarks demonstrates that MuvieNeRF is capable of simultaneously synthesizing different scene properties with promising visual quality, even outperforming conventional discriminative models in various settings. Notably, we show that MuvieNeRF exhibits universal applicability across a range of NeRF backbones. Our code is available at https://github.com/zsh2000/MuvieNeRF.