Generative World Renderer
This work addresses the domain gap in rendering for computer vision and graphics researchers, though it is incremental as it builds on existing dataset and evaluation methods.
The authors tackled the problem of limited realism and temporal coherence in synthetic datasets for generative inverse and forward rendering by introducing a large-scale dynamic dataset of 4M continuous frames from AAA games, which improved cross-dataset generalization and enabled controllable generation. They also proposed a VLM-based assessment protocol that correlates with human judgment for evaluating inverse rendering without ground truth.
Scaling generative inverse and forward rendering to real-world scenarios is bottlenecked by the limited realism and temporal coherence of existing synthetic datasets. To bridge this persistent domain gap, we introduce a large-scale, dynamic dataset curated from visually complex AAA games. Using a novel dual-screen stitched capture method, we extracted 4M continuous frames (720p/30 FPS) of synchronized RGB and five G-buffer channels across diverse scenes, visual effects, and environments, including adverse weather and motion-blur variants. This dataset uniquely advances bidirectional rendering: enabling robust in-the-wild geometry and material decomposition, and facilitating high-fidelity G-buffer-guided video generation. Furthermore, to evaluate the real-world performance of inverse rendering without ground truth, we propose a novel VLM-based assessment protocol measuring semantic, spatial, and temporal consistency. Experiments demonstrate that inverse renderers fine-tuned on our data achieve superior cross-dataset generalization and controllable generation, while our VLM evaluation strongly correlates with human judgment. Combined with our toolkit, our forward renderer enables users to edit styles of AAA games from G-buffers using text prompts.