CVFeb 2, 2023

SceneScape: Text-Driven Consistent Scene Generation

arXiv:2302.01133v2180 citationsh-index: 33
Originality Incremental advance
AI Analysis

This addresses the challenge of text-driven perpetual view generation for applications in virtual reality or content creation, though it builds incrementally on existing pre-trained models.

The paper tackles the problem of generating long-term, 3D-consistent videos from text prompts by combining a pre-trained text-to-image model with a monocular depth prediction model, achieving geometrically-plausible scene synthesis for diverse environments like spaceships and caves.

We present a method for text-driven perpetual view generation -- synthesizing long-term videos of various scenes solely, given an input text prompt describing the scene and camera poses. We introduce a novel framework that generates such videos in an online fashion by combining the generative power of a pre-trained text-to-image model with the geometric priors learned by a pre-trained monocular depth prediction model. To tackle the pivotal challenge of achieving 3D consistency, i.e., synthesizing videos that depict geometrically-plausible scenes, we deploy an online test-time training to encourage the predicted depth map of the current frame to be geometrically consistent with the synthesized scene. The depth maps are used to construct a unified mesh representation of the scene, which is progressively constructed along the video generation process. In contrast to previous works, which are applicable only to limited domains, our method generates diverse scenes, such as walkthroughs in spaceships, caves, or ice castles.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes