TextMesh: Generation of Realistic 3D Meshes From Text Prompts
This work addresses the need for practical 3D content generation for applications in gaming, VR, and design, though it is incremental as it builds on existing diffusion-based approaches.
The paper tackles the problem of generating realistic 3D meshes from text prompts, addressing limitations in existing methods that produce impractical neural radiance fields (NeRFs) and over-saturated outputs, and achieves improved mesh extraction and texture details.
The ability to generate highly realistic 2D images from mere text prompts has recently made huge progress in terms of speed and quality, thanks to the advent of image diffusion models. Naturally, the question arises if this can be also achieved in the generation of 3D content from such text prompts. To this end, a new line of methods recently emerged trying to harness diffusion models, trained on 2D images, for supervision of 3D model generation using view dependent prompts. While achieving impressive results, these methods, however, have two major drawbacks. First, rather than commonly used 3D meshes, they instead generate neural radiance fields (NeRFs), making them impractical for most real applications. Second, these approaches tend to produce over-saturated models, giving the output a cartoonish looking effect. Therefore, in this work we propose a novel method for generation of highly realistic-looking 3D meshes. To this end, we extend NeRF to employ an SDF backbone, leading to improved 3D mesh extraction. In addition, we propose a novel way to finetune the mesh texture, removing the effect of high saturation and improving the details of the output 3D mesh.