LLM as an Art Director (LaDi): Using LLMs to improve Text-to-Media Generators
This addresses the challenge of improving artistic control in media generation for creators, though it appears incremental by applying existing LLM methods to a known bottleneck.
The paper tackles the problem of inadequate text prompts limiting the artistic coherence and subject relevance of text-to-image and video generators by using Large Language Models as Art Directors (LaDi) to enhance generation through techniques like constrained decoding and intelligent prompting, with deployment in apps by Plai Labs.
Recent advancements in text-to-image generation have revolutionized numerous fields, including art and cinema, by automating the generation of high-quality, context-aware images and video. However, the utility of these technologies is often limited by the inadequacy of text prompts in guiding the generator to produce artistically coherent and subject-relevant images. In this paper, We describe the techniques that can be used to make Large Language Models (LLMs) act as Art Directors that enhance image and video generation. We describe our unified system for this called "LaDi". We explore how LaDi integrates multiple techniques for augmenting the capabilities of text-to-image generators (T2Is) and text-to-video generators (T2Vs), with a focus on constrained decoding, intelligent prompting, fine-tuning, and retrieval. LaDi and these techniques are being used today in apps and platforms developed by Plai Labs.