Multimodal Markup Document Models for Graphic Design Completion
This addresses the problem of design automation for graphic designers and developers, offering a versatile foundation but is incremental as it builds on existing multimodal and language model approaches.
The paper tackles the problem of automating graphic design tasks by introducing MarkupDM, a multimodal markup document model that represents designs as interleaved documents of markup language and images, enabling unified completion of attributes, images, and text; it demonstrates favorable performance compared to state-of-the-art models in instruction-guided design completion, especially in textual completion.
We introduce MarkupDM, a multimodal markup document model that represents graphic design as an interleaved multimodal document consisting of both markup language and images. Unlike existing holistic approaches that rely on an element-by-attribute grid representation, our representation accommodates variable-length elements, type-dependent attributes, and text content. Inspired by fill-in-the-middle training in code generation, we train the model to complete the missing part of a design document from its surrounding context, allowing it to treat various design tasks in a unified manner. Our model also supports image generation by predicting discrete image tokens through a specialized tokenizer with support for image transparency. We evaluate MarkupDM on three tasks, attribute value, image, and text completion, and demonstrate that it can produce plausible designs consistent with the given context. To further illustrate the flexibility of our approach, we evaluate our approach on a new instruction-guided design completion task where our instruction-tuned MarkupDM compares favorably to state-of-the-art image editing models, especially in textual completion. These findings suggest that multimodal language models with our document representation can serve as a versatile foundation for broad design automation.