CVMar 31, 2023

Towards Flexible Multi-modal Document Models

Naoto Inoue, Kotaro Kikuchi, Edgar Simo-Serra, Mayu Otani, Kota Yamaguchi

arXiv:2303.18248v119.036 citationsh-index: 28Has Code

Originality Incremental advance

AI Analysis

This work addresses the need for holistic models in creative workflows for designers, though it appears incremental as it builds on existing multi-modal and pre-training techniques.

The authors tackled the problem of jointly solving multiple design tasks for graphical documents by developing FlexDM, a unified model that predicts masked fields like element type and styling attributes, achieving competitive performance with task-specific baselines.

Creative workflows for generating graphical documents involve complex inter-related tasks, such as aligning elements, choosing appropriate fonts, or employing aesthetically harmonious colors. In this work, we attempt at building a holistic model that can jointly solve many different design tasks. Our model, which we denote by FlexDM, treats vector graphic documents as a set of multi-modal elements, and learns to predict masked fields such as element type, position, styling attributes, image, or text, using a unified architecture. Through the use of explicit multi-task learning and in-domain pre-training, our model can better capture the multi-modal relationships among the different document fields. Experimental results corroborate that our single FlexDM is able to successfully solve a multitude of different design tasks, while achieving performance that is competitive with task-specific and costly baselines.

View on arXiv PDF Code

Similar