CL CVApr 16, 2024

Autoregressive Pre-Training on Pixels and Texts

Yekun Chai, Qingyi Liu, Jingwu Xiao, Shuohuan Wang, Yu Sun, Hua Wu

arXiv:2404.10710v315.227 citationsh-index: 15Has CodeEMNLP

Originality Incremental advance

AI Analysis

This work addresses the challenge of multimodal language modeling for AI researchers, offering incremental improvements by combining visual and textual data.

The paper tackles the problem of integrating visual and textual information in language models by pre-training on both document images and texts, finding that this approach significantly improves performance, with a unidirectional pixel-based model achieving results comparable to state-of-the-art bidirectional models on language understanding tasks.

The integration of visual and textual information represents a promising direction in the advancement of language models. In this paper, we explore the dual modality of language--both visual and textual--within an autoregressive framework, pre-trained on both document images and texts. Our method employs a multimodal training strategy, utilizing visual data through next patch prediction with a regression head and/or textual data through next token prediction with a classification head. We focus on understanding the interaction between these two modalities and their combined impact on model performance. Our extensive evaluation across a wide range of benchmarks shows that incorporating both visual and textual data significantly improves the performance of pixel-based language models. Remarkably, we find that a unidirectional pixel-based model trained solely on visual data can achieve comparable results to state-of-the-art bidirectional models on several language understanding tasks. This work uncovers the untapped potential of integrating visual and textual modalities for more effective language modeling. We release our code, data, and model checkpoints at \url{https://github.com/ernie-research/pixelgpt}.

View on arXiv PDF Code

Similar