CVAIJun 10, 2024

TRINS: Towards Multimodal Language Models that Can Read

arXiv:2406.06730v17 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of improving reading ability in multimodal AI for applications like document analysis or accessibility, though it is incremental as it builds on existing datasets and models.

The paper tackles the problem of multimodal language models struggling to read text in images due to limited training data, by introducing TRINS, a text-rich image instruction dataset with 39,153 images and 102,437 questions, and LaRA, a model that outperforms state-of-the-art methods on this dataset and other benchmarks.

Large multimodal language models have shown remarkable proficiency in understanding and editing images. However, a majority of these visually-tuned models struggle to comprehend the textual content embedded in images, primarily due to the limitation of training data. In this work, we introduce TRINS: a Text-Rich image INStruction dataset, with the objective of enhancing the reading ability of the multimodal large language model. TRINS is built upon LAION using hybrid data annotation strategies that include machine-assisted and human-assisted annotation processes. It contains 39,153 text-rich images, captions, and 102,437 questions. Specifically, we show that the number of words per annotation in TRINS is significantly longer than that of related datasets, providing new challenges. Furthermore, we introduce a simple and effective architecture, called a Language-vision Reading Assistant (LaRA), which is good at understanding textual content within images. LaRA outperforms existing state-of-the-art multimodal large language models on the TRINS dataset, as well as other classical benchmarks. Lastly, we conducted a comprehensive evaluation with TRINS on various text-rich image understanding and generation tasks, demonstrating its effectiveness.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes