CVCLDec 20, 2024

A High-Quality Text-Rich Image Instruction Tuning Dataset via Hybrid Instruction Generation

arXiv:2412.16364v122 citationsh-index: 13Has CodeCOLING
Originality Incremental advance
AI Analysis

This addresses the problem of poor multimodal alignment for text-rich images in AI models, though it is incremental as it builds on existing self-instruct and hybrid generation approaches.

The authors tackled the problem of large multimodal models struggling with text-rich images due to inadequate training data by creating LLaVAR-2, a high-quality dataset of 424k instruction pairs generated through a hybrid method involving human annotators and GPT-4o, which led to models fine-tuned on it showing impressive enhancements over those using self-instruct data.

Large multimodal models still struggle with text-rich images because of inadequate training data. Self-Instruct provides an annotation-free way for generating instruction data, but its quality is poor, as multimodal alignment remains a hurdle even for the largest models. In this work, we propose LLaVAR-2, to enhance multimodal alignment for text-rich images through hybrid instruction generation between human annotators and large language models. Specifically, it involves detailed image captions from human annotators, followed by the use of these annotations in tailored text prompts for GPT-4o to curate a dataset. It also implements several mechanisms to filter out low-quality data, and the resulting dataset comprises 424k high-quality pairs of instructions. Empirical results show that models fine-tuned on this dataset exhibit impressive enhancements over those trained with self-instruct data.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes