CLCVSep 20, 2023

KOSMOS-2.5: A Multimodal Literate Model

Microsoft
arXiv:2309.11419v2103 citationsh-index: 102
Originality Incremental advance
AI Analysis

This work addresses the challenge of document understanding for AI applications, presenting an incremental advancement in multimodal literate models.

The paper tackles the problem of automatically reading text-intensive images by introducing KOSMOS-2.5, a multimodal model that generates spatially-aware text blocks and structured markdown output, achieving performance comparable to GPT-4o on new benchmarks and matching larger models on visual question answering tasks with a 1.3B parameter model.

The automatic reading of text-intensive images represents a significant advancement toward achieving Artificial General Intelligence (AGI). In this paper we present KOSMOS-2.5, a multimodal literate model for machine reading of text-intensive images. Pre-trained on a large-scale corpus of text-intensive images, KOSMOS-2.5 excels in two distinct yet complementary transcription tasks: (1) generating spatially-aware text blocks, where each block of text is assigned spatial coordinates within the image, and (2) producing structured text output that captures both style and structure in markdown format. This unified multimodal literate capability is achieved through a shared decoder-only autoregressive Transformer architecture and task-specific prompts. Building on this foundation, we fine-tune KOSMOS-2.5 for document understanding tasks, resulting in a document understanding generalist named KOSMOS-2.5-CHAT. Additionally, a large corpus of 357.4 million document pages spanning diverse domains was curated for pre-training. We evaluate KOSMOS-2.5 on two newly proposed benchmarks, OCREval and MarkdownEval, for document-level text recognition and image-to-markdown generation, demonstrating impressive literate capabilities comparable to GPT-4o. KOSMOS-2.5-CHAT achieves performance comparable to other state-of-the-art generalists that are five times larger (1.3B vs. 7B) across nine text-rich visual question answering benchmarks. Models and code have been available at \url{https://aka.ms/kosmos25}.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes