CLAISep 18, 2024

DocMamba: Efficient Document Pre-training with State Space Model

arXiv:2409.11887v24 citationsh-index: 19
Originality Incremental advance
AI Analysis

This addresses the problem of processing long documents efficiently for document understanding tasks, offering a novel method that is not incremental but introduces a new paradigm.

The paper tackles the inefficiency of Transformer-based models for visually-rich document understanding due to quadratic computational complexity, proposing DocMamba, a state space model framework that reduces complexity to linear while achieving new state-of-the-art results on datasets like FUNSD, CORD, and SORIE, with significant speed and memory improvements.

In recent years, visually-rich document understanding has attracted increasing attention. Transformer-based pre-trained models have become the mainstream approach, yielding significant performance gains in this field. However, the self-attention mechanism's quadratic computational complexity hinders their efficiency and ability to process long documents. In this paper, we present DocMamba, a novel framework based on the state space model. It is designed to reduce computational complexity to linear while preserving global modeling capabilities. To further enhance its effectiveness in document processing, we introduce the Segment-First Bidirectional Scan (SFBS) to capture contiguous semantic information. Experimental results demonstrate that DocMamba achieves new state-of-the-art results on downstream datasets such as FUNSD, CORD, and SORIE, while significantly improving speed and reducing memory usage. Notably, experiments on the HRDoc confirm DocMamba's potential for length extrapolation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes