CL AIJan 27

Up to 36x Speedup: Mask-based Parallel Inference Paradigm for Key Information Extraction in MLLMs

Xinzhong Wang, Ya Guo, Jing Li, Huan Chen, Yi Tu, Yijie Hong, Gongshen Liu, Huijia Zhu

arXiv:2601.19613v10.6h-index: 4

Originality Highly original

AI Analysis

This addresses a critical efficiency problem for real-world KIE applications, offering a scalable solution, though it is incremental as it builds on existing MLLM frameworks.

The paper tackles the efficiency bottleneck of autoregressive inference in Key Information Extraction (KIE) from visually-rich documents by introducing PIP, a parallel inference paradigm that uses mask tokens for simultaneous generation, achieving a 5-36x speedup with minimal performance loss.

Key Information Extraction (KIE) from visually-rich documents (VrDs) is a critical task, for which recent Large Language Models (LLMs) and Multi-Modal Large Language Models (MLLMs) have demonstrated strong potential. However, their reliance on autoregressive inference, which generates outputs sequentially, creates a significant efficiency bottleneck, especially as KIE tasks often involve extracting multiple, semantically independent fields. To overcome this limitation, we introduce PIP: a Parallel Inference Paradigm for KIE. Our approach reformulates the problem by using "[mask]" tokens as placeholders for all target values, enabling their simultaneous generation in a single forward pass. To facilitate this paradigm, we develop a tailored mask pre-training strategy and construct large-scale supervised datasets. Experimental results show that our PIP-models achieve a 5-36x inference speedup with negligible performance degradation compared to traditional autoregressive base models. By substantially improving efficiency while maintaining high accuracy, PIP paves the way for scalable and practical real-world KIE solutions.

View on arXiv PDF

Similar