CL AISep 26, 2025

Extract-0: A Specialized Language Model for Document Information Extraction

arXiv:2509.22906v11 citations

Originality Highly original

AI Analysis

This addresses the problem of efficient and accurate document information extraction for users needing to process diverse documents, representing a strong specific gain rather than a broad paradigm shift.

The paper tackles document information extraction by developing Extract-0, a specialized 7-billion parameter language model that achieves a mean reward of 0.573 on a benchmark, outperforming larger models like GPT-4.1.

This paper presents Extract-0, a 7-billion parameter language model specifically optimized for document information extraction that achieves performance exceeding models with parameter counts several orders of magnitude larger. Through a novel combination of synthetic data generation, supervised fine-tuning with Low-Rank Adaptation (LoRA), and reinforcement learning via Group Relative Policy Optimization (GRPO), Extract-0 achieves a mean reward of 0.573 on a benchmark of 1,000 diverse document extraction tasks, outperforming GPT-4.1 (0.457), o3 (0.464), and GPT-4.1-2025 (0.459). The training methodology employs a memory-preserving synthetic data generation pipeline that produces 280,128 training examples from diverse document sources, followed by parameterefficient fine-tuning that modifies only 0.53% of model weights (40.4M out of 7.66B parameters). The reinforcement learning phase introduces a novel semantic similarity-based reward function that handles the inherent ambiguity in information extraction tasks. This research demonstrates that task-specific optimization can yield models that surpass general-purpose systems while requiring substantially fewer computational resource.

View on arXiv PDF

Similar