Spatial ModernBERT: Spatial-Aware Transformer for Table and Key-Value Extraction in Financial Documents at Scale
This work addresses the need for automated table and key-value extraction in financial documents for business workflows like auditing and invoice processing, but it is incremental as it builds on existing transformer and spatial embedding methods.
The paper tackled the problem of extracting tables and key-value pairs from financial documents by introducing Spatial ModernBERT, a transformer-based model with spatial embeddings, which achieved robust performance in accurately detecting and extracting tabular data and key-value fields.
Extracting tables and key-value pairs from financial documents is essential for business workflows such as auditing, data analytics, and automated invoice processing. In this work, we introduce Spatial ModernBERT-a transformer-based model augmented with spatial embeddings-to accurately detect and extract tabular data and key-value fields from complex financial documents. We cast the extraction task as token classification across three heads: (1) Label Head, classifying each token as a label (e.g., PO Number, PO Date, Item Description, Quantity, Base Cost, MRP, etc.); (2) Column Head, predicting column indices; (3) Row Head, distinguishing the start of item rows and header rows. The model is pretrained on the PubTables-1M dataset, then fine-tuned on a financial document dataset, achieving robust performance through cross-entropy loss on each classification head. We propose a post-processing method to merge tokens using B-I-IB tagging, reconstruct the tabular layout, and extract key-value pairs. Empirical evaluation shows that Spatial ModernBERT effectively leverages both textual and spatial cues, facilitating highly accurate table and key-value extraction in real-world financial documents.