Neurosymbolic Information Extraction from Transactional Documents
This work addresses the problem of extracting structured information from transactional documents for domains like finance or logistics, though it appears incremental as it builds on existing neurosymbolic methods.
The paper tackles information extraction from transactional documents by introducing a neurosymbolic framework that integrates symbolic validation with language models, resulting in significant improvements in F1-scores and accuracy.
This paper presents a neurosymbolic framework for information extraction from documents, evaluated on transactional documents. We introduce a schema-based approach that integrates symbolic validation methods to enable more effective zero-shot output and knowledge distillation. The methodology uses language models to generate candidate extractions, which are then filtered through syntactic-, task-, and domain-level validation to ensure adherence to domain-specific arithmetic constraints. Our contributions include a comprehensive schema for transactional documents, relabeled datasets, and an approach for generating high-quality labels for knowledge distillation. Experimental results demonstrate significant improvements in $F_1$-scores and accuracy, highlighting the effectiveness of neurosymbolic validation in transactional document processing.