DeepCPCFG: Deep Learning and Context Free Grammars for End-to-End Information Extraction
This addresses the challenge of automating information extraction from complex documents like invoices, reducing the need for costly manual annotations.
The paper tackles the problem of extracting structured information from business documents without detailed annotations by proposing DeepCPCFG, an end-to-end system that uses deep learning and context-free grammars, achieving state-of-the-art results on scanned invoices.
We address the challenge of extracting structured information from business documents without detailed annotations. We propose Deep Conditional Probabilistic Context Free Grammars (DeepCPCFG) to parse two-dimensional complex documents and use Recursive Neural Networks to create an end-to-end system for finding the most probable parse that represents the structured information to be extracted. This system is trained end-to-end with scanned documents as input and only relational-records as labels. The relational-records are extracted from existing databases avoiding the cost of annotating documents by hand. We apply this approach to extract information from scanned invoices achieving state-of-the-art results despite using no hand-annotations.