Design and Implementation of an OCR-Powered Pipeline for Table Extraction from Invoices
This addresses the need for efficient data extraction from noisy, non-standard invoices in financial and archival domains, but it is incremental as it builds on existing OCR methods.
The paper tackled the problem of extracting structured tabular data from scanned invoices by designing an OCR-powered pipeline, resulting in significantly improved accuracy and consistency for automated financial workflows.
This paper presents the design and development of an OCR-powered pipeline for efficient table extraction from invoices. The system leverages Tesseract OCR for text recognition and custom post-processing logic to detect, align, and extract structured tabular data from scanned invoice documents. Our approach includes dynamic preprocessing, table boundary detection, and row-column mapping, optimized for noisy and non-standard invoice formats. The resulting pipeline significantly improves data extraction accuracy and consistency, supporting real-world use cases such as automated financial workflows and digital archiving.