CVCLJun 6, 2024

ReceiptSense: Beyond Traditional OCR -- A Dataset for Receipt Understanding

arXiv:2406.04493v22 citationsHas Code
AI Analysis

This dataset addresses the problem of automated multilingual document processing for researchers and practitioners, though it is incremental as it focuses on a specific domain.

The authors tackled the challenge of multilingual OCR and information extraction from receipts, particularly for complex scripts like Arabic, by introducing a comprehensive dataset of 20,000 annotated receipts, 30,000 OCR-annotated images, and 10,000 item-level annotations, establishing baseline performance with traditional and neural methods.

Multilingual OCR and information extraction from receipts remains challenging, particularly for complex scripts like Arabic. We introduce \dataset, a comprehensive dataset designed for Arabic-English receipt understanding comprising 20,000 annotated receipts from diverse retail settings, 30,000 OCR-annotated images, and 10,000 item-level annotations, and a new Receipt QA subset with 1265 receipt images paired with 40 question-answer pairs each to support LLM evaluation for receipt understanding. The dataset captures merchant names, item descriptions, prices, receipt numbers, and dates to support object detection, OCR, and information extraction tasks. We establish baseline performance using traditional methods (Tesseract OCR) and advanced neural networks, demonstrating the dataset's effectiveness for processing complex, noisy real-world receipt layouts. Our publicly accessible dataset advances automated multilingual document processing research (see https://github.com/Update-For-Integrated-Business-AI/CORU ).

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes