CV CLJun 6, 2024

ReceiptSense: Beyond Traditional OCR -- A Dataset for Receipt Understanding

Abdelrahman Abdallah, Mohamed Mounis, Mahmoud Abdalla, Mahmoud SalahEldin Kasem, Mohamed Mahmoud, Ibrahim Abdelhalim, Mohamed Elkasaby, Yasser ElBendary, Adam Jatowt

arXiv:2406.04493v23.72 citationsh-index: 31Has Code

Originality Synthesis-oriented

AI Analysis

This dataset addresses the problem of automated multilingual document processing for researchers and practitioners, though it is incremental as it focuses on a specific domain.

The authors tackled the challenge of multilingual OCR and information extraction from receipts, particularly for complex scripts like Arabic, by introducing a comprehensive dataset of 20,000 annotated receipts, 30,000 OCR-annotated images, and 10,000 item-level annotations, establishing baseline performance with traditional and neural methods.

Multilingual OCR and information extraction from receipts remains challenging, particularly for complex scripts like Arabic. We introduce \dataset, a comprehensive dataset designed for Arabic-English receipt understanding comprising 20,000 annotated receipts from diverse retail settings, 30,000 OCR-annotated images, and 10,000 item-level annotations, and a new Receipt QA subset with 1265 receipt images paired with 40 question-answer pairs each to support LLM evaluation for receipt understanding. The dataset captures merchant names, item descriptions, prices, receipt numbers, and dates to support object detection, OCR, and information extraction tasks. We establish baseline performance using traditional methods (Tesseract OCR) and advanced neural networks, demonstrating the dataset's effectiveness for processing complex, noisy real-world receipt layouts. Our publicly accessible dataset advances automated multilingual document processing research (see https://github.com/Update-For-Integrated-Business-AI/CORU ).

View on arXiv PDF Code

Similar