Joint Extraction Matters: Prompt-Based Visual Question Answering for Multi-Field Document Information Extraction
This addresses document information extraction tasks by showing practical improvements for multi-field extraction, though it appears incremental as it builds on existing VQA methods.
This paper tackles the problem of extracting multiple information fields from document images using visual question answering, finding that joint extraction of dependent fields improves accuracy compared to isolated queries, with specific gains observed for numeric and contextual dependencies.
Visual question answering (VQA) has emerged as a flexible approach for extracting specific pieces of information from document images. However, existing work typically queries each field in isolation, overlooking potential dependencies across multiple items. This paper investigates the merits of extracting multiple fields jointly versus separately. Through experiments on multiple large vision language models and datasets, we show that jointly extracting fields often improves accuracy, especially when the fields share strong numeric or contextual dependencies. We further analyze how performance scales with the number of requested items and use a regression based metric to quantify inter field relationships. Our results suggest that multi field prompts can mitigate confusion arising from similar surface forms and related numeric values, providing practical methods for designing robust VQA systems in document information extraction tasks.