HAUR: Human Annotation Understanding and Recognition Through Text-Heavy Images
This addresses a specific limitation in Vision Question Answering for real-world scenarios, but it appears incremental as it builds on existing vision-text models with a new dataset and task.
The paper tackles the problem of vision-text models struggling with human annotations on text-heavy images by proposing the HAUR task, introducing the HAUR-5 dataset with five annotation types, and developing the OCR-Mix model, which outperforms other models in cross-model comparisons.
Vision Question Answering (VQA) tasks use images to convey critical information to answer text-based questions, which is one of the most common forms of question answering in real-world scenarios. Numerous vision-text models exist today and have performed well on certain VQA tasks. However, these models exhibit significant limitations in understanding human annotations on text-heavy images. To address this, we propose the Human Annotation Understanding and Recognition (HAUR) task. As part of this effort, we introduce the Human Annotation Understanding and Recognition-5 (HAUR-5) dataset, which encompasses five common types of human annotations. Additionally, we developed and trained our model, OCR-Mix. Through comprehensive cross-model comparisons, our results demonstrate that OCR-Mix outperforms other models in this task. Our dataset and model will be released soon .