CVDec 24, 2024

HAUR: Human Annotation Understanding and Recognition Through Text-Heavy Images

Yuchen Yang, Haoran Yan, Yanhao Chen, Qingqiang Wu, Qingqi Hong

arXiv:2412.18327v12.0h-index: 1

Originality Synthesis-oriented

AI Analysis

This addresses a specific limitation in Vision Question Answering for real-world scenarios, but it appears incremental as it builds on existing vision-text models with a new dataset and task.

The paper tackles the problem of vision-text models struggling with human annotations on text-heavy images by proposing the HAUR task, introducing the HAUR-5 dataset with five annotation types, and developing the OCR-Mix model, which outperforms other models in cross-model comparisons.

Vision Question Answering (VQA) tasks use images to convey critical information to answer text-based questions, which is one of the most common forms of question answering in real-world scenarios. Numerous vision-text models exist today and have performed well on certain VQA tasks. However, these models exhibit significant limitations in understanding human annotations on text-heavy images. To address this, we propose the Human Annotation Understanding and Recognition (HAUR) task. As part of this effort, we introduce the Human Annotation Understanding and Recognition-5 (HAUR-5) dataset, which encompasses five common types of human annotations. Additionally, we developed and trained our model, OCR-Mix. Through comprehensive cross-model comparisons, our results demonstrate that OCR-Mix outperforms other models in this task. Our dataset and model will be released soon .

View on arXiv PDF

Similar