CVAINov 7, 2024

TAP-VL: Text Layout-Aware Pre-training for Enriched Vision-Language Models

Amazon
arXiv:2411.04642v11 citationsh-index: 132025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)
Originality Incremental advance
AI Analysis

This addresses the problem of text recognition in images for vision-language tasks, offering an incremental enhancement to existing OCR-based strategies.

The paper tackles the challenge of effectively handling text within images in Vision-Language models by introducing TAP-VL, a method that treats OCR information as a distinct modality and integrates it into any VL model, resulting in consistent performance improvements across scene-text and document-based benchmarks.

Vision-Language (VL) models have garnered considerable research interest; however, they still face challenges in effectively handling text within images. To address this limitation, researchers have developed two approaches. The first method involves utilizing external Optical Character Recognition (OCR) tools to extract textual information from images, which is then prepended to other textual inputs. The second strategy focuses on employing extremely high-resolution images to improve text recognition capabilities. In this paper, we focus on enhancing the first strategy by introducing a novel method, named TAP-VL, which treats OCR information as a distinct modality and seamlessly integrates it into any VL model. TAP-VL employs a lightweight transformer-based OCR module to receive OCR with layout information, compressing it into a short fixed-length sequence for input into the LLM. Initially, we conduct model-agnostic pretraining of the OCR module on unlabeled documents, followed by its integration into any VL architecture through brief fine-tuning. Extensive experiments demonstrate consistent performance improvements when applying TAP-VL to top-performing VL models, across scene-text and document-based VL benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes