CV LGJan 18, 2023

Towards Models that Can See and Read

Roy Ganz, Oren Nuriel, Aviad Aberdam, Yair Kittenplon, Shai Mazor, Ron Litman

Amazon

arXiv:2301.07389v214.117 citationsh-index: 14

Originality Incremental advance

AI Analysis

This addresses the limitation of multimodal models in integrating scene-text for vision-language tasks, offering a unified solution that enhances performance, though it is incremental as it builds on existing architectures.

The paper tackles the problem of existing vision-language models being specialized for either visual question answering or image captioning but not both, especially when scene-text is involved, and proposes UniTNT, a unified approach that enables a single model to handle both tasks and improves performance on general VQA and CAP by up to 2.69% and 0.6 CIDEr, respectively.

Visual Question Answering (VQA) and Image Captioning (CAP), which are among the most popular vision-language tasks, have analogous scene-text versions that require reasoning from the text in the image. Despite their obvious resemblance, the two are treated independently and, as we show, yield task-specific methods that can either see or read, but not both. In this work, we conduct an in-depth analysis of this phenomenon and propose UniTNT, a Unified Text-Non-Text approach, which grants existing multimodal architectures scene-text understanding capabilities. Specifically, we treat scene-text information as an additional modality, fusing it with any pretrained encoder-decoder-based architecture via designated modules. Thorough experiments reveal that UniTNT leads to the first single model that successfully handles both task types. Moreover, we show that scene-text understanding capabilities can boost vision-language models' performance on general VQA and CAP by up to 2.69% and 0.6 CIDEr, respectively.

View on arXiv PDF

Similar