CVMay 11, 2022

TextMatcher: Cross-Attentional Neural Network to Compare Image and Text

arXiv:2205.05507v21 citationsh-index: 2
Originality Incremental advance
AI Analysis

This addresses a multimodal-learning problem for applications like bank cheque processing, but it is incremental as it adapts existing cross-attention mechanisms to a specific task.

The paper tackles the problem of text matching, assessing whether an image containing text corresponds to a candidate transcription, and shows that TextMatcher achieves higher performance and faster inference compared to baselines on the IAM dataset.

We study a novel multimodal-learning problem, which we call text matching: given an image containing a single-line text and a candidate text transcription, the goal is to assess whether the text represented in the image corresponds to the candidate text. We devise the first machine-learning model specifically designed for this problem. The proposed model, termed TextMatcher, compares the two inputs by applying a cross-attention mechanism over the embedding representations of image and text, and it is trained in an end-to-end fashion. We extensively evaluate the empirical performance of TextMatcher on the popular IAM dataset. Results attest that, compared to a baseline and existing models designed for related problems, TextMatcher achieves higher performance on a variety of configurations, while at the same time running faster at inference time. We also showcase TextMatcher in a real-world application scenario concerning the automatic processing of bank cheques.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes