CVNov 19, 2019

KISS: Keeping It Simple for Scene Text Recognition

arXiv:1911.08400v118 citations
Originality Synthesis-oriented
AI Analysis

This work addresses scene text recognition for computer vision applications by offering a simpler, more generalizable approach, though it is incremental in its use of existing components.

The paper tackles scene text recognition by proposing a model (KISS) that uses only off-the-shelf neural network building blocks, achieving state-of-the-art or competitive performance on benchmarks without specialized methods like 2D-attention or image rectification.

Over the past few years, several new methods for scene text recognition have been proposed. Most of these methods propose novel building blocks for neural networks. These novel building blocks are specially tailored for the task of scene text recognition and can thus hardly be used in any other tasks. In this paper, we introduce a new model for scene text recognition that only consists of off-the-shelf building blocks for neural networks. Our model (KISS) consists of two ResNet based feature extractors, a spatial transformer, and a transformer. We train our model only on publicly available, synthetic training data and evaluate it on a range of scene text recognition benchmarks, where we reach state-of-the-art or competitive performance, although our model does not use methods like 2D-attention, or image rectification.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes