CVOct 13, 2022

Scene Text Image Super-Resolution via Content Perceptual Loss and Criss-Cross Transformer Blocks

arXiv:2210.06924v18.814 citationsh-index: 72

Originality Incremental advance

AI Analysis

This work addresses the challenge of enhancing low-resolution text images in natural scenes, which is crucial for applications like scene text recognition, but it is incremental as it builds on prior content-based methods.

The paper tackles the problem of text image super-resolution to improve readability for humans and scene text recognition, achieving state-of-the-art performance in both recognition accuracy and human perception across various language datasets.

Text image super-resolution is a unique and important task to enhance readability of text images to humans. It is widely used as pre-processing in scene text recognition. However, due to the complex degradation in natural scenes, recovering high-resolution texts from the low-resolution inputs is ambiguous and challenging. Existing methods mainly leverage deep neural networks trained with pixel-wise losses designed for natural image reconstruction, which ignore the unique character characteristics of texts. A few works proposed content-based losses. However, they only focus on text recognizers' accuracy, while the reconstructed images may still be ambiguous to humans. Further, they often have weak generalizability to handle cross languages. To this end, we present TATSR, a Text-Aware Text Super-Resolution framework, which effectively learns the unique text characteristics using Criss-Cross Transformer Blocks (CCTBs) and a novel Content Perceptual (CP) Loss. The CCTB extracts vertical and horizontal content information from text images by two orthogonal transformers, respectively. The CP Loss supervises the text reconstruction with content semantics by multi-scale text recognition features, which effectively incorporates content awareness into the framework. Extensive experiments on various language datasets demonstrate that TATSR outperforms state-of-the-art methods in terms of both recognition accuracy and human perception.

View on arXiv PDF

Similar