CLFeb 28, 2023

Augmented Transformers with Adaptive n-grams Embedding for Multilingual Scene Text Recognition

arXiv:2302.14261v12 citationsh-index: 112
Originality Incremental advance
AI Analysis

This work addresses the problem of recognizing complex multilingual text in real-world scenes, which is incremental as it adapts existing transformer methods to a specific domain.

The paper tackled multilingual scene text recognition by proposing an augmented transformer architecture with adaptive n-grams embedding and cross-language rectification, achieving state-of-the-art performance on benchmark datasets and a new multilingual dataset from Indonesian tourism scenes.

While vision transformers have been highly successful in improving the performance in image-based tasks, not much work has been reported on applying transformers to multilingual scene text recognition due to the complexities in the visual appearance of multilingual texts. To fill the gap, this paper proposes an augmented transformer architecture with n-grams embedding and cross-language rectification (TANGER). TANGER consists of a primary transformer with single patch embeddings of visual images, and a supplementary transformer with adaptive n-grams embeddings that aims to flexibly explore the potential correlations between neighbouring visual patches, which is essential for feature extraction from multilingual scene texts. Cross-language rectification is achieved with a loss function that takes into account both language identification and contextual coherence scoring. Extensive comparative studies are conducted on four widely used benchmark datasets as well as a new multilingual scene text dataset containing Indonesian, English, and Chinese collected from tourism scenes in Indonesia. Our experimental results demonstrate that TANGER is considerably better compared to the state-of-the-art, especially in handling complex multilingual scene texts.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes