CV CLMar 31, 2023

Improving Scene Text Recognition for Character-Level Long-Tailed Distribution

Sunghyun Park, Sunghyo Chung, Jungsoo Lee, Jaegul Choo

arXiv:2304.08592v15.03 citationsh-index: 44

Originality Incremental advance

AI Analysis

This addresses a domain-specific issue for STR in non-English languages, offering an incremental improvement over existing methods.

The paper tackles the problem of scene text recognition (STR) performance degradation in languages with many characters, such as Chinese and Korean, due to long-tailed character distributions, and proposes CAFE-Net, which improves STR performance by using a context-aware and context-free expert ensemble method.

Despite the recent remarkable improvements in scene text recognition (STR), the majority of the studies focused mainly on the English language, which only includes few number of characters. However, STR models show a large performance degradation on languages with a numerous number of characters (e.g., Chinese and Korean), especially on characters that rarely appear due to the long-tailed distribution of characters in such languages. To address such an issue, we conducted an empirical analysis using synthetic datasets with different character-level distributions (e.g., balanced and long-tailed distributions). While increasing a substantial number of tail classes without considering the context helps the model to correctly recognize characters individually, training with such a synthetic dataset interferes the model with learning the contextual information (i.e., relation among characters), which is also important for predicting the whole word. Based on this motivation, we propose a novel Context-Aware and Free Experts Network (CAFE-Net) using two experts: 1) context-aware expert learns the contextual representation trained with a long-tailed dataset composed of common words used in everyday life and 2) context-free expert focuses on correctly predicting individual characters by utilizing a dataset with a balanced number of characters. By training two experts to focus on learning contextual and visual representations, respectively, we propose a novel confidence ensemble method to compensate the limitation of each expert. Through the experiments, we demonstrate that CAFE-Net improves the STR performance on languages containing numerous number of characters. Moreover, we show that CAFE-Net is easily applicable to various STR models.

View on arXiv PDF

Similar