Relational Contrastive Learning for Scene Text Recognition
This addresses robustness issues in scene text recognition for computer vision applications, representing an incremental improvement over existing self-supervised methods.
The paper tackles the problem of over-fitting in self-supervised scene text recognition by enriching textual relations through rearrangement, hierarchy, and interaction, proposing RCLSTR which outperforms state-of-the-art methods in representation quality.
Context-aware methods achieved great success in supervised scene text recognition via incorporating semantic priors from words. We argue that such prior contextual information can be interpreted as the relations of textual primitives due to the heterogeneous text and background, which can provide effective self-supervised labels for representation learning. However, textual relations are restricted to the finite size of dataset due to lexical dependencies, which causes the problem of over-fitting and compromises representation robustness. To this end, we propose to enrich the textual relations via rearrangement, hierarchy and interaction, and design a unified framework called RCLSTR: Relational Contrastive Learning for Scene Text Recognition. Based on causality, we theoretically explain that three modules suppress the bias caused by the contextual prior and thus guarantee representation robustness. Experiments on representation quality show that our method outperforms state-of-the-art self-supervised STR methods. Code is available at https://github.com/ThunderVVV/RCLSTR.