CVMar 29, 2023

RusTitW: Russian Language Text Dataset for Visual Text in-the-Wild Recognition

Igor Markov, Sergey Nesteruk, Andrey Kuznetsov, Denis Dimitrov

arXiv:2303.16531v12.82 citationsh-index: 9Has Code

Originality Synthesis-oriented

AI Analysis

This addresses a data gap for Russian language text recognition, enabling improved automated systems in that domain, though it is incremental as it extends existing dataset efforts to a new language.

The authors tackled the lack of training data for Russian text-in-the-wild recognition by presenting a large-scale human-labeled dataset, along with a synthetic dataset and generation code.

Information surrounds people in modern life. Text is a very efficient type of information that people use for communication for centuries. However, automated text-in-the-wild recognition remains a challenging problem. The major limitation for a DL system is the lack of training data. For the competitive performance, training set must contain many samples that replicate the real-world cases. While there are many high-quality datasets for English text recognition; there are no available datasets for Russian language. In this paper, we present a large-scale human-labeled dataset for Russian text recognition in-the-wild. We also publish a synthetic dataset and code to reproduce the generation process

View on arXiv PDF Code

Similar