A Multitask Network for Localization and Recognition of Text in Images
This addresses the need for efficient text extraction from complex images, such as documents, without relying on lexicons, though it is incremental as it builds on existing multi-task and attention-based methods.
The paper tackles the problem of lexicon-free text extraction from complex documents by developing an end-to-end multi-task network that simultaneously handles text localization and recognition, achieving high performance in challenging non-traditional OCR benchmarks.
We present an end-to-end trainable multi-task network that addresses the problem of lexicon-free text extraction from complex documents. This network simultaneously solves the problems of text localization and text recognition and text segments are identified with no post-processing, cropping, or word grouping. A convolutional backbone and Feature Pyramid Network are combined to provide a shared representation that benefits each of three model heads: text localization, classification, and text recognition. To improve recognition accuracy, we describe a dynamic pooling mechanism that retains high-resolution information across all RoIs. For text recognition, we propose a convolutional mechanism with attention which out-performs more common recurrent architectures. Our model is evaluated against benchmark datasets and comparable methods and achieves high performance in challenging regimes of non-traditional OCR.