Bidirectional Scene Text Recognition with a Single Decoder
This addresses a computational bottleneck in scene text recognition for applications like document analysis and autonomous systems, though it is incremental as it builds on existing bidirectional approaches.
The paper tackles the inefficiency of using two separate decoders for bidirectional scene text recognition by introducing Bi-STET, a method with a single decoder that outperforms bidirectional methods with two decoders and achieves or beats state-of-the-art results on all benchmarks.
Scene Text Recognition (STR) is the problem of recognizing the correct word or character sequence in a cropped word image. To obtain more robust output sequences, the notion of bidirectional STR has been introduced. So far, bidirectional STRs have been implemented by using two separate decoders; one for left-to-right decoding and one for right-to-left. Having two separate decoders for almost the same task with the same output space is undesirable from a computational and optimization point of view. We introduce the bidirectional Scene Text Transformer (Bi-STET), a novel bidirectional STR method with a single decoder for bidirectional text decoding. With its single decoder, Bi-STET outperforms methods that apply bidirectional decoding by using two separate decoders while also being more efficient than those methods, Furthermore, we achieve or beat state-of-the-art (SOTA) methods on all STR benchmarks with Bi-STET. Finally, we provide analyses and insights into the performance of Bi-STET.