CVAug 15, 2017

Sequence-to-Label Script Identification for Multilingual OCR

Yasuhisa Fujii, Karel Driesen, Jonathan Baccash, Ash Hurst, Ashok C. Popat

arXiv:1708.04671v212.041 citationsh-index: 17

Originality Incremental advance

AI Analysis

This work addresses inefficiencies in script identification for OCR systems, offering improvements for processing multilingual documents, though it is incremental as it builds on existing encoder-summarizer frameworks.

The paper tackles the problem of line-level script identification for multilingual OCR by reframing it as a sequence-to-label task, resulting in a 16% reduction in script identification error rate and a 33% decrease in character error rate due to script misidentification.

We describe a novel line-level script identification method. Previous work repurposed an OCR model generating per-character script codes, counted to obtain line-level script identification. This has two shortcomings. First, as a sequence-to-sequence model it is more complex than necessary for the sequence-to-label problem of line script identification. This makes it harder to train and inefficient to run. Second, the counting heuristic may be suboptimal compared to a learned model. Therefore we reframe line script identification as a sequence-to-label problem and solve it using two components, trained end-toend: Encoder and Summarizer. The encoder converts a line image into a feature sequence. The summarizer aggregates the sequence to classify the line. We test various summarizers with identical inception-style convolutional networks as encoders. Experiments on scanned books and photos containing 232 languages in 30 scripts show 16% reduction of script identification error rate compared to the baseline. This improved script identification reduces the character error rate attributable to script misidentification by 33%.

View on arXiv PDF

Similar