Development of a New Image-to-text Conversion System for Pashto, Farsi and Traditional Chinese
This work addresses the need for accurate text extraction from digital images in languages with complex scripts, benefiting digital humanities and metadata retrieval, but it appears incremental as it applies existing deep learning techniques to new language data.
The researchers tackled the problem of image-to-text conversion for languages with cursive scripts like Farsi and Pashto and non-cursive scripts like Traditional Chinese, achieving a system that processes over a billion pages of documents using machine learning and deep learning methods.
We report upon the results of a research and prototype building project \emph{Worldly~OCR} dedicated to developing new, more accurate image-to-text conversion software for several languages and writing systems. These include the cursive scripts Farsi and Pashto, and Latin cursive scripts. We also describe approaches geared towards Traditional Chinese, which is non-cursive, but features an extremely large character set of 65,000 characters. Our methodology is based on Machine Learning, especially Deep Learning, and Data Science, and is directed towards vast quantities of original documents, exceeding a billion pages. The target audience of this paper is a general audience with interest in Digital Humanities or in retrieval of accurate full-text and metadata from digital images.