From Press to Pixels: Evolving Urdu Text Recognition
This work addresses the problem of digitizing historical Urdu newspapers for researchers and archivists, but it is incremental as it applies existing methods to a new domain-specific dataset.
This paper tackles the problem of Optical Character Recognition for Urdu newspapers by developing an end-to-end pipeline that addresses challenges like complex layouts and low-resolution scans, achieving a best Word Error Rate of 0.133 with Gemini-2.5-Pro and showing that fine-tuning on 500 samples improves WER by 6.13%.
This paper introduces an end-to-end pipeline for Optical Character Recognition (OCR) on Urdu newspapers, addressing challenges posed by complex multi-column layouts, low-resolution scans, and the stylistic variability of the Nastaliq script. Our system comprises four modules: (1) article segmentation, (2) image super-resolution, (3) column segmentation, and (4) text recognition. We fine-tune YOLOv11x for segmentation, achieving 0.963 precision for articles and 0.970 for columns. A SwinIR-based super-resolution model boosts LLM text recognition accuracy by 25-70%. We also introduce the Urdu Newspaper Benchmark (UNB), a manually annotated dataset for Urdu OCR. Using UNB and the OpenITI corpus, we compare traditional CNN+RNN-based OCR models with modern LLMs. Gemini-2.5-Pro achieves the best performance with a WER of 0.133. We further analyze LLM outputs via insertion, deletion, and substitution error breakdowns, as well as character-level confusion analysis. Finally, we show that fine-tuning on just 500 samples yields a 6.13% WER improvement, highlighting the adaptability of LLMs for Urdu OCR.