Exploration of Deep Learning Based Recognition for Urdu Text
This addresses the problem of recognizing Urdu text, which is challenging due to its cursive script, but the approach is incremental as it builds on existing deep learning methods.
The paper tackled Urdu optical character recognition by proposing a component-based classification using convolutional neural networks, achieving 99% accuracy for component classification.
Urdu is a cursive script language and has similarities with Arabic and many other South Asian languages. Urdu is difficult to classify due to its complex geometrical and morphological structure. Character classification can be processed further if segmentation technique is efficient, but due to context sensitivity in Urdu, segmentation-based recognition often results with high error rate. Our proposed approach for Urdu optical character recognition system is a component-based classification relying on automatic feature learning technique called convolutional neural network. CNN is trained and tested on Urdu text dataset, which is generated through permutation process of three characters and further proceeds to discarding unnecessary images by applying connected component technique in order to obtain ligature only. Hierarchical neural network is implemented with two levels to deal with three degrees of character permutations and component classification Our model successfully achieved 0.99% for component classification.