CVNov 6, 2025
Seeing Straight: Document Orientation Detection for Efficient OCRSuranjan Goswami, Abhinav Ravi, Raja Kolla et al.
Despite significant advances in document understanding, determining the correct orientation of scanned or photographed documents remains a critical pre-processing step in the real world settings. Accurate rotation correction is essential for enhancing the performance of downstream tasks such as Optical Character Recognition (OCR) where misalignment commonly arises due to user errors, particularly incorrect base orientations of the camera during capture. In this study, we first introduce OCR-Rotation-Bench (ORB), a new benchmark for evaluating OCR robustness to image rotations, comprising (i) ORB-En, built from rotation-transformed structured and free-form English OCR datasets, and (ii) ORB-Indic, a novel multilingual set spanning 11 Indic mid to low-resource languages. We also present a fast, robust and lightweight rotation classification pipeline built on the vision encoder of Phi-3.5-Vision model with dynamic image cropping, fine-tuned specifically for 4-class rotation task in a standalone fashion. Our method achieves near-perfect 96% and 92% accuracy on identifying the rotations respectively on both the datasets. Beyond classification, we demonstrate the critical role of our module in boosting OCR performance: closed-source (up to 14%) and open-weights models (up to 4x) in the simulated real-world setting.
CVNov 6, 2025
IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMsAli Faraz, Akash, Shaharukh Khan et al.
Vision-language models (VLMs) have demonstrated impressive generalization across multimodal tasks, yet most evaluation benchmarks remain Western-centric, leaving open questions about their performance in culturally diverse and multilingual settings. To address this gap, we introduce IndicVisionBench, the first large-scale benchmark centered on the Indian subcontinent. Covering English and 10 Indian languages, our benchmark spans 3 multimodal tasks, including Optical Character Recognition (OCR), Multimodal Machine Translation (MMT), and Visual Question Answering (VQA), covering 6 kinds of question types. Our final benchmark consists of a total of ~5K images and 37K+ QA pairs across 13 culturally grounded topics. In addition, we release a paired parallel corpus of annotations across 10 Indic languages, creating a unique resource for analyzing cultural and linguistic biases in VLMs. We evaluate a broad spectrum of 8 models, from proprietary closed-source systems to open-weights medium and large-scale models. Our experiments reveal substantial performance gaps, underscoring the limitations of current VLMs in culturally diverse contexts. By centering cultural diversity and multilinguality, IndicVisionBench establishes a reproducible evaluation framework that paves the way for more inclusive multimodal research.
CVJul 13, 2021
A Novel Deep Learning Method for Thermal to Annotated Thermal-Optical Fused ImagesSuranjan Goswami, Satish Kumar Singh, and Bidyut B. Chaudhuri
Thermal Images profile the passive radiation of objects and capture them in grayscale images. Such images have a very different distribution of data compared to optical colored images. We present here a work that produces a grayscale thermo-optical fused mask given a thermal input. This is a deep learning based pioneering work since to the best of our knowledge, there exists no other work on thermal-optical grayscale fusion. Our method is also unique in the sense that the deep learning method we are proposing here works on the Discrete Wavelet Transform (DWT) domain instead of the gray level domain. As a part of this work, we also present a new and unique database for obtaining the region of interest in thermal images based on an existing thermal visual paired database, containing the Region of Interest on 5 different classes of data. Finally, we are proposing a simple low cost overhead statistical measure for identifying the region of interest in the fused images, which we call as the Region of Fusion (RoF). Experiments on the database show encouraging results in identifying the region of interest in the fused images. We also show that they can be processed better in the mixed form rather than with only thermal images.
IVJan 18, 2021
A Simple Mutual Information based Registration Method for Thermal-Optical Image Pairs applied on a Novel DatasetSuranjan Goswami, Satish Kumar Singh
While thermal optical registered datasets are becoming widely available, most of these works are based on image pairs which are pre-registered. However, thermal imagers where these images are registered by default are quite expensive. We present in this work, a thermal image registration technique which is computationally lightweight, and can be employed regardless of the resolution of the images captured. We use 2 different thermal imagers to create a completely new database and introduce it as a part of this work as well. The images captured are based on 5 different classes and encompass subjects like the Prayagraj Kumbh Mela, one of the largest public fairs in the world, captured over a period of 2 years.