43.2CVApr 20
Revisiting Change VQA in Remote Sensing with Structured and Native Multimodal Qwen ModelsYakoub Bazi, Mohamad M. Al Rahhal, Mansour Zuair et al.
Change visual question answering (Change VQA) addresses the problem of answering natural-language questions about semantic changes between bi-temporal remote sensing (RS) images. Although vision-language models (VLMs) have recently been studied for temporal RS image understanding, Change VQA remains underexplored in the context of modern multimodal models. In this letter, we revisit the CDVQA benchmark using recent Qwen models under a unified low-rank adaptation (LoRA) setting. We compare Qwen3-VL, which follows a structured vision-language pipeline with multi-depth visual conditioning and a full-attention decoder, with Qwen3.5, a native multimodal model that combines a single-stage alignment with a hybrid decoder backbone. Experimental results on the official CDVQA test splits show that recent VLMs improve over earlier specialized baselines. They further show that performance does not scale monotonically with model size, and that native multimodal models are more effective than structured vision-language pipelines for this task. These findings indicate that tightly integrated multimodal backbones contribute more to performance than scale or explicit multi-depth visual conditioning for language-driven semantic change reasoning in RS imagery.
CVJan 16, 2018
ConvSRC: SmartPhone based Periocular Recognition using Deep Convolutional Neural Network and Sparsity Augmented Collaborative RepresentationAmani Alahmadi, Muhammad Hussain, Hatim Aboalsamh et al.
Smartphone based periocular recognition has gained significant attention from biometric research community because of the limitations of biometric modalities like face, iris etc. Most of the existing methods for periocular recognition employ hand-crafted features. Recently, learning based image representation techniques like deep Convolutional Neural Network (CNN) have shown outstanding performance in many visual recognition tasks. CNN needs a huge volume of data for its learning, but for periocular recognition only limited amount of data is available. The solution is to use CNN pre-trained on the dataset from the related domain, in this case the challenge is to extract efficiently the discriminative features. Using a pertained CNN model (VGG-Net), we propose a simple, efficient and compact image representation technique that takes into account the wealth of information and sparsity existing in the activations of the convolutional layers and employs principle component analysis. For recognition, we use an efficient and robust Sparse Augmented Collaborative Representation based Classification (SA-CRC) technique. For thorough evaluation of ConvSRC (the proposed system), experiments were carried out on the VISOB challenging database which was presented for periocular recognition competition in ICIP2016. The obtained results show the superiority of ConvSRC over the state-of-the-art methods; it obtains a GMR of more than 99% at FMR = 10-3 and outperforms the first winner of ICIP2016 challenge by 10%.