CVNov 4, 2025
SCALE-VLP: Soft-Weighted Contrastive Volumetric Vision-Language Pre-training with Spatial-Knowledge SemanticsAilar Mahdizadeh, Puria Azadi Moghadam, Xiangteng He et al.
Vision-language models (VLMs) have demonstrated strong cross-modal capabilities, yet most work remains limited to 2D data and assumes binary supervision (i.e., positive vs. negative pairs), overlooking the continuous and structured dependencies present in volumetric data such as CT. Existing approaches often treat volumetric scans as independent 2D slices, compromising spatial coherence and underutilizing rich clinical semantics. We propose SCALE-VLP, a soft-weighted contrastive vision-language pre-training framework that integrates (i) volumetric spatial semantics to preserve anatomical structure and (ii) domain-aware, knowledge-infused semantics (e.g., radiological ontologies) to guide alignment. This yields structurally consistent and semantically grounded representations under limited supervision, demonstrating strong cross-task transferability (retrieval, report generation, and classification), and cross-domain generalizability with consistent gains without further fine-tuning. In particular, compared to the previous state of the art, SCALE-VLP achieves up to 4.3x higher top-1 CT-report retrieval, improves abnormality classification by 10 points, and reaches ROUGE-L 0.44 and BERT-F1 0.89 for report generation. Further, in zero-shot evaluation on an out-of-domain external dataset, we observe consistent gains, indicating the cross-task and cross-domain generalization ability of SCALE-VLP.
CVJan 5
Shallow- and Deep-fake Image Manipulation Localization Using Vision Mamba and Guided Graph Neural NetworkJunbin Zhang, Hamid Reza Tohidypour, Yixiao Wang et al.
Image manipulation localization is a critical research task, given that forged images may have a significant societal impact of various aspects. Such image manipulations can be produced using traditional image editing tools (known as "shallowfakes") or advanced artificial intelligence techniques ("deepfakes"). While numerous studies have focused on image manipulation localization on either shallowfake images or deepfake videos, few approaches address both cases. In this paper, we explore the feasibility of using a deep learning network to localize manipulations in both shallow- and deep-fake images, and proposed a solution for such purpose. To precisely differentiate between authentic and manipulated pixels, we leverage the Vision Mamba network to extract feature maps that clearly describe the boundaries between tampered and untouched regions. To further enhance this separation, we propose a novel Guided Graph Neural Network (G-GNN) module that amplifies the distinction between manipulated and authentic pixels. Our evaluation results show that our proposed method achieved higher inference accuracy compared to other state-of-the-art methods.
CVMar 13, 2018
A Learning-Based Visual Saliency Prediction Model for Stereoscopic 3D Video (LBVS-3D)Amin Banitalebi-Dehkordi, Mahsa T. Pourazad, Panos Nasiopoulos
Over the past decade, many computational saliency prediction models have been proposed for 2D images and videos. Considering that the human visual system has evolved in a natural 3D environment, it is only natural to want to design visual attention models for 3D content. Existing monocular saliency models are not able to accurately predict the attentive regions when applied to 3D image/video content, as they do not incorporate depth information. This paper explores stereoscopic video saliency prediction by exploiting both low-level attributes such as brightness, color, texture, orientation, motion, and depth, as well as high-level cues such as face, person, vehicle, animal, text, and horizon. Our model starts with a rough segmentation and quantifies several intuitive observations such as the effects of visual discomfort level, depth abruptness, motion acceleration, elements of surprise, size and compactness of the salient regions, and emphasizing only a few salient objects in a scene. A new fovea-based model of spatial distance between the image regions is adopted for considering local and global feature calculations. To efficiently fuse the conspicuity maps generated by our method to one single saliency map that is highly correlated with the eye-fixation data, a random forest based algorithm is utilized. The performance of the proposed saliency model is evaluated against the results of an eye-tracking experiment, which involved 24 subjects and an in-house database of 61 captured stereoscopic videos. Our stereo video database as well as the eye-tracking data are publicly available along with this paper. Experiment results show that the proposed saliency prediction method achieves competitive performance compared to the state-of-the-art approaches.
CVMar 13, 2018
A Learning-Based Visual Saliency Fusion Model for High Dynamic Range Video (LBVS-HDR)Amin Banitalebi-Dehkordi, Yuanyuan Dong, Mahsa T. Pourazad et al.
Saliency prediction for Standard Dynamic Range (SDR) videos has been well explored in the last decade. However, limited studies are available on High Dynamic Range (HDR) Visual Attention Models (VAMs). Considering that the characteristic of HDR content in terms of dynamic range and color gamut is quite different than those of SDR content, it is essential to identify the importance of different saliency attributes of HDR videos for designing a VAM and understand how to combine these features. To this end we propose a learning-based visual saliency fusion method for HDR content (LVBS-HDR) to combine various visual saliency features. In our approach various conspicuity maps are extracted from HDR data, and then for fusing conspicuity maps, a Random Forests algorithm is used to train a model based on the collected data from an eye-tracking experiment. Performance evaluations demonstrate the superiority of the proposed fusion method against other existing fusion methods.
GRMar 13, 2018
Effect of Eye Dominance on the Perception of Stereoscopic 3D VideoAmin Banitalebi-Dehkordi, Mahsa T. Pourazad, Panos Nasiopoulos
Asymmetric schemes have widespread applications in the 3D video transmission pipeline. The significance of eye dominance becomes a concern when designing such schemes. In this paper, in order to investigate the effect of eye dominance on the perceptual 3D video quality, a database of representative asymmetric stereoscopic sequences is prepared and the overall 3D quality of these sequences is evaluated through subjective experiments. Experiment results showed that viewers find an asymmetric video more pleasant when the view with higher quality is projected to their dominant eye. Moreover, the eye dominance changes the mean opinion quality score by 16 % at most, a result caused by slight asymmetric video compression. For all other representative types of asymmetry, the statistical difference is much lower and in some cases even negligible.