Abdul Bais

CV
h-index10
12papers
24citations
Novelty44%
AI Score53

12 Papers

CYNov 1, 2023Code
A Call to Arms: AI Should be Critical for Social Media Analysis of Conflict Zones

Afia Abedin, Abdul Bais, Cody Buntain et al.

The massive proliferation of social media data represents a transformative opportunity for conflict studies and for tracking the proliferation and use of weaponry, as conflicts are increasingly documented in these online spaces. At the same time, the scale and types of data available are problematic for traditional open-source intelligence. This paper focuses on identifying specific weapon systems and the insignias of the armed groups using them as documented in the Ukraine war, as these tasks are critical to operational intelligence and tracking weapon proliferation, especially given the scale of international military aid given to Ukraine. The large scale of social media makes manual assessment difficult, however, so this paper presents early work that uses computer vision models to support this task. We demonstrate that these models can both identify weapons embedded in images shared in social media and how the resulting collection of military-relevant images and their post times interact with the offline, real-world conflict. Not only can we then track changes in the prevalence of images of tanks, land mines, military trucks, etc., we find correlations among time series data associated with these images and the daily fatalities in this conflict. This work shows substantial opportunity for examining similar online documentation of conflict contexts, and we also point to future avenues where computer vision can be further improved for these open-source intelligence tasks.

22.1CVMay 24
CLIP-Guided SAM: Parameter-Efficient Semantic Conditioning for Promptable Segmentation

Shayan Jalilian, Abdul Bais

Promptable foundation models such as the Segment Anything Model (SAM) produce high-quality masks but remain semantically blind, relying on external prompts to specify categories. Existing vision-language approaches address this limitation by using external prompt coupling, where a vision-language model generates spatial prompts for SAM as a separate stage. We propose CLIP-Guided SAM, a parameter-efficient segmentation framework built on internal semantic conditioning. Instead of using semantic signals only to generate prompts, we inject CLIP-derived text, vision, and similarity features directly into SAM's image encoder through lightweight multi-modal semantic adapters. These adapters condition SAM's internal feature representations, allowing semantic information to influence mask prediction while preserving SAM's original promptable interface. Our framework is designed for low labeled-data settings and applies to both general-domain benchmarks and specialized downstream tasks. It supports two operating modes: Manual mode, for interactive segmentation with both text and spatial prompts, and Semi-Automatic text-only mode, for applications that require concept-specific segmentation using only textual input. We show that robustness depends on aligning training with the type of prompts used at inference, making train-test prompt consistency an important design principle. Through extensive experiments and ablations, we evaluate our method against SAM+PEFT baselines without semantic conditioning, vision-language + SAM pipelines, SAM 3, and strong semi-supervised segmentation methods that rely on large amounts of unlabeled data. Across these settings, CLIP-Guided SAM consistently achieves superior or competitive performance while remaining parameter-efficient in both training and deployment.

CVOct 2, 2023
Improved Crop and Weed Detection with Diverse Data Ensemble Learning

Muhammad Hamza Asad, Saeed Anwar, Abdul Bais

Modern agriculture heavily relies on Site-Specific Farm Management practices, necessitating accurate detection, localization, and quantification of crops and weeds in the field, which can be achieved using deep learning techniques. In this regard, crop and weed-specific binary segmentation models have shown promise. However, uncontrolled field conditions limit their performance from one field to the other. To improve semantic model generalization, existing methods augment and synthesize agricultural data to account for uncontrolled field conditions. However, given highly varied field conditions, these methods have limitations. To overcome the challenges of model deterioration in such conditions, we propose utilizing data specific to other crops and weeds for our specific target problem. To achieve this, we propose a novel ensemble framework. Our approach involves utilizing different crop and weed models trained on diverse datasets and employing a teacher-student configuration. By using homogeneous stacking of base models and a trainable meta-architecture to combine their outputs, we achieve significant improvements for Canola crops and Kochia weeds on unseen test data, surpassing the performance of single semantic segmentation models. We identify the UNET meta-architecture as the most effective in this context. Finally, through ablation studies, we demonstrate and validate the effectiveness of our proposed model. We observe that including base models trained on other target crops and weeds can help generalize the model to capture varied field conditions. Lastly, we propose two novel datasets with varied conditions for comparisons.

CVJan 4Code
SwinIFS: Landmark Guided Swin Transformer For Identity Preserving Face Super Resolution

Habiba Kausar, Saeed Anwar, Omar Jamal Hammad et al.

Face super-resolution aims to recover high-quality facial images from severely degraded low-resolution inputs, but remains challenging due to the loss of fine structural details and identity-specific features. This work introduces SwinIFS, a landmark-guided super-resolution framework that integrates structural priors with hierarchical attention mechanisms to achieve identity-preserving reconstruction at both moderate and extreme upscaling factors. The method incorporates dense Gaussian heatmaps of key facial landmarks into the input representation, enabling the network to focus on semantically important facial regions from the earliest stages of processing. A compact Swin Transformer backbone is employed to capture long-range contextual information while preserving local geometry, allowing the model to restore subtle facial textures and maintain global structural consistency. Extensive experiments on the CelebA benchmark demonstrate that SwinIFS achieves superior perceptual quality, sharper reconstructions, and improved identity retention; it consistently produces more photorealistic results and exhibits strong performance even under 8x magnification, where most methods fail to recover meaningful structure. SwinIFS also provides an advantageous balance between reconstruction accuracy and computational efficiency, making it suitable for real-world applications in facial enhancement, surveillance, and digital restoration. Our code, model weights, and results are available at https://github.com/Habiba123-stack/SwinIFS.

CVNov 16, 2025Code
MSRNet: A Multi-Scale Recursive Network for Camouflaged Object Detection

Leena Alghamdi, Muhammad Usman, Hafeez Anwar et al.

Camouflaged object detection is an emerging and challenging computer vision task that requires identifying and segmenting objects that blend seamlessly into their environments due to high similarity in color, texture, and size. This task is further complicated by low-light conditions, partial occlusion, small object size, intricate background patterns, and multiple objects. While many sophisticated methods have been proposed for this task, current methods still struggle to precisely detect camouflaged objects in complex scenarios, especially with small and multiple objects, indicating room for improvement. We propose a Multi-Scale Recursive Network that extracts multi-scale features via a Pyramid Vision Transformer backbone and combines them via specialized Attention-Based Scale Integration Units, enabling selective feature merging. For more precise object detection, our decoder recursively refines features by incorporating Multi-Granularity Fusion Units. A novel recursive-feedback decoding strategy is developed to enhance global context understanding, helping the model overcome the challenges in this task. By jointly leveraging multi-scale learning and recursive feature optimization, our proposed method achieves performance gains, successfully detecting small and multiple camouflaged objects. Our model achieves state-of-the-art results on two benchmark datasets for camouflaged object detection and ranks second on the remaining two. Our codes, model weights, and results are available at \href{https://github.com/linaagh98/MSRNet}{https://github.com/linaagh98/MSRNet}.

CVNov 16, 2025Code
C3Net: Context-Contrast Network for Camouflaged Object Detection

Baber Jan, Aiman H. El-Maleh, Abdul Jabbar Siddiqui et al.

Camouflaged object detection identifies objects that blend seamlessly with their surroundings through similar colors, textures, and patterns. This task challenges both traditional segmentation methods and modern foundation models, which fail dramatically on camouflaged objects. We identify six fundamental challenges in COD: Intrinsic Similarity, Edge Disruption, Extreme Scale Variation, Environmental Complexities, Contextual Dependencies, and Salient-Camouflaged Object Disambiguation. These challenges frequently co-occur and compound the difficulty of detection, requiring comprehensive architectural solutions. We propose C3Net, which addresses all challenges through a specialized dual-pathway decoder architecture. The Edge Refinement Pathway employs gradient-initialized Edge Enhancement Modules to recover precise boundaries from early features. The Contextual Localization Pathway utilizes our novel Image-based Context Guidance mechanism to achieve intrinsic saliency suppression without external models. An Attentive Fusion Module synergistically combines the two pathways via spatial gating. C3Net achieves state-of-the-art performance with S-measures of 0.898 on COD10K, 0.904 on CAMO, and 0.913 on NC4K, while maintaining efficient processing. C3Net demonstrates that complex, multifaceted detection challenges require architectural innovation, with specialized components working synergistically to achieve comprehensive coverage beyond isolated improvements. Code, model weights, and results are available at https://github.com/Baber-Jan/C3Net.

CVOct 6, 2025Code
SPEGNet: Synergistic Perception-Guided Network for Camouflaged Object Detection

Baber Jan, Saeed Anwar, Aiman H. El-Maleh et al.

Camouflaged object detection segments objects with intrinsic similarity and edge disruption. Current detection methods rely on accumulated complex components. Each approach adds components such as boundary modules, attention mechanisms, and multi-scale processors independently. This accumulation creates a computational burden without proportional gains. To manage this complexity, they process at reduced resolutions, eliminating fine details essential for camouflage. We present SPEGNet, addressing fragmentation through a unified design. The architecture integrates multi-scale features via channel calibration and spatial enhancement. Boundaries emerge directly from context-rich representations, maintaining semantic-spatial alignment. Progressive refinement implements scale-adaptive edge modulation with peak influence at intermediate resolutions. This design strikes a balance between boundary precision and regional consistency. SPEGNet achieves 0.887 $S_α$ on CAMO, 0.890 on COD10K, and 0.895 on NC4K, with real-time inference speed. Our approach excels across scales, from tiny, intricate objects to large, pattern-similar ones, while handling occlusion and ambiguous boundaries. Code, model weights, and results are available on \href{https://github.com/Baber-Jan/SPEGNet}{https://github.com/Baber-Jan/SPEGNet}.

CVMay 14, 2024
Bird Eye-View to Street-View: A Survey

Khawlah Bajbaa, Muhammad Usman, Saeed Anwar et al.

In recent years, street view imagery has grown to become one of the most important sources of geospatial data collection and urban analytics, which facilitates generating meaningful insights and assisting in decision-making. Synthesizing a street-view image from its corresponding satellite image is a challenging task due to the significant differences in appearance and viewpoint between the two domains. In this study, we screened 20 recent research papers to provide a thorough review of the state-of-the-art of how street-view images are synthesized from their corresponding satellite counterparts. The main findings are: (i) novel deep learning techniques are required for synthesizing more realistic and accurate street-view images; (ii) more datasets need to be collected for public usage; and (iii) more specific evaluation metrics need to be investigated for evaluating the generated images appropriately. We conclude that, due to applying outdated deep learning techniques, the recent literature failed to generate detailed and diverse street-view images.

CVMay 29, 2025
MaskAdapt: Unsupervised Geometry-Aware Domain Adaptation Using Multimodal Contextual Learning and RGB-Depth Masking

Numair Nadeem, Muhammad Hamza Asad, Saeed Anwar et al.

Semantic segmentation of crops and weeds is crucial for site-specific farm management; however, most existing methods depend on labor intensive pixel-level annotations. A further challenge arises when models trained on one field (source domain) fail to generalize to new fields (target domain) due to domain shifts, such as variations in lighting, camera setups, soil composition, and crop growth stages. Unsupervised Domain Adaptation (UDA) addresses this by enabling adaptation without target-domain labels, but current UDA methods struggle with occlusions and visual blending between crops and weeds, leading to misclassifications in real-world conditions. To overcome these limitations, we introduce MaskAdapt, a novel approach that enhances segmentation accuracy through multimodal contextual learning by integrating RGB images with features derived from depth data. By computing depth gradients from depth maps, our method captures spatial transitions that help resolve texture ambiguities. These gradients, through a cross-attention mechanism, refines RGB feature representations, resulting in sharper boundary delineation. In addition, we propose a geometry-aware masking strategy that applies horizontal, vertical, and stochastic masks during training. This encourages the model to focus on the broader spatial context for robust visual recognition. Evaluations on real agricultural datasets demonstrate that MaskAdapt consistently outperforms existing State-of-the-Art (SOTA) UDA methods, achieving improved segmentation mean Intersection over Union (mIOU) across diverse field conditions.

CVJul 31, 2025
SAM-PTx: Text-Guided Fine-Tuning of SAM with Parameter-Efficient, Parallel-Text Adapters

Shayan Jalilian, Abdul Bais

The Segment Anything Model (SAM) has demonstrated impressive generalization in prompt-based segmentation. Yet, the potential of semantic text prompts remains underexplored compared to traditional spatial prompts like points and boxes. This paper introduces SAM-PTx, a parameter-efficient approach for adapting SAM using frozen CLIP-derived text embeddings as class-level semantic guidance. Specifically, we propose a lightweight adapter design called Parallel-Text that injects text embeddings into SAM's image encoder, enabling semantics-guided segmentation while keeping most of the original architecture frozen. Our adapter modifies only the MLP-parallel branch of each transformer block, preserving the attention pathway for spatial reasoning. Through supervised experiments and ablations on the COD10K dataset as well as low-data subsets of COCO and ADE20K, we show that incorporating fixed text embeddings as input improves segmentation performance over purely spatial prompt baselines. To our knowledge, this is the first work to use text prompts for segmentation on the COD10K dataset. These results suggest that integrating semantic conditioning into SAM's architecture offers a practical and scalable path for efficient adaptation with minimal computational complexity.

CVJun 16, 2025
HVL: Semi-Supervised Segmentation leveraging Hierarchical Vision-Language Synergy with Dynamic Text-Spatial Query Alignment

Numair Nadeem, Saeed Anwar, Muhammad Hamza Asad et al.

In this paper, we address Semi-supervised Semantic Segmentation (SSS) under domain shift by leveraging domain-invariant semantic knowledge from text embeddings of Vision-Language Models (VLMs). We propose a unified Hierarchical Vision-Language framework (HVL) that integrates domain-invariant text embeddings as object queries in a transformer-based segmentation network to improve generalization and reduce misclassification under limited supervision. The mentioned textual queries are used for grouping pixels with shared semantics under SSS. HVL is designed to (1) generate textual queries that maximally encode domain-invariant semantics from VLM while capturing intra-class variations; (2) align these queries with spatial visual features to enhance their segmentation ability and improve the semantic clarity of visual features. We also introduce targeted regularization losses that maintain vision--language alignment throughout training to reinforce semantic understanding. HVL establishes a novel state-of-the-art by achieving a +9.3% improvement in mean Intersection over Union (mIoU) on COCO, utilizing 232 labelled images, +3.1% on Pascal VOC employing 92 labels, +4.8% on ADE20 using 316 labels, and +3.4% on Cityscapes with 100 labels, demonstrating superior performance with less than 1% supervision on four benchmark datasets. Our results show that language-guided segmentation bridges the label efficiency gap and enables new levels of fine-grained generalization.

SIJul 29, 2020
CommuNety: A Deep Learning System for the Prediction of Cohesive Social Communities

Syed Afaq Ali Shah, Weifeng Deng, Jianxin Li et al.

Effective mining of social media, which consists of a large number of users is a challenging task. Traditional approaches rely on the analysis of text data related to users to accomplish this task. However, text data lacks significant information about the social users and their associated groups. In this paper, we propose CommuNety, a deep learning system for the prediction of cohesive social networks using images. The proposed deep learning model consists of hierarchical CNN architecture to learn descriptive features related to each cohesive network. The paper also proposes a novel Face Co-occurrence Frequency algorithm to quantify existence of people in images, and a novel photo ranking method to analyze the strength of relationship between different individuals in a predicted social network. We extensively evaluate the proposed technique on PIPA dataset and compare with state-of-the-art methods. Our experimental results demonstrate the superior performance of the proposed technique for the prediction of relationship between different individuals and the cohesiveness of communities.