CVJul 25, 2023Code
Foundational Models Defining a New Era in Vision: A Survey and OutlookMuhammad Awais, Muzammal Naseer, Salman Khan et al.
Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The complex relations between objects and their locations, ambiguities, and variations in the real-world environment can be better described in human language, naturally governed by grammatical rules and other modalities such as audio and depth. The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time. These models are referred to as foundational models. The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions. In this survey, we provide a comprehensive review of such emerging foundational models, including typical architecture designs to combine different modalities (vision, text, audio, etc), training objectives (contrastive, generative), pre-training datasets, fine-tuning mechanisms, and the common prompting patterns; textual, visual, and heterogeneous. We discuss the open challenges and research directions for foundational models in computer vision, including difficulties in their evaluations and benchmarking, gaps in their real-world understanding, limitations of their contextual understanding, biases, vulnerability to adversarial attacks, and interpretability issues. We review recent developments in this field, covering a wide range of applications of foundation models systematically and comprehensively. A comprehensive list of foundational models studied in this work is available at \url{https://github.com/awaisrauf/Awesome-CV-Foundational-Models}.
CVMay 30, 2022Code
GMML is All you NeedSara Atito, Muhammad Awais, Josef Kittler
Vision transformers have generated significant interest in the computer vision community because of their flexibility in exploiting contextual information, whether it is sharply confined local, or long range global. However, they are known to be data hungry. This has motivated the research in self-supervised transformer pretraining, which does not need to decode the semantic information conveyed by labels to link it to the image properties, but rather focuses directly on extracting a concise representation of the image data that reflects the notion of similarity, and is invariant to nuisance factors. The key vehicle for the self-learning process used by the majority of self-learning methods is the generation of multiple views of the training data and the creation of pretext tasks which use these views to define the notion of image similarity, and data integrity. However, this approach lacks the natural propensity to extract contextual information. We propose group masked model learning (GMML), a self-supervised learning (SSL) mechanism for pretraining vision transformers with the ability to extract the contextual information present in all the concepts in an image. GMML achieves this by manipulating randomly groups of connected tokens, ensuingly covering a meaningful part of a semantic concept, and then recovering the hidden semantic information from the visible part of the concept. GMML implicitly introduces a novel data augmentation process. Unlike most of the existing SSL approaches, GMML does not require momentum encoder, nor rely on careful implementation details such as large batches and gradient stopping, which are all artefacts of most of the current self-supervised learning techniques. The source code is publicly available for the community to train on bigger corpora: https://github.com/Sara-Ahmed/GMML.
CVAug 14, 2024Code
BAPLe: Backdoor Attacks on Medical Foundational Models using Prompt LearningAsif Hanif, Fahad Shamshad, Muhammad Awais et al.
Medical foundation models are gaining prominence in the medical community for their ability to derive general representations from extensive collections of medical image-text pairs. Recent research indicates that these models are susceptible to backdoor attacks, which allow them to classify clean images accurately but fail when specific triggers are introduced. However, traditional backdoor attacks necessitate a considerable amount of additional data to maliciously pre-train a model. This requirement is often impractical in medical imaging applications due to the usual scarcity of data. Inspired by the latest developments in learnable prompts, this work introduces a method to embed a backdoor into the medical foundation model during the prompt learning phase. By incorporating learnable prompts within the text encoder and introducing imperceptible learnable noise trigger to the input images, we exploit the full capabilities of the medical foundation models (Med-FM). Our method, BAPLe, requires only a minimal subset of data to adjust the noise trigger and the text prompts for downstream tasks, enabling the creation of an effective backdoor attack. Through extensive experiments with four medical foundation models, each pre-trained on different modalities and evaluated across six downstream datasets, we demonstrate the efficacy of our approach. BAPLe achieves a high backdoor success rate across all models and datasets, outperforming the baseline backdoor attack methods. Our work highlights the vulnerability of Med-FMs towards backdoor attacks and strives to promote the safe adoption of Med-FMs before their deployment in real-world applications. Code is available at https://asif-hanif.github.io/baple/.
CVJul 8, 2024Code
C2C: Component-to-Composition Learning for Zero-Shot Compositional Action RecognitionRongchang Li, Zhenhua Feng, Tianyang Xu et al.
Compositional actions consist of dynamic (verbs) and static (objects) concepts. Humans can easily recognize unseen compositions using the learned concepts. For machines, solving such a problem requires a model to recognize unseen actions composed of previously observed verbs and objects, thus requiring so-called compositional generalization ability. To facilitate this research, we propose a novel Zero-Shot Compositional Action Recognition (ZS-CAR) task. For evaluating the task, we construct a new benchmark, Something-composition (Sth-com), based on the widely used Something-Something V2 dataset. We also propose a novel Component-to-Composition (C2C) learning method to solve the new ZS-CAR task. C2C includes an independent component learning module and a composition inference module. Last, we devise an enhanced training strategy to address the challenges of component variations between seen and unseen compositions and to handle the subtle balance between learning seen and unseen actions. The experimental results demonstrate that the proposed framework significantly surpasses the existing compositional generalization methods and sets a new state-of-the-art. The new Sth-com benchmark and code are available at https://github.com/RongchangLi/ZSCAR_C2C.
SDNov 23, 2022
ASiT: Local-Global Audio Spectrogram vIsion Transformer for Event ClassificationSara Atito, Muhammad Awais, Wenwu Wang et al.
Transformers, which were originally developed for natural language processing, have recently generated significant interest in the computer vision and audio communities due to their flexibility in learning long-range relationships. Constrained by the data hungry nature of transformers and the limited amount of labelled data, most transformer-based models for audio tasks are finetuned from ImageNet pretrained models, despite the huge gap between the domain of natural images and audio. This has motivated the research in self-supervised pretraining of audio transformers, which reduces the dependency on large amounts of labeled data and focuses on extracting concise representations of audio spectrograms. In this paper, we propose \textbf{L}ocal-\textbf{G}lobal \textbf{A}udio \textbf{S}pectrogram v\textbf{I}sion \textbf{T}ransformer, namely ASiT, a novel self-supervised learning framework that captures local and global contextual information by employing group masked model learning and self-distillation. We evaluate our pretrained models on both audio and speech classification tasks, including audio event classification, keyword spotting, and speaker identification. We further conduct comprehensive ablation studies, including evaluations of different pretraining strategies. The proposed ASiT framework significantly boosts the performance on all tasks and sets a new state-of-the-art performance in five audio and speech classification tasks, outperforming recent methods, including the approaches that use additional datasets for pretraining.
CVAug 22, 2023
Masked Momentum Contrastive Learning for Zero-shot Semantic UnderstandingJiantao Wu, Shentong Mo, Muhammad Awais et al.
Self-supervised pretraining (SSP) has emerged as a popular technique in machine learning, enabling the extraction of meaningful feature representations without labelled data. In the realm of computer vision, pretrained vision transformers (ViTs) have played a pivotal role in advancing transfer learning. Nonetheless, the escalating cost of finetuning these large models has posed a challenge due to the explosion of model size. This study endeavours to evaluate the effectiveness of pure self-supervised learning (SSL) techniques in computer vision tasks, obviating the need for finetuning, with the intention of emulating human-like capabilities in generalisation and recognition of unseen objects. To this end, we propose an evaluation protocol for zero-shot segmentation based on a prompting patch. Given a point on the target object as a prompt, the algorithm calculates the similarity map between the selected patch and other patches, upon that, a simple thresholding is applied to segment the target. Another evaluation is intra-object and inter-object similarity to gauge discriminatory ability of SSP ViTs. Insights from zero-shot segmentation from prompting and discriminatory abilities of SSP led to the design of a simple SSP approach, termed MMC. This approaches combines Masked image modelling for encouraging similarity of local features, Momentum based self-distillation for transferring semantics from global to local features, and global Contrast for promoting semantics of global features, to enhance discriminative representations of SSP ViTs. Consequently, our proposed method significantly reduces the overlap of intra-object and inter-object similarities, thereby facilitating effective object segmentation within an image. Our experiments reveal that MMC delivers top-tier results in zero-shot semantic segmentation across various datasets.
LGMar 22, 2023Code
Variantional autoencoder with decremental information bottleneck for disentanglementJiantao Wu, Shentong Mo, Xiang Yang et al.
One major challenge of disentanglement learning with variational autoencoders is the trade-off between disentanglement and reconstruction fidelity. Previous studies, which increase the information bottleneck during training, tend to lose the constraint of disentanglement, leading to the information diffusion problem. In this paper, we present a novel framework for disentangled representation learning, DeVAE, which utilizes hierarchical latent spaces with decreasing information bottlenecks across these spaces. The key innovation of our approach lies in connecting the hierarchical latent spaces through disentanglement-invariant transformations, allowing the sharing of disentanglement properties among spaces while maintaining an acceptable level of reconstruction performance. We demonstrate the effectiveness of DeVAE in achieving a balance between disentanglement and reconstruction through a series of experiments and ablation studies on dSprites and Shapes3D datasets. Code is available at https://github.com/erow/disentanglement_lib/tree/pytorch#devae.
CVSep 11, 2023
SCD-Net: Spatiotemporal Clues Disentanglement Network for Self-supervised Skeleton-based Action RecognitionCong Wu, Xiao-Jun Wu, Josef Kittler et al.
Contrastive learning has achieved great success in skeleton-based action recognition. However, most existing approaches encode the skeleton sequences as entangled spatiotemporal representations and confine the contrasts to the same level of representation. Instead, this paper introduces a novel contrastive learning framework, namely Spatiotemporal Clues Disentanglement Network (SCD-Net). Specifically, we integrate the decoupling module with a feature extractor to derive explicit clues from spatial and temporal domains respectively. As for the training of SCD-Net, with a constructed global anchor, we encourage the interaction between the anchor and extracted clues. Further, we propose a new masking strategy with structural constraints to strengthen the contextual associations, leveraging the latest development from masked image modelling into the proposed SCD-Net. We conduct extensive evaluations on the NTU-RGB+D (60&120) and PKU-MMD (I&II) datasets, covering various downstream tasks such as action recognition, action retrieval, transfer learning, and semi-supervised learning. The experimental results demonstrate the effectiveness of our method, which outperforms the existing state-of-the-art (SOTA) approaches significantly.
IVAug 29, 2022
SB-SSL: Slice-Based Self-Supervised Transformers for Knee Abnormality Classification from MRISara Atito, Syed Muhammad Anwar, Muhammad Awais et al.
The availability of large scale data with high quality ground truth labels is a challenge when developing supervised machine learning solutions for healthcare domain. Although, the amount of digital data in clinical workflows is increasing, most of this data is distributed on clinical sites and protected to ensure patient privacy. Radiological readings and dealing with large-scale clinical data puts a significant burden on the available resources, and this is where machine learning and artificial intelligence play a pivotal role. Magnetic Resonance Imaging (MRI) for musculoskeletal (MSK) diagnosis is one example where the scans have a wealth of information, but require a significant amount of time for reading and labeling. Self-supervised learning (SSL) can be a solution for handling the lack of availability of ground truth labels, but generally requires a large amount of training data during the pretraining stage. Herein, we propose a slice-based self-supervised deep learning framework (SB-SSL), a novel slice-based paradigm for classifying abnormality using knee MRI scans. We show that for a limited number of cases (<1000), our proposed framework is capable to identify anterior cruciate ligament tear with an accuracy of 89.17% and an AUC of 0.954, outperforming state-of-the-art without usage of external data during pretraining. This demonstrates that our proposed framework is suited for SSL in the limited data regime.
IVNov 23, 2022
SPCXR: Self-supervised Pretraining using Chest X-rays Towards a Domain Specific Foundation ModelSyed Muhammad Anwar, Abhijeet Parida, Sara Atito et al.
Chest X-rays (CXRs) are a widely used imaging modality for the diagnosis and prognosis of lung disease. The image analysis tasks vary. Examples include pathology detection and lung segmentation. There is a large body of work where machine learning algorithms are developed for specific tasks. A significant recent example is Coronavirus disease (covid-19) detection using CXR data. However, the traditional diagnostic tool design methods based on supervised learning are burdened by the need to provide training data annotation, which should be of good quality for better clinical outcomes. Here, we propose an alternative solution, a new self-supervised paradigm, where a general representation from CXRs is learned using a group-masked self-supervised framework. The pre-trained model is then fine-tuned for domain-specific tasks such as covid-19, pneumonia detection, and general health screening. We show that the same pre-training can be used for the lung segmentation task. Our proposed paradigm shows robust performance in multiple downstream tasks which demonstrates the success of the pre-training. Moreover, the performance of the pre-trained models on data with significant drift during test time proves the learning of a better generic representation. The methods are further validated by covid-19 detection in a unique small-scale pediatric data set. The performance gain in accuracy (~25%) is significant when compared to a supervised transformer-based method. This adds credence to the strength and reliability of our proposed framework and pre-training strategy.
CVNov 13, 2023
LT-ViT: A Vision Transformer for multi-label Chest X-ray classificationUmar Marikkar, Sara Atito, Muhammad Awais et al.
Vision Transformers (ViTs) are widely adopted in medical imaging tasks, and some existing efforts have been directed towards vision-language training for Chest X-rays (CXRs). However, we envision that there still exists a potential for improvement in vision-only training for CXRs using ViTs, by aggregating information from multiple scales, which has been proven beneficial for non-transformer networks. Hence, we have developed LT-ViT, a transformer that utilizes combined attention between image tokens and randomly initialized auxiliary tokens that represent labels. Our experiments demonstrate that LT-ViT (1) surpasses the state-of-the-art performance using pure ViTs on two publicly available CXR datasets, (2) is generalizable to other pre-training methods and therefore is agnostic to model initialization, and (3) enables model interpretability without grad-cam and its variants.
CVFeb 13Code
Channel-Aware Probing for Multi-Channel ImagingUmar Marikkar, Syed Sameed Husain, Muhammad Awais et al.
Training and evaluating vision encoders on Multi-Channel Imaging (MCI) data remains challenging as channel configurations vary across datasets, preventing fixed-channel training and limiting reuse of pre-trained encoders on new channel settings. Prior work trains MCI encoders but typically evaluates them via full fine-tuning, leaving probing with frozen pre-trained encoders comparatively underexplored. Existing studies that perform probing largely focus on improving representations, rather than how to best leverage fixed representations for downstream tasks. Although the latter problem has been studied in other domains, directly transferring those strategies to MCI yields weak results, even worse than training from scratch. We therefore propose Channel-Aware Probing (CAP), which exploits the intrinsic inter-channel diversity in MCI datasets by controlling feature flow at both the encoder and probe levels. CAP uses Independent Feature Encoding (IFE) to encode each channel separately, and Decoupled Pooling (DCP) to pool within channels before aggregating across channels. Across three MCI benchmarks, CAP consistently improves probing performance over the default probing protocol, matches fine-tuning from scratch, and largely reduces the gap to full fine-tuning from the same MCI pre-trained checkpoints. Code can be found in https://github.com/umarikkar/CAP.
CVSep 23, 2024
Probabilistically Aligned View-unaligned Clustering with Adaptive Template SelectionWenhua Dong, Xiao-Jun Wu, Zhenhua Feng et al.
In most existing multi-view modeling scenarios, cross-view correspondence (CVC) between instances of the same target from different views, like paired image-text data, is a crucial prerequisite for effortlessly deriving a consistent representation. Nevertheless, this premise is frequently compromised in certain applications, where each view is organized and transmitted independently, resulting in the view-unaligned problem (VuP). Restoring CVC of unaligned multi-view data is a challenging and highly demanding task that has received limited attention from the research community. To tackle this practical challenge, we propose to integrate the permutation derivation procedure into the bipartite graph paradigm for view-unaligned clustering, termed Probabilistically Aligned View-unaligned Clustering with Adaptive Template Selection (PAVuC-ATS). Specifically, we learn consistent anchors and view-specific graphs by the bipartite graph, and derive permutations applied to the unaligned graphs by reformulating the alignment between two latent representations as a 2-step transition of a Markov chain with adaptive template selection, thereby achieving the probabilistic alignment. The convergence of the resultant optimization problem is validated both experimentally and theoretically. Extensive experiments on six benchmark datasets demonstrate the superiority of the proposed PAVuC-ATS over the baseline methods.
CVFeb 27, 2025Code
One Model for ALL: Low-Level Task Interaction Is a Key to Task-Agnostic Image FusionChunyang Cheng, Tianyang Xu, Zhenhua Feng et al.
Advanced image fusion methods mostly prioritise high-level missions, where task interaction struggles with semantic gaps, requiring complex bridging mechanisms. In contrast, we propose to leverage low-level vision tasks from digital photography fusion, allowing for effective feature interaction through pixel-level supervision. This new paradigm provides strong guidance for unsupervised multimodal fusion without relying on abstract semantics, enhancing task-shared feature learning for broader applicability. Owning to the hybrid image features and enhanced universal representations, the proposed GIFNet supports diverse fusion tasks, achieving high performance across both seen and unseen scenarios with a single model. Uniquely, experimental results reveal that our framework also supports single-modality enhancement, offering superior flexibility for practical applications. Our code will be available at https://github.com/AWCXV/GIFNet.
36.2CVMar 16
Domain Adaptation Without the Compute Burden for Efficient Whole Slide Image AnalysisUmar Marikkar, Muhammad Awais, Sara Atito
Computational methods on analyzing Whole Slide Images (WSIs) enable early diagnosis and treatments by supporting pathologists in detection and classification of tumors. However, the extremely high resolution of WSIs makes end-to-end training impractical compared to typical image analysis tasks. To address this, most approaches use pre-trained feature extractors to obtain fixed representations of whole slides, which are then combined with Multiple Instance Learning (MIL) for downstream tasks. These feature extractors are typically pre-trained on natural image datasets such as ImageNet, which fail to capture domain-specific characteristics. Although domain-specific pre-training on histopathology data yields more relevant feature representations, it remains computationally expensive and fail to capture task-specific characteristics within the domain. To address the computational cost and lack of task-specificity in domain-specific pre-training, we propose EfficientWSI (eWSI), a careful integration of Parameter-Efficient-Fine-Tuning (PEFT) and Multiple Instance Learning (MIL) that enables end-to-end training on WSI tasks. We evaluate eWSI on seven WSI-level tasks over Camelyon16, TCGA and BRACS datasets. Our results show that eWSI when applied with ImageNet feature extractors yields strong classification performance, matching or outperforming MILs with in-domain feature extractors, alleviating the need for extensive in-domain pre-training. Furthermore, when eWSI is applied with in-domain feature extractors, it further improves classification performance in most cases, demonstrating its ability to capture task-specific information where beneficial. Our findings suggest that eWSI provides a task-targeted, computationally efficient path for WSI tasks, offering a promising direction for task-specific learning in computational pathology.
28.1CVMar 15
DC-ViT: Modulating Spatial and Channel Interactions for Multi-Channel ImagesUmar Marikkar, Syed Sameed Husain, Muhammad Awais et al.
Training and evaluation in multi-channel imaging (MCI) remains challenging due to heterogeneous channel configurations arising from varying staining protocols, sensor types, and acquisition settings. This heterogeneity limits the applicability of fixed-channel encoders commonly used in general computer vision. Recent Multi-Channel Vision Transformers (MC-ViTs) address this by enabling flexible channel inputs, typically by jointly encoding patch tokens from all channels within a unified attention space. However, unrestricted token interactions across channels can lead to feature dilution, reducing the ability to preserve channel-specific semantics that are critical in MCI data. To address this, we propose Decoupled Vision Transformer (DC-ViT), which explicitly regulates information sharing using Decoupled Self-Attention (DSA), which decomposes token updates into two complementary pathways: spatial updates that model intra-channel structure, and channel-wise updates that adaptively integrate cross-channel information. This decoupling mitigates informational collapse while allowing selective inter-channel interaction. To further exploit these enhanced channel-specific representations, we introduce Decoupled Aggregation (DAG), which allows the model to learn task-specific channel importances. Extensive experiments across three MCI benchmarks demonstrate consistent improvements over existing MC-ViT approaches.
LGMar 31, 2024Code
DailyMAE: Towards Pretraining Masked Autoencoders in One DayJiantao Wu, Shentong Mo, Sara Atito et al.
Recently, masked image modeling (MIM), an important self-supervised learning (SSL) method, has drawn attention for its effectiveness in learning data representation from unlabeled data. Numerous studies underscore the advantages of MIM, highlighting how models pretrained on extensive datasets can enhance the performance of downstream tasks. However, the high computational demands of pretraining pose significant challenges, particularly within academic environments, thereby impeding the SSL research progress. In this study, we propose efficient training recipes for MIM based SSL that focuses on mitigating data loading bottlenecks and employing progressive training techniques and other tricks to closely maintain pretraining performance. Our library enables the training of a MAE-Base/16 model on the ImageNet 1K dataset for 800 epochs within just 18 hours, using a single machine equipped with 8 A100 GPUs. By achieving speed gains of up to 5.8 times, this work not only demonstrates the feasibility of conducting high-efficiency SSL training but also paves the way for broader accessibility and promotes advancement in SSL research particularly for prototyping and initial testing of SSL ideas. The code is available in https://github.com/erow/FastSSL.
CLFeb 28, 2025Code
AgroLLM: Connecting Farmers and Agricultural Practices through Large Language Models for Enhanced Knowledge Transfer and Practical ApplicationDinesh Jackson Samuel, Inna Skarga-Bandurova, David Sikolia et al.
AgroLLM is an AI-powered chatbot designed to enhance knowledge-sharing and education in agriculture using Large Language Models (LLMs) and a Retrieval-Augmented Generation (RAG) framework. By using a comprehensive open-source agricultural database, AgroLLM provides accurate, contextually relevant responses while reducing incorrect information retrieval. The system utilizes the FAISS vector database for efficient similarity searches, ensuring rapid access to agricultural knowledge. A comparative study of three advanced models: Gemini 1.5 Flash, ChatGPT-4o Mini, and Mistral-7B-Instruct-v0.2 was conducted to evaluate performance across four key agricultural domains: Agriculture and Life Sciences, Agricultural Management, Agriculture and Forestry, and Agriculture Business. Key evaluation metrics included embedding quality, search efficiency, and response relevance. Results indicated that ChatGPT-4o Mini with RAG achieved the highest accuracy at 93%. Continuous feedback mechanisms enhance response quality, making AgroLLM a benchmark AI-driven educational tool for farmers, researchers, and professionals, promoting informed decision-making and improved agricultural practices.
CVMay 19, 2025Code
GeoVLM: Improving Automated Vehicle Geolocalisation Using Vision-Language MatchingBarkin Dagda, Muhammad Awais, Saber Fallah
Cross-view geo-localisation identifies coarse geographical position of an automated vehicle by matching a ground-level image to a geo-tagged satellite image from a database. Despite the advancements in Cross-view geo-localisation, significant challenges still persist such as similar looking scenes which makes it challenging to find the correct match as the top match. Existing approaches reach high recall rates but they still fail to rank the correct image as the top match. To address this challenge, this paper proposes GeoVLM, a novel approach which uses the zero-shot capabilities of vision language models to enable cross-view geo-localisation using interpretable cross-view language descriptions. GeoVLM is a trainable reranking approach which improves the best match accuracy of cross-view geo-localisation. GeoVLM is evaluated on standard benchmark VIGOR and University-1652 and also through real-life driving environments using Cross-View United Kingdom, a new benchmark dataset introduced in this paper. The results of the paper show that GeoVLM improves retrieval performance of cross-view geo-localisation compared to the state-of-the-art methods with the help of explainable natural language descriptions. The code is available at https://github.com/CAV-Research-Lab/GeoVLM
58.9LGMay 12
Information theoretic underpinning of self-supervised learning by clusteringJosef Kittler, Sara Atito, Muhammad Awais
Self-supervised learning (SSL) is recognized as an essential tool for building foundation models for Artificial Intelligence applications. The advances in SSL have been made thanks to vigorous arguments about the principles of SSL and through extensive empirical research. The aim of this paper is to contribute to the development of the underpinning theory of SSL, focusing on the deep clustering approach. By analogy to supervised learning, we formulate SSL as K-L divergence optimization. The mode collapse is prevented by imposing an optimisation constraint on the teacher distribution. This leads to normalization using inverse cluster priors. We show that using Jensen inequality this normalization simplifies to the popular batch centering procedure. Distillation and centering are common {heuristics-based} practices in SSL, {but our work underpins them theoretically.} The theoretical model developed not only supports specific existing successful SSL methods, but also suggests directions for future investigations.
IVJun 27, 2024Code
BOrg: A Brain Organoid-Based Mitosis Dataset for Automatic Analysis of Brain DiseasesMuhammad Awais, Mehaboobathunnisa Sahul Hameed, Bidisha Bhattacharya et al.
Recent advances have enabled the study of human brain development using brain organoids derived from stem cells. Quantifying cellular processes like mitosis in these organoids offers insights into neurodevelopmental disorders, but the manual analysis is time-consuming, and existing datasets lack specific details for brain organoid studies. We introduce BOrg, a dataset designed to study mitotic events in the embryonic development of the brain using confocal microscopy images of brain organoids. BOrg utilizes an efficient annotation pipeline with sparse point annotations and techniques that minimize expert effort, overcoming limitations of standard deep learning approaches on sparse data. We adapt and benchmark state-of-the-art object detection and cell counting models on BOrg for detecting and analyzing mitotic cells across prophase, metaphase, anaphase, and telophase stages. Our results demonstrate these adapted models significantly improve mitosis analysis efficiency and accuracy for brain organoid research compared to existing methods. BOrg facilitates the development of automated tools to quantify statistics like mitosis rates, aiding mechanistic studies of neurodevelopmental processes and disorders. Data and code are available at https://github.com/awaisrauf/borg.
CVApr 8, 2021Code
SiT: Self-supervised vIsion TransformerSara Atito, Muhammad Awais, Josef Kittler
Self-supervised learning methods are gaining increasing traction in computer vision due to their recent success in reducing the gap with supervised learning. In natural language processing (NLP) self-supervised learning and transformers are already the methods of choice. The recent literature suggests that the transformers are becoming increasingly popular also in computer vision. So far, the vision transformers have been shown to work well when pretrained either using a large scale supervised data or with some kind of co-supervision, e.g. in terms of teacher network. These supervised pretrained vision transformers achieve very good results in downstream tasks with minimal changes. In this work we investigate the merits of self-supervised learning for pretraining image/vision transformers and then using them for downstream classification tasks. We propose Self-supervised vIsion Transformers (SiT) and discuss several self-supervised training mechanisms to obtain a pretext model. The architectural flexibility of SiT allows us to use it as an autoencoder and work with multiple self-supervised tasks seamlessly. We show that a pretrained SiT can be finetuned for a downstream classification task on small scale datasets, consisting of a few thousand images rather than several millions. The proposed approach is evaluated on standard datasets using common protocols. The results demonstrate the strength of the transformers and their suitability for self-supervised learning. We outperformed existing self-supervised learning methods by large margin. We also observed that SiT is good for few shot learning and also showed that it is learning useful representation by simply training a linear classifier on top of the learned features from SiT. Pretraining, finetuning, and evaluation codes will be available under: https://github.com/Sara-Ahmed/SiT.
LGJun 19, 2020Code
Towards an Adversarially Robust Normalization ApproachMuhammad Awais, Fahad Shamshad, Sung-Ho Bae
Batch Normalization (BatchNorm) is effective for improving the performance and accelerating the training of deep neural networks. However, it has also shown to be a cause of adversarial vulnerability, i.e., networks without it are more robust to adversarial attacks. In this paper, we investigate how BatchNorm causes this vulnerability and proposed new normalization that is robust to adversarial attacks. We first observe that adversarial images tend to shift the distribution of BatchNorm input, and this shift makes train-time estimated population statistics inaccurate. We hypothesize that these inaccurate statistics make models with BatchNorm more vulnerable to adversarial attacks. We prove our hypothesis by replacing train-time estimated statistics with statistics calculated from the inference-time batch. We found that the adversarial vulnerability of BatchNorm disappears if we use these statistics. However, without estimated batch statistics, we can not use BatchNorm in the practice if large batches of input are not available. To mitigate this, we propose Robust Normalization (RobustNorm); an adversarially robust version of BatchNorm. We experimentally show that models trained with RobustNorm perform better in adversarial settings while retaining all the benefits of BatchNorm. Code is available at \url{https://github.com/awaisrauf/RobustNorm}.
CVFeb 6
RAIGen: Rare Attribute Identification in Text-to-Image Generative ModelsSilpa Vadakkeeveetil Sreelatha, Dan Wang, Serge Belongie et al.
Text-to-image diffusion models achieve impressive generation quality but inherit and amplify training-data biases, skewing coverage of semantic attributes. Prior work addresses this in two ways. Closed-set approaches mitigate biases in predefined fairness categories (e.g., gender, race), assuming socially salient minority attributes are known a priori. Open-set approaches frame the task as bias identification, highlighting majority attributes that dominate outputs. Both overlook a complementary task: uncovering rare or minority features underrepresented in the data distribution (social, cultural, or stylistic) yet still encoded in model representations. We introduce RAIGen, the first framework, to our knowledge, for un-supervised rare-attribute discovery in diffusion models. RAIGen leverages Matryoshka Sparse Autoencoders and a novel minority metric combining neuron activation frequency with semantic distinctiveness to identify interpretable neurons whose top-activating images reveal underrepresented attributes. Experiments show RAIGen discovers attributes beyond fixed fairness categories in Stable Diffusion, scales to larger models such as SDXL, supports systematic auditing across architectures, and enables targeted amplification of rare attributes during generation.
28.1CVApr 1
CoLA: Cross-Modal Low-rank Adaptation for Multimodal Downstream TasksWish Suharitdamrong, Tony Alex, Muhammad Awais et al.
Foundation models have revolutionized AI, but adapting them efficiently for multimodal tasks, particularly in dual-stream architectures composed of unimodal encoders, such as DINO and BERT, remains a significant challenge. Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) enable lightweight adaptation, yet they operate in isolation within each modality, limiting their ability in capturing cross-modal interactions. In this paper, we take a step in bridging this gap with Cross-Modal Low-Rank Adaptation (CoLA), a novel PEFT framework that extends LoRA by introducing a dedicated inter-modal adaptation pathway alongside the standard intra-modal one. This dual-path design enables CoLA to adapt unimodal foundation models to multimodal tasks effectively, without interference between modality-specific and cross-modal learning. We evaluate CoLA across a range of vision-language (RefCOCO, RefCOCO+, RefCOCOg) and audio-visual (AVE, AVS) benchmarks, where it consistently outperforms LORA, achieving a relative gain of around 3\% and 2\%, respectively, while maintaining parameter efficiency. Notably, CoLA enables the first multi-task PEFT framework for visual grounding, bridging a key gap in efficient multimodal adaptation.
CLJan 13
Qalb: Largest State-of-the-Art Urdu Large Language Model for 230M Speakers with Systematic Continued Pre-trainingMuhammad Taimoor Hassan, Jawad Ahmed, Muhammad Awais
Despite remarkable progress in large language models, Urdu-a language spoken by over 230 million people-remains critically underrepresented in modern NLP systems. Existing multilingual models demonstrate poor performance on Urdu-specific tasks, struggling with the language's complex morphology, right-to-left Nastaliq script, and rich literary traditions. Even the base LLaMA-3.1 8B-Instruct model shows limited capability in generating fluent, contextually appropriate Urdu text. We introduce Qalb, an Urdu language model developed through a two-stage approach: continued pre-training followed by supervised fine-tuning. Starting from LLaMA 3.1 8B, we perform continued pre-training on a dataset of 1.97 billion tokens. This corpus comprises 1.84 billion tokens of diverse Urdu text-spanning news archives, classical and contemporary literature, government documents, and social media-combined with 140 million tokens of English Wikipedia data to prevent catastrophic forgetting. We then fine-tune the resulting model on the Alif Urdu-instruct dataset. Through extensive evaluation on Urdu-specific benchmarks, Qalb demonstrates substantial improvements, achieving a weighted average score of 90.34 and outperforming the previous state-of-the-art Alif-1.0-Instruct model (87.1) by 3.24 points, while also surpassing the base LLaMA-3.1 8B-Instruct model by 44.64 points. Qalb achieves state-of-the-art performance with comprehensive evaluation across seven diverse tasks including Classification, Sentiment Analysis, and Reasoning. Our results demonstrate that continued pre-training on diverse, high-quality language data, combined with targeted instruction fine-tuning, effectively adapts foundation models to low-resource languages.
SDJun 13, 2025
SSLAM: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic SoundscapesTony Alex, Sara Ahmed, Armin Mustafa et al.
Self-supervised pre-trained audio networks have seen widespread adoption in real-world systems, particularly in multi-modal large language models. These networks are often employed in a frozen state, under the assumption that the SSL pre-training has sufficiently equipped them to handle real-world audio. However, a critical question remains: how well do these models actually perform in real-world conditions, where audio is typically polyphonic and complex, involving multiple overlapping sound sources? Current audio SSL methods are often benchmarked on datasets predominantly featuring monophonic audio, such as environmental sounds, and speech. As a result, the ability of SSL models to generalize to polyphonic audio, a common characteristic in natural scenarios, remains underexplored. This limitation raises concerns about the practical robustness of SSL models in more realistic audio settings. To address this gap, we introduce Self-Supervised Learning from Audio Mixtures (SSLAM), a novel direction in audio SSL research, designed to improve, designed to improve the model's ability to learn from polyphonic data while maintaining strong performance on monophonic data. We thoroughly evaluate SSLAM on standard audio SSL benchmark datasets which are predominantly monophonic and conduct a comprehensive comparative analysis against SOTA methods using a range of high-quality, publicly available polyphonic datasets. SSLAM not only improves model performance on polyphonic audio, but also maintains or exceeds performance on standard audio SSL benchmarks. Notably, it achieves up to a 3.9\% improvement on the AudioSet-2M (AS-2M), reaching a mean average precision (mAP) of 50.2. For polyphonic datasets, SSLAM sets new SOTA in both linear evaluation and fine-tuning regimes with performance improvements of up to 9.1\% (mAP).
CVOct 23, 2024
Rethinking Positive Pairs in Contrastive LearningJiantao Wu, Sara Atito, Zhenhua Feng et al.
The training methods in AI do involve semantically distinct pairs of samples. However, their role typically is to enhance the between class separability. The actual notion of similarity is normally learned from semantically identical pairs. This paper presents SimLAP: a simple framework for learning visual representation from arbitrary pairs. SimLAP explores the possibility of learning similarity from semantically distinct sample pairs. The approach is motivated by the observation that for any pair of classes there exists a subspace in which semantically distinct samples exhibit similarity. This phenomenon can be exploited for a novel method of learning, which optimises the similarity of an arbitrary pair of samples, while simultaneously learning the enabling subspace. The feasibility of the approach will be demonstrated experimentally and its merits discussed.
CVMay 29, 2025
DVD: A Comprehensive Dataset for Advancing Violence Detection in Real-World ScenariosDimitrios Kollias, Damith C. Senadeera, Jianian Zheng et al.
Violence Detection (VD) has become an increasingly vital area of research. Existing automated VD efforts are hindered by the limited availability of diverse, well-annotated databases. Existing databases suffer from coarse video-level annotations, limited scale and diversity, and lack of metadata, restricting the generalization of models. To address these challenges, we introduce DVD, a large-scale (500 videos, 2.7M frames), frame-level annotated VD database with diverse environments, varying lighting conditions, multiple camera sources, complex social interactions, and rich metadata. DVD is designed to capture the complexities of real-world violent events.
CVDec 10, 2024
PortraitTalk: Towards Customizable One-Shot Audio-to-Talking Face GenerationFatemeh Nazarieh, Zhenhua Feng, Diptesh Kanojia et al.
Audio-driven talking face generation is a challenging task in digital communication. Despite significant progress in the area, most existing methods concentrate on audio-lip synchronization, often overlooking aspects such as visual quality, customization, and generalization that are crucial to producing realistic talking faces. To address these limitations, we introduce a novel, customizable one-shot audio-driven talking face generation framework, named PortraitTalk. Our proposed method utilizes a latent diffusion framework consisting of two main components: IdentityNet and AnimateNet. IdentityNet is designed to preserve identity features consistently across the generated video frames, while AnimateNet aims to enhance temporal coherence and motion consistency. This framework also integrates an audio input with the reference images, thereby reducing the reliance on reference-style videos prevalent in existing approaches. A key innovation of PortraitTalk is the incorporation of text prompts through decoupled cross-attention mechanisms, which significantly expands creative control over the generated videos. Through extensive experiments, including a newly developed evaluation metric, our model demonstrates superior performance over the state-of-the-art methods, setting a new standard for the generation of customizable realistic talking faces suitable for real-world applications.
IVFeb 22, 2024
DiCoM -- Diverse Concept Modeling towards Enhancing Generalizability in Chest X-Ray StudiesAbhijeet Parida, Daniel Capellan-Martin, Sara Atito et al.
Chest X-Ray (CXR) is a widely used clinical imaging modality and has a pivotal role in the diagnosis and prognosis of various lung and heart related conditions. Conventional automated clinical diagnostic tool design strategies relying on radiology reads and supervised learning, entail the cumbersome requirement of high quality annotated training data. To address this challenge, self-supervised pre-training has proven to outperform supervised pre-training in numerous downstream vision tasks, representing a significant breakthrough in the field. However, medical imaging pre-training significantly differs from pre-training with natural images (e.g., ImageNet) due to unique attributes of clinical images. In this context, we introduce Diverse Concept Modeling (DiCoM), a novel self-supervised training paradigm that leverages a student teacher framework for learning diverse concepts and hence effective representation of the CXR data. Hence, expanding beyond merely modeling a single primary label within an image, instead, effectively harnessing the information from all the concepts inherent in the CXR. The pre-trained model is subsequently fine-tuned to address diverse domain-specific tasks. Our proposed paradigm consistently demonstrates robust performance across multiple downstream tasks on multiple datasets, highlighting the success and generalizability of the pre-training strategy. To establish the efficacy of our methods we analyze both the power of learned representations and the speed of convergence (SoC) of our models. For diverse data and tasks, DiCoM is able to achieve in most cases better results compared to other state-of-the-art pre-training strategies. This when combined with the higher SoC and generalization capabilities positions DiCoM to be established as a foundation model for CXRs, a widely used imaging modality.
AIJul 21, 2025
QSAF: A Novel Mitigation Framework for Cognitive Degradation in Agentic AIHammad Atta, Muhammad Zeeshan Baig, Yasir Mehmood et al.
We introduce Cognitive Degradation as a novel vulnerability class in agentic AI systems. Unlike traditional adversarial external threats such as prompt injection, these failures originate internally, arising from memory starvation, planner recursion, context flooding, and output suppression. These systemic weaknesses lead to silent agent drift, logic collapse, and persistent hallucinations over time. To address this class of failures, we introduce the Qorvex Security AI Framework for Behavioral & Cognitive Resilience (QSAF Domain 10), a lifecycle-aware defense framework defined by a six-stage cognitive degradation lifecycle. The framework includes seven runtime controls (QSAF-BC-001 to BC-007) that monitor agent subsystems in real time and trigger proactive mitigation through fallback routing, starvation detection, and memory integrity enforcement. Drawing from cognitive neuroscience, we map agentic architectures to human analogs, enabling early detection of fatigue, starvation, and role collapse. By introducing a formal lifecycle and real-time mitigation controls, this work establishes Cognitive Degradation as a critical new class of AI system vulnerability and proposes the first cross-platform defense model for resilient agentic behavior.
CRAug 4, 2025
DIRF: A Framework for Digital Identity Protection and Clone Governance in Agentic AI SystemsHammad Atta, Muhammad Zeeshan Baig, Yasir Mehmood et al.
The rapid advancement and widespread adoption of generative artificial intelligence (AI) pose significant threats to the integrity of personal identity, including digital cloning, sophisticated impersonation, and the unauthorized monetization of identity-related data. Mitigating these risks necessitates the development of robust AI-generated content detection systems, enhanced legal frameworks, and ethical guidelines. This paper introduces the Digital Identity Rights Framework (DIRF), a structured security and governance model designed to protect behavioral, biometric, and personality-based digital likeness attributes to address this critical need. Structured across nine domains and 63 controls, DIRF integrates legal, technical, and hybrid enforcement mechanisms to secure digital identity consent, traceability, and monetization. We present the architectural foundations, enforcement strategies, and key use cases supporting the need for a unified framework. This work aims to inform platform builders, legal entities, and regulators about the essential controls needed to enforce identity rights in AI-driven systems.
NAFeb 28, 2025
A data augmentation strategy for deep neural networks with application to epidemic modellingMuhammad Awais, Abu Safyan Ali, Giacomo Dimarco et al.
In this work, we integrate the predictive capabilities of compartmental disease dynamics models with machine learning ability to analyze complex, high-dimensional data and uncover patterns that conventional models may overlook. Specifically, we present a proof of concept demonstrating the application of data-driven methods and deep neural networks to a recently introduced Susceptible-Infected-Recovered type model with social features, including a saturated incidence rate, to improve epidemic prediction and forecasting. Our results show that a robust data augmentation strategy trough suitable data-driven models can improve the reliability of Feed-Forward Neural Networks and Nonlinear Autoregressive Networks, providing a complementary strategy to Physics-Informed Neural Networks, particularly in settings where data augmentation from mechanistic models can enhance learning. This approach enhances the ability to handle nonlinear dynamics and offers scalable, data-driven solutions for epidemic forecasting, prioritizing predictive accuracy over the constraints of physics-based models. Numerical simulations of the lockdown and post-lockdown phase of the COVID-19 epidemic in Italy and Spain validate our methodology.
HEP-EXFeb 18, 2025
Neuromorphic Readout for Hadron CalorimetersEnrico Lupi, Abhishek, Max Aehle et al.
We simulate hadrons impinging on a homogeneous lead-tungstate (PbWO4) calorimeter to investigate how the resulting light yield and its temporal structure, as detected by an array of light-sensitive sensors, can be processed by a neuromorphic computing system. Our model encodes temporal photon distributions as spike trains and employs a fully connected spiking neural network to estimate the total deposited energy, as well as the position and spatial distribution of the light emissions within the sensitive material. The extracted primitives offer valuable topological information about the shower development in the material, achieved without requiring a segmentation of the active medium. A potential nanophotonic implementation using III-V semiconductor nanowires is discussed. It can be both fast and energy efficient.
HEP-EXFeb 10, 2025
Unsupervised Particle Tracking with Neuromorphic ComputingEmanuele Coradin, Fabio Cufino, Muhammad Awais et al.
We study the application of a neural network architecture for identifying charged particle trajectories via unsupervised learning of delays and synaptic weights using a spike-time-dependent plasticity rule. In the considered model, the neurons receive time-encoded information on the position of particle hits in a tracking detector for a particle collider, modeled according to the geometry of the Compact Muon Solenoid Phase II detector. We show how a spiking neural network is capable of successfully identifying in a completely unsupervised way the signal left by charged particles in the presence of conspicuous noise from accidental or combinatorial hits. These results open the way to applications of neuromorphic computing to particle tracking, motivating further studies into its potential for real-time, low-power particle tracking in future high-energy physics experiments.
CVOct 26, 2025
MAGIC-Talk: Motion-aware Audio-Driven Talking Face Generation with Customizable Identity ControlFatemeh Nazarieh, Zhenhua Feng, Diptesh Kanojia et al.
Audio-driven talking face generation has gained significant attention for applications in digital media and virtual avatars. While recent methods improve audio-lip synchronization, they often struggle with temporal consistency, identity preservation, and customization, especially in long video generation. To address these issues, we propose MAGIC-Talk, a one-shot diffusion-based framework for customizable and temporally stable talking face generation. MAGIC-Talk consists of ReferenceNet, which preserves identity and enables fine-grained facial editing via text prompts, and AnimateNet, which enhances motion coherence using structured motion priors. Unlike previous methods requiring multiple reference images or fine-tuning, MAGIC-Talk maintains identity from a single image while ensuring smooth transitions across frames. Additionally, a progressive latent fusion strategy is introduced to improve long-form video quality by reducing motion inconsistencies and flickering. Extensive experiments demonstrate that MAGIC-Talk outperforms state-of-the-art methods in visual quality, identity preservation, and synchronization accuracy, offering a robust solution for talking face generation.
QMOct 18, 2025
Few-Label Multimodal Modeling of SNP Variants and ECG Phenotypes Using Large Language Models for Cardiovascular Risk StratificationNiranjana Arun Menon, Yulong Li, Iqra Farooq et al.
Cardiovascular disease (CVD) risk stratification remains a major challenge due to its multifactorial nature and limited availability of high-quality labeled datasets. While genomic and electrophysiological data such as SNP variants and ECG phenotypes are increasingly accessible, effectively integrating these modalities in low-label settings is non-trivial. This challenge arises from the scarcity of well-annotated multimodal datasets and the high dimensionality of biological signals, which limit the effectiveness of conventional supervised models. To address this, we present a few-label multimodal framework that leverages large language models (LLMs) to combine genetic and electrophysiological information for cardiovascular risk stratification. Our approach incorporates a pseudo-label refinement strategy to adaptively distill high-confidence labels from weakly supervised predictions, enabling robust model fine-tuning with only a small set of ground-truth annotations. To enhance the interpretability, we frame the task as a Chain of Thought (CoT) reasoning problem, prompting the model to produce clinically relevant rationales alongside predictions. Experimental results demonstrate that the integration of multimodal inputs, few-label supervision, and CoT reasoning improves robustness and generalizability across diverse patient profiles. Experimental results using multimodal SNP variants and ECG-derived features demonstrated comparable performance to models trained on the full dataset, underscoring the promise of LLM-based few-label multimodal modeling for advancing personalized cardiovascular care.
CVSep 18, 2025
RespoDiff: Dual-Module Bottleneck Transformation for Responsible & Faithful T2I GenerationSilpa Vadakkeeveetil Sreelatha, Sauradip Nag, Muhammad Awais et al.
The rapid advancement of diffusion models has enabled high-fidelity and semantically rich text-to-image generation; however, ensuring fairness and safety remains an open challenge. Existing methods typically improve fairness and safety at the expense of semantic fidelity and image quality. In this work, we propose RespoDiff, a novel framework for responsible text-to-image generation that incorporates a dual-module transformation on the intermediate bottleneck representations of diffusion models. Our approach introduces two distinct learnable modules: one focused on capturing and enforcing responsible concepts, such as fairness and safety, and the other dedicated to maintaining semantic alignment with neutral prompts. To facilitate the dual learning process, we introduce a novel score-matching objective that enables effective coordination between the modules. Our method outperforms state-of-the-art methods in responsible generation by ensuring semantic alignment while optimizing both objectives without compromising image fidelity. Our approach improves responsible and semantically coherent generation by 20% across diverse, unseen prompts. Moreover, it integrates seamlessly into large-scale models like SDXL, enhancing fairness and safety. Code will be released upon acceptance.
LGAug 10, 2025
How Effectively Can Large Language Models Connect SNP Variants and ECG Phenotypes for Cardiovascular Risk Prediction?Niranjana Arun Menon, Iqra Farooq, Yulong Li et al.
Cardiovascular disease (CVD) prediction remains a tremendous challenge due to its multifactorial etiology and global burden of morbidity and mortality. Despite the growing availability of genomic and electrophysiological data, extracting biologically meaningful insights from such high-dimensional, noisy, and sparsely annotated datasets remains a non-trivial task. Recently, LLMs has been applied effectively to predict structural variations in biological sequences. In this work, we explore the potential of fine-tuned LLMs to predict cardiac diseases and SNPs potentially leading to CVD risk using genetic markers derived from high-throughput genomic profiling. We investigate the effect of genetic patterns associated with cardiac conditions and evaluate how LLMs can learn latent biological relationships from structured and semi-structured genomic data obtained by mapping genetic aspects that are inherited from the family tree. By framing the problem as a Chain of Thought (CoT) reasoning task, the models are prompted to generate disease labels and articulate informed clinical deductions across diverse patient profiles and phenotypes. The findings highlight the promise of LLMs in contributing to early detection, risk assessment, and ultimately, the advancement of personalized medicine in cardiac care.
CVJun 18, 2025
Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM CaptioningAnkan Deria, Adinath Madhavrao Dukre, Feilong Tang et al.
Despite significant advances in inference-time search for vision-language models (VLMs), existing approaches remain both computationally expensive and prone to unpenalized, low-confidence generations which often lead to persistent hallucinations. We introduce \textbf{Value-guided Inference with Margin-based Reward (ViMaR)}, a two-stage inference framework that improves both efficiency and output fidelity by combining a temporal-difference value model with a margin-aware reward adjustment. In the first stage, we perform a single pass to identify the highest-value caption among diverse candidates. In the second stage, we selectively refine only those segments that were overlooked or exhibit weak visual grounding, thereby eliminating frequently rewarded evaluations. A calibrated margin-based penalty discourages low-confidence continuations while preserving descriptive richness. Extensive experiments across multiple VLM architectures demonstrate that ViMaR generates captions that are significantly more reliable, factually accurate, detailed, and explanatory, while achieving over 4$\times$ speedup compared to existing value-guided methods. Specifically, we show that ViMaR trained solely on LLaVA Mistral-7B, \textit{generalizes effectively to guide decoding in a stronger unseen model}. To further validate this, we adapt the ViMaR to steer generation in LLaVA-OneVision-Qwen2-7B, leading to consistent improvements in caption quality and demonstrating robust cross-model guidance. This cross-model generalization highlights ViMaR's flexibility and modularity, positioning it as a scalable and transferable inference-time decoding strategy. Furthermore, when ViMaR-generated captions are used for self-training, the underlying models achieve substantial gains across a broad suite of visual comprehension benchmarks, underscoring the potential of fast, accurate, and self-improving VLM pipelines.
SDJun 12, 2025
PAL: Probing Audio Encoders via LLMs - Audio Information Transfer into LLMsTony Alex, Wish Suharitdamrong, Sara Atito et al.
Integration of audio perception into large language models (LLMs) is an emerging research area for enabling machine listening applications, yet efficient transfer of rich audio semantics from audio encoders to LLMs remains underexplored. The most widely used integration paradigm projects the audio encoder output tokens into the LLM input space (e.g., via an MLP or a Q-Former), then prepends or inserts them to the text tokens. We refer to this generic scheme as Prepend to the LLM's input token space (PLITS) integration. We propose an efficient alternative, Lightweight Audio LLM Integration (LAL). LAL introduces audio representations solely via the attention mechanism within different layers of the LLM, bypassing its feedforward module. LAL encodes rich audio semantics at an appropriate level of abstraction for integration into different blocks of LLMs. Our design significantly reduces computational overhead compared to existing integration approaches. Observing with Whisper that the speech encoder benefits from PLITS integration, we propose an audio encoder aware approach for efficiently Probing Audio encoders via LLM (PAL), which employs PLITS integration for Whisper and LAL for general audio encoders. Under an identical training curriculum, LAL consistently maintains performance or outperforms existing integration approaches across multiple base LLMs and tasks. For general audio tasks, LAL improvement is up to 30% over a strong PLITS baseline while reducing memory usage by up to 64.1% and increasing throughput by up to 247.5%. Furthermore, for general audio-music-speech LLM, PAL performs on par with a fully PLITS integration-based system but with substantially improved computational and memory efficiency. Project page: https://ta012.github.io/PAL/
CVMay 24, 2025
C3R: Channel Conditioned Cell Representations for unified evaluation in microscopy imagingUmar Marikkar, Syed Sameed Husain, Muhammad Awais et al.
Immunohistochemical (IHC) images reveal detailed information about structures and functions at the subcellular level. However, unlike natural images, IHC datasets pose challenges for deep learning models due to their inconsistencies in channel count and configuration, stemming from varying staining protocols across laboratories and studies. Existing approaches build channel-adaptive models, which unfortunately fail to support out-of-distribution (OOD) evaluation across IHC datasets and cannot be applied in a true zero-shot setting with mismatched channel counts. To address this, we introduce a structured view of cellular image channels by grouping them into either context or concept, where we treat the context channels as a reference to the concept channels in the image. We leverage this context-concept principle to develop Channel Conditioned Cell Representations (C3R), a framework designed for unified evaluation on in-distribution (ID) and OOD datasets. C3R is a two-fold framework comprising a channel-adaptive encoder architecture and a masked knowledge distillation training strategy, both built around the context-concept principle. We find that C3R outperforms existing benchmarks on both ID and OOD tasks, while a trivial implementation of our core idea also outperforms the channel-adaptive methods reported on the CHAMMI benchmark. Our method opens a new pathway for cross-dataset generalization between IHC datasets, without requiring dataset-specific adaptation or retraining.
CVMay 23, 2025
Dual Branch VideoMamba with Gated Class Token Fusion for Violence DetectionDamith Chamalke Senadeera, Xiaoyun Yang, Shibo Li et al.
The rapid proliferation of surveillance cameras has increased the demand for automated violence detection. While CNNs and Transformers have shown success in extracting spatio-temporal features, they struggle with long-term dependencies and computational efficiency. We propose Dual Branch VideoMamba with Gated Class Token Fusion (GCTF), an efficient architecture combining a dual-branch design and a state-space model (SSM) backbone where one branch captures spatial features, while the other focuses on temporal dynamics. The model performs continuous fusion via a gating mechanism between the branches to enhance the model's ability to detect violent activities even in challenging surveillance scenarios. We also present a new benchmark by merging RWF-2000, RLVS, SURV and VioPeru datasets in video violence detection, ensuring strict separation between training and testing sets. Experimental results demonstrate that our model achieves state-of-the-art performance on this benchmark and also on DVD dataset which is another novel dataset on video violence detection, offering an optimal balance between accuracy and computational efficiency, demonstrating the promise of SSMs for scalable, near real-time surveillance violence detection.
CVJun 25, 2024
Investigating Self-Supervised Methods for Label-Efficient LearningSrinivasa Rao Nandam, Sara Atito, Zhenhua Feng et al.
Vision transformers combined with self-supervised learning have enabled the development of models which scale across large datasets for several downstream tasks like classification, segmentation and detection. The low-shot learning capability of these models, across several low-shot downstream tasks, has been largely under explored. We perform a system level study of different self supervised pretext tasks, namely contrastive learning, clustering, and masked image modelling for their low-shot capabilities by comparing the pretrained models. In addition we also study the effects of collapse avoidance methods, namely centring, ME-MAX, sinkhorn, on these downstream tasks. Based on our detailed analysis, we introduce a framework involving both mask image modelling and clustering as pretext tasks, which performs better across all low-shot downstream tasks, including multi-class classification, multi-label classification and semantic segmentation. Furthermore, when testing the model on full scale datasets, we show performance gains in multi-class classification, multi-label classification and semantic segmentation.
CVJun 25, 2024
Pseudo Labelling for Enhanced Masked AutoencodersSrinivasa Rao Nandam, Sara Atito, Zhenhua Feng et al.
Masked Image Modeling (MIM)-based models, such as SdAE, CAE, GreenMIM, and MixAE, have explored different strategies to enhance the performance of Masked Autoencoders (MAE) by modifying prediction, loss functions, or incorporating additional architectural components. In this paper, we propose an enhanced approach that boosts MAE performance by integrating pseudo labelling for both class and data tokens, alongside replacing the traditional pixel-level reconstruction with token-level reconstruction. This strategy uses cluster assignments as pseudo labels to promote instance-level discrimination within the network, while token reconstruction requires generation of discrete tokens encapturing local context. The targets for pseudo labelling and reconstruction needs to be generated by a teacher network. To disentangle the generation of target pseudo labels and the reconstruction of the token features, we decouple the teacher into two distinct models, where one serves as a labelling teacher and the other as a reconstruction teacher. This separation proves empirically superior to a single teacher, while having negligible impact on throughput and memory consumption. Incorporating pseudo-labelling as an auxiliary task has demonstrated notable improvements in ImageNet-1K and other downstream tasks, including classification, semantic segmentation, and detection.
CVJun 6, 2024
Efficient 3D-Aware Facial Image Editing via Attribute-Specific Prompt LearningAmandeep Kumar, Muhammad Awais, Sanath Narayan et al.
Drawing upon StyleGAN's expressivity and disentangled latent space, existing 2D approaches employ textual prompting to edit facial images with different attributes. In contrast, 3D-aware approaches that generate faces at different target poses require attribute-specific classifiers, learning separate model weights for each attribute, and are not scalable for novel attributes. In this work, we propose an efficient, plug-and-play, 3D-aware face editing framework based on attribute-specific prompt learning, enabling the generation of facial images with controllable attributes across various target poses. To this end, we introduce a text-driven learnable style token-based latent attribute editor (LAE). The LAE harnesses a pre-trained vision-language model to find text-guided attribute-specific editing direction in the latent space of any pre-trained 3D-aware GAN. It utilizes learnable style tokens and style mappers to learn and transform this editing direction to 3D latent space. To train LAE with multiple attributes, we use directional contrastive loss and style token loss. Furthermore, to ensure view consistency and identity preservation across different poses and attributes, we employ several 3D-aware identity and pose preservation losses. Our experiments show that our proposed framework generates high-quality images with 3D awareness and view consistency while maintaining attribute-specific features. We demonstrate the effectiveness of our method on different facial attributes, including hair color and style, expression, and others.
CVNov 30, 2021
MC-SSL0.0: Towards Multi-Concept Self-Supervised LearningSara Atito, Muhammad Awais, Ammarah Farooq et al.
Self-supervised pretraining is the method of choice for natural language processing models and is rapidly gaining popularity in many vision tasks. Recently, self-supervised pretraining has shown to outperform supervised pretraining for many downstream vision applications, marking a milestone in the area. This superiority is attributed to the negative impact of incomplete labelling of the training images, which convey multiple concepts, but are annotated using a single dominant class label. Although Self-Supervised Learning (SSL), in principle, is free of this limitation, the choice of pretext task facilitating SSL is perpetuating this shortcoming by driving the learning process towards a single concept output. This study aims to investigate the possibility of modelling all the concepts present in an image without using labels. In this aspect the proposed SSL frame-work MC-SSL0.0 is a step towards Multi-Concept Self-Supervised Learning (MC-SSL) that goes beyond modelling single dominant label in an image to effectively utilise the information from all the concepts present in it. MC-SSL0.0 consists of two core design concepts, group masked model learning and learning of pseudo-concept for data token using a momentum encoder (teacher-student) framework. The experimental results on multi-label and multi-class image classification downstream tasks demonstrate that MC-SSL0.0 not only surpasses existing SSL methods but also outperforms supervised transfer learning. The source code will be made publicly available for community to train on bigger corpus.
CVNov 25, 2021
Global Interaction Modelling in Vision Transformer via Super TokensAmmarah Farooq, Muhammad Awais, Sara Ahmed et al.
With the popularity of Transformer architectures in computer vision, the research focus has shifted towards developing computationally efficient designs. Window-based local attention is one of the major techniques being adopted in recent works. These methods begin with very small patch size and small embedding dimensions and then perform strided convolution (patch merging) in order to reduce the feature map size and increase embedding dimensions, hence, forming a pyramidal Convolutional Neural Network (CNN) like design. In this work, we investigate local and global information modelling in transformers by presenting a novel isotropic architecture that adopts local windows and special tokens, called Super tokens, for self-attention. Specifically, a single Super token is assigned to each image window which captures the rich local details for that window. These tokens are then employed for cross-window communication and global representation learning. Hence, most of the learning is independent of the image patches $(N)$ in the higher layers, and the class embedding is learned solely based on the Super tokens $(N/M^2)$ where $M^2$ is the window size. In standard image classification on Imagenet-1K, the proposed Super tokens based transformer (STT-S25) achieves 83.5\% accuracy which is equivalent to Swin transformer (Swin-B) with circa half the number of parameters (49M) and double the inference time throughput. The proposed Super token transformer offers a lightweight and promising backbone for visual recognition tasks.
LGNov 9, 2021
MixACM: Mixup-Based Robustness Transfer via Distillation of Activated Channel MapsMuhammad Awais, Fengwei Zhou, Chuanlong Xie et al.
Deep neural networks are susceptible to adversarially crafted, small and imperceptible changes in the natural inputs. The most effective defense mechanism against these examples is adversarial training which constructs adversarial examples during training by iterative maximization of loss. The model is then trained to minimize the loss on these constructed examples. This min-max optimization requires more data, larger capacity models, and additional computing resources. It also degrades the standard generalization performance of a model. Can we achieve robustness more efficiently? In this work, we explore this question from the perspective of knowledge transfer. First, we theoretically show the transferability of robustness from an adversarially trained teacher model to a student model with the help of mixup augmentation. Second, we propose a novel robustness transfer method called Mixup-Based Activated Channel Maps (MixACM) Transfer. MixACM transfers robustness from a robust teacher to a student by matching activated channel maps generated without expensive adversarial perturbations. Finally, extensive experiments on multiple datasets and different learning scenarios show our method can transfer robustness while also improving generalization on natural images.