Shafin Rahman

CV
h-index29
48papers
1,247citations
Novelty54%
AI Score60

48 Papers

LGApr 27, 2023Code
Uncertainty Aware Neural Network from Similarity and Sensitivity

H M Dipu Kabir, Subrota Kumar Mondal, Sadia Khanam et al.

Researchers have proposed several approaches for neural network (NN) based uncertainty quantification (UQ). However, most of the approaches are developed considering strong assumptions. Uncertainty quantification algorithms often perform poorly in an input domain and the reason for poor performance remains unknown. Therefore, we present a neural network training method that considers similar samples with sensitivity awareness in this paper. In the proposed NN training method for UQ, first, we train a shallow NN for the point prediction. Then, we compute the absolute differences between prediction and targets and train another NN for predicting those absolute differences or absolute errors. Domains with high average absolute errors represent a high uncertainty. In the next step, we select each sample in the training set one by one and compute both prediction and error sensitivities. Then we select similar samples with sensitivity consideration and save indexes of similar samples. The ranges of an input parameter become narrower when the output is highly sensitive to that parameter. After that, we construct initial uncertainty bounds (UB) by considering the distribution of sensitivity aware similar samples. Prediction intervals (PIs) from initial uncertainty bounds are larger and cover more samples than required. Therefore, we train bound correction NN. As following all the steps for finding UB for each sample requires a lot of computation and memory access, we train a UB computation NN. The UB computation NN takes an input sample and provides an uncertainty bound. The UB computation NN is the final product of the proposed approach. Scripts of the proposed method are available in the following GitHub repository: github.com/dipuk0506/UQ

CVJun 4
HyperVis: Continuous Latent Visual Relational Graphs on the Lorentz Hyperboloid for Compositional Reasoning

Moshiur Farazi, Sameera Ramasinghe, Mahbub Ahmed Turza et al.

Vision-Language Models (VLMs) struggle with compositional reasoning that requires understanding inter-object relationships. A natural remedy is to inject explicit scene graph triplets $\langle s, p, o \rangle$ from an off-the-shelf scene graph generator (SGG), but we show this backfires: discrete text labels collide with the continuous visual modality, degrading GQA accuracy from 60.38\% to 58.86\%. We propose \textbf{HyperVis}, which bypasses the SGG semantic bottleneck entirely. From $N$ class-agnostic region proposals, we compute a dense $O(N^2)$ visual relation tensor via spatially-biased cross-attention, project it onto a Lorentz hyperboloid, and enforce hierarchy through spatial physics, namely IoA-driven entailment cones and exterior-angle repulsion. We discover that HyperVis contributes in two complementary ways: (1) as a \emph{training-time regularizer}, the hyperbolic relational losses shape LoRA representations that improve generative VQA (GQA 61.03\% vs.\ 57.21\% for LoRA fine-tuning without relational losses, recovering and surpassing the baseline); and (2) as an \emph{inference-time relational encoder}, hyperbolic prefix tokens boost discriminative compositional scoring (SugarCrepe 79.94\%, $+$6.25pp over baseline). The learned curvature stabilises at $κ{=}4.0$, an order of magnitude above prior hyperbolic VLMs where $κ$ typically collapses toward zero, indicating that continuous visual features genuinely require the exponential volume of strongly curved space. A controlled Euclidean ablation confirms this decomposition: the relational pipeline regularises LoRA comparably in flat space (GQA 60.81\%), but the compositionality gain is specifically hyperbolic (SugarCrepe $+$4.58pp over Euclidean), with entailment loss ${\sim}6{\times}$ higher in Euclidean training. Codes are available at TBA.

CVOct 5, 2023Code
LumiNet: Perception-Driven Knowledge Distillation via Statistical Logit Calibration

Md. Ismail Hossain, M M Lutfe Elahi, Sameera Ramasinghe et al.

In the knowledge distillation literature, feature-based methods have dominated due to their ability to effectively tap into extensive teacher models. In contrast, logit-based approaches, which aim to distill "dark knowledge" from teachers, typically exhibit inferior performance compared to feature-based methods. To bridge this gap, we present LumiNet, a novel knowledge distillation algorithm designed to enhance logit-based distillation. We introduce the concept of "perception", aiming to calibrate logits based on the model's representation capability. This concept addresses overconfidence issues in the logit-based distillation method while also introducing a novel method to distill knowledge from the teacher. It reconstructs the logits of a sample/instances by considering relationships with other samples in the batch. LumiNet excels on benchmarks like CIFAR-100, ImageNet, and MSCOCO, outperforming the leading feature-based methods, e.g., compared to KD with ResNet18 and MobileNetV2 on ImageNet, it shows improvements of 1.5% and 2.05%, respectively. Codes are available at https://github.com/ismail31416/LumiNet.

CVAug 26, 2024Code
3D Point Cloud Network Pruning: When Some Weights Do not Matter

Amrijit Biswas, Md. Ismail Hossain, M M Lutfe Elahi et al.

A point cloud is a crucial geometric data structure utilized in numerous applications. The adoption of deep neural networks referred to as Point Cloud Neural Networks (PC- NNs), for processing 3D point clouds, has significantly advanced fields that rely on 3D geometric data to enhance the efficiency of tasks. Expanding the size of both neural network models and 3D point clouds introduces significant challenges in minimizing computational and memory requirements. This is essential for meeting the demanding requirements of real-world applications, which prioritize minimal energy consumption and low latency. Therefore, investigating redundancy in PCNNs is crucial yet challenging due to their sensitivity to parameters. Additionally, traditional pruning methods face difficulties as these networks rely heavily on weights and points. Nonetheless, our research reveals a promising phenomenon that could refine standard PCNN pruning techniques. Our findings suggest that preserving only the top p% of the highest magnitude weights is crucial for accuracy preservation. For example, pruning 99% of the weights from the PointNet model still results in accuracy close to the base level. Specifically, in the ModelNet40 dataset, where the base accuracy with the PointNet model was 87. 5%, preserving only 1% of the weights still achieves an accuracy of 86.8%. Codes are available in: https://github.com/apurba-nsu-rnd-lab/PCNN_Pruning

CVOct 5, 2023
Continual Test-time Domain Adaptation via Dynamic Sample Selection

Yanshuo Wang, Jie Hong, Ali Cheraghian et al.

The objective of Continual Test-time Domain Adaptation (CTDA) is to gradually adapt a pre-trained model to a sequence of target domains without accessing the source data. This paper proposes a Dynamic Sample Selection (DSS) method for CTDA. DSS consists of dynamic thresholding, positive learning, and negative learning processes. Traditionally, models learn from unlabeled unknown environment data and equally rely on all samples' pseudo-labels to update their parameters through self-training. However, noisy predictions exist in these pseudo-labels, so all samples are not equally trustworthy. Therefore, in our method, a dynamic thresholding module is first designed to select suspected low-quality from high-quality samples. The selected low-quality samples are more likely to be wrongly predicted. Therefore, we apply joint positive and negative learning on both high- and low-quality samples to reduce the risk of using wrong information. We conduct extensive experiments that demonstrate the effectiveness of our proposed method for CTDA in the image domain, outperforming the state-of-the-art results. Furthermore, our approach is also evaluated in the 3D point cloud domain, showcasing its versatility and potential for broader applicability.

CVOct 30, 2022
Exemplar Guided Deep Neural Network for Spatial Transcriptomics Analysis of Gene Expression Prediction

Yan Yang, Md Zakir Hossain, Eric A Stone et al.

Spatial transcriptomics (ST) is essential for understanding diseases and developing novel treatments. It measures gene expression of each fine-grained area (i.e., different windows) in the tissue slide with low throughput. This paper proposes an Exemplar Guided Network (EGN) to accurately and efficiently predict gene expression directly from each window of a tissue slide image. We apply exemplar learning to dynamically boost gene expression prediction from nearest/similar exemplars of a given tissue slide image window. Our EGN framework composes of three main components: 1) an extractor to structure a representation space for unsupervised exemplar retrievals; 2) a vision transformer (ViT) backbone to progressively extract representations of the input window; and 3) an Exemplar Bridging (EB) block to adaptively revise the intermediate ViT representations by using the nearest exemplars. Finally, we complete the gene expression prediction task with a simple attention-based prediction block. Experiments on standard benchmark datasets indicate the superiority of our approach when comparing with the past state-of-the-art (SOTA) methods.

CLAug 20, 2024Code
Beyond Labels: Aligning Large Language Models with Human-like Reasoning

Muhammad Rafsan Kabir, Rafeed Mohammad Sultan, Ihsanul Haque Asif et al.

Aligning large language models (LLMs) with a human reasoning approach ensures that LLMs produce morally correct and human-like decisions. Ethical concerns are raised because current models are prone to generating false positives and providing malicious responses. To contribute to this issue, we have curated an ethics dataset named Dataset for Aligning Reasons (DFAR), designed to aid in aligning language models to generate human-like reasons. The dataset comprises statements with ethical-unethical labels and their corresponding reasons. In this study, we employed a unique and novel fine-tuning approach that utilizes ethics labels and their corresponding reasons (L+R), in contrast to the existing fine-tuning approach that only uses labels (L). The original pre-trained versions, the existing fine-tuned versions, and our proposed fine-tuned versions of LLMs were then evaluated on an ethical-unethical classification task and a reason-generation task. Our proposed fine-tuning strategy notably outperforms the others in both tasks, achieving significantly higher accuracy scores in the classification task and lower misalignment rates in the reason-generation task. The increase in classification accuracies and decrease in misalignment rates indicate that the L+R fine-tuned models align more with human ethics. Hence, this study illustrates that injecting reasons has substantially improved the alignment of LLMs, resulting in more human-like responses. We have made the DFAR dataset and corresponding codes publicly available at https://github.com/apurba-nsu-rnd-lab/DFAR.

CVMay 30, 2022
Few-shot Class-incremental Learning for 3D Point Cloud Objects

Townim Chowdhury, Ali Cheraghian, Sameera Ramasinghe et al.

Few-shot class-incremental learning (FSCIL) aims to incrementally fine-tune a model (trained on base classes) for a novel set of classes using a few examples without forgetting the previous training. Recent efforts address this problem primarily on 2D images. However, due to the advancement of camera technology, 3D point cloud data has become more available than ever, which warrants considering FSCIL on 3D data. This paper addresses FSCIL in the 3D domain. In addition to well-known issues of catastrophic forgetting of past knowledge and overfitting of few-shot data, 3D FSCIL can bring newer challenges. For example, base classes may contain many synthetic instances in a realistic scenario. In contrast, only a few real-scanned samples (from RGBD sensors) of novel classes are available in incremental steps. Due to the data variation from synthetic to real, FSCIL endures additional challenges, degrading performance in later incremental steps. We attempt to solve this problem using Microshapes (orthogonal basis vectors) by describing any 3D objects using a pre-defined set of rules. It supports incremental training with few-shot examples minimizing synthetic to real data variation. We propose new test protocols for 3D FSCIL using popular synthetic datasets (ModelNet and ShapeNet) and 3D real-scanned datasets (ScanObjectNN and CO3D). By comparing state-of-the-art methods, we establish the effectiveness of our approach in the 3D domain.

QMMar 23, 2022
DPST: De Novo Peptide Sequencing with Amino-Acid-Aware Transformers

Yan Yang, Zakir Hossain, Khandaker Asif et al.

De novo peptide sequencing aims to recover amino acid sequences of a peptide from tandem mass spectrometry (MS) data. Existing approaches for de novo analysis enumerate MS evidence for all amino acid classes during inference. It leads to over-trimming on receptive fields of MS data and restricts MS evidence associated with following undecoded amino acids. Our approach, DPST, circumvents these limitations with two key components: (1) A confidence value aggregation encoder to sketch spectrum representations according to amino-acid-based connectivity among MS; (2) A global-local fusion decoder to progressively assimilate contextualized spectrum representations with a predefined preconception of localized MS evidence and amino acid priors. Our components originate from a closed-form solution and selectively attend to informative amino-acid-aware MS representations. Through extensive empirical studies, we demonstrate the superiority of DPST, showing that it outperforms state-of-the-art approaches by a margin of 12% - 19% peptide accuracy.

AIMay 23, 2022
Rethinking Task-Incremental Learning Baselines

Md Sazzad Hossain, Pritom Saha, Townim Faisal Chowdhury et al.

It is common to have continuous streams of new data that need to be introduced in the system in real-world applications. The model needs to learn newly added capabilities (future tasks) while retaining the old knowledge (past tasks). Incremental learning has recently become increasingly appealing for this problem. Task-incremental learning is a kind of incremental learning where task identity of newly included task (a set of classes) remains known during inference. A common goal of task-incremental methods is to design a network that can operate on minimal size, maintaining decent performance. To manage the stability-plasticity dilemma, different methods utilize replay memory of past tasks, specialized hardware, regularization monitoring etc. However, these methods are still less memory efficient in terms of architecture growth or input data costs. In this study, we present a simple yet effective adjustment network (SAN) for task incremental learning that achieves near state-of-the-art performance while using minimal architectural size without using memory instances compared to previous state-of-the-art approaches. We investigate this approach on both 3D point cloud object (ModelNet40) and 2D image (CIFAR10, CIFAR100, MiniImageNet, MNIST, PermutedMNIST, notMNIST, SVHN, and FashionMNIST) recognition tasks and establish a strong baseline result for a fair comparison with existing methods. On both 2D and 3D domains, we also observe that SAN is primarily unaffected by different task orders in a task-incremental setting.

CVDec 24, 2022
COLT: Cyclic Overlapping Lottery Tickets for Faster Pruning of Convolutional Neural Networks

Md. Ismail Hossain, Mohammed Rakib, M. M. Lutfe Elahi et al.

Pruning refers to the elimination of trivial weights from neural networks. The sub-networks within an overparameterized model produced after pruning are often called Lottery tickets. This research aims to generate winning lottery tickets from a set of lottery tickets that can achieve similar accuracy to the original unpruned network. We introduce a novel winning ticket called Cyclic Overlapping Lottery Ticket (COLT) by data splitting and cyclic retraining of the pruned network from scratch. We apply a cyclic pruning algorithm that keeps only the overlapping weights of different pruned models trained on different data segments. Our results demonstrate that COLT can achieve similar accuracies (obtained by the unpruned model) while maintaining high sparsities. We show that the accuracy of COLT is on par with the winning tickets of Lottery Ticket Hypothesis (LTH) and, at times, is better. Moreover, COLTs can be generated using fewer iterations than tickets generated by the popular Iterative Magnitude Pruning (IMP) method. In addition, we also notice COLTs generated on large datasets can be transferred to small ones without compromising performance, demonstrating its generalizing capability. We conduct all our experiments on Cifar-10, Cifar-100 & TinyImageNet datasets and report superior performance than the state-of-the-art methods.

CVOct 9, 2022
Less is More: Facial Landmarks can Recognize a Spontaneous Smile

Md. Tahrim Faroque, Yan Yang, Md Zakir Hossain et al.

Smile veracity classification is a task of interpreting social interactions. Broadly, it distinguishes between spontaneous and posed smiles. Previous approaches used hand-engineered features from facial landmarks or considered raw smile videos in an end-to-end manner to perform smile classification tasks. Feature-based methods require intervention from human experts on feature engineering and heavy pre-processing steps. On the contrary, raw smile video inputs fed into end-to-end models bring more automation to the process with the cost of considering many redundant facial features (beyond landmark locations) that are mainly irrelevant to smile veracity classification. It remains unclear to establish discriminative features from landmarks in an end-to-end manner. We present a MeshSmileNet framework, a transformer architecture, to address the above limitations. To eliminate redundant facial features, our landmarks input is extracted from Attention Mesh, a pre-trained landmark detector. Again, to discover discriminative features, we consider the relativity and trajectory of the landmarks. For the relativity, we aggregate facial landmark that conceptually formats a curve at each frame to establish local spatial features. For the trajectory, we estimate the movements of landmark composed features across time by self-attention mechanism, which captures pairwise dependency on the trajectory of the same landmark. This idea allows us to achieve state-of-the-art performances on UVA-NEMO, BBC, MMI Facial Expression, and SPOS datasets.

LGNov 14, 2025Code
Dynamic Temperature Scheduler for Knowledge Distillation

Sibgat Ul Islam, Jawad Ibn Ahad, Fuad Rahman et al.

Knowledge Distillation (KD) trains a smaller student model using a large, pre-trained teacher model, with temperature as a key hyperparameter controlling the softness of output probabilities. Traditional methods use a fixed temperature throughout training, which is suboptimal. Moreover, architectural differences between teacher and student often result in mismatched logit magnitudes. We demonstrate that students benefit from softer probabilities early in training but require sharper probabilities in later stages. We introduce Dynamic Temperature Scheduler (DTS), which adjusts temperature dynamically based on the cross-entropy loss gap between teacher and student. To our knowledge, this is the first temperature scheduling method that adapts based on the divergence between teacher and student distributions. Our method integrates seamlessly with existing KD frameworks. We validate DTS across multiple KD strategies on vision (CIFAR-100, Tiny-ImageNet) and NLP tasks (GLUE, Dolly, SelfIns, UnNI, S-NI), consistently outperforming static-temperature baselines. Code is available at https://github.com/Sibgat-Ul/DTS.

CVSep 29, 2022
Prompt-guided Scene Generation for 3D Zero-Shot Learning

Majid Nasiri, Ali Cheraghian, Townim Faisal Chowdhury et al.

Zero-shot learning on 3D point cloud data is a related underexplored problem compared to its 2D image counterpart. 3D data brings new challenges for ZSL due to the unavailability of robust pre-trained feature extraction models. To address this problem, we propose a prompt-guided 3D scene generation and supervision method that augments 3D data to learn the network better, exploring the complex interplay of seen and unseen objects. First, we merge point clouds of two 3D models in certain ways described by a prompt. The prompt acts like the annotation describing each 3D scene. Later, we perform contrastive learning to train our proposed architecture in an end-to-end manner. We argue that 3D scenes can relate objects more efficiently than single objects because popular language models (like BERT) can achieve high performance when objects appear in a context. Our proposed prompt-guided scene generation method encapsulates data augmentation and prompt-based annotation/captioning to improve 3D ZSL performance. We have achieved state-of-the-art ZSL and generalized ZSL performance on synthetic (ModelNet40, ModelNet10) and real-scanned (ScanOjbectNN) 3D object datasets.

CVDec 9, 2025
Beyond Real Weights: Hypercomplex Representations for Stable Quantization

Jawad Ibn Ahad, Maisha Rahman, Amrijit Biswas et al.

Multimodal language models (MLLMs) require large parameter capacity to align high-dimensional visual features with linguistic representations, making them computationally heavy and difficult to deploy efficiently. We introduce a progressive reparameterization strategy that compresses these models by gradually replacing dense feed-forward network blocks with compact Parameterized Hypercomplex Multiplication (PHM) layers. A residual interpolation schedule, together with lightweight reconstruction and knowledge distillation losses, ensures that the PHM modules inherit the functional behavior of their dense counterparts during training. This transition yields substantial parameter and FLOP reductions while preserving strong multimodal alignment, enabling faster inference without degrading output quality. We evaluate the approach on multiple vision-language models (VLMs). Our method maintains performance comparable to the base models while delivering significant reductions in model size and inference latency. Progressive PHM substitution thus offers an architecture-compatible path toward more efficient multimodal reasoning and complements existing low-bit quantization techniques.

CLNov 14, 2025
LAET: A Layer-wise Adaptive Ensemble Tuning Framework for Pretrained Language Models

Jawad Ibn Ahad, Muhammad Rafsan Kabir, Robin Krambroeckers et al.

Natural Language Processing (NLP) has transformed the financial industry, enabling advancements in areas such as textual analysis, risk management, and forecasting. Large language models (LLMs) like BloombergGPT and FinMA have set new benchmarks across various financial NLP tasks, including sentiment analysis, stock movement prediction, and credit risk assessment. Furthermore, FinMA-ES, a bilingual financial LLM, has also demonstrated strong performance using the FLARE and FLARE-ES benchmarks. However, the high computational demands of these models limit the accessibility of many organizations. To address this, we propose Layer-wise Adaptive Ensemble Tuning (LAET), a novel strategy that selectively fine-tunes the most effective layers of pre-trained LLMs by analyzing hidden state representations while freezing less critical layers. LAET significantly reduces computational overhead while enhancing task-specific performance. Our approach shows strong results in financial NLP tasks, outperforming existing benchmarks and state-of-the-art LLMs such as GPT-4, even with smaller LLMs ($\sim$3B parameters). This work bridges cutting-edge financial NLP research and real-world deployment with efficient and scalable models for financial applications.

CVMar 27, 2024Code
Backpropagation-free Network for 3D Test-time Adaptation

Yanshuo Wang, Ali Cheraghian, Zeeshan Hayder et al.

Real-world systems often encounter new data over time, which leads to experiencing target domain shifts. Existing Test-Time Adaptation (TTA) methods tend to apply computationally heavy and memory-intensive backpropagation-based approaches to handle this. Here, we propose a novel method that uses a backpropagation-free approach for TTA for the specific case of 3D data. Our model uses a two-stream architecture to maintain knowledge about the source domain as well as complementary target-domain-specific information. The backpropagation-free property of our model helps address the well-known forgetting problem and mitigates the error accumulation issue. The proposed method also eliminates the need for the usually noisy process of pseudo-labeling and reliance on costly self-supervised training. Moreover, our method leverages subspace learning, effectively reducing the distribution variance between the two domains. Furthermore, the source-domain-specific and the target-domain-specific streams are aligned using a novel entropy-based adaptive fusion strategy. Extensive experiments on popular benchmarks demonstrate the effectiveness of our method. The code will be available at \url{https://github.com/abie-e/BFTT3D}.

CVOct 18, 2023
ChatGPT-guided Semantics for Zero-shot Learning

Fahimul Hoque Shubho, Townim Faisal Chowdhury, Ali Cheraghian et al.

Zero-shot learning (ZSL) aims to classify objects that are not observed or seen during training. It relies on class semantic description to transfer knowledge from the seen classes to the unseen classes. Existing methods of obtaining class semantics include manual attributes or automatic word vectors from language models (like word2vec). We know attribute annotation is costly, whereas automatic word-vectors are relatively noisy. To address this problem, we explore how ChatGPT, a large language model, can enhance class semantics for ZSL tasks. ChatGPT can be a helpful source to obtain text descriptions for each class containing related attributes and semantics. We use the word2vec model to get a word vector using the texts from ChatGPT. Then, we enrich word vectors by combining the word embeddings from class names and descriptions generated by ChatGPT. More specifically, we leverage ChatGPT to provide extra supervision for the class description, eventually benefiting ZSL models. We evaluate our approach on various 2D image (CUB and AwA) and 3D point cloud (ModelNet10, ModelNet40, and ScanObjectNN) datasets and show that it improves ZSL performance. Our work contributes to the ZSL literature by applying ChatGPT for class semantics enhancement and proposing a novel word vector fusion method.

CVNov 21, 2024Code
Test-Time Adaptation of 3D Point Clouds via Denoising Diffusion Models

Hamidreza Dastmalchi, Aijun An, Ali Cheraghian et al.

Test-time adaptation (TTA) of 3D point clouds is crucial for mitigating discrepancies between training and testing samples in real-world scenarios, particularly when handling corrupted point clouds. LiDAR data, for instance, can be affected by sensor failures or environmental factors, causing domain gaps. Adapting models to these distribution shifts online is crucial, as training for every possible variation is impractical. Existing methods often focus on fine-tuning pre-trained models based on self-supervised learning or pseudo-labeling, which can lead to forgetting valuable source domain knowledge over time and reduce generalization on future tests. In this paper, we introduce a novel 3D test-time adaptation method, termed 3DD-TTA, which stands for 3D Denoising Diffusion Test-Time Adaptation. This method uses a diffusion strategy that adapts input point cloud samples to the source domain while keeping the source model parameters intact. The approach uses a Variational Autoencoder (VAE) to encode the corrupted point cloud into a shape latent and latent points. These latent points are corrupted with Gaussian noise and subjected to a denoising diffusion process. During this process, both the shape latent and latent points are updated to preserve fidelity, guiding the denoising toward generating consistent samples that align more closely with the source domain. We conduct extensive experiments on the ShapeNet dataset and investigate its generalizability on ModelNet40 and ScanObjectNN, achieving state-of-the-art results. The code has been released at \url{https://github.com/hamidreza-dastmalchi/3DD-TTA}.

CVDec 2, 2024Code
VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval

Dhiman Paul, Md Rizwan Parvez, Nabeel Mohammed et al.

Video Highlight Detection and Moment Retrieval (HD/MR) are essential in video analysis. Recent joint prediction transformer models often overlook their cross-task dynamics and video-text alignment and refinement. Moreover, most models typically use limited, uni-directional attention mechanisms, resulting in weakly integrated representations and suboptimal performance in capturing the interdependence between video and text modalities. Although large-language and vision-language models (LLM/LVLMs) have gained prominence across various domains, their application in this field remains relatively underexplored. Here we propose VideoLights, a novel HD/MR framework addressing these limitations through (i) Convolutional Projection and Feature Refinement modules with an alignment loss for better video-text feature alignment, (ii) Bi-Directional Cross-Modal Fusion network for strongly coupled query-aware clip representations, and (iii) Uni-directional joint-task feedback mechanism enhancing both tasks through correlation. In addition, (iv) we introduce hard positive/negative losses for adaptive error penalization and improved learning, and (v) leverage LVLMs like BLIP-2 for enhanced multimodal feature integration and intelligent pretraining using synthetic data generated from LVLMs. Comprehensive experiments on QVHighlights, TVSum, and Charades-STA benchmarks demonstrate state-of-the-art performance. Codes and models are available at https://github.com/dpaul06/VideoLights .

CVOct 11, 2024Code
Foundation Model-Powered 3D Few-Shot Class Incremental Learning via Training-free Adaptor

Sahar Ahmadi, Ali Cheraghian, Morteza Saberi et al.

Recent advances in deep learning for processing point clouds hold increased interest in Few-Shot Class Incremental Learning (FSCIL) for 3D computer vision. This paper introduces a new method to tackle the Few-Shot Continual Incremental Learning (FSCIL) problem in 3D point cloud environments. We leverage a foundational 3D model trained extensively on point cloud data. Drawing from recent improvements in foundation models, known for their ability to work well across different tasks, we propose a novel strategy that does not require additional training to adapt to new tasks. Our approach uses a dual cache system: first, it uses previous test samples based on how confident the model was in its predictions to prevent forgetting, and second, it includes a small number of new task samples to prevent overfitting. This dynamic adaptation ensures strong performance across different learning tasks without needing lots of fine-tuning. We tested our approach on datasets like ModelNet, ShapeNet, ScanObjectNN, and CO3D, showing that it outperforms other FSCIL methods and demonstrating its effectiveness and versatility. The code is available at \url{https://github.com/ahmadisahar/ACCV_FCIL3D}.

LGAug 18, 2025Code
Z-Pruner: Post-Training Pruning of Large Language Models for Efficiency without Retraining

Samiul Basir Bhuiyan, Md. Sazzad Hossain Adib, Mohammed Aman Bhuiyan et al.

Large language models (LLMs) have rapidly advanced in recent years, achieving remarkable performance across a wide range of natural language processing tasks. However, this progress has come at the cost of increasingly large model sizes, which pose significant challenges for deployment, scalability, and energy efficiency. To address these limitations, post-training pruning has emerged as a promising approach for reducing model size and inference latency without the need for retraining. Despite these advantages, many existing pruning methods result in substantial performance degradation or require computationally expensive fine-tuning. In this work, we introduce Z-Pruner, a novel post-training pruning method designed to induce sparsity in pretrained LLMs without any retraining. Unlike conventional approaches, Z-Pruner leverages both weight update magnitudes and activation patterns to identify and eliminate redundant parameters more effectively. Our method is model-agnostic, efficient, and easy to implement. We evaluate Z-Pruner using multiple widely-used LLM architectures, including LLaMA-2, LLaMA-3, and OPT, across a diverse set of standard language benchmarks. Experimental results demonstrate that Z-Pruner surpasses state-of-the-art pruning methods that require intensive weight updates. Specifically, Z-Pruner achieves the lowest perplexity scores and the highest overall average score for zero-shot accuracy. We have made the corresponding codes publicly available at https://github.com/sazzadadib/Z-Pruner.

CVApr 15, 2025Code
TFMPathy: Tabular Foundation Model for Privacy-Aware, Generalisable Empathy Detection from Videos

Md Rakibul Hasan, Md Zakir Hossain, Aneesh Krishna et al.

Detecting empathy from video interactions is an emerging area of research, particularly in healthcare and social robotics. However, privacy and ethical concerns often prevent the release of raw video data, with many datasets instead shared as pre-extracted tabular features. Previous work on such datasets has established classical tree-based models as the state of the art. Motivated by recent successes of large-scale foundation models for text, we investigate the potential of tabular foundation models (TFMs) for empathy detection from video-derived tabular data. Our proposed system, TFMPathy, is demonstrated with two recent TFMs (TabPFN v2 and TabICL) under both in-context learning and fine-tuning paradigms. On a public human-robot interaction benchmark, TFMPathy significantly improves empathy detection accuracy reported in the literature. While the established evaluation protocol in the literature does not ensure cross-subject generalisation, our evaluation scheme also captures such generalisation. We show that TFMPathy under a fine-tuning setup has better cross-subject generalisation capacity over baseline methods (accuracy: $0.590 \rightarrow 0.730$; AUC: $0.564 \rightarrow 0.669$). Given the ongoing privacy and ethical constraints around raw video sharing, the proposed TFMPathy system provides a practical and scalable path toward building AI systems dependent on human-centred video datasets. Our code is publicly available at https://github.com/hasan-rakibul/TFMPathy (will be made available upon acceptance of this paper).

CLJan 1, 2025Code
Labels Generated by Large Language Models Help Measure People's Empathy in Vitro

Md Rakibul Hasan, Yue Yao, Md Zakir Hossain et al.

Large language models (LLMs) have revolutionised many fields, with LLM-as-a-service (LLMSaaS) offering accessible, general-purpose solutions without costly task-specific training. In contrast to the widely studied prompt engineering for directly solving tasks (in vivo), this paper explores LLMs' potential for in-vitro applications: using LLM-generated labels to improve supervised training of mainstream models. We examine two strategies - (1) noisy label correction and (2) training data augmentation - in empathy computing, an emerging task to predict psychology-based questionnaire outcomes from inputs like textual narratives. Crowdsourced datasets in this domain often suffer from noisy labels that misrepresent underlying empathy. We show that replacing or supplementing these crowdsourced labels with LLM-generated labels, developed using psychology-based scale-aware prompts, achieves statistically significant accuracy improvements. Notably, the RoBERTa pre-trained language model (PLM) trained with noise-reduced labels yields a state-of-the-art Pearson correlation coefficient of 0.648 on the public NewsEmp benchmarks. This paper further analyses evaluation metric selection and demographic biases to help guide the future development of more equitable empathy computing models. Code and LLM-generated labels are available at https://github.com/hasan-rakibul/LLMPathy.

CVOct 8, 2025Code
Label Semantics for Robust Hyperspectral Image Classification

Rafin Hassan, Zarin Tasnim Roshni, Rafiqul Bari et al.

Hyperspectral imaging (HSI) classification is a critical tool with widespread applications across diverse fields such as agriculture, environmental monitoring, medicine, and materials science. Due to the limited availability of high-quality training samples and the high dimensionality of spectral data, HSI classification models are prone to overfitting and often face challenges in balancing accuracy and computational complexity. Furthermore, most of HSI classification models are monomodal, where it solely relies on spectral-spatial data to learn decision boundaries in the high dimensional embedding space. To address this, we propose a general-purpose Semantic Spectral-Spatial Fusion Network (S3FN) that uses contextual, class specific textual descriptions to complement the training of an HSI classification model. Specifically, S3FN leverages LLMs to generate comprehensive textual descriptions for each class label that captures their unique characteristics and spectral behaviors. These descriptions are then embedded into a vector space using a pre-trained text encoder such as BERT or RoBERTa to extract meaningful label semantics which in turn leads to a better feature-label alignment for improved classification performance. To demonstrate the effectiveness of our approach, we evaluate our model on three diverse HSI benchmark datasets - Hyperspectral Wood, HyperspectralBlueberries, and DeepHS-Fruit and report significant performance boost. Our results highlight the synergy between textual semantics and spectral-spatial data, paving the way for further advancements in semantically augmented HSI classification models. Codes are be available in: https://github.com/milab-nsu/S3FN

CVOct 19, 2020Code
Synthesizing the Unseen for Zero-shot Object Detection

Nasir Hayat, Munawar Hayat, Shafin Rahman et al.

The existing zero-shot detection approaches project visual features to the semantic domain for seen objects, hoping to map unseen objects to their corresponding semantics during inference. However, since the unseen objects are never visualized during training, the detection model is skewed towards seen content, thereby labeling unseen as background or a seen class. In this work, we propose to synthesize visual features for unseen classes, so that the model learns both seen and unseen objects in the visual domain. Consequently, the major challenge becomes, how to accurately synthesize unseen objects merely using their class semantics? Towards this ambitious goal, we propose a novel generative model that uses class-semantics to not only generate the features but also to discriminatively separate them. Further, using a unified model, we ensure the synthesized features have high diversity that represents the intra-class differences and variable localization precision in the detected bounding boxes. We test our approach on three object detection benchmarks, PASCAL VOC, MSCOCO, and ILSVRC detection, under both conventional and generalized settings, showing impressive gains over the state-of-the-art methods. Our codes are available at https://github.com/nasir6/zero_shot_detection.

CLNov 16, 2024
Empowering Meta-Analysis: Leveraging Large Language Models for Scientific Synthesis

Jawad Ibn Ahad, Rafeed Mohammad Sultan, Abraham Kaikobad et al.

This study investigates the automation of meta-analysis in scientific documents using large language models (LLMs). Meta-analysis is a robust statistical method that synthesizes the findings of multiple studies support articles to provide a comprehensive understanding. We know that a meta-article provides a structured analysis of several articles. However, conducting meta-analysis by hand is labor-intensive, time-consuming, and susceptible to human error, highlighting the need for automated pipelines to streamline the process. Our research introduces a novel approach that fine-tunes the LLM on extensive scientific datasets to address challenges in big data handling and structured data extraction. We automate and optimize the meta-analysis process by integrating Retrieval Augmented Generation (RAG). Tailored through prompt engineering and a new loss metric, Inverse Cosine Distance (ICD), designed for fine-tuning on large contextual datasets, LLMs efficiently generate structured meta-analysis content. Human evaluation then assesses relevance and provides information on model performance in key metrics. This research demonstrates that fine-tuned models outperform non-fine-tuned models, with fine-tuned LLMs generating 87.6% relevant meta-analysis abstracts. The relevance of the context, based on human evaluation, shows a reduction in irrelevancy from 4.56% to 1.9%. These experiments were conducted in a low-resource environment, highlighting the study's contribution to enhancing the efficiency and reliability of meta-analysis automation.

IRApr 19, 2025
LegalRAG: A Hybrid RAG System for Multilingual Legal Information Retrieval

Muhammad Rafsan Kabir, Rafeed Mohammad Sultan, Fuad Rahman et al.

Natural Language Processing (NLP) and computational linguistic techniques are increasingly being applied across various domains, yet their use in legal and regulatory tasks remains limited. To address this gap, we develop an efficient bilingual question-answering framework for regulatory documents, specifically the Bangladesh Police Gazettes, which contain both English and Bangla text. Our approach employs modern Retrieval Augmented Generation (RAG) pipelines to enhance information retrieval and response generation. In addition to conventional RAG pipelines, we propose an advanced RAG-based approach that improves retrieval performance, leading to more precise answers. This system enables efficient searching for specific government legal notices, making legal information more accessible. We evaluate both our proposed and conventional RAG systems on a diverse test set on Bangladesh Police Gazettes, demonstrating that our approach consistently outperforms existing methods across all evaluation metrics.

CLNov 16, 2024
BanglaDialecto: An End-to-End AI-Powered Regional Speech Standardization

Md. Nazmus Sadat Samin, Jawad Ibn Ahad, Tanjila Ahmed Medha et al.

This study focuses on recognizing Bangladeshi dialects and converting diverse Bengali accents into standardized formal Bengali speech. Dialects, often referred to as regional languages, are distinctive variations of a language spoken in a particular location and are identified by their phonetics, pronunciations, and lexicon. Subtle changes in pronunciation and intonation are also influenced by geographic location, educational attainment, and socioeconomic status. Dialect standardization is needed to ensure effective communication, educational consistency, access to technology, economic opportunities, and the preservation of linguistic resources while respecting cultural diversity. Being the fifth most spoken language with around 55 distinct dialects spoken by 160 million people, addressing Bangla dialects is crucial for developing inclusive communication tools. However, limited research exists due to a lack of comprehensive datasets and the challenges of handling diverse dialects. With the advancement in multilingual Large Language Models (mLLMs), emerging possibilities have been created to address the challenges of dialectal Automated Speech Recognition (ASR) and Machine Translation (MT). This study presents an end-to-end pipeline for converting dialectal Noakhali speech to standard Bangla speech. This investigation includes constructing a large-scale diverse dataset with dialectal speech signals that tailored the fine-tuning process in ASR and LLM for transcribing the dialect speech to dialect text and translating the dialect text to standard Bangla text. Our experiments demonstrated that fine-tuning the Whisper ASR model achieved a CER of 0.8% and WER of 1.5%, while the BanglaT5 model attained a BLEU score of 41.6% for dialect-to-standard text translation.

CVNov 20, 2025
Teacher-Guided One-Shot Pruning via Context-Aware Knowledge Distillation

Md. Samiul Alim, Sharjil Khan, Amrijit Biswas et al.

Unstructured pruning remains a powerful strategy for compressing deep neural networks, yet it often demands iterative train-prune-retrain cycles, resulting in significant computational overhead. To address this challenge, we introduce a novel teacher-guided pruning framework that tightly integrates Knowledge Distillation (KD) with importance score estimation. Unlike prior approaches that apply KD as a post-pruning recovery step, our method leverages gradient signals informed by the teacher during importance score calculation to identify and retain parameters most critical for both task performance and knowledge transfer. Our method facilitates a one-shot global pruning strategy that efficiently eliminates redundant weights while preserving essential representations. After pruning, we employ sparsity-aware retraining with and without KD to recover accuracy without reactivating pruned connections. Comprehensive experiments across multiple image classification benchmarks, including CIFAR-10, CIFAR-100, and TinyImageNet, demonstrate that our method consistently achieves high sparsity levels with minimal performance degradation. Notably, our approach outperforms state-of-the-art baselines such as EPG and EPSD at high sparsity levels, while offering a more computationally efficient alternative to iterative pruning schemes like COLT. The proposed framework offers a computation-efficient, performance-preserving solution well suited for deployment in resource-constrained environments.

CVSep 23, 2025
HadaSmileNet: Hadamard fusion of handcrafted and deep-learning features for enhancing facial emotion recognition of genuine smiles

Mohammad Junayed Hasan, Nabeel Mohammed, Shafin Rahman et al.

The distinction between genuine and posed emotions represents a fundamental pattern recognition challenge with significant implications for data mining applications in social sciences, healthcare, and human-computer interaction. While recent multi-task learning frameworks have shown promise in combining deep learning architectures with handcrafted D-Marker features for smile facial emotion recognition, these approaches exhibit computational inefficiencies due to auxiliary task supervision and complex loss balancing requirements. This paper introduces HadaSmileNet, a novel feature fusion framework that directly integrates transformer-based representations with physiologically grounded D-Markers through parameter-free multiplicative interactions. Through systematic evaluation of 15 fusion strategies, we demonstrate that Hadamard multiplicative fusion achieves optimal performance by enabling direct feature interactions while maintaining computational efficiency. The proposed approach establishes new state-of-the-art results for deep learning methods across four benchmark datasets: UvA-NEMO (88.7 percent, +0.8), MMI (99.7 percent), SPOS (98.5 percent, +0.7), and BBC (100 percent, +5.0). Comprehensive computational analysis reveals 26 percent parameter reduction and simplified training compared to multi-task alternatives, while feature visualization demonstrates enhanced discriminative power through direct domain knowledge integration. The framework's efficiency and effectiveness make it particularly suitable for practical deployment in multimedia data mining applications that require real-time affective computing capabilities.

CLAug 5, 2025
UPLME: Uncertainty-Aware Probabilistic Language Modelling for Robust Empathy Regression

Md Rakibul Hasan, Md Zakir Hossain, Aneesh Krishna et al.

Supervised learning for empathy regression is challenged by noisy self-reported empathy scores. While many algorithms have been proposed for learning with noisy labels in textual classification problems, the regression counterpart is relatively under-explored. We propose UPLME, an uncertainty-aware probabilistic language modelling framework to capture label noise in the regression setting of empathy detection. UPLME includes a probabilistic language model that predicts both empathy score and heteroscedastic uncertainty and is trained using Bayesian concepts with variational model ensembling. We further introduce two novel loss components: one penalises degenerate Uncertainty Quantification (UQ), and another enforces the similarity between the input pairs on which we predict empathy. UPLME provides state-of-the-art performance (Pearson Correlation Coefficient: $0.558\rightarrow0.580$ and $0.629\rightarrow0.634$) in terms of the performance reported in the literature in two public benchmarks, having label noise. Through synthetic label noise injection, we show that UPLME is effective in separating noisy and clean samples based on the predicted uncertainty. UPLME further outperform (Calibration error: $0.571\rightarrow0.376$) a recent variational model ensembling-based UQ method designed for regression problems.

LGJul 4, 2025
Temporal Window Smoothing of Exogenous Variables for Improved Time Series Prediction

Mustafa Kamal, Niyaz Bin Hashem, Robin Krambroeckers et al.

Although most transformer-based time series forecasting models primarily depend on endogenous inputs, recent state-of-the-art approaches have significantly improved performance by incorporating external information through exogenous inputs. However, these methods face challenges, such as redundancy when endogenous and exogenous inputs originate from the same source and limited ability to capture long-term dependencies due to fixed look-back windows. In this paper, we propose a method that whitens the exogenous input to reduce redundancy that may persist within the data based on global statistics. Additionally, our approach helps the exogenous input to be more aware of patterns and trends over extended periods. By introducing this refined, globally context-aware exogenous input to the endogenous input without increasing the lookback window length, our approach guides the model towards improved forecasting. Our approach achieves state-of-the-art performance in four benchmark datasets, consistently outperforming 11 baseline models. These results establish our method as a robust and effective alternative for using exogenous inputs in time series forecasting.

CVJan 26, 2024
Spatial Transcriptomics Analysis of Zero-shot Gene Expression Prediction

Yan Yang, Md Zakir Hossain, Xuesong Li et al.

Spatial transcriptomics (ST) captures gene expression within distinct regions (i.e., windows) of a tissue slide. Traditional supervised learning frameworks applied to model ST are constrained to predicting expression from slide image windows for gene types seen during training, failing to generalize to unseen gene types. To overcome this limitation, we propose a semantic guided network (SGN), a pioneering zero-shot framework for predicting gene expression from slide image windows. Considering a gene type can be described by functionality and phenotype, we dynamically embed a gene type to a vector per its functionality and phenotype, and employ this vector to project slide image windows to gene expression in feature space, unleashing zero-shot expression prediction for unseen gene types. The gene type functionality and phenotype are queried with a carefully designed prompt from a pre-trained large language model (LLM). On standard benchmark datasets, we demonstrate competitive zero-shot performance compared to past state-of-the-art supervised learning approaches.

CVJun 27, 2021
Learning without Forgetting for 3D Point Cloud Objects

Townim Chowdhury, Mahira Jalisha, Ali Cheraghian et al.

When we fine-tune a well-trained deep learning model for a new set of classes, the network learns new concepts but gradually forgets the knowledge of old training. In some real-life applications, we may be interested in learning new classes without forgetting the capability of previous experience. Such learning without forgetting problem is often investigated using 2D image recognition tasks. In this paper, considering the growth of depth camera technology, we address the same problem for the 3D point cloud object data. This problem becomes more challenging in the 3D domain than 2D because of the unavailability of large datasets and powerful pretrained backbone models. We investigate knowledge distillation techniques on 3D data to reduce catastrophic forgetting of the previous training. Moreover, we improve the distillation process by using semantic word vectors of object classes. We observe that exploring the interrelation of old and new knowledge during training helps to learn new concepts without forgetting old ones. Experimenting on three 3D point cloud recognition backbones (PointNet, DGCNN, and PointConv) and synthetic (ModelNet40, ModelNet10) and real scanned (ScanObjectNN) datasets, we establish new baseline results on learning without forgetting for 3D data. This research will instigate many future works in this area.

CVMar 6, 2021
Semantic-aware Knowledge Distillation for Few-Shot Class-Incremental Learning

Ali Cheraghian, Shafin Rahman, Pengfei Fang et al.

Few-shot class incremental learning (FSCIL) portrays the problem of learning new concepts gradually, where only a few examples per concept are available to the learner. Due to the limited number of examples for training, the techniques developed for standard incremental learning cannot be applied verbatim to FSCIL. In this work, we introduce a distillation algorithm to address the problem of FSCIL and propose to make use of semantic information during training. To this end, we make use of word embeddings as semantic information which is cheap to obtain and which facilitate the distillation process. Furthermore, we propose a method based on an attention mechanism on multiple parallel embeddings of visual data to align visual and semantic vectors, which reduces issues related to catastrophic forgetting. Via experiments on MiniImageNet, CUB200, and CIFAR100 dataset, we establish new state-of-the-art results by outperforming existing approaches.

CVNov 30, 2020
S2FGAN: Semantically Aware Interactive Sketch-to-Face Translation

Yan Yang, Md Zakir Hossain, Tom Gedeon et al.

Interactive facial image manipulation attempts to edit single and multiple face attributes using a photo-realistic face and/or semantic mask as input. In the absence of the photo-realistic image (only sketch/mask available), previous methods only retrieve the original face but ignore the potential of aiding model controllability and diversity in the translation process. This paper proposes a sketch-to-image generation framework called S2FGAN, aiming to improve users' ability to interpret and flexibility of face attribute editing from a simple sketch. The proposed framework modifies the constrained latent space semantics trained on Generative Adversarial Networks (GANs). We employ two latent spaces to control the face appearance and adjust the desired attributes of the generated face. Instead of constraining the translation process by using a reference image, the users can command the model to retouch the generated images by involving the semantic information in the generation process. In this way, our method can manipulate single or multiple face attributes by only specifying attributes to be changed. Extensive experimental results on CelebAMask-HQ dataset empirically shows our superior performance and effectiveness on this task. Our method successfully outperforms state-of-the-art methods on attribute manipulation by exploiting greater control of attribute intensity.

CVOct 24, 2020
Classifying Eye-Tracking Data Using Saliency Maps

Shafin Rahman, Sejuti Rahman, Omar Shahid et al.

A plethora of research in the literature shows how human eye fixation pattern varies depending on different factors, including genetics, age, social functioning, cognitive functioning, and so on. Analysis of these variations in visual attention has already elicited two potential research avenues: 1) determining the physiological or psychological state of the subject and 2) predicting the tasks associated with the act of viewing from the recorded eye-fixation data. To this end, this paper proposes a visual saliency based novel feature extraction method for automatic and quantitative classification of eye-tracking data, which is applicable to both of the research directions. Instead of directly extracting features from the fixation data, this method employs several well-known computational models of visual attention to predict eye fixation locations as saliency maps. Comparing the saliency amplitudes, similarity and dissimilarity of saliency maps with the corresponding eye fixations maps gives an extra dimension of information which is effectively utilized to generate discriminative features to classify the eye-tracking data. Extensive experimentation using Saliency4ASD, Age Prediction, and Visual Perceptual Task dataset show that our saliency-based feature can achieve superior performance, outperforming the previous state-of-the-art methods by a considerable margin. Moreover, unlike the existing application-specific solutions, our method demonstrates performance improvement across three distinct problems from the real-life domain: Autism Spectrum Disorder screening, toddler age prediction, and human visual perceptual task classification, providing a general paradigm that utilizes the extra-information inherent in saliency maps for a more accurate classification.

CVOct 7, 2020
RealSmileNet: A Deep End-To-End Network for Spontaneous and Posed Smile Recognition

Yan Yang, Md Zakir Hossain, Tom Gedeon et al.

Smiles play a vital role in the understanding of social interactions within different communities, and reveal the physical state of mind of people in both real and deceptive ways. Several methods have been proposed to recognize spontaneous and posed smiles. All follow a feature-engineering based pipeline requiring costly pre-processing steps such as manual annotation of face landmarks, tracking, segmentation of smile phases, and hand-crafted features. The resulting computation is expensive, and strongly dependent on pre-processing steps. We investigate an end-to-end deep learning model to address these problems, the first end-to-end model for spontaneous and posed smile recognition. Our fully automated model is fast and learns the feature extraction processes by training a series of convolution and ConvLSTM layer from scratch. Our experiments on four datasets demonstrate the robustness and generalization of the proposed model by achieving state-of-the-art performances.

CVMar 16, 2020
Any-Shot Object Detection

Shafin Rahman, Salman Khan, Nick Barnes et al.

Previous work on novel object detection considers zero or few-shot settings where none or few examples of each category are available for training. In real world scenarios, it is less practical to expect that 'all' the novel classes are either unseen or {have} few-examples. Here, we propose a more realistic setting termed 'Any-shot detection', where totally unseen and few-shot categories can simultaneously co-occur during inference. Any-shot detection offers unique challenges compared to conventional novel object detection such as, a high imbalance between unseen, few-shot and seen object classes, susceptibility to forget base-training while learning novel classes and distinguishing novel classes from the background. To address these challenges, we propose a unified any-shot detection model, that can concurrently learn to detect both zero-shot and few-shot object classes. Our core idea is to use class semantics as prototypes for object detection, a formulation that naturally minimizes knowledge forgetting and mitigates the class-imbalance in the label space. Besides, we propose a rebalanced loss function that emphasizes difficult few-shot cases but avoids overfitting on the novel classes to allow detection of totally unseen classes. Without bells and whistles, our framework can also be used solely for Zero-shot detection and Few-shot detection tasks. We report extensive experiments on Pascal VOC and MS-COCO datasets where our approach is shown to provide significant improvements.

CVDec 16, 2019
Transductive Zero-Shot Learning for 3D Point Cloud Classification

Ali Cheraghian, Shafin Rahman, Dylan Campbell et al.

Zero-shot learning, the task of learning to recognize new classes not seen during training, has received considerable attention in the case of 2D image classification. However despite the increasing ubiquity of 3D sensors, the corresponding 3D point cloud classification problem has not been meaningfully explored and introduces new challenges. This paper extends, for the first time, transductive Zero-Shot Learning (ZSL) and Generalized Zero-Shot Learning (GZSL) approaches to the domain of 3D point cloud classification. To this end, a novel triplet loss is developed that takes advantage of unlabeled test data. While designed for the task of 3D point cloud classification, the method is also shown to be applicable to the more common use-case of 2D image classification. An extensive set of experiments is carried out, establishing state-of-the-art for ZSL and GZSL in the 3D point cloud domain, as well as demonstrating the applicability of the approach to the image domain.

CVJul 15, 2019
Mitigating the Hubness Problem for Zero-Shot Learning of 3D Objects

Ali Cheraghian, Shafin Rahman, Dylan Campbell et al.

The development of advanced 3D sensors has enabled many objects to be captured in the wild at a large scale, and a 3D object recognition system may therefore encounter many objects for which the system has received no training. Zero-Shot Learning (ZSL) approaches can assist such systems in recognizing previously unseen objects. Applying ZSL to 3D point cloud objects is an emerging topic in the area of 3D vision, however, a significant problem that ZSL often suffers from is the so-called hubness problem, which is when a model is biased to predict only a few particular labels for most of the test instances. We observe that this hubness problem is even more severe for 3D recognition than for 2D recognition. One reason for this is that in 2D one can use pre-trained networks trained on large datasets like ImageNet, which produces high-quality features. However, in the 3D case there are no such large-scale, labelled datasets available for pre-training which means that the extracted 3D features are of poorer quality which, in turn, exacerbates the hubness problem. In this paper, we therefore propose a loss to specifically address the hubness problem. Our proposed method is effective for both Zero-Shot and Generalized Zero-Shot Learning, and we perform extensive evaluations on the challenging datasets ModelNet40, ModelNet10, McGill and SHREC2015. A new state-of-the-art result for both zero-shot tasks in the 3D case is established.

CVFeb 27, 2019
Zero-shot Learning of 3D Point Cloud Objects

Ali Cheraghian, Shafin Rahman, Lars Petersson

Recent deep learning architectures can recognize instances of 3D point cloud objects of previously seen classes quite well. At the same time, current 3D depth camera technology allows generating/segmenting a large amount of 3D point cloud objects from an arbitrary scene, for which there is no previously seen training data. A challenge for a 3D point cloud recognition system is, then, to classify objects from new, unseen, classes. This issue can be resolved by adopting a zero-shot learning (ZSL) approach for 3D data, similar to the 2D image version of the same problem. ZSL attempts to classify unseen objects by comparing semantic information (attribute/word vector) of seen and unseen classes. Here, we adapt several recent 3D point cloud recognition systems to the ZSL setting with some changes to their architectures. To the best of our knowledge, this is the first attempt to classify unseen 3D point cloud objects in the ZSL setting. A standard protocol (which includes the choice of datasets and the seen/unseen split) to evaluate such systems is also proposed. Baseline performances are reported using the new protocol on the investigated models. This investigation throws a new challenge to the 3D point cloud recognition community that may instigate numerous future works.

CVNov 22, 2018
Task-generalizable Adversarial Attack based on Perceptual Metric

Muzammal Naseer, Salman H. Khan, Shafin Rahman et al.

Deep neural networks (DNNs) can be easily fooled by adding human imperceptible perturbations to the images. These perturbed images are known as `adversarial examples' and pose a serious threat to security and safety critical systems. A litmus test for the strength of adversarial examples is their transferability across different DNN models in a black box setting (i.e. when the target model's architecture and parameters are not known to attacker). Current attack algorithms that seek to enhance adversarial transferability work on the decision level i.e. generate perturbations that alter the network decisions. This leads to two key limitations: (a) An attack is dependent on the task-specific loss function (e.g. softmax cross-entropy for object recognition) and therefore does not generalize beyond its original task. (b) The adversarial examples are specific to the network architecture and demonstrate poor transferability to other network architectures. We propose a novel approach to create adversarial examples that can broadly fool different networks on multiple tasks. Our approach is based on the following intuition: "Perpetual metrics based on neural network features are highly generalizable and show excellent performance in measuring and stabilizing input distortions. Therefore an ideal attack that creates maximum distortions in the network feature space should realize highly transferable examples". We report extensive experiments to show how adversarial examples generalize across multiple networks for classification, object detection and segmentation tasks.

CVNov 22, 2018
Polarity Loss for Zero-shot Object Detection

Shafin Rahman, Salman Khan, Nick Barnes

Conventional object detection models require large amounts of training data. In comparison, humans can recognize previously unseen objects by merely knowing their semantic description. To mimic similar behaviour, zero-shot object detection aims to recognize and localize 'unseen' object instances by using only their semantic information. The model is first trained to learn the relationships between visual and semantic domains for seen objects, later transferring the acquired knowledge to totally unseen objects. This setting gives rise to the need for correct alignment between visual and semantic concepts, so that the unseen objects can be identified using only their semantic attributes. In this paper, we propose a novel loss function called 'Polarity loss', that promotes correct visual-semantic alignment for an improved zero-shot object detection. On one hand, it refines the noisy semantic embeddings via metric learning on a 'Semantic vocabulary' of related concepts to establish a better synergy between visual and semantic domains. On the other hand, it explicitly maximizes the gap between positive and negative predictions to achieve better discrimination between seen, unseen and background objects. Our approach is inspired by embodiment theories in cognitive science, that claim human semantic understanding to be grounded in past experiences (seen objects), related linguistic concepts (word vocabulary) and visual perception (seen/unseen object images). We conduct extensive evaluations on MS-COCO and Pascal VOC datasets, showing significant improvements over state of the art.

CVMar 16, 2018
Deep Multiple Instance Learning for Zero-shot Image Tagging

Shafin Rahman, Salman Khan

In-line with the success of deep learning on traditional recognition problem, several end-to-end deep models for zero-shot recognition have been proposed in the literature. These models are successful to predict a single unseen label given an input image, but does not scale to cases where multiple unseen objects are present. In this paper, we model this problem within the framework of Multiple Instance Learning (MIL). To the best of our knowledge, we propose the first end-to-end trainable deep MIL framework for the multi-label zero-shot tagging problem. Due to its novel design, the proposed framework has several interesting features: (1) Unlike previous deep MIL models, it does not use any off-line procedure (e.g., Selective Search or EdgeBoxes) for bag generation. (2) During test time, it can process any number of unseen labels given their semantic embedding vectors. (3) Using only seen labels per image as weak annotation, it can produce a bounding box for each predicted labels. We experiment with the NUS-WIDE dataset and achieve superior performance across conventional, zero-shot and generalized zero-shot tagging tasks.

CVMar 16, 2018
Zero-Shot Object Detection: Learning to Simultaneously Recognize and Localize Novel Concepts

Shafin Rahman, Salman Khan, Fatih Porikli

Current Zero-Shot Learning (ZSL) approaches are restricted to recognition of a single dominant unseen object category in a test image. We hypothesize that this setting is ill-suited for real-world applications where unseen objects appear only as a part of a complex scene, warranting both the `recognition' and `localization' of an unseen category. To address this limitation, we introduce a new \emph{`Zero-Shot Detection'} (ZSD) problem setting, which aims at simultaneously recognizing and locating object instances belonging to novel categories without any training examples. We also propose a new experimental protocol for ZSD based on the highly challenging ILSVRC dataset, adhering to practical issues, e.g., the rarity of unseen objects. To the best of our knowledge, this is the first end-to-end deep network for ZSD that jointly models the interplay between visual and semantic domain information. To overcome the noise in the automatically derived semantic descriptions, we utilize the concept of meta-classes to design an original loss function that achieves synergy between max-margin class separation and semantic space clustering. Furthermore, we present a baseline approach extended from recognition to detection setting. Our extensive experiments show significant performance boost over the baseline on the imperative yet difficult ZSD problem.

CVJun 27, 2017
A Unified approach for Conventional Zero-shot, Generalized Zero-shot and Few-shot Learning

Shafin Rahman, Salman H. Khan, Fatih Porikli

Prevalent techniques in zero-shot learning do not generalize well to other related problem scenarios. Here, we present a unified approach for conventional zero-shot, generalized zero-shot and few-shot learning problems. Our approach is based on a novel Class Adapting Principal Directions (CAPD) concept that allows multiple embeddings of image features into a semantic space. Given an image, our method produces one principal direction for each seen class. Then, it learns how to combine these directions to obtain the principal direction for each unseen class such that the CAPD of the test image is aligned with the semantic embedding of the true class, and opposite to the other classes. This allows efficient and class-adaptive information transfer from seen to unseen classes. In addition, we propose an automatic process for selection of the most useful seen classes for each unseen class to achieve robustness in zero-shot learning. Our method can update the unseen CAPD taking the advantages of few unseen images to work in a few-shot learning scenario. Furthermore, our method can generalize the seen CAPDs by estimating seen-unseen diversity that significantly improves the performance of generalized zero-shot learning. Our extensive evaluations demonstrate that the proposed approach consistently achieves superior performance in zero-shot, generalized zero-shot and few/one-shot learning problems.