CVJul 21, 2022
Novel Class Discovery without ForgettingK J Joseph, Sujoy Paul, Gaurav Aggarwal et al.
Humans possess an innate ability to identify and differentiate instances that they are not familiar with, by leveraging and adapting the knowledge that they have acquired so far. Importantly, they achieve this without deteriorating the performance on their earlier learning. Inspired by this, we identify and formulate a new, pragmatic problem setting of NCDwF: Novel Class Discovery without Forgetting, which tasks a machine learning model to incrementally discover novel categories of instances from unlabeled data, while maintaining its performance on the previously seen categories. We propose 1) a method to generate pseudo-latent representations which act as a proxy for (no longer available) labeled data, thereby alleviating forgetting, 2) a mutual-information based regularizer which enhances unsupervised discovery of novel classes, and 3) a simple Known Class Identifier which aids generalized inference when the testing data contains instances form both seen and unseen categories. We introduce experimental protocols based on CIFAR-10, CIFAR-100 and ImageNet-1000 to measure the trade-off between knowledge retention and novel class discovery. Our extensive evaluations reveal that existing models catastrophically forget previously seen categories while identifying novel categories, while our method is able to effectively balance between the competing objectives. We hope our work will attract further research into this newly identified pragmatic problem setting.
CVJul 5, 2023
S3C: Self-Supervised Stochastic Classifiers for Few-Shot Class-Incremental LearningJayateja Kalla, Soma Biswas
Few-shot class-incremental learning (FSCIL) aims to learn progressively about new classes with very few labeled samples, without forgetting the knowledge of already learnt classes. FSCIL suffers from two major challenges: (i) over-fitting on the new classes due to limited amount of data, (ii) catastrophically forgetting about the old classes due to unavailability of data from these classes in the incremental stages. In this work, we propose a self-supervised stochastic classifier (S3C) to counter both these challenges in FSCIL. The stochasticity of the classifier weights (or class prototypes) not only mitigates the adverse effect of absence of large number of samples of the new classes, but also the absence of samples from previously learnt classes during the incremental steps. This is complemented by the self-supervision component, which helps to learn features from the base classes which generalize well to unseen classes that are encountered in future, thus reducing catastrophic forgetting. Extensive evaluation on three benchmark datasets using multiple evaluation metrics show the effectiveness of the proposed framework. We also experiment on two additional realistic scenarios of FSCIL, namely where the number of annotated data available for each of the new classes can be different, and also where the number of base classes is much lesser, and show that the proposed S3C performs significantly better than the state-of-the-art for all these challenging scenarios.
CVApr 22, 2022
Spacing Loss for Discovering Novel CategoriesK J Joseph, Sujoy Paul, Gaurav Aggarwal et al.
Novel Class Discovery (NCD) is a learning paradigm, where a machine learning model is tasked to semantically group instances from unlabeled data, by utilizing labeled instances from a disjoint set of classes. In this work, we first characterize existing NCD approaches into single-stage and two-stage methods based on whether they require access to labeled and unlabeled data together while discovering new classes. Next, we devise a simple yet powerful loss function that enforces separability in the latent space using cues from multi-dimensional scaling, which we refer to as Spacing Loss. Our proposed formulation can either operate as a standalone method or can be plugged into existing methods to enhance them. We validate the efficacy of Spacing Loss with thorough experimental evaluation across multiple settings on CIFAR-10 and CIFAR-100 datasets.
CVJul 27, 2023
Test Time Adaptation for Blind Image Quality AssessmentSubhadeep Roy, Shankhanil Mitra, Soma Biswas et al.
While the design of blind image quality assessment (IQA) algorithms has improved significantly, the distribution shift between the training and testing scenarios often leads to a poor performance of these methods at inference time. This motivates the study of test time adaptation (TTA) techniques to improve their performance at inference time. Existing auxiliary tasks and loss functions used for TTA may not be relevant for quality-aware adaptation of the pre-trained model. In this work, we introduce two novel quality-relevant auxiliary tasks at the batch and sample levels to enable TTA for blind IQA. In particular, we introduce a group contrastive loss at the batch level and a relative rank loss at the sample level to make the model quality aware and adapt to the target data. Our experiments reveal that even using a small batch of images from the test distribution helps achieve significant improvement in performance by updating the batch normalization statistics of the source model.
LGNov 27, 2023Code
Can Out-of-Domain data help to Learn Domain-Specific Prompts for Multimodal Misinformation Detection?Amartya Bhattacharya, Debarshi Brahma, Suraj Nagaje Mahadev et al.
Spread of fake news using out-of-context images and captions has become widespread in this era of information overload. Since fake news can belong to different domains like politics, sports, etc. with their unique characteristics, inference on a test image-caption pair is contingent on how well the model has been trained on similar data. Since training individual models for each domain is not practical, we propose a novel framework termed DPOD (Domain-specific Prompt tuning using Out-of-domain data), which can exploit out-of-domain data during training to improve fake news detection of all desired domains simultaneously. First, to compute generalizable features, we modify the Vision-Language Model, CLIP to extract features that helps to align the representations of the images and corresponding captions of both the in-domain and out-of-domain data in a label-aware manner. Further, we propose a domain-specific prompt learning technique which leverages training samples of all the available domains based on the extent they can be useful to the desired domain. Extensive experiments on the large-scale NewsCLIPpings and VERITE benchmarks demonstrate that DPOD achieves state of-the-art performance for this challenging task. Code: https://github.com/scviab/DPOD.
CVApr 20, 2023
SATA: Source Anchoring and Target Alignment Network for Continual Test Time AdaptationGoirik Chakrabarty, Manogna Sreenivas, Soma Biswas
Adapting a trained model to perform satisfactorily on continually changing testing domains/environments is an important and challenging task. In this work, we propose a novel framework, SATA, which aims to satisfy the following characteristics required for online adaptation: 1) can work seamlessly with different (preferably small) batch sizes to reduce latency; 2) should continue to work well for the source domain; 3) should have minimal tunable hyper-parameters and storage requirements. Given a pre-trained network trained on source domain data, the proposed SATA framework modifies the batch-norm affine parameters using source anchoring based self-distillation. This ensures that the model incorporates the knowledge of the newly encountered domains, without catastrophically forgetting about the previously seen ones. We also propose a source-prototype driven contrastive alignment to ensure natural grouping of the target samples, while maintaining the already learnt semantic information. Extensive evaluation on three benchmark datasets under challenging settings justify the effectiveness of SATA for real-world applications.
CVNov 22, 2022
Pred&Guide: Labeled Target Class Prediction for Guiding Semi-Supervised Domain AdaptationMegh Manoj Bhalerao, Anurag Singh, Soma Biswas
Semi-supervised domain adaptation aims to classify data belonging to a target domain by utilizing a related label-rich source domain and very few labeled examples of the target domain. Here, we propose a novel framework, Pred&Guide, which leverages the inconsistency between the predicted and the actual class labels of the few labeled target examples to effectively guide the domain adaptation in a semi-supervised setting. Pred&Guide consists of three stages, as follows (1) First, in order to treat all the target samples equally, we perform unsupervised domain adaptation coupled with self-training; (2) Second is the label prediction stage, where the current model is used to predict the labels of the few labeled target examples, and (3) Finally, the correctness of the label predictions are used to effectively weigh source examples class-wise to better guide the domain adaptation process. Extensive experiments show that the proposed Pred&Guide framework achieves state-of-the-art results for two large-scale benchmark datasets, namely Office-Home and DomainNet.
CVAug 19, 2022
Test-time Training for Data-efficient UCDRSoumava Paul, Titir Dutta, Aheli Saha et al.
Image retrieval under generalized test scenarios has gained significant momentum in literature, and the recently proposed protocol of Universal Cross-domain Retrieval is a pioneer in this direction. A common practice in any such generalized classification or retrieval algorithm is to exploit samples from many domains during training to learn a domain-invariant representation of data. Such criterion is often restrictive, and thus in this work, for the first time, we explore the generalized retrieval problem in a data-efficient manner. Specifically, we aim to generalize any pre-trained cross-domain retrieval network towards any unknown query domain/category, by means of adapting the model on the test data leveraging self-supervised learning techniques. Toward that goal, we explored different self-supervised loss functions~(for example, RotNet, JigSaw, Barlow Twins, etc.) and analyze their effectiveness for the same. Extensive experiments demonstrate the proposed approach is simple, easy to implement, and effective in handling data-efficient UCDR.
CVMay 20
AttriStory: Fine-grained Attribute Realization for Visual Storytelling with Diffusion ModelsManogna Sreenivas, Rohit Kumar, Soma Biswas
Visual storytelling with diffusion models has made impressive strides in maintaining character consistency across narrative scenes. However, a critical gap remains: while these methods ensure a character remains consistent across scenes, they provide no systematic method to ensure if fine-grained attributes such as color and textures of clothing, accessories are faithfully rendered in the generated images. Towards this goal, we introduce AttriStory, a benchmark enabling attribute realization in visual storytelling. We curate 200 multi-scene stories across 10 distinct artistic styles using Large Language Model. Each scene is constructed with detailed attribute specifications to enable rich visual narratives. Further, to address attribute realization, we propose a plug-and-play latent optimization module that operates during early denoising steps, when the model establishes structural and semantic content. We achieve this through AttriLoss objective designed to maximize alignment between the cross-attention maps for desired attribute-object pairs while suppressing spurious associations, guiding models to localize attributes correctly. This approach operates orthogonally to existing consistency mechanisms, integrating seamlessly with current story generation pipelines without requiring architectural modifications. Our experiments demonstrate consistent improvements on incorporating AttriLoss across all baselines. This work positions attribute realization as a distinct, complementary dimension of visual storytelling, alongside character consistency, advancing the field toward fine-grained attribute-controlled story generation. Project-page:https://manogna-s.github.io/attristory/
CVNov 2, 2023
Robust Feature Learning and Global Variance-Driven Classifier Alignment for Long-Tail Class Incremental LearningJayateja Kalla, Soma Biswas
This paper introduces a two-stage framework designed to enhance long-tail class incremental learning, enabling the model to progressively learn new classes, while mitigating catastrophic forgetting in the context of long-tailed data distributions. Addressing the challenge posed by the under-representation of tail classes in long-tail class incremental learning, our approach achieves classifier alignment by leveraging global variance as an informative measure and class prototypes in the second stage. This process effectively captures class properties and eliminates the need for data balancing or additional layer tuning. Alongside traditional class incremental learning losses in the first stage, the proposed approach incorporates mixup classes to learn robust feature representations, ensuring smoother boundaries. The proposed framework can seamlessly integrate as a module with any class incremental learning method to effectively handle long-tail class incremental learning scenarios. Extensive experimentation on the CIFAR-100 and ImageNet-Subset datasets validates the approach's efficacy, showcasing its superiority over state-of-the-art techniques across various long-tail CIL settings.
CVMay 21, 2025Code
Prompt Tuning Vision Language Models with Margin Regularizer for Few-Shot Learning under Distribution ShiftsDebarshi Brahma, Anuska Roy, Soma Biswas
Recently, Vision-Language foundation models like CLIP and ALIGN, which are pre-trained on large-scale data have shown remarkable zero-shot generalization to diverse datasets with different classes and even domains. In this work, we take a step further and analyze whether these models can be adapted to target datasets having very different distributions and classes compared to what these models have been trained on, using only a few labeled examples from the target dataset. In such scenarios, finetuning large pretrained models is challenging due to problems of overfitting as well as loss of generalization, and has not been well explored in prior literature. Since, the pre-training data of such models are unavailable, it is difficult to comprehend the performance on various downstream datasets. First, we try to answer the question: Given a target dataset with a few labelled examples, can we estimate whether further fine-tuning can enhance the performance compared to zero-shot evaluation? by analyzing the common vision-language embedding space. Based on the analysis, we propose a novel prompt-tuning method, PromptMargin for adapting such large-scale VLMs directly on the few target samples. PromptMargin effectively tunes the text as well as visual prompts for this task, and has two main modules: 1) Firstly, we use a selective augmentation strategy to complement the few training samples in each task; 2) Additionally, to ensure robust training in the presence of unfamiliar class names, we increase the inter-class margin for improved class discrimination using a novel Multimodal Margin Regularizer. Extensive experiments and analysis across fifteen target benchmark datasets, with varying degrees of distribution shifts from natural images, shows the effectiveness of the proposed framework over the existing state-of-the-art approaches applied to this setting. github.com/debarshigit/PromptMargin.
CVAug 8, 2024
AggSS: An Aggregated Self-Supervised Approach for Class-Incremental LearningJayateja Kalla, Soma Biswas
This paper investigates the impact of self-supervised learning, specifically image rotations, on various class-incremental learning paradigms. Here, each image with a predefined rotation is considered as a new class for training. At inference, all image rotation predictions are aggregated for the final prediction, a strategy we term Aggregated Self-Supervision (AggSS). We observe a shift in the deep neural network's attention towards intrinsic object features as it learns through AggSS strategy. This learning approach significantly enhances class-incremental learning by promoting robust feature learning. AggSS serves as a plug-and-play module that can be seamlessly incorporated into any class-incremental learning framework, leveraging its powerful feature learning capabilities to enhance performance across various class-incremental learning approaches. Extensive experiments conducted on standard incremental learning datasets CIFAR-100 and ImageNet-Subset demonstrate the significant role of AggSS in improving performance within these paradigms.
CVJul 10, 2024
TACLE: Task and Class-aware Exemplar-free Semi-supervised Class Incremental LearningJayateja Kalla, Rohit Kumar, Soma Biswas
We propose a novel TACLE (TAsk and CLass-awarE) framework to address the relatively unexplored and challenging problem of exemplar-free semi-supervised class incremental learning. In this scenario, at each new task, the model has to learn new classes from both (few) labeled and unlabeled data without access to exemplars from previous classes. In addition to leveraging the capabilities of pre-trained models, TACLE proposes a novel task-adaptive threshold, thereby maximizing the utilization of the available unlabeled data as incremental learning progresses. Additionally, to enhance the performance of the under-represented classes within each task, we propose a class-aware weighted cross-entropy loss. We also exploit the unlabeled data for classifier alignment, which further enhances the model performance. Extensive experiments on benchmark datasets, namely CIFAR10, CIFAR100, and ImageNet-Subset100 demonstrate the effectiveness of the proposed TACLE framework. We further showcase its effectiveness when the unlabeled data is imbalanced and also for the extreme case of one labeled example per class.
CVDec 3, 2025
Fully Unsupervised Self-debiasing of Text-to-Image Diffusion ModelsKorada Sri Vardhana, Shrikrishna Lolla, Soma Biswas
Text-to-image (T2I) diffusion models have achieved widespread success due to their ability to generate high-resolution, photorealistic images. These models are trained on large-scale datasets, like LAION-5B, often scraped from the internet. However, since this data contains numerous biases, the models inherently learn and reproduce them, resulting in stereotypical outputs. We introduce SelfDebias, a fully unsupervised test-time debiasing method applicable to any diffusion model that uses a UNet as its noise predictor. SelfDebias identifies semantic clusters in an image encoder's embedding space and uses these clusters to guide the diffusion process during inference, minimizing the KL divergence between the output distribution and the uniform distribution. Unlike supervised approaches, SelfDebias does not require human-annotated datasets or external classifiers trained for each generated concept. Instead, it is designed to automatically identify semantic modes. Extensive experiments show that SelfDebias generalizes across prompts and diffusion model architectures, including both conditional and unconditional models. It not only effectively debiases images along key demographic dimensions while maintaining the visual fidelity of the generated images, but also more abstract concepts for which identifying biases is also challenging.
CVAug 27, 2025
Segmentation Assisted Incremental Test Time Adaptation in an Open WorldManogna Sreenivas, Soma Biswas
In dynamic environments, unfamiliar objects and distribution shifts are often encountered, which challenge the generalization abilities of the deployed trained models. This work addresses Incremental Test Time Adaptation of Vision Language Models, tackling scenarios where unseen classes and unseen domains continuously appear during testing. Unlike traditional Test Time Adaptation approaches, where the test stream comes only from a predefined set of classes, our framework allows models to adapt simultaneously to both covariate and label shifts, actively incorporating new classes as they emerge. Towards this goal, we establish a new benchmark for ITTA, integrating single image TTA methods for VLMs with active labeling techniques that query an oracle for samples potentially representing unseen classes during test time. We propose a segmentation assisted active labeling module, termed SegAssist, which is training free and repurposes the segmentation capabilities of VLMs to refine active sample selection, prioritizing samples likely to belong to unseen classes. Extensive experiments on several benchmark datasets demonstrate the potential of SegAssist to enhance the performance of VLMs in real world scenarios, where continuous adaptation to emerging data is essential. Project-page:https://manogna-s.github.io/segassist/
CVJun 4, 2025
Multiple Stochastic Prompt Tuning for Few-shot Adaptation under Extreme Domain ShiftDebarshi Brahma, Soma Biswas
Foundation Vision-Language Models (VLMs) like CLIP exhibit strong generalization capabilities due to large-scale pretraining on diverse image-text pairs. However, their performance often degrades when applied to target datasets with significant distribution shifts in both visual appearance and class semantics. Recent few-shot learning approaches adapt CLIP to downstream tasks using limited labeled data via adapter or prompt tuning, but are not specifically designed to handle such extreme domain shifts. Conversely, some works addressing cross-domain few-shot learning consider such domain-shifted scenarios but operate in an episodic setting with only a few classes per episode, limiting their applicability to real-world deployment, where all classes must be handled simultaneously. To address this gap, we propose a novel framework, MIST (Multiple Stochastic Prompt Tuning), for efficiently adapting CLIP to datasets with extreme distribution shifts using only a few labeled examples, in scenarios involving all classes at once. Specifically, we introduce multiple learnable prompts per class to effectively capture diverse modes in visual representations arising from distribution shifts. To further enhance generalization, these prompts are modeled as learnable Gaussian distributions, enabling efficient exploration of the prompt parameter space and reducing overfitting caused by limited supervision. Extensive experiments and comparisons with state-of-the-art methods demonstrate the effectiveness of the proposed framework.
CVJun 1, 2024
Efficient Open Set Single Image Test Time Adaptation of Vision Language ModelsManogna Sreenivas, Soma Biswas
Adapting models to dynamic, real-world environments characterized by shifting data distributions and unseen test scenarios is a critical challenge in deep learning. In this paper, we consider a realistic and challenging Test-Time Adaptation setting, where a model must continuously adapt to test samples that arrive sequentially, one at a time, while distinguishing between known and unknown classes. Current Test-Time Adaptation methods operate under closed-set assumptions or batch processing, differing from the real-world open-set scenarios. We address this limitation by establishing a comprehensive benchmark for {\em Open-set Single-image Test-Time Adaptation using Vision-Language Models}. Furthermore, we propose ROSITA, a novel framework that leverages dynamically updated feature banks to identify reliable test samples and employs a contrastive learning objective to improve the separation between known and unknown classes. Our approach effectively adapts models to domain shifts for known classes while rejecting unfamiliar samples. Extensive experiments across diverse real-world benchmarks demonstrate that ROSITA sets a new state-of-the-art in open-set TTA, achieving both strong performance and computational efficiency for real-time deployment. Our code can be found at the project site https://manogna-s.github.io/rosita/
CVSep 2, 2023
pSTarC: Pseudo Source Guided Target Clustering for Fully Test-Time AdaptationManogna Sreenivas, Goirik Chakrabarty, Soma Biswas
Test Time Adaptation (TTA) is a pivotal concept in machine learning, enabling models to perform well in real-world scenarios, where test data distribution differs from training. In this work, we propose a novel approach called pseudo Source guided Target Clustering (pSTarC) addressing the relatively unexplored area of TTA under real-world domain shifts. This method draws inspiration from target clustering techniques and exploits the source classifier for generating pseudo-source samples. The test samples are strategically aligned with these pseudo-source samples, facilitating their clustering and thereby enhancing TTA performance. pSTarC operates solely within the fully test-time adaptation protocol, removing the need for actual source data. Experimental validation on a variety of domain shift datasets, namely VisDA, Office-Home, DomainNet-126, CIFAR-100C verifies pSTarC's effectiveness. This method exhibits significant improvements in prediction accuracy along with efficient computational requirements. Furthermore, we also demonstrate the universality of the pSTarC framework by showing its effectiveness for the continuous TTA framework. The source code for our method is available at https://manogna-s.github.io/pstarc
CVDec 4, 2021
SITA: Single Image Test-time AdaptationAnsh Khurana, Sujoy Paul, Piyush Rai et al.
In Test-time Adaptation (TTA), given a source model, the goal is to adapt it to make better predictions for test instances from a different distribution than the source. Crucially, TTA assumes no access to the source data or even any additional labeled/unlabeled samples from the target distribution to finetune the source model. In this work, we consider TTA in a more pragmatic setting which we refer to as SITA (Single Image Test-time Adaptation). Here, when making a prediction, the model has access only to the given single test instance, rather than a batch of instances, as typically been considered in the literature. This is motivated by the realistic scenarios where inference is needed on-demand instead of delaying for an incoming batch or the inference is happening on an edge device (like mobile phone) where there is no scope for batching. The entire adaptation process in SITA should be extremely fast as it happens at inference time. To address this, we propose a novel approach AugBN that requires only a single forward pass. It can be used on any off-the-shelf trained model to test single instances for both classification and segmentation tasks. AugBN estimates normalization statistics of the unseen test distribution from the given test image using only one forward pass with label-preserving transformations. Since AugBN does not involve any back-propagation, it is significantly faster compared to recent test time adaptation methods. We further extend AugBN to make the algorithm hyperparameter-free. Rigorous experimentation show that our simple algorithm is able to achieve significant performance gains for a variety of datasets, tasks, and network architectures.
CVAug 18, 2021
Universal Cross-Domain Retrieval: Generalizing Across Classes and DomainsSoumava Paul, Titir Dutta, Soma Biswas
In this work, for the first time, we address the problem of universal cross-domain retrieval, where the test data can belong to classes or domains which are unseen during training. Due to dynamically increasing number of categories and practical constraint of training on every possible domain, which requires large amounts of data, generalizing to both unseen classes and domains is important. Towards that goal, we propose SnMpNet (Semantic Neighbourhood and Mixture Prediction Network), which incorporates two novel losses to account for the unseen classes and domains encountered during testing. Specifically, we introduce a novel Semantic Neighborhood loss to bridge the knowledge gap between seen and unseen classes and ensure that the latent space embedding of the unseen classes is semantically meaningful with respect to its neighboring classes. We also introduce a mix-up based supervision at image-level as well as semantic-level of the data for training with the Mixture Prediction loss, which helps in efficient retrieval when the query belongs to an unseen domain. These losses are incorporated on the SE-ResNet50 backbone to obtain SnMpNet. Extensive experiments on two large-scale datasets, Sketchy Extended and DomainNet, and thorough comparisons with state-of-the-art justify the effectiveness of the proposed model.
CVSep 14, 2020
SML: Semantic Meta-learning for Few-shot Semantic SegmentationAyyappa Kumar Pambala, Titir Dutta, Soma Biswas
The significant amount of training data required for training Convolutional Neural Networks has become a bottleneck for applications like semantic segmentation. Few-shot semantic segmentation algorithms address this problem, with an aim to achieve good performance in the low-data regime, with few annotated training images. Recently, approaches based on class-prototypes computed from available training data have achieved immense success for this task. In this work, we propose a novel meta-learning framework, Semantic Meta-Learning (SML) which incorporates class level semantic descriptions in the generated prototypes for this problem. In addition, we propose to use the well established technique, ridge regression, to not only bring in the class-level semantic information, but also to effectively utilise the information available from multiple images present in the training data for prototype computation. This has a simple closed-form solution, and thus can be implemented easily and efficiently. Extensive experiments on the benchmark PASCAL-5i dataset under different experimental settings show the effectiveness of the proposed framework.
CVFeb 3, 2020
A Novel Incremental Cross-Modal Hashing ApproachDevraj Mandal, Soma Biswas
Cross-modal retrieval deals with retrieving relevant items from one modality, when provided with a search query from another modality. Hashing techniques, where the data is represented as binary bits have specifically gained importance due to the ease of storage, fast computations and high accuracy. In real world, the number of data categories is continuously increasing, which requires algorithms capable of handling this dynamic scenario. In this work, we propose a novel incremental cross-modal hashing algorithm termed "iCMH", which can adapt itself to handle incoming data of new categories. The proposed approach consists of two sequential stages, namely, learning the hash codes and training the hash functions. At every stage, a small amount of old category data termed "exemplars" is is used so as not to forget the old data while trying to learn for the new incoming data, i.e. to avoid catastrophic forgetting. In the first stage, the hash codes for the exemplars is used, and simultaneously, hash codes for the new data is computed such that it maintains the semantic relations with the existing data. For the second stage, we propose both a non-deep and deep architectures to learn the hash functions effectively. Extensive experiments across a variety of cross-modal datasets and comparisons with state-of-the-art cross-modal algorithms shows the usefulness of our approach.
CVOct 13, 2019
A Novel Self-Supervised Re-labeling Approach for Training with Noisy LabelsDevraj Mandal, Shrisha Bharadwaj, Soma Biswas
The major driving force behind the immense success of deep learning models is the availability of large datasets along with their clean labels. Unfortunately, this is very difficult to obtain, which has motivated research on the training of deep models in the presence of label noise and ways to avoid over-fitting on the noisy labels. In this work, we build upon the seminal work in this area, Co-teaching and propose a simple, yet efficient approach termed mCT-S2R (modified co-teaching with self-supervision and re-labeling) for this task. First, to deal with significant amount of noise in the labels, we propose to use self-supervision to generate robust features without using any labels. Next, using a parallel network architecture, an estimate of the clean labeled portion of the data is obtained. Finally, using this data, a portion of the estimated noisy labeled portion is re-labeled, before resuming the network training with the augmented data. Extensive experiments on three standard datasets show the effectiveness of the proposed framework.
CVMay 27, 2019
Label Prediction Framework for Semi-Supervised Cross-Modal RetrievalDevraj Mandal, Pramod Rao, Soma Biswas
Cross-modal data matching refers to retrieval of data from one modality, when given a query from another modality. In general, supervised algorithms achieve better retrieval performance compared to their unsupervised counterpart, as they can learn better representative features by leveraging the available label information. However, this comes at the cost of requiring huge amount of labeled examples, which may not always be available. In this work, we propose a novel framework in a semi-supervised setting, which can predict the labels of the unlabeled data using complementary information from different modalities. The proposed framework can be used as an add-on with any baseline crossmodal algorithm to give significant performance improvement, even in case of limited labeled data. Finally, we analyze the challenging scenario where the unlabeled examples can even come from classes not in the training data and evaluate the performance of our algorithm under such setting. Extensive evaluation using several baseline algorithms across three different datasets shows the effectiveness of our label prediction framework.
CVMay 11, 2019
Multi-class Novelty Detection Using Mix-up TechniqueSupritam Bhattacharjee, Devraj Mandal, Soma Biswas
Multi-class novelty detection is increasingly becoming an important area of research due to the continuous increase in the number of object categories. It tries to answer the pertinent question: given a test sample, should we even try to classify it? We propose a novel solution using the concept of mixup technique for novelty detection, termed as Segregation Network. During training, a pair of examples are selected from the training data and an interpolated data point using their convex combination is constructed. We develop a suitable loss function to train our model to predict its constituent classes. During testing, each input query is combined with the known class prototypes to generate mixed samples which are then passed through the trained network. Our model which is trained to reveal the constituent classes can then be used to determine whether the sample is novel or not. The intuition is that if a query comes from a known class and is mixed with the set of known class prototypes, then the prediction of the trained model for the correct class should be high. In contrast, for a query from a novel class, the predictions for all the known classes should be low. The proposed model is trained using only the available known class data and does not need access to any auxiliary dataset or attributes. Extensive experiments on two benchmark datasets, namely Caltech 256 and Stanford Dogs and comparisons with the state-of-the-art algorithms justifies the usefulness of our approach.
CVMay 11, 2019
Unified Generator-Classifier for Efficient Zero-Shot LearningAyyappa Kumar Pambala, Titir Dutta, Soma Biswas
Generative models have achieved state-of-the-art performance for the zero-shot learning problem, but they require re-training the classifier every time a new object category is encountered. The traditional semantic embedding approaches, though very elegant, usually do not perform at par with their generative counterparts. In this work, we propose an unified framework termed GenClass, which integrates the generator with the classifier for efficient zero-shot learning, thus combining the representative power of the generative approaches and the elegance of the embedding approaches. End-to-end training of the unified framework not only eliminates the requirement of additional classifier for new object categories as in the generative approaches, but also facilitates the generation of more discriminative and useful features. Extensive evaluation on three standard zero-shot object classification datasets, namely AWA, CUB and SUN shows the effectiveness of the proposed approach. The approach without any modification, also gives state-of-the-art performance for zero-shot action classification, thus showing its generalizability to other domains.
CVDec 4, 2018
Semi-Supervised Cross-Modal Retrieval with Label PredictionDevraj Mandal, Pramod Rao, Soma Biswas
Due to abundance of data from multiple modalities, cross-modal retrieval tasks with image-text, audio-image, etc. are gaining increasing importance. Of the different approaches proposed, supervised methods usually give significant improvement over their unsupervised counterparts at the additional cost of labeling or annotation of the training data. Semi-supervised methods are recently becoming popular as they provide an elegant framework to balance the conflicting requirement of labeling cost and accuracy. In this work, we propose a novel deep semi-supervised framework which can seamlessly handle both labeled as well as unlabeled data. The network has two important components: (a) the label prediction component predicts the labels for the unlabeled portion of the data and then (b) a common modality-invariant representation is learned for cross-modal retrieval. The two parts of the network are trained sequentially one after the other. Extensive experiments on three standard benchmark datasets, Wiki, Pascal VOC and NUS-WIDE demonstrate that the proposed framework outperforms the state-of-the-art for both supervised and semi-supervised settings.
CVMar 8, 2018
Preserving Semantic Relations for Zero-Shot LearningYashas Annadani, Soma Biswas
Zero-shot learning has gained popularity due to its potential to scale recognition models without requiring additional training data. This is usually achieved by associating categories with their semantic information like attributes. However, we believe that the potential offered by this paradigm is not yet fully exploited. In this work, we propose to utilize the structure of the space spanned by the attributes using a set of relations. We devise objective functions to preserve these relations in the embedding space, thereby inducing semanticity to the embedding space. Through extensive experimental evaluation on five benchmark datasets, we demonstrate that inducing semanticity to the embedding space is beneficial for zero-shot learning. The proposed approach outperforms the state-of-the-art on the standard zero-shot setting as well as the more realistic generalized zero-shot setting. We also demonstrate how the proposed approach can be useful for making approximate semantic inferences about an image belonging to a category for which attribute information is not available.
CVNov 1, 2016
Sliding Dictionary Based Sparse Representation For Action RecognitionYashas Annadani, D L Rakshith, Soma Biswas
The task of action recognition has been in the forefront of research, given its applications in gaming, surveillance and health care. In this work, we propose a simple, yet very effective approach which works seamlessly for both offline and online action recognition using the skeletal joints. We construct a sliding dictionary which has the training data along with their time stamps. This is used to compute the sparse coefficients of the input action sequence which is divided into overlapping windows and each window gives a probability score for each action class. In addition, we compute another simple feature, which calibrates each of the action sequences to the training sequences, and models the deviation of the action from the each of the training data. Finally, a score level fusion of the two heterogeneous but complementary features for each window is obtained and the scores for the available windows are successively combined to give the confidence scores of each action class. This way of combining the scores makes the approach suitable for scenarios where only part of the sequence is available. Extensive experimental evaluation on three publicly available datasets shows the effectiveness of the proposed approach for both offline and online action recognition tasks.
CVSep 22, 2014
Temporally Coherent Bayesian Models for Entity Discovery in Videos by Tracklet ClusteringAdway Mitra, Soma Biswas, Chiranjib Bhattacharyya
A video can be represented as a sequence of tracklets, each spanning 10-20 frames, and associated with one entity (eg. a person). The task of \emph{Entity Discovery} in videos can be naturally posed as tracklet clustering. We approach this task by leveraging \emph{Temporal Coherence}(TC): the fundamental property of videos that each tracklet is likely to be associated with the same entity as its temporal neighbors. Our major contributions are the first Bayesian nonparametric models for TC at tracklet-level. We extend Chinese Restaurant Process (CRP) to propose TC-CRP, and further to Temporally Coherent Chinese Restaurant Franchise (TC-CRF) to jointly model short temporal segments. On the task of discovering persons in TV serial videos without meta-data like scripts, these methods show considerable improvement in cluster purity and person coverage compared to state-of-the-art approaches to tracklet clustering. We represent entities with mixture components, and tracklets with vectors of very generic features, which can work for any type of entity (not necessarily person). The proposed methods can perform online tracklet clustering on streaming videos with little performance deterioration unlike existing approaches, and can automatically reject tracklets resulting from false detections. Finally we discuss entity-driven video summarization- where some temporal segments of the video are selected automatically based on the discovered entities.