LGAug 27, 2025
Coresets from Trajectories: Selecting Data via Correlation of Loss DifferencesManish Nagaraj, Deepak Ravikumar, Kaushik Roy
Deep learning models achieve state-of-the-art performance across domains but face scalability challenges in real-time or resource-constrained scenarios. To address this, we propose Correlation of Loss Differences (CLD), a simple and scalable metric for coreset selection that identifies the most impactful training samples by measuring their alignment with the loss trajectories of a held-out validation set. CLD is highly efficient, requiring only per-sample loss values computed at training checkpoints, and avoiding the costly gradient and curvature computations used in many existing subset selection methods. We develop a general theoretical framework that establishes convergence guarantees for CLD-based coresets, demonstrating that the convergence error is upper-bounded by the alignment of the selected samples and the representativeness of the validation set. On CIFAR-100 and ImageNet-1k, CLD-based coresets typically outperform or closely match state-of-the-art methods across subset sizes, and remain within 1% of more computationally expensive baselines even when not leading. CLD transfers effectively across architectures (ResNet, VGG, DenseNet), enabling proxy-to-target selection with <1% degradation. Moreover, CLD is stable when using only early checkpoints, incurring negligible accuracy loss. Finally, CLD exhibits inherent bias reduction via per-class validation alignment, obviating the need for additional stratified sampling. Together, these properties make CLD a principled, efficient, stable, and transferable tool for scalable dataset optimization.
LGJul 11, 2023
Memorization Through the Lens of Curvature of Loss Function Around SamplesIsha Garg, Deepak Ravikumar, Kaushik Roy
Deep neural networks are over-parameterized and easily overfit the datasets they train on. In the extreme case, it has been shown that these networks can memorize a training set with fully randomized labels. We propose using the curvature of loss function around each training sample, averaged over training epochs, as a measure of memorization of the sample. We use this metric to study the generalization versus memorization properties of different samples in popular image datasets and show that it captures memorization statistics well, both qualitatively and quantitatively. We first show that the high curvature samples visually correspond to long-tailed, mislabeled, or conflicting samples, those that are most likely to be memorized. This analysis helps us find, to the best of our knowledge, a novel failure mode on the CIFAR100 and ImageNet datasets: that of duplicated images with differing labels. Quantitatively, we corroborate the validity of our scores via two methods. First, we validate our scores against an independent and comprehensively calculated baseline, by showing high cosine similarity with the memorization scores released by Feldman and Zhang (2020). Second, we inject corrupted samples which are memorized by the network, and show that these are learned with high curvature. To this end, we synthetically mislabel a random subset of the dataset. We overfit a network to it and show that sorting by curvature yields high AUROC values for identifying the corrupted samples. An added advantage of our method is that it is scalable, as it requires training only a single network as opposed to the thousands trained by the baseline, while capturing the aforementioned failure mode that the baseline fails to identify.
LGApr 9, 2023
Homogenizing Non-IID datasets via In-Distribution Knowledge Distillation for Decentralized LearningDeepak Ravikumar, Gobinda Saha, Sai Aparna Aketi et al.
Decentralized learning enables serverless training of deep neural networks (DNNs) in a distributed manner on multiple nodes. This allows for the use of large datasets, as well as the ability to train with a wide variety of data sources. However, one of the key challenges with decentralized learning is heterogeneity in the data distribution across the nodes. In this paper, we propose In-Distribution Knowledge Distillation (IDKD) to address the challenge of heterogeneous data distribution. The goal of IDKD is to homogenize the data distribution across the nodes. While such data homogenization can be achieved by exchanging data among the nodes sacrificing privacy, IDKD achieves the same objective using a common public dataset across nodes without breaking the privacy constraint. This public dataset is different from the training dataset and is used to distill the knowledge from each node and communicate it to its neighbors through the generated labels. With traditional knowledge distillation, the generalization of the distilled model is reduced because all the public dataset samples are used irrespective of their similarity to the local dataset. Thus, we introduce an Out-of-Distribution (OoD) detector at each node to label a subset of the public dataset that maps close to the local training data distribution. Finally, only labels corresponding to these subsets are exchanged among the nodes and with appropriate label averaging each node is finetuned on these data subsets along with its local data. Our experiments on multiple image classification datasets and graph topologies show that the proposed IDKD scheme is more effective than traditional knowledge distillation and achieves state-of-the-art generalization performance on heterogeneously distributed data with minimal communication overhead.
LGMay 6, 2022
Norm-Scaling for Out-of-Distribution DetectionDeepak Ravikumar, Kaushik Roy
Out-of-Distribution (OoD) inputs are examples that do not belong to the true underlying distribution of the dataset. Research has shown that deep neural nets make confident mispredictions on OoD inputs. Therefore, it is critical to identify OoD inputs for safe and reliable deployment of deep neural nets. Often a threshold is applied on a similarity score to detect OoD inputs. One such similarity is angular similarity which is the dot product of latent representation with the mean class representation. Angular similarity encodes uncertainty, for example, if the angular similarity is less, it is less certain that the input belongs to that class. However, we observe that, different classes have different distributions of angular similarity. Therefore, applying a single threshold for all classes is not ideal since the same similarity score represents different uncertainties for different classes. In this paper, we propose norm-scaling which normalizes the logits separately for each class. This ensures that a single value consistently represents similar uncertainty for various classes. We show that norm-scaling, when used with maximum softmax probability detector, achieves 9.78% improvement in AUROC, 5.99% improvement in AUPR and 33.19% reduction in FPR95 metrics over previous state-of-the-art methods.
CVJul 2, 2024
Advancing Compressed Video Action Recognition through Progressive Knowledge DistillationEfstathia Soufleri, Deepak Ravikumar, Kaushik Roy
Compressed video action recognition classifies video samples by leveraging the different modalities in compressed videos, namely motion vectors, residuals, and intra-frames. For this purpose, three neural networks are deployed, each dedicated to processing one modality. Our observations indicate that the network processing intra-frames tend to converge to a flatter minimum than the network processing residuals, which in turn converges to a flatter minimum than the motion vector network. This hierarchy in convergence motivates our strategy for knowledge transfer among modalities to achieve flatter minima, which are generally associated with better generalization. With this insight, we propose Progressive Knowledge Distillation (PKD), a technique that incrementally transfers knowledge across the modalities. This method involves attaching early exits (Internal Classifiers - ICs) to the three networks. PKD distills knowledge starting from the motion vector network, followed by the residual, and finally, the intra-frame network, sequentially improving IC accuracy. Further, we propose the Weighted Inference with Scaled Ensemble (WISE), which combines outputs from the ICs using learned weights, boosting accuracy during inference. Our experiments demonstrate the effectiveness of training the ICs with PKD compared to standard cross-entropy-based training, showing IC accuracy improvements of up to 5.87% and 11.42% on the UCF-101 and HMDB-51 datasets, respectively. Additionally, WISE improves accuracy by up to 4.28% and 9.30% on UCF-101 and HMDB-51, respectively.
LGJul 3, 2024
Curvature Clues: Decoding Deep Learning Privacy with Input Loss CurvatureDeepak Ravikumar, Efstathia Soufleri, Kaushik Roy
In this paper, we explore the properties of loss curvature with respect to input data in deep neural networks. Curvature of loss with respect to input (termed input loss curvature) is the trace of the Hessian of the loss with respect to the input. We investigate how input loss curvature varies between train and test sets, and its implications for train-test distinguishability. We develop a theoretical framework that derives an upper bound on the train-test distinguishability based on privacy and the size of the training set. This novel insight fuels the development of a new black box membership inference attack utilizing input loss curvature. We validate our theoretical findings through experiments in computer vision classification tasks, demonstrating that input loss curvature surpasses existing methods in membership inference effectiveness. Our analysis highlights how the performance of membership inference attack (MIA) methods varies with the size of the training set, showing that curvature-based MIA outperforms other methods on sufficiently large datasets. This condition is often met by real datasets, as demonstrated by our results on CIFAR10, CIFAR100, and ImageNet. These findings not only advance our understanding of deep neural network behavior but also improve the ability to test privacy-preserving techniques in machine learning.
LGDec 7, 2025
GradientSpace: Unsupervised Data Clustering for Improved Instruction TuningShrihari Sridharan, Deepak Ravikumar, Anand Raghunathan et al.
Instruction tuning is one of the key steps required for adapting large language models (LLMs) to a broad spectrum of downstream applications. However, this procedure is difficult because real-world datasets are rarely homogeneous; they consist of a mixture of diverse information, causing gradient interference, where conflicting gradients pull the model in opposing directions, degrading performance. A common strategy to mitigate this issue is to group data based on semantic or embedding similarity. However, this fails to capture how data influences model parameters during learning. While recent works have attempted to cluster gradients directly, they randomly project gradients into lower dimensions to manage memory, which leads to accuracy loss. Moreover, these methods rely on expert ensembles which necessitates multiple inference passes and expensive on-the-fly gradient computations during inference. To address these limitations, we propose GradientSpace, a framework that clusters samples directly in full-dimensional gradient space. We introduce an online SVD-based algorithm that operates on LoRA gradients to identify latent skills without the infeasible cost of storing all sample gradients. Each cluster is used to train a specialized LoRA expert along with a lightweight router trained to select the best expert during inference. We show that routing to a single, appropriate expert outperforms expert ensembles used in prior work, while significantly reducing inference latency. Our experiments across mathematical reasoning, code generation, finance, and creative writing tasks demonstrate that GradientSpace leads to coherent expert specialization and consistent accuracy gains over state-of-the-art clustering methods and finetuning techniques.
CVFeb 16
Loss Knows Best: Detecting Annotation Errors in Videos via Loss TrajectoriesPraditha Alwis, Soumyadeep Chandra, Deepak Ravikumar et al.
High-quality video datasets are foundational for training robust models in tasks like action recognition, phase detection, and event segmentation. However, many real-world video datasets suffer from annotation errors such as *mislabeling*, where segments are assigned incorrect class labels, and *disordering*, where the temporal sequence does not follow the correct progression. These errors are particularly harmful in phase-annotated tasks, where temporal consistency is critical. We propose a novel, model-agnostic method for detecting annotation errors by analyzing the Cumulative Sample Loss (CSL)--defined as the average loss a frame incurs when passing through model checkpoints saved across training epochs. This per-frame loss trajectory acts as a dynamic fingerprint of frame-level learnability. Mislabeled or disordered frames tend to show consistently high or irregular loss patterns, as they remain difficult for the model to learn throughout training, while correctly labeled frames typically converge to low loss early. To compute CSL, we train a video segmentation model and store its weights at each epoch. These checkpoints are then used to evaluate the loss of each frame in a test video. Frames with persistently high CSL are flagged as likely candidates for annotation errors, including mislabeling or temporal misalignment. Our method does not require ground truth on annotation errors and is generalizable across datasets. Experiments on EgoPER and Cholec80 demonstrate strong detection performance, effectively identifying subtle inconsistencies such as mislabeling and frame disordering. The proposed approach provides a powerful tool for dataset auditing and improving training reliability in video-based machine learning.
LGFeb 28, 2024
Unveiling Privacy, Memorization, and Input Curvature LinksDeepak Ravikumar, Efstathia Soufleri, Abolfazl Hashemi et al.
Deep Neural Nets (DNNs) have become a pervasive tool for solving many emerging problems. However, they tend to overfit to and memorize the training set. Memorization is of keen interest since it is closely related to several concepts such as generalization, noisy learning, and privacy. To study memorization, Feldman (2019) proposed a formal score, however its computational requirements limit its practical use. Recent research has shown empirical evidence linking input loss curvature (measured by the trace of the loss Hessian w.r.t inputs) and memorization. It was shown to be ~3 orders of magnitude more efficient than calculating the memorization score. However, there is a lack of theoretical understanding linking memorization with input loss curvature. In this paper, we not only investigate this connection but also extend our analysis to establish theoretical links between differential privacy, memorization, and input loss curvature. First, we derive an upper bound on memorization characterized by both differential privacy and input loss curvature. Second, we present a novel insight showing that input loss curvature is upper-bounded by the differential privacy parameter. Our theoretical findings are further empirically validated using deep models on CIFAR and ImageNet datasets, showing a strong correlation between our theoretical predictions and results observed in practice.
LGMar 13, 2024
SAP: Corrective Machine Unlearning with Scaled Activation Projection for Label Noise RobustnessSangamesh Kodge, Deepak Ravikumar, Gobinda Saha et al.
Label corruption, where training samples are mislabeled due to non-expert annotation or adversarial attacks, significantly degrades model performance. Acquiring large, perfectly labeled datasets is costly, and retraining models from scratch is computationally expensive. To address this, we introduce Scaled Activation Projection (SAP), a novel SVD (Singular Value Decomposition)-based corrective machine unlearning algorithm. SAP mitigates label noise by identifying a small subset of trusted samples using cross-entropy loss and projecting model weights onto a clean activation space estimated using SVD on these trusted samples. This process suppresses the noise introduced in activations due to the mislabeled samples. In our experiments, we demonstrate SAP's effectiveness on synthetic noise with different settings and real-world label noise. SAP applied to the CIFAR dataset with 25% synthetic corruption show upto 6% generalization improvements. Additionally, SAP can improve the generalization over noise robust training approaches on CIFAR dataset by ~3.2% on average. Further, we observe generalization improvements of 2.31% for a Vision Transformer model trained on naturally corrupted Clothing1M.
LGOct 13, 2025
The Easy Path to Robustness: Coreset Selection using Sample HardnessPranav Ramesh, Arjun Roy, Deepak Ravikumar et al.
Designing adversarially robust models from a data-centric perspective requires understanding which input samples are most crucial for learning resilient features. While coreset selection provides a mechanism for efficient training on data subsets, current algorithms are designed for clean accuracy and fall short in preserving robustness. To address this, we propose a framework linking a sample's adversarial vulnerability to its \textit{hardness}, which we quantify using the average input gradient norm (AIGN) over training. We demonstrate that \textit{easy} samples (with low AIGN) are less vulnerable and occupy regions further from the decision boundary. Leveraging this insight, we present EasyCore, a coreset selection algorithm that retains only the samples with low AIGN for training. We empirically show that models trained on EasyCore-selected data achieve significantly higher adversarial accuracy than those trained with competing coreset methods under both standard and adversarial training. As AIGN is a model-agnostic dataset property, EasyCore is an efficient and widely applicable data-centric method for improving adversarial robustness. We show that EasyCore achieves up to 7\% and 5\% improvement in adversarial accuracy under standard training and TRADES adversarial training, respectively, compared to existing coreset methods.
CLOct 8, 2025
TRIM: Token-wise Attention-Derived Saliency for Data-Efficient Instruction TuningManish Nagaraj, Sakshi Choudhary, Utkarsh Saxena et al.
Instruction tuning is essential for aligning large language models (LLMs) to downstream tasks and commonly relies on large, diverse corpora. However, small, high-quality subsets, known as coresets, can deliver comparable or superior results, though curating them remains challenging. Existing methods often rely on coarse, sample-level signals like gradients, an approach that is computationally expensive and overlooks fine-grained features. To address this, we introduce TRIM (Token Relevance via Interpretable Multi-layer Attention), a forward-only, token-centric framework. Instead of using gradients, TRIM operates by matching underlying representational patterns identified via attention-based "fingerprints" from a handful of target samples. Such an approach makes TRIM highly efficient and uniquely sensitive to the structural features that define a task. Coresets selected by our method consistently outperform state-of-the-art baselines by up to 9% on downstream tasks and even surpass the performance of full-data fine-tuning in some settings. By avoiding expensive backward passes, TRIM achieves this at a fraction of the computational cost. These findings establish TRIM as a scalable and efficient alternative for building high-quality instruction-tuning datasets.
LGMar 12, 2025
Finding the Muses: Identifying Coresets through Loss TrajectoriesManish Nagaraj, Deepak Ravikumar, Efstathia Soufleri et al.
Deep learning models achieve state-of-the-art performance across domains but face scalability challenges in real-time or resource-constrained scenarios. To address this, we propose Loss Trajectory Correlation (LTC), a novel metric for coreset selection that identifies critical training samples driving generalization. $LTC$ quantifies the alignment between training sample loss trajectories and validation set loss trajectories, enabling the construction of compact, representative subsets. Unlike traditional methods with computational and storage overheads that are infeasible to scale to large datasets, $LTC$ achieves superior efficiency as it can be computed as a byproduct of training. Our results on CIFAR-100 and ImageNet-1k show that $LTC$ consistently achieves accuracy on par with or surpassing state-of-the-art coreset selection methods, with any differences remaining under 1%. LTC also effectively transfers across various architectures, including ResNet, VGG, DenseNet, and Swin Transformer, with minimal performance degradation (<2%). Additionally, LTC offers insights into training dynamics, such as identifying aligned and conflicting sample behaviors, at a fraction of the computational cost of traditional methods. This framework paves the way for scalable coreset selection and efficient dataset optimization.
LGDec 15, 2020
Exploring Vicinal Risk Minimization for Lightweight Out-of-Distribution DetectionDeepak Ravikumar, Sangamesh Kodge, Isha Garg et al.
Deep neural networks have found widespread adoption in solving complex tasks ranging from image recognition to natural language processing. However, these networks make confident mispredictions when presented with data that does not belong to the training distribution, i.e. out-of-distribution (OoD) samples. In this paper we explore whether the property of Vicinal Risk Minimization (VRM) to smoothly interpolate between different class boundaries helps to train better OoD detectors. We apply VRM to existing OoD detection techniques and show their improved performance. We observe that existing OoD detectors have significant memory and compute overhead, hence we leverage VRM to develop an OoD detector with minimal overheard. Our detection method introduces an auxiliary class for classifying OoD samples. We utilize mixup in two ways to implement Vicinal Risk Minimization. First, we perform mixup within the same class and second, we perform mixup with Gaussian noise when training the auxiliary class. Our method achieves near competitive performance with significantly less compute and memory overhead when compared to existing OoD detection techniques. This facilitates the deployment of OoD detection on edge devices and expands our understanding of Vicinal Risk Minimization for use in training OoD detectors.
LGAug 4, 2020
TREND: Transferability based Robust ENsemble DesignDeepak Ravikumar, Sangamesh Kodge, Isha Garg et al.
Deep Learning models hold state-of-the-art performance in many fields, but their vulnerability to adversarial examples poses threat to their ubiquitous deployment in practical settings. Additionally, adversarial inputs generated on one classifier have been shown to transfer to other classifiers trained on similar data, which makes the attacks possible even if model parameters are not revealed to the adversary. This property of transferability has not yet been systematically studied, leading to a gap in our understanding of robustness of neural networks to adversarial inputs. In this work, we study the effect of network architecture, initialization, optimizer, input, weight and activation quantization on transferability of adversarial samples. We also study the effect of different attacks on transferability. Our experiments reveal that transferability is significantly hampered by input quantization and architectural mismatch between source and target, is unaffected by initialization but the choice of optimizer turns out to be critical. We observe that transferability is architecture-dependent for both weight and activation quantized models. To quantify transferability, we use simple metric and demonstrate the utility of the metric in designing a methodology to build ensembles with improved adversarial robustness. When attacking ensembles we observe that "gradient domination" by a single ensemble member model hampers existing attacks. To combat this we propose a new state-of-the-art ensemble attack. We compare the proposed attack with existing attack techniques to show its effectiveness. Finally, we show that an ensemble consisting of carefully chosen diverse networks achieves better adversarial robustness than would otherwise be possible with a single network.