Madan Ravi Ganesh

CV
h-index12
14papers
84citations
Novelty46%
AI Score43

14 Papers

CVApr 14, 2022
Q-TART: Quickly Training for Adversarial Robustness and in-Transferability

Madan Ravi Ganesh, Salimeh Yasaei Sekeh, Jason J. Corso

Raw deep neural network (DNN) performance is not enough; in real-world settings, computational load, training efficiency and adversarial security are just as or even more important. We propose to simultaneously tackle Performance, Efficiency, and Robustness, using our proposed algorithm Q-TART, Quickly Train for Adversarial Robustness and in-Transferability. Q-TART follows the intuition that samples highly susceptible to noise strongly affect the decision boundaries learned by DNNs, which in turn degrades their performance and adversarial susceptibility. By identifying and removing such samples, we demonstrate improved performance and adversarial robustness while using only a subset of the training data. Through our experiments we highlight Q-TART's high performance across multiple Dataset-DNN combinations, including ImageNet, and provide insights into the complementary behavior of Q-TART alongside existing adversarial training approaches to increase robustness by over 1.3% while using up to 17.9% less training time.

CVOct 9, 2023
Text-driven Prompt Generation for Vision-Language Models in Federated Learning

Chen Qiu, Xingyu Li, Chaithanya Kumar Mummadi et al.

Prompt learning for vision-language models, e.g., CoOp, has shown great success in adapting CLIP to different downstream tasks, making it a promising solution for federated learning due to computational reasons. Existing prompt learning techniques replace hand-crafted text prompts with learned vectors that offer improvements on seen classes, but struggle to generalize to unseen classes. Our work addresses this challenge by proposing Federated Text-driven Prompt Generation (FedTPG), which learns a unified prompt generation network across multiple remote clients in a scalable manner. The prompt generation network is conditioned on task-related text input, thus is context-aware, making it suitable to generalize for both seen and unseen classes. Our comprehensive empirical evaluations on nine diverse image classification datasets show that our method is superior to existing federated prompt learning methods, that achieve overall better generalization on both seen and unseen classes and is also generalizable to unseen datasets.

LGNov 14, 2023
Leveraging Foundation Models to Improve Lightweight Clients in Federated Learning

Xidong Wu, Wan-Yi Lin, Devin Willmott et al.

Federated Learning (FL) is a distributed training paradigm that enables clients scattered across the world to cooperatively learn a global model without divulging confidential data. However, FL faces a significant challenge in the form of heterogeneous data distributions among clients, which leads to a reduction in performance and robustness. A recent approach to mitigating the impact of heterogeneous data distributions is through the use of foundation models, which offer better performance at the cost of larger computational overheads and slower inference speeds. We introduce foundation model distillation to assist in the federated training of lightweight client models and increase their performance under heterogeneous data settings while keeping inference costs low. Our results show improvement in the global model performance on a balanced testing set, which contains rarely observed samples, even under extreme non-IID client data distributions. We conduct a thorough evaluation of our framework with different foundation model backbones on CIFAR10, with varying degrees of heterogeneous data distributions ranging from class-specific data partitions across clients to dirichlet data sampling, parameterized by values between 0.01 and 1.0.

35.9CVApr 28
The Surprising Effectiveness of Canonical Knowledge Distillation for Semantic Segmentation

Muhammad Ali, Kevin Alexander Laube, Madan Ravi Ganesh et al.

Recent knowledge distillation (KD) methods for semantic segmentation introduce increasingly complex hand-crafted objectives, yet are typically evaluated under fixed iteration schedules. These objectives substantially increase per-iteration cost, meaning equal iteration counts do not correspond to equal training budgets. It is therefore unclear whether reported gains reflect stronger distillation signals or simply greater compute. We show that iteration-based comparisons are misleading: when wall-clock compute is matched, \textit{canonical} logit- and feature-based KD outperform recent segmentation-specific methods. Under extended training, feature-based distillation achieves state-of-the-art ResNet-18 performance on Cityscapes and ADE20K. A PSPNet ResNet-18 student closely approaches its ResNet-101 teacher despite using only one quarter of the parameters, reaching 99\% of the teacher's mIoU on Cityscapes (79.0 vs.\ 79.8) and 92\% on ADE20K. Our results challenge the prevailing assumption that KD for segmentation requires task-specific mechanisms and suggest that scaling, rather than complex hand-crafted objectives, should guide future method design.

CVDec 21, 2024
HyperCLIP: Adapting Vision-Language models with Hypernetworks

Victor Akinwande, Mohammad Sadegh Norouzzadeh, Devin Willmott et al.

Self-supervised vision-language models trained with contrastive objectives form the basis of current state-of-the-art methods in AI vision tasks. The success of these models is a direct consequence of the huge web-scale datasets used to train them, but they require correspondingly large vision components to properly learn powerful and general representations from such a broad data domain. This poses a challenge for deploying large vision-language models, especially in resource-constrained environments. To address this, we propose an alternate vision-language architecture, called HyperCLIP, that uses a small image encoder along with a hypernetwork that dynamically adapts image encoder weights to each new set of text inputs. All three components of the model (hypernetwork, image encoder, and text encoder) are pre-trained jointly end-to-end, and with a trained HyperCLIP model, we can generate new zero-shot deployment-friendly image classifiers for any task with a single forward pass through the text encoder and hypernetwork. HyperCLIP increases the zero-shot accuracy of SigLIP trained models with small image encoders by up to 3% on ImageNet and 5% on CIFAR-100 with minimal training throughput overhead.

LGJun 1, 2025
FedRPCA: Enhancing Federated LoRA Aggregation Using Robust PCA

Divyansh Jhunjhunwala, Arian Raje, Madan Ravi Ganesh et al.

LoRA has emerged as one of the most promising fine-tuning techniques, especially for federated learning (FL), since it significantly reduces communication and computation costs at resource-constrained clients. However, data heterogeneity remains a significant challenge for LoRA-based FL, and the conventional aggregation strategy based on FedAvg suffers from slow convergence and suboptimal accuracy. Motivated by recent advances in model merging, particularly Task Arithmetic, we explore the idea of aggregating client LoRA parameters using scaled averaging. We first observe that a naive application of Task Arithmetic is ineffective due to the high cosine similarity between client updates, indicating significant common knowledge in the updates across clients. To address this issue, we propose decomposing client LoRA updates via Robust Principal Component Analysis (Robust-PCA) into a common low-rank component and client-specific sparse components. Our proposed algorithm FedRPCA aggregates the low-rank components through averaging, consolidating common knowledge, and applies scaled averaging to the sparse components to amplify client-specific knowledge. We evaluate our approach across a variety of vision and language tasks and demonstrate that it achieves higher final accuracy and faster convergence compared to competing baselines.

CVJun 22, 2020
Slimming Neural Networks using Adaptive Connectivity Scores

Madan Ravi Ganesh, Dawsin Blanchard, Jason J. Corso et al.

In general, deep neural network (DNN) pruning methods fall into two categories: 1) Weight-based deterministic constraints, and 2) Probabilistic frameworks. While each approach has its merits and limitations there are a set of common practical issues such as, trial-and-error to analyze sensitivity and hyper-parameters to prune DNNs, which plague them both. In this work, we propose a new single-shot, fully automated pruning algorithm called Slimming Neural networks using Adaptive Connectivity Scores (SNACS). Our proposed approach combines a probabilistic pruning framework with constraints on the underlying weight matrices, via a novel connectivity measure, at multiple levels to capitalize on the strengths of both approaches while solving their deficiencies. In \alg{}, we propose a fast hash-based estimator of Adaptive Conditional Mutual Information (ACMI), that uses a weight-based scaling criterion, to evaluate the connectivity between filters and prune unimportant ones. To automatically determine the limit up to which a layer can be pruned, we propose a set of operating constraints that jointly define the upper pruning percentage limits across all the layers in a deep network. Finally, we define a novel sensitivity criterion for filters that measures the strength of their contributions to the succeeding layer and highlights critical filters that need to be completely protected from pruning. Through our experimental validation we show that SNACS is faster by over 17x the nearest comparable method and is the state of the art single-shot pruning method across three standard Dataset-DNN pruning benchmarks: CIFAR10-VGG16, CIFAR10-ResNet56 and ILSVRC2012-ResNet50.

LGMar 18, 2020
MINT: Deep Network Compression via Mutual Information-based Neuron Trimming

Madan Ravi Ganesh, Jason J. Corso, Salimeh Yasaei Sekeh

Most approaches to deep neural network compression via pruning either evaluate a filter's importance using its weights or optimize an alternative objective function with sparsity constraints. While these methods offer a useful way to approximate contributions from similar filters, they often either ignore the dependency between layers or solve a more difficult optimization objective than standard cross-entropy. Our method, Mutual Information-based Neuron Trimming (MINT), approaches deep compression via pruning by enforcing sparsity based on the strength of the relationship between filters of adjacent layers, across every pair of layers. The relationship is calculated using conditional geometric mutual information which evaluates the amount of similar information exchanged between the filters using a graph-based criterion. When pruning a network, we ensure that retained filters contribute the majority of the information towards succeeding layers which ensures high performance. Our novel approach outperforms existing state-of-the-art compression-via-pruning methods on the standard benchmarks for this task: MNIST, CIFAR-10, and ILSVRC2012, across a variety of network architectures. In addition, we discuss our observations of a common denominator between our pruning methodology's response to adversarial attacks and calibration statistics when compared to the original network.

CVJan 13, 2020
Rethinking Curriculum Learning with Incremental Labels and Adaptive Compensation

Madan Ravi Ganesh, Jason J. Corso

Like humans, deep networks have been shown to learn better when samples are organized and introduced in a meaningful order or curriculum. Conventional curriculum learning schemes introduce samples in their order of difficulty. This forces models to begin learning from a subset of the available data while adding the external overhead of evaluating the difficulty of samples. In this work, we propose Learning with Incremental Labels and Adaptive Compensation (LILAC), a two-phase method that incrementally increases the number of unique output labels rather than the difficulty of samples while consistently using the entire dataset throughout training. In the first phase, Incremental Label Introduction, we partition data into mutually exclusive subsets, one that contains a subset of the ground-truth labels and another that contains the remaining data attached to a pseudo-label. Throughout the training process, we recursively reveal unseen ground-truth labels in fixed increments until all the labels are known to the model. In the second phase, Adaptive Compensation, we optimize the loss function using altered target vectors of previously misclassified samples. The target vectors of such samples are modified to a smoother distribution to help models learn better. On evaluating across three standard image benchmarks, CIFAR-10, CIFAR-100, and STL-10, we show that LILAC outperforms all comparable baselines. Further, we detail the importance of pacing the introduction of new labels to a model as well as the impact of using a smooth target vector.

CVOct 7, 2019
ViP: Video Platform for PyTorch

Madan Ravi Ganesh, Eric Hofesmann, Nathan Louis et al.

This work presents the Video Platform for PyTorch (ViP), a deep learning-based framework designed to handle and extend to any problem domain based on videos. ViP supports (1) a single unified interface applicable to all video problem domains, (2) quick prototyping of video models, (3) executing large-batch operations with reduced memory consumption, and (4) easy and reproducible experimental setups. ViP's core functionality is built with flexibility and modularity in mind to allow for smooth data flow between different parts of the platform and benchmarking against existing methods. In providing a software platform that supports multiple video-based problem domains, we allow for more cross-pollination of models, ideas and stronger generalization in the video understanding research community.

LGOct 2, 2019
A Geometric Approach to Online Streaming Feature Selection

Salimeh Yasaei Sekeh, Madan Ravi Ganesh, Shurjo Banerjee et al.

Online Streaming Feature Selection (OSFS) is a sequential learning problem where individual features across all samples are made available to algorithms in a streaming fashion. In this work, firstly, we assert that OSFS's main assumption of having data from all the samples available at runtime is unrealistic and introduce a new setting where features and samples are streamed concurrently called OSFS with Streaming Samples (OSFS-SS). Secondly, the primary OSFS method, SAOLA utilizes an unbounded mutual information measure and requires multiple comparison steps between the stored and incoming feature sets to evaluate a feature's importance. We introduce Geometric Online Adaption, an algorithm that requires relatively less feature comparison steps and uses a bounded conditional geometric dependency measure. Our algorithm outperforms several OSFS baselines including SAOLA on a variety of datasets. We also extend SAOLA to work in the OSFS-SS setting and show that GOA continues to achieve the best results. Thirdly, the current paradigm of the OSFS algorithm comparison is flawed. Algorithms are measured by comparing the number of features used and the accuracy obtained by the learner, two properties that are fundamentally at odds with one another. Without fixing a limit on either of these properties, the qualities of features obtained by different algorithms are incomparable. We try to rectify this inconsistency by fixing the maximum number of features available to the learner and comparing algorithms in terms of their accuracy. Additionally, we characterize the behaviour of SAOLA and GOA on feature sets derived from popular deep convolutional featurizers.

CVApr 16, 2018
M-PACT: An Open Source Platform for Repeatable Activity Classification Research

Eric Hofesmann, Madan Ravi Ganesh, Jason J. Corso

There are many hurdles that prevent the replication of existing work which hinders the development of new activity classification models. These hurdles include switching between multiple deep learning libraries and the development of boilerplate experimental pipelines. We present M-PACT to overcome existing issues by removing the need to develop boilerplate code which allows users to quickly prototype action classification models while leveraging existing state-of-the-art (SOTA) models available in the platform. M-PACT is the first to offer four SOTA activity classification models, I3D, C3D, ResNet50+LSTM, and TSN, under a single platform with reproducible competitive results. This platform allows for the generation of models and results over activity recognition datasets through the use of modular code, various preprocessing and neural network layers, and seamless data flow. In this paper, we present the system architecture, detail the functions of various modules, and describe the basic tools to develop a new model in M-PACT.

CVMar 21, 2018
T-RECS: Training for Rate-Invariant Embeddings by Controlling Speed for Action Recognition

Madan Ravi Ganesh, Eric Hofesmann, Byungsu Min et al.

An action should remain identifiable when modifying its speed: consider the contrast between an expert chef and a novice chef each chopping an onion. Here, we expect the novice chef to have a relatively measured and slow approach to chopping when compared to the expert. In general, the speed at which actions are performed, whether slower or faster than average, should not dictate how they are recognized. We explore the erratic behavior caused by this phenomena on state-of-the-art deep network-based methods for action recognition in terms of maximum performance and stability in recognition accuracy across a range of input video speeds. By observing the trends in these metrics and summarizing them based on expected temporal behaviour w.r.t. variations in input video speeds, we find two distinct types of network architectures. In this paper, we propose a preprocessing method named T-RECS, as a way to extend deep-network-based methods for action recognition to explicitly account for speed variability in the data. We do so by adaptively resampling the inputs to a given model. T-RECS is agnostic to the specific deep-network model; we apply it to four state-of-the-art action recognition architectures, C3D, I3D, TSN, and ConvNet+LSTM. On HMDB51 and UCF101, T-RECS-based I3D models show a peak improvement of at least 2.9% in performance over the baseline while T-RECS-based C3D models achieve a maximum improvement in stability by 59% over the baseline, on the HMDB51 dataset.

ROApr 12, 2016
Spatiotemporal Articulated Models for Dynamic SLAM

Suren Kumar, Vikas Dhiman, Madan Ravi Ganesh et al.

We propose an online spatiotemporal articulation model estimation framework that estimates both articulated structure as well as a temporal prediction model solely using passive observations. The resulting model can predict future mo- tions of an articulated object with high confidence because of the spatial and temporal structure. We demonstrate the effectiveness of the predictive model by incorporating it within a standard simultaneous localization and mapping (SLAM) pipeline for mapping and robot localization in previously unexplored dynamic environments. Our method is able to localize the robot and map a dynamic scene by explaining the observed motion in the world. We demonstrate the effectiveness of the proposed framework for both simulated and real-world dynamic environments.