Mrigank Rochan

CV
h-index18
23papers
1,369citations
Novelty51%
AI Score57

23 Papers

SEApr 14
Model Compression vs. Adversarial Robustness: An Empirical Study on Language Models for Code

Md. Abdul Awal, Mrigank Rochan, Chanchal K. Roy

Transformer-based language models for code have shown remarkable performance in various software analytics tasks, but their adoption is hindered by high computational costs, slow inference speeds, and substantial environmental impact. Model compression techniques such as pruning, quantization, and knowledge distillation have gained traction in addressing these challenges. However, the impact of these strategies on the robustness of compressed language models for code in adversarial scenarios remains poorly understood. Understanding how these compressed models behave under adversarial attacks is essential for their safe and effective deployment in real-world applications. To bridge this knowledge gap, we conduct a comprehensive evaluation of how common compression strategies affect the adversarial robustness of compressed models. We assess the robustness of compressed versions of three widely used language models for code across three software analytics tasks, using six evaluation metrics and four commonly used classical adversarial attacks. Our findings indicate that compressed models generally maintain comparable performance to their uncompressed counterparts. However, when subjected to adversarial attacks, compressed models exhibit significantly reduced robustness. These results reveal a trade-off between model size reduction and adversarial robustness, underscoring the need for careful consideration when deploying compressed models in security-critical software applications. Our study highlights the need for further research into compression strategies that strike a balance between computational efficiency and adversarial robustness, which is essential for deploying reliable language models for code in real-world software applications.

CVJul 18, 2024
Unsupervised Video Highlight Detection by Learning from Audio and Visual Recurrence

Zahidul Islam, Sujoy Paul, Mrigank Rochan

With the exponential growth of video content, the need for automated video highlight detection to extract key moments or highlights from lengthy videos has become increasingly pressing. This technology has the potential to enhance user experiences by allowing quick access to relevant content across diverse domains. Existing methods typically rely either on expensive manually labeled frame-level annotations, or on a large external dataset of videos for weak supervision through category information. To overcome this, we focus on unsupervised video highlight detection, eliminating the need for manual annotations. We propose a novel unsupervised approach which capitalizes on the premise that significant moments tend to recur across multiple videos of the similar category in both audio and visual modalities. Surprisingly, audio remains under-explored, especially in unsupervised algorithms, despite its potential to detect key moments. Through a clustering technique, we identify pseudo-categories of videos and compute audio pseudo-highlight scores for each video by measuring the similarities of audio features among audio clips of all the videos within each pseudo-category. Similarly, we also compute visual pseudo-highlight scores for each video using visual features. Then, we combine audio and visual pseudo-highlights to create the audio-visual pseudo ground-truth highlight of each video for training an audio-visual highlight detection network. Extensive experiments and ablation studies on three benchmarks showcase the superior performance of our method over prior work.

CVJun 9, 2023
Improving LiDAR 3D Object Detection via Range-based Point Cloud Density Optimization

Eduardo R. Corral-Soto, Alaap Grandhi, Yannis Y. He et al.

In recent years, much progress has been made in LiDAR-based 3D object detection mainly due to advances in detector architecture designs and availability of large-scale LiDAR datasets. Existing 3D object detectors tend to perform well on the point cloud regions closer to the LiDAR sensor as opposed to on regions that are farther away. In this paper, we investigate this problem from the data perspective instead of detector architecture design. We observe that there is a learning bias in detection models towards the dense objects near the sensor and show that the detection performance can be improved by simply manipulating the input point cloud density at different distance ranges without modifying the detector architecture and without data augmentation. We propose a model-free point cloud density adjustment pre-processing mechanism that uses iterative MCMC optimization to estimate optimal parameters for altering the point density at different distance ranges. We conduct experiments using four state-of-the-art LiDAR 3D object detectors on two public LiDAR datasets, namely Waymo and ONCE. Our results demonstrate that our range-based point cloud density manipulation technique can improve the performance of the existing detectors, which in turn could potentially inspire future detector designs.

CVApr 10
Learnable Motion-Focused Tokenization for Effective and Efficient Video Unsupervised Domain Adaptation

Tzu Ling Liu, Ian Stavness, Mrigank Rochan

Video Unsupervised Domain Adaptation (VUDA) poses a significant challenge in action recognition, requiring the adaptation of a model from a labeled source domain to an unlabeled target domain. Despite recent advances, existing VUDA methods often fall short of fully supervised performance, a key reason being the prevalence of static and uninformative backgrounds that exacerbate domain shifts. Additionally, prior approaches largely overlook computational efficiency, limiting real-world adoption. To address these issues, we propose Learnable Motion-Focused Tokenization (LMFT) for VUDA. LMFT tokenizes video frames into patch tokens and learns to discard low-motion, redundant tokens, primarily corresponding to background regions, while retaining motion-rich, action-relevant tokens for adaptation. Extensive experiments on three standard VUDA benchmarks across 21 domain adaptation settings show that our VUDA framework with LMFT achieves state-of-the-art performance while significantly reducing computational overhead. LMFT thus enables VUDA that is both effective and computationally efficient.

CVOct 18, 2022
Domain Adaptation in 3D Object Detection with Gradual Batch Alternation Training

Mrigank Rochan, Xingxin Chen, Alaap Grandhi et al.

We consider the problem of domain adaptation in LiDAR-based 3D object detection. Towards this, we propose a simple yet effective training strategy called Gradual Batch Alternation that can adapt from a large labeled source domain to an insufficiently labeled target domain. The idea is to initiate the training with the batch of samples from the source and target domain data in an alternate fashion, but then gradually reduce the amount of the source domain data over time as the training progresses. This way the model slowly shifts towards the target domain and eventually better adapt to it. The domain adaptation experiments for 3D object detection on four benchmark autonomous driving datasets, namely ONCE, PandaSet, Waymo, and nuScenes, demonstrate significant performance gains over prior arts and strong baselines.

SEMar 13
MoEKD: Mixture-of-Experts Knowledge Distillation for Robust and High-Performing Compressed Code Models

Md. Abdul Awal, Mrigank Rochan, Chanchal K. Roy

Large language models for code have achieved strong performance across diverse software analytics tasks, yet their real-world adoption remains limited by high computational demands, slow inference speeds, significant energy consumption, and environmental impact. Knowledge distillation (KD) offers a practical solution by transferring knowledge from a large model to a smaller and more efficient model. Despite its effectiveness, recent studies show that models distilled from a single source often exhibit degraded adversarial robustness, even when robustness-aware distillation techniques are employed. These observations suggest a fundamental limitation of single-source distillation in simultaneously transferring high-quality and robust knowledge. To overcome this limitation, we propose Mixture of Experts Knowledge Distillation (MoEKD), a KD framework that leverages a Mixture of Experts (MoE) architecture to enable more effective and robust knowledge transfer from multiple specialized experts into a compact model. MoEKD decomposes the distillation process into expert and router training, aggregation of expert knowledge through a learned routing mechanism, and distillation from the aggregated knowledge. We evaluate MoEKD on the vulnerability detection task using CodeBERT and GraphCodeBERT models. Experimental results show that MoEKD not only improves adversarial robustness by up to 35.8%, but also enhances predictive performance by up to 13%, compared to state-of-the-art KD baselines, including Compressor and AVATAR. Furthermore, an ablation study demonstrates that aggregating expert knowledge enables ultra-compact models to maintain competitive performance even when their size is reduced by approximately half. Overall, these results highlight the effectiveness of multi-expert knowledge aggregation in addressing key limitations of existing single-source KD approaches.

SENov 7, 2025
A Metamorphic Testing Perspective on Knowledge Distillation for Language Models of Code: Does the Student Deeply Mimic the Teacher?

Md. Abdul Awal, Mrigank Rochan, Chanchal K. Roy

Transformer-based language models of code have achieved state-of-the-art performance across a wide range of software analytics tasks, but their practical deployment remains limited due to high computational costs, slow inference speeds, and significant environmental impact. To address these challenges, recent research has increasingly explored knowledge distillation as a method for compressing a large language model of code (the teacher) into a smaller model (the student) while maintaining performance. However, the degree to which a student model deeply mimics the predictive behavior and internal representations of its teacher remains largely unexplored, as current accuracy-based evaluation provides only a surface-level view of model quality and often fails to capture more profound discrepancies in behavioral fidelity between the teacher and student models. To address this gap, we empirically show that the student model often fails to deeply mimic the teacher model, resulting in up to 285% greater performance drop under adversarial attacks, which is not captured by traditional accuracy-based evaluation. Therefore, we propose MetaCompress, a metamorphic testing framework that systematically evaluates behavioral fidelity by comparing the outputs of teacher and student models under a set of behavior-preserving metamorphic relations. We evaluate MetaCompress on two widely studied tasks, using compressed versions of popular language models of code, obtained via three different knowledge distillation techniques: Compressor, AVATAR, and MORPH. The results show that MetaCompress identifies up to 62% behavioral discrepancies in student models, underscoring the need for behavioral fidelity evaluation within the knowledge distillation pipeline and establishing MetaCompress as a practical framework for testing compressed language models of code derived through knowledge distillation.

CVOct 23, 2020Code
AdaCrowd: Unlabeled Scene Adaptation for Crowd Counting

Mahesh Kumar Krishna Reddy, Mrigank Rochan, Yiwei Lu et al.

We address the problem of image-based crowd counting. In particular, we propose a new problem called unlabeled scene-adaptive crowd counting. Given a new target scene, we would like to have a crowd counting model specifically adapted to this particular scene based on the target data that capture some information about the new scene. In this paper, we propose to use one or more unlabeled images from the target scene to perform the adaptation. In comparison with the existing problem setups (e.g. fully supervised), our proposed problem setup is closer to the real-world applications of crowd counting systems. We introduce a novel AdaCrowd framework to solve this problem. Our framework consists of a crowd counting network and a guiding network. The guiding network predicts some parameters in the crowd counting network based on the unlabeled images from a particular scene. This allows our model to adapt to different target scenes. The experimental results on several challenging benchmark datasets demonstrate the effectiveness of our proposed approach compared with other alternative methods. Code is available at https://github.com/maheshkkumar/adacrowd.

CVFeb 1, 2020Code
Few-Shot Scene Adaptive Crowd Counting Using Meta-Learning

Mahesh Kumar Krishna Reddy, Mohammad Hossain, Mrigank Rochan et al.

We consider the problem of few-shot scene adaptive crowd counting. Given a target camera scene, our goal is to adapt a model to this specific scene with only a few labeled images of that scene. The solution to this problem has potential applications in numerous real-world scenarios, where we ideally like to deploy a crowd counting model specially adapted to a target camera. We accomplish this challenge by taking inspiration from the recently introduced learning-to-learn paradigm in the context of few-shot regime. In training, our method learns the model parameters in a way that facilitates the fast adaptation to the target scene. At test time, given a target scene with a small number of labeled data, our method quickly adapts to that scene with a few gradient updates to the learned parameters. Our extensive experimental results show that the proposed approach outperforms other alternatives in few-shot scene adaptive crowd counting. Code is available at https://github.com/maheshkkumar/fscc.

CVAug 6, 2025
Test-Time Adaptation for Video Highlight Detection Using Meta-Auxiliary Learning and Cross-Modality Hallucinations

Zahidul Islam, Sujoy Paul, Mrigank Rochan

Existing video highlight detection methods, although advanced, struggle to generalize well to all test videos. These methods typically employ a generic highlight detection model for each test video, which is suboptimal as it fails to account for the unique characteristics and variations of individual test videos. Such fixed models do not adapt to the diverse content, styles, or audio and visual qualities present in new, unseen test videos, leading to reduced highlight detection performance. In this paper, we propose Highlight-TTA, a test-time adaptation framework for video highlight detection that addresses this limitation by dynamically adapting the model during testing to better align with the specific characteristics of each test video, thereby improving generalization and highlight detection performance. Highlight-TTA is jointly optimized with an auxiliary task, cross-modality hallucinations, alongside the primary highlight detection task. We utilize a meta-auxiliary training scheme to enable effective adaptation through the auxiliary task while enhancing the primary task. During testing, we adapt the trained model using the auxiliary task on the test video to further enhance its highlight detection performance. Extensive experiments with three state-of-the-art highlight detection models and three benchmark datasets show that the introduction of Highlight-TTA to these models improves their performance, yielding superior results.

CVJan 14, 2022
Domain Adaptation in LiDAR Semantic Segmentation via Alternating Skip Connections and Hybrid Learning

Eduardo R. Corral-Soto, Mrigank Rochan, Yannis Y. He et al.

In this paper we address the challenging problem of domain adaptation in LiDAR semantic segmentation. We consider the setting where we have a fully-labeled data set from source domain and a target domain with a few labeled and many unlabeled examples. We propose a domain adaption framework that mitigates the issue of domain shift and produces appealing performance on the target domain. To this end, we develop a GAN-based image-to-image translation engine that has generators with alternating connections, and couple it with a state-of-the-art LiDAR semantic segmentation network. Our framework is hybrid in nature in the sense that our model learning is composed of self-supervision, semi-supervision and unsupervised learning. Extensive experiments on benchmark LiDAR semantic segmentation data sets demonstrate that our method achieves superior performance in comparison to strong baselines and prior arts.

CVJul 20, 2021
Unsupervised Domain Adaptation in LiDAR Semantic Segmentation with Self-Supervision and Gated Adapters

Mrigank Rochan, Shubhra Aich, Eduardo R. Corral-Soto et al.

In this paper, we focus on a less explored, but more realistic and complex problem of domain adaptation in LiDAR semantic segmentation. There is a significant drop in performance of an existing segmentation model when training (source domain) and testing (target domain) data originate from different LiDAR sensors. To overcome this shortcoming, we propose an unsupervised domain adaptation framework that leverages unlabeled target domain data for self-supervision, coupled with an unpaired mask transfer strategy to mitigate the impact of domain shifts. Furthermore, we introduce the gated adapter module with a small number of parameters into the network to account for target domain-specific information. Experiments adapting from both real-to-real and synthetic-to-real LiDAR semantic segmentation benchmarks demonstrate the significant improvement over prior arts.

CVFeb 9, 2021
Referring Segmentation in Images and Videos with Cross-Modal Self-Attention Network

Linwei Ye, Mrigank Rochan, Zhi Liu et al.

We consider the problem of referring segmentation in images and videos with natural language. Given an input image (or video) and a referring expression, the goal is to segment the entity referred by the expression in the image or video. In this paper, we propose a cross-modal self-attention (CMSA) module to utilize fine details of individual words and the input image or video, which effectively captures the long-range dependencies between linguistic and visual features. Our model can adaptively focus on informative words in the referring expression and important regions in the visual input. We further propose a gated multi-level fusion (GMLF) module to selectively integrate self-attentive cross-modal features corresponding to different levels of visual features. This module controls the feature fusion of information flow of features at different levels with high-level and low-level semantic information related to different attentive words. Besides, we introduce cross-frame self-attention (CFSA) module to effectively integrate temporal information in consecutive frames which extends our method in the case of referring segmentation in videos. Experiments on benchmark datasets of four referring image datasets and two actor and action video segmentation datasets consistently demonstrate that our proposed approach outperforms existing state-of-the-art methods.

CVAug 31, 2020
Sentence Guided Temporal Modulation for Dynamic Video Thumbnail Generation

Mrigank Rochan, Mahesh Kumar Krishna Reddy, Yang Wang

We consider the problem of sentence specified dynamic video thumbnail generation. Given an input video and a user query sentence, the goal is to generate a video thumbnail that not only provides the preview of the video content, but also semantically corresponds to the sentence. In this paper, we propose a sentence guided temporal modulation (SGTM) mechanism that utilizes the sentence embedding to modulate the normalized temporal activations of the video thumbnail generation network. Unlike the existing state-of-the-art method that uses recurrent architectures, we propose a non-recurrent framework that is simple and allows much more parallelization. Extensive experiments and analysis on a large-scale dataset demonstrate the effectiveness of our framework.

CVJul 19, 2020
Adaptive Video Highlight Detection by Learning from User History

Mrigank Rochan, Mahesh Kumar Krishna Reddy, Linwei Ye et al.

Recently, there is an increasing interest in highlight detection research where the goal is to create a short duration video from a longer video by extracting its interesting moments. However, most existing methods ignore the fact that the definition of video highlight is highly subjective. Different users may have different preferences of highlight for the same input video. In this paper, we propose a simple yet effective framework that learns to adapt highlight detection to a user by exploiting the user's history in the form of highlights that the user has previously created. Our framework consists of two sub-networks: a fully temporal convolutional highlight detection network $H$ that predicts highlight for an input video and a history encoder network $M$ for user history. We introduce a newly designed temporal-adaptive instance normalization (T-AIN) layer to $H$ where the two sub-networks interact with each other. T-AIN has affine parameters that are predicted from $M$ based on the user history and is responsible for the user-adaptive signal to $H$. Extensive experiments on a large-scale dataset show that our framework can make more accurate and user-specific highlight predictions.

CVApr 9, 2019
Cross-Modal Self-Attention Network for Referring Image Segmentation

Linwei Ye, Mrigank Rochan, Zhi Liu et al.

We consider the problem of referring image segmentation. Given an input image and a natural language expression, the goal is to segment the object referred by the language expression in the image. Existing works in this area treat the language expression and the input image separately in their representations. They do not sufficiently capture long-range correlations between these two modalities. In this paper, we propose a cross-modal self-attention (CMSA) module that effectively captures the long-range dependencies between linguistic and visual features. Our model can adaptively focus on informative words in the referring expression and important regions in the input image. In addition, we propose a gated multi-level fusion module to selectively integrate self-attentive cross-modal features corresponding to different levels in the image. This module controls the information flow of features at different levels. We validate the proposed approach on four evaluation datasets. Our proposed approach consistently outperforms existing state-of-the-art methods.

CVApr 9, 2019
Convolutional Temporal Attention Model for Video-based Person Re-identification

Tanzila Rahman, Mrigank Rochan, Yang Wang

The goal of video-based person re-identification is to match two input videos, so that the distance of the two videos is small if two videos contain the same person. A common approach for person re-identification is to first extract image features for all frames in the video, then aggregate all the features to form a video-level feature. The video-level features of two videos can then be used to calculate the distance of the two videos. In this paper, we propose a temporal attention approach for aggregating frame-level features into a video-level feature vector for re-identification. Our method is motivated by the fact that not all frames in a video are equally informative. We propose a fully convolutional temporal attention model for generating the attention scores. Fully convolutional network (FCN) has been widely used in semantic segmentation for generating 2D output maps. In this paper, we formulate video based person re-identification as a sequence labeling problem like semantic segmentation. We establish a connection between them and modify FCN to generate attention scores to represent the importance of each frame. Extensive experiments on three different benchmark datasets (i.e. iLIDS-VID, PRID-2011 and SDU-VID) show that our proposed method outperforms other state-of-the-art approaches.

CVOct 26, 2018
Video-based Person Re-identification Using Spatial-Temporal Attention Networks

Shivansh Rao, Tanzila Rahman, Mrigank Rochan et al.

We consider the problem of video-based person re-identification. The goal is to identify a person from videos captured under different cameras. In this paper, we propose an efficient spatial-temporal attention based model for person re-identification from videos. Our method generates an attention score for each frame based on frame-level features. The attention scores of all frames in a video are used to produce a weighted feature vector for the input video. Unlike most existing deep learning methods that use global representation, our approach focuses on attention scores. Extensive experiments on two benchmark datasets demonstrate that our method achieves the state-of-the-art performance. This is a technical report.

CVJul 20, 2018
Future Semantic Segmentation with Convolutional LSTM

Seyed shahabeddin Nabavi, Mrigank Rochan, Yang et al.

We consider the problem of predicting semantic segmentation of future frames in a video. Given several observed frames in a video, our goal is to predict the semantic segmentation map of future frames that are not yet observed. A reliable solution to this problem is useful in many applications that require real-time decision making, such as autonomous driving. We propose a novel model that uses convolutional LSTM (ConvLSTM) to encode the spatiotemporal information of observed frames for future prediction. We also extend our model to use bidirectional ConvLSTM to capture temporal information in both directions. Our proposed approach outperforms other state-of-the-art methods on the benchmark dataset.

CVJun 29, 2018
Gated Feedback Refinement Network for Coarse-to-Fine Dense Semantic Image Labeling

Md Amirul Islam, Mrigank Rochan, Shujon Naha et al.

Effective integration of local and global contextual information is crucial for semantic segmentation and dense image labeling. We develop two encoder-decoder based deep learning architectures to address this problem. We first propose a network architecture called Label Refinement Network (LRN) that predicts segmentation labels in a coarse-to-fine fashion at several spatial resolutions. In this network, we also define loss functions at several stages to provide supervision at different stages of training. However, there are limits to the quality of refinement possible if ambiguous information is passed forward. In order to address this issue, we also propose Gated Feedback Refinement Network (G-FRNet) that addresses this limitation. Initially, G-FRNet makes a coarse-grained prediction which it progressively refines to recover details by effectively integrating local and global contextual information during the refinement stages. This is achieved by gate units proposed in this work, that control information passed forward in order to resolve the ambiguity. Experiments were conducted on four challenging dense labeling datasets (CamVid, PASCAL VOC 2012, Horse-Cow Parsing, PASCAL-Person-Part, and SUN-RGBD). G-FRNet achieves state-of-the-art semantic segmentation results on the CamVid and Horse-Cow Parsing datasets and produces results competitive with the best performing approaches that appear in the literature for the other three datasets.

CVMay 30, 2018
Video Summarization by Learning from Unpaired Data

Mrigank Rochan, Yang Wang

We consider the problem of video summarization. Given an input raw video, the goal is to select a small subset of key frames from the input video to create a shorter summary video that best describes the content of the original video. Most of the current state-of-the-art video summarization approaches use supervised learning and require labeled training data. Each training instance consists of a raw input video and its ground truth summary video curated by human annotators. However, it is very expensive and difficult to create such labeled training examples. To address this limitation, we propose a novel formulation to learn video summarization from unpaired data. We present an approach that learns to generate optimal video summaries using a set of raw videos ($V$) and a set of summary videos ($S$), where there exists no correspondence between $V$ and $S$. We argue that this type of data is much easier to collect. Our model aims to learn a mapping function $F : V \rightarrow S$ such that the distribution of resultant summary videos from $F(V)$ is similar to the distribution of $S$ with the help of an adversarial objective. In addition, we enforce a diversity constraint on $F(V)$ to ensure that the generated video summaries are visually diverse. Experimental results on two benchmark datasets indicate that our proposed approach significantly outperforms other alternative methods.

CVMay 26, 2018
Video Summarization Using Fully Convolutional Sequence Networks

Mrigank Rochan, Linwei Ye, Yang Wang

This paper addresses the problem of video summarization. Given an input video, the goal is to select a subset of the frames to create a summary video that optimally captures the important information of the input video. With the large amount of videos available online, video summarization provides a useful tool that assists video search, retrieval, browsing, etc. In this paper, we formulate video summarization as a sequence labeling problem. Unlike existing approaches that use recurrent models, we propose fully convolutional sequence models to solve video summarization. We firstly establish a novel connection between semantic segmentation and video summarization, and then adapt popular semantic segmentation networks for video summarization. Extensive experiments and analysis on two benchmark datasets demonstrate the effectiveness of our models.

CVMar 1, 2017
Label Refinement Network for Coarse-to-Fine Semantic Segmentation

Md Amirul Islam, Shujon Naha, Mrigank Rochan et al.

We consider the problem of semantic image segmentation using deep convolutional neural networks. We propose a novel network architecture called the label refinement network that predicts segmentation labels in a coarse-to-fine fashion at several resolutions. The segmentation labels at a coarse resolution are used together with convolutional features to obtain finer resolution segmentation labels. We define loss functions at several stages in the network to provide supervisions at different stages. Our experimental results on several standard datasets demonstrate that the proposed model provides an effective way of producing pixel-wise dense image labeling.