Siddharth Roheda

h-index6

9papers

29citations

Novelty59%

AI Score40

Ranked #73,244 of 194,257 authors (top 38%)#24,884 in CV (top 42%)

9 Papers

3.3LGOct 2, 2022

Fast OT for Latent Domain Adaptation

Siddharth Roheda, Ashkan Panahi, Hamid Krim

In this paper, we address the problem of unsupervised Domain Adaptation. The need for such an adaptation arises when the distribution of the target data differs from that which is used to develop the model and the ground truth information of the target data is unknown. We propose an algorithm that uses optimal transport theory with a verifiably efficient and implementable solution to learn the best latent feature representation. This is achieved by minimizing the cost of transporting the samples from the target domain to the distribution of the source domain.

5.2CVNov 21, 2024

GalaxyEdit: Large-Scale Image Editing Dataset with Enhanced Diffusion Adapter

Aniruddha Bala, Rohan Jaiswal, Loay Rashid et al.

Training of large-scale text-to-image and image-to-image models requires a huge amount of annotated data. While text-to-image datasets are abundant, data available for instruction-based image-to-image tasks like object addition and removal is limited. This is because of the several challenges associated with the data generation process, such as, significant human effort, limited automation, suboptimal end-to-end models, data diversity constraints and high expenses. We propose an automated data generation pipeline aimed at alleviating such limitations, and introduce GalaxyEdit - a large-scale image editing dataset for add and remove operations. We fine-tune the SD v1.5 model on our dataset and find that our model can successfully handle a broader range of objects and complex editing instructions, outperforming state-of-the-art methods in FID scores by 11.2\% and 26.1\% for add and remove tasks respectively. Furthermore, in light of on-device usage scenarios, we expand our research to include task-specific lightweight adapters leveraging the ControlNet-xs architecture. While ControlNet-xs excels in canny and depth guided generation, we propose to improve the communication between the control network and U-Net for more intricate add and remove tasks. We achieve this by enhancing ControlNet-xs with non-linear interaction layers based on Volterra filters. Our approach outperforms ControlNet-xs in both add/remove and canny-guided image generation tasks, highlighting the effectiveness of the proposed enhancement.

8.4CVApr 24, 2025

DCT-Shield: A Robust Frequency Domain Defense against Malicious Image Editing

Aniruddha Bala, Rohit Chowdhury, Rohan Jaiswal et al.

Advancements in diffusion models have enabled effortless image editing via text prompts, raising concerns about image security. Attackers with access to user images can exploit these tools for malicious edits. Recent defenses attempt to protect images by adding a limited noise in the pixel space to disrupt the functioning of diffusion-based editing models. However, the adversarial noise added by previous methods is easily noticeable to the human eye. Moreover, most of these methods are not robust to purification techniques like JPEG compression under a feasible pixel budget. We propose a novel optimization approach that introduces adversarial perturbations directly in the frequency domain by modifying the Discrete Cosine Transform (DCT) coefficients of the input image. By leveraging the JPEG pipeline, our method generates adversarial images that effectively prevent malicious image editing. Extensive experiments across a variety of tasks and datasets demonstrate that our approach introduces fewer visual artifacts while maintaining similar levels of edit protection and robustness to noise purification techniques.

3.6CVJan 10, 2025

LLVD: LSTM-based Explicit Motion Modeling in Latent Space for Blind Video Denoising

Loay Rashid, Siddharth Roheda, Amit Unde

Video restoration plays a pivotal role in revitalizing degraded video content by rectifying imperfections caused by various degradations introduced during capturing (sensor noise, motion blur, etc.), saving/sharing (compression, resizing, etc.) and editing. This paper introduces a novel algorithm designed for scenarios where noise is introduced during video capture, aiming to enhance the visual quality of videos by reducing unwanted noise artifacts. We propose the Latent space LSTM Video Denoiser (LLVD), an end-to-end blind denoising model. LLVD uniquely combines spatial and temporal feature extraction, employing Long Short Term Memory (LSTM) within the encoded feature domain. This integration of LSTM layers is crucial for maintaining continuity and minimizing flicker in the restored video. Moreover, processing frames in the encoded feature domain significantly reduces computations, resulting in a very lightweight architecture. LLVD's blind nature makes it versatile for real, in-the-wild denoising scenarios where prior information about noise characteristics is not available. Experiments reveal that LLVD demonstrates excellent performance for both synthetic and captured noise. Specifically, LLVD surpasses the current State-Of-The-Art (SOTA) in RAW denoising by 0.3dB, while also achieving a 59\% reduction in computational complexity.

3.6CVSep 29, 2025

VNODE: A Piecewise Continuous Volterra Neural Network

Siddharth Roheda, Aniruddha Bala, Rohit Chowdhury et al.

This paper introduces Volterra Neural Ordinary Differential Equations (VNODE), a piecewise continuous Volterra Neural Network that integrates nonlinear Volterra filtering with continuous time neural ordinary differential equations for image classification. Drawing inspiration from the visual cortex, where discrete event processing is interleaved with continuous integration, VNODE alternates between discrete Volterra feature extraction and ODE driven state evolution. This hybrid formulation captures complex patterns while requiring substantially fewer parameters than conventional deep architectures. VNODE consistently outperforms state of the art models with improved computational complexity as exemplified on benchmark datasets like CIFAR10 and Imagenet1K.

8.4CVSep 27, 2025

Vid-Freeze: Protecting Images from Malicious Image-to-Video Generation via Temporal Freezing

Rohit Chowdhury, Aniruddha Bala, Rohan Jaiswal et al.

The rapid progress of image-to-video (I2V) generation models has introduced significant risks, enabling video synthesis from static images and facilitating deceptive or malicious content creation. While prior defenses such as I2VGuard attempt to immunize images, effective and principled protection to block motion remains underexplored. In this work, we introduce Vid-Freeze - a novel attention-suppressing adversarial attack that adds carefully crafted adversarial perturbations to images. Our method explicitly targets the attention mechanism of I2V models, completely disrupting motion synthesis while preserving semantic fidelity of the input image. The resulting immunized images generate stand-still or near-static videos, effectively blocking malicious content creation. Our experiments demonstrate the impressive protection provided by the proposed approach, highlighting the importance of attention attacks as a promising direction for robust and proactive defenses against misuse of I2V generation models.

3.7CVApr 10, 2021

Latent Code-Based Fusion: A Volterra Neural Network Approach

Sally Ghanem, Siddharth Roheda, Hamid Krim

We propose a deep structure encoder using the recently introduced Volterra Neural Networks (VNNs) to seek a latent representation of multi-modal data whose features are jointly captured by a union of subspaces. The so-called self-representation embedding of the latent codes leads to a simplified fusion which is driven by a similarly constructed decoding. The Volterra Filter architecture achieved reduction in parameter complexity is primarily due to controlled non-linearities being introduced by the higher-order convolutions in contrast to generalized activation functions. Experimental results on two different datasets have shown a significant improvement in the clustering performance for VNNs auto-encoder over conventional Convolutional Neural Networks (CNNs) auto-encoder. In addition, we also show that the proposed approach demonstrates a much-improved sample complexity over CNN-based auto-encoder with a superb robust classification performance.

5.4CVOct 21, 2019

Volterra Neural Networks (VNNs)

Siddharth Roheda, Hamid Krim

The importance of inference in Machine Learning (ML) has led to an explosive number of different proposals in ML, and particularly in Deep Learning. In an attempt to reduce the complexity of Convolutional Neural Networks, we propose a Volterra filter-inspired Network architecture. This architecture introduces controlled non-linearities in the form of interactions between the delayed input samples of data. We propose a cascaded implementation of Volterra Filtering so as to significantly reduce the number of parameters required to carry out the same classification task as that of a conventional Neural Network. We demonstrate an efficient parallel implementation of this Volterra Neural Network (VNN), along with its remarkable performance while retaining a relatively simpler and potentially more tractable structure. Furthermore, we show a rather sophisticated adaptation of this network to nonlinearly fuse the RGB (spatial) information and the Optical Flow (temporal) information of a video sequence for action recognition. The proposed approach is evaluated on UCF-101 and HMDB-51 datasets for action recognition, and is shown to outperform state of the art CNN approaches.

4.1LGJun 10, 2019

Robust Multi-Modal Sensor Fusion: An Adversarial Approach

Siddharth Roheda, Hamid Krim, Benjamin S. Riggan

In recent years, multi-modal fusion has attracted a lot of research interest, both in academia, and in industry. Multimodal fusion entails the combination of information from a set of different types of sensors. Exploiting complementary information from different sensors, we show that target detection and classification problems can greatly benefit from this fusion approach and result in a performance increase. To achieve this gain, the information fusion from various sensors is shown to require some principled strategy to ensure that additional information is constructively used, and has a positive impact on performance. We subsequently demonstrate the viability of the proposed fusion approach by weakening the strong dependence on the functionality of all sensors, hence introducing additional flexibility in our solution and lifting the severe limitation in unconstrained surveillance settings with potential environmental impact. Our proposed data driven approach to multimodal fusion, exploits selected optimal features from an estimated latent space of data across all modalities. This hidden space is learned via a generative network conditioned on individual sensor modalities. The hidden space, as an intrinsic structure, is then exploited in detecting damaged sensors, and in subsequently safeguarding the performance of the fused sensor system. Experimental results show that such an approach can achieve automatic system robustness against noisy/damaged sensors.