Rohit Jena

CV
h-index78
21papers
174citations
Novelty50%
AI Score53

21 Papers

IVDec 1, 2025Code
Disentangling Progress in Medical Image Registration: Beyond Trend-Driven Architectures towards Domain-Specific Strategies

Bailiang Jian, Jiazhen Pan, Rohit Jena et al.

Medical image registration drives quantitative analysis across organs, modalities, and patient populations. Recent deep learning methods often combine low-level "trend-driven" computational blocks from computer vision, such as large-kernel CNNs, Transformers, and state-space models, with high-level registration-specific designs like motion pyramids, correlation layers, and iterative refinement. Yet, their relative contributions remain unclear and entangled. This raises a central question: should future advances in registration focus on importing generic architectural trends or on refining domain-specific design principles? Through a modular framework spanning brain, lung, cardiac, and abdominal registration, we systematically disentangle the influence of these two paradigms. Our evaluation reveals that low-level "trend-driven" computational blocks offer only marginal or inconsistent gains, while high-level registration-specific designs consistently deliver more accurate, smoother, and more robust deformations. These domain priors significantly elevate the performance of a standard U-Net baseline, far more than variants incorporating "trend-driven" blocks, achieving an average relative improvement of $\sim3\%$. All models and experiments are released within a transparent, modular benchmark that enables plug-and-play comparison for new architectures and registration tasks (https://github.com/BailiangJ/rethink-reg). This dynamic and extensible platform establishes a common ground for reproducible and fair evaluation, inviting the community to isolate genuine methodological contributions from domain priors. Our findings advocate a shift in research emphasis: from following architectural trends to embracing domain-specific design principles as the true drivers of progress in learning-based medical image registration.

CVSep 9, 2024
Elucidating Optimal Reward-Diversity Tradeoffs in Text-to-Image Diffusion Models

Rohit Jena, Ali Taghibakhshi, Sahil Jain et al.

Text-to-image (T2I) diffusion models have become prominent tools for generating high-fidelity images from text prompts. However, when trained on unfiltered internet data, these models can produce unsafe, incorrect, or stylistically undesirable images that are not aligned with human preferences. To address this, recent approaches have incorporated human preference datasets to fine-tune T2I models or to optimize reward functions that capture these preferences. Although effective, these methods are vulnerable to reward hacking, where the model overfits to the reward function, leading to a loss of diversity in the generated images. In this paper, we prove the inevitability of reward hacking and study natural regularization techniques like KL divergence and LoRA scaling, and their limitations for diffusion models. We also introduce Annealed Importance Guidance (AIG), an inference-time regularization inspired by Annealed Importance Sampling, which retains the diversity of the base model while achieving Pareto-Optimal reward-diversity tradeoffs. Our experiments demonstrate the benefits of AIG for Stable Diffusion models, striking the optimal balance between reward optimization and image diversity. Furthermore, a user study confirms that AIG improves diversity and quality of generated images across different model architectures and reward functions.

CVJul 4, 2022
Beyond mAP: Towards better evaluation of instance segmentation

Rohit Jena, Lukas Zhornyak, Nehal Doiphode et al.

Correctness of instance segmentation constitutes counting the number of objects, correctly localizing all predictions and classifying each localized prediction. Average Precision is the de-facto metric used to measure all these constituents of segmentation. However, this metric does not penalize duplicate predictions in the high-recall range, and cannot distinguish instances that are localized correctly but categorized incorrectly. This weakness has inadvertently led to network designs that achieve significant gains in AP but also introduce a large number of false positives. We therefore cannot rely on AP to choose a model that provides an optimal tradeoff between false positives and high recall. To resolve this dilemma, we review alternative metrics in the literature and propose two new measures to explicitly measure the amount of both spatial and categorical duplicate predictions. We also propose a Semantic Sorting and NMS module to remove these duplicates based on a pixel occupancy matching scheme. Experiments show that modern segmentation networks have significant gains in AP, but also contain a considerable amount of duplicates. Our Semantic Sorting and NMS can be added as a plug-and-play module to mitigate hedged predictions and preserve AP.

CVNov 17, 2023
SplatArmor: Articulated Gaussian splatting for animatable humans from monocular RGB videos

Rohit Jena, Ganesh Subramanian Iyer, Siddharth Choudhary et al.

We propose SplatArmor, a novel approach for recovering detailed and animatable human models by `armoring' a parameterized body model with 3D Gaussians. Our approach represents the human as a set of 3D Gaussians within a canonical space, whose articulation is defined by extending the skinning of the underlying SMPL geometry to arbitrary locations in the canonical space. To account for pose-dependent effects, we introduce a SE(3) field, which allows us to capture both the location and anisotropy of the Gaussians. Furthermore, we propose the use of a neural color field to provide color regularization and 3D supervision for the precise positioning of these Gaussians. We show that Gaussian splatting provides an interesting alternative to neural rendering based methods by leverging a rasterization primitive without facing any of the non-differentiability and optimization challenges typically faced in such approaches. The rasterization paradigms allows us to leverage forward skinning, and does not suffer from the ambiguities associated with inverse skinning and warping. We show compelling results on the ZJU MoCap and People Snapshot datasets, which underscore the effectiveness of our method for controllable human synthesis.

CVMar 15, 2023
Mesh Strikes Back: Fast and Efficient Human Reconstruction from RGB videos

Rohit Jena, Pratik Chaudhari, James Gee et al.

Human reconstruction and synthesis from monocular RGB videos is a challenging problem due to clothing, occlusion, texture discontinuities and sharpness, and framespecific pose changes. Many methods employ deferred rendering, NeRFs and implicit methods to represent clothed humans, on the premise that mesh-based representations cannot capture complex clothing and textures from RGB, silhouettes, and keypoints alone. We provide a counter viewpoint to this fundamental premise by optimizing a SMPL+D mesh and an efficient, multi-resolution texture representation using only RGB images, binary silhouettes and sparse 2D keypoints. Experimental results demonstrate that our approach is more capable of capturing geometric details compared to visual hull, mesh-based methods. We show competitive novel view synthesis and improvements in novel pose synthesis compared to NeRF-based methods, which introduce noticeable, unwanted artifacts. By restricting the solution space to the SMPL+D model combined with differentiable rendering, we obtain dramatic speedups in compute, training times (up to 24x) and inference times (up to 192x). Our method therefore can be used as is or as a fast initialization to NeRF-based methods.

IVAug 11, 2024
Deep Learning in Medical Image Registration: Magic or Mirage?

Rohit Jena, Deeksha Sethi, Pratik Chaudhari et al.

Classical optimization and learning-based methods are the two reigning paradigms in deformable image registration. While optimization-based methods boast generalizability across modalities and robust performance, learning-based methods promise peak performance, incorporating weak supervision and amortized optimization. However, the exact conditions for either paradigm to perform well over the other are shrouded and not explicitly outlined in the existing literature. In this paper, we make an explicit correspondence between the mutual information of the distribution of per-pixel intensity and labels, and the performance of classical registration methods. This strong correlation hints to the fact that architectural designs in learning-based methods is unlikely to affect this correlation, and therefore, the performance of learning-based methods. This hypothesis is thoroughly validated with state-of-the-art classical and learning-based methods. However, learning-based methods with weak supervision can perform high-fidelity intensity and label registration, which is not possible with classical methods. Next, we show that this high-fidelity feature learning does not translate to invariance to domain shift, and learning-based methods are sensitive to such changes in the data distribution. Finally, we propose a general recipe to choose the best paradigm for a given registration problem, based on these observations.

CVMar 19
Factored Levenberg-Marquardt for Diffeomorphic Image Registration: An efficient optimizer for FireANTs

Rohit Jena, Pratik Chaudhari, James C. Gee

FireANTs introduced a novel Eulerian descent method for plug-and-play behavior with arbitrary optimizers adapted for diffeomorphic image registration as a test-time optimization problem, with a GPU-accelerated implementation. FireANTs uses Adam as its default optimizer for fast and more robust optimization. However, Adam requires storing state variables (i.e. momentum and squared-momentum estimates), each of which can consume significant memory, prohibiting its use for significantly large images. In this work, we propose a modified Levenberg-Marquardt (LM) optimizer that requires only a single scalar damping parameter as optimizer state, that is adaptively tuned using a trust region approach. The resulting optimizer reduces memory by up to 24.6% for large volumes, and retaining performance across all four datasets. A single hyperparameter configuration tuned on brain MRI transfers without modification to lung CT and cross-modal abdominal registration, matching or outperforming Adam on three of four benchmarks. We also perform ablations on the effectiveness of using Metropolis-Hastings style rejection step to prevent updates that worsen the loss function.

CVDec 17, 2025
The LUMirage: An independent evaluation of zero-shot performance in the LUMIR challenge

Rohit Jena, Pratik Chaudhari, James C. Gee

The LUMIR challenge represents an important benchmark for evaluating deformable image registration methods on large-scale neuroimaging data. While the challenge demonstrates that modern deep learning methods achieve competitive accuracy on T1-weighted MRI, it also claims exceptional zero-shot generalization to unseen contrasts and resolutions, assertions that contradict established understanding of domain shift in deep learning. In this paper, we perform an independent re-evaluation of these zero-shot claims using rigorous evaluation protocols while addressing potential sources of instrumentation bias. Our findings reveal a more nuanced picture: (1) deep learning methods perform comparably to iterative optimization on in-distribution T1w images and even on human-adjacent species (macaque), demonstrating improved task understanding; (2) however, performance degrades significantly on out-of-distribution contrasts (T2, T2*, FLAIR), with Cohen's d scores ranging from 0.7-1.5, indicating substantial practical impact on downstream clinical workflows; (3) deep learning methods face scalability limitations on high-resolution data, failing to run on 0.6 mm isotropic images, while iterative methods benefit from increased resolution; and (4) deep methods exhibit high sensitivity to preprocessing choices. These results align with the well-established literature on domain shift and suggest that claims of universal zero-shot superiority require careful scrutiny. We advocate for evaluation protocols that reflect practical clinical and research workflows rather than conditions that may inadvertently favor particular method classes.

CVApr 2, 2024
Neural Ordinary Differential Equation based Sequential Image Registration for Dynamic Characterization

Yifan Wu, Mengjin Dong, Rohit Jena et al.

Deformable image registration (DIR) is crucial in medical image analysis, enabling the exploration of biological dynamics such as organ motions and longitudinal changes in imaging. Leveraging Neural Ordinary Differential Equations (ODE) for registration, this extension work discusses how this framework can aid in the characterization of sequential biological processes. Utilizing the Neural ODE's ability to model state derivatives with neural networks, our Neural Ordinary Differential Equation Optimization-based (NODEO) framework considers voxels as particles within a dynamic system, defining deformation fields through the integration of neural differential equations. This method learns dynamics directly from data, bypassing the need for physical priors, making it exceptionally suitable for medical scenarios where such priors are unavailable or inapplicable. Consequently, the framework can discern underlying dynamics and use sequence data to regularize the transformation trajectory. We evaluated our framework on two clinical datasets: one for cardiac motion tracking and another for longitudinal brain MRI analysis. Demonstrating its efficacy in both 2D and 3D imaging scenarios, our framework offers flexibility and model agnosticism, capable of managing image sequences and facilitating label propagation throughout these sequences. This study provides a comprehensive understanding of how the Neural ODE-based framework uniquely benefits the image registration challenge.

CVApr 1, 2024
FireANTs: Adaptive Riemannian Optimization for Multi-Scale Diffeomorphic Matching

Rohit Jena, Pratik Chaudhari, James C. Gee

The paper proposes FireANTs, a multi-scale Adaptive Riemannian Optimization algorithm for dense diffeomorphic image matching. Existing state-of-the-art methods for diffeomorphic image matching are slow due to inefficient implementations and slow convergence due to the ill-conditioned nature of the optimization problem. Deep learning methods offer fast inference but require extensive training time, substantial inference memory, and fail to generalize across long-tailed distributions or diverse image modalities, necessitating costly retraining. We address these challenges by proposing a training-free, GPU-accelerated multi-scale Adaptive Riemannian Optimization algorithm for fast and accurate dense diffeomorphic image matching. FireANTs runs about 2.5x faster than ANTs on a CPU, and upto 1200x faster on a GPU. On a single GPU, FireANTs performs competitively with deep learning methods on inference runtime while consuming upto 10x less memory. FireANTs shows remarkable robustness to a wide variety of matching problems across modalities, species, and organs without any domain-specific training or tuning. Our framework allows hyperparameter grid search studies with significantly less resources and time compared to traditional and deep learning registration algorithms alike.

CVSep 29, 2025
A Scalable Distributed Framework for Multimodal GigaVoxel Image Registration

Rohit Jena, Vedant Zope, Pratik Chaudhari et al.

In this work, we propose FFDP, a set of IO-aware non-GEMM fused kernels supplemented with a distributed framework for image registration at unprecedented scales. Image registration is an inverse problem fundamental to biomedical and life sciences, but algorithms have not scaled in tandem with image acquisition capabilities. Our framework complements existing model parallelism techniques proposed for large-scale transformer training by optimizing non-GEMM bottlenecks and enabling convolution-aware tensor sharding. We demonstrate unprecedented capabilities by performing multimodal registration of a 100 micron ex-vivo human brain MRI volume at native resolution - an inverse problem more than 570x larger than a standard clinical datum in about a minute using only 8 A6000 GPUs. FFDP accelerates existing state-of-the-art optimization and deep learning registration pipelines by upto 6 - 7x while reducing peak memory consumption by 20 - 59%. Comparative analysis on a 250 micron dataset shows that FFDP can fit upto 64x larger problems than existing SOTA on a single GPU, and highlights both the performance and efficiency gains of FFDP compared to SOTA image registration methods.

CVAug 24, 2025
Development of an isotropic segmentation model for medial temporal lobe subregions on anisotropic MRI atlas using implicit neural representation

Yue Li, Pulkit Khandelwal, Rohit Jena et al.

Imaging biomarkers in magnetic resonance imaging (MRI) are important tools for diagnosing and tracking Alzheimer's disease (AD). As medial temporal lobe (MTL) is the earliest region to show AD-related hallmarks, brain atrophy caused by AD can first be observed in the MTL. Accurate segmentation of MTL subregions and extraction of imaging biomarkers from them are important. However, due to imaging limitations, the resolution of T2-weighted (T2w) MRI is anisotropic, which makes it difficult to accurately extract the thickness of cortical subregions in the MTL. In this study, we used an implicit neural representation method to combine the resolution advantages of T1-weighted and T2w MRI to accurately upsample an MTL subregion atlas set from anisotropic space to isotropic space, establishing a multi-modality, high-resolution atlas set. Based on this atlas, we developed an isotropic MTL subregion segmentation model. In an independent test set, the cortical subregion thickness extracted using this isotropic model showed higher significance than an anisotropic method in distinguishing between participants with mild cognitive impairment and cognitively unimpaired (CU) participants. In longitudinal analysis, the biomarkers extracted using isotropic method showed greater stability in CU participants. This study improved the accuracy of AD imaging biomarkers without increasing the amount of atlas annotation work, which may help to more accurately quantify the relationship between AD and brain atrophy and provide more accurate measures for disease tracking.

CVJun 11, 2024
Deep Implicit Optimization enables Robust Learnable Features for Deformable Image Registration

Rohit Jena, Pratik Chaudhari, James C. Gee

Deep Learning in Image Registration (DLIR) methods have been tremendously successful in image registration due to their speed and ability to incorporate weak label supervision at training time. However, existing DLIR methods forego many of the benefits and invariances of optimization methods. The lack of a task-specific inductive bias in DLIR methods leads to suboptimal performance, especially in the presence of domain shift. Our method aims to bridge this gap between statistical learning and optimization by explicitly incorporating optimization as a layer in a deep network. A deep network is trained to predict multi-scale dense feature images that are registered using a black box iterative optimization solver. This optimal warp is then used to minimize image and label alignment errors. By implicitly differentiating end-to-end through an iterative optimization solver, we explicitly exploit invariances of the correspondence matching problem induced by the optimization, while learning registration and label-aware features, and guaranteeing the warp functions to be a local minima of the registration objective in the feature space. Our framework shows excellent performance on in-domain datasets, and is agnostic to domain shift such as anisotropy and varying intensity profiles. For the first time, our method allows switching between arbitrary transformation representations (free-form to diffeomorphic) at test time with zero retraining. End-to-end feature learning also facilitates interpretability of features and arbitrary test-time regularization, which is not possible with existing DLIR methods.

CVDec 18, 2023
Towards Establishing Dense Correspondence on Multiview Coronary Angiography: From Point-to-Point to Curve-to-Curve Query Matching

Yifan Wu, Rohit Jena, Mehmet Gulsun et al.

Coronary angiography is the gold standard imaging technique for studying and diagnosing coronary artery disease. However, the resulting 2D X-ray projections lose 3D information and exhibit visual ambiguities. In this work, we aim to establish dense correspondence in multi-view angiography, serving as a fundamental basis for various clinical applications and downstream tasks. To overcome the challenge of unavailable annotated data, we designed a data simulation pipeline using 3D Coronary Computed Tomography Angiography (CCTA). We formulated the problem of dense correspondence estimation as a query matching task over all points of interest in the given views. We established point-to-point query matching and advanced it to curve-to-curve correspondence, significantly reducing errors by minimizing ambiguity and improving topological awareness. The method was evaluated on a set of 1260 image pairs from different views across 8 clinically relevant angulation groups, demonstrating compelling results and indicating the feasibility of establishing dense correspondence in multi-view angiography.

IVJan 13, 2021
Self-Supervised Vessel Enhancement Using Flow-Based Consistencies

Rohit Jena, Sumedha Singla, Kayhan Batmanghelich

Vessel segmentation is an essential task in many clinical applications. Although supervised methods have achieved state-of-art performance, acquiring expert annotation is laborious and mostly limited for two-dimensional datasets with a small sample size. On the contrary, unsupervised methods rely on handcrafted features to detect tube-like structures such as vessels. However, those methods require complex pipelines involving several hyper-parameters and design choices rendering the procedure sensitive, dataset-specific, and not generalizable. We propose a self-supervised method with a limited number of hyper-parameters that is generalizable across modalities. Our method uses tube-like structure properties, such as connectivity, profile consistency, and bifurcation, to introduce inductive bias into a learning algorithm. To model those properties, we generate a vector field that we refer to as a flow. Our experiments on various public datasets in 2D and 3D show that our method performs better than unsupervised methods while learning useful transferable features from unlabeled data. Unlike generic self-supervised methods, the learned features learn vessel-relevant features that are transferable for supervised approaches, which is essential when the number of annotated data is limited.

LGNov 15, 2020
Predicting Human Strategies in Simulated Search and Rescue Task

Vidhi Jain, Rohit Jena, Huao Li et al.

In a search and rescue scenario, rescuers may have different knowledge of the environment and strategies for exploration. Understanding what is inside a rescuer's mind will enable an observer agent to proactively assist them with critical information that can help them perform their task efficiently. To this end, we propose to build models of the rescuers based on their trajectory observations to predict their strategies. In our efforts to model the rescuer's mind, we begin with a simple simulated search and rescue task in Minecraft with human participants. We formulate neural sequence models to predict the triage strategy and the next location of the rescuer. As the neural networks are data-driven, we design a diverse set of artificial "faux human" agents for training, to test them with limited human rescuer trajectory data. To evaluate the agents, we compare it to an evidence accumulation method that explicitly incorporates all available background knowledge and provides an intended upper bound for the expected performance. Further, we perform experiments where the observer/predictor is human. We show results in terms of prediction accuracy of our computational approaches as compared with that of human observers.

LGSep 20, 2020
Addressing reward bias in Adversarial Imitation Learning with neutral reward functions

Rohit Jena, Siddharth Agrawal, Katia Sycara

Generative Adversarial Imitation Learning suffers from the fundamental problem of reward bias stemming from the choice of reward functions used in the algorithm. Different types of biases also affect different types of environments - which are broadly divided into survival and task-based environments. We provide a theoretical sketch of why existing reward functions would fail in imitation learning scenarios in task based environments with multiple terminal states. We also propose a new reward function for GAIL which outperforms existing GAIL methods on task based environments with single and multiple terminal states and effectively overcomes both survival and termination bias.

CVApr 10, 2020
MA 3 : Model Agnostic Adversarial Augmentation for Few Shot learning

Rohit Jena, Shirsendu Sukanta Halder, Katia Sycara

Despite the recent developments in vision-related problems using deep neural networks, there still remains a wide scope in the improvement of generalizing these models to unseen examples. In this paper, we explore the domain of few-shot learning with a novel augmentation technique. In contrast to other generative augmentation techniques, where the distribution over input images are learnt, we propose to learn the probability distribution over the image transformation parameters which are easier and quicker to learn. Our technique is fully differentiable which enables its extension to versatile data-sets and base models. We evaluate our proposed method on multiple base-networks and 2 data-sets to establish the robustness and efficiency of this method. We obtain an improvement of nearly 4% by adding our augmentation module without making any change in network architectures. We also make the code readily available for usage by the community.

LGJan 21, 2020
Augmenting GAIL with BC for sample efficient imitation learning

Rohit Jena, Changliu Liu, Katia Sycara

Imitation learning is the problem of recovering an expert policy without access to a reward signal. Behavior cloning and GAIL are two widely used methods for performing imitation learning. Behavior cloning converges in a few iterations but doesn't achieve peak performance due to its inherent iid assumption about the state-action distribution. GAIL addresses the issue by accounting for the temporal dependencies when performing a state distribution matching between the agent and the expert. Although GAIL is sample efficient in the number of expert trajectories required, it is still not very sample efficient in terms of the environment interactions needed for convergence of the policy. Given the complementary benefits of both methods, we present a simple and elegant method to combine both methods to enable stable and sample efficient learning. Our algorithm is very simple to implement and integrates with different policy gradient algorithms. We demonstrate the effectiveness of the algorithm in low dimensional control tasks, gridworlds and in high dimensional image-based tasks.

CVApr 28, 2019
An approach to image denoising using manifold approximation without clean images

Rohit Jena

Image restoration has been an extensively researched topic in numerous fields. With the advent of deep learning, a lot of the current algorithms were replaced by algorithms that are more flexible and robust. Deep networks have demonstrated impressive performance in a variety of tasks like blind denoising, image enhancement, deblurring, super-resolution, inpainting, among others. Most of these learning-based algorithms use a large amount of clean data during the training process. However, in certain applications in medical image processing, one may not have access to a large amount of clean data. In this paper, we propose a method for denoising that attempts to learn the denoising process by pushing the noisy data close to the clean data manifold, using only noisy images during training. Furthermore, we use perceptual loss terms and an iterative refinement step to further refine the clean images without losing important features.

CVApr 25, 2019
Out of the Box: A combined approach for handling occlusion in Human Pose Estimation

Rohit Jena

Human Pose estimation is a challenging problem, especially in the case of 3D pose estimation from 2D images due to many different factors like occlusion, depth ambiguities, intertwining of people, and in general crowds. 2D multi-person human pose estimation in the wild also suffers from the same problems - occlusion, ambiguities, and disentanglement of people's body parts. Being a fundamental problem with loads of applications, including but not limited to surveillance, economical motion capture for video games and movies, and physiotherapy, this is an interesting problem to be solved both from a practical perspective and from an intellectual perspective as well. Although there are cases where no pose estimation can ever predict with 100% accuracy (cases where even humans would fail), there are several algorithms that have brought new state-of-the-art performance in human pose estimation in the wild. We look at a few algorithms with different approaches and also formulate our own approach to tackle a consistently bugging problem, i.e. occlusions.