Abhishek Saroha

CV
h-index36
13papers
56citations
Novelty53%
AI Score52

13 Papers

CVJul 24, 2024Code
DiffCD: A Symmetric Differentiable Chamfer Distance for Neural Implicit Surface Fitting

Linus Härenstam-Nielsen, Lu Sang, Abhishek Saroha et al.

Neural implicit surfaces can be used to recover accurate 3D geometry from imperfect point clouds. In this work, we show that state-of-the-art techniques work by minimizing an approximation of a one-sided Chamfer distance. This shape metric is not symmetric, as it only ensures that the point cloud is near the surface but not vice versa. As a consequence, existing methods can produce inaccurate reconstructions with spurious surfaces. Although one approach against spurious surfaces has been widely used in the literature, we theoretically and experimentally show that it is equivalent to regularizing the surface area, resulting in over-smoothing. As a more appealing alternative, we propose DiffCD, a novel loss function corresponding to the symmetric Chamfer distance. In contrast to previous work, DiffCD also assures that the surface is near the point cloud, which eliminates spurious surfaces without the need for additional regularization. We experimentally show that DiffCD reliably recovers a high degree of shape detail, substantially outperforming existing work across varying surface complexity and noise levels. Project code is available at https://github.com/linusnie/diffcd.

CVJun 3, 2023
Enhancing Surface Neural Implicits with Curvature-Guided Sampling and Uncertainty-Augmented Representations

Lu Sang, Abhishek Saroha, Maolin Gao et al.

Neural implicit representations have become a popular choice for modeling surfaces due to their adaptability in resolution and support for complex topology. While previous works have achieved impressive reconstruction quality by training on ground truth point clouds or meshes, they often do not discuss the data acquisition and ignore the effect of input quality and sampling methods during reconstruction. In this paper, we introduce a method that directly digests depth images for the task of high-fidelity 3D reconstruction. To this end, a simple sampling strategy is proposed to generate highly effective training data, by incorporating differentiable geometric features computed directly based on the input depth images with only marginal computational cost. Due to its simplicity, our sampling strategy can be easily incorporated into diverse popular methods, allowing their training process to be more stable and efficient. Despite its simplicity, our method outperforms a range of both classical and learning-based baselines and demonstrates state-of-the-art results in both synthetic and real-world datasets.

50.0CVMay 29
Benchmarking Single-Step Inpainting Methods for Multi-Object 3D Gaussian Splatting Scenes

Finn Dröge, Cecilia Curreli, Abhishek Saroha et al.

The tasks of object removal and inpainting 3D Gaussian Splatting (3DGS) scenes face challenges such as 3D consistency across camera views. In comparing 2D inpainters and their suitability for the 3D domain, we find that reconstruction-based inpainters outperform generative diffusion models in 3D consistency. Integrating these 2D inpainters into different single-step methods for creating and finetuning 3DGS scenes, our results indicate that initializing the scene from scratch produces higher quality results than finetuning the existing scene. Using a state-of-the-art generative 2D inpainter, we create a straightforward baseline to underline the importance of object removal before inpainting in the 3D setting. Since 360° datasets rarely include real-world ground truths, and challenging occlusion scenarios are equally sparse, we introduce a novel multi-object scene with recorded ground truth data and many views with object occlusions.

74.8LGMay 29
Graph Neural Networks Are Not Continuous Across Graph Resolutions

Christian Koke, Yuesong Shen, Abhishek Saroha et al.

We show that contrary to conventional wisdom in the community, graph neural networks (GNNs) are not continuous with respect to all natural modes of graph convergence. As a result, GNNs may generate substantially different latent representations for graphs that are very similar. In particular they assign vastly different latent embeddings to graphs that represent the same underlying object at different resolution scales. We trace this failure of continuity back to a structural obstruction arising from commonly used information-propagation schemes. Building on this insight we then derive a principled modification to standard GNN architectures which equips models with continuity across scales. The proposed modification enables consistent integration of distinct resolutions and reliable generalization between them. We systematically validate our theoretical findings in a wide range of numerical experiments.

LGSep 30, 2023
ResolvNet: A Graph Convolutional Network with multi-scale Consistency

Christian Koke, Abhishek Saroha, Yuesong Shen et al.

It is by now a well known fact in the graph learning community that the presence of bottlenecks severely limits the ability of graph neural networks to propagate information over long distances. What so far has not been appreciated is that, counter-intuitively, also the presence of strongly connected sub-graphs may severely restrict information flow in common architectures. Motivated by this observation, we introduce the concept of multi-scale consistency. At the node level this concept refers to the retention of a connected propagation graph even if connectivity varies over a given graph. At the graph-level, multi-scale consistency refers to the fact that distinct graphs describing the same object at different resolutions should be assigned similar feature vectors. As we show, both properties are not satisfied by poular graph neural network architectures. To remedy these shortcomings, we introduce ResolvNet, a flexible graph neural network based on the mathematical concept of resolvents. We rigorously establish its multi-scale consistency theoretically and verify it in extensive experiments on real world data: Here networks based on this ResolvNet architecture prove expressive; out-performing baselines significantly on many tasks; in- and outside the multi-scale setting.

CVApr 21, 2022
Implicit Shape Completion via Adversarial Shape Priors

Abhishek Saroha, Marvin Eisenberger, Tarun Yenamandra et al.

We present a novel neural implicit shape method for partial point cloud completion. To that end, we combine a conditional Deep-SDF architecture with learned, adversarial shape priors. More specifically, our network converts partial inputs into a global latent code and then recovers the full geometry via an implicit, signed distance generator. Additionally, we train a PointNet++ discriminator that impels the generator to produce plausible, globally consistent reconstructions. In that way, we effectively decouple the challenges of predicting shapes that are both realistic, i.e. imitate the training set's pose distribution, and accurate in the sense that they replicate the partial input observations. In our experiments, we demonstrate state-of-the-art performance for completing partial shapes, considering both man-made objects (e.g. airplanes, chairs, ...) and deformable shape categories (human bodies). Finally, we show that our adversarial training approach leads to visually plausible reconstructions that are highly consistent in recovering missing parts of a given object.

31.0CVMar 18
GMT: Goal-Conditioned Multimodal Transformer for 6-DOF Object Trajectory Synthesis in 3D Scenes

Huajian Zeng, Abhishek Saroha, Daniel Cremers et al.

Synthesizing controllable 6-DOF object manipulation trajectories in 3D environments is essential for enabling robots to interact with complex scenes, yet remains challenging due to the need for accurate spatial reasoning, physical feasibility, and multimodal scene understanding. Existing approaches often rely on 2D or partial 3D representations, limiting their ability to capture full scene geometry and constraining trajectory precision. We present GMT, a multimodal transformer framework that generates realistic and goal-directed object trajectories by jointly leveraging 3D bounding box geometry, point cloud context, semantic object categories, and target end poses. The model represents trajectories as continuous 6-DOF pose sequences and employs a tailored conditioning strategy that fuses geometric, semantic, contextual, and goaloriented information. Extensive experiments on synthetic and real-world benchmarks demonstrate that GMT outperforms state-of-the-art human motion and human-object interaction baselines, such as CHOIS and GIMO, achieving substantial gains in spatial accuracy and orientation control. Our method establishes a new benchmark for learningbased manipulation planning and shows strong generalization to diverse objects and cluttered 3D environments. Project page: https://huajian- zeng.github. io/projects/gmt/.

83.8CVApr 1
EgoFlow: Gradient-Guided Flow Matching for Egocentric 6DoF Object Motion Generation

Abhishek Saroha, Huajian Zeng, Xingxing Zuo et al.

Understanding and predicting object motion from egocentric video is fundamental to embodied perception and interaction. However, generating physically consistent 6DoF trajectories remains challenging due to occlusions, fast motion, and the lack of explicit physical reasoning in existing generative models. We present EgoFlow, a flow-matching framework that synthesizes realistic and physically plausible trajectories conditioned on multimodal egocentric observations. EgoFlow employs a hybrid Mamba-Transformer-Perceiver architecture to jointly model temporal dynamics, scene geometry, and semantic intent, while a gradient-guided inference process enforces differentiable physical constraints such as collision avoidance and motion smoothness. This combination yields coherent and controllable motion generation without post-hoc filtering or additional supervision. Experiments on real-world datasets HD-EPIC, EgoExo4D, and HOT3D show that EgoFlow outperforms diffusion-based and transformer baselines in accuracy, generalization, and physical realism, reducing collision rates by up to 79%, and strong generalization to unseen scenes. Our results highlight the promise of flow-based generative modeling for scalable and physically grounded egocentric motion understanding.

CVMar 13, 2024
Gaussian Splatting in Style

Abhishek Saroha, Mariia Gladkova, Cecilia Curreli et al.

3D scene stylization extends the work of neural style transfer to 3D. A vital challenge in this problem is to maintain the uniformity of the stylized appearance across multiple views. A vast majority of the previous works achieve this by training a 3D model for every stylized image and a set of multi-view images. In contrast, we propose a novel architecture trained on a collection of style images that, at test time, produces real time high-quality stylized novel views. We choose the underlying 3D scene representation for our model as 3D Gaussian splatting. We take the 3D Gaussians and process them using a multi-resolution hash grid and a tiny MLP to obtain stylized views. The MLP is conditioned on different style codes for generalization to different styles during test time. The explicit nature of 3D Gaussians gives us inherent advantages over NeRF-based methods, including geometric consistency and a fast training and rendering regime. This enables our method to be useful for various practical use cases, such as augmented or virtual reality. We demonstrate that our method achieves state-of-the-art performance with superior visual quality on various indoor and outdoor real-world data.

CVJan 10, 2025
Nonisotropic Gaussian Diffusion for Realistic 3D Human Motion Prediction

Cecilia Curreli, Dominik Muhle, Abhishek Saroha et al.

Probabilistic human motion prediction aims to forecast multiple possible future movements from past observations. While current approaches report high diversity and realism, they often generate motions with undetected limb stretching and jitter. To address this, we introduce SkeletonDiffusion, a latent diffusion model that embeds an explicit inductive bias on the human body within its architecture and training. Our model is trained with a novel nonisotropic Gaussian diffusion formulation that aligns with the natural kinematic structure of the human skeleton. Results show that our approach outperforms conventional isotropic alternatives, consistently generating realistic predictions while avoiding artifacts such as limb distortion. Additionally, we identify a limitation in commonly used diversity metrics, which may inadvertently favor models that produce inconsistent limb lengths within the same sequence. SkeletonDiffusion sets a new benchmark on real-world datasets, outperforming various baselines across multiple evaluation metrics. Visit our project page at https://ceveloper.github.io/publications/skeletondiffusion/ .

CVJan 7, 2025
ZDySS -- Zero-Shot Dynamic Scene Stylization using Gaussian Splatting

Abhishek Saroha, Florian Hofherr, Mariia Gladkova et al.

Stylizing a dynamic scene based on an exemplar image is critical for various real-world applications, including gaming, filmmaking, and augmented and virtual reality. However, achieving consistent stylization across both spatial and temporal dimensions remains a significant challenge. Most existing methods are designed for static scenes and often require an optimization process for each style image, limiting their adaptability. We introduce ZDySS, a zero-shot stylization framework for dynamic scenes, allowing our model to generalize to previously unseen style images at inference. Our approach employs Gaussian splatting for scene representation, linking each Gaussian to a learned feature vector that renders a feature map for any given view and timestamp. By applying style transfer on the learned feature vectors instead of the rendered feature map, we enhance spatio-temporal consistency across frames. Our method demonstrates superior performance and coherence over state-of-the-art baselines in tests on real-world dynamic scenes, making it a robust solution for practical applications.

CVMar 11, 2025
GAS-NeRF: Geometry-Aware Stylization of Dynamic Radiance Fields

Nhat Phuong Anh Vu, Abhishek Saroha, Or Litany et al.

Current 3D stylization techniques primarily focus on static scenes, while our world is inherently dynamic, filled with moving objects and changing environments. Existing style transfer methods primarily target appearance -- such as color and texture transformation -- but often neglect the geometric characteristics of the style image, which are crucial for achieving a complete and coherent stylization effect. To overcome these shortcomings, we propose GAS-NeRF, a novel approach for joint appearance and geometry stylization in dynamic Radiance Fields. Our method leverages depth maps to extract and transfer geometric details into the radiance field, followed by appearance transfer. Experimental results on synthetic and real-world datasets demonstrate that our approach significantly enhances the stylization quality while maintaining temporal coherence in dynamic scenes.

CVFeb 12, 2020
FPGA Implementation of Minimum Mean Brightness Error Bi-Histogram Equalization

Abhishek Saroha, Avichal Rakesh, Rajiv Kumar Tripathi

Histogram Equalization (HE) is a popular method for contrast enhancement. Generally, mean brightness is not conserved in Histogram Equalization. Initially, Bi-Histogram Equalization (BBHE) was proposed to enhance contrast while maintaining a the mean brightness. However, when mean brightness is primary concern, Minimum Mean Brightness Error Bi-Histogram Equalization (MMBEBHE) is the best technique. There are several implementations of Histogram Equalization on FPGA, however to our knowledge MMBEBHE has not been implemented on FPGAs before. Therefore, we present an implementation of MMBEBHE on FPGA.