CVMar 28, 2023
Cross-View Visual Geo-Localization for Outdoor Augmented RealityNiluthpol Chowdhury Mithun, Kshitij Minhas, Han-Pang Chiu et al.
Precise estimation of global orientation and location is critical to ensure a compelling outdoor Augmented Reality (AR) experience. We address the problem of geo-pose estimation by cross-view matching of query ground images to a geo-referenced aerial satellite image database. Recently, neural network-based methods have shown state-of-the-art performance in cross-view matching. However, most of the prior works focus only on location estimation, ignoring orientation, which cannot meet the requirements in outdoor AR applications. We propose a new transformer neural network-based model and a modified triplet ranking loss for joint location and orientation estimation. Experiments on several benchmark cross-view geo-localization datasets show that our model achieves state-of-the-art performance. Furthermore, we present an approach to extend the single image query-based geo-localization approach by utilizing temporal information from a navigation pipeline for robust continuous geo-localization. Experimentation on several large-scale real-world video sequences demonstrates that our approach enables high-precision and stable AR insertion.
CVMay 17, 2022
GraphMapper: Efficient Visual Navigation by Scene Graph GenerationZachary Seymour, Niluthpol Chowdhury Mithun, Han-Pang Chiu et al.
Understanding the geometric relationships between objects in a scene is a core capability in enabling both humans and autonomous agents to navigate in new environments. A sparse, unified representation of the scene topology will allow agents to act efficiently to move through their environment, communicate the environment state with others, and utilize the representation for diverse downstream tasks. To this end, we propose a method to train an autonomous agent to learn to accumulate a 3D scene graph representation of its environment by simultaneously learning to navigate through said environment. We demonstrate that our approach, GraphMapper, enables the learning of effective navigation policies through fewer interactions with the environment than vision-based systems alone. Further, we show that GraphMapper can act as a modular scene encoder to operate alongside existing Learning-based solutions to not only increase navigational efficiency but also generate intermediate scene representations that are useful for other future tasks.
ARMar 10, 2023
Hardware Acceleration of Neural GraphicsMuhammad Husnain Mubarik, Ramakrishna Kanungo, Tobias Zirr et al.
Rendering and inverse-rendering algorithms that drive conventional computer graphics have recently been superseded by neural representations (NR). NRs have recently been used to learn the geometric and the material properties of the scenes and use the information to synthesize photorealistic imagery, thereby promising a replacement for traditional rendering algorithms with scalable quality and predictable performance. In this work we ask the question: Does neural graphics (NG) need hardware support? We studied representative NG applications showing that, if we want to render 4k res. at 60FPS there is a gap of 1.5X-55X in the desired performance on current GPUs. For AR/VR applications, there is an even larger gap of 2-4 OOM between the desired performance and the required system power. We identify that the input encoding and the MLP kernels are the performance bottlenecks, consuming 72%,60% and 59% of application time for multi res. hashgrid, multi res. densegrid and low res. densegrid encodings, respectively. We propose a NG processing cluster, a scalable and flexible hardware architecture that directly accelerates the input encoding and MLP kernels through dedicated engines and supports a wide range of NG applications. We also accelerate the rest of the kernels by fusing them together in Vulkan, which leads to 9.94X kernel-level performance improvement compared to un-fused implementation of the pre-processing and the post-processing kernels. Our results show that, NGPC gives up to 58X end-to-end application-level performance improvement, for multi res. hashgrid encoding on average across the four NG applications, the performance benefits are 12X,20X,33X and 39X for the scaling factor of 8,16,32 and 64, respectively. Our results show that with multi res. hashgrid encoding, NGPC enables the rendering of 4k res. at 30FPS for NeRF and 8k res. at 120FPS for all our other NG applications.
DCMay 22
Flare: Leveraging Serverless Elasticity to Absorb Microservice Load SpikesDilina Dehigama, Shyam Jesalpura, David Schall et al.
Online services strive to maintain application responsiveness even when the traffic is unpredictable and fluctuating. Today's online services are commonly deployed as chains of microservices, each microservice packaged as one or more containers inside virtual machines (VMs). While performant and affordable when the load is steady, VM-based deployments are known to be slow to scale when the load spikes, resulting in degraded performance for end-users of the service. To avoid such performance degradations, service providers can over-provision their deployments; however, such a strategy is costly and inefficient, leaving resources under-utilized for extended periods. To address the challenge of unpredictable load spikes, we propose Flare, a hybrid microservice architecture that combines VMs with serverless computing. Flare utilizes VMs to cost-effectively handle steady workloads and leverages serverless elasticity to absorb traffic spikes. When a spike occurs, Flare detects which specific service(s) are overloaded and shifts the excess load of only those services to serverless, thus minimizing the cost overhead. Flare seamlessly integrates into existing auto-scaling and serverless infrastructure, requiring minimal changes to the control plane and no modifications to the application.
NAMay 17, 2018
Generalized least square homotopy perturbations for system of fractional partial differential equationsRakesh Kumar, Reena Koundal
In this paper, generalized aspects of least square homotopy perturbations are explored to treat the system of non-linear fractional partial differential equations and the method is called as generalized least square homotopy perturbations (GLSHP). The concept of partial fractional Wronskian is introduced to detect the linear independence of functions depending on more than one variable through Caputo fractional calculus. General theorem related to Wronskian is also proved. It is found that solutions converge more rapidly through GLSHP in comparison to classical fractional homotopy perturbations. Results of this generalization are validated by taking examples from nonlinear fractional wave equations.
NAMay 6
Neural-Guided Domain Restriction to Accelerate Pseudospectra Computation for Structured Non-normal Banded MatricesAmit Punia, Rakesh Kumar, Madan Lal
Computing pseudospectra of non-normal matrices is essential for understanding the stability and transient behavior of dynamical systems. Such analysis is critical in applications including fluid dynamics, control systems, and differential operators, where non-normality can lead to significant transient amplification and sensitivity to perturbations that are not captured by eigenvalue analysis alone. At large scales, commonly used numerical approaches for pseudospectra computation can become computationally demanding, as they require repeated auxiliary computations to identify spectrally sensitive regions in the complex plane. We present a neural network-based approach that predicts sensitive regions directly from matrix features, thereby avoiding exhaustive pseudospectra evaluation across the entire complex plane. We calibrate the prediction threshold on validation data to ensure reliable coverage of sensitive regions. The trained neural network guides the selection of grid points requiring full computation, enabling focused computation only where necessary. The approach provides a practical preprocessing strategy for efficient pseudospectra computation. Numerical experiments on non-normal banded matrices demonstrate substantial speedup compared to full grid-based numerical evaluation while maintaining high accuracy in identifying sensitive regions.
CVApr 2, 2025
Diffusion-Guided Gaussian Splatting for Large-Scale Unconstrained 3D Reconstruction and Novel View SynthesisNiluthpol Chowdhury Mithun, Tuan Pham, Qiao Wang et al.
Recent advancements in 3D Gaussian Splatting (3DGS) and Neural Radiance Fields (NeRF) have achieved impressive results in real-time 3D reconstruction and novel view synthesis. However, these methods struggle in large-scale, unconstrained environments where sparse and uneven input coverage, transient occlusions, appearance variability, and inconsistent camera settings lead to degraded quality. We propose GS-Diff, a novel 3DGS framework guided by a multi-view diffusion model to address these limitations. By generating pseudo-observations conditioned on multi-view inputs, our method transforms under-constrained 3D reconstruction problems into well-posed ones, enabling robust optimization even with sparse data. GS-Diff further integrates several enhancements, including appearance embedding, monocular depth priors, dynamic object modeling, anisotropy regularization, and advanced rasterization techniques, to tackle geometric and photometric challenges in real-world settings. Experiments on four benchmarks demonstrate that GS-Diff consistently outperforms state-of-the-art baselines by significant margins.
CVOct 1, 2025
GeoSURGE: Geo-localization using Semantic Fusion with Hierarchy of Geographic EmbeddingsAngel Daruna, Nicholas Meegan, Han-Pang Chiu et al.
Worldwide visual geo-localization seeks to determine the geographic location of an image anywhere on Earth using only its visual content. Learned representations of geography for visual geo-localization remain an active research topic despite much progress. We formulate geo-localization as aligning the visual representation of the query image with a learned geographic representation. Our novel geographic representation explicitly models the world as a hierarchy of geographic embeddings. Additionally, we introduce an approach to efficiently fuse the appearance features of the query image with its semantic segmentation map, forming a robust visual representation. Our main experiments demonstrate improved all-time bests in 22 out of 25 metrics measured across five benchmark datasets compared to prior state-of-the-art (SOTA) methods and recent Large Vision-Language Models (LVLMs). Additional ablation studies support the claim that these gains are primarily driven by the combination of geographic and visual representations.
ROAug 26, 2021
SASRA: Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous EnvironmentsMuhammad Zubair Irshad, Niluthpol Chowdhury Mithun, Zachary Seymour et al.
This paper presents a novel approach for the Vision-and-Language Navigation (VLN) task in continuous 3D environments, which requires an autonomous agent to follow natural language instructions in unseen environments. Existing end-to-end learning-based VLN methods struggle at this task as they focus mostly on utilizing raw visual observations and lack the semantic spatio-temporal reasoning capabilities which is crucial in generalizing to new environments. In this regard, we present a hybrid transformer-recurrence model which focuses on combining classical semantic mapping techniques with a learning-based method. Our method creates a temporal semantic memory by building a top-down local ego-centric semantic map and performs cross-modal grounding to align map and language modalities to enable effective learning of VLN policy. Empirical results in a photo-realistic long-horizon simulation environment show that the proposed approach outperforms a variety of state-of-the-art methods and baselines with over 22% relative improvement in SPL in prior unseen environments.
CVMar 21, 2021
MaAST: Map Attention with Semantic Transformersfor Efficient Visual NavigationZachary Seymour, Kowshik Thopalli, Niluthpol Mithun et al.
Visual navigation for autonomous agents is a core task in the fields of computer vision and robotics. Learning-based methods, such as deep reinforcement learning, have the potential to outperform the classical solutions developed for this task; however, they come at a significantly increased computational load. Through this work, we design a novel approach that focuses on performing better or comparable to the existing learning-based solutions but under a clear time/computational budget. To this end, we propose a method to encode vital scene semantics such as traversable paths, unexplored areas, and observed scene objects -- alongside raw visual streams such as RGB, depth, and semantic segmentation masks -- into a semantically informed, top-down egocentric map representation. Further, to enable the effective use of this information, we introduce a novel 2-D map attention mechanism, based on the successful multi-layer Transformer networks. We conduct experiments on 3-D reconstructed indoor PointGoal visual navigation and demonstrate the effectiveness of our approach. We show that by using our novel attention schema and auxiliary rewards to better utilize scene semantics, we outperform multiple baselines trained with only raw inputs or implicit semantic information while operating with an 80% decrease in the agent's experience.
IVDec 7, 2020
Efficient Kernel based Matched Filter Approach for Segmentation of Retinal Blood VesselsSushil Kumar Saroj, Vikas Ratna, Rakesh Kumar et al.
Retinal blood vessels structure contains information about diseases like obesity, diabetes, hypertension and glaucoma. This information is very useful in identification and treatment of these fatal diseases. To obtain this information, there is need to segment these retinal vessels. Many kernel based methods have been given for segmentation of retinal vessels but their kernels are not appropriate to vessel profile cause poor performance. To overcome this, a new and efficient kernel based matched filter approach has been proposed. The new matched filter is used to generate the matched filter response (MFR) image. We have applied Otsu thresholding method on obtained MFR image to extract the vessels. We have conducted extensive experiments to choose best value of parameters for the proposed matched filter kernel. The proposed approach has examined and validated on two online available DRIVE and STARE datasets. The proposed approach has specificity 98.50%, 98.23% and accuracy 95.77 %, 95.13% for DRIVE and STARE dataset respectively. Obtained results confirm that the proposed method has better performance than others. The reason behind increased performance is due to appropriate proposed kernel which matches retinal blood vessel profile more accurately.
CVSep 12, 2020
RGB2LIDAR: Towards Solving Large-Scale Cross-Modal Visual LocalizationNiluthpol Chowdhury Mithun, Karan Sikka, Han-Pang Chiu et al.
We study an important, yet largely unexplored problem of large-scale cross-modal visual localization by matching ground RGB images to a geo-referenced aerial LIDAR 3D point cloud (rendered as depth images). Prior works were demonstrated on small datasets and did not lend themselves to scaling up for large-scale applications. To enable large-scale evaluation, we introduce a new dataset containing over 550K pairs (covering 143 km^2 area) of RGB and aerial LIDAR depth images. We propose a novel joint embedding based method that effectively combines the appearance and semantic cues from both modalities to handle drastic cross-modal variations. Experiments on the proposed dataset show that our model achieves a strong result of a median rank of 5 in matching across a large test set of 50K location pairs collected from a 14km^2 area. This represents a significant advancement over prior works in performance and scale. We conclude with qualitative results to highlight the challenging nature of this task and the benefits of the proposed model. Our work provides a foundation for further research in cross-modal visual localization.
CVAug 28, 2019
ApproxNet: Content and Contention-Aware Video Analytics System for Embedded ClientsRan Xu, Rakesh Kumar, Pengcheng Wang et al.
Videos take a lot of time to transport over the network, hence running analytics on the live video on embedded or mobile devices has become an important system driver. Considering that such devices, e.g., surveillance cameras or AR/VR gadgets, are resource constrained, creating lightweight deep neural networks (DNNs) for embedded devices is crucial. None of the current approximation techniques for object classification DNNs can adapt to changing runtime conditions, e.g., changes in resource availability on the device, the content characteristics, or requirements from the user. In this paper, we introduce ApproxNet, a video object classification system for embedded or mobile clients. It enables novel dynamic approximation techniques to achieve desired inference latency and accuracy trade-off under changing runtime conditions. It achieves this by enabling two approximation knobs within a single DNN model, rather than creating and maintaining an ensemble of models (e.g., MCDNN [MobiSys-16]. We show that ApproxNet can adapt seamlessly at runtime to these changes, provides low and stable latency for the image and video frame classification problems, and show the improvement in accuracy and latency over ResNet [CVPR-16], MCDNN [MobiSys-16], MobileNets [Google-17], NestDNN [MobiCom-18], and MSDNet [ICLR-18].
CYApr 5, 2019
Can You Explain That? Lucid Explanations Help Human-AI Collaborative Image RetrievalArijit Ray, Yi Yao, Rakesh Kumar et al.
While there have been many proposals on making AI algorithms explainable, few have attempted to evaluate the impact of AI-generated explanations on human performance in conducting human-AI collaborative tasks. To bridge the gap, we propose a Twenty-Questions style collaborative image retrieval game, Explanation-assisted Guess Which (ExAG), as a method of evaluating the efficacy of explanations (visual evidence or textual justification) in the context of Visual Question Answering (VQA). In our proposed ExAG, a human user needs to guess a secret image picked by the VQA agent by asking natural language questions to it. We show that overall, when AI explains its answers, users succeed more often in guessing the secret image correctly. Notably, a few correct explanations can readily improve human performance when VQA answers are mostly incorrect as compared to no-explanation games. Furthermore, we also show that while explanations rated as "helpful" significantly improve human performance, "incorrect" and "unhelpful" explanations can degrade performance as compared to no-explanation games. Our experiments, therefore, demonstrate that ExAG is an effective means to evaluate the efficacy of AI-generated explanations on a human-AI collaborative task.
CVDec 8, 2018
Semantically-Aware Attentive Neural Embeddings for Image-based Visual LocalizationZachary Seymour, Karan Sikka, Han-Pang Chiu et al.
We present an approach that combines appearance and semantic information for 2D image-based localization (2D-VL) across large perceptual changes and time lags. Compared to appearance features, the semantic layout of a scene is generally more invariant to appearance variations. We use this intuition and propose a novel end-to-end deep attention-based framework that utilizes multimodal cues to generate robust embeddings for 2D-VL. The proposed attention module predicts a shared channel attention and modality-specific spatial attentions to guide the embeddings to focus on more reliable image regions. We evaluate our model against state-of-the-art (SOTA) methods on three challenging localization datasets. We report an average (absolute) improvement of $19\%$ over current SOTA for 2D-VL. Furthermore, we present an extensive study demonstrating the contribution of each component of our model, showing $8$--$15\%$ and $4\%$ improvement from adding semantic information and our proposed attention module. We finally show the predicted attention maps to offer useful insights into our model.
NAOct 2, 2018
Hybrid BSQI-WENO Based Numerical Scheme for Hyperbolic Conservation LawsRakesh Kumar, S. Baskar
In this paper, we intend to use a B-spline quasi-interpolation (BSQI) technique to develop higher order hybrid schemes for conservation laws. As a first step, we develop cubic and quintic B-spline quasi-interpolation based numerical methods for hyperbolic conservation laws in 1 space dimension, and show that they achieve the rate of convergence 4 and 6, respectively. Although the BSQI schemes that we develop are shown to be stable, they produce spurious oscillations in the vicinity of shocks, as expected. In order to suppress the oscillations, we conjugate the BSQI schemes with the fifth order weighted essentially non-oscillatory (WENO5) scheme. We use a weak local truncation based estimate to detect the high gradient regions of the numerical solution. We use this information to capture shocks using WENO scheme, whereas the BSQI based scheme is used in the smooth regions. For the time discretization, we consider a strong stability preserving (SSP) Runge-Kutta method of order three. At the end, we demonstrate the accuracy and the efficiency of the proposed schemes over the WENO5 scheme through numerical experiments.
NASep 10, 2018
Simple smoothness indicator and multi-level adaptive order WENO scheme for hyperbolic conservation lawsRakesh Kumar, Praveen Chandrashekar
In the present work, we propose two new variants of fifth order finite difference WENO schemes of adaptive order. We compare our proposed schemes with other variants of WENO schemes with special emphasize on WENO-AO(5,3) scheme [Balsara, Garain, and Shu, {\it J. Comput. Phys.}, 326 (2016), pp 780-804]. The first algorithm (WENO-AON(5,3)), involves the construction of a new simple smoothness indicator which reduces the computational cost of WENO-AO(5,3) scheme. Numerical experiments show that accuracy of WENO-AON(5,3) scheme is comparable to that of WENO-AO(5,3) scheme and resolution of solutions involving shock or other discontinuities is comparable to that of WENO-AO(5,3) scheme. The second algorithm denoted as WENO-AO(5,4,3), involves the inclusion of an extra cubic polynomial reconstruction in the base WENO- AO(5,3) scheme, which leads to a more accurate scheme. Extensive numerical experiments in 1D and 2D are performed, which shows that WENO-AO(5,4,3) scheme has better resolution near shocks or discontinuities among the considered WENO schemes with negligible increase in computational cost.